DM Techniques forum

-- Cesc Julbe (erased user) - 2014-12-04

Software re-use or software development from scratch

Prior to all of the forums below, we have to decide to what extent are we going to re-use existing software.

  • Adopt one Data Mining Platform as the basis for the WP? Which? Requirements?
  • Adopt several and make them compatible via wrappers? Which? Requirements?
  • Develop all from scratch?
[Tapiador] - Something possibly of interest (under active development now). It aims at providing a higher level interface for machine learning (picking the best algorithm for each use case based on cross-validation, etc). The latest paper can be found at

Regarding the platform to use: Most of the new frameworks and tools support HDFS so it shouldn't be difficult to adopt a couple of them and perform a seamless integration. The two most promising ones (at a first glance) may be Mahout and Spark. Mahout has many algorithms implemented already, but its performance is poor for those that are iterative. On the other hand, Spark is the best option for iterative algorithms (e.g. Logistic Regression) although it doesn't (currently) provide as many algorithms as Mahout. We'll definitely need to try out both.

Dimensionality Reduction and Feature Extraction

Initial list for discussion:

  • PCA, ICA
  • Filter techniques (Mutual information, Gain Ratio, chi^2, correlation based...)
  • Wrapper techniques
  • Manifold-related techniques (Diffusion Maps, LLE...)

Supervised classification/regression

Initial list for discussion:

  • LDA
  • ANN
  • SVM
  • Bayesian Networks
  • Bayesian inference with forward models
  • Gaussian Processes

Model inference [unsupervised classification]

Initial list for discussion:

  • k-means
  • Self-Organised Feature Maps
  • Density Based clustering (HMAC, DBSCAN,...)
  • Connectivity based
  • Parametric or model based clustering (Autoclass, EM)
  • Sub-space clustering
  • Spectral Clustering
  • Clustering by Message passing

Model evaluation

Initial list for discussion:

  • ROC curves
  • k-fold Cross Validation
  • hypothesis tests

Evolution based algorithms

  • Genetic Algorithms
  • Swarm optimization
  • Genetic programming


Topic revision: r1 - 2014-12-04 - CescJulbe
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback