DM Techniques forum
--
Cesc Julbe (erased user) - 2014-12-04
Software re-use or software development from scratch
Prior to all of the forums below, we have to decide to what extent are we going to re-use existing software.
- Adopt one Data Mining Platform as the basis for the WP? Which? Requirements?
- Adopt several and make them compatible via wrappers? Which? Requirements?
- Develop all from scratch?
[Tapiador]
http://www.mlbase.org/
- Something possibly of interest (under active development now). It aims at providing a higher level interface for machine learning (picking the best algorithm for each use case based on cross-validation, etc). The latest paper can be found at
http://arxiv.org/pdf/1310.5426v2.pdf
Regarding the platform to use: Most of the new frameworks and tools support HDFS so it shouldn't be difficult to adopt a couple of them and perform a seamless integration. The two most promising ones (at a first glance) may be Mahout and Spark. Mahout has many algorithms implemented already, but its performance is poor for those that are iterative. On the other hand, Spark is the best option for iterative algorithms (e.g. Logistic Regression) although it doesn't (currently) provide as many algorithms as Mahout. We'll definitely need to try out both.
Dimensionality Reduction and Feature Extraction
Initial list for discussion:
- PCA, ICA
- Filter techniques (Mutual information, Gain Ratio, chi^2, correlation based...)
- Wrapper techniques
- Manifold-related techniques (Diffusion Maps, LLE...)
Supervised classification/regression
Initial list for discussion:
- LDA
- ANN
- SVM
- Bayesian Networks
- Bayesian inference with forward models
- Gaussian Processes
Model inference [unsupervised classification]
Initial list for discussion:
- k-means
- Self-Organised Feature Maps
- Density Based clustering (HMAC, DBSCAN,...)
- Connectivity based
- Parametric or model based clustering (Autoclass, EM)
- Sub-space clustering
- Spectral Clustering
- Clustering by Message passing
Model evaluation
Initial list for discussion:
- ROC curves
- k-fold Cross Validation
- hypothesis tests
Evolution based algorithms
- Genetic Algorithms
- Swarm optimization
- Genetic programming
Comments