We agree on the paramount importance of a simple, easy-to-use system that facilitates scientific exploitation by astronomers acquainted with the world of data mining although not necessarily experts
Also agree on the importance of reusability of data mining models (classifiers or the like) and reproducibility of results
The two previous considerations seem to point to a design where models are stored and defined as PMML (predictive model markup language) generated off-line locally by the astronomer with whatever software and uploaded to the CU9 infrastructure, that must be able to handle these models. Even further, it would make sense to explore the convergence universe between predictive modelling, machine learning and distributed systems, allowing to build models in a user friendly environment like R and then, export these models to a full scale distributed environemnt. An interesting contribution to be considered in this way is the following paper (Pattern project: http://kdd13pmml.files.wordpress.com/2013/07/pattern.pdf).
We also agree that we must try to be ambitious and design a system capable of handling data mining tasks beyond the exploitation of the Gaia database (that is, able to mine merged catalogues )
Identified 4 simple use cases to start identifying strengths and weaknesses of current alternatives for the framework:
PCA analysis of a large sample of BP+RP spectra from GOG (all types of stars, galaxies, etc). Analysis feasibility and scalability. In principle, classical approach (even if it can be coded as a stream).
Build and evaluate a set of three supervised classifiers (ANN, Random Forest and SVM) from Gaia data. Combine the three and apply them to a large collection of Gaia data. In principle, use the PCA components obtained above to predict effective temperatures of stars. The emphasis is not on the results but on feasibility and scalability.
Build an unsupervised model (Self-Organised Map and Mixture-of-Gaussians of arbitrary shape and orientation) from simulated astrometry (RA, Dec, proper motions, radial velocities and parallaxes)
Find outliers in a large data set based on the distance to nearest neighbour.
Basis framework
hadoop installation with libraries for handling of PMML models. Try to test use case #1 before the Vienna meeting. Assess weaknesses and strengths and advance a proposal for an alternative that meets all the requirements of the 4 use cases in memory handling, iterativity, conditionals, etc.
Assess the suitability of other core engines like Spark which are very promising for iterative algorithms like the ones used in machine learning, data mining and the like. Those new architectures propose a generic in-memory processing engine wrapped by a core machine learning library and other higher level layers that would aim at improving productivity and automating the task of model selection by optimizing a search problem over feature extractors and machine learning algorithms available.