Madrid 1-2 September 2014, meetings minutes

Article text.

-- Cesc Julbe (erased user) - 2014-12-04

Madrid, UNED, 1st-2nd September 2014

Attendants:

Cesc Julbe (UB)
Daniel Tapiador
Luis Sarro (UNED)

AGREEMENTS

Data mining Framework high level design and functionalities agreements:

We expect two potential users for the framework:
1. Basic user who wants to use the tools provided by the framework (like common machine learning -classification, regression, collaborative filtering, and more general exploratory data analysis techniques such as dimensionality reduction, feature selection...-).
2. Expert user, he has his own libraries and wants to use the infrastructure to run his code.
We propose a collaborative platform, so user developments can be incorporated to the framework. All these developments will be reviewed by a QA commission from inside the WP.
From a technological point of view, we choose, as of initial development technology, Apache Spark large-scale data processing framework which is based on Scala programming language and it is compatible with Java. Apache Mahout, which is a scalable machine learning platform is moving from Map/Reduce to Spark. Despite there are still a lot of missing MLlib libraries implemented in Spark platform, we expect them to be released in the coming months.
We have to put special emphasis on providing a framework which goes beyond some well-known and established solutions such as WEKA and RapidMiner. GENIUS is a technological oriented project, so our solution must not be a "powerful WEKA" but a more advanced and state-of-the-art Big Data framework.
We are open to alternate solutions and technologies, but non-planned and agreed implementations that require additional development must be provided with the resources needed for such development.
More traditional approach to Data Mining like the use of R in a grid environment could be included as well.
We have to give visibility to our work through demos and workshops, it is also important identifying people in the WP to perform that task.
All these agreements will be included in a document to be distributed among the members of the WP. This document will define at a high level the following topics:
- Goal of the framework
- Target users
- Basic functionality
- Technological selection
- Other considerations.

ACTION ITEMS AND WORK PLAN

Contact CU8 working group to understand its Supervised Classification work with extended data and archives such as 2MASS.
- Ultimate goal of the work package inside CU9? Are they trying to solve a specific use case?
- Detect potential synergies with our work.
- Technologies used (R, Java?).
- Schedule of their work.
Resume contacts with CSUC (Consorci de Serveis Universitaris de Catalunya -former CESCA-) project responsible in order to have a more powerful infrastructure so we can continue with the planned work. The main focus is in the storage capacity (currently, we have 16 nodes with 250Gb per node).
Definition of 3 use cases based on the current Spark version.
1. UC #1:
  - supervised classification using Lasso-Ridge non-linear regression, already available in Spark MLib implementation.
  - We must ask CU8 working group for a Training Set (TS). We train the system with 9/10 of the TS and we validate with the 1/10 TS remaining.
  - This is not a challenging use case because it is a labeled data set, which is rare in Astronomy, but it is a proof of concept.
2. UC #2: Clustering. We propose two features for this second use case.
  - K-means. Spark already implements this clustering technique (https://spark.apache.org/docs/latest/mllib-clustering.html).
  - HMAG clustering based on the work developed by Miguel García Torres. His current implementation (Java) could be written in scala. If he agrees, tasks to be done have to be coordinated. Conclusions from this work should be published.
3. UC #3:
  - Bayesian inference with Markov chain Monte Carlo (MCMC) applied to the Milky Way model. Resuming Angel Berihuete work. Initially on a basic simple version, adapting the R and RStan (http://mc-stan.org/rstan.html) code written for the IMF model estimation.
  - Secondly, we will continue with the more advanced work done by Angel for this very same study extended to additional parameters. The work done for this use case should provide both scientific and technological results to be published.
  - Outlier detection. work TBD.
4. UC #1 and UC #2 should be available to the users though a web based interface probably based on GWT (Google Web Toolkit) still to be defined.

SCHEDULE

We expect to have the 3 use cases implemented in a 1 year from now.
UC #1 and UC #2 should be available in 6 months from now, accessible through the web base interface.

OTHER CONSIDERATIONS

In 27th OCTOBER 2014 there is a 'Data Analysis and Statistics' workshop/meeting en ESAC (http://esac-statistics.wikispaces.com/), Luis is attending and it would be positive for our work to contact some of the expert in there in order to get some collaboration and advice.
The "GREAT-ITN CONFERENCE GREAT ITN Closing Conference: The Milky Way Unravelled by GREAT, 1-5 December 2014, University of Barcelona, Spain" meeting (https://gaia.ub.edu/Twiki/bin/view/GREATITNFC/WebHome) would be a good opportunity to meet with the astronomers that have been involved in the Milky Way modeling problem for some Brainstorming sessions.
It is necessary to know the ESAC agreements established when the Work Package was defined.
1. Who will be the key person at ESAC?
2. Which will be the infrastructure used for Data Mining activities?
3. Security policies: Will external users be allowed to upload external developments like custom models and libraries to the Data Mining infrastructure at ESAC?
Data for testing:
1. A full-sky GOG simulation with end-of-mission data is available (no spectra).
2. S1 HTM region end-of-mission data with spectra is also available (about 6% of the sky).

Madrid 1-2 September 2014, meetings minutes

AGREEMENTS

ACTION ITEMS AND WORK PLAN

SCHEDULE

OTHER CONSIDERATIONS

Comments