MADRID 22 December 2014, MEETINGS MINUTES

Madrid, CAB - ESAC, December 22nd 2014

Attendants:

# Cesc Julbe (UB)
# Daniel Tapiador
# Luis Sarro (UNED)

PREVIOUS MEETINGS REPORT

Reporting about the positive GENIUS review which took place on December 15th.
Reporting about the meeting with the SAT team in ESAC, October 27th.
- The SAT didn't impose any restrictiong to the framework we were developing and will participate in the most critical aspects of the deployment and integration to the GAIA archive, such as security and infratructure.

PROJECT REVIEW AND STATUS DISCUSSION

Reporting on the activity performed on the prototype of Data Mining framework.
Basic design, features and technologies used.
Some low level exchange with Daniel about Spark implementation details.
Data conversion. Currently we use the gbin to parquet conversor based on Daniel initial implementation. Spark provides an improved mechanism to save RDD to parquet. We will explore it.
Missing component to be evaluated: Job Manager and Scheduler. CSUC may participate in this task helping on how this layer should work. We can take as a reference the framework provided by OOYALA (http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server).
We keep Cloudera Apache Hadoop distribution, focusing on the new Cluster that should be available shortly.
Review of the idea of writting a document where:
- We define what the Data Mining framework will do
- Which technologies are going to be used,
- Work plan for the next two years must be provided. A more detailed one for 2015 and another one based on milestones for 2016.
Cross-match queries between different astronomical archives, some issues raised about the replicability of other archives at ESAC. Will that be possible?
Review of MlLib Spark libraries implementations. Which can be used and extended (new Spark MlLib package -Spark ML-). Extension of new mechanisms such as other dimensionality reduction methods more advanced, beyond PCA like 'local linear embedding' and 'difusion maps'. There is some code available to start working on these advances algorithms. They will be incorporated into our Work Package development.
Coordination with Visualization group. Up to which point do we provide advanced visualization capabilities? We will provide results in a tabular format (ASCII, CSV...) that should be integrated or connected to more advanced visualization tools provided by WP980.

WORK PLAN - USE CASES

For the coming months we plan to solve several scientific use cases on top of our infrastructure. But focusing on those use cases that are not likely going to be covered by MLLib implementations of Spark and/or Mahout and provide advanced capabilities. As presented in previous sections, we plan to implement: and extension of the most traditional dimensionality reduction mechanisms
- Local linear embedding'
- Diffusion maps.
Data to use: M-type stars from Sloan, Supervised and Unsupervised classification (and with Cross-match with other catalogues?) This could be done during 2015 using our infrastructure. Sloan is a SQL-querying archive. There are several issues to address:
- How data is stored in local for the use case? (into the HDFS)
- The result of this work will be a set of recipes to perform SC and UC based on our experience over our infrastructure.
- How to execute those tasks/commands? Console or some Studio like platform.
- Capacity to show how to use the tasks (Workshop to solve specific problems?). Platform should be flexible enough to solve scientific problems provided by the community. #** Supported by a Workshop. A good output for the Genius project.
- Nice to have a local path where query results can be stored to be consumed by the data mining framework, directly to the HDFS lace. Manual serialization to be avoided if Spark can convert to parquet directly.

Cesc Julbe - 2015-05-28

MADRID 22 December 2014, MEETINGS MINUTES

PREVIOUS MEETINGS REPORT

PROJECT REVIEW AND STATUS DISCUSSION

WORK PLAN - USE CASES

Comments