WP973_ESAC_meeting_May_2015 < GENIUS

TWiki>

GENIUS Web>500ToolsForDataExploration>DataMiningGENIUSPage>WP973_ESAC_meeting_May_2015 (2015-05-28, CescJulbe)

Edit Attach

Meeting notes
Conclusions - Action Items
- TASKS SUMARY
- ACTIONS ITEMS
Comments

Meeting notes

GENERAL & COORDINATION

Goal of the meeting is agreed to set the bases for an agreement on:
- Infrastructure
- Potential users which will drive the Framework development
- Execution policies & time allocation committee
Technological justification on the choosing of this technology above others
TGAS may allow to do some DM, but it is a small catalogue (10e6 objects). For release 1 there could be some interesting use cases to be analysed such star counts, but there will not be infrastructure at ESAC for DM yet.
Data Mining should be ready for Release 3, between 2017 and 2018, although Release 1 and 2 may have very useful data to start developing some use cases
This platform could be re-used by Euclid (or at least the experience can be passed on).
A formal SRS is not necessary. We’ll check the current SRS for WP970 and update it if necessary. ESAC/SAT may have a word about the high level requirements as they are currently defined. We'll write a roadmap document based on those requirements. Check TGAS or Cross-match roadmap documents.
Currently there are too many TASKS/sub-WP, we'll have to review them and merge if possible.

DATA LAKE

Data transformation from gbin is very complicated.
WP973 proposes a format and SAT can provide a conversion to the given format in a similar way they ingest data into GACS in a post-ingestion stage.
SAT to implement a serializer to transform data from Gaia Catalogue to a DM compatible format. SAT uses a JSON serialization.
We store a full dump of the catalogue to the HDFS.
Intermediate storage: We can use an approach similar to the VOSpace. There is a space not visible to the community and some data can be public if the user decides to do so.
There could be also a library of models previously applied to the archive - storing PMML or similar XML. Reproducible science. Focus more on sharing the models than the data.
Combination of catalogues. Queries not only from GAIA but from SDSS or 2MASS. Other catalogues will be ingested into the archive so these other catalogues can also be stored in a friendly format for the DM framework. Users may be also able to upload external data to be framework, to be consumed by the DM.
TASK to be coordinated by UB team & SAT

USER PROFILE DEFINITION

There is a short discussion about the user profile that the framework should focus on. We all agree that the main user should have an expert or semi-expert DM profile.
We want to provide a framework for big science, not focusing on small science (thus non-expert user). Let's focus on advanced users then.
Non expert users would need a lot of documentation and guidance. The goal of the WP is not teaching about Data Mining, but provide a Data Mining framework, documentation will focus on how to use framework, not a Data Mining 'book'.
Luis Sarro to set-up a pool to find out what sort of ideas, profile of potential user and get use cases people involved in the grand challenges. A set of beta testers can be identified to review the interface in a midterm future.

Contact Nic Walton and Graine Costigan as a starting point as they are defining a working group.

Brief security discussion note:

User profile to be defined according to user roles which provide execution attributes/properties (storage quota, CPU time...).
Luis Sarro coordinates this Task

EXECUTION POLICIES

Discussion on how tasks will be executed in this framework:

Time allocation for big projects (90% of the time for example), big projects to be evaluated by a committee (?)

10% for low priority jobs, even interactive and small scale projects.

Discussion on which is the model we want to focus:
1. People submitting jobs to the infrastructure - Data Centre at ESAC (queued tasks, priorities, etc...)
2. Distributed execution model, with external computational resources.
By now we'll focus on the first option, integrating other data centres and external computational resources like CSUC or CESGA is also contemplated in a mid/long term time frame.
Time allocation committee. Maybe CSUC can provide feedback (they do ot have privileged roles, all of them have the same priority, and groups paid by the hours they consume).
- We cannot afford a non-expert user to halt the system. Difficult to avoid the human intervention in the time allocation committee.
- Proposal for job execution request has to be approved.
- SAT/ESAC/CSG to play a role in this process as they are the infrastructure providers but the proposal has to be validated from the scientific point of view.
A complete new policy has to be in place.

TASK Coordination on ESAC/SAT and UB.

USER INTERFACE

Console is the preferred user interface where you can submit tasks. No security concerns.
Web interface to submit jobs is too complex that would provide little reward to the user. We'll leave it for now.
It is better having a set of recipes that can be executed using the framework through the console of submitting a tasks directly to the cluster.

JOB SCHEDULER

Job scheduler to be merged with infrastructure WP/Task, CSUC involved as it relies on a software infrastructure, mainly YARN.

TEST/DEBUG EXECUTION ENVIRONMENT

Execution environments: We propose having an environment for testing not only the production

Test environment seems ok for everyone with a sub-set of data. SAT already has a fuzzy table which is a representative sample of the archive.
Proposal: Virtual Machine available for download and you can test locally, containing the subset of data + OS, etc... Ready to be used by the user locally. This has to be maintained in synch with the system.

VISUALIZATION INTERFACE

We should provide requirements to the Visualization Working group and they'll provide functionality.
1. Interactive visualization - Sanity checks.
2. Diagnostics - Static files for intermediate visualization (pdf, jpg, png...).
Basic requirement: feed files to the infrastructure.
WP980 have a server (server object) and an API to access. You have to download a client of even build a client using that API. Effort must be focused on connect the output from Data Mining tasks to the Visualization server.
In CU9 plenary meeting, Barcelona on September 2015 there is a splinter session with Visualization and this could be an opportunity to discuss these requirements with them.
Visualization work includes clustering visualization provided by M.Manteiga and C.Dafonte team.
Coordination of this task is for Univ. Coruña team.

GRAND CHALLENGES

Goal: Astrophysical problem that involves large amounts of data and can only be solved used data mining techniques. Providing a list of problems that can be solved in this framework and that they are not being done outside or somewhere else in DPAC.
Review of the grand challenges proposed since the very beginning.
Let's define big cases that can be also useful and provide reward from scientific point of view while testing the infrastructure.
Open a period of time to propose a more detailed challenges: Data required (dimensionality), scientific goals, etc.
The list of grand challenges should also be reflected in an initial list of GAIA papers, so these tasks must be in synch with other CU tasks. First week of June there is a CU9 review where the issue will be address. Grand challenges will lead to scientific papers that have to be sent to the WP managers to be discussed.
Luis Sarro to coordinate this task.

FEATURES

- BASIC

Let's take everything that it is already implemented, not write something that will probably be implemented in a near future. Spark also provides an infrastructure to create pipelines.
SparkR can also be included in the basic features "provider". Giuliano Gioffrida and Silvia Marinoni have done already some research/development in SparkR.

- ADVANCED

Define a set of advanced techniques and release them in an incremental way, contributions are allowed. The initial list could be:
- Diffusion Maps
- LLE (Local Linear Embedding)
- Kernel PCA
- HMAC (Hierarchical Mode Association Clustering)
- Self-Organised Feature Maps - Implemented by Univ. Coruña colleagues for Spark and also running on GPUs.
- Bayesian hierarchical models
We will evaluate the possibility to apply for European funds to finance these developments.
Luis Sarro to coordinate this task.

SCHOOLS & WORKSHOPS

Formation on the community about the DM framework.
CSUC can provide infrastructure
Once organized we can find the people to give the curses
UB takes coordination on this WP.

INFRASTRUCTURE

ESA is willing to provide a DM framework on the archive, sitting on ESAC. CSG is taking care of infrastructure and SAT providing services. Many technological solutions available (Cloud computing, with resources deployed on demand, etc.) based on proper requirements.
FJ gives an update on the current infrastructure: Cluster with 64 nodes, 264GB RAM, Cloudera 5 + Spark 1.3.1
New Cluster under process to be acquired using GENIUS funds. Security policy to use this cluster has to be defined. Also, CSG have to agree with WP and CSUC on the security issues, how can user access to the cluster, etc. as the system we develop must be replicated at ESAC.
GPUs provide a great performance boost. Can they be integrated in the final design?
Once the system is stabilized, this will define a hardware and software (Big Data platform - Hadoop, Spark, etc.) requirements that have to be formally agreed by CSG and SAT at ESAC.
Data Mining will also ake place in some other places where the archive could be replicated.
- We could have a census of Data Centers with replicated data and could share resources for the DM framework.
- Other partners can provide HPC (CSUC, CESGA, Italian partners as well). ESAC should be the HUB to other partners.
- Discussion on adding external computing infrastructure to the system beyond ESAC and accessing the GAIA catalogue.

Conclusions - Action Items

TASKS SUMARY

WP-000 COORDINATION, coordinated by UB
WP-010 INFRASTRUCTURE, coordinated by UB, participation of CSUC & CSG/SAT
- Hardware Infrastructure
- Execution environments (test/debug)
- Big data framework
- Job scheduler
- User Interface
WP-110 USER PROFILE DEFINITION, cordinated by Luis Sarro
WP-120 EXECUTION POLICIES, cordinated by UB, validated/agreed with CSG & SAT
WP-130 VISUALIZATION INTERFACE, cordinated by Univ. Coruña
WP-140 GRAND CHALLENGES, cordinated by Luis Sarro
WP-150 FEATURES, cordinated by Luis Sarro
WP-160 SCHOOLS & WORKSHOPS, cordinated by UB

ACTIONS ITEMS

Review current SRS for WP 970, update requirements if necessary. Agree them with SAT. Task on Francesc Julbe.
Define a WP working methodology. Follow-up telecons, etc. Task on Francesc Julbe & Luis Sarro.
Roadmap document design based on WP970 SRS defined and reviewed. Task on Francesc Julbe.
Set-up a pool to find out what sort of ideas, profile of potential user and get use cases people involved in the grand challenges. A set of beta testers can be identified to review the interface in a midterm future. Contact Nic Walton and Graine Costigan as a starting point as they are defining a working group. Task on Luis Sarro.
Infrastructure: Update the current software platform. Task on CSUC.
Infrastructure: Setup and consolidation of the different technological frameworks involved. Stabilize the system. Document the steps taken and review infrastructure with CSG. Task on CSUC and UB agreed with CSG.
Infrastructure: Access to the members of the WP for testing and development. Task on CSUC, agreed by CSG.
Visualization: Contact WP980 team and start first study on how the DM output can be integrated into the WP980 system. Way to ingest intermediate data and final data (interactive visualization or static images). Task on Univ. Coruña.
Define a set of use cases, from simplest ones to more complex ones, to be analysed with GAIA data (if possible in synch with data releases). Task on Luis Saro and Francesc Julbe.
SparkR follow-up. Task on G.Gioffrida and Silvia Marinoni.
Clean-up the list of Grand challenges and open a period of time for ‘use cases’ proposals. Use cases must have a more detailed information (data required, science goal, etc.). Task on Luis Sarro.
Evaluate the possibility to apply for European funds to finance Advanced DM features. Task on Francesc Julbe.

Cesc Julbe - 2015-05-28