GENIUS Web>500ToolsForDataExploration (revision 14)

WP4 - Tools for data exploration

Description

A use of the Gaia archive based on simple queries (i.e. sky region queries) would only allow a basic use of its potential. To fully exploit a billion object data set, containing a wide variety of data (astrometric, photometric, spectrophotometric, spectroscopic, ...) more advanced and powerful data exploration tools will be needed. This work package is devoted to the development of such tools, in close coordination with WP200 to ensure that they are tailored to the actual needs of the scientific user community. It will include:

Development of visualization tools , adapted both to the potential large size and complexity of the available data of the results of the archive queries.
Development of data mining tools and infrastructure adapted to the characteristics of the archive (both to its contents and the archive system), allowing the users to perform data mining tasks and extract new knowledge .
Development or adaptation of VO tools and services to the Gaia archive. In particular, the possibility of cross-matching the contents of the Gaia archive with other archives (specially with large surveys ongoing or in preparation, like LSST) should be easily available.
Development of tools for the Grand Challenges outlined in WP 200, that will involve complex and massive exploration of the data.

Furthermore, this work package also includes the development of some tools for outreach and academic activities. Although not explicitly included in the FP7 call, we consider the task of presenting astronomy to the general public and the provision of resources for teaching
astronomy based on actual Gaia data as worthy contributions to the dissemination of space mission data on a global scale.

WP4 - Tools for data exploitation [Months: 1-42] Lead beneficiary: UB Type of activity: RTD

The UB team leads this work package and will contribute most of the resources devoted to it. The personnel at the UB (see Sec. 2.2.1), led by the GENIUS coordinator X. Luri, will provide its extensive background on astrometry in general and the Gaia data in particular, and its knowledge and experience on the use of astronomical data. In addition, an experienced software engineer will be hired with the GENIUS funding and devoted full time to WP400 to provide the technical expertise necessary for the developments in this work package with the support of the UB staff. Some funding will also be devoted to specific tasks along the schedule, to employ part time software engineers already working for DPAC developments in the UB team.

T4.1 - Technical coordination [Months: 1-42]

In addition to managing the resources deployed on the other WP-400 work packages, and producing reports on those activities, this work package oversees the design and specification of all work conducted under WP-400, to ensure that it adequately addresses the requirements identified within the GENIUS project and from external sources, such as the CU9 and GREAT. This WP also includes the liaison with Gaia and Science Archive team members at ESAC for the coordination in the development of exploitation tools working on the Gaia archive.

T4.2 - Visualization tools ( [Months: 1-42]

FFCUL, UB

This Work Package addresses the development of visualization tools and solutions, adapted to the large size and complexity of the Gaia archive. This includes interaction with the data, resulting in seamless visual queries to the archive.

The full understanding of the Gaia catalogue data requires a rich set of visualization tools, that will help in the human interpretation of the data and knowledge discovery from its internal relation. To achieve that, the visualization package should support a wide variety of visualization algorithms including geometrical, volumetric methods and also advanced topological and modelling algorithms (i.e. polygon reduction, contouring, or glyphs) among others. Besides that, we must consider modern concepts of displaying (statistical) data, moving beyond simple histograms or plots towards visual knowledge inspiration and persuasive presentation components (i.e. voxels, hixels, texels representations). It will be also important to go one step forward in current research areas such as visualization of the uncertainties (errors, and their models must be seamlessly integrated and never ignored), user interactivity or cosmetics (essential for outreach, WP-730).

The core components of the visualization framework that interact with different (N-dimensional) graphic widgets and the algorithms will have to be provided as part of this package. Internal (server–side) parallel processing of massive data sets and provision for easy human interaction will have to be considered. From the hardware infrastructure the visualization package will have to allow for a flexible definition underlying the client and serverside egressing technologies and platforms.

Although Gaia data will be multi-dimensional, visual exploration in Astronomy is mostly done using 2D representations. This reduced dimensionality has a price: It easily hides features and relations in the data and can produce cluttered views. Multiple 2D panels are often used as a solution, but the linkage between data in different panels is frequently not clear. Curiously, 3D visualization, with the gain of an extra visual dimension, is not widespread in Astronomy, where most of the data are individual entities (stars, galaxies, asteroids). It is almost exclusively used in simulations of astrophysical fluids and fields, which are extended bodies. The reason is a lack of good tools for 3D selection and interaction with point clouds. 2D interfaces, such as a mouse and keyboard, are not adapted for this kind of interaction. This is one of the most critical inhibitors of the advantages of using the extra third dimension in scientific research. There is clearly a need of developing an adequate tool for 3D interactive visualization supporting human-computer interfaces other than the mouse and keyboard.

Besides our own developed components, the analysis for the reuse and extension of widely accepted (astronomical) visualization software will be considered as part of the WP tasks. In particular the tools that support VO formats will be targeted (i.e. TOPCAT, VOSpec) in coordination with WP-440. Those tools are already using a set of different astronomic formats and allow the inclusion of several user defined formats. They also provide widgets for higher dimensional visualisation, statistics algorithms or visual comparison that will be adapted to visualise the contents of the Gaia archive and compare it against other archives. Other existing tools will have to be examined, in particular the ones that deal with parallel visualization on large clusters (i.e. using MapReduce), the open-source ParaView coprocessing library (that uses VTK) or VisIVO, a current parallel processing capable visualization tool well known in astronomy.

The tasks in this sub-work package include the contributions of the FFCUL specialised partner. The team at FFCUL will provide expertise in the development of visualization tools. Their activity in visualization studies and developments for space and earth observation further allows GENIUS to take advantage of the synergies with fields other than astronomy.

The following tasks have been identified for the visualisation WP:

Define the list of requirements and feasible use cases to be covered by visualization.
Define the architecture to support the visualization requirements.
Identify the existing open-source visualization tools to be used or extended to support the graphical view of the Gaia archive
Define the proper data models for the visualization of the requirements. In particular:
- Define in collaboration with WP430 the requirements for data mining visualization.
- Define in collaboration with WP440 the infrastructure technology compatibility and extensions to use VO standards and services.
Implement, test and monitor the visualisation and interaction tools (widgets and algorithms).

T4.3 - Data mining [Months: 1-42]

UB, CSIC

The Gaia catalogue will represent an unmatched opportunity to apply data mining techniques and algorithms as tools for knowledge discovery in a domain where there is no alternative to automated methods based on statistical learning (human exploration is certainly not feasible except for very limited subsets of data). The application of the data mining algorithms in order to extract new knowledge from the data is mandatory for a full scientific exploitation of the Gaia data. The main focus will be on Knowledge Discovery which is expected to reveal patterns and relationships within the astronomical data that can lead to the detection of new types of objects or isolated, exotic objects that represent rapid stages of stellar evolution and/or new astrophysical scenarios. Also, modelling tasks will arise from the discovered patterns. In that sense, the capability of automated dimensionality reduction (feature extraction, feature selection) and the development of key learning algorithms (clustering, outlier analysis, swarm intelligence, . . . ) implemented for parallel processing are foreseen as important.

From the architecture point of view, the DM module will have to scale to the entire Gaia data set and allow for a flexible definition of the underlying infrastructure (Cloud Computing, High Performance computing (HPC), GRID computing, and other emerging technologies). The initial approach we plan is an architecture where the mining algorithms are accessed following the paradigm of Software as a Service (SaaS) over a service oriented architecture. However, the package should also be compatible with future definitions of data mining processes, that are expected to include more complex mining work flows supporting asynchronous notifications from those services.

The tasks in this sub-work package are mainly under the UB partner, and also include the contribution of the CSIC specialised partner. Through the CSIC the team of L. Sarro will provide to GENIUS its expertise in Data Mining in astronomy, including the synergies with his work in the area inside the Gaia DPAC (see Sec. 2.2.7 of the DOW Part A).

The following tasks have been defined for the data mining WP.

Define the list of requirements (in coordination with WP200) and feasible use cases to be covered.
Define the architecture to support the mining processes listed in the requirements.
Define the framework to allow users to develop their own implementations of the mining algorithms.
Define the proper data models for the data mining based on the requirements. In particular:
- Define in collaboration with WP420 (Visualisation) the requirements for dimensionality reduction.
- Define in collaboration with WP300 the infrastructure technology compatibility for the data mining work flows needed by the requirements
Parallelise existing algorithms or libraries for Data Mining in distributed environments

T4.4 - VO tools and services [Months: 1-42]

CSIC, UBR

Besides novel modes of access to the entire Gaia archive and the emerging needs on visualisation (WP420) and data mining (WP430) it is anticipated that the more traditional archive access mode # in which a potentially complex query downloads a data set of modest size for interactive client-side processing # will continue to be important. The most efficient way to support this model is to provide a seamless interface for Gaia data acquisition from existing analysis tools in which astronomers already have expertise. We therefore intend to extend the following existing VO applications with Gaia-specific data acquisition tools:

# TOPCAT (Tool for OPerations on Catalogues And Tables http://www.star.bris.ac.uk/~mbt/topcat/) is an interactive graphical application for exploration, analysis and manipulation of tabular data, especially source catalogues, which works well with moderately large data sets (up to a few million rows and a few hundred columns; more details are given in 2.2.11). TOPCAT already offers a number of service-specific load dialogues (e.g. VizieR, Millennium Simulation), and a Gaia option would be added alongside these. Additionally, investigations will be made of whether the existing practical limits on dataset size can be increased. TOPCAT is in regular use by certainly hundreds and perhaps thousands of astronomers worldwide, and has users in 24 of the 27 EU member states. Providing direct access to Gaia data from this tool will be a highly effective way to facilitate an entry point for its exploitation.

# VOSpec : Gaia will produce a large set of spectra (spectrophotometric data for all the objects and high-resolution spectra for all objects up to G 17). VOSpec is a ESA-VO tool that can handle spectra in the VO context. It offers multi-wavelength spectral analysis and spectral widgets. The inclusion of Gaia-specific modules are foreseen for the users that have to work with spectra processing in Gaia.

# VisIVO : (Visualization Interface to the Virtual Observatory) is an open-source tool developed following the VO standards and recommendations. Data is retrieved by connecting to a VO service and loaded locally for manipulation or visualization. It can deal with multidimensional data sets of both observational and simulated data. It offers parallel processing facilities that will need to be extended to fully exploit the access to the Gaia data.

# VOSED: is a tool developed in the framework of the Spanish VO to ease the generation of Spectral Energy Distributions (SEDs). VOSED is able to build SEDs gathering information from the spectroscopic services available in VO. These datasets can be complemented with photometric information from a number of Vizier Catalogues as well as with data provided by the user.

# VOSA (http://svo.cab.inta-csic.es/theory/vosa/): a tool to query photometric catalogs accessible through VO services, query VO compliant theoretical spectra and calculate the associated synthetic photometry and derive physical parameters from the model that best reproduces the observed data.

The tasks in this sub-work package include the contributions of the CSIC and UBR specialised partners. At CSIC the team led by E. Solano (Spanish Virtual Observatory, see Sec. 2.2.7), will provide VO support and at UBR M. Taylor (main developer of TOPCAT and other VO tools, see Sec. 2.2.11) will provide the TOPCAT integration.

The following tasks have been defined for this sub-work package:

Define the list of services and tools specifications to be covered using VO for Gaia. In particular:
- Define in collaboration with WP420 (Visualisation) the requirements for VO tools and services.
- Define in collaboration with WP430 (Data mining) the requirements for VO tools and services.
Design and Implement VO services and tools for the Gaia data.
Test and optimise, and validate of the VO tools and services providing performance monitoring.
Define/implement the query extensions necessary to query the catalogue to fulfil the specifications.
Obtain user feedback and update the tools and services if necessary
Write documentation

Participants

Manager: X. Luri (UB)
Partners:
- UB: Francesc Julbe
- CSIC: Enrique Solano, Luis Sarro
- FFCUL: Miguel Dias Duarte Ferreira Gomes, André Moitinho, Alberto Krone-Martins
- UBR: Mark Taylor
- CNRS: Jérôme Berthier

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7-SPACE-2013-1) under grant agreement n°606740.