400 - Tools for data exploration

400 - Tools for data exploration

A use of the Gaia archive based on simple queries (i.e. sky region queries) would only allow a basic use of it potential. To fully exploit a billion object dataset, containing a wide variety of data (astrometric, photometric, spectrophotometric, spectroscopic, …) more advanced and powerful data exploration tools will be needed. This work package is devoted to the development of such tools, in close coordination with WP 200 to ensure that they are tailored to the actual needs of the scientific user community. It will include:

Development of visualization tools, adapted both to the potential large size and complexity of the available data of the results of the archive queries.

Development of data mining tools adapted to the characteristics of the archive (both to its contents and the archive system), allowing the users to search and extract data based on complex criteria.

Development or adaptation of VO tools to the Gaia archive. In particular, the possibility of cross-matching the contents of the Gaia archive with other archives (specially with large surveys ongoing or in preparation, like LSST) should be easily available.
Development of tools for the Grand Challenges outlined in WP 200, that will involve complex and massive exploration of the data.

Furthermore, the work package also includes the development of some tools for outreach and academic activities. Although not explicitly included in the call, we consider the task of approaching astronomy to the general public and the provision of resources for teaching astronomy based on actual Gaia data is a worthy contribution to dissemination of space mission data on a global scale.

WP 410 Management

Overall management of WP 500

WP 420 Visualization tools

Inputs provided by A. Moitinho

Survey of visualisation tools of some utility for exploring the Gaia catalogue. Technical and semantic approaches ….. We have an on-going ESA contract (VA-4D) for surveying the current available visualisation ools in Climate Sciences and Astronomy, visualisation needs and performing the corresponding gap (not GAP) assessment. Implementability of gap solutions. One of the utcomes is a conceptual design for a next generation visualisation tool. The study covers not only technical aspects, but also a more abstract component focused on the semantics and ergonomics of visualisation. Application of this type of knowledge will be necessary in the definition of Gaia visualisation.
Technical solutions for visualisation besides the current study above mentioned. We have recently completed another ESA contract. This one (KD-LADS) was for knowledge discovery in large datasets and included a visualisation module - an extension of Paraview, which already gave us a little practical experience in this field. Now with the VA-4D study we are developing further expertise in the field. Writng , SRS and SDD would be natural products of our activities.
Implementation provided that GENIUS gets funded so that we can support extra human resources, we can do it as UNINOVA has proved with 15 succesfull projects (13 implementations) for ESA.

A sketch of our vision

With petabyte sized databases, Science will happen when we manage to connect all this data
with usually kilobyte sized explanations. As it is attested by the portion of our brain dedicated
to the processing of visual information, the human being has its compreheension favored when
the data is presented in a visual way. The aim of scientific visualization is exactly this: to reduce
the complexity of scientific data in a way that favor the researcher understanding, and thus the
flourishing of ideas and physical interpretation.

Gaia data is highly complex in nature, and so will be the Gaia catalogue. Therefore, tools should
be provided to the research community for helping them grasping as quick and precisely as
possible the information they are searching for, as well as to facilitate and even to encourage
serendipitous discoveries. In this way, whatever tool is implemented, it should not work in
a complete passive way, waiting for commands from the user, but it should have a little bit
of active voice, suggesting some characteristics of the visualization that would facilitate the
discovery process.

One simple example of an “active visualisation” is the following one: Imagine you want to see
the MW in 3D, so you request to visualize the positions x,y,z of all the stars in the catalogue. In
this case, an “active tool” would automatically present you a 3D volume rendering of the stars,
in a way that you wouldn’t see a 3D scatter plot, with each point representing a single star, but
the global structure of the MW would be presented. Then as you zoom in the visualization, the
volume render would progressively turn into a scatter plot showing individual stars, obviously in
a fully automatic way.

Also, this tool would present realistic visualizations. Still using our example of the Galaxy, when
seen as an external galaxy a certain amount of degradation in the spacial resolution (psf) is
necessary for conferring a realistic spatial representation. The bulk of the stellar population
would be visualized as a volume rendering, some specially bright stars would be displayed as
PSFs, just like what happens when we observe (even from space) other galaxies.

Of course, basic functionalities must be available, such as tools for plotting scattered-points
data in 2d or 3d (with additional color-coded and shape-coded dimensions), but even these
features should present some kind of “active voice”. For instance, you graphically select a
certain amount of stars in a scatter diagram. Automatically you will receive a report with the % of
stars of certain types selected selected (within the sample and globally. E.g. x% of the sample
is F stars, which are y% of the F stars in the Catalogue. The same for other parameters.) This
kind of information would immediately draw attention to any unexpected selection bias, and
eventually would lead to knowledge discovery: why the hell to I have so many variable stars
here? Another appealing example is to plot unclassified stars and produce “misterious Milky
Way” maps. What kind of biases will we find here?

This highlights how we must study what kind of representations can provide a broad view of the
Gaia catalogue. i.e. seeing a Milky Way map is not a general view of the contents. The design
of the visualisation system will rely on the definition of key statistics representing the catalogue
contents.

Moreover, a rather neglected aspect of 3d visualization softwares that in the case of Gaia has
a fundamental importance are the measurement errors. Any tool to be implemented for visual
exploitation of Gaia data must take the catalogue errors into account during the visualisation
process in a seamless way, if they expect to have some real scientific value.
Architecture and functionality of visualisation must be driven by use case scenarios, like those
being listed in the GREAT wiki (model comparison, etc). However, we can only know the actual
usage in a broad sense. There will always be specific needs in special cases that we cannot
predict beforehand. We have to accept this. Gaia visualisation should not claim to be a universal
tool.

Gaia visualisation should allow interaction with 2D and 3D representations of the Milky Way,
allow zooming and paning, selection of data based on positions or any other measurements
(color, chemical composition, kinematics, etc). It should be able to represent and allow
interaction with both point like data (stars) and extended sources (e.g. molecular clouds mapped
via Gaia extinction or measurements from radio surveys). Selection should be possible either
directly on the data parameters or with the help of some classification scheme. The tool would
also allow fitting or comparing theoretical and semi-empirical models to observations.
We don’t really know, or are not used to, do scientific analysis in 3D. The interfaces are not yet
comfortable and the interaction approaches are not efficient. This must really be researched.
However, 3D displays and interfaces are becoming widespread in the entertainment market.
We have to port this experience into scientific visualisation. Why? because we gain an extra
dimension to analyse simultaneously. Younger people will certainly be used to these systems.
Gaia, and astronomy in general, have a strong appeal to the public. However, scientific plots,
although useful to the researcher, do not have visual appeal for the public. To overcome this
scientist-public barrier, artist impressions are usually produced but have the inconvenient of
being very qualitatively and even misleading due to some exaggeration. The ideal tool should
provide some (automatic) cosmetic qualities.

WP 430 Data mining tools

Objectives

The objective is to implement the infrastructure to allow common data mining tasks in the Gaia Archive. The focus will be in Knowledge Discovery (new types of objects, exotic objects, similarity-based queries, etc) and modeling. The DM (Data Mining) module will have to scale to the entire Gaia dataset and allow for a flexible definition of the underlying infrastructure (Cloud Computing, GRID computing, and other emerging technologies).

Tasks

Define the list of use cases to be made feasible.
Define in collaboration with WP530 (Visualisation) the requirements for dimensionality reduction.
Define in collaboration with WP300 the infrastructure technology compatibility.
Parallelise existing libraries for Data Mining in distributed environments.
Implement a submodule to allow the user to provide his own algorithms.
Write documentation.

Input

Simulated data
Gaia Main DataBase
Existing DM (Data Mining) libraries

Output

Data Mining capabilities integrated in the Gaia Archive.

WP 440 VO tools and services

Objectives

The objective is to adapt, test, and implement Virtual Observatory tools and services for GAIA data.

Tasks

Acquire specifications to develop the VO tools and services
Design, develop and implement the VO tools and services
Test, optimization, and validation of the VO tools and services
Monitor performance of the tools and services
Obtain feedback from users
Update VO tools and services if necessary
Write documentation

Input

Simulated data
IVOA tools

Output

GAIA VO tools and services

Suggestion from Mark Taylor, TOPCAT developer

TOPCAT (which I've developed over about the last 8 years) is a
graphical tool for analysis and interactive exploration of tabular
data which works well with moderately large datasets (1e6-1e7 rows,
1e2 columns); it does plotting, selections, crossmatching,
calculations, and a load of other stuff. It's already in quite
wide use, and already ticks a number of the buzzwords in the
WP500 introduction slide - it does visualisation, it's very VO-friendly
(and very well-known by the VO group at ESAC), it's been used to
some extent for outreach (though that hasn't been a high priority
before now), and I'm looking at adding some data mining capabilities.
In its current incarnation it is not scalable up to 1e9 rows
(which of course couldn't be reasonably transmitted
from an archive server to a client-side tool in any case), so I'm
by no means suggesting that it's the single solution to the
question that WP500 is seeking to answer, but I do think that a tool
of this nature is an important part of the armoury that a user
of the Gaia archive will want, and as far as I know, TOPCAT is
the most capable one around.

STILTS is a complementary suite of command-line tools based on the
same technology. Both are implemented in pure java.

The web pages of these tools are here:

http://www.starlink.ac.uk/topcat/
http://www.starlink.ac.uk/stilts/

I don't have much background with Gaia, and I haven't worked on
writing an FP* proposal before now, so I don't have a very clear
idea of what's required here. However, I can imagine that once
there are requirements for a user-facing tool that can provide
the data exploration functionality being discussed here, adding
such functionality to an existing powerful and widely-used tool
will be a more effective way to tackle it than starting from scratch.
One concrete and fairly straightforward possibility that comes to
mind is adding a Gaia-specific load dialogue to TOPCAT which makes
it easy to interroate the archive to get data into the tool
(similar requirements from users of other projects in the past
have led to custom load dialogues for VizieR and Millennium
Database access services).