GENIUS Web>400Toolsfordatavalidationandanalysis (2015-10-14, LolaBalaguer)

WP5 - Tools for Data Validation and Analysis

Description

The preparation of the Gaia archive before its publication requires a careful, detailed and indepth validation of its contents. The scientific and statistical challenge of this task on a onebillion data set containing a wide variety of data (astrometric, photometric, spectrophotometric, spectroscopic, . . . ) is daunting, and would be impossible without tools adapted to work on such a massive and data-diverse archive. This work package aims at producing such tools, based on the actual validation needs and on the characteristics of the archive system, thus making them as efficient as possible. Furthermore, the validation process will rely on methods and tools that can also be used, with little or no adaptation, for the scientific analysis of the catalogue.

Therefore, this work package, in connection with WP 400, will also produce tools for the use of the scientific community in its analysis of the Gaia data. This work package will undertake the following tasks:

T5.2 Looking for trouble: definition of problem cases, validation scenarios and tools
T5.3 Simulation versus reality: from models to observables
T5.4 Confronting Gaia to external archives
T5.5 Data demining: outlier analysis
T5.6 Transversal tools for special objects

This structure is mostly identical to the CU9 work packages and deliver tools for these.

WP5 - Tools for data validation and analysis [Months: 1-42]
Lead beneficiary: CNRS
Type of activity: RTD

Despite the precautions taken when building the data processing algorithms, completely avoiding errors in the astrometric, photometric, spectroscopic or classification data in a one billion source catalogue, with many intricate data for each, is indeed an impossible task. Still, provisions should be made for ensuring the highest quality for the Gaia Catalogue through a data validation before each release.

While every Gaia DPAC Coordination Unit (CU) has indeed implemented unit tests and verification tests, a validation between CUs, and a comparison with external data can offer, perhaps not a final word, but at least a useful complementary insight. The present section details the tools, either interactive or automated, devoted to validation purposes. As much as possible, the validation tools will rely on requirements, methods and tools developed in the other work packages in order to validate not only the data but also the other tools developed within GENIUS.

T5.1 - Technical coordination [Months: 1-42]

CNRS

The objective of this work package is to ensure that WP 500 meets its objectives within budget and on schedule. Tasks will include co-ordinating and supervising activities to be carried out, monitoring project progress, monitor quality and timing of deliverables, reporting back to the GENIUS executive board. The manager of this Work Package will be responsible for management and progress reports and ensuring a good coordination with the other Work Packages and with CU9 needs.

T5.2 - Looking for trouble: definition of problem cases, validation scenarios and tools [Months: 1-42]

CNRS

A basic verification of the Catalogue content should ensure that the field contents are as expected, that all fields are within valid ranges and fields present as indicated (e.g. spectroscopic epoch data should be present when and only when indicated). Blind automated tools for fulfilling these simplest basic tests are thus needed. Besides, a consistency of this content with documentation is mandatory.

Complementing this formal validation of the Catalogue output, more complex tests should be elaborated, and the associated tools should be developed. For instance, the fact that Gaia is a complete observatory in orbit, combining astrometric, photometric and spectroscopic information implies some redundancy which can be exploited for validation purposes; for example, photometry should be consistent with spectroscopy. Other intrinsic correlations between parameters can be used to build those tests, such as e.g. the dependence of proper motion on distance.

On a forward modelling side, it is of interest to wonder what kind of problems could occur and what consequences this would have on observed parameters. Some expected problems which would produce errors in the Catalogue are the following:

Calibration or instrumental problems
Classification errors
Data Processing shortcuts or approximate models

This work package will accordingly define validation scenarios, and implement the corresponding tests. Some illustrative examples can be given:

It is expected that photometric calibration problems would introduce a spurious variability for stars. Consequently,

the analysis of stellar variability either spatially or versus time can validate the data or exhibit calibration problems.

On the astrometric side, any annual thermal or calibration effects would introduce a parallax bias, as was already

studied for Hipparcos, so the parallax zero-point should be studied as in, e.g., [1].

Bad cross matching of Solar System Objects (SSO) would produce spurious SSOs or stars, so the distributions

of the distances to the nearest neighbour, from SSO observations to nearest non-SSO, is a useful test.

Summarising the above comments, the corresponding work packages would then be the following:

521. Formal validation of the Catalogue field content as function of the object type

522. Internal consistency tests

523. Tests based on what is known to produce effects on given parameters

524. Generation of validation reports with diagnostics filtering

T5.3 - Simulation versus reality: from models to observables [Months: 1-42]

CNRS

The CU2 DPAC Coordination Unit has provided a very valuable tool: the Universe Model. Indeed, this model, initially based on the Besançon Model of our Galaxy, has been complemented with an extinction model, multiple stars and variability models, etc., and now represents the best simulated sky one could hope to test the DPAC algorithms against.

In turn, this model can be used to validate the Gaia data. In a first step, the (astrometric, photometric, spectroscopic or classification) observable parameters which are predicted by the model should be computed, in the form of statistics: distribution, confidence intervals and correlations between parameters, by object type, by region and by time.

Certainly, differences between what is predicted and what is observed are expected (or even desired) and the comparison between model and observed data requires clustering tools (WP 551) and robust implementations (WP 553). Clearly, for several parameters, checks will have to be made separately for different classes of sources and it would also be desirable that scientists are able to apply their interpretative skills to the comparison of model versus data.

531. Statistics of the parameters deduced from models

532. Build tools checking that all Catalogue fields have ‘reasonable’ distributions, i.e. consistent with what is

obtained in WP 531

The CNRS UMR 6213 is the one responsible of the CU2 Universe Model and will then very efficiently tackle these tasks.

Projecting models into the observable domain (such as the Universe model mentioned above) is a task in common with WP 230, the difference being that validation expects to retrieve from data already known specific structures while scientific users of the Catalogue will expect to find new extra ones. The development of the needed tools will consequently be done in close cooperation with WP 230.

T5.4 - Confronting Gaia to external archives [Months: 1-42]

CNRS, CSIC, KU

One of the first uses of the Gaia data will be the cross-matching to external archives where the astrometry will allow to obtain the absolute luminosities in various wavelength ranges. Defining the tools to allow this is thus mandatory on the ‘scientific’ side; on the ‘validation’ side, what is important is that a photometric analysis should show the consistency between Gaia data and external data.

The problem—which is actually not a problem but one of the strengths of Gaia—is that there is no comparable allsky survey with a comparable angular resolution and multiple star discovering power. Although the cross matching will be based on the VO tools elaborated in WP 400, the methodology to do this in practice a) in dense areas, b) with multiple objects handling, c) taking into account all properties of Gaia on the one hand and of the other Catalogues on the other hand, implies the need of developing tools allowing both input from the users and intelligence in the data pairing.

Besides, the validation mentioned here is supposed to be the validation of Gaia data, not that of the external archives, although this certainly would be interesting on the scientific side (and will thus be disseminated to provide input to further scientific analysis). Robustness is thus mandatory in front of the lack of data, the lack of precision, and the high level of systematics which are expected in external archives and which could wrongly be interpreted as problems within Gaia data. Robustness shall be achieved thanks to tools developed in WP 550.

541. Multi-wavelength cross-matching tools

CSIC will devote 4 man months in this Work Package. Consultancy from the INAF partner will also be useful for this task as INAF-OATo is responsible for the IGSL cross-matching algorithms in DPAC. This WP will also benefit from the developments in WP 240.

542. Photometric and classification analysis tools

543. Cross-validation tools with Nano-JASMINE data

Similar to Gaia, the Nano-JASMINE (N.J.) astrometric results will need to be validated, and many tools defined in the whole WP 500 package can indeed be used for this purpose.

Besides, it is planned to combine N.J. data with Hipparcos data to hugely improve the proper motion precision of the stars in common, thanks to the long time base between both missions. A cross-validation is needed before combining data, which will incidentally allow to detect long period binaries. As N.J. uses the Astrometric Global Iterative Solution (AGIS) developed for Gaia by ESAC and Lund Observatory, a useful insight on the AGIS behaviour with N.J. (e.g. validation of the estimation of the correlations between astrometric measurements) will be obtained when the more precise Gaia data is available. Finally, the Gaia data will also allow to test the results obtained with the validation tools applied to the N.J. data.

T5.5 - Data demining: outlier analysis [Months: 1-42]

CNRS, FFCUL

Outliers being by definition objects which deviate from an assumed model, it would be surprising that a mission such as Gaia planned for deciphering the complex structure of the Galaxy would exhibit no outliers departing from our current knowledge.

While a first risk already handled is the presence of problems or systematic errors in the Catalogue, another issue is an incorrect interpretation of data features. Indeed, although three of the WP managers of the current proposal drew the attention of the community [4] to the precautions to be taken with the analysis of the Hipparcos data, this did not prevent incorrect exploitation of the astrometric data. In this respect, being able to show that objects are not outliers is perhaps as important. Tools dealing with extreme values are thus needed.

In an interactive discovery phase, the data analysis via graphics developed as in WP 400 should allow tolerances in order not to detect noise instead of substructures. Still, at some point special sub samples will be detected, thanks to clustering tools. At that point, what is needed is an immediate characterization (statistical analysis and classification) of the properties of this sub sample with a subsequent visualization (e.g. 3D spatial maps).

551. Clustering and sub-population statistical characterisation tools

552. From graphics to diagnostics, from diagnostics to graphics

The FFCUL node will efficiently contribute to this task

*553. Robust tools using truncated, censored or correlated data

T5.6 - Transversal tools for special objects [Months: 1-42]

CNRS, UNIGE, ULB

Some special objects need a special treatment, in particular those having a time dependence such as multiple or variable stars or solar system objects. Moreover, these objects may greatly benefit from a reprocessing of the Gaia data using external epoch data. Dedicated sub-work packages led by experts of the models used in these fields are thus required here and their specialized tools will also contribute to WP 520-550.

Detection of new objects in the Solar System is foreseen with Gaia. While in numbers they are very few (about

one per 107 Gaia objects), in terms of classes they are scientifically valuable: one expects to detect Near Earth Objects (NEOs) inside the orbit of the Earth or bright outer Solar System objects. Thus, real time validation has to be provided for the dedicated ground-based support Gaia-FUN-SSO network; this to reduce as much as possible false alerts and also to validate data inserted in the global data analysis scheme. Specific software needs to be developed for automating alerts, transformation and dissemination of data for use by observers, and making all data and alerts VO compliant. This task will have to combine ground-based and space-based Gaia data. One also needs to compute the orbital elements and compare them to the elements of the known population of asteroids and comets, so as to perform orbital adjustment, taking into account a full dynamical model, and at the same time validate the inversion process from the limited Gaia sample (which usually corresponds to less than an orbital period around the Sun). These data are mandatory to ingest in the input auxiliary database for small solar system bodies which has to be maintained and regularly updated during the space mission.

Solar system objects are particular objects because they are moving with continuously varying velocity and their brightness is continuously changing because of both geometry and intrinsic properties. Observations can be corrupted because of a close approach to a star; in such case the information has to be provided to the group having the task to analyse stellar data. Furthermore, one will either validate the rejection of corrupted data, or retain the data as possibly valuable additional scientific input (entering so WP 550). Such analysis has to be performed on solar system objects directly observed by Gaia as well as other objects that will not be observed by Gaia but are catalogued in SSO data bases (e.g. the planets, dwarf planets, large satellites and irregular ones, and asteroids fainter than magnitude 20).

Considering multiple stars, it should be noted that about half of the Gaia Catalogue will consist of sources that are

actually non-single stars of which a significant but only much smaller fraction will be detected. Assessing the quality of the data reduction for the majority of stars not detected as non single stars (NSS) thus appears complicated (as the model fit to the astrometric measurements may be incorrect), whereas it may prove easier when a more correct astrometric model is already known, which is the case for detected NSS.

Two types of validations are identified: on the one hand, validations relying upon the statistical behaviour of the solutions leading to the catalogue, i.e. purely standalone validations e.g. goodness of fit; on the other hand, validations based on a comparison with some auxiliary data, e.g. speckle observations or spectroscopic orbits. These validations may thus allow a better insight into the properties of the observations, associated uncertainties, and data reduction done in the astrometric (CU3), photometric (CU5) and spectroscopic (CU6) data reduction chains.

Finally, the stellar variability should be studied in detail: while a certain fraction of the sources are expected to be

intrinsically variable in flux, some assumed constancy of many other sources is also what permits the principles of the data reduction. Conversely, an unexpected variability can also be the signature of an acquisition or data reduction problem. Because variability is transversal to the validation process, this work package will develop tools validating the astrometric, photometric, spectro-photometric and spectroscopic reduction from the point of view of time series and variability. The study will take several directions: for example studying variable sources to determine if some of their variability behaviour is due to the instrument or the reduction, or to detect if constant sources, or small amplitude variables have some residual effects coming from the satellite or perturbations from that data acquisition mode, or reduction method. To see such effects, it is then important to gather several sources and to take averaged quantities.

A large list has already been established of all effects that should be studied to see if there are some residual effects in the (spectro)-photometric, and spectroscopic time series.

561. Solar system objects

The competence of the CNRS UMR 8028 will prove useful if not mandatory for this task.

562. Multiple stars

The ULB node contribution, in charge of the CU4 NSS handling in DPAC, will be needed for this Work Package.

563. Variability and time series

This WP is where the competence of the UG node (in charge of the coordination unit, CU7, responsible for variability processing in DPAC) will help to build tools related to the variability analysis.

Participants

Manager: F. Arenou (Obs. Paris-Meudon, CNRS)
Partners:,
- CNRS (GEPI, IMCCE, UTINAM): Paola Di Matteo, Carine Babusiaux, Daniel Hestroffer, William Thuillot, Annie Robin, Céline Reylé, Houri Ziaeepour, M. Kudryashova, Krzysztof Findeisen, Laura Ruiz-Dern
- UNIGE: Laurent Eyer, Sergi Blanco-Cuaresma
- ULB: Dimitri Pourbaix, Christos Siopis
- CSIC: Enrique Solano, Luis Sarro
- FFCUL: André Moitinho, Alberto Krone-Martins
- KU: Yoshiyuki Yamada, Naoteru Gouda, Ryoichi Nishi, Shunsuke Hozumi, Satoshi Yoshioka

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7-SPACE-2013-1) under grant agreement n°606740.