GENIUS Web>300ArchiveSystemDesign (2016-01-26, NigelHambly)

WP3 - Aspects of archive system design

Description

The objective of this workpackage is to design, prototype and develop aspects of the archive infrastructure needed for the scientific exploitation of Gaia data.

WP3 - Aspects of archive system design [Months: 1-42]
Lead beneficiary: UEDIN
Type of activity: RTD

The design and technology choices made will be motivated by the real user requirements identified by WP 2 – in particular, the massive, complex queries defined by the Grand Challenges – and by other initiatives, such as the GREAT project, and will be made with full recognition of the constraints imposed by the ESAC archive system, with which it must interface effectively. Prototypes will be prepared and tested in cooperation with the end user community and with the ESAC science archive team through the DPAC CU9. A core principle will be the adoption of Virtual Observatory standards and the development of VO infrastructure to enable ready interoperation with the other external datasets needed to release the full scientific potential of Gaia.

T3.1 - Technical coordination [Months: 1-42]

UEDIN

In addition to managing the resources deployed on the other WP3 work packages, and producing reports on those activities, this work package oversees the design and specification of all work conducted under WP3, to ensure that it adequately addresses the requirements identified within the GENIUS project and from external sources, such as the CU9 and GREAT. The key thing here is to ensure maximum science return by enabling science exploitation through appropriate use of information technologies.

This WP also includes the assurance of compliance with the deployment of the archive at ESAC. Since the Gaia archive will be designed and run at in this centre, it is essential that the techniques and technologies prototyped in this project are consistent with what can be ultimately implemented there. An important aspect of T3.1 is to ensure the injection of the relevant requirements for this in the design and evaluation phases, and that all GENIUS system design work is tackled with full awareness of the constraints imposed by ESAC infrastructure and practice. A key deliverable is therefore a formal, documented co-ordination and interface agreement between GENIUS and the Science Archive Team (SAT) at ESAC through the CU9.

This work will be undertaken by Hambly of UEDIN.

T3.2 - Aspects of archive interface design [Months: 1-42]

UEDIN

The Gaia mission will produce a wide variety of data products, leading to a complex archive. A crucial issue for the exploitability of the Gaia data set is, therefore, an archive interface that supports a sufficiently rich range of functionality and is sufficiently easy to use for users to do their science with it effectively. The task of this WP is to prototype archive interface components that meet these user requirements, as developed by the CU9 and GREAT. Since any candidate archive DBMSs to be employed at ESAC support access from Java via Java Database Connectivity (JDBC), it is possible to develop archive interface prototypes independent of the backend DBMS.

UEDIN has recently been prototyping the use of Web 2.0 technologies for the delivery of an intuitive, but richlyfunctioned user interface to sky survey archives with a complicated schema, and this appears promising for Gaia: functionality like making schema information readily available to users as they develop their queries, and, even, using code completion to help write them, can make archive use much more effective.

The interface is able to offer users the ability to explore data interactively: they can execute a query, generate summary plots (e.g. scatter plots, histograms, etc), realise their query was not quite making the desired selection, and then easily tweaking the query and executing it again. This reflects the iterative method of working that scientists naturally adopt, which is clearly revealed in analyses the query logs from sky survey archives such as the WFCAM Science Archive [6], curated by UEDIN, and this iterative workflow can be made to run efficiently using a combination of client- and server-side technologies.

What is most important is that the functionality prototyped is that prioritised by scientists, and that any testbed developed here helps the user community to further refine their expressed requirements. For example, while GAP has successfully engaged the Gaia user community via a call for ‘usage scenarios’ under the auspices of GREAT (and these form the inputs to WP2), iteration of requirements with these key consumers has not been considered so far. This process will drive the further development of user interface design – e.g. in determining which additional graphical capacities to implement, and to assess how sophisticated a caching mechanism is required to support the division of datasets between the client and the server – and we propose to use the interfaces developed by this WP for an initial deployment as a testbed for the community to further assess its requirements.

The work will be undertaken by Read (UEDIN)

T3.3 - VO infrastructure [Months: 1-42]

INAF, CNRS, UEDIN, CSIC

The past decade has seen a huge amount of activity in defining, standardising and implementing the global ‘Virtual Observatory’. From the outset, large–scale mission data sets from ground and space were anticipated as being the cornerstone of the VO. This work has reached a level of maturity whereby most of the basic interoperability standards are in place (http://www.ivoa.net/Documents/ ) and it is possible to build project-specific services on top of them and to see where the further development of standards is needed in support of particular projects.

Our goal in T3.3 is a focused programme of VO consolidation and development work concerning server– side components (as opposed to client–side applications; see WP440) to provide the particular VO infrastructure required for Gaia exploitation. This will involve the following strands of work:

1. Assessment of compliance with VO standards (Solano, 6 sm CSIC): to test, and implement the Virtual Observatory standards and protocols necessary to make Gaia data fully VO compliant. We will define the list of VO standards applicable to Gaia data; implement VO standards in Gaia simulated data; and document using simulated data and IVOA standards and protocols as inputs. The main deliverable will be a specification for VO–compliant Gaia data.

2. Deployment of specific web services (Berthier, 1.8 sm CNRS): the SkyBOT (http://vo.imcce.fr/webservices/skybot/) service suite will provide VO-compliant tools for the treatment of solar system bodies within Gaia data, while Miriade (http://vo.imcce.fr/webservices/miriade/) computes positional and physical ephemerides of known solar system bodies in a VO-compliant manner.

3. VO-Dance (Smareglia, 18 sm INAF):The VO-Dance suite provides a lightweight method of publishing data to the VO. Its components can be distributed as disk images to be run on a virtual machine, so we shall assess its use as a means whereby users can integrate their own datasets with Gaia data.

4. VOSpace (Voutsinas, 9 sm UEDIN): Support for an extension to the current VOSpace functionality so that, in addition to providing users with file storage space addressable by VO access protocols, they can also have database storage space on the same basis. This will provide users with a personal database facility like the SDSS MyDB systems, which they are able to address in a VO-complicant manner. For example, a user will be able to direct the result set from one VO query into their personal database, and then use it as the target for a subsequent query, possibly also involving other datasets in the VO, using the TAP Factory system of T3.4 below

T3.4 - Data Centre Collaboration [Months: 1-42]

UEDIN

With the Table Access Protocol (TAP http://www.ivoa.net/Documents/TAP/) the VO provides a standard means of querying tabular data sets, and with the advent of the TAP factory [8] it has become possible to execute multiple, distributed TAP queries. In a traditional IVOA TAP scenario, single TAP endpoints provide the means for VO clients to present the user with a data resource schema and then to service an ADQL query on that resource, but it is then up to further, separate client–end manipulations to join data for multiwavelength science. TAP Factory takes this further by combining TAP with the Open Grid Service Architecture Data Access Infrastructue (OGSA–DAI) middleware to provide a means of creating TAP end-points on–the–fly, and, thereby, facilitating the cross-querying of distributed resources by TAP clients.

Such a system supports one of the fundamental usage scenarios for the VO. A user can select a set of data resources published using TAP on which to execute a distributed query. From the metadata exposed by the individual TAP services, TAP Factory is able to create a new TAP endpoint on–the–fly for the distributed query and present the user with the metadata of the virtual data federation thus generated. The user can then pose a query against this virtual federation as if querying a single TAP service, and, when coupled with the MyDB –like personal database of T3.3, it enables users to create sophisticated sets of cross–catalogue queries, as required for the full exploitation of Gaia data. The key point here is that a data resource can be incorporated into a virtual federation without requiring any action on the part of the staff of the data centre that curate it; so, in the case of Gaia, it is possible for higher level services like these to be developed and deployed, without requiring any action from (or placing any obligations on) the staff at ESAC.

A basic prototype of this system has been produced by UEDIN, but it needs further development in several related regards before it is capable of supporting the scientific exploitation of Gaia. Firstly, the efficiency with which the system can execute a distributed query over the virtual federation constructed by TAP Factory depends on the metadata available to OGSA-DAI’s Distributed Query Processor (DQP) for the purposes of constructing a good query execution plan. For example, if DQP knows the distribution of values of the attributes used in join clauses in the distributed query, it can make an informed decision about how best to move data in executing the query, and whether to perform any server-side pre-processing before doing so. Taking full advantage of these capabilities will require an extension to the TAP standard, to expand the range of metadata exposed by a TAP service, and this can be best progressed through the IVOA standardisation process by the demonstration of powerful prototypes performing realistic science analyses.

The efficiency of the distributed queries can be improved further by collaboration between data centres. A naive spatial cross-match query executed between distributed multi–TB data sets will remain expensive, given network speeds, but several strategies exist that can ameliorate this situation and this work package will assess, through quantitative analysis – and, where possible, direct experimentation – the optimal configuration of the multiwavelength datasets required for the scientific exploitation of Gaia. For example, to determine which external catalogues should be co-located with a copy of the Gaia archive, for which should “cross-neighbour” tables be precomputed to facilitate queries between data sets that remain geographically separated, and for which can crossmatches be performed on-the-fly with sufficient speed.

The work will be undertaken by Read and Voutsinas of UEDIN.

T3.5 - Cloud-based research and data mining environments [Months: 1-42]

UEDIN

Research environments such as that provided by CADC with CANFAR (http://canfar.phys.uvic.ca/) represent stateof- the-art solutions to the large and growing range of research and data mining demands being placed upon astronomical archives. CANFAR offers scientists a rich, yet bounded, environment based on virtual machines (VMs), within which a scientist can deploy the software they need for their individual research and have it run in a manner that does not risk the stability of the archive or the research of other scientists. VM images can be created and stored by individual scientists or research consortia, and deployed when, and in the numbers, necessary for the job at hand, so that the available data analysis hardware can be employed effectively, but with the flexibility needed to match the differing needs of multiple user groups.

As archives increase in size and complexity, data analysis will shift to the data centre, and the CANFAR initiative is showing how this can work in practice. Of particular relevance to this project is the recent work (https://sites.google.com/site/nickballastronomer/research/canfar_skytree) deploying the Skytree scalable data mining software within the CANFAR cloud, which has demonstrated how such the provision within a data centre of such a virtualized environment can support the large-scale data mining analyses envisaged for Gaia by WP4. CANFAR is the pioneer in this domain, but further R&D work is needed to shape a system that will be suitable for Gaia: e.g. further integration with VO protocols (see T3.3 above), and creation of a more sophisticated packaging system for deployable software.

The work of T3.5 will centre on the prototyping the deployment, configuration and enhancement of a virtualized data analysis environment for Gaia. Starting with the existing CANFAR system, it will identify best practice and requirements for further development, some of which can be prototyped within T3.5. Comparison with other solutions for Gaia analysis within the date centre will be undertaken and conclusions reported.

This work will be undertaken by Read at UEDIN

Participants

Manager: N. Hambly (UEDIN)
Partners:
- UEDIN: Mike Read, Stelios Voutsinas, Mark Holliman, Dave Morris
- CSIC: Enrique Solano, Luis Sarro
- CNRS: Jérôme Berthier,
- INAF: Riccardo Smareglia, Marco Molinaro

Meetings

Teleconference between OATs and IfA concerning VO-Dance developments

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7-SPACE-2013-1) under grant agreement n°606740.