Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > | DESCRIPTIONUser must be able to execute Data Mining tasks and Machine Learning algorithms in the distributed computing infrastructure that the Gaia archive will provide. Data mining queries can be complex and not only a single operation but a more complete pipeline, so the interfaces must allow the users to define these complex queries, test them and finally submit them to the production cluster. We propose two different execution environments, a test one with less security restrictions with a subset of the archive to test algorithms and pipelines and the production environment, where a defined job can be submitted.INTERFACESConsoleData Mining tasks generally require a good knowledge of the data to query and fine tuning of algorithms and processes through trial and error learning so we think a console where interactively perform these operations is a strong requirement. Spark provides a Scala and python consoles, but other alternatives can be considered as R. Console should be allowed in the test environment to access a subset of the archive.Direct execution environmentThis environment will allow users to upload their own implementations (i.e. compiled jar files) and submit them directly to the cluster similar to a submit job script in an HPC environment. The environment should be set to allow the user to use latest MlLib libraries and other advanced dependencies needed. Security policies have to be defined to be applied to the job before being submitted.<< Web interface >> Deprecated!Once a task is completely defined (tested and verified, such as a trained model in the test environment) we can configure a tasks through a web interface. Main features of this interface should be:
Comments |