DM Infrastructure forum
Data Mining computations infrastructure
- Where will the computations take place? At ESAC? Will there be a (CPU) Time Allocation Committee with proposals? Otherwise, who is entitled to submit what? Is the user allowed to contribute resources? how? Should the software be developed for execution in infrastructures other than ESAC? Amazon like platforms?
[Tapiador] In my opinion the software should be executable in other infrastructures (otherwise we'd be doing it wrong, i.e. too specific), though it should be thoroughly optimized/integrated for ESAC.
- What will be the technology adopted for the distribution of computations? Hadoop+MapReduce? MPI? Grid? Any other alternatives? [ Apache Mahout]: A Java based scalable machine learning library that uses the map/reduce paradigm by means of Hadoop. [ RHIPE]: It is a merger of R and Hadoop for deep analysis of complex big data. [ Spark] Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce. To make programming faster, Spark integrates into the Scala programming language, letting you manipulate distributed datasets like local collections. You can also use Spark interactively to query big data from the Scala interpreter. Spark provides bindings for Scala, Java and Python, and its core ecosystem plans to give support for machine learning (MLlib), graph processing (GraphX), interactive querying (Shark), etc. which may be useful. [ Twister] MapReduce programming model has simplified the implementations of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among parallel computing communities. From the years of experience in applying MapReduce programming model to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. Twister is a lightweight MapReduce runtime we have developed by incorporating these enhancements. [ Haloop]: HaLoop extends MapReduce with programming support for iterative applications and improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms [ Ricardo] for integration of R and Hadoop, including Jaql
Another alternative could be
[
JPPF]: Java Parallel Processing Framework. It is suitable for parallelizing algorithms and execute them on a Grid.
[Tapiador] I think we can discard MPI and Grid upfront. Grid computing model fits well for data reduction, where we apply the same software to different data (observations, etc) independently (Single Program Multiple Data architecture - SPMD). However, when it comes to data mining and machine learning algorithms (where data need to be shared and arranged in different ways for different iterations), Grid computing does not fit well as each task is isolated from the rest. It does not mean that it can't be done, though it will certainly be much more difficult. MPI allows us to exchange data and so on, but it leaves the complexity of managing a distributed system (something extremely difficult for complex algorithms and large scale systems) to the developer. Recent developments (Hadoop, Spark, Tez, Storm, Akka, etc) clearly separate these different concerns, making it easier (faster and less error prone) to build scalable software.
- What will be the language adopted for the development. Java is compulsory for CU1-CU8, but it is not so for CU9.
[Fustes] I recommend Java since we can reuse code from CU1-CU8. Additionally,
MapReduce systems like Hadoop are best used by Java implementation. Python might be a good choice too because of its simplicity, but take into account that it is much slower and not well supported by M.R.
[Tapiador] There are more and more tools/frameworks that support Python like Dumbo for instance, and some distributed data processing tools aimed at data mining and machine learning like Spark provide bindings for Python, Java and Scala (which runs on the JVM). I strongly agree though with the fact that we'll be reusing a lot of software from CU1-8 and thus we'll need to stick to languages that can run on the JVM for that reason, but there will surely be some users/developers that feel more comfortable with other languages like Python (at least for exploratory work).
- Predictive Modelling Markup Language ([ PMML])
Are comercial plugins a possibility within DPAC? If so, consider [
ZEMENTIS] or [
RevolutionAnalytics]
Interfaces
Interaction with Database: Definition of requirements
The user should be able to select datasets for data mining analysis. Hive seems natural for such a purpose, but it does not support ADQL. Perhaps someone might be interested in extending
HiveQL to support ADQL-like queries... One question to bear in mind is whether the users datasets will be stored temporarly, permanently or not stored at all. The third case might be possible by integrating somehow the data selection with the data mining algorithms, this is, the selection would be performed in each map phase of the data mining algorithm, which is only feasible for very simple queries (select some fields and filter some rows). Shall we assume that the data model of the HDFS system will be the same as the MDB one? Which tables do we need? Which kind of joins will be useful? Should we integrate some tables if they will be frequently joined?
[Tapiador] There is a framework from ESRI that could help when extending
HiveQL to support ADQL. It can be found at
http://esri.github.io/gis-tools-for-hadoop/ In my opinion, there should be a 'data lake' (with quotas and so on) where users could leave their intermediate results. This could be as simple as an HDFS deployment (many -if not all- big data frameworks/tools can read/write from/to it).
Interaction with Visualisation: Definition of requirements
The users should be able to visualize the existing tables (or files), prepare a query to select the desired data, form a dataset and run a data mining algorithm. An estimation of the time required to finish the computation should be given. A run might be rejected if the computation required is too high.
The users should be able to manage their models and visualize them with convenient screens. Some algorithms like unsupervised classification can help to filter the data, so the users should be able to select a subset and send it to VO tools via SAMP.
--
Cesc Julbe - 2014-12-04
Comments