Edit Raw edit Attach

Practical Work with GUMS and GOG

Virtual Machine Setup

For the demonstration, a virtual machine (VM) has been created and pre-configured. The following steps must be followed for installing the VM on your own laptops:

  • Download Virtual Box from http://www.virtualbox.org/wiki/Downloads (choose the proper installer depending on your laptop's platform).

  • Get the VM image from ftp://ssh.esac.esa.int/pub/dtapiador/vm/latest. Download the entire 'greatworkshop' folder to where Virtual Box VMs are placed (in a folder named 'VirtualBox VMs' under 'Documents and Settings\<user>' in Windows, '/Users/<user> /VirtualBox VMs' in MAC and a similar place for Linux platform).

  • Start Virtual Box and add the VM (Machine -> Add) configuring it properly (the type of OS is Linux 2.6 - 32 bits). Once added, in Settings -> Network menu, check that Adapter 1 is enabled and attached to NAT (with cable connected). If the network is properly set up we will have connectivity to the outside world from the VM.

  • Boot the VM (Show/Start button) and log in with user 'hadoop' and password 'vm4hadoop'. The mouse and keyboard should be captured when the VM window is active (there might be cases where you have to click on the VM window contents so that it is captured, in those cases, the right CTRL will get you out of the VM).
  • Open a terminal and run:

  • [Optional] Open Eclipse and have a look at the examples available in the package 'gaia.cu9.great.test.example' for GUMS and those contained in the package 'gaia.cu9.mock.test' for GOG. They implement histograms for counting elements (CountElemHistBuilder), creating a Healpix map (HealpixMapHistBuilder and AstroHealpixMapHistBuilder for GUMS and GOG respectively), a theoretical Hertzsprung-Russell histogram (TheoreticalHRHistBuilder), etc.

  • Run each example with the script:

run_histogram.sh /opt/hadoop/tests/props/<propsfilechosen>.properties
  • Check the results of the test by running:

hadoop dfs -ls /output

hadoop dfs -cat /output/<propsfilechosen>/part-r-00000 | more
  • The results can be copied to the local filesystem of the VM by running:

hadoop dfs -copyToLocal /output/<propsfilechosen>/part-r-00000 $HOME/<propsfilechosen>.out

Setting up a shared folder with the host machine

On the Virtual Box user interface:

  • On the menu at the top of the virtual machine choose devices -> shared folders. (ignore messages about not having access to the USB device)

  • On the right there is a little + folder icon to add a new folder -- give it a name for the vm like 'vmdata'.
  • Create a folder or use one already available on your machine.

Back in the VM you must mount this:

  • Open a terminal (right click on desktop in the vm).
  • You need to be root for this... type su enter (password is 'vm4great').

  • Assuming you called your folder vmdata type the following sequence of commands:

mkdir vmdata

chown hadoop vmdata

mount -t vboxsf vmdata -o uid=hadoop,gid=hadoop vmdata/

  • Now you should have ~/vmdata available.

Histogram Generation Framework

As stated in the presentation, for the workshop exercises we will be using the new BIG DATA processing paradigm known as MapReduce (MR), but we will not be accessing it directly. We will use a framework written in Java that will help us define and run the histograms required for the different exercises proposed.

The framework design is shown in the UML class diagram (no need to know either the notation or how it works, a quick scan of the classes names will give us some information of the things we can do).

Some very basic information of the framework (see the example below for a wider understanding of the different entities):

  • A Histogram comprises several things, namely: the categories or dimensions it is composed of, the value builder (for count histograms it will always return a '1'), the aggregator which will gather the different values generated by the value builder and will compute the single aggregated value per combination of categories (for count histograms it will just sum the values) and an optional filter for filtering out elements not interested for our particular case (generate a histogram for sources at a certain distance, etc).
  • A category represents one dimension of the histogram. It can hold discrete bins (values) such as the healpix pixel, the spectral type of a source, etc. and interval bins for grouping a range of values into a single interval (sources whose magG ranges from 8.0 to 10.0).
  • The value builder defines what we want to get as our output. It may be just a count of elements falling into each combination of categories (CountLongValBuilder) but it might also be another value (some magnitude already present in the source, or the result of a complex calculation from the source information).

  • The aggregator just defines how to gather the different values built into a single one (when counting elements for each combination of categories, it will just return the sum of all values). It may also compute the average/min/max of a certain magnitude, etc.
  • There is a helper (FieldReader) that can be used for accessing a determined field of an object, by passing the attribute name. For instance, if we want to use the result of the method getSpectralType(), we will set up 'spectralType' as the field to access by the FieldReader. This is convenient in order to avoid the creation of a class (for every field we are interested in) that basically returns the attribute required. Nested attributes (astrometry -> alpha) can be also accessed by field readers when configured recursively. Have a look at java code in file SpecTypeMagnitudeHistBuilder.java for a practical example of recursive field readers.

  • The histograms to be computed by the job (there may be as many as we want in a single job provided the data type returned by the value builder is the same) are defined in the class implementing HistogramBuilder (or extending HistogramBuilderHelper which contains some helpers for setting up the histograms). This concrete class will be the one to set in the properties file that defines the MR job (see the example below).

As a side comment, the framework has been developed using the powerful java 'generics', which allows us to deal with any type of object both as input and as output of the framework. Parameter types tagged as 'E' refer to the input object we will be dealing with (for us it will normally be 'UMStellarSource'). Parameter type 'V' refers to the output type being generated by the histogram (normally LongWritable if we generate count histograms). It is important to remark that due to some Hadoop constraints, java types cannot be used directly for the output types, thus a wrapper is needed (LongWritable for Long, DoubleWritable for Double, etc). Just ask us if you do not know the counterpart type for your concrete example output or how to use it.

Theoretical Hertzsprung-Russell histogram example

In this section, we will go through the steps needed for computing a theoretical Hertzsprung-Russell (HR) histogram with the framework. The starting point is the graph we want to build:


From the graph we can see:

  • We want to build one histogram only. In case we needed more histograms, we would just have to go through this example steps for the rest of histograms, making sure that the output type of the data returned by the different value builders is the same (LongWritable, DoubleWritable, etc).

  • There are two categories:
    • An interval category represented by the source field 'mbol' (absolute bolometric magnitude, 'getMbol()' method). Due to the fact that we need the field value itself (no computations to be done for calculating the value of the category for each source), a field reader (helper functionality) is used. The code for doing so is shown below:

        // Absolute bolometric magnitude category         Category<UMStellarSource, ?> bolLuminosityCat =                         new FieldIntervalCategory<UMStellarSource, Double>(                                         buildRanges(bolLumRanges, false),                                         // Get field mbol                                         new FieldReaderImpl("mbol"));
  • For the other category, there is some calculation to be performed and thus we cannot use a field reader. We then have to create another class that returns the logarithm in base 10 of the source field 'teff' (effective temperature) for each source. The code of this external class is below:

/**  * This class defines an interval category used to compute the logarithm  * in base 10 of the effective temperature, for the theoretical  * Hertzsprung-Russell diagram.  *   * @author Daniel Tapiador - ESA/ESAC - Madrid, Spain.  */ public class EffectiveTemperatureCat extends IntervalCategoryImpl<UMStellarSource, Double> {                  /**          * Constructor.          *           * @param intervals          */         public EffectiveTemperatureCat(List<IntervalBin<Double>> intervals) {                 super(intervals);         }                  /**          * Compute the log10 of the effective temperature for the source.          *           * @param element The source to get the logarithm in base 10 of the          * effective temperature from.          * @return The log10(effective temperature) of the source.          */         @Override         public Double getField(UMStellarSource element) {                 return Math.log10(element.getTeff());         } }
  • The value builder to use will just produce a '1' value for each source being analysed. Thus the helper CountLongValBuilder is the one chosen.

  • The aggregator that we will use is the already provided CountLongAggregator, as we want to get the count of sources falling down into every combination of the categories. Obviously, as the two categories do not contain discrete values, we will have to define the ranges for them later on when overriding HistogramBuilderHelper (class that defines the histograms to compute in the job).

  • We want to compute all sources contained in the dataset so no need to create a filter.

Taking into account the previous rationale, the code to be inserted into the class overriding HistogramBuilderHelper would look like this:

        /**          * @see esac.archive.gaia.dl.histogram.impl.          * HistogramBuilderHelper#getHistograms()          */         @Override         public List<Histogram<UMStellarSource, LongWritable>> getHistograms() {                  // List holding the histograms to compute                 List<Histogram<UMStellarSource, LongWritable>> histograms =                                 new ArrayList<Histogram<UMStellarSource, LongWritable>>();                  /*                  * Build the theoretical Hertzsprung-Russell categories.                  */                 List<Category<UMStellarSource, ?>> categories =                                 new ArrayList<Category<UMStellarSource, ?>>();                  // Effective Temperature ranges                 List<Double> teffRanges = new ArrayList<Double>();                 for (double i=2.5; i<=5.0; i+=0.0025) {                         teffRanges.add(Double.valueOf(oneDecimal.format(i)));                 }                  // Effective Temperature category                 Category<UMStellarSource, ?> teffCat =                                 new EffectiveTemperatureCat(buildRanges(teffRanges, false));                  categories.add(teffCat);                                  // Absolute bolometric magnitude ranges                 List<Double> bolLumRanges = new ArrayList<Double>();                 for (double i=-10; i<=25; i+=0.025) {                         bolLumRanges.add(Double.valueOf(oneDecimal.format(i)));                 }                                  // Absolute bolometric magnitude category                 Category<UMStellarSource, ?> bolLuminosityCat =                                 new FieldIntervalCategory<UMStellarSource, Double>(                                                 buildRanges(bolLumRanges, false),                                                 // Get field mbol                                                 new FieldReaderImpl("mbol"));                  categories.add(bolLuminosityCat);                  /*                  * Build the theoretical Hertzsprung-Russell histogram.                  */                 Histogram<UMStellarSource, LongWritable> hrHist =                                 new HistogramImpl<UMStellarSource, LongWritable>(                                                 "TheoreticalHR", categories,                                                 new CountLongAggregator(), null,                                                 new CountLongValBuilder<UMStellarSource>(), "/");                  histograms.add(hrHist);                  return histograms;         }

It is important to remark that the usage of the helpers supplied within the framework is not mandatory. Any interface/class (categories, value builders, aggregators, etc) can be externally implemented/overridden to cope with the concrete needs of the exercises. Helpers are just convenient implementations in order to save time.

Once the source code is ready, we can proceed with the definition of the job to be run on Hadoop. We do this by setting several properties in a typical java properties file (find some examples at /opt/hadoop/tests/props). The contents of the configuration file created for the theoretical HR job are:

###################################### # Properties for Histogram generation ######################################  esac.archive.gaia.dl.histogram.mapreduce.HistogramJob.inputPaths=/datasets/GUMS10-Sample esac.archive.gaia.dl.histogram.mapreduce.HistogramJob.outputPath=/output/theoreticalhr esac.archive.gaia.dl.histogram.mapreduce.HistogramJob.jobName=TheoreticalHRHistogram esac.archive.gaia.dl.histogram.HistogramBuilder.implementer=gaia.cu9.great.test.example.TheoreticalHRHistBuilder esac.archive.gaia.dl.histogram.mapreduce.HistogramJob.outputValue=org.apache.hadoop.io.LongWritable esac.archive.gaia.dl.histogram.mapreduce.HistogramJob.InputFormatClass=esac.archive.gaia.dl.histogram.mapreduce.gbin.GbinInputFormat

The last step is to run the job. This can be done by following the instructions in the VM setup section (using the configuration file of this example at /opt/hadoop/tests/props/theoreticalhr.properties).

** IMPORTANT **: For the exercise proposed to each group, there is already a class overriding HistogramBuilderHelper (for defining the histograms to compute in the job) named ExerciseHistBuilder, along with the associated configuration file (/opt/hadoop/tests/props/exercise.properties). The suggestion is that you use that class and embed your code where appropriate (generating the classes for the categories if needed, etc) and then run the job with the /opt/hadoop/tests/props/exercise.properties configuration file.

After the job has been run on hadoop, we can generate the graph by following the steps below:

  • Copy the results to a local directory ($HOME is fine) by running:

hadoop dfs -copyToLocal /output/theoreticalhr/part-r-00000 $HOME/theoreticalhr.out
  • Create the plot by using 'plot2DHistogramFromHadoopOutput.py' script:

plot2DHistogramFromHadoopOutput.py -g --title "Theoretical Hertzsprung-Russell Diagram" --xLabel "Log10 (Effective Temperature)" --yLabel "Absolute Bolometric Magnitude" --flipX --flipY $HOME/theoreticalhr.out
  • It will create a png (-g option supplied) located in the working directory named '2dhistogram.png' that we can visualize with your favorite tool:

gimp 2dhistogram.png

Healpix Map histogram example

The starting point is again the graph we want to generate:


The steps are basically the same as in the example above. The dataset to be used for this example is the biggest one available in the VM (subsetdiscrete) as discrete histograms are not so CPU intensive, thus they can be run over a larger dataset on the local laptop (in less time). For plotting, there is a helper python script that parses the output of the Hadoop job:

hadoopOutputOnSky.py $HOME/healpixmap-subsetdiscrete.out 256 -g --clabel 'Log10 (Counts)' --title 'Star Counts'

Running on Amazon Elastic Map Reduce

As you may already have found out, GUMS10 dataset (indeed any dataset generated in the Gaia mission) is quite big for either downloading entirely or processing locally on the laptop so we need to upload our software to the data centre or cloud infrastructure for performing the analysis/processing.

As already discussed, we will be using Amazon Elastic Map Reduce (EMR) for running our data processing exercises over the whole GUMS10 catalog, which has already been uploaded to Amazon S3. Then, we need to configure the VM before any of the commands written in this page can be executed there. If your exercise is running well on the local datasets and you want to give it a try over the whole GUMS10 catalog, ask any of the tutors.

Start cluster at Amazon EMR

There is a script in the VM that starts a cluster in Amazon EMR. We have to specify the number of worker nodes (4, 8 or 16) and the text identifier that will be shown in the console (this is just a textual id, not the job flow id). For instance:

start_cluster_amazon.sh 8 Exercise1

This script will boot a cluster with the specified number of worker nodes (plus one master instance) in the Amazon cloud and it will copy the entire GUMS10 dataset locally on the cluster. Although the copy of the data to the newly initiated cluster may take some time, we can start adding job flows (histogram generation jobs) whenever we wish, and they will be executed once the cluster is ready.

Job submission

For submitting jobs (adding job flows) to the Amazon EMR cluster, we need to use the job flow id (i.e. j-2OHOASLFJ9RPP) generated when 'start_cluster_amazon.sh' script is run:

run_histogram_amazon.sh /opt/hadoop/tests/props/healpixmap-amazon.properties j-2OHOASLFJ9RPP

Once the job has been submitted, we can check the cluster status by running (it will show the status of all steps added including the initial step for copying the data from s3):

list_cluster_status.sh j-2OHOASLFJ9RPP

Stop the cluster

It is extremely important to shutdown the cluster we have used once the job has finished (we can submit as many jobs as we want to the cluster and they will be run in the order they were launched). This way we release all computing resources of the cluster for which we are being charged. To do so:

stop_cluster_amazon.sh j-2OHOASLFJ9RPP

Collect results

Just after the job has finished (and we have shutdown the cluster), we proceed with the download of results, the grouping into a single file and the graph generation (as in the examples run locally). To download and collect the results we must run:

collect_results_amazon.sh /opt/hadoop/tests/props/healpixmap-amazon.properties $HOME/healpixmap-amazon.out

Then, $HOME/healpixmap-amazon.out will be created with the results of the job. We can then proceed with the graph generation with the pertinent tool (in the same way as for local results). This is the graph generated for the example run above (healpix map at NSIDE 256 for the entire GUMS10 catalog, generated in 2 hours and 15 minutes with a cluster of 8 worker nodes):


The theoretical HR diagram is shown below (just for reference). It has been generated from the whole GUMS10 catalog at Amazon EMR and it has taken 21 hours to process in a 16 worker nodes cluster (mainly due to the large amount of bins used in the histogram categories which has increased a lot the CPU time).


Raw View |  Raw edit |  Edit Attach More topic actions...
Topic revision: r1 - 2012-02-22 - DanielTapiador
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback