ALIs

kommt noch

R Programming Language

This webpage gives a short introduction to the R programming language as it is installed on the HPC systems at LRZ.

Contents:

  1. Introduction to R

  2. High-performance and Parallel Programming with R
    R is known as a very powerful language for statistics, but it has also evolved into a tool for the analysis and visualisation of large data sets which are typically obtained from supercomputing applications.
    The course teaches the use of the dynamic language R for parallel programming of supercomputers and features rapid prototyping of simple simulations. Several parallel programming models including Rmpi, snow, multicore, and gputools are presented which exploit the multiple processors that are standard on modern supercomputer architectures.

  3. Data Mining of XML Data Sets using R
    Large data sets are often stored on the web in XML format. Automated access and analysis of these data sets presents a non-trivial task. The dynamic language R is known as a very powerful language for statistics, but it has also evolved into a tool for the analysis and visualisation of large data sets. In this course R will be used to perform and automate these tasks and visualise the results interactively and on the web.

  4. Scientific High-Quality Visualisation with R
    R is a powerful software that can also easily be used for high quality visualisation of large datasets of various modalities like particle data, continuous volume data, etc. The range of different plots comprises of line plots, contour plots, surface plots up to interactive 3D opengl plotting.

Preliminaries

R is a dynamic language for numerical computing and graphics with a strong affinity to statistics. R is available as Free Software under the terms of the GNU General Public License (GPL). It compiles and runs on a wide variety of UNIX and Linux platforms, Microsoft Windows and MacOS. R is a fully featured programming language and much of the system itself is written in R. Advanced users can link and call C, C++ and Fortran code at run time. R has its roots in statistics, but its extensibility, ease-of-use and powerful graphics makes it ideal for users looking for a fast, easy and robust environment for data analysis and numerics. R can easily be extended with more than 1,700 additional packages available through the Internet that can be installed with the command install.package (“name”) and then loaded with the function library (name).

R is a statistics package and was developed as a free successor to the S and Splus languages. It is probably a bit harder to learn than other statistics tools but once you are used to the functional programming approach of R it gives you great flexibility. You can accomplish complex tasks with just very few commands and produce publication quality hardcopy output. It also allows you to add functionality and automate processes. R is available on all the most common platforms. At LRZ several different version are installed.

An important remark about the equal sign in R: x<-1 is equivalent to x=1 but the latter should be used wherever optional parameters are expected like in function calls. Use the arrow form for explicit calculations and the equal sign form for definitions.

Availability and starting R interactively

R is available on the LRZ-Linux cluster and on the HLRB-II. It can be used interactively or in batch-mode. For using R interactively log into one of the interactive nodes and type:

> module load R
for loading R
> R
for starting R.
For starting a different version than the default version at LRZ use the command
> module avail R
which prints out a list of all available versions. At the moment the following versions are available for the different architectures:
On all architectures the default version is R-2.8.1
For the x86_64 architecture (lx64ia2, rvs1, gvs1) also available are:
  • 2.5.1, 2.6.2_mpi, 2.8.0 compiled with gcc, 2.8.1 compiled with gcc -O3, 2.8.1 with mpi, 2.9.0 (for compatibility reasons)
  • 2.8.0 with linked openmp libraries
  • 2.9.1 compiled with mpicc. This version also features the Rmpi package which allows users to write mpi programs using R.


For the ia64 architecture (hlrb2, lx64ia1) also available are the following versions:
  • 2.4.1, 2.5.1, 2.8.0, 2.6.2_mpi, 2.8.1 compiled with gcc -O3, 2.8.1 with mpi (for compatibility reasons)
  • 2.9.2 compiled with mpicc. This version also features the Rmpi package which allows users to write mpi programs using R.


Please note that the compatibility versions will be removed sooner or later. We intend to provide only three versions:
  • Plain vanilla R
  • R compiled with mpicc
  • R linked to openmp and mkl


For example, to load the mpi version of R on the HLRB-II issue the following command:
> module load R/2.9.2_mpi
In order to run R using mpi on 4 cores you have to copy a special .Rprofile to your current directory (please see details on separate page) and start R using the mpi environment (interactively):
> mpirun -np 4 R --no-save -q
R is then started on 4 cores and returns awaiting user input. Be aware that mpi-R is not using the readlines library and you will not be able to edit the command line as used in the vanilla R environment. A possible workaround might be running mpi-R in an emacs shell. (see more on separate page)



Batch jobs using R


For batch jobs you have to write a qsub script which has to be customized for the different batch schedulers. On the linux cluster a SGE qsub script for a serial job might look as follows:

#!/bin/bash
#$-N my_R_job
#$-S /bin/bash
#$-M user@home.de
#$-l march=x86_64
. /etc/profile
module load R
echo "
print(Sys.info())
print(version)
" | R --no-save -q > $HOME/my_R_job.$JOB_ID.out

MPI Job on x86_64


MPI job on HLRB2


Shared Memory Job on x86_64


By using the multicore package one can write parallel programs with R that use several cores on a shared memory system. On the linux cluster shred memory systems up to 32 cores are available. Keep in mind that the total size of an R array can, however, not be larger than 4 GB, due to the limited size of the array pointer. For job submission of a shared memory R job using 8 cores please add the following SGE control line to your job script

#$-pe shm_8 8

for a 32 core job the clause would be

#$-pe shm_32 32

 

Shared Memory Job on HLRB2






Short example (reading data and visualisation)

In most cases you will have some data that you would like to read and analyse lateron. The most straightforward way for loading data into R is reading from a text file. The file 'measurements.txt' contains tab separated data columns (of performance measurements for the Itanium2 processor). (It is also possible to read data in other formats or reading from a database; please refer to the R documentation for further information).

Having started R in the directory where your datafile resides you can read the file into a so called data-frame 'measurements' by using the read function:

> measurements <- read.table("measurements.txt", header=TRUE)
It is possible to inspect the contents of the data-frame and all other data objects by simply entering the respective name at the R prompt:
> measurements
The measurements contain data for different sampling intervals which are given in one column. It is common that the available data needs to be grouped by the contents of one column; this can be achieved very conveniently with using factors; in the following a factor is created from the sampling interval 'stime':
> stimef <- factor(measurements["stime"][,1])
Then a boxplot containing a separate box for each sampling interval giving the variability of the measurements for that sampling interval can be created by:
> plot(measurements["FP_OPS_RETIRED"][,1]/measurements["stime"][,1]/1.0E+9 ~ stimef, 
    main="variability of 100 samples (5 min. sampling interval)",  
    xlab="sampling length [s]", ylab = "[GFlop/s]")
Data can be prepared for hardcopy in a variety of formats (like e.g. postcript, pdf, png,...). For creating a png-file, first set up the graphics device:
> png("boxplot_GFlop.png", width=600, height=400)
Then you have to perform the plot command(s) that you would like to have output to the file you entered. Finally switch off the png device for writing the data and closing the output file:
> dev.list()
X11 PNG
  2   3
> dev.off(3)

Now you should find a file 'boxplot_GFlop.png' in the current directory.

The above is only a short example for giving you a feeling what R is like. You can find further information in the references given below. References: The first address for further information is the homepage of the R-project. Another useful source of information might be theWiki of the R-project.

If you have any questions, suggestions or would be interested in additional packages to be installed on the machines, please feel free to submit a trouble ticket.

R: MPI Extension

This R package allow you to create R programs which run cooperatively in parallel across multiple machines, or multiple CPUs on one machine, to accomplish a goal more quickly than running a single program on one machine.

How to use Rmpi

Rmpi is an implementation of R which runs across multiple processors, possibly on multiple machines, using the MPI programming model. Please note that this page is not a tutuorial on how to write Rmpi scripts, just a short example of how to run an Rmpi job at OSC. Like any R job, you must load the R module before using Rmpi.

module load R/2.10.1ps


To run an Rmpi job you will need to create both a PBS script (to submit the job) and an Rmpi script. In the example below, we start an Rmpi process on four processors on each of two machines (8 processors and processes in total) and simply ask them to say "Hello".

The PBS script using sgi-mpt suitable for the HLRB-II looks like this:

#!/bin/bash
#PBS -N testRmpi
#PBS -l walltime=01:00:00
#PBS -l select=2:procs=4
. /etc/profile
module load R/2.10.1mpi
cd $PBS_O_WORKDIR
cd ~/testRmpi
cp $R_HOME/lib64/R/library/Rmpi/Rprofile ./.Rprofile
mpiexec -n 8 R CMD BATCH myjob_mpi.R

For the Sun Grid Engine and the parastation mpi environment the script looks very similar:

#!/bin/bash
#$-N testRmpi
#$-l march=x86_64
#$-l walltime=01:00:00
#$-l select=2:procs=4
. /etc/profile
module load R/2.10.1ps
cd ~/testRmpi
cp $R_HOME/lib64/R/library/Rmpi/Rprofile ./.Rprofile
mpiexec -n 8 R CMD BATCH myjob_mpi.R


Note that for Rmpi the use of mpiexec and not mpirun is mandatory. The reasons are too lengthy to go into here but it's to do with how R picks up on the right processors/nodes to use. The script runs an Rmpi job saved in a file called "myjob_mpi.R". The R file looks like this:

# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()


The output of the job looks something like this:

R version 2.10.1 (2008-02-08)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to...
.
.
. 
master (rank 0, comm 1) of size 8 is running on: compB051
slave1 (rank 1, comm 1) of size 8 is running on: compB051
slave2 (rank 2, comm 1) of size 8 is running on: compB051
slave3 (rank 3, comm 1) of size 8 is running on: compB051
slave4 (rank 4, comm 1) of size 8 is running on: compB053
slave5 (rank 5, comm 1) of size 8 is running on: compB053
slave6 (rank 6, comm 1) of size 8 is running on: compB053
slave7 (rank 7, comm 1) of size 8 is running on: compB053
> # Tell all slaves to return a message identifying themselves
> mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
$slave1
[1] "I am 1 of 8"
$slave2
[1] "I am 2 of 8"
$slave3
[1] "I am 3 of 8"
$slave4
[1] "I am 4 of 8"
$slave5
[1] "I am 5 of 8"
$slave6
[1] "I am 6 of 8"
$slave7
[1] "I am 7 of 8"
>
> # Tell all slaves to close down, and exit the program
> mpi.close.Rslaves()
[1] 1
> mpi.quit()

Using the Intel Math Kernel Library (mkl) in R

The Intel Math Kernel Library is a specially optimized library from Intel which is installed on the HLRB-II and the Linux Cluster. There exist bindings for R which accellerate linear algebra expressions substantially. In order to use this library you have to load a special R module called R/2.10.1mkl. Please note that MPI and MKL can at the moment mutually exclusively be used.

A simple use case is given by the following example:

> module load R/2.10.1mkl
> R
R version 2.10.1 (2008-02-08)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to...
> x <- matrix(runif(1000*1000),1000,1000)
> y <- solve(x)