R Programming Language

This webpage gives a short introduction to the R programming language as it is installed on the HPC systems at LRZ.

Contents:

  1. Introduction to R

  2. High-performance and Parallel Programming with R
    R is known as a very powerful language for statistics, but it has also evolved into a tool for the analysis and visualisation of large data sets which are typically obtained from supercomputing applications.
    The course teaches the use of the dynamic language R for parallel programming of supercomputers and features rapid prototyping of simple simulations. Several parallel programming models including Rmpi, snow, multicore, and gputools are presented which exploit the multiple processors that are standard on modern supercomputer architectures.

  3. Data Mining of XML Data Sets using R
    Large data sets are often stored on the web in XML format. Automated access and analysis of these data sets presents a non-trivial task. The dynamic language R is known as a very powerful language for statistics, but it has also evolved into a tool for the analysis and visualisation of large data sets. In this course R will be used to perform and automate these tasks and visualise the results interactively and on the web.

  4. Scientific High-Quality Visualisation with R
    R is a powerful software that can also easily be used for high quality visualisation of large datasets of various modalities like particle data, continuous volume data, etc. The range of different plots comprises of line plots, contour plots, surface plots up to interactive 3D opengl plotting.

Preliminaries

R is a dynamic language for numerical computing and graphics with a strong affinity to statistics. R is available as Free Software under the terms of the GNU General Public License (GPL). It compiles and runs on a wide variety of UNIX and Linux platforms, Microsoft Windows and MacOS. R is a fully featured programming language and much of the system itself is written in R. Advanced users can link and call C, C++ and Fortran code at run time. R has its roots in statistics, but its extensibility, ease-of-use and powerful graphics makes it ideal for users looking for a fast, easy and robust environment for data analysis and numerics. R can easily be extended with more than 1,700 additional packages available through the Internet that can be installed with the command install.package (“name”) and then loaded with the function library (name).

R is a statistics package and was developed as a free successor to the S and Splus languages. It is probably a bit harder to learn than other statistics tools but once you are used to the functional programming approach of R it gives you great flexibility. You can accomplish complex tasks with just very few commands and produce publication quality hardcopy output. It also allows you to add functionality and automate processes. R is available on all the most common platforms. At LRZ several different version are installed.

An important remark about the equal sign in R: x<-1 is equivalent to x=1 but the latter should be used wherever optional parameters are expected like in function calls. Use the arrow form for explicit calculations and the equal sign form for definitions.

Availability and starting R interactively

R is available on the LRZ-Linux cluster and on the HLRB-II. It can be used interactively or in batch-mode. For using R interactively log into one of the interactive nodes and type:

> module load R/serial

for loading R

> R

for starting R.
For starting a different version than the default version at LRZ use the command

> module avail R

which prints out a list of all available versions. At the moment the following versions are available for the different architectures:
On all architectures the default version is R-2.13 (now Aug 2013) 
Also available are 2.10, 2.11, 2.12, 2.14, 2.15 and 3.0


Please note that the compatibility versions will be removed sooner or later. We intend to provide only two versions:

  • Plain vanilla R
  • R compiled with mpicc


For example, to load the mpi version of R on SuperMUC issue the following command:

> module load R/parallel/2.13

In order to run R using mpi on 4 cores you have to copy a special .Rprofile to your current directory (please see details on separate page) and start R using the mpi environment (interactively):

> mpirun -np 4 R --no-save -q

R is then started on 4 cores and returns awaiting user input. Be aware that mpi-R is not using the readlines library and you will not be able to edit the command line as used in the vanilla R environment. A possible workaround might be running mpi-R in an emacs shell. (see more on separate page)


Short example (reading data and visualisation)

In most cases you will have some data that you would like to read and analyse lateron. The most straightforward way for loading data into R is reading from a text file. The file 'measurements.txt' contains tab separated data columns (of performance measurements for the Itanium2 processor). (It is also possible to read data in other formats or reading from a database; please refer to the R documentation for further information).

Having started R in the directory where your datafile resides you can read the file into a so called data-frame 'measurements' by using the read function:

> measurements <- read.table("measurements.txt", header=TRUE)

It is possible to inspect the contents of the data-frame and all other data objects by simply entering the respective name at the R prompt:

> measurements

The measurements contain data for different sampling intervals which are given in one column. It is common that the available data needs to be grouped by the contents of one column; this can be achieved very conveniently with using factors; in the following a factor is created from the sampling interval 'stime':

> stimef <- factor(measurements["stime"][,1])

Then a boxplot containing a separate box for each sampling interval giving the variability of the measurements for that sampling interval can be created by:

> plot(measurements["FP_OPS_RETIRED"][,1]/measurements["stime"][,1]/1.0E+9 ~ stimef, 
    main="variability of 100 samples (5 min. sampling interval)",  
    xlab="sampling length [s]", ylab = "[GFlop/s]")

Data can be prepared for hardcopy in a variety of formats (like e.g. postcript, pdf, png,...). For creating a png-file, first set up the graphics device:

> png("boxplot_GFlop.png", width=600, height=400)

Then you have to perform the plot command(s) that you would like to have output to the file you entered. Finally switch off the png device for writing the data and closing the output file:

> dev.list()
X11 PNG
  2   3
> dev.off(3)


Now you should find a file 'boxplot_GFlop.png' in the current directory.

The above is only a short example for giving you a feeling what R is like. You can find further information in the references given below. References: The first address for further information is the homepage of the R-project. Another useful source of information might be theWiki of the R-project.

If you have any questions, suggestions or would be interested in additional packages to be installed on the machines, please feel free to submit a trouble ticket.

R: MPI Extension

This R package allow you to create R programs which run cooperatively in parallel across multiple machines, or multiple CPUs on one machine, to accomplish a goal more quickly than running a single program on one machine.

How to use Rmpi

Rmpi is an implementation of R which runs across multiple processors, possibly on multiple machines, using the MPI programming model. Please note that this page is not a tutuorial on how to write Rmpi scripts, just a short example of how to run an Rmpi job at LRZ. Like any R job, you must load the R module before using Rmpi.

module load R/parallel/2.13

The script runs an Rmpi job saved in a file called "myjob_mpi.R". The R file looks like this:

# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()

MPI Job on SLURM


For batch jobs you have to write a qsub script which has to be customized for the different batch schedulers. On the linux cluster a SLURM script for a serial job might look as follows:

#!/bin/bash
#SBATCH -o myjob.%j.%N.out
#SBATCH -D /home/cluster/...
#SBATCH -J Rmpi_test
#SBATCH --get-user-env
#SBATCH --clusters uv3
#SBATCH --nodes=1-1
#SBATCH --cpus-per-task=32
#SBATCH --ntasks=32
#SBATCH --mail-user=user@lrz.de
#SBATCH --export=NONE
#SBATCH --time=01:00:00

source /etc/profile.d/modules.sh
module load R/parallel/2.13

srun_ps R -f myjob_mpi.R

MPI Job on SuperMUC

#!/bin/bash
#@ job_type = parallel
#@ class = general
#@ node = 2
#@ tasks_per_node = 16
#@ initialdir = /home/hpc/...
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ wall_clock_limit = 5:00:00
#@ queue

. /etc/profile.d/modules.sh
module load R/parallel/2.13

poe R -f myjob_mpi.R


The output of the job looks something like this:

R version 2.13.1 (2008-02-08)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to...
.
.
. 
master (rank 0, comm 1) of size 8 is running on: compB051
slave1 (rank 1, comm 1) of size 8 is running on: compB051
slave2 (rank 2, comm 1) of size 8 is running on: compB051
slave3 (rank 3, comm 1) of size 8 is running on: compB051
slave4 (rank 4, comm 1) of size 8 is running on: compB053
slave5 (rank 5, comm 1) of size 8 is running on: compB053
slave6 (rank 6, comm 1) of size 8 is running on: compB053
slave7 (rank 7, comm 1) of size 8 is running on: compB053
> # Tell all slaves to return a message identifying themselves
> mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
$slave1
[1] "I am 1 of 8"
$slave2
[1] "I am 2 of 8"
$slave3
[1] "I am 3 of 8"
$slave4
[1] "I am 4 of 8"
$slave5
[1] "I am 5 of 8"
$slave6
[1] "I am 6 of 8"
$slave7
[1] "I am 7 of 8"
>
> # Tell all slaves to close down, and exit the program
> mpi.close.Rslaves()
[1] 1
> mpi.quit()