OpenMP - Parallel programming on shared memory systems

This document provides a guide to usage of OpenMP and availability of OpenMP on the HPC systems at LRZ.

An abstract description of OpenMP

OpenMP is a parallelization method available for the programming languages Fortran, C and C++, which is targeted toward use on shared memory systems. Since the OpenMP standard was developed with support from many vendors, programs parallelized with OpenMP should be widely portable.

The standard document 3.1 for the Fortran, C and C++ base languages was released in July, 2011. This is supported by most compilers. The up-to-date standard document is that for OpenMP 4.5, with the examples annex published separately; this in particular extends OpenMP to apply to accelerator devices, but actual compiler support is at present not yet in place (for the most part).

The OpenMP parallelization model

From the operating system point of view, OpenMP functionality is based on the use of threads, while the user's job simply consists in inserting suitable parallelization directives into her/his code. These directives should not influence the sequential functionality of the code; this is supported through their taking the form of Fortran comments and C/C++ preprocessor pragmas, respectively. However, an OpenMP aware compiler will be capable of transforming the code-blocks marked by OpenMP directives into threaded code; at run time the user can then decide (by setting suitable environment variables) what resources should be made available to the parallel parts of his executable, and how they are organized or scheduled. The following image illustrates this.

arch.png

However hardware and operation mode of the computing system put limits to the application of OpenMP parallel programs: Usually, it will not be sensible to share processors with other applications because scalability of the codes will be negatively impacted due to load imbalance and/or memory contention. For much the same reasons it is in many cases not useful to generate more threads than CPUs are available. Correspondingly, you need to be aware of your computing centers' policies regarding the usage of multiprocessing resources.

At run time the following situation presents itself: Certain regions of the application can be executed in parallel, the rest - which should be as small as possible - will be executed serially (i.e., by one CPU with one thread).

exec.png

Program execution always starts in serial mode; as soon as the first parallel region is reached, a team of threads is formed ("forked") based on the user's requirements (4 threads in the image above), and each thread executes code enclosed in the parallel region. Of course it is necessary to impose a suitable division of the work to be done among the threads ("work sharing"). At the end of the parallel region all threads are synchronized ("join"), and the following serial code is only worked on by the master thread, while the slave threads wait (shaded yellow squares) until a new parallel region begins. This alteration between serial and parallel regions can be repeated arbitrarily often; the threads are only terminated when the application finishes.

The OpenMP standard also allows nesting of parallelism: A thread in a team of threads may generate a sub-team; some OpenMP implementations however do not allow the use of more than one thread below the top nesting level.

A priori the number of threads used does not need to conform with the number of CPUs available for a job. However for achieving good performance it is necessary to determine and possibly enforce an optimal assignment of CPUs to threads. This may involve additional functionality in the operating system, and is not covered by the OpenMP standard.

Comparison with other parallelization methods

In contrast to using MPI (which usually requires a lot of work), one can very quickly obtain a functioning parallel program with OpenMP in many cases. However, in order to achieve good scalability and high performance it will be necessary to use suitable tools to perform further optimization. Even then, scalability of the resulting code will not always be on par with the corresponding code parallelized with MPI.

 

MPI
(shared and
distributed memory
systems)

OpenMP
(shared memory
systems)

proprietary
directives

PGAS

(coarrays, UPC)

portable

yes

yes

no

yes

scalable

yes

partially

partially

yes

supports data parallelism

no

yes

yes

partially

supports incremental parallelization

no

yes

yes

partially

serial functionality intact?

no

yes

yes

yes

correctness verifiable?

no

yes

?

in principle yes

On high performance computing systems with a combined shared and distributed memory architecture using MPI and OpenMP in a complementary manner is one possible strategy for parallelization ("hybrid parallel programs"). Schematically one obtains the following hierarchy of parallelism:

  • The job to be done is subdivided into large chunks, which are distributed to compute nodes using MPI.

  • Each chunk of work is then further subdivided by suitable OpenMP directives. Hence each compute node generates a team of threads, each thread working on part of a chunk.

  • On the lowest level, e.g. the loops within part of a chunk, the well-known optimization methods (either by compiler or by manual optimization) should be used to obtain good single CPU or single thread performance. The method used will depend on the hardware (e.g., RISC-like vs. vector-like).

Note that the hardware architecture may also have an influence on the OpenMP parallelization method itself. Furthermore, unlimited intermixing of MPI and OpenMP requires a thread-safe implementation of MPI; the level of thread-safeness can be obtained by calling the MPI_Init_thread subroutine with suitable arguments; depending on the result, appropriate care may be required to follow the limitations inherent in the various threading support levels defined by the MPI standard.

Overview of OpenMP functionality

Please consult the Wikipedia OpenMP article for such an overview.

Remarks on the usage of OpenMP

OpenMP compilers

OpenMP enabled compilers are available on all HPC systems at LRZ:

  • The current (15.0) Intel Compiler suite fully supports OpenMP 3.1 and some features of OpenMP 4.0, particularly the SIMD directives. These compilers are available on all HPC systems at LRZ.

  • The PGI Compiler suite supports OpenMP on x86_64 based systems.

  • The currently deployed release (4.9) of GCC supports OpenMP 3.1.

  • The NAG compiler supports most of OpenMP 3.1 in its 6.0 releases.

Compiler switches

For activation of OpenMP directives at least one additional compiler switch is required.

Vendor

Compiler calls

OpenMP option

Intel

ifort / icc / icpc

-openmp

GCC

gfortran / gcc / g++ -fopenmp

PGI

pgf90 / pgcc / pgCC

-mp

Please specify this switch for all program units that contain either OpenMP directives, or calls to OpenMP run time functions. Also, the OpenMP switch must be specified at link time.

Controlling the Run Time environment

Before executing your OpenMP program, please set up the OpenMP environment using one or more of the variables indicated in the following table.

Variable namepossible valuesmeaning
OMP_NUM_THREADS a positive integer, or a comma-separated list of positive integers The number of threads that will be spawned by the first encountered parallel regions, or by subsequent nested parallel regions.
OMP_SCHEDULE type[,chunk], where type is one of static, dynamic, guided or auto, and chunk is an optional positive integer determines how work sharing of loops with a runtime clause is scheduled to threads
OMP_DYNAMIC true or false determine whether or not the OpenMP run time may decide how many threads should be used for a parallel region
OMP_PROC_BIND true or false determine whether or not the OpenMP run time should prevent threads moving between processors.
OMP_NESTED true or false determine whether or not the OpenMP run time should support nesting of parallel regions
OMP_STACKSIZE size or sizeB or sizeK or sizeM or sizeG, where size is a positive integer Size of the thread-individual stack in appropriate units (kByte, MByte etc.). If no unit specifier appears, the size is given in units of kBytes.
OMP_WAIT_POLICY active or passive provide hint to OpenMP run time about desired behaviour of waiting threads
OMP_MAX_ACTIVE_LEVELS a positive integer maximum number of active levels of nested parallelism
OMP_THREAD_LIMIT a positive integer maximum number of threads to use for the whole OpenMP program. There may be an implementation defined maximum value which cannot be exceeded.

Simple example (using the bash or ksh Shell):

export OMP_NUM_THREADS=8

./my_openmp_program.exe

This will execute the OpenMP-compiled binary "my_openmp_program.exe" with up to 8 threads.

Stub library, module and include file for Fortran

In order to keep code compilable for the serial functionality, any OpenMP function calls or declarations should also be decorated with an active comment:

      implicit none
      ...
!$    integer OMP_GET_THREAD_NUM
!$    external OMP_GET_THREAD_NUM
      ...
      mythread = 0
!$    mythread = OMP_GET_THREAD_NUM()

Note that without the IMPLICIT NONE statement and missing declaration OMP_GET_THREAD_NUM has the wrong type!

If you do not wish to do this, it is also possible to use

  • either an include file omp_lib.h

  • or a Fortran 90 module omp_lib.f90

for compilation of the serial code. For linkage one also needs a stub library. All this is provided in the Intel compilers via the option -openmp-stubs, which will otherwise produce purely serial code. The above code can then be written as follows:

Fortran 77 style

Fortran 90 style (recommended)

      implicit none
      ...
      include 'omp_lib.h'
      ...
      mythread = OMP_GET_THREAD_NUM()
      use omp_lib
      implicit none
      ...
      ...
      mythread = OMP_GET_THREAD_NUM()

OpenMP extensions supported by the Intel Compilers

The Intel Fortran and C/C++ compilers provide some additional functionality described in the following.

Additional environment variables

Name

Explanation

Default value

KMP_AFFINITY and/or KMP_PLACE_THREADS

see the description on Intel's web site

schedule threads to cores or threads (logical CPUs) in a user controlled manner.

KMP_ALL_THREADS

maximum number of threads available to a parallel region

max(32, 4*OMP_NUM_THREADS, 4*(No. of processors)

KMP_BLOCKTIME

Interval after which inactive thread is put to sleep, in milliseconds.
Should be short in throughput mode, can be longer in turnaround mode; see KMP_LIBRARY below.

200 milliseconds

KMP_LIBRARY

Select execution mode for OpenMP runtime library. Possible values are:

  • throughput: optimized for sharing resources with other programs

  • turnaround: suited to dedicated use of resources, as in HPC.

  • serial: enforce serial execution

throughput

KMP_MONITOR_STACKSIZE

Set stacksize in bytes for monitor thread

max(32768, system minimum thread stack size)

KMP_STACKSIZE

stack size (in bytes) usable for each thread. Change if your application segfaults for no apparent reason. You may also need to increase your shell stack limit appropriately.

 

With OpenMP 3.0 and higher, the standardized variable OMP_STACKSIZE should be used.

2 MByte

Notes:

  • Setting suitable postfixes where appropriate allows you to specify units. I.e., KMP_STACKSIZE=6m sets a value of 6 MByte.

  • There are also some extension routine calls, i.e. kmp_set_stacksize_s(...) with an implementation dependent integer kind as argument, which can be used instead of the environment variables described above. However this will usually not be portable and usage is hence discouraged unless for specific needs.

NUMA-related directives

For Fortran, Intel has implemented an additional proprietary directive which supports correctly distributed memory initialization and NUMA pre-fetching. This directive is of the form

!DIR$ MEMORYTOUCH (array-name [ , schedule-type [ ( chunk-size ) ]] [ , init-type]) 

where the parameter names have the following meaning:

  • array-name is an array of type INTEGER(4), INTEGER(8), REAL(4) or REAL(8)
  • schedule-type is one of STATIC, GUIDED, RUNTIME or DYNAMIC, and should be consistent with the OpenMP conforming processing of the subsequent parallel loops.
  • chunk-size is an integer expression
  • init-type is one of LOAD or STORE

If init-type is LOAD, the compiler generates an OpenMP loop which fetches elements of array-name into a temporary variable. If init-type is STORE, the compiler generates an OpenMP loop which sets elements of array-name to zero. Examples:

!DIR$ memorytouch (A) !DIR$ memorytouch (A , LOAD) !DIR$ memorytouch (A , STATIC (load+jf(3)) ) !DIR$ memorytouch (A , GUIDED (20), STORE) 

While the MEMORYTOUCH directive is accepted on all platforms, at present it is meaningful only on specific systems with NUMA designs and when OpenMP is enabled.

References and documentation

Intel Compiler documentation on the LRZ web site
Note that the links on this page lead to a password-protected area. Issue the command get_manuals_passwd when logged in to one of the LRZ HPC systems to obtain access information.
OpenMP home page
The central source of information about OpenMP
OpenMP Specifications
For Fortran, C, and C++
Compilers and tools
Various vendor's implementations and add-ons

Acknowledgments go to Isabel Loebich and Michael Resch, Höchstleistungsrechenzentrum Stuttgart, for a very stimulating OpenMP workshop and the permission to reuse material from this workshop in this document.