IBM MPI: An MPI implementation for SuperMUC

The MPI implementation by IBM is the default MPI environment on SuperMUC. This document gives implementation specific hints on the usage of this MPI variant


IBM's Parallel Environment (PE) is a development and execution environment for parallel applications (distributed-memory, message-passing applications running across multiple nodes). It is designed to support development, testing, debuggging, tuning and running high-performance parallel applications written in C, C++ and Fortran on highly scalable SMP clusters. It supports most features from release 3.0 of the MPI standard.

The Parallel Environment includes the following components:

  • The Parallel Operating Environment (POE) for submitting and managing jobs.
  • IBM's MPI and LAPI/PAMI libraries for communication between parallel tasks.
  • A parallel debugger (pdb) for debugging parallel programs.
  • IBM High Performance Computing Toolkit for analyzing performance of parallel and serial applications.

This document explains the basic usage of POE and IBM MPI; the Parallel Environment Tools are explained elsewhere.

Supported compilers

The default module supports the Intel compiler suite (ifort/icc/icpc) version 16.0, and at least version 17.0 of that compiler can also be used. Older versions of the Intel compiler (15.0 and lower) are not supported. If you wish to use IBM MPI with the system-provided GCC (version 4.3), please issue the following commands in both your compilation environment and batch script:

module unload
module load

Please note that using GCC versions other than the system-provided one is not supported (i.e. if trouble arises we cannot help you).

Starting up MPI programs using mpiexec/POE


export MP_NODES=4
export MP_PROCS=160
poe ./myprog.exe 
mpiexec ./myprog.exe


cat >cmdfile <<EOD
export MP_NODES=1
export MP_PROCS=6
export MP_CMDFILE=cmdfile

Environment variables used by mpiexec/POE

Normally, the poe module provides reasonable settings for the control variables used by POE. The following table gives a full explanation of these control variables; instead of setting an environment variable, an execution flag can be used.

Environment Variable/Command Line Flag(s):ItemPossible Values:Default:

Variable set by POE

MP_CHILD Number which is set for each task ; may be used to perform trival parallelization (task farming) with serial jobs.  If myscript contains:
./my_serial_prog <in.$MP.CHILD >out.$MP_CHILD
poe myscript -procs 40

would start 40 times my_serial_prog with different input and output

0 - num_tasks can not be set

MPI library

(only available on SuperMUC)
Specifies the MPI library to use.
  • pempi ist the (legacy) library used on SuperMIG
  • mpich2 ist the library on SuperMUIC, it is based on PAMI
To indicate to POE which message passing API is being used by the application code.
  • MPI: Indicates that the application makes only MPI calls.
  • LAPI: Indicates that the application makes only LAPI calls.
  • MPI_LAPI: Indicates that calls to both message passing APIs are used in the application, and the same set of communication resources (windows, IP addresses) is to be shared between them.
  • MPI,LAPI: Indicates that calls to both message passing APIs are used in the application, with dedicated resources assigned to each of them.
  • LAPI,MPI: Has a meaning identical to MPI,LAPI.
MPI | LAPI |PAMI|shmem

Partition manager control

The number of program tasks.

Only useful for interactive jobs; otherwise it is set by LoadLeveler.

Any number from 1 to the maximum supported configuration. 1
To specify the number of processor nodes on which to run the parallel tasks.

Only useful for interactive jobs; otherwise it is set by LoadLeveler.

It may be used alone or in conjunction with MP_TASKS_PER_NODE and/or MP_PROCS,

Any number from 1 to the maximum supported configuration. None
To specify the number of tasks to be run on each of the physical nodes.

Only useful for interactive jobs; otherwise it is set by LoadLeveler.

It may be used alone or in conjunction with MP_NODES.

Any number from 1 to the maximum supported configuration. None
The period of time (in seconds) between processor node allocation retries by POE if there are not enough processor nodes immediately available to run a program.
This is valid only if you are using LoadLeveler. If the character string wait is specified instead of a number, no retries are attempted by POE, and the job remains enqueued in LoadLeveler until LoadLeveler either schedules the job or cancels it.
An integer greater than or equal to 0, or the case-insensitive value wait. 0 (no retry)
(no associated command line flag)
The length of time that POE waits before abandoning an attempt to connect to the remote nodes. Any number greater than 0. If set to 0 or a negative number, the value is ignored. 150 seconds
-hostfile -hfile
The name of a host list file for node allocation. Any file specifier or the word NULL. host.list in the current directory.
The interval (in seconds) at which POE checks the remote nodes to ensure that they are actively communicating with the home node. An integer greater than or equal to 0. 600
The name of a POE commands file used to load the nodes of your partition. If set, POE will read the commands file rather than STDIN. Any file specifier. None
Whether or not the Partition Manager maintains your partition for multiple job steps. yes |  no | parallel no
The programming model you are using. spmd  | mpmd spmd
Exploit new or existing LoadLeveler (as Resource Manager)  functionality that is not available using POE options. This includes specification of:
  • task geometry
  • blocking factor
  • machine order
  • consumable resources
  • memory requirements
  • disk space requirements
  • machine architecture
  • run jobs from more than 1 pool

For more information on the LoadLeveler functionality you can exploit, refer to For more information, see LoadLeveler pages.

Run parallel jobs without specifying a host file or pool, thereby causing LoadLeveler to select nodes for the parallel job from any in its cluster.

Any relative or full path name. None
When using LoadLeveler for node allocation, the name of the output LoadLeveler job command file to be generated by the Partition Manager. The output LoadLeveler job command file will show the LoadLeveler settings that result from the POE environment variables and/or command line options for the current invocation of POE.

It may be saved for debugging and/or for further execution (together with MP_LLFILE or -llfile) without specifing the arguments for poe.

Any relative or full path name. None
The name of an output host list file to be generated by the Partition Manager. These are tMay be saved for debugging. Any relative or full path name. None

I/O control

Whether or not output from the parallel tasks is labeled by task id. yes  | no  
The input mode. This determines how input is managed for the parallel tasks.
  • all: All tasks receive the same input data from STDIN.
  • none: No tasks receive input data from STDIN; STDIN will be used by the home node only.
  • a task id: STDIN is only sent to the task identified.
all | none| taskid  
The output mode. This determines how STDOUT is handled by the parallel tasks.
  • unordered: All tasks write output data to STDOUT asynchronously.
  • ordered: Output data from each parallel task is written to its own buffer. Later, all buffers are flushed, in task order, to STDOUT.
  • a task id: Only the task indicated writes output data to STDOUT.
unordered | ordered | taskid  

Diagnostic Information and Debugging

The level of message reporting.
  • 0: Error
  • 1: Warning and error
  • 2: Informational, warning, and error
  • 3: Informational, warning, and error. Also reports high-level diagnostic messages for use by the IBM Support Center.
  • 4, 5, 6: Informational, warning, and error. Also reports high- and low-level diagnostic messages for use by the IBM Support Center
0 - 6 1
Whether to produce a report of the current settings of MPI environment variables, across all tasks in a job. If yes is specified, the MPI environment variable information is gathered at initialization time from all tasks, and forwarded to task 0, where the report is prepared. If a script_name is specified, the script is run on each node, and the output script is forwarded to task 0 and included in the report.

When a variable's value is the same for all tasks, it is printed only once. If it is different for some tasks, an asterisk (*) appears in the report after the word "Task".

  • no: Do not produce a report of MPI environment variable settings.
  • yes: Produce a report of MPI environment variable settings.
  • script_name: Produce the report (same as yes), then run the script specified here.
no | yes | scriptname no
Controls the level of parameter checking during execution.
  • yes: Setting this to yes enables some intertask parameter checking which may help uncover certain problems, but slows execution.
  • no or nor: Normal mode does only relatively inexpensive, local parameter checking. Setting this variable to min allows PE MPI to bypass parameter checking on all send and receive operations.
  • deb (debug) checking is intended for developing applications, and can significantly slow performance.
  • min (for minimum mode) feedback.should only be used with well tested applications because a bug in an application running with min will not provide useful error
yes | no | nor | deb | min no
Provides the ability to gather communication statistics for User Space jobs. yes | no | print no
MP_DEBUG_NOTIMEOUT-debug_notimeout A debugging aid that allows programmers to attach to one or more of their tasks without the concern that some other my reach a timeout. yes | no no
Provides the ability to gather communication statistics for User Space jobs. yes | no | print no

Pinning of Tasks and Threads

When MP_PE_AFFINITY=yes then MPI/POE runtime will not pass information to scheduler such as Load Leveler and it will do its own affinity setting using MP_TASK_AFFINITY yes | no yes
Setting this environment variable attaches task of a parallel job to one of the system cpusets.
  • CORE   Default: specifies that each MPI task runs on a single physical processor core.
  • CORE:n  Specifies the number of processor cores to which the threads of an MPI task are constrained (one thread per core), typically n is the number of OMP_NUM_THREADS (should be used for OpenMP or hybrid jobs)
  • CPU  Specifies that each MPI task runs on a single logical CPU.
  • CPU:n Specifies the number of of logical CPUs to which the threads of an MPI task are constrained (one thread per cpu), typically n is the number of OMP_NUM_THREADS
  • MCM Specifies that the tasks are allocated in a round-robin fashion among the sockets of a node, should only be use if not all cores of a node are not used.
  • list-of-numbers  Specifies that the tasks are assigned on a round-robin basis to this set of sockets
  • -1  Specifies that no affinity request will be made (disables task affinity).

CPU or CORE will generate  a CPU mask with one or two entries.

CPU:n will generate a CPU mask which contains all CPU but will set KMP_AFFINITY different for each MPI task, e.g.

 =core|cpu core

List of cpus on this the tasks and thread are bound.
MP_TASK_AFFINITY should be set to cpu or cpu:n


poe ./a.out

Here it is ensured, the each task and its threads run on different sockets.

list of CPUS  


This transparently causes portions of the user's virtual address space to be pinned and mapped to a communications adapter. The low level communication protocol will then use Remote Direct Memory Access (RDMA, also known as bulk transfer) to copy (pull) data from the send buffer to the receive buffer as part of the MPI receive. yes | no yes
Contiguous messages with data lengths greater than or equal to the value you specify for this environment variable will use the bulk transfer path. Messages with data lengths that are smaller than the value, or are noncontiguous, will use packet mode transfer.
Based on MPI benchmark measurements the altogether best bandwidth performance is achieved with a value of MP_BULK_MIN_MSG_SIZE=512k. However, the real performance will depend on the actual used message sizes inside an application and will results in higher memory consumption.
The acceptable range is from
4096 to 2147483647 (INT_MAX).

nnnnn (byte)
nnnK (kB)
nnM (MB)
nnG (GB)



To control the amount of memory MPI allows for the buffering of early arrival message data. Message data that is sent without knowing if the receive is posted is said to be sent eagerly. If the message data arrives before the receive is posted, this is called an early arrival and must be buffered at the receive side.
See MP_BUFFER_MEM detailsin the IBM Manual.

Can also be specified in the form MP_BUFFER_MEM=M1,M2, where M1 specifies the amount of pre-allocated memory. M2 specifies an upper bound on the amount of early arrival buffer memory

nnnnn (byte)
nnnK (kB)
nnM (MB)
nnG (GB)
64 MB
(User Space
and IP)
To change the threshold value for message size, above which rendezvous protocol is used.
To ensure that at least 32 messages can be outstanding between any two tasks, MP_EAGER_LIMIT will be adjusted based on the number of tasks according to the following table, when the user has specified neither MP_BUFFER_MEM nor MP_EAGER_LIMIT:
Number of
   1 to  256       32768
 257 to  512       16384
 513 to 1024        8192
1025 to 2048        4096
2049 to 4096        2048
4097 to 8192        1024

MPI uses the MP_BUFFER_MEM and the MP_EAGER_LIMIT values that are selected for a job to determine how many complete point-to-point messages, each with a size that is equal to or less than the eager_limit, can be sent eagerly from every task of the job to a single task, without causing the single target to run out of buffer space. This is done by allocating to each sending task a number of message credits for each target. The sending task will consume one message credit for each eager send to a particular target. It will get that credit back after the message has been matched at that target. The following equation is used to calculate the number of credits to be allocated:


MPI uses this equation to ensure that there are at least two credits for each  target. If needed, MPI reduces the initially selected value of MP_EAGER_LIMIT, or increases the initially selected value of MP_BUFFER_MEM, in order to achieve this minimum threshold of two credits for each target.

MP_COLLECTIVE_OFFLOAD If set to "yes", reduction operations are performed off-node, which can give a considerable performance boost if the number of elements to be reduced is small. However, this will only work if the FCA manager is running. The command /opt/mellanox/fca/bin/fca_find_fmm? can be used to determine whether this is the case yes | no no
 Specifies the size of the Early Arrival (EA) buffer that is
used by the communication subsystem to buffer eager send
messages, for collective communications operations, that arrive
before there is a matching receive posted.
nnnnn (byte)
nnnK (kB)
nnM (MB)
nnG (GB)
Use the fastest collective communication algorithm even if that algorithm requires allocation of more scratch buffer space. yes | no
To avoid lock overheads in a program that is known to be single-threaded. MPI-IO and MPI one-sided communications are unavailable if this variable is set to yes. Results are undefined if this variable is set to yes with multiple application message passing threads in use. See IBM Parallel Environment: MPI Programming Guide for more information. yes
Set: to specify how a thread or task behaves when it discovers it is blocked, waiting for a message to arrive.

MPI_WAIT_MODE set to nopoll may reduce CPU consumption for applications that post a receive call on a separate thread, and that receive call does not expect an immediate message arrival. Also, using MPI_WAIT_MODE set to nopoll may increase delay between message arrival and the blocking call's return. It is recommended that MP_CSS_INTERRUPT be set to yes when the nopoll wait is selected, so that the system wait can be interrupted by the arrival of a message. Otherwise, the nopoll wait is interrupted at the timing interval set by MP_POLLING_INTERVAL

nopoll | poll | sleep | yield poll (for User Space and IP)
To change the polling interval (in microseconds). An integer between 1 and
2 billion.

User Space is an unreliable packet transport (packets may be dropped during transport without an error being reported), the message dispatcher manages packet acknowledgment and retransmission with a sliding window protocol. This message dispatcher is also run on a hidden thread once every few hundred milliseconds and, if environment variable MP_CSS_INTERRUPT is set, upon notification of packet arrival.

To specify whether or not arriving packets generate interrupts. Using this environment variable may provide better performance for certain applications. Setting this variable explicitly will suppress the MPI-directed switching of interrupt mode, leaving the user in control for the rest of the run. In particular, achieving fair locking behaviour with MPI_WIN_LOCK needs a "yes" setting for this variable.

For more information, refer to the MPI_FILE_OPEN and MPI_WIN_CREATE subroutines in IBM Parallel Environment: MPI Subroutine Reference.

yes | no no

Advanced Tuning Parameters

To specify whether or not MPI info objects reject hints (key and value pairs) that are not meaningful to the MPI implementation. yes | no yes
The size of the message envelope buffer (that is, uncompleted send and receive descriptors). Any positive number. There is no upper limit, but any value less than 1 MB is ignored. 8 MB
Control how often the communication subsystem library checks to see if it should retransmit packets that have not been acknowledged. The value nnnn is the number of polling loops between checks. The acceptable range
is from 1000 to INT_MAX
(User Space)
To specify the additional stack size allocated for user subroutines running on an MPI service thread. If you do not allocate enough space, the program may encounter a SIGSEGV exception or more subtle failures.

nnnK (where:K = 1024 bytes)
nnM (where: M = 1024*1024 bytes)

(no associated command line flag)
To change the length of time (in seconds) the communication subsystem will wait for a connection to be established during message-passing initialization. An integer greater than 0 150
Allows the user to control the packet acknowledgement threshold. Specify a positive integer. A positive integer limited to 31 30
To specify the default size of the data buffer used by MPI-IO agents.

nnnK (where K=1024 bytes)
nnnM (where M=1024*1024 bytes)

bytes The number of
bytes that
corresponds to
16 file blocks.
To specify whether or not to turn on I/O error logging. yes | no no
The maximum LAPI level message size that will be buffered locally, to more quickly free up the user send buffer. This sets the size of the local buffers that will be allocated to store such messages, and will impact memory usage, while potentially improving performance. The MPI application message size supported is smaller by, at most, 32 bytes. nnn bytes (where:
nnn > 0 bytes)
16352 bytes
The number of retransmit buffers that will be allocated per task. Each buffer is of size MP_REXMIT_BUF_SIZE * MP_REXMIT_BUF_CNT. This count controls the number of in-flight messages that can be buffered to allow prompt return of application send buffers. nnn (where:
nnn > 0)

Gprof and Corefile Handling

Allows you to specify the directory into which POE stores the gmon.out file for each task. A gmon.out file contains profiling data and is produced by compiling a program with the -pg flag. Any relative path name or full path name. profdir.task_id
POE processes that terminate abnormally will can generate standard corefiles. If you prefer, you can instruct POE to write the stack trace  information to standard error instead.

It is highly recommended to limit the amount of information written for large parallel application by setting it to STDERR.

Creates a separate directory for each task's core file. Any valid directory name, or "none" to bypass creating a new directory. coredir.taskid


(no associated command line flag)
A fence character string for separating arguments you want parsed by POE from those you do not. Any string. None
(no associated command line flag)
Whether or not POE ignores the argument list. If set to yes, POE will not attempt to remove POE command line flags before passing the argument list to the user's program. yes | no no

Troubleshooting, Hints and Tuning

Tuning of collectives with pami_tune

PE comes with an additional tool that permits selection of optimal settings for collective functions. Based on an input file (named ptune.ini for the sake of the example) that might look as follows,


for a selection of collective calls (the naming scheme is hopefully obvious), message sizes and MPI task counts (==geometry sizes). Message sizes are in units of bytes. Some idea is of course needed what the actual message sizes in the application are, as well as the MPI calls used. One then executes the command

pami_tune -p -f ptune.ini

within a suitably configured batch job (in this case, requesting e.g. 80 Sandy Bridge nodes with 16 tasks each). The result is an XML file my_pami_settings.xml that can be used for an application run as follows:

export MP_COLLECTIVE_SELECTION_FILE=<folder_where_file_resides>/my_pami_settings.xml
poe ./my_application.exe

Before running either pami_tune or the application, it may also be worth trying out whether going above or below MPI-internal thresholds with MP_BULK_MIN_MSG_SIZE, MP_EAGER_LIMIT has an effect.

Further documentation on pami_tune is available on the IBM web server.


If communication timeouts occur (indicated by appropriate warnings from PE), try the following settings:

MP_TIMEOUT=36000 (or even higher)

Note that this will not fix the problem, but will only allow you to continue running at (much) degraded performance - you should report an incident if such timeouts occur.

MPI Collectives offload in MPICH2

The adapters for the networks offer an interface called Fabric Collective Accelerator. FCA combines CORE-Direct® (Collective Offload Resource Engine) on the adapter with hardware assistance on the switch to speed up selected collective operations. Using FCA may significantly improve the performance of an application. Only MPICH2 supports collective offloading. The following MPI collectives are available with FCA: FCA

  • MPI_Reduce
  • MPI_Allreduce
  • MPI_Barrier
  • MPI_Bcast
  • MPI_Allgather
  • MPI_Allgatherv

The list of supported data types includes:

  • All data types for C language bindings, except MPI_LONG_DOUBLE
  • All data types for C reduction functions (C reduction types).
  • The following data types for FORTRAN language bindings: MPI_INTEGER, MPI_INTEGER2, MPI_INTEGER4, MPI_INTEGER8, MPI_REAL, MPI_REAL4 and MPI_REAL8

FCA does not support data types for FORTRAN reduction functions (FORTRAN reduction types).

By default, collective offload is turned off. To enable it, the environment variable MP_COLLECTIVE_OFFLOAD=all|yes must be set. Setting MP_COLLECTIVE_OFFLOAD=none|no disables collective offload. Once enabled, the FCA collective algorithm will be the first one MPICH2 will try. If the FCA algorithm cannot run at this time, a default MPICH2 algorithm will be executed. FCA support is limited to 2K MPI communicators per network.

To enable a subset of supported FCA algorithms two environment variables MP_COLLECTIVE_OFFLOAD=[all | yes] and MP_MPI_PAMI_FOR=[<collective1>,…,<collectiveN>], where collectives in the list are identified as "Reduce", "Allreduce", "Barrier", "Bcast", "Allgather" and "AllGatherV", must be set. All collectives outside the list will use default MPICH2 algorithm.

There is a known performance problem for mpi_allreduce with small message sizes for the default settings of the module. IBM development is already working on the issue, this will be fixed in a later release of the product. In the meantime, there are 2 different ways to solve the issue:

  1. Use collective offloading (Mellanox FCA) with the MP_COLLECTIVE_OFFLOAD environment variable. This is fast for simple test cases, but we have seen problems (bugs in FCA) for real world applications (there will be an FCA upgrade in the next SuperMUC maintenance slot which hopefully resolves most issues).
  2. Try 'export MP_MPILIB=pempi'. This is necessary for compilation, linking and execution, therefore you need to recompile.
    In this (legacy) implementation of the mpi libary, a more sophisticated (and much faster) algorithm is used for mpi_allreduce. This should be the easier and more robust workaround.

Non-deterministic/Non-standard behaviour for MP_COLLECTIVE_OFFLOAD=yes

Álthough mathematically correct, this option may change the order in which the reduction operations are performed and may not give deterministic results, and may therefore lead to (bitwise) different results for the same data. Also the algorithm may produce not the same (bitwise) result on all nodes (this is not MPI standard conforming). For numerical sensitive codes this may cause problems. To avoid this, force the FCA (Fast Collective Acceleration from Mellanox) be determinisitc:

  • export fca_mpi_comm_type=FCA_COMM_DEFAULT

Overlapping of communication and computation

Overlap of communication and computation works if RDMA (remote direct memory access)  is used i.e.,

  • MP_USE_BULK_XFER=yes, which is set by default on SuperMUC, and for message sizes larger than
  • MP_BULK_MIN_MSG_SIZE, which is 8K on SuperMUC, and
  • MP_CSS_INTERRUPT=yes is set.

The default setting for MP_CSS_INTERRUPT on SuperMUC is 'no' (explicitly set in the poe module) so this has to be changed in your environment.

Please note that some programs using non-blocking communication are known to fail (i.e. transmit incorrect data across tasks) if RDMA is used. If this happens to you, please try one of the following:

export MP_USE_BULK_XFER=no


export MP_USE_BULK_XFER=yes

Also, please report the problem to us via an incident ticket, including the information on which workaround (if any) make things run correctly.

Reducing MPI Buffer Sizes

There usually is a tradeoff between performance and memory usage.

  • MP_EAGER_LIMIT: The eager buffer control is different for MPICH2 (MP_MPILIB=mpich) versus PEMPI (MP_MPILIB=pempi). The setting of MP_BUFFER_MEM has no effect on applications that are compiled with the MPICH2 compiler scripts. If there are not a lot of eager messages then MPICH2 it won't take up much memory.  The eager limit is set via the value of MP_EAGER_LIMIT.  
  • MP_RFIFO_SIZE:  Reduce the Receive FIFO - MP_RFIFO_SIZE environment variable. The default is 16MB with minimum setting of 1MB.  So reducing this to 2 to 4MB with RDMA enabled may give users performance for large messages and also reduce memory usage.
  • Hybrid applications: If you use OpenMP + MPI, you may get similar performance if you enable RDMA and turn-off on-node shared memory through MP_SHARED_MEMORY=no (default is to use 256MB pool for Shared Memory FIFO slots).  You may also want to control the on-node eager limit through MP_EAGER_LIMIT_LOCAL to be low so you do RDMA for most messages within the node.
  • Retransmission Buffers: There are also some small retransmission buffers to help short message transfers, not a lot of memory saving is possible, but still some depending on application.  Helps more with blocking MPI send calls then non-blocking MPI send calls (2MB per MPI process:)MP_REXMIT_BUF_SIZE (default is 1638) and MP_REXMIT_BUF_CNT (default is 128).
  • MP_USE_BULK_XFER: For large task counts, a significant portion of memory can be consumed by RDMA connections. Should that become a scalability issue, a program may choose to run in FIFO mode. This can be achieved by setting MP_USE_BULK_XFER=no. Buffer tuning environment variables MP_BUFFER_MEM and MP_RFIFO_SIZE can be set to reduce the MPI memory footprint. For a small task count per node, FIFO mode is less efficient than RDMA.

MPI one-sided Communication

Many MPI one-sided communication calls require that MP_CSS_INTERRUPT=yes is set.

Further information