ALIs

kommt noch

IBM MPI: An MPI implementation for SuperMUC

R. Bader

The MPI implementation by IBM is the default MPI environment in use on the Petaflop-class Supercomputer SuperMUC. This document gives implementation-specific hints on the usage of this MPI variant


Table of contents


Overview

IBM's Parallel Environment is a development and execution environment for parallel applications (distributed-memory, message-passing applications running across multiple nodes). It is designed to support development, testing, debuggging, tuning and running high-performance parallel applications written in C, C++ and Fortran on highly scalable SMP clusters.

The Parallel Environment includes the following components:

  • The Parallel Operating Environment (POE) for submitting and managing jobs.
  • IBM's MPI and LAPI/PAMI libraries for communication between parallel tasks.
  • A parallel debugger (pdb) for debugging parallel programs.
  • IBM High Performance Computing Toolkit for analyzing performance of parallel and serial applications.

This document explains the basic usage of POE and IBM MPI; the Parallel Environment Tools are explained elsewhere.

Starting up MPI programs using mpiexec/POE

SPMD

export MP_NODES=4
export MP_PROCS=160
poe ./myexe 
#or
mpiexec ./myexec

MPMD

cat >cmdfile <<EOD
./master
./slave1
./slave1
./slave1
./slave2
./slave2
EOD
export MP_NODES=1
export MP_PROCS=6
export MP_CMDFILE=cmdfile
poe ./myexe 
#or
mpiexec ./myexec

Environment variables used by mpiexec/POE

Normally, the poe module provides reasonable settings for the control variables used by POE. The following table gives a full explanation of these control variables; instead of setting an environment variable, an execution flag can be used.

Environment Variable/Command Line Flag(s): Item Possible Values: Default:

Variable set by POE

MP_CHILD Number which is set for each task ; may be used to perform trival parallelization (task farming) with serial jobs.
if myscript contains:
./my_serial_prog <in.$MP.CHILD >out.$MP_CHILD
poe myscript -procs 40

would start 40 times my_serial_prog with different input and output

0 - num_tasks can not be set

Partition manager control

MP_PROCS
-procs
The number of program tasks.

Only useful for interactive jobs; otherwise it is set be LoadLeveler.

Any number from 1 to the maximum supported configuration. 1
MP_NODES
-nodes
To specify the number of processor nodes on which to run the parallel tasks.

Only useful for interactive jobs; otherwise it is set be LoadLeveler.

It may be used alone or in conjunction with MP_TASKS_PER_NODE and/or MP_PROCS,

Any number from 1 to the maximum supported configuration. None
MP_TASKS_PER_ NODE
-tasks_per_node
To specify the number of tasks to be run on each of the physical nodes.

Only useful for interactive jobs; otherwise it is set be LoadLeveler.

It may be used alone or in conjunction with MP_NODES.

Any number from 1 to the maximum supported configuration. None
MP_RETRY
-retry
The period of time (in seconds) between processor node allocation retries by POE if there are not enough processor nodes immediately available to run a program.
This is valid only if you are using LoadLeveler. If the character string wait is specified instead of a number, no retries are attempted by POE, and the job remains enqueued in LoadLeveler until LoadLeveler either schedules the job or cancels it.
An integer greater than or equal to 0, or the case-insensitive value wait. 0 (no retry)
MP_TIMEOUT
(no associated command line flag)
The length of time that POE waits before abandoning an attempt to connect to the remote nodes. Any number greater than 0. If set to 0 or a negative number, the value is ignored. 150 seconds
MP_HOSTFILE
-hostfile -hfile
The name of a host list file for node allocation. Any file specifier or the word NULL. host.list in the current directory.
MP_PULSE
-pulse
The interval (in seconds) at which POE checks the remote nodes to ensure that they are actively communicating with the home node. An integer greater than or equal to 0. 600
MP_MSG_API
-msg_api
To indicate to POE which message passing API is being used by the application code.
  • MPI: Indicates that the application makes only MPI calls.
  • LAPI: Indicates that the application makes only LAPI calls.
  • MPI_LAPI: Indicates that calls to both message passing APIs are used in the application, and the same set of communication resources (windows, IP addresses) is to be shared between them.
  • MPI,LAPI: Indicates that calls to both message passing APIs are used in the application, with dedicated resources assigned to each of them.
  • LAPI,MPI: Has a meaning identical to MPI,LAPI.
MPI | LAPI | MPI_LAPI |MPI,LAPI | LAPI,MPI MPI

Job Specification

MP_CMDFILE
-cmdfile
The name of a POE commands file used to load the nodes of your partition. If set, POE will read the commands file rather than STDIN. Any file specifier. None
MP_NEWJOB
-newjob
Whether or not the Partition Manager maintains your partition for multiple job steps. yes  no no
MP_PGMMODEL
-pgmmodel
The programming model you are using. spmd  | mpmd spmd
MP_RMFILE
-rmfile
Exploit new or existing LoadLeveler (as Resource Manager)  functionality that is not available using POE options. This includes specification of:
  • task geometry
  • blocking factor
  • machine order
  • consumable resources
  • memory requirements
  • disk space requirements
  • machine architecture
  • run jobs from more than 1 pool

For more information on the LoadLeveler functionality you can exploit, refer to For more information, see LoadLeveler pages.

Run parallel jobs without specifying a host file or pool, thereby causing LoadLeveler to select nodes for the parallel job from any in its cluster.

Any relative or full path name. None
MP_SAVE_RMFILE
-save_rmfile
When using LoadLeveler for node allocation, the name of the output LoadLeveler job command file to be generated by the Partition Manager. The output LoadLeveler job command file will show the LoadLeveler settings that result from the POE environment variables and/or command line options for the current invocation of POE.

It may be saved for debugging and/or for further execution (together with MP_LLFILE or -llfile) without specifing the arguments for poe.

Any relative or full path name. None
MP_SAVEHOSTFILE
 -savehostfile
The name of an output host list file to be generated by the Partition Manager. These are tMay be saved for debugging. Any relative or full path name. None

 I/O control

MP_LABELIO
-labelio
Whether or not output from the parallel tasks is labeled by task id. yes  | no  
MP_STDINMODE
-stdinmode
The input mode. This determines how input is managed for the parallel tasks.
  • all: All tasks receive the same input data from STDIN.
  • none: No tasks receive input data from STDIN; STDIN will be used by the home node only.
  • a task id: STDIN is only sent to the task identified.
all | none| taskid  
MP_STDOUTMODE
-stdoutmode
The output mode. This determines how STDOUT is handled by the parallel tasks.
  • unordered: All tasks write output data to STDOUT asynchronously.
  • ordered: Output data from each parallel task is written to its own buffer. Later, all buffers are flushed, in task order, to STDOUT.
  • a task id: Only the task indicated writes output data to STDOUT.
unordered | ordered | taskid  

Diagnostic Information

MP_INFOLEVEL
-infolevel
The level of message reporting.
  • 0: Error
  • 1: Warning and error
  • 2: Informational, warning, and error
  • 3: Informational, warning, and error. Also reports high-level diagnostic messages for use by the IBM Support Center.
  • 4, 5, 6: Informational, warning, and error. Also reports high- and low-level diagnostic messages for use by the IBM Support Center
0 - 6
1
MP_PRINTENV
-printenv
Whether to produce a report of the current settings of MPI environment variables, across all tasks in a job. If yes is specified, the MPI environment variable information is gathered at initialization time from all tasks, and forwarded to task 0, where the report is prepared. If a script_name is specified, the script is run on each node, and the output script is forwarded to task 0 and included in the report.

When a variable's value is the same for all tasks, it is printed only once. If it is different for some tasks, an asterisk (*) appears in the report after the word "Task".

  • no: Do not produce a report of MPI environment variable settings.
  • yes: Produce a report of MPI environment variable settings.
  • script_name: Produce the report (same as yes), then run the script specified here.
no | yes | scriptname no
MP_EUIDEVELOP
-euidevelop
Controls the level of parameter checking during execution.
  • yes: Setting this to yes enables some intertask parameter checking which may help uncover certain problems, but slows execution.
  • no or nor: Normal mode does only relatively inexpensive, local parameter checking. Setting this variable to min allows PE MPI to bypass parameter checking on all send and receive operations.
  • deb (debug) checking is intended for developing applications, and can significantly slow performance.
  • min (for minimum mode) feedback.should only be used with well tested applications because a bug in an application running with min will not provide useful error
yes | no | nor | deb | min no
MP_STATISTICS
-statistics
Provides the ability to gather communication statistics for User Space jobs. yes | no | print no
Pinning of Tasks and Threads
MP_PE_AFFINITY
-pe_affinity
When MP_PE_AFFINITY=yes then MPI/POE runtime will not pass information to scheduler such as Load Leveler and it will do its own affinity setting using MP_TASK_AFFINITY yes | no yes
MP_TASK_AFFINITY
-task_affinity
Setting this environment variable attaches task of a parallel job to one of the system cpusets.
  • CORE   Default: specifies that each MPI task runs on a single physical processor core.
  • CORE:n  Specifies the number of processor cores to which the threads of an MPI task are constrained (one thread per core), typically n is the number of OMP_NUM_THREADS (should be used for OpenMP or hybrid jobs)
  • CPU  Specifies that each MPI task runs on a single logical CPU.
  • CPU:n Specifies the number of of logical CPUs to which the threads of an MPI task are constrained (one thread per cpu), typically n is the number of OMP_NUM_THREADS
  • MCM Specifies that the tasks are allocated in a round-robin fashion among the sockets of a node, should only be use if not all cores of a node are not used.
  • list-of-numbers  Specifies that the tasks are assigned on a round-robin basis to this set of sockets
  • -1  Specifies that no affinity request will be made (disables task affinity).

CPU or CORE will generate  a CPU mask with one or two entries.

CPU:n will generatea CPU which contains all CPU but will set KMP_AFFINITY different for each MPI task, e.g.
KMP_AFFINITY="proclist=[0-5],exclicit"

 =core|cpu core

Tuning

MP_USE_BULK_XFER
-use_bulk_xfer
This transparently causes portions of the user's virtual address space to be pinned and mapped to a communications adapter. The low level communication protocol will then use Remote Direct Memory Access (RDMA, also known as bulk transfer) to copy (pull) data from the send buffer to the receive buffer as part of the MPI receive. yes | no yes
MP_BULK_MIN_MSG_SIZE
-bulk_min_msg_size
Contiguous messages with data lengths greater than or equal to the value you specify for this environment variable will use the bulk transfer path. Messages with data lengths that are smaller than the value, or are noncontiguous, will use packet mode transfer.
Based on MPI benchmark measurements the altogether best bandwidth performance is achieved with a value of MP_BULK_MIN_MSG_SIZE=512k. However, the real performance will depend on the actual used message sizes inside an application and will results in higher memory consumption.
The acceptable range is from
4096 to 2147483647 (INT_MAX).

nnnnn (byte)
nnn
K (kB)
nnM (MB)
nnG (GB)

64K
MP_BUFFER_MEM
-buffer_mem

 

To control the amount of memory MPI allows for the buffering of early arrival message data. Message data that is sent without knowing if the receive is posted is said to be sent eagerly. If the message data arrives before the receive is posted, this is called an early arrival and must be buffered at the receive side.
See MP_BUFFER_MEM details in the IBM Manual.

Can also be specified in the form MP_BUFFER_MEM=M1,M2, where M1 specifies the amount of pre-allocated memory. M2 specifies an upper bound on the amount of early arrival buffer memory

nnnnn (byte)
nnn
K (kB)
nnM (MB)
nnG (GB)
64 MB
(User Space
and IP)
MP_EAGER_LIMIT
-eager_limit
To change the threshold value for message size, above which rendezvous protocol is used.
To ensure that at least 32 messages can be outstanding between any two tasks, MP_EAGER_LIMIT will be adjusted based on the number of tasks according to the following table, when the user has specified neither MP_BUFFER_MEM nor MP_EAGER_LIMIT:
Number of
Tasks     MP_EAGER_LIMIT
------------------------
   1 to  256       32768
 257 to  512       16384
 513 to 1024        8192
1025 to 2048        4096
2049 to 4096        2048
4097 to 8192        1024

MPI uses the MP_BUFFER_MEM and the MP_EAGER_LIMIT values that are selected for a job to determine how many complete point-to-point messages, each with a size that is equal to or less than the eager_limit, can be sent eagerly from every task of the job to a single task, without causing the single target to run out of buffer space. This is done by allocating to each sending task a number of message credits for each target. The sending task will consume one message credit for each eager send to a particular target. It will get that credit back after the message has been matched at that target. The following equation is used to calculate the number of credits to be allocated:

MP_BUFFER_MEM / (MP_PROCS * MAX(MP_EAGER_LIMIT, 64))

MPI uses this equation to ensure that there are at least two credits for each  target. If needed, MPI reduces the initially selected value of MP_EAGER_LIMIT, or increases the initially selected value of MP_BUFFER_MEM, in order to achieve this minimum threshold of two credits for each target.

nnnnn
 
65536
MP_CC_BUF_MEM
CC_BUF_MEM
 Specifies the size of the Early Arrival (EA) buffer that is
used by the communication subsystem to buffer eager send
messages, for collective communications operations, that arrive
before there is a matching receive posted.
nnnnn (byte)
nnn
K (kB)
nnM (MB)
nnG (GB)
 
MP_CC_SCRATCH_BUF
-cc_scratch_buf
Use the fastest collective communication algorithm even if that algorithm requires allocation of more scratch buffer space. yes | no
 
yes
MP_SINGLE_
THREAD
-single_thread
To avoid lock overheads in a program that is known to be single-threaded. MPI-IO and MPI one-sided communications are unavailable if this variable is set to yes. Results are undefined if this variable is set to yes with multiple application message passing threads in use. See IBM Parallel Environment: MPI Programming Guide for more information. yes
no
yes
MP_WAIT_MODE
-wait_mode
Set: to specify how a thread or task behaves when it discovers it is blocked, waiting for a message to arrive.

MPI_WAIT_MODE set to nopoll may reduce CPU consumption for applications that post a receive call on a separate thread, and that receive call does not expect an immediate message arrival. Also, using MPI_WAIT_MODE set to nopoll may increase delay between message arrival and the blocking call's return. It is recommended that MP_CSS_INTERRUPT be set to yes when the nopoll wait is selected, so that the system wait can be interrupted by the arrival of a message. Otherwise, the nopoll wait is interrupted at the timing interval set by MP_POLLING_INTERVAL

nopoll | poll | sleep | yield poll (for User Space and IP)
MP_POLLING_
INTERVAL
-polling_interval
To change the polling interval (in microseconds). An integer between 1 and
2 billion.
400000
MP_CSS_INTERRUPT
-css_interrupt

User Space is an unreliable packet transport (packets may be dropped during transport without an error being reported), the message dispatcher manages packet acknowledgment and retransmission with a sliding window protocol. This message dispatcher is also run on a hidden thread once every few hundred milliseconds and, if environment variable MP_CSS_INTERRUPT is set, upon notification of packet arrival.

To specify whether or not arriving packets generate interrupts. Using this environment variable may provide better performance for certain applications. Setting this variable explicitly will suppress the MPI-directed switching of interrupt mode, leaving the user in control for the rest of the run.

For more information, refer to the MPI_FILE_OPEN and MPI_WIN_CREATE subroutines in IBM Parallel Environment: MPI Subroutine Reference.

yes | no no

Advanced Tuning Parameters

MP_HINTS_FILTERED
-hints_filtered
To specify whether or not MPI info objects reject hints (key and value pairs) that are not meaningful to the MPI implementation. yes | no yes
MP_IONODEFILE
-ionodefile
To specify the name of a parallel I/O node file, a text file that lists the nodes that should be handling parallel I/O. Setting this variable enables you to limit the number of nodes that participate in parallel I/O and guarantees that all I/O operations are performed on the same node.

See Determining which nodes will participate in parallel file I/O for more information.

Any relative path name or full path name. None. All nodes will participate in parallel I/O.
MP_MSG_ENVELOPE_BUF
-msg_envelope_buf
The size of the message envelope buffer (that is, uncompleted send and receive descriptors). Any positive number. There is no upper limit, but any value less than 1 MB is ignored. 8 MB
MP_RETRANSMIT_
INTERVAL
-retransmit_interval
Control how often the communication subsystem library checks to see if it should retransmit packets that have not been acknowledged. The value nnnn is the number of polling loops between checks. The acceptable range
is from 1000 to INT_MAX
400000
(User Space)
MP_THREAD_
STACKSIZE
-thread_stacksize
To specify the additional stack size allocated for user subroutines running on an MPI service thread. If you do not allocate enough space, the program may encounter a SIGSEGV exception or more subtle failures.

nnnnn
nnnK (where:K = 1024 bytes)
nnM (where: M = 1024*1024 bytes)

bytes
 
0
MP_TIMEOUT
(no associated command line flag)
To change the length of time (in seconds) the communication subsystem will wait for a connection to be established during message-passing initialization. An integer greater than 0 150
MP_ACK_THRESH
-ack_thresh
Allows the user to control the packet acknowledgement threshold. Specify a positive integer. A positive integer limited to 31 30
MP_IO_BUFFER_SIZE
-io_buffer_size
To specify the default size of the data buffer used by MPI-IO agents.

nnnn
nnnK (where K=1024 bytes)
nnnM (where M=1024*1024 bytes)

bytes The number of
bytes that
corresponds to
16 file blocks.
MP_IO_ERRLOG
-io_errlog
To specify whether or not to turn on I/O error logging. yes | no no
MP_REXMIT_BUF_SIZE
-rexmit_buf_size
The maximum LAPI level message size that will be buffered locally, to more quickly free up the user send buffer. This sets the size of the local buffers that will be allocated to store such messages, and will impact memory usage, while potentially improving performance. The MPI application message size supported is smaller by, at most, 32 bytes. nnn bytes (where:
nnn > 0 bytes)
16352 bytes
MP_REXMIT_BUF_CNT
-rexmit_buf_cnt
The number of retransmit buffers that will be allocated per task. Each buffer is of size MP_REXMIT_BUF_SIZE * MP_REXMIT_BUF_CNT. This count controls the number of in-flight messages that can be buffered to allow prompt return of application send buffers. nnn (where:
nnn > 0)
128

Gprof and Corefile Handling

MP_PROFDIR
-profdir
Allows you to specify the directory into which POE stores the gmon.out file for each task. A gmon.out file contains profiling data and is produced by compiling a program with the -pg flag. Any relative path name or full path name. profdir.task_id
MP_COREFILE_FORMAT
corefile_format.
POE processes that terminate abnormally will can generate standard corefiles. If you prefer, you can instruct POE to write the stack trace  information to standard error instead.

It is highly recommended to limit the amount of information written for large parallel application by setting it to STDERR.

noset | STDERR STDERR
MP_COREDIR
-coredir
Creates a separate directory for each task's core file. Any valid directory name, or "none" to bypass creating a new directory. coredir.taskid

Other

MP_FENCE
(no associated command line flag)
A fence character string for separating arguments you want parsed by POE from those you do not. Any string. None
MP_NOARGLIST
(no associated command line flag)
Whether or not POE ignores the argument list. If set to yes, POE will not attempt to remove POE command line flags before passing the argument list to the user's program. yes | no no

Further information