ALIs

kommt noch

SGI Message Passing Toolkit

SGI's message passing toolkit is comprised of user and system tools and libraries which provide optimized MPI functionality for SGI systems like SGI Altix or SGI ICE.


Table of contents


Setting up for the use of sgi MPT

On all sgi systems, an environment module mpi.mpt is automatically loaded at login. This environment module makes available all tools needed to compile and execute MPI programs as described in the main MPI document.

 

MPT specific information on compiling and linking

Location of MPI libraries

Some software packages want an entry for the location of the MPI libraries. If you use the wrapper scripts, you should normally be able to leave the corresponding environment variables empty. If you do not wish to use the wrapper scripts, or if you do mixed-language programming, please specify 

-lmpi -lffio -lsma -lpthread   for Fortran,
-lmpi -lsma -lpthread  for C, and
-lmpi -lmpi++abi1002 -lsma -lpthread  for C++.

Note: The -lmpi++abi1002 setting applies on SLES10/11 based systems, the older -lmpi++ should not be used any more.

Multi-threaded MPI

MPT offers a multi-threaded version of its MPI library. When using the compiler wrappers, it is sufficient to specify the -mt switch for linking. If explicit libraries are used, the -lmpi entries from above should be replaced by -lmpi_mt. While the default MPI library provides only MPI_THREAD_SINGLE (i.e., no thread safety), the multi-threaded library provided MPI_THREAD_MULTIPLE (the highest level of thread safety).

Execution of programs

Interactive runs (ICE login node, small tests only)

In this case, you can use the mpirun command:

    mpirun -np 6 ./myprog.exe

will start up 6 MPI tasks. If your program also was compiled with OpenMP and the OMP_NUM_THREADS environment variable is set to a value ≠ 1, additional threads may also be started up by each MPI task.

MPMD startup is also supported via the syntax

    mpirun -np 2 ./myprog1.exe : -np 3 ./myprog2.exe

Batch mode runs

MPT programs which are run under control of a batch queuing system (at LRZ: SLURM) should be started up with the srun_ps command. As a rule, all necessary setup information will be automatically read from the batch configuration file, hence it is usually sufficient to specify

srun_ps ./myprog.exe

srun_ps will invoke SGI's mpirun with suitable arguments.

Hybrid program startup

For MPI programs which also use OpenMP, placement of tasks and threads is automatically performed if appropriate specifications are handed to srun_ps:

srun_ps -n 12 -t 4 omplace ./myprog.exe

This run would start 12 MPI tasks, each of which might create 4 threads without undue resource overuse.


Controlling MPI execution via switches

The switches described in the following table can be used on the srun_ps command.

Flag Explanation
-f file_name pick up command arguments from file file_name
-p prefix_string Specifies a string to prepend to each line of output from stderr and stdout for each MPI process.
The following prerequisites and recommendations apply:
  • To delimit lines of text that come from different hosts, each output to stdout/stderr must be terminated with a new line character.
  • the MPI_UNBUFFERED_STDIO environment variable shall not be set.
  • Some special strings are available for obtaining MPI-internal information; LRZ recommends a setting like

                                    -p "|--%g of %G on %@-->"

    This will for each MPI task print out the task identifier, total number of tasks, and the host the task is running on.

-stats Prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. Users can combine this option with the -p option to prefix the statistics messages with the MPI rank. For more details, see the MPI_SGI_stat_print(3) man page.
-v Displays comments on what mpirun is doing when launching the MPI application.

Using memory mapping

Memory mapping is a functionality available within SGI MPT, which provides optimized communication behaviour for some applications by enabling e.g., single copy mechanisms. For some MPT calls, e.g. one-sided calls, shmem calls or global shared memory using memory mapping is in fact mandatory. By default, this feature is enabled for SGI MPT.

However using memory mapping also has a downside, which consists in extensive usage of pinned memory pages which may considerably increase the memory usage of your application uncontrollably unless you take steps to prevent this. The following alternatives are available:

  1. Deactivate default single copy by setting MPI_DEFAULT_SINGLE_COPY_OFF to any value. This will keep memory mapping available for those routines for which it is mandatory.
  2. Increase the value of MPI_BUFFER_MAX. This will suppress using single copy for all messages smaller than the supplied value.
  3. Deactivate memory mapping altogether by setting MPI_MEMMAP_OFF. Beware that certain functionality for which memory mapping is mandatory will not work in this case.
  4. Limit mapped memory usage by setting the MPI_MAPPED_HEAP_SIZE and MPI_MAPPED_STACK_SIZE to some value not too much larger than the maximum size of your messages. Since a silent changeover to non-mapped memory may have a performance impact, you will need to re-check performance after adjusting to new values.

Please see a more detailed description of the aforementioned environment variables in the table below. All changes to the default environment may incur performance variations which in turn can depend on the sizing of your application message sizes. Hence you need to be very careful in properly tuning for your application and your application setup.

Memory usage when using memory mapping

Looking at memory usage with tools like ps or top when memory mapping is enabled may indicate a very large memory overhead. In fact, this is not the case since the pinned memory pages get accounted to each process by the Linux kernel even though there exists only one instance of them. If you want to obtain a reliable estimate for memory usage, you need to disable memory mapping.

Environment variables

MPI execution can be more finely controlled by setting certain environment variables to suitable values. The exact settings may depend on the application as well as the parallel configuration the application is run on. The MPT environment module will perform some settings where deviations from the SGI defaults appear reasonable; but of course the user may need to make further changes. Some settings have considerable performance impact!

Name

Function

Remarks

Controlling task distribution (e.g., for hybrid parallelism)

Note: These variables should not be set if the omplace utility is used.

MPI_DSM_CPULIST

Specifies a list of CPUs (relative to current CPUset) on which to run an MPI application.

Unset by default. Usually only necessary for complex setups like hybrid and/or MPMD jobs.

MPI_DSM_DISTRIBUTE

Activates NUMA job placement mode. This mode ensures that each MPI process gets a unique CPU and physical memory on the node with which that CPU is associated. The CPUs are chosen by simply starting at relative CPU 0 and incrementing until all MPI processes have been forked. To choose specific CPUs, use the MPI_DSM_CPULIST environment variable.

LRZ/PBS sets this by default.

MPI_DSM_PPM

Sets the number of MPI processes per blade. The value must be less than or equal to the number of cores per blade (or memory channel).

The default is the number of cores per blade.

MPI_OPENMP_INTEROP

Setting this variable modifies the placement of MPI processes to better accommodate the OpenMP threads associated with each process. For this variable to take effect, you must also set MPI_DSM_DISTRIBUTE.

Set to any value to enable.

MPI_OMP_NUM_THREADS

Can be set to a colon separated list of positive integers, representing the value of the OMP_NUM_THREADS environment variable for each host-program specification on the mpirun command line.

Set to OMP_NUM_THREADS value by default, or 1 if OMP_NUM_THREADS is unset.

Controlling task execution

MPI_NAP

This variable affects the way in which ranks wait for events to occur:

  • unset: The MPI library spins in a tight loop when awaiting events. Best possible response time, but each waiting rank uses CPU time at wall-clock rates.
  • defined with no value (export MPI_NAP=""): The MPI library makes a system call while waiting, which might yield the CPU to another eligible process that can use it. If no such process exists, the rank receives control back nearly immediately, and CPU time accrues at near wall-clock rates. If another process does exist, it is given some CPU time, after which the MPI rank is again given the CPU to test for the event.
  • set to integer value: the rank sleeps for that many milliseconds before again testing to determine if an event has occurred. This dramatically reduces the CPU time that is charged against the rank, and might increase the system's "idle" time. This setting is best if there is usually a significant time difference between the times that sends and matching receives are posted.

Setting to a moderate value is useful for master-slave codes where the master shares CPU resources with one of the slaves. Defining MPI_NAP without value is best used if the system is oversubscribed (there are more processes ready to run than there are CPUs). Leaving MPI_NAP undefined is best if sends and matching receives occur nearly simultaneously.

MPI_UNBUFFERED_STDIO

Disable buffering of stdio/stderr. If MPI processes produce very long output lines, the program may crash due to running out of STDIO buffer; this

Set to any value to enable. If enabled, the option -prefix is ignored.

Memory mapping, remote memory access

MPI_BUFFER_MAX

Specifies a minimum message size, in bytes, for which the message will be considered a candidate for single-copy transfer.
Setting this variable to a large value, e.g. larger than the maximum message size, may improve performance very much.

LRZ sets this to a default of 32768

MPI_DEFAULT_SINGLE_COPY_OFF

Disables the single-copy mode. Users of MPI_Send should continue to use the MPI_BUFFER_MAX environment variable to control single-copy.

If unset, single copy mode is enabled; this causes transfers of more than 2000 Bytes that use MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcast, MPI_Allreduce and MPI_Reduce to use the single-copy mode optimization. Set to any value to disable.

MPI_MAPPED_HEAP_SIZE

Sets the new size (in bytes) for the amount of heap that is memory mapped per MPI process.

Default: The physical memory available per CPU less the static region size. This variable will only have an effect if memory mapping is on.

MPI_MAPPED_STACK_SIZE

Sets the new size (in bytes) for the amount of stack that is memory mapped per MPI process. The default size of the mapped stack is the physical memory available per CPU less the static region size.

Default: The stack size limit. If stack size is set to unlimited, the mapped region is set to the physical memory available per CPU. This variable will only have an effect if memory mapping is on.

MPI_MEMMAP_OFF

Turns off the memory mapping feature. The memory mapping feature provides support for single-copy transfers and MPI-2 one-sided communication on Linux for single and multi-partition jobs.

Unset by default. Set to any value to switch memory mapping off.

Diagnostics and debugging support

MPI_CHECK_ARGS

Run-time checking of MPI function arguments. Segmentation faults might occur if bad arguments are passed to MPI.

Useful for debugging. Adds several microseconds to latency!

MPI_COREDUMP

Controls which ranks of an MPI job can dump core on receipt of a core-dumping signal. Valid values are:

  • FIRST: The first rank on each host to receive a core-dumping signal should dump core.
  • NONE: No rank should dump core.
  • ALL: All ranks should dump core if they receive a core-dumping signal.
  • INHIBIT: disables MPI signal-handler registration for core-dumping signals; stack traceback and MPI signal handler invocation are then suppressed.

Default setting is as if FIRST were specified. Please note that you will need to issue the command ulimit -c unlimited before starting MPI execution to actually obtain core files since a maximum core size of 0 is set as system default. Intel's idb is used to generate the traceback information; use of the -g -traceback compilation switch is recommended to enable source location.

MPI_DSM_VERBOSE

Print information about process placement unless MPI_DSM_OFF is also set. Output is sent to stderr.

Unset by default. Set to any value to enable.

MPI_MEMMAP_VERBOSE

Display additional information regarding the memory mapping initialization sequence. Output is sent to stderr.

Unset by default. Set to any value to enable.

MPI_SHARED_VERBOSE

Setting this variable allows for some diagnostic information concerning messaging within a host to be displayed on stderr.

Off by default.

MPI_SLAVE_DEBUG_ATTACH

Specifies the MPI process to be debugged. If you set MPI_SLAVE_DEBUG_ATTACH to N, the MPI process with rank N prints a message during program startup, describing how to attach to it from another window using the gdb or idb debugger. The message includes the number of seconds you have to attach the debugger to process N. If you fail to attach before the time expires, the process continues.

Off by default.

MPI_STATS

Enables printing of MPI internal statistics.

Off by default. Note: This variable should not be set if the program uses threads.

MPI-internal limits

MPI_BUFS_PER_HOST

Number of shared message buffers (of size 16 KB) that MPI is to allocate for each host. These buffers are used to send and receive long inter-partition messages.

SGI default is 32. Increase if default buffering proves insufficient.

MPI_BUFS_PER_PROC

Number of shared message buffers (of size 16 KB) that MPI is to allocate for each MPI task. These buffers are used to send and receive long intra-partition messages.

SGI default is 32. Increase if default buffering proves insufficient.

MPI_COMM_MAX

Maximum number of communicators that can be used in an MPI program.

Default value is 256.

MPI_GROUP_MAX

Maximum number of groups available for each MPI process.

Default value is 32.

MPI_MSGS_MAX

This variable can be set to control the total number of message headers of size 128 kBytes that can be allocated. This allocation applies to messages exchanged between processes on a single host. If you set this variable, specify the maximum number of message headers.

May improve performance if your application generates many small messages. Default is 512.

MPI_REQUEST_MAX

Determines the maximum number of nonblocking sends and receives that can simultaneously exist for any single MPI process. Use this variable to increase internal default limits. MPI generates an error message if this limit (or the default, if not set) is exceeded.

The default value is 16384

MPI_TYPE_DEPTH

Sets the maximum number of nesting levels for derived data types. Limits the maximum depth of derived data types that an application can create. MPI generates an error message if this limit (or the default, if not set) is exceeded.

By default, 8 levels can be used.

MPI_TYPE_MAX

Determines the maximum number of data types that can simultaneously exist for any single MPI process. Use this variable to increase internal default limits. MPI generates an error message if this limit (or the default, if not set) is exceeded.

1024 by default.

MPT Extension: Shmem programming interface

In addition to MPI calls, a SPMD parallel program on Altix systems may also use the efficiently implemented shmem library calls. These make use of the RDMA facilities of the SGI interconnect (NUMAlink); indeed they also work across partition boundaries. Shmem calls are generally similar in semantics to one-sided MPI communication calls, but easier to use: For example,

shmem_double_put(target,source,len,pe)

is a facility for transferring a double precision array source(len) to the location target(len) on the remote process pe. The target object must be remotely accessible (aka symmetric), i. e. typically either a static array or dynamically allocated by executing the collective call shpalloc (3F) on a suitably defined Cray-type pointer. Repeatedly executed shmem calls targeting the same process will usually require an additional synchronization call - in the above case: shmem_fence()- to enforce memory ordering. Also note that the interface is not generic: For each data type used there is a distinct API call available (if at all). Here is a list of further functionality available:

  • shmem_get: transfer data from remote to local process
  • shmem_ptr: return a pointer to a memory location on a remote process
  • collective calls for reduction, broadcast, barrier
  • administrative calls for starting up and getting process IDs: If shmem is used in conjunction with MPI, please use the standard MPI administrative calls instead!

Due to the cache coherency properties of the Altix systems, the cache management functions - while still available for compatibility - are not actually required. Please consult the documentation referenced below for detailed shmem information.

Global shared memory

The GSM feature provides expanded shared memory capabilities across partitioned Altix systems and additional shared memory placement specifications within a single host configuration. Additional (however non-portable) API calls provide a way to allocate a global shared memory segment with desired placement options, free that segment, and provide information about that segment. For example, calling the subroutine

gsm_alloc(len, placement, flags, comm, base, ierror)

will provide a memory segment of size len bytes at address base (accessed via a Cray-type pointer) for all processes in the MPI communicator comm. Data written to this segment by any process will be visible to all other processes after a synchronization call (usually MPI_Barrier). Please consult the documentation referenced below for detailed GSM information.

Documentation

General Information on MPI

Please refer to the MPI page at LRZ for the API documentation and information about the different flavors of MPI available.

Manual pages

  • For sgi MPT, please consult the man pages  mpi (1), mpirun (1). Also, each MPI API call has its own man page.
  • For the shmem API, consult the man page shmem_intro (1). Again, each shmem routine has its individual man page.
  • For the global shared memory API, consult the man page gsm_intro (1), which also contains references to further API calls.

SGI's MPT documentation

Some of the following links lead to a password-protected area. To obtain user name and password for access, please type the command get_manuals_passwd when logged in to the system.

  • MPT page on SGI's web site with summary information
  • sgi MPT User's Guide, in PDF (200 kByte) and HTML format.