ALIs
kommt nochSGI Message Passing Toolkit
SGI's message passing toolkit is comprised of user and system tools and libraries which provide optimized MPI functionality for SGI systems like SGI Altix or SGI ICE.
Table of contents
- Setting up for the use of sgi MPT
- MPT specific information on compiling and linking
- Location of MPI libraries
- Multi-threaded MPI
- Execution of programs
- Interactive runs (ICE login node, small tests only)
- Batch mode runs
- Hybrid program startup
- Controlling MPI execution via switches
- Using memory mapping
- Memory usage when using memory mapping
- Environment variables
- MPT Extension: Shmem programming interface
- Global shared memory
- Documentation
- General Information on MPI
- Manual pages
- SGI's MPT documentation
Setting up for the use of sgi MPT
On all sgi systems, an environment module mpi.mpt is automatically loaded at login. This environment module makes available all tools needed to compile and execute MPI programs as described in the main MPI document.
MPT specific information on compiling and linking
Location of MPI libraries
Some software packages want an entry for the location of the MPI libraries. If you use the wrapper scripts, you should normally be able to leave the corresponding environment variables empty. If you do not wish to use the wrapper scripts, or if you do mixed-language programming, please specify
-lmpi -lffio -lsma -lpthread for Fortran, -lmpi -lsma -lpthread for C, and -lmpi -lmpi++abi1002 -lsma -lpthread for C++.
Note: The -lmpi++abi1002 setting applies on SLES10/11 based systems, the older -lmpi++ should not be used any more.
Multi-threaded MPI
MPT offers a multi-threaded version of its MPI library. When using the compiler wrappers, it is sufficient to specify the -mt switch for linking. If explicit libraries are used, the -lmpi entries from above should be replaced by -lmpi_mt. While the default MPI library provides only MPI_THREAD_SINGLE (i.e., no thread safety), the multi-threaded library provided MPI_THREAD_MULTIPLE (the highest level of thread safety).
Execution of programs
Interactive runs (ICE login node, small tests only)
In this case, you can use the mpirun command:mpirun -np 6 ./myprog.exe
will start up 6 MPI tasks. If your program also was compiled with OpenMP and the OMP_NUM_THREADS environment variable is set to a value ≠ 1, additional threads may also be started up by each MPI task.
MPMD startup is also supported via the syntax
mpirun -np 2 ./myprog1.exe : -np 3 ./myprog2.exe
Batch mode runs
MPT programs which are run under control of a batch queuing system (at LRZ: SLURM) should be started up with the srun_ps command. As a rule, all necessary setup information will be automatically read from the batch configuration file, hence it is usually sufficient to specify
srun_ps ./myprog.exe
srun_ps will invoke SGI's mpirun with suitable arguments.
Hybrid program startup
For MPI programs which also use OpenMP, placement of tasks and threads is automatically performed if appropriate specifications are handed to srun_ps:
srun_ps -n 12 -t 4 omplace ./myprog.exe
This run would start 12 MPI tasks, each of which might create 4 threads without undue resource overuse.
Controlling MPI execution via switches
The switches described in the following table can be used on the srun_ps command.
| Flag | Explanation |
|---|---|
| -f file_name | pick up command arguments from file file_name |
| -p prefix_string |
Specifies a string to prepend to each line of output from stderr and stdout for each MPI process. The following prerequisites and recommendations apply:
|
| -stats | Prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. Users can combine this option with the -p option to prefix the statistics messages with the MPI rank. For more details, see the MPI_SGI_stat_print(3) man page. |
| -v | Displays comments on what mpirun is doing when launching the MPI application. |
Using memory mapping
Memory mapping is a functionality available within SGI MPT, which provides optimized communication behaviour for some applications by enabling e.g., single copy mechanisms. For some MPT calls, e.g. one-sided calls, shmem calls or global shared memory using memory mapping is in fact mandatory. By default, this feature is enabled for SGI MPT.
However using memory mapping also has a downside, which consists in extensive usage of pinned memory pages which may considerably increase the memory usage of your application uncontrollably unless you take steps to prevent this. The following alternatives are available:
- Deactivate default single copy by setting MPI_DEFAULT_SINGLE_COPY_OFF to any value. This will keep memory mapping available for those routines for which it is mandatory.
- Increase the value of MPI_BUFFER_MAX. This will suppress using single copy for all messages smaller than the supplied value.
- Deactivate memory mapping altogether by setting MPI_MEMMAP_OFF. Beware that certain functionality for which memory mapping is mandatory will not work in this case.
- Limit mapped memory usage by setting the MPI_MAPPED_HEAP_SIZE and MPI_MAPPED_STACK_SIZE to some value not too much larger than the maximum size of your messages. Since a silent changeover to non-mapped memory may have a performance impact, you will need to re-check performance after adjusting to new values.
Please see a more detailed description of the aforementioned environment variables in the table below. All changes to the default environment may incur performance variations which in turn can depend on the sizing of your application message sizes. Hence you need to be very careful in properly tuning for your application and your application setup.
Memory usage when using memory mapping
Looking at memory usage with tools like ps or top when memory mapping is enabled may indicate a very large memory overhead. In fact, this is not the case since the pinned memory pages get accounted to each process by the Linux kernel even though there exists only one instance of them. If you want to obtain a reliable estimate for memory usage, you need to disable memory mapping.
Environment variables
MPI execution can be more finely controlled by setting certain environment variables to suitable values. The exact settings may depend on the application as well as the parallel configuration the application is run on. The MPT environment module will perform some settings where deviations from the SGI defaults appear reasonable; but of course the user may need to make further changes. Some settings have considerable performance impact!
|
Name |
Function |
Remarks |
|---|---|---|
|
Controlling task distribution (e.g., for hybrid parallelism) Note: These variables should not be set if the omplace utility is used. |
||
|
MPI_DSM_CPULIST |
Specifies a list of CPUs (relative to current CPUset) on which to run an MPI application. |
Unset by default. Usually only necessary for complex setups like hybrid and/or MPMD jobs. |
|
MPI_DSM_DISTRIBUTE |
Activates NUMA job placement mode. This mode ensures that each MPI process gets a unique CPU and physical memory on the node with which that CPU is associated. The CPUs are chosen by simply starting at relative CPU 0 and incrementing until all MPI processes have been forked. To choose specific CPUs, use the MPI_DSM_CPULIST environment variable. |
LRZ/PBS sets this by default. |
|
MPI_DSM_PPM |
Sets the number of MPI processes per blade. The value must be less than or equal to the number of cores per blade (or memory channel). |
The default is the number of cores per blade. |
|
MPI_OPENMP_INTEROP |
Setting this variable modifies the placement of MPI processes to better accommodate the OpenMP threads associated with each process. For this variable to take effect, you must also set MPI_DSM_DISTRIBUTE. |
Set to any value to enable. |
|
MPI_OMP_NUM_THREADS |
Can be set to a colon separated list of positive integers, representing the value of the OMP_NUM_THREADS environment variable for each host-program specification on the mpirun command line. |
Set to OMP_NUM_THREADS value by default, or 1 if OMP_NUM_THREADS is unset. |
|
Controlling task execution |
||
|
MPI_NAP |
This variable affects the way in which ranks wait for events to occur:
|
Setting to a moderate value is useful for master-slave codes where the master shares CPU resources with one of the slaves. Defining MPI_NAP without value is best used if the system is oversubscribed (there are more processes ready to run than there are CPUs). Leaving MPI_NAP undefined is best if sends and matching receives occur nearly simultaneously. |
|
MPI_UNBUFFERED_STDIO |
Disable buffering of stdio/stderr. If MPI processes produce very long output lines, the program may crash due to running out of STDIO buffer; this |
Set to any value to enable. If enabled, the option -prefix is ignored. |
|
Memory mapping, remote memory access |
||
|
MPI_BUFFER_MAX |
Specifies a minimum message size, in bytes, for which the message will be considered a candidate for single-copy transfer. |
LRZ sets this to a default of 32768 |
|
MPI_DEFAULT_SINGLE_COPY_OFF |
Disables the single-copy mode. Users of MPI_Send should continue to use the MPI_BUFFER_MAX environment variable to control single-copy. |
If unset, single copy mode is enabled; this causes transfers of more than 2000 Bytes that use MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcast, MPI_Allreduce and MPI_Reduce to use the single-copy mode optimization. Set to any value to disable. |
|
MPI_MAPPED_HEAP_SIZE |
Sets the new size (in bytes) for the amount of heap that is memory mapped per MPI process. |
Default: The physical memory available per CPU less the static region size. This variable will only have an effect if memory mapping is on. |
|
MPI_MAPPED_STACK_SIZE |
Sets the new size (in bytes) for the amount of stack that is memory mapped per MPI process. The default size of the mapped stack is the physical memory available per CPU less the static region size. |
Default: The stack size limit. If stack size is set to unlimited, the mapped region is set to the physical memory available per CPU. This variable will only have an effect if memory mapping is on. |
|
MPI_MEMMAP_OFF |
Turns off the memory mapping feature. The memory mapping feature provides support for single-copy transfers and MPI-2 one-sided communication on Linux for single and multi-partition jobs. |
Unset by default. Set to any value to switch memory mapping off. |
|
Diagnostics and debugging support |
||
|
MPI_CHECK_ARGS |
Run-time checking of MPI function arguments. Segmentation faults might occur if bad arguments are passed to MPI. |
Useful for debugging. Adds several microseconds to latency! |
|
MPI_COREDUMP |
Controls which ranks of an MPI job can dump core on receipt of a core-dumping signal. Valid values are:
|
Default setting is as if FIRST were specified. Please note that you will need to issue the command ulimit -c unlimited before starting MPI execution to actually obtain core files since a maximum core size of 0 is set as system default. Intel's idb is used to generate the traceback information; use of the -g -traceback compilation switch is recommended to enable source location. |
|
MPI_DSM_VERBOSE |
Print information about process placement unless MPI_DSM_OFF is also set. Output is sent to stderr. |
Unset by default. Set to any value to enable. |
|
MPI_MEMMAP_VERBOSE |
Display additional information regarding the memory mapping initialization sequence. Output is sent to stderr. |
Unset by default. Set to any value to enable. |
|
MPI_SHARED_VERBOSE |
Setting this variable allows for some diagnostic information concerning messaging within a host to be displayed on stderr. |
Off by default. |
|
MPI_SLAVE_DEBUG_ATTACH |
Specifies the MPI process to be debugged. If you set MPI_SLAVE_DEBUG_ATTACH to N, the MPI process with rank N prints a message during program startup, describing how to attach to it from another window using the gdb or idb debugger. The message includes the number of seconds you have to attach the debugger to process N. If you fail to attach before the time expires, the process continues. |
Off by default. |
|
MPI_STATS |
Enables printing of MPI internal statistics. |
Off by default. Note: This variable should not be set if the program uses threads. |
|
MPI-internal limits |
||
|
MPI_BUFS_PER_HOST |
Number of shared message buffers (of size 16 KB) that MPI is to allocate for each host. These buffers are used to send and receive long inter-partition messages. |
SGI default is 32. Increase if default buffering proves insufficient. |
|
MPI_BUFS_PER_PROC |
Number of shared message buffers (of size 16 KB) that MPI is to allocate for each MPI task. These buffers are used to send and receive long intra-partition messages. |
SGI default is 32. Increase if default buffering proves insufficient. |
|
MPI_COMM_MAX |
Maximum number of communicators that can be used in an MPI program. |
Default value is 256. |
|
MPI_GROUP_MAX |
Maximum number of groups available for each MPI process. |
Default value is 32. |
|
MPI_MSGS_MAX |
This variable can be set to control the total number of message headers of size 128 kBytes that can be allocated. This allocation applies to messages exchanged between processes on a single host. If you set this variable, specify the maximum number of message headers. |
May improve performance if your application generates many small messages. Default is 512. |
|
MPI_REQUEST_MAX |
Determines the maximum number of nonblocking sends and receives that can simultaneously exist for any single MPI process. Use this variable to increase internal default limits. MPI generates an error message if this limit (or the default, if not set) is exceeded. |
The default value is 16384 |
|
MPI_TYPE_DEPTH |
Sets the maximum number of nesting levels for derived data types. Limits the maximum depth of derived data types that an application can create. MPI generates an error message if this limit (or the default, if not set) is exceeded. |
By default, 8 levels can be used. |
|
MPI_TYPE_MAX |
Determines the maximum number of data types that can simultaneously exist for any single MPI process. Use this variable to increase internal default limits. MPI generates an error message if this limit (or the default, if not set) is exceeded. |
1024 by default. |
MPT Extension: Shmem programming interface
In addition to MPI calls, a SPMD parallel program on Altix systems may also use the efficiently implemented shmem library calls. These make use of the RDMA facilities of the SGI interconnect (NUMAlink); indeed they also work across partition boundaries. Shmem calls are generally similar in semantics to one-sided MPI communication calls, but easier to use: For example,
shmem_double_put(target,source,len,pe)
is a facility for transferring a double precision array source(len) to the location target(len) on the remote process pe. The target object must be remotely accessible (aka symmetric), i. e. typically either a static array or dynamically allocated by executing the collective call shpalloc (3F) on a suitably defined Cray-type pointer. Repeatedly executed shmem calls targeting the same process will usually require an additional synchronization call - in the above case: shmem_fence()- to enforce memory ordering. Also note that the interface is not generic: For each data type used there is a distinct API call available (if at all). Here is a list of further functionality available:
- shmem_get: transfer data from remote to local process
- shmem_ptr: return a pointer to a memory location on a remote process
- collective calls for reduction, broadcast, barrier
- administrative calls for starting up and getting process IDs: If shmem is used in conjunction with MPI, please use the standard MPI administrative calls instead!
Due to the cache coherency properties of the Altix systems, the cache management functions - while still available for compatibility - are not actually required. Please consult the documentation referenced below for detailed shmem information.
Global shared memory
The GSM feature provides expanded shared memory capabilities across partitioned Altix systems and additional shared memory placement specifications within a single host configuration. Additional (however non-portable) API calls provide a way to allocate a global shared memory segment with desired placement options, free that segment, and provide information about that segment. For example, calling the subroutine
gsm_alloc(len, placement, flags, comm, base, ierror)
will provide a memory segment of size len bytes at address base (accessed via a Cray-type pointer) for all processes in the MPI communicator comm. Data written to this segment by any process will be visible to all other processes after a synchronization call (usually MPI_Barrier). Please consult the documentation referenced below for detailed GSM information.
Documentation
General Information on MPI
Please refer to the MPI page at LRZ for the API documentation and information about the different flavors of MPI available.
Manual pages
- For sgi MPT, please consult the man pages mpi (1), mpirun (1). Also, each MPI API call has its own man page.
- For the shmem API, consult the man page shmem_intro (1). Again, each shmem routine has its individual man page.
- For the global shared memory API, consult the man page gsm_intro (1), which also contains references to further API calls.
SGI's MPT documentation
Some of the following links lead to a password-protected area. To obtain user name and password for access, please type the command get_manuals_passwd when logged in to the system.
- MPT page on SGI's web site with summary information
- sgi MPT User's Guide, in PDF (200 kByte) and HTML format.