Using GPFS

Using GPFS utilities and development environment

By issuing the command

module load gpfs

you gain access to

  • GPFS commands that allow to query and manipulate file system parameters (e.g. setting of Access Control Lists)
  • Compile and link executable code that makes use of GPFS-specific semantics

Please see the GPFS documentation (links at the end of this page) for details on both commands and API.

Best Practices and Hints for Optimizations

Most hints presented here do not only apply to GPFS, but also to to other files systems.

Avoid using "ls -l"

  • Use "ls" if you just want to list files. If you use "-l" all the metadata have to be read.

Use "vi -n" or "view"

  • This avoids the creation of a swap-file during opening of the file (which is a meta data opertion) and speeds up the inspection of files.

Avoid repetive and excessive "open/close" or "stat" operations

  • Mata-data operations may need serialized locking mechanisms.
  • Some users use "stat" of related functions and/or commands to test the size or existence of files. When testing becomes excessive, it will cause a heavy load on the meta-data servers.

Avoid having multiple processes open the same file(s) (for writing)

  • Mata-data operations may need serialized locking mechanisms.
  • When just reading, make this explicit in the open calls(FORTRAN: ACTION=READ, C: RDONLY). This will reduce contention.

Do not have all (or too many) files in the same directory

(this also applies to having 1000s of directories within the same directory)

The GPFS architecture is good at processing parallel I/O from many nodes in general. However, it is very slow when different nodes try to write to exactly the same area of the same file.  The general rule is to avoid having hundreds or thousands of tasks trying to modify the same file/directory at the same time with certain operations. This happens for instance when, on job start, all participating nodes try to create a file each in one and the same directory. A directory is nothing but a file as well. The rate at which files are being created that way was seen to be about 1/s ! It is strongly recommended to not do this for any larger job. As as better alternative, the files for the individual tasks can be created all by one task. This is faster by several orders at job start (see below). If the nodes need to create their files indeed themselves, then do create subdirectories first, either one for each tasks or one for a (small) subset of tasks, and let then the tasks create their files within these subdirectories. The subdirectory creation should again be done just by one task. The code using MPI should do something like this pseudo-code:

!# serial creation
barrier
if (task==0) then
do i=0,nprocs-1
create subdirectory(i)
create file(i) // with optional truncate option
enddo
endif

!# all files created now
barrier

!#parallel usage
open file(myid)
open file(commonfile_id)
...
write privatefile
write commonfile

The tasks can then proceed to modifying their own portions of a common file, with best results if their regions do not overlap on a granularity smaller than the GPFS blocksize (8 MB). For fine grain updates that are smaller than the blocksize, the MPI-IO package is advised since it will use MPI to ship around the small updates to nodes that manage different regions of the file.

Avoid simultaneous accesses from phases 1 and 2

A node that executes operations on an i-node or modifies the contents on a file becomes metadata-server for that file. This enables relative good scalability for metadata operations. However, if nodes from phase 1 and 2 respectively attempt accesses to a file simultaneously, then the metadata functionality is moved to the storage server itself; this may cause not only a significant performance hit for the involved processes, but may even slow down the complete filesystem operations. Therefore such accesses should be avoided, including e.g. looking at files on a login node of phase 1 while a phase 2 job is doing I/O on it, or accessing the same directory simultaneously from jobs running on phases 1 and 2 respectively.

Using MPIIO Hints

Existing hints and their usefulness for an application developer/user.

Hint Useful-
ness
Explanation
romio_cb_read High Enable or not collective buffering.
Defines whether or not to utilize collective IO for writing. If romio_cb_write is disabled, all tasks perform their own independent POSIX IO. By default, romio_cb_write is enabled
romio_cb_write High Enable or not collective buffering.
Defines whether or not to utilize collective IO for reading. If romio_cb_read is disabled, all tasks perform their own independent POSIX IO. By default, romio_cb_read is enabled.
romio_cb_fr_types Low Tuning of collective buffering   
romio_cb_fr_alignment Low Tuning of collective buffering   
romio_cb_alltoall Low Tuning of collective buffering   
romio_cb_pfr Low Tuning of collective buffering   
romio_cb_ds_threshold Low Tuning of collective buffering   
cb_buffer_size Medium

Tuning of collective buffering.

Controls the size (in bytes) of the intermediate buffer used in two-phase collective IO. If the amount of data that an aggregator transfers is larger than this value, multiple operations are used. The default value is 16 MB.

cb_nodes Medium Tuning of collective buffering   
cb_config_list Medium Tuning of collective buffering. Provides explicit control over aggregators.
romio_no_indep_rw Low Deferred open + only collective I/O 
ind_rd_buffer_size Low Buffer size for data sieving
ind_wr_buffer_size Low Buffer size for data sieving
romio_ds_read High Enable or not data sieving
romio_ds_write High Enable or not data sieving

Most of the time, it is better to disable the data sieving optimisation because a similar one is already performed by the filesystem.

Example which users reported good results on SuperMUC:

call MPI_Info_set(info,"romio_cb_write","enable", error)
call MPI_Info_set(info,"cb_buffer_size","4194304", error)
call MPI_Info_set(info,"striping_unit","4194304", error)

For details see: PRACE Advanced Training - Best practices for parallel IO and MPI-IO hints

General Hints for IO

  • Open files in the correct mode. If a file is only intended to be read, it must be opened in read-only mode because choosing the right mode allows the system to apply optimisations and to allocate only the necessary resources.
  • Write/read arrays/data structures in one call rather than element per element. Not complying with this rule will have a significant negative impact on the I/O performance.
  • Do not open and close files too frequently because it involves many system operations. The best way is to open the file the first time it is needed and to close it only if its use is not necessary for a long enough period of time.
  • Limit the number of simultaneous open files because for each open file, the system must assign and manage some resources.
  • Separate procedures involving I/O from the rest of the source code for better readability and maintainability.
  • Separate metadata from data. Metadata is anything that describes the data. This is usually the parameters of calculations, the sizes of arrays... It is often easier to separate files into a first part (header) containing the metadata followed by the
    data.
  • Create files independent of the number of processes. This will make life much easier for post-processing and also for restarts with a different number of processes.
  • Align accesses to the frontiers of the file system blocks and have only one process per data server (not easy).
  • Use non-blocking MPI-I/O calls (not implemented/available on all systems).
  • Use higher level libraries based on MPI-I/O (HDF5, ADIOS, SIONlib...).

For details see: PRACE Advanced Training - Best practices for parallel IO and MPI-IO hints

GPFS documentation

Is available on the IBM web site: