Using GPFS
Table of contents
- Using GPFS utilities and development environment
- Best Practices and Hints for Optimizations
- Avoid using "ls -l"
- Use "vi -n" or "view"
- Avoid repetive and excessive "open/close" or "stat" operations
- Avoid having multiple processes open the same file(s) (for writing)
- Do not have all (or too many) files in the same directory
- Avoid simultaneous accesses from phases 1 and 2
- Using MPIIO Hints
- General Hints for IO
- GPFS documentation
Using GPFS utilities and development environment
By issuing the command
module load gpfs
you gain access to
- GPFS commands that allow to query and manipulate file system parameters (e.g. setting of Access Control Lists)
- Compile and link executable code that makes use of GPFS-specific semantics
Please see the GPFS documentation (links at the end of this page) for details on both commands and API.
Best Practices and Hints for Optimizations
Most hints presented here do not only apply to GPFS, but also to to other files systems.
Avoid using "ls -l"
- Use "ls" if you just want to list files. If you use "-l" all the metadata have to be read.
Use "vi -n" or "view"
- This avoids the creation of a swap-file during opening of the file (which is a meta data opertion) and speeds up the inspection of files.
Avoid repetive and excessive "open/close" or "stat" operations
- Mata-data operations may need serialized locking mechanisms.
- Some users use "stat" of related functions and/or commands to test the size or existence of files. When testing becomes excessive, it will cause a heavy load on the meta-data servers.
Avoid having multiple processes open the same file(s) (for writing)
- Mata-data operations may need serialized locking mechanisms.
- When just reading, make this explicit in the open calls(FORTRAN: ACTION=READ, C: RDONLY). This will reduce contention.
Do not have all (or too many) files in the same directory
(this also applies to having 1000s of directories within the same directory)
The GPFS architecture is good at processing parallel I/O from many nodes in general. However, it is very slow when different nodes try to write to exactly the same area of the same file.
!# serial creation
barrier
if (task==0) then
do i=0,nprocs-1
create subdirectory(i)
create file(i) // with optional truncate option
enddo
endif
!# all files created now
barrier
!#parallel usage
open file(myid)
open file(commonfile_id)
...
write privatefile
write commonfile
The tasks can then proceed to modifying their own portions of a common file, with best results if their regions do not overlap on a granularity smaller than the GPFS blocksize (8 MB). For fine grain updates that are smaller than the blocksize, the MPI-IO package is advised since it will use MPI to ship around the small updates to nodes that manage different regions of the file.
Avoid simultaneous accesses from phases 1 and 2
A node that executes operations on an i-node or modifies the contents on a file becomes metadata-server for that file. This enables relative good scalability for metadata operations. However, if nodes from phase 1 and 2 respectively attempt accesses to a file simultaneously, then the metadata functionality is moved to the storage server itself; this may cause not only a significant performance hit for the involved processes, but may even slow down the complete filesystem operations. Therefore such accesses should be avoided, including e.g. looking at files on a login node of phase 1 while a phase 2 job is doing I/O on it, or accessing the same directory simultaneously from jobs running on phases 1 and 2 respectively.
Using MPIIO Hints
Existing hints and their usefulness for an application developer/user.
Hint | Useful- ness |
Explanation |
---|---|---|
romio_cb_read | High | Enable or not collective buffering. Defines whether or not to utilize collective IO for writing. If romio_cb_write is disabled, all tasks perform their own independent POSIX IO. By default, romio_cb_write is enabled |
romio_cb_write | High | Enable or not collective buffering. Defines whether or not to utilize collective IO for reading. If romio_cb_read is disabled, all tasks perform their own independent POSIX IO. By default, romio_cb_read is enabled. |
romio_cb_fr_types | Low | Tuning of collective buffering |
romio_cb_fr_alignment | Low | Tuning of collective buffering |
romio_cb_alltoall | Low | Tuning of collective buffering |
romio_cb_pfr | Low | Tuning of collective buffering |
romio_cb_ds_threshold | Low | Tuning of collective buffering |
cb_buffer_size | Medium |
Tuning of collective buffering. Controls the size (in bytes) of the intermediate buffer used in two-phase collective IO. If the amount of data that an aggregator transfers is larger than this value, multiple operations are used. The default value is 16 MB. |
cb_nodes | Medium | Tuning of collective buffering |
cb_config_list | Medium | Tuning of collective buffering. Provides explicit control over aggregators. |
romio_no_indep_rw | Low | Deferred open + only collective I/O |
ind_rd_buffer_size | Low | Buffer size for data sieving |
ind_wr_buffer_size | Low | Buffer size for data sieving |
romio_ds_read | High | Enable or not data sieving |
romio_ds_write | High | Enable or not data sieving |
Most of the time, it is better to disable the data sieving optimisation because a similar one is already performed by the filesystem.
Example which users reported good results on SuperMUC:
call MPI_Info_set(info,"romio_cb_write","enable", error) call MPI_Info_set(info,"cb_buffer_size","4194304", error) call MPI_Info_set(info,"striping_unit","4194304", error)
For details see: PRACE Advanced Training - Best practices for parallel IO and MPI-IO hints
General Hints for IO
- Open files in the correct mode. If a file is only intended to be read, it must be opened in read-only mode because choosing the right mode allows the system to apply optimisations and to allocate only the necessary resources.
- Write/read arrays/data structures in one call rather than element per element. Not complying with this rule will have a significant negative impact on the I/O performance.
- Do not open and close files too frequently because it involves many system operations. The best way is to open the file the first time it is needed and to close it only if its use is not necessary for a long enough period of time.
- Limit the number of simultaneous open files because for each open file, the system must assign and manage some resources.
- Separate procedures involving I/O from the rest of the source code for better readability and maintainability.
- Separate metadata from data. Metadata is anything that describes the data. This is usually the parameters of calculations, the sizes of arrays... It is often easier to separate files into a first part (header) containing the metadata followed by the
data. - Create files independent of the number of processes. This will make life much easier for post-processing and also for restarts with a different number of processes.
- Align accesses to the frontiers of the file system blocks and have only one process per data server (not easy).
- Use non-blocking MPI-I/O calls (not implemented/available on all systems).
- Use higher level libraries based on MPI-I/O (HDF5, ADIOS, SIONlib...).
For details see: PRACE Advanced Training - Best practices for parallel IO and MPI-IO hints
GPFS documentation
Is available on the IBM web site:
- General GPFS documentation page
- Specifically, GPFS commands
- Specfically, GPFS API