File systems on HLRB II

Short description of the file systems available on HLRB II, including recommendations on data handling.

Table of contents

File systems on HLRB II

Overview

File system How to access

Purpose

Implementation,
Overall size,
Bandwidth
Backup
Lifetime and
deletion strategy
Quota
/home/hlrb2 environment
variable $HOME
store the user's source, input data, and small and important result files NAS-Filer (NFS), 60TB, ~800MB/s

Yes: backup to tape and
Snapshots

Project duration yes 
(per project)
/ptmp1 environment
variable $OPT_TMP
temporary huge files (restart files, pre- and to be postprocessed files) parallel filesystem, CXFS,  300TB, ~20GB/s NO high watermark deletion1,
(beware of technical problems)
no
/ptmp2 environment
variable $PROJECT
temporary huge files (restart files, pre- and to be postprocessed files) parallel filesystem, CXFS,  300TB, ~20GB/s NO Project duration
(beware of technical problems)
yes
(per project)
/tmp use of this area is  strongly discouraged (danger of system failure)

temporary filesystem for system use

4.8 GB node-local NO no guarantees, may be deleted at any time at LRZ's discretion no
1High water mark deletion means: when the filling of the file system exceeds  80%, files will be deleted starting with the oldest and largest files until a filling of less than 60% is reached. Be aware that the normal tar -x command preserves the modification time of the original file instead of the time when the archive is unpacked. So unpacked files may become one of the first candidates for deletion. Use tar -mx or touch in combination with find to work around this. Be aware that the exact time of deletion is unpredictable!

User's responsibility for saving important data

Having parallel filesystems of several hundreds of Terabytes (/ptmp1, /ptmp2), it is technically impossible (or too expensive) to backup these data automatically. Although the disks are protected by RAID mechanisms, other severe incidents might destroy the data. Therefore it is within the resposibility of the user to transfer data to safe places (e.g. $HOME) and/or archive them to tapes. Due to the long off-line times for dump and restoring of data, LRZ might not be able to recover data from any type of file outage/inconsistency of the <i>scratch</i> filesystems /ptmp1 ($OIPT_TMP) and /ptmp2 ($PROJECT). The alias name $PROJECT and the intended storage period until the end of your project should not be misguided as as an indication for the data safeness.

There is no automatic backup for /ptmp1 and /ptmp2. Beside high watermark deletion, severe technical problems might destroy your data.

Copy, transfer or archive the files you want to keep!

Quota/Volume limit in $HOME and $PROJECT

The storage in/home/hlrb2 is limited. Each project is assigned a separate volume on the NAS-Filer which will be mounted to /home/hlrb2/<project_name> and will contain the home-directories of the project's users. The maximum size of the volumes is limited. The command to get information about your quota and the disk space used is::

    sdf $HOME                                   (for $HOME)
     /usr/sbin/repquota -g /ptmp2         (for $PROJECT)

The disk space in $HOME is not only occupied by your current data but also by snapshots ("backup copies") from the last 10 days. Typically  your file space consists of 150 GB quota + an additional 150 GB for the snapshots. If you change and delete lots of files in your home directory during 10 days so that the amount of changes is larger than 150 GB in 10 days the addditional space is not sufficient and snapshots will also take up space from the "real" quota until they are automatically deleted.

It might help if you do not place any temporary files in your home directory ($HOME) but use the large parallel project filesystem ($PROJECT) or the parallel temporary file system $OPT_TMP which is not limited by individual quotas. 

Temporary filesystems $OPT_TMP

Please use the environment variable $OPT_TMP to access the temporary filesystem. This variable points to the location where the underlying file system will deliver optimal IO-Performance. Do not use /tmp for storing large temporary files ! (The filesystem where /tmp resides on is very small and slow. Files will be regularly deleted by sysadmin).

Coping with high watermark deletion in $OPT_TMP

The high watermark deletion mechanism may remove files which are only a few days old if the file system is used heavily. In order to cope with this situation, please note:

  • The normal tar -x command preserves the modification time of the original file and not the time when the archive has been unpacked. Therefore files which have been unpacked from an older archive are one of the first candidates to be deleted. To prevent this, use tar -xm to unpack your files, which will give them the actual date.
  • Please use the TSM system to archive/retrieve files from/to $OPT_TMP to/from the tape archive.
  • Please always use $OPT_TMP for files which are considerably larger than 1GB. 
  • Please remove any files which are not needed any more as soon as possible. The high watermark deletion procedure is then less likely to be triggered.
  • More information about the filling of the file systems and about the oldest files will be made available on a web site in the near future.

Transferring files from/to other systems

Due to our security regulations, transferring files typically will only work if the IP address of the remote system has been entered into LRZ routing database. Additionally it may be necessary to specify specific ports to be opened for this IP address. Please apply for such an entry via an update in the project application form.

secure copy

For secure file transfer in both directions, the command scp can be used. This method is most straightforward but time-consuming for large files; if you have large files, check out one of the alternatives below.

dmscp2

dmscp2 combines secure authentication with fast, unencrypted data transfer.

grid-ftp

This is a component of the Globus toolkit; please consult the LRZ specific documentation on this topic..


Snapshots, backup, archiving and restoring

For all files in $HOME backup copies are kept and made available in the special subdirectory $HOME/.snapshot/
A file can be restored by simply copying the file from the appropriate snapshot directory to its original location.

Please consult the HPC Backup and Archiving document for how to handle backups and the TSM tape system.


Efficient use of IO

General rules

  • perform IO in few and and large chunks
  • write binary instead of formatted data (factor 3 improvement in size and performance)
  • use IO libraries when possible (netcdf, hdf5,...)
  • convert to target format in memory if possible (i.e. perform as much postprocessing as possible before writing to disk)
    for parallel programs: output to separate files for each process: highest throughput, but may need postprocessing
  • use library/compiler support for conversion between little/big endian of files used on different architectures (better optimized and less error prone)
  • avoid unnecessary open/close statements (can disable caching)
  • avoid explicit flushes of data to disk, except when needed for consistency reasons

FFIO Layer

The Flexible File I/O (FFIO) system lets the user specify a comma-separated list of layers through which I/O data is to be passed. The libFFIO (Flexible File I/O) system supports an eie and event layers through which user I/O data can be passed. The two layers are invoked by specifying the numerics and the options of the layers with the FF_IO_OPTS env variable, and overloading glibc with the libFFIO.so library via the Linux LD_PRELOAD mechanism (see ld.so(8)). The executable codes need not to be modified, recompiled, or relinked, as long as they are dynamically linked with the libc library to allow the overload.

Example 1 (serial jobs):

export LD_PRELOAD=/usr/lib/libFFIO.so

export FF_IO_OPTS='*.dat (eie.direct.bpons.mbytes:8192:24:256:0:1:0)'

./myprog.exe   # assume program produces large *.dat files

It is recommended to not use the shell where the LD_PRELOAD is set for anything else than running the program. For example, using shell tools or editors typically gives trouble.

Example 2 (MPI-parallel jobs):

In order to support the FFIO also in batch jobs using multiple partitions, an LRZ-specificmpiexec_ffio command is available which performs most of the needed settings for you; in particular, you should not set LD_PRELOAD yourself in this case since this causes the startup procedure to fail. In order to use the mpiexec_ffio facility, please edit a file FFIOconf to contain, for example

export MPI_APPS_EXE_NAME=<name of executable>

NTASKS=<total number of tasks>

i=0

while [ $i -lt $NTASKS ] ; do

  export FF_IO_OPTS_RANK${i}='*.dat (eie.direct.bpons.mbytes:8192:12:16:0:1:0)'

  i=$(($i + 1))

done

In this example, I/O to all files of the form *.dat will be diverted via FFIO for all MPI tasks. The configuration would require up to approx. 48 MB of memory (12 pages) per MPI Task. Please start your MPI program inside the PBS job script via e.g.,

mpiexec_ffio -ffio FFIOconf -n 256 ./myprog.exe

Using this scheme, it is also possible to configure only a subset of MPI tasks for FFIO, or use different settings for different tasks. MPMD-style processing is only possible if the same program is invoked in every clause.

MPI-IO on HLRB II

In general we do not expect that there is not a big performance difference between using serial IO facilities and MPI-IO. This is because the nodes of HLRB II will be able to cache most IO data in memory. Therefore:

  • If your program is implemented using serial IO facilities, it is probably possible to leave the IO serial (unless you are writing lots of data using only one task, in which case you should indeed consider parallelizing the IO).
  • if your program uses parallel IO based on MPI-IO our advice is to use the MPI-IO implementation inside SGI MPT.

Further information

Please contact HPC support if something is unclear to you or if you have further questions.