ALIs

kommt noch

Background Storage and its Handling

This document gives an overview of background storage systems available on the LRZ Linux Cluster. Usage, special tools and policies are discussed.


Table of contents


Available disk resources and file system layout

The following table gives an overview of the available file system resources on the Linux Clusters.

Recommendation: LRZ has defined an environment variable $OPT_TMP which should be used as a base path for reading/writing large scratch files. This variable is set to the most appropriate file system for each available platform (e.g., to the XFS scratch file system on the Altix). Since the target of $OPT_TMP may change over time, it is recommended to use this variable instead of hard-coded paths.

Notes:

  1. On the Altix systems, NAS is not used as default scratch area; you need to perform explicit copying of data from $OPT_TMP (which on the Altix points to the node-local scratch areas) to $PTMP within your SGE job script. The shell commands for performing this copying might look like:

    cd $OPT_TMP; cp -a mydir $PTMP

    where mydir is a directory residing within $OPT_TMP.

Purpose 

Segment
of the Linux
Cluster

File system type
and
full name

How the user should access the files

Space Available

Approx. aggregated 
bandwidth 

Backup by LRZ

Lifetime and deletion strategy.
Remarks

 Globally accessible Home and Project Directories

User's Home Directories

all

NFS
/home/cluster/$USER

$HOME

25 GByte
by default
per group

up to 100 MB/s

Yes, backup to tape and
Snapshots

Expiration of LRZ project.
NFS quotas apply

Project file system

all

NFS
/naslx/projects

$PROJECT

up to 1 TByte per group
available on request

up to 1 GB/s

NO

No guarantee for data integrity, disk quota

 Pseudo-temporary File Systems
Please use the scratch area that is most appropriate for the system you work on

Altix preferred scratch file system

SGI Altix

XFS
/scratch/ptmp/$USER
(node-local and therefore invisible from other nodes)

$OPT_TMP

 7 TByte

read 1.2 GB/s,
write 0.7 GB/s

NO

High watermark deletion of oldest and largest files. 
No guarantee for data integrity.

scratch file system

all

NFS
/naslx/ptmp

$OPT_TMP

several TByte

up to1 GB/s

NO

Sliding window file deletion.
No guarantee for data integrity.

Local File Systems

node-local temporary user data

all 

local disks
/scratch

$TMPDIR

8-200 GByte

approx.
30 MB/s

NO

Batch Nodes except Altix: Job duration only.
Files should be deleted by user job script at the end of a  job.
Login Nodes and Altix systems:
files are removed if older than 4 weeks


Some details on use and LRZ configuration of the storage areas

Project directories

If your project requires processing large data sets (50+ GB) with a timeframe of several months, the LRZ file deletion strategy in the pseudo-temporary file systems might become a problem. In this case you might be interested using the file system pointed at by the $PROJECT environment variable. Please note that we cannot guarantee data integrity over the full lifetime of your project, so you need to take the safety measure of archiving all important data to tape after placing them in the project directory. Finally, a group quota is imposed on this area. If you need resources in this area, please contact LRZ HPC support. Note that $PROJECT is not available by default.

Metadata on scratch and project directories

While for both scratch and project directories the metadata performance (i.e., performance for generating, accessing and deleting directories and files) is improved compared to previously used technologies, the capacity for metadata (e.g., number of file entries in a directory) is limited. Therefore, please do not generate extremely large numbers of very small files in these areas; instead, try to aggregate into larger files and write data into these e.g. via direct access. Violation of this rule will lead to LRZ blocking your access to the $OPT_TMP or $PROJECT area since otherwise user operation on the cluster may be obstructed. Please also note that there is a per-directory limit of 10 MBytes which are available for storing i-node metadata (directory entries and file names); this limits the number of files which can be put into a single directory.


File deletion strategies and data integrity issues

To prevent overflow of the large scale storage areas, LRZ has implemented various deletion strategies. Please note that

  • for a given file or directory, the exact time of deletion is unpredictable!
  • the normal tar -x command preserves the modification time of the original file instead of the time when the archive is unpacked. So unpacked files may become one of the first candidates for deletion. Use tar -mx if required, or perform touch on a file or
    find mydir -exec touch {} \;
    
    on a directory tree mydir.

Due to the deletion strategies described in the subsections below, but also due to the fact that LRZ cannot guarantee the same level of data integrity for the high performance file system as compared to e.g., $HOME, LRZ urges you to copy, transfer or archive your files from pseudo-temporary disks as well as the $PROJECT areas to safe storage/tape areas.

High Watermark Deletion

When the filling of the file system exceeds  some limit (typically between 80% and 90%), files will be deleted starting with the oldest and largest files until a filling of between 60% and 75% is reached. The precise values may vary. 

  • This strategy is used on the Altix XFS scratch areas only.

Sliding window file deletion

Any files and directories older than typically 30 days (the interval may be shortened if the fill-up rate becomes very high) are removed from the disk area. This deletion mechanism is invoked once a day.

  • This strategy is used on the $OPT_TMP scratch area.

NAS based file systems ($HOME, $OPT_TMP, $PROJECT) file systems and quotas

The file systems reside on dedicated Network Attached Storage systems ("filers")  and are accessed via NFSv3. Filers offer high I/O-Performance - also with smaller files - and excellent reliability. These file systems can be uniformly accessed from any node in the cluster.

You can check your HOME quota by using the sdf command:


   sdf $HOME

which will give you an output like

Filesystem                                                    Size       Avail (MiB)
nas0.hlrb2.lrz-muenchen.de:/home/cluster/<project_id>/x      25800       11926

The first number in each line is the total quota (in Mebibytes, 220 Bytes), and the second number is the amount used.

Note:

  • Quotas are assigned to projects and not to individual accounts. In case of quota overflow please check your own usage with du and then first contact your colleagues if your own usage is not responsible for filling up the quota.
  • Some applications or installations programs try to query the free disk space just with the "df" or "quoata" command. This will not work with the NAS-based file systems. Modification of these applications is neccessary..

SGI XFS File System

On the SGI Altix systems, high performance I/O is performed to locally attached arrays of RAID disks with a net size of 11 and 7 Tbytes, respectively.


Large scale transfer of data to the outside world

The preferred method of transferring data to other compute systems outside LRZ is to use grid-ftp. Please consult the LRZ specific document on using the grid facilities.

Backup and Archiving

Please consult the HPC Backup and Archiving document for how to handle backups (via snapshots) and how to use the TSM tape system.