Handling backups and archives on the HPC systems

On LRZ's HPC systems, mechanisms are provided which allow the user to restore accidentally deleted or overwritten files, write files to tape, and retrieve them. This document describes usage and recommended practices for these facilities. It applies to all HPC systems; where applicable, system specific settings and procedures are described.

Snapshots, Backup, and Restore


For all files in $HOME backup copies are kept and made available in the special subdirectory


Several snapshots are available:

File systemTime of snapshotNumber of snapshots retainedHow to access
$HOME daily at 3:00, 9:00, 15:00, 21:00 4 $HOME/.snapshot/hourly.[0-3]/
daily at 0:00 10 $HOME/.snapshot/nightly.[0-9]/

A file can be restored by simply copying the file from the appropriate snapshot directory to its original location. Please note:

  • The directory $HOME/.snapshot/ is not listed by the ls or even ls -a command, and you cannot create it either. It is however possible to do a cd $HOME/.snapshot/ and then see all entries using the ls command.
  • When copying the snapshot file to its original location, some versions of cp might refuse to overwrite the original location (since it uses the same i-node). In that case, copy the snapshot file to an alternative location and then move it to the original location.
  • There exist no snapshots for the $WORK and $SCRATCH file systems on SuperMUC . You must archive data in these file spaces yourself if necessary.
  • Deleted files in your ordinary $HOME directory are still contained in the snapshot directories and they are accounted for the volume quota. Because of the way snapshots work there is reserved space for old file versions which is 300% bigger than your project quota. That means, if your quota is e.g., 25 GB, that there are 75 GB of "snapshot reserve" for changes. If you change or delete more than 75 GBs of data during a 10 day interval it might happen that your project space is full and even deleting files does not recover any storage. Please contact the LRZ Service Desk if you run into problems with this mechanism; LRZ sysadmins can manually remove superfluous snapshots.

Tape backups

In addition to the snapshots described above, LRZ also maintains tape backups. Tape backups of $HOME are only done for the purpose of desaster recovery and are not intended to be used in daily operation. Usually, user can use the snapshot e.g., for restoring unintentionally deleted files- Tape backups are made less often but live longer:

File systemTime of tape backupNumber of file versions retainedLife time of unchanged filesLife time of backup for files removed from disk storage
$HOME Saturday 22:50 3 duration of the project 1 year

Please note, that no tape backups for the $OPT_TMP and $PROJECT file systems are performed on Linux-Cluster, and no tape backups are performed for the $SCRATCH and $WORK file systems on SuperMUC. You need to archive data residing on one of these file systems at your own discretion.

Restoring tape backups

If you cannot find the version of the data you need in the snapshot directories please contact the LRZ Service Desk with a request to restore the data from the TSM tape backup for you. The TSM tape backup of $HOME is not directly accessible by users. Please also keep in mind that restoring a backup from tape is man power consuming and may take a long time to complete.

Archiving and retrieving files and directories (command line)

Archiving means saving the current version of a file to tape. Several version of the same file can be kept in the tape archive. For restoring them you must differentiate them by date and time or by an additional description.

In order to archive and retrieve data at the SuperMUC/SuperMIG or the Linux Cluster, the Tivoli Storage Management Infrastructure (TSM) is provided. A system wide TSM client configuration is available, so that you do not need to perform the installation or configuration of a TSM client yourself.

  • On the Linux Cluster Systems, using the TSM client is possible only from a login shell on one of the public login nodes.
  • On SuperMUC, using the TSM client is only possible from the virtual TSM login node supermuc-tsm.lrz.de (see section on LANless archiving below).

Archiving data with TSM

Let's assume you have a file myFile stored on the temporary filesystem in a location (directory) denoted by some self-defined environment variable $MY_SCRDIR. Since myFile may be automatically removed from $MY_SCRDIR by high-watermark deletion after some days, you might want to have an archive copy at hand. So here's how to create one. Go to $MY_SCRDIR and invoke on of the commands

   dsmc archive myFile
   dsmc archive -description="V.1.2" myFile

We recommend to keep logs of all archive commands in a specific directory, e.g.

   dsmc archive myFile >$HOME/mytapelogs/archived_on_YYYY_MM_DD_hh_mm

This might later help to avoid confusion with the file namespace (see blow).

In case the file name contains spaces you have to enclose it in double quotes, e.g., "my file with spaces". If you want to archive several files myFile1, myFile2, ... you can use wildcards or specify the filenames

   dsmc archive myFile* 
   dsmc archive myFile1 myFile2 myFile3

You can also archive complete directory trees. This can be achieved by using an additional command-line option

   dsmc archive -subdir=yes MyDirectory/

dsmc interprets MyDirectory/as a directory.

Retrieving data with TSM

You can search for archived files in a subdirectory $MY_SCRDIR of any file system by issuing the command

   dsmc query archive -subdir=yes $MY_SCRDIR/

Again the slash after $MY_SCRDIR is important to remind dsmc that it is a directory. A file can be retrieved with one of the command

   dsmc retrieve $MY_SCRDIR/myFile $MY_SCRDIR/myNewFileName
  dsmc archive -description="V.1.2" myFile myNewFileName

If you omit the second file argument, the file will be restored under its original name. Of course, you can also retrieve complete directory trees.

   dsmc retrieve MyDirectory/ RetrievedDirectory/ -subdir=yes

This will restore the data in RetrievedDirectory/MyDirectory/. Again, directory or file names containing spaces have to be enclosed in double quotes, and directory names mus end with a slash (/).

Retrieving files with several versions

If you have several version of the same file you can use the options -fromdate, -fromtime, -todate, -totime, -description to differentiate. You might need to specify the format of the date and time string. Interactively, you can use the -pick option.

  dsmc retrieve -timeformat=4 -datefomat=3 -fromdate=2011-11-30 -fromtime=23:33:00 MyFile
  dsmc retrieve -pick MyFile
  dsmc archive -description="V.1.2" myFile

Deletion of  data from TSM archives

The default policy is to prohibit users the deletion of data from the archives to prevent that data gets accidentally deleted. However, since many many request this feature, the permission can be granted on request via the Servicedesk.

Please bear in mind, that deletion rights can only be granted on the granularity of a project, meaning that once granted, all users of the project are allowed to delete their data. However, it is not possible that a user can delete data, archived by another user. Please also bear in mind that deleted data cannot be restored so be very carefully when deleting data. If you feel unassertive, feel free to contact us via the Servicedesk for guidance.

Dealing with resource limits (very large archives)

On some of LRZ's HPC systems, resource limits are in place to prevent misuse. Please use the ulimit command to check which values these limits have. In particular, a CPU time limit (-t switch of ulimit) may cause archivation of very large files to abort. If you are impacted by this, you need to split your data and archive disjoint subsets with multiple dsmc commands (possibly in parallel).

Checking that everything has been archived

In order to make sure all your files have been archived successfully, you should check and save the summary output of the TSM client. It typically looks like this:

   userxyz@i01r12s30:~> dsmc ar test/ -subdir=yes | tee LOG 
   IBM Tivoli Storage Manager
   Command Line Backup-Archive Client Interface
   Client Version 6, Release 2, Level 2.7  
   Client date/time: 11/13/2012 08:55:52
   c) Copyright by IBM Corporation and other(s) 1990, 2011. All Rights Reserved.
   Node Name: PRXXFA
   Server date/time: 11/13/2012 08:55:52  Last access: 11/13/2012 08:26:50
   Archive function invoked.
   Directory-->              69,632 /home/hpc [Sent]      
   Directory-->               8,192 /home/hpc/prxxfa [Sent]      
   Directory-->               8,192 /home/hpc/prxxfa/userxyz [Sent]      
   Directory-->               4,096 /home/hpc/prxxfa/userxyz/test [Sent]      
   Normal File-->                 4 /home/hpc/prxxfa/userxyz/test/testfile1 [Sent]      
   Normal File-->                 4 /home/hpc/prxxfa/userxyz/test/testfile2 [Sent]      
   Normal File-->                 4 /home/hpc/prxxfa/userxyz/test/testfile3 [Sent]      
   Normal File-->                 4 /home/hpc/prxxfa/userxyz/test/testfile4 [Sent]      
   Normal File-->                 4 /home/hpc/prxxfa/userxyz/test/testfile5 [Sent]      
   Archive processing of '/home/hpc/prxxfa/userxyz/test/*' finished without failure.
   Total number of objects inspected:        9
   Total number of objects archived:         9
   Total number of objects updated:          0
   Total number of objects rebound:          0
   Total number of objects deleted:          0
   Total number of objects expired:          0
   Total number of objects failed:           0
   Total number of bytes inspected:     88.01 KB
   Total number of bytes transferred:     185  B
   LanFree data bytes:                    125  B
   Data transfer time:                    0.00 sec
   Network data transfer rate:       20,073.78 KB/sec
   Aggregate data transfer rate:          0.00 KB/sec
   Objects compressed by:                    0%
   Total data reduction ratio:           99.80%
   Elapsed processing time:           00:00:56


If archiving or retrieving is not starting promptly this is most probably NOT a problem. There usually are just no free tape drives available at the moment. If you encounter such a situation, please be patient and wait. We only have 15 tape drives for SuperMUC available and if there is much workload from multiple users on the system, it can take several hours until your archive job gets a free tape drive. Avoid cancelling and resubmitting your archive command. If you feel that there is a problem, please open a support ticket at LRZ Servicedesk.

Archiving and retrieving files and directories (GUI)

You can also use the GUI for archiving and retrieving. Start the client with the command

module load java


Archiving data using the GUI

You can archive a file or a group of files using file names, or you can select files that match your search criteria using a directory tree.
Perform archives using the following procedure:

  1. Click Archive from the main window.
  2. Expand the directory tree by clicking the plus sign (+) or the folder icon next to an object in the tree. To search or filter files, click the Search icon from the tool bar. Your directories and files are under the pathes:
       "Local ->/gpfs/scratch/project/user", "Local ->/gpfs/work/project/user", or "Network->/home/hpc/project/user" on SuperMUC
       "Network->/naslx/projects/project/user", or "Network->/home/hpc/project/user"
  3. Enter your search criteria in the Find Files (Archive) window.
  4. Click the Search button. The Matching Files (Archive) window appears.
  5. Click the selection boxes next to the files you want to archive and close the matching files (Archive) window.
  6. Enter your filter criteria in the Find Files (Archive) window.
  7. Click the Filter button. The Archive window displays the filtered files.
  8. Click the selection boxes next to the filtered files or directories you want to archive.
  9. Enter the description, accept the default description, or select an existingdescription for your archive package in the Description box. When an existing archive descriptionis used, the files or directories selected are added to the archive package. Allarchived packages with the same description are grouped for retrieves,queries, and deletions.
  10. To modify specific archive options, click the Options button. Any options youchange are effective during the current session only.
  11. Click on Archive. The Archive Task List window displays the archive processing status.

Retrieving data using the GUI

  1. Click Retrieve from the client Java GUI main window.
  2. Expand the directory tree by clicking the plus sign (+) or the folder icon next to an object you want to expand. To search or filter files, click the Search icon from the tool bar.
  3. Enter your search criteria in the Find Files (Retrieve) window.
  4. Click the Search button. The Matching Files (Retrieve) window appears.
  5. Click the selection boxes next to the files you want to retrieve and close the Matching Files (Retrieve) window.
  6. Enter your filter criteria in the Find Files (Retrieve) window.
  7. Click the Filter button. The Retrieve window displays the filtered files.
  8. Click the selection boxes next to the filtered files or directories you want to retrieve.
  9. To modify specific retrieve options, click the Options button. Any options you change are effective during the current session only.
  10. Click Retrieve. The Retrieve Destination window appears. Enter the appropriate information in the Retrieve Destination window.
  11. Click Retrieve. The Retrieve Task List window displays the retrieveprocessing status.

Optimal usage of TSM on all systems

In order to achieve a better performance with TSM archive or retrieve jobs you should consider the following guidelines.

Use large files (multi GB) files

If you have many small files put them into an archive and put the the tar files into the tape archive

  tar cvf tar.tar small_files
 dsmc ar tar.tar

Archiving small files

If you cannot avoid having many small files, put them in a directory on the NAS Filesystem and archive this directory from there. By no means try to archive many small files from any GPFS Filesystem because the slow meta data operation performance of GPFS will slow down archiving to 2-10 MB/s.

   dsmc ar smallfiles/ -subdir=yes

Avoid specifying the files on the command line like dsmc ar smallfiles/*. The difference is, that the first command will group up to 4096 files or 20GB of data into a single transaction and therefore you will get the wire speed of the tape drive. The second command will create a single transaction for each file and therefore will be very slow (up to factor 10 or more depending on the size of the files)

Optimal usage of TSM on SuperMUC

LAN less archiving

LRZ provides you a Tivoli Storage Manager (TSM) Client on the two special SuperMUC TSM nodes mapped to the DNS name supermuc-tsm.lrz.de. For a throughout coverage of TSM Client please read the IBM documents.

The TSM client sends/retrieves the archive meta-data via LAN to the TSM server where it is stored in a relational database but the actual archive data is written/read by the client itself directly via the Storage Area Network (SAN) directly to/from tape (see Figuree). This is called "LAN-less Archiving". For this reason you will encounter a time delay between calling the TSM client and sending/retrieving the first file because the tape first has to be mounted. Usually this delay should be about 1-2 minute(s). However if all tape drives are in use it can be much higher. Currently there are 15 LTO-5 drives available for SuperMUC archiving. The throughput of a single tape drive is 140 MB/s native. Depending on how good you data is compressible you may get up to 280 MB/s. Each of the two SuperMUC archive nodes has a bandwidth of 2 x 800 MB/s to the LRZ SAN.



Working with parallel archive streams

By default a single TSM client call will use only a single tape drive. When you need to archive multiple big files, the throughput of a single tape drive may be not enough. In this case you can specify the number of parallel streams ? and therefore the number of tape drives used in parallel ? by the resourceutilization parameter. However, you have to keep in mind that the resourceutilization parameter does not directly specify the number of sessions created by the client but does influence the clients decision on how much resources he may use. Details on how this works can be read at the IBM documentation.

However, please bear in mind that SuperMUC has available only 15 tape drives and that other users may want also archive data at the same time you do. So please be kind to other users and do not start too many parallel archiving jobs at once. Practical relevant are when archiving:

Resource UtilizationMax number of parallel write streams
4 2
6 3
7 4
9 5
10 6

In the example below we have 3 large files that we want to archive in parallel. Therefore we use the following command:

   dsmc ar test/ -subdir=yes -resourceutilization=6
   IBM Tivoli Storage Manager
   Command Line Backup-Archive Client Interface
   Client Version 6, Release 2, Level 2.7  
   Archive function invoked.
   Normal File-->    10,737,418,240 /home/hpc/prxxfa/userxyz/test/testfile2 [Sent]      
   Normal File-->    10,737,418,240 /home/hpc/prxxfa/userxyz/test/testfile1 [Sent]      
   Normal File-->    10,737,418,240 /home/hpc/prxxfa/userxyz/test/testfile3 [Sent]      
   Archive processing of '/home/hpc/prxxfa
   /userxyz/test/*' finished without failure.
   Total number of objects inspected:        3
   Total number of objects failed:           0
   Total number of bytes transferred:   30.01 GB
   LanFree data bytes:                   30.00 GB
   Data transfer time:                   69.38 sec
   Network data transfer rate:   453,533.57 KB/sec
   Aggregate data transfer rate: 413,967.37 KB/sec

Unfortunately by now TSM lacks support for parallel retrieve sessions. Therefore you should make sure to start to retrieve your files soon enough so that they are ready when you need them. We created an Request for Enhancement at IBM to add this feature. You can help us prioritize the request by logging into IBM Developer Network and Voting for the particular RFE. In special cases where you have to store and retrieve intermediate result files from/to scratch it may be possible to work around with special procedures. However, if you currently need parallel retrieve/restore sessions please contact us via ServiceDesk so that we can help to find an individual solution for you.

Working with parallel retrival streams

Currently, there is no out-of-the-box solution in Spectrum Protect to perform parallel retrievals of archived files. However, You can use pdsmc either to parse an already generated dsmc query archive output or as a wrapper around the dsmc query archive process.

Retrieving files into $SCRATCH

There is is one minor problem when you retrieve files into $SCRATCH. The files are restored with the "last access date (atime)" of the original files. However, the automatic cleanup procedure in $SCRATCH deletes files older than 30 days. You have to touch all restored files the within the time span between retrieveal and the cleanup deletion (which is typically done over night). Alternatively, you can  retrieve the files into $WORK where no automatic cleanup is done.

Special cases of TSM usage

Symbolic Links

Tivoli Storage Manager follows a symbolic link and archives the associated file or directory. This is the default. To avoid this, specify that only the symbolic link and not the associated file or director are beeing archived

   dsmc -archsymlinkasfile=no

Switching between TSM nodes: General information

It may be necessary to access files archived by other users or on a previous system. Since such files may reside on a different TSM server node than that assigned to a particular user account on a particular system. Therefore, a special script is available that allows to switch between TSM server nodes based on account and system information.

User shellSwitching command
sh, bash, ksh source set_dsm_config.sh <user name> <system name>
csh, tcsh source set_dsm_config.csh <user name> <system name>

The <system name> argument is ignored on the Linux Cluster or the HLRB-II. On SuperMUC, it may have the value "SUPERMUC" or "HLRB2"; both arguments are optional, but if <system name> is specified, <user name> must also be specified and stated first. If run without arguments, it resets the $DSM_CONFIG to the invoking user on the local system.

For setting access permissions (see below for examples), it is necessary to explicitly specify the name of the TSM node, typically denoted by <TSM_NODE>; the value which needs to be entered for <TSM_NODE> is contained in the "servername" entry of the file pointed at by $DSM_CONFIG; it will typically be of the following form:

HPC system at LRZTSM node name
SuperMUC and
Linux-Cluster after Nov. 2011
<project name = unix group name>
HLRB-II HLRBArchive_<number>
Linux Cluster before Nov. 2011 LXCL_ARCHIVE_<number>

LRZ does not support arbitrary combinations of systems and users. The allowed combinations are described in the following subsections.

Retrieving files/directories belonging to other users

To retrieve files which were archived by other users (even for other members of your group), you need to perform following steps:

a) Owner sets access permissions: the user who archived the data (e.g., h1100xx) must execute the command

   h1100xx$ dsmc set access archive "*" <TSM_NODE> h0000yy

this will grant access to all files ("*") archived on TSM node <TSM_NODE> (the TSM node the user is bound to, see the general information above) for user h0000yy.

b) Other user retrieves file(s): the other user (h0000yy) must execute the following commands (assuming the bash shell is used):

   h0000yy$ source set_dsm_config.sh h1100xx
   h0000yy$ dsmc q ar  -fromowner=h1100xx "<absolute directory name>/*"
   h0000yy$ dsmc retrieve -fromowner=h1100xx -fromdate='dd/mm/yyyy'    \
   "<archived file>" "<local file>"
   h0000yy$ dsmc retrieve -fromowner=h1100xx -fromdate='dd/mm/yyyy'   
   -subdir=yes "archived_directory/*" "newdir/>"

Again, directory or file names containing spaces have to be enclosed in double quotes, and directory names must end with a slash (/).

Do not forget to clean up your environment ...

Once the files stored on a different TSM node have been retrieved, please reset the configuration to the original user account and/or system by running source set_dsm_config.sh

Otherwise, attempts to perform "normal" archiving or restoring (under the own user account) may fail or yield unexpected results.

Other Special Cases

Conversion of a SuperMUC project into a Data-Only project (after project end)

On request, it is possible  to convert a Linux-Cluster or SuperMUC project into a Data-Only project. Within such a Data-Only project the project manager is able to further retain and access the data once archived on tape, thus using the tape archive as a safe and reliable long term storage for the data generated by an SuperMUC project.

Data can than be accessed via the gateway node "tsmgw.abs.lrz.de" using the SuperMUC username and password of the project manager. Access to the server is possible via SSH with no restricitons on the IP address. However, access to SuperMUC itself is not possible after the end of a project. Currently, the server is equipped with a 37 TB local disk storage (/tsmtrans) to buffer the data retrieved from tape. There is a directory /tsmtrans/<username> where you can store the data and transfer them via scp. The same location is also accessible via GridFTP on the standard port 2811. It is possible to use the command line tool globus-url-copy, Globus Online (via the endpoint lrz#TSMgateway), or any other client which supports the GridFTP protocol.

The project manager can access all data of the project that are stored in the tape archive, but it is necessary to use the -fromowner=otheruser flag for data which was not archived by him/herself but another project member. Also, the password for accessing the tape archive (TSM Node) is not stored on the gateway node and must be set and remembered by the project manager.  

  • When your SuperMUC project ends, the project manager will receive a reminder E-Mail, explaining the steps necessary to transform the Project, if desired

The TSM concept of File Space

TSM uses so-called file spaces which represent the file systems. However this depends on how filesystems are mounted. On rare occasions LRZ may need to change the mount points which results in different file spaces. The file space can be display by using the command

   dsmc q fi

E.g., on HLRB2/home and/home/hlrb2 represent two distinct file spaces.

Which file space is used by TSM depends on the specifications in the command:

   dsmc q ar "/home/*" -subdir=yes 

would display the files in the space /home but not the files in the file space /home/hlrb2; vice versa the command

   dsmc q ar "/home/hlrb2/*" -subdir=yes 

would display the file in file space /home/hlrb2 but not the files in the file space /home. The file space can be explicitly specified by using curly brackets.

   dsmc q ar "{/home}/hlrb2/*" -subdir=yes

would display all files from the file space /home which were archived from the directory /home/hlrb2. Thus, the previous two commands would give different output although the same files in the file system are targeted.

Help function and further information

To get more help on the more advanced functions of TSM type

   dsmc help

and follow the instructions. More information can also be found in: