ALIs

kommt noch

Handling backups and archives on the HPC systems

On LRZ's HPC systems, mechanisms are provided which allow the user to restore accidentally deleted or overwritten files, write files to tape, and retrieve them. This document describes usage and recommended practices for these facilities. It applies to all HPC systems; where applicable, system specific settings and procedures are described.


Table of contents


Snapshots, Backup, and Restore

Snapshots

For all files in $HOME backup copies are kept and made available in the special subdirectory

$HOME/.snapshot/


Several snapshots are available:
 
File system Time of snapshot Number of snapshots retained How to access
$HOME daily at 3:00, 9:00, 15:00, 21:00 4 $HOME/.snapshot/hourly.[0-3]/
daily at 0:00 10 $HOME/.snapshot/nightly.[0-9]/

A file can be restored by simply copying the file from the appropriate snapshot directory to its original location. Please note:

  • The directory $HOME/.snapshot/ is not listed by the ls or even ls -a command, and you cannot create it either. It is however possible to do a cd $HOME/.snapshot/ and then see all entries using the ls command.
  • When copying the snapshot file to its original location, some versions of cp might refuse to overwrite the original location (since it uses the same i-node). In that case, copy the snapshot file to an alternative location and then move it to the original location.
  • There exist no snapshots for the $WORK and $SCRATCH file systems on SuperMUC . You must archive data in these file spaces yourself if necessary.
  • Deleted files in your ordinary $HOME directory are still contained in the snapshot directories and they are accounted for the volume quota. Because of the way snapshots work there is reserved space for old file versions which is 300% bigger than your project quota. That means, if your quota is e.g., 25 GB, that there are 75 GB of "snapshot reserve" for changes. If you change or delete more than 75 GBs of data during a 10 day interval it might happen that your project space is full and even deleting files does not recover any storage. Please contact the LRZ Service Desk if you run into problems with this mechanism; LRZ sysadmins can manually remove superfluous snapshots.

Tape backups

In addition to the snapshots described above, LRZ also maintains tape backups.  Tape backups of $HOME are only done for the purpose of desaster recovery and are not intended to be used in daily operation. Usually, user can use the snapshot e.g., for restoring unintentionally deleted files- Tape backups are made less often but live longer:
 
File system Time of tape backup Number of file versions retained Life time of unchanged files Life time of backup for files removed from disk storage
$HOME Saturday 22:50 3 duration of the project 1 year 

Please note, that no tape backups for the $OPT_TMP and $PROJECT file systems are performed on  Linux-Cluster, and no tape backups are performed for the $SCRATCH and $WORK file systems on SuperMUC. You need to archive data residing on one of these file systems at your own discretion.

Restoring tape backups

If you cannot find the version of the data you need in the snapshot directories please contact the LRZ Service Desk  with a request to restore the data from the TSM tape backup for you. The TSM tape backup of $HOME is not directly accessible by users. Please also keep in mind that restoring a backup from tape is man power consuming and may take a long time to complete.


Archiving and retrieving files and directories

Archiving means saving the current version of a file to tape. Several version of the same file can be kept in the tape archive. For restoring them you must differentiate them by date and time or by an additional description.

In order to archive and retrieve data at the  SuperMUC/SuperMIG or the Linux Cluster, the Tivoli Storage Management Infrastructure (TSM) is provided. A system wide TSM client configuration is available, so that you do not need to perform the installation or configuration of a TSM client yourself.

Archiving data with TSM

Let's assume you have a file myFile stored on the temporary filesystem in a location (directory) denoted by some self-defined environment variable $MY_SCRDIR. Since myFile may be automatically removed from $MY_SCRDIR by high-watermark deletion after some days, you might want to have an archive copy at hand. So here's how to create one. Go to $MY_SCRDIR and invoke on of the commands
dsmc archive myFile
dsmc archive -description="V.1.2" myFile

We recommend to keep logs of all archive commands in a specific directory, e.g.

dsmc archive myFile >$HOME/mytapelogs/archived_on_YYYY_MM_DD_hh_mm

This might later help to avoid confusion with the file namespace (see blow).

In case the file name contains spaces you have to enclose it in double quotes, e.g., "my file with spaces". If you want to archive several files myFile1, myFile2, ... you can use wildcards or specify the filenames

dsmc archive myFile* 
dsmc archive myFile1 myFile2 myFile3

You can also archive complete directory trees. This can be achieved by using an additional command-line option

dsmc archive -subdir=yes MyDirectory/ 
Please note the trailing slash in the directory name. This slash is important since it ensures that dsmc interprets MyDirectory/ as a directory.

Please also consult the section on optimal usage of TSM.

Retrieving data with TSM

You can search for archived files in a subdirectory $MY_SCRDIR of any file system by issuing the command

dsmc query archive -subdir=yes $MY_SCRDIR/ 

Again the slash after $MY_SCRDIR is important to remind dsmc that it is a directory. A file can be retrieved with one of the command

dsmc retrieve $MY_SCRDIR/myFile $MY_SCRDIR/myNewFileName
dsmc archive -description="V.1.2" myFile myNewFileName
If you omit the second file argument, the file will be restored under its original name. Of course, you can also retrieve complete directory trees
dsmc retrieve MyDirectory/ RetrievedDirectory/ -subdir=yes

This will restore the data in RetrievedDirectory/MyDirectory/. Again, directory or file names containing spaces have to be enclosed in double quotes.

Retrieving files with several versions

If you have several version of the same file you can use the options -fromdate, -fromtime, -todate, -totime, -description to differentiate. You might need to specify the format of the date and time string. Interactively, you can use the -pick option.

dsmc retrieve -timeformat=4 -datefomat=3 -fromdate=2011-11-30 -fromtime=23:33:00 MyFile
dsmc retrieve -pick MyFile
dsmc archive -description="V.1.2" myFile

Deletion of TSM archives

To prevent users from shooting themselves in the foot, deletion of archives has been disabled. Multiple archiving of the same path name is always tagged with at minimum the archiving date on the TSM server, and the last archived version is retrieved unless you specify the -pick subargument at retrieval, in which case you are offered a choice of archived versions.

Help function and further information

To get more help on the more advanced functions of TSDM type and follow the instructions

dsmc helpMore information can also be found in the

Special cases of TSM usage

Switching between TSM nodes: General information

It may be necessary to access files archived by other users or on a previous system. Since such files may reside on a different TSM server node than that assigned to a particular user account on a particular system. Therefore, a special script is available that allows to switch between TSM server nodes based on account and system information.

User shell

Switching command

sh, bash, ksh source set_dsm_config.sh <user name> <system name>
csh, tcsh source set_dsm_config.csh <user name> <system name>

The <system name> argument is ignored on the Linux Cluster or the HLRB-II. On SuperMUC, it may have the value "SUPERMUC" or "HLRB2"; both arguments are optional, but if <system name> is specified, <user name> must also be specified and stated first. If run without arguments, it resets the $DSM_CONFIG to the invoking user on the local system.

For setting access permissions (see below for examples), it is necessary to explicitly specify the name of the TSM node, typically denoted by <TSM_NODE>; the value which needs to be entered for <TSM_NODE> is contained in the "servername" entry of the file pointed at by $DSM_CONFIG; it will typically be of the following form:

HPC system at LRZ

TSM node name

SuperMUC <project name = unix group name>
HLRB-II HLRBArchive_<number>
Linux Cluster LXCL_ARCHIVE_<number>

LRZ does not support arbitrary combinations of systems and users. The allowed combinations are described in the following subsections.

Scenario 1: retrieving files/directories belonging to other users

To retrieve files which were archived by other users (even for other members of your group), you need to perform following steps:

a) Owner sets access permissions: the user who archived the data (e.g., h1100xx)  must execute the command

h1100xx$ dsmc set access archive "*" <TSM_NODE> h0000yy

this will grant access to all files ("*") archived on TSM node <TSM_NODE> (the TSM node the user is bound to, see the general information above) for user h0000yy.

b) Other user retrieves file(s): the other user (h0000yy) must execute the following commands (assuming the bash shell is used):

h0000yy$ source set_dsm_config.sh h1100xx 
h0000yy$ dsmc q ar  -fromowner=h1100xx "<absolute directory name>/*" 
h0000yy$ dsmc retrieve -fromowner=h1100xx -fromdate='dd/mm/yyyy'    \
                  "<archived file>" "<local file>" 
h0000yy$ dsmc retrieve -fromowner=h1100xx -fromdate='dd/mm/yyyy'    \
                  -subdir=yes "<archived_directory/*>" "<local dir>"

Scenario 2: retrieving files archived from HLRB2 on SuperMUC

If you want to access your HLRB2 archive files on SuperMUC you need to execute the following command:

source set_dsm_config.sh <your user name> HLRB2 

After that, you can retrieve files as generally described above, provided you still remember the base directory paths: Note that the scratch and project file areas on HLRB-II used the following scheme:
  • the $HOME directory used /home/hlrb2/<group>/<user>
  • the $OPT_TMP directory used /ptmp1/<group>/<user>
  • the $PROJECT area used /ptmp2/<group>/<user>

Scenario 3: On SuperMUC, retrieving HLRB2 archives generated by another user

a) Owner sets access permissions: the user who archived the data (e.g., lu9999xx) must execute the commands:

lu9999xx$ source set_dsm_config.sh lu9999xx HLRB2 
lu9999xx$ dsmc set access archive "*" <TSM_NODE> lu1234yy

this will grant access to all files ("*") archived on tsm node of HLRB2 (the TSM node the user is bound to, see the general information above) for user lu1234yy.

b) Other user retrieves file(s): the other user must execute the commands (assuming the bash shell is used):

lu1234yy$ source set_dsm_config.sh lu9999xx HLRB2        
lu1234yy$ dsmc q ar  -fromowner=lu9999xx "<absolute directory path>/*"
lu1234yy$ dsmc retrieve -fromowner=lu9999xx -fromdate='dd/mm/yyyy' \
                   "<archivedfile>" "<localfile>"
lu1234yy$ dsmc retrieve -fromowner=lu9999xx -fromdate='dd/mm/yyyy' -subdir=yes  \
                   "<archiveddir/*>" "<localdir>"

Scenario 4: On SuperMUC, retrieving archives which have incidentally been archived on login2.hlrb2.lrz.de

This subsection is of interest only to a few users of HLRB2 who archived files on the alternate login node login2.hlrb2.lrz-muenchen.de, before Sep 16, 2011. They can retrieve and query their archive by:

dsmc q ar -se=tsm1 -subdir=yes "<absolute directory path>/*"
smc retrieve -se=tsm1 -subdir=yes "archiveddir/*>" "<localdir>"

Scenario 5: Retrieving older files on the Linux Cluster systems

This subsection is of interest only to users of the Linux Cluster systems at LRZ.

Due to changes in the system configuration, it is necessary to use one of the following commands to retrieve files archived before April 14, 2009:

  • for files archived after the restructuring of the HOME path names; please take note of the curly brackets (see above):
dsmc retrieve {/home}/cluster/$(id -g -n)/$USER/MyDirectory \
        $HOME/MyDirectory -subdir=yes  -se=HPCArchive
  • for files archived before the restructuring of the HOME path names:
dsmc retrieve /home/cluster/$USER/MyDirectory \
        $HOME/MyDirectory -subdir=yes  -se=HPCArchive
  • for files archived from the /lustre or /lustre_projects storage areas:
dsmc retrieve /lustre[_projects]/.../MyDirectory \
        /lustre[_projects]/.../MyDirectory -subdir=yes  -se=HPCArchive

Do not forget to clean up your environment ...

Once the files stored on a different TSM node have been retrieved, please reset the configuration to the original user account and/or system by running

source set_dsm_config.sh

Otherwise, attempts to perform "normal" archiving or restoring (under the own user account) may fail or yield unexpected results.

The TSM concept of File Space

TSM uses so-called file spaces which represent the file systems. However this depends on how filesystems are mounted. On rare occasions LRZ may need to change the mount points which results in different file spaces. The file space can be display by using the command

dsmc q fi

E.g., on HLRB2/home and/home/hlrb2 represent two distinct file spaces.

Which file space is used by TSM depends on the specifications in the command:

dsmc q ar "/home/*" -subdir=yes 

would display the files in the space /home but not the files in the file space /home/hlrb2; vice versa the command

dsmc q ar "/home/hlrb2/*" -subdir=yes 

would display the file in file space /home/hlrb2 but not the files in the file space /home. The file space can be explicitly specified by using curly brackets.

dsmc q ar "{/home}/hlrb2/*" -subdir=yes

would display all files from the file space /home which were archived from the directory /home/hlrb2. Thus, the previous two commands would give different output although the same files in the file system are targeted.


Optimal usage of TSM

In order to achieve a better performance with TSM archive or retrieve jobs you should consider the following guidelines.

Archiving/Retrieving large files

Use this procedure only if your archive files are larger than 1 GByte each.
Otherwhise, first accumulate your files in tar archives, which are larger than 1 GByte, or follow the procedure explained in the next subsection. If you have multiple files to archive/retrieve you should archive/retrieve more than one file per dsmc call. For large files the optimum throughput performance is achieved with 4 files per dsmc call. If you have 6 files to archive for example call:

dsmc ar file1 file2 file3 file4
dsmc ar file5 file6
Archiving/retrieving more than 4 files with one dsmc call does not increase TSM performance anymore. Instead it may even lead to a slightly lower overall performance.

Using file lists for many small files

If you want to archive/retrieve more than 100 files which are smaller than 1 GByte each, you should archive/retrieve them via a file list. To do so, create a file fileList.txt, which contains the full qualified path names of the files to archive, one per line. You must not use any wildcard character, and if a file name contains spaces you must enclose the name with quotes. After that, invoke

dsmc ar -filelist=fileList.txt

It is worth mentioning that accumulating many small files into few large files, by using system tools like tar, is beneficial in terms of TSM archive/retrieve performance. So, if possible create few large files and archive them by using the procedure described in the previous subsection.

Reserving CPUs for archiving

TSM archive/retrieve performance can be further improved by dedicating CPUs to the TSM client. You can choose from two options:
  1. Interactive batch job for archiving/retrieving on SuperMUC

    (Note - this is not yet fully determined and may change)
    Use a llrun job to process TSM archiving, using at most 4 archive/retrieve jobs which are executed in parallel.