Linux Cluster Maintenance Sep 23-27

Update (October 1):

All cluster segments have been returned to operation. Remaining issues are:

  • we may need to restart SLURM on CooLMUC-2 interactive/batch (but not on serial) to correct a memory usage mis-specification
  • there are occasional hangs of the DSS-based file systems that will need longer-term investigation
  • some software application packages may still cause trouble due to internal use of non-valid pathnames and the change in the programming environment.

Dear users of the Linux cluster systems at LRZ,

all Cluster Segments will be in maintenance (and therefore unavailable for both login and job processing) September 23-27.
This includes all housed systems.

This maintenance involves a significant changeover of the operational mode of the cluster; this implies that users will need to
perform significant updates to their individual data and computational configuration after the maintenance. The following table
supplies an overview of the relevant changes and how to deal with them.

Changed Item Description of change What users need to do about it
Programming Environment The Intel Parallel Studio modules (compilers, libraries
and tools) will be updated from the outdated 2017 release
to the 2019 version.
Recompilation of user's codes will be needed. For some time, it will be possible
to switch back to the old module versions (e.g., intel/17.0 mpi.intel/2017), but
we ask you to report problems observed in the new environment because the old
releases will be retired early next year.
Software Environment Release 19.1 of the Spack-generated modules will be deployed,
and this release will be used as baseline for the HPC software stack
Some  module names and most module version strings will change. No or other default
versions will be specified. The legacy modules will be retired early next year.

Please report any problems you encounter when using the provided modules.

In case you relied on software inside the release 18.2, you may still use this via

module switch spack/18.2

For further information on the software stack and release conventions please look
at the LRZ documentation at

https://doku.lrz.de/display/PUBLIC/Environment+Modules#EnvironmentModules-SpackGeneratedModules

In most cases, recompilation of codes will be necessary.

Batch Environment A new version of SLURM will be deployed

All SLURM scripts will require changes. Updated example scripts are available on
the LRZ documentation server now.

At minimum, please replace "source /etc/profile.d/modules.sh" by
"module load slurm_setup". We also strongly recommend using relative
pathnames for the -o, -e, and -D options. This will save on additional work
in November (where the HOME path names will change). Furthermore, redundant
resource specifiers should be avoided.

Note that jobs that make use of the WORK or PROJECT areas will fail if they
attempt write accesses. See also the changes in the Data Environment below.

Data Environment If a PROJECT area exists,
  1. its path name will be stored in the WORK_LEGACY and PROJECT_LEGACY variable,
    and data will only be readable, not writable.
  2. a new, DSS based storage area will be made available

The WORK and PROJECT variables are not set any more.

See also the storage migration announcement.

Users need to migrate project data to the new area themselves:

  1. You will receive an email to the mail address registered in the LRZ identity
    management system. This contains an invitation to a new DSS storage area
    to which you need to appropriately respond.
  2. On any cluster login node, issue the command
    dssusrinfo all
    This will list paths to accessible containers, as well as quota information etc.
    Note that this info only becomes visible if you responded to the email above.
  3. Edit your shell profile and set the PROJECT and/or WORK variable to a
    suitable path value based on the above output, typically one of the DSS
    paths with your account name appended to it.
  4. Use the cp or rsync or tar command to migrate your data from
    PROJECT_LEGACY to the new storage area.
  5. If your scripts use absolute path names instead of the PROJECT or
    WORK variable, they need appropriate updates.
The file system documentation has been appropriately updated.
Data access from the outside world The gsissh and gridftp transfer methods have become unavailable on the cluster systems

Instead, please use the Globus Research Data Management Portal for accessing data located on DSS. The file system documentation has been appropriately updated.
Initially, only the new DSS project areas will be accessible.




Verfasser: R. Bader, M. Brehm
veröffentlicht: 2019-09-19