Initial operation of CooLMUC-3 and its restrictions

General comments

The documentation of the Linux Cluster systems has been extended to also cover operation of the CooLMUC-3 system. This document mostly describes where there are restrictions and deviations to the documented behaviour during the introductory phase of operation, as well as some special properties of the system that differ from the other cluster systems at LRZ. We will update this document as additional facilities become available.

Login node and operating environment

The front end node lxlogin8.lrz.de must be used to do development work for the CooLMUC-3 system. That login node is not a many-core (KNL) node itself, so it may not be possible to execute binaries there that have been custom built for the KNL architecture. However, the Intel development software stack permits you to do cross-compilation there, and an interactive SLURM job can be used to execute test runs on KNL compute nodes.

Please also note that CooLMUC-3 deploys a new release of the operating environment (SLES12). Programs built on this platform are likely to not execute properly on the other HPC systems at LRZ.

Restrictions on operational continuity

During initial operation LRZ will continue work on improving usability and stability of the system. Also, benchmark performance verification will need to be done. Therefore, especially during the night shifts (17:00-8:00) or on weekends, there may be (potentially unannounced) interruptions of operation for the aforementioned purposes.

Batch processing

  • A separate SLURM instance serves CooLMUC-3, so all batch runs for CooLMUC-3 must be submitted from the CooLMUC-3 login node lxlogin8 (and this will remain that way for quite some time to come).
  • Initially, at most 60 nodes can be used for a single job for at most 12 hours.
  • Initially, not all hardware facilities will be available:
    • there may be limitations on the efficient use of hyperthreading; handling optimal pinning of processes and threads under SLURM is still under investigation.
    • switching cluster modes and HBM modes is now supported.
    • setting core frequencies is supported (within limits), please use the --cpu-freq switch of the SLURM sbatch command (man sbatch supplies details).

Restrictions on file system access

The SCRATCH file system of CooLMUC-2 is now available also for CooLMUC-3, however still in a somewhat experimental phase. If the value of the $SCRATCH environment variable indicates that the directory does not exist, or if I/O operations fail, please use the "old" scratch area pointed to by $SCRATCH_LEGACY instead. Also, current I/O performance of the system is still below expectation; work is under way to improve the situation.

Current status / Updates

  • This document was updated on Oct 26 with all new features that have come in by that date.