Job Processing on the LRZ Clusters

This document provides general guidelines on how to select a suitable system for the desired application scenario. It also serves as an entry point to the technically much more detailed batch documentation.

Job resource matching

Because the LRZ cluster is shared between many users, production jobs must be submitted to a resource and workload manager (SLURM) which puts them into a queue and executes them provided the requested resources are available.

The user of the cluster is responsible for determining the resource requirements of her/his job profile. Some testing and use of appropriate tools is needed to perform the necessary characterization. Depending on these requirements, she/he must select the appropriate segment of the cluster, corresponding to the desired processing mode.

Overview of available batch resources

Parallel resources  (technical introduction) (limits and policies) (example job scripts)
mpp1 Distributed memory CooLMUC1 QDR Infiniband cluster (multiple nodes with 16 GB/node)
mpp2 Distributed memory CooLMUC2 FDR Infiniband cluster (multiple nodes with 64 GB/node)
uv2, uv3 Shared memory sgi Ultraviolet partitions (up to 1.5 TByte memory for a single job)
myri Shared memory 10G Myrinet cluster. For this cluster, it may be necessary to specify the partition. Available partitions are:
matum_u short-time unpriorized execution; dedicated to users from MA-TUM
matum_p long-time priorized execution; dedicated to users from MA-TUM
inter Interactive parallel jobs: A "salloc" shell is generated on a login node. Which segment is used depends on where you are logged in.
tum_chem Parallel job processing dedicated to users of TUM / chemistry (integrated with CooLMUC2)
Serial resources (technical introduction) (limits and policies) (example job scripts)
serial

For serial job processing. Available partitions are:

serial_mpp2 Standard serial jobs on CoolMUC2 (typically single core and ca. 2 GByte memory, but more can be configured)
serial_long Long running serial jobs on CooLMUC2 (typically single core and ca. 2 GByte memory, but more can be configured)
hugemem

Serial jobs with large memory requirements (typically up to 240 GBytes in a single shared memory node)

tum_geodesy

Serial job processing dedicated to users of TUM / geodesy

lmu_asc

Serial job processing dedicated to users of the Arnold-Sommerfeld-Centre

There also exists a document that generally describes SLURM's specification syntax and semantics.

Guidelines for resource selection

Processing Mode

  • Jobs that can only use one or at most a few hardware cores should perform serial processing. The particular segment used depends on how much main memory is needed.
  • Jobs that can use many hardware cores, but require a single shared memory node, should perform parallel processing with the Ultraviolet (UV2,3) systems. For example, programs that use more than 32 threads with OpenMP fit this pattern (some tuning for ccNUMA may be required).
  • Programs that use MPI (or PGAS) for parallelization typically require distributed memory parallel processing. These should use the parallel processing facilities available in the CooLMUC2, CooLMUC1, or possibly Ultraviolet segments. Which of these segments to select depends on
    • how many tasks are started per node
    • how much memory is needed by a task (see Memory Requirements below)
    • how many computational nodes are needed in total (because on each segment, there exists a limit for this number).

Workflows that alternate between serial and parallel processing, where serial processing requires a significant fraction of the execution wall time, should consider setting up separate serial and parallel SLURM job scripts with appropriately defined dependencies between them. For technical info, search the specification document for the term "dependency".

Run Time

Please note that all job classes impose a maximum run time limit. The precise values depend on the cluster segment used; it can be adjusted downward for any individual job. If your job cannot complete within the limit, there exist following (non-exclusive) options to enable processing on the Linux Cluster:

  • Enable checkpointing to disk: This permits to subdivide a simulation into multiple jobs that are subsequently executed. The program must be capable of writing its state near the end of the job, and re-reading it at the beginning of the next job. Also, sufficient disk space must be available to store your checkpoint data.
  • Increase the amount of parallelism of your program. This can be done by requesting more computational resources (if possible), or by improving the parallel algorithms used to achieve better performance with the same amount of computational resources
  • Perform code optimizations of your program: This may involve changes to the used algorithms (e.g. complexity reduction), or - more simply - adding vectorization (SIMD) directives to the code, using suitable switches for compiling, performing restructuring of hot loops and data structures to improve the temporal and/or spatial locality of your code, etc.

Memory Requirements

The technical documentation provides information on the available memory per core as well as per node for each segment.

For applications that are serially executed, the user may need to specify a separate memory requirement in the batch script if it is larger than the available per-core memory. This will cause the scheduler to raise the memory limit and will avoid scheduling of other jobs on the used node that would overstrain the node's memory resources.

For parallel applications that run on distributed memory systems two considerations must be met:

  • The total memory available in user space for the set of nodes requested by the job must not be exceeded.
  • The memory used on each individual node must not be exceeded by all tasks run on that node.

Note that applications exist for which the memory usage is unsymmetric. In this case it may become necessary to work with a variable number of tasks per node. One relevant scenario is a master-worker scheme where the master may need an order of magnitude more memory and therefore requires a node of its own, while worker nodes can share a node. The MPI startup mechanism (which is implementation dependent - please therefore consult the documentation for the variant you intend to use) usually offers a method for control of the startup procedure.

Disk and I/O Requirements

The disk and I/O requirements are not controlled by the batch scheduling system, but rely on either the availability of local disk space (which can be used as scratch space only), or of a shared file system (NFS/NAS based), or of a parallel shared file system (GPFS). The latter two provide system-global services with respect to bandwidth - this means that the total I/O bandwidth is shared between all users. The consequence is that all I/O may be significantly slowed down if heavily used by multiple users at the same time, or even - for large scale parallel jobs - by a single user. This may result in job failure if job time limits are exceeded due to slowed I/O. At present, LRZ can not make any Quality of Service assurance for I/O bandwidth.

Please consult the file system page for more detailed technical information.

Licences

Some jobs may make use of licensed software, either from the LRZ software application stack, or of software installed in the user's HOME directory. In many cases, the software needs to access a license server because there exist limits on how many instances of the software may run and who may access it at all. Please be aware of following aspects:

  • if the license server for the software is not located at LRZ, jobs run on the LRZ cluster will need to be able to access it. This may involve opening firewall ports for the cluster networks on a site that is not under LRZ's administrative purview.
  • LRZ is currently not able to manage license contingents. The reason is that a significant additional effort is required, not only with suitable configuration of SLURM, but also with how the license servers are managed. The situation implies that a job will fail if the usage limit of a licensed software is exceeded when the job starts.

Alternatives

If your job profile cannot be matched to one of the Cluster segments, please consider moving to a different LRZ system or service that does fit your requirements:

  • For scaling out to very large core counts, due to needing the memory or wanting to reduce the computational time, consider applying for a SuperMUC project. The same may also apply if you need to do very large-scale I/O processing. However, except for initial testing, SuperMUC projects undergo a refereeing procedure. So the onus is on you to demonstrate the scientific value of using this expensive resource.
  • For workflows that use only moderate computational resources but need long run times, the LRZ cloud services will be more appropriate. For permanently needed services you may consider acquiring a virtual machine. Both these options are also relevant if you wish to deploy your own OS images (including a tested application stack). Please note that deploying a virtual machine incurs costs on your part - see the entry "Dienstleistungskatalog" on the services page for details.