ALIs

kommt noch

Running parallel jobs with SLURM

The new MPP cluster uses the SLURM scheduler to execute parallel jobs. This document describes usage, policies and resources available for SLURM execution. It is envisaged that the other parallel cluster segments will also be migrated to SLURM in the near future.


Table of contents

  • Introduction and Prerequisites
  • Interactive SLURM shell
  • Script-driven SLURM jobs
    • Step 1: Edit a job script
    • Step 2: Submission procedure
    • Step 3: Checking the status of a job
    • Inspection and modification of jobs
    • Deleting jobs from the queue
  • A GUI for job management
  • Documentation and Support
    • LRZ specific information
    • General information

Note: The information contained in this document and its subdocuments applies for all parallel job classes. For serial job processing, please consult the SGE (Sun Grid Engine) documentation.

Introduction and Prerequisites

All parallel programs in the parallel segments of the cluster must be started up using either

  • an interactive SLURM shell
  • a SLURM batch script

In order to access the SLURM infrastructure described here, please first log in to a front end node of the cluster, and from there to a submission node of the cluster segment you wish to use; each cluster segment corresponds to one or more SLURM cells.

Cluster segment

submission (and development) node

Myrinet 10 GE cluster, MPP cluster lxia4-1, lxia4-2
sgi ICE, Ultraviolet ice1-login

Jobs can be submitted to any SLURM cell from any submission node, but please note that an MPI program built on the MPP or Myrinet 10 GE cluster will typically not run on the ICE or UV systems and vice versa.

This document provides information on how to configure, submit and execute SLURM jobs, as well as information about batch processing policies. In particular, please be aware that misuse of the resources described here can result in the invalidation of the violating account.

Interactive SLURM shell

For performing program testing and short runs the following sequence of commands can be used: First, salloc is invoked to reserve the needed resources. Then, the srun_ps command is used to start up a program on these resources.

salloc --ntasks=32 --partition=mpp1_inter

srun_ps ./myprog.exe

exit

Start an MPP mode Parastation MPI program using two 16-way nodes on the MPP cluster
salloc --ntasks=16 --partition=ice1_inter

srun_ps ./myprog.exe

exit

 
Start an MPP mode sgi MPT program using two nodes with 16 physical cores on the ICE

(additionally specify --ntasks-per-core=2 if you want to use hyperthreaded cores. In this example only one node would be assigned)

salloc --ntasks=6 --cpus-per-task=8 --partition=mpp1_inter

srun_ps -t 8 ./myprog.exe

exit

Start a hybrid mode Parastation MPI program on the MPP cluster using 6 MPI tasks, with 8 OpenMP threads per task (3 nodes will be needed).
salloc --ntasks=4 --cpus-per-task=4 --partition=ice1_inter

srun_ps -t 4 ./myprog.exe

exit

 
Start a hybrid mode sgi MPT program on the ICE with 4 MPI tasks and 4 OpenMP threads per task (2 nodes will be needed)

by default, a SLURM shell generated via salloc will run for 15 minutes. This interval can be extended to the partition maximum by specifying a suitable --time=hh:mm:ss argument

Notes:

  • SLURM also has its own srun command; however this is not usable with MPI programs built with Parastation MPI or sgi MPT.
  • The srun_ps command can be used to start up programs built with Parastation MPI (MPP cluster), sgi MPT (ICE or UV), or Intel MPI (all systems running under SLURM).
  • Once the allocation expires, the program will be signalled and killed; further programs can not be started. Please issue the exit command and start a new allocation.

Script-driven SLURM jobs

This type of execution method should be used for all production runs. A step-by-step recipe for the simplest type of parallel job is given, illustrating the use of the SLURM commands for users of the bash shell. See the documentation section at the end for pointers to more complex setups.

Step 1: Edit a job script

The following script is assumed to be stored in the file myjob.cmd.

 

#!/bin/bash

 

#SBATCH -o /home/hpc/<group>/<user>/myjob.%j.%N.out

(Placeholder) standard output and error go there. Note that the directory where the output file is placed must exist before the job starts, and the full path name must be specified (no environment variable!). The %j encodes the job ID into the output file. The %N encodes the master node of the job and should be added since job IDs from different SLURM clusters may be the same.

#SBATCH -D  /home/hpc/<group>/<user>/mydir

directory used by script as starting point (working directory)

#SBATCH -J <job_name>

(Placeholder) name of job

#SBATCH --clusters=mpp1

Instead of mpp1 (for the MPP cluster) you can also use,

ice1 here (for the ICE), or uv2 or uv3 (for the two UltraViolet partitions)

#SBATCH --get-user-env Set user environment properly

#SBATCH --ntasks=64

Number of MPI tasks assigned to job. By default, SLURM will start as many MPI tasks per node as there are physical cores in the node.

#SBATCH --mail-type=end

Send an e-mail at job completion

#SBATCH --mail-user=<email_address>@<domain>

(Placeholder) e-mail address (don't forget!)

#SBATCH --export=NONE

Do not export the environment of the submitting shell into the job; while SLURM allows to also use ALL here, this is strongly discouraged, because the submission environment is very likely to be inconsistent with the environment required for execution of the job.

#SBATCH --time=08:00:00

maximum run time is 8 hours 0 minutes 0 seconds; this may be increased up to the queue limit

source /etc/profile.d/modules.sh

initialize module system

module load gsl # ... etc

load any required environment modules (may be needed if program is linked against shared libraries). "gsl" of course is only a placeholder.

srun_ps  ./my_mpi_prog.exe

start MPI executable. The MPI variant used depends on the loaded module set; non-MPI programs may fail to start up - please consult the example jobs or the software-specific documentation for other startup mechanisms

This script essentially looks like a bash script. However, there are specially marked comment lines ("control sequences"), which have a special meaning in theSLURM context explained on the right hand of the above table. The entries marked "Placeholder" must be suitably modified to have valid user-specific values.

For this script, the environment of the submitting shell will not be exported to the job's environment; the job will start 64 MPI tasks on as many cores.

Step 2: Submission procedure

The job script is submitted to the queue via the command

sbatch myjob.cmd

At submission time the control sequences are evaluated and stored in the queuing database, and the script is copied into an internal directory for later execution. If the command was executed successfully, the Job ID will be returned as follows:

Submitted batch job 65648.

It is a good idea to note down your Job ID's, for example to provide to LRZ HPC support as information if anything goes wrong. The submission command can also contain control sequences. For example,

sbatch --time=12:00:00 myjob.cmd

would override the setting inside the script, forcing it to run 12 instead of 8 hours. 

Step 3: Checking the status of a job

Once submitted, the job will be queued for some time, depending on how many jobs are presently submitted. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the systems determined by its resource requirements. The status of the job can be queried with the squeue --clusters=[all | cluster_name] command, which will give an output like

CLUSTER: mpp1
JOBID PARTITION   NAME  USER ST  TIME NODES NODELIST(REASON)
65646 mpp1_batch  job1  xyz1  R 24:19     2 lxa[7-8]
65647 mpp1_batch  myj   xza2  R  0:09     1 lxa14
65648 mpp1_batch  calc  yaz7 PD  0:00     6 (Resources)

(assuming mpp1 is specified as the clusters argument) indicating that the job is queued. Once the job is running, the output would indicate the state to be "R" (=running), and would also list the host(s) it was running on. For jobs that have not yet started, the --start option, applied to squeue, will provide an estimate (!) for the starting time. The sinfo --clusters=[all | cluster_name] command prints out an overview of the status of all clusters or a particular clusters in the SLURM configuration.

Inspection and modification of jobs

Queued jobs can be inspected for their characteristics via the command

scontrol --clusters=<cluster_name> show jobid=<job ID>

which will print out a list of "Keyword=Value" pairs which characterize the job. As long as a job is waiting in the queue, it is possible to modify at least some of these; for example, the command

scontrol --clusters=<cluster_name> update jobid=65648 TimeLimit=04:00:00

would change the run time limit of the above-mentioned example job from 8 hours to 4 hours.

Deleting jobs from the queue

To forcibly remove a job from SLURM,  the command

scancel --clusters=<cluster_name> <JOB_ID>

can be used. Please do not forget to specify the cluster! The scancel (1) man page provides further information on the use of this command.

A GUI for job management

The command sview is available to inspect and modify jobs via a graphical user interface:

sview_1

  • To identify your jobs among the many ones in the list, select either the "specific user's jobs" or the "job ID" item from the menu "Actions YSearch"
  • By right-clicking on a job of yours and selecting "Edit job" in the context menu, you can obtain a window which allows to modify the job settings. Please be careful about committing your changes.

sview_2

Documentation and Support

LRZ specific information

The subdocuments linked to in the following table provide further information about usage of SLURM on LRZ's HPC systems:

Examples provides example job scripts which cover the most common usage patterns for the
Policies provides information about the policies, such as memory limits, run time limits etc.
Specifications lists SLURM parameter settings and explains them, making appropriate recommendations where necessary

General information

  • The home site of SLURM at LLNL
  • The manual pages slurm(1), sinfo(1), squeue(1), scontrol(1), scancel(1), sview(1)
  • In case of difficulties with the LRZ installations of SLURM, please contact us via the Service Desk.