ALIs

kommt noch

Interactive and Batch Jobs with Loadleveler


Inhalt


Node Allocation Policy

  • Only complete nodes (each with 40 physical cores) are provided to a given job for dedicated use.
  • Accounting is performed by using:  AllocatedNodes*Walltime*40
  • Try to not waste cores of a node, i.e. try to use all 40 cores. However, in some special cases, e.g. large memory requirements, etc. , you may not be able to use all cores.
  • TASKS_PER_NODE must be less or equal 40

Jobs size: totaltasks, number of nodes  and tasks per node

Typically there are two ways of specifying the job size:

  • Total tasks and number of cores
    • This is the more general case
    • If your number of tasks is not evenly dividable by number of nodes
    • You must compute the number of node by yourself (i.e., ceil(number of tasks/40) )
    • LoadLeveler decides how to distrubute the tasks to nodes
    • marked red in the following examples
  • Tasks per node and number of nodes
    • if your number of tasks is 40 (or a bit less)
    • if you can run with various number of nodes by specify min and max
    • marked blue in the follwoing examples

Hints:

  • Do not mix both ways!
  • Always specify number of nodes!
  • Do not waste cores of a node (remember: 40 cores per node)!

See details in the section: Keywords for Node and Core allocation


Interactive parallel Jobs

IBM MPI

Parallel executables which are build with IBM's MPI can be run interactively by invoking them with one of the following methods

  • poe ./executable -procs <nnn> -nodes <NNN>
  • mpiexec -n <nnn> ./executable -nodes <NNN>
  • ./executable -procs <nnn> -nodes <NNN>
  • MP_PROCS=nnn MP_NODES=NNN poe ./executable
  • MP_PROCS=nnn MP_NODES=NNN mpiexec ./executable
  • MP_PROCS=nnn MP_NODES=NNN ./executable

The executable is not executed on the frontend nodes but on the compute nodes. The resource manager of LoadLeveler is used to allocate free nodes. Therefore it is equivalent to submitting a Job Command File to LoadLeveler via llsubmit.

The LoadLeveler keyword part can also be used to allocate nodes

  • cat > LL_FILE <<EOD
    #@ job_type = parallel
    #@ node = 2
    #@ total_tasks = 78
    ## other example
    #@ tasks_per_node= 39
    #@ queue
    EOD
    poe ./executable -rmfile LL_FILE

Intel MPI

You can invoke executables compiled with Intel MPI via poe. However, the run runtime libraries of IBM MPI are then used. Problems, particularly with hybrid (OpenMP-MPI) execution may occur. We recommend to write an LoadLeveler Job Command File and submit it to the class "test".

Limited resources for job class "test"

poe jobs take their resources from the nodes of the job class "test". If you get the the following message:

ERROR: 0031-165 Job has been cancelled, killed, or schedd or resource is unavailable. Not enough resources to start now. Global MAX_TOP_DOGS limit of 1 reached.

then all nodes are busy. You have to wait and try again, or you have to submit your program to the test or general queue as a batchjob e.g., using thecommand. You can use the examples given below by just replacing class=general by class=test.

If you use less than 40 cores,  set  "MP_NODES=1"  to save resources for your co-workers, or use  llrun, which makes in most cases the right decision for nade and core allocation


Batch-Jobs with LoadLeveler

The login node "supermuc-login" intended only for editing and compiling your parallel programs. Interactive usage of "poe/mpirun" is not allowed. To run test or production jobs, submit them to the LoadLeveler batch system, which will find and allocate the resources required for your job (i.e., the compute nodes to run your job on). 

The most important Loadleveler commands are:

llsubmit Submit a job script for execution.
llq Check the status of your job(s).
llhold Place a hold on a job
llcancel Cancel a job.
llclass Query information about job classes.

The -H flag provides extended help information.

Build your job command file "job.cmd" by using a text editor to create a script file.

# This job command file is called job.cmd
#@ executable = a.out
#@ input = job.in
#@ output = job.out
#@ error = job.err
#@ queue
echo "JOB is run"

To submit the job command file that you created in step 1, use the llsubmit command:

llsubmit job.cmd

LoadLeveler responds by issuing a message similar to:

submit: The job "supermuc.22" has been submitted.

Where supermuc is the name of the machine to which the job was submitted and 22 is the job identifier (ID). To display the status of the job you just submitted, use the llq command. This command returns information about all jobs in the LoadLeveler queue:

llq supermuc.22

To place a temporary hold on a job in a queue, use the llhold command. This command only takes effect if jobs are in the Idle or NotQueued state:

llhold supermuc.22

To release the hold, use the llhold command with the -r option:

llhold -r supermuc.22

To cancel a job, use the llcancel command:

llcancel supermuc.22

Job Command File

A Job Command File describes the job to be submitted to the LoadLeveler Job Manager using the llsubmit command. It can contain multiple steps, each designated by the #@queue statement. Lines starting with '#@' are statements that are interpreted by LoadLeveler's parser. The job command file itself can be the script to be executed by each step of the job.

Note that the llsubmit command itself does not form the job using command line arguments.

Examples for Job Command Files

Parallel MPI Job (IBM MPI)

#!/bin/bash
#
#@ job_type = parallel
#@ class = general
#@ node = 4
#@ total_tasks=156
## other example
#@ tasks_per_node = 39
#@ wall_clock_limit = 1:20:30
##                    1 h 20 min 30 secs
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/mydir
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=erika.mustermann@xyz.de
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
mpiexec -n 156 ./myprog.exe

Parallel MPI Job (Intel MPI)

#!/bin/bash
# DO NOT USE environment = COPY_ALL
#@ job_type = MPICH
#@ class = general
#@ node = 4
#@ total_tasks=156
## other example
#@ tasks_per_node = 39
#@ wall_clock_limit = 1:20:30
##                    1 h 20 min 30 secs
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/mydir
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=erika.mustermann@xyz.de
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
#setup of environment
module unload mpi.ibm
module load mpi.intel
mpiexec -n 156 ./myprog.exe

Hybrid MPI Job (IBM MPI)

#!/bin/bash
#
#@ job_type = parallel
#@ class = general
#@ node = 4
#@ total_tasks=12
## other example
#@ tasks_per_node = 3
#@ wall_clock_limit = 1:20:30
##                    1 h 20 min 30 secs
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/mydir
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=erika.mustermann@xyz.de
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
export MP_SINGLE_THREAD=no
export OMP_NUM_THREADS=10
# Pinning
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS
mpiexec -n 12./myprog.exe

Hybrid MPI/OpenMP Job (Intel MPI)

#!/bin/bash
# DO NOT USE environment = COPY_ALL
#@ job_type = MPICH
#@ class = general
#@ node = 4
#@ total_tasks=12
## other example
#@ tasks_per_node = 3
#@ wall_clock_limit = 1:20:30
##                    1 h 20 min 30 secs
#@ job_name = mytest
#@ network.MPI = sn_all,not_shared,us
#@ initialdir = $(home)/mydir
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=erika.mustermann@xyz.de
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh
#setup of environment
module unload mpi.ibm
module load mpi.intel
export OMP_NUM_THREADS=10 
#optional: 
#module load mpi_pinning/hybrid_blocked
mpiexec -n 12 ./myprog.exe

Pure OpenMP Job (on Fat Node Island)

#!/bin/bash
#@ wall_clock_limit = 01:20:00,01:19:30
## with softlimit
#@ job_name = mytest
#@ job_type = parallel
#@ class = general
#@ node = 1
#@ total_tasks = 1
## OR
#@ tasks_per_node = 1
#@ node_usage = not_shared
#@ initialdir = $(home)/mydir
#@ output = job$(jobid).out
#@ error = job$(jobid).err
#@ notification=always
#@ notify_user=erika.mustermann@xyz.de
#@ queue
. /etc/profile
. /etc/profile.d/modules.sh

export OMP_NUM_THREADS=40
export KMP_AFFINITY="granularity=core,compact,1"
./myprog.exe

TSM Archive or Retrieve Job

Data stage example fehlt noch

Job Command File Keywords

Job Name

#@ job_name = myjobname

Specifies the name of the job. This keyword must be specified in the first job step. If it is specified in other job steps in the job command file, it is ignored.

The job_name only appears in the long reports of the llq, llstatus, and llsummary commands, and in mail related to the job.

Job Type

#@ job_type=serial | parallel | MPICH

Serial (Presently not available at SuperMUC)

  • single task job
  • For single task, multiple threads jobs, please see the OpenMP job example above
  • Many parallel keywords cannot be specified with a serial job_type

Parallel

  • means a POE job
  • Multiple tasks, multiple nodes
  • At SuperMUC only complete nodes (40 cores) are given to the users. Nodes are the smallest allocatable units for parallel jobs

MPICH

  • Similar to a parallel job, but handled differently by LoadLeveler internally

Job Class

#@ class = class_name

Valid classes are:

Class
Name

 Purpose

 Max. nodes Wall Clock
Limit
Run limit
per user
test Test and interactive use  4
(160 cores)
2 h 1
general General purpose Job for long running jobs 52
 (2080 cores)
48 h 1
special Restricted use (by LRZ, IBM or on request particular users) 200

(8000 cores)

unlimited 1

 Hint: use "llclass -l" to see all limits and definitions

Limits

#@ wall_clock_limit = hardlimit[,softlimit]

Specifies the hard limit, soft limit, or both for the elapsed time for which a job can run.
Limits are specified with the format hh:mm:ss (hours:minutes:seconds)

Parallel Jobs

Keywords for Node and Core allocation

#@ node = <min,max>.

The scheduler attempts to get max nodes to run the job step, but will start the job step on min nodes if necessary

#@ node = <number>.

The scheduler will find number nodes on which to run the job step

#@ tasks_per_node = <number>.

Used in conjunction with #@ node, each node is assigned number tasks

tasks_per_node must be less or equal 40.

#@ total_tasks = <number>

 Rather than specifying the number of tasks to start on each node, the total number of tasks in the job step across all nodes can be specified

#@ blocking = <number|unlimited>

Tasks are allocated in groups (blocks) of number tasks. A node could run more than one block of tasks.br> SSpecifying #@ blocking = unlimited s also known as packing, where the fewest number of nodes possible are selected

#@ task_geometry.

Allows you to specify which tasks run together on the same machines, although you cannot specify which machines.

Example:

#@ task_geometry = {(5,2) (1,3) (4,6,0)}

The 7 tasks are grouped to run on 3 nodes, where tasks 5 and 2 run on one of the nodes, tasks 1 and 3 on another, and tasks 0, 4 and 6 on the third.

Valid combinations of keywords for parallel jobs:

Keyword Valid combinations
total_tasks X X      
tasks_per_node     X X  
node = <min, max>     X    
node = <number> X     X  
task_geometry         X
blocking   X      

see also: Task Assignment Considerations

Network Protocol

#@ network.protocol = type[, usage[, mode[,comm_level]]]

protocol: specifies the communication protocols that are used with an adapter,

  • MPI Specifies the message passing interface (MPI).
  • LAPI Specifies the low-level application programming interface
  • MPI_LAPI, both MPI and LAPI

usage: Specifies whether the adapter can be shared with tasks of other job steps. Possible values are

  • shared, which is the default,
  • not_shared.

type: This field is required and specifies one of the following:

  • sn_all: Specifies that striped communication should be used over all available switch networks.

mode: Specifies the communication subsystem mode used by the communication protocol that you specify

  • US (User Space), always use US!
  • IP. do not use, but its the default !

Restart

#@ restart = yes | no

Specifies whether LoadLeveler considers a job to be ?restartable.?

If restart=yes (default), and the job is vacated (e.g. in case of system errors) from its executing machine before completing, the central manager requeues the job. It can start running again when a machine on which it can run becomes available.
If restart=no, a vacated job is canceled rather than requeued.

Files and Directories

#@ input = filenamebr> #@ output = filename
#@ error = filename
#@ initialdir = pathname: Specifies the path name of the directory to use as the initial working directory during execution of the job step.

If no filename is specified, LoadLeveler uses /dev/null.
filename can be a absolute filename or it can be relative to the current working directory

Executable and Arguments

The Job Manager copies the executable from the submitting node to the spool directory

#@ executable = program_name

  • The name of the executable that will be started by LoadLeveler
  • If blank, the job command file itself will be used as the executable
  • A different executable can be specified for each step in a job

#@ arguments = arg1 arg2 ...

  • Specifies the list of arguments passed to the program when the job step runs
  • Different arguments can be specified for each step, even if all steps are using the job command file as the executable

#@ environment = env1; env2; ...

Specifies the environment variables that will be set for the job step by LoadLeveler when the job step starts.

  • COPY_ALL: Copies all environment variables from your shell
  • $var: Copies var into the job step's environment
  • !var: Omits var from the job step's environment
  • var=value: Specifies that the environment variable var be set to the value ?value? and copied to the job step's environment when the job step is started

LoadLeveler sets the environment before the login shell is executed, so anything set by the shell (such as settings in .profile) will override the variables set by LoadLeveler

#@ env_copy = all | master

Specifies whether environment variables are copied to all nodes of the job step or only to the master node. By default, the environment is copied to all nodes

#@ shell = name

When LoadLeveler starts the job, the specified shell will be used instead of the default shell for the user from /etc/passwd

Job Steps

A single job can contain more than one step. Every job contains, by default, at least one step. A step can have dependencies on the exit codes of other steps in the same job. Each "#@ queue" statement marks the end of one step and the beginning of the next. Most keyword values are inherited from the previous step. A group of steps within a job can be co-scheduled: they are treated as one entity that will all be started at the same time.

Dependent steps

#@ dependency = step_name operator value

value is usually a number that specifies the job return code to which the step_name is set.

It can also be one of the following LoadLeveler defined job step return codes:

Operators include ==, !=, <=, >=, <, >, &&, ||

A step can have dependencies on more than one step, like

         #@ dependency = (step1 == 0) && (step2 >= 0)

Special Return Codes

  • CC_NOTRUN: The return code set by LoadLeveler for a job step which is not run because the dependency is not met. The value of CC_NOTRUN is 1002
  • CC_REMOVED: The return code set by LoadLeveler for a job step which is removed from the system (because, for example, llcancel was issued against he job step). The value of CC_REMOVED is 1001.

See also: example job with dependencies

Variables

Variables only accessible for LoadLeveler command file (at submit time)

Several variables are available for use in job command files.

$(domain): The domain of the host from which the job was submitted.
$(home): The home directory for the user on the cluster selected to run the job.
$(user): The user name that will be used to run the job. This might be a different user name.
$(host): The hostname of the machine from which the job was submitted.
$(jobid): The sequential number assigned to this job by the schedd daemon.
$(stepid): The sequential number assigned to this job step when multiple queue statements are used with the job command file.

Some variables are set from other keywords defined in the job command file

$(executable)
$(class)
$(comment)
$(job_name)
$(step_name)
$(base_executable): Automatically set from the executable keyword; consists of the executable file name without the directory component (basename).

Example: 
#@ output = $(home)/$(job_name)/$(step_name).$(schedd_host).$(jobid).$(stepid).out

Environment Variables (accessible with job)

LoadLeveler sets several environment variables in the application's environment. A complete list is available in Using and Administering, but here are a few examples and explanations:
 
LOADLBATCH=yes Set when it is a batchjob.
LOADL_HOSTFILE=filename Contains the list of hosts where the job is run.
LOADL_JOB_NAME=i01adm01.sm.lrz.de.51458 The three part job identifier.
LOADL_STEP_ID=i01adm01.sm.lrz.de.51458.0 The process ID of the starter process.
LOADL_STEP_COMMAND=/home/prxxxx/luyyyy/JOB The name of the executable (or the name of the job command file if the job command file is the executable).
LOADL_STEP_CLASS=general The job class for serial jobs.
LOADL_STEP_ARGS=input1 Any arguments passed by the job step.
LOADL_STEP_ERR=err.51458 The file used for standard error messages (stderr).
LOADL_STEP_OUT=out.51458 The file used for standard output (stdout).
LOADL_STEP_IN=/dev/null The file used for standard input (stdin).
LOADL_STEP_INITDIR=/home/prxxxx/luyyyy/ The initial working directory.
LOADL_STEP_NAME=0 The name of the job step.
LOADL_TOTAL_TASKS=800 Specifies the total number of tasks of the MPICH job step. This variable is available only when the job_type is set to MPICH.

Notification

#@ notification = always|error|start|never|complete

Specifies when mail is sent to the adress in the notify_user keywordl:

  • always : Notify the user when the job begins, ends, or if it incurs error conditions.
  • error: Notify the user only if the job fails.
  • start: Notify the user only when the job begins.
  • never:  Never notify the user.
  • complete: Notify the user only when the job ends. (Default)

#@ notify_user = email_address.

Specifies the address to which mail is sent based on the notification keyword.


Querying the Status of a Job

The llq command lists all job steps in the queue, one job step per line.

llq -u userlist filters out only those job steps belonging to the specified users.

llq -j joblist will display only the specified jobs.

The format of a job ID is host.jobid.
The format of a step ID is host.jobid.stepid.

Fields in llq's listing

Class: Job class.

Id: The format of a full LoadLeveler step identifier is host.jobid.stepid.

Owner: User ID that the job will be run under.

PRI: User priority of the job step

Running On: If running, the name of the machine the job step is running on. This is blank when the job is not running. For a parallel job step, only the first machine is shown.

ST: Current state of the job step.

Idle (I): The job step is waiting to be scheduled.

NotQueued (NQ): The job step is not being considered for scheduling, but it has been submitted to LoadLeveler and the Job Manager and Scheduler do know about it, e.g:

Job steps whose dependencies cannot yet be determined.
Job steps that have requested to run in a non-existent reservation.
Job steps submitted above installation-defined limits on queued or idle jobs.

No user intervention can move the job step to Idle state

User Hold (H): The job step is not being considered for scheduling. It can be released from hold using the llhold -r command by the user who submitted the job.

System Hold (S): The job step is not being considered for scheduling. It can be released only by a LoadLeveler administrator.

User & System Hold (HS): It must be released from hold by a LoadLeveler administrator and by the user before it can be scheduled.

Deferred (D): The job step was submitted with a startdate, and that date and time have not yet arrived.

Running (R): The job step is currently running.

Pending (P): The scheduler has assigned resources to the job step and is in the process of sending the start request to the resource manager.

Starting (ST): The resource manager has received the start request from the scheduler and is in the process of dispatching the job step to the nodes where it will run. The next state will normally be Running.

Completed (C): The job step has completed.

Canceled (CA): The job step was canceled by a user or an administrator.

Preempted (E): The job step has been preempted by the suspend method, either by the scheduler or by an administrator using the llpreempt command.

Preempt Pending (EP): LoadLeveler is in the process of preempting the job step.

Resume Pending (MP): The job step is being resumed from preemption. The next state will normally be Running

Submitted: Date and time of job submission.


Using Intel MPI in batch Jobs

Until the next release of Intel-MPI (expected in mid September 2011) it is necessary to set up ssh to work WITHIN SuperMIG without passwords. Do not mix this up with passphrase free access to SuperMIG from the outside world which is a breach of security rules!
None of the keys (public or private) generated here should be transferred outside SuperMIG (i.e., they should stay in $HOME/.ssh), otherwise security problems may be incurred.

The following commands must therefore be executed on SuperMIG:

Commands Remarks
cd ~/.ssh
ssh-keygen -t rsa
Generate an RSA key. The command will respond with
Generating public/private rsa key pair.
Enter file in which to save the key 
(/home/myaccount/.ssh/id_rsa)
to which you respond by typing the ENTER key. Next, you are prompted for a passphrase -
Enter passphrase (empty for no passphrase):
to which you should respond by typing ENTER (no/empty passphrase).
cat id_rsa.pub >> authorized_keys
add internal public key to list of authorized keys.

Using llrun to start interactive or batch Jobs

A convenient way to start both interactive or batch jobs is the LRZ command "llrun". Load the module lrztools and use llrun:

module load lrztools
llrun -N <nodes> -n <mpi_processes> -t <threads> executable
Usage : llrun [<options>] <exe> [<user_or_poe_args>]
<options>:
-N: number of nodes (Default: 1)
-p: total number of processes, same as -n (Default: 1)
-n: total number of processes, same as -p (Default: 1)
-P: Task per node
-t: number of threads per process (Default: 1)
-A: No automatic adjustment of number of nodes (Default: automatic adj.)
-f: Pin processes and Threads to Fixed physical cores (IBM MPI only)
-b: submit batch job
-c: submit to class (default: test)
-m: email 
-w: submit batch job with wallclock limit (Default: 00:15:00)
-i: include content of file before parallel execution 
-I: include content of file after  parallel execution 
-o: do not run/submit job but save to file 
-h: help (this message)
-v: verbose (Default)
-V: NO verbose

Why isn't my job running?

First, run llq -s job_ids

  •  it will -provide information on why a selected list of jobs remain in the Hold, NotQueued, Idle or Deferred state. Is the job step's class available?

Are machines configured to run your job class available

  • Run llclass to see if the class is defined and has available initiators.
  • llstatus -l will show Configured Classes and Available Classes.
  • llstatus will show if machines are Idle or Busy.

Does the job step have requirements which cannot be met by any available machines?

Are higher priority job steps getting scheduled ahead of your job step or are they reserving resources that your job step cannot backfill?

  • llq -l will show q_sysprio which is what is used to order the job steps in the queue.
  • The llprio command can be used to adjust a job step's priority relative to that user's other submitted job steps.

Is the job step bound to a reservation that has not yet become active?

If the job status immediately goes to "Hold" check that the files for "outpout" and "error" really can be written (diirectory exists and has write permissions etc.).


Further Information