loading...

LCA

Laboratory for Advanced Computing

Running, monitoring and canceling jobs

Submitting batch jobs using the sbatch command

The way to run jobs on Navigator is by submitting a script with the sbatch command. The command to submit a job is as simple as:

sbatch runscript.sh

The commands specified in the runscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won't stop if you disconnect from Navigator.

A typical submission script, in this case using the hostname command to get the computer name, will look like this:

#!/bin/bash

#SBATCH -n 1 # Number of cores
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 0-00:05 # Runtime in D-HH:MM
#SBATCH -p veryshort # Partition to submit to
#SBATCH --mem=100 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o hostname.out # File to which STDOUT will be written
#SBATCH -e hostname.err # File to which STDERR will be written
#SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=myname@example.com # Email to which notifications will be sent

hostname

In general, the script is composed of 3 parts:

  • the #!/bin/bash line (allows the script to be run as a bash script).
  • the #SBATCH lines (are technically bash comments, but they set various parameters for the SLURM scheduler).
  • the command line itself.

The #SBATCH lines shown above set key parameters.

The SLURM system copies many environment variables from your current session to the compute host where the script is run including PATH and your current working directory. As a result, you can specify files relative to your current location (e.g. ./myfolder/myfiles/myfile.txt).

#SBATCH -n 1
    This line sets the number of cores that you're requesting. Make sure that your tool can use multiple cores before requesting more than one. If this parameter is omitted, SLURM assumes -n 1.

#SBATCH -N 1
    This line requests that the cores are all on node. Only change this to >1 if you know your code uses a message passing protocol like MPI. SLURM makes no assumptions on this parameter -- if you request more than one core (-n > 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).

#SBATCH -t 5
    This line specifies the running time for the job in minutes. You can also the convenient format D-HH:MM. If your job runs longer than the value you specify here, it will be cancelled. Jobs have a maximum run time of 7 days on Navigator, though extensions can be done. There is no penalty for over-requesting time. NOTE! If this parameter is omitted on any partition, the your job will be given the default of 10 minutes.

#SBATCH -p veryshort
    This line specifies the SLURM partition (AKA queue) under which the script will be run. The serial_requeue partition is good for routine jobs that can handle being occasionally stopped and restarted. PENDING times are typically short for this queue. See the partitions description below for more information.

#SBATCH --mem=100
    The Navigator cluster requires that you specify the amount of memory (in MB) that you will be using for your job. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options, --mem-per-cpu and --mem. If you specify multiple cores (e.g. -n 4), --mem-per-cpu will allocate the amount specified for each of the cores you're requested. The --mem option, on the other hand, specifies the total amount over all of the cores. If this parameter is omitted, the smallest amount is allocated, usually 100 MB. And chances are good that your job will be killed as it will likely go over this amount.

#SBATCH -o hostname.out
    This line specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.

#SBATCH -e hostname.err
    This line specifies the file to which standard error will be appended. SLURM submission and processing errors will also appear in the file. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.

#SBATCH --mail-type=END
    Because jobs are processed in the "background" and can take some time to run, it is useful send an email message when the job has finished (--mail-type=END). Email can also be sent for other processing stages (START, FAIL) or at all of the times (ALL).

#SBATCH --mail-user=myname@example.com
    The email address to which the --mail-type messages will be sent.

It is important to accurately request resources

Navigator is a medium, shared system that must have an accurate idea of the resources your program(s) will use so that it can effectively schedule jobs. If insufficient memory is allocated, your program may crash (often in an unintelligible way); Additionally, your "fairshare", a number used in calculating the priority of your job for scheduling purposes, can be adversely affected by over-requesting. Therefore it is important to be as accurate as possible when requesting cores (-n) and memory (--mem or --mem-per-cpu).

The distinction between --mem and --mem-per-cpu is important when running multi-core jobs (for single core jobs, the two are equivalent). --mem sets total memory across all cores, while --mem-per-cpu sets the value for each requested core. If you request two cores (-n 2) and 4 Gb with --mem, each core will receive 2 Gb RAM. If you specify 4 Gb with --mem-per-cpu, each core will receive 4 Gb for a total of 8 Gb.

Monitoring job progress with squeue and sacct

squeue and sacct are two different commands that allow you to monitor job activity in SLURM. squeue is the primary and most accurate monitoring tool since it queries the SLURM controller directly. sacct gives you similar information for running jobs, and can also report on previously finished jobs, but because it accesses the SLURM database, there are some circumstances when the information is not in sync with squeue.

Running squeue without arguments will list all currently running jobs. It is more common, though to list jobs for a particular user (like yourself) using the -u option...

squeue -u palmeida

or for a particular job

squeue -j 9999999

If you include the -l option (for "long" output) you can get useful data, including the running state of the job.

squeue "long" output using username (-u) and job id (-j) filters

Above you can se a typical output:

[palmeida@navigator ]# squeue -l
Sun Mar 22 20:25:39 2015
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              2733 veryshort   myprog palmeida  PENDING       0:00   8:00:00     43 (Resources)
              2732 veryshort   myprog palmeida  RUNNING    2:24:17   8:00:00     43 com-[28-70]
              2731 veryshort   myprog palmeida  RUNNING    5:02:54   8:00:00     43 com-[71-113]

The squeue man page has a complete description of the tool options.

The sacct command also provides details on the state of a particular job. An squeue-like report on a single job is a simple command.

sacct -j 9999999

However sacct can provide much more detail as it has access to many of the resource accounting fields that SLURM uses. For example, to get a detailed report on the memory usage for today's jobs for user palmeida:

[root@navigator slurm]# sacct -u palmeida -j 2733_39 --format=JobID,JobNAME,Partition,MaxRSS,AveRss,MaxRSSNode,ReqMem
       JobID    JobName  Partition     MaxRSS     AveRSS MaxRSSNode     ReqMem 
------------ ---------- ---------- ---------- ---------- ---------- ---------- 
2733           myprog-+  veryshort                                          0n 
2733.0       hydra_pmi+                     0          0     com-71         0n 
 

Both tools provide information about the job State. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, and FAILED.

PENDINGJob is awaiting a slot suitable for the requested resources. Jobs with high resource demands may spend significant time PENDING.
RUNNINGJob is running.
COMPLETEDJob has finished and the command(s) have returned successfully (i.e. exit code 0).
CANCELLEDJob has been terminated by the user or administrator using scancel.
FAILEDJob finished with an exit code other than 0.

Killing jobs with scancel


If for any reason, you need to kill a job that you've submitted, just use the scancel command with the job id.

scancel 9999999

If you don't keep track of the job id returned from sbatch, you should be able to find it with the squeue -u command described above.

Using MPI

MPI (Message Passing Interface) is a standard that supports communication between separate processes, allowing parallel programs to simulate a large common memory space. There are three implementations available on Navigator: MPICH, OpenMPI and MVAPICH2. These libraries can be loaded via the module system like:

module load libs/openmpi/1.8.4-gcc-4.4.7

or

module load comp/intel2015.1.133
module load comp/intel2015.1.133.optimization
module load libs/mvapich2/2.1rc1-intel2015.1.133

Note that the MPI module names also specify the compiler used to build them. It is important that the tools you are using have been built with the same compiler. If not, your job will fail.

An example MPI script with comments is below:

#!/bin/bash

#SBATCH -n 192
#SBATCH -p normal
#SBATCH -A staff
#SBATCH -J gmx_NP192

module load libs/openmpi/1.8.4-gcc-4.4.7
module load progs/gromacs 

mpiexec -n 192 mdrun_mpi -s dmpc3_md.tpr -o dmpc3_md.trr -c dmpc4.gro -g dmpc3_md.log

Notice that the number of processors requested by the mpiexec command matches the number of cores requested for SLURM (-n).

In parallel jobs, to make sure you run all cores in all requested node, the number of requested cores should be a multiple of 24.