The way to run jobs on Navigator is by submitting a script with the sbatch command. The command to submit a job is as simple as:
The commands specified in the runscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won't stop if you disconnect from Navigator.
A typical submission script, in this case using the hostname command to get the computer name, will look like this:
#!/bin/bash #SBATCH -n 1 # Number of cores #SBATCH -N 1 # Ensure that all cores are on one machine #SBATCH -t 0-00:05 # Runtime in D-HH:MM #SBATCH -p veryshort # Partition to submit to #SBATCH --mem=100 # Memory pool for all cores (see also --mem-per-cpu) #SBATCH -o hostname.out # File to which STDOUT will be written #SBATCH -e hostname.err # File to which STDERR will be written #SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL #SBATCH --email@example.com # Email to which notifications will be sent hostname
In general, the script is composed of 3 parts:
The #SBATCH lines shown above set key parameters.
The SLURM system copies many environment variables from your current session to the compute host where the script is run including PATH and your current working directory. As a result, you can specify files relative to your current location (e.g. ./myfolder/myfiles/myfile.txt).
#SBATCH -n 1
This line sets the number of cores that you're requesting. Make sure that your tool can use multiple cores before requesting more than one. If this parameter is omitted, SLURM assumes -n 1.
#SBATCH -N 1
This line requests that the cores are all on node. Only change this to >1 if you know your code uses a message passing protocol like MPI. SLURM makes no assumptions on this parameter -- if you request more than one core (-n > 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).
#SBATCH -t 5
This line specifies the running time for the job in minutes. You can also the convenient format D-HH:MM. If your job runs longer than the value you specify here, it will be cancelled. Jobs have a maximum run time of 7 days on Navigator, though extensions can be done. There is no penalty for over-requesting time. NOTE! If this parameter is omitted on any partition, the your job will be given the default of 10 minutes.
#SBATCH -p veryshort
This line specifies the SLURM partition (AKA queue) under which the script will be run. The serial_requeue partition is good for routine jobs that can handle being occasionally stopped and restarted. PENDING times are typically short for this queue. See the partitions description below for more information.
The Navigator cluster requires that you specify the amount of memory (in MB) that you will be using for your job. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options, --mem-per-cpu and --mem. If you specify multiple cores (e.g. -n 4), --mem-per-cpu will allocate the amount specified for each of the cores you're requested. The --mem option, on the other hand, specifies the total amount over all of the cores. If this parameter is omitted, the smallest amount is allocated, usually 100 MB. And chances are good that your job will be killed as it will likely go over this amount.
#SBATCH -o hostname.out
This line specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.
#SBATCH -e hostname.err
This line specifies the file to which standard error will be appended. SLURM submission and processing errors will also appear in the file. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.
Because jobs are processed in the "background" and can take some time to run, it is useful send an email message when the job has finished (--mail-type=END). Email can also be sent for other processing stages (START, FAIL) or at all of the times (ALL).
The email address to which the --mail-type messages will be sent.
Navigator is a medium, shared system that must have an accurate idea of the resources your program(s) will use so that it can effectively schedule jobs. If insufficient memory is allocated, your program may crash (often in an unintelligible way); Additionally, your "fairshare", a number used in calculating the priority of your job for scheduling purposes, can be adversely affected by over-requesting. Therefore it is important to be as accurate as possible when requesting cores (-n) and memory (--mem or --mem-per-cpu).
The distinction between --mem and --mem-per-cpu is important when running multi-core jobs (for single core jobs, the two are equivalent). --mem sets total memory across all cores, while --mem-per-cpu sets the value for each requested core. If you request two cores (-n 2) and 4 Gb with --mem, each core will receive 2 Gb RAM. If you specify 4 Gb with --mem-per-cpu, each core will receive 4 Gb for a total of 8 Gb.
squeue and sacct are two different commands that allow you to monitor job activity in SLURM. squeue is the primary and most accurate monitoring tool since it queries the SLURM controller directly. sacct gives you similar information for running jobs, and can also report on previously finished jobs, but because it accesses the SLURM database, there are some circumstances when the information is not in sync with squeue.
Running squeue without arguments will list all currently running jobs. It is more common, though to list jobs for a particular user (like yourself) using the -u option...
squeue -u palmeida
or for a particular job
squeue -j 9999999
If you include the -l option (for "long" output) you can get useful data, including the running state of the job.
squeue "long" output using username (-u) and job id (-j) filters
Above you can se a typical output:
[palmeida@navigator ]# squeue -l Sun Mar 22 20:25:39 2015 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 2733 veryshort myprog palmeida PENDING 0:00 8:00:00 43 (Resources) 2732 veryshort myprog palmeida RUNNING 2:24:17 8:00:00 43 com-[28-70] 2731 veryshort myprog palmeida RUNNING 5:02:54 8:00:00 43 com-[71-113]
The squeue man page has a complete description of the tool options.
The sacct command also provides details on the state of a particular job. An squeue-like report on a single job is a simple command.
sacct -j 9999999
However sacct can provide much more detail as it has access to many of the resource accounting fields that SLURM uses. For example, to get a detailed report on the memory usage for today's jobs for user palmeida:
[root@navigator slurm]# sacct -u palmeida -j 2733_39 --format=JobID,JobNAME,Partition,MaxRSS,AveRss,MaxRSSNode,ReqMem JobID JobName Partition MaxRSS AveRSS MaxRSSNode ReqMem ------------ ---------- ---------- ---------- ---------- ---------- ---------- 2733 myprog-+ veryshort 0n 2733.0 hydra_pmi+ 0 0 com-71 0n
Both tools provide information about the job State. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, and FAILED.
|PENDING||Job is awaiting a slot suitable for the requested resources. Jobs with high resource demands may spend significant time PENDING.|
|RUNNING||Job is running.|
|COMPLETED||Job has finished and the command(s) have returned successfully (i.e. exit code 0).|
|CANCELLED||Job has been terminated by the user or administrator using scancel.|
|FAILED||Job finished with an exit code other than 0.|
If for any reason, you need to kill a job that you've submitted, just use the scancel command with the job id.
If you don't keep track of the job id returned from sbatch, you should be able to find it with the squeue -u command described above.
MPI (Message Passing Interface) is a standard that supports communication between separate processes, allowing parallel programs to simulate a large common memory space. There are three implementations available on Navigator: MPICH, OpenMPI and MVAPICH2. These libraries can be loaded via the module system like:
module load libs/openmpi/1.8.4-gcc-4.4.7
module load comp/intel2015.1.133 module load comp/intel2015.1.133.optimization module load libs/mvapich2/2.1rc1-intel2015.1.133
Note that the MPI module names also specify the compiler used to build them. It is important that the tools you are using have been built with the same compiler. If not, your job will fail.
An example MPI script with comments is below:
#!/bin/bash #SBATCH -n 192 #SBATCH -p normal #SBATCH -A staff #SBATCH -J gmx_NP192 module load libs/openmpi/1.8.4-gcc-4.4.7 module load progs/gromacs mpiexec -n 192 mdrun_mpi -s dmpc3_md.tpr -o dmpc3_md.trr -c dmpc4.gro -g dmpc3_md.log
Notice that the number of processors requested by the mpiexec command matches the number of cores requested for SLURM (-n).
In parallel jobs, to make sure you run all cores in all requested node, the number of requested cores should be a multiple of 24.