Cluster Navigator/Navigator+ (3936 + 1280 cores)
164 computing nodes (Fujitsu PRIMERGY CX400 S2)
2 x Intel Xeon E5-2697v2 (12-core) @ 2.70 GHz
96 GB ECC DDR3 LV RAM
HCA InfiniBand FDR 56 Gbps
28 computing nodes (Fujitsu PRIMERGY CX2550 M4)
2 x Intel Xeon Gold 6148 (20-core) @ 2.40 GHz
256 GB SSD (local storage)
21 nodes: 96 GB DDR4-2666 R ECC
7 nodes: 384 GB DDR4-2666 R ECC
EP Mellanox EDR ConnectX-4 100 Gbps
4 gpu computing nodes (Fujitsu PRIMERGY CX2570 M4)
2 x NVIDIA Tesla V100 16 GB
2 x Intel Xeon Gold 6148 (20-core) @ 2.40 GHz
256 GB SSD (local storage)
96 GB DDR4-2666 R ECC
EP Mellanox EDR ConnectX-4 100 Gbps
1 SMP computing node (Fujitsu PRIMERGY RX4770 M4)
4 x Intel Xeon Gold 6154 (18-core) @ 3.00 GHz
3,84 TB SSD (local storage)
3TB GB DDR4-2666 R ECC
EP Mellanox EDR ConnectX-4 100 Gbps
Lustre Shared Storage (220 TB + 1,27 PB)
Fujitsu PRIMERGY: RX2540 M4 & RX350 S8
1 x MDS (RX2540 M4, SSD 3.2 TB)
5 x OSS (RX350 S8, SATA 40 TB)
1 x OSS (RX2540 M4, SAS 72 TB)
2 x MDS (block storage SFA200NV, 6.8 TB NVMe)
2 x OSS (1640 TB total)
8 external Infiniband EDR connections
- Interconnect (FDR Infiniband 2:1 + EDR Infiniband 100 Gb/s non-blocking)
Accessing the cluster
Once you've gone through the account setup procedure and obtained a suitable terminal application, you can login to the Navigator system via ssh
Navigator computers run the CentOS distribution of the Linux operating system and commands are run under the "bash" shell. There are a number of Linux and bash references, cheat sheets and tutorials available on the web.
There are some modifications to the default installation to meet all the computational needs we support.
If you get in trouble compiling or running your code, feel free to contact us. When asking for help, it is very important to provide us enough information about the environment where the problem occurred, logs, and conditions about where and how you use to make the same operations.
Because of the diversity of investigations currently supported by LCA, many applications and libraries are supported on the Navigator cluster. Technically, it is impossible to include all of these tools in every user's environment. The Linux module system is used to enable subsets of these tools for a particular user's computational needs.
Navigator has installed the gcc and gfortran default version for the release of installed CentOS - GCC- 4.8.5. Since this is an old version of the compiler, Navigator also provide users the ability to use gcc (and gfortran) - GCC 5.4.0 and GCC 8.3.0 through a environment modules.
As many users expect, Navigator also provides a recent version of the Intel Compiler and the algebra library MKL - Intel Parallel Studio XE Cluster Edition 2019 update 3 (126.96.36.199). This can be used through an environment module.
Queue management system
SLURM is a queue management system and stands for Simple Linux Utility for Resource Management. SLURM was developed at the Lawrence Livermore National Lab and currently runs some of the largest compute clusters in the world.
SLURM is similar in many ways to TORQUE or most other queue systems. You must write a batch script then submit it to the queue manager. The queue manager then schedules your job to run on the partition (or queue in TORQUE) that you designate. Below we will provide an outline of how to submit jobs to SLURM, how SLURM decides when to schedule your job and how to monitor progress.
SLURM has a number of features that make it more suited to our environment than TORQUE:
- Kill and Requeue SLURM’s ability to kill and requeue is superior to that of TORQUE. It waits for jobs to be cleared before scheduling the high priority job. It also does kill and requeue on memory rather than just on core count.
- Memory Memory requests are sacrosanct in SLURM. Thus the amount of memory you request at run time is guaranteed to be there. No one can infringe on that memory space and you cannot exceed the amount of memory that you request.
- Accounting Tools SLURM has a back end database which stores historical information about the cluster. This information can be queried by the users who are curious about how much resources they have used.
The primary source for documentation on SLURM usage and commands can be found at the SLURM site. If you Google for SLURM questions, you'll often see the Lawrence Livermore pages as the top hits, but these tend to be outdated.A great way to get details on the SLURM commands is the man pages available from the Navigator cluster login node. For example, if you type the following command:
you'll get the manual page for the sbatch command.
Commands and flags
Since most people is familiar to the TORQUE queue management system, we provide a small group of examples using SLURM and the correspondent TORQUE commands.
|Submit a batch serial job||sbatch||qsub||
|Kill a job||scancel||qdel||
|Check current job by id||sacct||checkjob||
sacct -j <JOBID>
|View status of queues||squeue||qstat||
|View information about nodes and partitions||sinfo||showq||
sinfo -N; sinfo --long
SLURM partitions (queues)
Partition is the term that SLURM uses for queues. Partitions can be thought of as a set of resources and parameters around their use. Full information about the partitions and their usage can be in the internal wiki accessible to all registered users. One useful command to check the current limits for core usage in each queue is
sacctmgr show qos format="name,maxtresperuser%20,maxtresperaccount%20" | grep <queue>
Running, monitoring and canceling jobs
Submitting batch jobs using the sbatch command
The way to run jobs on Navigator is by submitting a script with the sbatch command. The command to submit a job is as simple as:
The commands specified in the runscript.sh file will then be run on the first available compute node that fits the resources requested in the script. sbatch returns immediately after submission; commands are not run as foreground processes and won't stop if you disconnect from Navigator.
A typical submission script, in this case using the hostname command to get the computer name, will look like this:
#!/bin/bash #SBATCH -n 1 # Number of cores #SBATCH -N 1 # Ensure that all cores are on one machine #SBATCH -t 0-00:05 # Runtime in D-HH:MM #SBATCH -p veryshort # Partition to submit to #SBATCH --mem=100 # Memory pool for all cores (see also --mem-per-cpu) #SBATCH -o hostname.out # File to which STDOUT will be written #SBATCH -e hostname.err # File to which STDERR will be written #SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL #SBATCH --firstname.lastname@example.org # Email to which notifications will be sent hostname
In general, the script is composed of 3 parts:
- the #!/bin/bash line (allows the script to be run as a bash script).
- the #SBATCH lines (are technically bash comments, but they set various parameters for the SLURM scheduler).
- the command line itself.
The #SBATCH lines shown above set key parameters.
The SLURM system copies many environment variables from your current session to the compute host where the script is run including PATH and your current working directory. As a result, you can specify files relative to your current location (e.g. ./myfolder/myfiles/myfile.txt).
#SBATCH -n 1
This line sets the number of cores that you're requesting. Make sure that your tool can use multiple cores before requesting more than one. If this parameter is omitted, SLURM assumes -n 1.
#SBATCH -N 1
This line requests that the cores are all on node. Only change this to >1 if you know your code uses a message passing protocol like MPI. SLURM makes no assumptions on this parameter -- if you request more than one core (-n > 1) and your forget this parameter, your job may be scheduled across nodes; and unless your job is MPI (multinode) aware, your job will run slowly, as it is oversubscribed on the master node and wasting resources on the other(s).
#SBATCH -t 5
This line specifies the running time for the job in minutes. You can also the convenient format D-HH:MM. If your job runs longer than the value you specify here, it will be cancelled. Jobs have a maximum run time of 7 days on Navigator, though extensions can be done. There is no penalty for over-requesting time. NOTE! If this parameter is omitted on any partition, the your job will be given the default of 10 minutes.
#SBATCH -p veryshort
This line specifies the SLURM partition (AKA queue) under which the script will be run. The serial_requeue partition is good for routine jobs that can handle being occasionally stopped and restarted. PENDING times are typically short for this queue. See the partitions description below for more information.
The Navigator cluster requires that you specify the amount of memory (in MB) that you will be using for your job. Accurate specifications allow jobs to be run with maximum efficiency on the system. There are two main options, --mem-per-cpu and --mem. If you specify multiple cores (e.g. -n 4), --mem-per-cpu will allocate the amount specified for each of the cores you're requested. The --mem option, on the other hand, specifies the total amount over all of the cores. If this parameter is omitted, the smallest amount is allocated, usually 100 MB. And chances are good that your job will be killed as it will likely go over this amount.
#SBATCH -o hostname.out
This line specifies the file to which standard out will be appended. If a relative file name is used, it will be relative to your current working directory. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.
#SBATCH -e hostname.err
This line specifies the file to which standard error will be appended. SLURM submission and processing errors will also appear in the file. If this parameter is omitted, any output will be directed to a file named slurm-JOBID.out in the current directory.
Because jobs are processed in the "background" and can take some time to run, it is useful send an email message when the job has finished (--mail-type=END). Email can also be sent for other processing stages (START, FAIL) or at all of the times (ALL).
The email address to which the --mail-type messages will be sent.
It is important to accurately request resources
Navigator is a medium, shared system that must have an accurate idea of the resources your program(s) will use so that it can effectively schedule jobs. If insufficient memory is allocated, your program may crash (often in an unintelligible way); Additionally, your "fairshare", a number used in calculating the priority of your job for scheduling purposes, can be adversely affected by over-requesting. Therefore it is important to be as accurate as possible when requesting cores (-n) and memory (--mem or --mem-per-cpu).
The distinction between --mem and --mem-per-cpu is important when running multi-core jobs (for single core jobs, the two are equivalent). --mem sets total memory across all cores, while --mem-per-cpu sets the value for each requested core. If you request two cores (-n 2) and 4 Gb with --mem, each core will receive 2 Gb RAM. If you specify 4 Gb with --mem-per-cpu, each core will receive 4 Gb for a total of 8 Gb.
Monitoring job progress with squeue and sacct
squeue and sacct are two different commands that allow you to monitor job activity in SLURM. squeue is the primary and most accurate monitoring tool since it queries the SLURM controller directly. sacct gives you similar information for running jobs, and can also report on previously finished jobs, but because it accesses the SLURM database, there are some circumstances when the information is not in sync with squeue.
Running squeue without arguments will list all currently running jobs. It is more common, though to list jobs for a particular user (like yourself) using the -u option...
squeue -u palmeida
or for a particular job
squeue -j 9999999
If you include the -l option (for "long" output) you can get useful data, including the running state of the job.
squeue "long" output using username (-u) and job id (-j) filters
Above you can se a typical output:
[palmeida@navigator ]# squeue -l Sun Mar 22 20:25:39 2015 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 2733 veryshort myprog palmeida PENDING 0:00 8:00:00 43 (Resources) 2732 veryshort myprog palmeida RUNNING 2:24:17 8:00:00 43 com-[28-70] 2731 veryshort myprog palmeida RUNNING 5:02:54 8:00:00 43 com-[71-113]
The squeue man page has a complete description of the tool options.
The sacct command also provides details on the state of a particular job. An squeue-like report on a single job is a simple command.
sacct -j 9999999
However sacct can provide much more detail as it has access to many of the resource accounting fields that SLURM uses. For example, to get a detailed report on the memory usage for today's jobs for user palmeida:
[root@navigator slurm]# sacct -u palmeida -j 2733_39 --format=JobID,JobNAME,Partition,MaxRSS,AveRss,MaxRSSNode,ReqMem JobID JobName Partition MaxRSS AveRSS MaxRSSNode ReqMem ------------ ---------- ---------- ---------- ---------- ---------- ---------- 2733 myprog-+ veryshort 0n 2733.0 hydra_pmi+ 0 0 com-71 0n
Both tools provide information about the job State. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, and FAILED.
|PENDING||Job is awaiting a slot suitable for the requested resources. Jobs with high resource demands may spend significant time PENDING.|
|RUNNING||Job is running.|
|COMPLETED||Job has finished and the command(s) have returned successfully (i.e. exit code 0).|
|CANCELLED||Job has been terminated by the user or administrator using scancel.|
|FAILED||Job finished with an exit code other than 0.|
Killing jobs with scancel
If for any reason, you need to kill a job that you've submitted, just use the scancel command with the job id.
If you don't keep track of the job id returned from sbatch, you should be able to find it with the squeue -u command described above.
MPI (Message Passing Interface) is a standard that supports communication between separate processes, allowing parallel programs to simulate a large common memory space. There are three implementations available on Navigator: MPICH, OpenMPI and MVAPICH2. These libraries can be loaded via the module system like:
module load libs/openmpi/1.8.4-gcc-4.4.7
module load comp/intel2015.1.133 module load comp/intel2015.1.133.optimization module load libs/mvapich2/2.1rc1-intel2015.1.133
Note that the MPI module names also specify the compiler used to build them. It is important that the tools you are using have been built with the same compiler. If not, your job will fail.
An example MPI script with comments is below:
#!/bin/bash #SBATCH -n 192 #SBATCH -p normal #SBATCH -A staff #SBATCH -J gmx_NP192 module load libs/openmpi/1.8.4-gcc-4.4.7 module load progs/gromacs mpiexec -n 192 mdrun_mpi -s dmpc3_md.tpr -o dmpc3_md.trr -c dmpc4.gro -g dmpc3_md.log
Notice that the number of processors requested by the mpiexec command matches the number of cores requested for SLURM (-n).
In parallel jobs, to make sure you run all cores in all requested node, the number of requested cores should be a multiple of 24.
We use a multifactor method of job scheduling on Navigator.
Job priority is assigned by a combination of fair-share, partition priority, and length of time a job has been sitting in the queue.
The priority of the queue is the highest factor in the job priority calculation.For certain queues this will cause jobs on lower priority queues which overlap with that queue to be requeued.
The second most important factor is fair-share score. You can find a description of how SLURM calculates fair-share here.
The third most important is how long you have been sitting in the queue. The longer your job sits in the queue the higher its priority grows. If everyone’s priority is equal then FIFO is the scheduling method.
If you want to see what your current priority is just do
sprio -j <JOBID>
which will show you the calculation it does to figure out your job priority.
If you do
sshare -u <USERNAME>
you can see your current fair-share and usage.
We also have backfill turned on.
This allows for jobs which are smaller to sneak in while a larger higher priority job is waiting for nodes to free up.
If your job can run in the amount of time it takes for the other job to get all the nodes it needs, SLURM will schedule you to run during that period.
This means knowing how long your code will run for is very important and must be declared if you wish to leverage this feature.
Otherwise the scheduler will just assume you will use the maximum allowed time for the partition when you run.
- System: CentOS 7.7
Lustre version: 2.12.3_DDN5
Resource Manager and Scheduler: Slurm 20.02.0
User Environment Manager: Lmod 8.2.7
Compilers, Interpreters and Tools
Intel (188.8.131.52, from Intel Parallel Studio XE Cluster Edition 2019 update 3)
Message Passing Interface libraries
Intel MPI (184.108.40.206)
Intel MKL (220.127.116.11)
- Amber (18) *
Amber (for Assisted Model Building with Energy Refinement) is a software package for performing molecular dynamics and structure prediction.
- AmberTools (18)
AmberTools is a free and open source software package for performing molecular dynamics.
- CP2K (6.1)
CP2K is a freely available (GPL) program, written in Fortran 95, to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. It provides a general framework for different methods such as e.g. density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW), and classical pair and many-body potentials.
- FDS (6.7.3)
Fire Dynamics Simulator (FDS) is a large-eddy simulation (LES) code for low-speed flows, with an emphasis on smoke and heat transport from fires.
- FreeSurfer (6)
FreeSurfer is an open source software package for processing MRI images of the brain.
- GAMESS-US (2018R3; 20190930)
The General Atomic and Molecular Electronic Structure System (GAMESS) is a general ab initio quantum chemistry package.
- Gaussian (16.b.01) *
Gaussian is a general-purpose computational chemistry software package that provides state-of-the-art capabilities for electronic structure modelling.
- GEOS (3.7.2)
GEOS (Geometry Engine - Open Source) is a C++ port of the Java Topology Suite (JTS).
- GROMACS (2016.6; 2018.7; 2019.4)
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. This is a CPU only build, containing both MPI and threadMPI builds.
- MATLAB (2020a) *
MATLAB is a numerical computing environment built around the programming language with the same name.
- NWChem (6.8)
NWChem aims to provide its users with scalable computational chemistry tools.
- Octave (4.4.1; 5.1.0)
GNU Octave is a high-level interpreted language, primarily intended for numerical computations.
- OpenFOAM (6; 7; v1812; v1906)
OpenFOAM is open-source software to develop customised numerical solvers and pre-/post-processing utilities for the solution of continuum mechanics problems.
* These programs are only available to some users/groups, restricted either by License or by special Confidentiality agreements.
A variety of problems can arise when running jobs on Navigator. Many are related to resource mis-allocation, but there are other common problems as well.
|JOB <jobid> CANCELLED AT <time>
DUE TO TIME LIMIT
|You did not specify enough time in your batch submission script. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30 for 12.5 hours)|
|Job <jobid> exceeded <mem> memory limit,
|Your job is attempting to use more memory than you've requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpu or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced.
For jobs that require truly large amounts of memory (>96 Gb), you may need to use a furure bigmem SLURM partition we pretend to run on Navigator in the future. Genome and transcript assembly tools are commonly in this camp.
Socket timed out on send/recv operation
|This message indicates a failure of the SLURM controller. Though there are many possible explanations, it is generally due to an overwhelming number of jobs being submitted, or, occasionally, finishing simultaneously. If you want to figure out if SLURM is working use the sdiag command. sdiag should respond quickly in these situations and give you an idea as to what the scheduler is up to.|
|JOB <jobid> CANCELLED AT <time>
DUE TO NODE FAILURE
|This message may arise for a variety of reasons, but it indicates that the host on which your job was running can no longer be contacted by SLURM.|