loading...

LCA

Laboratory for Advanced Computing

General information

Accessing the cluster

Once you've gone through the account setup procedure and obtained a suitable terminal application, you can login to the Navigator system via ssh

ssh <USERNAME>@navigator.lca.uc.pt

Operating system

Navigator computers run the CentOS distribution of the Linux operating system and commands are run under the "bash" shell. There are a number of Linux and bash references, cheat sheets and tutorials available on the web.

There are some modifications to the default installation to meet all the computational needs we support.

If you get in trouble compiling or running your code, feel free to contact us. When asking for help, it is very important to provide us enough information about the environment  where the problem occurred, logs, and conditions about where and how you use to make the same operations.

Environment modules

Because of the diversity of investigations currently supported by LCA, many applications and libraries are supported on the Navigator cluster. Technically, it is impossible to include all of these tools in every user's environment. The Linux module system is used to enable subsets of these tools for a particular user's computational needs.

Compilers

Navigator has installed the gcc and gfortran default version for the release of installed CentOS - GCC- 4.8.5. Since this is an old version of the compiler, Navigator also provide users the ability to use gcc (and gfortran) - GCC 5.4.0 and GCC 8.3.0 through a environment modules.

As many users expect, Navigator also provides a recent version of the Intel Compiler and the algebra library MKL - Intel Parallel Studio XE Cluster Edition 2019 update 3 (19.0.3.199). This can be used through an environment module.

Queue management system

SLURM is a queue management system and stands for Simple Linux Utility for Resource Management. SLURM was developed at the Lawrence Livermore National Lab and currently runs some of the largest compute clusters in the world.

SLURM is similar in many ways to TORQUE or most other queue systems. You must write a batch script then submit it to the queue manager. The queue manager then schedules your job to run on the partition (or queue in TORQUE) that you designate. Below we will provide an outline of how to submit jobs to SLURM, how SLURM decides when to schedule your job and how to monitor progress.

SLURM has a number of features that make it more suited to our environment than TORQUE:

  • Kill and Requeue SLURM’s ability to kill and requeue is superior to that of TORQUE. It waits for jobs to be cleared before scheduling the high priority job. It also does kill and requeue on memory rather than just on core count.
  • Memory Memory requests are sacrosanct in SLURM. Thus the amount of memory you request at run time is guaranteed to be there. No one can infringe on that memory space and you cannot exceed the amount of memory that you request.
  • Accounting Tools SLURM has a back end database which stores historical information about the cluster. This information can be queried by the users who are curious about how much resources they have used.

The primary source for documentation on SLURM usage and commands can be found at the SLURM site. If you Google for SLURM questions, you'll often see the Lawrence Livermore pages as the top hits, but these tend to be outdated.A great way to get details on the SLURM commands is the man pages available from the Navigator cluster login node. For example, if you type the following command:

man sbatch

you'll get the manual page for the sbatch command.

Commands and flags

Since most people is familiar to the TORQUE queue management system, we provide a small group of examples using SLURM and the correspondent TORQUE commands.

SLURMTORQUESLURM Example
Submit a batch serial jobsbatchqsub
sbatch my_script.sh
Kill a jobscancelqdel
scancel <JOBID>
Check current job by idsacctcheckjob
sacct -j <JOBID>
View status of queuessqueueqstat
squeue --long
View information about nodes and partitionssinfoshowq
sinfo -N; sinfo --long