loading...

LCA

Laboratory for Advanced Computing

Troubleshooting

A variety of problems can arise when running jobs on Navigator. Many are related to resource mis-allocation, but there are other common problems as well.

ErrorLikely cause
JOB <jobid> CANCELLED AT <time>
DUE TO TIME LIMIT
You did not specify enough time in your batch submission script. The -t option sets time in minutes or can also take D-HH:MM form (0-12:30 for 12.5 hours)
Job <jobid> exceeded <mem> memory limit,
being killed
Your job is attempting to use more memory than you've requested for it. Either increase the amount of memory requested by --mem or --mem-per-cpu or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced.
For jobs that require truly large amounts of memory (>96 Gb), you may need to use a furure bigmem SLURM partition we pretend to run on Navigator in the future. Genome and transcript assembly tools are commonly in this camp.
slurm_receive_msg:
Socket timed out on send/recv operation
This message indicates a failure of the SLURM controller. Though there are many possible explanations, it is generally due to an overwhelming number of jobs being submitted, or, occasionally, finishing simultaneously. If you want to figure out if SLURM is working use the sdiag command. sdiag should respond quickly in these situations and give you an idea as to what the scheduler is up to.
JOB <jobid> CANCELLED AT <time>
DUE TO NODE FAILURE
This message may arise for a variety of reasons, but it indicates that the host on which your job was running can no longer be contacted by SLURM.