This article is for mox.hyak (hyak nextgen). If you are using ikt.hyak (hyak classic) then go to Hyak_scheduler .
Mox uses a scheduler called slurm. It is similar to but different from the PBS based scheduler used on hyak classic.
Below xyz is your hyak group name and abc is your UW netid.
To logon:
ssh abc@mox.hyak.uw.edu
The above command gives you access to the login node of mox. The login node is only for logging in and submitting jobs. The computational work is done on a compute node. As shown below, you can get either an interactive compute node or submit a batch job. The build node is a special compute node which can connect to the internet.
To see the various partitions (aka allocations):
sinfo
To see the usage of your group's nodes:
squeue -p xyz
Below mox specific command shows the number of nodes etc. of all allocations.
hyakalloc
Interactive Single Node Usage with srun:
The build node can connect to outside mox. It is useful for using git, transferring files to outside mox or getting files from outside mox, installing packages in R or Python etc.
To get an interactive build node for 2 hours:
srun -p build --time=2:00:00 --mem=100G --pty /bin/bash
(Note: (1) --pty /bin/bash must be the last option in above command.
(2) It is important to specify --mem=100G. If this is not specified then the the SLURM scheduler limits the usage of memory to
a default value even if more memory is available on the node. Usually the default value is quite low. )
An interactive node in your own group cannot connect to outside mox.
To get an interactive node in your own group for 2 hours:
srun -p xyz -A xyz --time=2:00:00 --mem=100G --pty /bin/bash
Issue below comand at an interactive node prompt to find the list of SLURM environment variables:
export | grep SLURM
Interactive Multiple Node Usage with srun:
If you are setting up an application which uses multiple nodes (e.g. Apache Spark), you will need interactive access to multiple nodes .
To get 2 nodes for interactive use:
srun -N 2 -p xyz -A xyz --time=2:00:00 --mem=100G --pty /bin/bash
When the above command runs, then you will have been allocated 2 nodes but will be on one of the two allocated node.
In order to find the names of the nodes that you have been allocated, issue below command:
scontrol show hostnames
Once you know the node names, then you can use them for your work (e.g. Apache Spark, Hadoop etc.)
Below command will tell you about other SLURM environment variables.
export | grep SLURM
scancel
Use scancel to cancel jobs. For example to cancel a job with job id 1234:
scancel 1234
If your userid is abc then you can cancel all your jobs by using below command:
scancel -u abc
sstat
Use sstat for information (e.g. memory usage) about a running job.
sstat --format=jobid,MaxVMSize
For details see https://slurm.schedmd.com/sstat.html
sacct
Use sacct for information (e.g. memory usage) about a job which has completed.
sacct --format=jobid,elapsed,MaxVMSize
For details see https://slurm.schedmd.com/sacct.html
Batch usage:
To submit a batch job:
sbatch -p xyz -A xyz myscript.slurm
The script myscript.slurm is similar to myscript.pbs used in hyak classic. Below is an example slurm script.
#!/bin/bash
## Job Name
#SBATCH --job-name=myjob
## Allocation Definition
#SBATCH --account=xyz
#SBATCH --partition=xyz
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (3 hours)
#SBATCH --time=3:00:00
## Memory per node
#SBATCH --mem=30G
## Specify the working directory for this job
#SBATCH --workdir=/gscratch/xyz/abc/myjobdir
myprogram
More details are here:
https://slurm.schedmd.com/quickstart.html
https://slurm.schedmd.com/documentation.html
===== Below is for Advanced users only ====
Below is about salloc. Do not use salloc unless you have a specific reason.
Interactive Multiple Node Usage with salloc:
If you are setting up an application which uses multiple nodes (e.g. Apache Spark), you will need interactive access to multiple nodes .
To get 2 nodes for interactive use:
salloc -N 2 -p xyz -A xyz --time=2:00:00 --mem=100G
When the above command runs, then you will have been allocated 2 nodes but will still be on the mox login node.
If you issue a command like below then srun will run the command hostname on each node:
srun hostname
In order to find the names of the nodes that you have been allocated, issue below command:
scontrol show hostnames
Once you know the node names, then you can use them for your work (e.g. Apache Spark, Hadoop etc.)
Below command will tell you about other SLURM environment variables.
export | grep SLURM
srun vs salloc
If no nodes have been allocated then (1) and (2) are equivalent.
(1) srun -p xyz -A xyz --time=4:00:00 --mem=100G --pty /bin/bash
(2) salloc -p xyz -A xyz --time=4:00:00 --mem=100G
This allocates the node.
srun --pty /bin/bash
This uses the same node.
If nodes have already been allocated by using salloc then srun just uses those nodes.
Alternative to srun is to allocate a node and then ssh to it e.g.
Allocate a node:
salloc -p xyz -A xyz --time=4:00:00 --mem=100G
Find the name of the allocated node:
export | grep LIST
If above command gives a hostname of n1234 then you can ssh to n1234 and use it for your work.