Use of the cluster is coordinated by a batch queue scheduler, which assigns compute nodes to jobs in an order that depends on various factors, such as: the time submitted, the number of nodes requested, the availability of the resources being requested (GPU, memory, etc.).
There are two basic types of jobs to the cluster: interactive and batch.
Interactive jobs give you access to a shell on one of the nodes, from which you can execute commands by hand, whereas batch jobs run from a given shell script in the background and automatically terminate when finished.
Generally speaking, interactive jobs are used for building, prototyping and testing, while batch jobs are used thereafter.
Important - Beginning Oct 2022, every job submission will require the "account" parameter (--account=), accounts are assigned as part of the SC access request and it is based on the group you are affiliated with, you can use "showaccount" to find out your current affiliation. If you need to change/add more account, please submit an SC Access Request form.
Batch Jobs
Batch jobs are the preferred way to interact with the cluster, and are useful when you do not need to interact with the shell to perform the desired task. Two clear advantages are that your job will be managed automatically after submission, and that placing your setup commands in a shell script lets you efficiently dispatch multiple similar jobs. To start a simple batch job on a partition (group you work with, see bottom of the page), ssh into sc and type:
sbatch my_script.sh
There are many parameters you can define based on your requirement. You can reference a sample submit script at: /sailhome/software/sample-batch.sh.
For further documentation on submitting batch jobs via Slurm, see the online sbatch documentation via SchedMD.
Our friends at the Stanford Research Computing Center who run the Sherlock cluster via Slurm, also have a wonderful write-up that largely applies to us too: Sherlock Cluster.
Interactive Jobs
Interactive jobs are useful for compiling and prototyping code intended to run on the cluster, performing one-time tasks, and executing software that requires runtime feedback. To start an interactive job, ssh into sc and type:
srun --account=your_group_account --partition=my_partition --pty bash
The above will allocate a node in mypartition (replace that name with the name of your partition) and drop you into a bash shell. You can also add other parameters as necessary.
srun --account=your_group_account --partition=my_partition --nodelist=node1 --gres=gpu:1 --pty bash
The above will allocate node1 in mypartition with 1 GPU and drop you into a bash shell.
If you need X11 forwarding please make sure you have XServer installed (such as XQuartz) and add --x11 to your srun command:
srun --account=your_group_account --partition=my_partition --nodelist=node1 --gres=gpu:1 --pty --x11 xclock
GPU specifics
Users can request for a specific type of GPU or specify a memory constraint if they choose to:
srun --account=your_group_account --partition=mypartition --gres=gpu:titanx:1 --pty bash
The above will request 1 TitanX GPU from any nodes in mypartition.
srun --account=your_group_account --partition=mypartition --gres=gpu:1 --constraint=12G --pty bash
The above will request 1 GPU with 12G VRAM from any nodes in mypartition.
Of course this varies from partition to partition depending on their hardware configurations. Please visit https://sc.stanford.edu and click on "Partition" on the top right, you can see the types of GPU available for each partition there. As for constraint, you can refer to the specification by Nvidia. (1080ti = 11G, titan = 12G, etc.)
For further documentation on the srun command, see the online srun documentation via SchedMD.