HAI Compute Cluster

     


Important!

  • The HAI Compute Cluster uses SUNetID Authentication and requires DUO 2FA, this is different than the SC Cluster
  • Do NOT run intensive processes on haic headnode (NO vsnode, ipython, tensorboard...etc), they will be killed automatically
  • The cluster is a shared resources, always be mindful of others
  • If you have any issue with using the cluster, please send us a request at https://support.cs.stanford.edu

Overview

The HAI Compute Cluster is hosted and managed by Stanford Computer Science I.T. (aka Action @ CS), serving researchers who are affiliated with HAI. The cluster begins in Fall 2024 and consists of 5 systems and 40x Nvidia H100 GPUs with NVLink, it is interconnected with NDR Infiniband-network fabric and capable of full parallel GPU workload across the entire cluster. The cluster is based on Ubuntu 22.04 and uses SLURM for scheduling workload.


Access

You must request access via your HAI faculty, project proposal will be reviewed and access will be granted upon approval.

Once your access is setup, you need to SSH to the cluster headnode, haic.stanford.edu, with your SUNetID (not CSID), if you are coming from off-campus, please be sure to connect to Stanford VPN(Full-Tunnel) before you SSH.


Storage

Important! As of now, we don't have a dedicated data-transfer node for HAI cluster yet, that means the headnode will be busy at times, if you need to download large dataset, please keep any concurrent connection to a reasonable amount.

UNIX home directory - each user will have a home directory created in /hai/users with a 50G quota.

Shared network storage - each user will have a shared network directory created in /hai/scratch with a 5TB quota. Team/Project specific folder is under consideration at this time.


SLURM

SLURM Accounting Group (IMPORTANT for job submission)

Each user account is tied to at least one SLURM group(account) based on the team you are affiliated with. When running SLURM srun/sbatch, you are required to use the '--account=' flag to submit your job. eg. If you are part of the MODELS team,

srun --account=models -p hai-interactive --gres=gpu:1 --pty bash

SLURM partition (queue)

There are currently 3 partitions (queues) setup,

  • hai (sbatch only, no shell access, all jobs should go here), nodelist=haic-hgx-[2-5], walltime is currently set to default=24hrs, max=3days
  • hai-interactive (shell access allowed, meant for debugging/prototyping, resource quotas are heavily restricted), nodelist=haic-hgx-1, walltime is currently set to default=8hrs, max=24hrs
  • hai-lo (low priority queue, jobs maybe preempted when submitted to this partition, "free-for-all" when excess cycles are available, walltime is currently set to default=24hrs, max=14days

SLURM Job Quotas

There are 3 types of quota applied to the SLURM setup on the HAI cluster, this is one major difference if you are an existing SC cluster user, so please read this carefully.

  • Per-user quota, these quotas are applied to user individually
  • Per-account quota, these quotas are applied to your team account, which represents the aggregate of all the team users
  • Partition-based quota, these quotas are applied to specific partition, eg. hai-interactive and hai-lo, these quotas supercede the Per-user and Per-account quotas

The chart below outlines the quota system, but as always they are subject to change based on usage.

Example: Every user currently has a quota of 8GPU, 8 running jobs + 8 more in the queue, but each team (all user on that team) combined only has a quota of 16GPU, 16 running jobs + 16 more in the queue. Each user can only have 1 running job with 1GPU submitted on the hai-interactive partition and this also counts toward your team quota. Jobs submitted to hai-lo basically has no quotas.


For technical issue or question regarding the HAI Compute Cluster, please send us a request at https://support.cs.stanford.edu