Running jobs

This page covers what you need to know to get your code running on the Rosalind HPC cluster. It covers quick start steps for beginners with simple jobs, configuration steps for GPU, MPI and GUI jobs, along with general advice on using the scheduler that will help you get the most out of the system. This uses only trivial code examples, guidance on specific scientific software is found elsewhere.

Quickstart

Prerequisites

These instructions assume the following to be true:

  • Your account has been enabled.
  • You have an ssh client installed on a computer w/ Internet connectivity or on the KCL network.
  • You are able to succesfully access the cluster login nodes.
  • You know how to create files with a Linux based text editor.

What is the Rosalind HPC cluster?

The Rosalind HPC cluster is a collection of highly speced compute nodes (computers) with a shared network and storage servers. There is a variety of hardware specifications and ages and these nodes are owned by one or more funding partners. The scheduler (Slurm) allows users of the cluster to submit jobs (software applications) to run on this pool of hardware, and for the underlying hardware to be efficiently allocated according to compute requirements, access policies and priorities.

Loading software dependencies

The cluster makes use of Module Environments to provide a means for loading specific versions of scientific software or development tools on the cluster. When submitting jobs to the cluster you should load the software modules required for your program to run.

Identify your partition

The scheduler is configured to group the compute nodes in to a number of paritions in order to apply a sharing policy. The first task when submitting jobs to the cluster is to identify which partition you should use from the table below.

Partition Name User Group
brc Users from the GSTT and SLaM BRCs
nms_research NMS research staff and PhD students
nms_research_gpu NMS research staff and PhD students
nms_teach NMS taught students
nms_teach_gpu NMS taught students
shared All King's staff and students
shared_gpu All King's staff and students

Based on which User Group you belong to in the above table you must use the associated Partition Name with the -p option of the srun command in the following examples. The _gpu partitions should be used when submitting GPU jobs.

Note

The shared partition is accessible to all users on the cluster and allows a small percentage of resources available in the nms/brc partitions to be made available for general use. If the same resource is available to you via a more specific (i.e. nms/brc) partition it is always best to target that in your job submissions.

Running an interactive job

Running an interactive job is the cluster quivalent of executing a command on a standard Linux command line. This means you will be able to provide input and read output (via the terminal) in real-time. This is often used as a means to test that code runs before submitting a batch job. You can start a new interactive job by using the srun command; the scheduler will search for an available compute node, and provide you with an interactive login shell on the node if one is available.

[k1234567@login3(rosalind) ~]$ srun -p shared --pty /bin/bash
[k1234567@nodek52 [rosalind] ~]$ echo 'Hello, World!'
Hello, World!
[k1234567@nodek52 [rosalind] ~]$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8178    shared     bash k1234567  R       4:44      1 nodek52
[k1234567@nodek52 [rosalind] ~]$ exit
exit
[k1234567@login3(rosalind) ~]$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[k1234567@login3(rosalind) ~]$

In the above example, the srun command is used together with three options: -p shared, --pty and /bin/bash. -p shared specifies the shared partition. The --pty option executes the task in pseudo terminal mode, allowing the session to act like a standard terminal session. The /bin/bash option is the command to be run, in this case the default Linux shell, bash. Once srun is run a terminal session is acquired where you can run standard Linux commands (echo, cd, ls, etc) on the allocated compute node (nodek52 here). squeue -u <username> shows the details of the running interactive job and reports an empty list once we have exited bash.

Submitting a batch job

Batch (or non-interactive) jobs allow users to leverage one of the main benefits of having a cluster scheduler; jobs can be queued up with instructions on how to run them and then executed across the cluster while the user does something else. Users submit jobs as scripts, which include instructions on how to run the job - the output of the job (stdout and stderr in Linux terminology) is written to a file on disk for review later on. You can write a batch job that does anything that can be typed on the command-line.

We'll start with a basic example - the following script is written in bash. You can create the script yourself using your editor of choice. The script does nothing more than print some messages to the screen (the echo lines), and sleeps for 15 seconds. We've saved the script to a file called helloworld.sh - the .sh extension helps to remind us that this is a shell script, but adding a filename extension isn't strictly necessary for Linux.

  #!/bin/bash -l
  #SBATCH --ouput=/scratch/users/%u/%j.out
  echo "Hello, World! From $HOSTNAME"
  sleep 15
  echo "Goodbye, World! From $HOSTNAME"

Note

We use the -l option to bash on the first line of the script to request a login session. This ensures that Environment Modules can be loaded from your script.

Note

SBATCH --output=/scratch/users/%u/%j.out is specified to direct the output of the job to the fast scratch storage (Lustre). We recommend using this configuration for all your jobs. If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.

We can execute that script directly on the login node by using the command bash helloworld.sh, we get the following output:

[k1234567@login3(rosalind) ~]$ bash helloworld.sh
Hello, World! From login3.pri.rosalind2.alces.network
Goodbye, World! From login3.pri.rosalind2.alces.network

To submit your job script to the cluster job scheduler, use the command sbatch -p <partition> helloworld.sh. Where <partition> is your partition name. The job scheduler should immediately report the job ID for your job; your job ID is a unique identifier which can be used to when viewing or controlling queued jobs

[k1234567@login3(rosalind) ~]$ sbatch -p shared helloworld.sh
Submitted batch job 8256
[k1234567@login3(rosalind) ~]$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8256    shared hellowor k1234567  R       0:11      1 nodea01
[k1234567@login3(rosalind) ~]$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[k1234567@login3(rosalind) ~]$ ls -l /scratch/users/k1234567/
total 5
-rw-r--r-- 1 k1234567 clusterusers 112 Oct 17 17:08 8244.out
-rw-r--r-- 1 k1234567 clusterusers 112 Oct 17 17:21 8256.out
[k1234567@login3(rosalind) ~]$ cat /scratch/users/k1234567/8256.out
Hello, World! From nodea01.pri.rosalind2.alces.network
Goodbye, World! From nodea01.pri.rosalind2.alces.network

Default resources

The job launched above didn't make any explicit requests for resources (e.g. CPU cores, memory) or specifify a runtime and so inherited the cluster defaults. If more resource is needed (this is HPC after all) the batch job should include instructions to request more resources. It is important to remember that if you exhaust these resource limits (e.g. runtime or memory) you're job will get killed.

Viewing and controlling queued jobs

Once your job has been submitted, use the squeue command to view the status of the job queue (adding -u <your_username> to see only your jobs. If there are available compute nodes with the resources you've requested, your job should be shown in the R (running) state, if not your job may be shown in the PD (pending) state until resources are available to run it. If a job is in PD state - the reason for being unable to run will be displayed in the NODELIST(REASON) column of the squeue output.

[k1234567@login3(rosalind) ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8306 morty,nms 2D_36_si k1802890 PD       0:00      5 (Resources)
              8307 morty,nms 2D_36_si k1802890 PD       0:00      5 (Priority)
              8291     morty  tvmc.sh k1623514  R    3:19:54      5 nodek[03-07]
              8312 nms_resea test.slu k1898460  R    2:00:38      1 nodek54

You can use the scancel <jobid> command to delete a job you've submitted, whether it's running or still in the queued state.

[k1234567@login3(rosalind) ~]$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8327    shared hellowor k1234567  R       0:00      1 nodea01
              8325    shared hellowor k1234567  R       0:03      1 nodea01
              8326    shared hellowor k1234567  R       0:03      1 nodea01
[k1234567@login3(rosalind) ~]$ scancel 8325
[k1234567@login3(rosalind) ~]$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8325    shared hellowor k1234567 CG       0:11      1 nodea01
              8327    shared hellowor k1234567  R       0:09      1 nodea01
              8326    shared hellowor k1234567  R       0:12      1 nodea01

Scheduler instructions

Job instructions can be provided in two ways; they are:

  1. On the command line, as parameters to your sbatch or srun command. For example, you can set the name of your job using the --job-name=[name] | -J [name] option:
[k1234567@login3(rosalind) ~]$ sbatch -p shared --job-name hello helloworld.sh
Submitted batch job 8333
[k1234567@login3(rosalind) ~]$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8333    shared    hello k1234567  R       0:09      1 nodea01
  1. In your job script, by including scheduler directives at the top of your job script - you can achieve the same effect as providing options with the sbatch or srun commands. To add the --job-name to our previous example:
#!/bin/bash -l
#SBATCH --output=/scratch/users/%u/%j.out
#SBATCH --job-name=hello
echo "Hello, World! From $HOSTNAME"
sleep 15
echo "Goodbye, World! From $HOSTNAME"

Including job scheduler instructions in your job-scripts is often the most convenient method of working for batch jobs - follow the guidelines below for the best experience:

  • Lines in your script that include job-scheduler directives must start with #SBATCH at the beginning of the line
  • You can have multiple lines starting with #SBATCH in your job-script
  • You can put multiple instructions separated by a space on a single line starting with #SBATCH
  • The scheduler will parse the script from top to bottom and set instructions in order; if you set the same parameter twice, the second value will be used.
  • Instructions are parsed at job submission time, before the job itself has actually run. This means you can't, for example, tell the scheduler to put your job output in a directory that you create in the job-script itself - the directory will not exist when the job starts running, and your job will fail with an error.
  • You can use dynamic variables in your instructions (see below)

Dynamic scheduler variables

When writing submission scripts you will often need to reference values set by the scheduler (e.g. jobid), values inherited from the OS (e.g. username), or values you set elsewhere in your script (e.g. ntasks in the SBATCH directives). For this purpose a number of dynamic variables are made available. In the list below variables starting % can be referenced in the SBATCH directives and those starting $ from the body of the shell script.

  • %u / $USER The Linux username of the submitting user
  • %a / $SLURM_ARRAY_TASK_ID Job array ID (index) number
  • %A / $SLURM_ARRAY_JOB_ID Job allocation number for an array job
  • %j / $SLURM_JOBID Job allocation number
  • $SLURM_NTASKS Number of CPU cores requested with -n, --ntasks, this can be provided to your code to make use of the allocated CPU cores

Simple scheduler instruction examples

Here are some commonly used scheduler instructions, along with some example of their usage:

Setting output file location

To set the output file location for your job, use the -o [file_name] | --output=[file_name] option - both standard-out and standard-error from your job-script, including any output generated by applications launched by your job-script will be saved in the filename you specify.

By default, the scheduler stores data relative to your home-directory - but to avoid confusion, we recommend specifying a full path to the filename to be used. Although Linux can support several jobs writing to the same output file, the result is likely to be garbled - it's common practice to include something unique about the job (e.g. it's job ID) in the output filename to make sure your job's output is clear and easy to read.

Note

The directory used to store your job output file must exist and be writable by your user before you submit your job to the scheduler. Your job may fail to run if the scheduler cannot create the output file in the directory requested.

The following example uses the --output=[file_name] instruction to set the output file location:

  #!/bin/bash -l
  #SBATCH --ouput=/scratch/users/%u/%j.out
  echo "Hello, World! From $HOSTNAME"
  sleep 15
  echo "Goodbye, World! From $HOSTNAME"

Note

SBATCH --output=/scratch/users/%u/%j.out is specified to direct the output of the job to the fast scratch storage (Lustre). We recommend using this configuration for all your jobs. If needed you can create sub-directories within your scratch storage area to keep things tidy between different projects.

Setting working directory for your job

By default, jobs are executed from your home-directory on the cluster (i.e. /home/<your-user-name>, $HOME or ~). You can include cd commands in your job-script to change to different directories; alternatively, you can provide an instruction to the scheduler to change to a different directory to run your job. The available options are:

  • -D | --workdir=[dir_name] - instruct the job scheduler to move into the directory specified before starting to run the job on a compute node

Note

The directory specified must exist and be accessible by the compute node in order for the job you submitted to run.

Waiting for a previous job before running

You can instruct the scheduler to wait for an existing job to finish before starting to run the job you are submitting with the -d [state:job_id] | --depend=[state:job_id] option.

Running task array jobs

A common workload is having a large number of jobs to run which basically do the same thing, aside perhaps from having different input data. You could generate a job-script for each of them and submit it, but that's not very convenient - especially if you have many hundreds or thousands of tasks to complete. Such jobs are known as task arrays - an embarrassingly parallel job will often fit into this category.

A convenient way to run such jobs on a cluster is to use a task array, using the -a [array_spec] | --array=[array_spec] directive. Your job-script can then use the pseudo environment variables created by the scheduler to refer to data used by each task in the job. The following job-script uses the $SLURM_ARRAY_TASK_ID/%a variable to echo its current task ID to an output file:

  #!/bin/bash -l
  #SBATCH --job-name=array
  #SBATCH --output=output.array.%A.%a
  #SBATCH --array=1-1000
  echo "I am $SLURM_ARRAY_TASK_ID from job $SLURM_ARRAY_JOB_ID"

All tasks in an array job are given a job ID with the format [job_ID]_[task_number] e.g. 77_81 would be job number 77, array task 81.

Array jobs can easily be cancelled using the scancel command - the following examples show various levels of control over an array job:

scancel 77 Cancels all array tasks under the job ID 77

scancel 77_[100-200] Cancels array tasks 100-200 under the job ID 77

scancel 77_5 Cancels array task 5 under the job ID 77

Requesting more resources

By default, jobs are constrained to the the cluster defaults (see table below) - users can use scheduler instructions to request more resources for their jobs as needed. The following documentation shows how these requests can be made.

CPU cores Memory Runtime
1 core 1GB 24 hours

In order to promote best-use of the cluster scheduler, particularly in a shared environment, it is recommended that you inform the scheduler of the amount of time, memory and CPU cores your job is expected to need. This helps the scheduler appropriately place jobs on the available nodes in the cluster and should minimise any time spent queuing for resources to become available (to this end you should always request the minimal amount of resources you require to run).

Requesting a longer runtime

Note

7 days is the maximum runtime that can be set on the Rosalind HPC cluster

You can inform the cluster scheduler of the expected runtime using the -t, --time=<time> option. For example - to submit a job that runs for 2 hours, the following example job script could be used:

  #!/bin/bash -l
  #SBATCH --job-name=sleep
  #SBATCH --time=0-2:00
  sleep 7200

You can then see any time limits assigned to running jobs using the command squeue --long:

[k1234567@login3(rosalind) ~]$ squeue --long -u k1234567
Fri Oct 18 16:21:39 2019
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
             11439    shared sleep.sh k1234567  RUNNING       0:29   2:00:00      1 nodea01

Requesting more memory

You can specify the maximum amount of memory required per submitted job with the --mem=<MB> option. This informs the scheduler of the memory required for the submitted job. Optionally - you can also request an amount of memory per CPU core rather than a total amount of memory required per job. To specify an amount of memory to allocate per core, use the --mem-per-cpu=<MB> option.

Note

When running a job across multiple compute hosts, the --mem=<MB> option informs the scheduler of the required memory per node

Running multi-threaded jobs

If you want to use multiple cores on a compute node to run a multi-threaded application, they need to inform the scheduler. Using multiple CPU cores is achieved by specifying the -n, --ntasks=<number> option in either your submission command or the scheduler directives in your job script. The --ntasks option informs the scheduler of the number of cores you wish to reserve for use. If the parameter is omitted, the default --ntasks=1 is assumed. You could specify the option -n 4 to request 4 CPU cores for your job. Besides the number of tasks, you will need to add --nodes=1 to your scheduler command or at the top of your job script with #SBATCH --nodes=1, this will set the maximum number of nodes to be used to 1 and prevent the job selecting cores from multiple nodes (multi node jobs require MPI).

Note

If you request more cores than are available on a node in your cluster, the job will not run until a node capable of fulfilling your request becomes available. The scheduler will display the error in the output of the squeue command.

Note

Just asking for more cores wont necessarily mean your code makes use of them. It is generally required to inform your application of how many cores to use. This can be done using the $SLURM_NTASKS variable in your submission script.

Running parallel (MPI) jobs

Note

It is important to note that applications will not necessarily support being run across multiple nodes. They must explicitly support MPI for this purpose as seen in the mpirun application in this example.

If you want to run parallel jobs via a messaging passing interface (MPI), they need to inform the scheduler - this allows jobs to be efficiently spread over compute nodes to get the best possible performance. Using multiple CPU cores across multiple nodes is achieved by specifying the -N, --nodes=<minnodes[-maxnodes]> option - which requests a minimum (and optional maximum) number of nodes to allocate to the submitted job. If only the minnodes count is specified - then this is used for both the minimum and maximum node count for the job.

You can request multiple cores over multiple nodes using a combination of scheduler directives either in your job submission command or within your job script. Some of the following examples demonstrate how you can obtain cores across different resources;

  • --nodes=2 --ntasks=16 Requests 16 cores across 2 compute nodes
  • --nodes=2 Requests all available cores of 2 compute nodes
  • --ntasks=16 Requests 16 cores across any available compute nodes

For example, to use 30 CPU cores on the cluster for a single application, the instruction --ntasks=30 can be used. The following example the mpirun command testing MPI functionality across 30 CPU cores. In this example the job is scheduled over two compute nodes. The jobscript loads the libs/openmpi module which makes the mpirun command available.

#!/bin/bash -l
#SBATCH -n 30
#SBATCH --job-name=openmpi
#SBATCH --output=openmpi.out.%j
module load libs/openmpi
mpirun -n 30 mpitest
[k1234567@login3(rosalind) ~]$ sbatch -p shared openmpi.sh
Submitted batch job 11449
[k1234567@login3(rosalind) ~]$ squeue -u k1234567
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             11449    shared  openmpi k1234567  R       0:02      2 nodea[13-14]
[k1234567@login3(rosalind) ~]$ cat openmpi.out.11450
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 4 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 6 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 7 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 16 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 0 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 1 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 2 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 3 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 5 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 11 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 12 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 14 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 8 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 9 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 10 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 15 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 17 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 19 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 18 out of 30 processors
Hello world from processor nodea13.pri.rosalind2.alces.network, rank 13 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 20 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 22 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 25 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 26 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 29 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 27 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 28 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 23 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 24 out of 30 processors
Hello world from processor nodea14.pri.rosalind2.alces.network, rank 21 out of 30 processors

Running GPU jobs

Lots of scientific software is starting to make use of Graphical Processing Units (GPUs) for computation instead of tradational CPU cores. This is because GPUs out-perform CPUs for certain mathematical operations. If you wish to schedule your job on a GPU you need to provide the --gres=gpu option in your submissions script. The followin example schedules a job on a GPU node then lists the GPU card it was assigned.

#!/bin/bash -l
#SBATCH --output=/mnt/lustre/users/%u/%j.out
#SBATCH --job-name=gpu
#SBATCH --gres=gpu
echo "Hello, World! From $HOSTNAME"
nvidia-debugdump -l
sleep 15
echo "Goodbye, World! From $HOSTNAME"
[k1234567@login3(rosalind) ~]$ sbatch -p nms_research_gpu hellogpu.sh
Submitted batch job 12087
[k1234567@login3(rosalind) ~]$ cat /mnt/lustre/users/k1234567/12087.out
Hello, World! From nodek53.pri.rosalind2.alces.network
Found 1 NVIDIA devices
    Device ID:              0
    Device name:            Tesla V100-PCIE-32GB
    GPU internal ID:        0322118079778

Goodbye, World! From nodek53.pri.rosalind2.alces.network

Note

Due to limited numbers it is only possible to reserve a single GPU per job.

Note

Your GPU enabled application will mostly likely make use of the NVidia CUDA libraries, to load the CUDA module use module load libs/cuda in your job submission script.

Asking for specific hardware

The Rosalind HPC cluster is made up of a number of different compute node specifications, while sometimes you may just want your job to run as soon as possible, other times you may wish to target specific hardware. This might be because the newer CPU/GPU architecture has features you're code wishes to exploit or that you are running code which was compiled specifically for an older generation of CPU. This can be acheive with the sbatch -C / --constrain options.

Request V100 GPU
#!/bin/bash -l                                                                                    
#SBATCH --output=/mnt/lustre/users/%u/%j.out
#SBATCH --job-name=gpu
#SBATCH --gres=gpu
#SBATCH --constrain=v100
Request Skylake CPU
#!/bin/bash -l                                                                                    
#SBATCH --output=/mnt/lustre/users/%u/%j.out
#SBATCH --job-name=skylake
#SBATCH --constrain=skylake

Note

You will need to confirm that the hardware features you are targetting are available in the partitions you have access to and direct your jobs at.

Running software with a GUI

For working with some scientific software a Graphical User Interface (GUI) may be required. The following steps are required to get a session with GUI support on a compute node. These steps will request the default resources and can be modified with the instructions above to ask for more CPU, GPUs, memory or run time.

  1. Connect with ssh to a login node.
  2. If you have previously started a GUI session use alces session list to get the connection details.
  3. For a new GUI session run alces session start gnome.
  4. Use the Host, Port and Password to connect with a VNC client from your OS of choice.
  5. Open a terminal window in the desktop session and run the command start-interactive-session -p <partition>. Choosing the appropriate value for <partition>.
  6. You should now be able to run applications with a GUI, e.g. xeyes.

Further documentation

This guide is a quick overview of some of the many available options of the SLURM cluster scheduler. For more information on the available options, you may wish to reference some of the following available documentation for the demonstrated SLURM commands;

  • Use the man squeue command to see a full list of scheduler queue instructions
  • Use the man sbatch/srun command to see a full list of scheduler submission instructions
  • Online documentation for the SLURM scheduler is available here