Bare metal HPC

If you're new to HPC job scheduling, you may find our job scheduler overview and reference page a useful place to start.



Overview

Users can buy access to compute nodes by purchasing a number of core hours*. You will need to specify the length of time over which you want to run the project e.g. 6 months, 2 years etc. During this time you will be expected to consume your alloted core hours. Priorities on the cluster have been set up so that users who are using a given number of core hours over a short period of time will be given a higher priority than users who are using the same number of core hours over a longer period of time. This is done in order to balance the usage. Our system has been set up so that we cannot sell more CPU core hours over a given period of time than are available on the system. This should ensure that things don't get over-loaded and in an ideal world everyone would be able to get their jobs on the system right away, however on a shared system there will always be times when the system has a high number of requests and users will have to wait for their jobs to be executed. We will be sympathetic to users who have urgent deadlines and we'll increase a user's priority if there are exceptional circumstances.

* core hours are a measure of the occupancy of the systems CPU cores over a period of time.

In addition to payed access, there will also be a free tier which will allow users access to the system in a restricted manner. The purpose of this is to allow users to become familiar with the system and enable some small scale testing before setting up payed access.


CPU

There are different tarrifs for CPU usage based on the underlying costs of different types of hardware. Users have control over the particular hardware that their jobs go to by submitting their jobs to a particular "queue" or "queues". The table below shows the relationship between queues and hardware charging category at the time of writing this.

Charging categoryqueueHardware description
HPCHighMemHighMemLongterm.q2 x 10 core Intel Haswell E5-2660 v3 2.60GHz with 384GB RAM
HighMemShortterm.q
HPCLowMemLowMemLongterm.q2 x 10 core Intel Haswell E5-2660 v3 2.60GHz with 192GB RAM
LowMemShortterm.q


Storage

There is more than one type of storage and more than one way to store your data there. All users will have a login home directory that can be used for software and scripts, this has a small capacity and shouldn't be used to store data. Additionally all users will need to have an appropriate amount of scratch space on our lustre storage. The lustre storage should be where working data is kept i.e. whenever a job is run that has to access data, that data should be on the lustre file system.

Additionally there is a secondary storage system, our ceph system, with higher capacity which is intended for data that isn't being accessed by the compute nodes. The is much less expensive and is intended for longer term storage.

Data transfer between the scratch storage and the secondary storage should be relatively quick since this is likely to be copying entire files at a time instead of multiple and random read write access. There are a number of way of interacting with the ceph storage and we intend to configure an irods to allow users easy access and controll of where their data resides.