Info for first time cluster users
So, this is your first time using a cluster? Don’t worry, it’s not as daunting as it sounds. We will attempt to answer all of the basic questions here. You can also refer to Glossary of terms.
What is a “cluster”?
DeepSense owns and operates its own High-Performance Computing (HPC) cluster. This is merely a fancy term for a group of computers interconnected by high-speed switches that allow users to run jobs that require more processing power, bigger memory, huge storage and more reliable filesystems than would be available on a single computer. This is particularly useful for AI/ML models which may take weeks to train. Instead of bogging down their own computers, our users can submit jobs on our platform which can easily handle the workload. Typically, each computer in the cluster is referred to as a compute node. In a High-Performance Computing cluster, compute nodes are the work horse for users' jobs. DeepSense has 34 compute nodes with different hardware settings for our users. A cluster runs 24/7 unless there is an outage. (see Resources)
What is a “job”?
When we refer to job submission, we simply mean any sort of computer program you run. These are usually command line programs, like a bash script, python code or a compiled executable. Our users can also use jupyter notebooks that are open in their browser, while our machines handle the computation. In order to run most jobs, you must submit them to a job scheduler. We will look at this in more detail shortly.
How is an HPC environment different than a single computer?
When you write a computer program on your own computer, it, along with any data needed, is housed on your computer’s hard drive. When you execute the code, it typically runs on one core, unless you’ve specifically written it to run in parallel. Even still, your computer probably only has a handful of cores. If you run more than one program at the same time, they will typically run on separate cores. It is possible to have all of the cores of your computer running separate programs. If you try to run another, it will have to wait until a core is available.
However, when you submit a job on a cluster, the scheduler will find an appropriate node on which to run the job. Depending on how many users have currently running jobs, some nodes may not have enough cores or memory to run more jobs. Thus, the scheduler will find a node that does have the available resources. When the scheduler finds a node with available resources, it will run the job on that node, without requiring the user to connect directly to that node. Each node in our cluster has 20 cores, and at least 500Gb of memory.
Another big difference is the filesystem. Typically, your computer will only have one hard drive (possibly a few). In order to achieve large enough storage space, our cluster has many more hard drives. In order to manage this, we use a parallel filesystem (GPFS in our case). Without getting into the technical details, this is simply a way to store files that can be accessed from any node in the cluster. Thus, no matter which node your job runs on, any input or output files can be accessed with the same path.
How do you connect to the DeepSense infrastructure?
The easiest way to connect is through “ssh”. As stated earlier, our users are typically running command line programs. On their own computer, they would open a command line interface (terminal in OSX, some shell in *nix, or windows powershell for example) and execute their programs. Ssh (secure shell) is a way to connect to a remote server through a command line interface. In this way, users can run the commands on our machines. For various other tasks, there are other ways of connecting, but they can be more complicated.
When you do connect via ssh, you will connect to one of two login nodes (login1.deepsense.ca and login2.deepsense.ca). These are also called head nodes, and are generally the only ones you will connect to. You can run commands on these, just as you would from the terminal on your computer. However, since all users will be connected to one of these two nodes, we don’t want them overused. You may test small bits of code on these nodes, but for longer running jobs, you will submit them to the scheduler.
We have two job schedulers, but the most commonly used one is called LSF (load sharing facility). When you submit a job to the scheduler, it will send that job into one of a few different queues. For now, we only use two queues: Normal and GPU. Normal is the default, and all jobs requiring a GPU need to be sent to the GPU queue. If there are nodes available, the job will be run immediately on a node that has the available resources. If, for example, all of the GPU nodes are in use, and you submit a job to the GPU queue, the scheduler will have to wait until a GPU node is available. Multiple jobs can be waiting in the queue for available resources. Once a GPU node is available, the scheduler will run the next job in the queue.
In the future, if our resources are running at near capacity, we may implement other queues. The queues can also have different priority levels. Suppose, for example, we had two queues called “normal” and “priority”, with the latter having higher priority. If all the nodes are in use, and you submit job A to the normal queue, and then job B to the priority queue, when a node is free, the scheduler will run job B, as it is from the priority queue, even though job A was submitted first.
The technical details of how to submit jobs to LSF are found here LSF.