High performance computing

High performance computing (HPC) is a powerful tool in today’s research and allows for large-scale computations of complex systems, for big data analysis, or for data visualization. To achieve this, HPC systems have thousands of processors, multiple GPUs, hundreds of gigabytes of memory, and terabytes of storage available.

Resources are available throughout Canada in the form of large compute clusters provided by the consortia under the umbrella of Compute Canada but also through your own equipment hosted in the university data centres.

Available clusters

Compute Canada is the organization overseeing all of the compute clusters available in Canada. To begin using a cluster, you must first register and open an account.

This image shows Compute Canada login screen

There are many clusters available in Canada that are managed by six consortia: ACENET, CAC, Calcul Québec, SHARCNET, SciNet, and WestGrid. The tables below list the available systems for CPU- and GPU-based clusters.

 

CPU clusters

Consortium

Cluster

Cores

Memory per node

Storage

Interconnect

ACENET

Placentia

3680

8–32 GB

75 GB

Infiniband 4xDDR

ACENET

Glooscap

1968

4–128 GB

61 GB

Gigabit Ethernet

CAC

Frontenac

3600

 

3 TB

 

Calcul Québec

Briarée

8064

24–96 GB

7.3 TB

Infiniband QDR

Calcul Québec

Colosse

7680

24–48 GB

500 TB

Infiniband QDR

Calcul Québec

Cottos

1024

16 GB

151 GB

Infiniband QDR

Calcul Québec

Guillimin

20512

24–1024 GB

 

Infiniband QDR

Calcul Québec

Mp2

39168

32–512 GB

 

Infiniband QDR

Calcul Québec

Ms2

2464

16–32 GB

 

Infiniband DDR

Calcul Québec

Psi

1008

72 GB

 

Ethernet

SHARCNET

Graham

33448

128–3072 GB

20 TB

Infiniband EDR

SHARCNET

Orca

8880

24–128 GB

 

Infiniband QDR

SHARCNET

Windeee

144

32 GB

 

Ethernet

SciNet

Niagara

60000

202 GB

6 PB

Infiniband EDR

SciNet

BlueGene/Q

65536

16 GB

 

5D Torus

WestGrid

Cedar

58416

125–3022 GB

20 TB

Omnipath

 
GPU/coprocessor clusters

Consortium

Cluster

Nodes

Memory per node

Interconnect

Storage

Calcul Québec

Guillimin

58 / 50

64 GB

 

343 GB

Calcul Québec

Hàdes

9

24 GB

 

412 GB

Calcul Québec

Helios K20

15

128 GB

 

2 TB

Calcul Québec

Helios K80

6

256 GB

 

330 GB

SHARCNET

Graham

160

128 GB

Infiniband FDR

20 TB

SHARCNET

Monk

54

48 GB

Infiniband QDR

20 TB

SHARCNET

Copper

8

96 GB

Infiniband FDR

 

SciNet

SOSCIP GPU Cluster

14

512 GB

Infiniband EDR

 

WestGrid

Cedar

146

125–250 GB

Omnipath

20 TB

 

In addition to these systems, there are also smaller contributed systems.

Job submission

All compute clusters use a queuing system allowing each program to run at full speed without running out of resources.

To run the program, you must submit it to the queue and the cluster scheduler will determine where and when your job will be completed. The wait time in the queue depends on the current usage of the cluster as well as on the requested amount of CPU cores, memory and requested run time. Jobs that require many resources carry a longer wait time because the scheduler has to wait until there is space for your job. It is therefore very important to know how many resources your program requires in order to minimize waiting times.

Minimize queue wait time

The clusters use different schedulers but they operate on the same principles – the larger the computational requirements, the longer wait time in the queue.

You can determine the amount of resources your program needs in several ways.

If you have access to the source code, you can estimate the amount of memory a particular run would require. Create a routine that goes through the initial setup without allocating the memory, so you can keep track of the allocated memory. While this does require some work initially, it will allow you to quickly calculate the required amount of memory and will save you time in the end. Some commercial packages might also give you estimates based on the input configuration.

If this is not an option, you can run trails of your program on your computer with small input parameters.

For example, if you are running a simulation requiring calculations with N=500 particles, complete runs for N=10, 20, 30, 40. Check how the memory usage increases with each increment. In Linux, you can check the memory usage with pmap.

Find the PID of your program.

$ ps -u $USER
  PID TTY          TIME CMD​​​​​​​
    4 tty1     00:00:00 bash​​​​​​​
  141 tty1     00:00:00 ps​​​​​​​
  166 tty1     00:01:22 simulations

Then check memory usage with pmap

$ pmap 166
166:   simulations​​​​​​​
00007ff3dc000000    132K rw---   [ anon ]​​​​​​​
00007ff3dc021000  65404K -----   [ anon ]
...
...
...​​​​​​​
00007ff456815000      4K rw--- simulations​​​​​​​
00007ffff0590000  16388K rw---   [ anon ]​​​​​​​
00007ffff699e000   8192K rw---   [ anon ]​​​​​​​
00007ffff784b000      4K r-x--   [ anon ]
 total          1982516K

From the total in the last line, we can see that here the program used about 2 GB. Scaling tends to be either polynomial or logarithmic. By plotting the results in a graph and possibly fitting with a polynomial or logarithm, you can estimate the memory usage for N=500.

After submitting a job to a cluster, you will see in the automatically generated log file the memory usage of your program so you can adjust the next submissions accordingly. Our recommendation is to choose 10% more so your job will be completed when memory usage fluctuates a bit.

Estimating run-time is more difficult since this depends on the CPU speed and on the program scaling when running on multiple CPU cores. You will need to make an estimated guess and, when submitting the job, to allow for a liberal margin of error. Use the cluster reports to determine the real run time.

Checkpointing

If your program requires a long run time, it will be in the queue for a long time before it is started. Our recommendation is that such programs implement checkpointing.

Checkpointing is a technique that makes a snapshot of the entire program memory and saves it to disk periodically.

For example, a one-week program can be split into 7 one-day runs. This will reduce the wait time and will make your program more resilient to crashes or outages as it will continue from the last checkpoint.

This is a working example of a simple molecular dynamics simulation with checkpointing.

The simulation creates a new checkpoint every 40 steps by saving the iteration number and the positions of all the particles at that iteration to a binary file and then quits.

This image shows a new checkpoint created every 40 steps by saving the iteration number and the positions of all the particles at that iteration to a binary file and then quits

Running the program again will continue where it left of, thanks to the data written to the checkpoint.

This image shows that the program will run from where it left of thanks to the checkpoint data

Jobs can be submitted in such a way that they can wait on other jobs to finish, so you do not have to manually resubmit the job after each checkpoint has completed.

Related Links

 

For Research Support

For help using clusters and running programs, please contact us using the Service Desk request form.

Back to top