In order to dramatically speed up your computations, you can create a program that uses multiple CPU cores in parallel instead of a single CPU core. Depending on the type of the problem, the speed-up could scale linearly with the number of cores. For example, on a quad-core machine, your program could run four times faster. When running on clusters with hundreds or thousands of CPUs, the benefits provided by parallel computing can be substantial.
Various techniques are available to implement parallel computing, such as OpenMP, Python multiprocessing, or MPI.
One of the easiest way to parallelize programs is with OpenMP. It is available for C, C++, and Fortran, and is present in most modern compilers.
To activate OpenMP on shared-memory systems (i.e., on a single computer), you only need to set a compiler flag. The addition of a single line near a for-loop will run it in parallel. The only limitation in this situation is the number of cores in the CPU.
The example below illustrates a simulation of falling and bouncing particles, parallelized at the physics.
The particle positions are initialized using the C++ random number generator. This is a serial part because the random number generator is not thread-safe.
The actual physics is done by continuously updating the particle positions. By adding a #pragma statement above the for-loop, this part is run in paralle.
Compiling programs using OpenMP requires a compile flag. Its name will depend on the compiler, but is usually in the form of -fopenmp or -openmp. For example, to compile the above example use:
See these tutorials for more information on how to program with OpenMP.
The example above was relatively straightforward, i.e. running a program in parallel because the particles were not interacting with each other.
Interaction complicates things because multiple cores simultaneously access the positions of the particles, which can lead to undesired effects. There are also limitations in terms of how much speed-up can be obtained because there is always some overhead with parallel computing. For instance, the example above does not scale linearly with the number of threads. In fact, the program can become slower when adding more threads. Therefore, it is better to profile your program to see what the actual speed-up is, especially if you are considering RAC applications at Compute Canada.
A plot of the above example on an Intel Xeon E5520 processor reveals that while a speed-up of almost 2x can be obtained by using four threads, there is no gain when using more.
This plot was generated by timing the runtime of the program using different numbers of threads. This can be specified using the environment variable OMP_NUM_THREADS.
Even though Python is not naturally geared toward multithreading, it does have several packages that enable it to do so. The multiprocessing package launches multiple Python processes to leverage multiple CPU cores. Below is an example of a simulation of the random walk of photons in an absorbing and isotropically scattering medium.
Multiprocessing is set up by initializing the Manager and adding jobs to run.
The "target" argument to Process is the name of the function that the job will execute. The function called “simulate” simulates the photons as they travel through the medium.
- The checking of "name" being equal to "main" is required because multiprocessing launches multiple instances of Python loading the script and executing any code not protected like this.
- A separate I/O writer job is used to write the results to disk. Communication between the processes to this writer is done through Queue.
- Using 3 cores gives a speed-up of 270% compared to a single core on a Intel i5-4310U.
Another common technique for making parallelized programs is Message Passing Interface (MPI), which is used on all of the large clusters. It enables the use of hundreds or thousands of CPU (or more) by combining the computing power of many individual computers that communicate over the network. This network is typically very fast as it is often the limiting factor for doing large-scale simulations. Documentation is available for Open MPI and MPICH.
Making a program using MPI is more involved that OpenMP as one really needs to think about passing data around to all the compute nodes. For example, if the simulation involves a grid, the grid needs to be divided in subgrids and their boundaries need to be shared between the different processes at every time step. This example below simulates the heat equation with fixed boundary conditions.
First the grid is spread over all of the MPI processes.
During the simulation, each MPI process needs to communicate with its neighbours through "ghost" cells, which make the boundaries of the local grids line up.
- An MPI program always starts with MPI_Init.
- The number of processes involved with the simulations is obtained with MPI_Comm_size.
- The rank is the process number and is used for communicating with the other processes and for running code that needs to be done by a single process only. Here process 0 is responsible for writing data to a file.
- The exchanging of ghost cells uses MPI_Send and MPI_Recv to transfer the boundaries of the subgrids to the neighbouring subgrids. Generally, you want to pass data between the processes as little as possible to run the simulation at maximum speed.
- Writing data is only done by one process to avoid race conditions. All other processes send their subgrids to process 0 for writing. This ensure the data remain sequential.
- An MPI program always ends with MPI_Finalize.
For other languages, multiple solutions exist. For example, Julia has parallel computing built in. If more fine-controlled usage is required than OpenMP can offer, one can instead use pthreads on Linux. MPI is also available for Python, for R as well as for other languages.
For advanced usage, it is even possible to combine MPI with multithreading to leverage shared-memory systems while also having the ability to combine multiple systems.