Basics of Parallel Programming Using MPI

In this post, I will describe basics of parallel programming using fortran and MPI. MPI implementations are available for many other languages (C, C++, et cetera) as well, and so everything here can be easily translated to these languages. I will assume that the readers have some coding experience, and therefore understand how the compilation process works. For the sake of pedagogy, I will briefly go through these things. Also, since I work in the field of physics, I will talk about things that I found useful in my research rather than going through all the nitty gritty details.

Before we look at a piece of parallel fortran code, we must install the necessary fortran compilers (gfortran, ifort, et cetera) along with an OpenMPI implementation. On UNIX based systems, the following commands can be used to install the gfortran compilers.

For mac (make sure the xcode command line tools are installed beforehand. I am not a mac user, so please correct me if it doesn’t work):

brew install open-mpi

For debian based systems:

apt install openmpi-bin
apt install libopenmpi-dev

For Arch based systems:

pacman -Syu openmpi

For rpm based systems:

sudo yum install openmpi
yum install openmpi-devel

To make sure that everything has installed correctly, execute the version check commands as follows:

mpiexec --version
mpifort --version

Once these compilers are installed, we are ready to write parallel fortran codes. By parallel programming, we imply that computations are done on multiple computers, all running at the same time. If we perform a computation, that takes time T to complete in serial, on N processors, we would ideally expect the computation time to reduce by a factor of N, that is T/N. However, it usually takes a bit longer as many computations require time consuming data transfers between computers. Therefore, an ideal parallel algorithm should minimize data transfers (a.k.a message passing).

MPI (Message Passing Interface) is a standard designed to unify message passing implementations on various parallel computing architectures. Basically, it defines all the subroutines, keywords, et cetera to be used in all message passing implementations. Now, when using MPI, the most important thing to understand is that exactly the same code is executed on all processors and the only thing that is unique to each processor is the process identification number (PID). To belabor this point further, let’s look at an example. Suppose we want to do a fluid mechanics simulation that solves Navier-Stokes equations on a 3D lattice. To parallelize the simulation, we can divide the lattice into multiple smaller 3D sub-lattices. Since the same exact code will have to be run on all the processors, how do we make sure that all processors do not perform computations on the same sub-lattice ? The only way to run the same code on different sub-lattices is by using PID number to translate the points on the lattice.

We are now ready to look at a parallel fortran example. In the fortran code below, we are adding four numbers 79, 83, 2*79, and 2*83 (it will become clear shortly why I am multiplying by 2). We will do the computation using two processors. On the first processor (PID=0), we add 79 and 83, while on the second processor (PID=1), we add 2*79 and 2*83 Once the two additions are done, we send the sum from the second processor to the first processor and then add the two sums on the first processor. In this particular example, we do not really get a significant speed-up over a serial computation, since the number of operations performed on each processor are not large. In reality, the evaluation time may be greater than on a single processor due to the time penalty caused by message passing.

       program firstParallelExample

       use iso_fortran_env
       use mpi

       implicit none

       real(real64) a, b, sum_local, sum_global
       integer pid, nprocesses
       integer ierr, msg, status(mpi_status_size)

       call mpi_init(ierr)
       call mpi_comm_rank(mpi_comm_world, pid, ierr)
       call mpi_comm_size(mpi_comm_world, nprocesses, ierr)

       a = 79.*(pid+1)
       b = 83.*(pid+1)

       sum_local = a + b 

       print *, 'The sum on the process ID ',pid, ' is ',sum_local

       msg=1

       if (pid==1) then
       call mpi_send(sum_local,1,mpi_double_precision,0,msg,
     1                     mpi_comm_world,ierr)
       endif

       if (pid == 0) then

       call mpi_recv(sum_global,1,mpi_double_precision,1,msg,
     1                     mpi_comm_world, status, ierr)
       sum_global=sum_global+sum_local

       print *,'The total sum of the four numbers is',sum_global

       endif

       call mpi_finalize(ierr)

       return
       end program

Please copy the code above into a file called “sum.f” to see how everything works. As in this example, all codes using MPI must have the header statement, use mpi. The other mpi keywords used here are most prevalent and are described as follows:

mpi_init(ierr): This routine initializes the MPI standard and must be called before any other MPI routine.

mpi_comm_rank(mpi_comm_world, pid, ierr): This routine is used to produce the rank or ID of the current process. MPI partitions the processes into multiple groups (communicators). By default, we only have one group or communicator called mpi_comm_world. The processes in each communicator are assigned a process ID (pid) or rank. A single process may be part of multiple communicators.

mpi_comm_size(mpi_comm_world, nprocesses, ierr)`: This routine gives us the number of processes in the communicator mpi_comm_world.

mpi_send(sum_local, 1, mpi_double_precision, 0, msg, mpi_comm_world, ierr): I will describe this routine using one sentence. It sends the data stored in the variable sum_local (1st argument) containing 1 element (2nd argument) of type double_precision (3rd argument) to process with pid=0 (4th argument) with a message msg (5th argument) in the communicator mpi_comm_world. If we wish send an array of data, then the second argument will have to be changed to the number of elements in that array.

mpi_recv(sum_global, 1, mpi_double_precision, 1, msg, mpi_comm_world, status, ierr): This routine receives the data in variable sum_global (1st argument) with 1 element (2nd argument) of type double_precision (3rd argument) from a process with pid=1 (4th argument) carrying a message msg in the communicator mpi_comm_world. For a safe communication, the sent message msg should have the same value as on the receiving end. For arrays, same as for the message sending subroutine, the second argument will have to be changed to the number of elements in that array.

mpi_finalize(ierr): This routine safely kills the MPI environment.

Note that all of these routines return an error code, ierr, which indicates whether the subroutines successfully did their job or not. To run this code, we need to first compile it using the command line by executing the following command in the terminal,

mpifort sum.f -o sum.out

The executable can then be run using the command below,

mpirun -n 2 sum.out

The flag “-n 2” is used to indicate the number of processes to be used for running the code. The output we get looks like the following,

 The sum on the process ID            0  is    162.00000000000000     
 The sum on the process ID            1  is    324.00000000000000     
 The total sum of the four numbers is   486.00000000000000 


In all MPI codes, all the variables are local to each processor. We can see that the variables a, b, sum_local, and sum_global have different values on the two processors. After summation is performed in both processes, the second process (pid=1) sends the local sum to the first process (pid=0), where the data is received in the new variable sum_global. We then add that the local sum to it to get the final sum of the four numbers.

I will discuss advanced features of MPI and their implementations in my upcoming posts.