This tutorial covers how to write a parallel program to calculate π using the Monte Carlo method. The first code is a simple serial implementation. The next codes are parallelized using MPI and OpenMP and then finally, the last code sample is a version that combines both of these parallel techniques.

## Serial Implementation

The codes use Monte Carlo methods to estimate π. To set up the estimate, randomly located points are generated within a 2×2 square which has a circle inscribed within it– think of a game of darts. The algorithm generates a large number of points and checks to see if the coordinates, x and y, of each point are inside the circle- x2+y2≤1. The ratio, P, of points inside the circle to the total number of points tried is calculated. Using this ratio, the approximation of π can be computed by assuming that P approaches the value of ratio of the area of the circle to the area of the square when the number of points, niter, is large. Thus we can say: P= (πr²)/4r² = π/4 and solve for π. To do this, P is multiplied by 4 and the result is an approximation of π. (Note: The more iterations of points generated (niter), the more accurate the approximation will be).

### serialpi.c

#include <stdio.h> #include <stdlib.h> #include <math.h> void main(int argc, char* argv[]) { double niter = 10000000; double x,y; int i; int count=0; double z; double pi; srand(time(NULL)); //main loop for (i=0; i<niter; ++i) { //get random points x = (double)random()/RAND_MAX; y = (double)random()/RAND_MAX; z = sqrt((x*x)+(y*y)); //check to see if point is in unit circle if (z<=1) { ++count; } } pi = ((double)count/(double)niter)*4.0; //p = 4(m/n) printf("Pi: %f\n", pi);

Next, compile the code using the compiler wrappers on the supercomputer

cc serialpi.c -o a.out

To run this program on Titan, or any of the other clusters, a job needs to be submitted in the form of a PBS batch script. Here is the an example batch scrip, titan_pi.pbs:

#!/bin/bash # Begin PBS directives #PBS -A AAA### #PBS -N pi #PBS -j oe #PBS -l walltime=0:01:00,nodes=1 #PBS -V #PBS -q debug # End PBS directives and begin shell commands cd $PBS_O_WORKDIR date aprun -n 1 ./a.out

Replace AAA### with the appropriate project ID that the time will be charged to. See the Titan System User Guide for a more in depth look at writing batch scripts.

To submit this batch script, run the following:

qsub titan_pi.pbs

When executed, the result will be something like:

`Pi: 3.141664`

## Parallelize with MPI

### MPI Send/Recv

In this next version of the program, the code has been parallelized with MPI. In order to do this, a few things need to be set up in the code. First, the mpi.h header file needs to be included.

#include "mpi.h"

Next, an MPI instance needs to be created. To do this, add this:

MPI_Init(&argv, &argc);

right before the code needed to be parallelized with MPI. To utilize the created MPI instance, the code needs to have MPI sends and receives present. This message sending and receiving is the basic concept behind MPI. Many functions in MPI can be implemented with combinations of basic send and receive calls. Here, our code is set up to do computations on all but the master node (MPI rank 0). The only thing that rank 0 does is receive the messages (the calculated points in/out of the circle) sent by the other nodes and calculate the ratio based upon that data. The MPI_Recv(); function needs to be added so that our master node can receive the messages. To set up the receive, it needs the following info:

int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status);

- A buffer to store the reveived data
- A number of maximum elements in the receive buffer
- An MPI_Datatype of the appropriate type for the data to be revieved
- The rank number that the message will be sent from
- A message tag
- The name of the MPI communicator. I.e. MPI_COMM_WORLD.
- A status variable

So the first MPI_Recv line should look like this:

MPI_Recv(&recieved[i], nodenum, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

Next, an MPI_Send(); needs to be set up so that the master rank has data to receive. MPI_Send(); only needs to know:

int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);

- A pointer to the send buffer (data to send)
- Number of elements in the send buffer
- An MPI_Datatype for what kind of data thats being sent
- The destination rank
- A unique message tag
- The MPI communicator

So with this in mind, the first MPI_Send needs to look like this:

MPI_Send(&count, 1, MPI_INT, 0, myid, MPI_COMM_WORLD);

Finally, the MPI instance should be closed out before the program exits. To do this simply add this line after the parallel code:

MPI_Finalize();

This code is written so that each node is running the same algorithm to place the random points, categorize them, and count them. The task assigned to master node collects the total number of tried points and total number of points in the circle. It takes the ratio of them and multiplies it by 4 to get a more accurate estimation of π. This is a more accurate, because each node is running as many iterations as the singe node in the serial code ran. The master node treats them as if the values were ultimately from the same circle. The same thing can be achieved with the serial code by multiplying niter, the number of iterations, by the total number of nodes used in the parallelized version–but that is much less efficient than the MPI version because it is not dividing the work over several nodes simultaneously.

### mpiPi.c

#include <stdio.h> #include <stdlib.h> #include "mpi.h" #include <math.h> #define SEED 35791246 int main(int argc, char* argv[]) { long niter = 1000000; int myid; //holds process's rank id double x,y; //x,y value for the random coordinate int i, count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int nodenum; MPI_Init(&argc, &argv); //Start MPI MPI_Comm_rank(MPI_COMM_WORLD, &myid); //get rank of node's process MPI_Comm_size(MPI_COMM_WORLD, &nodenum); int recieved[nodenum]; long recvniter[nodenum]; srand(SEED); //Give rand() a seed value if(myid != 0) { for (i=0; i<niter; ++i) //main loop { x= ((double)rand())/RAND_MAX; //gets a random x coordinate y =((double)rand())/RAND_MAX; //gets a random y coordinate z = sqrt(x*x+y*y); //Checks to see if number in inside unit circle if (z<=1) { count++; //if it is, consider it a valid random point } } for(i=0; i<nodenum; ++i) { MPI_Send(&count, 1, MPI_INT, 0, myid, MPI_COMM_WORLD); MPI_Send(&niter, 1, MPI_LONG, 0, myid, MPI_COMM_WORLD); } } else if (myid == 0) { for(i=0; i<nodenum; ++i) { MPI_Recv(&recieved[i], nodenum, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Recv(&recvniter[i], nodenum, MPI_LONG, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } } if (myid == 0) //if root process { int finalcount = 0; long finalniter = 0; for(i = 0; i<nodenum; ++i) { finalcount += recieved[i]; finalniter += recvniter[i]; } pi = ((double)finalcount/(double)finalniter)*4.0; //p = 4(m/n) printf("Pi: %f\n", pi); //Print the calculated value of pi } MPI_Finalize(); //Close the MPI instance return 0; }

Compile this code by running:

cc mpiPi.c -o a.out

The job submission is the same as the previous program but two lines in the batch script need to be edited:

#PBS -l walltime=0:01:00,nodes=1

the “nodes=1” needs to be changed to to how many nodes needed to run this code.

and

aprun -n 1 ./a.out

The “-n 1” needs to be changed to reflect the change made in the previous line (e.g. if the previous line now reads “nodes = 100” then the aprun line needs to reflect that by saying “-n 100”

Now the job can be submitted to the batch system to be run when the number of requested nodes becomes available.

qsub titan_pi.pbs

### MPI Reduce

In the following version of the code, the serial code has been parallelized with MPI again, but this time the collective MPI_Reduce() operation is used. MPI_Reduce() gathers all the specified values on all nodes to one reduced value on the master node according to the specified operation (i.e. MPI_SUM). This has the same effect as the previous code, but makes for cleaner reading code.

### mpirPi.c

#include <stdio.h> #include <stdlib.h> #include "mpi.h" #include <math.h> int main(int argc, char* argv[]) { int niter = 100000; int myid; //holds process's rank id double x,y; //x,y value for the random coordinate int i; int count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int reducedcount; //total number of "good" points from all nodes int reducedniter; //total number of ALL points from all nodes MPI_Init(&argc, &argv); //Start MPI MPI_Comm_rank(MPI_COMM_WORLD, &myid); //get rank of node's process if(myid != 0) { for (i=0; i<niter; ++i) //main loop { srand48(time(NULL)); //Give rand() a seed value x = (double)random()/RAND_MAX; //gets a random x coordinate y = (double)random()/RAND_MAX; //gets a random y coordinate z = sqrt((x*x)+(y*y)); //Checks to see if number in inside unit circle if (z<=1) { ++count; //if it is, consider it a valid random point } } } MPI_Reduce(&count, &reducedcount, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); MPI_Reduce(&niter, &reducedniter, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); reducedniter -= niter; //to compensate for no loop on master node if (myid == 0) //if root process { pi = ((double)reducedcount/(double)reducedniter)*4.0; //p = 4(m/n) printf("Pi: %f\n%i\n%d\n", pi, reducedcount, reducedniter); //Print the calculated value of pi } MPI_Finalize(); //Close the MPI instance return 0; }

Compile this code by running

cc mpirPi.c -o a.out

Again, the batch script needs to be edited to reflect the number of nodes requested for this code. Once that is done, submit the job by once again running

qsub titan_pi.pbs

## Parallelize with OpenMP

Here the original serial code has been parallelized with OpenMP instead of MPI. OpenMP is used to create separate threads to run on the multicore CPUs of Titan. To use OpenMP the following line needs to added:

#include <omp.h>

Also, the `#pragma omp parallel`

compiler directive needs to be added. This tells the compiler that the following block will be OpenMP parallel code. This code is creating 16 threads; one for each core of the CPU. Since OpenMP is a shared memory API, one needs to keep in mind what variables need to be accessible by what threads. Here, our code needs each thread to have its own copy of x, y, z, and i, and the variable “count” needs to be accessible by all threads so that it can be incremented when needed. In order to accomplish this, OpenMP needs to be told that x, y, z, and i need to be private. Here the firstprivate() clause is used in the omp pragma to declare them as thread private. The firstprivate() clause also has the feature that will automatically initialize the variables inside its parentheses. The variable, “count” needs to be accessible to all threads (i.e. shared) so the shared() clause is used to accomplish that. Finally, OpenMP is told to spawn 16 threads (one for each core of the processor). This is done by using the num_threads() clause.

### omppi.c

#include <stdio.h> #include <stdlib.h> #include <omp.h> #include <math.h> int main(int argc, char* argv[]) { int niter = 1000000; //number of iterations per FOR loop double x,y; //x,y value for the random coordinate int i; //loop counter int count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int numthreads = 16; #pragma omp parallel firstprivate(x, y, z, i) shared(count) num_threads(numthreads) { srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value for (i=0; i<niter; ++i) //main loop { x = (double)random()/RAND_MAX; //gets a random x coordinate y = (double)random()/RAND_MAX; //gets a random y coordinate z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle if (z<=1) { ++count; //if it is, consider it a valid random point } } //print the value of each thread/rank } pi = ((double)count/(double)(niter*numthreads))*4.0; printf("Pi: %d\n", pi); return 0; }

Compiling it is similar to the previous examples, with one exception. When the compiler is called, the correct OpenMP flag needs to be added. This flag is different depending on which programming environment module you have loaded (run ` module list`

and look for a module that says PrgEnv-*****)

- PrgEnv-pgi: -mp

- PrgEnv-intel: -openmp

- PrgEnv-cray: -openmp

- PrgEnv-gnu: -fopenmp

Now the code can be compiled using this format:

cc -(appropriate flag here) omppi.c -o a.out

The batch script needs to be edited so that only 1 node is requested (there is no MPI code in this for multiple nodes. Yet.)

Submit the script:

qsub titan_pi.pbs

## OpenMP + MPI Hybrid

This next code sample combines both MPI and OpenMP. This allows it to fully exploit the parallel architecture of Titan’s multicore CPUs as well as add a layer of parallelism from MPI. In this version, 16 OpenMP threads are launched on each MPI rank. Here, a different approach is taken with the OpenMP section. Instead of declaring count as shared, the reduction(:) clause is used. This clause effectively creates a private copy of “count” on each thread to be incremented and then once all threads complete, OpenMP will reduce all instances of “count” into the original count variable according to the mathematical operator given. In this case, the total number from all threads is needed, so they are reduced by taking the sum of all of them. Then MPI_Reduce is used to gather and reduce “count” from each node into one variable on the master node.

### mpiOMPpi.c

#include <stdio.h> #include <stdlib.h> #include "mpi.h" #include <omp.h> #include <math.h> int main(int argc, char* argv[]) { int niter = 1000000; //number of iterations per FOR loop int myid; //holds process's rank id double x,y; //x,y value for the random coordinate int i; //loop counter int count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int reducedcount; //total number of "good" points from all nodes int reducedniter; //total number of ALL points from all nodes int ranknum = 0; //total number of nodes available int numthreads = 16; MPI_Init(&argc, &argv); //Start MPI MPI_Comm_rank(MPI_COMM_WORLD, &myid); //get rank of node's process MPI_Comm_size(MPI_COMM_WORLD, &ranknum); //Gets number of nodes availible to process if(myid != 0) //Do the following on all except the master node { //Start OpenMP code: 16 threads/node #pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads) { srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value for (i=0; i<niter; ++i) //main loop { x = (double)random()/RAND_MAX; //gets a random x coordinate y = (double)random()/RAND_MAX; //gets a random y coordinate z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle if (z<=1) { ++count; //if it is, consider it a valid random point } } } } MPI_Barrier(MPI_COMM_WORLD); MPI_Reduce(&count, &reducedcount, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); reducedniter = numthreads*niter*(ranknum-1); if (myid == 0) //if root process/master node { //p = 4(m/n) pi = ((double)reducedcount/(double)reducedniter)*4.0; //Print the calculated value of pi printf("Pi: %f\n%i\n%d\n", pi, reducedcount, reducedniter); } MPI_Finalize(); //Close the MPI instance return 0; }

Compiling this code is similar to the past examples (don’t forget to use the correct OpenMP flag according to the programming environment used). To compile run:

cc -(OpenMP flag) mpiOMPpi.c -o a.out

The batch script can be edited now so that it can request the appropriate number of nodes (see above). After the batch script is edited, submit the job like before:

qsub titan_pi.pbs

If you have any questions about these examples or how to run them please email help@nccs.gov.

All code from this tutorial, including a Makefile and batch scripts can be found as a git repository at https://github.com/olcf/Serial-to-Parallel–Monte-Carlo-Pi