Starting from:

$35

Hybrid Parallel HighLife Using 1-D Arrays in CUDA and MPI

CSCI-4320/6360 - Group Assignment 4:
Hybrid Parallel HighLife Using 1-D Arrays in CUDA and
MPI

1 Overview
For Group Assignment 4 (you may have upto 4 people in your group), you are to extend your CUDA
implementation of HighLife that uses ***only*** 1-D arrays to running across multiple GPUs and
compute nodes using MPI. You will run your hybrid parallel CUDA/MPI C-program on the AiMOS
supercomputer at the CCI in parallel using at most 2 compute nodes for a total of 12 GPUs.
1.1 Review of HighLife Specification
The HighLife is an example of a Cellular Automata where universe is a two-dimensional orthogonal
grid of square cells (with WRAP AROUND FOR THIS ASSIGNMENT), each of which is in one
of two possible states, ALIVE or DEAD. Every cell interacts with its eight neighbors, which are
the cells that are horizontally, vertically, or diagonally adjacent. At each step in time, the following
transitions occur at each and every cell:
• Any LIVE cell with FEWER THAN two (2) LIVE neighbors DIES, as if caused by underpopulation.
• Any LIVE cell with two (2) OR three (3) LIVE neighbors LIVES on to the next generation.
• Any LIVE cell with MORE THAN three (3) LIVE neighbors DIES, as if by over-population.
• Any DEAD cell with EXACTLY three (3) or six (6) LIVE neighbors becomes a LIVE cell, as
if by reproduction/birth.
The world size and initial pattern are determined by an arguments to your program. A template
for your program will be provide and more details are below. The first generation is created by
1
applying the above rules to every cell in the seed—births and deaths occur simultaneously, and
the discrete moment at which this happens is sometimes called a “tick” or iteration. The rules
continue to be applied repeatedly to create further generations. The number of generations will
also be an argument to your program. Note, an iteration starts with Cell(0, 0) and ends with
Cell(N − 1, N − 1) in the serial case.
1.2 Revised Implementation Details
For the MPI parallelization approach, each MPI Rank will perform an even “chunk” of rows for
the Cellular Automata universe. Using our existing program, this means that each MPI rank’s subworld will be stack on-top of each other. For example, suppose you have a 1024x1024 sub-world
for each MPI rank to process, each Rank will have 1024x1024 cells to compute. Thus, Rank 0, will
compute rows 0 to 1023, Rank 1 computes rows 1024 to 2047 and Rank 2 will compute rows 2048
to 3071 and so on. Inside of each rank, the HighLife CUDA kernel will be used to process each
iteration.
Thus, the actual world will no longer be square but each sub-world will be.
For the Cellular Automata universe allocation each MPI Rank only needs to allocate it’s specific
chunk plus space for “ghost” rows at the MPI Rank boundaries. The “ghost” rows can be held
outside of the main MPI Rank’s Cellular Automata universe.
Now, you’ll notice that rows at the boundaries of MPI Ranks need to be updated / exchanged
prior to the start of computing each tick like before in Assignment 3.
For these row exchanges, you will use the MPI_Isend and MPI_Irecv messages. You are free to
design your own approach to “tags” and how you use these routines except your design should not
deadlock. Make sure, like Assignment 3, use MPI_Wait or MPI_Waitall to ensure all messages have
been sent/recv’ed before moving on to the CUDA kernel compute stage of the loop.
So the algorithm becomes:
main(...)
{
Setup MPI
Set CUDA Device based on MPI rank.
Start time with MPI_Wtime.
Allocate My Rank’s chunk of the universe, init pattern 5 (middle of my rank’s chuallocate space for "ghost" rows.
for( i = 0; i < number of ticks; i++)
{
Exchange row data with MPI Ranks
using MPI_Isend/Irecv.
Use MPI_Wait or MPI_Waitall to ensure all message are sent/recv’ed.
Do rest of universe update as done
in assignment 2 using CUDA HighLife kernel.
2
// note, no top-bottom wrap except for ranks 0 and N-1.
}
MPI_Barrier();
if Rank 0,
end time with MPI_Wtime and printf MPI_Wtime
performance results;
if (Output Argument is True)
{
Printf my Rank’s chunk of universe.
}
MPI_Finalize();
}
To init MPI in your main function do:
// MPI init stuff
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &numranks);
To init CUDA (after MPI has been initialized in main) in your “init master” function, do:
if( (cE = cudaGetDeviceCount( &cudaDeviceCount)) != cudaSuccess )
{
printf(" Unable to determine cuda device count, error is %d, count is %d\n",
cE, cudaDeviceCount );
exit(-1);
}
if( (cE = cudaSetDevice( myrank % cudaDeviceCount )) != cudaSuccess )
{
printf(" Unable to have rank %d set to cuda device %d, error is %d \n",
myrank, (myrank % cudaDeviceCount), cE);
exit(-1);
}
Note, myrank is this MPI rank value.
As noted in the pseudo-code comments, you will need to modify your CUDA HighLife kernel
to account for the lack of top and bottom edge “world wrapping” for ranks other than rank 0 and
rank N-1 where N is the number of ranks used in the parallel job. All ranks will need to support
left and right edge “world wrapping”.
Moreover, for this assignment, pattern 5 will be started in the middle of each
rank’s sub-world. This is different from previous Assignments and so the answer will be different
depending on the number of ranks used as the world size grows as the number of ranks is increased.
This is called a weak scaling performance study.
3
2 Running on AiMOS
2.1 Building Hybrid MPI-CUDA Programs
You will need to break apart your code into two files. The highlife-mpi.c file contains all the
MPI C code. See the example mpi-cuda-template file as an example. The highlife-cuda.cu
contains all the CUDA specific code including the world init routines. You’ll need to make sure
those routines are correctly “extern”. Because nvcc is a C++ like compiler, you’ll need to turn off
name mangling for any kernel launch C functions via extern ‘‘C’’ { function dec } that are
called from highlife-mpi.c and defined in highlife-cuda Next, create your own Makefile with
the following:
all: highlife.c highlife-cuda.cu
mpixlc g highlife-mpi.c -c -o highlife-mpi.o
nvcc -g -G highlife-cuda.cu -c -o highlife-cuda.o
mpicc -g highlife-mpi.o highlife-cuda.o -o highlife-exe \
-L/usr/local/cuda-11.2/lib64/ -lcudadevrt -lcudart -lstdc++
2.2 SLURM Submission Script
The create your own slurmSpectrum.sh batch script using the example from the CCI Docs website,
see: https://docs.cci.rpi.edu/Slurm/ and below:
#!/bin/bash
module load xl_r spectrum-mpi cuda/11.2
mpirun --bind-to core --report-bindings -np $SLURM_NPROCS /path/to/your/executable
mpirun --bind-to core --report-bindings -np $SLURM_NPROCS\
/gpfs/u/home/SPNR/SPNRcaro/barn/PARALLEL-COMPUTING/HighLife-CUDA/highlife-with-cuda 5 1638Next, please follow the steps below:
1. Login to CCI landing pad (blp01.ccni.rpi.edu) using SSH and your CCI account and
PIC/Token/password information. For example, ssh SPNRcaro@blp03.ccni.rpi.edu.
2. Login to AiMOS front end by doing ssh PCPEyourlogin@dcsfen01.
3. (Do one time only if you did not do for Assignment 1). Setup ssh-keys for passwordless login between compute nodes via ssh-keygen -t rsa and then cp ~/.ssh/id rsa.pub
~/.ssh/authorized keys.
4. Load modules: run the module load xl r spectrum-mpi cuda/11.2 command. This puts
the correct IBM XL compiler along with MPI and CUDA in your path correctly as well as all
needed libraries, etc.
4
5. Compile code on front end per directions above.
6. Get a single node allocation by issuing: salloc -N 1 --partition=el8 --gres=gpu:4 -t
30 which will allocate a single compute node using 4 GPUs for 30 mins. The max time for
the class is 30 mins per job. Your salloc command will return once you’ve been granted a
node. Normally, it’s been immediate but if the system is full of jobs you may have to wait for
some period of time.
7. Recall, use the “interactive” queueing option to allocate a “debug” compute node for 30 mins.
8. Use the squeue to find the dcsXYZ node name (e.g., dcs24).
9. SSH to that compute node, for example, ssh dcs24. You should be at a Linux prompt on
that compute node.
10. Issue run command for HighLife. For example, ./highlife 5 16384 128 256 which will run
Highlife using pattern 5 with a world size of 16Kx16k for 128 iterations using a 256 thread
blocksize.
11. If you are done with a node early, please exit the node and cancel the job with scancel
JOBID where the JOBID can be found using the squeue command.
12. Use the example Sbatch script above and sbatch command covered in lecture 9 to run across
multiple nodes.
3 Weak Scaling Parallel Performance Analysis and Report
First, make sure you disable any “world” output from your program to prevent extremely large
output files. Note, the arguments are the same as used in Assignment 2. Using the MPI Wtime
command, execute your program across the following configurations and collect the total execution
time for each run.
• 1 node, 1 GPU, 16Kx16K world size each MPI rank, 128 iterations with 256 CUDA thread
block size and pattern 5.
• 1 node, 2 GPUs/MPI ranks, 16Kx16K world size each MPI rank, 128 iterations with 256
CUDA thread block size and pattern 5.
• 1 node, 3 GPUs/MPI ranks, 16Kx16K world size each MPI rank, 128 iterations with 256
CUDA thread block size and pattern 5.
• 1 node, 4 GPUs/MPI ranks, 16Kx16K world size each MPI rank, 128 iterations with 256
CUDA thread block size and pattern 5.
• 1 node, 5 GPUs/MPI ranks, 16Kx16K world size each MPI rank, 128 iterations with 256
CUDA thread block size and pattern 5.
• 1 node, 6 GPUs/MPI ranks, 16Kx16K world size each MPI rank, 128 iterations with 256
CUDA thread block size and pattern 5.
5
• 2 nodes, 12 GPUs/MPI ranks, 16Kx16K world size each MPI rank, 128 iterations with 256
CUDA thread block size and pattern 5.
Determine your maximum speedup relative to using a single GPU and which configuration
yields the fastest “cells updates per second” rate as done in Assignment 2. Explain why you think
a particular configuration was faster than others.
Because this is a group assignment, please document in your report which team
members contributed particular items to the Assignment - e.g, coding, running experiments, writing the report, etc.
4 HAND-IN and GRADING INSTRUCTIONS
Please submit your C-code and PDF report with performance data/table to the submitty.cs.rpi.
edu grading system. All grading will be done manually because Submitty currently does not support
GPU programs. We will test against a smaller world size (e.g., 32x32 per rank) for correctness. A
rubric will be posted which describes the grading elements of the program and report in Submitty.
Also, please make sure you document the code you write for this assignment. That is, say what you
are doing and why.
6

More products