$35
1
CPSC 424/524
Assignment 3
Triangular Matrix Multiplication
The Problem. Given a lower triangular N×N double precision matrix A (�"# = 0, for � �) and an upper
triangular N×N matrix B (�"# = 0, for � < �), develop an MPI program to compute the product of A and B.
Mathematically, you are computing an N×N matrix C whose entries are defined by:
�"# = ∑ �"3�3#
456 (",#)
389 (1)
Task 1: Serial Program (5 points)
I have provided source files (serial.c & matmul.c) for a serial program that computes C using equation (1).
These files are in the directory /home/fas/cpsc424/ahs3/assignment3. You may copy and use all files in
that directory. (However, please do not copy the .dat files, which are rather large; just use them where they are.)
You are free to modify or replace the C program, if you wish. If you do so, you may use any data structures you like,
but you may store and use only the potentially nonzero elements of the matrices. That means you may not store
entries of the strict upper triangle of A or the strict lower triangle of B. If your modifications require significantly
more space or time than the original, you may be docked some points for inefficiency.
Run the serial program on randomly-generated triangular matrices using N = 1000; 2000; 4000; and 8000. The
serial.c program contains code (a seed setting) to reproducibly create randomly-generated triangular matrices.
Please use that code for this purpose throughout this assignment to facilitate checking the numerical results. Along
with the source files, I have provided binary result files containing the correct C matrices, and serial.c computes
the Frobenius norm of the error in the computed C, which should be zero (or very nearly so). All your codes should
illustrate performance and correctness by printing a table similar to the one printed by serial.c. (To save disk
space, please do not copy the binary result files; simply read them where they are, as done in serial.c.) For
timing, you may use my timing() functions from earlier assignments; see the Makefile. Alternatively, you may
use standard MPI timing routines MPI_WTime(). Below is a sample output table from serial.c. See the Makefile
and the file build-run.sh for a list of relevant #SBATCH settings for Slurm, required module files, and how to
build the serial program.
Matrix multiplication times:
N TIME (secs) F-norm of Error
----- ------------- -----------------
1000 0.1521 0.0000000000
2000 1.1684 0.0000000000
4000 14.0898 0.0000000000
8000 119.3466 0.0000000000
NOTES:
(a) You should need no more than ~30 minutes for runs in this assignment. Please limit your walltime requests so
that everyone has fair access to the class partition. Please debug using N = 1000 before trying larger cases. My
own experience suggests that you may see variation in timings, so make several runs and report average timings.
(b) As you develop parallel codes, ensure that your programs produce the correct answers by having the master
compute and print error norms as in serial.c. (However, DO NOT include this work in your reported timings.)
2
Tasks 2-4: Parallel Programs
Parallelization Approach: Here’s the approach you’ll use for this assignment: Divide the input matrices into
blocks of rows (for A) or columns (for B). Then a computational task would be to compute the product of a block
row of A and a block column of B to form a rectangular block of C. Pictorially (shown for full, not triangular,
matrices A and B), this looks like:
To implement this using p MPI processes (including the master, which should do real work in addition to
managing the entire computation in this case), you will use the following “ring-pass” approach:
1. The master (only) creates space for the full matrices and initializes A and B as in serial.c.
2. The master then partitions A and C into p block rows each, and it partitions B into p block columns.
3. Next, the master permanently assigns each MPI process (including the master) one block row each of A
and C, and it assigns each MPI process one block column of B as its initial assignment. The master should
use MPI collective operations to distribute the assigned block rows or columns of A and B to each MPI
process. (Each MPI process should allocate its own space for its portion of C.)
4. The computation then proceeds iteratively as follows:
a. Each MPI process does all the computation it can with the data it has in hand.
b. MPI processes then pass their block columns of B to the next higher-ranked MPI process. (MPI
process p-1 passes to MPI process 0.)
c. Repeat until done with the calculation.
5. The master then uses MPI collective operations to assemble the full C matrix by collecting all the block
rows of C from the other MPI processes.
6. Finally, the master computes the error norm and prints a table similar to the one printed by serial.c.
Notes: You will use various MPI functions in this assignment, but you may not use MPI_Sendrecv().
Except for Task 5, you may assume that N is a multiple of p.
Task 2: Blocking MPI Parallel Program (40 points for CPSC 424; 30 points for CPSC 524)
Part A (25 points for CPSC 424; 15 points for CPSC 524): Implement an MPI program that computes the
product of two triangular matrices using equation (1) and the parallelization approach described above. For Task 2,
you must use MPI’s blocking communication operations (including collectives). Your program should use
exactly p block rows for A, and p block columns for B, where p is the total number of MPI processes (including the
master). For Task 2, each block row or block column should contain N/p consecutive rows or columns. Insert
timing calls to help understand and critique the performance of your program. (Specifically, try to (a) assess load
balance by comparing the times for each MPI process; and (b) distinguish between time spent computing and time
spent communicating for some of the processes.) Run your program on randomly-generated triangular matrices
using N = 1000; 2000; 4000; and 8000, using p = 1, 2, 4, and 8 on one node for each value of N. For each run print a
summary timing/error table similar to that in the serial program. (Do not report the breakdown of the computation
vs. communication times for this part of Task 2, but you will need to do that in the additional test cases below.) For
this part of Task 2, in your report, please discuss the raw performance, scalability, and load balance you observe.
Include per-process timing data sufficient to support your discussion of load balance.
Rows
Cols
3
Part B (15 points): For p = 4 and 8 only, using N = 8000 only, make additional runs using 2 and 4 nodes with
appropriate numbers of MPI processes per node. (You will have to tell Slurm and/or mpiexec how to distribute
processes on available cores. In this case, please use a round-robin assignment across the sockets. To learn how to
do this, see the man pages for sbatch and mpiexec.) Provide the summary table for each run, and, in addition,
provide tables that show the timing breakdowns for computation and communication in your program when run on
1, 2, and 4 nodes. (You should have results for 1 node from Part A.) Discuss the raw performance and load balance
you observe, including specific comments based on the timings for computation vs. communication. Based on your
observations in both parts of Task 2, suggest (but do not implement) ways to improve the program (including raw
performance, scalability, load balance, and timing splits between computation and communication). In formulating
these suggestions, you may consider use of any MPI operations except MPI_Sendrecv(), and you may suggest
variations in the number and sizes of row- and column-blocks, or modifications to the communication pattern, if
you wish. You MUST describe why and how you expect each of your suggestions to improve program
performance, scalability, and/or load balance.
Task 3: Non-Blocking MPI Parallel Program (40 points for CPSC 424; 35 points for CPSC 524)
Part A (25 points for CPSC 424; 20 points for CPSC 524): One way to improve performance might be to
overlap computation and communication. Modify your Task-2 program to try to implement overlapping by using
non-blocking MPI communication operations where possible. You should continue to use collective operations as
in Task 2. (Hint: Think about the steps in the program that might be overlapped. You may need to introduce
additional communication buffers to implement certain types of overlap.) Once you have a working program,
repeat the runs and discussions/analyses from Part A of Task 2.
Part B (15 points): Repeat the runs and discussions/analyses from Part B of Task 2 using your modified program.
Compare the results to your Task-2 results, and try to explain what you observe. Be sure to address differences in
raw performance, scalability, and load balance, and to discuss the observed breakdowns between computation and
communication times.
Task 4: Load Balance (15 points)
If the observed load balance in your Task-3 program was not as good you’d have liked, then suggest and
implement one simple approach that might improve the load balance, while retaining similar levels of raw
performance and scalability. For this task, you may modify the parallelization approach described above, if you
feel that doing so will achieve better results. (Hint: Follow the KISS principle. Don’t expect to achieve perfect
load balance. Just come up with, describe, justify, and implement one simple strategy that tries to make a
significant improvement. If you believe that your load balance in Task 3 was good enough, you need not make
new runs—in that case, all you need to do here for full credit is to explain how your Task-3 program achieved
such good load balance.)
If you do create a modified program for this task, then run it on the test cases from Part B of Task 3 (and also on
the corresponding test cases on 1 node), and produce similar tabular outputs. Briefly discuss/explain the
performance differences you see, including raw performance, scalability, and load balance. Note: You are only
expected to try out one reasonable idea, which may or may not actually produce improvements.
Task 5: Generalization (15 points). Required for CPSC 524; Extra credit for CPSC 424
Modify your Task-4 program (or your Task-3 program if you didn’t modify it for Task 4) so that it works for
arbitrary N and p (if you wish, with the presumption that p £ 32 << N). For this task only, N is not presumed to
be a multiple of p. Demonstrate that your code works by running it for N = 7633 and p = 7 on 4 nodes, with a
round-robin MPI process distribution by socket. (Note that I have provided a correct answer for N = 7633.)
4
Procedures for Programming Assignments
For this class, we will use the Canvas website to submit solutions to programming assignments.
Remember: While you may discuss the assignment with me, a ULA, or your classmates, the source code you
turn in must be yours alone and should not represent collaborations or the ideas of others!
What should you include in your solution package?
1. All source code files, Makefiles, and scripts that you developed, used, or modified. All source code
files should contain proper attributions and suitable comments to explain your code.
2. A report in PDF format containing:
a. Your name, the assignment number, and course name/number.
b. Information on building and running the code:
i. A brief description of the software/development environment used. For example, you should list
the module files you’ve loaded.
ii. Steps/commands used to compile, link, and run the submitted code. Best is to use a Makefile for
building the code and to submit an sbatch script for executing it. (If you ran your code
interactively, then you’ll need to list the commands required to run it.)
iii. Outputs from executing your program.
c. Any other information required for the assignment, including any questions you were asked to answer.
How should you submit your solution?
1. On the cluster, create a directory named “NetID_ps3_cpsc424”. (For me, that would be “ahs3_ps3_cpsc424”.
Put into it all the files you need to submit.
2. Create a compressed tar file of your directory by running the following in its parent directory:
tar -cvzf NetID_ps3_cpsc424.tar.gz NetID_ps3_cpsc424
3. To submit your solution, click on the “Assignments” button on the Canvas website and select this assignment
from the list. Then click on the “Submit Assignment” button and upload your solution file
NetID_ps3_cpsc424.tar.gz. (Canvas will only accept files with a “gz” or “tgz” extension.)You may add
additional comments to your submission, but your report should be included in the attachment. You can use
scp or rsync or various GUI tools (e.g., CyberDuck) to move files back and forth to Omega.
5
Due Date and Late Policy (Note that this assignment is due in the morning.)
Due Date: Sunday, November 4, 2018 by 11:59 a.m.
Late Policy: On time submission: Full credit
Up to 5 days late: 90% credit
Up to 1 week late: 75% credit
Up to 9 days late: 50% credit
More than 9 days late: 35% credit
Important Note: The entire HPC data center is scheduled to be closed for maintenance work from 4:00pm on
Sunday, November 4, 2018 until 5:00pm on Thursday, November 8, 2018. None of the HPC clusters will be
available during that time.