$30
CSE613: Parallel Programming
Homework #3
Task 1. [ 120 Points ] Distributed-Memory Matrix Multiplication.
(a) [ 50 Points ] Implement the three distributed-memory algorithms for multiplying two square
matrices shown in Figures 1–3. Assume the initial distribution of input matrices and the final
distribution of the output matrix from lecture 13.
(b) [ 10 Points ] Use your implementations from part 1(a) to multiply two 2k × 2
k matrices
(initialized with random integers in [−100, 100]) on 2l×2
l
compute nodes with 1 process/node
for 10 ≤ k ≤ 14 and 0 ≤ l ≤ 2. Report the running times and explain your findings.
(c) [ 10 Points ] Repeat part 1(b) with t processes/node, where t is the number of cores available
on a compute node. Report the running times. Compare with part 1(b) and explain.
(d) [ 20 Points ] Suppose a master node initially holds the input matrices and will hold the final
output matrix. Augment your fastest implementation from part 1(a) with efficient routines
for initial distribution and final collection of matrices. When you measure the running time
of this algorithm please include the time needed for these additional distribution/collection
steps.
(e) [ 10 Points ] Repeat part 1(b) with the algorithm from part 1(d).
(f) [ 10 Points ] Repeat part 1(c) with the algorithm from part 1(d).
(g) [ 10 Points ] Compare your results from parts 1(b, c) with those from parts 1(e, f). Explain
your findings.
Task 2. [ 80 Points ] Distributed-Shared-Memory Matrix Multiplication.
(a) [ 30 Points ] Modify your fastest implementation from part 1(a) and and its modified version
from part 1(d) to use a shared-memory parallel matrix multiplication algorithm inside each
process. Use your fastest shared-memory parallel matrix multiplication routine from HW1.
(b) [ 40 Points ] Repeat part 1(b) with the two implementations from part 2(a). Use 1 process/node, but inside each process use all cores available on that node.
(c) [ 10 Points ] Compare your results from parts 1(b, c, e, f) with those from part 2(b). Explain
your findings.
1
Figure 1: Distributed matrix multiplications using block rotations for both input matrices.
Figure 2: Distributed matrix multiplications using block rotations for one input matrices and block
broadcasts for the other.
Figure 3: Distributed matrix multiplications using block broadcasts for both input matrices.
2
APPENDIX 1: What to Turn in
One compressed archive file (e.g., zip, tar.gz) containing the following items.
– Source code, makefiles and job scripts for both tasks.
– A PDF document containing all answers.
APPENDIX 2: Things to Remember
– Please never run anything that takes more than a minute or uses multiple cores
on login nodes. Doing so may result in account suspension. All runs must be submitted as
jobs to compute nodes (even when you use Cilkview or PAPI).
– Please store all data in your work folder ($WORK), and not in your home folder ($HOME).
3