$30
Homework:
Using the CAC flux computer, run an efficient general purpose program to use p cores to solve
a PDE on an n × n double-precision real-valued matrix A(0:n−1, 0:n−1). Do this for n = 1000
and 5000 and for p = 1, 4, 16, and 36. The same program must be used for all of the runs, and
should work no matter what the entries of A are and no matter what n 2 and p 0 are, for p
a perfect square ≤ 9 · n
2
. If you are using row or column partitioning, instead of partitioning into
squares, then it should work for arbitrary p ≤ 9 · n
2
, not necessarily a perfect square. The factor of
9 is included to insure that there are no pathological cases.
Warning: do not put this off until the last minute, and debug using small matrices.
Procedure: For each n and p pair,
1. Initialize A using the definition below.
2. Use MPI BARRIER, then have the root process call MPI WTIME.
3. Do 500 iterations, defined below.
4. Compute the verification sum (see below) and send it to the root process.
5. Use MPI BARRIER, then have the root process call MPI WTIME.
6. Report the elapsed time and the verification.
7. There are significant timing differences between runs. Therefore, for each n and p pair, do 3
different runs (not three timings on the same run) and report all 3.
Turn in a program listing, the timing and verification outputs, and a report.
To generate A: the initial values are defined as follows:
• A(0, j) = 0.0 for 0 ≤ j ≤ n − 1
• A(n−1, j) = 5 sin(π(j/n)
2
) for 0 ≤ j ≤ n − 1
• All entries except the above are 0.5
Each entry of A can start on whatever processor you choose. If you want multiple copies of an
entry, you must copy the value from its initial processor, and this must occur after the barrier.
Iterative step: In each iteration,
• A(0, j) and A(n−1, j) are unchanged for 0 ≤ j ≤ n − 1
• for all other entries, the new value of A(i, j) is the average of Ao(i, j) and the Ao values of
the 8 neighbors of (i, j), where Ao denotes the value from the previous iteration. Be careful
that you don’t use values from the current iteration. A neighbor of (i, j) is any location
(i
′
, j′
) 6= (i, j) where i
′ = i or i
′ = i ± 1 and j
′ = j or j
′ = j ± 1 mod n. Using the mod
function eliminates the boundary condition in the second dimension, turning the matrix into
the surface of a cylinder.
To verify a run: Calculate X
0≤i≤n−1
A(i,i).
The report: Turn in a short report which briefly describes how you decomposed the matrix, how
you did communication, and the timing result. Analyze whether the timing represents perfect
speedup and scaling, and, if not, why is the program not achieving it (or, if you have superlinear
scaling, why that occurred). Note that you have scaling in terms of the size of the matrix, as well
as in the number of processors. The report needs to be typewritten, though you can do drawings
by hand. You will be graded on correctness, the quality of your report, and achieved performance.
Since performance is important, you should make sure you do serial and parallel optimizations
discussed in class. Please keep the report short.