$30
Exercise 3
Overview
This exercise is designed to help you better understand the lecture material and be prepared for the style of
questions you will get on the exams. The questions are designed to have simple answers. Any explanation
you provide can be brief—at most 3 sentences. You should work on this on your own, since that’s how
things will be when you take an exam.
You will submit an electronic version of this assignment to Gradescope as a PDF file. For those of you
familiar with the LATEX text formatter, you can download the template and configuation files at:
http://www.cs.cmu.edu/˜418/exercises/config-ex3.tex
http://www.cs.cmu.edu/˜418/exercises/ex3.tex
Instructions for how to use this template are included as comments in the file. Otherwise, you can use this
PDF document as your starting point. You can either: 1) electronically modify the PDF, or 2) print it out,
write your answers by hand, and scan it. In any case, we expect your solution to follow the formatting of
this document.
Problem 1: GPU Programming
Your friends have been hearing that you now know how to program a GPU and take advantage of all its
capabilities. Eager to bask in your knowledge, they come to you with different problems that they think
might benefit from GPU parallelism.
One such friend heard that GPUs can launch thousands of parallel “threads.” They want to identify the
indices of a large array of n elements in an even larger array (of size m) using binary search. They suggest
to you that every thread on the GPU will be searching for one element in the array, so that with say, 1024
threads, they’ll be able to search get near 1024× performance.
1
A. First off, how do you explain that threads on a CPU are not the same as threads on a GPU? (What’s
the difference between the implementation of p threads and CUDA threads?) Give at least 3 important
distinctions.
B. Still convinced of their binary search’s potential, they write this kernel function.
// Array S and R are of size n
// Array A is of size m, with its elements ordered.
__global__ void cudaBinarySearch(int n, int m, int *S, int *A, int *R) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i >= n) return;
int search = S[i];
int left = 0, right = m - 1, middle = 0;
R[i] = -1; // In case not found
// Binary Search
while (left <= right) {
middle = left + (right - left) / 2;
if (search < A[middle])
right = middle - 1;
else if (search > A[middle])
left = middle + 1;
else
R[i] = middle;
break;
}
}
After testing it out, they find that they get very little performance gain relative to the number of threads
that they are running. Sullenly, they come back to you and ask where they went wrong. Give at least
two reasons for this code’s poor performance.
2
Problem 2: Problem Scaling
This problem is a three-dimensional extension of the grid problem described in the lecture on workloaddriven performance evaluation:
http://www.cs.cmu.edu/˜418/lectures/10 perfeval.pdf.
Consider an iterative solver that operates on a three-dimensional grid of size N × N × N. It requires N
iterations to reach convergence. We refer to the parameter N as the problem size. Increasing N by a factor
of 2 increases the number of grid elements 8× and the number of iterations 2×.
The state of the system is represented as a three-dimensional array of values state. The function update
computes a new value of the state based on the current state. For each element state[x][y][z],
update computes the new value as a function of its current value and that of its six neighbors:
Self: state[x][y][z]
West: state[x-1][y][z]
East: state[x+1][y][z]
North: state[x][y-1][z]
South: state[x][y+1][z]
Down: state[x][y][z-1]
Up: state[x][y][z+1]
(You don’t need to know the exact function.)
The overall operation can then be described in a pseudo-C notation as:
// For each iteration
for (iter = 0; iter < N; iter++) {
// Main computation
for all x, y, z in 0..N
nstate[x][y][z] = update(state, x, y, z);
3
(a) N 2N (b) P 8P
Figure 1: Scaling a problem by doubling the grid size N (a), or by increasing the number of processors 8×
(b).
for all x, y, z in 0..N
state[x][y][z] = nstate[x][y][z]
}
In mapping the computation onto P processors, the state array is split into P cubic blocks, each containing
N3/P array elements. Each iteration (labeled “Main computation”) involves having each processor 1)
communicate the boundary values with its (up to six) neighbors, and 2) computing the update function for
all N3/P of its assigned elements. The program uses M gigabytes of memory per processor and runs in
total time T.
1. Consider the following two scaling possibilities, yielding a problem of size N0
running on P
0 processors, as illustrated in Figure 1.
(a) Double the value of N, while holding P constant: N0 = 2N, P
0 = P
(b) Hold N constant, while increasing P by 8×: N0 = N, P
0 = 8P
Fill in the table below showing how the amounts of computation and communication for a single
processor, performing a single iteration would scale, relative to the original problem. (For example,
2× would indicate growth by a factor of two.)
Computation Communication
(a): N0 = 2N, P
0 = P
(b): N0 = N, P
0 = 8P
2. Now assume both N and P can vary. Based on these specific cases, give formulas for how the computation, communication, and arithmetic intensity would scale as functions of N and P. Again, quantify
these values with respect to the computation and communication each processor would perform during a single iteration. (By way of reference, the formulas for the two-dimensional version were given
in slide 7 of the lecture.)
4
Computation Communication Arithmetic Intensity
3. Suppose we have a machine with 8P processors, each identical to the original P processors. We
consider three scaling possibilities, yielding a new problem of size N0
requiring total time T
0
and M0
gigabytes per processor:
Problem scaling: N0 = N. Solve the same problem faster. Goal is to minimize T
0
Memory scaling: M0 = M. Make full use of the increased total memory. Goal is to maximize N0
.
Time scaling: T
0 = T. Solve a bigger problem in the same amount of time. Goal is to maximize N0
.
Fill in the following table with formulas indicating the problem size N0
, the per-processor memory
requirement M0
, the ideal total time T
0
, and the change in arithmetic intensity. (By way of reference,
this information for the two-dimensional version was given in slides 17, 22, and 25 of the lecture.)
Scaling Type N0 M0 T
0 Arith. Intensity
Problem N
Memory M
Time T
Problem 3: Cache coherency
Your friend suggests modifying the MSI coherence protocol so that PrRd / BusRd behavior on the I-to-S
transition is changed to PrRd / BusRdX, as is shown below:
5
A. Is the memory system still coherent? What impact does this change have on the system?
6
B. You are hired by Intel to design an invalidation based cache-coherent processor with a large number of
cores. The main application for this processor will be to train a chess AI by playing games against itself
for experiments. It is critical that it doesn’t play too many games during training to maintain the validity
of the experiment, so we store how many games are played in a shared counter:
int num_games = 100000;
int games_played = 0;
// This code is run on each core
while (true) {
// Atomically get value of count prior to increment, and
// write incremented value to games_played
int val = atomic_add(&games_played, 1);
if (val > num_games) break;
play_game()
}
Unsure on how to design the cache coherence of this processor, you check with one of the senior designers. She suggests that you should use a bus-based, snooping coherence implementation, rather than a
directory-based protocol, because the broadcasting it performs would be more efficient in this case. Do
you agree or disagree? Why or why not?
7
C. You are now tasked with finding the best performing chess AI. You plan to run the following code on
the processor:
int num_experiments = 10000;
int best_score = 0;
#pragma omp parallel for
for (int i = 0; i < num_experiments; i++) {
int score = perform_experiment(i);
if (score > best_score) {
// Atomically retrieve best score, compute max,
// and write result to best_score
atomic_max(&best_score, score);
}
}
Trying to improve the performance of this code, you remove the if statement prior to the atomic_max,
noticing that it is redundant. Suprisingly, the performance of the program is drastically reduced. Why
did this happen?
8
D. You are hired as a TA for 15-418/618. In an attempt to speed up exam grading, you decide to multithread
the grading program on a cache coherent system as shown below
int scores[NUM_STUDENTS];
#pragma omp parallel for schedule(dynamic)
for (int student = 0; student < NUM_STUDENTS; student++) {
for (int problem = 0; problem < NUM_EXAM_PROBLEMS; problem++) {
// performs a write to scores[student]
grade_and_update(student, problem, &scores[student]);
}
}
Dismayed by the speed of this program, you analyze the runtime of this program and find that there is a
large amount of bus communication during the execution.
What kind of communication is happening in this program? What performance problem is this a result
of?
9