Starting from:

$30

Exercise 1 Instruction-Level Parallelism


Exercise 1

Overview
This exercise is designed to help you better understand the lecture material and be prepared for the style of
questions you will get on the exams. The questions are designed to have simple answers. Any explanation
you provide can be brief—at most 3 sentences. You should work on this on your own, since that’s how
things will be when you take an exam.
You will submit an electronic version of this assignment to Gradescope as a PDF file. For those of you
familiar with the LATEX text formatter, you can download the template and configuration files at:
http://www.cs.cmu.edu/˜418/exercises/config-ex1.tex
http://www.cs.cmu.edu/˜418/exercises/ex1.tex
Instructions for how to use this template are included as comments in the file. Otherwise, you can use this
PDF document as your starting point. You can either: 1) electronically modify the PDF, or 2) print it out,
write your answers by hand, and scan it. In any case, we expect your solution to follow the formatting of
this document.
1
Problems
Consider the following code where each line within the function represents a single instruction.
typedef struct {
float x;
float y;
} point;
inline void innerProduct(point *a, point *b, float *result)
{
float x1 = a->x; // Uses a load instruction
float x2 = b->x;
float product1 = x1*x2;
float y1 = a->y;
float y2 = b->y;
float product2 = y1*y2;
float inner = product1 + product2;
*result = inner; // Uses a store instruction
}
void computeInnerProduct(point A[], point B[], float result[], int N)
{
for (int i = 0; i < N; i++)
innerProduct(&A[i], &B[i], &result[i]);
}
In the following questions, you can assume the following:
• N is very large (> 106
).
• The machines described have modern CPUs, providing out-of-order execution, speculative execution,
branch prediction, etc.
– There are ample resources for fetching, decoding, and committing instructions. The only performance limitations are due to the number, capabilities, and latencies of the execution units.
– The branch prediction is perfect.
• There are no cache misses.
• The overhead of updating the loop index i is negligible.
• The load/store units perform any necessary address arithmetic.
• The overhead due to procedure calls, as well as starting and ending loops, is negligible.
2
Problem 1: Instruction-Level Parallelism
Suppose you have a machine M1 with two load/store units that can each load or store a single value on
each clock cycle, and one arithmetic unit that can perform one arithmetic operation (e.g., multiplication or
addition) on each clock cycle.
A. Assume that the load/store and arithmetic units have latencies of one cycle. How many clock cycles
would be required to execute computeInnerProduct as a function of N? Explain what limits the
performance.
B. Now assume that the load/store and arithmetic unit have latencies of 10 clock cycles, but they are fully
pipelined, able to initiate new operations every clock cycle. How many clock cycles would be required
to execute computeInnerProduct as a function of N? Explain how this relates to your answer to
part A.
3
Problem 2: SIMD with ISPC
Consider running the following ISPC code.
export void computeInnerProductISPC(uniform point[] A,
uniform point[] B,
uniform float[] result,
uniform int N)
{
foreach(i = 0 ... N)
{
result[i] = A[i].x * B[i].x + A[i].y * B[i].y;
}
}
Suppose machine M2 has one 8-wide SIMD load/store unit, and one 8-wide SIMD arithmetic unit. Both
have latencies of one clock cycle.
A. How many clock cycles would be required to execute computeInnerProductISPC as a function
of N? Explain what limits the performance.
B. If we were to run the code shown in computeInnerProductISPC on a five-core machine M3,
where each core has the same SIMD capabilities as M2, what would be the best speedup it could achieve
over the single-core performance of part A? Explain.
4
Problem 3: SIMD, Multicore, and Multi-Threaded Parallelism with ISPC
A. Consider the five-core machine M3 described in Problem 2B. Suppose you could write multi-threaded
code where there is no overhead associated with the threads. Each thread would run the function
computeInnerProductISPC to compute some subset of the N elements. What is the maximum
speedup you could achieve relative to the single-threaded code running on machine M2?
B. Now suppose we have a machine M4, identical to M3, except that each core supports three-way simultaneous multithreading. What is the maximum speedup your multithreaded code could achieve relative
to what it achieved running on machine M3.
C. Let N = 106
. Running on machine M3, if we were to write a Pthreads program that spawns 250 threads,
each computing 4000 inner products using computeInnerProductISPC, would this program get
an overall performance improvement over one that uses a single thread to compute all N inner products? (Use your own intution about the cost of spawning new threads in Pthreads when answering this
question.) Explain.
5
D. Let N = 106
. Running on machine M3, if we were to write an ISPC program that launches 250 tasks,
each computing 4000 inner products using computeInnerProductISPC, would this program get
an overall performance improvement over one that uses a single task to compute all N inner products?
(Consider what you know about the performance characteristics of ISPC tasks.) Explain.
6

More products