Starting from:

$30

Homework 3: All-Pairs Shortest Path

Homework 3: All-Pairs Shortest Path

Contents
Goal
Requirements
Command line specification
Input specification
Output specification
Grading
Submission
Resources
Blocked Floyd-Warshall algorithm
Final Notes
Goal
1. Understand the all-pairs shortest path problem.
2. Get familiar with CUDA by implementing the blocked Floyd-Warshall algorithm.
Requirements
In this assignment, you are asked to implement two versions of all-pairs shortest path.
hw3a. Thread version. You may use pthread or std::thread. The code should run on apollo. You can use any
algorithm for shortest path.
hw3b. CUDA version. The code should run on hades. You should use the blocked Floyd-Warshall algorithm.
Both versions follow the same input & output format.
Command line specification
For hw3a on apollo:
srun -N1 -n1 -c12 ./hw3a inputfile outputfile
For hw3b on hades:
srun --gres=gpu:1 -pipl ./hw3b inputfile outputfile
Where inputfile and outputfile are the paths to the corresponding files. The format of the input/output
files are specified below.
Input specification
The input is a binary file containing a directed graph with non-negative edge distances.
The first two integers mean number of vertices (V) and number of edges (E).
After that, there are the list of edges. Each edge contains three integers: source vertex id ( ) ,
destination vertex id ( ), edge weight ( ).
All values are non-negative integers. The values of vertex indexes start at 0.
The ranges for the input is:
srci
dsti wi
for correctness
for performance testing
(there will not be repeated edges)
Here’s an example:
offset type decimal value description
0000 32-bit integer 3 # vertices (V)
0004 32-bit integer 6 # edges (E)
0008 32-bit integer 0 src id for edge 0
0012 32-bit integer 1 dst id for edge 0
0016 32-bit integer 3 edge 0’s distance
0020 32-bit integer src id for edge 1
… … … …
0076 32-bit integer edge 5’s distance
Output specification
The output file is also in binary format.
For an input file with vertices, you should output an output file containing integers.
The first integers should be the shortest path distances for starting from edge 0:
; then the following integers would be the
shortest path distances starting from edge 1: ; and so
on, totaling integers.
where .
If there is no valid path between i→j, please output with: .
Example output file:
offset type decimal value description
0000 32-bit integer 0 dist(0, 0)
0004 32-bit integer ? dist(0, 1)
0008 32-bit integer ? dist(0, 2)
… … … …
4V2
-8 32-bit integer ? min dist(V-1, V-2)
4V2
-4 32-bit integer 0 min dist(V-1, V-1)
Grading
1. hw3a (30%)
An unknown number of test cases will be used to test your implementation. You get 30 points for hw3a if
you passed all the test cases, points if there are failed test cases. There will be hidden
testcases.
For each test case, you pass it if:
2 ≤ V ≤ 10000
2 ≤ V ≤ 45000
0 ≤ E ≤ V × (V − 1)
0 ≤ srci , dsti < V
srci ≠ dsti
if srci = srcj then dsti ≠ dstj
0 ≤ wi < 1000
V V
2
V
dist(0, 0), dist(0, 1), dist(0, 2),…, dist(0, V − 1) V
dist(1, 0), dist(1, 1), dist(1, 2),…, dist(1, V − 1)
V
2
dist(i, j) = 0 i = j
dist(i, j) = 2 − 1 = 1073741823
30
max(0, 30 − 2k) k
The answer is correct.
The execution time of your implementation is shorter than , where is the
execution time of TA’s sequential code.
2. hw3b correctness (30%)
An unknown number of test cases will be used to test your implementation. You get 30 points for hw3a if
you passed all the test cases, points if there are failed test cases. There will be hidden
testcases.
For each test case, you pass it if:
The answer is correct.
The execution time of your implementation is shorter than 30 seconds.
Note that for the purpose for correctness testing, .
3. hw3b performance (30%)
Points are given according to the largest test case (largest ) you can complete within 30 seconds.
3. Demo (10%)
Each student is given 7 minutes to explain the implementation followed by some questions from TA.
Points are given according to your understanding and explanation of your code, and your answers of
the TA questions.
Submission¶
Put these source code under at ~/homework/hw3 in apollo31, then run hw3a-judge:
Your Makefile for hw3a – ~/homework/hw3/Makefile
Source code for thread version – ~/homework/hw3/hw3a.cc or ~/homework/hw2/hw3a.c
Put these source code under at ~/homework/hw3 in hades01, then run hw3b-judge:
Your Makefile for hw3b – ~/homework/hw3/Makefile
Source code for the CUDA version – ~/homework/hw3/hw3b.cu
make hw3a and make hw3b will be used to compile your code.
Use hw3a-judge and hw3b-judge to check and submit your code before the deadline. You can submit as
many times as you want. You need to answer y when judging to update your submitted code.
Resources
Sequential Blocked FW source code – hades:/home/ipl19/y/hw3/seq.cc
Recommended compiler flags
hw3a: g?? -O3 -pthread -std=c++11 -Wall
hw3b: nvcc -O3 -std=c++11 -Xptxas=-v -arch=sm_61
No example Makefile for you this time!
Scoreboard location: http://ipl.cs.nthu.edu.tw/s/hw3
Correctness Testcases: /home/ipl19/y/hw3/c
Performance Testcases: /home/ipl19/y/hw3/p, only on hades
Use the hw3-cat command to view the binary test cases in text format.
Blocked Floyd-Warshall algorithm
Tseq ÷ 6 + 10 seconds Tseq
max(0, 30 − 2k) k
V ≤ 10000
V
Given an matrix where represents the distance (weight of the edge) from a
vertex to a vertex in a directed graph with vertices. We define an matrix where
denotes the shortest-path distance from a vertex to a vertex . Let be the result
which all the intermediate vertices are in the set .
We define as the following:
The matrix gives the answer to the all-pairs shortest path problem.
In the blocked all-pairs shortest path algorithm, we partition into blocks of
submatrices. The number is called the blocking factor. For instance, in figure 1, we divide a matrix
into submatrices (or blocks) by .
Figure 1: Divide a matrix by B = 2
The blocked version of the Floyd-Warshall algorithm will perform rounds, and each round is divided
into 3 phases. It performs iterations in each phase.
Assuming a block is identified by its index , where . The block with index is
denoted by .
In the following explanation, we assume and . The execution flow is described step by step as
follows:
Phase 1: self-dependent blocks.
In the -th iteration, the first phase is to compute the pivot block .
For instance, in the 1st iteration, is computed as follows:
V × V W = [w(i, j)] w(i, j) ≥ 0
i j V V × V D = [d(i, j)]
d(i, j) i j D = [ (i, j)]
(k) d
(k)
{1, 2, . . . , k}
d (i, j)
(k)
d (i, j) = {
(k) w(i, j)
min(d (i, j), (i, k − 1) + (k − 1, j))
(k−1) d
(k−1) d
(k−1)
if k = 0;
if k ≥ 1.
D = (i, j)
(V ) d
(V )
D ⌈V /B⌉ × ⌈V /B⌉ B × B
B 6 × 6
3 × 3 B = 2
⌈V /B⌉
B
(I, J) 0 ≤ I, J < ⌈V /B⌉ (I, J)
D
(k)
(I,J)
N = 6 B = 2
k B × B D
(k⋅B)
(k−1,k−1)
D
(2)
(0,0)
d (0, 0) = min( (0, 0), (0, 0) + (0, 0))
(1) d
(0) d
(0) d
(0)
d (0, 1) = min( (0, 1), (0, 0) + (0, 1))
(1) d
(0) d
(0) d
(0)
d (1, 0) = min( (1, 0), (1, 0) + (0, 0))
(1) d
(0) d
(0) d
(0)
d (1, 1) = min( (1, 1), (1, 0) + (0, 1))
(1) d
(0) d
(0) d
(0)
d (0, 0) = min( (0, 0), (0, 1) + (1, 0))
(2) d
(1) d
(1) d
(1)
d (0, 1) = min( (0, 1), (0, 1) + (1, 1))
(2) d
(1) d
(1) d
(1)
d (1, 0) = min( (1, 0), (1, 1) + (1, 0))
(2) d
(1) d
(1) d
(1)
d (1, 1) = min( (1, 1), (1, 1) + (1, 1))
(2) d
(1) d
(1) d
(1)
Note that the result of depends on the result of and therefore cannot be computed in parallel with
the computation of .
Phase 2: pivot-row and pivot-column blocks.
In the -th iteration, it computes all and where .
The result of pivot-row / pivot-column blocks depend on the result in phase 1 and itself.
For instance, in the 1st iteration, the result of depends on and :
Phase 3: other blocks.
In the -th iteration, it computes all where .
The result of these blocks depends on the result from phase 2 and itself.
For instance, in the 1st iteration, the result of depends on and :
Figure 2: The 3 phases of blocked FW algorithm in the first 1st iteration.
d
(2) d
(1)
d
(1)
k D
(k⋅B)
(h,k) D
(k⋅B)
(k,h) h ≠ k
D
(2)
(0,2) D
(2)
(0,0) D
(0)
(0,2)
d (0, 4) = min( (0, 4), (0, 0) + (0, 4))
(1) d
(0) d
(2) d
(0)
d (0, 5) = min( (0, 5), (0, 0) + (0, 5))
(1) d
(0) d
(2) d
(0)
d (1, 4) = min( (1, 4), (1, 0) + (0, 4))
(1) d
(0) d
(2) d
(0)
d (1, 5) = min( (1, 5), (1, 0) + (0, 5))
(1) d
(0) d
(2) d
(0)
d (0, 4) = min( (0, 4), (0, 1) + (1, 4))
(2) d
(1) d
(2) d
(1)
d (0, 5) = min( (0, 5), (0, 1) + (1, 5))
(2) d
(1) d
(2) d
(1)
d (1, 4) = min( (1, 4), (1, 1) + (1, 4))
(2) d
(1) d
(2) d
(1)
d (1, 5) = min( (1, 5), (1, 1) + (1, 5))
(2) d
(1) d
(2) d
(1)
k D
(k⋅B)
(h1 ,h2 ) h1
, h2 ≠ k
D
(2)
(1,2) D
(2)
(1,0) D
(2)
(0,2)
d (2, 4) = min( (2, 4), (2, 0) + (0, 4))
(1) d
(0) d
(2) d
(2)
d (2, 5) = min( (2, 5), (2, 0) + (0, 5))
(1) d
(0) d
(2) d
(2)
d (3, 4) = min( (3, 4), (3, 0) + (0, 4))
(1) d
(0) d
(2) d
(2)
d (3, 5) = min( (3, 5), (3, 0) + (0, 5))
(1) d
(0) d
(2) d
(2)
d (2, 4) = min( (2, 4), (2, 1) + (1, 4))
(2) d
(1) d
(2) d
(2)
d (2, 5) = min( (2, 5), (2, 1) + (1, 5))
(2) d
(1) d
(2) d
(2)
d (3, 4) = min( (3, 4), (3, 1) + (1, 4))
(2) d
(1) d
(2) d
(2)
d (3, 5) = min( (3, 5), (3, 1) + (1, 5))
(2) d
(1) d
(2) d
(2)
Figure 3: The computations of , and their dependeicies in the first iteration.
Figure 4: In this particular example where and , we will require rounds.
Final Notes
Contact TA via afg984@gmail.com or iLMS immediately if you find any problems with the homework
specification, judge scripts, example source code or the test cases.
You are allowed to discuss and exchange ideas with others, but you are required to write the code on your
own. You’ll got 0 points if we found you cheating.
D
(2)
(0,2) D
(2)
(1,2)
V = 6 B = 2 ⌈V /B⌉ = 3

More products