$30
Electrical and Computer Engineering / Software Engineering
CPrE/SE 419: SOFTWARE TOOLS FOR LARGE-SCALE DATA ANALYSIS, SPRING 2022
LAB #3: SORTING
Purpose
In this lab, your goal is to write a program using MapReduce that can sort a large data set quickly.
Sorting is a common task on large data sets, and you will be working on a standard input data set, which
is used as a benchmark for comparing different implementations of sort.
At the end of the lab, you will be able to write an algorithm and implementation for sorting a large data
set, and measure its performance on various input data sets.
Constraints:
• Use no more than 10 reducers at any MapReduce stage.
• In your program, make sure to delete all the temporary directories which contains the
intermediate data before exiting.
Metric of Success:
• The output of your program could be one or more sorted lists. Suppose you have n reducers, e.g.
R1, R2, …, Rn, all data in R1 need to be smaller than all data in R2, and all data in R2 need to be
smaller than R3 and so on.
Submission (50 points)
Create a zip (or tar) archive and hand it in through Canvas. Your submission includes:
● A writeup report:
• The results for each experiment including top 5 and last 5 lines for each partition.
• The strategy you used for experiment 2.
• The arguments for your jar file to run, for each experiment.
• Any other resources to support your work.
● Commented Code for your program. Include all source files needed for compilation.
Resource
• Lecture notes on sorting and Partitioner in MR
• http://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/mapreduce/Partitioner.html
• https://hadoop.apache.org/docs/r3.1.1/api/org/apache/hadoop/mapreduce/lib/partition/Total
OrderPartitioner.html
Electrical and Computer Engineering / Software Engineering
Template Code
• sortingTemplate.java on Canvas
• Chapter 9 MapReduce Features: Total Sort, Hadoop-The Definitive Guide
Dataset
We have multiple datasets in cybox, https://iastate.box.com/s/3l34r2d3yvssxzzbzxvfn1zwt9yve0uk , to
play with, such as “input-50k”, “input-500k” and etc. Each dataset is made of multiple lines, one line per
record. Each record has a key (first 15 chars) and a payload (40 chars), separated by tab character. The
final output should be sorted by key, and the output should also include the payload for the key.
Example of the input data:
Experiment 1 (25 points)
Sort the data in “input-5m” by keys, using TotalOrderPartitioner. Use up to 10 reducers.
Experiment 2 (25 points)
Write your own “MyPartitioner” class and use it to sort the data in input-500k. You MUST NOT use
InputSampler class. Explain your algorithm/strategy to partition in the submission. Use up to 10
reducers.
Bonus (5 points)
Question: We have a list of datasets (see below), sorted by their size. Try TotalOrderPartitioner and your
own partitioner (MyPartitioner) on each dataset. Do you get different results? What makes it different?
Which is the largest dataset your solution can sort? You can get the bonus if you can sort 5m dataset by
using only up to 10 reducers and your own partitioner.