Starting from:

$30

Assignment 1 Streaming


Assignment 1

Overview
Part I. Streaming (35 points)
Part I. MapReduce (65 points)
Part II. Spark and Streaming (50 points)
Submission
Overview
Goal: Gain experience working with streaming algorithms as well as MapReduce workow systems. Gain
experience simulating working with a distributed le system.
Requirements. You must use Python version 3.5 or later. You do not need a distribution of MapReduce for
this assignment as we will use our own basic simulator (see task II.A.) -- future assignments will use
MapReduce on a cluster.
Templates and Python Libraries. Template code is provided for each part of the assignment. Each template
includes all of the data science, machine learning, or statistics libraries that you may import. Other data
science, machine learning, or statistics related libraries are prohibited unless listed below -- ask if
unsure. The intention is for you to implement the algorithms we have gone over and problem solve in order
to best understand the concepts of this course and their practical application.
Within the templates, all provided method names and classes must be used as provided with the same
parameters. However, you may also use additional methods to keep your code clean.
Additional approved libraries that are not in the template will be listed here (if any):
datetime
random.shuffle

Part I. Streaming (35 points)

Here, you will implement and evaluate the runtime of the typical multi-level sampler as well as the
“streaming” sampler that we went over in class. We call it a “multi-level sampler” because the level of analysis
we wish to sample (e.g. user level) is not the same level with which the data arrives (e.g. transaction level).
Data: to complete these tasks, you be provided transaction data of the following form:
record_id, date, user_id, amount
record_id -- unique id for the transaction
date -- date of the transaction
user_id -- id for the user making the transaction
amount -- the amount of the transaction
The data is provided in three les
1) transactions_small.csv
2) Transactions_medium.csv [zip]
3) Transactions_large.csv [zip; to be released]
Task I.A) Typical non-streaming multi-level sampler (15 points)
Implement the standard non-streaming sampling method
Step 1: read le to pull out unique user_ids from le
Step 2: subset to random 1% of user_ids
Step 3: read le again to pull out records from the 1% user_id
and compute mean and standard deviation within
Task I.B) Streaming multi-level sampler (15 points)
Implement the streaming multi-level sampling code we went over in class which uses hash functions on
the user_id to read through the le once. Technically, we are simulating a web stream of data by instead
taking a single pass at a le but you should see the advantage of this algorithm even for sampling from
les. Record the information that is needed in order to compute the mean and standard deviation. Your
sample should correspond to 2% and .5% of the user_ids in each le (approximate, especially in the case
of the small le). Make sure to use a streaming mean and std-dev (see rules in method description).
Task I.C) Timing (5 points)
Time wall-clock processing time, in milliseconds, over dierent sizes of data: small(10,000)
medium(1,000,000) and large (100,000,000)
Report runtimes and results for both implementations above, using percentages of .02 and .005 for
each of the three les (small may not have an adequate sample at .005: that’s ok).
Template Code for Part I. A template to be lled in with your code is provided here:
sampler_lastname_id.py
Part II. MapReduce (65 points)
Here, you will complete a back-end for a MapReduce system and test it on a couple MapReduce jobs: word
count (provided), and meanCharsMR (you must implement). Template code is provided.
Specically, you must complete:
Task I.A) PartitionFunction (10 points)
Complete the partition function, making sure to use a hash that can handle: integers and strings.
Task I.B) RunSystem (20 points)
Complete the “runSystem(self)” method which divides the data into chunks and schedules the running of
mapTasks and reduceTasks. The are two places to complete:
(1) Divide up the data into chunks according to num_map_tasks, and launch a map task per chunk.

(2) Send each key-value pair to its assigned reducer.
Task I.C) Combiner (15 points)
Edit the “MapTask” method to add support for running a Combiner. Look for “#<<COMPLETE”
within the method. Remember, a combiner runs the reduce task at the end of the map task in order to
save communication cost of sending to multiple reducers. Note: main will run the WordCountBasicMR
to test with and without the combiner. It is recommended that you look over the WordCountBasicMR
to understand what it is doing. You can assume your combiner code will only run on reducers that are
both commutative and associative (see hint at bottom).
Task I.D) Mean CharsMR (20 points)
Edit the “map” and “reduce” methods of “MeanCharsMR” to implement a map-reduce computation of
the mean and standard deviation number of each character (a-z, case insensitive) in each document.
Template Code for Part II. A template to be lled in with your code is provided here:
MRSystemSimulator2020_lastname_id.py
Submission
Please use blackboard to submit two les each with your lastname and student id:
1. sampler_<lastname_<id.py
2. MRSystemSimulator2020_<lastname_<id.py
Do not upload a zip le. Double-check that your les are there and correct after uploading and make
sure to submit. Uploading les that are zips or any other type than python code les will result in the
submission being considered invalid. Partially uploaded les or non-submitted les will count as
unsubmitted.
Questions: Please post questions to the course piazza page.
Hints
As questions come in this location will be used to track suggestions.
● Combiners require a reduce function that is both commutative and associative:
f(v1, v2) = f(v2, v1) and f(f(v1, v2), v3) = f(v1, f(v2, v3))
Published by Google Drive – Report Abuse – Updated automatically every 5 minutes

More products