$30
Asmt 5: Frequent Items
100 points
Overview
In this assignment you will explore finding frequent items in data sets, with emphasis on streaming techniques designed to work at enormous scale. For simplicity you will work on more manageably sized data
sets, and simulate the stream by just processing with a for loop.
You will use two data sets for this assignment:
• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/S1.txt
• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A5/S2.txt
The first data set S1 has a set of m = 3,000,000 characters, and the second one S2 has m = 4,000,000
characters. The order of the file represents the order of the stream.
As usual, it is highly recommended that you use LaTeX for this assignment. If you do not, you may
lose points if your assignment is difficult to read or hard to follow. Find a sample form in this directory:
http://www.cs.utah.edu/˜jeffp/teaching/latex/
1 Streaming Algorithms
A (40 points): Run the Misra-Gries Algorithm (see L11.3.1) with (k − 1) = 9 counters on streams S1
and S2. Report the output of the counters at the end of the stream. In addition to each counter report the
estimated ratio of each label using the estimated count/m.
In each stream, use just the counters to report which characters might occur at least 20% of the time, and
which must occur at least 20% of the time.
B (40 points): Build a Count-Min Sketch (see L12.1.1) with k = 10 counters using t = 5 hash functions.
Run it on streams S1 and S2.
For both streams, report the estimated counts for characters a, b, and c. In addition to each counter report
the estimated ratio of each of these labels using the estimated count/m. Just from the output of the sketch,
with probably 1−δ = 31/32 (that is assuming the randomness in the algorithm does not do something bad),
which objects might occur at least 20% of the time, and which objects must occur at least 20% of the time.
C (10 points): How would your implementation of these algorithms need to change (to answer the same
questions) if each object of the stream was a “word” seen on Twitter, and the stream contained all tweets
concatenated together?
D (10 points): Describe one advantage of the Count-Min Sketch over the Misra-Gries Algorithm.
2 BONUS
The exact heavy-hitter problem is as follows: return all objects that occur more than 10% of the time. It
cannot return any false positives or any false negatives. In the streaming setting, this requires Ω(min{m, n})
space if there are n objects that can occur and the stream is of length m.
CS 6140/CS 5140 Data Mining; Spring 2020 Instructor: Jeff M. Phillips, U. of Utah
A: (1 point) A 2-Pass streaming algorithm is one that is able to read all of the data in-order exactly twice,
but still only has limited memory. Describe a small space algorithm to solve the exact heavy hitter problem,
i.e., with ε = 0, (say for φ = 10% threshold).
B: (2 points) Provide an upper bound for the probability that at least one student gets a really bad estimate
for the CountMin Sketch where the count estimate for a, b, and c are all the same since they always collide.
As in the question 1B use k = 10 and t = 5, and assume perfect hash functions. And take into account that
there are n = 125 students in the class, and this is a problem if it happens to any one of them.