$30
Asmt 2: Document Similarity and Hashing
100 points
Overview
In this assignment you will explore the use of k-grams, Jaccard distance, min hashing, and LSH in the
context of document similarity.
You will use four text documents for this assignment:
• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D1.txt
• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D2.txt
• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D3.txt
• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A2/D4.txt
As usual, it is recommended that you use LaTeX for this assignment. If you do not, you may lose points
if your assignment is difficult to read or hard to follow. Find a sample form in this directory: http:
//www.cs.utah.edu/˜jeffp/teaching/latex/
1 Creating k-Grams (50 points)
You will construct several types of k-grams for all documents. All documents only have at most 27 characters: all lower case letters and space. Yes, the space counts as a character in character k-grams.
[G1] Construct 2-grams based on characters, for all documents.
[G2] Construct 3-grams based on characters, for all documents.
[G3] Construct 2-grams based on words, for all documents.
Remember, that you should only store each k-gram once, duplicates are ignored.
A: (25 points) How many distinct k-grams are there for each document with each type of k-gram? You
should report 4 × 3 = 12 different numbers.
B: (25 points) Compute the Jaccard similarity between all pairs of documents for each type of k-gram.
You should report 3 × 6 = 18 different numbers.
2 Min Hashing (50 points)
We will consider a hash family H so that any hash function h ∈ H maps from h : {k-grams} → [m] for m
large enough (To be extra cautious, I suggest over m ≥ 10,000; but should work with smaller m too).
A: (35 points) Using grams G2, build a min-hash signature for document D1 and D2 using t = {20, 60, 150, 300, 600}
hash functions. For each value of t report the approximate Jaccard similarity between the pair of documents
D1 and D2, estimating the Jaccard similarity:
JSˆ
t(a, b) = 1
t
X
t
i=1 (
1 if ai = bi
0 if ai 6= bi
.
You should report 5 numbers.
CS 6140/CS 5140 Data Mining; Spring 2020 Instructor: Jeff M. Phillips, U. of Utah
B: (15 point) What seems to be a good value for t? You may run more experiments. Justify your answer
in terms of both accuracy and time.
3 Bonus (3 points)
Describe a scheme like Min-Hashing over a domain of size n for the Andberg Similarity, defined Andb(A, B) =
|A∩B|
|A∪B|+|A4B|
. That is so given two sets A and B and family of hash functions, then Prh∈H[h(A) = h(B)] =
Andb(A, B). Note the only randomness is in the choice of hash function h from the set H, and h ∈ H represents the process of choosing a hash function (randomly) from H. The point of this question is to design
this process, and show that it has the required property.
Or show that such a process cannot be done.