$30
Asmt 3: Distances and LSH
100 points
Overview
In this assignment you will explore LSH and Euclidean distances.
You will use a data set for this assignment:
• http://www.cs.utah.edu/˜jeffp/teaching/cs5140/A3/R.csv
It is recommended that you use LaTeX for this assignment (or other option that can properly digitally
render math). If you do not, you may lose points if your assignment is difficult to read or hard to follow.
Find a sample form in this directory: http://www.cs.utah.edu/˜jeffp/teaching/latex/
1 Choosing r, b (35 points)
Consider computing an LSH using t = 160 hash functions. We want to find all object pairs which have
Jaccard similarity above τ = .85.
A: (15 points) Use the trick mentioned in class and the notes to estimate the best values of hash functions
b within each of r bands to provide the S-curve
f(s) = 1 − (1 − s
b
)
r
with good separation at τ . Report these values.
B: (15 points) Consider the 4 objects A, B, C, D, with the following pair-wise similarities:
A B C D
A 1 0.77 0.25 0.33
B 0.77 1 0.20 0.55
C 0.25 0.20 1 0.91
D 0.33 0.55 0.91 1
Using your choice of r and b and f(·), what is the probability of each pair of the four objects for being
estimated to having similarity greater that τ = 0.85? Report 6 numbers. (Show your work.)
2 Generating Random Directions (30 points)
A: (10 points) Describe how to generate a single random unit vector in d = 10 dimensions using only the
operation u ← unif(0, 1) which generates a uniform random variable between 0 and 1. (This can be called
multiple times.)
B: (20 points) Generate t = 160 unit vectors in R
d
for d = 100. Plot of cdf of their pairwise dot products
(yes, you need to calculate
t
2
?
dot products).
CS 6140/CS 5140 Data Mining; Spring 2020 Instructor: Jeff M. Phillips, U. of Uta
3 Angular Hashed Approximation (35 points)
Consider the n = 500 data points in R
d
for d = 100 in data set R, given at the top. We will use the angular
similarity, between two vectors a, b ∈ R
d
:
sang(a, b) = 1 −
1
π
arccos(ha, ¯
¯bi)
If a, b are not unit vectors (e.g., in S
d−1
), then we convert them to a¯ = a/kak2 and ¯b = b/kbk2. The
definition of sang(a, b) assumes that the input are unit vectors, and it takes a value between 0 and 1, with as
usual 1 meaning most similar.
A: (15 points) Compute all pairs of dot products (Yes, compute
n
2
?
values), and plot a cdf of their angular
similarities. Report the number with angular similarity more than τ = 0.85.
B: (20 points) Now compute the dot products and angular similarities among
t
2
?
pairs of the t random
unit vectors from Q2.B. Again plot the cdf, and report the number with angular similarity above τ = 0.85.
4 Bonus (3 points)
Implement the banding scheme with your choice of r, b, using your t = 160 random vectors, to estimate the
pairs with similarity above τ = 0.85 in the data set R. Report the fraction found above τ = 0.85. Compare
the runtime of this approach versus a brute force search.