$29
Problem 1
(10 pts) Given the balanced binary trees T1 and T2, which contain m and n elements
respectively, we want to determine whether they have some particular key in common.
Assume an adversarial sequence that placed the m and n items into the two trees.
(A) Suppose our algorithm traverses each node of T1 using an in-order traversal and
checks if the key corresponding to the current node traversed exists in T2. Express
the asymptotic running time of this procedure, in terms of m and n.
(B) Now suppose our algorithm first allocates a new hash table H1 of size m (assume
H1 uses a uniform hash function) and then inserts every key in T1 into H1 during
a traversal of T1. Then, we traverse the tree T2 and search for whether the key of
each node traversed already exists in H1. Give the asymptotic running time of this
algorithm in the average case. Justify your answer.
Problem 2
(45 pts) A good hash function h(x) behaves in practice very close to the uniform hashing
assumption analyzed in class, but is a deterministic function. That is, h(x) = k each time
x is used as an argument to h(). Designing good hash functions is hard, and a bad hash
function can cause a hash table to quickly exit the sparse loading regime by overloading
some buckets and under loading others. Good hash functions often rely on beautiful and
complicated insights from number theory, and have deep connections to pseudorandom
number generators and cryptographic functions. In practice, most hash functions are
moderate to poor approximations of uniform hashing.
Consider the following two hash functions. Let U be the universe of strings composed
of the characters from the alphabet P = [A; · · · ; Z], and let the function f(xi) return the
index of a letter xi 2 P, e.g., f(A) = 1 and f(Z) = 26. Finally, for an m-character string
x 2 Pm, define h1(x) = ([Pm i=1 f(xi)] mod l), where l is the number of buckets in the
hash table. For the other hash function, let the function f2(xi; ai) return xi∗ai, where ai is
a uniform random integer, x 2 [0; · · · ; l − 1], and define h2(x) = ([Pm i=1 f(xi; ai)] mod l).
That is, the first hash function sums up the index values of the characters of a string x
and maps that value onto one of the l buckets, and the second hash function is a universal
hash function.
(A) There is a txt file on Moodle that contains US Census derived last names: Using
these names as input strings, first choose a uniformly random 50% of these name
strings and then hash them using h1(x) and h2(x).
See python code.
(B) Produce a histogram showing the corresponding distribution of hash locations for each
hash function when l = 5701. Label the axes of your figure. Brief description what
the figure shows about h1(x) and h2(x); justify your results in terms of the behavior
of h(x). Hint: the raw file includes information other than the name strings, which
will need to be removed; and, think about how you can count hash locations without
building or using a real hash table
(C) Enumerate at least 4 reasons why h1(x) is a bad hash function relative to the ideal
behavior of uniform hashing
(D) Produce a plot showing (i) the length of the longest chain (were we to use chaining
for resolving collisions) as a function of the number n of these strings that we hash
into a table with l = 5701 buckets. Produce another plot showing the number of
collisions as a function of l. Choose prime numbers for l and comment on how
collisions decrease as l increases.