$30
COMS 4771 HW3
You are allowed to write up solutions in groups of (at max) two students. These group members
don’t necessarily have to be the same from previous homeworks. Only one submission per group
is required by the due date on Gradescope. Name and UNI of all group members must be clearly
specified on the homework. No late homeworks are allowed. To receive credit, a typesetted copy of
the homework pdf must be uploaded to Gradescope by the due date. You must show your work to
receive full credit. Discussing possible solutions for homework questions is encouraged on piazza
and with peers outside your group, but every group must write their own individual solutions. You
must cite all external references you used (including the names of individuals you discussed the
solutions with) to complete the homework.
1 [Bayesian interpretation of ridge regression] Consider the following data generating process for linear regression problem in R
d
. Nature first selects d weight coefficients w1, . . . , wd
as wi ∼ N(0, τ 2
) i.i.d. Given n examples x1, . . . , xn ∈ R
d
, nature generates the output
variable yi as
yi =
X
d
j=1
wjxi,j + ?i
,
where ?i ∼ N(0, σ2
) i.i.d.
Show that finding the coefficients w1, . . . , wd that maximizes P[w1, . . . , wd|(x1, y1). . . ,(xn, yn)]
is equivalent to minimizing the ridge optimization criterion.
2 [Combining multiple classifiers] The concept of “wisdom-of-the-crowd” posits that collective knowledge of a group as expressed through their aggregated actions or opinions is superior to the decision of any one individual in the group. Here we will study a version of the
“wisdom-of-the-crowd” for binary classifiers: how can one combine prediction outputs from
multiple possibly low-quality binary classifiers to achieve an aggregate high-quality final output? Consider the following iterative procedure to combine classifier results.
Input:
- S – a set of training samples: S = {(x1, y1), . . . ,(xm, ym)}, where each yi ∈ {−1, +1}
- T – number of iterations (also, number of classifiers to combine)
- F – a set of (possibly low-quality) classifiers. Each f ∈ F, is of the form f : X → {−1, +1}
Output:
- F – a set of selected classifiers {f1, . . . , fT }, where each fi ∈ F.
- A – a set of combination weights {α1, . . . , αT }
1
Iterative Combination Procedure:
- Initialize distribution weights D1(i) = 1
m [for i = 1, . . . , m]
- for t = 1, . . . , T do
- // ?j is weighted error of j-th classifier w.r.t. Dt
- Define ?j := Pm
i=1 Dt(i) · 1[yi 6= fj (xi)] [for each fj ∈ F]
- // select the classifier with the smallest (weighted) error
- ft = arg minfj∈F ?j
- ?t = minfj∈F ?j
- // recompute weights w.r.t. performance of ft
- Compute classifier weight αt =
1
2
ln
1−?t
?t
?
- Compute distribution weight Dt+1(i) = Dt(i) exp(−αtyift(xi))
- Normalize distribution weights Dt+1(i) = P
Dt+1(i)
i Dt+1(i)
- endfor
- return weights αt, and classifiers ft for t = 1, . . . , T.
Final Combined Prediction:
- For any test input x, define the aggregation function as: g(x) := P
t αtft(x), and return the
prediction as sign(g(x)).
We’ll prove the following statement: If for each iteration t there is some γt 0 such that
?t =
1
2 −γt (that is, assuming that at each iteration the error of the classifier ft is just γt better
than random guessing), then error of the aggregate classifier
err(g) := 1
m
X
i
1[yi 6= sign(g(xi))] ≤ exp(−2
X
T
t=1
γ
2
t
).
That is, the error of the aggregate classifier g decreases exponentially fast with the number of
combinations T!
(i) Let Zt := P
i Dt+1(i) (i.e., Zt denotes the normalization constant for the weighted
distribution Dt+1). Show that
DT +1(i) = 1
m
1
Q
t Zt
exp(−yig(xi)).
(ii) Show that error of the aggregate classifier g is upper bounded by the product of Zt:
err(g) ≤
Q
t Zt.
(hint: use the fact that 0-1 loss is upper bounded by exponential loss)
(iii) Show that Zt = 2p
?t(1 − ?t).
(hint: noting Zt =
P
i Dt(i) exp(−αtyift(xi)), separate the expression for correctly
and incorrectly classified cases and express it in terms of ?t)
(iv) By combining results from (ii) and (iii), we have that err(g) ≤
Q
t
2
p
?t(1 − ?t), now
show that:
Y
t
2
p
?t(1 − ?t) = Y
t
q
1 − 4γ
2
t ≤ exp(−2
X
t
γ
2
t
).
Thus establishing that err(g) ≤ exp(−2
P
t
γ
2
t
).
3 [Low-dimensional information-preserving transformations] (hashing the cube) You have a
collection of nonzero distinct binary vectors x1, . . . , xm ∈ {0, 1}
n. To facilitate later lookup,
you decide to hash them to vectors of length p < n by means of a linear mapping xi
7→ Axi
,
where A is a p × n matrix with 0-1 entries, and all computations are performed modulo 2.
Suppose the entries of the matrix are picked uniformly at random (ie, each an independent
coin toss)
(i) Pick any 1 ≤ i ≤ m, and any b ∈ {0, 1}
p
. Show that the probability (over the choice of
A) that xi hashes to b is exactly 1/2
p
. (Hint: focus on a coordinate 1 ≤ j ≤ n for which
xij = 1.)
(ii) Pick any 1 ≤ i < j ≤ m. What is the probability that xi and xj hash to the same vector?
This is called a collision.
(iii) Show that if p ≥ 2 log2 m, then with probability at least 1/2, there are no collisions
among the xi
. Thus: to avoid collisions, it is enough to linearly hash into O(log m)
dimensions!
(question credit: Prof. Sanjoy Dasgupta)
4 [Strange consequences of high dimensionality] As discussed in class, we often represent our
data in high dimensions. Thus to understand our data better and design effective prediction
algorithms, it is good to understand how things behave in high dimensions. Obviously, since
we cannot visualize or imagine high dimensional spaces, we often tend to rely on how data
behave in one-, two- or three-dimensions and extrapolate how they may behave in hundreds of
dimensions. It turns out that our low dimensional intuition can be very misleading about data
and distributions in high dimensional spaces. In this problem we will explore this in more
detail.
Consider the Gaussian distribution with mean µ and identity covariance Id in R
d
. Recall that
the density assigned to any point x ∈ Rd
, then becomes
p(x) = (2π)
−d/2
exp ?
− kx − µk
2
/2
.
(i) Show that when x = µ, x gets assigned the highest density.
(This, of course, makes sense: the Gaussian density peaks at its mean and thus x = µ
has the highest density.)
3
(ii) If mean has the highest density, it stands to reason that if we draw a large i.i.d. sample
from the distribution, then a large fraction of the points should lie close to the mean.
Let’s try to verify this experimentally. For simplicity, let mean µ = 0 (covariance is still
Id). Draw 10,000 points i.i.d. from a Gaussian N(0, Id).
To see how far away a sampled datapoint is from the mean, we can look at the distance kx − µk
2 = kxk
2
(that is, the squared length of the sampled datapoint, when
mean is zero). Plot the histogram of squared length of the samples, for dimensions
d = 1, 2, 3, 5, 10, 50 and 100. You should plot the all these histograms on the same
figure for a better comparison.
What interesting observations do you see from this plot? Do you notice anything strange
when the samples that were drawn from the high dimensional Gaussian distribution? Do
most of the samples lie close to the mean?
(iii) Let’s mathematically derive where we expect these samples to lie. That is, calculate
Ex∼N(0,Id)
h
kxk
2
i
.
Is the empirical plot in part (ii) in agreement with the mathematical expression you
derived here?
(iv) This “strangeness” is not specific to Gaussian distribution, you can observe something
similar even for the simplest of distributions in high dimensions. Consider the uniform
distribution over the cube [−1, 1]d
. Just like in part (ii), draw 10,000 i.i.d. samples from
this d-dimensional cube with uniform density, and plot the histogram of how far away
from the origin the sample points lie. (do this for d = 1, 2, 3, 5, 10, 50 and 100, again on
the same plot).
Recall that the cube has side length of 2, while most of the high-dimensional samples
have length of far more than 2! This means even though you are drawing uniformly from
the cube, most of your samples lie in the corners (and not the interior) of the cube!
(v) Again, calculate the expected (squared) length of the samples. That is, calculate
Ex∼unif([−1,1]d)
h
kxk
2
i
.
Does the plot in part (iv) in agreement with the expression you derive here?
4