$30
COMS 4771 HW4
You are allowed to work in groups of (at max) three students. These group members don’t necessarily have to be the same from previous homeworks. Only one submission per group is required
by the due date. Name and UNI of all group members must be clearly specified on the homework.
You must cite all the references you used to do this homework. You must show your work to receive
full credit.
1 [PAC learning and VC dimension]
(a) Consider a domain with n binary features and binary class labels. Let Fd be the hypothesis class that contains all decision trees over these features that have depth at most d.
Give the tightest possible VC-dimension estimate of Fd.
(b) [a sharper learning rate for finding perfect predictors!] Consider a finite size hypothesis class F that contains the perfect predictor, that is, ∃f
∗ ∈ F, s.t. ∀(x, y) ∼ D(X × Y ),
y = f
∗
(x). We will show that for such a hypothesis class F, we can derive a sharper
sample complexity rate than what was shown in class (i.e., Occam’s Razor bound).
(i) For a fixed ? 0, we say that a classifier f is ?-bad if
P(x,y)∼D[f(x) = y] < 1 − ?.
What is the chance that output of a given ?-bad f matches the true label for all m
samples drawn i.i.d. from D? That is, for a given ?-bad f, calculate
P(xi,yi)m
i=1∼D
∀i f(xi) = yi
| f is ?-bad
.
(ii) What is the chance that event in part (i) occurs for some f ∈ F? That is, calculate
P(xi,yi)m
i=1∼D
∃ ?-bad f ∈ F, such that ∀i f(xi) = yi
.
(iii) Observe that event is part (ii) is a “bad” event: an ERM-algorithm can pick a ?-bad
classifier if it happens to perfectly classify on the m i.i.d. samples. We want to limit
the probability of this bad event by at most δ 0 amount. So how many i.i.d.
samples suffice to ensure that with probability at least 1 − δ, an ERM-algorithm
does not return an ?-bad classifier?
1
(c) Given a collection of models F, suppose you were able to develop an algorithm A :
(xi
, yi)
n
i=1 7→ f
A
n
(that is, given n labeled training samples, A returns a model f
A
n ∈ F)
that has the following property: for all ? 0, with probability 0.55 (over the draw of
n = O(
1
?
2 ) samples,
err(f
A
n
) − inf
f∈F
err(f) ≤ ?,
where err(f) := P(x,y)
[f(x) 6= y].
Show that one can construct an algorithm B : (xi
, yi)
n
0
i=1 7→ f
B
n0 with the property: for
all ? 0 and all δ 0, with probability at least 1 − δ over a draw of n
0
samples:
err(f
B
n0 ) − inf
f∈F
err(f) ≤ ?.
Moreover show that n
0 = poly(
1
?
,
1
δ
) samples are enough for the algorithm B to return
such a model f
B
n0 . Hence, the model class F is efficiently PAC-learnable.
(Hint: Algorithm B can make multiple calls to the algorithm A.)
2 [Exploring k-means in detail] Recall that in k-means clustering we attempt to find k cluster
centers cj ∈ R
d
, j ∈ {1, . . . , k} such that the total (squared) distance between each datapoint
and the nearest cluster center is minimized. In other words, we attempt to find c1, ..., ck that
minimizes
Xn
i=1
min
j∈{1,...,k}
kxi − cjk
2
, (1)
where n is the total number of datapoints. To do so, we iterate between assigning xi
to the
nearest cluster center and updating each cluster center cj to the average of all points assigned
to the jth cluster (aka Lloyd’s method).
(a) [it is unclear how to find the best k, i.e. estimate the correct number of clusters!]
Instead of holding the number of clusters k fixed, one can think of minimizing (1) over
both k and c. Show that this is a bad idea. Specifically, what is the minimum possible
value of (1)? what values of k and c result in this value?
(b) [efficient optimal solution when either k = 1 or d = 1] Optimizing the k-means
objective (1) for the general case when k ≥ 2 and d ≥ 2 is NP-hard. But one can find
an efficient optimial solution when either k = 1 or d = 1.
(i) What is the optimal objective value and the setting of the cluster centers in the case
of k = 1?
(ii) For the case d = 1 (and k ≥ 2), show that Lloyd’s algorithm is not optimal. That
is, there is a suboptimal setting of cluster assignment for some dataset (with d = 1)
for which Lloyd’s algorithm will not be able to improve the solution.
(iii) (note: this part is extra credit, only attempt it once everything else is done)
Propose an efficient algorithm (that is, an algorithm that runs in polynomial in n
and k steps) for optimally solving k-means for d = 1 case.
(Hint: Use the fact that when d = 1, the dataset can be ordered as xi1 ≤ xi2 ≤
· · · ≤ xin
and each cluster in the optimal assignment will be a nonoverlapping
interval)
2
(c) [kernelizing k-means] k-means with Euclidean distance metric assumes that each pair
of clusters is linearly separable. This may not be the case. A classic example is where
we have two clusters corresponding to data points on two concentric circles in the R
2
.
(i) Implement Lloyd’s method for k-means algorithm and show the resulting cluster
assignment for the dataset depicted above. Give two more examples of datasets in
R
2
, for which optimal k-means setting results in an undesirable clustering. Show
the resulting cluster assignment for the two additional example datasets.
Let φ denote an explicit non-linear feature space mapping in some inner product space.
We will show how one can derive an implicit kernel-based version of the Lloyd’s method
for k-means, where we only operate on data as φ(xi) · φ(xj ) = K(xi
, xj ).
(ii) Let zij be an indicator that is equal to 1 if the φ(xi) is currently assigned to the jth
cluster and 0 otherwise (1 ≤ i ≤ n and 1 ≤ j ≤ k). Show that the jth cluster center
cj can be written as Pn
i=1 αijφ(xi), where αij only depends on zij variables.
(iii) Given any two data points φ(xi) and φ(xj ), show that the square distance kφ(xi)−
φ(xj )k
2
can be computed using only (linear combinations of) inner products.
(iv) Given the results of parts (ii) and (iii), show how to compute the square distance
kxi − cjk
2 using only (linear combinations of) inner products between the data
points x1, ..., xn.
(v) From results from parts (ii), (iii) and (iv), propose the algorithm for kernelized
version of Lloyd’s method of finding k-means.
(vi) Implement your proposed kernelized k-means method and run it on the three example datasets of part (i). Compare the resulting cluster for various choices of kernels
(e.g. linear kernel, quadratic kernel, rbf kernel).
(submit your datasets and kernelized code on Courseworks to receive full credit)
3 [different learning rates for different strategies] Here will explore how the rate by which
an algorithm learns a concept can change based on how labeled examples are obtained. For
that we look at three settings: (i) an active learning setting where the algorithm has the luxury
of specifying a data point and querying its label, (ii) a passive learning setting where labeled
examples are drawn at random and (iii) an adversarial setting where training examples are
given by an adversary that tries to make your life hard.
Consider a binary classification problem where each data point consists of d binary features.
Let F be the hypothesis class of conjunctions of subsets of the d features and their negations.
So for example one hypothesis could be f1(x) = x1 ∧ x2 ∧ ¬xd (where ∧ denotes the logical
“and” and ¬ denotes the logical “not”). Another hypothesis could be f2(x) = ¬x3 ∧ x5. A
conjunction in F cannot contain both xi and ¬xi
. We assume a consistent learning scenario
where there exists a hypothesis f
∗ ∈ F that is consistent with the true label for all data points.
(i) In the active learning setting, the learning algorithm can query the label of an unlabeled
example. Assume that you can query any possible example. Show that, starting with a
single positive example, you can exactly learn the true hypothesis f
∗ using d queries.
(ii) In the passive learning setting, where the examples are drawn i.i.d. from an underlying
fixed distribution D. How many examples are sufficient to guarantee a generalization
error less than ? with probability ≥ 1 − δ?
3
(iii) Show that if the training data is not representative of the underlying distribution, a consistent hypothesis f
∗
can perform poorly. Specifically, assume that the true hypothesis
f
∗
is a conjunction of k out of the d features for some k 0 and that all possible data
points are equally likely. Show that there exists a training set of 2
(d−k) unique examples
and a hypothesis f that is consistent with this training set but achieves a classification
error ≥ 50% when tested on all possible data points.
4 [From distances to embeddings] Your friend from overseas is visiting you and asks you
the geographical locations of popular US cities on a map. Not having access to a US map,
you realize that you cannot provide your friend accurate information. You recall that you
have access to the relative distances between nine popular US cities, given by the following
distance matrix D:
Distances (D) BOS NYC DC MIA CHI SEA SF LA DEN
BOS 0 206 429 1504 963 2976 3095 2979 1949
NYC 206 0 233 1308 802 2815 2934 2786 1771
DC 429 233 0 1075 671 2684 2799 2631 1616
MIA 1504 1308 1075 0 1329 3273 3053 2687 2037
CHI 963 802 671 1329 0 2013 2142 2054 996
SEA 2976 2815 2684 3273 2013 0 808 1131 1307
SF 3095 2934 2799 3053 2142 808 0 379 1235
LA 2979 2786 2631 2687 2054 1131 379 0 1059
DEN 1949 1771 1616 2037 996 1307 1235 1059 0
Being a machine learning student, you believe that it may be possible to infer the locations
of these cities from the distance data. To find an embedding of these nine cities on a two
dimensional map, you decide to solve it as an optimization problem as follows.
You associate a two-dimensional variable xi as the unknown latitude and the longitude value
for each of the nine cities (that is, x1 is the lat/lon value for BOS, x2 is the lat/lon value for
NYC, etc.). You write down the an (unconstrained) optimization problem
minimizex1,...,x9
X
i,j
kxi − xjk − Dij ?2
,
where P
i,j (kxi − xjk − Dij )
2 denotes the embedding discrepancy function.
(i) What is the derivative of the discrepancy function with respect to a location xi?
(ii) Write a program in your preferred language to find an optimal setting of locations
x1, . . . , x9. You must submit your code to Courseworks to receive full credit.
(iii) Plot the result of the optimization showing the estimated locations of the nine cities.
(here is a sample code to plot the city locations in Matlab)
cities={’BOS’,’NYC’,’DC’,’MIA’,’CHI’,’SEA’,’SF’,’LA’,’DEN’};
locs = [x1;x2;x3;x4;x5;x6;x7;x8;x9];
figure; text(locs(:,1), locs(:,2), cities);
What can you say about your result of the estimated locations compared to the actual
geographical locations of these cities?