$30
CM146
Problem Set 5: Boosting, Unsupervised learning
Submission instructions
• Submit your solutions electronically on the course Gradescope site as PDF files.
• If you plan to typeset your solutions, please use the LaTeX solution template. If you must
submit scanned handwritten solutions, please use a black pen on blank white paper and a
high-quality scanner app.
1 AdaBoost [5 pts]
In the lecture on ensemble methods, we said that in iteration t, AdaBoost is picking (ht
, βt) that
minimizes the objective:
(h
∗
t
(x), β∗
t
) = arg min
(ht(x),βt)
X
n
wt(n)e
−ynβtht(xn)
= arg min
(ht(x),βt)
(e
βt − e
−βt
)
X
n
wt(n)I[yn 6= ht(xn)]
+e
−βt X
n
wt(n)
We define the weighted misclassification error at time t, ?t to be ?t =
P
n wt(n)I[yn 6= ht(xn)]. Also
the weights are normalized so that P
n wt(n) = 1.
(a) Take the derivative of the above objective function with respect to βt and set it to zero to
solve for βt and obtain the update for βt
.
(b) Suppose the training set is linearly separable, and we use a hard-margin linear support vector
machine (no slack) as a base classifier. In the first boosting iteration, what would the resulting
β1 be?
2 K-means for single dimensional data [5 pts]
In this problem, we will work through K-means for a single dimensional data.
(a) Consider the case where K = 3 and we have 4 data points x1 = 1, x2 = 2, x3 = 5, x4 = 7.
What is the optimal clustering for this data ? What is the corresponding value of the objective
?
Parts of this assignment are adapted from course material by Jenna Wiens (UMich) and Tommi Jaakola (MIT).
1
(b) One might be tempted to think that Lloyd’s algorithm is guaranteed to converge to the
global minimum when d = 1. Show that there exists a suboptimal cluster assignment (i.e.,
initialization) for the data in the above part that Lloyd’s algorithm will not be able to improve
(to get full credit, you need to show the assignment, show why it is suboptimal and explain
why it will not be improved).
2
3 Gaussian Mixture Models [8 pts]
We would like to cluster data {x1, . . . , xN }, xn ∈ R
d using a Gaussian Mixture Model (GMM)
with K mixture components. To do this, we need to estimate the parameters θ of the GMM, i.e.,
we need to set the values θ = {ωk, µk
, Σk}
K
k=1 where ωk is the mixture weight associated with
mixture component k, and µk and Σk denote the mean and the covariance matrix of the Gaussian
distribution associated with mixture component k.
If we knew which cluster each sample xn belongs to (we had complete data), we showed in the
lecture on Clustering that the log likelihood l is what we have below and we can compute the
maximum likelihood estimate (MLE) of all the parameters.
l(θ) = X
n
log p(xn, zn)
=
X
k
X
n
γnk log ωk +
X
k
(X
n
γnk log N(xn|µk
, Σk)
)
(1)
Since we do not have complete data, we use the EM algorithm. The EM algorithm works by
iterating between setting each γnk to the posterior probability p(zn = k|xn) (step 1 on slide 26
of the lecture on Clustering) and then using γnk to find the value of θ that maximizes l (step 2
on slide 26). We will now derive updates for one of the parameters, i.e., µj
(the mean parameter
associated with mixture component j).
(a) To maximize l, compute ∇µj
l(θ): the gradient of l(θ) with respect to µj
.
(b) Set the gradient to zero and solve for µj
to show that µj = P
1
n
γnj
P
n
γnjxn.
(c) Suppose that we are fitting a GMM to data using K = 2 components. We have N = 5
samples in our training data with xn, n ∈ {1, . . . , N} equal to: {5, 15, 25, 30, 40}.
We use the EM algorithm to find the maximum likeihood estimates for the model parameters,
which are the mixing weights for the two components, ω1 and ω2, and the means for the two
components, µ1 and µ2. The standard deviations for the two components are fixed at 1.
Suppose that at the end of step 1 of iteration 5 in the EM algorithm, the soft assignment γnk
for the five data items are as shown in Table 1.
γ1 γ2
0.2 0.8
0.2 0.8
0.8 0.2
0.9 0.1
0.9 0.1
Table 1: Entry in row n and column k of the table corresponds to γnk
What are updated values for the parameters ω1, ω2, µ1, and µ2 at the end of step 2 of the
EM algorithm?
3