$35
CS 446 / ECE 449 — Homework 4
Version 1.0
Instructions.
• Everyone must submit individually at gradescope under hw4 and hw4code.
• The “written” submission at hw4 must be typed, and submitted in any format gradescope accepts
(to be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like;
but it must be typed!
• When submitting at hw4, gradescope will ask you to mark out boxes around each of your answers;
please do this precisely!
• Please make sure your NetID is clear and large on the first page of the homework.
• Your solution must be written in your own words.
• When submitting to hw4code, only upload hw4.py and hw4 utils.py. Additional files will be ignored.
Version History.
1.0 Initial Version.
1
1. Principal Component Analysis
(a) For each of the following statements, specify whether the statement is true or false. If you think the
statement is wrong, explain in 1 to 2 sentences why it is wrong.
• True or False: As shown in the figure below, PCA seeks a subspace such that the sum of all the
vertical distance to the subspace (the dashed line) is minimized.
• True or False: PCA seeks a projection that best represents the data in a least-squares sense.
• True or False: PCA seeks a linear combination of variables such that the maximum variance is
extracted from the variables.
• True or False: The principal components are not necessarily orthogonal to each other.
(b) Recall that PCA finds a direction w in which the projected data has highest variance by solving the
following program:
max
w:||w||2=1
w
T Σw. (1)
Here, Σ is a covariance matrix. You are given a dataset of two 2-dimensional points (1, 3) and (4, 7).
Draw the two data points on the 2D plane. What is the first principal component w of this dataset?
(c) Now you are given a dataset of four points (2, 0), (2, 2), (6, 0) and (6, 2). Draw the four data points
on the 2D plane. Given this dataset, what is the dimension of the covariance matrix Σ in Eq. (1)?
Also, explicitly write down the values of Σ given the dataset.
(d) What is the optimal w and the optimal value of the program in Eq. (1) given
Σ =
12 0 0 0
0 6 0 0
0 0 20 0
0 0 0 10
.
2
2. Gaussian Mixture Models
Consider a Gaussian mixture model with K components (k ∈ {1, . . . , K}), each having mean µk, variance
σ
2
k
, and mixture weight πk. Further, we are given a dataset D = {xi}, where xi ∈ R. We use zi = {zik} to
denote the latent variables.
(a) What is the log-likelihood of the data according to the Gaussian Mixture Model (use µk, σk, πk, K,
xi
, and D)?
(b) Assume K = 1, find the maximum likelihood estimate for the parameters (µ1, σ
2
1
, π1).
(c) What is the probability distribution on the latent variables, i.e., what is the distribution p(zi,1, zi,2, · · · , zi,K)
underlying Gaussian mixture models. Also give its name.
(d) For general K, what is the posterior probability p(zik = 1|xi)? To simplify, wherever possible, use
N (xi
|µk, σk), a Gaussian distribution over xi ∈ R having mean µk and variance σ
2
k
.
(e) How are k-Means and Gaussian Mixture Model related? (There are three conditions)
Hint: Think of variance, πk, and hard/soft assignment.
(f) Show that:
lim→0
− logX
K
k=1
exp (−Fk/) = min
k
Fk, ∈ R
+
Hint: Use l’Hopital’s rule.
(g) Consider the modified Gaussian Mixture Model objective:
min
µ
−
X
xi∈D
logX
K
k=1
exp (−(xi − µk)
2
/).
Conclude that the objective for k-Means is the 0-temperature limit of Gaussian Mixture Model.
Hint: Let Fk = (x − µk)
2 and apply the equation you proved in (f).
3
3. Expectation Maximization
In this problem, you will implement an expectation-maximization (EM) algorithm to cluster samples
D = {x
(i)}
n
i=1, with x
(i) ∈ {0, 1}
D into groups. You will be using a mixture of Bernoullis model to tackle
this problem.
(a) Mixture of Bernoullis.
i. Assume each variable xd in the D-dimensional vector x is drawn from a Bernoulli(qd) distribution,
P(xd = 1) = qd. Let q = [q1, · · · , qD] ∈ [0, 1]D be the resulting vector of Bernoulli parameters.
Write an expression for P(x|q) as a function of qd and xd.
ii. Now suppose we have a mixture of K Bernoulli distributions: each vector x
(i)
is drawn from
some vector of Bernoulli random variables with parameters p
(k) = [p
(k)
1
, · · · , p
(k)
D ], that we
call Bernoulli(p
(k)
). Let p , {p
(1)
, · · · , p(K)}. Let πk denote the probability that the k
th set
of Bernoulli parameters is chosen. Assume a distribution π , {π1, . . . , πK} over all possible
Bernoulli parameters. Write an expression for P(x
(i)
|p, π), as a function of πk and P(x
(i)
|p
(k)
).
iii. Using the above, write an expression for the log-likelihood of i.i.d. data D, i.e., log P(D|π, p).
(b) Expectation step.
i. Let z
(i) , [z
(i)
1
, . . . , z
(i)
K ] ∈ {0, 1}
K be an indicator vector, such that z
(i)
k = 1 if x
(i) was drawn
from a Bernoulli(p
(k)
), and 0 otherwise. Write down the expression of P(z
(i)
|π) as a function of
πk and z
(i)
k
.
ii. Write down the expression of P(x
(i)
|z
(i)
, p, π) as a function of P(x
(i)
|p
(k)
) and z
(i)
k
.
iii. Let Z = {z
(i)}
n
i=1. Using the two quantities above, derive the likelihood of the data and latent
variables P(Z, D|π, p).
iv. Let η(z
(i)
k
) = E[z
(i)
k
|x
(i)
, π, p]. Show that
η(z
(i)
k
) = πk
QD
d=1(p
(k)
d
)
x
(i)
d (1 − p
k
d
)
1−x
(i)
d
P
j
πj
QD
d=1(p
(j)
d
)
x
(i)
d (1 − p
j
d
)
1−x
(i)
d
v. Let p˜, π˜ be the new parameters that we would like to maximize. p, π are from the previous
iteration. Use this to derive the following final expression for the E-step in the EM algorithm:
E[log P(Z, D|p, ˜ π˜)|D, p, π] = Xn
i=1
X
K
k=1
η(z
(i)
k
)
h
log ˜πk +
X
D
d=1
(x
(i)
d
log ˜p
(k)
d + (1 − x
(i)
d
) log(1 − p˜
(k)
d
))i
(c) Maximization step. In the following, we will find ˜p and ˜π that maximize the above expression.
i. Show that ˜p that maximizes the E-step is:
p˜
(k) =
Pn
i=1 η(z
(i)
k
)x
(i)
Nk
,
where Nk =
Pn
i=1 η(z
(i)
k
).
ii. Prove that the value of ˜π that maximizes the E-step is:
π˜k = P
Nk
k0 Nk0
.
Hint: π˜ needs to be a distribution. Use Lagrange multiplier to include this constraint into your
objective function and solve it.
4
4. K-Means 1
(a) Mention if K-Means is a supervised or an un-supervised method and state the reason.
(b) Assume that you are trying to cluster data points xi for i ∈ {1, 2, . . . , D} into K clusters each
with center µk where k ∈ {1, 2, . . . , K}. The objective function for doing this clustering involves
minimization of the Euclidean distance between the points and the cluster centers. It is given by
min
µ
min
r
X
i∈D
X
K
k=1
1
2
rikkxi − µkk
2
2
.
How do you ensure hard assignment of one data point to one and only one cluster at a given time?
Hint: By hard assignment we mean that you are 100 % sure that a point either belongs or doesn’t
belong to a cluster.
(c) How does your answer to part b change if we want to obtain a soft assignment instead?
Hint: By soft assignment we mean that a point belongs to a cluster with some probability.
(d) Looking at the following plot, what is the best choice for the number of clusters?
1 2 3 4 5 6 7 8 9 10
Number of clusters
0
1000
2000
3000
4000
5000
Squared distance of all points from their cluster center
(e) Would K-Means be an efficient algorithm to cluster the following data? Explain your answer in a
couple of lines.
3 2 1 0 1 2 3
x
3
2
1
0
1
2
3
y
5
5. K-Means 2
We are given a dataset D = {(x)} of 2d points x ∈ R
2 which we are interested in partitioning into K
clusters, each having a cluster center µk (k ∈ {1, . . . , K}) via the k-Means algorithm. This algorithm
optimizes the following cost function:
min
µk,r
X
x∈D,k∈{1,...,K}
1
2
rx,kkx − µkk
2
2
s.t. (
P
rx,k ∈ {0, 1} ∀x ∈ D, k ∈ {1, . . . , K}
k∈{1,...,K}
rx,k = 1 ∀x ∈ D (2)
(a) What is the domain for µk?
(b) Given fixed cluster centers µk ∀k ∈ {1, . . . , K}, what is the optimal rx,k for the program in Eq. 2?
Provide a reason?
(c) Given fixed rx,k ∀x ∈ D, k ∈ {1, . . . , K}, what are the optimal cluster centers µk ∀k ∈ {1, . . . , K} for
the program in Eq. 2?
Hint: Reason by first computing the derivative w.r.t µk.
(d) Using Pseudo-code, sketch the algorithm which alternates the aforementioned two steps. Is this
algorithm guaranteed to converge and why? Is this algorithm guaranteed to find the global optimum?
What is the reason?
Hint: you can provide a counter-example to invalidate a statement.
(e) Please implement the aforementioned two steps. For the given dataset, after how many updates
does the algorithm converge, what cost function value does it converge to and what are the obtained
cluster centers? Visualize clusters at each step and attach the plots here. Please at least report
numbers with one decimal point.
Remark: how we count updates: when computing a set of new centroids from initilization, we call
this one update.
Hint: You may find hw4 utils.vis cluster useful.
6