$30
CS 446 / ECE 449 — Homework 4
Version 1.0
Instructions.
.
• Everyone must submit individually at gradescope under hw4 and hw4code.
• The “written” submission at hw4 must be typed, and submitted in any format gradescope accepts (to
be safe, submit a PDF). You may use LATEX, markdown, google docs, MS word, whatever you like; but
it must be typed!
• When submitting at hw4, gradescope will ask you to mark out boxes around each of your answers;
please do this precisely!
• Please make sure your NetID is clear and large on the first page of the homework.
• Your solution must be written in your own words.
• When submitting to hw4code, only upload hw4.py and hw4 utils.py. Additional files will be ignored.
1
1. Principal Component Analysis
(a) In the lecture, we mainly discuss the case where the data centers around the origin point (see slides
14/21). If the data does not center around the origin point, the PCA algorithm will first subtract
the mean of all the data, then calculate the projection w, and eventually add the mean back (see
slides 15/21). In this question, we will prove that subtracting the mean is reasonable for PCA.
However, since proving the general theorem is beyond the scope of this course, we will focus on the
2-dimensional case.
For a line on the xy-plane, we denote it as the set of roots/solutions to f(p) = 0, where f(p) = v
⊤p+b.
Provided N points on the xy plane: P = [p
(1)
, ..., p
(N)
], and p
(i) = (x
(i)
, y(i)
). We want to find the
optimal line fe= 0 with parameters ve and eb that minimizes the sum of the squared distances between
the points and the line as the equation below.
ve,eb = arg min
v,b
X
N
i=1
(
v
⊤p
(i) + b
∥v∥2
)
2
(1)
Prove that if the optimal line exists, then the mean of the N points is on the line.
Hint. Without loss of generality, consider the case where v is a unit vector. Note that v is the
normal vector of the line.
(b) For each of the following statements, specify whether the statement is true or false. If you think the
statement is wrong, explain in 1 to 2 sentences why it is wrong.
• True or False: As shown in the figure below, PCA seeks a line such that the sum of the vertical
distances from the points to the line is minimized.
• True or False: PCA seeks a projection that best represents the data in a least-squares sense.
• True or False: The principal components are not orthogonal to each other.
• True or False: Solving PCA using SVD might result in a solution which corresponds to a local
minimum.
(c) In PCA, assume we have
X⊤X =
12 0 0 0
0 6 0 0
0 0 20 0
0 0 0 10
Compute the largest eigenvalue of X⊤X and its corresponding eigenvector.
Solution.
2
2. K-Means
We are given a dataset D = {x
(i)}
N
i=1 of d-dimensional points x
(i) ∈ R
d which we are interested in
partitioning them into K clusters, each having a cluster center µk ∈ R
d
(k ∈ [K]) via the K-Means
algorithm. Note that we denote [K] = {1, . . . , K} and similarly for [N]. Recall (from our lecture notes)
that the algorithm optimizes the following cost function:
min
µk∈Rd,k∈[K]
min
rik∈{0,1},i∈[N],k∈[K]
X
N
i=1
X
K
k=1
1
2
rik∥x
(i) − µk∥
2
2
s.t. X
K
k=1
rik = 1 ∀k ∈ [K]. (2)
(a) Given fixed cluster centers µk, ∀k ∈ [K], what is the optimal assignments rik for the optimization in
(2)? (assume that there is no ∥x
(i) − µp∥
2
2 = ∥x
(i) − µq∥
2
2
for ∀p ̸= q)
(b) Given fixed assignments rik, ∀i ∈ [N], k ∈ [K], we assume each cluster center has at least one data
point assigned to it, i.e., PN
i=1 rik > 0. What are the optimal cluster centers µk for the optimization
(2)?
(c) Implement the K-Means (with K = 2) algorithm for a 2-dimensional dataset and submit your code
to hw4code on gradescope. For the given dataset, visualize the clusters at the first three steps and
attach the plots in your written (typed) report. Please see hw4.py for details.
Solution.
3
3. Gaussian Mixture Models
Consider a one-dimensional Gaussian mixture model with K components (k ∈ {1, . . . , K}), each having
mean µk, variance σ
2
k
, and mixture weight πk. Further, we are given a dataset D = {x
(i)}
N
i=1, where
x
(i) ∈ R.
(a) What is the log-likelihood of the data, i.e., log p(D | µk, σk, πk, 1 ≤ k ≤ K), according to the Gaussian
Mixture Model, assuming the data samples x
(i) are i.i.d.?
(b) Recall the Expectation-Maximization procedure for Gaussian mixture models from our lecture where
rik = pψ(t) (z
(i) = ek | x
(i)
). Prove that PK
k=1 rik = 1.
(c) Recall the Expectation-Maximization procedure for Gaussian mixture models from our lecture where
Lagrangian is
L =
X
N
i=1
X
K
k=1
rik[
1
2σ
2
k
(x
(i) − µk)
2 +
1
2
log(2πσ2
k
) − log πk] + λ(
X
K
k=1
πk − 1).
Show how to derive πk =
1
N
PN
i=1 rik from setting ∂L
∂λ = 0 and ∂L
∂πk
= 0 (1 ≤ k ≤ K).
(d) A nice property of Gaussian mixture is the universal approximation property. Loosely speaking, given
any distribution, there exist a Gaussian mixture that can approximate it up to arbitrary accuracy.
Let us verify this powerful property on discrete distributions. Consider n points z1, z2, . . . , zn ∈ R,
and a discrete distribution characterized by a probability mass function q on R:
q(x) = (
qi
if x = zi
,
0 otherwise,
where qi > 0 and Pn
i=1 qi = 1.
Construct a Gaussian mixture on R that approximates the distribution characterized by the probability
mass function q . It is sufficient to describe the construction and show intuitively why it is a good
approximation. You don’t need to rigorously prove your results.
(e) We have seen the Gaussian mixture with finitely many components (K components), where πk
(k ∈ {1, . . . , K}) are the mixture weights. Noting that PK
k=1 πk = 1 and πk ≥ 0, we can see that the
mixture weights represent a probability mass function. Therefore, we can generalize the Gaussian
mixture to combine infinitely many components by using a probability density function.
Formally, let µ : R → R and σ : R → R+ be two functions representing the means and variances,
respectively. Let π : R → R+ be a probability density function on R representing the mixture weights.
Denote the density of the generalized Gaussian mixture as
p(x) = Z ∞
−∞
1
p
2πσ(t)
2
exp
−
(x − µ(t))2
2σ(t)
2
π(t) dt. (3)
Prove that p is a valid density function, i.e., ∀x ∈ R : p(x) ≥ 0 and R ∞
−∞ p(x)dx = 1 (assuming p(x)
and R ∞
−∞ p(x)dx exist).
Hint: See the Fubini’s Theorem, but you don’t need to worry about the mathematical details.
Solution.
4