$30
CS 189 Introduction to Machine Learning
HW7
Deliverables:
1. Submit a PDF of your homework, with an appendix listing all your code, to the Gradescope assignment entitled “Homework 7 Write-Up”. In addition, please include, as your solutions to each
coding problem, the specific subset of code relevant to that part of the problem. You may typeset your
homework in LaTeX or Word (submit PDF format, not .doc/.docx format) or submit neatly handwritten and scanned solutions. Please start each question on a new page. If there are graphs, include
those graphs in the correct sections. Do not put them in an appendix. We need each solution to be
self-contained on pages of its own.
• In your write-up, please state with whom you worked on the homework.
• In your write-up, please copy the following statement and sign your signature next to it. (Mac
Preview and FoxIt PDF Reader, among others, have tools to let you sign a PDF file.) We want
to make it extra clear so that no one inadvertently cheats.
“I certify that all solutions are entirely in my own words and that I have not looked at another
student’s solutions. I have given credit to all external sources I consulted.”
2. Submit all the code needed to reproduce your results to the Gradescope assignment entitled “Homework 7 Code”. Yes, you must submit your code twice: in your PDF write-up following the directions
as described above so the readers can easily read it, and once in compilable/interpretable form so the
readers can easily run it. Do NOT include any data files we provided. Please include a short file
named README listing your name, student ID, and instructions on how to reproduce your results.
Please take care that your code doesn’t take up inordinate amounts of time or memory. If your code
cannot be executed, your solution cannot be verified.
HW7, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 1
1 Regularized and Kernel k-Means
Recall that in k-means clustering we attempt to minimize the objective
min
C1,C2,...,Ck
X
k
i=1
X
x j∈Ci
kxj − µik
2
2
, where
µi = argminµi∈Rd
X
x j∈Ci
kxj − µik
2
2 =
1
|Ci
|
X
x j∈Ci
xj
, i = 1, 2, . . . , k.
The samples are {x1, . . . , xn}, where xj ∈ R
d
. Ci
is the set of sample points assigned to cluster i and |Ci
| is its
cardinality. Each sample point is assigned to exactly one cluster.
(a) What is the minimum value of the objective when k = n (the number of clusters equals the number of
sample points)?
(b) (Regularized k-means) Suppose we add a regularization term to the above objective. The objective is
now
X
k
i=1
λkµik
2
2 +
X
x j∈Ci
kxj − µik
2
2
.
Show that the optimum of
min
µi∈Rd
λkµik
2
2 +
X
x j∈Ci
kxj − µik
2
2
is obtained at µi =
1
|Ci
|+λ
P
x j∈Ci
xj
.
(c) Here is an example where we would want to regularize clusters. Suppose there are n students who live in
a R
2 Euclidean world and who wish to share rides efficiently to Berkeley for their final exam in CS189.
The university permits k vehicles which may be used for shuttling students to the exam location. The
students need to figure out k good locations to meet up. The students will then walk to the closest meet
up point and then the shuttles will deliver them to the exam location. Let xj be the location of student
j, and let the exam location be at (0, 0). Assume that we can drive as the crow flies, i.e., by taking
the shortest path between two points. Write down an appropriate objective function to minimize the
total distance that the students and vehicles need to travel. Hint: your result should be similar to the
regularized k-means objective.
(d) (Kernel k-means) Suppose we have a dataset {xi}
n
i=1
, xi ∈ R
`
that we want to split into k clusters, i.e.,
finding the best k-means clustering without the regularization. Furthermore, suppose we know a priori
that this data is best clustered in an impractically high-dimensional feature space R
m with an appropriate
metric. Fortunately, instead of having to deal with the (implicit) feature map Φ : R
` → R
m and (implicit)
distance metric1
, we have a kernel function κ(x1, x2) = Φ(x1) · Φ(x2) that we can compute easily on the
raw samples. How should we perform the kernelized counterpart of k-means clustering?
Derive the underlined portion of this algorithm, and show your work in deriving it. The main issue
is that although we define the means µi
in the usual way, we can’t ever compute Φ explicitly because it’s
1
Just as how the interpretation of kernels in kernelized ridge regression involves an implicit prior/regularizer as well as an
implicit feature space, we can think of kernels as generally inducing an implicit distance metric as well. Think of how you would
represent the squared distance between two points in terms of pairwise inner products and operations on them.
HW7, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 2
way too big. Therefore, in the step where we determine which cluster each sample point is assigned to,
we must use the kernel function κ to obtain the right result. (Review the lecture on kernels if you don’t
remember how that’s done.)
Algorithm 1: Kernel k-means
Require: Data matrix X ∈ R
n×d
; Number of clusters K; kernel function κ(x1, x2)
Ensure: Cluster class class(j) for each sample xj
.
function Kernel-k-means(X, c)
Randomly initialize class(j) to be an integer in 1, 2, . . . , K for each xj
.
while not converged do
for i ← 1 to K do
Set S i = {j ∈ {1, 2, . . . , n} : class(j) = i}.
for j ← 1 to n do
Set class(j) = argmink
Return S i for i = 1, 2, . . . , c.
end function
(e) The expression you derived may have unnecessary terms or redundant kernel computations. Explain
how to eliminate them; that is, how to perform the computation quickly without doing irrelevant computations or redoing computations already done.
2 Low-Rank Approximation
Low-rank approximation tries to find an approximation to a given matrix, where the approximation matrix has a lower rank compared to the original matrix. This is useful for mathematical modeling and data
compression. Mathematically, given a matrix M, we try to find Mˆ in the following optimization problem,
argmin
Mˆ
kM − Mˆ kF subject to rank(Mˆ ) ≤ k (1)
where kAkF =
qP
i
P
j a
2
i j is the Frobenius norm, i.e., the sum of squares of all entries in the matrix, followed
by a square root.
This problem can be solved using Singular Value Decomposition (SVD). Specifically, let M = UΣV
,
where Σ = diag(σ1, ..., σn). Then a rank-k approximation of M can be written as Mˆ = UΣˆV
, where
Σ = ˆ diag(σ1, ..., σk, 0, ..., 0). In this problem, we aim to perform this approximation method on gray-scale
images, which can be thought of as a 2D matrix.
(a) Using the image low-rank data/face.jpg, perform a rank-5, rank-20, and rank-100 approximation
on the image. Show both the original image as well as the low-rank images you obtain in your report.
(b) Now perform the same rank-5, rank-20, and rank-100 approximation on low-rank data/sky.jpg.
Show both the original image as well as the low-rank images you obtain in your report.
(c) In one plot, plot the Mean Squared Error (MSE) between the rank-k approximation and the original
image for both low-rank data/face.jpg and low-rank data/sky.jpg, for k ranging from 1 to
100. Be sure to label each curve in the plot. The MSE between two images I, J ∈ R
w×h
is
MSE(I, J) =
X
i, j
(Ii, j − Ji, j)
2
. (2)
HW7, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 3
(d) Find the lowest-rank approximation for which you begin to have a hard time differentiating the original
and the approximated images. Compare your results for the face and the sky image. What are the
possible reasons for the difference?
HW7, ©UCB CS 189, Spring 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 4