Homework 4: Clustering and EM
This homework assignment focuses on different unsupervised learning methods from a theoretical and practical standpoint. In Problem 1, you will explore Hierarchical Clustering and experiment with how the choice of distance metrics can alter the behavior of the algorithm. In Problem 2, you will derive from scratch the full expectation-maximization algorithm for fitting a Gaussian mixture model. In Problem 3, you will implement K-Means clustering on a dataset of handwritten images and analyze the latent structure learned by this algorithm.
There is a mathematical component and a programming component to this homework. Please submit your PDF and Python files to Canvas, and push all of your work to your GitHub repository. If a question requires you to make any plots, please include those in the writeup.
Hierarchical Clustering [7 pts]
At each step of hierarchical clustering, the two most similar clusters are merged together. This step is repeated until there is one single group. We saw in class that hierarchical clustering will return a different result based on the pointwise-distance and cluster-distance that is is used. In this problem you will examine different choices of pointwise distance (specified through choice of norm) and cluster distance, and explore how these choices change how the HAC algorithm runs on a toy data set.
Problem 1
Consider the following four data points in R2, belonging to three clusters: the black cluster consisting of x1 = (0.1,0.5) and x2 = (0.35,0.75)), the red cluster consisting of x3 = (0.28,1.35), and the blue cluster consisting of x4 = (0,1.01).
Different pointwise distances d(x,x0) = kx−x0kp can be used. Recall the definition of the `1, `2, and `∞ norm:
kxk1 =
m X j=1
|xi| kxk2 =v u u t
m X j=1
x2 i kxk∞ = max j∈{1,...,m}|xj|
Also recall the definition of min-distance, max-distance, centroid-distance, and average-distance between two clusters (where µG is the center of a cluster G):
dmin(G,G0) = min x∈G,x0∈G0
d(x,x0)
dmax(G,G0) = max x∈G,x0∈G0
d(x,x0) dcentroid(G,G0) = d(µG,µG0) davg(G,G0) = 1 |G||G0|X x∈G X x0∈G0
d(x,x0)
1. Draw the 2D unit sphere for each norm, defined as S = {x ∈R2 : kxk = 1}. Feel free to do it by hand, take a picture and include it in your pdf. 2. For each norm (`1,`2,`∞) and each clustering distance, specify which two clusters would be the first to merge.
3. Draw the complete dendrograms showing the order of agglomerations for the `2 norm and each of the clustering distances. We have provided some code to make this easier for you. You are not required to use it.
Solution
Expectation-Maximization for Gaussian Mixture Models [7pts]
In this problem we will explore expectation-maximization for the Gaussian Mixture model. Each observation xi is a vector in RD. We posit that each observation comes from one mixture component. For this problem, we will assume there are c components. Each component k ∈ {1,...,c} will be associated with a mean vector µk ∈ RD and a covariance Σk. Finally let the (unknown) overall mixing proportion of the components be θ ∈ [0,1]c, wherePc k=1 θk = 1. Our generative model is that each of the n observations comes from a single component. We encode observation i’s component-assignment as a one-hot vector zi ∈{0,1}c over components. This one-hot vector is drawn from θ; then, xi is drawn from N(µzi,Σzi). Formally documents are generated in two steps: zi ∼ Categorical(θ) xi ∼ N(µzi,Σzi) Problem 2
1. Intractability of the Data Likelihood Let φk represent all the parameters associated with a component (µk,Σk). We are generally interested in finding a set of parameters φk that maximize the data likelihood logp({xi}|{phik}). Expand the data likelihood to include the necessary sums over observations xi and latents zi. Why is optimizing this loss directly intractable? 2. Complete-Data Log Likelihood Define the complete data for this problem to be D = {(xi,zi)}n i=1. Write out the complete-data (negative) log likelihood. L(θ,{µk,Σk}c k=1) = −lnp(D|θ,{µk,Σk}c k=1).
3. Expectation Step Our next step is to introduce a mathematical expression for qi, the posterior over the hidden topic variables zi conditioned on the observed data xi with fixed parameters, i.e p(zi|xi;θ,{µk,Σk}c k=1). • Write down and simplify the expression for qi. • Give an algorithm for calculating qi for all i, given the observed data {xi}n i=1 and settings of the parameters θ and {µk,Σk}c k=1. 4. Maximization Step Using the qi estimates from the Expectation Step, derive an update for maximizing the expected complete data log likelihood in terms of θ and {µk,Σk}c k=1. • Derive an expression for the expected complete-data log likelihood in terms of qi. • Find an expression for θ that maximizes this expected complete-data log likelihood. You may find it helpful to use Lagrange multipliers in order to force the constraintPθk = 1. Why does this optimized θ make intuitive sense? • Apply a similar argument to find the value of the (µk,Σk)’s that maximizes the expected complete-data log likelihood.
5. Finally, compare this EM approach to the generative model for classification in Homework 2. How are the computations similar? Different?
Solution
K-Means [15 pts]
For this problem you will implement K-Means clustering from scratch. Using numpy is fine, but don’t use a third-party machine learning implementation like scikit-learn. You will then apply this approach to clustering of image data.
We have provided you with the MNIST dataset, a collection of handwritten digits used as a benchmark of image recogntion (you can learn more about the data set at http://yann.lecun.com/exdb/mnist/). The MNIST task is widely used in supervised learning, and modern algorithms with neural networks do very well on this task.
Here we will use MNIST unsupervised learning. You have been given representations of 6000 MNIST images, each of which are 28 × 28 greyscale handwritten digits. Your job is to implement K-means clustering on MNIST, and to test whether this relatively simple algorithm can cluster similar-looking images together.
Problem 3 The given code loads the images into your environment as a 6000x28x28 array. • Implement K-means clustering from different random initializations and for several values of K using the `2 norm as your distance metric. (You should feel free to explore other metrics than the `2 norm, but this is strictly optional.) Compare the K-means objective for different values of K and across random initializations. • For three different values of K, and a couple of random restarts for each, show the mean images for each cluster (i.e., for the cluster prototypes), as well as the images for a few representative images for each cluster. You should explain how you selected these representative images. To render an image, use the pyplot imshow function. • Are the results wildly different for different restarts and/or different values of K? For one of your runs, plot the K-means objective function as a function of iteration and verify that it never increases.
As in past problem sets, please include your plots in this document. (There may be several plots for this problem, so feel free to take up multiple pages.)