Homework 4: Clustering and EM
This homework assignment focuses on different unsupervised learning methods from a theoretical and practical standpoint. In Problem 1, you will explore Hierarchical Clustering and experiment with how the choice of distance metrics can alter the behavior of the algorithm. In Problem 2, you will derive from scratch the full expectation-maximization algorithm for fitting a simple topic model. In Problem 3, you will implement K-Means clustering on a dataset of handwritten images and analyze the latent structure learned by this algorithm.
There is a mathematical component and a programming component to this homework. Please submit your PDF and Python files to Canvas, and push all of your work to your GitHub repository. If a question requires you to make any plots, please include those in the writeup.
Hierarchical Clustering [7 pts]
At each step of hierarchical clustering, the two most similar clusters are merged together. This step is repeated until there is one single group. We saw in class that hierarchical clustering will return a different result based on the pointwise-distance and cluster-distance that is is used. In this problem you will examine different choices of pointwise distance (specified through choice of norm) and cluster distance, and explore how these choices change how the HAC algorithm runs on a toy data set.
Problem 1
Consider the following four data points in R2, belonging to three clusters: the black cluster consisting of x1 = (0.1,0.5) and x2 = (0.35,0.75)), the red cluster consisting of x3 = (0.28,1.35), and the blue cluster consisting of x4 = (0,1.01).
Different pointwise distances d(x,x0) = kx−x0kp can be used. Recall the definition of the `1, `2, and `∞ norm:
kxk1 =
m X j=1
|xi| kxk2 =v u u t
m X j=1
x2 i kxk∞ = max j∈{1,...,m}|xj|
Also recall the definition of min-distance, max-distance, centroid-distance, and average-distance between two clusters (where µG is the center of a cluster G):
dmin(G,G0) = min x∈G,x0∈G0
d(x,x0)
dmax(G,G0) = max x∈G,x0∈G0
d(x,x0) dcentroid(G,G0) = d(µG,µG0) davg(G,G0) = 1 |G||G0|X x∈G X x0∈G0
d(x,x0)
1. Draw the 2D unit sphere for each norm, defined as S = {x ∈R2 : kxk = 1}. Feel free to do it by hand, take a picture and include it in your pdf. 2. For each norm (`1,`2,`∞) and each clustering distance, specify which two clusters would be the first to merge.
3. Draw the complete dendrograms showing the order of agglomerations for the `2 norm and each of the clustering distances.
Solution
Topic Modeling [15 pts]
In this problem we will explore a simplified version of topic modeling in which each document has just a single topic. For this problem, we will assume there are c topics. Each topic k ∈{1,...,c} will be associated with a vector βk ∈ [0,1]|V| describing a distribution over the vocabulary V withP|V| j=1 βk,j = 1. Each document xi will represented as a bag-of-words xi ∈ (Z≥0)|V|, where xi,j is a non-negative integer representing the number of times word j appeared in document i. Document i has ni word tokens in total. Finally let the (unknown) overall mixing proportion of topics be θ ∈ [0,1]c, wherePc k=1 θk = 1. Our generative model is that each of the n documents has a single topic. We encode this topic as a onehot vector zi ∈{0,1}c over topics. This one-hot vector is drawn from θ; then, each of the words is drawn from βzi. Formally documents are generated in two steps: zi ∼ Categorical(θ) xi ∼ Multinomial(βzi) Problem 2
1. Draw the graphical model representation of this problem. Be sure to use the plate notation to indicate repeated random variables and gray nodes to indicate observed variables. 2. Complete-Data Log Likelihood Define the complete data for this problem to be D = {(xi,zi)}n i=1. • Write out the complete-data (negative) log likelihood. L(θ,{βk}c k=1) = −lnp(D|θ,{βk}c k=1). • Explain in one sentence why we cannot directly optimize this loss function. 3. Expectation Step Our next step is to introduce a mathematical expression for qi, the posterior over the hidden topic variables zi conditioned on the observed data xi with fixed parameters, i.e p(zi|xi;θ,{βk}c k=1). • Write down and simplify the expression for qi. • Give an algorithm for calculating qi for all i, given the observed data {xi}n i=1 and settings of the parameters θ and {βk}c k=1. 4. Maximization Step Using the qi estimates from the Expectation Step, derive an update for maximizing the expected complete data log likelihood in terms of θ and {βk}c k=1. • Derive an expression for the expected complete-data log likelihood in terms of qi. • Find an expression for θ that maximizes this expected complete-data log likelihood. You may find it helpful to use Lagrange multipliers in order to force the constraintPθk = 1. Why does this optimized θ make intuitive sense? • Apply a similar argument to find the value of the βk’s that maximizes the expected completedata log likelihood.
Solution
K-Means [15 pts]
For this problem you will implement K-Means clustering from scratch. Using numpy is fine, but don’t use a third-party machine learning implementation like scikit-learn. You will then apply this approach to clustering of image data.
We have provided you with the MNIST dataset, a collection of handwritten digits used as a benchmark of image recogntion (you can learn more about the data set at http://yann.lecun.com/exdb/mnist/). The MNIST task is widely used in supervised learning, and modern algorithms with neural networks do very well on this task.
Here we will use MNIST unsupervised learning. You have been given representations of 6000 MNIST images, each of which are 28 × 28 greyscale handwritten digits. Your job is to implement K-means clustering on MNIST, and to test whether this relatively simple algorithm can cluster similar-looking images together.
Problem 3 The given code loads the images into your environment as a 6000x28x28 array. • Implement K-means clustering from different random initializations and for several values of K using the `2 norm as your distance metric. (You should feel free to explore other metrics than the `2 norm, but this is strictly optional.) Compare the K-means objective for different values of K and across random initializations. • For three different values of K, and a couple of random restarts for each, show the mean images for each cluster (i.e., for the cluster prototypes), as well as the images for a few representative images for each cluster. You should explain how you selected these representative images. To render an image, use the pyplot imshow function. • Are the results wildly different for different restarts and/or different values of K? For one of your runs, plot the K-means objective function as a function of iteration and verify that it never increases.
As in past problem sets, please include your plots in this document. (There may be several plots for this problem, so feel free to take up multiple pages.)
Solution
Problem 4 (Calibration, 1pt) Approximately how long did this homework take you to complete?