Starting from:

$30

CS559-B HW3

CS559-B HW3

Problem 1 (25pt): [K-means] Implement the K-means algorithm. Note that you cannot directly
call the built-in kmeans functions.
Figure 1: Scatter plot of datasets and the initialized centers of 3 clusters
Given the matrix X whose rows represent different data points, you are asked to perform a kmeans clustering on this dataset using the Euclidean distance as the distance function. Here k is
chosen as 3. The Euclidean distance d between a vector x and a vector y both in Rp
is defined
as d =
pPp
i=1(xi − yi)
2. All data in X were plotted in Figure 1. The centers of 3 clusters were
1
initialized as µ1 = (6.2, 3.2) (red), µ2 = (6.6, 3.7) (green), µ3 = (6.5, 3.0) (blue).
X =





















5.9 3.2
4.6 2.9
6.2 2.8
4.7 3.2
5.5 4.2
5.0 3.0
4.9 3.1
6.7 3.1
5.1 3.8
6.0 3.0





















(1) [10pt] What’s the center of the first cluster (red) after one iteration? (Answer in the format
of [x1, x2], round results to three decimal places, same as part (2) and (3) )
(2) [5pt] What’s the center of the second cluster (green) after two iteration?
(3) [5pt] What’s the center of the third cluster (blue) when the clustering converges?
(4) [5pt] How many iterations are required for the clusters to converge?
Problem 2 (15pt): [K-means and gradient descent] Recall the loss function for k-means clustering with k clusters, sample points x1, x2,..., xn, and centers µ1, µ2,..., µk:
L =
X
k
j=1
X
xi∈Sj
||xi − µj ||2
, where Sj refers to the set of data points that are closer to µj than to any other cluster mean.
(1) [5pt] Instead of updating µj by computing he mean, let’s minimize L with batch gradient
descent while holding the sets Sj fixed. Derive the update formula for µ1 with learning rate .
(2) [2pt] Derive the update formula for µ1 with stochastic gradient descent on a single sample
point xi
. Use learning rate .
(3) [8pt] In this part, we will connect the batch gradient descent update equation with the
standard k-means algorithm. Recall that in the update step of the standard algorithm, we assign
each cluster center to be the mean of the data points closest to that center. It turns out that a
particular choice of the learning rate  (which may be different for each cluster) makes the two
algorithms (batch gradient descent and the standard k-means algorithm) have identical update
steps. Let’s focus on the update for the first cluster, with center µ1. Calculate the value of  so
that both algorithms perform the same update for µ1.
2
Problem 3 (10pt): [Latent variable model and GMM] Consider the discrete latent variable
model where the latent variable z use 1-of-K representation. The distribution for latent variable z
is defined as:
p(zk = 1) = πk
where {πk} satisfy: 0 ≤ πk ≤ 1 and PK
k=1 πk = 1. Suppose the conditional distribution of
observation x given particular value for z is Gaussian:
p(x|zk = 1) = N (x|µk, Σk)
(1) [2pt] Write down the compact form of p(z) and p(x|z).
(2) [3pt] Show that the marginal distribution p(x) has the following form:
p(x) = X
K
k=1
πkN (x|µk, Σk)
(3) [5pt] If we want to find the MLE solution for parameters πk, µk, Σk in such model, what
algorithm should we use? Briefly describe its major difference compared to K-means algorithm.
Problem 4 (20pt): [Bayesian Network] Suppose we are given 5 random variables, A1, A2, B1, B2, B3.
These random variables have dependence relations as follows: B1 depends on A1, A2 depends on
B1. B2 depends on A2 and A1. B3 depends on A2. All 5 random variables are binary, i.e.,
Ai
, Bj ∈ {0, 1}, i = 1, 2; j = 1, 2, 3.
(1) [5pt] Draw the corresponding bayesian network.
(2) [5pt] Based on the bayesian network in (1), write down the joint distribution p(A1, A2, B1, B2, B3).
(3) [5pt] How many independent parameters are needed to fully specify the joint distribution
in (2).
(4) [5pt] Suppose we do not have any independence assumption, write down one possible factorization of p(A1, A2, B1, B2, B3), and state how many independent parameters are required to
describe joint distribution in this case.
Problem 5 (30 pt) [Neural Networks]
Build a neural network with one hidden layer to predict class labels for Iris plant dataset (https:
//archive.ics.uci.edu/ml/datasets/iris).
(1) [20pt] Properly split the data to training and testing set, and report the training, testing
accuracy. You can use sigmoid activation function and select any reasonable size of hidden units.
Note that for this part, you need to implement the forward/backward function yourself without
using the deep learning package. However, you can use the deep learning package, e.g., tensorflow,
pytorch, matconvnet etc, to compare with your own results.
3
(2) [10pt] Try different design of the neural network, compare with part (1), and report findings.
This is an open-ended question, you can change the previous model in several ways, e.g., (1) change
the activation function to be tanh, ReLU etc, or (2) try to build more complex neural network by
introducing more layers, or many other options. Note that for this part, you are allowed to use
deep learning packages.
4

More products