$30
COMS 4771 HW4
You are allowed to write up solutions in groups of (at max) two students. These group members
don’t necessarily have to be the same from previous homeworks. Only one submission per group
is required by the due date on Gradescope. Name and UNI of all group members must be clearly
specified on the homework. No late homeworks are allowed. To receive credit, a typesetted copy of
the homework pdf must be uploaded to Gradescope by the due date. You must show your work to
receive full credit. Discussing possible solutions for homework questions is encouraged on piazza
and with peers outside your group, but every group must write their own individual solutions. You
must cite all external references you used (including the names of individuals you discussed the
solutions with) to complete the homework.
1 [Estimating parameters with complete and incomplete data] Consider the data generation
process for observation pair (a, b) as follows:
– a is the outcome of an independent six-faced (possibly loaded) dice-roll. That is, chance
of rolling face ‘1’ is p1, rolling face ‘2’ is p2, etc., with a total of six distinct possibilities.
– Given the outcome a, b is drawn independently from a density distributed as qae
−qab
(where qa 0).
(i) List all the parameters of this process. We shall denote the collection of all the parameters as the variable θ (the parameter vector).
(ii) Suppose we run this process n times independently, and get the sequence:
(a1, b1),(a2, b2), . . . ,(an, bn).
What is the likelihood that this sequence was generated by a specific setting of the parameter vector θ?
(iii) What is the most likely setting of the parameter vector θ given the complete observation
sequence (ai
, bi)
n
i=1? that is, find the Maximum Likelihood Estimate of θ given the observations.
(iv) What is the probability of the partial (incomplete) observation b1, b2, . . . , bn given a specific setting of the parameter vector θ?
(v) Derive the Expectation Maximization (EM) algirthm to estimate of the parameters given
the incomplete observation sequence (bi)
n
i=1.
1
2 [kernelizing k-means] k-means with Euclidean distance metric assumes that each pair of
clusters is linearly separable. This may not be the case. A classic example is where we have
two clusters corresponding to data points on two concentric circles in the R
2
.
(i) Implement Lloyd’s method for k-means algorithm and show the resulting cluster assignment for the dataset depicted above. Give two more examples of datasets in R
2
, for
which optimal k-means setting results in an undesirable clustering. Show the resulting
cluster assignment for the two additional example datasets.
Let φ denote an explicit non-linear feature space mapping in some inner product space. We
will show how one can derive an implicit kernel-based version of the Lloyd’s method for
k-means, where we only operate on data as φ(xi) · φ(xj ) = K(xi
, xj ).
(ii) Let zij be an indicator that is equal to 1 if the φ(xi) is currently assigned to the jth
cluster and 0 otherwise (1 ≤ i ≤ n and 1 ≤ j ≤ k). Show that the jth cluster center cj
can be written as Pn
i=1 αijφ(xi), where αij only depends on zij variables.
(iii) Given any two data points φ(xi) and φ(xj ), show that the square distance kφ(xi) −
φ(xj )k
2
can be computed using only (linear combinations of) inner products.
(iv) Given the results of parts (ii) and (iii), show how to compute the square distance kφ(xi)−
cjk
2 using only (linear combinations of) inner products between the data points x1, ..., xn.
(v) From results from parts (ii), (iii) and (iv), propose the algorithm for kernelized version
of Lloyd’s method of finding k-means.
(vi) Implement your proposed kernelized k-means method and run it on the three example
datasets of part (i). Compare the resulting cluster for various choices of kernels (e.g.
linear kernel, quadratic kernel, rbf kernel).
(submit your datasets and kernelized code on Courseworks to receive full credit)
3 [Randomized decision rules] Consider a “reward-based” prediction framework, where given
a continuous state-space X and a discreet action-space A, a learning algorithm f : X → A
decides to perform action a ∈ A when it is in state x ∈ X. The learner then receives a
numerical (possibly negative) reward R(x, a) for performing action a ∈ A in state x ∈ X.
(Note: such reward-based frameworks are common in robotics where the learning agent, i.e.,
the robot, performs some—possibly randomized—action a in some situation x and receives
reward R(x, a). The goal of the learning agent is to maximize its expected reward.)
(i) Assume that f is allowed take a randomized action on any state x. That is, in any state
x ∈ X, f chooses to perform action a with probability P[a|x]. Show that the average
(over the states) expected reward received by f, i.e., Ex,f(x)
[R(x, f(x))] is
Z hX
a
R(x, a)P[a|x]
i
P[x]dx.
(ii) Show that the expected reward is maximized by choosing P[a|x] = 1 for the action a
associated with the maximal reward R(x, a) (for a given state x). This shows us that
there is no benefit in randomizing the best decision rule.
(iii) Can one benefit from randomizing a suboptimal rule? Briefly explain.
2
4 [From distances to embeddings] Your friend from overseas is visiting you and asks you
the geographical locations of popular US cities on a map. Not having access to a US map,
you realize that you cannot provide your friend accurate information. You recall that you
have access to the relative distances between nine popular US cities, given by the following
distance matrix D:
Distances (D) BOS NYC DC MIA CHI SEA SF LA DEN
BOS 0 206 429 1504 963 2976 3095 2979 1949
NYC 206 0 233 1308 802 2815 2934 2786 1771
DC 429 233 0 1075 671 2684 2799 2631 1616
MIA 1504 1308 1075 0 1329 3273 3053 2687 2037
CHI 963 802 671 1329 0 2013 2142 2054 996
SEA 2976 2815 2684 3273 2013 0 808 1131 1307
SF 3095 2934 2799 3053 2142 808 0 379 1235
LA 2979 2786 2631 2687 2054 1131 379 0 1059
DEN 1949 1771 1616 2037 996 1307 1235 1059 0
Being a machine learning student, you believe that it may be possible to infer the locations
of these cities from the distance data. To find an embedding of these nine cities on a two
dimensional map, you decide to solve it as an optimization problem as follows.
You associate a two-dimensional variable xi as the unknown latitude and the longitude value
for each of the nine cities (that is, x1 is the lat/lon value for BOS, x2 is the lat/lon value for
NYC, etc.). You write down the an (unconstrained) optimization problem
minimizex1,...,x9
X
i,j
kxi − xjk − Dij ?2
,
where P
i,j (kxi − xjk − Dij )
2 denotes the embedding discrepancy function.
(i) What is the derivative of the discrepancy function with respect to a location xi?
(ii) Write a program in your preferred language to find an optimal setting of locations
x1, . . . , x9. You must submit your code to Courseworks to receive full credit.
(iii) Plot the result of the optimization showing the estimated locations of the nine cities.
(here is a sample code to plot the city locations in Matlab)
cities={’BOS’,’NYC’,’DC’,’MIA’,’CHI’,’SEA’,’SF’,’LA’,’DEN’};
locs = [x1;x2;x3;x4;x5;x6;x7;x8;x9];
figure; text(locs(:,1), locs(:,2), cities);
What can you say about your result of the estimated locations compared to the actual
geographical locations of these cities?