Starting from:

$30

Assignment 8 EM algorithm for binary matrix completion

CSE 250A. Assignment 8

8.1 EM algorithm for binary matrix completion
In this problem you will use the EM algorithm to build a simple movie recommendation system. Download
the files hw8 movies.txt, hw8 ids.txt, and hw8 ratings.txt. The last of these files contains a matrix of zeros,
ones, and missing elements denoted by question marks. The hi, ji
th element in this matrix contains the i
th
student’s rating of the j
th movie, according to the following key:
1 recommended,
0 not recommend,
? not seen.
(a) Sanity check
Compute the mean popularity rating of each movie, given by the simple ratio
number of students who recommended the movie
number of students who saw the movie
,
and sort the movies by this ratio. Print out the movie titles from least popular (The Last Airbender) to
most popular (Inception). Note how well these rankings do or do not corresponding to your individual
preferences.
(b) Likelihood
Now you will learn a naive Bayes model of these movie ratings, represented by the belief network shown below, with hidden variable Z ∈ {1, 2, . . . , k} and partially observed binary variables
R1, R2, . . . , R76 (corresponding to movie ratings).
Z
R1 R2 R3 R50 ...
This model assumes that there are k different types of movie-goers, and that the i
th type of moviegoer—who represents a fraction P(Z =i) of the overall population—likes the j
th movie with conditional probability P(Rj = 1|Z =i). Let Ωt denote the set of movies seen (and hence rated) by the t
th
student. Show that the likelihood of the t
th student’s ratings is given by
P
?n
Rj =r
(t)
j
o
j∈Ωt
?
=
X
k
i=1
P(Z =i)
Y
j∈Ωt
P
?
Rj =r
(t)
j



Z =i
?
.
1
(c) E-step
The E-step of this model is to compute, for each student, the posterior probability that he or she
corresponds to a particular type of movie-goer. Show that
P
?
Z =i




n
Rj =r
(t)
j
o
j∈Ωt
?
=
P(Z =i)
Q
j∈Ωt
P
?
Rj =r
(t)
j



Z =i
?
Pk
i
0=1 P(Z =i
0)
Q
j∈Ωt
P
?
Rj =r
(t)
j



Z =i
0
?.
(d) M-step
The M-step of the model is to re-estimate the probabilities P(Z =i) and P(Rj = 1|Z =i) that define
the CPTs of the belief network. As shorthand, let
ρit = P
?
Z =i




n
Rj =r
(t)
j
o
j∈Ωt
?
denote the probabilities computed in the E-step of the algorithm. Also, let T denote the number of
students. Show that the EM updates are given by
P(Z =i) ←
1
T
X
T
t=1
ρit,
P(Rj = 1|Z =i) ←
P
{t|j∈Ωt} ρit I
?
r
(t)
j
, 1
?
+
P
{t|j6∈Ωt} ρit P(Rj = 1|Z =i)
PT
t=1 ρit
.
(e) Implementation
Download the files hw8 probZ init.txt and hw8 probR init.txt, and use them to initialize the probabilities P(Z =i) and P(Rj = 1|Z =i) for a model with k= 4 types1 of movie-goers. Run 256 iterations
of the EM algorithm, computing the (normalized) log-likelihood
L =
1
T
X
T
t=1
log P
?n
Rj =r
(t)
j
o
j∈Ωt
?
at each iteration. Does your log-likelihood increase (i.e., become less negative) at each iteration? Fill
in a completed version of the following table, using the already provided entries to check your work:
iteration log-likelihood L
0 -29.3276
1 -18.1393
2
4
8
16 -13.8581
32
64
128
256
1There is nothing special about these initial values or the choice of k= 4; feel free to experiment with other choices.
2
(f) Personal movie recommendations
Find your student PID in hw8 ids.txt to determine the row of the ratings matrix that stores your personal data. Compute the posterior probability in part (c) for this row from your trained model, and
then compute your expected ratings on the movies you haven’t yet seen:
P
?
R` = 1




n
Rj =r
(t)
j
o
j∈Ωt
?
=
X
k
i=1
P
?
Z =i




n
Rj =r
(t)
j
o
j∈Ωt
?
P(R` = 1|Z =i) for ` 6∈ Ωt
.
Print out the list of these (unseen) movie sorted by their expected ratings. Does this list seem to
reflect your personal tastes better than the list in part (a)? Hopefully it does (although our data set is
obviously far smaller and more incomplete than the data sets at companies like Netflix or Amazon).
Note: if you didn’t complete the survey in time, then you will need to hard-code your ratings in order
to answer this question.
(g) Source code
Turn in a hard-copy printout of your source code for all parts of this problem. As usual, you may
program in the language of your choice.
3
8.2 Mixture model decision boundary
Consider a multivariate Gaussian mixture model with two mixture components. The model has a hidden
binary variable y ∈ {0, 1} and an observed vector variable ~x ∈ Rd
, with graphical model:
y x
P(y=i) = πi P(~x|y=i) = (2π)
− d
2 |Σi
|
− 1
2 e
− 1
2
(~x−~µi)
T Σ
−1
i
(~x−~µi)
The parameters of the Gaussian mixture model are the prior probabilities π0 and π1, the mean vectors ~µ0
and ~µ1, and the covariance matrices Σ0 and Σ1.
(a) Compute the posterior distribution P(y= 1|~x) as a function of the parameters (π0, π1, ~µ0, ~µ1, Σ0, Σ1)
of the Gaussian mixture model.
(b) Consider the special case of this model where the two mixture components share the same covariance
matrix: namely, Σ0 = Σ1 = Σ. In this case, show that your answer from part (a) can be written as:
P(y = 1|~x) = σ( ~w · ~x + b) where σ(z) = 1
1 + e−z
.
As part of your answer, you should express the parameters ( ~w, b) of the sigmoid function explicitly in
terms of the parameters (π0, π1, ~µ0, ~µ1, Σ) of the Gaussian mixture model.
(c) Assume again that Σ0 = Σ1 = Σ. Note that in this case, the decision boundary for the mixture model
reduces to a hyperplane; namely, we have P(y = 1|~x) = P(y = 0|~x) when ~w · ~x + b = 0. Let k be a
positive integer. Show that the set of points for which
P(y = 1|~x)
P(y = 0|~x)
= k
is also described by a hyperplane, and find the equation for this hyperplane. (These are the points
for which one class is precisely k times more likely than the other.) Of course, your answer should
recover the hyperplane decision boundary ~w · ~x + b = 0 when k = 1.
4
8.3 Gradient ascent versus EM
X1 X2 X3 Xd
...
Y
Consider the belief network shown above, with binary random variables Xi∈ {0, 1} and Y ∈ {0, 1}. Suppose
that the conditional probability table (CPT) for node Y has the form
P
?
Y = 1|X~ =~x?
= 1 − exp (−~v · ~x),
which is parameterized in terms of nonnegative weights vi ≥ 0, one for each parent Xi
.
(a) Log (conditional) likelihood
Consider a fully observed data set of i.i.d.examples {~xt
, yt}
T
t=1 where for each example ~xt ∈ {0, 1}
d
and yt ∈ {0, 1}. Compute the log (conditional) likelihood,
L(~v) = X
T
t=1
log P(yt
|~xt),
in terms of the weights vi
. You will want to simplify your expression as much as possible for the next
part of this question.
(b) Gradient
As shorthand in this problem, let ρt = P(Y = 1|~xt). The gradient of the conditional log-likelihood
from part (a) is given by one of the following expressions:
(i) ∂L
∂~v =
X
T
t=1
(yt − ρt) ~xt
(ii) ∂L
∂~v =
X
T
t=1
?
yt − ρt
ρt
?
~xt
(iii) ∂L
∂~v =
X
T
t=1
?
yt − ρt
ρt(1 − ρt)
?
~xt
Compute the gradient, and show that your final answer matches (the correct) one of these expressions.
5
(c) Noisy-OR
With a simple transformation, we can express the conditional probability model for this problem in a
more familiar form. Let
vi = − log(1 − pi),
where the new parameters pi ∈ [0, 1] lie in the unit interval. Use this transformation to express the
conditional probability
P(y= 1|~x) = 1 − e
−~v·~x
in terms of the parameters pi
. Your answer should reproduce the result for a noisy-OR model with
parameters pi
.
(d) Chain rule
We can perform gradient ascent in either the weights vi or the noisy-OR parameters pi
. Use the
transformation in part (c) and the chain rule from calculus to show that:
∂L
∂pi
=
?
1
1 − pi
?
∂L
∂vi
.
(e) Gradient ascent versus EM
Consider the update rule for gradient ascent (GA) in the noisy-OR parameters pi ∈ [0, 1]. This update
takes the simple form:
pi ← pi + η
?
∂L
∂pi
?
. (GA)
In the homework, you also derived an EM update for the noisy-OR parameters pi
in this belief network. It had the following form:
pi ←
pi
Ti
"X
T
t=1
?
ytxit
ρt
?#
, (EM)
where on the right hand side of this update we have used shorthand for the conditional probability
ρt = P(y= 1|~xt) and the sums of inputs Ti =
PT
t=1 xit.
How are these two different updates related? In this problem you will prove a surprising result.
Suppose that at each GA-update, we use a specialized learning rate ηi for each parameter pi
, and that
we set
ηi =
pi(1 − pi)
Ti
.
For this particular adaptive choice of the learning rates ηi (which rescale the individual components
of the gradient), show that the resulting GA-update is identical to the EM-update.
6
8.4 Similarity learning with logistic regression
In class we introduced logistic regression as a probabilistic model of classification that mapped vector
inputs ~x ∈ <n
to binary outputs y ∈ {0, 1}. We also showed how to train the model from labeled examples {(~x1, y1),(~x2, y2), . . . ,(~xT , yT )}; this was done by maximizing the log (conditional) likelihood
PT
t=1 log P(yt
|~xt) of correct classification.
Sometimes it is necessary to train models of logistic regression from examples that are not so explicitly
labeled. An interesting case is the problem of similarity learning. In this problem we are told only whether
pairs of inputs belong to the same class—that is, whether or not they share the same label—but not the
actual labels of the inputs themselves.
The figure below shows a belief network for this problem. The observed inputs ~x, ~x0 ∈ <n have unknown
(hidden) labels y, y0 ∈ {0, 1}, and the observed variable s ∈ {0, 1} indicates whether or not y=y
0
. Here
σ(z) = [1 + e
−z
]
−1 denotes the sigmoid function. Note that the same weight vector ~w ∈ <n parameterizes
the logistic regressions for P(y|~x) and P(y
0
|~x0
).
X1 X2 X3 Xn
...
Y
S
X1 X2 X3 Xn
...
Y
In this problem you will be guided to derive an EM algorithm for models of this form. (The virtue of the
EM algorithm for this model is that its M-step consists of ordinary logistic regressions.)
(a) Inference for similar examples
Show how to compute the posterior probability P(y = 1, y0 = 1|~x, ~x0
, s = 1) in terms of the weight
vector ~w and the inputs ~x, ~x0
. Simplify your answer as much as possible.
(b) Inference for dissimilar examples
Show how to compute the posterior probability P(y = 1, y0 = 0|~x, ~x0
, s = 0) in terms of the weight
vector ~w and the inputs ~x, ~x0
. Simplify your answer as much as possible.
7
(c) E-Step
The posterior probabilities needed for the EM algorithm can be derived (fairly easily) from the ones
that you just computed. Choose the correct answer for each of the following.
The posterior probability P(y = 1|~x, ~x0
, s= 1) is equal to:
(a) P(y= 1, y0 = 0|~x, ~x0
, s= 0)
(b) P(y= 1, y0 = 1|~x, ~x0
, s= 1)
(c) 1 − P(y= 1, y0 = 0|~x, ~x0
, s= 0)
(d) 1 − P(y= 1, y0 = 1|~x, ~x0
, s= 1)
The posterior probability P(y
0 = 1|~x, ~x0
, s= 1) is equal to:
(a) P(y= 1, y0 = 0|~x, ~x0
, s= 0)
(b) P(y= 1, y0 = 1|~x, ~x0
, s= 1)
(c) 1 − P(y= 1, y0 = 0|~x, ~x0
, s= 0)
(d) 1 − P(y= 1, y0 = 1|~x, ~x0
, s= 1)
The posterior probability P(y = 1|~x, ~x0
, s= 0) is equal to:
(a) P(y= 1, y0 = 0|~x, ~x0
, s= 0)
(b) P(y= 1, y0 = 1|~x, ~x0
, s= 1)
(c) 1 − P(y= 1, y0 = 0|~x, ~x0
, s= 0)
(d) 1 − P(y= 1, y0 = 1|~x, ~x0
, s= 1)
The posterior probability P(y
0 = 1|~x, ~x0
, s= 0) is equal to:
(a) P(y= 1, y0 = 0|~x, ~x0
, s= 0)
(b) P(y= 1, y0 = 1|~x, ~x0
, s= 1)
(c) 1 − P(y= 1, y0 = 0|~x, ~x0
, s= 0)
(d) 1 − P(y= 1, y0 = 1|~x, ~x0
, s= 1)
(d) Log-likelihood
Consider a data set {(~x1, ~x0
1
, s1), . . . ,(~xT , ~x0
T
, sT )} that consists of pairs of inputs (~xt
, ~x0
t
) and judgments of similarity st∈ {0, 1}. Derive an explicit expression for the log (conditional) likelihood
L( ~w) = X
t
log P(st
|~xt
, ~x0
t
)
in terms of the weight vector ~w and the examples in the data set. Hint: You will need to sum over
(allowed) values of the hidden variables (y, y0
).
8
(e) M-step
As shorthand, let y¯t = P(y = 1|~xt
, ~x0
t
, st) and y¯
0
t = P(y
0 = 1|~xt
, ~x0
t
, st) denote the posterior probabilities computed in the E-step of the EM algorithm. The M-step for this model takes the following
form:
~w ← argmax
~w
(X
t
?
y¯t
log σ( ~w · ~xt) + (1 − y¯t) log σ(−~w · ~xt) +

0
t
log σ( ~w · ~x0
t
) + (1 − y¯
0
t
) log σ(−~w · ~x0
t
)
?



It is generally much simpler to maximize the right hand side of this expression (which should look
familiar to you) than the log-likelihood L( ~w) in part (d).
Let η 0 denote a small learning rate. Derive the iterative update rule to perform the above maximization using gradient ascent. Complete the expression below with your final answer.
~w ← ~w + η
(X
t
? ?)
Note: compute the gradient for the maximization shown above, not the gradient of your answer to
part (d) of this problem (which is much more complicated).
9
8.5 Logistic regression across time
X1 X2 X3 Xd
...
Y
In class we studied a simple model of logistic regression represented by the belief network shown above.
Predictions were modeled by the conditional probability
P(Y = 1|~x) = σ( ~w · ~x),
where σ(z) = [1 + e
−z
]
−1 was the sigmoid function, and the model as a whole was parameterized by the
weight vector ~w ∈ <d
. Below we show a more compact way to draw this same belief network, with all the
inputs represented by a single vector-valued node.
X
Y
In lecture we considered this model for i.i.d. data. In this problem you will explore a model for data that does
not satisfy this assumption. Since the i.i.d. assumption does not hold, we will try to model the dependence
across time explicitly.
X1
Y1
X2
Y2
XT-1
YT-1
XT
Y0 YT ...
...
P(Yt = 1 | ~xt
, Yt−1 = 0) = σ( ~w0 · ~xt)
P(Yt = 1 | ~xt
, Yt−1 = 1) = σ( ~w1 · ~xt)
Consider the belief network shown above for predicting a time series of binary outputs yt ∈ {0, 1} from a
time series of inputs ~xt ∈ <d
, where t = 1, 2, . . . , T. For times t ≥ 1, the belief network has the sigmoid
conditional probability tables (CPTs):
P(Yt = 1 | ~xt
, Yt−1 = 0) = σ( ~w0 · ~xt),
P(Yt = 1 | ~xt
, Yt−1 = 1) = σ( ~w1 · ~xt),
where ~w0 ∈ <d
and ~w1 ∈ <d
are different weight vectors. In other words, we predict yt from ~xt differently
depending on the value of yt−1: if yt−1 = 0, then we use ~w0 in the sigmoid CPT for predicting yt
, and if
yt−1 = 1, then we use ~w1 in the sigmoid CPT.
10
In addition, imagine for this application that only two outputs are observed, an initial one at time t = 0
and a final one at time t = T. These observations are indicated as usual by shaded nodes.
(a) Likelihood
As shorthand, let αit = P(Yt = i|y0, ~x1, ~x2, . . . , ~xt) where i ∈ {0, 1}. Derive an efficient recursion
for computing the probabilities αj,t+1 in terms of the probabilities αit at the previous time step as well
as the model parameters ~w0 and ~w1. Show your work for full credit, justifying each step.
(b) Most likely hidden states
Let i ∈ {0, 1}, and denote the log-probability of the most likely set of hidden states up to time t, such
that yt = i, by
`

it = max
y1,...,yt−1
?
log P(y1, y2, . . . , yt =i|y0, ~x1, ~x2, . . . , ~xt)
?
.
Derive an efficient recursion for computing the log-probabilities `

j,t+1 from those at the previous time
step as well as the model parameters ~w0 and ~w1. Show your work for full credit, justifying each step.
(c) Prediction (2 pts)
To use this model for prediction, given parameters ~w0 and ~w1, we must be able to compute the binary
outputs {y

t }
T −1
t=1 that maximize the posterior probability
P(y1, y2, . . . , yT −1|y0, yT , ~x1, . . . ~xT ).
Show that your results from part (b) can be used to perform this computation; in particular, complete
the two blocks of pseudocode sketched below.
for t = 1 to T −1
for j = 0 to 1
Φt+1(j) = argmaxi∈{0,1}
? ?
for t = T −1 to 1
y

t =
11

More products