$29
CSE 250a. Assignment 6
6.1 Survey
Please fill out the short survey on movies at this link: https://forms.gle/3ygxERqXgNhbPgTF8
We are collecting this data for an upcoming assignment in which you will build a simple movie recommendation system and obtain personalized recommendations based on your own tastes. Note that you will not
be able to complete this assignment if you do not fill out the survey.
6.2 EM algorithm
A
B C
D
(a) Posterior probability
Consider the belief network shown above, with observed nodes C and D and hidden nodes A and B.
Compute the posterior probability P(a, b|c, d) in terms of the CPTs of the belief network—that is, in
terms of P(a), P(b|a), P(c|a, b) and P(d|b, c).
(b) Posterior probability
Compute the posterior probabilities P(a|c, d) and P(b|c, d) in terms of your answer from part (a);
that is, for this problem, you may assume that P(a, b|c, d) is given.
(c) Log-likelihood
Consider a partially complete data set of i.i.d. examples {ct
, dt}
T
t=1 drawn from the joint distribution
of the above belief network. The log-likelihood of the data set is given by:
L =
X
t
log P(C =ct
, D =dt).
Compute this log-likelihood in terms of the CPTs of the belief network. You may re-use work from
earlier parts of the problem.
1
A
B C
D
(d) EM algorithm
Give the EM updates to estimate CPTs that maximize the log-likelihood in part (c); in particular,
complete the numerator and denominator in the below expressions for the update rules. Simplify your
answers as much as possible, expressing them in terms of the posterior probabilities P(a, b|ct
, dt),
P(a|ct
, dt), and P(b|ct
, dt), as well as the functions I(c, ct), and I(d, dt).
P(A=a) ←
P(B =b|A=a) ←
P(C =c|A=a, B =b) ←
P(D =d|B =b, C =c) ←
2
6.3 EM algorithm for noisy-OR
Consider the belief network on the right,
with binary random variables X ∈ {0, 1}
n
and Y ∈ {0, 1} and a noisy-OR conditional probability table (CPT). The noisyOR CPT is given by:
P(Y = 1|X) = 1 −
Yn
i=1
(1 − pi)
Xi
,
which is expressed in terms of the noisyOR parameters pi∈[0, 1].
Y
X1 X2 X3 X
n
In this problem, you will derive and implement an EM algorithm for estimating the noisy-OR parameters pi
.
It may seem that the EM algorithm is not suited to this problem, in which all the nodes are observed, and
the CPT has a parameterized form. In fact, the EM algorithm can be applied, but first we must express the
model in a different but equivalent form.
Consider the belief network shown to the right. In this network, a binary random variable Zi ∈ {0, 1} intercedes between each pair of nodes Xi and Y . Suppose that:
P(Zi = 1|Xi = 0) = 0,
P(Zi = 1|Xi = 1) = pi
.
Also, let the node Y be determined by the logical-OR of Zi
.
In other words:
P(Y = 1|Z) = (
1 if Zi = 1 for any i,
0 if Zi = 0 for all i.
Y
Z1 Z2 Z3 Z
n
X1 X2 X3 X
n
(a) Show that this “extended” belief network defines the same conditional distribution P(Y |X) as the
original one. In particular, starting from
P(Y = 1|X) = X
Z∈{0,1}n
P(Y = 1, Z|X),
show that the right hand side of this equation reduces to the noisy-OR CPT with parameters pi
. To
perform this marginalization, you will need to exploit various conditional independence relations.
(b) Consider estimating the noisy-OR parameters pi
to maximize the (conditional) likelihood of the observed data. The (normalized) log-likelihood in this case is given by:
L =
1
T
X
T
t=1
log P(Y =y
(t)
|X =~x(t)
),
3
where (~x(t)
, y(t)
) is the tth joint observation of X and Y , and where for convenience we have divided
the overall log-likelihood by the number of examples T. From your result in part (a), it follows that
we can estimate the parameters pi
in either the original network or the extended one (since in both
networks they would be maximizing the same equation for the log-likelihood).
Notice that in the extended network, we can view X and Y as observed nodes and Z as hidden nodes.
Thus in this network, we can use the EM algorithm to estimate each parameter pi
, which simply
defines one row of the “look-up” CPT for the node Zi
.
Compute the posterior probability that appears in the E-step of this EM algorithm. In particular, for
joint observations x∈ {0, 1}
n
and y∈ {0, 1}, use Bayes rule to show that:
P(Zi = 1, Xi = 1|X =x, Y =y) = yxipi
1 −
Q
j
(1 − pj )
xj
(c) For the data set {~x(t)
, y(t)}
T
t=1, show that the EM update for the parameters pi
is given by:
pi ←
1
Ti
X
t
P
?
Zi = 1, Xi = 1|X =x
(t)
, Y =y
(t)
?
,
where Ti
is the number of examples in which Xi = 1. (You should derive this update as a special case
of the general form presented in lecture.)
(d) Download the data files on the course web site, and use the EM algorithm to estimate the parameters pi
. The data set1 has T = 267 examples over n = 23 inputs. To check your solution, initialize
all pi = 0.05 and perform 256 iterations of the EM algorithm. At each iteration, compute the loglikelihood shown in part (b). (If you have implemented the EM algorithm correctly, this log-likelihood
will always increase from one iteration to the next.) Also compute the number of mistakes M ≤ T
made by the model at each iteration; a mistake occurs either when yt = 0 and P(yt = 1|~xt) ≥ 0.5
(indicating a false positive) or when yt = 1 and P(yt = 1|~xt) ≤ 0.5 (indicating a false negative). The
number of mistakes should generally decrease as the model is trained, though it is not guaranteed to
do so at each iteration. Complete the following table:
iteration number of mistakes M log-likelihood L
0 175 -0.95809
1 56
2 -0.40822
4
8
16
32
64 37
128
256 -0.31016
You may use the already completed entries of this table to check your work.
(e) Turn in your source code. As always, you may program in the language of your choice.
1
For those interested, more information about this data set is available at http://archive.ics.uci.edu/ml/datasets/SPECT+Heart.
However, be sure to use the data files provided on the course web site, as they have been specially assembled for this assignment.
4
6.4 Auxiliary function
In class we derived an auxiliary function for optimizing the log-likelihood in belief networks with hidden
variables. In this problem you will derive an auxiliary function for optimizing a simpler function that is
nearly quadratic near its minimum, but nearly linear far away from its minimum.
(a) Consider the function f(x) = log cosh(x). Show that the minimum occurs at x = 0.
(b) Show that f
00(x) ≤ 1 for all x.
(c) Consider the function Q(x, y) = f(y) +f
0
(y)(x−y) + 1
2
(x−y)
2
. Plot f(x), Q(x, −2), and Q(x, 3)
as a function of x.
(d) Prove that Q(x, y) is an auxiliary function for f(x). In particular, show that it satisfies:
(i) Q(x, x) = f(x)
(ii) Q(x, y) ≥ f(x)
Hint: use part (b), and note that f(x) = f(y) + R x
y
duf0
(u) = f(y) + R x
y
du h
f
0
(y) + R u
y
dvf00(v)
i
.
(e) Derive the form of the update rule xn+1 = argminxQ(x, xn).
(f) Write a simple program to show that your update rule in (e) converges numerically for the initial
guesses x0 = −2 and x0 = 3. Turn in your source code as well as plots of xn versus n.
(g) Repeat parts (e) and (f) using the update rule for Newton’s method: namely, xn+1 = xn−f
0
(xn)/f00(xn).
What happens and why? Determine an upper bound on |x0| so that Newton’s method converges.
(Hint: require |x1| < |x0|.)
(h) Plot the function g(x) = 1
10
P10
k=1 log cosh ?
x + √
2
k
?
. Is it still simple to find the exact minimum?
(i) Consider the function R(x, y) = g(y) + g
0
(y)(x−y) + 1
2
(x−y)
2
. Prove that R(x, y) is an auxiliary
function for g(x).
(j) Derive the form of the update rule xn+1 = argminxR(x, xn).
(k) Use the update rule from part (j) to locate the minimum of g(x) to four significant digits. In addition
to your answer for the minimum, turn in your source code as well as plots of xn versus n.
5