Starting from:

$29

CSE 250a. Assignment 5

CSE 250a. Assignment 5

5.1 EM algorithm
For this problem, consider the belief network shown below, with observed nodes B and D and hidden
nodes A and C.
A
B C
D
(a) Posterior probability
Compute the posterior probability P(a, c|b, d) in terms of the conditional probability tables (CPTs) of
the belief network, namely P(a), P(b|a), P(c|a, b) and P(d|b, c).
(b) Posterior probability
Compute the posterior probabilities P(a|b, d) and P(c|b, d) in terms of your answer from part (a).
(c) Log-likelihood
Consider a partially complete data set of i.i.d. examples {bt
, dt}
T
t=1 drawn from the joint distribution
of the above belief network. The log-likelihood of the data set is given by:
L =
X
t
log P(B =bt
, D =dt).
Compute this log-likelihood in terms of the CPTs of the belief network.
(d) EM algorithm
Derive the EM updates for the CPTs of this belief network. Simplify your answers as much as possible, expressing them in terms of the posterior probabilities P(a, c|bt
, dt), P(a|bt
, dt), and P(c|bt
, dt),
as well as the binary-valued functions I(b, bt), and I(d, dt), where
I(x, x0
) = (
1 if x = x
0
,
0 otherwise.
1
5.2 EM algorithm for noisy-OR
Consider the belief network on the right,
with binary random variables X ∈ {0, 1}
n
and Y ∈ {0, 1} and a noisy-OR conditional probability table (CPT). The noisyOR CPT is given by:
P(Y = 1|X) = 1 −
Yn
i=1
(1 − pi)
Xi
,
which is expressed in terms of the noisyOR parameters pi∈[0, 1].
Y
X1 X2 X3 X
n
In this problem, you will derive and implement an EM algorithm for estimating the noisy-OR parameters pi
.
It may seem that the EM algorithm is not suited to this problem, in which all the nodes are observed, and
the CPT has a parameterized form. In fact, the EM algorithm can be applied, but first we must express the
model in a different but equivalent form.
Consider the belief network shown to the right. In this network, a binary random variable Zi ∈ {0, 1} intercedes between each pair of nodes Xi and Y . Suppose that:
P(Zi = 1|Xi = 0) = 0,
P(Zi = 1|Xi = 1) = pi
.
Also, let the node Y be determined by the logical-OR of Zi
.
In other words:
P(Y = 1|Z) = (
1 if Zi = 1 for any i,
0 if Zi = 0 for all i.
Y
Z1 Z2 Z3 Z
n
X1 X2 X3 X
n
(a) Show that this “extended” belief network defines the same conditional distribution P(Y |X) as the
original one. In particular, starting from
P(Y = 1|X) = X
Z∈{0,1}n
P(Y = 1, Z|X),
show that the right hand side of this equation reduces to the noisy-OR CPT with parameters pi
. To
perform this marginalization, you will need to exploit various conditional independence relations.
(b) Consider estimating the noisy-OR parameters pi
to maximize the (conditional) likelihood of the observed data. The (normalized) log-likelihood in this case is given by:
L =
1
T
X
T
t=1
log P(Y =y
(t)
|X =~x(t)
),
2
where (~x(t)
, y(t)
) is the tth joint observation of X and Y , and where for convenience we have divided
the overall log-likelihood by the number of examples T. From your result in part (a), it follows that
we can estimate the parameters pi
in either the original network or the extended one (since in both
networks they would be maximizing the same equation for the log-likelihood).
Notice that in the extended network, we can view X and Y as observed nodes and Z as hidden nodes.
Thus in this network, we can use the EM algorithm to estimate each parameter pi
, which simply
defines one row of the “look-up” CPT for the node Zi
.
Compute the posterior probability that appears in the E-step of this EM algorithm. In particular, for
joint observations x∈ {0, 1}
n
and y∈ {0, 1}, use Bayes rule to show that:
P(Zi = 1, Xi = 1|X =x, Y =y) = yxipi
1 −
Q
j
(1 − pj )
xj
(c) For the data set {~x(t)
, y(t)}
T
t=1, show that the EM update for the parameters pi
is given by:
pi ←
1
Ti
X
t
P
?
Zi = 1, Xi = 1|X =x
(t)
, Y =y
(t)
?
,
where Ti
is the number of examples in which Xi = 1. (You should derive this update as a special case
of the general form presented in lecture.)
(d) Download the ASCII files hw5 X.txt and hw5 Y.txt from the course web site. These files contain
T = 20000 examples of (x, y) pairs (with n= 10). Use the EM algorithm to estimate the parameters pi
that maximize the log-likelihood of this data set.
To check your solution, initialize all pi = 0.5 and perform 64 iterations of the EM algorithm. At each
iteration, compute the log-likelihood shown in part (b). Turn in a completed version of this table:
# 0 1 2 4 8 16 32 64
L -2.5709 -0.6938 ? ? ? ? ? -0.5363
If you have implemented the EM algorithm correctly, this log-likelihood will always increase from
one iteration to the next. You may use the already completed entries of this table to check your work.
(e) Turn in your source code. You may program in the language of your choice.
3
5.3 Auxiliary function
In class we derived an auxiliary function for optimizing the log-likelihood in belief networks with hidden
variables. In this problem you will derive an auxiliary function for optimizing a simpler function that is
nearly quadratic near its minimum, but nearly linear far away from its minimum.
(a) Consider the function f(x) = log cosh(x). Show that the minimum occurs at x = 0.
(b) Show that f
00(x) ≤ 1 for all x.
(c) Consider the function Q(x, y) = f(y) +f
0
(y)(x−y) + 1
2
(x−y)
2
. Plot f(x), Q(x, −2), and Q(x, 3)
as a function of x.
(d) Prove that Q(x, y) is an auxiliary function for f(x). In particular, show that it satisfies:
(i) Q(x, x) = f(x)
(ii) Q(x, y) ≥ f(x)
(e) Derive the form of the update rule xn+1 = argminxQ(x, xn).
(f) Write a simple program to show that your update rule in (e) converges numerically for the initial
guesses x0 = −2 and x0 = 3. Turn in your source code as well as plots of xn versus n.
(g) Repeat parts (e) and (f) using the update rule for Newton’s method: namely, xn+1 = xn−f
0
(xn)/f00(xn).
What happens and why? Determine an upper bound on |x0| so that Newton’s method converges.
(Hint: require |x1| < |x0|.)
(h) Plot the function g(x) = 1
10
P10
k=1 log cosh ?
x + √
1
k
?
. Is it still simple to find the exact minimum?
(i) Consider the function R(x, y) = g(y) + g
0
(y)(x−y) + 1
2
(x−y)
2
. Prove that R(x, y) is an auxiliary
function for g(x).
(j) Derive the form of the update rule xn+1 = argminxR(x, xn).
(k) Use the update rule from part (j) to locate the minimum of g(x) to four significant digits. In addition
to your answer for the minimum, turn in your source code as well as plots of xn versus n.
4

More products