$29
CSE 250a. Assignment 3
3.1 Inference
X1 X2 X3 XT-2 XT-1 XT • • •
Consider the belief network shown above with random variables Xt2{1, 2,...,m}. Suppose that the CPT
at each non-root node is given by the same m⇥m transition matrix; that is, for all t 1, we have:
Aij = P(Xt+1 =j|Xt =i).
(a) Show that P(Xt+1 =j|X1 =i)=[At
]ij , where At is the t
th power of the transition matrix.
(b) Consider the computational complexity of this inference. Devise a simple algorithm, based on matrixvector multiplication, that scales as O(m2t).
(c) Show alternatively that the inference can also be done in O(m3 log2 t).
(d) Suppose that the transition matrix Aij is sparse, with at most s⌧m non-zero elements per row. Show
that in this case the inference can be done in O(smt).
3.2 Stochastic simulation
Consider the belief network shown below, with n binary random variables Bi2{0, 1} and an integer random
variable Z. Let f(B) = Pn
i=1 2i1Bi denote the nonnegative integer whose binary representation is given
by BnBn1 ...B2B1. Suppose that each bit has prior probability P(Bi = 1) = 1
2 , and that
P(Z|B1, B2,...,Bn) = ✓1 ↵
1 + ↵
◆
↵|Zf(B)|
where 0 < ↵ < 1 is a parameter measuring the amount of noise in the conversion from binary to decimal.
(Larger values of ↵ indicate greater levels of noise.)
bits Bi
integer Z
...
1
(a) Show that the conditional distribution for binary to decimal conversion is normalized; namely, that
P
z P(Z =z|B1, B2,...,Bn)=1, where the sum is over all integers z 2 [1, +1].
(b) Use the method of likelihood weighting to estimate the probability P(B7 = 1|Z = 64) for a network
with n= 10 bits and noise level ↵= 0.35.
(c) Plot your estimate in part (b) as a function of the number of samples. You should be confident from
the plot that your estimate has converged to a good degree of precision (say, at least two significant
digits).
(d) Submit a hard-copy print-out of your source code. You may program in the language of your choice,
and you may use any program at your disposal to plot the results.
3.3 Node clustering
Consider the belief network shown below over binary variables X, Y1, Y2, Y3, Z1, and Z2. The network can
be transformed into a polytree by clustering the nodes Y1, Y2, and Y3 into a single node Y . From the CPTs
in the original belief network, fill in the missing elements of the CPTs for the polytree.
X P(Y1 = 1|X) P(Y2 = 1|X) P(Y3 = 1|X)
0 0.75 0.50 0.25
1 0.50 0.25 0.75
Y1 Y2 Y3 P(Z1 = 1|Y1, Y2, Y3) P(Z2 = 1|Y1, Y2, Y3)
0 0 0 0.9 0.1
1 0 0 0.8 0.2
0 1 0 0.7 0.3
0 0 1 0.6 0.4
1 1 0 0.5 0.5
1 0 1 0.4 0.6
0 1 1 0.3 0.7
1 1 1 0.2 0.8
Y1 Y2 Y3 Y P(Y |X = 0) P(Y |X = 1) P(Z1 = 1|Y ) P(Z2 = 1|Y )
0 0 0 1
1 0 0 2
0 1 0 3
0 0 1 4
1 1 0 5
1 0 1 6
0 1 1 7
1 1 1 8
X
Y1 Y2 Y3
Z1 Z2
X
Y
Z1 Z2
2
3.4 Maximum likelihood estimation
Consider the two DAGs shown below, G1 and G2, over the same nodes {X1, X2,...,XT }, that differ only
in the direction of their edges.
X1 X2 X3 • • • XT-2 XT-1 XT
X1 X2 X3 • • • XT-2 XT-1 XT
G2
G1
Suppose that the CPTs for these DAGs are obtained by maximum likelihood estimation from a “fully observed” data set in which each example provides a complete instantiation of the nodes in the DAG. In this
data set, let COUNTt(x) denote the number of examples in which Xt =x, and let COUNTt(x, x0
) denote the
number of examples in which Xt =x and Xt+1 =x0
.
(a) Express the maximum likelihood estimates for the CPTs in G1 in terms of these counts.
(b) Express the maximum likelihood estimates for the CPTs in G2 in terms of these counts.
(c) Using your answers from parts (a) and (b), show that the maximum likelihood CPTs for G1 and G2
from this data set give rise to the same joint distribution over the nodes {X1, X2,...,XT }.
3
3.5 Statistical language modeling
In this problem, you will explore some simple statistical models of English text. Download and examine
the data files on Piazza for this assignment. (Start with the readme.txt file.) These files contain unigram
and bigram counts for 500 frequently occurring tokens in English text. These tokens include actual words as
well as punctuation symbols and other textual markers. In addition, an “unknown” token is used to represent
all words that occur outside this basic vocabulary. Turn in a printed copy of all your source code and results
for the following problems. You may program in the language of your choice.
(a) Compute the maximum likelihood estimate of the unigram distribution Pu(w) over words w. Print out
a table of all the tokens (i.e., words) that start with the letter “M”, along with their numerical unigram
probabilities (not counts). (You do not need to print out the unigram probabilities for all 500 tokens.)
(b) Compute the maximum likelihood estimate of the bigram distribution Pb(w0
|w). Print out a table of
the ten most likely words to follow the word “THE”, along with their numerical bigram probabilities.
(c) Consider the sentence “The stock market fell by one hundred points last week.” Ignoring punctuation, compute and compare the log-likelihoods of this sentence under the unigram and bigram models:
Lu = logh
Pu(the) Pu(stock) Pu(market) ... Pu(points) Pu(last)Pu(week)
i
Lb = logh
Pb(the|hsi) Pb(stock|the) Pb(market|stock) ... Pb(last|points) Pb(week|last)
i
In the equation for the bigram log-likelihood, the token hsi is used to mark the beginning of a sentence.
Which model yields the highest log-likelihood?
(d) Consider the sentence “The sixteen officials sold fire insurance.” Ignoring punctuation, compute
and compare the log-likelihoods of this sentence under the unigram and bigram models:
Lu = logh
Pu(the) Pu(sixteen) Pu(officials) ... Pu(sold) Pu(fire) Pu(insurance)
i
Lb = logh
Pb(the|hsi) Pb(sixteen|the) Pb(officials|sixteen) ... Pb(fire|sold) Pb(insurance|fire)
i
Which pairs of adjacent words in this sentence are not observed in the training corpus? What effect
does this have on the log-likelihood from the bigram model?
(e) Consider the so-called mixture model that predicts words from a weighted interpolation of the unigram
and bigram models:
Pm(w0
|w) = Pu(w0
) + (1 )Pb(w0
|w),
where 2 [0, 1] determines how much weight is attached to each prediction. Under this mixture
model, the log-likelihood of the sentence from part (d) is given by:
Lm = logh
Pm(the|hsi) Pm(sixteen|the) Pm(officials|sixteen) ... Pm(fire|sold) Pm(insurance|fire)
i
.
Compute and plot the value of this log-likelihood Lm as a function of the parameter 2 [0, 1]. From
your results, deduce the optimal value of to two significant digits.
4