$29
CSE 250A. Assignment 7
Supplementary reading:
• Russell & Norvig, Chapter 15.
• L. R. Rabiner (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2):257–286.
7.1 Viterbi algorithm
In this problem, you will decode an English sentence from a long sequence of non-text observations. To
do so, you will implement the same basic algorithm used in most engines for automatic speech recognition.
In a speech recognizer, these observations would be derived from real-valued measurements of acoustic
waveforms. Here, for simplicity, the observations only take on binary values, but the high-level concepts are
the same.
Consider a discrete HMM with n = 27 hidden states St∈ {1, 2, . . . , 27} and binary observations Ot∈ {0, 1}.
Download the ASCII data files from the course web site for this assignment. These files contain parameter
values for the initial state distribution πi = P(S1 = i), the transition matrix aij = P(St+1 = j|St = i), and
the emission matrix bik = P(Ot =k|St =i), as well as a long bit sequence of T = 55000 observations.
Use the Viterbi algorithm to compute the most probable sequence of hidden states conditioned on this
particular sequence of observations. As always, you may program in the language of your choice. Turn in
the following:
(a) a hard-copy print-out of your source code
(b) a plot of the most likely sequence of hidden states versus time.
To check your answer: suppose that the hidden states {1, 2, . . . , 26} represent the letters {a, b, . . . , z} of the
English alphabet, and suppose that hidden state 27 encodes a space between words. If you have implemented
the Viterbi algorithm correctly, the most probable sequence of hidden states (in fixed-length chunks) will
reveal a highly recognizable message, as well as an interesting commentary on our times.
1
7.2 Inference in HMMs
Consider a discrete HMM with hidden states St
, observations Ot
, transition matrix aij =P(St+1 =j|St =i)
and emission matrix bik =P(Ot =k|St =i). In class, we defined the forward-backward probabilities:
αit = P(o1, o2, . . . , ot
, St =i),
βit = P(ot+1, ot+2, . . . , oT |St =i),
for a particular observation sequence {o1, o2, . . . , oT } of length T. This problem will increase your familiarity with these definitions. In terms of these probabilities, which you may assume to be given, as well as the
transition and emission matrices of the HMM, show how to (efficiently) compute the following probabilities:
(a) P(St+1 =j|St =i, o1, o2, . . . , oT )
(b) P(St =i|St+1 =j, o1, o2, . . . , oT )
(c) P(St−1 =i, St =k, St+1 =j|o1, o2, . . . , oT )
(d) P(St+1 =j|St−1 =i, o1, o2, . . . , oT )
In all these problems, you may assume that t 1 and t < T; in particular, you are not asked to consider the
boundary cases.
2
7.3 Conditional independence
Consider the hidden Markov model (HMM) shown below, with hidden states St and observations Ot for
times t∈ {1, 2, . . . , T}. State whether the following statements of conditional independence are true or false.
P(St
|St−1) = P(St
|St−1, St+1)
P(St
|St−1) = P(St
|St−1, Ot−1)
P(St
|St−1) = P(St
|St−1, Ot)
P(St
|Ot−1) = P(St
|O1, O2, . . . , Ot−1)
P(Ot
|St−1) = P(Ot
|St−1, Ot−1)
P(Ot
|Ot−1) = P(Ot
|O1, O2, . . . , Ot−1)
P(S2, S3, . . . , ST |S1) =
QT
t=2 P(St
|St−1)
P(S1, S2, . . . , ST −1|ST ) =
QT −1
t=1 P(St
|St+1)
P(S1, S2, . . . , ST |O1, O2, . . . , OT ) =
QT
t=1 P(St
|Ot)
P(S1, S2, . . . , ST , O1, O2, . . . , OT ) =
QT
t=1 P(St
, Ot)
P(O1, O2, . . . , OT |S1, S2, . . . , ST ) =
QT
t=1 P(Ot
|St)
P(O1, O2, . . . , OT ) =
QT
t=1 P(Ot
|O1, . . . , Ot−1)
S1 S2 ST-1 ST
O2 OT-1 OT
...
O1
St-1 St ...
Ot-1 Ot
St+1
Ot+1
3
7.4 Belief updating
In this problem, you will derive recursion relations for real-time updating of beliefs based on incoming
evidence. These relations are useful for situated agents that must monitor their environments in real-time.
(a) Consider the discrete hidden Markov model (HMM) with hidden states St
, observations Ot
, transition
matrix aij and emission matrix bik. Let
qit = P(St =i|o1, o2, . . . , ot)
denote the conditional probability that St
is in the i
th state of the HMM based on the evidence up to
and including time t. Derive the recursion relation:
qjt =
1
Zt
bj (ot)
X
i
aijqit−1 where Zt =
X
ij
bj (ot)aijqit−1.
Justify each step in your derivation—for example, by appealing to Bayes rule or properties of conditional independence.
(b) Consider the dynamical system with continuous, real-valued hidden states Xt and observations Yt
,
represented by the belief network shown below. By analogy to the previous problem (replacing sums
by integrals), derive the recursion relation:
P(xt
|y1, y2, . . . , yt) = 1
Zt
P(yt
|xt)
Z
dxt−1 P(xt
|xt−1) P(xt−1|y1, y2, . . . , yt−1),
where Zt
is the appropriate normalization factor,
Zt =
Z
dxt P(yt
|xt)
Z
dxt−1 P(xt
|xt−1) P(xt−1|y1, y2, . . . , yt−1).
In principle, an agent could use this recursion for real-time updating of beliefs in arbitrarily complicated continuous worlds. In practice, why is this difficult for all but Gaussian random variables?
X1 X2 X3 X4
Y1 Y2 Y3 Y4
...
4
7.5 V-chain
Consider the belief network shown below with hidden nodes Xt and Yt and observed nodes Ot
. Suppose
also that the same CPTs are shared across nodes at different times, with parameters
πj = P(Yt =j),
aj` = P(Xt+1 =`|Yt =j),
bij (k) = P(Ot =k|Xt =i, Yt =j),
that are valid for all times t. In this problem you will derive an efficient algorithm for computing the
likelihood P(o1, o2, . . . , oT ) in terms of these parameters.
(a) Base case
Show how to compute the marginal probability P(Y1 = j, O1 = o1) in terms of the prior probability
P(X1 =i) and the parameters {π, b} of this model.
(b) Forward algorithm
Consider the probabilities αjt = P(o1, o2, . . . , ot
, Yt = j) for a particular sequence of observations {o1, o2, . . . , oT }. Derive an efficient recursion to compute these probabilities at time t 1 from
those at time t and the parameters {π, a, b} of this model. Justify each step in your derivation.
(c) Likelihood
Show how to compute the likelihood P(o1, o2, . . . , oT ) efficiently in terms of the marginal probabilities αjT as defined in part (b).
(d) Complexity
Suppose that Xt ∈ {1, 2, . . . , nx} and Yt ∈ {1, 2, . . . , ny}. How does the computation of the likelihood scale in the length of the sequence T and the cardinalities nx and ny of the hidden states? Justify
your answer.
O1
X1 Y1 . . .
O2
X2 Y2
OT-1
XT-1 YT-1
OT
XT YT
5