$30
KTH, DD2380 ARTIFICIAL INTELLIGENCE
Hidden Markov Models
Deadline For grade A you must latest have your implementations accepted by Kattis (and for
the duckhunt with the required score) by Sept 20 2018, 20:00 sharp and present this in an "ontime" presentation slot Sept 21, 2018, 17:00. Any assignment accepted by Kattis/presented after
these deadlines will result in max. grade B.
1 INTRODUCTION TO HIDDEN MARKOV MODELS
A Hidden Markov Model (HMM) is a powerful statistical model used in robotics, computational biology and many other disciplines. HMMs build on the idea that observations are generated by unknown
or hidden states. For example, a GPS signal indicating your location is generated by your actual position. Since many processes, ranging from DNA sequences to satellite coordinates, can only be observed indirectly and under high uncertainty, HMMs are useful approximations of these systems.
HMMs belong to the class of Markov processes that describe memoryless dynamical systems. Any
Markov process fulfils the Markov assumption, i.e. that the future is independent of the past given the
current state at time t. In terms of HMMs, given the knowledge of the hidden state Xt−1, the current
hidden state Xt
is independent of all past hidden states Xτ with τ < t −1. Similarly, given the current
state Xt the current observation Ot
is independent of all past states and observations.
In this notation, Xt and Ot are random variables which means that their value depends on a probability distribution. Realizations of these random variables, which means choosing values according
to these probability distributions, are denoted by xt and ot
. Each random variable can take values
from a given set of outcomes, discrete or continuous. In this assignment, we will only be dealing with
discrete random variables.
As an example of that, the outcome of a coin toss can be denoted by Ot which can take any value
in the discrete set {head,t ai l}. The event of observing head is a realization ot and has probability
P(Ot = ot) = P(Ot = head) = 0.5. The probability of observing any outcome, i.e. either head or t ai l,
has to sum to P
ot∈{head, tail} P(Ot = ot) = 1.
Considering more than one time step, the notation O1:T = o1:T describes an observed series of realizations of the random variable. In terms of the coin example, if we threw the coin three times and
observed head,head,t ai l, then we would get O1 = head, O2 = head and O3 = t ai l.
Equipped with this notation, we can define the joint probability of a sequence of made observations
o1:T and all hidden states in the time interval [1,T ] as
P(O1:T = o1:T,X1:T) = P(X1)P(O1 = o1|X1)
Y
T
t=2
P(Xt
|Xt−1)P(Ot = ot
|Xt)
Since the hidden states are unknown to us, they must be modelled under uncertainty. This is achieved
by modelling the probability distribution over the transition from one state to another and the probability distribution over the observations given the current state. Thus, P(Xt = xt
|Xt−1 = xt−1) denotes
1
the probability of being in a specific state at time t given that the system was in a specific state at time
t −1. P(Ot = ot
|Xt = xt) is the probability of observing ot given that the system is in hidden state xt
.
At each time step t ∈ [1,T ] in the interval from 1 to T an HMM can thus be characterized by the
following components:
Xt = xi
,i ∈ {1, 2,...N} : N possible hidden states
Ot = ok,k ∈ {1, 2,...K} : K possible observations
πi = P(X1 = i) : the initial probability vector π ∈ R
1,N with elements πi
ai,j = P(Xt+1 = j|Xt = i) : the transition probability matrix A ∈ R
N,N with elements ai,j
bi(k) = P(Ot = k|Xt = i) : the observation probability matrix B ∈ R
N,K with elements bi,k = bi(k)
As discussed above, both the hidden states and observations are random variables for which we
denote the realizations by Xt = xi
,i ∈ {1, 2,...N} and Ot = ok,k ∈ {1, 2,...K}. The transition probability matrix A describes the probability of being in state j at time t + 1 given that the system has been
in state i at time t. Equivalently, the observation probability matrix B describes the probability of
observing k at time t given that the system is in state i at time t. The matrix notation of the parameters simplifies the computations as the implementation of this statistical model are reduced to pure
algebraic manipulations.
HMMs can be cast into different problem settings. First, if we assume that the parameter set λ =
(A,B,π) as defined above is known and that we have observed the realization O1:T = o1:T, we can
predict the next observation. The auto-completion function for text in a mobile phone can be based
on this operation - given the letters that have been observed so far, what is the most probable next
letter or complete word? In this example, both the observations and hidden states are the letters of
the alphabet.
Second, the probability of a given observation sequence can be estimated. Since the HMM represents the statistics of hidden states and observations, the observation of ”good morninf” will have a
lower probability than ”good morning” and an auto-correction program can detect the mistake.
Third, in order to correct the mistake, we need to decode the actual intention of the writer. Thus, we
need to determine the most likely hidden state sequence that might have produced the observation
sequence.
Fourth, and finally, if the parameters are not known, the set λ = (A,B,π) needs to be learned with
help of made observations. Both the Markov assumption and the algebraic notation render this problem solvable.
1.1 GETTING STARTED
In this lab, you are required to understand and implement the different HMM problems. Along
with the description of your tasks, we pose a number of questions, marked by a frame in the text.
You are required to answer these in written form and present them to a teaching assistant in a lab
session.
For the grades E and D, this implies to implement the above mentioned problems of HMMs,
i.e. the evaluation, decoding and learning problem.
For grade C, you need to have passed the E-D level and you are furthermore required to investigate HMMs in empirical and theoretical terms.
Finally, for A and B, you need to have passed the E-D and C level. In the final assignment,
you are required to apply your knowledge to an applied problem setting. We provide you with
observations of the direction of flight of different birds over time. The task is to develop a system
that can determine different bird species based on their flying pattern and to shoot the birds.
2
2 GRADE LEVEL E AND D
2.1 OVERVIEW
As introduced above, HMMs consist of probability distributions over hidden states and observations.
In order to infer e.g. future observations or the most likely sequence of hidden states, it is essential to
take any uncertainty about the system into account. This is achieved by marginalizing over states and
observations at all relevant time points. For the discrete case, marginalization is achieved as follows:
In general terms, let O ∼ P(O) and X ∼ P(X) be random variables in a discrete space that can take
the values O ∈ {o1,o2,...oK } and X ∈ {x1,x2,...xN } respectively. The joint probability distribution over
O and X is denoted by P(O,X) and the conditional distribution of O given X by P(O|X). Then the
marginal probability mass distribution of P(O = oi) is given by
P(O = oi) =
Xm
j=1
P(O = oi
,X = xj) =
Xm
j=1
P(O = oi
|X = xj)P(X = xj) (2.1)
Here we are marginalizing over the random variable X, i.e. in the discrete setting we sum over all its
values.
As an example, imagine that we have two coins, representing the hidden states Xi
, denoted by c1
and c2, depicted in Figure 2.1. We know that coin c1 is biased and will show head with probability 0.9
and t ai l with probability 0.1, while coin c2 shows head and t ai l each with a probability of 0.5. Now
assume that our observation at each time point consists of the result of a toss of one of these coins, so
either o1 = head or o2 = t ai l but we do not know which of the two coins was used. In this example,
the hidden states Xt are the identity of the coin at time t and the observations Ot are the observed
outcome of head or t ai l. Let us assume that the probability of selecting any coin is 0.5, both initially
and at every next time step. In order to determine the probability of observing e.g. head at time t
we need to take into account that any of the two coins could produce this observation. Therefore, we
Figure 2.1
3
need to marginalize over this uncertainty. For the initial time step, t = 1, this gives us e.g.
P(O1 = head) =
X
2
i=1
P(O1 = head,X1 = ci) (2.2)
=
X
2
i=1
P(O1 = head|X1 = ci)P(X1 = ci) (2.3)
= P(O1 = head|X1 = c1)P(X1 = c1)+P(O1 = head|X1 = c2)P(X1 = c2) (2.4)
= 0.9∗0.5+0.5∗0.5 = 0.7 (2.5)
Question 1 This problem can be formulated in matrix form. Please specify the initial probability vector π, the transition probability matrix A and the observation probability matrix B.
2.2 HMM 0 - NEXT OBSERVATION DISTRIBUTION
Imagine that we have an estimate of the current state distribution, i.e. we know which of the coins
we have used at the last time step with a certain probability. In Figure 2.1, we are located at time
step 7 and know the probability of using coin one or two at the previous time step. This knowledge is
represented in a row vector, in which each row represents the probability to be in a certain state, here
the coin we use, and the sum of all entries is 1. Given this knowledge, we can compute the probability
of observing each observation, here head or tail. First, we need to multiply the transition matrix with
our current estimate of states.
Question 2 What is the result of this operation?
In the following, the result must be multiplied with the observation matrix.
Question 3 What is the result of this operation?
Implement these steps in order to solve the HMM 0 problem (found in Kattis: https://kth.
kattis.com/problems/kth.ai.hmm0). It might be a good idea to implement basic matrix and vector operations such as the multiplication of a matrix and a vector. Test these sufficiently before spamming Kattis. Keep in mind that the problems on Kattis have nothing to do with the coin example but
represent arbitrary HMM models. For testing, download the files from the homepage for the problem.
You can test your java code outside Kattis by running
java MyHMM < file.in
and then make sure that the output matches that of file.ans in this case. Replace MyHMM with
whatever your java program is called. If you are using C++ you would run something like
./myHMM < file.in.
2.3 HMM 1 - PROBABILITY OF THE OBSERVATION SEQUENCE
After predicting the next observation, we want to estimate what the probability of the made observation sequence O1:T = [O1 = o1,O2 = o2,...,OT = oT ] is. For example, given our model of the coin
problem, it will be extremely unlikely to observe only tails all the time.
In order to solve this problem, we can make use of the so called forward algorithm or α-pass algorithm. This procedure iteratively estimates the probability to be in a certain state i at time t and
having observed the observation sequence up to time t for t ∈ [1,..T ].
We start off by computing the probability of having observed the first observation o1 and having
been in any of the hidden states. The latter probability is provided by the initial state distribution
4
vector π. Thus, we initialize α1(i), where the subscript 1 indicates the time step and i the state, as
α1(i) = P(O1 = o1,X1 = xi) for i ∈ [1,..N] (2.6)
= P(O1 = o1|X1 = xi)P(X1 = xi) (2.7)
= bi(o1)πi (2.8)
In the following steps, we need to take into account that any of the other hidden states at time t − 1
could have preceeded the state at the current time t. In the coin example this means that we need
to take into account the probability of having used any of the coins at the last time step in order
to estimate the probability of the coin that is currently used. If we would not take this probability
into account, we would always guess that coin c1 produced a head observation, even if the last 6
observations have been t ai l. Any information available to us needs to be taken into account! Thus,
at time t we need to marginalize over the probability of having been in any other state at t − 1 and
multiply this estimate with the matching observation probability as follows
αt(i) = P(O1:t = o1:t
,Xt = xi) for i ∈ [1,..N] (2.9)
= P(Ot = ot
|Xt = xi
,O1:t−1)P(Xt = xi
,O1:t−1) (2.10)
= P(Ot = ot
|Xt = xi)P(Xt = xi
,O1:t−1) (2.11)
= P(Ot = ot
|Xt = xi)[
X
N
j=1
P(Xt = xi
|Xt−1 = xj)P(Xt−1 = xj
,O1:t−1)] (2.12)
= bi(ot)[
X
N
j=1
aj,iαt−1(j)] (2.13)
Question 4 Why is it valid to substitute O1:t = o1:t with Ot = ot when we condition on the state
Xt = xi?
This step has to be computed iteratively for every t ∈ [1,..T ]. Finally, we can compute the probability of having observed the given observation sequence O1:T . For this, we again have to marginalize
over the hidden state distribution such that
P(O1:T = o1:T ) =
X
N
j=1
P(O1:T = o1:T ,XT = xj) (2.14)
=
X
N
j=1
αT (j) (2.15)
Implement these steps in order to solve the HMM 1 problem (found in Kattis: https://kth.kattis.
com/problems/kth.ai.hmm1). You might want to store the α values in matrix form. Be careful to index all matrices and vectors correctly and to maintain the temporal order of summation and product
as in the formulas. As before, HMM 1 is not connected to the coin example but represents arbitrary
HMM problems.
2.4 HMM 2 - ESTIMATE SEQUENCE OF STATES
The forward algorithm marginalizes over the hidden state distribution. In contrast, we can also compute the most likely sequence of hidden states given the observations. Let us denote this sequence
with {X
∗
}1:T = (X
∗
1
,X
∗
2
,...X
∗
T
). Instead of assuming that any state has led to the current estimate, we
only take the most likely predecessor into account. The most likely hidden state sequence can be
valuable information when estimating the actual dynamics underlying our noisy observations. In the
coin example, you might want to know which coin has most likely been used at each time point. In
Figure 2.1, such a path is indicated by the red line.
This problem is solved with help of the Viterbi algorithm. The procedure is divided into two steps.
5
At first, we need to compute the probability of having observed O1:t = o1:t and being in a state
Xt = xi given the most likely preceding state Xt−1 = xj
for each t. This quantity is denoted by δt(i),
defined in equations (2.16 - 2.18). We start with the initial values
δ1(i) = P(O1 = o1,X1 = xi) for i ∈ [1,..N] (2.16)
= bi(o1)πi
. (2.17)
The subsequent steps update the δt as follows
δt(i) = maxj∈[1,..N] P(Xt = xi
|Xt−1 = xj)P(O1:t−1,X
∗
1:t−2
,X
∗
t−1 = xj)P(Ot = ot
|Xt = xi) for i ∈ [1,..N]
(2.18)
= maxj∈[1,..N] aj,iδt−1(j)bi(ot). (2.19)
In order to be able to trace back the most likely sequence later on, it is convenient to store the indices
of the most likely states at each step.
δ
i d x
t
(i) = ar gmaxj∈[1,..N] aj,iδt−1(j)bi(ot) for i ∈ [1,..N] (2.20)
So, if the algorithm determined δt(i) for state i at time t to have been preceded by state k, then
δ
i d x
t
(i) = k.
When arriving at the end of the observation sequence, T , the probability of the most likely hidden
state sequence is given by
P({X
∗
}1:T ,O1:T = o1:T ) = maxj∈[1,..N] δT (2.21)
In the second step, we can backtrack the sequence {X
∗
} by setting
X
∗
T = ar gmaxj∈[1,..N] δT (j) and (2.22)
X
∗
t = δ
i d x
t
(X
∗
t+1
) (2.23)
Implement these steps in order to solve the HMM 2 problem (found in Kattis: https://kth.kattis.
com/problems/kth.ai.hmm2). You might want to store the δ and δ
i d x values in matrix form.
Question 5 How many values are stored in the matrices δ and δ
i d x respectively?
2.5 HMM 3 - ESTIMATE MODEL PARAMETERS
Up to this point, we assumed that the HMM parameters were given. In a real world scenario, this will
rarely be the case. Instead, the initial distribution and transition and observation matrices must be
approximated with help of data samples. This can be done efficiently with help of the Baum-Welch
algorithm. Luckily, we have already implemented a part of this method - the α-pass algorithm in the
HMM 1 problem.
Recall that the α-pass algorithm determined the joint probability of each state at time t and all
observations up to t. In order to be able to determine the probability of any state at any time point
given all observations, we need assess the probability of a state at time t and all future observations
[t+1 : T ]. For this, we can make use of the β-pass algorithm that works similar to the α-pass algorithm
but traverses the observations in the opposite direction - from the last to the first. The β is defined as
βt(i) = P(Ot+1:T = ot+1:T |Xt = xi) for i ∈ [1,..N], (2.24)
(2.25)
i.e. the probability of observing all future observations Ot+1:T given the current state Xt = xi
. Again,
we need to determine β values for all time steps and all states. For this, we initialize
βT (i) = 1 for i ∈ [1,..N], (2.26)
(2.27)
6
as we do not intend to favour any final state over the other. In the following, we iterate backwards
through the observation sequence and compute the β values, defined in the following equation:
βt(i) = P(Ot+1:T = ot+1:T |Xt = xi) for i ∈ [1,..N], (2.28)
=
X
N
j=1
P(Ot+2:T = ot+2:T |Xt+1 = xj)P(Ot+1 = ot+1|Xt+1 = xj)P(Xt+1 = xj
|Xt = xi) (2.29)
=
X
N
j=1
βt+1(j)bj(ot+1)ai,j
. (2.30)
Now that we have computed both α and β, we can estimate the probability of being in state i at time
t and being in state j at time t +1 given all made observations and the probability of being in state i
at time t given all made observations. These probabilities are given by the di-gamma function
γt(i, j) = P(Xt = xi
,Xt+1 = xj
|O1:T = o1:T ) (2.31)
=
αt(i)ai,j bj(Ot+1)βt+1(j)
PN
k=1
αT (k)
(2.32)
and the gamma function
γt(i) = P(Xt = xi
,|O1:T = o1:T ) (2.33)
=
X
N
j=1
γt(i, j). (2.34)
Question 6 Why we do we need to divide by the sum over the final α values for the di-gamma
function?
Finally, we can estimate λ = (A,B,π) by determining the expected value of the probabilities. We
have the transition estimates given by
ai,j =
PT −1
t=1
γt(i, j)
PT −1
t=1
γt(i)
for i, j ∈ [1,..N], (2.35)
the observation estimates given by
bj(k) =
PT −1
t=1
1(Ot = k)γt(j)
PT −1
t=1
γt(j)
for j ∈ [1,..N],k ∈ [1,..K], (2.36)
where 1(Ot = k) is the indicator function that is 1, when the argument is true and 0 otherwise.
At last, the initial probabilities are given by
πi = γ1(i) for i ∈ [1,..N]. (2.37)
Now, we have everything in place in order to learn the parameters of our HMM with the Baum-Welch
Algorithm:
1. Initialize λ = (A,B,π) (2.38)
2. Compute all αt(i),βt(i),γt(i, j),γt(i) values (2.39)
3. Re-estimate λ = (A,B,π) (2.40)
4. Repeat from 2. until convergence (2.41)
Implement these steps in order to solve the HMM 3 problem (found in Kattis: https://kth.
kattis.com/problems/kth.ai.hmm3). For more details, especially regarding the initialization of
the parameters (!), please refer to the Stamp tutorial (A Revealing Introduction to Hidden Markov
Models by Mark Stamp).
7
3 GRADE LEVEL C
3.1 OVERVIEW
This part extends the HMM implementation by more empirical investigations. Given your complete
Baum-Welch algorithm, you are supposed to both investigate different properties of its performance
and to test different parameter settings. To this end, we provide you with an HMM model which can
be used for data generation. Thus, this assignment focuses on theoretical and empirical understanding of HMMs.
3.2 EMPERICAL INVESTIGATIONS
We provide you with a file that contains a data sequence (T = 1000, 10000) generated from the following HMM dynamics:
A =
0.7 0.05 0.25
0.1 0.8 0.1
0.2 0.3 0.5
B =
0.7 0.2 0.1 0
0.1 0.4 0.3 0.2
0 0.1 0.2 0.7
π =
¡
1 0 0¢
(3.1)
Your task is to train an HMM on a varying number of these observations with help of your BaumWelch implementation.
Question 7 Train an HMM with the same parameter dimensions as above, i.e. A should be a 3
times 3 matrix, etc. Initialize your algorithm with the following matrices:
A =
0.54 0.26 0.20
0.19 0.53 0.28
0.22 0.18 0.6
B =
0.5 0.2 0.11 0.19
0.22 0.28 0.23 0.27
0.19 0.21 0.15 0.45
π =
¡
0.3 0.2 0.5¢
Does the algorithm converge? How many observations do you need for the algorithm to converge? How can you define convergence?
Question 8 Train an HMM with the same parameter dimensions as above, i.e. A is a 3x3 matrix
etc. The initialization is left up to you.
How close do you get to the parameters above, i.e. how close do you get to the generating
parameters in Eq. 3.1? What is the problem when it comes to estimating the distance between
these matrices? How can you solve these issues?
Question 9 Train an HMM with different numbers of hidden states.
What happens if you use more or less than 3 hidden states? Why?
Are three hidden states and four observations the best choice? If not, why? How can you determine the optimal setting? How does this depend on the amount of data you have?
Question 10 Initialize your Baum-Welch algorithm with a uniform distribution. How does this
effect the learning?
Initialize your Baum-Welch algorithm with a diagonal A matrix and π = [0, 0, 1]. How does this
effect the learning?
Initialize your Baum-Welch algorithm with a matrices that are close to the solution. How does
this effect the learning?
8
4 GRADE LEVEL B AND A
4.1 OVERVIEW
In this part of the assignment you will implement an agent that can classify and hunt ducks. The
aim is to get a deep understanding of Hidden Markov Models and for you to show that you can make
use of them to solve a less structured problem. You will practice identifying the number of states,
observations, etc. You will have to learn the HMM model and use it for classification.
If you have not played the game until now, go here and try your hand at it: http://www.silvergames.
com/duck-hunt.
Though this is a one person version of the game, the game was originally meant for 2-6 players and
was designed as a Nintendo TV-video game. The controllers were plastic guns with light sensors that
could be aimed at the television to "shoot" the ducks on the TV screen.
Our version of ducks hunt is a generalized version of the original game: There is a sky with birds
flying. The birds fly in the sky until they are shot down. The player has to observe the flight pattern
of the birds and predict the next move in order to shoot them down. If the prediction is correct, the
player will hit the bird and gain points. If the prediction is wrong the player will miss the bird and lose
points. The game is over when all birds are shot or when the time runs out.
We have also added different species of birds which behave differently in the air. Most birds give one
point when shot, but one of the species is very rare and will hurt your score seriously if you shoot it.
The species are identified after each round, in order to prioritize your targets you will need to identify
them during flight. You may also get extra points for guessing the correct species before the truth is
revealed.
AI PERSPECTIVE
The game can be broken down into three main topics, which will all be addressed in your code:
1. Predicting the flight trajectory of the birds so that they can be shot down.
2. Identifying the bird species to avoid forbidden targets and maximizing the score.
3. Making decisions to shoot or not based on the confidence of prediction of bird species and
flight trajectory.
Tip: First focus on predicting the movement of a single bird using a HMM. This is the building block
to solve the entire homework.
4.2 DUCK HUNT - GENERAL INFORMATION
This assignment requires thorough understanding of HMMs. Therefore it is highly recommended to
have a look at the Stamp tutorial (A Revealing Introduction to Hidden Markov Models by Mark Stamp)!
It will give you more details about numerically stable computations and other important factors for
your implementation.
You find valuable information regarding the dynamics of the game on Kattis https://kth.kattis.
com/problems/kth.ai.duckhunt.
PROVIDED CODE
A basic code skeleton is provided for you in each of the supported programming languages. The program is fully functional as it is provided, but the program never shoots nor guesses. The code can be
found on Kattis: https://kth.kattis.com/problems/kth.ai.duckhunt. Please remember that
you are not allowed to use the implementation we provided for the 2015 edition of this course!
9
The birds and the environments provided for testing are different from the ones that are used in
Kattis.
You should modify the player class and you may also create any number of new classes and files.
The files included in the skeleton may be modified locally but keep in mind that they will be overwritten on kattis (except for Player.cpp/hpp/java).
Your interface with the judge is the player class. Avoid using stdin and stdout. (Use stderr for debugging.)
TEST ENVIRONMENT
There are five test environments available to you. Each environment comes in two versions, with and
without opponents. The default environment is SouthEmissions.in. You can play other environments
by using the "load" option to the server, for example
./duckhunt server load ParadiseEmissions.in < player2server | ./duckhunt verbose player2server
GRADING DETAILS AND TASKS
You are required to have a Kattis score of at least 370. This homework has not as an aim that you
optimize your code toward Kattis but that you get an understanding of the theory behind HMMs.
Therefore, you have to be able to reason about your solution and to answer theoretical questions
regarding HMMs. You need to solve both the shooting and guessing problem. Having solved them
means being able to both shoot and guess reasonably, all using an HMM framework.
10