Starting from:

$30

Written Homework 3: Hidden Markov Models

CSCI 544 – Applied Natural Language Processing
Written Homework 3: Hidden Markov Models

Total: 18 pages.
General instructions
1. This is not a graded assignment. Do not turn it in.
2. The assignment is meant as preparation for the in-class exams. You should aim to answer all
the questions on your own, without help.
3. Space is provided as it would be on the exams. Answers should be concise and fit in the space
provided; in the exams it will not be possible to add space, and long and rambling answers
will be penalized.
4. After solving the problems (or giving them your best try), you are encouraged to discuss and
compare solutions with your classmates.
5. You are welcome to discuss the problems with us. We encourage open discussion on Piazza,
so that the entire class can benefit from the discussion.
6. Answers to select problems will be distributed at a later time.
For further background on hidden Markov models, you may consult the following optional readings
from Jurafsky and Martin, Speech and Language Processing (3rd edition draft):
Chapter 8: Sequence Labeling for Parts of Speech and Named Entities
Chapter A: Hidden Markov Models
For the purpose of this assignment, please use the definitions given in the assignment, not the ones
in the above chapters or any other source.
1
Markov chains. A Markov chain is a state machine which consists of the following:
1. A set of states Q = {q1,...,qn}.
2. A transition probability matrix A, where each ai j represents the
probability of transitioning from state qi
to state qj
, such that for
each i, ∑
n
j=1
ai j = 1.
A =



a11 ··· a1n
.
.
.
.
.
.
.
.
.
an1 ··· ann



3. A set of possible observations V = {v1,..., vm}.
4. An emission probability matrix B, where each bi j represents the
probability of state qi emitting the observation vj
, such that for
each i, ∑
m
j=1
bi j = 1.
B =



b11 ··· b1m
.
.
.
.
.
.
.
.
.
bn1 ··· bnm



5. A special start state q0 which is not associated with observations, along with transition probabilities a01 ···a0n from the start state to the other states. The start state may be identical to
one of the other states.
A Markov chain starts at the start state q0, and at each time point t1,t2 ... performs a transition and
emits an observation. The Markov property states that the probability of being in a particular state
at time ti depends only on the previous state (that is, the state at ti−1), and the probability of an
observation at time ti depends only on the current state (that is, the state at ti).
Problem 1. Consider a Markov chain that represents the probability that a child left alone in her
room will be awake or asleep. There are two states {awake, asleep}, and two possible observations
coming from the room {noise,quiet}. The transition and emission probabilities are noted in the
following diagram: transitions are shown with solid arrows, and emissions with dashed arrows.
(Note that the diagram is identical to the one discussed in class, but the probabilities are different!)
awake asleep
noise quiet
0.4
0.2
0.6 0.8
0.5 0.5 0.1 0.9
The child starts by being awake, and remains in the room for 4 time points, t1 ...t4 (4 iterations of
the Markov chain).
2
a. What is the most probable sequence of states for t1 ...t4?
b. What is the probability of the above sequence of states?
c. What is the most probable sequence of states and observations?
d. What is the probability of the above sequence of states and observations?
e. What is the least probable sequence of states?
f. What is the probability of the above sequence of states?
g. What is the least probable sequence of states and observations?
h. What is the probability of the above sequence of states and observations?
3
The Viterbi algorithm. A Hidden Markov Model (HMM) is a Markov chain where we cannot
observe the states directly, but we can observe the emissions. The Viterbi algorithm is used for
decoding the sequence of states, that is finding the most likely sequence of states that could give
rise to a sequence of observations. Given a set of states Q and a sequence of time points 1...T, the
algorithm builds two matrices of size Q×(1...T): a probability matrix representing the probability
of each state at each time point, and a backpointer matrix which points from each state at each time
point to the most likely previous state. At the final time point T, the algorithm selects the state
with the highest probability, and returns the path of backpointers from that state, representing the
most likely sequence of states to give rise to the observations. The following is pseudocode for the
algorithm: the notation a(q
0
,q) represents the transition probability between states q
0
and q, and
b(q,ot) represents the emission probability by state q of the observation noted at time t.
# Initialization step at t = 1
for q in Q :
probability(q,1) = a(q0,q) ∗ b(q,o1)
backpointer(q,1) = q0
# Recursion step for the remaining time points
for t from 2 to T :
for q in Q :
probability(q,t) = maxq
0∈Q probability(q
0
,t −1) ∗ a(q
0
,q) ∗ b(q,ot)
backpointer(q,t) = argmaxq
0∈Q probability(q
0
,t −1) ∗ a(q
0
,q)
# Termination step
most probable state(T) = argmaxq
0∈Q probability(q
0
,T)
return the backtrace path by following the backpointers from the most probable state
Problem 2. Consider the same Markov chain from problem 1, this time as a hidden Markov model.
The child starts by being awake, and remains in the room for 3 time points, t1 ...t3 (3 iterations of
the Markov chain). The observations are: quiet, quiet, noise.
a. Using the Viterbi algorithm, identify the most likely sequence of states that would give rise
to these observations. Show your work.
awake awake awake awake
asleep asleep asleep
t0 t1 t2 t3
4
b. After the first iteration (at t1), the sum of the probabilities is less than 1. Why? Where is the
rest of the probability mass?
c. Suppose we are not interested in decoding the sequence of states (that is, whether the child
was awake or asleep at each point), but only in the overall most likely state at the end (that is,
whether the child is awake or asleep at the end). Obviously we can remove the backpointer
lines from the Viterbi algorithm; however, this would still give us the probability of only the
most likely path to each end state. What additional change can we make to the algorithm so
that instead of giving us the probability of the most likely path to each state at each time, it
will give the overall probability of being at each state at each time? Explain why.
5
Problem 3. In part-of-speech tagging, we learn a hidden Markov model from a tagged corpus:
each state is a part-of-speech tag, transition probabilities are the conditional probabilities of tags
given the previous tag, observations are words, and emission probabilities are the conditional probabilities of words given tags. The start state is the beginning of a sentence, which is not a partof-speech tag. In this problem we will look at some data that will help us tag the sentence Time
flies like an arrow. We will use the following sentences as a corpus of training data (the notation
word/TAG means word tagged with a specific part-of-speech tag).
eat/VB breakfast/NN at/IN morning/NN time/NN
take/VB time/NN with/IN arrow/NN projects/NN
horse/NN riders/NN like/VB the/DT airport/NN
paper/NN flies/VB on/IN hydrogen/NN gas/NN
bees/NN sting/VB like/IN some/DT flies/NN
beans/NN soil/VB an/DT iron/NN grill/NN
flies/NN smell/VB an/DT arrow/NN drink/NN
people/NN like/VB an/DT army/NN arrow/NN
dinner/NN time/NN flies/VB all/DT day/NN
horse/NN flies/NN time/VB morning/NN rays/NN
a. Based on the corpus, fill in the transition probabilities in the state chart below.
q0
VB
NN
IN
DT
6
b. Fill in the emission probabilities for the 4 states; only write down the probabilities for the
words time flies like an arrow.
time flies like an arrow
VB
NN
IN
DT
c. Now use the Viterbi algorithm with the above model to tag the sentence Time flies like an
arrow. What is the most likely tag sequence, based on the training data? Show your work.
7
Problem 4. You are building a system to tag utterances in English and Spanish. You collect the
following sample of 20 short utterances, tagged with part-of-speech labels.
English Spanish
hay/NN fever/NN bread/NN pudding/NN field/NN llamas/NN llamas/VB medico/NN ´
barn/NN hay/NN goat/NN meat/NN desert/NN hay/NN hay/VB queso/NN
hay/NN burns/VB fish/NN swim/VB dogs/NN bark/VB vemos/VB llamas/NN
llamas/NN spit/VB people/NN run/VB birds/NN fly/VB comemos/VB bebemos/VB
ride/VB donkeys/NN make/VB hay/NN eat/VB drink/VB alimentos/NN bebidas/NN
Of course, the above is a very biased sample of general English and Spanish (and some of the
utterances are barely grammatical), but let’s assume that they are a good sample of the corpus we
have at hand. We will tag and identify the language of the unseen utterance hay llamas.
a. Based on the training data, what is the prior probability that an utterance selected at random
will be English? Spanish?
English: Spanish:
b. Assume the tags are generated by a separate hidden Markov model for each language. Based
on the training data, estimate the transition probabilities for each language.
English
q0
NN
VB
Spanish
q0
NN
VB
c. Now estimate the emission probabilities, again separately for each language. Only write down
the probabilities for the words hay and llamas.
English Spanish
hay llamas hay llamas
VB
NN
8
d. For each language, use the Viterbi algorithm to decode the most likely tag sequence.
English
hay llamas
q0
NN
VB
NN
VB
Spanish
hay llamas
q0
NN
VB
NN
VB
e. What is the most likely language and tag sequence for the utterance hay llamas? Why? (Don’t
forget the prior probabilities!)
9
f. What is the most likely language (under any tag sequence) for the utterance hay llamas?
Why? (Don’t forget the prior probabilities!)
Problem 5. Suppose we want to use a hidden Markov model to tag a text with the most likely tag
for each word, regardless of context. (Yes, this is overkill, as there are much simpler ways to do
this, but for this exercise assume we want to use the mechanism of a hidden Markov model.) How
would we set the transition probabilities? How would we set the emission probabilities? Why?
10
Problem 6. You are building a system to tag utterances in English. You collect the following
sample of 20 short utterances, tagged with part-of-speech labels.
mister/NN jones/NN mister/NN smith/NN color/NN gray/NN squirrels/NN eat/VB
money/NN talks/VB gray/NN wins/VB squirrels/NN drink/VB mister/NN healthy/JJ
mister/NN runs/VB princess/NN royal/JJ squirrels/VB food/NN earn/VB money/NN
drink/VB tea/NN cook/VB potatoes/NN fix/VB things/NN gray/JJ squirrels/NN
blue/JJ gray/JJ happy/JJ squirrels/NN yellow/JJ red/JJ mellow/JJ green/JJ
Using the above data, we will tag the unseen sentence mister gray squirrels money.
a. Assume the tags are generated by a hidden Markov model. Based on the training data, estimate the transition probabilities of the model.
q0
VB
NN
JJ
b. Now estimate the emission probabilities. Only write down the probabilities for the words
mister, gray, squirrels, and money.
mister gray squirrels money
NN
VB
JJ
11
c. Now use the Viterbi algorithm to decode the utterance mister gray squirrels money. Remember the backpointers. Do not draw arrows for zero probabilities. Show your work.
mister gray squirrels money
q0
NN
VB
JJ
NN
VB
JJ
NN
VB
JJ
NN
VB
JJ
d. What is the most likely tag sequence for the utterance mister gray squirrels?
e. What is the most likely tag sequence for mister gray squirrels money?
f. Did any of the tags change from part (d) to part (e)? Why or why not?
12
Problem 7. A program intended to be a hidden Markov model part-of-speech tagger had a bug
which caused it to ignore the previous state when estimating transition probabilities. That is, instead
of estimating transition probabilities a(qi
,qj) ← P(state = qj
|prev state = qi), it estimated the transition matrix as a(qi
,qj) ← P(state = qj) for every state qi
. The emission matrix was not affected
by the bug, so it was estimated properly: b(qi
, vj) ← P(observation = vj
|state = qi).
a. Is the buggy model still a Markov chain? Explain why or why not.
b. If the buggy model is used for decoding, what tag will it give to each word? Explain why.
13
Problem 8. You are building a system to identify named entities in English headlines using the
B-I-O tagging scheme: each word receives a tag with the following meaning:
B: The word is the first word (beginning) of a named entity.
I: The word is part of a named entity, but not the first word (inside).
O: The word is not part of a named entity (outside).
You collect the following sample of 10 headlines, tagged with B-I-O labels.
Man/O Will/O Buy/O Apple/O Give/O Microsoft/B Corporation/I Peace/O
Apple/B Sues/O Google/B Inc/I Child/O Enters/O Whole/B Foods/I
Dancer/O Likes/O Will/B Smith/I Save/O Whole/B Foods/I Corporation/I
Apple/B Google/B Inc/I Gain/O Billionaires/O Buy/O Samsung/B Corporation/I
Woman/O Will/O Eat/O Apple/O Mister/B Ronald/I Reagan/I Died/O
Using the above data, we will tag the unseen headline Apple Will Buy Corporation.
Note that in the above data there are 10 B tags, 10 I tags, and 20 O tags.
Also, there are 10 B→. . . transitions, 5 I→. . . transitions, and 15 O→. . . transitions.
All the probabilities in parts a and b are multiples of 0.1.
a. Assume the tags are generated by a hidden Markov model. Based on the training data, estimate the transition probabilities of the model. Do not perform any smoothing.
q0
B
I
O
14
b. Estimate the emission probabilities. Only write down the probabilities for the words Apple,
Will, Buy, and Corporation. Do not perform any smoothing.
Apple Will Buy Corporation
B
I
O
c. Now use the Viterbi algorithm to decode the headline Apple Will Buy Corporation. Remember
the backpointers. Do not draw arrows for zero probabilities. Show your work.
Note: If your numbers are correct, the algorithm will fail to tag the last word.
Apple Will Buy Corporation
q0
I
B
O
I
B
O
I
B
O
I
B
O
d. Briefly explain why the algorithm failed to tag the last word.
15
e. Without calculating any numbers: Would it be a good idea to fix the tagger by smoothing the
transition probabilities? Why or why not?
f. Without calculating any numbers: Would it be a good idea to fix the tagger by smoothing the
emission probabilities? Why or why not?
g. The Ratnaparkhi paper on part-of-speech tagging recommends using a tag dictionary, because
it greatly improves runtime and does not have a noticeable effect on accuracy. As discussed
in class, this is analogous to not smoothing the emission probabilities. Is this reasoning valid
for the present B-I-O tagging problem? Why or why not?
16
Problem 9. You are building a system to tag sentences in English. You collect the following
sample of 10 short sentences, tagged with part-of-speech labels.
love/VB dog/NN kisses/NN kiss/VB cat/NN fish/NN
cats/NN fly/VB planes/NN cat/NN kisses/VB dogs/NN
flies/NN dog/VB cats/NN dog/NN eats/VB frog/NN
fruit/NN fly/NN lands/VB cat/NN kisses/NN hurt/VB
horse/NN fly/NN wins/VB city/NN birds/NN fly/VB
Using the above data, we will tag the unseen sentence dog kisses fly. We will assume the tags are
generated by a hidden Markov model with an end state. That is, in addition to transitions from
an initial state q0 to the first word of a sentence, and transitions between words, we also estimate
transition probabilities from the final word of a sentence to a special end state qF. (You can find
more on end states in an earlier draft of the Jurafsky and Martin chapter on Hidden Markov Models.)
a. Based on the training data, estimate the transition probabilities of the model. Note that there
are 20 NN tags and NN → . . . transitions, and 10 VB tags and VB → . . . transitions. Remember that transitions out of each state must form a probability distribution.
q0
NN
VB
qF
b. Now estimate the emission probabilities. Only write down the probabilities for the words
dog, kisses, and fly.
dog kisses fly
NN
VB
17
c. Now use the Viterbi algorithm to decode the utterance dog kisses fly. Remember the backpointers. Do not draw arrows for zero probabilities. Show your work.
dog kisses fly
q0
NN
VB
NN
VB
NN
VB
qF
d. If the decoder were to stop at the word fly before reaching the end state, what would be the
most likely tag sequence (or sequences) for the utterance dog kisses fly?
e. After reaching the end state, what is (or are) the most likely tag sequence (or sequences) for
the utterance dog kisses fly?
f. Does having an end state make a difference? What properties of the data does the end state
capture, and how do these bear upon the decision made by the Viterbi decoder?
18

More products