Starting from:

$30

Assignment 5: Qu’ bopbe’ paqvam

Natural Language Processing:
Assignment 5: Qu’ bopbe’ paqvam

Introduction
As always, check out the Github repository with the course homework templates:
git://github.com/ezubaric/cl1-hw.git
The code for this homework is in the hw5 directory.
1 Tagging and Tag Sets (10 points)
1.1 When taggers go bad (5 points)
Consider the following sentences:
1. British Left Waffles on Falkland Islands
2. Teacher Strikes Idle Kids
3. Clinton Wins Budget; More Lies Ahead
4. Juvenile Court to Try Shooting Defendant
Choose one of these sentences and tag it in two different (but plausible)
ways.
1.2 Exploring the tag set (5 points)
There are 265 distinct words in the Brown Corpus having exactly four possible tags (assuming nothing is done to normalize the word forms).
1. Create a table with the integers 1. . . 10 in one column, and the number
of distinct words in the corpus having {1, . . . , 10} distinct tags.
2. For the word with the greatest number of distinct tags, print out sentences from the corpus containing the word, one for each possible tag.
1
2 Viterbi Algorithm (30 Points)
Consider the following sentences written in Klingon. For each sentence,
the part of speech of each “word” has been given (for ease of translation,
some prefixes/suffixes have been treated as words), along with a translation.
Using these training sentences, we’re going to build a hidden Markov model
to predict the part of speech of an unknown sentence using the Viterbi
algorithm. Do not use log probabilities (you can later, though).
N PRO V N PRO
pa’Daq ghah taH tera’ngan ’e
room (inside) he is human of
The human is in the room
V N V N
ja’chuqmeH rojHom neH tera’ngan
in order to parley truce want human
The enemy commander wants a truce in order to parley
N V N CONJ N V N
tera’ngan qIp puq ’eg puq qIp tera’ngan
human bit child and child bit child
The child bit the human, and the human bit the child
2.1 Emission Probability (10 points)
Compute the frequencies of each part of speech in the table below for nouns
and verbs. We’ll use a smoothing factor of 0.1 (as discussed in class) to make
sure that no event is impossible; add this number to all of your observations.
Two parts of speech have already been done for you. After you’ve done this,
compute the emission probabilities in a similar table.
2
NOUN VERB CONJ PRO
’e 0.1 1.1
’eg 1.1 0.1
ghaH 0.1 1.1
ja’chuqmeH 0.1 0.1
legh 0.1 0.1
neH 0.1 0.1
pa’Daq 0.1 0.1
puq 0.1 0.1
qIp 0.1 0.1
rojHom 0.1 0.1
taH 0.1 0.1
tera’ngan 0.1 0.1
yaS 0.1 0.1
2.2 Start and Transition Probability (5 points)
Now, for each part of speech, total the number of times it transitioned to
each other part of speech. Again, use a smoothing factor of 0.1. After you’ve
done this, compute the start and transition probabilities.
NOUN VERB CONJ PRO
START
N 1.1 2.1
V 0.1 0.1
CONJ 0.1 0.1
PRO 0.1 0.1
2.3 Viterbi Decoding (15 points)
Now consider the following sentence: “tera’ngan legh yaS”.
1. Compute the probability of the sequence noun, verb, noun.
2. Create the decoding matrix of this sentence ln δn(z) (word positions
are columns, rows are parts of speech). Only provide log probabilities,
and only use base 2.
3
POS n = 1 n = 2 n = 3
z =n
z =v
z =conj
z =pro
3. What is the most likely sequence of parts of speech?
4. Let’s compare this to the probability of your previous answer.
(a) How does this compare to the sequence noun, verb, noun?
(b) Which is more plausible linguistically?
(c) Does an hmm model encode the intuition that you used to answer
the previous question?
5. (For fun, not for credit) What do you think this sentence means? What
word is the subject of the sentence?
Turning in Your Assignment
As usual, submit your completed code to Moodle. Make sure:
• Add your name to the assignment
• Turn in a single pdf
4

More products