$30
COMS W4705: Problem Set 1
• Total points: 100
• Due date: Analytical problems (questions 1–3) due Monday 9th February, 5pm. Programming problems due 16th February, 5pm.
• Late policy: 5 points off for every day late, 0 points if handed in after 5pm on February 12th for the
analytical problems, February 19th for the programming problems.
Analytical Problems (due February 9th)
Question 1 (15 points)
In lecture we saw a method for language modeling called linear interpolation, where the trigram estimate
q(wi
| wi−2, wi−1) is defined as
q(wi
| wi−2, wi−1) = λ1 × qML(wi
| wi−2, wi−1) + λ2 × qML(wi
| wi−1) + λ3 × qML(wi)
Here λ1, λ2, λ3 are weights for the trigram, bigram, and unigram estimates, and qML stands for the maximumlikelihood estimate.
One way to optimize the λ values (again, as seen in lecture), is to use a set of validation data, in the following
way. Say the validation data consists of n sentences, S1, S2, . . . , Sn. Define c0
(w1, w2, w3) to be the number
of times the trigram w1, w2, w3 is seen in the validation sentences. Then the λ values are chosen to maximize
the following function:
L(λ1, λ2, λ3) = X
w1,w2,w3
c
0
(w1, w2, w3) log q(w3, |w1, w2)
Question: show that choosing λ values that maximize L(λ1, λ2, λ3) is equivalent to choosing λ values that
minimize the perplexity of the language model on the validation data.
Question 2 (10 points)
In the lecture we saw an improved method for linear interpolation where the λ values are sensitive to the
number of times the bigram (wi−2, wi−1) has been seen; the intuition behind this was that the more frequently this bigram has been seen, the more weight should be put on the trigram estimate.
Here we’ll define a method that is similar in form to the method seen in lecture, but differs in some important
ways. First, we define a function Φ(wi−2, wi−1, wi) which maps trigrams into “bins”, depending on their
count. For example, we can define Φ as follows:
Φ(wi−2, wi−1, wi) = 1 If Count(wi−2, wi−1, wi) = 0
Φ(wi−2, wi−1, wi) = 2 If 1 ≤ Count(wi−2, wi−1, wi) ≤ 2
Φ(wi−2, wi−1, wi) = 3 If 3 ≤ Count(wi−2, wi−1, wi) ≤ 5
Φ(wi−2, wi−1, wi) = 4 If 6 ≤ Count(wi−2, wi−1, wi)
The trigram estimate q(wi
| wi−2, wi−1) is then defined as
q(wi
| wi−2, wi−1) = λ
Φ(wi−2,wi−1,wi)
1 × qML(wi
| wi−2, wi−1)
+λ
Φ(wi−2,wi−1,wi)
2 × qML(wi
| wi−1)
+λ
Φ(wi−2,wi−1,wi)
3 × qML(wi)
Notice that we now have 12 smoothing parameters, i.e., λ
i
j
for i = 1 . . . 4 and j = 1 . . . 3.
Question: Unfortunately this estimation method has a serious problem: what is it?
Question 3 (15 points)
We are going to come up with a modified version of the Viterbi algorithm for trigram taggers. Assume that
the input to the Viterbi algorithm is a word sequence x1 . . . xn. For each word in the vocabulary, we have a
tag dictionary T(x) that lists the tags y such that e(x|y) 0. Take K to be a constant such that |T(x)| ≤ K
for all x. Give pseudo-code for a version of the Viterbi algorithm that runs in O(nK3
) time where n is the
length of the input sentence.
Programming Problems (due February 16th)
Please read the submission instructions, policy and hints on the course webpage.
In this programming problem, you are going to build a trigram HMM tagger for named entities. The joint
probability of a sentence x1, x2, · · · , xn and a tag sequence y1, y2, · · · , yn is defined as
p(x1 · · · xn, y1 · · · yn) =
q(y1|∗, ∗) · q(y2|y1, ∗) · q(STOP|yn−1, yn−2) ·
Yn
i=3
q(yi
|yi−1, yi−2) ·
Yn
i=1
e(x1|y1)
* is a padding symbol that indicates the beginning of a sentence, STOP is a special HMM state indicating
the end of a sentence.
We provide a training data set ner train.dat and a development set ner dev.key. Both files are in
the following format (one word per line, word and token separated by space, a single blank line separates
sentences.)
U.N. I-ORG
official O
Ekeus I-PER
heads O
for O
Baghdad I-LOC
. O
Senior O
United I-ORG
Nations I-ORG
arms O
official O
.
.
.
There are four types of named entities: person names (PER), organizations (ORG), locations (LOC) and
miscellaneous names (MISC). Named entity tags have the format I-TYPE. Only if two phrases of the same
type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that
it starts a new entity. Words marked with O are not part of a named entity.
The script count freqs.py reads in a training file and produces trigram, bigram and emission counts.
Run the script on the training data and pipe the output into some file:
python count freqs.py ner train.dat ner.counts
Each line in the output contains the count for one event. There are two types of counts:
• Lines where the second token is WORDTAG contain emission counts Count(y ❀ x), for example
13 WORDTAG I-ORG University
indicates that University was tagged 13 times as I-ORG in the the training data.
• Lines where the second token is n-GRAM (where n is 1, 2 or 3) contain unigram counts Count(y),
bigram counts Count(yn−1, yn), or trigram counts Count(yn−2, yn−1, yn). For example
3792 2-GRAM O I-ORG
indicates that there were 3792 instances of an I-ORG tag following an O tag and
1586 3-GRAM O I-ORG I-ORG
indicates that in 1586 cases the bigram O I-ORG was followed by another I-ORG tag.
Question 4 (20 points)
• Using the counts produced by count freqs.py, write a function that computes emission parameters
e(x|y) = Count(y ❀ x)
Count(y)
• We need to predict emission probabilities for words in the test data that do not occur in the training
data. One simple approach is to map infrequent words in the training data to a common class and to
treat unseen words as members of this class. Replace infrequent words (Count(x) < 5) in the original
training data file with a common symbol RARE . Then re-run count freqs.py to produce new
counts. Submit your code.
• As a baseline, implement a simple named entity tagger that always produces the tag y
∗ = arg max
y
e(x|y)
for each word x. Make sure your tagger uses the RARE word probabilities for rare and unseen words.
Your tagger should read in the counts file and the file ner dev.dat (which is ner dev.key
without the tags) and produce output in the same format as the training file, but with an additional
column in the end that contains the log probability for each prediction. For instance
Nations I-ORG -9.2103403719761818
Submit your program and report on the performance of your model.You can evaluate your model using
the evaluation script: python eval ne tagger.py ner dev.key prediction file. In
addition, write up any observations you made.
Question 5 (30 Points)
• Using the counts produced by count freqs.py, write a function that computes parameters
q(yi
|yi−1, yi−2) = Count(yi−2, yi−1, yi)
Count(yi−2, yi−1)
for a given trigram yi−2 yi−1 yi
. Make sure your function works for the boundary cases q(y1|∗, ∗),
q(y2|∗, y1) and q(STOP|yn−1, yn).
Write a program that reads in lines of state trigrams yi−2 yi−1 yi (separated by space) and prints the
log probability for each trigram. Submit the program.
• Using the maximum likelihood estimates for transitions and emissions, implement the Viterbi algorithm to compute
arg max
y1···yn
p(x1 · · · xn, y1 · · · yn).
Your tagger should have the same basic functionality as the baseline tagger. Instead of emission
probabilities the third column should contain the log-probability of the tagged sequence up to this
word. Submit your program and report on the performance of your model. In addition, write up any
observations you made.
Question 6 (10 Points)
Improve the way your HMM tagger deals with low-frequency words by grouping them based on informative
patterns rather than just into a single class of rare words. For instance, such a class could contain all
capitalized words (possibly proper names), all words that contain only capital letters and dots (possibly
abbreviations for first names or companies), or all words that consists only of numerals. You can again
replace words in the original training data and generate new counts by running count freqs.py. Report
on the classes and patterns you chose and how these methods impact the performance of your named entity
tagger. Submit your program.