$30
COMS W4705: Problem Set 4
Total points: 145
Analytic Problems (due April 27th at 5pm)
Question 1 (20 points)
Clarissa Linguistica decides to build a log-linear model for language modeling. She has a training sample
(xi
, yi) for i = 1 . . . n, where each xi
is a prefix of a document (e.g., xi = “Yesterday, George Bush said”)
and yi
is the next word seen after this prefix (e.g., yi = “that”). As usual in log-linear models, she defines a
function f(x, y) that maps any x, y pair to a vector in R
d
. Given parameter values v ∈ R
d
, the model defines
P(y|x, v) = e
v·f (x,y)
P
y
0∈V e
v·f (x,y0)
where V is the vocabulary, i.e., the set of possible words; and v · f(x, y) is the inner product between the
vectors v and f(x, y).
Given the training set, the training procedure returns parameters v
∗ = arg maxv L(v), where
L(v) = X
i
log P(yi
|xi
, v) − C
X
k
v
2
k
and C 0 is some constant.
Clarissa makes the following choice of her first two features in the model:
f1(x, y) = (
1 if y = model and previous word in x is the
0 otherwise
f2(x, y) = (
1 if y = model and previous word in x is the
0 otherwise
So f1(x, y) and f2(x, y) are identical features.
Question (10 points): Show that for any training set, with f1 and f2 defined as above, the optimal parameters
v
∗
satisfy the property that v
∗
1 = v
∗
2
.
Question (10 points): Now say we define the optimal parameters to be v
∗ = arg maxv L(v), where
L(v) = X
i
log P(yi
|xi
, v) − C
X
k
|vk|
and C 0 is some constant. (Here |vk| is the absolute value of the k’th feature.) In this case, does the
property v
∗
1 = v
∗
2
necessarily hold? If not, what constraints do hold for the values v
∗
1
and v
∗
2
?
Question 2 (15 points)
Nathan L. Pedant now decides to build a bigram language model using log-linear models. He gathers a
training sample (xi
, yi) for i = 1 . . . n. Given a vocabulary of words V, each xi and each yi
is a member of
V. Each (xi
, yi) pair is a bigram extracted from the corpus, where the word yi
is seen following xi
in the
corpus.
Nathan’s model is similar to Clarissa’s, except he chooses the optimal parameters v
∗
to be arg maxL(v)
where
L(v) = X
i
log P(yi
|xi
, v)
The features in his model are of the following form:
fi(x, y) = (
1 if y = model and x = the
0 otherwise
i.e., the features track pairs of words. To be more specific, he creates one feature of the form
fi(x, y) = (
1 if y = w2 and x = w1
0 otherwise
for every (w1, w2) in V × V.
Question (15 points): Assume that the training corpus contains all possible bigrams: i.e., for all w1, w2 ∈ V
there is some i such that xi = w1 and yi = w2. The optimal parameter estimates v
∗ define a probability
P(y = w2|x = w1, v
∗
) for any bigram w1, w2. Show that for any w1, w2 pair, we have
P(y = w2|x = w1, v
∗
) = Count(w1, w2)
Count(w1)
where Count(w1, w2) = number of times(xi
, yi) = (w1, w2), and Count(w1) = number of times xi = w1.
Question 3
Nathan L. Pedant generates (x, y) pairs as follows. Take V to be set of possible words (vocabulary), e.g., V
= {the, cat, dog, happy, ...}. Take V
0
to be the set of all words in V, plus the reversed string
of each word, e.g., V
0 = {the, eht, cat, tac, dog, god, happy, yppah, ...}.
For each x, Nathan chooses a word from some vocabulary V. He then does the following:
• With 0.4 probability, he chooses y to be identical to x.
• With 0.3 probability, he chooses y to be the reversed string of x.
• With 0.3 probability, he chooses y to be some string that is neither x nor the reverse of x. In this case
he chooses y from the uniform distribution over words in V
0
that are neither x nor the reverse of x.
Question (10 points)
Define a log-linear model that can model this distribution P(y|x) perfectly (Note: you may assume that
there are no palindromes in the vocabulary, i.e., no words like eye which stay the same when reversed.)
Your model should make use of as few parameters as possible (we will give you 10 points for a correct
model with 2 parameters, 8 points for a correct model with 3 parameters, 5 points for a correct model with
more than 3 parameters.)
Question (10 points)
Write an expression for each of the probabilities
P(the|the)
P(eht|the)
P(dog|the)
as a function of the parameters in your model.
Question (10 points)
What value do the parameters in your model take to give the distribution described above?
Programming Problems (due May 4th at 5pm)
Please read the submission instructions, policy and hints on the course webpage.
In this programming assignment, you will train a perceptron model for part-of-speech tagging. For the
decoding problem for a global linear model, we are given an input instance x ∈ X and a function that
generates possible output structures GEN(x), e.g. x is a sentence w1, . . . , wn and y ∈ GEN(x) is a
possible tagging t1, . . . , tn. Additionally we assume that we have a feature function f(x, y) and a weight
vector v ∈ Rm. The decoding problem is defined as
y
∗ = arg max
y∈GEN(x)
f(x, y) · v
Crucially, the feature function f must factor to make the problem tractable. For part-of-speech tagging, we
partition a tagging into a sequence of histories, h1, . . . , hn and decompose f into a sum of feature functions
over each history and the corresponding output, i.e. Pn
i=1 g(hi
, ti). Our focus will be on bigram tagging, so
we define a history as
hi = hti−1, x, ii
In this assignment you will experiment with different feature functions and implement the perceptron training algorithm for learning the vector v.
Histories and Feature Vectors
Consider a typical tagged training sentence from the Wall Street Journal corpus
There/DET is/VERB no/DET asbestos/NOUN in/ADP our/PRON products/NOUN now/ADV ./.
In order to score this sentence it is necessary to generate all the (hi
, ti) pairs for the sentence. We provide a
script to generate the histories seen for the gold tagging.
python tagger history generator.py GOLD < example.sent
1 * DET
2 DET VERB
3 VERB DET
4 DET ADV
...
Each line has three columns, i, ti−1,and ti
, which, along with the sentence itself, give all the information
needed to compute the feature vector for the gold tagging.
In addition to the gold histories, we also need to compute features for all other possible histories. Computing
these features is necessary for tagging new sentences and for training the model. We provide a command to
enumerate all the possible histories for an example sentence.
python tagger history generator.py ENUM < example.sent
1 * DET
1 * VERB
1 * NOUN
1 * ADJ
...
Once we have the histories for an example we can compute features for each history. Much of the art of
global linear models is selecting good features to describe the data. Feature functions in NLP typically look
like
gBIGRAM:DET:NOUN(hti−1, x, ii, ti) = (
1 if ti−1 = DET, ti = NOUN
0 otherwise
which is a binary feature indicating that the previous tag is DET and the current tag is NOUN. We would
typically have a feature like this for all pairs of part-of-speech tags in addition to DET/NOUN. For examples
of more features, see questions 4 and 5.
Since most problems in NLP use binary features, a common implementation technique is to store a feature
vector (the output of g(hi
, ti)) as a map from strings to {0, 1}. For instance if the bigram DET/NOUN is
seen in the sentence, the set would contain {“BIGRAM:DET:NOUN” → 1}. Similarly the weight vector
v can be can be implemented as a sparse map. For instance, in Python the weight vector v would be a
dictionary mapping features to weights, e.g.
v["BIGRAM:DET:NOUN"] = 1.23
The dot product in the decoding problem can then implemented as a sparse product between these two maps.
For more details about this sparse representation, see the notes on log-linear models.
Decoding
If we have a weight vector v and a set of features, we can use the global linear model on new sentence. This
requires solving the following decoding problem:
y
∗ = arg max
y∈GEN(x)
f(x, y) · v = arg max
t1,...,tn∈GEN(x)
Xn
i=1
g(hti−1, i, xi, ti) · v
To simplify your task, we have implemented a decoder for bigram tagging. That is, given the score for each
history pair, g(hi
, ti) · v, we will return y
∗
. However we do not provide any of the scoring code. To use the
decoder, you must score each possible history of the sentence.
The interface for scoring histories is to provide a string with columns i, ti−1, ti and the fourth column being
the score g(hi
, ti)·v. As an example consider storing this following information in a file history.scores
cat history.scores
1 * DET 23.2
1 * VERB 0.25
1 * NOUN 0.51
1 * ADJ 0.10
...
Calling the tagger decoder with the sentence and the scores will return y∗ represented as a sequence of
histories.
python tagger decoder.py HISTORY < history.scores
1 * DET
2 DET VERB
3 VERB DET
4 DET ADV
...
Note: To use the decoder, it is necessary to call this script from within your code. For efficiency reasons,
you should spawn each script once and then communicate by piping sentences and history weights to the
scripts. You should not need to write out to files or spawn more than a handful of processes per run. For
Python, see the subprocess module and the example distributed with this problem for a simple interface.
If you have trouble running shell commands in your language of choice, please consult the TAs.
Perceptron Training
So far, we have discussed generating histories and decoding. The next step is to use these to train your perceptron tagger. Training the tagger requires a corpus of tagged examples (x
(i)
, y(i)
) for i ∈ {1, . . . , M}. The
file tag train.dat contains the training sentences, and the files tag dev.dat and tag dev.key
contain development sentences and correct results. The sentences are taken from the Wall Street Journal
corpus and consists of sentences of length less than 15 words. The tag format is identical to that used in
Homework 1.
The perceptron algorithm is defined in detail in the class slides. The algorithm is slightly complicated, but
you will find that much of the implementation consists of gluing together the provided scripts. The following
gives high-level view of the pieces required for each update
1. Get the gold tag histories using tagger history generator.py GOLD.
2. Enumerate all possible histories using tagger history generator.py ENUM.
3. Use your model to assign a weight to each history.
4. Find the histories of the highest-scoring tagging using tagger decoder.py.
5. Update your weights based on the features of the gold and highest-scoring histories.
See the note above for a description of how to call these scripts from your code.
Question 4 (20 points)
In this problem you will experiment with using the bigram tagging decoder with a pre-trained model.
• Consider a very simple model with two types of feature functions for all tags u and v and words r
gBIGRAM:u:v(hti−1, x, ii, ti) = (
1 if ti−1 = u and ti = v
0 otherwise
gTAG:u:r(hti−1,(w1 . . . wn), ii, ti) = (
1 if wi = r and ti = u
0 otherwise
Consider the example sentence above (“There is no . . .”) as x, we extract the following feature vector
from an example (wrong) history
g(hDET, x, 2i, NOUN) = {BIGRAM:DET:NOUN → 1, TAG:NOUN:is → 1}
For this problem we provide a pre-trained weight vector v for these features (initialized v = 0 and
trained for k = 5 iterations). The file tag.model contains the vector. All lines in the file consist of
two columns FEATURE, WEIGHT indicating v(FEATURE) = WEIGHT. More specifically.
– Lines of the form TAG:asbestos:NOUN 0.23 mean the feature TAG:asbestos:NOUN
representing asbestos tagged with NOUN has weight 0.23.
– Lines of the form BIGRAM:DET:NOUN 0.45 mean the feature BIGRAM:DET:NOUN representing the bigram DET, NOUN has weight 0.45.
• The goal of this problem is to decode with this model. First read tag.model into a map from feature
strings to weights. Next for each sentence in development data run the following steps
1. Enumerate all possible histories using tagger history generator.py ENUM.
2. Compute the features for each history and use tag.model to assign a weight to each history.
3. Call tagger decoder.py HISTORY and pipe in the weighted histories to compute the
highest scoring tagging.
You can check the performance of your tagger by calling python eval tagger.py tag dev.key
tag dev.out. Report the performance of your tagger and submit your code.
Question 5 (40 points)
One reason that this model performs poorly is that we do not handle rare words. We could introduce RARE
tokens or word classes; however, a major benefit of discriminative models is that we can instead introduce
features that look directly at properties of the word.
A good example of this type of feature is word suffixes. Suffixes are very a useful indicator of part-ofspeech in English (e.g. “ed”, “ing”, “er”) and so we expect features of this type to help disambiguate
unknown words. The new features are, for all suffixes u, tag v, and lengths j ∈ {1, 2, 3}
gSUFF:u:j:v(hti−1,(w1 . . . wn), ii, ti) = (
1 if u = suffix(wi
, j) and ti = v
0 otherwise
where suffix(w, j) returns the last j letters of w.
• For this question, you should first implement the perceptron algorithm to estimate a weight vector v for
the new set of features. Start with v = 0 and run K = 5 iterations. Follow the steps in the Perceptron
Training section. When your code is finished write the final model out to suffix tagger.model.
• After training is complete, add the new suffix featurs to your code from question 4 and then run on
the development set. Report the performance of your tagger and the tagged output as well as your
code. In addition, write up any observations. Did the new features help? How did they compare to
the previous model?
Question 6 (20 Points)
As mentioned above, the benefit of this formulation is the ability to add arbitrary features conditioned on
the input sentence x. In the last question we experimented with adding suffix features from i-th word of the
sentence. Many other features have been shown to be useful for part-of-speech tagging.
• For this question you should run tagging experiments with custom features. There are many ways to
come up with new features. One strategy is to look at the errors your tagger made in the question 5
and try out features that target specific issues. We also encourage you to look up features in the
NLP literature or in open-source taggers. Don’t be discouraged if adding seemingly useful features
sometimes decreases the accuracy of your tagger; feature selection can be a difficult problem.
When you come up with interesting features to use, run experiments with three different feature
combinations. Compare the results you saw and write up the issues that you encountered.