$30
CS447: Natural Language Processing
Homework 3
Getting started
All files that are necessary to do the assignment are contained in a tarball which you can get from:
http://courses.grainger.illinois.edu/cs447/fa2020/HW/HW3/cs447_hw3.zip
You need to unpack this zip file (unzip cs447 hw3.zip) to get a directory that contains the code and data
you will need for this homework.
NLTK
For this assignment you will need the Natural Language Toolkit (http://nltk.org/). You can use the
following commands:
sudo pip install nltk==3.4.5
Part 1: Hidden Markov Model Part of Speech Taggers (8 points)
1.1 Goal
The first part of the assignment asks you to implement a Hidden Markov Model (HMM) model for part of
speech tagging. Specifically, you need to accomplish the following:
• implement the Viterbi algorithm for a bigram HMM tagger
• train this tagger on labeled training data (train.txt), and use it to tag unseen test data (test.txt).
• write another program that compares the output of your tagger with the gold standard for the test
data (gold.txt) to compute the over overall token accuracy of your tagger. In addition, you need to
compute the precision and recall for each tag and produce a confusion matrix.
1.2 Data
You will use three data files to train your tagger, produce Viterbi tags for unlabeled data, and evaluate your
tagger’s Viterbi output.
1.2.1 Training
The file train.txt consists of POS-tagged text:
Kemper NNP Financial NNP Services NNPS Inc. NNP , , charging VBG that DT program NN trading NN
is VBZ ruining VBG the DT stock NN market NN , , cut VBD off RP four CD big JJ Wall NNP Street NNP
firms NNS from IN doing VBG any DT of IN its PRP$ stock-trading NN business NN . . The DT move NN
is VBZ the DT biggest JJS salvo NN yet RB in IN the DT renewed VBN outcry NN against IN program NN
trading NN , , with IN Kemper NNP putting VBG its PRP$ money NN -- : the DT millions NNS of IN
dollars NNS in IN commissions NNS it PRP generates VBZ each DT year NN -- : where WRB its PRP$
mouth NN is VBZ . .
1
2
Words are separated by spaces. Tags are attached to words, with a single underscore between them. In
the actual file, each line is a single sentence. You will use this file to train a bigram HMM tagger (i.e.
estimate its transition, emission and initial probabilities). For the purposes of this assignment, you need
to use an UNK token to allow unseen words; use a threshold of 5. Additionally, you need to smooth the
transition probabilities using Laplace smoothing (Lecture 3, slide 48). This allows your model to assign
non-zero probability to unseen tag bigrams (which do exist in the test corpus).
1.2.2 Tagging
The file test.txt consists of raw text. Each line is a single sentence, and words are separated by spaces.
Once you finish your implementation, run your tagger over test.txt and create a file out.txt that contains
the Viterbi output.
1.2.3 Evaluation
For evaluation purposes, you need to compare the output of your tagger with gold.txt.
1.3 Provided Code/What you need to implement
Your solution will consist of two files, hw3 hmm.py and hw3 eval hmm.py.
1.3.1 HMM tagger
You should implement your HMM tagger in the module hw3 hmm.py. You are free to implement your HMM
tagger using whatever data structures you wish, but all of your code should be your own1
. We provide a
skeleton implementation for the HMM class with training and testing methods:
• train(self, corpus): estimates the counts for the bigram HMM on a labeled training corpus , which
is a nested list of TaggedWord objects 2
(0.5 points)
• test(self, corpus): given unlabeled corpus (a nested list of word strings), print each sentence with
its Viterbi tag sequence (0.5 points)
The tagging method requires you to implement the Viterbi algorithm for bigram HMMs:
• viterbi(self, sen): given sen (a list of words), returns the Viterbi tag sequence for that sentence
as a list of strings (2 points)
Implementation Hint: If you have a lexicon that tells you for each word which tags it can have, you don’t
have to go through all cells in each column. Similarly, if you first check whether a word has zero probability
given a tag, you don’t have to do all the computations to fill that cell. All your computations should again
be in log space. You may implement these functions however you wish, but please preserve the method
signatures for autograding.
You may implement these functions however you wish, but please preserve the method signatures for autograding.
Note: The gold labels for the test data contain a transition (”sharper JJR $ $”) that is not present in the
training data when using the default unknown word threshold of 5.3
In order to prevent your tagger from
assigning zero probability to a test sentence, you can use add-one smoothing on the transition distributions
to ensure that a sequence with non-zero probability exists. Make sure that you still restrict the space of
possible tags for each word token according to your tag lexicon.
1Feel free to re-use your distribution code from HW1, if you like.
2A TaggedWord simply stores a word token and a POS tag (e.g. from train.txt or gold.txt) as two separate strings.
3Both ‘sharper’ and ‘$’ appear at least 5 times, and JJR and $ are the only tags they appear with, respectively; however,
nowhere in the training data is there a bigram tagged as JJR-$.
3
1.3.2 Evaluation program
You should implement your evaluation program in hw3 eval hmm.py. We ask that you implement the
following methods for the Eval class:
• getTokenAccuracy(self): returns the percentage of tokens that were correctly labeled by the tagger
(1 point)
• getSentenceAccuracy(self): returns the percentage of sentences where each word in a sentence was
correctly labeled (1 point)
• getPrecision(self, tagTi): return the tagger’s precision when predicting tag ti (1 point)
Precision indicates how often ti was the correct tag when the tagger output ti
• getRecall(self, tagTj): return the tagger’s recall for predicting gold tag tj (1 point)
Recall indicates how often the tagger output tj when tj was the correct tag.
• writeConfusionMatrix(self, outFile): writes a confusion matrix to outFile (1 point)
We’d like you to create a confusion matrix (Lecture 9, slide 13) that indicates how often words that are
supposed to have tag ti were assigned tag tj (when ti = tj , a token is correctly labeled). The first line of
your confusion matrix should be a comma- or tab-separated list of POS tags (NN, NNP, etc.); the j
th entry in
this list indicates that the j
th column in the matrix stores the number of times that another tag was tagged
as tj . The first entry in the i
th following row should be tag ti
, followed by the counts for each tj . Thus,
the diagonal of the confusion matrix indicates the number of times that each tag was labeled correctly. The
entries in this confusion matrix should be raw frequencies (i.e., you don’t need to normalize the counts). You
can use the counts in your confusion matrix to calculate precision and recall for each tag.
1.3.3 Sanity check
The Python script hmm sanity check.py will run your code and evaluate your HMM tagger on a sample
sentence. It will check that your tagger correctly predicts the POS tags for that sentence, and that the
probabilities of the model are reasonably close to the TA implementation. The script will also test your
evaluation program to make sure that the methods you implement there behave correctly. Note that passing
the script does not mean that your code is 100% correct (but it’s still a good sign!).
1.4 What to submit for Part 1
For this portion, you should submit your two program files along with your confusion matrix:
• hw3 hmm.py: your completed Python module for POS tagging using a bigram HMM(see section 1.3.1)
• hw3 eval hmm.py: your completed Python module for evaluating your HMM’s output (see section
1.3.2)
• confusion matrix.txt: a text file containing the confusion matrix you produced from your HMM’s
output
We will provide a copy of the corpora/input files to test your program during grading.
1.5 What you will be graded on
You will be graded according to the correctness of your HMM tagger4
(3 points), as well as the correctness of
your token and sentence accuracy functions (2 points), precision and recall functions (2 points) and confusion
matrix (1 point).
4Your tagger should not get 100% accuracy; correctness will be judged by comparing your Viterbi predictions against the
predictions of the TA’s tagger, allowing for noise due to ties and such.
4
Part 2: Writing a Context-Free Grammar (2 points)
2.1 Goal
Your first task is to write a context-free grammar that can parse a set of short sentences. You will be using
parsing code from the NLTK for this portion of the assignment, and your grammar does not need to be in
Chomsky Normal Form.
2.2 Data
The file sentences.txt contains an evaluation set of 25 short sentences.
2.3 Provided code
We have provided a script (hw3 nltkcfg.py) that reads your grammar from mygrammar.cfg and tries to
parse the evaluation set. The script expects the sentences.txt file to be in the same directory, and will
write its output to a new file, hw3 cfg out.txt. Call the script from the command line with:
python hw3_nltkcfg.py
For each sentence, the script will provide a brief message in the console (whether the parse was a SUCCESS
or a FAILURE), along with the number of successful parse trees. More importantly, each successful parse tree
will be written to the output file in a bracketed, tabbed format. When evaluating your grammar, you should
look at this output to find ways to improve your grammar.
Debugging
Additionally, you can provide --gui as an optional argument for debugging:
python hw3_nltkcfg.py --gui
With --gui enabled, the script will display a pop-up window with each successful parse as a tree. In order
to view the next parse tree, you must close the window. We provide minimal debugging support for looking
at partial charts in the event of a parse failure. If you feel that you need to see this information to debug
your grammar, you are free to add additional debugging statements to hw3 nltkcfg.py.
2.4 What you need to implement
You don’t need to write (or submit) any actual Python code for this part of the assignment. Instead, your
task is to define the rules for your CFG in mygrammar.cfg. This file already contains the lexical rules
required for these sentences (i.e., words are mapped to their POS tags), along with a start symbol for the
grammar that generates the nonterminal symbol S.
Please do not change or delete the existing rules.
Instead, you will need to add rules such as:
S -> NP VP
NP -> DT N
N -> NN | NNS | ...
etc.
You may define your own inventory of nonterminals, but please do not change the existing rules. Your grade
will depend in part on your grammar’s ability to successfully parse the evaluation sentences; however, we
care more about your ability to define and explain a linguistically meaningful grammar (see below).
5
Implementation hints
While it’s possible to define a very simple grammar that parses every sentence, such as:
S -> LEX | LEX S
LEX -> NN | NNS | COMMA | VB | ...
that’s not the point of this part of the assignment, and such a solution would receive zero credit. Similarly,
a set of rules like:
S -> NN CC NN VBD FS
S -> PRP VBP NNS CC NNS FS
etc.
where each (extremely flat) rule generates the POS tag sequence for a particular sentence in the evaluation
data is also not allowed. Instead, your grammar should use a set of nonterminals similar to those of the
Penn Treebank parse trees you’ve seen in class. You can find a description of the POS tags used in this
assignment at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.
2.5 What to submit
Along with your completed grammar file (mygrammar.cfg), you should submit your output file (hw3 cfg out.txt).
2.6 What you will be graded on
You will be graded on your grammar’s ability to successfully parse the given 25 sentences (1 point) and on
your grammar’s ability to successfully parse the sentences in our held-out test set (1 point).
Submission Details
In conclusion, you need to submit a zip file containing the following files for grading purposes to gradescope
and remember not to change any function name or filename we give you:
• hw3 hmm.py
• hw3 eval hmm.py
• confusion matrix.txt
• mygrammar.cfg
• hw3 cfg out.txt