CMPSC 442: Homework 6

Your shopping cart is empty.

CMPSC 442: Homework 6 [100 points]
TO SUBMIT HOMEWORK
To submit homework for a given homework assignment:
1. You *must* download the homework template file homework6_cmpsc442.py from Canvas. Each
template file is a python file that gives you a headstart in creating your homework python script
with the correct function names for autograding. There is also a text file in the Canvas
assignment page that you will need, called:
brown-corpus.txt
2. You *must* rename the template file by replacing cmpsc442 with your PSU id from your official
PSU. For example, if your PSU email id is abcd1234, you would rename your file as
homework6_abcd1234.py to submit to Gradescope.
3. Upload your *py file to Gradescope by its due date. If the submission closes before you upload,
your homework will be counted late, and you might get a zero grade.
4. Make sure your file can import before you submit; the autograder imports your file. If it won't
import, you will get a zero
Instructions
In this assignment, you will gain experience working with hidden Markov models for part-of-speech
tagging.
A skeleton file homework6_cmpsc442.py containing empty definitions for each question has been
provided. Since portions of this assignment will be graded automatically, none of the names or
function signatures in this file should be modified. However, you are free to introduce additional
variables or functions if needed.
You may import definitions from any standard Python library, and are encouraged to do so in case you
find yourself reinventing the wheel.
You will find that in addition to a problem specification, most programming questions also include a
pair of examples from the Python interpreter. These are meant to illustrate typical use cases to clarify
the assignment, and are not comprehensive test suites. In addition to performing your own testing,
you are strongly encouraged to verify that your code gives the expected output for these examples
before submitting.
You are strongly encouraged to follow the Python style guidelines set forth in PEP 8, which was written
in part by the creator of Python. However, your code will not be graded for style.
1. Hidden Markov Models [95 points]
8/19/22, 6:18 PM Homework 6
file:///Users/kxk302/Downloads/homework1-7html/homework6-cmpsc442.html 2/3
In this section, you will develop a hidden Markov model for part-of-speech (POS) tagging, using the
Brown corpus as training data. The tag set used in this assignment will be the universal POS tag set,
which is composed of the twelve POS tags Noun (noun), Verb (verb), Adj (adjective), Adv (adverb),
Pron (pronoun), Det (determiner or article), Adp (preposition or postposition), Num (numeral), Conj
(conjunction), Prt (particle), '.' (punctuation mark), and X (other).
As in previous assignments, your use of external code should be limited to built-in Python modules,
which excludes packages such as NumPy and NLTK.
1. [10 points] Write a function load_corpus(path) that loads the corpus at the given path and
returns it as a list of POS-tagged sentences. Each line in the file should be treated as a separate
sentence, where sentences consist of sequences of whitespace-separated strings of the form
"token=POS". Your function should return a list of lists, with individual entries being 2-tuples of
the form (token, POS).
>>> c = load_corpus("brown_corpus.txt")
>>> c[1402]
[('It', 'PRON'), ('made', 'VERB'),
('him', 'PRON'), ('human', 'NOUN'),
('.', '.')]
>>> c = load_corpus("brown_corpus.txt")
>>> c[1799]
[('The', 'DET'), ('prospects', 'NOUN'),
('look', 'VERB'), ('great', 'ADJ'),
('.', '.')]
2. [20 points] In the Tagger class, write an initialization method __init__(self, sentences)
which takes a list of sentences in the form produced by load_corpus(path) as input and
initializes the internal variables needed for the POS tagger. In particular, if { t1
, t2
, . . . , tn
}
denotes the set of tags and { w1
, w2
, . . . , wm } denotes the set of tokens found in the input
sentences, you should at minimum compute:
The initial tag probabilities π(ti
) for 1 ≤ i ≤ n, where π(ti
) is the probability that a sentence
begins with tag ti
.
The transition probabilities a(ti → tj
) for 1 ≤ i, j ≤ n, where a(ti → tj
) is the probability that
tag tj
occurs after tag ti
.
The emission probabilities b(ti → wj
) for 1 ≤ i ≤ n and 1 ≤ j ≤ m, where b(ti → wj
) is the
probability that token wj
is generated given tag ti
.
It is imperative that you use Laplace smoothing where appropriate to ensure that your system
can handle novel inputs, but the exact manner in which this is done is left up to you as a design
decision. Your initialization method should take no more than a few seconds to complete when
given the full Brown corpus as input.
3. [25 points] In the Tagger class, write a method most_probable_tags(self, tokens) which
returns the list of the most probable tags corresponding to each input token. In particular, the
most probable tag for a token wj
is defined to be the tag with index i
*
= argmaxi
b(ti → wj
).
>>> c = load_corpus("brown_corpus.txt")
>>> t = Tagger(c)
>>> t.most_probable_tags(
>>> c = load_corpus("brown_corpus.txt")
>>> t = Tagger(c)
>>> t.most_probable_tags(
8/19/22, 6:18 PM Homework 6
file:///Users/kxk302/Downloads/homework1-7html/homework6-cmpsc442.html 3/3
... ["The", "man", "walks", "."])
['DET', 'NOUN', 'VERB', '.']
... ["The", "blue", "bird", "sings"])
['DET', 'ADJ', 'NOUN', 'VERB']
4. [40 points] In the Tagger class, write a method viterbi_tags(self, tokens) which returns
the most probable tag sequence as found by Viterbi decoding. Recall from lecture that Viterbi
decoding is a modification of the Forward algorithm, adapted to find the path of highest
probability through the trellis graph containing all possible tag sequences. Computation will
likely proceed in two stages: you will first compute the probability of the most likely tag
sequence, and will then reconstruct the sequence which achieves that probability from end to
beginning by tracing backpointers.
>>> c = load_corpus("brown_corpus.txt")
>>> t = Tagger(c)
>>> s = "I am waiting to reply".split()
>>> t.most_probable_tags(s)
['PRON', 'VERB', 'VERB', 'PRT', 'NOUN']
>>> t.viterbi_tags(s)
['PRON', 'VERB', 'VERB', 'PRT', 'VERB']
>>> c = load_corpus("brown_corpus.txt")
>>> t = Tagger(c)
>>> s = "I saw the play".split()
>>> t.most_probable_tags(s)
['PRON', 'VERB', 'DET', 'VERB']
>>> t.viterbi_tags(s)
['PRON', 'VERB', 'DET', 'NOUN']
2. Feedback [5 points]
1. [1 point] Approximately how long did you spend on this assignment?
2. [2 points] Which aspects of this assignment did you find most challenging? Were there any
significant stumbling blocks?
3. [2 points] Which aspects of this assignment did you like? Is there anything you would have
changed?

Shopping cart

US$0

CMPSC 442: Homework 6

More products