CS322 Assignment 1
Written Questions:
1. 3.1
2. 3.2
3. 3.3
4. 3.4
5. 3.7
6. Suppose you are an employee of a large-unnamed-web-company, and you would like to release a
dataset of all six-gram counts on the web. However — given that there are many possible n-grams,
you are worried that your file will be too large. You encounter a blog post from Google describing
some basic statistics of their (now famous) n-gram dataset https://ai.googleblog.com/2006/08/
all-our-n-gram-are-belong-to-you.html1
. In particular, you find the counts of the number of
unique unigrams/bigrams/trigrams/etc. (e.g., the number of unique unigrams is approximately 13.58
million).
(a) Assume that the number of unigrams is equal to the size of the vocabulary. How many possible
bigrams are there for that vocabulary? For some n, what is the formula to compute the number
of possible n-grams?
(b) Compute the sparsity of bigrams on the web as the ratio of number of observed bigrams to the
number of possible bigrams. Repeat this process for 3/4/5-grams. Make a plot where the x-axis
is n and the y-axis is log10 of the sparsity measure.
(c) Given the plot you previously made (roughly) what sparsity would you expect if you were to
release a dataset of six-grams with the same vocabulary?
(d) Given the sparsity you computed in the previous question — (roughly) how many unique six-grams
do you expect to exist on the web?
1Archived version: https://web.archive.org/web/20190213193236/https://ai.googleblog.com/2006/08/
all-our-n-gram-are-belong-to-you.html
CS322, Spring 2019 Assignment 1
Programming Assignment:
In this assignment, you will be coding up a smoothed, n-gram language model (with no backoff, for
simplicity). Additionally — you will be evaluating the model using perplexity. We will be using a corpus of
50,000 movie reviews from IMDB (49.5K are for training, and 500 are for validation). The original dataset,
is due to Maas et al. 2011.2
The dataset is distributed in a textfile, where each line consists of a either a movie review, or an ignorable
blank lines. The training data is contained in movies train.toks and the validation data is contained in
movies val.toks. For this assignment, start/end token buffers should be added at the beginning and end
of each line (rather than between sentences). For example, consider training a trigram language model on
the following (made-up) line:
The movie was great. I laughed a lot.
Even though the line contains multiple sentences, start/end tokens should be added as (for a trigram language
model, with history of two)
<s <s The movie was great. I laughed a lot. </s
More details are available in the starter code (which you are free to use, modify, or change in any way
you like). The outline of the assignment is as follows:
1. Write a function count unigrams. This function counts up all unigrams in the input, and returns a
dictionary mapping a word type to how many times it appeared in the training data.
2. We next create a vocabulary of words (this code is included in the starter code). Answer the following
questions about the vocabulary: How many tokens were in the original training corpus, in total? How
many unique words were there in the original training corpus? How big is your vocabulary after
cutting off (and what in the starter code controls it?)? What are the most common 10 words in your
vocabulary?
3. Write a function unigram distribution that takes as input a list of token lists, and outputs a dictionary mapping from word type to smoothed-probability if that word, i.e., P(w). Remember: if you saw
a word 5 times in your corpus, and your smoothing factor is .1, you assume that you actually saw it
5.01 times! Be sure to normalize this into a probability distribution, i.e., your dictionary should have
the property that P
w in vocab P(w) = 1.
4. Write a function compute perplexity unigrams that computes the perplexity of an input list of tokens
according to the unigram probability distribution returned by unigram distribution.
5. Write a function build table that computes a nested dictionary of context → next word counts. For
example, if counts is the output of the function, counts[(‘had’, ‘great’)] might be
{‘acting’:10, ‘effects’:5, ...}. A warning: do not include zero counts in this data structure!!!
As we have (or will soon) discuss in class, the number of possible n-grams is much greater than the
number of observed n-grams. It is possible to compute accurate probabilities with this sparse (i.e.,
zeros are not explicitly included) representation! Note: in my solution, my counts is implemented with
a collections.defaultdict. In particular, you might find collections.defaultdict(int) (or
something similar) to be useful — this is a normal python dictionary, but if you query the dictionary
for something you haven’t seen before, it returns zero.
6. Write a function compute log probability that loops over a sentence, and sums the log probability
of each token. For a tri-gram language model applied to the sentence the movie was great!, your
function should return log(P(the|<s, <s))+log(P(movie|the, <s))+log(P(was|movie , the))+
... + log(P(</s|!, great)). To compute these probabilities, you should use the counts you computed
in the build table function. Keep in mind — you need to smooth all of the counts in your dictionary
2http://ai.stanford.edu/~amaas/data/sentiment/
CS322, Spring 2019 Assignment 1
by the smoothing factor parameter, i.e., if one of the counts in your count dictionary is 5 and your
smoothing parameter is .1, then you should assume that you observed that n-gram 5.1 times, for
the purposes of computing the probability. Note: given that your count dictionary doesn’t contain
zero entries, you may be tempted to ignore the smoothing for those entries. However — to properly
normalize your probability distribution, you should assume that all zero counts are now, say, .1! The
reading has more detail, but you’ll have to be a bit clever so as to 1) properly account for zero counts
with smoothing, but 2) not make your count dictionary dense, i.e., it should only store observed counts!
7. Write a function compute perplexity that computes the perplexity of a sentence, given its log probability. This function should call compute log probability.
8. There is some starter code in main, but complete the code to run the following experiment: train a
language model for n = 1, 2, 3, 4 and compute average, per-line, validation perplexity for each of the
resulting models. Make a plot of perplexity vs. n and discuss any patterns you see.
9. (Optional) How does the max vocab size interplay with perplexity of the unigram language model?
Try running the experiment from the previous question with several different vocabulary sizes and
discuss.
10. (Optional) Try computing the training set perplexity of your language models, and compare those values
to the validation perplexities. Are the training perplexities lower than the validation perplexities? Why
or why not?