CS322 Assignment 4
Written Questions: None this time!
Programming Assignment:
In this assignment, you will be coding up a neural n-gram language model. We will use the same corpus
from the first assignment: the corpus of 50,000 movie reviews from IMDB (49.5K are for training, and 500 are
for validation). The original dataset, is due to Maas et al. 2011.1 You’ll also need the GoogleNews word2vec
embeddings from the previous assignment in the same directory as your code for part of this assignment.
As with A1, the dataset is distributed in a textfile, where each line consists of a either a movie review,
or an ignorable blank lines. The training data is contained in movies train.toks and the validation data
is contained in movies val.toks. For this assignment, start/end token buffers should be added at the
beginning and end of each line (rather than between sentences). For example, consider training a trigram
language model on the following (made-up) line:
The movie was great. I laughed a lot.
Even though the line contains multiple sentences, start/end tokens should be added as (for a trigram language
model, with history of two)
<s <s The movie was great. I laughed a lot. </s
More details are available in the starter code (which you are free to use, modify, or change in any way
you like). The outline of the assignment is as follows:
1. Implement make all contexts. Given a list of lists of tokens, this function generates context, target
tuples representing preceding words/the target prediction word respectively. See the documentation
within the starter code for more information.
2. Implement generate context batches. This function is a python generator that yields context/target
pairs (X, y) for your model. See the documentation within the starter code for more specific info about
the form of X and y — but, the main idea is that X is the input to your neural network, and gives
the indices of the context words. Similarly, y is the target output of your neural network, and gives a
one-hot encoding of the correct target word for prediction purposes.
3. Implement a feedfoward, neural n-gram language model with one hidden layer in keras. The documentation for each layer is available online, and we have seen some of these layers used in class demos
before. This model should consist of, in order:
(a) a keras.layers.Embedding layer (the embeddings should be 300 dimensional)
(b) a keras.layers.Flatten layer (this concatenates all of the word embeddings from the context
into a single vector)
(c) a keras.layers.Dense layer (this should map your concatenated word embedding to 100 dimensions, I’d recommend adding a relu activation to this layer)
(d) a keras.layers.BatchNormalization layer (we didn’t talk about this layer in class, but the
idea of “Batch Normalization” is to mean-center and variance-1 transform activations within a
neural network. You can read more about it here: https://en.wikipedia.org/wiki/Batch\
_normalization. Adding it speeds training up considerably.)
(e) a keras.layers.Dense layer (this should map your batch normalized activations to the size of the
vocabulary; to turn these activations into probabilities for each word, you should add a softmax
activation.)
Compile your model with the categorical crossentropy loss, and the adam optimizer. use model.summary()
to print out a summary of your model, and ensure that the inputs and outputs of each layer make
sense.
1http://ai.stanford.edu/~amaas/data/sentiment/
CS322, Spring 2019 Assignment 4
4. Implement compute perplexity. This function is quite similar to your perplexity computation function from the first assignment, but instead of estimating conditional probabilities from counts in a
table, you should use model.predict, which gives your model’s predictions for the probability of a
target word, given some context.
More specifically — this function should loop over a sentence, and sum the log probability of each
token in sequence. If args.history is 2, and the list of tokens were derived from the previously introduced example, your function should compute log(P(the|<s, <s)) + log(P(movie|the, <s)) +
log(P(was|movie , the)) + ... + log(P(</s|!, great)). Each conditional probability should be computed by computing a forward pass of your neural network using model.predict. For numerical
stability purposes, you should add epsilon to each of the probabilities output by your model and then
re-normalize the resulting distribution to a true probability distribution (sometimes, the neural network can get over-confident in predictions and assign exactly zero probability! the process of adding in
the small epsilon value to each estimated probability and then re-normalizing prevents this annoying
case). Given these summed log probabilities, you should be able to compute perplexity, which should
be returned.
5. Run an experiment training your neural language model. I’ve written the skeleton code for a training
loop that prints out perplexity every training epoch (where an epoch is defined to be 5000 parameter
updates). Run for 30 epochs, saving the per-epoch validation perplexities. Each epoch for me takes
around one minute or so, so you may need to leave your computer running while you do this (suggestions: maybe grab dinner, or hang out with your friends while your machine learns to model language!
One “pro” of machine learning experiments is that you have down-time to do other things :) ).
6. One strategy to improve the performance of your language model is to initialize the weights in the
keras.layers.Embedding layer with word2vec weights. Implement load word2vec init, a function
that returns a vocab-by-300 dimensional matrix to serve as the initalizer for your embedding layer (see
the documentation in the starter code for more info). Implement functionality for the command line
argument variable args.word2vec init, i.e., if it is set to zero, you don’t initialize your embedding
layer’s weights, and if it is set to one, you do.
7. Run the same experiment, except with --word2vec init 1, i.e., with the Embedding layer initialized
with word2vec pretrained weights. Run for 30 epochs, saving the per-epoch validation perplexities.
8. Compare the two experiments you ran by plotting epoch versus validation perplexity — did initializing
the keras.layers.Embedding layer with word2vec initialization help? For reference, my implementation of an n-gram language model from HW1 gets, at best, 45.20 perplexity. I was able to achieve
better perplexity with a history 4 feedforward neural language model as described within 6 or 7 epochs.
9. Optionally — run a 3rd experiment with args.history 10, and add this line to your plot. Did
incorporating more history work?
10. Optionally — write a function that generates sequences of output text using model.predict. This
function should predict a word by taking the most likely next word according to the model, and feed
that word back into its own context to predict the next word. Doing this process in a loop should allow
you to generate a sequence of words using your language model.