$30
CS447 - Assignment 2
In this part of assignment 2 we'll be building a machine learning model to detect the sentiment of movie reviews using the Stanford Sentiment Treebank([SST])(http://ai.stanford.edu/~amaas/data/sentiment/) dataset. First we will import all the required libraries. We highly recommend that you finish the PyTorch Tutorials 1 , 2 , 3 before starting this assignment.
After finishing this assignment you will able to answer the following questions:
How to write Dataloaders in Pytorch?
How to build dictionaries and vocabularies for Deep Nets?
How to use Embedding Layers in Pytorch?
How to use a Convolutional Neural Network for sentiment analysis?
How to build various recurrent models (LSTMs and GRUs) for sentiment analysis?
How to use packed_padded_sequences for sequential models?
Please make sure that you have selected "GPU" as the Hardware accelerator from Runtime -> Change runtime type.
Import Libraries
# Don't import any other libraries
from collections import defaultdict
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
import torch.optim as optim
from torchtext import data, datasets
device = torch.device('cpu')
if __name__=='__main__':
print('Using device:', device)
Using device: cpu
Download dataset
First we will download the dataset using torchtext, which is a package that supports NLP for PyTorch. The following command will get you 3 objects train_data, val_data and test_data. To access the data:
To access list of textual tokens - train_data[0].text
To access label - train_data[0].label
if __name__=='__main__':
train_data, val_data, test_data = datasets.SST.splits(data.Field(tokenize = 'spacy'), data.LabelField(dtype = torch.float), filter_pred=lambda ex: ex.label != 'neutral')
print('{:d} train and {:d} test samples'.format(len(train_data), len(test_data)))
print('Sample text:', train_data[0].text)
print('Sample label:', train_data[0].label)
downloading trainDevTestTrees_PTB.zip
trainDevTestTrees_PTB.zip: 100%|██████████| 790k/790k [00:01<00:00, 743kB/s]
extracting
6920 train and 1821 test samples
Sample text: ['The', 'Rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'Century', "'s", 'new', '`', '`', 'Conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'Arnold', 'Schwarzenegger', ',', 'Jean', '-', 'Claud', 'Van', 'Damme', 'or', 'Steven', 'Segal', '.']
Sample label: positive
1. Define the Dataset Class (4 points)
In the following cell, we will define the dataset class. You need to implement the following functions:
build_dictionary() - creates the dictionaries ixtoword and wordtoix. Converts all the text of all examples, in the form of text ids and stores them in textual_ids. If a word is not present in your dictionary, it should use <unk>. Use the hyperparameter THRESHOLD to control which words appear in the dictionary, based on their frequency in the training data. Note that a word’s frequency should be >=THRESHOLD to be included in the dictionary. Also make sure that <end> should be at idx 0, and <unk> should be at idx 1
get_label() - This function should return the value 1 if the label in the dataset is positive, and should return 0 if it is negative. The data type for the returned item should be torch.LongTensor
get_text() - This function should pad the review with <end> character up to a length of MAX_LEN if the length of the text is less than the MAX_LEN. If length is more than MAX_LEN then it should only return the first MAX_LEN words. This function should also return the original length of the review. The data type for the returned items should be torch.LongTensor. Note that the text returned is a list of indices of the words from your wordtoix mapping
__len__() - This function should return the total length (int value) of the dataset i.e. the number of sentences
__getitem__() - This function should return the padded text, the length of the text (without the padding) and the label. The data type for all the returned items should be torch.LongTensor. You will use the get_label() and get_text() functions here
NOTE: Don't forget to convert all text to lowercase!
THRESHOLD = 10
MAX_LEN = 60
UNK = '<unk>'
END = '<end>'
class TextDataset(data.Dataset):
def __init__(self, examples, split, ixtoword=None, wordtoix=None, THRESHOLD=THRESHOLD, MAX_LEN=MAX_LEN):
self.examples = examples
self.split = split
self.ixtoword = ixtoword
self.wordtoix = wordtoix
self.THRESHOLD = THRESHOLD
self.MAX_LEN = MAX_LEN
self.build_dictionary()
self.vocab_size = len(self.ixtoword)
self.textual_ids = []
self.labels = []
##### TODO #####
# textual_ids contains list of word ids as per wordtoix for all sentences
# Replace words out of vocabulary with id of UNK token
# labels is a list of integer labels (0 or 1) for all sentences
for sentences in self.examples:
seq = []
for word in sentences.text:
if word in self.wordtoix:
seq.append(self.wordtoix[word])
else:
seq.append(self.wordtoix[UNK])
# seq.append(self.wordtoix[END])
self.textual_ids.append(seq)
self.labels.append(sentences.label)
def build_dictionary(self):
# This is built only from train dataset and then reused in test dataset
# by passing ixtoword and wordtoix from train dataset to __init__() when creating test dataset
# which is done under 'Initialize the Dataloader' section
if self.split.lower() != 'train':
return
# END should be at idx 0. UNK should be at idx 1
self.ixtoword = {0:END, 1:UNK}
self.wordtoix = {END:0, UNK:1}
##### TODO #####
# Count the frequencies of all words in the training data (self.examples)
# Assign idx (starting from 2) to all words having word_freq >= THRESHOLD
idx = 2
cnt_word = defaultdict(int)
for sentences in self.examples:
for word in sentences.text:
cnt_word[word.lower()] += 1
for key in cnt_word.keys():
if cnt_word[key] >= self.THRESHOLD:
self.ixtoword[idx] = key
self.wordtoix[key] = idx
idx += 1
return
def get_label(self, index):
##### TODO #####
# This function should return the value 1 if the label is positive, and 0 if it is negative for sentence at index `index`
# The data type for the returned item should be torch.LongTensor
if self.labels[index] == 'positive':
label = 1
else:
label = 0
label = torch.LongTensor([label])
label = torch.squeeze(label)
return label
def get_text(self, index):
##### TODO #####
# This function should pad the text with END token uptil a length of MAX_LEN if the length of the text is less than the MAX_LEN
# If length is more than MAX_LEN then only return the first MAX_LEN words
# This function should also return the original length of the review
# The data type for the returned items should be torch.LongTensor
# Note that the text returned is a list of indices of the words from your wordtoix mapping
seq = self.textual_ids[index]
text = []
text_len = torch.LongTensor([len(seq)])
text_len = torch.squeeze(text_len)
for idx in range(self.MAX_LEN):
if idx >= len(seq):
text.append(0)
else:
text.append(seq[idx])
text = torch.LongTensor(text)
return text, text_len
def __len__(self):
##### TODO #####
# This function should return the number of sentences (int value) in the dataset
return len(self.examples)
def __getitem__(self, index):
text, text_len = self.get_text(index)
label = self.get_label(index)
return text, text_len, label
if __name__=='__main__':
# Sample item
Ds = TextDataset(train_data, 'train')
print('vocab_size:', Ds.vocab_size)
text, text_len, label = Ds[0]
print('text:', text)
print('text_len:', text_len)
print('label:', label)
vocab_size: 1469
text: tensor([ 1, 1, 4, 1, 5, 6, 2, 1, 1, 8, 9, 10, 10, 1, 11, 12, 13, 14,
8, 15, 5, 16, 17, 1, 18, 1, 19, 1, 1, 20, 1, 21, 1, 1, 1, 22,
1, 1, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0])
text_len: tensor(39)
label: tensor(1)
Some helper functions
##### Do not modify this
def count_parameters(model):
"""
Count number of trainable parameters in the model
"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
def accuracy(output, labels):
"""
Returns accuracy per batch
output: Tensor [batch_size, n_classes]
labels: LongTensor [batch_size]
"""
preds = output.argmax(dim=1) # find predicted class
correct = (preds == labels).sum().float() # convert into float for division
acc = correct / len(labels)
return acc
Train your Model
##### Do not modify this
def train_model(model, num_epochs, data_loader, optimizer, criterion):
print('Training Model...')
model.train()
for epoch in range(num_epochs):
epoch_loss = 0
epoch_acc = 0
for texts, text_lens, labels in data_loader:
texts = texts.to(device) # shape: [batch_size, MAX_LEN]
text_lens = text_lens.to(device) # shape: [batch_size]
labels = labels.to(device) # shape: [batch_size]
optimizer.zero_grad()
output = model(texts, text_lens)
acc = accuracy(output, labels)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
print('[TRAIN]\t Epoch: {:2d}\t Loss: {:.4f}\t Accuracy: {:.2f}%'.format(epoch+1, epoch_loss/len(data_loader), 100*epoch_acc/len(data_loader)))
print('Model Trained!\n')
Evaluate your Model
##### Do not modify this
def evaluate(model, data_loader, criterion):
print('Evaluating performance on Test dataset...')
model.eval()
epoch_loss = 0
epoch_acc = 0
all_predictions = []
for texts, text_lens, labels in data_loader:
texts = texts.to(device)
text_lens = text_lens.to(device)
labels = labels.to(device)
output = model(texts, text_lens)
acc = accuracy(output, labels)
all_predictions.append(output.argmax(dim=1))
loss = criterion(output, labels)
epoch_loss += loss.item()
epoch_acc += acc.item()
print('[TEST]\t Loss: {:.4f}\t Accuracy: {:.2f}%'.format(epoch_loss/len(data_loader), 100*epoch_acc/len(data_loader)))
predictions = torch.cat(all_predictions)
return predictions
2. Build your Convolutional Neural Network Model (3 points)
In the following we provide you the class to build your model. We provide some parameters that we expect you to use in the initialization of your sequential model. Do not change these parameters.
class CNN(nn.Module):
def __init__(self, vocab_size, embed_size, out_channels, filter_heights, stride, num_classes, dropout, pad_idx):
super(CNN, self).__init__()
##### TODO #####
# Create an embedding layer (https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
# to represent the words in your vocabulary. You can vary the dimensionality of the embedding
self.embedding = nn.Embedding(vocab_size, embed_size)
# Define multiple Convolution layers (nn.Conv2d) with filter (kernel) size [filter_height, embed_size] based on your different filter_heights.
# Input channels will be 1 and output channels will be `out_channels` (these many different filters will be trained for each convolution layer)
# If you want, you can have a list of modules inside nn.ModuleList
# Note that even though your conv layers are nn.Conv2d, we are doing a 1d convolution since we are only moving the filter in one direction
#
# You can vary the number of output channels, filter heights, and stride
self.conv1 = nn.Conv2d(1, out_channels, (filter_heights[0], embed_size), stride=stride)
self.conv2 = nn.Conv2d(1, out_channels, (filter_heights[1], embed_size), stride=stride)
self.conv3 = nn.Conv2d(1, out_channels, (filter_heights[2], embed_size), stride=stride)
# Define a linear layer (nn.Linear) that consists of num_classes (2 in our case) units
# and takes as input the concatenated output for all cnn layers (out_channels * num_of_cnn_layers units)
self.linear = nn.Linear(out_channels * 3, num_classes)
def forward(self, texts, text_lens):
"""
texts: LongTensor [batch_size, MAX_LEN]
text_lens: LongTensor [batch_size] - you might not even need to use this
Returns output: Tensor [batch_size, num_classes]
"""
##### TODO #####
# Pass texts through your embedding layer to convert from word ids to word embeddings
# texts: [batch_size, MAX_LEN, embed_size]
em = self.embedding(texts)
# input to conv should have 1 channel. Take a look at torch's unsqueeze() function
# texts [batch_size, 1, MAX_LEN, embed_size]
x1 = self.conv1(em.unsqueeze(1))
x2 = self.conv2(em.unsqueeze(1))
x3 = self.conv3(em.unsqueeze(1))
# Pass these texts to each of your cnn and compute their output as follows:
# Your cnn output will have shape [batch_size, out_channels, *, 1] where * depends on filter_height and stride
# Convert to shape [batch_size, out_channels, *] (see torch's squeeze() function)
x1 = x1.squeeze(3)
x2 = x2.squeeze(3)
x3 = x3.squeeze(3)
# Apply non-linearity on it (F.relu() is a commonly used one. Feel free to try others)
x1 = F.relu(x1)
x2 = F.relu(x2)
x3 = F.relu(x3)
# Take the max value across last dimension to have shape [batch_size, out_channels]
x1 = torch.max(x1, 2).values
x2 = torch.max(x2, 2).values
x3 = torch.max(x3, 2).values
# Concatenate (torch.cat) outputs from all your cnns [batch_size, (out_channels*num_of_cnn_layers)]
#
x = torch.cat((x1,x2,x3), 1)
# Let's understand what you just did:
# Since each cnn is of different filter_height, it will look at different number of words at a time
# So, a filter_height of 3 means your cnn looks at 3 words (3-grams) at a time and tries to extract some information from it
# Each cnn will learn `out_channels` number of different features from the words it sees at a time
# Then you applied a non-linearity and took the max value for all channels
# You are essentially trying to find important n-grams from the entire text
# Everything happens on a batch simultaneously hence you have that additional batch_size as the first dimension
# optionally apply a dropout if you want to (You will have to initialize an nn.Dropout layer in __init__)
# Pass your concatenated output through your linear layer and return its output ([batch_size, num_classes])
##### NOTE: Do not apply a sigmoid or softmax to the final output - done in evaluation method!
output = self.linear(x)
return output
Initialize the Dataloader
We initialize the training and testing dataloaders using the Dataset classes we create for both training and testing. Make sure you use the same vocabulary for both the datasets.
if __name__=='__main__':
BATCH_SIZE = 64 # Feel free to try other batch sizes
##### Do not modify this
Ds = TextDataset(train_data, 'train')
train_loader = torch.utils.data.DataLoader(Ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, drop_last=True)
test_Ds = TextDataset(test_data, 'test', Ds.ixtoword, Ds.wordtoix)
test_loader = torch.utils.data.DataLoader(test_Ds, batch_size=1, shuffle=False, num_workers=1, drop_last=False)
Training and Evaluation for CNN Model
We first train your model using the training data. Feel free to play around with the hyperparameters. We recommend you write code to save your model (save/load model tutorial) as colab connections are not permanent and it can get messy if you'll have to train your model again and again.
if __name__=='__main__':
##### Do not modify this
VOCAB_SIZE = Ds.vocab_size
NUM_CLASSES = 2
PAD_IDX = 0
# Hyperparameters (Feel free to play around with these)
EMBEDDING_DIM = 1024
DROPOUT = 0
OUT_CHANNELS = 64
FILTER_HEIGHTS = [1, 3, 5] # [3 different filter sizes - unigram, bigram, trigram in this case. Feel free to try other n-grams as well]
STRIDE = 1
model = CNN(VOCAB_SIZE, EMBEDDING_DIM, OUT_CHANNELS, FILTER_HEIGHTS, STRIDE, NUM_CLASSES, DROPOUT, PAD_IDX)
# put your model on device
model = model.to(device)
print('The model has {:,d} trainable parameters'.format(count_parameters(model)))
The model has 2,094,658 trainable parameters
if __name__=='__main__':
LEARNING_RATE = 5e-4 # Feel free to try other learning rates
# Define your loss function
criterion = nn.CrossEntropyLoss().to(device)
# Define your optimizer
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
if __name__=='__main__':
N_EPOCHS = 8 # Feel free to change this
# train model for N_EPOCHS epochs
train_model(model, N_EPOCHS, train_loader, optimizer, criterion)
Training Model...
[TRAIN] Epoch: 1 Loss: 0.6514 Accuracy: 62.43%
[TRAIN] Epoch: 2 Loss: 0.4528 Accuracy: 79.15%
[TRAIN] Epoch: 3 Loss: 0.2866 Accuracy: 89.92%
[TRAIN] Epoch: 4 Loss: 0.1559 Accuracy: 96.17%
[TRAIN] Epoch: 5 Loss: 0.0818 Accuracy: 98.61%
[TRAIN] Epoch: 6 Loss: 0.0479 Accuracy: 99.19%
[TRAIN] Epoch: 7 Loss: 0.0305 Accuracy: 99.36%
[TRAIN] Epoch: 8 Loss: 0.0233 Accuracy: 99.45%
Model Trained!
##### Do not modify this
if __name__=='__main__':
# Compute test data accuracy
predictions_cnn = evaluate(model, test_loader, criterion)
# Convert tensor to numpy array
# This will be saved to your Google Drive below and you will be submitting this file to gradescope
predictions_cnn = predictions_cnn.cpu().data.detach().numpy()
Evaluating performance on Test dataset...
[TEST] Loss: 0.6267 Accuracy: 76.17%
3. Build your Recurrent Neural Network Model (3 points)
In the following we provide you the class to build your model. We provide some parameters that we expect you to use in the initialization of your sequential model. Do not change these parameters.
class RNN(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, num_classes, num_layers, bidirectional, dropout, pad_idx):
super(RNN, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
##### TODO #####
# Create an embedding layer (https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
# to represent the words in your vocabulary. You can vary the dimensionality of the embedding
self.embedding = nn.Embedding(vocab_size, embed_size)
# Create a recurrent network (nn.LSTM or nn.GRU) with batch_first = False
# You can vary the number of hidden units, directions, layers, and dropout
self.rnn = nn.LSTM(embed_size, self.hidden_size, self.num_layers, bidirectional=bidirectional)
# Define a linear layer (nn.Linear) that consists of num_classes (2 in our case) units
# and takes as input the output of the last timestep (in the bidirectional case: the output of the last timestep
# of the forward direction, concatenated with the output of the last timestep of the backward direction)
self.linear = nn.Linear(self.hidden_size*2, num_classes)
def forward(self, texts, text_lens):
"""
texts: LongTensor [batch_size, MAX_LEN]
text_lens: LongTensor [batch_size]
Returns output: Tensor [batch_size, num_classes]
"""
##### TODO #####
# permute texts for sentence_len first dimension
# texts: [MAX_LEN, batch_size]
texts = texts.T
# Pass texts through your embedding layer to convert from word ids to word embeddings
# texts: [MAX_LEN, batch_size, embed_size]
em = self.embedding(texts)
# Pack texts into PackedSequence using nn.utils.rnn.pack_padded_sequence
em = nn.utils.rnn.pack_padded_sequence(em, [MAX_LEN]*text_lens.shape[0])
# Pass the pack through your recurrent network
output, (h_n, c_n) = self.rnn(em)
# print(h_n.view(self.num_layers, 2, text_lens.shape[0], self.hidden_size))
h_n = h_n.view(self.num_layers, 2, text_lens.shape[0], self.hidden_size)
h_n_forward = h_n[-1, 0]
h_n_backward = h_n[-1, 1]
# Take output of the last timestep of the last layer for all directions and concatenate them (see torch.cat())
# depends on whether your model is bidirectional
# Your concatenated output will have shape [batch_size, num_dirs*hidden_size]
lin_in = torch.cat((h_n_forward, h_n_backward), 1)
# optionally apply a dropout if you want to (You will have to initialize an nn.Dropout layer in __init__)
# Pass your concatenated output through your linear layer and return its output ([batch_size, num_classes])
output = self.linear(lin_in)
##### NOTE: Do not apply a sigmoid or softmax to the final output - done in evaluation method!
return output
Initialize the Dataloader
We initialize the training and testing dataloaders using the Dataset classes we create for both training and testing. Make sure you use the same vocabulary for both the datasets.
if __name__=='__main__':
BATCH_SIZE = 32 # Feel free to try other batch sizes
##### Do not modify this
Ds = TextDataset(train_data, 'train')
train_loader = torch.utils.data.DataLoader(Ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, drop_last=True)
test_Ds = TextDataset(test_data, 'test', Ds.ixtoword, Ds.wordtoix)
test_loader = torch.utils.data.DataLoader(test_Ds, batch_size=1, shuffle=False, num_workers=1, drop_last=False)
Training and Evaluation for Sequential Model
We first train your model using the training data. Feel free to play around with the hyperparameters. We recommend you write code to save your model (save/load model tutorial) as colab connections are not permanent and it can get messy if you'll have to train your model again and again.
if __name__=='__main__':
##### Do not modify this
VOCAB_SIZE = Ds.vocab_size
NUM_CLASSES = 2
PAD_IDX = 0
# Hyperparameters (Feel free to play around with these)
EMBEDDING_DIM = 1024
DROPOUT = 0
BIDIRECTIONAL = True
HIDDEN_DIM = 256
N_LAYERS = 4
model = RNN(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM, NUM_CLASSES, N_LAYERS, BIDIRECTIONAL, DROPOUT, PAD_IDX)
# put your model on device
model = model.to(device)
print('The model has {:,d} trainable parameters'.format(count_parameters(model)))
The model has 8,861,698 trainable parameters
if __name__=='__main__':
LEARNING_RATE = 5e-4 # Feel free to try other learning rates
# Define your loss function
criterion = nn.CrossEntropyLoss().to(device)
# Define your optimizer
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
if __name__=='__main__':
N_EPOCHS = 10 # Feel free to change this
# train model for N_EPOCHS epochs
train_model(model, N_EPOCHS, train_loader, optimizer, criterion)
Training Model...
[TRAIN] Epoch: 1 Loss: 0.6592 Accuracy: 60.26%
[TRAIN] Epoch: 2 Loss: 0.4876 Accuracy: 77.45%
[TRAIN] Epoch: 3 Loss: 0.3413 Accuracy: 86.08%
[TRAIN] Epoch: 4 Loss: 0.2278 Accuracy: 91.35%
[TRAIN] Epoch: 5 Loss: 0.1503 Accuracy: 94.44%
[TRAIN] Epoch: 6 Loss: 0.0824 Accuracy: 97.14%
[TRAIN] Epoch: 7 Loss: 0.0657 Accuracy: 97.82%
[TRAIN] Epoch: 8 Loss: 0.0560 Accuracy: 97.93%
[TRAIN] Epoch: 9 Loss: 0.0354 Accuracy: 98.73%
[TRAIN] Epoch: 10 Loss: 0.0339 Accuracy: 98.87%
Model Trained!
##### Do not modify this
if __name__=='__main__':
# Compute test data accuracy
predictions_rnn = evaluate(model, test_loader, criterion)
# Convert tensor to numpy array
# This will be saved to your Google Drive below and you will be submitting this file to gradescope
predictions_rnn = predictions_rnn.cpu().data.detach().numpy()
Evaluating performance on Test dataset...
[TEST] Loss: 1.1929 Accuracy: 75.29%
Saving test results to your Google drive for submission.
You will save the predictions_rnn.txt and predictions_cnn.txt with your test data results. Make sure you do not shuffle the order of the test_data or the autograder will give you a bad score.
You will submit the following files to the autograder on the gradescope :
Your predictions_cnn.txt of test data results
Your predictions_rnn.txt of test data results
Your code of this notebook. You can do it by clicking File-> Download .py - make sure the name of the downloaded file is HW2.py
##### Do not modify this
if __name__=='__main__':
try:
from google.colab import drive
drive.mount('/content/drive')
except:
pass
np.savetxt('drive/My Drive/predictions_cnn.txt', predictions_cnn, delimiter=',')
np.savetxt('drive/My Drive/predictions_rnn.txt', predictions_rnn, delimiter=',')
print('Files saved successfully!')
Mounted at /content/drive
Files saved successfully!