$30
Problem Set 1: Text Classification
In this problem set, you will build a system for automatically classifying reddit comments by subreddit. You will:
Do some basic text processing, tokenizing your input and converting it into a bag-of-words representation
Build a machine learning classifier based on the generative model, using Naive Bayes
Evaluate your classifiers and examine what they have learned
Build a machine learning classifier based on the discriminative model, using Perceptron
Build more stable discriminative classifier, using the averaged perceptron
Build the logistic regression classifier (See instructions within the section)
Implement techniques to improve your classifier
0. Setup
In order to develop this assignment, you will need the following libraries. Many of these are part of anaconda, so a good starting point would be to install that.
python 2.7 (and not Python 3, although if somebody wants to try that and tell us what goes wrong, that would be appreciated...)
jupyter notebook
scipy
numpy (This will come if you install scipy like above, but if not install separately)
nltk (tested on NLTK 3.0.4)
matplotlib
nosetests
pandas Dataframes
Here is some help on installing packages in python: https://packaging.python.org/installing/
I generally use pip --user for everything.
About this assignment
This is a Jupyter notebook. You can execute cell blocks by pressing control-enter.
Most of your coding will be in the python source files in the directory gtnlplib.
To submit this assignment, run the script make-submission.sh, and submit the tarball pset1-submission.tgz on T-square.
Grading will be mostly based on automated unit testing.
The directory tests contains unit tests that are very similar to the tests that will be used to grade your assignment --- but not the same!
You should run these tests as you work on the assignment to see that you're on the right track. If you pass them in a non-adversarial way (i.e., you didn't write functions that target these tests directly), you will pass the tests that we use for grading.
Parts 1 and 2 are the foundation for the rest of the assignment. Don't even try to work on the later parts of the assignment until your code passes all tests for parts 1 and 2.
The same is largely true within each part: for example, do not move on to deliverable 4.2 until your code passes the test for deliverable 4.1.
You are free to look at the source code of the unit tests -- though most of the relevant code is also here in this notebook. Learn more about running unit tests at http://pythontesting.net/framework/nose/nose-introduction/
You may want to add more tests, but that is completely optional.
1. Preprocessing
Total: 1 point.
import pandas as pd
df_tr = pd.read_csv('reddit-train.csv',encoding='utf-8') # read the training data into a data frame
df_dv = pd.read_csv('reddit-dev.csv',encoding='utf-8') # read the dev data into a data frame
A dataframe is a structured representation of your data. You can preview your dataframes using head()
df_tr.head()
subreddit text
0 science Why don't you bother instead and respond prope...
1 iama AAIA's ACES is the database standard (http://w...
2 iama So many. I needed to perform a maintenance ru...
3 worldnews Ukraine used to be part of Russia in the Sovie...
4 worldnews This wiki article goes into how contentious th...
Your first task is to convert the text to a bag-of-words representation. There are three steps:
Break each input into sentences
Break each sentence into word "tokens"
Downcase each token, and add it to a Counter
You should use NLTK to complete the tokenization step, and collections.Counter for the bag of words representation. For more about NLTK tokenization, see http://www.nltk.org/book/ch03.html
Deliverable 1.1 Complete the function gtnlplib.preproc.tokenize_and_downcase. (0.5 points)
from gtnlplib import preproc
reload(preproc); #terminal semicolon suppresses output
# when you edit gtnlplib/preproc.py, you will need to reload it into the notebook, using the line above
# this will not work until you implement it
y_tr,x_tr = preproc.read_data('reddit-train.csv', #filename
'subreddit', #label field
preprocessor=preproc.tokenize_and_downcase) #your preprocessor
y_dv,x_dv = preproc.read_data('reddit-dev.csv', #filename
'subreddit', #label field
preprocessor=preproc.tokenize_and_downcase) #your preprocessor
y_te,x_te = preproc.read_data('reddit-test.csv', #filename
'subreddit', #label field
preprocessor=preproc.tokenize_and_downcase) #your preprocessor
Each element in the list x_tr is a counter, which corresponds to a bag of words.
Each element in the list y_tr is a string, corresponding to a label.
i = 100
print 'ORIGINAL TEXT: ',df_tr.loc[i]['text']
print
print 'BAG OF WORDS: ',x_tr[i]
ORIGINAL TEXT: I saw an article recently stating that there was a link between angiotensin converting enzymes and Alzheimer's. It said that ACE was partially responsible for the break down of the protein fibers and plaques that form in Alzheimer's. To your knowledge, is there any truth to this and if so, is there a correlation between the prevalence of Alzheimer's and the use of ACE inhibitors in the last few decades?
Thank you for the work that you do! I am glad there are much smarter people than I working on research for these cognitive issues.
EDIT: I'm really sad I missed this AMA.
BAG OF WORDS: Counter({u'the': 6, u'i': 5, u'and': 4, u'there': 4, u'.': 4, u'that': 4, u'alzheimer': 3, u'for': 3, u'of': 3, u"'s": 3, u'this': 2, u'is': 2, u'in': 2, u',': 2, u'to': 2, u'between': 2, u'you': 2, u'was': 2, u'a': 2, u'ace': 2, u'responsible': 1, u'few': 1, u'prevalence': 1, u'edit': 1, u'am': 1, u'it': 1, u'an': 1, u'down': 1, u'really': 1, u'ama': 1, u'are': 1, u'research': 1, u'converting': 1, u'protein': 1, u'inhibitors': 1, u'saw': 1, u'your': 1, u'issues': 1, u'if': 1, u'!': 1, u'people': 1, u'use': 1, u'said': 1, u'knowledge': 1, u'angiotensin': 1, u'fibers': 1, u'plaques': 1, u'much': 1, u':': 1, u'?': 1, u'missed': 1, u'do': 1, u'working': 1, u'recently': 1, u'cognitive': 1, u'form': 1, u'any': 1, u'smarter': 1, u'break': 1, u'article': 1, u"'m": 1, u'sad': 1, u'than': 1, u'glad': 1, u'partially': 1, u'on': 1, u'last': 1, u'work': 1, u'enzymes': 1, u'so': 1, u'these': 1, u'truth': 1, u'correlation': 1, u'thank': 1, u'link': 1, u'stating': 1, u'decades': 1})
You should now be able to pass the first test, test_preproc_d1_1. Try this by running the following code on the command line:
nosetests -v tests/test_pset1_preproc.py
Now let's aggregate these counts, by running the code block below.
corpus_counts = preproc.get_corpus_counts(x_tr)
This makes it possible to see the top K most common terms.
corpus_counts.most_common(5)
[(u'.', 22920),
(u'the', 20303),
(u',', 19036),
(u'to', 13252),
(u'and', 11552)]
Word count distributions are said to follow power law distributions. In practice, this means that a plot of the log-frequency against the log-rank is nearly linear. Let's see if this holds for our data.
# you need matplotlib version 1.4 or above
import matplotlib.pyplot as plt
import matplotlib
print matplotlib.__version__
%matplotlib inline
/Users/ananyaroy/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
1.5.3
plt.loglog([val for word,val in corpus_counts.most_common(4000)])
plt.xlabel('rank')
plt.ylabel('frequency');
Deliverable 1.2 Now let's compute some statistics about the training and dev data.
Print the token/type ratio for the training data. You will have to implement gtnlplib.preproc_metrics.get_token_type_ratio (0.1 points)
Print the number of types which appear exactly once in the training data. You will have to implement gtnlplib.preproc_metrics.type_frequency (0.1 points)
Print the number of types that appear in the dev data but not the training data (hint: use sets for this) You will have to implement gtnlplib.preproc_metrics.unseen_types (0.1 points)
(0.3 points)
from gtnlplib.preproc_metrics import get_token_type_ratio,type_frequency,unseen_types
get_token_type_ratio(corpus_counts)
19.67109394737792
print type_frequency(corpus_counts,1)
print type_frequency(corpus_counts,10)
14134
263
Words that appear exactly once are called hapax-legomena.
Now let's look at the dev data.
y_dv,x_dv = preproc.read_data('reddit-dev.csv', #filename
'subreddit', #label field
preprocessor=preproc.tokenize_and_downcase) #your preprocessor
corpus_counts_dv = preproc.get_corpus_counts(x_dv)
print get_token_type_ratio(corpus_counts_dv)
8.0507739212
Finally, let's compute the number of word types in the dev data, which do not appear in the training data
unseen_types(corpus_counts,corpus_counts_dv)
1737
Again, you can test your code by running nosetests -v tests/test_pset1_preproc.py
Deliverable 1.3
Why do you think the type-token ratio is lower for the dev data as compared to the training data?
Yes the dev set is smaller; why does this impact the type-token ratio? (0.2 pts)
Please put your answer in the file gtnlplib/text-answers.md
Pruning the vocabulary
Now let's prune the vocabulary to words that have appeared more than ten times.
Please run the following two code blocks.
vocab = [word for word,count in corpus_counts.iteritems() if count > 10]
print len(vocab)
print len(x_tr[0])
3600
86
This cutoff is chosen mainly for reasons of speed; in the bakeoff, you may want to have a larger vocabulary, assuming your classifiers are fast enough.
Run the code below to prune the data to this vocabulary. It will take a minute or two to complete.
x_tr = [{key:val for key,val in x_i.iteritems() if key in vocab} for x_i in x_tr]
x_dv = [{key:val for key,val in x_i.iteritems() if key in vocab} for x_i in x_dv]
x_te = [{key:val for key,val in x_i.iteritems() if key in vocab} for x_i in x_te]
print len(x_tr[0])
68
2. Linear classification
Now you'll implement the linear classification rule, $\hat{y} = \text{argmax}_y \theta^{\top} f(x,y)$.
You will use these functions in all classifiers in this assignment.
Total: 1 point.
from gtnlplib import clf_base
reload(clf_base)
from gtnlplib import constants
reload(constants);
Deliverable 2.1
Recall from class and the reading that the feature function vector $f(x,y)$ can be viewed as a dict, in which the values are counts, and the keys are tuples $(y,x_j)$, where $y$ is a label and $x_j$ is a base feature.
Implement the function make_feature_vector in clf_base.py. Desired output is shown below:
Note that you must also include the offset feature, gtnlplib.constants.OFFSET.
(0.2 points)
fv = clf_base.make_feature_vector({'test':1,'case':2},'iama')
print fv
{('iama', '**OFFSET**'): 1, ('iama', 'case'): 2, ('iama', 'test'): 1}
Let's compute the entire set of labels.
labels = set(y_tr) #figure out all possible labels
print labels
set([u'worldnews', u'science', u'askreddit', u'iama', u'todayilearned'])
Deliverable 2.2
Now implement the prediction rule, $\hat{y} = \text{argmax}_y \theta^{\top} f(x,y)$.
Specifically, implement the function predict in clf_base.py. The output should be:
A predicted label
The scores of all labels
This function will be called a lot, so try to make it fast. You don't need to do anything crazy, but avoid making your code do silly extra work.
(0.4 points)
You can test this function using these simple hand-crafted weights.
from collections import defaultdict,Counter
# weight vectors must be defaultdicts
theta_hand = defaultdict(float,
{('worldnews','worldnews'):1.,
('worldnews','news'):.5,
('worldnews','world'):.5,
('science','science'):1.,
('askreddit','askreddit'):1.,
('askreddit','ask'):0.5,
('iama','iama'):1,
('todayilearned','til'):1.,
('todayilearned','todayilearned'):1.,
('iama',constants.OFFSET):0.1
})
clf_base.predict(x_tr[5],theta_hand,labels)
(u'science',
{u'askreddit': 0.0,
u'iama': 0.1,
u'science': 5.0,
u'todayilearned': 0.0,
u'worldnews': 0.0})
Now let's see how good these weights are, by evaluating on the dev set.
from gtnlplib import evaluation
reload(evaluation);
# this applies your predict function to all the instances in ```x_dv```
y_hat = clf_base.predict_all(x_dv,theta_hand,labels)
print evaluation.acc(y_hat,y_dv)
0.394
Deliverable 2.3
Now modify theta_hand in gtnlplib/hand_weights.py to get accuracy above 41%
You can look at the training set to see how best to do this.
(0.4 points)
from gtnlplib import hand_weights
reload(evaluation)
reload(hand_weights);
# currently showing the accuracy for my weights
y_hat = clf_base.predict_all(x_dv,hand_weights.theta_hand,labels)
print evaluation.acc(y_hat,y_dv)
0.438
# run this block to output predictions on the test set
y_hat_te = clf_base.predict_all(x_te,hand_weights.theta_hand,labels)
evaluation.write_predictions(y_hat_te,'hand-test.preds')
3. Naive Bayes
You'll now implement a Naive Bayes classifier, as described in chapter 1 of the notes.
Total: 2 points.
from gtnlplib import naive_bayes
reload(naive_bayes);
Deliverable 3.1 (warmup) implement get_corpus_counts in naive_bayes.py. This function should compute the word counts for a given label.
(.2 points)
iama_counts = naive_bayes.get_corpus_counts(x_tr,y_tr,unicode('iama'));
print iama_counts['four']
print iama_counts['am']
17.0
255.0
Deliverable 3.2 Now implement estimate_pxy in naive_bayes.py. This function should compute the smoothed multinomial distribution $\log P(x \mid y)$ for a given label $y$.
Hint: note that this function takes the vocabulary as an argument. You have to assign a probability even for words that do not appear in documents with label $y$, if they are in the vocabulary.
You can use get_corpus_counts in this function if you want to, but you don't have to.
(.5 points)
log_pxy = naive_bayes.estimate_pxy(x_tr,y_tr,unicode('iama'),0.1,vocab)
import numpy as np
Probabilities must sum to one!
sum(np.exp(log_pxy.values()))
1.0000000000000027
Let's look at the log-probabilities of the words from the hand-tuned weights
print({word:log_pxy[word] for (_,word),weight in theta_hand.iteritems() if weight>0})
{'todayilearned': 0.0, 'science': -8.54040419944498, 'iama': 0.0, 'til': -11.412083824328992, 'worldnews': 0.0, 'askreddit': 0.0, '**OFFSET**': 0.0, 'ask': -7.710194133479042, 'world': -7.365696439972432, 'news': -8.892085854729721}
Deliverable 3.3 Now you are ready to implement estimate_nb in naive_bayes.py.
The goal is that the score given by clf_base.predict is equal to the joint probability $P(x,y)$, as described in the notes.
Don't forget the offset feature, whose weights should be set to the prior $\log P(y)$.
The log-probabilities for the offset feature should not be smoothed.
You can call the functions you have defined above, but you don't have to.
(0.8 points)
theta_nb = naive_bayes.estimate_nb(x_tr,y_tr,0.1)
clf_base.predict(x_tr[55],theta_nb,labels)
(u'science',
{u'askreddit': -1007.9847188388546,
u'iama': -975.8446106267712,
u'science': -949.4065887913448,
u'todayilearned': -976.5022264717562,
u'worldnews': -1004.6447250798955})
clf_base.predict(x_dv[48],theta_nb,labels)
(u'science',
{u'askreddit': -1078.6362701511373,
u'iama': -1040.9001474598663,
u'science': -976.3420010228351,
u'todayilearned': -983.3265505245143,
u'worldnews': -1029.2493890460516})
y_hat = clf_base.predict_all(x_dv,theta_nb,labels)
print evaluation.acc(y_hat,y_dv)
0.728
# execute this block to write predictions for the test set
y_hat = clf_base.predict_all(x_te,theta_nb,labels)
evaluation.write_predictions(y_hat,'nb-test.preds')
y_hat_te = evaluation.read_predictions('nb-test.preds')
evaluation.acc(y_hat_te,y_te)
0.0
Deliverable 3.4 Write a function in naive_bayes.py called find_best_smoother, which finds the smoothing value that gives best performance on the dev data.
Your function should trying at least the following values: [1e-3,1e-2,1e-1,1].
Then, using this smoothing value, run your Naive Bayes classifier on the test set, and output the results. (0.3 points)
best_smoother, scores = naive_bayes.find_best_smoother(x_tr,y_tr,x_dv,y_dv,[1e-3,1e-2,1e-1,1])
Now let's load the test data. Note that the y_te labels are all meaningless.
theta_nb = naive_bayes.estimate_nb(x_tr,y_tr,best_smoother)
y_hat = clf_base.predict_all(x_te,theta_nb,labels)
evaluation.write_predictions(y_hat,'nb-best-test.preds')
y_hat = evaluation.read_predictions('nb-best-test.preds')
print evaluation.acc(y_hat,y_te)
0.0
Deliverable 3.5 Run the code below to compare the learned weights using smoothing of $.001$ and $10.$
Explain the resulting figure as best you can, in the text file text-answers.md.
(0.2 points)
theta_nb_001 = naive_bayes.estimate_nb(x_tr,y_tr,.001)
theta_nb_10 = naive_bayes.estimate_nb(x_tr,y_tr,10.)
plt.scatter(theta_nb_001.values(),theta_nb_10.values());
plt.axis('square');
4. Perceptron
Total: 2 points
The perceptron update is,
\begin{align} \hat{y} = & \text{argmax}_y \theta^\top f(x,y)\\ \theta \gets & \theta + f(x,y) - f(x,\hat{y}) \end{align}
You will now implement this classifier, using the file gtnlplib/perceptron.py
from gtnlplib import perceptron
reload(perceptron)
<module 'gtnlplib.perceptron' from 'gtnlplib/perceptron.pyc'>
Deliverable 4.1
Implement the perceptron update, $f(x,y) - f(x,\hat{y})$, in the function perceptron_update in perceptron.py. (0.5 points)
theta_perc = hand_weights.theta_hand_original.copy() #let's start with the hand-set weights
# no update when the prediction is correct
update = perceptron.perceptron_update(x_tr[110],y_tr[110],theta_perc,labels)
print update
defaultdict(<type 'float'>, {})
# update when the prediction is incorrect
i=20
update =perceptron.perceptron_update(x_tr[i],y_tr[i],theta_perc,labels)
print update.items()[:5]
print len(update)
[((u'science', u'and'), 2), ((u'iama', u'200'), -1.0), ((u'iama', u'million'), -1.0), ((u'iama', u'now'), -1.0), ((u'science', u'years'), 1)]
146
Deliverable 4b
Now implement the perceptron algorithm. Your implementation should take as inputs:
The training instances $x$
The training labels $y$
The number of iterations to train
It should use your update function, and it should return:
weights $\theta$
a list of the weights at each iteration
Specifically, you should implement estimate_perceptron in perceptron.py (0.5 points)
reload(perceptron);
theta_perc,theta_perc_history = perceptron.estimate_perceptron(x_tr[:10],y_tr[:10],3)
print theta_perc[('science','its')]
print theta_perc[('science','what')]
0.0
4.0
I'm including the running time (on my lenovo X1 carbon laptop) for reference:
%%timeit
theta_perc,theta_perc_history = perceptron.estimate_perceptron(x_tr,y_tr,20)
1 loop, best of 3: 45.3 s per loop
theta_perc,theta_perc_history = perceptron.estimate_perceptron(x_tr,y_tr,50)
# run this to plot the accuracy over iterations
def plot_accs(weight_history,x_tr=x_tr,y_tr=y_tr,x_dv=x_dv,y_dv=y_dv):
tr_accs = []
dv_accs = []
for theta in weight_history:
tr_accs.append(evaluation.acc(clf_base.predict_all(x_tr,theta,labels),y_tr))
dv_accs.append(evaluation.acc(clf_base.predict_all(x_dv,theta,labels),y_dv))
plt.plot(tr_accs,'--')
plt.plot(dv_accs)
plt.xlabel('iteration')
plt.ylabel('accuracy');
plt.legend(['training','dev'],loc='lower right');
return tr_accs,dv_accs
y_hat = clf_base.predict_all(x_dv,theta_perc,labels)
print evaluation.acc(y_hat,y_dv)
0.646
plot_accs(theta_perc_history);