Starting from:

$30

Problem Set 3: Structured Perceptron

Problem Set 3: Structured Perceptron
In this problem set, you will implement a sequence labeling algorithm that is near state-of-the-art: structured perceptron. To do this, you will use several functions that you have written earlier, especially Viterbi and averaged perceptron.

The problem set is designed to highlight the connection between structured perceptron and classification-based tagging. You will first write most of the pieces that you need while working in the framework of classification-based tagging. Your implementation will take various tagging algorithms as arguments, including both one-word-at-a-time classification based tagging, and Viterbi sequence labeling.

One of the main reasons to prefer perceptron over hidden Markov models is the ability to use rich, overlapping features. You will design several feature functions throughout the assignment.

Because structure perceptron is slower to train than the hidden Markov model, we will use smaller datasets in this assignment, focusing on sentences that contain the most common POS tags in English and Japanese.

from gtnlplib import preproc, tagger_base, scorer 
from gtnlplib import features, viterbi, constants, structure_perceptron, kaggle

import os
import matplotlib.pyplot as plt
from collections import defaultdict
%matplotlib inline
## Demo
## NOTE! These datafiles are different than in pset2. Don't copy over constants.py from that pset.
all_tags = set()
for i,(words, tags) in enumerate(preproc.conll_seq_generator(constants.TRAIN_FILE,max_insts=100000)):
    for tag in tags:
        all_tags.add(tag)
print("English tags: {}".format(all_tags))

all_tags_ja = set()
for i,(words, tags) in enumerate(preproc.conll_seq_generator(constants.JA_TRAIN_FILE,max_insts=100000)):
    for tag in tags:
        all_tags_ja.add(tag)
print("Japanese tags: {}".format(all_tags_ja))
English tags: set([u'ADV', u'NOUN', u'ADP', u'PRON', u'PROPN', u'DET', u'PUNCT', u'VERB', u'AUX', u'ADJ'])
Japanese tags: set([u'ADV', u'NOUN', u'PRON', u'DET', u'PUNCT', u'VERB', u'NUM', u'ADJ'])
1. Tagging as discriminative classification
In pset 2, you performed part-of-speech tagging as generative classification, using Naive Bayes. Now you will perform discriminative classification, using average perceptron.

In this section, we are only doing classification-based tagging, but we will write the code in a way that generalizes to Viterbi-based structure prediction. This means that all features are of the form $f(\boldsymbol{w},y_m,y_{m-1},m)$.

Deliverable 1.1 Implement features.word_feats to output features that are tuples (y,constants.CURR_WORD_FEAT,w[m]) and (y,constants.OFFSET).

(0.7 points)

reload(features);
print(features.word_feats(['The','old','man','the','boat'],'NOUN','ADJ',2))
print(features.word_feats(['The','old','man','the','boat'],'VERB','NOUN',3))
# note that we may need to handle m >= len(tokens)
print(features.word_feats(['The','old','man','the','boat'],'NOUN','ADJ',5))
{('NOUN', '**OFFSET**'): 1, ('NOUN', '--CURR-WORD--', 'man'): 1}
{('VERB', '--CURR-WORD--', 'the'): 1, ('VERB', '**OFFSET**'): 1}
{('NOUN', '**OFFSET**'): 1}
Deliverable 1.2 Reimplement tagger_base.classifier_tagger as follows:

Inputs:

List of tokens to tag
Feature function, of the form $f(\boldsymbol{w},y_m,y_{m-1},m)$
Defaultdict of weights
List of all candidate tags
Outputs:

List of predicted tags
Score of predicted tag sequence $\theta \cdot f(\boldsymbol{w},\boldsymbol{y})$
(0.7 points)

theta_clf_hand = defaultdict(float,
                             {('NOUN',constants.OFFSET):0.1,
                              ('PRON',constants.CURR_WORD_FEAT,'They'):1,
                              ('PRON',constants.CURR_WORD_FEAT,'can'):-1,
                              ('NOUN',constants.CURR_WORD_FEAT,'fish'):1,
                              ('VERB',constants.CURR_WORD_FEAT,'fish'):0.5})
w = 'They can fish'.split()
print(w)
['They', 'can', 'fish']
reload(tagger_base);
tagger_base.classifier_tagger(w,features.word_feats,theta_clf_hand,all_tags)
([u'PRON', u'NOUN', u'NOUN'], 2.2)
Deliverable 1.3 The perceptron update requires computing the difference

\begin{align} & f(\boldsymbol{w},\boldsymbol{y}) - f(\boldsymbol{w},\hat{\boldsymbol{y}})\\ = & \sum_{m=1}^M f(\boldsymbol{w},y_m,y_{m-1},m) - f(\boldsymbol{w},\hat{y}_m,\hat{y}_{m-1},m) \end{align}
Implement tagger_base.compute_features to compute $f(\boldsymbol{w},\boldsymbol{y})$, with the following arguments:

A list of words
A list of tags
A feature function, of the form $f(\boldsymbol{w},y_m,y_{m-1},m)$.
The output should be a dict of features and counts.

Boundary cases:

When $m=0$, use the special case $y_{-1} = \text{START}$, using constants.START_TAG. Your current feature function will not test this, because it ignores $y_{m-1}$, but we will test it later.
When $m=M$, use the special case $y_M = \text{STOP}$, using constants.END_TAG.
These boundary cases will be important when you incorporate Viterbi tagging.

(0.7 points)

reload(tagger_base);
tagger_base.compute_features('the old man the boat'.split(),
                            ['DET','NOUN','VERB','DET','NOUN'],
                            features.word_feats)
{('--END--', '**OFFSET**'): 1.0,
 ('DET', '**OFFSET**'): 2.0,
 ('DET', '--CURR-WORD--', 'the'): 2.0,
 ('NOUN', '**OFFSET**'): 2.0,
 ('NOUN', '--CURR-WORD--', 'boat'): 1.0,
 ('NOUN', '--CURR-WORD--', 'old'): 1.0,
 ('VERB', '**OFFSET**'): 1.0,
 ('VERB', '--CURR-WORD--', 'man'): 1.0}
Deliverable 1.4

Now you can implement the function structure_perceptron.sp_update.

This will be very similar to your implementation of perceptron.perceptron_update in pset 1, but instead of calling clf_base.predict, you should call a function tagger which is passed in as an argument.

(0.7 points)

reload(structure_perceptron);
tagger_base.classifier_tagger('They can fish'.split(),
                             features.word_feats,
                             theta_clf_hand,
                             all_tags)
([u'PRON', u'NOUN', u'NOUN'], 2.2)
update = structure_perceptron.sp_update('They can fish'.split(),
                               ['PRON','AUX','VERB'],
                               theta_clf_hand,
                               features.word_feats,
                               tagger_base.classifier_tagger,
                               all_tags)
for key,val in update.iteritems():
    if val != 0:
        print(key,val)
((u'NOUN', '--CURR-WORD--', 'fish'), -1.0)
((u'NOUN', '**OFFSET**'), -2.0)
(('AUX', '--CURR-WORD--', 'can'), 1.0)
(('VERB', '--CURR-WORD--', 'fish'), 1.0)
(('VERB', '**OFFSET**'), 1.0)
((u'NOUN', '--CURR-WORD--', 'can'), -1.0)
(('AUX', '**OFFSET**'), 1.0)
Deliverable 1.5

You are now ready to implement structure_perceptron.estimate_perceptron.

Your implementation will be nearly identical to perceptron.estimate_perceptron, except for two things:

The input is now a list of (token-list, tag-list) tuples
Instead of calling perceptron.perceptron_update, you will call structure_perceptron.sp_update.
Other aspects of the implementation, such as weight averaging, should be identical.

(0.7 points)

reload(features)
reload(tagger_base)
reload(structure_perceptron);
toy_data = [('They can fish'.split(),['PRON','AUX','VERB']),
            ('the old man the boat'.split(),['DET','NOUN','VERB','DET','NOUN'])]
theta_toy_one_inst,_ = structure_perceptron.estimate_perceptron(toy_data[:1],
                                                                features.word_feats,
                                                                tagger_base.classifier_tagger,
                                                                1,
                                                                all_tags)
features.word_feats(toy_data[0][0],'NOUN','IGNORE',0)
{('NOUN', '**OFFSET**'): 1, ('NOUN', '--CURR-WORD--', 'They'): 1}
for feat,weight in theta_toy_one_inst.iteritems():
    if weight != 0:
        print(feat,weight)
((u'PRON', '**OFFSET**'), 1.0)
((u'NOUN', '--CURR-WORD--', 'can'), -1.0)
((u'AUX', '--CURR-WORD--', 'can'), 1.0)
((u'VERB', '--CURR-WORD--', 'fish'), 1.0)
((u'VERB', '**OFFSET**'), 1.0)
(('NOUN', '**OFFSET**'), -2.999)
((u'NOUN', '--CURR-WORD--', 'They'), -1.0)
((u'AUX', '**OFFSET**'), 1.0)
((u'NOUN', '--CURR-WORD--', 'fish'), -1.0)
((u'PRON', '--CURR-WORD--', 'They'), 1.0)
theta_toy_one_inst,_ = structure_perceptron.estimate_perceptron(toy_data[:1],
                                                                features.word_feats,
                                                                tagger_base.classifier_tagger,
                                                                10,
                                                                all_tags)
for feat,weight in theta_toy_one_inst.iteritems():
    if weight != 0:
        print(feat,weight)
((u'PRON', '**OFFSET**'), 1.0)
((u'NOUN', '--CURR-WORD--', 'can'), -1.0)
((u'AUX', '--CURR-WORD--', 'can'), 1.0)
((u'VERB', '--CURR-WORD--', 'fish'), 1.0)
((u'VERB', '**OFFSET**'), 1.0)
(('NOUN', '**OFFSET**'), -2.999)
((u'NOUN', '--CURR-WORD--', 'They'), -1.0)
((u'AUX', '**OFFSET**'), 1.0)
((u'NOUN', '--CURR-WORD--', 'fish'), -1.0)
((u'PRON', '--CURR-WORD--', 'They'), 1.0)
theta_toy,_ = structure_perceptron.estimate_perceptron(toy_data,
                                                     features.word_feats,
                                                     tagger_base.classifier_tagger,
                                                     1, all_tags)
for feat,weight in theta_toy.iteritems():
    if weight != 0:
        print(feat,weight)
((u'NOUN', '--CURR-WORD--', 'boat'), 0.5)
((u'PRON', '--CURR-WORD--', 'man'), -0.5)
((u'PRON', '**OFFSET**'), -1.5)
((u'NOUN', '--CURR-WORD--', 'can'), -1.0)
((u'PRON', '--CURR-WORD--', 'old'), -0.5)
((u'VERB', '--CURR-WORD--', 'man'), 0.5)
((u'AUX', '--CURR-WORD--', 'can'), 1.0)
((u'VERB', '--CURR-WORD--', 'fish'), 1.0)
((u'VERB', '**OFFSET**'), 1.5)
((u'PRON', '--CURR-WORD--', 'boat'), -0.5)
(('NOUN', '**OFFSET**'), -1.999)
((u'NOUN', '--CURR-WORD--', 'old'), 0.5)
((u'NOUN', '--CURR-WORD--', 'They'), -1.0)
((u'AUX', '**OFFSET**'), 1.0)
((u'NOUN', '--CURR-WORD--', 'fish'), -1.0)
((u'PRON', '--CURR-WORD--', 'They'), 1.0)
((u'PRON', '--CURR-WORD--', 'the'), -1.0)
((u'DET', '--CURR-WORD--', 'the'), 1.0)
((u'DET', '**OFFSET**'), 1.0)
Deliverable 1.6 Let's train this tagger and evaluate it.

(0.5 points)

training_set = [inst for inst in preproc.conll_seq_generator(constants.TRAIN_FILE)]
len(training_set)
3531
The cell below takes 30 seconds to run on my laptop.

theta_avp,theta_hist = structure_perceptron.estimate_perceptron(training_set,
                                                       features.word_feats,
                                                       tagger_base.classifier_tagger,
                                                       20,
                                                       all_tags)
tagger_base.eval_tagging_model(constants.DEV_FILE,
                               tagger_base.classifier_tagger,
                               features.word_feats,
                               theta_avp,
                               all_tags,
                               'avp-words.preds')
0.8129496402877698
tagger_base.apply_tagging_model(constants.TEST_FILE_HIDDEN,
                               tagger_base.classifier_tagger,
                               features.word_feats,
                               theta_avp,
                               all_tags,
                               'avp-words-te.preds')
# you can't run this line
scorer.accuracy(scorer.get_confusion(constants.TEST_FILE,'avp-words-te.preds'))
---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-29-020d7712d1a5> in <module>()
      1 # you can't run this line
----> 2 scorer.accuracy(scorer.get_confusion(constants.TEST_FILE,'avp-words-te.preds'))

/Users/ananyaroy/Documents/Sem2/NLP/gt-nlp-class-master/psets/ps3/gtnlplib/scorer.pyc in get_confusion(keyfilename, responsefilename)
     29     """
     30     counts = defaultdict(int)
---> 31     with codecs.open(keyfilename,encoding='utf8') as keyfile:
     32         with open(responsefilename,'r') as resfile:
     33             for key_line in keyfile:

/Users/ananyaroy/anaconda/lib/python2.7/codecs.pyc in open(filename, mode, encoding, errors, buffering)
    894             # Force opening of the file in binary mode
    895             mode = mode + 'b'
--> 896     file = __builtin__.open(filename, mode, buffering)
    897     if encoding is None:
    898         return file

IOError: [Errno 2] No such file or directory: 'data/en-ud-simpler-test.conllu'
Now let's see how accuracy improved over training. You can use this function elsewhere in the notebook if you'd like to see the progress of your classifier.

tagger_base.plot_learning_curve(constants.DEV_FILE,
                               tagger_base.classifier_tagger,
                               features.word_feats,
                               theta_hist,
                               all_tags);

These accuracies are not directly comparable with your HMM accuracies from pset2, since we are using a different dataset.

Let's see how many features are active.

print("Number of active features: %d"%len([val for val in theta_avp.values() if val != 0]))
Number of active features: 19265
Deliverable 1.7 Now try it on Japanese.

(0.5 points for 4650 / 0.25 points for 7650)

As before, 4650 students can opt in to 7650 grading by doing the 7650 problems.

Please set the GRADING variable in constants.py to the appropriate grading scheme. Note that CS7650 students must be graded by the CS7650 rubric.

training_set_ja = [inst for inst in preproc.conll_seq_generator(constants.JA_TRAIN_FILE)]
The cell below takes approximately 40 seconds to run on my laptop.

theta_avp_ja,theta_hist_ja =\
structure_perceptron.estimate_perceptron(training_set_ja,
                                         features.word_feats,
                                         tagger_base.classifier_tagger,
                                         20,
                                         all_tags_ja)
tagger_base.eval_tagging_model(constants.JA_DEV_FILE,
                               tagger_base.classifier_tagger,
                               features.word_feats,
                               theta_avp_ja,
                               all_tags_ja,
                               'avp-words.ja.preds')
0.790288213524728
tagger_base.apply_tagging_model(constants.JA_TEST_FILE_HIDDEN,
                               tagger_base.classifier_tagger,
                               features.word_feats,
                               theta_avp_ja,
                               all_tags_ja,
                               'avp-words-te.ja.preds')
# you can't run this
scorer.accuracy(scorer.get_confusion(constants.JA_TEST_FILE,'avp-words-te.ja.preds'))
Deliverable 1.8 (for 7650) As you can see from the cells here, this tagging model is less accurate for Japanese than it is for English. Why might that be? (I'm looking for an explanation that is based on quantitative facts about the datasets.) Put your answer in text-answers.md.

(0.5 points for 7650, optional for 4650)

Part 2. Better features
One simple way to improve tagging is to add better features to the classification-based tagger.

Deliverable 2.1 Let's start by adding features that include the final two characters of each word as an additional, suffix feature. Do this by implementing word_suff_features in features.py.

(0.5 points)

reload(constants)
reload(features);
print(features.word_suff_feats(['The','old','man','a','boat'],'DET','ADJ',0))
print(features.word_suff_feats(['The','old','man','a','boat'],'NOUN','DET',1))
print(features.word_suff_feats(['The','old','man','a','boat'],'NOUN','ADJ',3))
{('DET', '--CURR-WORD--', 'The'): 1, ('DET', '**OFFSET**'): 1, ('DET', '--SUFFIX--', 'he'): 1}
{('NOUN', '--SUFFIX--', 'ld'): 1, ('NOUN', '**OFFSET**'): 1, ('NOUN', '--CURR-WORD--', 'old'): 1}
{('NOUN', '**OFFSET**'): 1, ('NOUN', '--SUFFIX--', 'a'): 1, ('NOUN', '--CURR-WORD--', 'a'): 1}
Deliverable 2.2 Let's see whether this improves accuracy.

(0.5 points for 4650, 0.25 points for 7650. Includes both English and Japanese evaluations.)

reload(features);
The cell below takes 50 seconds to run on my laptop.

theta_suff_avp,_ =\
structure_perceptron.estimate_perceptron(training_set,
                                         features.word_suff_feats,
                                         tagger_base.classifier_tagger,
                                         20,
                                         all_tags)
print('EN dev set')
tagger_base.eval_tagging_model(constants.DEV_FILE,
                               tagger_base.classifier_tagger,
                               features.word_suff_feats,
                               theta_suff_avp,
                               all_tags,
                               'avp-words-suff.preds')
EN dev set
0.8439248601119105
tagger_base.apply_tagging_model(constants.TEST_FILE_HIDDEN,
                               tagger_base.classifier_tagger,
                               features.word_suff_feats,
                               theta_suff_avp,
                               all_tags,
                               'avp-words-suff-te.preds')

More products