$29
from collections import Counter, defaultdict
import os
from nltk.tag import pos_tag
from gtnlplib import coref, coref_rules, coref_features, coref_learning
Part 1: Exploring the data
The core data is in the form of "markables", which refer to token sequences that can participate in coreference relations.
Each markable has four elements:
string, which is a list of tokens
entity, which defines the ground truth assignments
start_token, the index of the first token in the markable with respect to the entire document
end_token, one plus the index of the last token in the markable
The read_data function also returns a list of tokens. You can use this to incorporate the linguistic context around each markable.
dv_dir = os.path.join('data','dev')
tr_dir = os.path.join('data','tr')
te_dir = os.path.join('data','te-hidden-labels')
markables,words = coref.read_data('Johnston Atoll',basedir=tr_dir)
print markables[3]
print words[markables[3]['start_token']:markables[3]['end_token']]
{'end_token': 21, 'start_token': 19, 'string': ['The', 'atoll'], 'entity': u'set_76'}
['The', 'atoll']
Deliverable 1.1: Write a function that returns all the markable strings associated with a given entity. Specifically, fill in the function get_markables_for_entity in coref.py. (0.5 pts)
reload(coref);
sorted(coref.get_markables_for_entity(markables,'set_100'))
['Johnston and Sand Island',
'Johnston and Sand islands',
'The islands',
'the area',
'the islands',
'them']
Deliverable 1.2 Write a function that takes as input a string, and returns a list of distances to the most recent ground truth antecedent for every time the input string appears. For example, if the input is "they", it should make a list with one element for each time the word "they" appears in the list of markables. Each element should be the distance of the word "they" to the nearest previous mention of the entity that "they" references.
Fill in the function get_distances_for_term in coref.py. If the input string is not anaphoric, the distance should be zero. Note that input strings may contain spaces. You may use any other function in coref.py to help you. (0.5 pts)
coref.get_distances(markables,'they')
[2, 2, 1, 2]
Now let's compare the typical distances for various mention types.
You can see the most frequent mention types by using the Counter class.
Counter([' '.join(markable['string']) for markable in markables]).most_common(5)
[('Johnston Atoll', 13),
('the atoll', 13),
('the island', 11),
('it', 8),
('Johnston Island', 5)]
coref.get_distances(markables,'Johnston Atoll')
[0, 4, 10, 3, 3, 1, 4, 3, 3, 1, 4, 7, 7]
coref.get_distances(markables,'the island')
[2, 4, 1, 2, 3, 7, 3, 6, 2, 3, 1, 2]
coref.get_distances(markables,'it')
[3, 1, 3, 1, 2, 1, 3, 1, 1, 1, 2]
2. Rule-based coreference resolution
We have written a simple coreference classifier, which predicts that each markable is linked to the most recent antecedent which is an exact string match.
The code block below applies this method to the dev set.
exact_matcher = coref_rules.make_resolver(coref_rules.exact_match)
The code above has two pieces:
coref_rules.exact_match is a function that takes two markables, and returns True iff they are an exact (case-insensitive) string match
make_resolver is a function that takes a matching function, and returns a function that computes an antecedent list for a list of markables.
Let's run it.
ant_exact = exact_matcher(markables)
print ant_exact[:20]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13, 3, 15, 16, 17, 18, 19]
The output is a list of antecedent numbers, ci . When ci=i , the markable i has no antecedent: it is the first mention of its entity.
We can test whether these predictions are correct by comparing against the key.
ant_true = coref.get_true_antecedents(markables)
num_correct = sum([c_true==c_predict for c_true,c_predict in zip(ant_true,ant_exact)])
acc = num_correct/float(len(markables))
print "correct: %d\taccuracy: %.3f"%(num_correct,acc)
correct: 76 accuracy: 0.353
Evaluation
Coreference is typically evaluated in terms of recall, precision, and F-measure. Here is how we will define these terms:
True positive: The system predicts c^i<i , and c^i and i are references to the same entity.
False positive: The system predicts c^i<i , but c^i and i are not references to the same entity.
False negative: There exists some ci<i such that ci and i are references to the same entity, but the system predicts either c^i=i , or some c^i which is not really a reference to the same entity that i references.
Recall = tptp+fn
Precision = tptp+fp
F-measure = 2RPR+P
A couple of things to notice here:
There is no reward for correctly identifying a markable as non-anaphoric (not having any antecedent), but you do avoid committing a false positive by doing this.
You cannot compute the evaluation by directly matching the predicted antecedents to the true antecedents. Suppose the truth is a←b,b←c , but the system predicts a←b,a←c : the system should receive two true positives, since a and c are references to the same entity in the ground truth.
Deliverable 2.1 Implement get_tp, get_fp, and get_fn in coref.py. You will want to use the function coref.get_entities. (1 point)
NOTE! You must successfully complete this deliverable. Otherwise, some of the unit tests won't work and you won't be able to complete the rest of the assignment.
f,r,p = coref.evaluate(exact_matcher,markables)
print f,r,p
0.619607843137 0.473053892216 0.897727272727
all_markables,all_words = coref.read_dataset(tr_dir)
coref.eval_on_dataset(exact_matcher,all_markables);
F: 0.6608 R: 0.5259 P:0.8886
Increasing precision
The exact_match function matches everything, including pronouns. This can lead to mistakes:
"Umashanthi ate pizza until she was full. Parvati kept eating until she had a stomach ache."
In this example, both pronouns likely refer to the names that immediately precede them, and not to each other.
Deliverable 2.2 The file coref_rules.py contains the signature for a function exact_match_no_pronoun, which solves this problem by only predicting matches between markables that are not pronouns. Implement and test this function. For now, you may use the list of pronouns provided in the code file coref_rules.py.
(0.5 points 4650 / 0.25 points 7650)
reload(coref_rules);
no_pro_matcher = coref_rules.make_resolver(coref_rules.exact_match_no_pronouns)
f,r,p = coref.eval_on_dataset(no_pro_matcher,all_markables);
F: 0.6419 R: 0.4868 P:0.9421
Precision has increased, but recall decreased, dragging down the overall F-measure.
Increasing recall
Our current matcher is very conservative. Let's try to increase recall. One solution is match on the head word of each markable.
As you know, in a CFG parse, the head word is defined by a set of rules: for example, the head of a determiner-noun construction is the noun. In a dependency parse, the head word would be the root of the subtree governing the markable span. But this assumes that the markables correspond to syntactic constituents or dependency subtrees. This is not guaranteed to be true -- particularly when there are parsing errors.
Deliverable 2.3 Let's start with a much simpler head-finding heuristic: simply select the last word in the markable. This handles many cases --- but as we will see, not all. To do this, implement the function match_last_token in coref_rules.py. This function should match all cases where the final tokens match. (0.5 points 4650 / 0.25 points 7650)
reload(coref_rules);
last_tok_matcher = coref_rules.make_resolver(coref_rules.match_last_token)
coref.eval_on_dataset(last_tok_matcher,all_markables);
F: 0.6482 R: 0.5959 P:0.7105
Recall is up, but precision is back down. To try to increase precision, let's add one more rule: two markables cannot coref if their spans overlap. This can happen with nested mentions, such as "(the president (of the united states))". Under our last-token rule, these two mentions would co-refer, but logically, overlapping markables cannot refer to the same entity.
Deliverable 2.4 Fill in the function match_last_token_no_overlap, which should match any two markables that share the same last token, unless their spans overlap. Use the start_token and end_token fields of each markable to determine whether they overlap. (0.5 points / 0.25 points)
reload(coref_rules);
coref.eval_on_dataset(coref_rules.make_resolver(coref_rules.match_last_token_no_overlap),all_markables);
F: 0.6723 R: 0.6108 P:0.7476
Both recall and precision increase. Why would recall increase? The restriction does not create any new coreference links, but it changes some incorrect links to correct links. This increases the number of true positives and reduces the number of false negatives.
Error analysis
To see whether we can do even better, let's try some error analysis on a specific file.
# predicted antecedent series
ant = coref_rules.make_resolver(coref_rules.match_last_token_no_overlap)(markables)
# let's look at large entities
m2e,e2m = coref.markables_to_entities(markables,ant)
big_entities = [ent for ent,vals in e2m.iteritems() if len(vals)>20]
for entity in big_entities:
print 'Entity %d: %d mentions'%(entity,len(e2m[entity]))
print [' '.join(markables[idx]['string']) for idx in e2m[entity]]
print
Entity 0: 31 mentions
['Johnston Atoll', 'The atoll', 'Johnston Atoll', 'the atoll', 'the atoll', 'the atoll', 'Johnston Atoll', 'the atoll', 'Johnston Atoll', 'the atoll', 'the atoll', 'the atoll', 'Johnston Atoll', 'Johnston Atoll', 'Johnston Atoll', 'The atoll', 'Johnston Atoll', 'Johnston Atoll', 'the atoll', 'the atoll', 'the deserted atoll', 'the Atoll', 'Johnston Atoll', 'the atoll', 'Johnston Atoll', 'Johnston Atoll', 'Johnston Atoll', 'the atoll', 'the atoll', 'Seabird species recorded as breeding on the atoll', 'the atoll']
Entity 22: 21 mentions
['Sand Island', 'the island', 'Sand Island', 'Johnston Island', 'Johnston Island', 'Johnston Island', 'the island', 'the island', 'Johnston Island', 'the island', 'the island', 'Johnston Island', 'the island', 'the island', 'the island', 'the island', 'The island', 'The central means of transportation to this island', 'this island', 'the island', 'the island']
Incorporating parts of speech
One clear mistake is that we are matching "Sand Island" to "Johnston Island". The last token heuristic is the culprit: in this case, the first token is a key disambiguator. Let's try a more syntactically-motivated approach.
Instead of matching the last token (low precision) or matching on all tokens (low recall), let's try matching on all content words. Let's start by including only the following grammatical categories:
Nouns (proper, common, singular, plural)
Pronouns (including possessive)
Cardinal numbers
To get these categories, we can call read_dataset with an optional argument, a part of speech tagger. We'll use NLTK for this project, which has a structured perceptron tagger on the PTB tagset.
reload(coref);
all_markables,_ = coref.read_dataset(tr_dir,tagger=pos_tag)
all_markables_dev,_ = coref.read_dataset(dv_dir,tagger=pos_tag)
all_markables_te,_ = coref.read_dataset(te_dir,tagger=pos_tag)
all_markables[14][4]
{'end_token': 30,
'entity': u'set_8',
'start_token': 26,
'string': ['the', 'coral', 'reef', 'platform'],
'tags': ['DT', 'JJ', 'NN', 'NN']}
As you can see, the markables now contain an additional tags field, with the part of speech tags for each token in the 'string' field.
Deliverable 2.5 Now implement a new matcher, match_on_content in coref_rules.py. Your code should match ma and mi iff all content words are identical. It should also enforce the "no overlap" restriction defined above. (0.5 points 4650 / 0.25 points 7650)
Run the cells below to run on the dev and test sets.
coref.eval_on_dataset(coref_rules.make_resolver(coref_rules.match_on_content),all_markables);
F: 0.6897 R: 0.5783 P:0.8545
Deliverable 2.6 Run the code blocks below to output predictions for the dev and test data. (0.25 points)
coref.write_predictions(coref_rules.make_resolver(coref_rules.match_on_content),
all_markables_dev,
'predictions/rules-dev.preds')
f,r,p = coref.eval_predictions('predictions/rules-dev.preds',all_markables_dev);
print f
F: 0.6830 R: 0.5889 P:0.8130
0.683029453015
coref.write_predictions(coref_rules.make_resolver(coref_rules.match_on_content),
all_markables_te,
'predictions/rules-test.preds')
# students can't run this
all_markables_te_secret,_ = coref.read_dataset('data/te')
coref.eval_predictions('predictions/rules-test.preds',all_markables_te_secret)
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-36-8fda4a87ca24> in <module>()
1 # students can't run this
2 all_markables_te_secret,_ = coref.read_dataset('data/te')
----> 3 coref.eval_predictions('predictions/rules-test.preds',all_markables_te_secret)
/Users/dhananjaybahal/Downloads/ps5/gtnlplib/coref.pyc in eval_predictions(pred_file, markables)
236 tot_fp += sum(get_fp(sys_ant,markables_i))
237 tot_fn += sum(get_fn(sys_ant,markables_i))
--> 238 r = tot_tp / float(tot_tp + tot_fn)
239 p = tot_tp / float(tot_tp + tot_fp)
240 f = 2 * r * p / ( r + p )
ZeroDivisionError: float division by zero
Part 3: Machine learning for coreference resolution
You will now implement coreference resolution using the mention-ranking model. Let's start by implementing some features.
Deliverable 3.1 Implement coref_features.minimal_features, using the rules you wrote from coref_rules. This should be a function that takes a list of markables, and indices for two mentions, and returns a dict with features and counts. Include the following features:
exact_match
last_token_match
content_match
cross_over: value of 1 iff the mentions overlap
new_entity: value of 1 iff i=j
For the first four features, you should call your code from coref_rules directly. (1 point)
reload(coref_features);
for i,markable in enumerate(all_markables[14][:15]):
print i,markable
0 {'end_token': 2, 'start_token': 0, 'tags': ['NNP', 'NNP'], 'string': ['Johnston', 'Atoll'], 'entity': u'set_76'}
1 {'end_token': 12, 'start_token': 10, 'tags': ['NNP', 'NNP'], 'string': ['Pacific', 'Ocean'], 'entity': u'set_3'}
2 {'end_token': 18, 'start_token': 17, 'tags': ['NNP'], 'string': ['Hawaii'], 'entity': u'set_107'}
3 {'end_token': 21, 'start_token': 19, 'tags': ['DT', 'NN'], 'string': ['The', 'atoll'], 'entity': u'set_76'}
4 {'end_token': 30, 'start_token': 26, 'tags': ['DT', 'JJ', 'NN', 'NN'], 'string': ['the', 'coral', 'reef', 'platform'], 'entity': u'set_8'}
5 {'end_token': 34, 'start_token': 32, 'tags': ['CD', 'NNS'], 'string': ['four', 'islands'], 'entity': u'set_10000'}
6 {'end_token': 36, 'start_token': 35, 'tags': ['NNP'], 'string': ['Johnston'], 'entity': u'set_76'}
7 {'end_token': 39, 'start_token': 35, 'tags': ['NNP', 'CC', 'NNP', 'NNS'], 'string': ['Johnston', 'and', 'Sand', 'islands'], 'entity': u'set_100'}
8 {'end_token': 39, 'start_token': 37, 'tags': ['NNP', 'NNS'], 'string': ['Sand', 'islands'], 'entity': u'set_83'}
9 {'end_token': 55, 'start_token': 46, 'tags': ['NNP', 'NNP', 'NNP', 'NNP', 'CC', 'NNP', 'NNP', 'NNP', 'NNP'], 'string': ['North', '-LRB-', 'Akau', '-RRB-', 'and', 'East', '-LRB-', 'Hikina', '-RRB-'], 'entity': u'set_10'}
10 {'end_token': 66, 'start_token': 64, 'tags': ['NNP', 'NNP'], 'string': ['Johnston', 'Atoll'], 'entity': u'set_76'}
11 {'end_token': 77, 'start_token': 69, 'tags': ['CD', 'IN', 'DT', 'NNP', 'NNPS', 'NNP', 'NNP', 'NNP'], 'string': ['one', 'of', 'the', 'United', 'States', 'Minor', 'Outlying', 'Islands'], 'entity': u'set_76'}
12 {'end_token': 74, 'start_token': 72, 'tags': ['NNP', 'NNPS'], 'string': ['United', 'States'], 'entity': u'set_108'}
13 {'end_token': 82, 'start_token': 79, 'tags': ['RB', 'CD', 'NNS'], 'string': ['nearly', '70', 'years'], 'entity': u'set_71'}
14 {'end_token': 85, 'start_token': 83, 'tags': ['DT', 'NN'], 'string': ['the', 'atoll'], 'entity': u'set_76'}
print coref_features.minimal_features(all_markables[14],0,1)
print coref_features.minimal_features(all_markables[14],1,1)
print coref_features.minimal_features(all_markables[14],0,3)
print coref_features.minimal_features(all_markables[14],6,7)
print coref_features.minimal_features(all_markables[14],3,14)
{}
{'new-entity': 1.0}
{'last-token-match': 1}
{'crossover': 1}
{'exact-match': 1, 'last-token-match': 1, 'content-match': 1}
Deliverable 3.2 Implement coref_learning.mention_rank, which should select the highest-scoring antecedent for each markable. (1 points)
reload(coref_learning);
reload(coref_features);
hand_weights = defaultdict(float,
{'new-entity':0.5,
'last-token-match':0.6,
'content-match':0.7,
'exact-match':1.}
)
print coref_learning.mention_rank(all_markables[12],1,coref_features.minimal_features,hand_weights)
print coref_learning.mention_rank(all_markables[12],7,coref_features.minimal_features,hand_weights)
1
0
Deliverable 3.3 Now implement coref_learning.compute_instance_update, which compute a perceptron update for instance i . (1 point)
for i,markable in enumerate(all_markables[14][:18]):
print i,markable
0 {'end_token': 2, 'start_token': 0, 'tags': ['NNP', 'NNP'], 'string': ['Johnston', 'Atoll'], 'entity': u'set_76'}
1 {'end_token': 12, 'start_token': 10, 'tags': ['NNP', 'NNP'], 'string': ['Pacific', 'Ocean'], 'entity': u'set_3'}
2 {'end_token': 18, 'start_token': 17, 'tags': ['NNP'], 'string': ['Hawaii'], 'entity': u'set_107'}
3 {'end_token': 21, 'start_token': 19, 'tags': ['DT', 'NN'], 'string': ['The', 'atoll'], 'entity': u'set_76'}
4 {'end_token': 30, 'start_token': 26, 'tags': ['DT', 'JJ', 'NN', 'NN'], 'string': ['the', 'coral', 'reef', 'platform'], 'entity': u'set_8'}
5 {'end_token': 34, 'start_token': 32, 'tags': ['CD', 'NNS'], 'string': ['four', 'islands'], 'entity': u'set_10000'}
6 {'end_token': 36, 'start_token': 35, 'tags': ['NNP'], 'string': ['Johnston'], 'entity': u'set_76'}
7 {'end_token': 39, 'start_token': 35, 'tags': ['NNP', 'CC', 'NNP', 'NNS'], 'string': ['Johnston', 'and', 'Sand', 'islands'], 'entity': u'set_100'}
8 {'end_token': 39, 'start_token': 37, 'tags': ['NNP', 'NNS'], 'string': ['Sand', 'islands'], 'entity': u'set_83'}
9 {'end_token': 55, 'start_token': 46, 'tags': ['NNP', 'NNP', 'NNP', 'NNP', 'CC', 'NNP', 'NNP', 'NNP', 'NNP'], 'string': ['North', '-LRB-', 'Akau', '-RRB-', 'and', 'East', '-LRB-', 'Hikina', '-RRB-'], 'entity': u'set_10'}
10 {'end_token': 66, 'start_token': 64, 'tags': ['NNP', 'NNP'], 'string': ['Johnston', 'Atoll'], 'entity': u'set_76'}
11 {'end_token': 77, 'start_token': 69, 'tags': ['CD', 'IN', 'DT', 'NNP', 'NNPS', 'NNP', 'NNP', 'NNP'], 'string': ['one', 'of', 'the', 'United', 'States', 'Minor', 'Outlying', 'Islands'], 'entity': u'set_76'}
12 {'end_token': 74, 'start_token': 72, 'tags': ['NNP', 'NNPS'], 'string': ['United', 'States'], 'entity': u'set_108'}
13 {'end_token': 82, 'start_token': 79, 'tags': ['RB', 'CD', 'NNS'], 'string': ['nearly', '70', 'years'], 'entity': u'set_71'}
14 {'end_token': 85, 'start_token': 83, 'tags': ['DT', 'NN'], 'string': ['the', 'atoll'], 'entity': u'set_76'}
15 {'end_token': 92, 'start_token': 91, 'tags': ['JJ'], 'string': ['American'], 'entity': u'set_108'}
16 {'end_token': 97, 'start_token': 95, 'tags': ['DT', 'NN'], 'string': ['that', 'time'], 'entity': u'set_71'}
17 {'end_token': 98, 'start_token': 97, 'tags': ['PRP'], 'string': ['it'], 'entity': u'set_76'}
print "prediction:",coref_learning.mention_rank(all_markables[14],14,coref_features.minimal_features,hand_weights)
print "update at a=3:",coref_learning.compute_instance_update(all_markables[14],14,3,coref_features.minimal_features,hand_weights)
print "update at a=10:",coref_learning.compute_instance_update(all_markables[14],14,10,coref_features.minimal_features,hand_weights)
print "update at a=12:",coref_learning.compute_instance_update(all_markables[14],14,12,coref_features.minimal_features,hand_weights)
print "update at a=4:",coref_learning.compute_instance_update(all_markables[14],14,1,coref_features.minimal_features,hand_weights)
prediction: 3
update at a=3: {}
update at a=10: {}
update at a=12: defaultdict(<type 'float'>, {'exact-match': -1.0, 'last-token-match': -1.0, 'content-match': -1.0})
update at a=4: defaultdict(<type 'float'>, {'exact-match': -1.0, 'last-token-match': -1.0, 'content-match': -1.0})
Deliverable 3.4 You are now ready to implement coref_learning.train_avg_perceptron
You can probably get away with "naive" weight averaging, unless you want to go crazy with features later.
Make sure that your running total of weights gets updated after each markable.
reload(coref_features);
reload(coref_learning);
theta_simple = coref_learning.train_avg_perceptron([all_markables[3][:10]],coref_features.minimal_features,N_its=2)
3 2
theta_simple[-1]
defaultdict(float,
{'content-match': 0.6,
'exact-match': 0.6,
'last-token-match': 0.6,
'new-entity': 0.2})
theta_hist = coref_learning.train_avg_perceptron(all_markables,coref_features.minimal_features,N_its=5)
1139 1137 1137 1137 1137
coref_learning.eval_weight_hist(all_markables,theta_hist,coref_features.minimal_features);
F: 0.6959 R: 0.5853 P:0.8582
F: 0.6954 R: 0.5849 P:0.8575
F: 0.6954 R: 0.5849 P:0.8575
F: 0.6954 R: 0.5849 P:0.8575
F: 0.6954 R: 0.5849 P:0.8575
theta_hist[-1]
defaultdict(float,
{'content-match': 0.5576695194206714,
'crossover': -0.5720868992758393,
'exact-match': 0.43515470704410797,
'last-token-match': 0.230348913759052,
'new-entity': 0.423633969716919})
Already pretty competitive with the rule-based alternatives, at least on the training set. Let's run on the dev set.
coref_learning.eval_weight_hist(all_markables_dev,theta_hist,coref_features.minimal_features);
F: 0.6671 R: 0.5756 P:0.7933
F: 0.6671 R: 0.5756 P:0.7933
F: 0.6671 R: 0.5756 P:0.7933
F: 0.6671 R: 0.5756 P:0.7933
F: 0.6671 R: 0.5756 P:0.7933
# run this block to output your predictions
coref.write_predictions(coref_learning.make_resolver(coref_features.minimal_features,
theta_hist[-1]),
all_markables_dev,
'predictions/minimal-dev.preds')
coref.eval_predictions('predictions/minimal-dev.preds',all_markables_dev);
F: 0.6671 R: 0.5756 P:0.7933
Deliverable 3.5 Implement distance features in coref_features.distance_features, measuring the mention distance and the token distance. Specifically:
Mention distance is number of intervening mentions between i and j, i−j .
Token distance is number of tokens between the start of i and the end of j.
These should be binary features, up to a maximum distance of 10, with the final feature indicating distance of 10 and above. The desired behavior is shown below. (0.25 points)
reload(coref_features);
for i,markable_i in enumerate(all_markables[14][:4]):
print i,markable_i
0 {'end_token': 2, 'start_token': 0, 'tags': ['NNP', 'NNP'], 'string': ['Johnston', 'Atoll'], 'entity': u'set_76'}
1 {'end_token': 12, 'start_token': 10, 'tags': ['NNP', 'NNP'], 'string': ['Pacific', 'Ocean'], 'entity': u'set_3'}
2 {'end_token': 18, 'start_token': 17, 'tags': ['NNP'], 'string': ['Hawaii'], 'entity': u'set_107'}
3 {'end_token': 21, 'start_token': 19, 'tags': ['DT', 'NN'], 'string': ['The', 'atoll'], 'entity': u'set_76'}
print coref_features.distance_features(all_markables[14],0,0)
print coref_features.distance_features(all_markables[14],0,1)
print coref_features.distance_features(all_markables[14],0,2)
print coref_features.distance_features(all_markables[14],1,3)
print coref_features.distance_features(all_markables[14],0,30)
{}
{'token-distance-8': 1, 'mention-distance-1': 1}
{'token-distance-10': 1, 'mention-distance-2': 1}
{'mention-distance-2': 1, 'token-distance-7': 1}
{'token-distance-10': 1, 'mention-distance-10': 1}
Deliverable 3.6 Implement coref_features.make_feature_union, which should take a list of feature functions, and return a function that computes the union of all features in the list. You can assume the feature functions don't use the same name for any feature. (0.25 points)
reload(coref_features);
joint_feats1 = coref_features.make_feature_union([coref_features.minimal_features,
coref_features.distance_features])
print joint_feats1(all_markables[12],1,3)
print joint_feats1(all_markables[12],0,3)
print joint_feats1(all_markables[12],0,7)
print joint_feats1(all_markables[12],10,10)
{'token-distance-6': 1, 'mention-distance-2': 1}
{'token-distance-10': 1, 'mention-distance-3': 1}
{'mention-distance-7': 1, 'token-distance-10': 1, 'last-token-match': 1}
{'new-entity': 1.0}
theta_hist = coref_learning.train_avg_perceptron(all_markables,joint_feats1,N_its=10)
1378 1421 1413 1414 1414 1413 1407 1418 1424 1425
coref_learning.eval_weight_hist(all_markables,theta_hist,joint_feats1);
F: 0.7002 R: 0.5849 P:0.8721
F: 0.7037 R: 0.5906 P:0.8704
F: 0.6997 R: 0.5844 P:0.8715
F: 0.6998 R: 0.5844 P:0.8720
F: 0.6959 R: 0.5800 P:0.8695
F: 0.6974 R: 0.5814 P:0.8715
F: 0.6973 R: 0.5814 P:0.8709
F: 0.6976 R: 0.5818 P:0.8710
F: 0.6973 R: 0.5814 P:0.8709
F: 0.6976 R: 0.5818 P:0.8710
Pretty much the same on training
coref_learning.eval_weight_hist(all_markables_dev,theta_hist,joint_feats1);
F: 0.6911 R: 0.5937 P:0.8266
F: 0.6947 R: 0.5985 P:0.8278
F: 0.6925 R: 0.5949 P:0.8283
F: 0.6925 R: 0.5949 P:0.8283
F: 0.6906 R: 0.5925 P:0.8277
F: 0.6906 R: 0.5925 P:0.8277
F: 0.6892 R: 0.5913 P:0.8260
F: 0.6892 R: 0.5913 P:0.8260
F: 0.6892 R: 0.5913 P:0.8260
F: 0.6892 R: 0.5913 P:0.8260
Better on dev.
Deliverable 3.7 Implement coref_features.make_feature_product, which should take a list of feature functions, and return a function that computes the product of the feature functions. Desired behavior:
f1=(i,xi),(j,xj)
f2=(m,xm),(n,xn)
f1×f2=((i,m),xi×xm),((i,n),xi×xn),((j,m),xj×xm),((j,n),xj×xn)
The product of features "feat1" and "feat2" should have the name "feat1-feat2", as shown in the example below.
(0.25 points)
reload(coref_features);
prod_feats1 = coref_features.make_feature_cross_product(coref_features.minimal_features,
coref_features.distance_features)
print coref_features.minimal_features(all_markables[14],3,14)
print coref_features.distance_features(all_markables[14],3,14)
print prod_feats1(all_markables[14],3,14)
{'exact-match': 1, 'last-token-match': 1, 'content-match': 1}
{'token-distance-10': 1, 'mention-distance-10': 1}
{'content-match-mention-distance-10': 1, 'exact-match-mention-distance-10': 1, 'content-match-token-distance-10': 1, 'last-token-match-mention-distance-10': 1, 'last-token-match-token-distance-10': 1, 'exact-match-token-distance-10': 1}
Now let's try a combined feature set, which includes the union of the product features and the original features
feats = coref_features.make_feature_union([coref_features.minimal_features,
coref_features.distance_features,
prod_feats1])
theta_hist = coref_learning.train_avg_perceptron(all_markables,feats,N_its=10)
1399 1388 1405 1410 1397 1406 1409 1394 1383 1409
coref_learning.eval_weight_hist(all_markables,theta_hist,feats);
F: 0.7064 R: 0.5994 P:0.8599
F: 0.7067 R: 0.5976 P:0.8645
F: 0.6873 R: 0.5677 P:0.8705
F: 0.6914 R: 0.5734 P:0.8705
F: 0.6914 R: 0.5734 P:0.8705
F: 0.6909 R: 0.5726 P:0.8709
F: 0.6907 R: 0.5726 P:0.8703
F: 0.6907 R: 0.5726 P:0.8703
F: 0.6909 R: 0.5726 P:0.8709
F: 0.6909 R: 0.5726 P:0.8709
coref_learning.eval_weight_hist(all_markables_dev,theta_hist,feats);
F: 0.6959 R: 0.6046 P:0.8197
F: 0.6955 R: 0.6034 P:0.8207
F: 0.6837 R: 0.5828 P:0.8268
F: 0.6860 R: 0.5865 P:0.8262
F: 0.6860 R: 0.5865 P:0.8262
F: 0.6851 R: 0.5852 P:0.8259
F: 0.6860 R: 0.5865 P:0.8262
F: 0.6860 R: 0.5865 P:0.8262
F: 0.6860 R: 0.5865 P:0.8262
F: 0.6860 R: 0.5865 P:0.8262
This doesn't help much in this case, but you may find it useful in the bakeoff.
Deliverable 3.8 (7650 only; 4650 optional)
To match nominals, it is often necessary to capture semantics. Find a paper (in ACL, NAACL, EACL, or TACL, since 2007) that attempts to use semantic analysis to do nominal coreference, and explain:
What form of semantics they are trying to capture (e.g., synonymy, hypernymy, predicate-argument, distributional)
How they formalize semantics into features, constraints, or some other preference
How much it helps
Put your answer in text-answers.md (1 point)
As usual, if you are in 4650 and you do this problem, you will be graded on the 7650 rubric.
Final bakeoff!
Ideas for additional features:
Large-margin training
Cost-sensitive training to balance precision and recall
Syntax (you can parse all the markables as a preprocessing step)
Tree distance
Syntactic parallelism
Better head matching
Word vector matching
Neural representations of each entity (Wiseman et al 2016)
Multilayer perceptron for mention ranking
Feel free to search the research literature (via Google scholar) to get ideas. If you use an idea from another paper, mention the paper (authors, title, and URL) in your comments in coref_features.py
Deliverable 3.9. Run the code blocks below to output predictions for both the dev and test sets. Note that theta_hist contains the history weights over all training epochs. You don't have to use the final set of weights for your output.
Scoring:
Dev F1 > .71: +0.25 points
Dev F1 > .72: +0.25 points
Dev F1 > .73: +0.25 point
Test F1 > .7: +0.25 points
The test set threshold is a low bar if you pass the dev tests and without badly overfitting.
Extra credit (evaluated on test set)
Best in 4650: +0.5 points
Best in 7650: +0.5 points
Better than best TA/prof system: +0.5 points
reload(coref_features);
# writing a function to make the bakeoff features can be convenient
# but you can define them directly if you want
# we are only evaluating the outputs
prod_feats_bakeoff = coref_features.make_feature_cross_product(coref_features.minimal_bakeoff_features,
coref_features.distance_features)
bakeoff_feats = coref_features.make_feature_union([coref_features.minimal_bakeoff_features,
coref_features.distance_features,
prod_feats_bakeoff])
theta_hist = coref_learning.train_avg_perceptron(all_markables,bakeoff_feats,N_its=5)
1261 1211 1238 1240 1240
coref_learning.eval_weight_hist(all_markables,theta_hist,bakeoff_feats);
F: 0.7077 R: 0.6113 P:0.8404
F: 0.7085 R: 0.6157 P:0.8343
F: 0.7222 R: 0.6368 P:0.8341
F: 0.7200 R: 0.6332 P:0.8343
F: 0.7219 R: 0.6341 P:0.8379
coref_learning.eval_weight_hist(all_markables_dev,theta_hist,bakeoff_feats);
F: 0.7309 R: 0.6421 P:0.8482
F: 0.7381 R: 0.6542 P:0.8466
F: 0.7456 R: 0.6699 P:0.8407
F: 0.7442 R: 0.6651 P:0.8449
F: 0.7407 R: 0.6614 P:0.8415
# run this block to output your predictions
coref.write_predictions(coref_learning.make_resolver(bakeoff_feats,
theta_hist[2]),
all_markables_dev,
'predictions/bakeoff-dev.preds')
# run this block to output your predictions
coref.write_predictions(coref_learning.make_resolver(bakeoff_feats,
theta_hist[2]),
all_markables_te,
'predictions/bakeoff-te.preds')