$30
COMS W4705: Problem Set 2
Total points: 140
Analytic Problems (due March 2nd)
Question 1 (20 points)
A probabilistic context-free grammar G = (N, Σ, R, S, q) in Chomsky Normal Form is defined as follows:
• N is a set of non-terminal symbols (e.g., NP, VP, S etc.)
• Σ is a set of terminal symbols (e.g., cat, dog, the, etc.)
• R is a set of rules which take one of two forms:
– X → Y1Y2 for X ∈ N, and Y1, Y2 ∈ N
– X → Y for X ∈ N, and Y ∈ Σ
• S ∈ N is a distinguished start symbol
• q is a function that maps every rule in R to a probability, which satisfies the following conditions:
– ∀r ∈ R, q(r) ≥ 0
– ∀X ∈ N, P
X→α∈R
q(X → α) = 1
Now assume we have a probabilistic CFG G0
, which has a set of rules R which take one of the two following
forms:
• X → Y1Y2 . . . Yn for X ∈ N, n ≥ 2, and ∀i, Yi ∈ N
• X → Y for X ∈ N, and Y ∈ Σ
Note that this is a more permissive definition than Chomsky normal form, as some rules in the grammar may
have more than 2 non-terminals on the right-hand side. An example of a grammar that satisfies this more
permissive definition is as follows:
S → NP VP 0.7
S → NP NP VP 0.3
VP → Vt NP 0.8
VP → Vt NP PP 0.2
NP → DT NN NN 0.3
NP → NP PP 0.7
PP → IN NP 1.0
Vt → saw 1.0
NN → man 0.7
NN → woman 0.2
NN → telescope 0.1
DT → the 1.0
IN → with 0.5
IN → in 0.5
Question 1(a): Describe how to transform a PCFG G0
, in this more permissive form, into an “equivalent”
PCFG G in Chomsky normal form. By equivalent, we mean that there is a one-to-one function f between
derivations in G0
and derivations in G, such that for any derivation T
0 under G0 which has probability p,
f(T
0
) also has probability p. (Note: one major motivation for this transformation is that we can then apply
the dynamic programming parsing algorithm, described in lecture, to the transformed grammar.) Hint: think
about adding new rules with new non-terminals to the grammar.
Question 1(b): Show the resulting grammar G after applying your transformation to the example PCFG
shown above.
Question 2 (20 points)
Nathan L. Pedant decides to build a treebank. He finally produces a corpus which contains the following
three parse trees:
S
VP
SBAR
S
VP
ADVP
loudly
VP
V2
snored
NP
Sally
COMP
that
V1
said
NP
John
S
VP
SBAR
S
VP
ADVP
quickly
VP
V2
ran
NP
Bill
COMP
that
V1
declared
NP
Sally
S
VP
SBAR
S
VP
ADVP
elegantly
VP
V2
swam
NP
Jeff
COMP
that
V1
pronounced
NP
Fred
Clarissa Lexica then purchases the treebank, and decides to build a PCFG, and a parser, using Nathan’s data.
Question 2(a): Show the PCFG that Clarissa would derive from this treebank.
Question 2(b): Show two parse trees for the string “Jeff pronounced that Fred snored loudly”, and calculate
their probabilities under the PCFG.
Question 2(c): Clarissa is shocked and dismayed, (see 2(b)), that “Jeff pronounced that Fred snored loudly”
has two possible parses, and that one of them—that Jeff is doing the pronouncing loudly—has relatively
high probability, in spite of it having the ADVP loudly modifiying the “higher” verb, pronounced. This type
of high attachment is never seen in the corpus, so the PCFG is clearly missing something. Clarissa decides
to fix the treebank, by altering some non-terminal labels in the corpus. Show one such transformation that
results in a PCFG that gives zero probability to parse trees with “high” attachments. (Your solution should
systematically refine some non-terminals in the treebank, in a way that slightly increases the number of
non-terminals in the grammar, but allows the grammar to capture the distinction between high and low
attachment to VPs.)
Question 3 (20 points)
Consider the CKY algorithm for finding the maximum probability for any tree when given as input a sequence of words x1, x2, . . . , xn. As usual, we use N to denote the set of non-terminals in the grammar, and
S to denote the start symbol.
The base case in the recursive definition is as follows: for all i = 1 . . . n, for all X ∈ N,
π(i, i, X) = ?
q(X → xi) if X → xi ∈ R
0 otherwise
and the recursive definition is as follows: for all (i, j) such that 1 ≤ i < j ≤ n, for all X ∈ N,
π(i, j, X) = max
X→Y Z∈R,
s∈{i...(j−1)}
(q(X → Y Z) × π(i, s, Y ) × π(s + 1, j, Z))
Finally, we return
π(1, n, S) = max
t∈TG(s)
p(t)
Now assume that we want to find the maximum probability for any balanced tree for a sentence. Here are
some example balanced trees:
B
C
f
D
e
S
A
G
g
F
f
B
C
f
D
e
A
S
A
G
g
F
f
B
C
f
D
e
S
A
G
g
F
f
B
C
f
D
e
It can be seen that in balanced trees, whenever a rule of the form X - Y Z is seen in the tree, then the
non-terminals Y and Z dominate the same number of words.
NOTE: we will assume that the length of the sentence, n, is n = 2x
for some integer x. For example
n = 2, or 4, or 8, or 16, etc.
Question 3(a): Complete the recursive definition below, so that the algorithm returns the maximum probability for any balanced tree underlying a sentence x1, x2, . . . , xn.
Base case: for all i = 1 . . . n, for all X ∈ N,
π(i, i, X) = ?
q(X → xi) if X → xi ∈ R
0 otherwise
Recursive case: (Complete)
Return:
π(1, n, S) = max
t∈TG(s)
p(t)
Programming Problems (due March 12th)
Please read the submission instructions, policy and hints on the course webpage.
In this programming assignment, you will train a probabilistic context-free grammar (PCFG) for syntactic
parsing. In class we defined a PCFG as a tuple
G = (N, Σ, S, R, q)
where N is the set of non-terminals, Σ is the vocabulary, S is a root symbol, R is a set of rules, and q is a
probabilistic model. We will assume that the grammar is in Chomsky normal form (CNF). Recall from class
that this means all rules in R will be either binary rules X → Y1 Y2 where X, Y1, Y2 ∈ N or unary rules
X → Z where X ∈ N and Z ∈ Σ.
You will estimate the parameters of the grammar from a corpus of annotated sentences from the Wall Street
Journal. Before diving into the details of the corpus format, we will discuss the rule structure of the grammar.
You will not have to implement this section but it will be important to know when building your parser.
Rule Structure
Consider a typical tagged sentence from the Wall Street Journal
There/DET is/VERB no/DET asbestos/NOUN in/ADP our/PRON products/NOUN now/ADV ./.
The correct parse tree for the sentence (as annotated by linguists) is as follows
S
.
.
VP
ADVP
ADV
now
PP
NP
NOUN
products
PRON
our
ADP
in
NP
NOUN
asbestos
DET
no
VERB
is
NP
DET
There
Note that nodes at the fringe of the tree are words, the nodes one level up are part-of-speech tags, and the
internal nodes are syntactic tags. For the definition of the syntactic tags used in this corpus, see “The Penn
Treebank: An Overview.” The part-of-speech tags should be self-explanatory, for further reference see “A
Universal Part-of-Speech Tagset.” Both are linked to on the website.
Unfortunately the annotated parse tree is not in Chomsky normal form. For instance, the rule S → NP VP
. has three children and the rule NP → DET has a single non-terminal child. We will need to work around
this problem by applying a transfomation to the grammar.
We will apply two transformations to the tree. The first is to collapse illegal unary rules of the form X → Y
where X, Y ∈ N into new non-terminals X + Y , for example
NP
DET
There
→ NP+DET
There
The second is to split n-ary rules X → Y1 Y2 Y3 where X, Y1, Y2, Y3 ∈ N into right branching binary rules
of the form X → Y1X and X → Y2Y3 for example
S
NP VP .
→ S
S
VP .
NP
With these transformation to the grammar, the new parse correct parse for this sentence is
S
S
.
.
VP
VP
VP
ADVP+ADV
now
PP
NP
NOUN
products
PRON
our
ADP
in
NP
NOUN
asbestos
DET
no
VERB
is
NP+DET
There
Converting the grammar to CNF will greatly simplify the algorithm for decoding; however, as you saw in
Question 1, there are other transformations into CNF that better preserve properties of the original grammar.
All the parses we provide for the assignment will be in the transformed format, you will not need to implement this transformation, but you should understand how it works..
Corpus Format
We provide a training data set with parses parse train.dat and a development set of sentences parse dev.dat
along with the correct parses parse dev.key. All the data comes for the Wall Street Journal corpus and
consists of sentences of less than 15 words for efficiency.
In the training set, each line of the file consists of a single parse tree in CNF. The trees are represented as
nested multi-dimensional arrays conforming to the JSON standard. JSON is general data encoding standard
(similar to XML). JSON is supported in pretty much every language (see http://www.json.org/).
Binary rules are represented as arrays of length 3.
NP
DT NOUN
→
["NP", ["DT", ...], ["NOUN", ...]]
Unary rules are represented as arrays of length 2.
NOUN
asbestos
→
["NOUN", "asbestos"]
For example, the file tree.example has the tree given above
["S", ["NP", ["DET", "There"]], ["S", ["VP", ["VERB", "is"], ["VP", ["
NP", ["DET", "no"], ["NOUN", "asbestos"]], ["VP", ["PP", ["ADP", "in
"], ["NP", ["PRON", "our"], ["NOUN", "products"]]], ["ADVP", ["ADV",
"now"]]]]], [".", "."]]]
The tree is represented as a recursive, multi-dimensional JSON array. If the array is length 3 it is a binary
rule, if it is length 2 it is a unary rule as shown above.
As an example, assume we want to access the node for “is” by walking down the tree as in the following
diagram.
S
S
.
.
VP
VP
VP
ADVP+ADV
now
PP
NP
NOUN
products
PRON
our
ADP
in
NP
NOUN
asbestos
DET
no
VERB
is
NP+DET
There
2
1
1
1
In Python, we can read and access this node using the following code
import json
tree = json.loads(open("tree.example").readline())
print tree[2][1][1][1]
The first line reads the tree into an array, and the second line walks down the tree to the node for “is”.
Other languages have similar (although possibly more verbose) libraries for reading in these arrays. Consult
the TAs if you are unable to find a parser for your favorite language.
Tree Counts
In order to estimate the parameters of the model you will need the counts of the rules used in the corpus
The script count cfg freqs.py reads in a training file and produces these counts. Run the script on the
training data and pipe the output into some file:
python count cfg freqs.py parse train.dat cfg.counts
Each line in the output contains the count for one event. There are three types of counts:
• Lines where the second token is NONTERMINAL contain counts of non-terminals Count(X)
17 NONTERMINAL NP
indicates that the non-terminal NP was used 17 times.
• Lines where the second token is BINARYRULE contain emission counts Count(X → Y1 Y2), for
example
13 BINARYRULE NP DET NOUN
indicates that binary rule NP → DET NOUN was used 13 times in the training data.
• Lines where the second token is UNARYRULE contain unigram counts Count(X → Y ).
14 UNARYRULE NP+NOUN products
indicates that the rule NP+NOUN → products was seen 14 times in the training data.
Question 4 (20 points)
• As in the tagging model, we need to predict emission probabilities for words in the test data that do
not occur in the training data. Recall our approach is to map infrequent words in the training data
to a common class and to treat unseen words as members of this class. Replace infrequent words
(Count(x) < 5) in the original training data file with a common symbol RARE . Note that this is
more difficult than the previous assignment because you will need to read the training file, find the
words at the fringe of the tree, and rewrite the tree out in the correct format. We provide a script
python pretty print tree.py parser dev.key which you can use to view your trees in
a more readable format. The script will also throw an error if you output is not in the correct format.
Re-run count cfg freqs.py to produce new counts.
Submit your code and the new training file it produces.
Question 5 (40 points)
• Using the counts produced by count cfg freqs.py, write a function that computes rule parameters
q(X → Y1 Y2) = Count(X → Y1 Y2)
Count(X)
q(X → w) = Count(X → w)
Count(X)
• Using the maximum likelihood estimates for the rule parameters, implement the CKY algorithm to
compute
arg max
t∈TG(s)
p(t).
Your parser should read in the counts file (with rare words) and the file parse dev.dat (which
is just the sentence from parse dev.key without the parse) and produce output in the following
format: the tree for one sentence per line represented in the JSON tree format specified above. Each
line should correspond to a parse for a sentence from parse dev.dat. You can view your output
using
pretty print parse.py and check your performance with eval parser.py by running python
eval parser.py parser dev.key prediction file. The evaluator takes two files where
each line is a matching parse in JSON format and outputs an evaluation score.
Note: There is an slight change from the class notes. Some of these sentences are fragments, and they
do not have S as the root. For the purpose of this assignment you should change the last line of the
CKY algorithm to
if π[1, n, S] 6= 0 return π[1, n, S]
else return max
X∈N
π[1, n, X]
This should help performance.
Submit your program and report on the performance of your model. In addition, write up any observations you made.
Question 6 (20 Points)
Consider the following subtree of the original parse.
VP
NP PP ADVP
DET NOUN
VERB
The tree structure makes the implicit independence assumption that the rule NP → DT NOUN is generated
independently of the parent non-terminal VP, i.e. the probability of generating this subtree in context is
p(VP → VERB NP PP ADVP | VP)p(NP → DET NOUN | NP)
However, often this independence assumption is too strong. It can be informative for the lower rule to
condition on the parent non-terminal as well, i.e. model generating this subtree as
p(VP → VERB NP PP ADVP | VP)p(NP → DET NOUN | NP, parent=VP)
This is a similar idea to using a trigram language model, except with parent non-terminals instead of words.
We can implement this model without changing our parsing algorithm by applying a tree transformation
known as vertical markovization. We simply augment each non-terminal with the symbol of its parent.
VPS
ADVPVP PPVP NPVP
NOUNNP DETNP
VERBVP
We apply vertical markovization before binarization; the new non-terminals are augmented with their original parent. After applying both transformations the final tree will look like
VPS
VPS
VPS
ADVPVP PPVP
NPVP
NOUNNP DETNP
VERBVP
There is a downside of vertical markovization. The transformation greatly increases the number of rules N
in the grammar potentially causing data sparsity issues when estimating the probability model q. In practice
though, using one level of vertical markovization (conditioning on the parent), has been shown to increase
accuracy.
Note: This problem looks straightforward, but the difficulty is that we now have a much larger set of nonterminals N. It is likely that a simple implementation of CKY that worked for the last question will not
scale to this problem. The challenge is to improve the basic inefficiencies of the implementation in order to
handle the extra non-terminals.
• The file parse train vert.dat contains the original training sentence with vertical markovization applied to the parse trees. Train with the new file and reparse the development set. Run python
eval parser.py parser dev.key prediction file to evaluate performance.
Report on the performance of your parser with the new trees. In what ways did vertical markovization
help or hurt parser performance? Find an example where it improved your parser’s output. Submit
your program and any further observations.