X -> Y1 Y2 ... Ynwhere X is a non-terminal symbol, and each symbol on the right-hand side of the rule is either a non-terminal symbol or a terminal symbol. We've seen and used simple grammars for a fragment of English many times in this course. Here's an example:
S -> NP VP Adj -> tall
NP -> Det N Adj -> short
NP -> Det Adj N N -> man
NP -> John N -> woman
NP -> Mary N -> building
VP -> Vi N -> song
VP -> Vt NP Vi -> ate
Det -> the Vi -> laughed
Det -> a Vi -> sang
Det -> some Vt -> saw
Vt -> sang
A probabilistic context-free grammar (PCFG) is a context-free grammar where, for every non-terminal symbol, the possible right-hand sides all have probabilities associated with them. For any given non-terminal symbol (e.g. NP), all the probabilities of the different right-hand sides must sum to 1. For example:
S -> NP VP [1.0] NP -> Det N [0.6] NP -> Det Adj N [0.2] NP -> John [0.1] NP -> Mary [0.1] VP -> Vi [0.3] VP -> Vt NP [0.7] ...etc.
S S
/ \ / \
/ \ / \
NP VP NP VP
| | \ | |
John Vt NP Mary Vi
| \ |
saw Mary laughed
the first tree can be seen as containing the context-free rules
S -> NP VP NP -> John VP -> Vt NP Vt -> saw NP -> Maryand the second tree as containing the rules
S -> NP VP NP -> Mary VP -> Vi Vi -> laughedThus the treebank contains the following rules and frequencies:
S -> NP VP (2) NP -> John (1) NP -> Mary (2) VP -> Vt NP (1) VP -> Vi (1) Vt -> saw (1) Vi -> laughed (1)These can be converted to probabilities by normalizing so that the right-hand sides for each non-terminal have probabilities that sum to 1:
S -> NP VP Prob = 2/2 = 1.0 NP -> John Prob = 1/3 = 0.3333 NP -> Mary Prob = 2/3 = 0.6666 VP -> Vt NP Prob = 1/2 = 0.5 VP -> Vi Prob = 1/2 = 0.5 Vt -> saw Prob = 1/1 = 1.0 Vi -> laughed Prob = 1/1 = 1.0To take one example in detail, the NP rule probabilities can be calculated by keeping track of (or computing) the total number of NP rules (here equal to 3), and converting the frequency of each NP rule to a probability by dividing by that total. In this example, we see that for this corpus, whenever there's an NP, there is a 1/3 chance that it will be expanded to the symbol John and a 2/3 chance it will be expanded as the symbol Mary.
Let's express this more as an algorithm (what follows is only one of many different ways of doing it). Given a treebank:
For each tree in the treebank
Get the context-free rules from the tree
For each such context-free rule R
Update the frequency of R
Update the frequency of LHS(R)
For every context-free rule R encountered in the treebank
Let freq(R) be the frequency of R
Let freq(LHS(R)) be the frequency of the LHS of R
Let prob = freq(R)/freq(LHS(R))
Output: R prob
Thus the end result of the algorithm is a probabilistic CFG,
represented as a table of context-free rules and their probabilities.
(LHS = left-hand-side, i.e. the nonterminal being expanded.)
( (S
(NP (DT The) (NNP Fulton) (NNP County) (NNP Grand) (NNP Jury) )
(VP (VBD said)
(NP (NNP Friday) )
(SBAR (-NONE- 0)
(S
(NP (DT an) (NN investigation)
(PP (IN of)
(NP
(NP (NNP Atlanta) )
(POS 's) (JJ recent) (JJ primary) (NN election) )))
(VP (VBD produced)
(NP (OQUOTE OQUOTE) (DT no) (NN evidence) (CQUOTE CQUOTE)
(SBAR (IN that)
(S
(NP (DT any) (NNS irregularities) )
(VP (VBD took)
(NP (NN place) ))))))))))
(PERIOD PERIOD) )
((S
(NP (DT the) (JJ new) (NN management) )
(VP (VBZ takes)
(NP (NN charge) ))
(NP (NNP JanPERIOD) (CD 1) ))
(PERIOD PERIOD) )
You should produce a file called "pcfg.dat" that contains, on each
line, a PCFG rule followed by its probability. The rules don't have
to be in any particular order, and the probabilities can be
represented either as rational numbers (e.g. 1/2) or floating-point
numbers (e.g. 0.5), it doesn't matter which. For example, here are
some rules from an output file, sorted so matching left-hand-sides of
rules are grouped together:
(DT -> (A)) 54/245 (DT -> (AN)) 6/245 (DT -> (ANY)) 1/245 (DT -> (BOTH)) 3/245 (MD -> (MAY)) 1/35 (MD -> (MIGHT)) 1/35 (MD -> (MUST)) 3/35 (MD -> (SHOULD)) 6/35 (MD -> (WILL)) 2/5 (NNP -> (WILLIAM)) 2/251 (NNP -> (WILLIAMS)) 5/251 (NNP -> (WPERIOD)) 1/251 (NNS -> (ACTIONS)) 1/115 (NNS -> (ADJUSTMENTS)) 1/115 (NNS -> (ADMINISTRATORS)) 1/115 (NP -> (DT JJ NN)) 12/727 (NP -> (DT JJ NNP)) 1/727 (NP -> (DT NN PP)) 23/727 (NP -> (DT NN)) 82/727Notice the output format for CFG rules: a "rule" is a list containing three elements: