Parts of speech categorise the syntactic function of words.
Tag | Example | |
---|---|---|
CC | Coordinating conjunction | and |
CD | Cardinal number | 1 |
DT | Determiner | the |
EX | Existential there | there |
FW | Foreign word | שלום |
IN | Preposition or subordinating conjunction | in |
JJ | Adjective | high |
JJR | Adjective, comparative | higher |
JJS | Adjective, superlative | highest |
LS | List item marker | , |
MD | Modal | can |
NN | Noun, singular or mass | desk |
NNS | Noun, plural | desks |
NNP | Proper noun, singular | Denmark |
NNPS | Proper noun, plural | Danes |
PDT | Predeterminer | both |
POS | Possessive ending | 's |
PRP | Personal pronoun | you |
PRP$ | Possessive pronoun | your |
RB | Adverb | well |
RBR | Adverb, comparative | better |
RBS | Adverb, superlative | best |
RP | Particle | |
SYM | Symbol | |
TO | to | |
UH | Interjection | |
VB | Verb, base form | see |
VBD | Verb, past tense | saw |
VBG | Verb, gerund or present participle | seeing |
VBN | Verb, past participle | seen |
VBP | Verb, non-3rd person singular present | see |
VBZ | Verb, 3rd person singular present | sees |
WDT | Wh-determiner | |
WP | Wh-pronoun | |
WP$ | Possessive wh-pronoun | |
WRB | Wh-adverb |
Phrases also have a grammatical function when they are syntactic constituents.
Penn Treebank constituent tagset:
Phrase Level | Example | |
---|---|---|
ADJP | Adjective Phrase | really high |
ADVP | Adverb Phrase | very well |
CONJP | Conjunction Phrase | as well as |
FRAG | Fragment | |
INTJ | Interjection | |
LST | List marker | |
NP | Noun Phrase | high desk |
PP | Prepositional Phrase | at home |
PRN | Parenthetical | |
PRT | Particle. Category for words that should be tagged RP | |
QP | Quantifier Phrase (i.e. complex measure/amount phrase); used within NP | |
RRC | Reduced Relative Clause | |
VP | Verb Phrase | see the desk |
WHADJP | Wh-adjective Phrase. Adjectival phrase containing a wh-adverb | how hot |
WHAVP | Wh-adverb Phrase, containing a wh-adverb | how well |
WHNP | Wh-noun Phrase, containing some wh-word | which book |
WHPP | Wh-prepositional Phrase, containing a wh-noun phrase | of which |
X | Unknown, uncertain, or unbracketable. |
Clause Level | ||
---|---|---|
S | simple declarative clause, i.e. one that is not introduced by a (possible empty) subordinating conjunction or a wh-word and that does not exhibit subject-verb inversion. | |
SBAR | Clause introduced by a (possibly empty) subordinating conjunction. | |
SBARQ | Direct question introduced by a wh-word or a wh-phrase. Indirect questions and relative clauses should be bracketed as SBAR, not SBARQ. | |
SINV | Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal. | |
SQ | Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ. |
A tree is a connected acyclic undirected graph.
Graphs consist of nodes and edges between them.
![]() |
![]() |
Another example of a PP attachment problem: does the PP (prepositional phrase) attach to the VP (verbal phrase) or the NP (noun phrase)?
A dataset that consists of a text corpus with annotated (syntactic) trees.
Some commonly used treebanks:
Structured prediction: trained on treebanks to build constituency trees from text.
See more in the chapter from this book about constituency parsing (slides).
In relation extraction, it helps to define linguistic patterns such as <subject> <verb> <object>
instead of purely text-based patterns.
Dechra Pharmaceuticals, which has just made its second acquisition, had previously purchased Genitrix.
Trinity Mirror plc, the largest British newspaper, purchased Local World, its rival.
Kraft, owner of Milka, purchased Cadbury Dairy Milk and is now gearing up for a roll-out of its new brand.
Syntactic dependencies are a useful representation for this purpose.
Reordering rules can be stated in terms of syntactic dependencies:
ROOT
node.Must be a tree: every word has exactly one head, and ROOT
has no head.
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 I _ _ _ _ 2 nsubj _ _
2 saw _ _ _ _ 0 root _ _
3 the _ _ _ _ 4 det _ _
4 star _ _ _ _ 2 dobj _ _
5 with _ _ _ _ 7 case _ _
6 the _ _ _ _ 7 det _ _
7 telescope _ _ _ _ 2 obl _ _
"""
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"2400px")
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 I _ _ _ _ 2 nsubj _ _
2 saw _ _ _ _ 0 root _ _
3 the _ _ _ _ 4 det _ _
4 star _ _ _ _ 2 dobj _ _
5 with _ _ _ _ 7 case _ _
6 the _ _ _ _ 7 det _ _
7 telescope _ _ _ _ 2 obl _ _
"""
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"2400px")
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 I _ _ _ _ 2 nsubj _ _
2 saw _ _ _ _ 0 root _ _
3 the _ _ _ _ 4 det _ _
4 star _ _ _ _ 2 dobj _ _
5 with _ _ _ _ 7 case _ _
6 the _ _ _ _ 7 det _ _
7 telescope _ _ _ _ 4 nmod _ _
"""
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"2400px")
![]() |
![]() |
Tabular format with 10 columns indicating various morphosyntactic attributes.
Shown here: ID, surface form, dependency head and dependency relation.
(The others are shown as _
but normally they would be filled in too.)
display(HTML(pd.read_csv(StringIO(conllu), sep="\t").to_html(index=False)))
render_displacy(arcs, tokens,"2400px")
# ID | FORM | LEMMA | UPOS | XPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | I | _ | _ | _ | _ | 2 | nsubj | _ | _ |
2 | saw | _ | _ | _ | _ | 0 | root | _ | _ |
3 | the | _ | _ | _ | _ | 4 | det | _ | _ |
4 | star | _ | _ | _ | _ | 2 | dobj | _ | _ |
5 | with | _ | _ | _ | _ | 7 | case | _ | _ |
6 | the | _ | _ | _ | _ | 7 | det | _ | _ |
7 | telescope | _ | _ | _ | _ | 4 | nmod | _ | _ |
How to define the relation labels? There are different linguistic traditions in different languages...
Nominals | Clauses | Modifier words | Function Words | |
Core arguments |
nsubj obj iobj |
csubj ccomp xcomp |
||
Non-core dependents |
obl vocative expl dislocated |
advcl |
advmod discourse |
aux cop mark |
Nominal dependents |
nmod appos nummod |
acl | amod |
det clf case |
Coordination | MWE | Loose | Special | Other |
conj cc |
fixed flat compound |
list parataxis |
orphan goeswith reparandum |
punct root dep |
UD also includes other morphosyntactic annotation:
Open class words | Closed class words | Other |
---|---|---|
ADJ | ADP | PUNCT |
ADV | AUX | SYM |
INTJ | CCONJ | X |
NOUN | DET | |
PROPN | NUM | |
VERB | PART | |
PRON | ||
SCONJ |
the big fish ate the small fish
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 Den _ _ _ _ 3 det _ _
2 store _ _ _ _ 3 amod _ _
3 fisk _ _ _ _ 4 nsubj _ _
4 spiste _ _ _ _ 0 root _ _
5 den _ _ _ _ 7 det _ _
6 lille _ _ _ _ 7 amod _ _
7 fisk _ _ _ _ 4 obj _ _
"""
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"1400px")
big fish small fish ate
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 큰 _ _ _ _ 2 amod _ _
2 물고기가 _ _ _ _ 5 nsubj _ _
3 작은 _ _ _ _ 4 amod _ _
4 물고기를 _ _ _ _ 5 obj _ _
5 먹었다 _ _ _ _ 0 root _ _
"""
arcs, tokens = to_displacy_graph(*load_arcs_tokens(conllu))
render_displacy(arcs, tokens,"1400px")
Task:
conllu = """
# ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
1 Alice _ _ _ _ 2 nsubj _ _
2 saw _ _ _ _ 0 root _ _
3 Bob _ _ _ _ 2 dobj _ _
"""
display(HTML(pd.read_csv(StringIO(conllu), sep="\t").to_html(index=False)))
# ID | FORM | LEMMA | UPOS | XPOS | FEATS | HEAD | DEPREL | DEPS | MISC |
---|---|---|---|---|---|---|---|---|---|
1 | Alice | _ | _ | _ | _ | 2 | nsubj | _ | _ |
2 | saw | _ | _ | _ | _ | 0 | root | _ | _ |
3 | Bob | _ | _ | _ | _ | 2 | dobj | _ | _ |
Always 0 $\leq$ LAS $\leq$ UAS $\leq$ 100%.
Consist of a buffer and stack, incrementally build the parse by applying actions (transitions).
What are the possible actions? Depends which transition system we are using!
Common transition systems:
Possible actions at each step:
Two special configurations:
render_transitions_displacy(transitions, tokenized_sentence)
stack | buffer | parse | action |
ROOT | Alice saw Bob | ||
ROOT Alice | saw Bob | shift | |
ROOT Alice saw | Bob | shift | |
ROOT saw | Bob | leftArc-nsubj | |
ROOT saw Bob | shift | ||
ROOT saw | rightArc-dobj | ||
ROOT | rightArc-root | ||
ROOT |
Model $p(a|c)$: how likely is action $a$ to be next, given that the current configuration is $c$? $$p(a|c) \approx s_\params(a,c)$$
Training: learn $\params$ with an annotated training set $$ \argmax_\params \prod_{x \in \train} \prod_{i=1}^{|x|} s_\params(a_i,c_i) $$
Decoding: try to find the most likely action sequence $$\argmax_{a_1,\ldots,a_{|x|}} \prod_{i=1}^{|x|} s_\params(a_i,c_i)$$
Sequence-to-sequence, but with control structure:
Alternative: (see also MT slides)
But what is the ground truth? Treebanks contain trees, not action sequences!
Oracle: rules to select the right action given the configuration and the correct tree.
Unlabeled parsing (without relation labels), just for simplicity.
LEFT-ARC | initial configuration | terminal configuration | |
---|---|---|---|
arc-standard | create arc from stack top to second stack item, pop second stack item |
stack contains root, buffer contains words |
stack contains root, buffer is empty |
arc-hybrid | create arc from buffer top to stack top, pop stack top |
stack is empty, buffer contains words and root |
stack is empty, buffer contains root |