All the IPython Notebooks in Python Natural Language Processing lecture series by Dr. Milaan Parmar are available @ GitHub
Sentence segmentation is the process of determining the longer processing units consisting of one or more words. This task involves identifying sentence boundaries between words in different sentences.
In spaCy Basics we saw briefly how Doc objects are divided into sentences. In this section we'll learn how sentence segmentation works, and how to set our own segmentation rules.
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')
# From Spacy Basics:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')
for sent in doc.sents:
print(sent)
This is the first sentence. This is another sentence. This is the last sentence.
Doc.sents
is a generator¶It is important to note that doc.sents
is a generator. That is, a Doc is not segmented until doc.sents
is called. This means that, where you could print the second Doc token with print(doc[1])
, you can't call the "second Doc sentence" with print(doc.sents[1])
:
print(doc[1])
is
print(doc.sents[1])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-4-2bc012eee1da> in <module>() ----> 1 print(doc.sents[1]) TypeError: 'generator' object is not subscriptable
However, you can build a sentence collection by running doc.sents
and saving the result to a list:
doc_sents = [sent for sent in doc.sents]
doc_sents
[This is the first sentence., This is another sentence., This is the last sentence.]
NOTE: list(doc.sents)
also works. We show a list comprehension as it allows you to pass in conditionals.
# Now you can access individual sentences:
print(doc_sents[1])
This is another sentence.
sents
are Spans¶At first glance it looks like each sent
contains text from the original Doc object. In fact they're just Spans with start and end token pointers.
type(doc_sents[1])
spacy.tokens.span.Span
print(doc_sents[1].start, doc_sents[1].end)
6 11
spaCy's built-in sentencizer
relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added before the creation of the Doc object, as that is where the parsing of segment start tokens happens:
# Parsing the segmentation start tokens happens during the nlp pipeline
doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')
for token in doc2:
print(token.is_sent_start, ' '+token.text)
True This None is None a None sentence None . True This None is None a None sentence None . True This None is None a None sentence None .
Notice we haven't run doc2.sents
, and yet token.is_sent_start
was set to True on two tokens in the Doc.
Let's add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon, the next token should start a new segment.
# SPACY'S DEFAULT BEHAVIOR
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc3.sents:
print(sent)
"Management is doing things right; leadership is doing the right things." -Peter Drucker
# ADD A NEW RULE TO THE PIPELINE
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == ';':
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe(set_custom_boundaries, before='parser')
nlp.pipe_names
['tagger', 'set_custom_boundaries', 'parser', 'ner']
The new rule has to run before the document is parsed. Here we can either pass the argument before='parser'
or first=True
.
# Re-run the Doc object creation:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')
for sent in doc4.sents:
print(sent)
"Management is doing things right; leadership is doing the right things." -Peter Drucker
# And yet the new rule doesn't apply to the older Doc object:
for sent in doc3.sents:
print(sent)
"Management is doing things right; leadership is doing the right things." -Peter Drucker
Why not simply set the .is_sent_start
value to True on existing tokens?
# Find the token we want to change:
doc3[7]
leadership
# Try to change the .is_sent_start attribute:
doc3[7].is_sent_start = True
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-15-bcec3fe6a9a2> in <module>() 1 # Try to change the .is_sent_start attribute: ----> 2 doc3[7].is_sent_start = True token.pyx in spacy.tokens.token.Token.is_sent_start.__set__() ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.
spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.
In some cases we want to replace spaCy's default sentencizer with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentencizer that breaks on linebreaks.
nlp = spacy.load('en_core_web_sm') # reset to the original
mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."
# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)
for sent in doc.sents:
print([token.text for token in sent])
['This', 'is', 'a', 'sentence', '.'] ['This', 'is', 'another', '.', '\n\n'] ['This', 'is', 'a', '\n', 'third', 'sentence', '.']
# CHANGING THE RULES
from spacy.pipeline import SentenceSegmenter
def split_on_newlines(doc):
start = 0
seen_newline = False
for word in doc:
if seen_newline:
yield doc[start:word.i]
start = word.i
seen_newline = False
elif word.text.startswith('\n'): # handles multiple occurrences
seen_newline = True
yield doc[start:] # handles the last group of tokens
sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)
While the function split_on_newlines
can be named anything we want, it's important to use the name sbd
for the SentenceSegmenter.
doc = nlp(mystring)
for sent in doc.sents:
print([token.text for token in sent])
['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', '.', '\n\n'] ['This', 'is', 'a', '\n'] ['third', 'sentence', '.']