Notebook

All the IPython Notebooks in Python Natural Language Processing lecture series by Dr. Milaan Parmar are available @ GitHub

07 Sentence Segmentation¶

Sentence segmentation is the process of determining the longer processing units consisting of one or more words. This task involves identifying sentence boundaries between words in different sentences.

In spaCy Basics we saw briefly how Doc objects are divided into sentences. In this section we'll learn how sentence segmentation works, and how to set our own segmentation rules.

In [1]:

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:

# From Spacy Basics:
doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.

`Doc.sents` is a generator¶

It is important to note that doc.sents is a generator. That is, a Doc is not segmented until doc.sents is called. This means that, where you could print the second Doc token with print(doc[1]), you can't call the "second Doc sentence" with print(doc.sents[1]):

In [3]:

print(doc[1])

is

In [4]:

print(doc.sents[1])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-2bc012eee1da> in <module>()
----> 1 print(doc.sents[1])

TypeError: 'generator' object is not subscriptable

However, you can build a sentence collection by running doc.sents and saving the result to a list:

In [5]:

doc_sents = [sent for sent in doc.sents]
doc_sents

Out[5]:

[This is the first sentence.,
 This is another sentence.,
 This is the last sentence.]

NOTE: list(doc.sents) also works. We show a list comprehension as it allows you to pass in conditionals.

In [6]:

# Now you can access individual sentences:
print(doc_sents[1])

This is another sentence.

`sents` are Spans¶

At first glance it looks like each sent contains text from the original Doc object. In fact they're just Spans with start and end token pointers.

In [7]:

type(doc_sents[1])

Out[7]:

spacy.tokens.span.Span

In [8]:

print(doc_sents[1].start, doc_sents[1].end)

6 11

Adding Rules¶

spaCy's built-in sentencizer relies on the dependency parse and end-of-sentence punctuation to determine segmentation rules. We can add rules of our own, but they have to be added before the creation of the Doc object, as that is where the parsing of segment start tokens happens:

In [9]:

# Parsing the segmentation start tokens happens during the nlp pipeline
doc2 = nlp(u'This is a sentence. This is a sentence. This is a sentence.')

for token in doc2:
    print(token.is_sent_start, ' '+token.text)

True  This
None  is
None  a
None  sentence
None  .
True  This
None  is
None  a
None  sentence
None  .
True  This
None  is
None  a
None  sentence
None  .

Notice we haven't run doc2.sents, and yet token.is_sent_start was set to True on two tokens in the Doc.

Let's add a semicolon to our existing segmentation rules. That is, whenever the sentencizer encounters a semicolon, the next token should start a new segment.

In [10]:

# SPACY'S DEFAULT BEHAVIOR
doc3 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc3.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter
Drucker

In [11]:

# ADD A NEW RULE TO THE PIPELINE
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before='parser')

nlp.pipe_names

Out[11]:

['tagger', 'set_custom_boundaries', 'parser', 'ner']

The new rule has to run before the document is parsed. Here we can either pass the argument before='parser' or first=True.

In [12]:

# Re-run the Doc object creation:
doc4 = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker')

for sent in doc4.sents:
    print(sent)

"Management is doing things right;
leadership is doing the right things."
-Peter
Drucker

In [13]:

# And yet the new rule doesn't apply to the older Doc object:
for sent in doc3.sents:
    print(sent)

"Management is doing things right; leadership is doing the right things."
-Peter
Drucker

Why not change the token directly?¶

Why not simply set the .is_sent_start value to True on existing tokens?

In [14]:

# Find the token we want to change:
doc3[7]

Out[14]:

leadership

In [15]:

# Try to change the .is_sent_start attribute:
doc3[7].is_sent_start = True

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-bcec3fe6a9a2> in <module>()
      1 # Try to change the .is_sent_start attribute:
----> 2 doc3[7].is_sent_start = True

token.pyx in spacy.tokens.token.Token.is_sent_start.__set__()

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

spaCy refuses to change the tag after the document is parsed to prevent inconsistencies in the data.

Changing the Rules¶

In some cases we want to replace spaCy's default sentencizer with our own set of rules. In this section we'll see how the default sentencizer breaks on periods. We'll then replace this behavior with a sentencizer that breaks on linebreaks.

In [16]:

nlp = spacy.load('en_core_web_sm')  # reset to the original

mystring = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."

# SPACY DEFAULT BEHAVIOR:
doc = nlp(mystring)

for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n', 'third', 'sentence', '.']

In [17]:

# CHANGING THE RULES
from spacy.pipeline import SentenceSegmenter

def split_on_newlines(doc):
    start = 0
    seen_newline = False
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text.startswith('\n'): # handles multiple occurrences
            seen_newline = True
    yield doc[start:]      # handles the last group of tokens


sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)

While the function split_on_newlines can be named anything we want, it's important to use the name sbd for the SentenceSegmenter.

In [18]:

doc = nlp(mystring)
for sent in doc.sents:
    print([token.text for token in sent])

['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n']
['third', 'sentence', '.']

In [ ]: