All the IPython Notebooks in Python Natural Language Processing lecture series by Dr. Milaan Parmar are available @ GitHub

06 Named Entity Recognition (NER)¶

(also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes

spaCy has an 'ner' pipeline component that identifies token spans fitting a predetermined set of named entities. These are available as the ents property of a Doc object.

https://spacy.io/usage/training#ner

In [1]:

# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:

# Write a function to display basic entity info:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [3]:

doc = nlp(u'Hi, everyone welcome to Milaan Parmar CS tutorial on NPL')

show_ents(doc)

Milaan Parmar CS - ORG - Companies, agencies, institutions, etc.

In [4]:

doc = nlp(u'May I go to England or Canada, next month to see the virus report?')

show_ents(doc)

England - GPE - Countries, cities, states
Canada - GPE - Countries, cities, states
next month - DATE - Absolute or relative dates or periods

Entity annotations¶

Doc.ents are token spans with their own set of annotations.

`ent.text`	The original entity text
`ent.label`	The entity type's hash value
`ent.label_`	The entity type's string description
`ent.start`	The token span's start index position in the Doc
`ent.end`	The token span's stop index position in the Doc
`ent.start_char`	The entity text's start index position in the Doc
`ent.end_char`	The entity text's stop index position in the Doc

In [5]:

doc = nlp(u'Can I please borrow 500 dollars from Blake to buy some Microsoft stock?')

for ent in doc.ents:
    print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

500 dollars 4 6 20 31 MONEY
Blake 7 8 37 42 PERSON
Microsoft 11 12 55 64 ORG

NER Tags¶

Tags are accessible through the .label_ property of an entity.

TYPE	DESCRIPTION	EXAMPLE
`PERSON`	People, including fictional.	Fred Flintstone
`NORP`	Nationalities or religious or political groups.	The Republican Party
`FAC`	Buildings, airports, highways, bridges, etc.	Logan International Airport, The Golden Gate
`ORG`	Companies, agencies, institutions, etc.	Microsoft, FBI, MIT
`GPE`	Countries, cities, states.	France, UAR, Chicago, Idaho
`LOC`	Non-GPE locations, mountain ranges, bodies of water.	Europe, Nile River, Midwest
`PRODUCT`	Objects, vehicles, foods, etc. (Not services.)	Formula 1
`EVENT`	Named hurricanes, battles, wars, sports events, etc.	Olympic Games
`WORK_OF_ART`	Titles of books, songs, etc.	The Mona Lisa
`LAW`	Named documents made into laws.	Roe v. Wade
`LANGUAGE`	Any named language.	English
`DATE`	Absolute or relative dates or periods.	20 July 1969
`TIME`	Times smaller than a day.	Four hours
`PERCENT`	Percentage, including "%".	Eighty percent
`MONEY`	Monetary values, including unit.	Twenty Cents
`QUANTITY`	Measurements, as of weight or distance.	Several kilometers, 55kg
`ORDINAL`	"first", "second", etc.	9th, Ninth
`CARDINAL`	Numerals that do not fall under another type.	2, Two, Fifty-two

Adding a Named Entity to a Span¶

Normally we would have spaCy build a library of named entities by training it on several samples of text.
In this case, we only want to add one value:

In [6]:

doc = nlp(u'Arthur to build a U.K. factory for $6 million')

show_ents(doc)

Arthur - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit

Add Milaan as PERSON

In [7]:

from spacy.tokens import Span

# Get the hash value of the ORG entity label
ORG = doc.vocab.strings[u'PERSON']  

# Create a Span for the new entity
new_ent = Span(doc, 0, 1, label=ORG)

# Add the entity to the existing Doc object
doc.ents = list(doc.ents) + [new_ent]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-f76bf8be7547> in <module>()
      8 
      9 # Add the entity to the existing Doc object
---> 10 doc.ents = list(doc.ents) + [new_ent]

doc.pyx in spacy.tokens.doc.Doc.ents.__set__()

ValueError: [E103] Trying to set conflicting doc.ents: '(0, 1, 'ORG')' and '(0, 1, 'PERSON')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

In the code above, the arguments passed to Span() are:

doc - the name of the Doc object
0 - the start index position of the span
1 - the stop index position (exclusive)
label=PERSON - the label assigned to our entity

In [8]:

show_ents(doc)

Arthur - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit

Adding Named Entities to All Matching Spans¶

What if we want to tag all occurrences of "WORDS"? WE NEED TO use the PhraseMatcher to identify a series of spans in the Doc:

In [9]:

doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '
          u'If successful, the vacuum cleaner will be our first product.')

show_ents(doc)

first - ORDINAL - "first", "second", etc.

In [10]:

# Import PhraseMatcher and create a matcher object:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [11]:

# Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]

In [12]:

# Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)

# Apply the matcher to our Doc object:
matches = matcher(doc)

# See what matches occur:
matches

Out[12]:

[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]

In [13]:

# Here we create Spans from each match, and create named entities from them:
from spacy.tokens import Span

PROD = doc.vocab.strings[u'PRODUCT']

new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]

doc.ents = list(doc.ents) + new_ents

In [14]:

show_ents(doc)

vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum cleaner - PRODUCT - Objects, vehicles, foods, etc. (not services)
first - ORDINAL - "first", "second", etc.

Counting Entities¶

While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

In [15]:

doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')

show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit

In [16]:

len([ent for ent in doc.ents if ent.label_=='MONEY'])

Out[16]:

In [17]:

spacy.__version__

Out[17]:

'2.2.4'

In [18]:

doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')

show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit

However, there is a simple fix that can be added to the nlp pipeline:

https://spacy.io/usage/processing-pipelines

In [19]:

# Quick function to remove ents formed on whitespace:
def remove_whitespace_entities(doc):
    doc.ents = [e for e in doc.ents if not e.text.isspace()]
    return doc

# Insert this into the pipeline AFTER the ner component:
nlp.add_pipe(remove_whitespace_entities, after='ner')

In [20]:

# Rerun nlp on the text above, and show ents:
doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')

show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit

For more on Named Entity Recognition visit https://spacy.io/usage/linguistic-features#101

Noun Chunks¶

Doc.noun_chunks are base noun phrases: token spans that include the noun and words describing the noun. Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.
Where Doc.ents rely on the ner pipeline component, Doc.noun_chunks are provided by the parser.

`noun_chunks` components:¶

`.text`	The original noun chunk text.
`.root.text`	The original text of the word connecting the noun chunk to the rest of the parse.
`.root.dep_`	Dependency relation connecting the root to its head.
`.root.head.text`	The text of the root token's head.

In [21]:

doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc.noun_chunks:
    print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)

Autonomous cars - cars - nsubj - shift
insurance liability - liability - dobj - shift
manufacturers - manufacturers - pobj - toward

`Doc.noun_chunks` is a generator function¶

Previously we mentioned that Doc objects do not retain a list of sentences, but they're available through the Doc.sents generator.
It's the same with Doc.noun_chunks - lists can be created if needed:

In [22]:

len(doc.noun_chunks)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-8b52b37c204e> in <module>()
----> 1 len(doc.noun_chunks)

TypeError: object of type 'generator' has no len()

In [23]:

len(list(doc.noun_chunks))

Out[23]:

For more on noun_chunks visit https://spacy.io/usage/linguistic-features#noun-chunks