Neural NER with spacy¶

In this hands-on, we use https://spacy.io, a framework for all basic NLP processing steps. It supports several languages out-of-the-box:

Tokenization/Word segmentation
Sentence splitting
Part-of-Speech tagging
Lemmatization
NER
Dependency Parsing

Hands-on¶

Downloading a small model for English trained on modern Web data. (Medium models end with md, large models ends with lg.

In [ ]:

!python -m spacy download en_core_web_sm

Setting up an NLP pipeline for English:

In [ ]:

import spacy
nlp = spacy.load("en_core_web_sm")

Reading and linguistic processing of a single sentence.

More information on the meaning of the results can be found under https://spacy.io/usage/linguistic-features . Note that the human readable properties of word tokens always end with underscore.

In [ ]:

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

The spacy default NLP pipeline for English includes NER. See https://spacy.io/usage/linguistic-features#named-entities for more information.

Let's look at the text of all found named entities:

In [ ]:

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Accessing IOB information at the token level:

In [ ]:

print([doc[0].text, doc[0].ent_iob_, doc[0].ent_type_])

In [ ]:

print([doc[9].text, doc[9].ent_iob_, doc[9].ent_type_])

More information on accessing NER information: https://spacy.io/usage/linguistic-features#accessing

Working with different language¶

If you want to work on French data, use the following commands for setting up an NLP pipeline.

In [ ]:

!python -m spacy download fr_core_news_sm

In [ ]:

# without this line spacy is not able to find the downloaded model

! python -m spacy link --force fr_core_news_sm fr_core_news_sm

In [ ]:

nlp_fr = spacy.load("fr_core_news_sm")

See https://spacy.io/usage/models for more languages.

Testing the robustness¶

Try sentences from other domains than contemporary news.
Add noise (OCR errors, typos) to the text.

How much do the results suffer?

In [ ]:

doc_fr = nlp_fr(u'''Apple envisage d'acheter une startup britannique pour 1 milliard de dollars''')

In [ ]:

for ent in doc_fr.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Next step: Combining rule-based and statistical NER¶

A tutorial how to use rule-based pattern matchers in spacy can be found here: https://spacy.io/usage/rule-based-matching#entityruler

A nice example is the rule-based addition of person titles to named entities that have been recognized them by statistical NER: https://spacy.io/usage/rule-based-matching#models-rules-ner

Next step: Online training of existing models¶

All spacy NER models can be updated easily by further training them on new labeled examples. The relevant documentation and sample code of spacy can be found here: https://spacy.io/usage/training#ner

There is a step-by-step tutorial that shows how an existing model can be adapted to your own data: https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718