In this hands-on, we use https://spacy.io, a framework for all basic NLP processing steps. It supports several languages out-of-the-box:
Downloading a small model for English trained on modern Web data. (Medium models end with md
, large models ends with lg
.
!python -m spacy download en_core_web_sm
Setting up an NLP pipeline for English:
import spacy
nlp = spacy.load("en_core_web_sm")
Reading and linguistic processing of a single sentence.
More information on the meaning of the results can be found under https://spacy.io/usage/linguistic-features . Note that the human readable properties of word tokens always end with underscore.
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
The spacy default NLP pipeline for English includes NER. See https://spacy.io/usage/linguistic-features#named-entities for more information.
Let's look at the text of all found named entities:
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Accessing IOB information at the token level:
print([doc[0].text, doc[0].ent_iob_, doc[0].ent_type_])
print([doc[9].text, doc[9].ent_iob_, doc[9].ent_type_])
More information on accessing NER information: https://spacy.io/usage/linguistic-features#accessing
If you want to work on French data, use the following commands for setting up an NLP pipeline.
!python -m spacy download fr_core_news_sm
# without this line spacy is not able to find the downloaded model
! python -m spacy link --force fr_core_news_sm fr_core_news_sm
nlp_fr = spacy.load("fr_core_news_sm")
See https://spacy.io/usage/models for more languages.
How much do the results suffer?
doc_fr = nlp_fr(u'''Apple envisage d'acheter une startup britannique pour 1 milliard de dollars''')
for ent in doc_fr.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
A tutorial how to use rule-based pattern matchers in spacy can be found here: https://spacy.io/usage/rule-based-matching#entityruler
A nice example is the rule-based addition of person titles to named entities that have been recognized them by statistical NER: https://spacy.io/usage/rule-based-matching#models-rules-ner
All spacy NER models can be updated easily by further training them on new labeled examples. The relevant documentation and sample code of spacy can be found here: https://spacy.io/usage/training#ner
There is a step-by-step tutorial that shows how an existing model can be adapted to your own data: https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718