As I mentioned in the previous notebook, there are two types of NLP and NER: rules-based and machine learning-based. This notebook will be dedicated to rules-based, while the next one will be dedicated to machine learning-based.
Rules-based NER is the method by which an NLP practitioner either creates or utilizes an NLP system that has a predefined set of instructions, or rules, to perform certain NLP tasks. For NER, this often times means using what is known as a gazetteer. A gazetteer is a list, or dictionary, of entities that align with a specific label. In the case of people, this would be a list of first and last names. If you are developing an NER for a specific region, as we will in a later notebook, this may be a list of all locations in that region.
Well-cultivated gazetteer served as the primary method for performing NER in the early twenty-first century. It is still a staple in some NLP frameworks, such as v.0.01 of the Classical Language Toolkit (CLTK), which is currently being replaced with machine learning models.
Sometimes, however, a researcher may not have a list of all potential names in a domain. In these instances, should one wish to pursue rules-based NLP, one would utilize pattern rules-based NER. In this scenario, the researcher creates a specific set of patterns to find and then presume that anything that aligns with that pattern is a specific entity. This can be used for places, i.e. Berlin, Germany; London, England, Dallas, Texas. In this scenario, one could use string pattern libraries, such as RegEx (Regular Expressions) to find instances of a capitalized word, followed by a comma, followed by another capitalized word and presume that such an occurrence is a PLACE. More commonly, however, such a method is used to find dates in a text. While there are many ways to represent a date, there are a limited number of patterns. For example, were I to want to represent January 1st, I could write the following:
There are likely other potential variations, but my point is that there is a limited number of ways of representing a date. For this reason, finding and extraction temporal entities in texts is usually fairly easy via rules-based pattern methods. The problem becomes even easier if dates are only represented a few ways. One could develop a pattern to find any instance of a number, followed by a capital word or visa versa as a date.
I mentioned in the previous notebook that NLP is moving away from rules-based methods and toward machine learning-based ones. You may be thinking, why you should learn rules-based NLP and NER.
The answer is because there remain certain situations in which a rules-based solution is better than a machine learning-based one. Further, there are times when a combination of the two is more suited. In order to understand when to use which solution is important and, therefore, a firm foundation in rules-based NER is essential in order to do machine learning NER.
This raises an important question. When should you use a rules-based NER approach? The answer is simple. When there are a finite number of ways a specific entity will be represented so that you can catch roughly 95-97% of them with such rules. I say 95-97% not because this is the industry standard, rather because this is my goal for my NER models. If I can use a rules-based approach and achieve equal accuracy as an ML model, I will likely do that because the time it takes to implement a rules-based approach to NER is often less than the time it takes to train, validate, and test a machine learning model.
As we will see in the video below, this is a particularly useful approach when you know every potential name in a corpus, such as we will see with the toy example of Harry Potter.
A key thing to remember, however, is that rules-based methods are precisely that: rules-based. If an entity falls outside of the rule you have developed, then it will not be flagged as an entity. This is particularly prevenlant in texts that have been OCRed, typed without spellcheck, unedited, or any form of uncleaned text.
While cleaning texts is an essential component to proper data preparation for NLP, it is sometimes impossible to clean entirely a text and, sometimes, an individual using your particular NER framework may not know to clean texts before hand. This is the chief limitation of rules-based methods and it is one of the chief reasons researchers use machine learning methods today. Machine learning models learn and can, therefore, generalize on unseen data, despite variance (to an extent) from things it has seen in the past. We will explore this in greater detail in the next notebook.
While spaCy is chiefly a machine learning NLP library, it possess the ability to use rules-based NER methods. This is a major strength of the library, for as we shall see to develop a strong NER system for many difficult domains, a combination of rules-based and machine learning-based approaches are sometimes essential.
There are a few ways to engage in rules-based NER with spaCy, but one of the more fundamental is its EntityRuler.
Let's return to our sample sentence from the first notebook with Martha the basketball player. In this scenario, we want to not only extract the normal entities from the text (PERSON, DATE, etc.), but also a new entity SPORT. In a later notebook, we will discuss adding custom entities in much more detail. For now, we will use this simple toy example to demonstrate how rules-based NER works.
text = "Martha, a senior, moved to Spain where she will be playing basketball until 05 June 2022 or until she can't play any longer."
#Import spacy
import spacy
#Create a blank spaCy model that will parse English ("en")
nlp = spacy.blank("en")
#Create a set of patterns
patterns = [{"label": "SPORT", "pattern": "basketball"}]
#Create a ruler that we will add to the model
entity_ruler = nlp.add_pipe("entity_ruler")
#Initialize the entity ruler with the patterns
entity_ruler.add_patterns(patterns)
#Create the doc object
doc = nlp(text)
#Iterate over all entities (there will be only one)
for ent in doc.ents:
print (ent.text, ent.label_)
INFO:tensorflow:Enabling eager execution INFO:tensorflow:Enabling v2 tensorshape INFO:tensorflow:Enabling resource variables INFO:tensorflow:Enabling tensor equality INFO:tensorflow:Enabling control flow v2 basketball SPORT
We see that the output is as desired. We have extracted basektball, the string, and its correct label, SPORT. Don't be alarmed if you don't entirely understand how the code above works. By the end of this series, you will. For now, simply understand how rules-based NER works. Expirement with the code above and try to create a model below that can identify soccer as a SPORT in the following text:
text = "Scott enjoys to play soccer."
#Here, replicate the code above to find soccer as a SPORT
Once you've finished with the above exercise, check out the video on Rules-Based NER below. It reinforces some of these concepts and explores them in a larger context, specifically via the first book of Harry Potter.
%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/16Ujcah_-h0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>