CLTK uses a backoff lemmatizer consisting of a chain of different lemmatizers:
dictionary-based lemmatizer with high-frequency words
a training-data-based lemmatizer based on 4,000 sentences from the Perseus Latin Dependency Treebanks
a regular-expression-based lemmatizer stripping word affixes
a dictionary-based lemmatizer with the complete set of Morpheus lemmas
an ‘identity’ lemmatizer returning the token itself as the lemma
We'll later see some of those in detail.
from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer
lemmatizer = BackoffLatinLemmatizer()
from data.word_tokenized_text import word_tokenized_text as text
lemmatized = lemmatizer.lemmatize(text)
lemmatized[:30]
[('Conditi', 'Conditi'), ('paradoxi', 'paradoxus'), ('compositio', 'compositio'), ('mellis', 'mel'), ('pondo', 'pondo'), ('XV', 'XV'), ('in', 'in'), ('aeneum', 'aeneus'), ('vas', 'vas'), ('mittuntur', 'mitto'), ('praemissis', 'praemitto'), ('vini', 'vinum'), ('sextariis', 'sextarius'), ('duobus', 'duo'), ('ut', 'ut'), ('in', 'in'), ('coctura', 'coquo'), ('mellis', 'mel'), ('vinum', 'vinum'), ('decoquas', 'decoquo'), ('quod', 'qui'), ('igni', 'ignis'), ('lento', 'lentus'), ('et', 'et'), ('aridis', 'aridus'), ('lignis', 'lignum'), ('calefactum', 'calefacio'), ('commotum', 'commoveo'), ('ferula', 'ferula'), ('dum', 'dum')]
line_lemmatized_text = []
from data.line_tokenized_text import line_tokenized_text
from cltk.stem.latin.j_v import JVReplacer
for line in line_tokenized_text:
_line = lemmatizer.lemmatize(JVReplacer().replace(line).split(" "))
line_lemmatized_text.append(_line)
line_lemmatized_text[0]
[('Conditi', 'Conditi'), ('paradoxi', 'paradoxus'), ('compositio', 'compositio'), ('mellis', 'mel'), ('pondo', 'pondo'), ('XU', 'XU'), ('in', 'in'), ('aeneum', 'aeneus'), ('uas', 'uas'), ('mittuntu', 'mittuntu'), ('praemissis', 'praemitto'), ('uini', 'uinum'), ('sextariis', 'sextarius'), ('duobu', 'duobu'), ('ut', 'ut'), ('in', 'in'), ('coctura', 'coquo'), ('mellis', 'mel'), ('uinum', 'uinum'), ('decoquas', 'decoquo')]
Suppose we want to detect which tokens of the text correspond to the ingredients needed for each recipe. We could in theory create a huge list of all the ingredients and simply check if each token is on the list. That however would mean the we'd have to also include each grammatical case and orthographic variation. However, using a lemmatizer we can restrict the list to the lemmata of the ingredients and simply check if the lemmatized token belongs to the list.
from data.ingredient_list import ingredients
ingredients = set(ingredients)
ingredient_indices = []
for i, lemma in enumerate(lemmatized):
if lemma[1] in ingredients:
ingredient_indices.append(i)
ingredient_indices[:30]
[3, 11, 17, 18, 34, 83, 88, 102, 119, 122, 137, 140, 147, 152, 185, 206, 214, 221, 234, 247, 256, 280, 307, 314, 322, 334, 357, 360, 388, 394]
Here we check how many of the first 100 tokens are ingredients.
for i, lemma in enumerate(lemmatized[:100]):
if lemma[1] in ingredients:
print(i, lemmatized[i])
3 ('mellis', 'mel') 11 ('vini', 'vinum') 17 ('mellis', 'mel') 18 ('vinum', 'vinum') 34 ('vini', 'vinum') 83 ('vino', 'vinum') 88 ('vini', 'vinum')
If you have a collection of tagged lemmata you can also train your own lemmatizers based on nltks models
The identity lemmatizer simply returns the token itself.
from cltk.lemmatize.backoff import IdentityLemmatizer
lemmatizer_identity = IdentityLemmatizer()
lemmatizer_identity.lemmatize(['mellis', 'vino'])
[('mellis', 'mellis'), ('vino', 'vino')]
The dictionary looks up the lemma on a user-defined dictionary. It is suitable for commonly occuring tokens (mainly words with irregular inflection). You can also define a "backoff" lemmatizer which your lemmatizer calls in case of undefined tokens.
from cltk.lemmatize.backoff import DictLemmatizer
lemmata_dict = {'mellis': 'mel', 'vino': 'vinum'}
lemmatizer_dict = DictLemmatizer(lemmas=lemmata_dict, backoff=lemmatizer_identity)
lemmatizer_dict.lemmatize(['mellis', 'vino', 'vini'])
[('mellis', 'mel'), ('vino', 'vinum'), ('vini', 'vini')]
The unigram lemmatizer accepts a tagged list of lemmata and returns the lemma with the highest frequency
from cltk.lemmatize.backoff import UnigramLemmatizer
train_data = [[('dactylum', 'dactylus'), ('dactylum', 'dactilus'), ('dactylum', 'dactylus')]]
lemmatizer_unigram = UnigramLemmatizer(train_data, backoff=lemmatizer_dict)
lemmatizer_unigram.lemmatize(['dactylum'])
[('dactylum', 'dactylus')]
You can also create a regex rule-based lemmatizer
from cltk.lemmatize.backoff import RegexpLemmatizer
regexps = [
(r'^(.+)(a|is|ii)$', r'\1um')
]
lemmatizer_regexp = RegexpLemmatizer(regexps=regexps, backoff=lemmatizer_unigram)
lemmatizer_regexp.lemmatize(['vinis', 'mella', 'dactylum'])
[('vinis', 'vinum'), ('mella', 'mellum'), ('dactylum', 'dactylus')]
It is very common that the same words can be spelled differently even in the same text. Write a simple program to detect words in the ingredient list that may have up to one different letter (bonus point for including inserted or deleted characters).
>> isIngredient("daktylus")
>> True
>> isIngredient("dactyllus")
>> True
>> isIngredient("daktyllus")
>> False