Lemmatization¶

CLTK uses a backoff lemmatizer consisting of a chain of different lemmatizers:

dictionary-based lemmatizer with high-frequency words
a training-data-based lemmatizer based on 4,000 sentences from the Perseus Latin Dependency Treebanks
a regular-expression-based lemmatizer stripping word affixes
a dictionary-based lemmatizer with the complete set of Morpheus lemmas
an ‘identity’ lemmatizer returning the token itself as the lemma

We'll later see some of those in detail.

In [1]:

from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer

In [2]:

lemmatizer = BackoffLatinLemmatizer()

In [3]:

from  data.word_tokenized_text import word_tokenized_text as text

In [4]:

lemmatized = lemmatizer.lemmatize(text)

In [5]:

lemmatized[:30]

Out[5]:

[('Conditi', 'Conditi'),
 ('paradoxi', 'paradoxus'),
 ('compositio', 'compositio'),
 ('mellis', 'mel'),
 ('pondo', 'pondo'),
 ('XV', 'XV'),
 ('in', 'in'),
 ('aeneum', 'aeneus'),
 ('vas', 'vas'),
 ('mittuntur', 'mitto'),
 ('praemissis', 'praemitto'),
 ('vini', 'vinum'),
 ('sextariis', 'sextarius'),
 ('duobus', 'duo'),
 ('ut', 'ut'),
 ('in', 'in'),
 ('coctura', 'coquo'),
 ('mellis', 'mel'),
 ('vinum', 'vinum'),
 ('decoquas', 'decoquo'),
 ('quod', 'qui'),
 ('igni', 'ignis'),
 ('lento', 'lentus'),
 ('et', 'et'),
 ('aridis', 'aridus'),
 ('lignis', 'lignum'),
 ('calefactum', 'calefacio'),
 ('commotum', 'commoveo'),
 ('ferula', 'ferula'),
 ('dum', 'dum')]

In [6]:

line_lemmatized_text = []

In [7]:

from data.line_tokenized_text import line_tokenized_text

In [8]:

from cltk.stem.latin.j_v import JVReplacer

In [9]:

for line in line_tokenized_text:
    _line = lemmatizer.lemmatize(JVReplacer().replace(line).split(" "))
    line_lemmatized_text.append(_line)

In [10]:

line_lemmatized_text[0]

Out[10]:

[('Conditi', 'Conditi'),
 ('paradoxi', 'paradoxus'),
 ('compositio', 'compositio'),
 ('mellis', 'mel'),
 ('pondo', 'pondo'),
 ('XU', 'XU'),
 ('in', 'in'),
 ('aeneum', 'aeneus'),
 ('uas', 'uas'),
 ('mittuntu', 'mittuntu'),
 ('praemissis', 'praemitto'),
 ('uini', 'uinum'),
 ('sextariis', 'sextarius'),
 ('duobu', 'duobu'),
 ('ut', 'ut'),
 ('in', 'in'),
 ('coctura', 'coquo'),
 ('mellis', 'mel'),
 ('uinum', 'uinum'),
 ('decoquas', 'decoquo')]

Detecting the ingredients in the text¶

Suppose we want to detect which tokens of the text correspond to the ingredients needed for each recipe. We could in theory create a huge list of all the ingredients and simply check if each token is on the list. That however would mean the we'd have to also include each grammatical case and orthographic variation. However, using a lemmatizer we can restrict the list to the lemmata of the ingredients and simply check if the lemmatized token belongs to the list.

In [11]:

from data.ingredient_list import ingredients

In [12]:

ingredients = set(ingredients)

In [13]:

ingredient_indices = []

for i, lemma in enumerate(lemmatized):
    if lemma[1] in ingredients:
        ingredient_indices.append(i)

In [14]:

ingredient_indices[:30]

Out[14]:

Here we check how many of the first 100 tokens are ingredients.

In [15]:

for i, lemma in enumerate(lemmatized[:100]):
    if lemma[1] in ingredients:
        print(i, lemmatized[i])

3 ('mellis', 'mel')
11 ('vini', 'vinum')
17 ('mellis', 'mel')
18 ('vinum', 'vinum')
34 ('vini', 'vinum')
83 ('vino', 'vinum')
88 ('vini', 'vinum')

Train your own lemmatizer¶

If you have a collection of tagged lemmata you can also train your own lemmatizers based on nltks models

Identity Lemmatizer¶

The identity lemmatizer simply returns the token itself.

In [16]:

from cltk.lemmatize.backoff import IdentityLemmatizer

In [17]:

lemmatizer_identity = IdentityLemmatizer()

In [18]:

lemmatizer_identity.lemmatize(['mellis', 'vino'])

Out[18]:

[('mellis', 'mellis'), ('vino', 'vino')]

Dictionary Lemmatizer¶

The dictionary looks up the lemma on a user-defined dictionary. It is suitable for commonly occuring tokens (mainly words with irregular inflection). You can also define a "backoff" lemmatizer which your lemmatizer calls in case of undefined tokens.

In [19]:

from cltk.lemmatize.backoff import DictLemmatizer

In [20]:

lemmata_dict = {'mellis': 'mel', 'vino': 'vinum'}

In [21]:

lemmatizer_dict = DictLemmatizer(lemmas=lemmata_dict, backoff=lemmatizer_identity)

In [22]:

lemmatizer_dict.lemmatize(['mellis', 'vino', 'vini'])

Out[22]:

[('mellis', 'mel'), ('vino', 'vinum'), ('vini', 'vini')]

Unigram Lemmatizer¶

The unigram lemmatizer accepts a tagged list of lemmata and returns the lemma with the highest frequency

In [23]:

from cltk.lemmatize.backoff import UnigramLemmatizer

In [24]:

train_data = [[('dactylum', 'dactylus'), ('dactylum', 'dactilus'), ('dactylum', 'dactylus')]]

In [25]:

lemmatizer_unigram = UnigramLemmatizer(train_data, backoff=lemmatizer_dict)

In [26]:

lemmatizer_unigram.lemmatize(['dactylum'])

Out[26]:

[('dactylum', 'dactylus')]

Regular-expression Lemmatizer¶

You can also create a regex rule-based lemmatizer

In [27]:

from cltk.lemmatize.backoff import RegexpLemmatizer

In [28]:

regexps = [
    (r'^(.+)(a|is|ii)$', r'\1um')
]

In [29]:

lemmatizer_regexp = RegexpLemmatizer(regexps=regexps, backoff=lemmatizer_unigram)

In [30]:

lemmatizer_regexp.lemmatize(['vinis', 'mella', 'dactylum'])

Out[30]:

[('vinis', 'vinum'), ('mella', 'mellum'), ('dactylum', 'dactylus')]

Your Turn 💡¶

It is very common that the same words can be spelled differently even in the same text. Write a simple program to detect words in the ingredient list that may have up to one different letter (bonus point for including inserted or deleted characters).

Example:¶

>> isIngredient("daktylus") >> True

>> isIngredient("dactyllus") >> True

>> isIngredient("daktyllus") >> False