Notebook

I was studying German the other day and stumbled upon a typo that leads me an interesting observation on these two words:

anschließen (to connect)
anschließend (following, afterwards)

They are very "similar" and I would like them to be connected in wilhelmlang.com, a platform that helps language learner learn multi-languages via knowledge graph.

We define the similarity of two words in this context as follows:

___Two words are similar either structurally semantically___.

For example:

anschließen and anschließend are structually similar because they differ by just one character (trailing d).
anschließend and nachher, are semantically similar because they both mean afterwards as adverb
Some can possess both. For instance, das Theater (the theater) and das Theaterstück (the drama) are similar both semantically and structurally

Lavenshtien's Distance¶

The first idea was to calculating the similarity between two words

The closest would be like the Levenstein's distance (also popularly called the edit distance).

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

In [1]:

import nltk
nltk.edit_distance("anschließen", "anschließend")

Out[1]:

The code above would return 1, as only one letter is different between the two words. Lavenshtien's distance is good for spotting the anschließen-anschließend case

It should be noted that Lavenshtien's distance must be combined with stemming to eliminate false-positive. For example, the German words "Bank" and "Sahne" (cream) has a pretty small edit distance (3). To distinguish "Bank-Sahne" with "anschließen-anschließend", notice that the former share different word stem while the latter share the same:

In [2]:

from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer()

words = ["Bank", "Sahne", "anschließen", "anschließend"]

for word in words:
    print("{original} stem: {stemmed}".format(original=word, stemmed=stemmer.stem(word)))

Bank stem: bank
Sahne stem: sahn
anschließen stem: anschliess
anschließend stem: anschliess

The anschließend-nachher, however, won't work well with the approach above. We need a different metric approach.

Lavenshtien's Distance¶

Cosin Similarity¶