I was studying German the other day and stumbled upon a typo that leads me an interesting observation on these two words:
They are very "similar" and I would like them to be connected in wilhelmlang.com, a platform that helps language learner learn multi-languages via knowledge graph.
We define the similarity of two words in this context as follows:
___Two words are similar either structurally semantically___.
For example:
The first idea was to calculating the similarity between two words
The closest would be like the Levenstein's distance (also popularly called the edit distance).
In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
import nltk
nltk.edit_distance("anschließen", "anschließend")
1
The code above would return 1, as only one letter is different between the two words. Lavenshtien's distance is good for spotting the anschließen-anschließend case
It should be noted that Lavenshtien's distance must be combined with stemming to eliminate false-positive. For example, the German words "Bank" and "Sahne" (cream) has a pretty small edit distance (3). To distinguish "Bank-Sahne" with "anschließen-anschließend", notice that the former share different word stem while the latter share the same:
from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer()
words = ["Bank", "Sahne", "anschließen", "anschließend"]
for word in words:
print("{original} stem: {stemmed}".format(original=word, stemmed=stemmer.stem(word)))
Bank stem: bank Sahne stem: sahn anschließen stem: anschliess anschließend stem: anschliess
The anschließend-nachher, however, won't work well with the approach above. We need a different metric approach.