#!/usr/bin/env python # coding: utf-8 # I was studying German the other day and stumbled upon a typo that leads me an interesting observation on these two words: # # - [anschließen](https://en.wiktionary.org/wiki/anschlie%C3%9Fen#German) (to connect) # - [anschließend](https://en.wiktionary.org/wiki/anschlie%C3%9Fend#German) (following, afterwards) # # They are very "similar" and I would like them to be connected in [wilhelmlang.com](https://wilhelmlang.com/), a platform that helps language learner learn multi-languages via knowledge graph. # # We define the similarity of two words in this context as follows: # # ___Two words are similar either structurally semantically___. # # For example: # # - __anschließen__ and __anschließend__ are structually similar because they differ by just one character (trailing __d__). # - __anschließend__ and [__nachher__](https://en.wiktionary.org/wiki/nachher#German), are semantically similar because they both mean __afterwards__ as adverb # - Some can possess both. For instance, [__das Theater__](https://en.wiktionary.org/wiki/Theater#German) (the theater) and [__das Theaterstück__](https://en.wiktionary.org/wiki/Theaterst%C3%BCck#German) (the drama) are similar both semantically and structurally # ### Lavenshtien's Distance # # The first idea was to calculating the similarity between two words # # The closest would be like the [Levenstein's distance](https://en.wikipedia.org/wiki/Levenshtein_distance) (also popularly called the _edit distance_). # # > In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. # # In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. # In[1]: import nltk nltk.edit_distance("anschließen", "anschließend") # The code above would return 1, as only one letter is different between the two words. Lavenshtien's distance is good for spotting the __anschließen-anschließend__ case # It should be noted that Lavenshtien's distance must be combined with stemming to eliminate false-positive. For example, the German words "Bank" and "Sahne" (cream) has a pretty small edit distance (3). To distinguish "Bank-Sahne" with "anschließen-anschließend", notice that the former share different word stem while the latter share the same: # In[2]: from nltk.stem.snowball import GermanStemmer stemmer = GermanStemmer() words = ["Bank", "Sahne", "anschließen", "anschließend"] for word in words: print("{original} stem: {stemmed}".format(original=word, stemmed=stemmer.stem(word))) # The __anschließend-nachher__, however, won't work well with the approach above. We need a different metric approach. # ### Cosin Similarity