#!/usr/bin/env python
# coding: utf-8

# I was studying German the other day and stumbled upon a typo that leads me an interesting observation on these two words:
# 
# - [anschließen](https://en.wiktionary.org/wiki/anschlie%C3%9Fen#German) (to connect)
# - [anschließend](https://en.wiktionary.org/wiki/anschlie%C3%9Fend#German) (following, afterwards)
# 
# They are very "similar" and I would like them to be connected in [wilhelmlang.com](https://wilhelmlang.com/), a platform that helps language learner learn multi-languages via knowledge graph.
# 
# We define the similarity of two words in this context as follows:
# 
# ___Two words are similar either structurally semantically___.
# 
# For example:
# 
# - __anschließen__ and __anschließend__ are structually similar because they differ by just one character (trailing __d__).
# - __anschließend__ and [__nachher__](https://en.wiktionary.org/wiki/nachher#German), are semantically similar because they both mean __afterwards__ as adverb
# - Some can possess both. For instance, [__das Theater__](https://en.wiktionary.org/wiki/Theater#German) (the theater) and [__das Theaterstück__](https://en.wiktionary.org/wiki/Theaterst%C3%BCck#German) (the drama) are similar both semantically and structurally

# ### Lavenshtien's Distance
# 
# The first idea was to calculating the similarity between two words
# 
# The closest would be like the [Levenstein's distance](https://en.wikipedia.org/wiki/Levenshtein_distance) (also popularly called the _edit distance_).
# 
# > In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
# 
# In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

# In[1]:


import nltk
nltk.edit_distance("anschließen", "anschließend")


# The code above would return 1, as only one letter is different between the two words. Lavenshtien's distance is good for spotting the __anschließen-anschließend__ case

# It should be noted that Lavenshtien's distance must be combined with stemming to eliminate false-positive. For example, the German words "Bank" and "Sahne" (cream) has a pretty small edit distance (3). To distinguish "Bank-Sahne" with "anschließen-anschließend", notice that the former share different word stem while the latter share the same:

# In[2]:


from nltk.stem.snowball import GermanStemmer
stemmer = GermanStemmer()

words = ["Bank", "Sahne", "anschließen", "anschließend"]

for word in words:
    print("{original} stem: {stemmed}".format(original=word, stemmed=stemmer.stem(word)))


# The __anschließend-nachher__, however, won't work well with the approach above. We need a different metric approach.

# ### Cosin Similarity