#!/usr/bin/env python # coding: utf-8 # # Document Similarity # This notebook introduces techniques for exploring document similarity. It's part of the [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) (and assumes that you've already worked through previous notebooks – see the table of contents). In this notebook we'll look in particular at: # # * [Document term matrix](#Document-Term-Matrix) # * [TF-IDF](#TF-IDF) # * [Cosine similarity](#Cosine-Similarity) # * [Visualizing document similarity with scatterplots](#Visualizing-Document-Distances-with-a-Scatterplot) # * [Visualizing document similarity with dendrograms](#Visualizing-Document-Clusters-with-a-Dendrogram) # A common task when working with a larger corpus is to try to determine how documents relate to one another – how similar or different they are, or how they might cluster together. For instance, if we take the 18 texts from the NTLK Gutenberg corpus, are there any texts that group together? There are several authors with multiple texts (Shakespeare, Austen, Chesterton), do these cluster together? # # More importantly, how do we determine similarity? For instance, we could plot the ratio of sentences per number of words and see if that's helpful in determining stylistic similarity. # In[1]: import nltk get_ipython().run_line_magic('matplotlib', 'inline') sentenceLengths = {} for fileid in nltk.corpus.gutenberg.fileids(): sentenceLengths[fileid] = len(nltk.corpus.gutenberg.sents(fileid)) / len(nltk.corpus.gutenberg.words(fileid)) nltk.FreqDist(sentenceLengths).plot() # We do indeed get a certain amount of grouping, for instance the first three texts are from Shakespeare, where there are a high number of sentences per total number of tokens. But this is perhaps more an indication of genre and even formatting rather than of vocabulary (indicated character names in the plain text plays are counted as sentences and the lines of text tend to be formatted shorter than in prose). # ## Document Term Matrix # Another way to consider document similarity is to consider and compare the frequency of terms in each document. The first step in doing this is to create what's called a document term matrix where frequencies are indicated for each document – it might look something like this: # In[2]: import pandas as pd documents = [ "I love jazz music and I love to read good books", "What good books have you read recently", "It's music to my ears when you say I love you" ] allWords = set([word.lower() for word in nltk.word_tokenize(" ".join(documents)) if any([c for c in word if c.isalpha()])]) allRawFreqs = [nltk.FreqDist(nltk.word_tokenize(document.lower())) for document in documents] pd.DataFrame(allRawFreqs).fillna(0) # The top row indicates each term in the entire corpus and each row represents a document ("0" is the first document "I love jazz…", etc.). This would be helpful (with real data), but it's not ideal to use raw frequencies (counts) since document length wouldn't be taken into consideration. # # We could create a similar table using relative frequencies instead, in other words, where each value would be relative to the total number of tokens in the document. That would be better for variable length documents, but an even more powerful (and commonly used) technique is to calculate a value that better represents the significance of a term within the corpus. Just as the relative frequency dampens counts based on document length, the [TF-IDF](http://en.wikipedia.org/wiki/Tf–idf) value dampens relative frequencies based on the distribution of the term in the corpus. # ## TF-IDF # We already have the "TF" part of the TF-IDF value, it's simply the term's raw or relative frequency in a given document. We need to multiply that by IDF, which is the inverse document frequency, which we can calculate like this: # # IDF = log of the total number of documents divided by the number of documents that contain the term # # So, imagine we want to calculate the TF-IDF score of the word "boot" for Austen's _Emma_ within the Gutenberg corpus: # # * 4 occurrences of "boot" in _Emma_ # * 16,1975 words in _Emma_ # * 6 documents that contain "boot" (at least once) # * 18 documents in total # # TF = 4 / 161975 = 0.00002469516901 # # IDF = log(18 /6) = 1.0986122886681098 # # TF-IDF (for boot in _Emma_) = 0.00002469516901 \* 1.0986122886681098 = **0.00002713041615** # That seems like a lot of work for a single TF-IDF value for one term in one document. But good news! We don't need to calculate TF-IDF scores ourselves, there are several libraries that do that for us. In fact, we've already seen this with the gensim library in the topic modelling notebook, but here we're going to use [scikit](http://scikit-learn.org/stable/), a machine learning library for Python, and in particular with [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). # In[3]: from sklearn.feature_extraction.text import TfidfVectorizer simpledtm = TfidfVectorizer().fit_transform(documents) # takes a list of strings (and tokenizes them for us) pd.DataFrame(simpledtm.toarray()) # When we use this convenient method, we lose the term labels (they appear in the top header row as numbers from 0 to 16), but in fact that's not as important for now as just seeing that we have a document term matrix with TF-IDF values for each document and each term (with several values of zero). # # We can generate a similar – though much larger – table for all the texts in the Gutenberg corpus. # In[4]: names = nltk.corpus.gutenberg.fileids() # we'll keep track of filenames (for labels) texts = [nltk.corpus.gutenberg.raw(fileid) for fileid in names] # a list of strings, one per document documentTermMatrix = TfidfVectorizer().fit_transform(texts) documentTermMatrix # Our document term matrix is stored in something called a sparse matrix, which is a useful and efficient way to store and process a very large table of data where several values are absent or not set. There are typically a lot of zero values in a sparse matrix, (imagine a term that only occurs in one of the 18 documents – a true matrix would need to store 18 cells/values, but a sparse matrix can store just one. # # This document term matrix is conceptually enormous. Depending on the arguments we provide to fit_transform() we could have a huge table of 18 (the number of documents) by the total number of unique words in all the documents (each term is a column and each column needs a value for each one of the documents). # ## Cosine Similarity # Even though we have a table of TF-IDF values for every term in every document, in order to make any use of it we need a way of comparing documents. In other words: of determining the distance between the frequency values (columns) for any of the documents in the rows. In effect, we want a way of expressing how different two rows (documents) are. # # We can do this by converting our TF-IDF values from each document into geometric space. Imagine each term frequency (as TF-IDF) as a point on a cartesian graph and a value that expresses the aggregate of these points. Each document could thus be expressed as a vector, or a line with a certain magnitude (length) and angle from the origin (0, 0). # # Because the length of the line (the total magnitude of the total TF-IDF values) would be sensitive to the length of the document, we can't simply compare the lines for each document, but we can compare the angle between the documents – that's the essence of cosine similarity. # # * cosine distance between document 1 and document 2 (d1, d2) # * cosine distance between document 2 and document 3 (d2, d3) # * cosine distance between document 1 and document 3 (d1, d3) # # #  # # We'll use scikit's [cosine_similarity](http://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) function. # In[5]: from sklearn.metrics.pairwise import cosine_similarity distances = 1 - cosine_similarity(documentTermMatrix) # This creates a new matrix not of individual terms but of distance measures between documents. Think of a table of distances between cities on a map, if you present it in a table there will be duplication, and identity values for where cities meet. # #
Calgary | Montreal | Toronto | Vancouver | |
Calgary | 3750 | 3450 | 1050 | |
Montreal | 3750 | 550 | 4800 | |
Toronto | 3450 | 550 | 4500 | |
Vancouver | 1050 | 4800 | 4500 |