by Dominic Reichl (@domreichl)
October 2018
My blog (https://www.mindcoolness.com) currently has 322 blog posts, which I have categorized into four broad topics:
Recently, I experienced a curious desire to find out how unsupervised NLP models would cluster my writings, so I've created this notebook.
This notebook requires the libraries Pandas, Beautiful Soup 4, Matplotlib, Mpld3, Scikit-learn, and TensorFlow.
With the blog post data already exported from my MySQL sever in CSV format, all we have to do here is load the CSV file into a Pandas DataFrame, filter the data, and convert the HTML code into text via BeautifulSoup.
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import MDS, TSNE
import mpld3
from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation
import tensorflow as tf
from sklearn.metrics import silhouette_score, calinski_harabaz_score
# load data
data = pd.read_csv('wp_posts.csv', sep=';')
# filter data (exclude pages, drafts, revisions, etc.), then keep only title & content
data = data[(data['post_type'] == 'post') & (data['post_status'] == 'publish')]
data = data[['post_title', 'post_content']].reset_index(drop=True)
# convert html code into text, then lowercase all words
for i in data.index:
soup = BeautifulSoup(data['post_content'].loc[i], 'html.parser')
data['post_content'].loc[i] = soup.get_text().lower()
data['post_title'].loc[i] = data['post_title'].loc[i].lower()
# display the last five blog posts
data.tail()
post_title | post_content | |
---|---|---|
317 | how the brain makes emotions | here's the latest state of the art in the cogn... |
318 | is willpower a cognitive strength? | \nwillpower is the ability to pursue long-term... |
319 | 6 reasons why people use moral language | \nwhy do we use moral language?\nhello, i'm do... |
320 | the bayesian brain: placebo effects explained | \n\nin my article on predictive processing, i ... |
321 | great minds discuss ideas, great men also disc... | \non great minds and great men\ngreat minds di... |
Natural language processing requires all words to be represented as numbers. For our purposes, the best way to achieve this is with Scikit-learn's TfidVectorizer. This tool not only transforms words into vectors, but also ensures that the terms defining a cluster provide enough differentiation.
How does is work?
Consider first that the frequency of words like 'the', 'a', 'is', and 'and' is likely to be high in any English corpus, which means that they're of little value for document clustering. Moreover, basically every AI-related document would have the same topic if 'network', 'model', and 'algorithm' were cluster-defining terms.
A clustering algorithm will produce much better results if the term frequency (tf = how often a word appears in a document) is multiplied by the inverse document frequency (idf = a measure of how much information the word provides). This is precisely what the TfidVectorizer does when it penalizes high-frequency terms for lacking informational value.
The CountVectorizer, by contrast, uses a simple bag-of-words approach where each term is transformed into a vector based on its count/frequency. This is useful for plotting the top 20 words in our data, and it's also needed for Latent Dirichlet Allocation (LDA), a structured probabilistic model.
Lastly, we should set a lower bound on document frequency (min_df), which sets a cut-off threshold (0.05) to ignore the rarest words in our vocabulary. This will also prove useful for the autoencoder later because it speeds up the neural network training quite a lot.
# tf-idf (term frequency-inverse document frequency)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df = 0.05)
tfidf_matrix = tfidf_vectorizer.fit_transform(data['post_content'])
tfidf_words = tfidf_vectorizer.get_feature_names()
# bag of words (term frequency)
tf_vectorizer = CountVectorizer(stop_words='english', min_df = 0.05)
tf_matrix = tf_vectorizer.fit_transform(data['post_content'])
tf_words = tf_vectorizer.get_feature_names()
tfidf_matrix.shape, tf_matrix.shape
((322, 995), (322, 995))
That's 322 documents (blog posts) and a vocabulary with 955 vectorized words.
To visualize the frequency of words in our data set, we must first retrieve each term and its count (as the sum of its vector) from the vocabulary, sort all terms by count, and then build a list with the 20 most frequent words as well as a list with their counts. With that, we can plot the lists in a bar chart. (Note that the most common English words were already filtered out as stop words by the vectorizer.)
# get word frequencies from the bag of words and sort them by count in descending order
term_frequency = [(term, tf_matrix.sum(axis=0)[0, i]) for term, i in tf_vectorizer.vocabulary_.items()]
term_frequency = sorted(term_frequency, key = lambda x: x[1], reverse=True)
terms = [i[0] for i in term_frequency[:20]] # get top 20 words
count = [i[1] for i in term_frequency[:20]] # get counts of top 20 words
# plot the 20 most frequent words in a bar chart
fig, ax = plt.subplots(figsize=(16,8))
ax.bar(range(len(terms)), count)
ax.set_xticks(range(len(terms)))
ax.set_xticklabels(terms)
ax.set_title('Top 20 Most Frequent Words')
ax.set_xlabel('Term')
ax.set_ylabel('Count')
plt.show()
Our main algorithm for clustering will be k-means, which randomly initializes cluster centers, assigns all data points to their closest centroid (measured as least squared Euclidean distance), and then gradually and heuristically moves the centroids until convergence, i.e., until each centroid has become the actual center of its assigned data points (although reaching an optimum is not guaranteed). Note that Scikit-learn's KMeans uses an improved initialization algorithm—k-means++—by default.
In my blog, I manually grouped my posts into four broad topic categories, so we will tell the model to find 4 clusters. After fitting the model with the matrix of vectorized words and storing the centroids, we can peek into the clustering results by printing out the top 3 defining words of each cluster.
k = 4 # number of clusters
# build and fit model, then store centroids
km = KMeans(k)
km_matrix = km.fit_transform(tfidf_matrix)
km_centroids = km.cluster_centers_.argsort()[:, ::-1]
# create a dictionary with the top three words for each cluster
top_words = {}
for i in range(4):
top_words[i] = ""
for c in km_centroids[i, :3]:
if top_words[i] == "":
top_words[i] = tfidf_words[c]
else:
top_words[i] = top_words[i] + ", " + tfidf_words[c]
print('Cluster %s:' %i, top_words[i])
Cluster 0: moral, meaning, values Cluster 1: willpower, self, control Cluster 2: pride, ego, humility Cluster 3: mind, emotions, life
The clusters already make sense, but we shall wait until #9 for an in-depth qualitative evaluation.
To visualize our clusters in a two-dimensional plot, we can use manifold learning models such as
Let's fit both models with and without cosine distance.
Using cosine distance is generally recommended for text data, but as we shall see, the plots look better without it. To build the plots, we create a data frame with x and y coordinates, post titles, and cluster labels before we group it by the latter. For the fun of it, let's also store the figure as a PNG image.
# fit two non-linear dimensionality reduction models
mds = MDS().fit_transform(km_matrix)
tsne = TSNE().fit_transform(km_matrix)
# fit models with cosine distance
cos_dist = 1 - cosine_similarity(km_matrix)
mds_cos = MDS(dissimilarity="precomputed").fit_transform(cos_dist)
tsne_cos = TSNE(metric="cosine").fit_transform(cos_dist)
# create data frame with coordinates, cluster labels, and post titles, grouped by clusters
df = pd.DataFrame(dict(x1=mds[:,0], y1=mds[:,1], x2=mds_cos[:,0], y2=mds_cos[:,1],
x3=tsne[:,0], y3=tsne[:,1], x4=tsne_cos[:,0], y4=tsne_cos[:,1],
label=km.labels_.tolist(), title=data['post_title']))
groups = df.groupby('label')
# set a color and get the top three words for each cluster
clusters = {0: ('#1b9e77', top_words[0]),
1: ('#d98f02', top_words[1]),
2: ('#7580b3', top_words[2]),
3: ('#e7196a', top_words[3]), }
# build two plots for the manifold learning models
fig, ax = plt.subplots(2,2, figsize=(16,12)) # 2x2 subplots
ax[0,0].set_title('MDS (euclidean distance)'); ax[0,1].set_title('MDS (cosine distance)') # titles for first row
ax[1,0].set_title('T-SNE (euclidean distance)'); ax[1,1].set_title('T-SNE (cosine distance)') # titles for second row
for i,g in groups: # iterate over clusters
ax[0,0].plot(g.x1, g.y1, marker='o', linestyle='', ms=12, color=clusters[i][0], label=clusters[i][1])
ax[0,1].plot(g.x2, g.y2, marker='o', linestyle='', ms=12, color=clusters[i][0], label=clusters[i][1])
ax[1,0].plot(g.x3, g.y3, marker='o', linestyle='', ms=12, color=clusters[i][0], label=clusters[i][1])
ax[1,1].plot(g.x4, g.y4, marker='o', linestyle='', ms=12, color=clusters[i][0], label=clusters[i][1])
ax[0,0].legend(); ax[0,1].legend(); ax[1,0].legend(); ax[1,1].legend() # add legends
# save the figure as png and display it
plt.savefig('clusters.png', dpi=200)
plt.show()
With the mpld3 library, we can use JavaScript and CSS code to create an interactive map that displays a tooltip with the blog title for each data point when the mouse hovers over it. This is great for exploring how my blog posts were clustered! I must give credit here to Brandon Rose (http://brandonrose.org) for this sweet piece of code—thank you.
#define custom toolbar location
class TopToolbar(mpld3.plugins.PluginBase):
"""Plugin for moving toolbar to top of figure"""
JAVASCRIPT = """
mpld3.register_plugin("toptoolbar", TopToolbar);
TopToolbar.prototype = Object.create(mpld3.Plugin.prototype);
TopToolbar.prototype.constructor = TopToolbar;
function TopToolbar(fig, props){
mpld3.Plugin.call(this, fig, props);
};
TopToolbar.prototype.draw = function(){
this.fig.toolbar.draw();
this.fig.toolbar.toolbar.attr("x", 150);
this.fig.toolbar.toolbar.attr("y", 400);
this.fig.toolbar.draw = function() {}
}
"""
def __init__(self):
self.dict_ = {"type": "toptoolbar"}
# define custom css to format the font and to remove the axis labeling
css = """
text.mpld3-text, div.mpld3-tooltip {
font-family:Arial, Helvetica, sans-serif;
font-size:14px;
font-weight: bold;
color: White;
background-color: DodgerBlue;
}
g.mpld3-xaxis, g.mpld3-yaxis {
display: none; }
svg.mpld3-figure {
margin-left: -75px;}
"""
# create plot
fig, ax = plt.subplots(figsize=(14,6))
for i,g in groups: # layer the plot by iterating through cluster labels
points = ax.plot(g.x3, g.y3, marker='o', linestyle='', ms=14, color=clusters[i][0], label=clusters[i][1])
labels = [i.title() for i in g.title] # get the blog posts titles in title case
tooltip = mpld3.plugins.PointHTMLTooltip(points[0], labels, voffset=10, hoffset=10, css=css) # set tooltip
mpld3.plugins.connect(fig, tooltip, TopToolbar()) # connect tooltip to fig
ax.legend(edgecolor='white') # add
# save as html file and show plot
html = mpld3.fig_to_html(fig)
with open("clusters.html", "w") as file: file.write(html)
mpld3.display()
Being deeply familiar with every data point (as it represents a blog post I have written), I can learn a lot from this interactive plot. But you, too, if you just briefly look at some of the titles and their relative distances, will quickly be able to confirm that the clustering has been very successful: the patterns make sense!
To find out the degree to which blog titles and contents belong to the same clusters, we can let the k-means model predict in which cluster each title and content fits best and then compute the overlap of these predictions.
In addition, we can see how the four pairs of topic categories I use on my blog map to the four clusters generated by the model. 100% would mean that the model has categorized my blog posts very similarly to how I have categorized them.
# use model to predict the cluster for each title and content
title_predictions = []
content_predictions = []
for i in range(len(data['post_content'])):
titles = tfidf_vectorizer.transform([data['post_title'][i]])
title_predictions.append(km.predict(titles))
contents = tfidf_vectorizer.transform([data['post_content'][i]])
content_predictions.append(km.predict(contents))
# check how often a post's title and content are predicted to belong to the same cluster
match = []
for i in range(len(title_predictions)):
if title_predictions[i] == content_predictions[i]:
match.append(1)
else:
match.append(0)
print('Title/content match: ' + str(round(sum(match)/len(match)*100, 1)) + '%')
# test to what extent each manually defined topic category falls into its own cluster
category_predictions = []
for topic in ('psychology cognitive science', 'willpower self improvement',
'philosophy spirituality', 'morality ethics'):
Category = tfidf_vectorizer.transform([topic])
category_predictions.append(km.predict(Category)[0])
print('Category similarity: ' + str(len(set(category_predictions))/k*100) + '%')
Title/content match: 76.7% Category similarity: 75.0%
Now let's look at three additional models, and let's also combine them with k-means:
nmf = NMF(k)
nmf_matrix = nmf.fit_transform(tfidf_matrix)
lsa = TruncatedSVD(k)
lsa_matrix = lsa.fit_transform(tfidf_matrix)
lda = LatentDirichletAllocation(k, learning_method='batch')
lda_matrix = lda.fit_transform(tf_matrix)
km_nmf = KMeans(k).fit(nmf_matrix) # NMF-based k-means
km_lsa = KMeans(k).fit(lsa_matrix) # LSA-based k-means
km_lda = KMeans(k).fit(lda_matrix) # LDA-based k-means
For qualitative evaluation, we can look at the three words that were most defining for each cluster produced by a model. If the word combinations make sense, if all top words of a cluster belong to a distinct category, and if there's little topical overlap between clusters, we may judge the model as good.
def top_words_decomp(model_name, model, terms):
''' prints the top 3 words of each cluster
from the components of decomposition models '''
print(model_name)
for i, topic in enumerate(model.components_):
print("Cluster %d: " % (i), end="")
print(" ".join([terms[t] for t in topic.argsort()[:-4:-1]]))
print()
top_words_decomp(" ---NMF---", nmf, tfidf_words)
top_words_decomp(" ---LSA---", lsa, tfidf_words)
top_words_decomp(" ---LDA---", lda, tf_words)
def top_words_cluster(model_name, centers):
''' prints the top 3 words of each cluster
from the centroids of the k-means models '''
print(model_name)
for i in range(k):
print("Cluster %d: " % i, end="")
print(" ".join([tfidf_words[c] for c in centers[i, :3]]))
print()
top_words_cluster(" ---K-M---", km_centroids)
top_words_cluster(" ---NMF-KM---", nmf.inverse_transform(km_nmf.cluster_centers_).argsort()[:, ::-1])
top_words_cluster(" ---LSA-KM---", lsa.inverse_transform(km_lsa.cluster_centers_).argsort()[:, ::-1])
---NMF--- Cluster 0: meditation mindfulness life Cluster 1: moral values meaning Cluster 2: pride humility ego Cluster 3: willpower emotions control ---LSA--- Cluster 0: pride self true Cluster 1: moral values meaning Cluster 2: pride emotion emotions Cluster 3: emotions willpower control ---LDA--- Cluster 0: emotions pride self Cluster 1: people human values Cluster 2: true life want Cluster 3: willpower self control ---K-M--- Cluster 0: moral meaning values Cluster 1: willpower self control Cluster 2: pride ego humility Cluster 3: mind emotions life ---NMF-KM--- Cluster 0: life mind meditation Cluster 1: moral values meaning Cluster 2: willpower emotions control Cluster 3: pride humility true ---LSA-KM--- Cluster 0: willpower control emotions Cluster 1: moral values meaning Cluster 2: pride humility true Cluster 3: life meditation mindfulness
From what I know about the data, consisting of my own blog posts (hence easy for me to interpret), the NMF clusters are certainly the best. Their top three words neatly outline the very topics I have written about the most on my blog:
Here's the full model ranking:
Had we used better tokenization and stemmed our tokens, LSA would have performed better. Still, it is worth noting that all other algorithms did quite well even without any stemming prior to the word vectorization.
Ready for something more complex?
Autoencoders!
What's that?
Autoencoders are neural networks used for unsupervised learning. They are a powerful tool for dealing with the curse of dimensionality.
Every autoencoder consists of two parts: (1) an encoder with multiple layers to reduce dimensionality and (2) a decoder with multiple layers to reconstruct the input from the dimensionally-reduced data. By reconstructing its inputs, the network detects the most important features in the data as it learns the identity function under the constraint of reduced dimensionality (or added noise). Since clustering is a form of dimensionality reduction, autoencoders should be useful for categorizing my blog posts into four broad topics.
In the code below, we use TensorFlow to build an autoencoder with two hidden layers. First, we set two hyperparameters (learning rate and number of epochs) as well as the network parameters (numbers of nodes for all three layers). Then, after defining the graph input (X) and all weights and biases, initialized with normally-distributed random numbers, we build an encoder and a decoder function, both with sigmoid activation functions for each layer. Then we construct the model and define the functions for loss and optimization: minimize squared error with adaptive moment estimation (Adam). Finally, we initialize the variables and launch the graph before we run the session and training cycles. In the last two lines, we get the results and end the training session.
learning_rate = 0.001
training_epochs = 501
n_input = tfidf_matrix.shape[1]
n_hidden_1 = tfidf_matrix.shape[1] // 4
n_hidden_2 = 4
X = tf.placeholder("float", [None, n_input])
weights = {
'encoder_h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'encoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'decoder_h1': tf.Variable(tf.random_normal([n_hidden_2, n_hidden_1])),
'decoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_input])),
}
biases = {
'encoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),
'encoder_b2': tf.Variable(tf.random_normal([n_hidden_2])),
'decoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),
'decoder_b2': tf.Variable(tf.random_normal([n_input])),
}
def encoder(x):
layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']), biases['encoder_b1']))
layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']), biases['encoder_b2']))
return layer_2
def decoder(x):
layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']),biases['decoder_b1']))
layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']), biases['decoder_b2']))
return layer_2
enc = encoder(X)
dec = decoder(enc)
cost = tf.reduce_mean(tf.pow(X - dec, 2))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.InteractiveSession() # interactive for jupyter notebook
sess.run(init)
for epoch in range(training_epochs):
for i in range(len(data)): # one batch per blog post
_, c = sess.run([optimizer, cost], feed_dict={X: tfidf_matrix[i].toarray()})
if epoch % 100 == 0: # display every hundredth epoch
print("Epoch:", '%03d' % epoch, "cost =", "{:.9f}".format(c))
autoenc_results = dec.eval(feed_dict={X: tfidf_matrix.toarray()})
sess.close()
Epoch: 000 cost = 0.076162167 Epoch: 100 cost = 0.020960618 Epoch: 200 cost = 0.007014040 Epoch: 300 cost = 0.001955998 Epoch: 400 cost = 0.000939283 Epoch: 500 cost = 0.000937003
The loss decreased well enough, certainly better than for all the many other architectures and parameters I tried before.
Now, let's feed the output of our NN into a k-means model and plot the results.
km_autoenc = KMeans(k).fit(autoenc_results) # autoencoder-based k-means
# fit T-SNE with cosine distance of autoencoder and autoencoder-based k-means results
cos_dist_autoenc = 1 - cosine_similarity(autoenc_results)
tsne_autoenc = TSNE(metric="cosine").fit_transform(cos_dist_autoenc)
cos_dist_km_autoenc = 1 - cosine_similarity(KMeans(k).fit_transform(autoenc_results))
tsne_km_autoenc = TSNE(metric="cosine").fit_transform(cos_dist_km_autoenc)
# plot T-SNE results
fig, ax = plt.subplots(1,2, figsize=(16,8))
ax[0].set_title('Autoencoder')
ax[0].scatter(tsne_autoenc[:,0], tsne_autoenc[:,1])
ax[1].set_title('Autoencoder-based KMeans')
ax[1].scatter(tsne_km_autoenc[:,0], tsne_km_autoenc[:,1])
plt.show()
For quantitative evaluation, we will use three metrics that don't require ground truth labels:
In this final step, we create an evaluation table with the scores for all our k-means models on these three metrics.
# create evaluation table
evaluation = pd.DataFrame({'Model': ['km', 'km_nmf', 'km_lsa', 'km_lda', 'km_autoenc']})
sc, wcss, chi = [], [], []
# calculate scores
for model in (km, km_nmf, km_lsa, km_lda, km_autoenc):
sc.append(silhouette_score(tfidf_matrix.toarray(), model.labels_))
wcss.append(round(model.inertia_, 2))
chi.append(round(calinski_harabaz_score(tfidf_matrix.toarray(), model.labels_), 2))
# use term frequency matrix for LDA
sc[-2] = silhouette_score(tf_matrix.toarray(), km_lda.labels_)
chi[-2] = round(calinski_harabaz_score(tf_matrix.toarray(), model.labels_), 2)
# fill and display evaluation table
evaluation['Silhouette'] = sc
evaluation['WCSS'] = wcss
evaluation['Calinski-Harabasz'] = chi
evaluation.head()
Model | Silhouette | WCSS | Calinski-Harabasz | |
---|---|---|---|---|
0 | km | 0.023253 | 276.33 | 5.97 |
1 | km_nmf | 0.023509 | 2.37 | 6.35 |
2 | km_lsa | 0.024074 | 8.68 | 6.32 |
3 | km_lda | -0.009829 | 23.09 | 0.54 |
4 | km_autoenc | 0.012356 | 0.09 | 1.92 |
Although these stats aren't particularly impressive (e.g., all silhouette coefficients are almost zero), their relative values are diagnostic nonetheless:
This confirms the above quantitative analysis.
In conclusion, it appears that sticking to NMF-based KMeans, LSA-based KMeans, or even just NMF alone is the best choice for this data set and our word vectorization method.