A tutorial by Thomas Jurczyk, CERES (Bochum)
This notebook includes the code used in my clustering tutorial for the Programming Historian. Please note that you need to save the datasets DNP_ancient_authors.csv
and RELIGION_abstracts.csv
in the a folder called data
. The datasets are available in my GitHub repository.
If you have any questions or comments, please contact me via email.
import pandas as pd
pd.set_option('display.max_colwidth', None)
# load the authors dataset that has been stored as a .csv files in a folder called "data" in the same directory as the Jupyter Notebook
df_authors = pd.read_csv("data/DNP_ancient_authors.csv", index_col="authors").drop(columns=["Unnamed: 0"])
In the next step, we print out the first five rows and look at some information and overview statistcs about each dataset using pandas' info()
and describe()
methods.
df_authors.info()
df_authors.head(5)
df_authors.describe()
As we can see when using describe()
on the df_authors
dataset, we have an overall huge standard deviation in almost every column and a huge difference between the 75% percentil value and the maxim value. This indicates that we might have some serious outliers in our dataset, and it might make sense to get rid of them before we continue with our analysis. Therefore, we only keep those data points in our dataframe with a word_count
within the 95% percentil range.
ninety_quantile = df_authors["word_count"].quantile(0.9)
df_authors = df_authors[df_authors["word_count"] <= ninety_quantile]
df_authors.info()
df_authors.describe()
We start our analysis by clustering the DNP_ancient_authors.csv
dataset. Before we start with the actual clustering process, we first import all the necessary library and write a couple of functions that will help us to plot our results during the analysis. We will also use these functions and imports in the second part of our analysis.
Before continuing with the following part, you need to install yellowbrick
first if you haven't done so already. You can either do this via
pip install yellowbrick
or
conda install yellowbrick
when using the Anaconda distribution.
from sklearn.preprocessing import StandardScaler as SS # z-score standardization
from sklearn.cluster import KMeans, DBSCAN # clustering algorithms
from sklearn.decomposition import PCA # dimensionality reduction
from sklearn.metrics import silhouette_score # used as a metric to evaluate the cohesion in a cluster
from sklearn.neighbors import NearestNeighbors # for selecting the optimal eps value when using DBSCAN
import numpy as np
# plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
from yellowbrick.cluster import SilhouetteVisualizer
def silhouettePlot(range_, data):
'''
we will use this function to plot a silhouette plot that helps us to evaluate the cohesion in clusters (k-means only)
'''
half_length = int(len(range_)/2)
range_list = list(range_)
fig, ax = plt.subplots(half_length, 2, figsize=(15,8))
for _ in range_:
kmeans = KMeans(n_clusters=_, random_state=42)
q, mod = divmod(_ - range_list[0], 2)
sv = SilhouetteVisualizer(kmeans, colors="yellowbrick", ax=ax[q][mod])
ax[q][mod].set_title("Silhouette Plot with n={} Cluster".format(_))
sv.fit(data)
fig.tight_layout()
fig.show()
fig.savefig("silhouette_plot.png")
def elbowPlot(range_, data, figsize=(10,10)):
'''
the elbow plot function helps to figure out the right amount of clusters for a dataset
'''
inertia_list = []
for n in range_:
kmeans = KMeans(n_clusters=n, random_state=42)
kmeans.fit(data)
inertia_list.append(kmeans.inertia_)
# plotting
fig = plt.figure(figsize=figsize)
ax = fig.add_subplot(111)
sns.lineplot(y=inertia_list, x=range_, ax=ax)
ax.set_xlabel("Cluster")
ax.set_ylabel("Inertia")
ax.set_xticks(list(range_))
fig.show()
fig.savefig("elbow_plot.png")
def findOptimalEps(n_neighbors, data):
'''
function to find optimal eps distance when using DBSCAN; based on this article: https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc
'''
neigh = NearestNeighbors(n_neighbors=n_neighbors)
nbrs = neigh.fit(data)
distances, indices = nbrs.kneighbors(data)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
plt.xlabel("Data Points")
plt.ylabel("Distance")
plt.savefig("eps_plot.png")
def progressiveFeatureSelection(df, n_clusters=3, max_features=4,):
'''
very basic implementation of an algorithm for feature selection (unsupervised clustering); taken from this post: https://datascience.stackexchange.com/questions/67040/how-to-do-feature-selection-for-clustering-and-implement-it-in-python
'''
feature_list = list(df.columns)
selected_features = list()
# select starting feature
initial_feature = ""
high_score = 0
for feature in feature_list:
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
data_ = df[feature]
labels = kmeans.fit_predict(data_.to_frame())
score_ = silhouette_score(data_.to_frame(), labels)
print("Proposed new feature {} with score {}". format(feature, score_))
if score_ >= high_score:
initial_feature = feature
high_score = score_
print("The initial feature is {} with a silhouette score of {}.".format(initial_feature, high_score))
feature_list.remove(initial_feature)
selected_features.append(initial_feature)
for _ in range(max_features-1):
high_score = 0
selected_feature = ""
print("Starting selection {}...".format(_))
for feature in feature_list:
selection_ = selected_features.copy()
selection_.append(feature)
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
data_ = df[selection_]
labels = kmeans.fit_predict(data_)
score_ = silhouette_score(data_, labels)
print("Proposed new feature {} with score {}". format(feature, score_))
if score_ > high_score:
selected_feature = feature
high_score = score_
selected_features.append(selected_feature)
feature_list.remove(selected_feature)
print("Selected new feature {} with score {}". format(selected_feature, high_score))
return selected_features
Next, we initialize scikit-learn's StandardScaler()
to standardize our data.
scaler = SS()
DNP_authors_standardized = scaler.fit_transform(df_authors)
DNP_authors_standardized
df_authors_standardized = pd.DataFrame(DNP_authors_standardized, columns=["word_count_standardized", "modern_translations_standardized", "known_works_standardized", "manuscripts_standardized", "early_editions_standardized", "early_translations_standardized", "modern_editions_standardized", "commentaries_standardized"])
df_authors_standardized = df_authors_standardized.set_index(df_authors.index)
This next section is a little fuzzy. Selecting features in an unsupervised learning context is difficult. I have decided to implement a rather "simple" approach (see the function progressiveFeatureSelection
). This attempt is based on this algorithm presented on stackexchange. Of course, there are more elaborate attempts to select features in an unsupervised learning context, such as presented in this article. However, the method implemented here turned out to be quite successful in the context of the datasets used in this tutorial.
For the sake of simplicity and since we only have ten features in the DNP_ancient_authors.csv
dataset, we are looking for the ideal three features to cluster our data.
selected_features = progressiveFeatureSelection(df_authors_standardized, max_features=3, n_clusters=3)
selected_features
It turns out that the features known_works_standardized
, commentaries_standardized
, and modern_editions_standardized
might be worth using when trying to cluster our data. We are creating a new dataframe with only these three features.
df_standardized_sliced = df_authors_standardized[selected_features]
df_standardized_sliced
We will now apply the elbow method and then use silhouette plots to get an impression of how many clusters we should choose to analyse our dataset. We will check for two to ten clusters. Note, however, that the feature selection was also done with a pre-defined k-means algorithm using n=3 clusters. Thus, our three selected features might already have a tendency towards this amount of clusters, since they turned out to be the best choice under the parametric circumstances of n=3 cluster.
elbowPlot(range(1,11), df_standardized_sliced)
Looking at the elbow plot indeed shows us that an "elbow" at n=3 as well as n=5 clusters. However, it's still pretty difficult to decide whether to use three, four, five or even six clusters. Therefore, we should also look at the silhouette plots.
silhouettePlot(range(3,9), df_standardized_sliced)
Looking at the silhouette scores underlines our previous intuition that a selection of n=3 or n=5 seems to be the right choice of clusters. The silhouette plot with n=3 clusters in particular has a relatively high average silhouette score. Yet, since the two other clusters are far below the average silhouette score for n=3 clusters, we decide to analyze the dataset with k-means using n=5 clusters. However, the different sizes of the “knives” and their sharp form in both n=3 and n=5 clusters indicate a single dominant cluster and a couple of rather small and less cohesive clusters.
Let us now train an k-means instance with n=5 clusters and plot the results using seaborn. Since we are focusing the plotting part in this tutorial on two dimensional plots, we wil us PCA()
(Principal Component Analysis) to reduce the dimensionality to two dimensions for plotting reasons only.
kmeans = KMeans(n_clusters=5, random_state=42)
cluster_labels = kmeans.fit_predict(df_standardized_sliced)
df_standardized_sliced["clusters"] = cluster_labels
# using PCA to reduce the dimenionality to two dimenions (to be able to plot the data with *seaborn*)
pca = PCA(n_components=2, whiten=False, random_state=42)
authors_standardized_pca = pca.fit_transform(df_standardized_sliced)
df_authors_standardized_pca = pd.DataFrame(data=authors_standardized_pca, columns=["pc_1", "pc_2"])
df_authors_standardized_pca["clusters"] = cluster_labels
df_authors_standardized_pca
sns_plot = sns.scatterplot(x="pc_1", y="pc_2", hue="clusters", data=df_authors_standardized_pca)
sns_plot.get_figure().savefig("dnp_ancient_authors_kmeans_plot_final_n=5.png")
This looks interesting! Even though the data is not perfectly clustered (which is something that we could already see during the evaluation of the silhouette plot), we can clearly separate at least one to two clusters. However, the two smaller ones still include quite a few outliers, which is a common problem of k-means.
Let us now look at the entries in the three clusters and see if this actually delivered any valuable insights into our data.
df_authors.iloc[df_authors_standardized_pca[df_authors_standardized_pca["clusters"] == 0].index].describe() # authors with very few known works and few modern editions/commentaries
df_authors.iloc[df_authors_standardized_pca[df_authors_standardized_pca["clusters"] == 1].index].describe() # authors many known works and many modern editions, but almost no commentaries
df_authors.iloc[df_authors_standardized_pca[df_authors_standardized_pca["clusters"] == 1].index]
df_authors.iloc[df_authors_standardized_pca[df_authors_standardized_pca["clusters"] == 2].index].describe() # authors with few known works, but a lot of commentaries and relatively many modern editions
df_authors.iloc[df_authors_standardized_pca[df_authors_standardized_pca["clusters"] == 2].index]
df_authors.iloc[df_authors_standardized_pca[df_authors_standardized_pca["clusters"] == 3].index].describe() # authors with few commentaries and an average number of known works and modern editions
df_authors.iloc[df_authors_standardized_pca[df_authors_standardized_pca["clusters"] == 3].index]
df_authors.iloc[df_authors_standardized_pca[df_authors_standardized_pca["clusters"] == 4].index].describe() # authors with a very high amount of commentaries and an average amount of known works and modern editions
df_authors.iloc[df_authors_standardized_pca[df_authors_standardized_pca["clusters"] == 4].index]
This provided us with quite some interesting results. Note, however, that many of the known authors are missing due to our initial cut off of 10% of the data. In an actual research article, this might be a bad idea, however, it facilitated our clustering process.
The second part of this tutorial is going to deal with textual data, namely all abstracts scraped from the Religion (journal) website. We will try to cluster the abstracts based on their word features in the form of TF-IDF vectors (which is short for "Text Frequency - Inverted Document Frequency").
Similar to the analysis of the DNP_ancient_authors.csv
dataset, we will first load the RELIGION_abstracts.csv
into our program and look at some descriptive statistics.
df_abstracts = pd.read_csv("data/RELIGION_abstracts_lemmatized.csv").drop(columns="Unnamed: 0")
df_abstracts.info()
df_abstracts.describe()
In order to process the textual data with clustering algorithms, we need to convert the texts to vectors. For this purpose, we are using the scikit-learn implementation of TF-IDF vectorization. For a good introduction to how TF-IDF works, see this great tutorial by Melanie Walsh.
As an optional step, I have implemented a function called lemmatizeAbstracts()
that lemmatizes the abstracts for us. Since we are not interested in stylistic similiarities between the abstracts, this might help us to reduce the overall amount of features (words) in our dataset. As part of the function, we are also cleaning the text of all punctuation and other noise such as brackets etc. In the following part, we are only working with the lemmatized version of the abstracts. However, if you feel like it, you can also continue using the original texts.
# lemmatization (optional step)
import spacy
import re
nlp = spacy.load("en_core_web_sm")
def lemmatizeAbstracts(x):
doc = nlp(x)
new_text = []
for token in doc:
new_text.append(token.lemma_)
text_string = " ".join(new_text)
# getting rid of non-word characters
text_string = re.sub(r"[^\w\s]+", "", text_string)
text_string = re.sub(r"\s{2,}", " ", text_string)
return text_string
df_abstracts["abstract_lemma"] = df_abstracts["abstract"].apply(lemmatizeAbstracts)
df_abstracts.to_csv("data/RELIGION_abstracts_lemmatized.csv")
I recommend saving the new lemmatized version of the dataset so that we do not have to redo the lemmatization each time we restart our notebook.
The first step is to instantiate our TF-IDF model by passing it the argument to ignore stop words, such as "the," "a," etc. The second step is pretty similar to the training of our K-Means instance in the previous part: We are passing the abstracts from our dataset to the vectorizer in order to convert them to machine-readable vectors. For the moment, we are not passing any additional arguments. Finally, we create a new pandas DataFrame object based on the TF-IDF matrix of our textual data.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words="english")
df_abstracts_tfidf = tfidf.fit_transform(df_abstracts["abstract_lemma"])
df_abstracts_tfidf
Our initial matrix is huge and includes over 8,000 words from the overall vocabulary of the 701 abstracts. This is obviously too much, not only from a computational perspective, but also because clustering algorithms such as k-means loose much of their power due to the so-called curse of dimensionality. We will thus need to significantly reduce the number of features. To do so, we first create a new instance of the df_abstracts_tfidf
with reduced maximum features to 250. We also tell the model to only consider words from the vocabulary that appear in at least five different documents but not more than 100 times. We also add the possibility to not only include single words, but bigrams as well (such as "19th century"). Finally, we tell our model to get rid of any potential accents.
Secondly, we are also using the Principal Component Analysis (PCA), a technique that is often applied to reduce the dimensionality of datasets.
PCA allows us to reduce the dimensionality of the original data substantially while retaining most of the salient information. On the PCA-reduced feature set, other machine learning algorithms—downstream in the machine learning pipeline—will have an easier time separating the data points in space (to perform tasks such as anomaly detection and clustering) and will require fewer computational resources. (quote from the online version of Ankur A. Patel: Hands-On Unsupervised Learning Using Python, O'Reilly Media 2020)
# creating a new TF-IDF matrix
tfidf = TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_features=250, strip_accents="unicode", min_df=10, max_df=200)
tfidf_religion_array = tfidf.fit_transform(df_abstracts["abstract_lemma"])
df_abstracts_tfidf = pd.DataFrame(tfidf_religion_array.toarray(), index=df_abstracts.index, columns=tfidf.get_feature_names())
df_abstracts_tfidf.describe()
Due to the curse of dimensionality when using k-means, let us next use PCA()
to caste the dimension from d=250 to d=10.
# using PCA to reduce the dimensionality
pca = PCA(n_components=10, whiten=False, random_state=42)
abstracts_pca = pca.fit_transform(df_abstracts_tfidf)
df_abstracts_pca = pd.DataFrame(data=abstracts_pca)
df_abstracts_pca
Next, we try to figure out if we can find any clusters in the abstracts using k-means. As we did in case of the DNP_ancient_authors.csv
dataset, we will start by looking for the right amount of cluster using the elbow method and the silhouette score.
elbowPlot(range(10,100), df_abstracts_pca, figsize=(20,20))
kmeans = KMeans(n_clusters=100, random_state=42)
abstracts_labels = kmeans.fit_predict(df_abstracts_pca)
df_abstracts_labeled = df_abstracts.copy()
df_abstracts_labeled["cluster"] = abstracts_labels
df_abstracts_labeled[df_abstracts_labeled["cluster"] == 9]["title"]
Works pretty well!
Even though the k-means clustering of our data already provided us with some valuable results from a clustering perspective, it might still be interesting to apply a different clustering algorithm such as DBSCAN. As explained above, DBSCAN allows for outliers in our data, meaning that it focuses on those regions in our data that may rightfully be called dense.
We will be using the d=10 reduced version of our RELIGION_abstracts.csv
dataset, which allows us to keep euclidean distance as a metric. If we were to use the initial TF-IDF matrix with 250 vectors, we should maybe change the underlying metric to cosine distance, which is better when dealing with sparse matrices such as in this example.
The first step will be to figure out which eps value is most suitable for our data.
findOptimalEps(2, df_abstracts_pca)
Using 0.2 as eps value seems to be reasonable.
dbscan = DBSCAN(eps=0.2, min_samples=5, metric="euclidean")
dbscan_labels = dbscan.fit_predict(df_abstracts_pca)
df_abstracts_dbscan = df_abstracts.copy()
df_abstracts_dbscan["cluster"] = dbscan_labels
df_abstracts_dbscan["cluster"].unique()
df_abstracts_dbscan[df_abstracts_dbscan["cluster"] == 1]["title"]
pca = PCA(n_components=2, whiten=False, random_state=42)
dbscan_pca_2d = pca.fit_transform(df_abstracts_tfidf)
df_dbscan_2d = pd.DataFrame(data=dbscan_pca_2d, columns=["pc_1", "pc_2"])
df_dbscan_2d["clusters"] = dbscan_labels
sns_plot = sns.scatterplot(x="pc_1", y="pc_2", hue="clusters", data=df_dbscan_2d)
sns_plot.get_figure().savefig("clustering-with-sklearn-in-python-fig12.png")