On this project, we'll pretend we're working for a company that creates data science content, be it books, online articles, videos or interactive text-based platforms like DataQuest.
We're tasked with figuring out what is best content to write about. However, we haven't been given what "best" means here. Therefore, our approximation will start by looking the most popular Data Science questions at Data Science Stack Exchange (DSSE) - a data science dedicated site. Some useful information about the site follows:
Stack Exchange provides a public data base for each of its websites. After spending some minutes there, we found some tables whose content seems promising for our project:
From the 8 possible types of posts, we'll focus on the **'questions'** type. Besides being the most common type, it's the most relevant for our purpose.
We'll start by getting the following information from the Posts table, including only 'post' types created in 2021 or after - as we're only interested in recent posts:
To such end, we used the following query:
SELECT Id, PostTypeId, CreationDate, Score, ViewCount, Tags, AnswerCount, FavoriteCount
FROM Posts
WHERE (PostTypeId = 1) AND (CreationDate > '2021/01/01');
The result was downloaded into the file 'dsse_questions.csv'
Next, we'll check how our recently obtained data looks:
# Importing the packages of our interest
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
# Data inspection
questions = pd.read_csv('dsse_questions.csv')
display(questions.head(100))
display("The posts dataframe has " + str(questions.size) + " rows")
display("The posts dataframe contains the following columns:", questions.dtypes)
The inspection of the data above shows some issues with the data that we'll need to fix:
# Filling missing values with 0
questions["FavoriteCount"].fillna(0, inplace=True)
# Setting proper column types
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)
questions["CreationDate"] = questions["CreationDate"].astype("datetime64[ns]")
# Formatting 'Tags' column for easier manipulation
questions["Tags"] = questions["Tags"].str.replace("><", ",").str.replace("<", "").str.replace(">", "")
# Data inspection
display(questions.head())
display("The posts dataframe contains the following columns:", questions.dtypes)
# Expand the 'Tags' column to facilitate analysis
questions_tags = pd.concat([questions, questions["Tags"].str.split(",", expand=True)], axis=1)
display(questions_tags.head())
# Counting how many times each tag has been used
## Creating empty dictionary to store results
tags_uses = {}
## Defining a function that increases 'tags_uses' dictionary by one for a particular tag
def tag_count(string):
if string != None:
if string in tags_uses:
tags_uses[string] += 1
else:
tags_uses[string] = 1
## Applying the previously defined function to each individual tags column
for x in range(0,5):
questions_tags[x].apply(tag_count)
## Converting the dictionary to dataframe and sorting by number of uses in descending order
tag_uses_df = pd.DataFrame.from_dict(tags_uses, orient="index").reset_index().rename(columns={"index":"tag", 0:"uses"})
tag_uses_df.sort_values("uses", ascending=False, inplace=True, ignore_index=True)
## Plotting the top 10 tags and their number of uses
top10_uses = tag_uses_df.head(10).sort_values("uses")
plt.barh(y=top10_uses["tag"], width=top10_uses["uses"])
plt.title("Number of uses for top10 used tags")
plt.tick_params(bottom=False, left=False)
plt.show()
# Counting how many views has each tag
## Creating empty dictionary to store results
tags_views = {}
## Defining a function that increases 'tags_uses' dictionary by number of views for each tag column
def tag_views(row):
for x in range(0,5):
if row[x] != None:
if row[x] in tags_views:
tags_views[row[x]] += row["ViewCount"]
else:
tags_views[row[x]] = row["ViewCount"]
## Applying the previously defined function to our dataframe
questions_tags.apply(tag_views, axis=1)
## Converting the dictionary to dataframe and sorting by number of uses in descending order
tag_views_df = pd.DataFrame.from_dict(tags_views, orient="index").reset_index().rename(columns={"index":"tag", 0:"views"})
tag_views_df.sort_values("views", ascending=False, inplace=True, ignore_index=True)
## Plotting the top 10 tags and their number of views
top10_views = tag_views_df.head(10).sort_values("views")
plt.barh(y=top10_views["tag"], width=top10_views["views"])
plt.title("Number of views for top10 viewed tags")
plt.tick_params(bottom=False, left=False)
plt.show()
The bar plots above show that, while most tags in our top10 lists exist in both lists, some of them are only contained in one of them. More specifically:
We could therefore conclude, that the most popular topics in Data Science relate to **machine learning** and **deep learning**, with the following tags:
To make our recommendation more exhaustive, we would like to ensure that the most popular topics in Data Sciende, as we measured them from 2021 and later, we will measure their popularity over time, so we don't recommend something that could simply be a fad.
For this purpose, we have downloaded a new file from the public data base for the stack exchange sites, using a modified version of the initial query:
SELECT Id, PostTypeId, CreationDate, Score, ViewCount, Tags, AnswerCount, FavoriteCount
FROM Posts
WHERE PostTypeId = 1;
We'll work with this new file in the following analyses.
Let's start by loading and cleaning the data:
# Reading the new file with no time filter
all_questions = pd.read_csv('all_dsse_questions.csv')
# Filling missing values with 0
all_questions["FavoriteCount"].fillna(0, inplace=True)
# Setting proper column types
all_questions["FavoriteCount"] = all_questions["FavoriteCount"].astype(int)
all_questions["CreationDate"] = all_questions["CreationDate"].astype("datetime64[ns]")
# Formatting 'Tags' column for easier manipulation
all_questions["Tags"] = all_questions["Tags"].str.replace("><", ",").str.replace("<", "").str.replace(">", "")
Once we have our data loaded and cleaned, we'll continue creating a column that shows 1 for rows containing deep learning tags, 0 otherwise:
all_questions["deep_learning"] = 0
deep_learning_tags = ["machine-learning", "deep-learning", "neural-network", "nlp", "classification", "keras", "tensorflow", "scikit-learn"]
# Create a function that returns 1 if any of the deep learning tags is in the "Tags" column
def is_deep(row):
taglist = row["Tags"].split(",")
for element in deep_learning_tags:
if element in taglist:
return 1
# Apply the function above to create a column with the results
all_questions["deep_learning"] = all_questions.apply(is_deep, axis=1)
# Fill NaN values in "deep_learning" with 0s
all_questions["deep_learning"].fillna(0, inplace=True)
# Display the table with the new column included
display(all_questions.head())
Then, we need to group the data so it can be easier to interpret and plot:
# Creating a year and month column
all_questions["year"] = pd.DatetimeIndex(all_questions["CreationDate"]).year
all_questions["year"] = all_questions["year"].astype(int)
# Creating a pivot table showing evolution of the proportion of deep learning topics across time
dl_evolution = all_questions.pivot_table(values="deep_learning", index="year", aggfunc="mean")
dl_evolution = dl_evolution.reset_index().rename(columns={"index":"tag", 0:"uses"})
dl_evolution["deep_learning"] = (dl_evolution["deep_learning"] * 100).astype(int)
display(dl_evolution)
Finally, let's graph the evolution of deep learning topics:
# Initiate the plot with the corresponding data
fig, ax = plt.subplots(figsize=(15, 8))
ax.plot(
dl_evolution["year"],
dl_evolution["deep_learning"],
color="#0B74CB",
linewidth=4,
alpha=.8
)
# Remove all the four spines
for location in ["top", "bottom", "left", "right"]:
ax.spines[location].set_visible(False)
# Hide ticks
ax.tick_params(bottom=False, left=False)
# Modify ticks labels
ax.tick_params(axis="x", color="gray", which="major", labelsize=14, labelrotation=0, labelcolor="gray")
ax.tick_params(axis="y", color="gray", which="major", labelsize=14, labelrotation=0, labelcolor="gray")
ax.grid(axis="y", visible=True, alpha=.5)
# Title
ax.text(
0, 1.15, "Percentage of deep learning related topics in DSSE",
horizontalalignment="left",
verticalalignment="center",
transform=ax.transAxes,
size=22, weight="bold",
alpha=.75
)
# Subtitle
ax.text(
0, 1.08, "Deep Learning starts decreasing popularity after 2018",
horizontalalignment="left",
verticalalignment="center",
transform=ax.transAxes,
size=16,
alpha=.80
)
# Signature line
ax.text(
0.5, -0.1, "Created by: Álvaro Viúdez" + " " * 215 + "Source: DataQuest",
horizontalalignment="center",
verticalalignment="center",
transform=ax.transAxes,
color="white",
backgroundcolor="gray"
)
# Initial point
ax.text(
0.01, .05, '46%',
horizontalalignment="center",
verticalalignment="center",
transform=ax.transAxes,
color="#0B74CB",
size=18, weight="bold"
)
# Max point
ax.text(
0.51, 1, '65%',
horizontalalignment="center",
verticalalignment="center",
transform=ax.transAxes,
color="#0B74CB",
size=18, weight="bold"
)
# End point
ax.text(
0.99, .44, '55%',
horizontalalignment="center",
verticalalignment="center",
transform=ax.transAxes,
color="#0B74CB",
size=18, weight="bold"
)
plt.show()
The present analysis has shown two interesting points on the popularity of data science topics:
However, even if machine learning and deep learning are not at their maximum popularity values, they still account for more than 50% of the topics in DSSE.
We could advise to our Data Science company to create machine learning and deep learning content, still paying attention to new topics that can emerge - Data Science is always evolving!