Our goal in this project is to use Data Science Stack Exchange to determine what content should a data science education company create, based on interest by subject.
On DSSE's help center's section on questions , we can read that we should:
All of these characteristics, if employed, should be helpful attributes to our goal.
In the help center we also learned that in addition to the sites mentioned in the Learn section, there are other two sites that are relevant:
On the home page we can see that we have four sections:
Questions — a list of all questions asked;
Tags — a list of tags (keywords or labels that categorize questions);
Users — a list of users;
Unanswered — a list of unanswered questions;
The tagging system used by Stack Exchange looks just like what we need to solve this problem as it allow us to quantify how many questions are asked about each subject.
Something else we can learn from exploring the help center, is that Stack Exchange's sites are heavily moderated by the community; this gives us some confidence in using the tagging system to derive conclusions.
Looking, just as an example, at this question, some of the information we see is:
Perusing the table names, a few stand out as relevant for our goal:
Running a few exploratory queries, leads us to focus our efforts on Posts
table. For examples, the Tags
table looked very promising as it tells us how many times each tag was used, but there's no way to tell just from this if the interest in these tags is recent or a thing from the past.
Id | TagName | Count | ExcerptPostId | WikiPostId |
---|---|---|---|---|
2 | machine-learning | 6919 | 4909 | 4908 |
46 | python | 3907 | 5523 | 5522 |
81 | neural-network | 2923 | 8885 | 8884 |
194 | deep-learning | 2786 | 8956 | 8955 |
77 | classification | 1899 | 4911 | 4910 |
324 | keras | 1736 | 9251 | 9250 |
128 | scikit-learn | 1303 | 5896 | 5895 |
321 | tensorflow | 1224 | 9183 | 9182 |
47 | nlp | 1162 | 147 | 146 |
24 | r | 1114 | 49 | 48 |
To get the relevant data we run the following query.
SELECT Id, CreationDate,
Score, ViewCount, Tags,
AnswerCount, FavoriteCount
FROM posts
WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019;
Here's what the first few rows look like:
Id | PostTypeId | CreationDate | Score | ViewCount | Tags | AnswerCount | FavoriteCount |
---|---|---|---|---|---|---|---|
44419 | 1 | 2019-01-23 09:21:13 | 1 | 21 | <machine-learning><data-mining> | 0 | |
44420 | 1 | 2019-01-23 09:34:01 | 0 | 25 | <machine-learning><regression><linear-regression><regularization> | 0 | |
44423 | 1 | 2019-01-23 09:58:41 | 2 | 1651 | <python><time-series><forecast><forecasting> | 0 | |
44427 | 1 | 2019-01-23 10:57:09 | 0 | 55 | <machine-learning><scikit-learn><pca> | 1 | |
44428 | 1 | 2019-01-23 11:02:15 | 0 | 19 | <dataset><bigdata><data><speech-to-text> | 0 |
We can read in the data while immediately making sure CreationDate
will be stored as a datetime object:
# We import everything that we'll use
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])
Running questions.info()
should gives a lot of useful information.
questions.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8839 entries, 0 to 8838 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 8839 non-null int64 1 CreationDate 8839 non-null datetime64[ns] 2 Score 8839 non-null int64 3 ViewCount 8839 non-null int64 4 Tags 8839 non-null object 5 AnswerCount 8839 non-null int64 6 FavoriteCount 1407 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(4), object(1) memory usage: 483.5+ KB
We see that only FavoriteCount
has missing values. A missing value on this column probably means that the question was is not present in any users' favorite list, so we can replace the missing values with zero.
The types seem adequate for every column, however, after we fill in the missing values on FavoriteCount
, there is no reason to store the values as floats.
Since the object
dtype is a catch-all type, let's see what types the objects in questions["Tags"]
are.
questions["Tags"].apply(lambda value: type(value)).unique()
array([<class 'str'>], dtype=object)
We see that every value in this column is a string. On Stack Exchange, each question can only have a maximum of five tags (source), so one way to deal with this column is to create five columns in questions
called Tag1
, Tag2
, Tag3
, Tag4
, and Tag5
and populate the columns with the tags in each row.
However, since doesn't help is relating tags from one question to another, we'll just keep them as a list.
We'll begin by fixing FavoriteCount
.
questions.fillna(value={"FavoriteCount": 0}, inplace=True)
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)
questions.dtypes
Id int64 CreationDate datetime64[ns] Score int64 ViewCount int64 Tags object AnswerCount int64 FavoriteCount int64 dtype: object
Let's now modify Tags
to make it easier to work with.
questions["Tags"] = questions["Tags"].str.replace("^<|>$", "").str.split("><")
questions.sample(3)
Id | CreationDate | Score | ViewCount | Tags | AnswerCount | FavoriteCount | |
---|---|---|---|---|---|---|---|
511 | 56382 | 2019-07-25 15:00:20 | 0 | 34 | [machine-learning, python, pandas, natural-lan... | 0 | 0 |
2178 | 58312 | 2019-08-28 09:44:00 | 1 | 41 | [neural-network, pytorch] | 0 | 1 |
2536 | 58151 | 2019-08-25 01:01:29 | 0 | 37 | [dataset, audio-recognition] | 2 | 0 |
We'll begin by counting how many times each tag was used
tag_count = dict()
for tags in questions["Tags"]:
for tag in tags:
if tag in tag_count:
tag_count[tag] += 1
else:
tag_count[tag] = 1
For improved aesthetics, let's transform tag_count
in a dataframe.
tag_count = pd.DataFrame.from_dict(tag_count, orient="index")
tag_count.rename(columns={0: "Count"}, inplace=True)
tag_count.head(10)
Count | |
---|---|
machine-learning | 2693 |
data-mining | 217 |
regression | 347 |
linear-regression | 175 |
regularization | 50 |
python | 1814 |
time-series | 466 |
forecast | 34 |
forecasting | 85 |
scikit-learn | 540 |
Let's now sort this dataframe by Count
and visualize the top 20 results.
most_used = tag_count.sort_values(by="Count").tail(20)
most_used
Count | |
---|---|
machine-learning-model | 224 |
statistics | 234 |
clustering | 257 |
predictive-modeling | 265 |
r | 268 |
dataset | 340 |
regression | 347 |
pandas | 354 |
lstm | 402 |
time-series | 466 |
cnn | 489 |
nlp | 493 |
scikit-learn | 540 |
tensorflow | 584 |
classification | 685 |
keras | 935 |
neural-network | 1055 |
deep-learning | 1220 |
python | 1814 |
machine-learning | 2693 |
The threshold of 20
is somewhat arbitrary and we can experiment with others, however, popularity of the tags rapidly declines, so looking at these tags should be enough to help us with our goal. Let's visualize these data.
most_used.plot(kind="barh", figsize=(16,8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2f948a2e50>
Some tags are very, very broad and are unlikely to be useful; e.g.: python
, dataset
, r
. Before we investigate the tags a little deeper, let's repeat the same process for views.
We'll use pandas's pandas.DataFrame.iterrows()
.
tag_view_count = dict()
for index, row in questions.iterrows():
for tag in row['Tags']:
if tag in tag_view_count:
tag_view_count[tag] += row['ViewCount']
else:
tag_view_count[tag] = row['ViewCount']
tag_view_count = pd.DataFrame.from_dict(tag_view_count, orient="index")
tag_view_count.rename(columns={0: "ViewCount"}, inplace=True)
most_viewed = tag_view_count.sort_values(by="ViewCount").tail(20)
most_viewed.plot(kind="barh", figsize=(16,8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2f938fb350>
Let's see them side by side.
fig, axes = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches((24, 10))
most_used.plot(kind="barh", ax=axes[0], subplots=True)
most_viewed.plot(kind="barh", ax=axes[1], subplots=True)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f2f93aa7210>], dtype=object)
in_used = pd.merge(most_used, most_viewed, how="left", left_index=True, right_index=True)
in_viewed = pd.merge(most_used, most_viewed, how="right", left_index=True, right_index=True)
One way of trying to gauge how pairs of tags are related to each other, is to count how many times each pair appears together. Let's do this.
We'll begin by creating a list of all tags.
all_tags = list(tag_count.index)
We'll now create a dataframe where each row will represent a tag, and each column as well. Something like this:
tag1 | tag2 | tag3 | |
---|---|---|---|
tag1 | |||
tag2 | |||
tag3 |
associations = pd.DataFrame(index=all_tags, columns=all_tags)
associations.iloc[0:4,0:4]
machine-learning | data-mining | regression | linear-regression | |
---|---|---|---|---|
machine-learning | NaN | NaN | NaN | NaN |
data-mining | NaN | NaN | NaN | NaN |
regression | NaN | NaN | NaN | NaN |
linear-regression | NaN | NaN | NaN | NaN |
We will now fill this dataframe with zeroes and then, for each lists of tags in questions["Tags"]
, we will increment the intervening tags by one. The end result will be a dataframe that for each pair of tags, it tells us how many times they were used together.
associations.fillna(0, inplace=True)
for tags in questions["Tags"]:
associations.loc[tags, tags] += 1
This dataframe is quite large. Let's focus our attention on the most used tags. We'll add some colors to make it easier to talk about the dataframe. (At the time of this writing, GitHub's renderer does not display the colors, we suggest you use this solution notebook together with JupyterLab).
relations_most_used = associations.loc[most_used.index, most_used.index]
def style_cells(x):
helper_df = pd.DataFrame('', index=x.index, columns=x.columns)
helper_df.loc["time-series", "r"] = "background-color: yellow"
helper_df.loc["r", "time-series"] = "background-color: yellow"
for k in range(helper_df.shape[0]):
helper_df.iloc[k,k] = "color: blue"
return helper_df
relations_most_used.style.apply(style_cells, axis=None)
machine-learning-model | statistics | clustering | predictive-modeling | r | dataset | regression | pandas | lstm | time-series | cnn | nlp | scikit-learn | tensorflow | classification | keras | neural-network | deep-learning | python | machine-learning | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
machine-learning-model | 224 | 3 | 3 | 21 | 7 | 12 | 8 | 4 | 5 | 7 | 4 | 4 | 18 | 9 | 21 | 17 | 10 | 19 | 37 | 139 |
statistics | 3 | 234 | 3 | 16 | 16 | 17 | 16 | 3 | 1 | 22 | 1 | 3 | 6 | 0 | 19 | 3 | 11 | 12 | 35 | 89 |
clustering | 3 | 3 | 257 | 0 | 16 | 5 | 2 | 5 | 3 | 20 | 0 | 9 | 24 | 0 | 12 | 0 | 8 | 2 | 45 | 61 |
predictive-modeling | 21 | 16 | 0 | 265 | 13 | 7 | 28 | 4 | 13 | 31 | 6 | 1 | 12 | 6 | 27 | 11 | 13 | 32 | 35 | 123 |
r | 7 | 16 | 16 | 13 | 268 | 6 | 10 | 2 | 3 | 22 | 2 | 4 | 1 | 1 | 10 | 10 | 9 | 5 | 24 | 63 |
dataset | 12 | 17 | 5 | 7 | 6 | 340 | 6 | 14 | 7 | 6 | 11 | 11 | 9 | 9 | 28 | 13 | 20 | 32 | 53 | 99 |
regression | 8 | 16 | 2 | 28 | 10 | 6 | 347 | 6 | 11 | 24 | 6 | 2 | 37 | 9 | 34 | 31 | 42 | 21 | 59 | 119 |
pandas | 4 | 3 | 5 | 4 | 2 | 14 | 6 | 354 | 7 | 19 | 1 | 3 | 37 | 3 | 3 | 3 | 1 | 1 | 244 | 62 |
lstm | 5 | 1 | 3 | 13 | 3 | 7 | 11 | 7 | 402 | 87 | 24 | 19 | 2 | 43 | 20 | 133 | 69 | 103 | 61 | 71 |
time-series | 7 | 22 | 20 | 31 | 22 | 6 | 24 | 19 | 87 | 466 | 8 | 0 | 12 | 9 | 25 | 51 | 33 | 44 | 105 | 131 |
cnn | 4 | 1 | 0 | 6 | 2 | 11 | 6 | 1 | 24 | 8 | 489 | 7 | 0 | 57 | 20 | 116 | 118 | 160 | 62 | 124 |
nlp | 4 | 3 | 9 | 1 | 4 | 11 | 2 | 3 | 19 | 0 | 7 | 493 | 12 | 11 | 35 | 23 | 24 | 72 | 71 | 113 |
scikit-learn | 18 | 6 | 24 | 12 | 1 | 9 | 37 | 37 | 2 | 12 | 0 | 12 | 540 | 15 | 47 | 34 | 24 | 16 | 235 | 188 |
tensorflow | 9 | 0 | 0 | 6 | 1 | 9 | 9 | 3 | 43 | 9 | 57 | 11 | 15 | 584 | 20 | 256 | 108 | 136 | 167 | 106 |
classification | 21 | 19 | 12 | 27 | 10 | 28 | 34 | 3 | 20 | 25 | 20 | 35 | 47 | 20 | 685 | 58 | 65 | 59 | 98 | 259 |
keras | 17 | 3 | 0 | 11 | 10 | 13 | 31 | 3 | 133 | 51 | 116 | 23 | 34 | 256 | 58 | 935 | 235 | 247 | 280 | 195 |
neural-network | 10 | 11 | 8 | 13 | 9 | 20 | 42 | 1 | 69 | 33 | 118 | 24 | 24 | 108 | 65 | 235 | 1055 | 305 | 137 | 366 |
deep-learning | 19 | 12 | 2 | 32 | 5 | 32 | 21 | 1 | 103 | 44 | 160 | 72 | 16 | 136 | 59 | 247 | 305 | 1220 | 160 | 429 |
python | 37 | 35 | 45 | 35 | 24 | 53 | 59 | 244 | 61 | 105 | 62 | 71 | 235 | 167 | 98 | 280 | 137 | 160 | 1814 | 499 |
machine-learning | 139 | 89 | 61 | 123 | 63 | 99 | 119 | 62 | 71 | 131 | 124 | 113 | 188 | 106 | 259 | 195 | 366 | 429 | 499 | 2693 |
The cells highlighted in yellow tell us that time-series
was used together with r
22 times. The values in blue tell us how many times each of the tags was used. We saw earlier that machine-learning
was used 2693 times and we confirm it in this dataframe.
It's hard for a human to understand what is going on in this dataframe. Let's create a heatmap. But before we do it, let's get rid of the values in blue, otherwise the colors will be too skewed.
for i in range(relations_most_used.shape[0]):
relations_most_used.iloc[i,i] = pd.np.NaN
plt.figure(figsize=(12,8))
sns.heatmap(relations_most_used, cmap="Greens", annot=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2f936807d0>
The most used tags also seem to have the strongest relationships, as given by the dark concentration in the bottom right corner. However, this could simply be because each of these tags is used a lot, and so end up being used together a lot without possibly even having any strong relation between them.
A more intuitive manifestation of this phenomenon is the following. A lot of people buy bread, a lot of people buy toilet paper, so they end up being purchased together a lot, but purchasing one of them doesn't increase the chances of purchasing the other.
Another shortcoming of this attempt is that it only looks at relations between pairs of tags and not between multiple groups of tags. For example, it could be the case that when used together, dataset
and scikit-learn
have a "strong" relation to pandas
, but each by itself doesn't.
So how do we attack both these problems? There is a powerful data mining technique that allows us to handle this: association rules. Association rules allow us to analytically spot relations like "people who purchase milk, also purchase eggs". Moreover, we can also measure how strong this relations are on several fronts: how common the relation is, how strong it is, and how independent the components of the relationship are (toilet paper and bread are probably more independent than eggs and milk — you'll learn more about statistical independence in the next step).
We won't get into the details of it, as the technique is out of scope for this course, but it is a path worth investigating!
Keras, scikit-learn, TensorFlow are all Python libraries that allow their users to employ deep learning (a type of neural network).
Most of the top tags are all intimately related with one central machine learning theme: deep learning. If we want to be very specific, we can suggest the creation of Python content that uses deep learning for classification problems (and other variations of this suggestion).
At the glance of an eye, someone with sufficient domain knowledge can tell that the most popular topic at the moment, as shown by our analysis, is deep learning.
Let's read in the file into a dataframe called all_q
. We'll parse the dates at read-time.
all_q = pd.read_csv("all_questions.csv", parse_dates=["CreationDate"])
We can use the same technique as before to clean the tags column.
all_q["Tags"] = all_q["Tags"].str.replace("^<|>$", "").str.split("><")
Before deciding which questions should be classified as being deep learning questions, we should decide what tags are deep learning tags.
The definition of what constitutes a deep learning tag we'll use is: a tag that belongs to the list ["lstm", "cnn", "scikit-learn", "tensorflow", "keras", "neural-network", "deep-learning"]
.
This list was obtained by looking at all the tags in most_used
and seeing which ones had any relation to deep learning. You can use Google and read the tags descriptions to reach similar results.
We'll now create a function that assigns 1
to deep learning questions and 0
otherwise; and we use it.
def class_deep_learning(tags):
for tag in tags:
if tag in ["lstm", "cnn", "scikit-learn", "tensorflow",
"keras", "neural-network", "deep-learning"]:
return 1
return 0
all_q["DeepLearning"] = all_q["Tags"].apply(class_deep_learning)
all_q.sample(5)
Id | CreationDate | Tags | DeepLearning | |
---|---|---|---|---|
15231 | 44675 | 2019-01-28 06:20:18 | [model-selection] | 0 |
440 | 55639 | 2019-07-14 11:45:43 | [machine-learning, dataset, machine-learning-m... | 0 |
11720 | 51523 | 2019-05-07 04:57:35 | [neural-network, gradient-descent, batch-norma... | 1 |
6262 | 27232 | 2018-01-30 09:53:38 | [python, convergence] | 0 |
19292 | 64930 | 2019-12-16 14:38:17 | [neural-network, deep-learning, keras, convolu... | 1 |
Looks good!
The data-science-techonology landscape isn't something as dynamic to merit daily, weekly, or even monthly tracking. Let's track it quarterly.
Since we don't have all the data for the first quarter of 2020, we'll get rid of those dates:
all_q = all_q[all_q["CreationDate"].dt.year < 2020]
Let's create a column that identifies the quarter in which a question was asked.
def fetch_quarter(datetime):
year = str(datetime.year)[-2:]
quarter = str(((datetime.month-1) // 3) + 1)
return "{y}Q{q}".format(y=year, q=quarter)
all_q["Quarter"] = all_q["CreationDate"].apply(fetch_quarter)
all_q.head()
Id | CreationDate | Tags | DeepLearning | Quarter | |
---|---|---|---|---|---|
0 | 45416 | 2019-02-12 00:36:29 | [python, keras, tensorflow, cnn, probability] | 1 | 19Q1 |
1 | 45418 | 2019-02-12 00:50:39 | [neural-network] | 1 | 19Q1 |
2 | 45422 | 2019-02-12 04:40:51 | [python, ibm-watson, chatbot] | 0 | 19Q1 |
3 | 45426 | 2019-02-12 04:51:49 | [keras] | 1 | 19Q1 |
4 | 45427 | 2019-02-12 05:08:24 | [r, predictive-modeling, machine-learning-mode... | 0 | 19Q1 |
For the final stretch of this screen, we'll group by quarter and:
quarterly = all_q.groupby('Quarter').agg({"DeepLearning": ['sum', 'size']})
quarterly.columns = ['DeepLearningQuestions', 'TotalQuestions']
quarterly["DeepLearningRate"] = quarterly["DeepLearningQuestions"]\
/quarterly["TotalQuestions"]
# The following is done to help with visualizations later.
quarterly.reset_index(inplace=True)
quarterly.sample(5)
Quarter | DeepLearningQuestions | TotalQuestions | DeepLearningRate | |
---|---|---|---|---|
17 | 18Q3 | 685 | 1512 | 0.453042 |
7 | 16Q1 | 110 | 516 | 0.213178 |
6 | 15Q4 | 66 | 382 | 0.172775 |
22 | 19Q4 | 809 | 2036 | 0.397348 |
9 | 16Q3 | 161 | 585 | 0.275214 |
ax1 = quarterly.plot(x="Quarter", y="DeepLearningRate",
kind="line", linestyle="-", marker="o", color="orange",
figsize=(24,12)
)
ax2 = quarterly.plot(x="Quarter", y="TotalQuestions",
kind="bar", ax=ax1, secondary_y=True, alpha=0.7, rot=45)
for idx, t in quarterly["TotalQuestions"].iteritems():
ax2.text(idx, t, str(t), ha="center", va="bottom")
xlims = ax1.get_xlim()
ax1.get_legend().remove()
handles1, labels1 = ax1.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(handles=handles1 + handles2,
labels=labels1 + labels2,
loc="upper left", prop={"size": 12})
for ax in (ax1, ax2):
for where in ("top", "right"):
ax.spines[where].set_visible(False)
ax.tick_params(right=False, labelright=False)
It seems that deep learning questions was a high-growth trend since the start of DSSE and it looks like it is plateauing. There is no evidence to suggest that interest in deep learning is decreasing and so we maintain our previous idea of proposing that we create deep learning content.