Exploring the Data¶

We can read in the data while immediately making sure CreationDate will be stored as a datetime object:

In [1]:

# We import everything that we'll use

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:

questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])

Running questions.info() should gives a lot of useful information.

In [3]:

questions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             8839 non-null   int64         
 1   CreationDate   8839 non-null   datetime64[ns]
 2   Score          8839 non-null   int64         
 3   ViewCount      8839 non-null   int64         
 4   Tags           8839 non-null   object        
 5   AnswerCount    8839 non-null   int64         
 6   FavoriteCount  1407 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB

We see that only FavoriteCount has missing values. A missing value on this column probably means that the question was is not present in any users' favorite list, so we can replace the missing values with zero.

The types seem adequate for every column, however, after we fill in the missing values on FavoriteCount, there is no reason to store the values as floats.

Since the object dtype is a catch-all type, let's see what types the objects in questions["Tags"] are.

In [4]:

questions["Tags"].apply(lambda value: type(value)).unique()

Out[4]:

array([<class 'str'>], dtype=object)

We see that every value in this column is a string. On Stack Exchange, each question can only have a maximum of five tags (source), so one way to deal with this column is to create five columns in questions called Tag1, Tag2, Tag3, Tag4, and Tag5 and populate the columns with the tags in each row.

However, since doesn't help is relating tags from one question to another, we'll just keep them as a list.

Cleaning the Data¶

We'll begin by fixing FavoriteCount.

In [5]:

questions.fillna(value={"FavoriteCount": 0}, inplace=True)
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)
questions.dtypes

Out[5]:

Id                        int64
CreationDate     datetime64[ns]
Score                     int64
ViewCount                 int64
Tags                     object
AnswerCount               int64
FavoriteCount             int64
dtype: object

Let's now modify Tags to make it easier to work with.

In [6]:

questions["Tags"] = questions["Tags"].str.replace("^<|>$", "").str.split("><")
questions.sample(3)

Out[6]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount
511	56382	2019-07-25 15:00:20	0	34	[machine-learning, python, pandas, natural-lan...	0	0
2178	58312	2019-08-28 09:44:00	1	41	[neural-network, pytorch]	0	1
2536	58151	2019-08-25 01:01:29	0	37	[dataset, audio-recognition]	2	0

Most Used and Most Viewed¶

We'll begin by counting how many times each tag was used

In [7]:

tag_count = dict()

for tags in questions["Tags"]:
    for tag in tags:
        if tag in tag_count:
            tag_count[tag] += 1
        else:
            tag_count[tag] = 1

For improved aesthetics, let's transform tag_count in a dataframe.

In [8]:

tag_count = pd.DataFrame.from_dict(tag_count, orient="index")
tag_count.rename(columns={0: "Count"}, inplace=True)
tag_count.head(10)

Out[8]:

	Count
machine-learning	2693
data-mining	217
regression	347
linear-regression	175
regularization	50
python	1814
time-series	466
forecast	34
forecasting	85
scikit-learn	540

Let's now sort this dataframe by Count and visualize the top 20 results.

In [9]:

most_used = tag_count.sort_values(by="Count").tail(20)
most_used

Out[9]:

	Count
machine-learning-model	224
statistics	234
clustering	257
predictive-modeling	265
r	268
dataset	340
regression	347
pandas	354
lstm	402
time-series	466
cnn	489
nlp	493
scikit-learn	540
tensorflow	584
classification	685
keras	935
neural-network	1055
deep-learning	1220
python	1814
machine-learning	2693

The threshold of 20 is somewhat arbitrary and we can experiment with others, however, popularity of the tags rapidly declines, so looking at these tags should be enough to help us with our goal. Let's visualize these data.

In [10]:

most_used.plot(kind="barh", figsize=(16,8))

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2f948a2e50>

Some tags are very, very broad and are unlikely to be useful; e.g.: python, dataset, r. Before we investigate the tags a little deeper, let's repeat the same process for views.

We'll use pandas's pandas.DataFrame.iterrows().

In [12]:

tag_view_count = dict()

for index, row in questions.iterrows():
    for tag in row['Tags']:
        if tag in tag_view_count:
            tag_view_count[tag] += row['ViewCount']
        else:
            tag_view_count[tag] = row['ViewCount']
            
tag_view_count = pd.DataFrame.from_dict(tag_view_count, orient="index")
tag_view_count.rename(columns={0: "ViewCount"}, inplace=True)

most_viewed = tag_view_count.sort_values(by="ViewCount").tail(20)

most_viewed.plot(kind="barh", figsize=(16,8))

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2f938fb350>

Let's see them side by side.

In [13]:

fig, axes = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches((24, 10))
most_used.plot(kind="barh", ax=axes[0], subplots=True)
most_viewed.plot(kind="barh", ax=axes[1], subplots=True)

Out[13]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f2f93aa7210>],
      dtype=object)

In [14]:

in_used = pd.merge(most_used, most_viewed, how="left", left_index=True, right_index=True)
in_viewed = pd.merge(most_used, most_viewed, how="right", left_index=True, right_index=True)

Relations Between Tags¶

One way of trying to gauge how pairs of tags are related to each other, is to count how many times each pair appears together. Let's do this.

We'll begin by creating a list of all tags.

In [15]:

all_tags = list(tag_count.index)

We'll now create a dataframe where each row will represent a tag, and each column as well. Something like this:

	tag1	tag2	tag3
tag1
tag2
tag3

In [16]:

associations = pd.DataFrame(index=all_tags, columns=all_tags)
associations.iloc[0:4,0:4]

Out[16]:

	machine-learning	data-mining	regression	linear-regression
machine-learning	NaN	NaN	NaN	NaN
data-mining	NaN	NaN	NaN	NaN
regression	NaN	NaN	NaN	NaN
linear-regression	NaN	NaN	NaN	NaN

We will now fill this dataframe with zeroes and then, for each lists of tags in questions["Tags"], we will increment the intervening tags by one. The end result will be a dataframe that for each pair of tags, it tells us how many times they were used together.

In [17]:

associations.fillna(0, inplace=True)

for tags in questions["Tags"]:
    associations.loc[tags, tags] += 1

This dataframe is quite large. Let's focus our attention on the most used tags. We'll add some colors to make it easier to talk about the dataframe. (At the time of this writing, GitHub's renderer does not display the colors, we suggest you use this solution notebook together with JupyterLab).

In [18]:

relations_most_used = associations.loc[most_used.index, most_used.index]

def style_cells(x):
    helper_df = pd.DataFrame('', index=x.index, columns=x.columns)
    helper_df.loc["time-series", "r"] = "background-color: yellow"
    helper_df.loc["r", "time-series"] = "background-color: yellow"
    for k in range(helper_df.shape[0]):
        helper_df.iloc[k,k] = "color: blue"
    
    return helper_df

relations_most_used.style.apply(style_cells, axis=None)

Out[18]:

	machine-learning-model	statistics	clustering	predictive-modeling	r	dataset	regression	pandas	lstm	time-series	cnn	nlp	scikit-learn	tensorflow	classification	keras	neural-network	deep-learning	python	machine-learning
machine-learning-model	224	3	3	21	7	12	8	4	5	7	4	4	18	9	21	17	10	19	37	139
statistics	3	234	3	16	16	17	16	3	1	22	1	3	6	0	19	3	11	12	35	89
clustering	3	3	257	0	16	5	2	5	3	20	0	9	24	0	12	0	8	2	45	61
predictive-modeling	21	16	0	265	13	7	28	4	13	31	6	1	12	6	27	11	13	32	35	123
r	7	16	16	13	268	6	10	2	3	22	2	4	1	1	10	10	9	5	24	63
dataset	12	17	5	7	6	340	6	14	7	6	11	11	9	9	28	13	20	32	53	99
regression	8	16	2	28	10	6	347	6	11	24	6	2	37	9	34	31	42	21	59	119
pandas	4	3	5	4	2	14	6	354	7	19	1	3	37	3	3	3	1	1	244	62
lstm	5	1	3	13	3	7	11	7	402	87	24	19	2	43	20	133	69	103	61	71
time-series	7	22	20	31	22	6	24	19	87	466	8	0	12	9	25	51	33	44	105	131
cnn	4	1	0	6	2	11	6	1	24	8	489	7	0	57	20	116	118	160	62	124
nlp	4	3	9	1	4	11	2	3	19	0	7	493	12	11	35	23	24	72	71	113
scikit-learn	18	6	24	12	1	9	37	37	2	12	0	12	540	15	47	34	24	16	235	188
tensorflow	9	0	0	6	1	9	9	3	43	9	57	11	15	584	20	256	108	136	167	106
classification	21	19	12	27	10	28	34	3	20	25	20	35	47	20	685	58	65	59	98	259
keras	17	3	0	11	10	13	31	3	133	51	116	23	34	256	58	935	235	247	280	195
neural-network	10	11	8	13	9	20	42	1	69	33	118	24	24	108	65	235	1055	305	137	366
deep-learning	19	12	2	32	5	32	21	1	103	44	160	72	16	136	59	247	305	1220	160	429
python	37	35	45	35	24	53	59	244	61	105	62	71	235	167	98	280	137	160	1814	499
machine-learning	139	89	61	123	63	99	119	62	71	131	124	113	188	106	259	195	366	429	499	2693

The cells highlighted in yellow tell us that time-series was used together with r 22 times. The values in blue tell us how many times each of the tags was used. We saw earlier that machine-learning was used 2693 times and we confirm it in this dataframe.

It's hard for a human to understand what is going on in this dataframe. Let's create a heatmap. But before we do it, let's get rid of the values in blue, otherwise the colors will be too skewed.

In [19]:

for i in range(relations_most_used.shape[0]):
    relations_most_used.iloc[i,i] = pd.np.NaN

In [20]:

plt.figure(figsize=(12,8))
sns.heatmap(relations_most_used, cmap="Greens", annot=False)

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f2f936807d0>

The most used tags also seem to have the strongest relationships, as given by the dark concentration in the bottom right corner. However, this could simply be because each of these tags is used a lot, and so end up being used together a lot without possibly even having any strong relation between them.

A more intuitive manifestation of this phenomenon is the following. A lot of people buy bread, a lot of people buy toilet paper, so they end up being purchased together a lot, but purchasing one of them doesn't increase the chances of purchasing the other.

Another shortcoming of this attempt is that it only looks at relations between pairs of tags and not between multiple groups of tags. For example, it could be the case that when used together, dataset and scikit-learn have a "strong" relation to pandas, but each by itself doesn't.

So how do we attack both these problems? There is a powerful data mining technique that allows us to handle this: association rules. Association rules allow us to analytically spot relations like "people who purchase milk, also purchase eggs". Moreover, we can also measure how strong this relations are on several fronts: how common the relation is, how strong it is, and how independent the components of the relationship are (toilet paper and bread are probably more independent than eggs and milk — you'll learn more about statistical independence in the next step).

We won't get into the details of it, as the technique is out of scope for this course, but it is a path worth investigating!

Enter Domain Knowledge¶

Keras, scikit-learn, TensorFlow are all Python libraries that allow their users to employ deep learning (a type of neural network).

Most of the top tags are all intimately related with one central machine learning theme: deep learning. If we want to be very specific, we can suggest the creation of Python content that uses deep learning for classification problems (and other variations of this suggestion).

At the glance of an eye, someone with sufficient domain knowledge can tell that the most popular topic at the moment, as shown by our analysis, is deep learning.

Just a Fad?¶

Let's read in the file into a dataframe called all_q. We'll parse the dates at read-time.

In [21]:

all_q = pd.read_csv("all_questions.csv", parse_dates=["CreationDate"])

We can use the same technique as before to clean the tags column.

In [22]:

all_q["Tags"] = all_q["Tags"].str.replace("^<|>$", "").str.split("><")

Before deciding which questions should be classified as being deep learning questions, we should decide what tags are deep learning tags.

The definition of what constitutes a deep learning tag we'll use is: a tag that belongs to the list ["lstm", "cnn", "scikit-learn", "tensorflow", "keras", "neural-network", "deep-learning"].

This list was obtained by looking at all the tags in most_used and seeing which ones had any relation to deep learning. You can use Google and read the tags descriptions to reach similar results.

We'll now create a function that assigns 1 to deep learning questions and 0 otherwise; and we use it.

In [23]:

def class_deep_learning(tags):
    for tag in tags:
        if tag in ["lstm", "cnn", "scikit-learn", "tensorflow",
                   "keras", "neural-network", "deep-learning"]:
            return 1
    return 0

In [24]:

all_q["DeepLearning"] = all_q["Tags"].apply(class_deep_learning)

In [25]:

all_q.sample(5)

Out[25]:

	Id	CreationDate	Tags	DeepLearning
15231	44675	2019-01-28 06:20:18	[model-selection]	0
440	55639	2019-07-14 11:45:43	[machine-learning, dataset, machine-learning-m...	0
11720	51523	2019-05-07 04:57:35	[neural-network, gradient-descent, batch-norma...	1
6262	27232	2018-01-30 09:53:38	[python, convergence]	0
19292	64930	2019-12-16 14:38:17	[neural-network, deep-learning, keras, convolu...	1

Looks good!

The data-science-techonology landscape isn't something as dynamic to merit daily, weekly, or even monthly tracking. Let's track it quarterly.

Since we don't have all the data for the first quarter of 2020, we'll get rid of those dates:

In [26]:

all_q = all_q[all_q["CreationDate"].dt.year < 2020]

Let's create a column that identifies the quarter in which a question was asked.

In [27]:

def fetch_quarter(datetime):
    year = str(datetime.year)[-2:]
    quarter = str(((datetime.month-1) // 3) + 1)
    return "{y}Q{q}".format(y=year, q=quarter)

all_q["Quarter"] = all_q["CreationDate"].apply(fetch_quarter)

In [28]:

all_q.head()

Out[28]:

	Id	CreationDate	Tags	DeepLearning	Quarter
0	45416	2019-02-12 00:36:29	[python, keras, tensorflow, cnn, probability]	1	19Q1
1	45418	2019-02-12 00:50:39	[neural-network]	1	19Q1
2	45422	2019-02-12 04:40:51	[python, ibm-watson, chatbot]	0	19Q1
3	45426	2019-02-12 04:51:49	[keras]	1	19Q1
4	45427	2019-02-12 05:08:24	[r, predictive-modeling, machine-learning-mode...	0	19Q1

For the final stretch of this screen, we'll group by quarter and:

Count the number of deep learning questions.
Count the total number of questions.
Compute the ratio between the two numbers above.

In [29]:

quarterly = all_q.groupby('Quarter').agg({"DeepLearning": ['sum', 'size']})
quarterly.columns = ['DeepLearningQuestions', 'TotalQuestions']
quarterly["DeepLearningRate"] = quarterly["DeepLearningQuestions"]\
                                /quarterly["TotalQuestions"]
# The following is done to help with visualizations later.
quarterly.reset_index(inplace=True)
quarterly.sample(5)

Out[29]:

	Quarter	DeepLearningQuestions	TotalQuestions	DeepLearningRate
17	18Q3	685	1512	0.453042
7	16Q1	110	516	0.213178
6	15Q4	66	382	0.172775
22	19Q4	809	2036	0.397348
9	16Q3	161	585	0.275214

In [30]:

ax1 = quarterly.plot(x="Quarter", y="DeepLearningRate",
                    kind="line", linestyle="-", marker="o", color="orange",
                    figsize=(24,12)
                    )

ax2 = quarterly.plot(x="Quarter", y="TotalQuestions",
                     kind="bar", ax=ax1, secondary_y=True, alpha=0.7, rot=45)

for idx, t in quarterly["TotalQuestions"].iteritems():
    ax2.text(idx, t, str(t), ha="center", va="bottom")
xlims = ax1.get_xlim()

ax1.get_legend().remove()

handles1, labels1 = ax1.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(handles=handles1 + handles2,
           labels=labels1 + labels2,
           loc="upper left", prop={"size": 12})


for ax in (ax1, ax2):
    for where in ("top", "right"):
        ax.spines[where].set_visible(False)
        ax.tick_params(right=False, labelright=False)

It seems that deep learning questions was a high-growth trend since the start of DSSE and it looks like it is plateauing. There is no evidence to suggest that interest in deep learning is decreasing and so we maintain our previous idea of proposing that we create deep learning content.

Id	TagName	Count	ExcerptPostId	WikiPostId
2	machine-learning	6919	4909	4908
46	python	3907	5523	5522
81	neural-network	2923	8885	8884
194	deep-learning	2786	8956	8955
77	classification	1899	4911	4910
324	keras	1736	9251	9250
128	scikit-learn	1303	5896	5895
321	tensorflow	1224	9183	9182
47	nlp	1162	147	146
24	r	1114	49	48

Id	PostTypeId	CreationDate	Score	ViewCount	Tags	AnswerCount
44419	1	2019-01-23 09:21:13	1	21	<machine-learning><data-mining>	0
44420	1	2019-01-23 09:34:01	0	25	<machine-learning><regression><linear-regression><regularization>	0
44423	1	2019-01-23 09:58:41	2	1651	<python><time-series><forecast><forecasting>	0
44427	1	2019-01-23 10:57:09	0	55	<machine-learning><scikit-learn><pca>	1
44428	1	2019-01-23 11:02:15	0	19	<dataset><bigdata><data><speech-to-text>	0

Popular Data Science Questions¶

Stack Exchange¶

What kind of questions are welcome on this site?¶

What, other than questions, does DSSE's home subdivide into?¶

What information is available in each post?¶

Stack Exchange Data Explorer¶

Getting the Data¶