=== Popular Data Science Questions ===¶

(exploring the family of Stack Exchange Websites)¶

The range of subdivisions on the Data Science Stack Exchange (DSSE) go from popular libraries like Pandas, Keras, Tensorflow to the programming language Python, passing trough Statistics, Predictive-modeling, Time-series, regression, etc.¶

Maybe we could take the approach of using as a proxy the most popular - most viewed, most answered, and most voted - questions/tags as one of the hottest topics for Data Science enthusiasts.¶

After drilling down into the several tables available in the Stack Exchange Data Explorer (SEDE) - https://data.stackexchange.com/datascience/query/new - we found that the most valuable ones for our strategy would be both the Posts as the Tags. Combining both tables we can get a good insight into the most popular Data Science topics, the ones with most comments, favorites, and most answered ones, and combine that data with the Tags being more frequently used.¶

After running the following query:¶

SELECT PostTypeId as type_of_posts, COUNT(*) as nº_of_posts
  FROM Posts
 GROUP BY PostTypeId
   ORDER BY nº_of_posts DESC;

One could check that the two most numerous post types are the 2 and the 1, Answer and Question. Let's now focus on the Questions side.¶

SELECT Id, 
       PostTypeId, 
       CreationDate, 
       Score, 
       ViewCount, 
       Tags, 
       AnswerCount, 
       FavoriteCount
  FROM Posts
 WHERE PostTypeId = 1 AND CreationDate >= '2019-01-01'
ORDER BY CreationDate;

Now lets read into the file created from the query above:¶

In [1]:

import numpy as np
import pandas as pd


posts = pd.read_csv('2019_questions.csv', parse_dates=['CreationDate'])

# Creating a sample of five rows from the newly created posts DataFrame:
posts.sample(5)

Out[1]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount
7925	44021	2019-01-15 10:15:57	0	12	<machine-learning><k-nn>	0	NaN
2459	47565	2019-03-18 22:31:37	2	142	<machine-learning><predictive-modeling><machin...	2	NaN
3799	61127	2019-10-02 05:02:49	1	13	<nlp><topic-model>	0	NaN
4158	49470	2019-04-17 10:51:15	1	43	<machine-learning><neural-network><data-mining...	0	NaN
4366	61617	2019-10-11 18:55:06	0	7	<python><matplotlib>	0	NaN

In [2]:

posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    1407 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB

In [3]:

posts['FavoriteCount'].value_counts(dropna=False)

Out[3]:

NaN      7432
 1.0      953
 2.0      205
 0.0      175
 3.0       43
 4.0       12
 5.0        8
 6.0        4
 7.0        4
 11.0       1
 8.0        1
 16.0       1
Name: FavoriteCount, dtype: int64

In [4]:

posts['Tags'].value_counts()

Out[4]:

<machine-learning>                                                                               118
<python><pandas>                                                                                  58
<python>                                                                                          55
<r>                                                                                               38
<tensorflow>                                                                                      36
<nlp>                                                                                             35
<neural-network>                                                                                  35
<reinforcement-learning>                                                                          32
<keras>                                                                                           29
<deep-learning>                                                                                   29
<time-series>                                                                                     26
<keras><tensorflow>                                                                               24
<machine-learning><python>                                                                        23
<classification>                                                                                  23
<python><pandas><dataframe>                                                                       22
<clustering>                                                                                      21
<machine-learning><neural-network>                                                                21
<cnn>                                                                                             19
<machine-learning><deep-learning>                                                                 18
<lstm>                                                                                            17
<dataset>                                                                                         17
<machine-learning><classification>                                                                17
<orange>                                                                                          16
<machine-learning><neural-network><deep-learning>                                                 16
<visualization>                                                                                   16
<machine-learning><python><scikit-learn>                                                          15
<pytorch>                                                                                         15
<python><keras><tensorflow>                                                                       15
<decision-trees>                                                                                  15
<pandas>                                                                                          15
                                                                                                ... 
<time-series><forecasting><probabilistic-programming>                                              1
<neural-network><training><optimization><fuzzy-logic><fuzzy-classification>                        1
<deep-learning><cnn><image-recognition><image-preprocessing><image-size>                           1
<machine-learning><cnn><reinforcement-learning><convolution>                                       1
<python><c>                                                                                        1
<data-mining><dbscan><research><implementation>                                                    1
<deep-learning><cross-validation>                                                                  1
<dataset><lstm>                                                                                    1
<gan><databases>                                                                                   1
<machine-learning><data><categorical-data><encoding>                                               1
<training><methodology>                                                                            1
<python><statistics><geospatial>                                                                   1
<machine-learning><classification><perceptron>                                                     1
<machine-learning><machine-learning-model><azure-ml>                                               1
<classification><scikit-learn><decision-trees><multiclass-classification><unbalanced-classes>      1
<deep-learning><loss-function><cosine-distance>                                                    1
<neural-network><computer-vision><object-detection>                                                1
<deep-learning><nlp><lstm><rnn><language-model>                                                    1
<r><ensemble-modeling>                                                                             1
<machine-learning><python><scikit-learn><regression><feature-selection>                            1
<machine-learning><python><nlp><stanford-nlp>                                                      1
<training><computer-vision><gan>                                                                   1
<machine-learning><nlp><natural-language-process><nlg>                                             1
<python><pandas><matplotlib>                                                                       1
<machine-learning><python><pytorch>                                                                1
<data><feature-engineering><encoding>                                                              1
<machine-learning><python><similarity><correlation>                                                1
<multilabel-classification><confusion-matrix>                                                      1
<machine-learning><lstm><bert>                                                                     1
<deep-learning><dataset><cnn><training><image-size>                                                1
Name: Tags, Length: 6462, dtype: int64

We have a tremendous amount of missing (NaN) datapoints in our posts Dataframe, mainly in our FavoriteCount column. Where 7432 datapoints, out of 8839, are NaN's. Apart from this particular col we do not have any other missing values in our Dataframe.¶

From a quick search we found out that this missing values might correspond to Questions that have never been favorited, hence felt into the forgotten category. And the difference between these and the 0.0 values, in the FavoriteCount column, is that the last were once favorited Questions that lost their favorited pedigree to the extent of 0. - https://meta.stackexchange.com/questions/327680/why-do-some-questions-have-a-favorite-count-of-0-while-others-have-none ¶

We have two solutions to fix this missing values in our Dataframe. The first solution would be to simply erase the rows, the second solution would be to assign to all the rows that have missing values, and based on our findings, the value of 0, implying that these rows have zero favorite votes. Due to the size of the missing values in our Dataframe we would suggest to opt for the second solution, thus not losing a lot of data.¶

Regarding our Tags col we could, in order to favour the analysis, and to smooth the results, further group the column, or we could also treat each group of tags as an individual tag. But first of all we should separate each tag properly, i e, with a comma (,).¶

First let's fill in the missing values in our FavoriteCount column with zeros (0). And then changing the col type from a float to a integer one:¶

In [5]:

#resorting to the fillna method:
posts['FavoriteCount'] = posts['FavoriteCount'].fillna(0)
posts.sample(5)

Out[5]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount
7045	43471	2019-01-04 10:26:10	5	354	<neural-network><classification><overfitting>	3	1.0
1745	57692	2019-08-16 21:15:18	1	29	<regression><visualization><feature-extraction...	1	0.0
34	44500	2019-01-24 12:40:20	0	157	<deep-learning><training>	0	0.0
8324	65611	2019-12-30 08:12:01	1	23	<machine-learning><deep-learning><pytorch>	0	1.0
4289	61768	2019-10-15 13:08:21	0	14	<machine-learning><predictive-modeling><data-c...	0	0.0

In [6]:

posts['FavoriteCount'] = posts['FavoriteCount'].astype(int)
posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
Id               8839 non-null int64
CreationDate     8839 non-null datetime64[ns]
Score            8839 non-null int64
ViewCount        8839 non-null int64
Tags             8839 non-null object
AnswerCount      8839 non-null int64
FavoriteCount    8839 non-null int64
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 483.5+ KB

Its time now to clean the Tags column and separate each different tag by a comma (,):¶

In [7]:

posts['Tags'] = posts['Tags'].str.replace('><', ',').copy()
posts['Tags'] = posts['Tags'].str.replace('<', ''). copy()
posts['Tags'] = posts['Tags'].str.replace('>', '').copy()
posts.head()

Out[7]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount
0	44419	2019-01-23 09:21:13	1	21	machine-learning,data-mining	0
1	44420	2019-01-23 09:34:01	0	25	machine-learning,regression,linear-regression,...	0
2	44423	2019-01-23 09:58:41	2	1651	python,time-series,forecast,forecasting	0
3	44427	2019-01-23 10:57:09	0	55	machine-learning,scikit-learn,pca	1
4	44428	2019-01-23 11:02:15	0	19	dataset,bigdata,data,speech-to-text	0

Now lets check, from the different Tags, which are the most popular ones, using two distinct popularity proxies: first we will count the number of times that each Tag was used, and come up with a top5. Then we will use the parameter ViewCount to grasp, among the top5 most used Tags, which one of the five was most viewed.¶

We are going to calculate the number of times each Tag was used, first by spliting on the commas (,) each individual Tag from their counterparts, then by stacking the full Dataframe that resulted from that, and finally counting the number of times that each different Tag was used:¶

In [8]:

tags_count = posts['Tags'].str.split(',', expand=True).stack().value_counts()
tags_top5 = tags_count.head(5)
print(tags_top5)

machine-learning    2693
python              1814
deep-learning       1220
neural-network      1055
keras                935
dtype: int64

Plotting the above results:¶

In [9]:

import matplotlib.pyplot as plt
%matplotlib inline

#Plotting the tags_top5 DataFrame in an horizontal bar graph:
tags_top5_graph = tags_top5.plot.barh(
                    edgecolor='none',
                    color = [(255/255,188/255,121/255),
                            (162/255,200/255, 236/255),
                            (207/255,207/255,207/255),
                            (200/255,82/255,0/255),
                            (255/255,194/255,10/255)])

#ENHANCING PLOT AESTHETICS:

#Removing all the 4 spines with a for loop from our graph figure:
for key, spine in tags_top5_graph.spines.items():
    spine.set_visible(False)
#Removing the ticks:
tags_top5_graph.tick_params(
                            bottom='off', top='off', left='off', right='off')
#Setting a graph title:
tags_top5_graph.set_title('Top5 Tags by Number of Usage')
#Setting an average graph line:
tags_top5_graph.axvline(tags_top5.mean(),
                       alpha=.8, linestyle='--', color='grey')

# Displaying the graph:
plt.show()

In [10]:

posts[posts['Tags'].str.contains('machine-learning')]['ViewCount'].sum()

Out[10]:

Now lets check, from the top5 Tags calculated above, which one is the most viewed. We will do so using the str.contains() method combined with a mask to filter only the top5 Tags in our posts Dataframe, and then, concentrating on the ViewCount col, adding all the times each tag was viewed:¶

In [11]:

#creating the tags_top5_views DataFrame(DF):
tags_top5_views = pd.DataFrame(
                                columns=tags_top5.index, 
                                index=['Total Views'])


for r in range(0,5):
    df = tags_top5
    n_views = posts[posts['Tags'].str.contains(df.index[r])]['ViewCount'].sum()
    print(str(df.index[r]) + '__total-views:',n_views)
#filling in the tags_top5_views DF with the name of the columns and respective
#number of views:
    col = tags_top5_views.columns[r]
    tags_top5_views[col] = [n_views]

machine-learning__total-views: 398666
python__total-views: 541691
deep-learning__total-views: 233628
neural-network__total-views: 185367
keras__total-views: 269051

Plotting the above results:¶

In [12]:

#Plotting the tags_top5_views DataFrame in a bar graph:
tags_top5_views_graph = tags_top5_views.plot.bar(edgecolor='none',
                                            color = [(255/255,188/255,121/255),
                            (162/255,200/255, 236/255),
                            (207/255,207/255,207/255),
                            (200/255,82/255,0/255),
                            (255/255,194/255,10/255)])

#ENHANCING PLOT AESTHETICS: 

#Removing all the 4 spines with a for loop from our graph figure:
for key, spine in tags_top5_views_graph.spines.items():
    spine.set_visible(False)
    
#Removing the ticks from the graph:
tags_top5_views_graph.tick_params(
                                  top ='off',
                                  bottom = 'off',
                                  right = 'off',
                                  left = 'off')
   
# Setting up a legend box for our bar graph:    
tags_top5_views_graph.legend(
    loc='upper right', 
    labels=(tags_top5_views.columns), 
    ncol=1, fancybox=True, framealpha=.6,
    prop={'size': 10})
#Rotating the xtick labels:
plt.xticks(rotation='horizontal')

tags_top5_views_graph.axhline(n_views.mean(),
                             color='grey',
                             alpha=.8,
                             linestyle=':')

plt.show()

It is clear that among our top5 Tags there are two that stand out: Machine-Learning (ML) and Python (Py). Not only are these two Tags the most used (ML-2693 times; Py-1814 times) but also the most viewed ones (Py-541691 views; ML-398666 views).¶

We've got two pretty good potential candidates for our assignment, and two compimentlary ones that can even be combined into a major one: Python and Machine-Learning.¶

In [13]:

posts[posts['Tags'].apply(
    lambda tags: True if 'python' and 'tensorflow' in tags else False)]

Out[13]:

	Id	CreationDate	Score	ViewCount	Tags	AnswerCount	FavoriteCount
22	44474	2019-01-24 00:43:27	2	1810	python,keras,tensorflow,gpu	2	2
39	44508	2019-01-24 15:18:57	1	27	tensorflow	0	0
52	44537	2019-01-25 00:54:49	0	303	machine-learning,neural-network,keras,tensorflow	1	0
66	55922	2019-07-18 13:59:42	0	117	keras,tensorflow,anomaly-detection,autoencoder	0	0
69	55925	2019-07-18 14:26:20	0	16	python,tensorflow,predictive-modeling,lstm,ana...	0	0
73	55931	2019-07-18 15:07:00	1	29	tensorflow	1	0
103	55994	2019-07-19 10:55:04	0	100	machine-learning,deep-learning,tensorflow,obje...	1	0
104	56000	2019-07-19 12:35:37	0	144	tensorflow	0	0
113	44584	2019-01-25 18:22:34	0	229	deep-learning,keras,tensorflow	1	0
122	44611	2019-01-26 16:16:54	2	102	keras,tensorflow	1	1
126	44624	2019-01-27 02:53:33	2	2538	keras,tensorflow,lstm	3	1
128	44627	2019-01-27 07:54:54	0	255	python,tensorflow,anaconda	2	0
136	44645	2019-01-27 14:13:32	1	216	machine-learning,neural-network,keras,tensorflow	1	0
152	44680	2019-01-28 07:48:11	1	193	python,tensorflow,cnn,computer-vision,opencv	0	0
190	55855	2019-07-17 18:10:46	0	109	machine-learning,tensorflow	1	0
194	55859	2019-07-17 19:32:53	1	50	deep-learning,keras,tensorflow	2	1
198	55868	2019-07-17 21:30:52	0	9	python,neural-network,tensorflow,predictive-mo...	0	1
209	55887	2019-07-18 05:46:20	1	26	machine-learning,deep-learning,tensorflow,data...	0	1
216	55899	2019-07-18 08:00:57	0	29	neural-network,keras,tensorflow,convolution,au...	0	0
240	56038	2019-07-19 22:58:54	0	165	neural-network,keras,tensorflow,cnn,gpu	0	0
257	56067	2019-07-20 17:41:12	0	99	neural-network,keras,tensorflow,autoencoder	0	0
287	44840	2019-01-31 01:15:50	0	54	scikit-learn,tensorflow,algorithms	1	0
297	44864	2019-01-31 14:00:45	0	16	machine-learning,tensorflow,autoencoder	0	0
298	44866	2019-01-31 14:16:49	1	492	tensorflow	1	0
304	44883	2019-02-01 00:02:15	5	4551	deep-learning,keras,tensorflow,multiclass-clas...	6	1
318	44911	2019-02-01 12:16:58	0	11	python,neural-network,scikit-learn,tensorflow,...	0	0
326	44928	2019-02-01 17:05:44	0	77	python,keras,tensorflow	1	0
332	56171	2019-07-22 16:36:03	1	255	deep-learning,keras,tensorflow,cnn,convnet	1	0
333	56172	2019-07-22 17:05:49	0	264	tensorflow	0	0
335	56181	2019-07-22 18:53:04	0	14	keras,tensorflow	0	0
...	...	...	...	...	...	...	...
8194	65518	2019-12-27 10:20:04	2	47	keras,tensorflow,prediction	0	0
8208	54765	2019-06-30 03:30:59	0	16	neural-network,tensorflow,word2vec,word-embedd...	0	0
8209	54766	2019-06-30 04:14:43	0	15	tensorflow,multiclass-classification,word-embe...	0	0
8234	44241	2019-01-19 16:53:02	0	17	machine-learning,deep-learning,tensorflow,comp...	0	0
8237	44246	2019-01-19 18:55:30	1	55	python,neural-network,tensorflow,convnet	1	0
8297	54888	2019-07-02 06:41:19	1	414	deep-learning,tensorflow,bert	1	0
8368	55081	2019-07-04 16:50:25	0	11	deep-learning,keras,tensorflow,convnet	0	0
8373	55089	2019-07-04 18:59:56	0	206	deep-learning,tensorflow,java,opencv	0	1
8413	55188	2019-07-06 17:07:51	0	20	machine-learning,python,tensorflow,accuracy	0	1
8420	55202	2019-07-07 05:36:28	0	16	machine-learning,tensorflow	0	0
8426	55215	2019-07-07 12:33:05	3	749	python,keras,tensorflow,loss-function	1	0
8464	55312	2019-07-08 21:15:45	1	85	keras,tensorflow	2	0
8515	54972	2019-07-03 08:53:01	0	12	neural-network,tensorflow,image-classification...	0	0
8530	55004	2019-07-03 19:10:53	0	31	machine-learning,tensorflow	1	0
8563	55032	2019-07-04 09:24:26	0	24	machine-learning,python,tensorflow,pandas,data...	2	0
8574	55050	2019-07-04 13:08:49	0	393	neural-network,deep-learning,keras,tensorflow	2	0
8588	55494	2019-07-11 10:11:21	0	60	neural-network,tensorflow,regression	1	0
8594	55505	2019-07-11 14:16:52	0	75	deep-learning,keras,tensorflow	0	0
8607	55536	2019-07-12 03:20:03	0	124	python,deep-learning,keras,tensorflow,object-d...	1	0
8612	55545	2019-07-12 06:52:03	2	502	neural-network,keras,tensorflow,cnn,convolution	2	0
8656	55158	2019-07-05 21:26:40	1	38	python,tensorflow,logistic-regression,loss-fun...	0	0
8674	55641	2019-07-14 12:49:57	0	35	machine-learning,deep-learning,tensorflow,imag...	0	0
8684	55659	2019-07-14 21:31:04	0	32	python,deep-learning,tensorflow,cnn	0	0
8742	55724	2019-07-15 20:23:27	0	14	tensorflow	0	0
8748	55735	2019-07-16 01:24:41	0	23	tensorflow,multiclass-classification,multilabe...	0	0
8755	55749	2019-07-16 06:49:44	1	29	deep-learning,tensorflow,multilabel-classifica...	0	0
8769	55777	2019-07-16 13:30:23	1	1212	tensorflow	1	0
8800	55265	2019-07-08 09:38:01	1	42	neural-network,deep-learning,keras,tensorflow,...	1	1
8808	55293	2019-07-08 16:33:03	0	109	deep-learning,tensorflow,cnn	1	0
8822	55391	2019-07-09 20:28:14	0	16	python,keras,tensorflow	0	0

584 rows × 7 columns

Prior to summarizing our findings let's dig deeper into Deep Learning and check whether or not this trend as come to stay.¶

In [14]:

all_questions = pd.read_csv('all_questions.csv', parse_dates=['CreationDate'])
all_questions.head()

Out[14]:

	Id	CreationDate	Tags
0	45416	2019-02-12 00:36:29	<python><keras><tensorflow><cnn><probability>
1	45418	2019-02-12 00:50:39	<neural-network>
2	45422	2019-02-12 04:40:51	<python><ibm-watson><chatbot>
3	45426	2019-02-12 04:51:49	<keras>
4	45427	2019-02-12 05:08:24	<r><predictive-modeling><machine-learning-mode...

Doing the same process as we've done above, and cleaning the Tags column separating each different tag by a comma (,):¶¶

In [15]:

all_questions['Tags'] = all_questions['Tags'].str.replace('><', ',')
all_questions['Tags'] = all_questions['Tags'].str.replace('<', '')
all_questions['Tags'] = all_questions['Tags'].str.replace('>', '')
all_questions.sample(5)

Out[15]:

	Id	CreationDate	Tags
10879	61095	2019-10-01 13:10:58	keras,time-series,lstm,convolution,autoencoder
5630	17287	2017-03-01 21:16:32	machine-learning,neural-network,convnet
180	55500	2019-07-11 13:19:26	prediction,forecasting,missing-data
17393	24080	2017-10-25 19:22:39	machine-learning,prediction,gaussian
20356	13854	2016-09-04 19:01:31	machine-learning,r,apache-spark,logistic-regre...

In [16]:

deep_learning = all_questions[all_questions['Tags'] == 'deep-learning']

#Sorting the Dataframe per Date:
deep_learning.sort_values(by='CreationDate', inplace=True)
deep_learning.head()

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[16]:

	Id	CreationDate	Tags
5514	5375	2015-03-23 17:36:03	deep-learning
8039	6643	2015-07-31 08:21:49	deep-learning
16750	11591	2016-05-04 18:17:38	deep-learning
2852	15984	2016-12-29 03:35:35	deep-learning
3513	16276	2017-01-12 11:28:45	deep-learning

In [17]:

deep_learning.tail()

Out[17]:

	Id	CreationDate	Tags
14682	62769	2019-11-06 12:48:16	deep-learning
15307	62896	2019-11-08 20:56:55	deep-learning
18065	64103	2019-12-02 16:10:37	deep-learning
18785	64645	2019-12-11 12:32:05	deep-learning
20660	66004	2020-01-07 06:54:06	deep-learning

It's clear that the range of our deep-learning sample varies from 2015 and 2020. Although for 2020 the size of the sample is too small, only contaning one month.¶

Now it is time to group together our deep-learning data, based on its year:¶

In [18]:

deep_learning_grp = deep_learning.groupby(
                    deep_learning.CreationDate.dt.year).sum().sort_values(by='Id')
deep_learning_grp.head(10)

Out[18]:

	Id
CreationDate
2015	12018
2016	27575
2020	66004
2017	265854
2018	562010
2019	1625975

Applying the same process as above, to the all_questions Dataframe:¶

In [19]:

all_questions_grp = all_questions.groupby(
                    all_questions.CreationDate.dt.year).sum()
all_questions_grp.head(10)

Out[19]:

	Id
CreationDate
2014	774987
2015	8241798
2016	27199660
2017	62341989
2018	189044640
2019	482278000
2020	30388658

Lets now merge the two Dataframes into one, for the sake of our analysis, and proceed with some comparisons and conclusions:¶

In [20]:

deep_all = pd.merge(all_questions_grp, deep_learning_grp, how='left', 
                    left_index=True, right_index=True)

deep_all = deep_all.rename(
                        columns={'Id_x':'all_questions','Id_y':'deep_learning' })

deep_all.head(10)

Out[20]:

	all_questions	deep_learning
CreationDate
2014	774987	NaN
2015	8241798	12018.0
2016	27199660	27575.0
2017	62341989	265854.0
2018	189044640	562010.0
2019	482278000	1625975.0
2020	30388658	66004.0

In order to adjust both samples and to make the comparison cleaner we will drop rows 2014, since there is no data for the deep-learning Tag for this period, and also the row 2020, due to the lack of sufficient data for these year in particular. This way our comparisons are more robust and consistent:¶

In [21]:

deep_all = deep_all.drop([2014, 2020], axis=0)
deep_all.head(10)

Out[21]:

	all_questions	deep_learning
CreationDate
2015	8241798	12018.0
2016	27199660	27575.0
2017	62341989	265854.0
2018	189044640	562010.0
2019	482278000	1625975.0

Let us now make another test, and compare the number of deep_learning Tags against all the questions made in the Stack Exchange website, in order to demonstrate and validate its growth in terms of all the questions ever made. Making a kind of common-size analysis:¶

In [22]:

deep_all['%_deep_learning'] = (deep_all[
    'deep_learning']/deep_all['all_questions'])*100
deep_all['date'] = deep_all.index
#dropping index (other way could be df.index.name=None): 
deep_all.reset_index(drop=True, inplace=True)
deep_all = deep_all[['date', 'all_questions', 'deep_learning', '%_deep_learning']]
deep_all.head(10)

Out[22]:

	date	all_questions	deep_learning	%_deep_learning
0	2015	8241798	12018.0	0.145818
1	2016	27199660	27575.0	0.101380
2	2017	62341989	265854.0	0.426445
3	2018	189044640	562010.0	0.297290
4	2019	482278000	1625975.0	0.337145

From a first glimpse we can observe an upward trend along the years, in the use of the deep_learning Tags. In 2015 their number were around 12000, as in 2019 these numbers climbed to figures around 1600000. This is an impressive growth.¶

In terms of their percentage, among all questions made in the Stack Exchange website, the growth trend is also there, albeit not so strong and not so linear. Now let's visualize it.¶

Plotting the results:¶

In [23]:

fig = plt.figure(figsize=(6,7))
ax_spines = ['right', 'left', 'bottom', 'top']

ax1 = fig.add_subplot(3,1,1)
ax2 = fig.add_subplot(3,1,2)
ax3 = fig.add_subplot(3,1,3)
x = deep_all['date']
xi = list(range(len(x)))

plt.xlabel=('x')
plt.xticks(xi, x)

ax1.plot(deep_all['all_questions'], color='green', linestyle='-.')
#Setting the yticks for ax1:
ax1.set_yticks([deep_all['all_questions'].min(), deep_all['all_questions'].max()/2,
     deep_all['all_questions'].max()])
ax2.plot(deep_all['deep_learning'], color = 'orange', linestyle='-.')
#Setting the yticks for ax2:
ax2.set_yticks([deep_all['deep_learning'].min(), deep_all['deep_learning'].max()/2,
     deep_all['deep_learning'].max()])
ax3.plot(deep_all['%_deep_learning'], color='grey', linestyle='-.')
#Setting the yticks for ax3:
ax3.set_yticks([0, 0.25, 0.5])

# Giving each graph a title:
ax1.set_title('All Questions')
ax2.set_title('Deep Learning Questions')
ax3.set_title('% of Deep Learning Questions')


#ENHANCING PLOT AESTHETICS:

#Removing the ticks from the graph:
for i in range(3):
    ax = fig.add_subplot(3,1,i+1)
    ax.tick_params(
                                  top ='off',
                                  bottom = 'off',
                                  right = 'off',
                                  left = 'off')
#Removing all the 4 spines with a for loop from our graph figure:
    for i in range(0,4):
        ax.spines[ax_spines[i]].set_visible(False)

#Removing unnecessary x labels:
for i in range(1,3):
    ax = fig.add_subplot(3,1,i)
    ax.tick_params(labelbottom=False)

plt.show()

Concluding, it is clear that we can actually see, along the years, a clear upward trend in the interest shown for the Deep Learning subject. Although being true, this statement is not so strong when we analyze the number of Deep Learning questions, compared to the Total Questions (All Questions). We still conclude that there is a growth, from year 2015 to 2019, in Deep Learning interest, but the linearity of that growth lacks strongness.¶

To sum it up, it's fair to say that Deep Learning is a subject that deserves our attention due to the traction it gained along the 5 years in our analysis.¶

=== Popular Data Science Questions ===¶

(exploring the family of Stack Exchange Websites)¶

The range of subdivisions on the Data Science Stack Exchange (DSSE) go from popular libraries like Pandas, Keras, Tensorflow to the programming language Python, passing trough Statistics, Predictive-modeling, Time-series, regression, etc.¶

Apart from all this information, we can also explore all the Questions that were made in the website, and some data related to them, such as how many answers each question had, votes, views.¶

Maybe we could take the approach of using as a proxy the most popular - most viewed, most answered, and most voted - questions/tags as one of the hottest topics for Data Science enthusiasts.¶

After running the following query:¶

One could check that the two most numerous post types are the 2 and the 1, Answer and Question. Let's now focus on the Questions side.¶

Now, pulling the data for the Posts related only to the recent Questions - from 2019 onwards:¶

Now lets read into the file created from the query above:¶

We have a tremendous amount of missing (NaN) datapoints in our posts Dataframe, mainly in our FavoriteCount column. Where 7432 datapoints, out of 8839, are NaN's. Apart from this particular col we do not have any other missing values in our Dataframe.¶

Regarding our Tags col we could, in order to favour the analysis, and to smooth the results, further group the column, or we could also treat each group of tags as an individual tag. But first of all we should separate each tag properly, i e, with a comma (,).¶

First let's fill in the missing values in our FavoriteCount column with zeros (0). And then changing the col type from a float to a integer one:¶

Its time now to clean the Tags column and separate each different tag by a comma (,):¶

We are going to calculate the number of times each Tag was used, first by spliting on the commas (,) each individual Tag from their counterparts, then by stacking the full Dataframe that resulted from that, and finally counting the number of times that each different Tag was used:¶

Plotting the above results:¶

Now lets check, from the top5 Tags calculated above, which one is the most viewed. We will do so using the str.contains() method combined with a mask to filter only the top5 Tags in our posts Dataframe, and then, concentrating on the ViewCount col, adding all the times each tag was viewed:¶

Plotting the above results:¶

It is clear that among our top5 Tags there are two that stand out: Machine-Learning (ML) and Python (Py). Not only are these two Tags the most used (ML-2693 times; Py-1814 times) but also the most viewed ones (Py-541691 views; ML-398666 views).¶

We've got two pretty good potential candidates for our assignment, and two compimentlary ones that can even be combined into a major one: Python and Machine-Learning.¶

Prior to summarizing our findings let's dig deeper into Deep Learning and check whether or not this trend as come to stay.¶

Doing the same process as we've done above, and cleaning the Tags column separating each different tag by a comma (,):¶¶

We will define a cut-off in terms of what we consider a deep-learning question or not, and we are only assuming questions related to the deep-learning topic as question with one, and one only deep-learning Tag in it.¶

It's clear that the range of our deep-learning sample varies from 2015 and 2020. Although for 2020 the size of the sample is too small, only contaning one month.¶

Now it is time to group together our deep-learning data, based on its year:¶

Applying the same process as above, to the all_questions Dataframe:¶

Lets now merge the two Dataframes into one, for the sake of our analysis, and proceed with some comparisons and conclusions:¶

Let us now make another test, and compare the number of deep_learning Tags against all the questions made in the Stack Exchange website, in order to demonstrate and validate its growth in terms of all the questions ever made. Making a kind of common-size analysis:¶

From a first glimpse we can observe an upward trend along the years, in the use of the deep_learning Tags. In 2015 their number were around 12000, as in 2019 these numbers climbed to figures around 1600000. This is an impressive growth.¶

In terms of their percentage, among all questions made in the Stack Exchange website, the growth trend is also there, albeit not so strong and not so linear. Now let's visualize it.¶

Plotting the results:¶

To sum it up, it's fair to say that Deep Learning is a subject that deserves our attention due to the traction it gained along the 5 years in our analysis.¶