Popular Data Science Questions

Data%20Science.jpg

Introduction

Stack Exchange is a network of question-and-answer (Q&A) websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows the sites to be self-moderating.

Stack Exchange currently hosts 176 communities that are created and run by experts and enthusiasts who are passionate about a specific topic. They build libraries of high-quality questions and answers, focused on each community's area of expertise. Here are just a few of the communities shown below. The size of 'icon' relates to the magnitude of community usage. Of all 176 communities, DATA SCIENCE only ranks 46th in terms of usage.

Stack%20Exchange.jpg

There's an incredible broad spectrum of subject matters to ask questions about across the 176 communities; covering Technology, Culture/Recreation, Life/Arts, Science, Professional and Business.

Clear boundaries are set regarding question format:

Ask about ...

  • Specific issues within each site's area of expertise
  • Real problems or questions that you’ve encountered

Don't ask about ...

  • Questions that are primarily opinion-based
  • Questions with too many possible answers or that would require an extremely long answer

"Tags" are used to make it easy to find interesting questions. All questions are tagged with their subject areas. Each can have up to 5 tags, since a question might be related to several subjects.

Project Objective

Although the title of this guided project is; 'Popular Data Science Questions', the raw data provided does not enable me to identify the most frequent questions. What the raw data provides is Data Science TAG TYPES, not actual questions posed by users.

What I can reveal through data analysis is top Data Science tag types and tag usage on a monthly basis over a period of time and a few other interesting things.

At the end of this report, I will list the current top 10 Data Science questions (across all Tag types) as of May 25 2021 by extracting them from the appropriate data site. I will provide a link to where I obtained the information.

Example Data Science Tags and Questions

Here are a few example "Tags" available within the Data Science Community:

  • machine-learning
  • python
  • deep-learning
  • scikit-learn
  • neural-network

Here are a few top post questions:

  • What are graph embedding?
  • One Hot Encoding vs Word Embedding - When to choose one or another?
  • What is Hellinger Distance and when to use it?
  • Can machine learning learn a function like finding maximum from a list?

Query 1 in Data Science Stack Exchange Data Explorer

Query%201B.jpg

Table%201B.jpg

Observations

The two separate SQL 'SELECT' commands shown above generated the two subsequent tables.

The first table on the left shows the different types of posts available within Data Science Stack Exchange. The table on the right shows the quantity of posts for each type of post.

The top two post types are Answer (32404) and Question (28881). It makes sense to me that there are more Answer posts since answers from different users may be provided for a single question.

Query 2 in Data Science Stack Exchange Data Explorer

Query%202.jpg

Table%202.jpg

Observations

The table above shows the type of information contained within the 'Posts' table. The information under the 'Tags' column shows that there are times when a question has multiple 'Tags' it can be classified under.

Later in this project we will observe the frequency of tag types and summarize the top 20. Tag types doesn't provide us the exact questions posed but they can give us an idea of the subject matter questions were related to.

Read Data File

In [1]:
# perform all appropriate import 'libraries' to ensure
# executability of various commands.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# read data file provided for analysis.
questions = pd.read_csv('2019_questions.csv', na_values=['Not Stated'])

# print column headings and file info to get a feel for
# the data file content and structure.
print(questions.columns, '\n')
print(questions.info(), '\n')
print(questions.isna().sum(), '\n')
print(questions.head())
Index(['Id', 'CreationDate', 'Score', 'ViewCount', 'Tags', 'AnswerCount',
       'FavoriteCount'],
      dtype='object') 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             8839 non-null   int64  
 1   CreationDate   8839 non-null   object 
 2   Score          8839 non-null   int64  
 3   ViewCount      8839 non-null   int64  
 4   Tags           8839 non-null   object 
 5   AnswerCount    8839 non-null   int64  
 6   FavoriteCount  1407 non-null   float64
dtypes: float64(1), int64(4), object(2)
memory usage: 483.5+ KB
None 

Id                  0
CreationDate        0
Score               0
ViewCount           0
Tags                0
AnswerCount         0
FavoriteCount    7432
dtype: int64 

      Id         CreationDate  Score  ViewCount  \
0  44419  2019-01-23 09:21:13      1         21   
1  44420  2019-01-23 09:34:01      0         25   
2  44423  2019-01-23 09:58:41      2       1651   
3  44427  2019-01-23 10:57:09      0         55   
4  44428  2019-01-23 11:02:15      0         19   

                                                Tags  AnswerCount  \
0                    <machine-learning><data-mining>            0   
1  <machine-learning><regression><linear-regressi...            0   
2       <python><time-series><forecast><forecasting>            0   
3              <machine-learning><scikit-learn><pca>            1   
4           <dataset><bigdata><data><speech-to-text>            0   

   FavoriteCount  
0            NaN  
1            NaN  
2            NaN  
3            NaN  
4            NaN  

Observations

  1. The only column that has missing values is 'FavoriteCount'. There are 7432 missing values out of 8839 rows. We may convert 'NaN' to 0 if need be to answer specific questions.
  2. In the 'Tags' column there are multiple Tag categories separated by '<' '>' characters. Whether anything specific needs to be done with this column depends on the type of questions that need to be answered.
In [2]:
# convert 'FavoriteCount' column from float64 to int64.
questions['FavoriteCount'] = questions['FavoriteCount'].fillna(0).astype(np.int64)

# convert CreationDate from object to datetime64.
questions['CreationDate'] = pd.to_datetime(questions['CreationDate'])

# confirm successful conversions.
print(questions.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             8839 non-null   int64         
 1   CreationDate   8839 non-null   datetime64[ns]
 2   Score          8839 non-null   int64         
 3   ViewCount      8839 non-null   int64         
 4   Tags           8839 non-null   object        
 5   AnswerCount    8839 non-null   int64         
 6   FavoriteCount  8839 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 483.5+ KB
None
In [3]:
# replace tag separators '><' with commas and
# remove '<' and '>' from beginning and end of Tags.
# this creates lists within Tags column.

questions['Tags'] = questions['Tags'].str.replace('><',',')
questions['Tags'] = questions['Tags'].str.replace('<','')
questions['Tags'] = questions['Tags'].str.replace('>','')

# confirm successful executions.
print(questions.head())
      Id        CreationDate  Score  ViewCount  \
0  44419 2019-01-23 09:21:13      1         21   
1  44420 2019-01-23 09:34:01      0         25   
2  44423 2019-01-23 09:58:41      2       1651   
3  44427 2019-01-23 10:57:09      0         55   
4  44428 2019-01-23 11:02:15      0         19   

                                                Tags  AnswerCount  \
0                       machine-learning,data-mining            0   
1  machine-learning,regression,linear-regression,...            0   
2            python,time-series,forecast,forecasting            0   
3                  machine-learning,scikit-learn,pca            1   
4                dataset,bigdata,data,speech-to-text            0   

   FavoriteCount  
0              0  
1              0  
2              0  
3              0  
4              0  
In [4]:
# use the 'explode' code to transform each element of the list-structure
# in Tags column confining tags to single value under same column name.
new = questions.assign(Tags=questions['Tags'].str.split(',')).explode('Tags')
print(new.info(), '\n')

# count quantity of times each tag type was used
# and limit output to top 20.
print('\033[1mTop 20 Data Science Tags in 2019  \033[0m')
new.Tags.value_counts().nlargest(20)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 26640 entries, 0 to 8838
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             26640 non-null  int64         
 1   CreationDate   26640 non-null  datetime64[ns]
 2   Score          26640 non-null  int64         
 3   ViewCount      26640 non-null  int64         
 4   Tags           26640 non-null  object        
 5   AnswerCount    26640 non-null  int64         
 6   FavoriteCount  26640 non-null  int64         
dtypes: datetime64[ns](1), int64(5), object(1)
memory usage: 1.6+ MB
None 

Top 20 Data Science Tags in 2019  
Out[4]:
machine-learning          2693
python                    1814
deep-learning             1220
neural-network            1055
keras                      935
classification             685
tensorflow                 584
scikit-learn               540
nlp                        493
cnn                        489
time-series                466
lstm                       402
pandas                     354
regression                 347
dataset                    340
r                          268
predictive-modeling        265
clustering                 257
statistics                 234
machine-learning-model     224
Name: Tags, dtype: int64
In [5]:
# plot a horizontal bar chart to provide visualization
# of top 20 Data Science tags used in 2019.
print('\n')
fig, ax = plt.subplots(figsize=(30,20))

new['Tags'].value_counts()[:20].sort_values(ascending=True).plot(kind='barh')
plt.title('TOP 20 Data Science Tags Used in 2019', fontsize=55, pad = 30)
plt.xlabel('Qty of Tag Usage', fontsize=45, labelpad = 30)
plt.xticks(fontsize=35, rotation=0)
plt.ylabel('Data Science Tag Names', fontsize=45)
plt.yticks(fontsize=35)
sns.despine(bottom=False, left=True)
ax.grid(False)
ax.tick_params(bottom=True, left=False, pad=15)
plt.show()

Data Science Stack Exchange Tags

The bar plot above shows the top 20 Data Science tags used in 2019.

One good question is; "Are these tags totally independent of each other or is there some inter-relationship between any of them?"

Well, there is a devoted section within the Data Science Stack Exchange web site under the title 'Tags' that provides a brief description of each 'Tag' as shown below. The 'Tag' captions below not only provide a description but also are ordered from highest usage to least regarding the current top 20 as of May 25 2021. They are very close to the top 20 of 2019 as shown in the above graph.

Regarding 'Tag' descriptions, there is definitely inter-relationship between various tags. For example, there are sub-categories of 'machine-learning' such as: 'deep-learning', 'scikit-learn', tensorflow' and so on.