Popular Data Science Questions

On this project, we'll pretend we're working for a company that creates data science content, be it books, online articles, videos or interactive text-based platforms like DataQuest.

We're tasked with figuring out what is best content to write about. However, we haven't been given what "best" means here. Therefore, our approximation will start by looking the most popular Data Science questions at Data Science Stack Exchange (DSSE) - a data science dedicated site. Some useful information about the site follows:

  1. Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field.
  2. Besides questions, the site's home page also subdivides in tags, users, companies and unanswered. Tags could be of interest to us, as they contain information about the content of each post.
  3. In each post, we have the following information available: question's creation date, question's modification date, question's score (based on upvotes and downvotes), number of views, tags, author, number of answers, content of the answers, authors of the answers, answers' scores (similar to question's score) and whether any answer have been accepted by the author as the best answer.

1) Getting the data

Stack Exchange provides a public data base for each of its websites. After spending some minutes there, we found some tables whose content seems promising for our project:

  • Posts: Contains information about individual posts, such as score, tags, upload dates, etc.
  • Tags: Contains information about number of uses of each tag, and some more info that is not of our interest right now.
  • Votes: Contains information about individual votes, such as ID of the voted post, type of vote, and date.

From the 8 possible types of posts, we'll focus on the **'questions'** type. Besides being the most common type, it's the most relevant for our purpose.

1.a) Querying posts' data

We'll start by getting the following information from the Posts table, including only 'post' types created in 2021 or after - as we're only interested in recent posts:

  • Id: An identification number for the post.
  • PostTypeId: An identification number for the type of post.
  • CreationDate: The date and time of creation of the post.
  • Score: The post's score.
  • ViewCount: How many times the post was viewed.
  • Tags: What tags were used.
  • AnswerCount: How many answers the question got (only applicable to question posts).
  • FavoriteCount: How many times the question was favored (only applicable to question posts).

To such end, we used the following query:

SELECT Id, PostTypeId, CreationDate, Score, ViewCount, Tags, AnswerCount, FavoriteCount
FROM Posts
WHERE (PostTypeId = 1) AND (CreationDate > '2021/01/01');

The result was downloaded into the file 'dsse_questions.csv'

1.b) First inspection of the posts' data

Next, we'll check how our recently obtained data looks:

In [1]:
# Importing the packages of our interest
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# Data inspection
questions = pd.read_csv('dsse_questions.csv')
display(questions.head(100))
display("The posts dataframe has " + str(questions.size) + " rows")
display("The posts dataframe contains the following columns:", questions.dtypes)
Id PostTypeId CreationDate Score ViewCount Tags AnswerCount FavoriteCount
0 87391 1 2021-01-01 03:10:42 1 34 <decision-trees> 1 NaN
1 87392 1 2021-01-01 07:28:07 0 34 <machine-learning><python><deep-learning><imag... 1 NaN
2 87393 1 2021-01-01 08:07:33 1 21 <neural-network><deep-learning><inception> 0 NaN
3 87395 1 2021-01-01 10:31:51 1 45 <machine-learning><cloud><federated-learning> 1 1.0
4 87404 1 2021-01-01 18:00:21 1 59 <reinforcement-learning><openai-gym> 1 NaN
... ... ... ... ... ... ... ... ...
95 87674 1 2021-01-08 11:47:07 2 1691 <mlp><deep-learning> 1 1.0
96 87675 1 2021-01-08 12:07:12 0 66 <python> 2 NaN
97 87683 1 2021-01-08 14:25:41 2 64 <cnn><feature-selection> 0 NaN
98 87688 1 2021-01-08 16:32:10 0 191 <clustering><dbscan> 0 NaN
99 87691 1 2021-01-08 18:44:11 1 154 <scikit-learn><hyperparameter-tuning><grid-sea... 1 NaN

100 rows × 8 columns

'The posts dataframe has 65096 rows'
'The posts dataframe contains the following columns:'
Id                 int64
PostTypeId         int64
CreationDate      object
Score              int64
ViewCount          int64
Tags              object
AnswerCount        int64
FavoriteCount    float64
dtype: object

1.c) Posts' data cleaning

The inspection of the data above shows some issues with the data that we'll need to fix:

  1. The 'FavouriteCount' column has missing values, and is stored as float instead of integer.
  2. The 'CreationDate' column is stored as string, instead of DateTime.
  3. The 'Tags' column content has each tag enclosed with '<>' signs.
In [3]:
# Filling missing values with 0
questions["FavoriteCount"].fillna(0, inplace=True)

# Setting proper column types
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)
questions["CreationDate"] = questions["CreationDate"].astype("datetime64[ns]")

# Formatting 'Tags' column for easier manipulation
questions["Tags"] = questions["Tags"].str.replace("><", ",").str.replace("<", "").str.replace(">", "")

# Data inspection
display(questions.head())
display("The posts dataframe contains the following columns:", questions.dtypes)
Id PostTypeId CreationDate Score ViewCount Tags AnswerCount FavoriteCount
0 87391 1 2021-01-01 03:10:42 1 34 decision-trees 1 0
1 87392 1 2021-01-01 07:28:07 0 34 machine-learning,python,deep-learning,image-cl... 1 0
2 87393 1 2021-01-01 08:07:33 1 21 neural-network,deep-learning,inception 0 0
3 87395 1 2021-01-01 10:31:51 1 45 machine-learning,cloud,federated-learning 1 1
4 87404 1 2021-01-01 18:00:21 1 59 reinforcement-learning,openai-gym 1 0
'The posts dataframe contains the following columns:'
Id                        int64
PostTypeId                int64
CreationDate     datetime64[ns]
Score                     int64
ViewCount                 int64
Tags                     object
AnswerCount               int64
FavoriteCount             int32
dtype: object

2) Data analysis

2.a) Most used and most viewed tags

We'll now focus on determining the most popular tags by considering two different popularity proxies: times used and number of views.

In [4]:
# Expand the 'Tags' column to facilitate analysis
questions_tags = pd.concat([questions, questions["Tags"].str.split(",", expand=True)], axis=1)

display(questions_tags.head())



# Counting how many times each tag has been used

## Creating empty dictionary to store results
tags_uses = {}

## Defining a function that increases 'tags_uses' dictionary by one for a particular tag
def tag_count(string):
    if string != None:
        if string in tags_uses:
            tags_uses[string] += 1
        else:
            tags_uses[string] = 1

## Applying the previously defined function to each individual tags column
for x in range(0,5):
    questions_tags[x].apply(tag_count)

## Converting the dictionary to dataframe and sorting by number of uses in descending order
tag_uses_df = pd.DataFrame.from_dict(tags_uses, orient="index").reset_index().rename(columns={"index":"tag", 0:"uses"})
tag_uses_df.sort_values("uses", ascending=False, inplace=True, ignore_index=True)

## Plotting the top 10 tags and their number of uses
top10_uses = tag_uses_df.head(10).sort_values("uses")
plt.barh(y=top10_uses["tag"], width=top10_uses["uses"])
plt.title("Number of uses for top10 used tags")
plt.tick_params(bottom=False, left=False)
plt.show()



# Counting how many views has each tag

## Creating empty dictionary to store results
tags_views = {}

## Defining a function that increases 'tags_uses' dictionary by number of views for each tag column
def tag_views(row):
    for x in range(0,5):
        if row[x] != None:
            if row[x] in tags_views:
                tags_views[row[x]] += row["ViewCount"]
            else:
                tags_views[row[x]] = row["ViewCount"]

## Applying the previously defined function to our dataframe
questions_tags.apply(tag_views, axis=1)

## Converting the dictionary to dataframe and sorting by number of uses in descending order
tag_views_df = pd.DataFrame.from_dict(tags_views, orient="index").reset_index().rename(columns={"index":"tag", 0:"views"})
tag_views_df.sort_values("views", ascending=False, inplace=True, ignore_index=True)

## Plotting the top 10 tags and their number of views
top10_views = tag_views_df.head(10).sort_values("views")
plt.barh(y=top10_views["tag"], width=top10_views["views"])
plt.title("Number of views for top10 viewed tags")
plt.tick_params(bottom=False, left=False)
plt.show()
Id PostTypeId CreationDate Score ViewCount Tags AnswerCount FavoriteCount 0 1 2 3 4
0 87391 1 2021-01-01 03:10:42 1 34 decision-trees 1 0 decision-trees None None None None
1 87392 1 2021-01-01 07:28:07 0 34 machine-learning,python,deep-learning,image-cl... 1 0 machine-learning python deep-learning image-classification image-preprocessing
2 87393 1 2021-01-01 08:07:33 1 21 neural-network,deep-learning,inception 0 0 neural-network deep-learning inception None None
3 87395 1 2021-01-01 10:31:51 1 45 machine-learning,cloud,federated-learning 1 1 machine-learning cloud federated-learning None None
4 87404 1 2021-01-01 18:00:21 1 59 reinforcement-learning,openai-gym 1 0 reinforcement-learning openai-gym None None None

The bar plots above show that, while most tags in our top10 lists exist in both lists, some of them are only contained in one of them. More specifically:

  • "time-series" only exists in the top10 used tags.
  • "pandas" only exists in the top10 viewed tags.

We could therefore conclude, that the most popular topics in Data Science relate to **machine learning** and **deep learning**, with the following tags:

  • machine-learning
  • deep-learning
  • neural-network
  • nlp
  • classification
  • keras
  • tensorflow
  • scikit-learn

2.b) Popularity of machine learning and deep learning over time

To make our recommendation more exhaustive, we would like to ensure that the most popular topics in Data Sciende, as we measured them from 2021 and later, we will measure their popularity over time, so we don't recommend something that could simply be a fad.

For this purpose, we have downloaded a new file from the public data base for the stack exchange sites, using a modified version of the initial query:

SELECT Id, PostTypeId, CreationDate, Score, ViewCount, Tags, AnswerCount, FavoriteCount
FROM Posts
WHERE PostTypeId = 1;

We'll work with this new file in the following analyses.

Let's start by loading and cleaning the data:

In [5]:
# Reading the new file with no time filter
all_questions = pd.read_csv('all_dsse_questions.csv')

# Filling missing values with 0
all_questions["FavoriteCount"].fillna(0, inplace=True)

# Setting proper column types
all_questions["FavoriteCount"] = all_questions["FavoriteCount"].astype(int)
all_questions["CreationDate"] = all_questions["CreationDate"].astype("datetime64[ns]")

# Formatting 'Tags' column for easier manipulation
all_questions["Tags"] = all_questions["Tags"].str.replace("><", ",").str.replace("<", "").str.replace(">", "")

Once we have our data loaded and cleaned, we'll continue creating a column that shows 1 for rows containing deep learning tags, 0 otherwise:

In [6]:
all_questions["deep_learning"] = 0
deep_learning_tags = ["machine-learning", "deep-learning", "neural-network", "nlp", "classification", "keras", "tensorflow", "scikit-learn"]

# Create a function that returns 1 if any of the deep learning tags is in the "Tags" column
def is_deep(row):
    taglist = row["Tags"].split(",")
    for element in deep_learning_tags:
        if element in taglist:
            return 1

# Apply the function above to create a column with the results
all_questions["deep_learning"] = all_questions.apply(is_deep, axis=1)

# Fill NaN values in "deep_learning" with 0s
all_questions["deep_learning"].fillna(0, inplace=True)

# Display the table with the new column included
display(all_questions.head())
Id PostTypeId CreationDate Score ViewCount Tags AnswerCount FavoriteCount deep_learning
0 56889 1 2019-08-03 12:05:21 2 185 deep-learning,keras,manifold 1 0 1.0
1 56893 1 2019-08-03 14:31:31 1 51 machine-learning,regression,feature-selection,... 1 0 1.0
2 56894 1 2019-08-03 14:32:09 2 743 deep-learning,time-series,word2vec,anomaly-det... 3 1 1.0
3 56905 1 2019-08-04 00:26:07 0 22 svm,multilabel-classification 2 0 0.0
4 56910 1 2019-08-04 06:49:43 0 519 machine-learning,keras 0 0 1.0

Then, we need to group the data so it can be easier to interpret and plot:

In [7]:
# Creating a year and month column
all_questions["year"] = pd.DatetimeIndex(all_questions["CreationDate"]).year
all_questions["year"] = all_questions["year"].astype(int)

# Creating a pivot table showing evolution of the proportion of deep learning topics across time
dl_evolution = all_questions.pivot_table(values="deep_learning", index="year", aggfunc="mean")
dl_evolution = dl_evolution.reset_index().rename(columns={"index":"tag", 0:"uses"})
dl_evolution["deep_learning"] = (dl_evolution["deep_learning"] * 100).astype(int)
display(dl_evolution)
year deep_learning
0 2014 46
1 2015 48
2 2016 55
3 2017 62
4 2018 65
5 2019 63
6 2020 62
7 2021 59
8 2022 55

Finally, let's graph the evolution of deep learning topics:

In [8]:
# Initiate the plot with the corresponding data
fig, ax = plt.subplots(figsize=(15, 8))
ax.plot(
    dl_evolution["year"],
    dl_evolution["deep_learning"],
    color="#0B74CB",
    linewidth=4,
    alpha=.8
)

# Remove all the four spines
for location in ["top", "bottom", "left", "right"]:
    ax.spines[location].set_visible(False)

# Hide ticks
ax.tick_params(bottom=False, left=False)

# Modify ticks labels
ax.tick_params(axis="x", color="gray", which="major", labelsize=14, labelrotation=0, labelcolor="gray")
ax.tick_params(axis="y", color="gray", which="major", labelsize=14, labelrotation=0, labelcolor="gray")
ax.grid(axis="y", visible=True, alpha=.5)

# Title
ax.text(
    0, 1.15, "Percentage of deep learning related topics in DSSE",
    horizontalalignment="left",
    verticalalignment="center",
    transform=ax.transAxes,
    size=22, weight="bold",
    alpha=.75
)

# Subtitle
ax.text(
    0, 1.08, "Deep Learning starts decreasing popularity after 2018",
    horizontalalignment="left",
    verticalalignment="center",
    transform=ax.transAxes,
    size=16,
    alpha=.80
)

# Signature line
ax.text(
    0.5, -0.1, "Created by: Álvaro Viúdez" + " " * 215 + "Source: DataQuest",
    horizontalalignment="center",
    verticalalignment="center",
    transform=ax.transAxes,
    color="white",
    backgroundcolor="gray"
)

# Initial point
ax.text(
    0.01, .05, '46%',
    horizontalalignment="center",
    verticalalignment="center",
    transform=ax.transAxes,
    color="#0B74CB",
    size=18, weight="bold"
)

# Max point
ax.text(
    0.51, 1, '65%',
    horizontalalignment="center",
    verticalalignment="center",
    transform=ax.transAxes,
    color="#0B74CB",
    size=18, weight="bold"
)

# End point
ax.text(
    0.99, .44, '55%',
    horizontalalignment="center",
    verticalalignment="center",
    transform=ax.transAxes,
    color="#0B74CB",
    size=18, weight="bold"
)

plt.show()

3) Conclussions

The present analysis has shown two interesting points on the popularity of data science topics:

  1. The most popular recent topics in Data Science relate to machine learning and deep learning
  2. These topics have been losing popularity since 2018, where they reached the highest value: 68% of the topics.

However, even if machine learning and deep learning are not at their maximum popularity values, they still account for more than 50% of the topics in DSSE.

We could advise to our Data Science company to create machine learning and deep learning content, still paying attention to new topics that can emerge - Data Science is always evolving!