Stack Exchange is a network of question-and-answer (Q&A) websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows the sites to be self-moderating.
Stack Exchange currently hosts 176 communities that are created and run by experts and enthusiasts who are passionate about a specific topic. They build libraries of high-quality questions and answers, focused on each community's area of expertise. Here are just a few of the communities shown below. The size of 'icon' relates to the magnitude of community usage. Of all 176 communities, DATA SCIENCE only ranks 46th in terms of usage.
There's an incredible broad spectrum of subject matters to ask questions about across the 176 communities; covering Technology, Culture/Recreation, Life/Arts, Science, Professional and Business.
Clear boundaries are set regarding question format:
Ask about ...
Don't ask about ...
"Tags" are used to make it easy to find interesting questions. All questions are tagged with their subject areas. Each can have up to 5 tags, since a question might be related to several subjects.
Although the title of this guided project is; 'Popular Data Science Questions', the raw data provided does not enable me to identify the most frequent questions. What the raw data provides is Data Science TAG TYPES, not actual questions posed by users.
What I can reveal through data analysis is top Data Science tag types and tag usage on a monthly basis over a period of time and a few other interesting things.
At the end of this report, I will list the current top 10 Data Science questions (across all Tag types) as of May 25 2021 by extracting them from the appropriate data site. I will provide a link to where I obtained the information.
Here are a few example "Tags" available within the Data Science Community:
Here are a few top post questions:
The two separate SQL 'SELECT' commands shown above generated the two subsequent tables.
The first table on the left shows the different types of posts available within Data Science Stack Exchange. The table on the right shows the quantity of posts for each type of post.
The top two post types are Answer (32404) and Question (28881). It makes sense to me that there are more Answer posts since answers from different users may be provided for a single question.
The table above shows the type of information contained within the 'Posts' table. The information under the 'Tags' column shows that there are times when a question has multiple 'Tags' it can be classified under.
Later in this project we will observe the frequency of tag types and summarize the top 20. Tag types doesn't provide us the exact questions posed but they can give us an idea of the subject matter questions were related to.
# perform all appropriate import 'libraries' to ensure
# executability of various commands.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# read data file provided for analysis.
questions = pd.read_csv('2019_questions.csv', na_values=['Not Stated'])
# print column headings and file info to get a feel for
# the data file content and structure.
print(questions.columns, '\n')
print(questions.info(), '\n')
print(questions.isna().sum(), '\n')
print(questions.head())
# convert 'FavoriteCount' column from float64 to int64.
questions['FavoriteCount'] = questions['FavoriteCount'].fillna(0).astype(np.int64)
# convert CreationDate from object to datetime64.
questions['CreationDate'] = pd.to_datetime(questions['CreationDate'])
# confirm successful conversions.
print(questions.info())
# replace tag separators '><' with commas and
# remove '<' and '>' from beginning and end of Tags.
# this creates lists within Tags column.
questions['Tags'] = questions['Tags'].str.replace('><',',')
questions['Tags'] = questions['Tags'].str.replace('<','')
questions['Tags'] = questions['Tags'].str.replace('>','')
# confirm successful executions.
print(questions.head())
# use the 'explode' code to transform each element of the list-structure
# in Tags column confining tags to single value under same column name.
new = questions.assign(Tags=questions['Tags'].str.split(',')).explode('Tags')
print(new.info(), '\n')
# count quantity of times each tag type was used
# and limit output to top 20.
print('\033[1mTop 20 Data Science Tags in 2019 \033[0m')
new.Tags.value_counts().nlargest(20)
# plot a horizontal bar chart to provide visualization
# of top 20 Data Science tags used in 2019.
print('\n')
fig, ax = plt.subplots(figsize=(30,20))
new['Tags'].value_counts()[:20].sort_values(ascending=True).plot(kind='barh')
plt.title('TOP 20 Data Science Tags Used in 2019', fontsize=55, pad = 30)
plt.xlabel('Qty of Tag Usage', fontsize=45, labelpad = 30)
plt.xticks(fontsize=35, rotation=0)
plt.ylabel('Data Science Tag Names', fontsize=45)
plt.yticks(fontsize=35)
sns.despine(bottom=False, left=True)
ax.grid(False)
ax.tick_params(bottom=True, left=False, pad=15)
plt.show()
The bar plot above shows the top 20 Data Science tags used in 2019.
One good question is; "Are these tags totally independent of each other or is there some inter-relationship between any of them?"
Well, there is a devoted section within the Data Science Stack Exchange web site under the title 'Tags' that provides a brief description of each 'Tag' as shown below. The 'Tag' captions below not only provide a description but also are ordered from highest usage to least regarding the current top 20 as of May 25 2021. They are very close to the top 20 of 2019 as shown in the above graph.
Regarding 'Tag' descriptions, there is definitely inter-relationship between various tags. For example, there are sub-categories of 'machine-learning' such as: 'deep-learning', 'scikit-learn', tensorflow' and so on.