Suppose we are data analysts for an online e-learning company that specializes in programming courses. We cover domains such as data science and game development, but our primary focus is web and mobile development. Our goal is to promote our products and invest money in more advertisement, but to do that we need to know what markets to advertise in. We ultiized three surveys related to programming/web development from FreeCodeCamp and Stack Overflow.
These surveys were conducted online by participants worldwide in 2016, 2017, and 2018. FreeCodeCamp's surveys targeted new programmers and asked many questions related to career interest, income expectations, age, gender, home country, time spent programming, and so on. Stack Overflow's 2018 survey was aimed primarily at individuals already in the developer community concerning topics from favorite technologies to job preferences.
We discovered that new programmers are interested in a wide variety of career fields to include web development, data science, data engineering, game development, QA engineering, machine learning, and many other careers. We found that the likely motivator for their programming journey was to advance their income and career opportunities. With this knowledge, we need to ensure that our courses stay up to date, relevant, and beneficial for our customers.
Most importantly, after exploring the surveys we discovered that the two best potential countries to invest our advertising in were the United States and India. Both countries had the highest number of survey participants, which indicates that most new programmers are presumably most numerous in these two countries. Secondly, The US has the highest average monthly spending for programming education, whereas India has a lower average spending. However, India's average monthly spending is still around the same amount as our monthly subscription (\$59 US dollars per month).
In short, the two best markets for advertising include the United States and India, we recommend to the marketing team to focus their efforts into these two regions.
We want to answer questions about a population of new coders that are interested in the subjects we teach. We'd like to know:
FreeCodeCamp Survey: https://www.freecodecamp.org/news/we-asked-20-000-people-who-they-are-and-how-theyre-learning-to-code-fff5d668969
Github repository: Survey Year 2017: https://github.com/freeCodeCamp/2017-new-coder-survey/tree/master/clean-data
Survey Year 2016: https://github.com/freeCodeCamp/2016-new-coder-survey#about-the-data
Stack Overflow Survey: https://www.kaggle.com/datasets/stackoverflow/stack-overflow-2018-developer-survey
Some limitations for analyzing survey data:
Method
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.style as style
#style.use("fivethirtyeight")
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#pd.options.display.float_format = '{:20,.2f}'.format
pd.options.display.max_columns = 150 # to avoid truncated output
# Freecodecamp survey 2017
csv = pd.read_csv("2017-fCC-New-Coders-Survey-Data.csv", low_memory= False)
# Freecodecamp survey 2016
csv2016 = pd.read_csv("2016-fCC-New-Coders-Survey-Data.csv", low_memory = False)
# Stack exchange survey
exchange = pd.read_csv("survey_results_public.csv", low_memory= False)
csv.head()
exchange.head()
The first step in our analysis is to identify the appropriate columns that are relevant. Unfortunately there are over 100 columns which is far too many for a practical analysis.
We identified a few columns for analysis using datapackage.json
. This JSON file describes each column for FreeCodeCamp's new coder surveys.
# Index location of the first set of columns to drop
print(csv.columns.get_loc("CodeEventConferences"))
print(csv.columns.get_loc("CodeEventWorkshops"))
# Drops columns
csv = csv.drop(csv.iloc[:, 8:23], axis=1)
# Index location of the next set of columns to drop
print(csv.columns.get_loc("NetworkID"))
print(csv.columns.get_loc("ResourceW3S"))
# Drop columns
csv = csv.drop(csv.iloc[:, 59:100], axis=1)
print(csv.columns.get_loc("YouTubeCodeCourse"))
# Drop remaining columns including index postion 63 and onward
csv = csv.drop(csv.iloc[:, 63:], axis=1)
csv.head()
csv.iloc[:,:20]
If we utilize the following code below we'll get a better understanding of missing data in the columns. There are instances of respondents failing to enter information during the survey. Many columns have missing data, and it's going to be difficult to clean the dataset without removing nearly every row.
# Missing data calculated
series = csv.apply(pd.isnull).sum()/csv.shape[0] * 100
# Columns with less than or equal to 60% missing data points
list = series[series <= 60].index
print(series)
# Converts the list of columns we want to use from pandas.index to list
cols_to_use = pd.Index.tolist(list)
cols_to_use.extend(["JobRoleInterest", "ExpectedEarning"])
# Isolates the dataframe down to only preferred columns
csv = csv[cols_to_use]
# Drop id.x and id.y columns
csv = csv.drop(columns=["ID.x","ID.y","ResourceW3S"])
csv
# Count missing data
nulls = csv.apply(pd.isnull).sum()/csv.shape[0] * 100
nulls = nulls.sort_values()
nulls
csv.info()
# New column to indicate year of survey completion
csv["Year"] = 2017
csv2016["Year"] = 2016
# Columns of interest
column_lists = csv.columns.to_list()
column_lists
# Apply column filtering to survey 2016
survey_2016 = csv2016[column_lists]
# Merge dataframes
combined_survey = pd.concat([csv, survey_2016])
# Merged dataframe length (rows)
print("Number of Rows:")
print(combined_survey.shape[0])
JobRoleInterest
: "Which one of these careers are you interested in?"
Most of the courses offered on our e-learning platform are for web and mobile development. We need to identify if the sample from the dataset is representative of the population of new coders. One significant limitation to this survey is in regards to the number of rows that contain missing information for JobRoleInterest
. Roughly 6 out of 10 observations do not have a response to this question.
It's strange that this many people took the survey neglected to answer this question. In addtion to this question, perhaps another question should have been asked: "What are your goals for learning programming", or something similar.
After merging both dataframes together we ended up with 33,795 rows. For analysis we're going to remove all observations that failed to answer this question. The final dataframe will include only 13,495 rows.
Of these observations we'll notice that career interest heavily leans to web development (including full stack, front end, and back end web development). Many observations also include multiple categories, rather than just one category. We can split each string for each row in the JobRoleInterest
column. This will help us understand the number of choices that each person selected.
We can split each occurance of a job category for rows containing multiple categories. To do this we'll have to use pandas.Series.str.split
. This approach will help us count every individual job category.
interests = combined_survey["JobRoleInterest"].value_counts(normalize=True) * 100
interests.head(20)
# Combination of all job interests
len(interests)
# New dataframe excluding any missing data from JobRoleInterest column
survey = combined_survey[combined_survey["JobRoleInterest"].notnull()].copy()
# Splits each occurence of a job category
survey["JobRoleInterest"] = survey["JobRoleInterest"].str.split(",")
# Combined dataset (survey) missing values in percentage
(survey.apply(pd.isnull).sum()/survey.shape[0] * 100).sort_values(ascending = False)
# Fill missing data points with average
survey["ExpectedEarning"] = survey["ExpectedEarning"].fillna(survey["ExpectedEarning"].median())
survey
# Counts each occurence of a particular category
category_count = dict()
# For loop for counting each individual category in the JobRoleInterest column
for categories in survey["JobRoleInterest"]:
for category in categories:
if category in category_count:
category_count[category] += 1 # counts category key if already present in dictionary
else:
category_count[category] = 1 # adds unique category key to dictionary if not already present
# Transforms dictionary to dataframe
category_count = pd.DataFrame.from_dict(category_count, orient="index", columns= ["Count"])
category_count = category_count.reset_index(level = 0)
category_count = category_count.rename(columns = {"index":"Interests"})
category_count["Interests"].unique()
There are many different "job interests" throughout the survey, and it's obvious that respondents were able to write-in their own response to the question. The biggest downfall of this approach is that we end up with many different variations of the same career, different spelling and capitalization, and unknown responses.
Python-Pandas counts these all as unique values so it is more difficult to get a completely accurate count. For example, different variations of "Front-End Developer". We do see some extra whitespace scattered throughout some of the values too. In order to clean up some of the values in this dataframe we'll strip any extra white space and change everything to lower case font.
# Strips whitespace, changes to lower case
category_count["Interests"] = category_count["Interests"].str.lstrip().str.rstrip().str.lower()
# Groupy by interests and adds up the number of occurences
category_count.groupby("Interests").sum().sort_values(by = "Count", ascending= False).head(50)
# Career interest frequency
group_category = category_count.groupby("Interests").sum().sort_values(by = "Count", ascending= False).head(50)
# Plot results
fig, ax = plt.subplots(figsize = (10,8))
plt.barh(group_category.index[:15], group_category["Count"][:15], height = .6, color = "grey")
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()
# Title
plt.title("Career Interests", size = 20, loc = "left", x = -0.28, y = 1.08)
# X label
plt.text(-1950, -1.6,"Frequency", size = 14, color = "grey")
plt.show()
After some data cleaning we can see that it's not perfect, but we definitely can tell that we have quite a range of interests ranging from primarily web-development to data science, game development and many other interests.
While we have many mixed interests, this is a good way to show that individuals might be interested in other topics than just web-development. We also see that some individuals responded with different versions of "I don't know". While it would be possible to remove any rows with this answers, given how few there are it's unlikely to affect our analysis either way.
# Gender frequency (Freecodecamp)
genders = survey["Gender"].value_counts(normalize=True, dropna=False) * 100
# Plot results
fig, ax = plt.subplots(figsize = (12, 8))
genders.plot(kind = "bar", color = "grey", width = .58)
# Title
plt.title("Gender representation (FreeCodeCamp)", size = 19, loc = "left", x = -0.1, y = 1.02)
# Remove spines
plt.gca().spines[["top", "left", "right"]].set_visible(False)
# X and Y labels
plt.ylabel("Frequency (percent)", color = "grey", size = 14, loc = "top")
plt.xlabel("Gender", color = "grey", size = 14, loc = "left")
# X and Y ticks
plt.yticks(size = 12)
plt.xticks(rotation = 0, size = 12)
plt.show()
We'll introduce a similar survey conducted in 2018 by Stack Exchange (a popular forum for asking and answering software/programming related questions). We'll perform data cleaning on this dataset shortly, but first we can get an overview of its contents and how its demographics compare to Freecodecamp's.
# Gender frequency (Stack Exchange)
genders_stk_exchange = exchange["Gender"].value_counts(normalize=True, dropna=False) * 100
# Plot results
fig, ax = plt.subplots(figsize = (12, 8))
genders_stk_exchange[:3].plot(kind = "bar", color = "grey", width = .57)
# Title
plt.title("Gender representation (Stack Exchange)", size = 19, loc = "left", x = -0.1, y = 1.02)
# Remove spines
plt.gca().spines[["top", "left", "right"]].set_visible(False)
# X and Y labels
plt.ylabel("Frequency (percent)", color = "grey", size = 14, loc = "top")
plt.xlabel("Gender", color = "grey", size = 14, loc = "left")
# X and Y ticks
plt.yticks(size = 12)
plt.xticks(rotation = 0, size = 12)
plt.show()
# Age distribution plotted
fig, ax = plt.subplots(figsize = (12,8))
survey["Age"].hist(bins = 20, color = "grey")
# Title
plt.title("Age Groups (FreeCodeCamp)", size = 19, loc = "left", x = -0.1, y = 1.02)
# Remove gridlines
ax.grid(False)
# Remove spines
plt.gca().spines[["right","top"]].set_visible(False)
# X and Y labels
plt.ylabel("# of observations", color = "grey", size = 14, loc = "top")
plt.xlabel("Age", color = "grey", size = 14, loc = "left")
# X and Y ticks
plt.yticks(size = 12)
plt.xticks(size = 12)
# Text
plt.text(32.5,2700,"Most new programmers\nare in their early 20s to early 30s", size = 14, color = "maroon")
# Main demographic highlighted
plt.axvspan(survey["Age"].quantile(0.25), survey["Age"].quantile(0.75), ymax=1000, color = "maroon", alpha = 0.4)
plt.show()
# Stack exchange age groups
# Color assignment
colors = ["grey","grey", "maroon", "grey", "grey", "grey"]
# Plot results
fig, ax = plt.subplots(figsize = (12, 8))
ages = exchange["Age"].value_counts().iloc[[4,1,0,2,3,5]].plot.bar(width = 0.65, color = colors)
# Remove spines
plt.gca().spines[["top", "left", "right"]].set_visible(False)
# Title
plt.title("Age Groups (Stack Exchange)", size = 19, loc = "left",x = -0.1, y = 1.02)
# X and Y lables
plt.ylabel("# of observations", color = "grey", size = 14, loc = "top")
plt.xlabel("Age", color = "grey", size = 14, loc = "left")
# X and Y ticks
plt.yticks(size = 12, color = "grey")
plt.xticks(size = 11, rotation = 0, color = "grey")
# Most frequent age group highlighted
plt.gca().get_xticklabels()[2].set_color("maroon")
plt.show()
# Freecodecamp countries
# Country frequency (freecodecamp)
countries = survey["CountryLive"].value_counts(normalize=True) * 100
# Frequency table to dataframe
countries = pd.Series.to_frame(countries).reset_index()
# Rename dataframe columns
countries = countries.rename(columns={"index":"Country","CountryLive":"Percentage"})
#------------------------------------------------------------------------------------------------#
# Stack Exchange Countries
# Country frequency (Stack Exchange)
countries_stack = exchange["Country"].value_counts(normalize=True) * 100
# Frequency table to dataframe
countries_stack = pd.Series.to_frame(countries_stack).reset_index()
# Rename dataframe columns
countries_stack = countries_stack.rename(columns={"index":"Country","Country":"Percentage"})
#---------------------------------------------------------------------------------------------------#
# Plot results (FreeCodeCamp)
# Color assignment
colors = ["maroon","maroon","maroon","maroon","grey","grey","grey","grey","grey","grey"]
fig, ax = plt.subplots(figsize = (10, 8))
plt.barh(countries["Country"][:10], countries["Percentage"][:10], color = colors, height= 0.65)
# Title
plt.title("Country Representation (FreeCodeCamp)", loc = "left", size = 18, x = -0.3, y = 1.08)
# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Text
plt.text(-15.2, -1.4,"Frequency (in percent)", size = 14, color = "grey")
# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14, color = "grey")
# Top 4 countries highlighted
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[1].set_color("maroon")
plt.gca().get_yticklabels()[2].set_color("maroon")
plt.gca().get_yticklabels()[3].set_color("maroon")
plt.show()
# Plot results (Stack Exchange)
# Color Assignment
colors = ["maroon","maroon","#D6A0A9","maroon","maroon","grey","grey","grey","grey","grey"]
fig, ax = plt.subplots(figsize = (10, 8))
plt.barh(countries_stack["Country"][:10], countries_stack["Percentage"][:10], color = colors, height= 0.6)
# Title
plt.title("Country Representation (Stack Exchange)", loc = "left", size = 18, x = -0.23, y = 1.09)
# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Text
plt.text(-4.9, -1.4,"Frequency (in percent)", size = 14, color = "grey")
# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14, color = "grey")
# Highlight top 5 countries
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[1].set_color("maroon")
plt.gca().get_yticklabels()[2].set_color("#D6A0A9") # Germany
plt.gca().get_yticklabels()[3].set_color("maroon")
plt.gca().get_yticklabels()[4].set_color("maroon")
plt.show()
# FreeCodeCamp
# School degree frequency (Freecodecamp)
code_camp_edu = survey["SchoolDegree"].value_counts(normalize=True) * 100
# Frequency table to dataframe
code_camp_edu = pd.Series.to_frame(code_camp_edu).reset_index()
# Rename dataframe columns
code_camp_edu = code_camp_edu.rename(columns={"index":"School Degree","SchoolDegree":"Percentage"})
# Color assignment
colors = ["maroon","maroon","grey","grey","grey","grey","grey","grey","grey","grey"]
# Plot results
fig, ax = plt.subplots(figsize = (10, 8))
plt.barh(code_camp_edu["School Degree"][:10], code_camp_edu["Percentage"][:10], color = colors, height= 0.62)
# Title
plt.title("School Degree Representation (FreeCodeCamp)", loc = "left", size = 18, x = -0.52, y = 1.1)
# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")
# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Text
plt.text(-12, -1.6,"Frequency (in percent)", size = 14, color = "grey")
# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14, color = "grey")
# Highlight top 2 degrees
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[1].set_color("maroon")
plt.show()
# Stack Exchange
# Replace string values
exchange["FormalEducation"] = exchange["FormalEducation"].replace({"Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)":"High School"})
# School degree frequency (Stack Exchange)
stk_exchange_edu = exchange["FormalEducation"].value_counts(normalize=True) * 100
# Frequency table to dataframe
stk_exchange_edu = pd.Series.to_frame(stk_exchange_edu).reset_index()
# Rename dataframe columns
stk_exchange_edu = stk_exchange_edu.rename(columns={"index":"School Degree","FormalEducation":"Percentage"})
# Color assignment
colors = ["maroon","maroon","grey","grey","grey","grey","grey","grey","grey","grey"]
# Plot results
fig, ax = plt.subplots(figsize = (10, 8))
plt.barh(stk_exchange_edu["School Degree"], stk_exchange_edu["Percentage"], color = colors, height= 0.62)
# Title
plt.title("School Degree Representation (Stack Exchange)", loc = "left", size = 18, x = -0.72, y = 1.1)
# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")
# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Text
plt.text(-12, -1.5,"Frequency (in percent)", size = 14, color = "grey")
# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14, color = "grey")
# Highlight top 2 degrees
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[1].set_color("maroon")
plt.show()
Thus far we have done the following:
JobRoleInterest
inputJobRoleInterest
categoriesBoth datasets share similar a similar distribution concerning age and gender. Men consist of the majority of respondents of new programmers (70 %, women at 20%).
The stack exchange survey is consisted primarily of STEM careers, and the distribution of gender is even more pronounced. Men represent nearly 60% of respondents, nan
s (unknown, missing data) at roughly 35% and women at only around 5%.
Age distribution is roughly the same too. New programmers are most likely to be in their early 20s to early 30s, and stack exchange survey participants are usually 25 to 34 years old.
Country representation between both surveys is about the same. A majority of survey participants are from the United States, followed by India in both examples. Countries with the highest participation are English-Speaking countries (except for Germany in Stack Exchange).
Bachelor's degrees are the most common degree held by respondents from both surveys.
We've seen a high level overview of the data. To provide customers with the most relevant training possible, we need to discover why people decide to learn a new skill like programming.
We'll provide the several charts and data that we believe supports the idea that new programmers are motivated by income and career opportunities. While only 40% of respondents answered the JobRoleInterest
question; 13,495 observations is more than enough to get a representative sample. There are many different career paths utilizing programming and tech skills that respondents are interested in.
Participants were asked the following questions regarding employment opportunities:
"Imagine that you are assessing a potential job opportunity. Please rank the following aspects of the job opportunity in order of importance , where 1 is the most important and 10 is the least important.
"Now, imagine you are assessing a job's benefits package. Please rank the following aspects of a job's benefits package from most to least important to you, where 1 is most important and 11 is least important.
By calculating the job aspects and benefits, on average the most important values should have a lower score (since 1 is most important, and 10 is least important). Before this calculation, we'll perform a bit of data cleaning on the stack exchange dataset.
# Rename current job related columns from stack exchange dataset
# Currency related columns
currency = exchange.columns[51:56].tolist()
# Columns up to index 38
columns = exchange.columns[:38].tolist()
# Age and gender columns
columns.extend(["Gender", "Age"])
# Add currency related columns to list
for i in currency:
columns.append(i)
# Isolates dataframe down to columns from list "columns"
stk_exchange = exchange[columns].copy()
# Rename job aspects and job benefits columns for easier comprehension
rename_cols = {
"AssessJob1":"Industry_working_in",
"AssessJob2":"Company_funding",
"AssessJob3":"Department_working_in",
"AssessJob4":"Technologies/Frameworks",
"AssessJob5":"Compensation_and_benefits",
"AssessJob6":"Company_culture",
"AssessJob7":"WFH",
"AssessJob8":"Professional_development",
"AssessJob9":"Company_diversity",
"AssessJob10":"Product_impact",
"AssessBenefits1":"Compensation",
"AssessBenefits2":"Stock_options",
"AssessBenefits3":"Health_insurance",
"AssessBenefits4":"Parental_leave",
"AssessBenefits5":"Fitness_wellness_benefit",
"AssessBenefits6":"Retirement",
"AssessBenefits7":"Meals/snacks",
"AssessBenefits8":"Computer/office_equipment",
"AssessBenefits9":"Childcare_benefit",
"AssessBenefits10":"Transportaion_benefit",
"AssessBenefits11":"Conference/education_budget"
}
exchange = exchange.rename(columns=rename_cols)
# Isolate rows only containing following countries listed below
stk_countries = stk_exchange[stk_exchange["Country"].str.contains("United States|India|United Kingdom|Canada", na = False)]
len(stk_countries["Country"])
exchange
These benefits and aspects are measured by current employees working in STEM fields. So we have to be careful to not assume these ratings directly relate to new programmers that participated in FreeCodeCamp's survey (as many of these respondents do not work in software/tech jobs).
However, if the same questions were asked by FreeCodeCamp, it's probable that we would see similar results. Therefore, if we use the stack exchange survey as proxy, compensation and health insurance are the most important to job applicants, or those interested in switching jobs. Some of the least important benefits include childcare, parental leave or a fitness/wellness benefit.
Job aspects describe how job candidates view a potential job opportunity, and the particular make-up of an organization. Respondents rated pay and benefits (which for some reason is listed as a benefit and an aspect), the technologies or programs used, career mobility, and the company culture higher than other aspects.
# Slice dataset to contain only job aspect columns
job_assessment = exchange.iloc[:,17:27]
# Constructs new dataframe of column averages
assessments = pd.Series.to_frame(job_assessment.mean(axis=0).sort_values(ascending=False)) # Calculate averages along each column
# Assign index name
assessments.index.name = "Aspects"
# Reset index
assessments.reset_index()
#---------------------------------------------------------------------------------------------------------------------------------#
# Slice dataset to contain only job aspect columns
benefits = exchange.iloc[:,27:38]
# Constructs new dataframe of column averages
job_benefits = pd.Series.to_frame(benefits.mean(axis=0).sort_values(ascending=False)) # Calculate averages along each column
# Assign index name
job_benefits.index.name = "Benefits"
# Reset index
job_benefits.reset_index()
#----------------------------------------------------------------------------------------------------------------------------------#
# Plot results
# If looking for a new job, rate importance of job aspects from 1(most important) to 11(least important)
# Color assignment
colors = ["grey","grey","grey","grey","grey","grey","grey","grey","grey","maroon","maroon"]
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(job_benefits.index, job_benefits[0], color = colors, height= 0.62)
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# X axis top
ax.xaxis.tick_top()
# Title
plt.title("Job benefits", size = 19, loc = "left", x= -0.35, y = 1.16)
# Text
plt.text(-3.2,12,"Rating (1 most important, 11 least important), average", color = "grey", size = 14)
# X and Y ticks
plt.yticks(size = 14, color = "grey")
plt.xticks(size = 13, color = "grey")
# Highlight top 2 benefits
plt.gca().get_yticklabels()[-1].set_color("maroon")
plt.gca().get_yticklabels()[-2].set_color("maroon")
plt.show()
# Plot results
# If looking for a new job, rate importance of job aspects from 1(most important) to 10(least important)
# Color assignment
colors = ["grey","grey","grey","grey","grey","grey","maroon","maroon","maroon","maroon"]
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(assessments.index, assessments[0], color = colors, height= 0.6)
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# X axis top
ax.xaxis.tick_top()
# Title
plt.title("Job aspects", size = 19, loc = "left", x= -0.4, y = 1.16)
# Text
plt.text(-3.2,11,"Rating (1 most important, 10 least important), average", color = "grey", size = 14)
# X and Y ticks
plt.yticks(size = 14, color = "grey")
plt.xticks(size = 13, color = "grey")
# Highlight top 4 job aspects
plt.gca().get_yticklabels()[-1].set_color("maroon")
plt.gca().get_yticklabels()[-2].set_color("maroon")
plt.gca().get_yticklabels()[-3].set_color("maroon")
plt.gca().get_yticklabels()[-4].set_color("maroon")
plt.show()
Income
: Respondents were asked their current yearly income.
ExpectedEarning
: "About how much money do you expect to earn per year at your first developer job, in US dollars?"
Has Debt
: The question asked was "Do you have any debt?"
In a high level overview we'll see that the median and average salary of new programmers is less than \$50,000 dollars(US). We'll see that new programmers expect to earn about \$15,000 to \$20,000 more in their new tech/software careers than what they currently earn.
# Income distribution
# Plot results
fig, ax = plt.subplots(figsize = (14,10))
survey["Income"].plot.hist(bins = 120, color = "grey", xlim = (0,250000))
# Remove spines
plt.gca().spines[["right","top"]].set_visible(False)
# Title
plt.title("Income distribution of survey respondents\n(All countries)",loc= "left", size = 18, y = 1.02)
# Average and median income
plt.axvline(survey["Income"].mean(), color = "red", alpha = 0.5, linewidth = 3)
plt.axvline(survey["Income"].median(), color = "blue", alpha = 0.5, linewidth = 3)
# Misc. Text
plt.text(41000, 850, " Average \n Income", size = 15, color = "red")
plt.text(14000, 850, " Median \n Income", size = 15, color = "blue")
# X and Y labels
plt.ylabel("Frequency",size = 15, loc = "top", color ="grey")
plt.xlabel("Income, Yearly (US dollars)", size = 15, loc = "left", color ="grey")
# X and Y ticks
plt.yticks(size = 14)
plt.xticks(size = 13)
plt.show()
# Difference between current income and expected income
fig, ax = plt.subplots(figsize = (13,10))
# Freecodecamp survey expected earning distribution
survey["ExpectedEarning"].plot.kde(xlim = (0, 200000), color = "#ED7E00", linewidth = 3)
# Freecodecamp survey current income distribution
survey["Income"].plot.kde(color = "#4B86C1", linewidth = 3)
# Title
plt.title("Current earnings vs. Expected earnings\n(All countries)", size = 18, loc = "left", y = 1.05)
# X and Y ticks
plt.xticks(size = 14)
plt.yticks(size = 14)
# X and Y labels
plt.ylabel("Density (Probability)", size = 14, color = "grey", loc = "top")
plt.xlabel("Income, Yearly (US dollars)", size = 14, color = "grey", loc = "left")
# Remove spines
plt.gca().spines[["right", "top"]].set_visible(False)
# Misc. text
plt.text(x = 0.01, y = 0.84, s="Income: Freecodecamp", color = "#4B86C1", size = 13, transform=ax.transAxes)
plt.text(x = 0.29, y = .90, s="Desired Income", color = "#ED7E00", size = 13, transform=ax.transAxes)
plt.text(0.55,0.85,"""Typically, survey participants expect to earn\n\$15,000 to \$20,000 more in their new career,
compared to their current income""", color = "grey", size = 14, transform=ax.transAxes)
# X and Y ticks
plt.yticks(size = 13)
plt.xticks(size = 13)
plt.show()
We can find each person's desired salary increase (relative to their current income, as a percentage) by utilizing the following formula:
Increase = New Number - Original Number
% increase = Increase / Original Number x 100
Since we have missing data points in both columns we expect to see negative percentages in the new column that we create. Missing data won't be dropped, instead we'll ignore any percentages below 0.
We'll notice that most often, respondents desire a salary increase in the range of 0% to 120%.
# Column creation using formula above
survey["Percent_Increase"] = (survey["ExpectedEarning"] - survey["Income"]) / survey["Income"] * 100
# Frequency distribution
survey["Percent_Increase"].value_counts(bins = 20, normalize= True) * 100
fig, ax = plt.subplots(figsize = (13,9))
# Expected salary increase (in a percentage) histogram
survey[survey["Percent_Increase"] <= 500]["Percent_Increase"].plot.hist(bins = 15, color = "grey")
# Boolean masking ^^^ less than or equal to %500 ^^^
# Lower and upper quartile %25 to %75 range
plt.axvspan(survey["Percent_Increase"].quantile(0.25), survey["Percent_Increase"].quantile(0.75), color = "maroon", alpha = 0.4)
# Remove spines
plt.gca().spines[["right","top"]].set_visible(False)
# Title
plt.title("Desired salary increase (in percent)", loc="left", size = 20, y = 1.05)
# X and Y labels
plt.ylabel("Frequency", size = 15, color = "grey", loc = "top")
plt.xlabel("Percent Increase", size = 15, loc = "left", color = "grey")
# X and Y ticks
plt.xticks(size = 13)
plt.yticks(size = 13)
# Text
plt.text(126,1000,"Typical range of expected salary increase (in percent)", size = 14, color = "maroon")
plt.show()
Most respondents do not have financial dependents to care for, and less than half do not have debts to pay off.
# Replaces following columns with True/False values
survey["HasDebt"] = survey["HasDebt"].replace({1.0:"True", 0.0: "False"})
survey["HasFinancialDependents"] = survey["HasFinancialDependents"].replace({1.0:"True", 0.0: "False"})
# Financial dependents
print("Financial Dependents:","\n", survey["HasFinancialDependents"].value_counts(normalize = True, dropna=False) * 100)
print("\n")
# Has debt of any kind
print("Has Debt:", "\n", survey["HasDebt"].value_counts(normalize = True, dropna=False) * 100)
print("\n")
EmploymentStatus
: "Regarding employment status, are you currently..."
Respondents were asked to select their current employment stats, examples include not working, employed for wages, self-employed, military, etc...
About half of respondents answered that they are actively working in some manner for their income. A smaller percentage neglected to answer, and the remaining participants are either not working but actively looking for work, not working and not looking for work, and the survey includes stay at home parents.
"Employed for wages" is the most common employment status, but this group has lowest median hours spent per week (10 hours) learning. The employment group "Not working but looking for work" has the highest median hours (20). Typically, respondents spend about 12 hours per week (median) or 1.7 hours per day learning programming. We did not calculate the weekly average, because the data contains many outliers in the range of 30 hours to 175 per week that significantly skews the distribtion.
# Fills in missing data from hours learning column
survey["HoursLearning"] = survey["HoursLearning"].fillna(survey["HoursLearning"].median())
# Hours spent learning distribution
fig, ax = plt.subplots(figsize = (12, 6))
sns.boxplot(x= "HoursLearning", data = survey, color = "grey", medianprops=dict(color="maroon", alpha=1))
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Title
plt.title("Hours spent learning per week", loc="left", size = 20, y = 1.05)
# Misc. Text
plt.text(80,-0.04,"Outliers", size = 14)
plt.text(0, -0.42, "Median (12 hours/week)", size = 14, color = "maroon")
plt.text(100, -0.3, "75% of participants spent\n20 hours or less per week learning", size = 14, color = "grey")
plt.text(100, -0.15, "25% of participants spent\n6 hours or less per week learning", size = 14, color = "grey")
# X label
plt.xlabel("Hours", size = 15, loc = "left", color = "grey")
# X ticks
plt.xticks(size = 13, color = "grey")
plt.show()
# Print stats
print(survey["HoursLearning"].describe())
# Frequency table employment status
survey["EmploymentStatus"].value_counts(dropna=False)
# Hours spent per week by employment status
survey.groupby("EmploymentStatus")["HoursLearning"].median().sort_values(ascending=False)
MonthsProgramming
: "About how many months have you been programming for? ("Programming experience")
There is some evidence that may suggest the type of career field has less influence on the motivation of individuals to learning programming.
Farming/fishing/forestry and education (typically careers we would not associate with programming/software development) have the greatest number of months programming. Besides these two career fields the IT/Software development field has the third highest average amount of experience. Presumbably respondents in the IT/Software development were likely spending time outside of work learning, or had just been hired.
Farming/fishing/forestry and education are some of the lowest paid career fields in this survey, yet on average, respondents expected a lower expected income than other career fields. Instead we see higher paying careers with less "programming experience" expecting higher income after switching to tech/software related jobs.
There may be a better argument to be made that education level may have more influence over a person's reason to begin learning a skill like programming for more career opportunities.
# Salary and experience comparison for employment fields
# Assign groupby objects for plotting using SchoolDegree
empfld_months_prg = survey.groupby("EmploymentField").mean().sort_values(by="MonthsProgramming") # sort by the average number of months programming
empfld_income = survey.groupby("EmploymentField").mean().sort_values(by="Income") # sort by the average income
empfld_expected_salary = survey.groupby("EmploymentField").mean().sort_values(by="ExpectedEarning")
#-------------------------------------------------------------------------------------------------------------------------------------#
# Color assignment
colors = ["grey", "grey", "grey", "grey", "grey","grey", "grey", "grey", "grey", "grey","grey", "grey", "maroon", "maroon", "maroon",]
# Plot results experience
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(empfld_months_prg.index, empfld_months_prg["MonthsProgramming"], color = colors, height = 0.6)
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Y label
plt.ylabel("Career Field", loc = "top", size = 14, color = "grey")
# Text
plt.text(-8,16.3,"Average number of months", size = 14, color = "grey")
# Title
plt.title("New programmer experience by career field", size = 20, loc = "left", x = -0.65, y = 1.12)
# X axis to top
ax.xaxis.tick_top()
# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)
# Highlight top 3 career fields most experience
plt.gca().get_yticklabels()[-1].set_color("maroon")
plt.gca().get_yticklabels()[-2].set_color("maroon")
plt.gca().get_yticklabels()[-3].set_color("maroon")
plt.show()
#---------------------------------------------------------------------------------------------------------------------------------------#
# Salary
# Color assignment
colors = ["maroon", "grey", "grey", "grey", "grey","maroon", "grey", "maroon", "grey", "grey","grey", "grey", "grey", "grey", "grey",]
# Plot results income
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(empfld_income.index, empfld_income["Income"], color = colors, height = 0.6)
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Y label
plt.ylabel("Career Field", loc = "top", size = 14, color = "grey")
# Text
plt.text(-25000,16.4,"Average Salary (US dollars)", size = 14, color = "grey")
plt.text(30000,-0.5, "Career fields shaded in red\nhave the highest average number of months\nspent learning programming", color = "grey")
# Title
plt.title("Salary by career field", size = 20, loc = "left", x = -0.65, y = 1.12)
# X axis to top
ax.xaxis.tick_top()
# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)
# Highlight top 3 career fields most experience
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[5].set_color("maroon")
plt.gca().get_yticklabels()[-8].set_color("maroon")
plt.show()
#--------------------------------------------------------------------------------------------------------------------------------------------#
# Color assignment
colors_salary = ["maroon", "grey", "grey", "grey", "grey","grey", "grey", "maroon", "grey", "grey","maroon", "grey", "grey", "grey", "grey"]
# Plot results expected earning
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(empfld_expected_salary.index, empfld_expected_salary["ExpectedEarning"].sort_values(), color = colors_salary, height = 0.6)
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Y label
plt.ylabel("Career Field", loc = "top", size = 14, color = "grey")
# Text
plt.text(-20000, 16.3,"Average (US dollars)", size = 14, color = "grey")
# Title
plt.title("Expected annual salary increase", size = 20, loc = "left", x = -0.65, y = 1.12)
# X axis to top
ax.xaxis.tick_top()
# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)
# Highlight top 3 career fields most experience
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[-5].set_color("maroon")
plt.gca().get_yticklabels()[7].set_color("maroon")
plt.show()
Earlier we noticed that the career field someone is in may have a weaker influence on the motivation for people to start learning programming. Instead, a person's education level may be a more significant factor. The data suggests that individuals with education less than a bachelor's have some of the greatest amount of "programming experience".
These three fields are:
They have the highest median and average expected salary increase (in percent), and in terms of yearly income, respondents in these groups are some of the lowest earning. However, we have to note that in comparison to the average expected earning (in US dollar amounts), degree holders with Ph.D.s, professional degrees, and bachelor's generally expect a higher salary, with the exception of associate's degree holders.
Neither career type nor education level are perfect indicators for whether or not some one may be more motivated/interested in learning new programming/tech skills. We think it's reasonable to argue that the data suggests survey participants are generally interested in programming for the career and income opportunities.
# Average expected earning by school degree
round(survey.groupby("SchoolDegree")["ExpectedEarning"].mean().sort_values(ascending=False), 2)
# Median expeceted salary increase (percent)
salary_increase_median = round(survey.groupby("SchoolDegree")["Percent_Increase"].median().sort_values(ascending=False),2)
salary_increase_median = pd.Series.to_frame(salary_increase_median).reset_index()
salary_increase_median = salary_increase_median.rename(columns={"index":"SchoolDegree","Percent_Increase":"Percentage"})
# Average expected salary increase (percent)
salary_increase = round(survey.groupby("SchoolDegree")["Percent_Increase"].mean().sort_values(ascending=False),2)
salary_increase = pd.Series.to_frame(salary_increase).reset_index()
salary_increase = salary_increase.rename(columns={"index":"SchoolDegree","Percent_Increase":"Percentage"})
#------------------------------------------------------------------------------------------------------------------------#
# Color assignment
colors = ["#145DDE", "#145DDE","#145DDE", "grey", "grey", "grey","grey", "grey", "grey", "grey"]
# Plot (1) results average and median salary increase (percent)
fig, ax = plt.subplots(figsize = (8, 6))
# Title
plt.title("Expected salary raise by education level", size = 19, loc = "left", x= -0.65, y = 1.16)
# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")
# Misc. text
plt.text(-4,-2,"Median", color = "#4B86C1", size = 14)
plt.text(35,-2,"Average", color = "grey", size = 14)
plt.text(80,-2,"(Percent)", size = 14)
plt.text(160, 2.5,"Education levels below bachelor's\ndegree have the highest average\nand median expected salary increase", color = "grey")
# Average plotted
plt.barh(salary_increase["SchoolDegree"], salary_increase["Percentage"], color = colors, height = 0.62)
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
ax.xaxis.tick_top()
# Median plotted
plt.barh(salary_increase_median["SchoolDegree"], salary_increase_median["Percentage"], color = "#4B86C1", height = 0.62)
plt.yticks(size = 14, color = "grey")
plt.xticks(size = 13, color = "grey")
plt.gca().invert_yaxis()
# Top 3 education levels by expected salary raise highlighted
plt.gca().get_yticklabels()[0].set_color("#145DDE")
plt.gca().get_yticklabels()[1].set_color("#145DDE")
plt.gca().get_yticklabels()[2].set_color("#145DDE")
plt.show()
#-----------------------------------------------------------------------------------------------------------------------------------------#
# Assign groupby objects for plotting using SchoolDegree
schl_dgree = survey.groupby("SchoolDegree").mean().sort_values(by = "MonthsProgramming") # sort by the average number of months programming
degree_income = survey.groupby("SchoolDegree").mean().sort_values(by = "Income") # sort by the average income
# Plot results income by school edcuation level
colors_degree_income = ["#145DDE", "grey","#145DDE", "grey", "grey", "#145DDE","grey", "grey", "grey", "grey"]
# Plot (2) school degree income
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(degree_income.index, degree_income["Income"], color = colors_degree_income, height = 0.62)
# Title
plt.title("Salary by education", size = 20, loc = "left", x = -0.7, y = 1.12)
# Text
plt.text(-23000,10.7,"Average Salary (US dollars)", size = 14, color = "grey")
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")
# X axis to top
ax.xaxis.tick_top()
# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)
# Top 3 education levels by expected salary raise highlighted
plt.gca().get_yticklabels()[0].set_color("#145DDE")
plt.gca().get_yticklabels()[2].set_color("#145DDE")
plt.gca().get_yticklabels()[5].set_color("#145DDE")
plt.show()
#-----------------------------------------------------------------------------------------------------------------------------------------#
# Plot results number of months programming by edcuation level
colors_schl_dgree = ["grey", "grey","grey", "grey", "grey", "#145DDE","#145DDE", "grey", "#145DDE", "grey"]
# Plot (3) school degree number of months programming
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(schl_dgree.index, schl_dgree["MonthsProgramming"], color = colors_schl_dgree, height = 0.62)
# Title
plt.title("New programmer experience by education", size = 20, loc = "left", x = -0.7, y = 1.12)
# Text
plt.text(-11,10.7,"Average number of months", size = 14, color = "grey")
# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")
# X axis to top
ax.xaxis.tick_top()
# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)
# Top 3 education levels by expected salary raise highlighted
plt.gca().get_yticklabels()[-2].set_color("#145DDE")
plt.gca().get_yticklabels()[-4].set_color("#145DDE")
plt.gca().get_yticklabels()[-5].set_color("#145DDE")
plt.show()