Best Markets for Advertising¶

Suppose we are data analysts for an online e-learning company that specializes in programming courses. We cover domains such as data science and game development, but our primary focus is web and mobile development. Our goal is to promote our products and invest money in more advertisement, but to do that we need to know what markets to advertise in. We ultiized three surveys related to programming/web development from FreeCodeCamp and Stack Overflow.

These surveys were conducted online by participants worldwide in 2016, 2017, and 2018. FreeCodeCamp's surveys targeted new programmers and asked many questions related to career interest, income expectations, age, gender, home country, time spent programming, and so on. Stack Overflow's 2018 survey was aimed primarily at individuals already in the developer community concerning topics from favorite technologies to job preferences.

We discovered that new programmers are interested in a wide variety of career fields to include web development, data science, data engineering, game development, QA engineering, machine learning, and many other careers. We found that the likely motivator for their programming journey was to advance their income and career opportunities. With this knowledge, we need to ensure that our courses stay up to date, relevant, and beneficial for our customers.

Most importantly, after exploring the surveys we discovered that the two best potential countries to invest our advertising in were the United States and India. Both countries had the highest number of survey participants, which indicates that most new programmers are presumably most numerous in these two countries. Secondly, The US has the highest average monthly spending for programming education, whereas India has a lower average spending. However, India's average monthly spending is still around the same amount as our monthly subscription ($59 US dollars per month).

In short, the two best markets for advertising include the United States and India, we recommend to the marketing team to focus their efforts into these two regions.

We want to answer questions about a population of new coders that are interested in the subjects we teach. We'd like to know:

Where are these new coders located.
What are the locations with the greatest number of new coders.
How much money new coders are willing to spend on learning.

FreeCodeCamp Survey: https://www.freecodecamp.org/news/we-asked-20-000-people-who-they-are-and-how-theyre-learning-to-code-fff5d668969

Github repository: Survey Year 2017: https://github.com/freeCodeCamp/2017-new-coder-survey/tree/master/clean-data

Survey Year 2016: https://github.com/freeCodeCamp/2016-new-coder-survey#about-the-data

Stack Overflow Survey: https://www.kaggle.com/datasets/stackoverflow/stack-overflow-2018-developer-survey

Some limitations for analyzing survey data:

For some questions, participants had the freedom to write in their own responses; this makes it difficult to properly every single response into unique values due to spelling, grammar, punctuation, and word usage, however we have done our best to clean some of these columns
Almost all columns have missing data, participants were able to leave questions blank if they did not want to answer a particular question; this makes it impossible to get a completely accurate analysis of all the data

Method

Load datasets
Clean dataframes, including standardizing any columns if needed
Concatenate/merge datasets
Correct any remaining inconsistencies/errors
Perform analysis and visualization

In [1]:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.style as style
#style.use("fivethirtyeight")
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
#pd.options.display.float_format = '{:20,.2f}'.format

In [2]:

pd.options.display.max_columns = 150 # to avoid truncated output 

# Freecodecamp survey 2017
csv = pd.read_csv("2017-fCC-New-Coders-Survey-Data.csv", low_memory= False)

# Freecodecamp survey 2016
csv2016 = pd.read_csv("2016-fCC-New-Coders-Survey-Data.csv", low_memory = False)

# Stack exchange survey
exchange = pd.read_csv("survey_results_public.csv", low_memory= False)

In [3]:

csv.head()

Out[3]:

	Age	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation	CodeEventConferences	CodeEventDjangoGirls	CodeEventFCC	CodeEventGameJam	CodeEventGirlDev	CodeEventHackathons	CodeEventMeetup	CodeEventNodeSchool	CodeEventNone	CodeEventOther	CodeEventRailsBridge	CodeEventRailsGirls	CodeEventStartUpWknd	CodeEventWkdBootcamps	CodeEventWomenCode	CodeEventWorkshops	CommuteTime	CountryCitizen	CountryLive	EmploymentField	EmploymentFieldOther	EmploymentStatus	EmploymentStatusOther	ExpectedEarning	FinanciallySupporting	FirstDevJob	Gender	GenderOther	HasChildren	HasDebt	HasFinancialDependents	HasHighSpdInternet	HasHomeMortgage	HasStudentDebt	HomeMortgageOwe	HoursLearning	ID.x	ID.y	Income	IsEthnicMinority	IsUnderEmployed	JobApplyWhen	JobInterestBackEnd	JobInterestDataEngr	JobInterestDataSci	JobInterestDevOps	JobInterestFrontEnd	JobInterestFullStack	JobInterestGameDev	JobInterestInfoSec	JobInterestMobile	JobInterestOther	JobInterestProjMngr	JobInterestQAEngr	JobInterestUX	JobPref	JobRelocateYesNo	JobRoleInterest	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	NetworkID	Part1EndTime	Part1StartTime	Part2EndTime	Part2StartTime	PodcastChangeLog	PodcastCodeNewbie	PodcastCodePen	PodcastDevTea	PodcastDotNET	PodcastGiantRobots	PodcastJSAir	PodcastJSJabber	PodcastNone	PodcastOther	PodcastProgThrowdown	PodcastRubyRogues	PodcastSEDaily	PodcastSERadio	PodcastShopTalk	PodcastTalkPython	PodcastTheWebAhead	ResourceCodecademy	ResourceCodeWars	ResourceCoursera	ResourceCSS	ResourceEdX	ResourceEgghead	ResourceFCC	ResourceHackerRank	ResourceKA	ResourceLynda	ResourceMDN	ResourceOdinProj	ResourceOther	ResourcePluralSight	ResourceSkillcrush	ResourceSO	ResourceTreehouse	ResourceUdacity	ResourceUdemy	ResourceW3S	SchoolDegree	SchoolMajor	StudentDebtOwe	YouTubeCodeCourse	YouTubeCodingTrain	YouTubeCodingTut360	YouTubeComputerphile	YouTubeDerekBanas	YouTubeDevTips	YouTubeEngineeredTruth	YouTubeFCC	YouTubeFunFunFunction	YouTubeGoogleDev	YouTubeLearnCode	YouTubeLevelUpTuts	YouTubeMIT	YouTubeMozillaHacks	YouTubeOther	YouTubeSimplilearn	YouTubeTheNewBoston
0	27.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15 to 29 minutes	Canada	Canada	software development and IT	NaN	Employed for wages	NaN	NaN	NaN	NaN	female	NaN	NaN	1.0	0.0	1.0	0.0	0.0	NaN	15.0	02d9465b21e8bd09374b0066fb2d5614	eb78c1c3ac6cd9052aec557065070fbf	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	start your own business	NaN	NaN	NaN	English	married or domestic partnership	150.0	6.0	6f1fbc6b2b	2017-03-09 00:36:22	2017-03-09 00:32:59	2017-03-09 00:59:46	2017-03-09 00:36:26	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	1.0	some college credit, no degree	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	34.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	United States of America	United States of America	NaN	NaN	Not working but looking for work	NaN	35000.0	NaN	NaN	male	NaN	NaN	1.0	0.0	1.0	0.0	1.0	NaN	10.0	5bfef9ecb211ec4f518cfc1d2a6f3e0c	21db37adb60cdcafadfa7dca1b13b6b1	NaN	0.0	NaN	Within 7 to 12 months	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	work for a nonprofit	1.0	Full-Stack Web Developer	in an office with other developers	English	single, never married	80.0	6.0	f8f8be6910	2017-03-09 00:37:07	2017-03-09 00:33:26	2017-03-09 00:38:59	2017-03-09 00:37:10	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	1.0	1.0	some college credit, no degree	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	21.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	NaN	NaN	NaN	1.0	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	15 to 29 minutes	United States of America	United States of America	software development and IT	NaN	Employed for wages	NaN	70000.0	NaN	NaN	male	NaN	NaN	0.0	0.0	1.0	NaN	NaN	NaN	25.0	14f1863afa9c7de488050b82eb3edd96	21ba173828fbe9e27ccebaf4d5166a55	13000.0	1.0	0.0	Within 7 to 12 months	1.0	NaN	NaN	1.0	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	work for a medium-sized company	1.0	Front-End Web Developer, Back-End Web Develo...	no preference	Spanish	single, never married	1000.0	5.0	2ed189768e	2017-03-09 00:37:58	2017-03-09 00:33:53	2017-03-09 00:40:14	2017-03-09 00:38:02	1.0	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	Codenewbie	NaN	NaN	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	1.0	1.0	NaN	high school diploma or equivalent (GED)	NaN	NaN	NaN	NaN	1.0	NaN	1.0	1.0	NaN	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN
3	26.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	I work from home	Brazil	Brazil	software development and IT	NaN	Employed for wages	NaN	40000.0	0.0	NaN	male	NaN	0.0	1.0	1.0	1.0	1.0	0.0	40000.0	14.0	91756eb4dc280062a541c25a3d44cfb0	3be37b558f02daae93a6da10f83f0c77	24000.0	0.0	1.0	Within the next 6 months	1.0	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	work for a medium-sized company	NaN	Front-End Web Developer, Full-Stack Web Deve...	from home	Portuguese	married or domestic partnership	0.0	5.0	dbdc0664d1	2017-03-09 00:40:13	2017-03-09 00:37:45	2017-03-09 00:42:26	2017-03-09 00:40:18	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	some college credit, no degree	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN
4	20.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Portugal	Portugal	NaN	NaN	Not working but looking for work	NaN	140000.0	NaN	NaN	female	NaN	NaN	0.0	0.0	1.0	NaN	NaN	NaN	10.0	aa3f061a1949a90b27bef7411ecd193f	d7c56bbf2c7b62096be9db010e86d96d	NaN	0.0	NaN	Within 7 to 12 months	1.0	NaN	NaN	NaN	1.0	1.0	NaN	1.0	1.0	NaN	NaN	NaN	NaN	work for a multinational corporation	1.0	Full-Stack Web Developer, Information Security...	in an office with other developers	Portuguese	single, never married	0.0	24.0	11b0f2d8a9	2017-03-09 00:42:45	2017-03-09 00:39:44	2017-03-09 00:45:42	2017-03-09 00:42:50	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	bachelor's degree	Information Technology	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [4]:

exchange.head()

Out[4]:

	Respondent	Hobby	OpenSource	Country	Student	Employment	FormalEducation	UndergradMajor	CompanySize	DevType	YearsCoding	YearsCodingProf	JobSatisfaction	CareerSatisfaction	HopeFiveYears	JobSearchStatus	LastNewJob	AssessJob1	AssessJob2	AssessJob3	AssessJob4	AssessJob5	AssessJob6	AssessJob7	AssessJob8	AssessJob9	AssessJob10	AssessBenefits1	AssessBenefits2	AssessBenefits3	AssessBenefits4	AssessBenefits5	AssessBenefits6	AssessBenefits7	AssessBenefits8	AssessBenefits9	AssessBenefits10	AssessBenefits11	JobContactPriorities1	JobContactPriorities2	JobContactPriorities3	JobContactPriorities4	JobContactPriorities5	JobEmailPriorities1	JobEmailPriorities2	JobEmailPriorities3	JobEmailPriorities4	JobEmailPriorities5	JobEmailPriorities6	JobEmailPriorities7	UpdateCV	Currency	Salary	SalaryType	ConvertedSalary	CurrencySymbol	CommunicationTools	TimeFullyProductive	EducationTypes	SelfTaughtTypes	TimeAfterBootcamp	HackathonReasons	AgreeDisagree1	AgreeDisagree2	AgreeDisagree3	LanguageWorkedWith	LanguageDesireNextYear	DatabaseWorkedWith	DatabaseDesireNextYear	PlatformWorkedWith	PlatformDesireNextYear	FrameworkWorkedWith	FrameworkDesireNextYear	IDE	OperatingSystem	NumberMonitors	Methodology	VersionControl	CheckInCode	AdBlocker	AdBlockerDisable	AdBlockerReasons	AdsAgreeDisagree1	AdsAgreeDisagree2	AdsAgreeDisagree3	AdsActions	AdsPriorities1	AdsPriorities2	AdsPriorities3	AdsPriorities4	AdsPriorities5	AdsPriorities6	AdsPriorities7	AIDangerous	AIInteresting	AIResponsible	AIFuture	EthicsChoice	EthicsReport	EthicsResponsible	EthicalImplications	StackOverflowRecommend	StackOverflowVisit	StackOverflowHasAccount	StackOverflowParticipate	StackOverflowJobs	StackOverflowDevStory	StackOverflowJobsRecommend	StackOverflowConsiderMember	HypotheticalTools1	HypotheticalTools2	HypotheticalTools3	HypotheticalTools4	HypotheticalTools5	WakeTime	HoursComputer	HoursOutside	SkipMeals	ErgonomicDevices	Exercise	Gender	SexualOrientation	EducationParents	RaceEthnicity	Age	Dependents	MilitaryUS	SurveyTooLong	SurveyEasy
0	1	Yes	No	Kenya	No	Employed part-time	Bachelor’s degree (BA, BS, B.Eng., etc.)	Mathematics or statistics	20 to 99 employees	Full-stack developer	3-5 years	3-5 years	Extremely satisfied	Extremely satisfied	Working as a founder or co-founder of my own c...	I’m not actively looking, but I am open to new...	Less than a year ago	10.0	7.0	8.0	1.0	2.0	5.0	3.0	4.0	9.0	6.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.0	1.0	4.0	2.0	5.0	5.0	6.0	7.0	2.0	1.0	4.0	3.0	My job status or other personal status changed	NaN	NaN	Monthly	NaN	KES	Slack	One to three months	Taught yourself a new language, framework, or ...	The official documentation and/or standards fo...	NaN	To build my professional network	Strongly agree	Strongly agree	Neither Agree nor Disagree	JavaScript;Python;HTML;CSS	JavaScript;Python;HTML;CSS	Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/A...	Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/A...	AWS;Azure;Linux;Firebase	AWS;Azure;Linux;Firebase	Django;React	Django;React	Komodo;Vim;Visual Studio Code	Linux-based	1	Agile;Scrum	Git	Multiple times per day	Yes	No	NaN	Strongly agree	Strongly agree	Strongly agree	Saw an online advertisement and then researche...	1.0	5.0	4.0	7.0	2.0	6.0	3.0	Artificial intelligence surpassing human intel...	Algorithms making important decisions	The developers or the people creating the AI	I'm excited about the possibilities more than ...	No	Yes, and publicly	Upper management at the company/organization	Yes	10 (Very Likely)	Multiple times per day	Yes	I have never participated in Q&A on Stack Over...	No, I knew that Stack Overflow had a jobs boar...	Yes	NaN	Yes	Extremely interested	Extremely interested	Extremely interested	Extremely interested	Extremely interested	Between 5:00 - 6:00 AM	9 - 12 hours	1 - 2 hours	Never	Standing desk	3 - 4 times per week	Male	Straight or heterosexual	Bachelor’s degree (BA, BS, B.Eng., etc.)	Black or of African descent	25 - 34 years old	Yes	NaN	The survey was an appropriate length	Very easy
1	3	Yes	Yes	United Kingdom	No	Employed full-time	Bachelor’s degree (BA, BS, B.Eng., etc.)	A natural science (ex. biology, chemistry, phy...	10,000 or more employees	Database administrator;DevOps specialist;Full-...	30 or more years	18-20 years	Moderately dissatisfied	Neither satisfied nor dissatisfied	Working in a different or more specialized tec...	I am actively looking for a job	More than 4 years ago	1.0	7.0	10.0	8.0	2.0	5.0	4.0	3.0	6.0	9.0	1.0	5.0	3.0	7.0	10.0	4.0	11.0	9.0	6.0	2.0	8.0	3.0	1.0	5.0	2.0	4.0	1.0	3.0	4.0	5.0	2.0	6.0	7.0	I saw an employer’s advertisement	British pounds sterling (£)	51000	Yearly	70841.0	GBP	Confluence;Office / productivity suite (Micros...	One to three months	Taught yourself a new language, framework, or ...	The official documentation and/or standards fo...	NaN	NaN	Agree	Agree	Neither Agree nor Disagree	JavaScript;Python;Bash/Shell	Go;Python	Redis;PostgreSQL;Memcached	PostgreSQL	Linux	Linux	Django	React	IPython / Jupyter;Sublime Text;Vim	Linux-based	2	NaN	Git;Subversion	A few times per week	Yes	Yes	The website I was visiting asked me to disable it	Somewhat agree	Neither agree nor disagree	Neither agree nor disagree	NaN	3.0	5.0	1.0	4.0	6.0	7.0	2.0	Increasing automation of jobs	Increasing automation of jobs	The developers or the people creating the AI	I'm excited about the possibilities more than ...	Depends on what it is	Depends on what it is	Upper management at the company/organization	Yes	10 (Very Likely)	A few times per month or weekly	Yes	A few times per month or weekly	Yes	No, I have one but it's out of date	7	Yes	A little bit interested	A little bit interested	A little bit interested	A little bit interested	A little bit interested	Between 6:01 - 7:00 AM	5 - 8 hours	30 - 59 minutes	Never	Ergonomic keyboard or mouse	Daily or almost every day	Male	Straight or heterosexual	Bachelor’s degree (BA, BS, B.Eng., etc.)	White or of European descent	35 - 44 years old	Yes	NaN	The survey was an appropriate length	Somewhat easy
2	4	Yes	Yes	United States	No	Employed full-time	Associate degree	Computer science, computer engineering, or sof...	20 to 99 employees	Engineering manager;Full-stack developer	24-26 years	6-8 years	Moderately satisfied	Moderately satisfied	Working as a founder or co-founder of my own c...	I’m not actively looking, but I am open to new...	Less than a year ago	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	5	No	No	United States	No	Employed full-time	Bachelor’s degree (BA, BS, B.Eng., etc.)	Computer science, computer engineering, or sof...	100 to 499 employees	Full-stack developer	18-20 years	12-14 years	Neither satisfied nor dissatisfied	Slightly dissatisfied	Working as a founder or co-founder of my own c...	I’m not actively looking, but I am open to new...	Less than a year ago	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A recruiter contacted me	U.S. dollars ($)	NaN	NaN	NaN	NaN	NaN	Three to six months	Completed an industry certification program (e...	The official documentation and/or standards fo...	NaN	NaN	Disagree	Disagree	Strongly disagree	C#;JavaScript;SQL;TypeScript;HTML;CSS;Bash/Shell	C#;JavaScript;SQL;TypeScript;HTML;CSS;Bash/Shell	SQL Server;Microsoft Azure (Tables, CosmosDB, ...	SQL Server;Microsoft Azure (Tables, CosmosDB, ...	Azure	Azure	NaN	Angular;.NET Core;React	Visual Studio;Visual Studio Code	Windows	2	Agile;Kanban;Scrum	Git	Multiple times per day	Yes	Yes	The ad-blocking software was causing display i...	Neither agree nor disagree	Somewhat agree	Somewhat agree	Stopped going to a website because of their ad...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Artificial intelligence surpassing human intel...	Artificial intelligence surpassing human intel...	A governmental or other regulatory body	I don't care about it, or I haven't thought ab...	No	Yes, but only within the company	Upper management at the company/organization	Yes	10 (Very Likely)	A few times per week	Yes	A few times per month or weekly	Yes	No, I have one but it's out of date	8	Yes	Somewhat interested	Somewhat interested	Somewhat interested	Somewhat interested	Somewhat interested	Between 6:01 - 7:00 AM	9 - 12 hours	Less than 30 minutes	3 - 4 times per week	NaN	I don't typically exercise	Male	Straight or heterosexual	Some college/university study without earning ...	White or of European descent	35 - 44 years old	No	No	The survey was an appropriate length	Somewhat easy
4	7	Yes	No	South Africa	Yes, part-time	Employed full-time	Some college/university study without earning ...	Computer science, computer engineering, or sof...	10,000 or more employees	Data or business analyst;Desktop or enterprise...	6-8 years	0-2 years	Slightly satisfied	Moderately satisfied	Working in a different or more specialized tec...	I’m not actively looking, but I am open to new...	Between 1 and 2 years ago	8.0	5.0	7.0	1.0	2.0	6.0	4.0	3.0	10.0	9.0	1.0	10.0	2.0	4.0	8.0	3.0	11.0	7.0	5.0	9.0	6.0	2.0	1.0	4.0	5.0	3.0	7.0	3.0	6.0	2.0	1.0	4.0	5.0	My job status or other personal status changed	South African rands (R)	260000	Yearly	21426.0	ZAR	Office / productivity suite (Microsoft Office,...	Three to six months	Taken a part-time in-person course in programm...	The official documentation and/or standards fo...	NaN	NaN	Strongly agree	Agree	Strongly disagree	C;C++;Java;Matlab;R;SQL;Bash/Shell	Assembly;C;C++;Matlab;SQL;Bash/Shell	SQL Server;PostgreSQL;Oracle;IBM Db2	PostgreSQL;Oracle;IBM Db2	Arduino;Windows Desktop or Server	Arduino;Windows Desktop or Server	NaN	NaN	Notepad++;Visual Studio;Visual Studio Code	Windows	2	Evidence-based software engineering;Formal sta...	Zip file back-ups	Weekly or a few times per month	No	NaN	NaN	Somewhat agree	Somewhat agree	Somewhat disagree	Clicked on an online advertisement;Saw an onli...	2.0	3.0	4.0	6.0	1.0	7.0	5.0	Algorithms making important decisions	Algorithms making important decisions	The developers or the people creating the AI	I'm excited about the possibilities more than ...	No	Yes, but only within the company	Upper management at the company/organization	Yes	10 (Very Likely)	Daily or almost daily	Yes	Less than once per month or monthly	No, I knew that Stack Overflow had a jobs boar...	No, I know what it is but I don't have one	NaN	Yes	Extremely interested	Extremely interested	Extremely interested	Extremely interested	Extremely interested	Before 5:00 AM	Over 12 hours	1 - 2 hours	Never	NaN	3 - 4 times per week	Male	Straight or heterosexual	Some college/university study without earning ...	White or of European descent	18 - 24 years old	Yes	NaN	The survey was an appropriate length	Somewhat easy

Data Processing and Cleaning¶

The first step in our analysis is to identify the appropriate columns that are relevant. Unfortunately there are over 100 columns which is far too many for a practical analysis.

We identified a few columns for analysis using datapackage.json. This JSON file describes each column for FreeCodeCamp's new coder surveys.

In [5]:

# Index location of the first set of columns to drop
print(csv.columns.get_loc("CodeEventConferences"))
print(csv.columns.get_loc("CodeEventWorkshops"))

8
23

In [6]:

# Drops columns
csv = csv.drop(csv.iloc[:, 8:23], axis=1)

In [7]:

# Index location of the next set of columns to drop
print(csv.columns.get_loc("NetworkID"))
print(csv.columns.get_loc("ResourceW3S"))

59
100

In [8]:

# Drop columns
csv = csv.drop(csv.iloc[:, 59:100], axis=1)

In [9]:

print(csv.columns.get_loc("YouTubeCodeCourse"))

In [10]:

# Drop remaining columns including index postion 63 and onward
csv = csv.drop(csv.iloc[:, 63:], axis=1)
csv.head()

Out[10]:

	Age	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation	CodeEventWorkshops	CommuteTime	CountryCitizen	CountryLive	EmploymentField	EmploymentFieldOther	EmploymentStatus	EmploymentStatusOther	ExpectedEarning	FinanciallySupporting	FirstDevJob	Gender	GenderOther	HasChildren	HasDebt	HasFinancialDependents	HasHighSpdInternet	HasHomeMortgage	HasStudentDebt	HomeMortgageOwe	HoursLearning	ID.x	ID.y	Income	IsEthnicMinority	IsUnderEmployed	JobApplyWhen	JobInterestBackEnd	JobInterestDataEngr	JobInterestDataSci	JobInterestDevOps	JobInterestFrontEnd	JobInterestFullStack	JobInterestGameDev	JobInterestInfoSec	JobInterestMobile	JobInterestOther	JobInterestProjMngr	JobInterestQAEngr	JobInterestUX	JobPref	JobRelocateYesNo	JobRoleInterest	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	ResourceW3S	SchoolDegree	SchoolMajor	StudentDebtOwe
0	27.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	15 to 29 minutes	Canada	Canada	software development and IT	NaN	Employed for wages	NaN	NaN	NaN	NaN	female	NaN	NaN	1.0	0.0	1.0	0.0	0.0	NaN	15.0	02d9465b21e8bd09374b0066fb2d5614	eb78c1c3ac6cd9052aec557065070fbf	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	start your own business	NaN	NaN	NaN	English	married or domestic partnership	150.0	6.0	1.0	some college credit, no degree	NaN	NaN
1	34.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	United States of America	United States of America	NaN	NaN	Not working but looking for work	NaN	35000.0	NaN	NaN	male	NaN	NaN	1.0	0.0	1.0	0.0	1.0	NaN	10.0	5bfef9ecb211ec4f518cfc1d2a6f3e0c	21db37adb60cdcafadfa7dca1b13b6b1	NaN	0.0	NaN	Within 7 to 12 months	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	work for a nonprofit	1.0	Full-Stack Web Developer	in an office with other developers	English	single, never married	80.0	6.0	1.0	some college credit, no degree	NaN	NaN
2	21.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	15 to 29 minutes	United States of America	United States of America	software development and IT	NaN	Employed for wages	NaN	70000.0	NaN	NaN	male	NaN	NaN	0.0	0.0	1.0	NaN	NaN	NaN	25.0	14f1863afa9c7de488050b82eb3edd96	21ba173828fbe9e27ccebaf4d5166a55	13000.0	1.0	0.0	Within 7 to 12 months	1.0	NaN	NaN	1.0	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	work for a medium-sized company	1.0	Front-End Web Developer, Back-End Web Develo...	no preference	Spanish	single, never married	1000.0	5.0	NaN	high school diploma or equivalent (GED)	NaN	NaN
3	26.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	I work from home	Brazil	Brazil	software development and IT	NaN	Employed for wages	NaN	40000.0	0.0	NaN	male	NaN	0.0	1.0	1.0	1.0	1.0	0.0	40000.0	14.0	91756eb4dc280062a541c25a3d44cfb0	3be37b558f02daae93a6da10f83f0c77	24000.0	0.0	1.0	Within the next 6 months	1.0	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	work for a medium-sized company	NaN	Front-End Web Developer, Full-Stack Web Deve...	from home	Portuguese	married or domestic partnership	0.0	5.0	NaN	some college credit, no degree	NaN	NaN
4	20.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	Portugal	Portugal	NaN	NaN	Not working but looking for work	NaN	140000.0	NaN	NaN	female	NaN	NaN	0.0	0.0	1.0	NaN	NaN	NaN	10.0	aa3f061a1949a90b27bef7411ecd193f	d7c56bbf2c7b62096be9db010e86d96d	NaN	0.0	NaN	Within 7 to 12 months	1.0	NaN	NaN	NaN	1.0	1.0	NaN	1.0	1.0	NaN	NaN	NaN	NaN	work for a multinational corporation	1.0	Full-Stack Web Developer, Information Security...	in an office with other developers	Portuguese	single, never married	0.0	24.0	NaN	bachelor's degree	Information Technology	NaN

In [11]:

csv.iloc[:,:20]

Out[11]:

	Age	AttendedBootcamp	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation	CodeEventWorkshops	CommuteTime	CountryCitizen	CountryLive	EmploymentField	EmploymentFieldOther	EmploymentStatus	EmploymentStatusOther	ExpectedEarning	FinanciallySupporting	FirstDevJob	Gender
0	27.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	15 to 29 minutes	Canada	Canada	software development and IT	NaN	Employed for wages	NaN	NaN	NaN	NaN	female
1	34.0	0.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	United States of America	United States of America	NaN	NaN	Not working but looking for work	NaN	35000.0	NaN	NaN	male
2	21.0	0.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	15 to 29 minutes	United States of America	United States of America	software development and IT	NaN	Employed for wages	NaN	70000.0	NaN	NaN	male
3	26.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	I work from home	Brazil	Brazil	software development and IT	NaN	Employed for wages	NaN	40000.0	0.0	NaN	male
4	20.0	0.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	Portugal	Portugal	NaN	NaN	Not working but looking for work	NaN	140000.0	NaN	NaN	female
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
18170	41.0	0.0	NaN	NaN	NaN	NaN	1.0	more than 1 million	NaN	I work from home	Indonesia	Indonesia	software development and IT	NaN	Self-employed freelancer	NaN	NaN	0.0	NaN	male
18171	31.0	0.0	NaN	NaN	NaN	NaN	1.0	more than 1 million	NaN	Less than 15 minutes	Nigeria	Nigeria	transportation	NaN	Self-employed freelancer	NaN	70000.0	1.0	NaN	male
18172	39.0	0.0	NaN	NaN	NaN	NaN	3.0	more than 1 million	1.0	45 to 60 minutes	South Africa	South Africa	NaN	IT support and website update	Employed for wages	NaN	NaN	0.0	1.0	male
18173	54.0	0.0	NaN	NaN	NaN	NaN	3.0	between 100,000 and 1 million	NaN	Less than 15 minutes	United Kingdom	United Kingdom	education	NaN	Employed for wages	NaN	NaN	0.0	NaN	male
18174	50.0	0.0	NaN	NaN	NaN	NaN	2.0	less than 100,000	NaN	15 to 29 minutes	United Kingdom	United Kingdom	health care	NaN	Employed for wages	NaN	NaN	0.0	NaN	male

18175 rows × 20 columns

If we utilize the following code below we'll get a better understanding of missing data in the columns. There are instances of respondents failing to enter information during the survey. Many columns have missing data, and it's going to be difficult to clean the dataset without removing nearly every row.

In [12]:

# Missing data calculated
series = csv.apply(pd.isnull).sum()/csv.shape[0] * 100

# Columns with less than or equal to 60% missing data points
list = series[series <= 60].index

In [13]:

print(series)

Age                  15.449794
AttendedBootcamp      2.563961
BootcampFinish       94.118294
BootcampLoanYesNo    94.063274
BootcampName         94.778542
                       ...    
MonthsProgramming     6.002751
ResourceW3S          46.272352
SchoolDegree         15.444292
SchoolMajor          51.983494
StudentDebtOwe       81.502063
Length: 63, dtype: float64

In [14]:

# Converts the list of columns we want to use from pandas.index to list
cols_to_use = pd.Index.tolist(list)
cols_to_use.extend(["JobRoleInterest", "ExpectedEarning"])

# Isolates the dataframe down to only preferred columns
csv = csv[cols_to_use]

# Drop id.x and id.y columns
csv = csv.drop(columns=["ID.x","ID.y","ResourceW3S"])
csv

Out[14]:

	Age	AttendedBootcamp	CityPopulation	CommuteTime	CountryCitizen	CountryLive	EmploymentField	EmploymentStatus	Gender	HasDebt	HasFinancialDependents	HasHighSpdInternet	HasServedInMilitary	HoursLearning	Income	IsEthnicMinority	IsReceiveDisabilitiesBenefits	IsSoftwareDev	IsUnderEmployed	JobApplyWhen	JobPref	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	SchoolDegree	SchoolMajor	JobRoleInterest	ExpectedEarning
0	27.0	0.0	more than 1 million	15 to 29 minutes	Canada	Canada	software development and IT	Employed for wages	female	1.0	0.0	1.0	0.0	15.0	NaN	NaN	0.0	0.0	0.0	NaN	start your own business	NaN	English	married or domestic partnership	150.0	6.0	some college credit, no degree	NaN	NaN	NaN
1	34.0	0.0	less than 100,000	NaN	United States of America	United States of America	NaN	Not working but looking for work	male	1.0	0.0	1.0	0.0	10.0	NaN	0.0	0.0	0.0	NaN	Within 7 to 12 months	work for a nonprofit	in an office with other developers	English	single, never married	80.0	6.0	some college credit, no degree	NaN	Full-Stack Web Developer	35000.0
2	21.0	0.0	more than 1 million	15 to 29 minutes	United States of America	United States of America	software development and IT	Employed for wages	male	0.0	0.0	1.0	0.0	25.0	13000.0	1.0	0.0	0.0	0.0	Within 7 to 12 months	work for a medium-sized company	no preference	Spanish	single, never married	1000.0	5.0	high school diploma or equivalent (GED)	NaN	Front-End Web Developer, Back-End Web Develo...	70000.0
3	26.0	0.0	between 100,000 and 1 million	I work from home	Brazil	Brazil	software development and IT	Employed for wages	male	1.0	1.0	1.0	0.0	14.0	24000.0	0.0	0.0	0.0	1.0	Within the next 6 months	work for a medium-sized company	from home	Portuguese	married or domestic partnership	0.0	5.0	some college credit, no degree	NaN	Front-End Web Developer, Full-Stack Web Deve...	40000.0
4	20.0	0.0	between 100,000 and 1 million	NaN	Portugal	Portugal	NaN	Not working but looking for work	female	0.0	0.0	1.0	0.0	10.0	NaN	0.0	0.0	0.0	NaN	Within 7 to 12 months	work for a multinational corporation	in an office with other developers	Portuguese	single, never married	0.0	24.0	bachelor's degree	Information Technology	Full-Stack Web Developer, Information Security...	140000.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
18170	41.0	0.0	more than 1 million	I work from home	Indonesia	Indonesia	software development and IT	Self-employed freelancer	male	1.0	1.0	0.0	0.0	10.0	60000.0	0.0	0.0	0.0	0.0	NaN	start your own business	NaN	Indonesian	married or domestic partnership	10.0	1.0	bachelor's degree	Telecommunications Technician	NaN	NaN
18171	31.0	0.0	more than 1 million	Less than 15 minutes	Nigeria	Nigeria	transportation	Self-employed freelancer	male	1.0	1.0	0.0	0.0	1.0	60000.0	0.0	0.0	0.0	1.0	more than 12 months from now	work for a nonprofit	no preference	English	divorced	10000.0	1.0	high school diploma or equivalent (GED)	NaN	DevOps / SysAdmin, Mobile Developer, Pro...	70000.0
18172	39.0	0.0	more than 1 million	45 to 60 minutes	South Africa	South Africa	NaN	Employed for wages	male	1.0	1.0	0.0	0.0	10.0	1000000.0	0.0	0.0	1.0	1.0	NaN	NaN	NaN	Zulu	married or domestic partnership	19.0	3.0	some high school	NaN	NaN	NaN
18173	54.0	0.0	between 100,000 and 1 million	Less than 15 minutes	United Kingdom	United Kingdom	education	Employed for wages	male	0.0	1.0	1.0	0.0	1.0	1000000.0	0.0	0.0	0.0	1.0	NaN	freelance	NaN	English	divorced	0.0	5.0	trade, technical, or vocational training	NaN	NaN	NaN
18174	50.0	0.0	less than 100,000	15 to 29 minutes	United Kingdom	United Kingdom	health care	Employed for wages	male	1.0	1.0	1.0	1.0	5.0	1000000.0	0.0	0.0	0.0	1.0	I haven't decided	work for a government	no preference	English	married or domestic partnership	NaN	10.0	bachelor's degree	Computer and Information Studies	Back-End Web Developer, Data Engineer, Data ...	NaN

18175 rows × 30 columns

In [15]:

# Count missing data
nulls = csv.apply(pd.isnull).sum()/csv.shape[0] * 100
nulls = nulls.sort_values()
nulls

Out[15]:

IsSoftwareDev                     0.588721
AttendedBootcamp                  2.563961
MonthsProgramming                 6.002751
HoursLearning                     8.038514
MoneyForLearning                  8.792297
Gender                           14.971114
CountryCitizen                   15.367263
HasHighSpdInternet               15.378267
SchoolDegree                     15.444292
Age                              15.449794
CityPopulation                   15.521320
LanguageAtHome                   15.576341
CountryLive                      15.620358
MaritalStatus                    15.625860
HasFinancialDependents           15.658872
IsEthnicMinority                 15.856946
HasDebt                          15.867950
HasServedInMilitary              16.060523
IsReceiveDisabilitiesBenefits    16.247593
EmploymentStatus                 21.072902
JobPref                          25.815681
CommuteTime                      49.127923
IsUnderEmployed                  49.254470
SchoolMajor                      51.983494
JobApplyWhen                     55.224209
JobWherePref                     55.334250
EmploymentField                  55.345254
Income                           58.057772
ExpectedEarning                  60.385144
JobRoleInterest                  61.529574
dtype: float64

In [16]:

csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18175 entries, 0 to 18174
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Age                            15367 non-null  float64
 1   AttendedBootcamp               17709 non-null  float64
 2   CityPopulation                 15354 non-null  object 
 3   CommuteTime                    9246 non-null   object 
 4   CountryCitizen                 15382 non-null  object 
 5   CountryLive                    15336 non-null  object 
 6   EmploymentField                8116 non-null   object 
 7   EmploymentStatus               14345 non-null  object 
 8   Gender                         15454 non-null  object 
 9   HasDebt                        15291 non-null  float64
 10  HasFinancialDependents         15329 non-null  float64
 11  HasHighSpdInternet             15380 non-null  float64
 12  HasServedInMilitary            15256 non-null  float64
 13  HoursLearning                  16714 non-null  float64
 14  Income                         7623 non-null   float64
 15  IsEthnicMinority               15293 non-null  float64
 16  IsReceiveDisabilitiesBenefits  15222 non-null  float64
 17  IsSoftwareDev                  18068 non-null  float64
 18  IsUnderEmployed                9223 non-null   float64
 19  JobApplyWhen                   8138 non-null   object 
 20  JobPref                        13483 non-null  object 
 21  JobWherePref                   8118 non-null   object 
 22  LanguageAtHome                 15344 non-null  object 
 23  MaritalStatus                  15335 non-null  object 
 24  MoneyForLearning               16577 non-null  float64
 25  MonthsProgramming              17084 non-null  float64
 26  SchoolDegree                   15368 non-null  object 
 27  SchoolMajor                    8727 non-null   object 
 28  JobRoleInterest                6992 non-null   object 
 29  ExpectedEarning                7200 non-null   float64
dtypes: float64(15), object(15)
memory usage: 4.2+ MB

In [17]:

# New column to indicate year of survey completion
csv["Year"] = 2017
csv2016["Year"] = 2016

# Columns of interest
column_lists = csv.columns.to_list()
column_lists

# Apply column filtering to survey 2016
survey_2016 = csv2016[column_lists]

Dataset merging¶

In [18]:

# Merge dataframes
combined_survey = pd.concat([csv, survey_2016])

# Merged dataframe length (rows)
print("Number of Rows:")
print(combined_survey.shape[0])

Number of Rows:
33795

JobRoleInterest: "Which one of these careers are you interested in?"

Most of the courses offered on our e-learning platform are for web and mobile development. We need to identify if the sample from the dataset is representative of the population of new coders. One significant limitation to this survey is in regards to the number of rows that contain missing information for JobRoleInterest. Roughly 6 out of 10 observations do not have a response to this question.

It's strange that this many people took the survey neglected to answer this question. In addtion to this question, perhaps another question should have been asked: "What are your goals for learning programming", or something similar.

After merging both dataframes together we ended up with 33,795 rows. For analysis we're going to remove all observations that failed to answer this question. The final dataframe will include only 13,495 rows.

Of these observations we'll notice that career interest heavily leans to web development (including full stack, front end, and back end web development). Many observations also include multiple categories, rather than just one category. We can split each string for each row in the JobRoleInterest column. This will help us understand the number of choices that each person selected.

We can split each occurance of a job category for rows containing multiple categories. To do this we'll have to use pandas.Series.str.split. This approach will help us count every individual job category.

In [19]:

interests = combined_survey["JobRoleInterest"].value_counts(normalize=True) * 100
interests.head(20)

Out[19]:

Full-Stack Web Developer                                                       25.150056
  Front-End Web Developer                                                      13.553168
Back-End Web Developer                                                          6.268989
  Data Scientist / Data Engineer                                                4.786958
  Mobile Developer                                                              3.934791
  User Experience Designer                                                      2.423120
  DevOps / SysAdmin                                                             1.889589
  Product Manager                                                               1.822897
  Data Scientist                                                                1.126343
  Quality Assurance Engineer                                                    0.881808
Game Developer                                                                  0.844757
Information Security                                                            0.681734
Full-Stack Web Developer,   Front-End Web Developer                             0.474250
  Front-End Web Developer, Full-Stack Web Developer                             0.414969
Data Engineer                                                                   0.392738
  User Experience Designer,   Front-End Web Developer                           0.318637
  Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer     0.288996
Back-End Web Developer,   Front-End Web Developer, Full-Stack Web Developer     0.266765
Back-End Web Developer, Full-Stack Web Developer,   Front-End Web Developer     0.266765
Full-Stack Web Developer,   Front-End Web Developer, Back-End Web Developer     0.229715
Name: JobRoleInterest, dtype: float64

In [20]:

# Combination of all job interests
len(interests)

Out[20]:

In [21]:

# New dataframe excluding any missing data from JobRoleInterest column
survey = combined_survey[combined_survey["JobRoleInterest"].notnull()].copy()

# Splits each occurence of a job category
survey["JobRoleInterest"] = survey["JobRoleInterest"].str.split(",")

In [22]:

# Combined dataset (survey) missing values in percentage
(survey.apply(pd.isnull).sum()/survey.shape[0] * 100).sort_values(ascending = False)

Out[22]:

EmploymentField                  61.645054
Income                           59.147833
CommuteTime                      53.864394
IsUnderEmployed                  53.093738
SchoolMajor                      47.143386
MaritalStatus                    39.066321
EmploymentStatus                 14.071878
ExpectedEarning                  10.596517
IsReceiveDisabilitiesBenefits     7.773249
HasServedInMilitary               7.654687
IsEthnicMinority                  7.476843
LanguageAtHome                    7.476843
HasDebt                           7.454613
Age                               7.387921
CityPopulation                    7.365691
CountryLive                       7.321230
HasFinancialDependents            7.306410
SchoolDegree                      7.128566
CountryCitizen                    7.121156
HasHighSpdInternet                7.054465
Gender                            6.595035
MoneyForLearning                  6.587625
HoursLearning                     5.779918
MonthsProgramming                 4.579474
AttendedBootcamp                  1.237495
JobPref                           0.955910
JobWherePref                      0.652093
JobApplyWhen                      0.548351
IsSoftwareDev                     0.229715
JobRoleInterest                   0.000000
Year                              0.000000
dtype: float64

In [23]:

# Fill missing data points with average
survey["ExpectedEarning"] = survey["ExpectedEarning"].fillna(survey["ExpectedEarning"].median())

In [24]:

survey

Out[24]:

	Age	AttendedBootcamp	CityPopulation	CommuteTime	CountryCitizen	CountryLive	EmploymentField	EmploymentStatus	Gender	HasDebt	HasFinancialDependents	HasHighSpdInternet	HasServedInMilitary	HoursLearning	Income	IsEthnicMinority	IsReceiveDisabilitiesBenefits	IsSoftwareDev	IsUnderEmployed	JobApplyWhen	JobPref	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	SchoolDegree	SchoolMajor	JobRoleInterest	ExpectedEarning	Year
1	34.0	0.0	less than 100,000	NaN	United States of America	United States of America	NaN	Not working but looking for work	male	1.0	0.0	1.0	0.0	10.0	NaN	0.0	0.0	0.0	NaN	Within 7 to 12 months	work for a nonprofit	in an office with other developers	English	single, never married	80.0	6.0	some college credit, no degree	NaN	[Full-Stack Web Developer]	35000.0	2017
2	21.0	0.0	more than 1 million	15 to 29 minutes	United States of America	United States of America	software development and IT	Employed for wages	male	0.0	0.0	1.0	0.0	25.0	13000.0	1.0	0.0	0.0	0.0	Within 7 to 12 months	work for a medium-sized company	no preference	Spanish	single, never married	1000.0	5.0	high school diploma or equivalent (GED)	NaN	[ Front-End Web Developer, Back-End Web Deve...	70000.0	2017
3	26.0	0.0	between 100,000 and 1 million	I work from home	Brazil	Brazil	software development and IT	Employed for wages	male	1.0	1.0	1.0	0.0	14.0	24000.0	0.0	0.0	0.0	1.0	Within the next 6 months	work for a medium-sized company	from home	Portuguese	married or domestic partnership	0.0	5.0	some college credit, no degree	NaN	[ Front-End Web Developer, Full-Stack Web De...	40000.0	2017
4	20.0	0.0	between 100,000 and 1 million	NaN	Portugal	Portugal	NaN	Not working but looking for work	female	0.0	0.0	1.0	0.0	10.0	NaN	0.0	0.0	0.0	NaN	Within 7 to 12 months	work for a multinational corporation	in an office with other developers	Portuguese	single, never married	0.0	24.0	bachelor's degree	Information Technology	[Full-Stack Web Developer, Information Securi...	140000.0	2017
6	29.0	0.0	between 100,000 and 1 million	30 to 44 minutes	United Kingdom	United Kingdom	NaN	Employed for wages	female	1.0	0.0	1.0	0.0	16.0	40000.0	NaN	0.0	0.0	0.0	I'm already applying	work for a medium-sized company	no preference	English	married or domestic partnership	0.0	12.0	some college credit, no degree	NaN	[Full-Stack Web Developer]	30000.0	2017
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
15585	32.0	0.0	more than 1 million	40.0	Ukraine	Ukraine	health care	Employed for wages	female	0.0	1.0	1.0	0.0	5.0	36000.0	1.0	0.0	0.0	1.0	Within the next 6 months	work for a multinational corporation	in an office with other developers	Russian	married or domestic partnership	5.0	2.0	bachelor's degree	Linguistics	[ Front-End Web Developer]	8400.0	2016
15598	51.0	0.0	less than 100,000	30.0	United States of America	United States of America	finance	Employed for wages	male	1.0	1.0	1.0	1.0	30.0	200000.0	0.0	0.0	0.0	0.0	more than 12 months from now	work for a medium-sized company	in an office with other developers	English	married or domestic partnership	100.0	12.0	professional degree (MBA, MD, JD, etc.)	Investments and Securities	[Full-Stack Web Developer]	100000.0	2016
15600	38.0	0.0	more than 1 million	90.0	United States of America	United States of America	finance	Employed for wages	male	0.0	1.0	1.0	0.0	6.0	200000.0	0.0	0.0	0.0	0.0	more than 12 months from now	work for a startup	no preference	English	married or domestic partnership	500.0	12.0	bachelor's degree	Finance	[Full-Stack Web Developer]	150000.0	2016
15608	40.0	0.0	more than 1 million	60.0	Australia	Australia	software development and IT	Employed for wages	male	1.0	1.0	0.0	0.0	10.0	200000.0	0.0	0.0	0.0	0.0	more than 12 months from now	work for a multinational corporation	in an office with other developers	English	married or domestic partnership	0.0	2.0	bachelor's degree	Computer Systems Analysis	[ DevOps / SysAdmin]	80000.0	2016
15615	28.0	0.0	less than 100,000	7.0	United States of America	United States of America	food and beverage	Employed for wages	male	1.0	1.0	1.0	0.0	20.0	200000.0	0.0	0.0	0.0	1.0	I'm already applying	work for a medium-sized company	from home	English	married or domestic partnership	1400.0	7.0	associate's degree	Computer and Information Systems Security	[Full-Stack Web Developer]	50000.0	2016

13495 rows × 31 columns

In [25]:

# Counts each occurence of a particular category
category_count = dict()

# For loop for counting each individual category in the JobRoleInterest column
for categories in survey["JobRoleInterest"]: 
    for category in categories:
        if category in category_count:
            category_count[category] += 1 # counts category key if already present in dictionary
        else:
            category_count[category] = 1 # adds unique category key to dictionary if not already present

# Transforms dictionary to dataframe 
category_count = pd.DataFrame.from_dict(category_count, orient="index", columns= ["Count"])
category_count = category_count.reset_index(level = 0)
category_count = category_count.rename(columns = {"index":"Interests"})

In [26]:

category_count["Interests"].unique()

Out[26]:

array(['Full-Stack Web Developer', '  Front-End Web Developer',
       ' Back-End Web Developer', '   DevOps / SysAdmin',
       '   Mobile Developer', ' Full-Stack Web Developer',
       ' Information Security', '   Front-End Web Developer',
       '   Quality Assurance Engineer', ' Game Developer',
       '   User Experience Designer', '  DevOps / SysAdmin',
       '   Data Scientist', ' Data Engineer', 'Back-End Web Developer',
       'Information Security', '  Data Scientist', '  Mobile Developer',
       '   Product Manager', 'Data Engineer', 'Game Developer',
       '  Product Manager', '  User Experience Designer',
       '  Quality Assurance Engineer', 'Ethical Hacker',
       ' security expert', ' Technical Writer', ' Researcher',
       'Systems Engineer', 'Desktop Applications Programmer', ' Robotics',
       'Non technical ', ' UI Design', 'Software engineer ',
       'email coder', ' Data analyst', ' I dont yet know',
       ' UX developer/designer', ' support scientific resaerch ',
       ' AI and neuroscience', 'Full Stack Software Engineer',
       ' Program Manager', ' Application Support Analyst',
       " This futurist's dream of using some tech in a way that inspires critical amounts of people to influence the changes we need to protect ",
       ' Information Architect', 'Physicist ',
       'Security Business Analyst ', ' Bioinformatics/science ',
       ' creative coder / generative artist/designer',
       ' a job in which I can use coding skills to create valuable portals to advance human rights',
       'Research ', ' Bitcoin/Crypto', 'Embedded hardware',
       'Data/Interactive Journalist', 'Software Engineering',
       ' Software Engineer', ' Business Analyst', 'Network Engineer',
       'Information Developer', 'Java developer', ' Project Management',
       'Machine learning engineer', 'Real-time systems', ' Cybersecurity',
       ' software engineer', 'GIS Developer', 'Research and education',
       ' System Software', 'Full Stack Developer ', 'AI',
       '  Bioinformatics ', ' Data Analyst', 'Urban Planner',
       'Software Engineer', 'full stack developer', ' SWE',
       ' Embedded Developer', ' virtual reality developer',
       ' Journalist/Graphic Designer/Marketing', ' Web Designer',
       'Computer Architect', ' Networking', 'Software Developer',
       ' Software Developer', ' Machine Learning Engineer',
       ' data analyst', ' AI and Machine Learning', ' computer engineer',
       ' Artificial Intelligence', 'Systems Programming',
       'Software Engineer (Computer Science Based)',
       'Technology Management', 'full-stack developer',
       ' Software developer', 'BA or developer', ' User Interface Design',
       'System Engineer', 'Network', ' Analyst', ' Machine Learning ',
       'Pharmacy tech', 'data journalist / data visualist', 'Desings',
       ' Infrastructure Architect ', ' Tech art',
       ' Technology-Business Liaison', ' Product Designer',
       'Front-End Web Designer', 'Document Controller',
       ' Software enginner', ' programmer', 'undeceided',
       'Pharmaceutical industry', ' Information Technology',
       ' Library Developer', ' Desktop Application Developer',
       ' Machine Learning', ' Operating Systems', ' Compilers', ' etc...',
       ' GIS Database Admin', ' designer',
       'Support Engineer or API Support', ' Software engineer',
       ' Python Developer', ' Bioinformatics',
       'Robotics Process Automation Specialist', 'Data visualisation',
       ' Desktop applications developer',
       'All - whatever is required to develop tools to revolutionize the mechanical engineering process',
       'Digital Humanitites', ' User Interface Designer',
       'Artificial Intelligence', ' Software Development', 'Programming',
       'Web development ', ' Marketing', 'Financial Services',
       'software developer', 'Natural Language Processing',
       ' Entreprenuer / Web Dev Hustler ', ' Machine Learning Engineer ',
       'Marketing Automation ', 'AI Developer', ' network admin',
       'Front end', ' back end', ' game', ' web', ' mobile developer',
       'Not sure!', ' Anything that engages me',
       "i don't know what the difference is between most of these soz lol",
       'Unsure', 'Any of them.', 'Not sure yet', 'Not Sure Yet',
       'Not sure', ' i dunno!!!!', ' milatary engineer', ' SEO',
       'Software engineer', 'Astrophysicist', ' Journalist',
       'philosopher', ' Java developer', 'Desktop Applications',
       ' Programmer', 'IoT Developer', 'Systems Programmer',
       'Web Designer', "Don't know yet", ' Artificial intelligence',
       ' Artificial Intelligence Engineer', 'Developer Evangelist',
       ' Bioinformatitian', ' IoT', ' Entrepreneur',
       ' I am interested in Game Development', ' Mobile Development',
       ' Web Design', ' Front End Web Development', 'programmer',
       'Data Reporter', 'Not Sure', 'Web developer',
       'User Interface Designer', 'Robotics and AI Engineer',
       ' Ethical Hacker', ' Artificial Intelligence engineer',
       ' Scientific Programming',
       ' Software Developer or Front-End Web Developer', ' UI Designer',
       ' Campaign Manager', ' AI Engineer', 'Software Specialist ',
       ' Project Manager', ' Growth Hacker', 'Research', 'idk',
       ' Founder', 'Software Engineers', 'VR Technology developer',
       ' developer', ' plc', 'Ceo', ' Tech lobbiest',
       'Quant (Algorithmic Trader)', 'Machine learning and AI ',
       'Project manager', 'undecided', ' Databases', 'Project Manager',
       'Cloud computing ', 'Software Developper', 'College professor',
       ' System Administrator/Network', ' Software Projects Manager',
       'Teacher. Teaching students to code. ', 'Education',
       'code developer...in whatever format', ' front-end', ' back-end',
       ' app dev etc.',
       'improving in my current career as a Learning technologist',
       'Informatician', ' Artificial Intelligence ', 'lab scientist',
       'Data Visualization Specialist', "I don't know yet!",
       "I'm just learning code to increase my skill-set. I see it as a literacy issue.",
       ' Teacher',
       ' Criminal Defense Attorney-- focusing on cyber crimes ',
       'Remote Support', 'non-programmer', ' IT specialist ',
       '  Data Scientist / Data Engineer'], dtype=object)

There are many different "job interests" throughout the survey, and it's obvious that respondents were able to write-in their own response to the question. The biggest downfall of this approach is that we end up with many different variations of the same career, different spelling and capitalization, and unknown responses.

Python-Pandas counts these all as unique values so it is more difficult to get a completely accurate count. For example, different variations of "Front-End Developer". We do see some extra whitespace scattered throughout some of the values too. In order to clean up some of the values in this dataframe we'll strip any extra white space and change everything to lower case font.

In [27]:

# Strips whitespace, changes to lower case 
category_count["Interests"] = category_count["Interests"].str.lstrip().str.rstrip().str.lower()

# Groupy by interests and adds up the number of occurences
category_count.groupby("Interests").sum().sort_values(by = "Count", ascending= False).head(50)

Out[27]:

	Count
Interests
full-stack web developer	6769
front-end web developer	4912
back-end web developer	3476
mobile developer	2719
user experience designer	1744
data scientist	1643
game developer	1628
information security	1326
data engineer	1248
devops / sysadmin	1146
product manager	1005
data scientist / data engineer	646
quality assurance engineer	602
software engineer	16
software developer	8
artificial intelligence	5
data analyst	5
programmer	4
machine learning engineer	4
desktop application developer	3
not sure	3
not sure yet	3
project manager	3
machine learning	2
product designer	2
web designer	2
full stack developer	2
research	2
ethical hacker	2
user interface designer	2
researcher	2
business analyst	2
bioinformatics	2
undecided	2
unsure	2
java developer	2
artificial intelligence engineer	2
python developer	1
quant (algorithmic trader)	1
project management	1
remote support	1
research and education	1
real-time systems	1
philosopher	1
programming	1
program manager	1
mobile development	1
natural language processing	1
network	1
network admin	1

In [28]:

# Career interest frequency
group_category = category_count.groupby("Interests").sum().sort_values(by = "Count", ascending= False).head(50)

# Plot results
fig, ax = plt.subplots(figsize = (10,8))
plt.barh(group_category.index[:15], group_category["Count"][:15], height = .6, color = "grey")

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()

# Title
plt.title("Career Interests", size = 20, loc = "left", x = -0.28, y = 1.08)

# X label
plt.text(-1950, -1.6,"Frequency", size = 14, color = "grey")

plt.show()

After some data cleaning we can see that it's not perfect, but we definitely can tell that we have quite a range of interests ranging from primarily web-development to data science, game development and many other interests.

While we have many mixed interests, this is a good way to show that individuals might be interested in other topics than just web-development. We also see that some individuals responded with different versions of "I don't know". While it would be possible to remove any rows with this answers, given how few there are it's unlikely to affect our analysis either way.

Age and Gender¶

In [29]:

# Gender frequency (Freecodecamp)
genders = survey["Gender"].value_counts(normalize=True, dropna=False) * 100

# Plot results
fig, ax = plt.subplots(figsize = (12, 8))
genders.plot(kind = "bar", color = "grey", width = .58)

# Title
plt.title("Gender representation (FreeCodeCamp)", size = 19, loc = "left", x = -0.1, y = 1.02)

# Remove spines
plt.gca().spines[["top", "left", "right"]].set_visible(False)

# X and Y labels
plt.ylabel("Frequency (percent)", color = "grey", size = 14, loc = "top")
plt.xlabel("Gender", color = "grey", size = 14, loc = "left")

# X and Y ticks
plt.yticks(size = 12)
plt.xticks(rotation = 0, size = 12)

plt.show()

We'll introduce a similar survey conducted in 2018 by Stack Exchange (a popular forum for asking and answering software/programming related questions). We'll perform data cleaning on this dataset shortly, but first we can get an overview of its contents and how its demographics compare to Freecodecamp's.

In [30]:

# Gender frequency (Stack Exchange)
genders_stk_exchange = exchange["Gender"].value_counts(normalize=True, dropna=False) * 100

# Plot results
fig, ax = plt.subplots(figsize = (12, 8))
genders_stk_exchange[:3].plot(kind = "bar", color = "grey", width = .57)

# Title
plt.title("Gender representation (Stack Exchange)", size = 19, loc = "left", x = -0.1, y = 1.02)

# Remove spines
plt.gca().spines[["top", "left", "right"]].set_visible(False)

# X and Y labels
plt.ylabel("Frequency (percent)", color = "grey", size = 14, loc = "top")
plt.xlabel("Gender", color = "grey", size = 14, loc = "left")

# X and Y ticks
plt.yticks(size = 12)
plt.xticks(rotation = 0, size = 12)

plt.show()

In [31]:

# Age distribution plotted
fig, ax = plt.subplots(figsize = (12,8))
survey["Age"].hist(bins = 20, color = "grey")

# Title
plt.title("Age Groups (FreeCodeCamp)", size = 19, loc = "left", x = -0.1, y = 1.02)

# Remove gridlines
ax.grid(False)

# Remove spines
plt.gca().spines[["right","top"]].set_visible(False)

# X and Y labels
plt.ylabel("# of observations", color = "grey", size = 14, loc = "top")
plt.xlabel("Age", color = "grey", size = 14, loc = "left")

# X and Y ticks
plt.yticks(size = 12)
plt.xticks(size = 12)

# Text
plt.text(32.5,2700,"Most new programmers\nare in their early 20s to early 30s", size = 14, color = "maroon")

# Main demographic highlighted
plt.axvspan(survey["Age"].quantile(0.25), survey["Age"].quantile(0.75), ymax=1000, color = "maroon", alpha = 0.4)

plt.show()

In [32]:

# Stack exchange age groups

# Color assignment
colors = ["grey","grey", "maroon", "grey", "grey", "grey"]

# Plot results
fig, ax = plt.subplots(figsize = (12, 8))
ages = exchange["Age"].value_counts().iloc[[4,1,0,2,3,5]].plot.bar(width = 0.65, color = colors)

# Remove spines
plt.gca().spines[["top", "left", "right"]].set_visible(False)

# Title
plt.title("Age Groups (Stack Exchange)", size = 19, loc = "left",x = -0.1, y = 1.02)

# X and Y lables
plt.ylabel("# of observations", color = "grey", size = 14, loc = "top")
plt.xlabel("Age", color = "grey", size = 14, loc = "left")

# X and Y ticks
plt.yticks(size = 12, color = "grey")
plt.xticks(size = 11, rotation = 0, color = "grey")

# Most frequent age group highlighted
plt.gca().get_xticklabels()[2].set_color("maroon")

plt.show()

Country Representation¶

In [33]:

# Freecodecamp countries
# Country frequency (freecodecamp)
countries = survey["CountryLive"].value_counts(normalize=True) * 100
# Frequency table to dataframe
countries = pd.Series.to_frame(countries).reset_index()
# Rename dataframe columns
countries = countries.rename(columns={"index":"Country","CountryLive":"Percentage"})

#------------------------------------------------------------------------------------------------#

# Stack Exchange Countries
# Country frequency (Stack Exchange)
countries_stack = exchange["Country"].value_counts(normalize=True) * 100
# Frequency table to dataframe
countries_stack = pd.Series.to_frame(countries_stack).reset_index()
# Rename dataframe columns
countries_stack = countries_stack.rename(columns={"index":"Country","Country":"Percentage"})

#---------------------------------------------------------------------------------------------------#

# Plot results (FreeCodeCamp)

# Color assignment
colors = ["maroon","maroon","maroon","maroon","grey","grey","grey","grey","grey","grey"]

fig, ax = plt.subplots(figsize = (10, 8))
plt.barh(countries["Country"][:10], countries["Percentage"][:10], color = colors, height= 0.65)

# Title
plt.title("Country Representation (FreeCodeCamp)", loc = "left", size = 18, x = -0.3, y = 1.08)

# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Text
plt.text(-15.2, -1.4,"Frequency (in percent)", size = 14, color = "grey")

# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14, color = "grey")

# Top 4 countries highlighted
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[1].set_color("maroon")
plt.gca().get_yticklabels()[2].set_color("maroon")
plt.gca().get_yticklabels()[3].set_color("maroon")

plt.show()


# Plot results (Stack Exchange)

# Color Assignment
colors = ["maroon","maroon","#D6A0A9","maroon","maroon","grey","grey","grey","grey","grey"]

fig, ax = plt.subplots(figsize = (10, 8))
plt.barh(countries_stack["Country"][:10], countries_stack["Percentage"][:10], color = colors, height= 0.6)

# Title
plt.title("Country Representation (Stack Exchange)", loc = "left", size = 18, x = -0.23, y = 1.09)

# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Text
plt.text(-4.9, -1.4,"Frequency (in percent)", size = 14, color = "grey")

# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14, color = "grey")

# Highlight top 5 countries
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[1].set_color("maroon")
plt.gca().get_yticklabels()[2].set_color("#D6A0A9") # Germany
plt.gca().get_yticklabels()[3].set_color("maroon")
plt.gca().get_yticklabels()[4].set_color("maroon")

plt.show()

Education levels¶

In [34]:

# FreeCodeCamp
# School degree frequency (Freecodecamp)
code_camp_edu = survey["SchoolDegree"].value_counts(normalize=True) * 100
# Frequency table to dataframe
code_camp_edu = pd.Series.to_frame(code_camp_edu).reset_index()
# Rename dataframe columns
code_camp_edu = code_camp_edu.rename(columns={"index":"School Degree","SchoolDegree":"Percentage"})

# Color assignment
colors = ["maroon","maroon","grey","grey","grey","grey","grey","grey","grey","grey"]

# Plot results
fig, ax = plt.subplots(figsize = (10, 8))
plt.barh(code_camp_edu["School Degree"][:10], code_camp_edu["Percentage"][:10], color = colors, height= 0.62)

# Title
plt.title("School Degree Representation (FreeCodeCamp)", loc = "left", size = 18, x = -0.52, y = 1.1)

# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")

# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Text
plt.text(-12, -1.6,"Frequency (in percent)", size = 14, color = "grey")

# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14, color = "grey")

# Highlight top 2 degrees
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[1].set_color("maroon")

plt.show()


# Stack Exchange
# Replace string values
exchange["FormalEducation"] = exchange["FormalEducation"].replace({"Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)":"High School"})

# School degree frequency (Stack Exchange)
stk_exchange_edu = exchange["FormalEducation"].value_counts(normalize=True) * 100
# Frequency table to dataframe
stk_exchange_edu = pd.Series.to_frame(stk_exchange_edu).reset_index()
# Rename dataframe columns
stk_exchange_edu = stk_exchange_edu.rename(columns={"index":"School Degree","FormalEducation":"Percentage"})

# Color assignment
colors = ["maroon","maroon","grey","grey","grey","grey","grey","grey","grey","grey"]

# Plot results
fig, ax = plt.subplots(figsize = (10, 8))
plt.barh(stk_exchange_edu["School Degree"], stk_exchange_edu["Percentage"], color = colors, height= 0.62)

# Title
plt.title("School Degree Representation (Stack Exchange)", loc = "left", size = 18, x = -0.72, y = 1.1)

# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")

# Invert data, x ticks to top
plt.gca().invert_yaxis()
ax.xaxis.tick_top()

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Text
plt.text(-12, -1.5,"Frequency (in percent)", size = 14, color = "grey")

# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14, color = "grey")

# Highlight top 2 degrees
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[1].set_color("maroon")
plt.show()

Thus far we have done the following:

Read in the dataframes
Investigated missing data
Selected appropriate columns for analysis
Merged 2016 and 2017 Freecodecamp surveys into one dataframe
Filtered out all observations that are missing JobRoleInterest input
Plotted the frequency of JobRoleInterest categories
Plotted the frequency of age and genders for both dataframes (freecodecamp & stack exchange)
Plotted the frequency of countries for both dataframes
Plotted education levels

Both datasets share similar a similar distribution concerning age and gender. Men consist of the majority of respondents of new programmers (70 %, women at 20%).

The stack exchange survey is consisted primarily of STEM careers, and the distribution of gender is even more pronounced. Men represent nearly 60% of respondents, nans (unknown, missing data) at roughly 35% and women at only around 5%.

Age distribution is roughly the same too. New programmers are most likely to be in their early 20s to early 30s, and stack exchange survey participants are usually 25 to 34 years old.

Country representation between both surveys is about the same. A majority of survey participants are from the United States, followed by India in both examples. Countries with the highest participation are English-Speaking countries (except for Germany in Stack Exchange).

Bachelor's degrees are the most common degree held by respondents from both surveys.

We've seen a high level overview of the data. To provide customers with the most relevant training possible, we need to discover why people decide to learn a new skill like programming.

We'll provide the several charts and data that we believe supports the idea that new programmers are motivated by income and career opportunities. While only 40% of respondents answered the JobRoleInterest question; 13,495 observations is more than enough to get a representative sample. There are many different career paths utilizing programming and tech skills that respondents are interested in.

Job Benefits and Satisfaction¶

Participants were asked the following questions regarding employment opportunities:

"Imagine that you are assessing a potential job opportunity. Please rank the following aspects of the job opportunity in order of importance , where 1 is the most important and 10 is the least important.
"Now, imagine you are assessing a job's benefits package. Please rank the following aspects of a job's benefits package from most to least important to you, where 1 is most important and 11 is least important.

By calculating the job aspects and benefits, on average the most important values should have a lower score (since 1 is most important, and 10 is least important). Before this calculation, we'll perform a bit of data cleaning on the stack exchange dataset.

In [35]:

# Rename current job related columns from stack exchange dataset
# Currency related columns
currency = exchange.columns[51:56].tolist()

# Columns up to index 38
columns = exchange.columns[:38].tolist()

# Age and gender columns
columns.extend(["Gender", "Age"])

# Add currency related columns to list
for i in currency:
    columns.append(i)

# Isolates dataframe down to columns from list "columns"
stk_exchange = exchange[columns].copy()

# Rename job aspects and job benefits columns for easier comprehension
rename_cols = {
                "AssessJob1":"Industry_working_in",
                "AssessJob2":"Company_funding",
                "AssessJob3":"Department_working_in",
                "AssessJob4":"Technologies/Frameworks",
                "AssessJob5":"Compensation_and_benefits",
                "AssessJob6":"Company_culture",
                "AssessJob7":"WFH",
                "AssessJob8":"Professional_development",
                "AssessJob9":"Company_diversity",
                "AssessJob10":"Product_impact",
                "AssessBenefits1":"Compensation",
                "AssessBenefits2":"Stock_options",
                "AssessBenefits3":"Health_insurance",
                "AssessBenefits4":"Parental_leave",
                "AssessBenefits5":"Fitness_wellness_benefit",
                "AssessBenefits6":"Retirement",
                "AssessBenefits7":"Meals/snacks",
                "AssessBenefits8":"Computer/office_equipment",
                "AssessBenefits9":"Childcare_benefit",
                "AssessBenefits10":"Transportaion_benefit",
                "AssessBenefits11":"Conference/education_budget"
                }

exchange = exchange.rename(columns=rename_cols)

# Isolate rows only containing following countries listed below
stk_countries = stk_exchange[stk_exchange["Country"].str.contains("United States|India|United Kingdom|Canada", na = False)]
len(stk_countries["Country"])

Out[35]:

In [36]:

exchange

Out[36]:

	Respondent	Hobby	OpenSource	Country	Student	Employment	FormalEducation	UndergradMajor	CompanySize	DevType	YearsCoding	YearsCodingProf	JobSatisfaction	CareerSatisfaction	HopeFiveYears	JobSearchStatus	LastNewJob	Industry_working_in	Company_funding	Department_working_in	Technologies/Frameworks	Compensation_and_benefits	Company_culture	WFH	Professional_development	Company_diversity	Product_impact	Compensation	Stock_options	Health_insurance	Parental_leave	Fitness_wellness_benefit	Retirement	Meals/snacks	Computer/office_equipment	Childcare_benefit	Transportaion_benefit	Conference/education_budget	JobContactPriorities1	JobContactPriorities2	JobContactPriorities3	JobContactPriorities4	JobContactPriorities5	JobEmailPriorities1	JobEmailPriorities2	JobEmailPriorities3	JobEmailPriorities4	JobEmailPriorities5	JobEmailPriorities6	JobEmailPriorities7	UpdateCV	Currency	Salary	SalaryType	ConvertedSalary	CurrencySymbol	CommunicationTools	TimeFullyProductive	EducationTypes	SelfTaughtTypes	TimeAfterBootcamp	HackathonReasons	AgreeDisagree1	AgreeDisagree2	AgreeDisagree3	LanguageWorkedWith	LanguageDesireNextYear	DatabaseWorkedWith	DatabaseDesireNextYear	PlatformWorkedWith	PlatformDesireNextYear	FrameworkWorkedWith	FrameworkDesireNextYear	IDE	OperatingSystem	NumberMonitors	Methodology	VersionControl	CheckInCode	AdBlocker	AdBlockerDisable	AdBlockerReasons	AdsAgreeDisagree1	AdsAgreeDisagree2	AdsAgreeDisagree3	AdsActions	AdsPriorities1	AdsPriorities2	AdsPriorities3	AdsPriorities4	AdsPriorities5	AdsPriorities6	AdsPriorities7	AIDangerous	AIInteresting	AIResponsible	AIFuture	EthicsChoice	EthicsReport	EthicsResponsible	EthicalImplications	StackOverflowRecommend	StackOverflowVisit	StackOverflowHasAccount	StackOverflowParticipate	StackOverflowJobs	StackOverflowDevStory	StackOverflowJobsRecommend	StackOverflowConsiderMember	HypotheticalTools1	HypotheticalTools2	HypotheticalTools3	HypotheticalTools4	HypotheticalTools5	WakeTime	HoursComputer	HoursOutside	SkipMeals	ErgonomicDevices	Exercise	Gender	SexualOrientation	EducationParents	RaceEthnicity	Age	Dependents	MilitaryUS	SurveyTooLong	SurveyEasy
0	1	Yes	No	Kenya	No	Employed part-time	Bachelor’s degree (BA, BS, B.Eng., etc.)	Mathematics or statistics	20 to 99 employees	Full-stack developer	3-5 years	3-5 years	Extremely satisfied	Extremely satisfied	Working as a founder or co-founder of my own c...	I’m not actively looking, but I am open to new...	Less than a year ago	10.0	7.0	8.0	1.0	2.0	5.0	3.0	4.0	9.0	6.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.0	1.0	4.0	2.0	5.0	5.0	6.0	7.0	2.0	1.0	4.0	3.0	My job status or other personal status changed	NaN	NaN	Monthly	NaN	KES	Slack	One to three months	Taught yourself a new language, framework, or ...	The official documentation and/or standards fo...	NaN	To build my professional network	Strongly agree	Strongly agree	Neither Agree nor Disagree	JavaScript;Python;HTML;CSS	JavaScript;Python;HTML;CSS	Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/A...	Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/A...	AWS;Azure;Linux;Firebase	AWS;Azure;Linux;Firebase	Django;React	Django;React	Komodo;Vim;Visual Studio Code	Linux-based	1	Agile;Scrum	Git	Multiple times per day	Yes	No	NaN	Strongly agree	Strongly agree	Strongly agree	Saw an online advertisement and then researche...	1.0	5.0	4.0	7.0	2.0	6.0	3.0	Artificial intelligence surpassing human intel...	Algorithms making important decisions	The developers or the people creating the AI	I'm excited about the possibilities more than ...	No	Yes, and publicly	Upper management at the company/organization	Yes	10 (Very Likely)	Multiple times per day	Yes	I have never participated in Q&A on Stack Over...	No, I knew that Stack Overflow had a jobs boar...	Yes	NaN	Yes	Extremely interested	Extremely interested	Extremely interested	Extremely interested	Extremely interested	Between 5:00 - 6:00 AM	9 - 12 hours	1 - 2 hours	Never	Standing desk	3 - 4 times per week	Male	Straight or heterosexual	Bachelor’s degree (BA, BS, B.Eng., etc.)	Black or of African descent	25 - 34 years old	Yes	NaN	The survey was an appropriate length	Very easy
1	3	Yes	Yes	United Kingdom	No	Employed full-time	Bachelor’s degree (BA, BS, B.Eng., etc.)	A natural science (ex. biology, chemistry, phy...	10,000 or more employees	Database administrator;DevOps specialist;Full-...	30 or more years	18-20 years	Moderately dissatisfied	Neither satisfied nor dissatisfied	Working in a different or more specialized tec...	I am actively looking for a job	More than 4 years ago	1.0	7.0	10.0	8.0	2.0	5.0	4.0	3.0	6.0	9.0	1.0	5.0	3.0	7.0	10.0	4.0	11.0	9.0	6.0	2.0	8.0	3.0	1.0	5.0	2.0	4.0	1.0	3.0	4.0	5.0	2.0	6.0	7.0	I saw an employer’s advertisement	British pounds sterling (£)	51000	Yearly	70841.0	GBP	Confluence;Office / productivity suite (Micros...	One to three months	Taught yourself a new language, framework, or ...	The official documentation and/or standards fo...	NaN	NaN	Agree	Agree	Neither Agree nor Disagree	JavaScript;Python;Bash/Shell	Go;Python	Redis;PostgreSQL;Memcached	PostgreSQL	Linux	Linux	Django	React	IPython / Jupyter;Sublime Text;Vim	Linux-based	2	NaN	Git;Subversion	A few times per week	Yes	Yes	The website I was visiting asked me to disable it	Somewhat agree	Neither agree nor disagree	Neither agree nor disagree	NaN	3.0	5.0	1.0	4.0	6.0	7.0	2.0	Increasing automation of jobs	Increasing automation of jobs	The developers or the people creating the AI	I'm excited about the possibilities more than ...	Depends on what it is	Depends on what it is	Upper management at the company/organization	Yes	10 (Very Likely)	A few times per month or weekly	Yes	A few times per month or weekly	Yes	No, I have one but it's out of date	7	Yes	A little bit interested	A little bit interested	A little bit interested	A little bit interested	A little bit interested	Between 6:01 - 7:00 AM	5 - 8 hours	30 - 59 minutes	Never	Ergonomic keyboard or mouse	Daily or almost every day	Male	Straight or heterosexual	Bachelor’s degree (BA, BS, B.Eng., etc.)	White or of European descent	35 - 44 years old	Yes	NaN	The survey was an appropriate length	Somewhat easy
2	4	Yes	Yes	United States	No	Employed full-time	Associate degree	Computer science, computer engineering, or sof...	20 to 99 employees	Engineering manager;Full-stack developer	24-26 years	6-8 years	Moderately satisfied	Moderately satisfied	Working as a founder or co-founder of my own c...	I’m not actively looking, but I am open to new...	Less than a year ago	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	5	No	No	United States	No	Employed full-time	Bachelor’s degree (BA, BS, B.Eng., etc.)	Computer science, computer engineering, or sof...	100 to 499 employees	Full-stack developer	18-20 years	12-14 years	Neither satisfied nor dissatisfied	Slightly dissatisfied	Working as a founder or co-founder of my own c...	I’m not actively looking, but I am open to new...	Less than a year ago	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A recruiter contacted me	U.S. dollars ($)	NaN	NaN	NaN	NaN	NaN	Three to six months	Completed an industry certification program (e...	The official documentation and/or standards fo...	NaN	NaN	Disagree	Disagree	Strongly disagree	C#;JavaScript;SQL;TypeScript;HTML;CSS;Bash/Shell	C#;JavaScript;SQL;TypeScript;HTML;CSS;Bash/Shell	SQL Server;Microsoft Azure (Tables, CosmosDB, ...	SQL Server;Microsoft Azure (Tables, CosmosDB, ...	Azure	Azure	NaN	Angular;.NET Core;React	Visual Studio;Visual Studio Code	Windows	2	Agile;Kanban;Scrum	Git	Multiple times per day	Yes	Yes	The ad-blocking software was causing display i...	Neither agree nor disagree	Somewhat agree	Somewhat agree	Stopped going to a website because of their ad...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Artificial intelligence surpassing human intel...	Artificial intelligence surpassing human intel...	A governmental or other regulatory body	I don't care about it, or I haven't thought ab...	No	Yes, but only within the company	Upper management at the company/organization	Yes	10 (Very Likely)	A few times per week	Yes	A few times per month or weekly	Yes	No, I have one but it's out of date	8	Yes	Somewhat interested	Somewhat interested	Somewhat interested	Somewhat interested	Somewhat interested	Between 6:01 - 7:00 AM	9 - 12 hours	Less than 30 minutes	3 - 4 times per week	NaN	I don't typically exercise	Male	Straight or heterosexual	Some college/university study without earning ...	White or of European descent	35 - 44 years old	No	No	The survey was an appropriate length	Somewhat easy
4	7	Yes	No	South Africa	Yes, part-time	Employed full-time	Some college/university study without earning ...	Computer science, computer engineering, or sof...	10,000 or more employees	Data or business analyst;Desktop or enterprise...	6-8 years	0-2 years	Slightly satisfied	Moderately satisfied	Working in a different or more specialized tec...	I’m not actively looking, but I am open to new...	Between 1 and 2 years ago	8.0	5.0	7.0	1.0	2.0	6.0	4.0	3.0	10.0	9.0	1.0	10.0	2.0	4.0	8.0	3.0	11.0	7.0	5.0	9.0	6.0	2.0	1.0	4.0	5.0	3.0	7.0	3.0	6.0	2.0	1.0	4.0	5.0	My job status or other personal status changed	South African rands (R)	260000	Yearly	21426.0	ZAR	Office / productivity suite (Microsoft Office,...	Three to six months	Taken a part-time in-person course in programm...	The official documentation and/or standards fo...	NaN	NaN	Strongly agree	Agree	Strongly disagree	C;C++;Java;Matlab;R;SQL;Bash/Shell	Assembly;C;C++;Matlab;SQL;Bash/Shell	SQL Server;PostgreSQL;Oracle;IBM Db2	PostgreSQL;Oracle;IBM Db2	Arduino;Windows Desktop or Server	Arduino;Windows Desktop or Server	NaN	NaN	Notepad++;Visual Studio;Visual Studio Code	Windows	2	Evidence-based software engineering;Formal sta...	Zip file back-ups	Weekly or a few times per month	No	NaN	NaN	Somewhat agree	Somewhat agree	Somewhat disagree	Clicked on an online advertisement;Saw an onli...	2.0	3.0	4.0	6.0	1.0	7.0	5.0	Algorithms making important decisions	Algorithms making important decisions	The developers or the people creating the AI	I'm excited about the possibilities more than ...	No	Yes, but only within the company	Upper management at the company/organization	Yes	10 (Very Likely)	Daily or almost daily	Yes	Less than once per month or monthly	No, I knew that Stack Overflow had a jobs boar...	No, I know what it is but I don't have one	NaN	Yes	Extremely interested	Extremely interested	Extremely interested	Extremely interested	Extremely interested	Before 5:00 AM	Over 12 hours	1 - 2 hours	Never	NaN	3 - 4 times per week	Male	Straight or heterosexual	Some college/university study without earning ...	White or of European descent	18 - 24 years old	Yes	NaN	The survey was an appropriate length	Somewhat easy
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
98850	101513	Yes	Yes	United States	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
98851	101531	No	Yes	Spain	Yes, full-time	Not employed, but looking for work	NaN	NaN	NaN	Back-end developer;Front-end developer	0-2 years	0-2 years	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
98852	101541	Yes	Yes	India	Yes, full-time	Employed full-time	Bachelor’s degree (BA, BS, B.Eng., etc.)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
98853	101544	Yes	No	Russian Federation	No	Independent contractor, freelancer, or self-em...	Some college/university study without earning ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
98854	101548	Yes	Yes	Cambodia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

98855 rows × 129 columns

These benefits and aspects are measured by current employees working in STEM fields. So we have to be careful to not assume these ratings directly relate to new programmers that participated in FreeCodeCamp's survey (as many of these respondents do not work in software/tech jobs).

However, if the same questions were asked by FreeCodeCamp, it's probable that we would see similar results. Therefore, if we use the stack exchange survey as proxy, compensation and health insurance are the most important to job applicants, or those interested in switching jobs. Some of the least important benefits include childcare, parental leave or a fitness/wellness benefit.

Job aspects describe how job candidates view a potential job opportunity, and the particular make-up of an organization. Respondents rated pay and benefits (which for some reason is listed as a benefit and an aspect), the technologies or programs used, career mobility, and the company culture higher than other aspects.

In [37]:

# Slice dataset to contain only job aspect columns
job_assessment = exchange.iloc[:,17:27]

# Constructs new dataframe of column averages
assessments = pd.Series.to_frame(job_assessment.mean(axis=0).sort_values(ascending=False)) # Calculate averages along each column
# Assign index name
assessments.index.name = "Aspects"
# Reset index
assessments.reset_index()

#---------------------------------------------------------------------------------------------------------------------------------#
# Slice dataset to contain only job aspect columns
benefits = exchange.iloc[:,27:38]
# Constructs new dataframe of column averages
job_benefits = pd.Series.to_frame(benefits.mean(axis=0).sort_values(ascending=False)) # Calculate averages along each column
# Assign index name
job_benefits.index.name = "Benefits"
# Reset index
job_benefits.reset_index()

#----------------------------------------------------------------------------------------------------------------------------------#

# Plot results
# If looking for a new job, rate importance of job aspects from 1(most important) to 11(least important)
# Color assignment
colors = ["grey","grey","grey","grey","grey","grey","grey","grey","grey","maroon","maroon"]

fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(job_benefits.index, job_benefits[0], color = colors, height= 0.62)

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# X axis top
ax.xaxis.tick_top()

# Title
plt.title("Job benefits", size = 19, loc = "left", x= -0.35, y = 1.16)

# Text
plt.text(-3.2,12,"Rating (1 most important, 11 least important), average", color = "grey", size = 14)

# X and Y ticks
plt.yticks(size = 14, color = "grey")
plt.xticks(size = 13, color = "grey")

# Highlight top 2 benefits
plt.gca().get_yticklabels()[-1].set_color("maroon")
plt.gca().get_yticklabels()[-2].set_color("maroon")
plt.show()

# Plot results
# If looking for a new job, rate importance of job aspects from 1(most important) to 10(least important)
# Color assignment
colors = ["grey","grey","grey","grey","grey","grey","maroon","maroon","maroon","maroon"]

fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(assessments.index, assessments[0], color = colors, height= 0.6)

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# X axis top
ax.xaxis.tick_top()

# Title
plt.title("Job aspects", size = 19, loc = "left", x= -0.4, y = 1.16)

# Text
plt.text(-3.2,11,"Rating (1 most important, 10 least important), average", color = "grey", size = 14)

# X and Y ticks
plt.yticks(size = 14, color = "grey")
plt.xticks(size = 13, color = "grey")

# Highlight top 4 job aspects
plt.gca().get_yticklabels()[-1].set_color("maroon")
plt.gca().get_yticklabels()[-2].set_color("maroon")
plt.gca().get_yticklabels()[-3].set_color("maroon")
plt.gca().get_yticklabels()[-4].set_color("maroon")

plt.show()

Income/Financial Situations¶

Income: Respondents were asked their current yearly income.

ExpectedEarning: "About how much money do you expect to earn per year at your first developer job, in US dollars?"

Has Debt: The question asked was "Do you have any debt?"

In a high level overview we'll see that the median and average salary of new programmers is less than $50,000 dollars(US). We'll see that new programmers expect to earn about $15,000 to $20,000 more in their new tech/software careers than what they currently earn.

In [38]:

# Income distribution
# Plot results
fig, ax = plt.subplots(figsize = (14,10))
survey["Income"].plot.hist(bins = 120, color = "grey", xlim = (0,250000))

# Remove spines
plt.gca().spines[["right","top"]].set_visible(False)

# Title
plt.title("Income distribution of survey respondents\n(All countries)",loc= "left", size = 18, y = 1.02)

# Average and median income
plt.axvline(survey["Income"].mean(), color = "red", alpha = 0.5, linewidth = 3)
plt.axvline(survey["Income"].median(), color = "blue", alpha = 0.5, linewidth = 3)

# Misc. Text
plt.text(41000, 850, " Average \n Income", size = 15, color = "red")
plt.text(14000, 850, " Median \n Income", size = 15, color = "blue")

# X and Y labels
plt.ylabel("Frequency",size = 15, loc = "top", color ="grey")
plt.xlabel("Income, Yearly (US dollars)", size = 15, loc = "left", color ="grey")

# X and Y ticks
plt.yticks(size = 14)
plt.xticks(size = 13)

plt.show()

In [39]:

# Difference between current income and expected income
fig, ax = plt.subplots(figsize = (13,10))

# Freecodecamp survey expected earning distribution
survey["ExpectedEarning"].plot.kde(xlim = (0, 200000), color = "#ED7E00", linewidth = 3)

# Freecodecamp survey current income distribution
survey["Income"].plot.kde(color = "#4B86C1", linewidth = 3)

# Title
plt.title("Current earnings vs. Expected earnings\n(All countries)", size = 18, loc = "left", y = 1.05)

# X and Y ticks
plt.xticks(size = 14)
plt.yticks(size = 14)

# X and Y labels
plt.ylabel("Density (Probability)", size = 14, color = "grey", loc = "top")
plt.xlabel("Income, Yearly (US dollars)", size = 14, color = "grey", loc = "left")

# Remove spines
plt.gca().spines[["right", "top"]].set_visible(False)

# Misc. text
plt.text(x = 0.01, y = 0.84, s="Income: Freecodecamp", color = "#4B86C1", size = 13, transform=ax.transAxes)
plt.text(x = 0.29, y = .90, s="Desired Income", color = "#ED7E00", size = 13, transform=ax.transAxes)
plt.text(0.55,0.85,"""Typically, survey participants expect to earn\n\$15,000 to \$20,000 more in their new career,
compared to their current income""", color = "grey", size = 14, transform=ax.transAxes)

# X and Y ticks
plt.yticks(size = 13)
plt.xticks(size = 13)

plt.show()

We can find each person's desired salary increase (relative to their current income, as a percentage) by utilizing the following formula:

Increase = New Number - Original Number

% increase = Increase / Original Number x 100

Since we have missing data points in both columns we expect to see negative percentages in the new column that we create. Missing data won't be dropped, instead we'll ignore any percentages below 0.

We'll notice that most often, respondents desire a salary increase in the range of 0% to 120%.

In [40]:

# Column creation using formula above
survey["Percent_Increase"] = (survey["ExpectedEarning"] - survey["Income"]) / survey["Income"] * 100

# Frequency distribution
survey["Percent_Increase"].value_counts(bins = 20, normalize= True) * 100

Out[40]:

(-115.647, 734.302]       40.192664
(734.302, 1567.585]        0.570582
(1567.585, 2400.867]       0.044461
(15733.384, 16566.667]     0.014820
(4067.432, 4900.714]       0.007410
(5733.996, 6567.279]       0.007410
(2400.867, 3234.149]       0.007410
(14066.82, 14900.102]      0.007410
(3234.149, 4067.432]       0.000000
(4900.714, 5733.996]       0.000000
(6567.279, 7400.561]       0.000000
(7400.561, 8233.843]       0.000000
(9067.126, 9900.408]       0.000000
(9900.408, 10733.69]       0.000000
(10733.69, 11566.973]      0.000000
(11566.973, 12400.255]     0.000000
(12400.255, 13233.537]     0.000000
(13233.537, 14066.82]      0.000000
(14900.102, 15733.384]     0.000000
(8233.843, 9067.126]       0.000000
Name: Percent_Increase, dtype: float64

In [41]:

fig, ax = plt.subplots(figsize = (13,9))

# Expected salary increase (in a percentage) histogram
survey[survey["Percent_Increase"] <= 500]["Percent_Increase"].plot.hist(bins = 15, color = "grey")
# Boolean masking ^^^ less than or equal to %500 ^^^

# Lower and upper quartile %25 to %75 range
plt.axvspan(survey["Percent_Increase"].quantile(0.25), survey["Percent_Increase"].quantile(0.75), color = "maroon", alpha = 0.4)

# Remove spines
plt.gca().spines[["right","top"]].set_visible(False)

# Title
plt.title("Desired salary increase (in percent)", loc="left", size = 20, y = 1.05)

# X and Y labels
plt.ylabel("Frequency", size = 15, color = "grey", loc = "top")
plt.xlabel("Percent Increase", size = 15, loc = "left", color = "grey")

# X and Y ticks
plt.xticks(size = 13)
plt.yticks(size = 13)

# Text
plt.text(126,1000,"Typical range of expected salary increase (in percent)", size = 14, color = "maroon")

plt.show()

Most respondents do not have financial dependents to care for, and less than half do not have debts to pay off.

In [42]:

# Replaces following columns with True/False values
survey["HasDebt"] = survey["HasDebt"].replace({1.0:"True", 0.0: "False"})
survey["HasFinancialDependents"] = survey["HasFinancialDependents"].replace({1.0:"True", 0.0: "False"})

In [43]:

# Financial dependents
print("Financial Dependents:","\n", survey["HasFinancialDependents"].value_counts(normalize = True, dropna=False) * 100) 
print("\n")

# Has debt of any kind
print("Has Debt:", "\n", survey["HasDebt"].value_counts(normalize = True, dropna=False) * 100) 
print("\n")

Financial Dependents: 
 False    71.619118
True     21.074472
NaN       7.306410
Name: HasFinancialDependents, dtype: float64


Has Debt: 
 False    50.596517
True     41.948870
NaN       7.454613
Name: HasDebt, dtype: float64

Employment status¶

EmploymentStatus: "Regarding employment status, are you currently..." Respondents were asked to select their current employment stats, examples include not working, employed for wages, self-employed, military, etc... About half of respondents answered that they are actively working in some manner for their income. A smaller percentage neglected to answer, and the remaining participants are either not working but actively looking for work, not working and not looking for work, and the survey includes stay at home parents.

"Employed for wages" is the most common employment status, but this group has lowest median hours spent per week (10 hours) learning. The employment group "Not working but looking for work" has the highest median hours (20). Typically, respondents spend about 12 hours per week (median) or 1.7 hours per day learning programming. We did not calculate the weekly average, because the data contains many outliers in the range of 30 hours to 175 per week that significantly skews the distribtion.

In [44]:

# Fills in missing data from hours learning column
survey["HoursLearning"] = survey["HoursLearning"].fillna(survey["HoursLearning"].median())

In [45]:

# Hours spent learning distribution
fig, ax = plt.subplots(figsize = (12, 6))
sns.boxplot(x= "HoursLearning", data = survey, color = "grey", medianprops=dict(color="maroon", alpha=1))

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Title
plt.title("Hours spent learning per week", loc="left", size = 20, y = 1.05)

# Misc. Text
plt.text(80,-0.04,"Outliers", size = 14)
plt.text(0, -0.42, "Median (12 hours/week)", size = 14, color = "maroon")
plt.text(100, -0.3, "75% of participants spent\n20 hours or less per week learning", size = 14, color = "grey")
plt.text(100, -0.15, "25% of participants spent\n6 hours or less per week learning", size = 14, color = "grey")

# X label
plt.xlabel("Hours", size = 15, loc = "left", color = "grey")

# X ticks
plt.xticks(size = 13, color = "grey")


plt.show()

# Print stats
print(survey["HoursLearning"].describe())

count    13495.000000
mean        16.955761
std         14.573179
min          0.000000
25%          7.000000
50%         12.000000
75%         20.000000
max        168.000000
Name: HoursLearning, dtype: float64

In [46]:

# Frequency table employment status
survey["EmploymentStatus"].value_counts(dropna=False)

Out[46]:

Employed for wages                      5564
Not working but looking for work        3395
NaN                                     1899
Not working and not looking for work     948
Self-employed freelancer                 644
Doing an unpaid internship               294
Unable to work                           235
A stay-at-home parent or homemaker       220
Self-employed business owner             210
Military                                  62
Retired                                   24
Name: EmploymentStatus, dtype: int64

In [47]:

# Hours spent per week by employment status
survey.groupby("EmploymentStatus")["HoursLearning"].median().sort_values(ascending=False)

Out[47]:

EmploymentStatus
Not working but looking for work        20.0
Self-employed freelancer                20.0
Doing an unpaid internship              15.0
Self-employed business owner            15.0
Not working and not looking for work    14.0
A stay-at-home parent or homemaker      13.0
Retired                                 12.0
Unable to work                          12.0
Employed for wages                      10.0
Military                                10.0
Name: HoursLearning, dtype: float64

Career comparison by salary and experience (months programming)¶

MonthsProgramming: "About how many months have you been programming for? ("Programming experience")

There is some evidence that may suggest the type of career field has less influence on the motivation of individuals to learning programming.

Farming/fishing/forestry and education (typically careers we would not associate with programming/software development) have the greatest number of months programming. Besides these two career fields the IT/Software development field has the third highest average amount of experience. Presumbably respondents in the IT/Software development were likely spending time outside of work learning, or had just been hired.

Farming/fishing/forestry and education are some of the lowest paid career fields in this survey, yet on average, respondents expected a lower expected income than other career fields. Instead we see higher paying careers with less "programming experience" expecting higher income after switching to tech/software related jobs.

There may be a better argument to be made that education level may have more influence over a person's reason to begin learning a skill like programming for more career opportunities.

In [48]:

# Salary and experience comparison for employment fields

# Assign groupby objects for plotting using SchoolDegree 
empfld_months_prg = survey.groupby("EmploymentField").mean().sort_values(by="MonthsProgramming") # sort by the average number of months programming
empfld_income = survey.groupby("EmploymentField").mean().sort_values(by="Income") # sort by the average income
empfld_expected_salary = survey.groupby("EmploymentField").mean().sort_values(by="ExpectedEarning")

#-------------------------------------------------------------------------------------------------------------------------------------#
# Color assignment
colors = ["grey", "grey", "grey", "grey", "grey","grey", "grey", "grey", "grey", "grey","grey", "grey", "maroon", "maroon", "maroon",]

# Plot results experience
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(empfld_months_prg.index, empfld_months_prg["MonthsProgramming"], color = colors, height = 0.6)

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Y label
plt.ylabel("Career Field", loc = "top", size = 14, color = "grey")

# Text
plt.text(-8,16.3,"Average number of months", size = 14, color = "grey")

# Title
plt.title("New programmer experience by career field", size = 20, loc = "left", x = -0.65, y = 1.12)

# X axis to top
ax.xaxis.tick_top()

# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)

# Highlight top 3 career fields most experience
plt.gca().get_yticklabels()[-1].set_color("maroon")
plt.gca().get_yticklabels()[-2].set_color("maroon")
plt.gca().get_yticklabels()[-3].set_color("maroon")

plt.show()

#---------------------------------------------------------------------------------------------------------------------------------------#

# Salary

# Color assignment
colors = ["maroon", "grey", "grey", "grey", "grey","maroon", "grey", "maroon", "grey", "grey","grey", "grey", "grey", "grey", "grey",]

# Plot results income
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(empfld_income.index, empfld_income["Income"], color = colors, height = 0.6)

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Y label
plt.ylabel("Career Field", loc = "top", size = 14, color = "grey")

# Text
plt.text(-25000,16.4,"Average Salary (US dollars)", size = 14, color = "grey")
plt.text(30000,-0.5, "Career fields shaded in red\nhave the highest average number of months\nspent learning programming", color = "grey")

# Title
plt.title("Salary by career field", size = 20, loc = "left", x = -0.65, y = 1.12)

# X axis to top
ax.xaxis.tick_top()

# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)

# Highlight top 3 career fields most experience
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[5].set_color("maroon")
plt.gca().get_yticklabels()[-8].set_color("maroon")

plt.show()

#--------------------------------------------------------------------------------------------------------------------------------------------#

# Color assignment
colors_salary = ["maroon", "grey", "grey", "grey", "grey","grey", "grey", "maroon", "grey", "grey","maroon", "grey", "grey", "grey", "grey"]

# Plot results expected earning
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(empfld_expected_salary.index, empfld_expected_salary["ExpectedEarning"].sort_values(), color = colors_salary, height = 0.6)

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Y label
plt.ylabel("Career Field", loc = "top", size = 14, color = "grey")

# Text
plt.text(-20000, 16.3,"Average (US dollars)", size = 14, color = "grey")

# Title
plt.title("Expected annual salary increase", size = 20, loc = "left", x = -0.65, y = 1.12)

# X axis to top
ax.xaxis.tick_top()

# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)

# Highlight top 3 career fields most experience
plt.gca().get_yticklabels()[0].set_color("maroon")
plt.gca().get_yticklabels()[-5].set_color("maroon")
plt.gca().get_yticklabels()[7].set_color("maroon")

plt.show()

Education¶

Earlier we noticed that the career field someone is in may have a weaker influence on the motivation for people to start learning programming. Instead, a person's education level may be a more significant factor. The data suggests that individuals with education less than a bachelor's have some of the greatest amount of "programming experience".

These three fields are:

"Some high school"
"Associate's degree"
"Some college credit, no degree"

They have the highest median and average expected salary increase (in percent), and in terms of yearly income, respondents in these groups are some of the lowest earning. However, we have to note that in comparison to the average expected earning (in US dollar amounts), degree holders with Ph.D.s, professional degrees, and bachelor's generally expect a higher salary, with the exception of associate's degree holders.

Neither career type nor education level are perfect indicators for whether or not some one may be more motivated/interested in learning new programming/tech skills. We think it's reasonable to argue that the data suggests survey participants are generally interested in programming for the career and income opportunities.

In [49]:

# Average expected earning by school degree
round(survey.groupby("SchoolDegree")["ExpectedEarning"].mean().sort_values(ascending=False), 2)

Out[49]:

SchoolDegree
Ph.D.                                       61165.49
associate's degree                          60870.16
professional degree (MBA, MD, JD, etc.)     56383.49
bachelor's degree                           55103.32
no high school (secondary school)           54756.51
some high school                            53954.36
some college credit, no degree              53603.29
master's degree (non-professional)          52670.46
trade, technical, or vocational training    49398.36
high school diploma or equivalent (GED)     48131.91
Name: ExpectedEarning, dtype: float64

In [50]:

# Median expeceted salary increase (percent)
salary_increase_median = round(survey.groupby("SchoolDegree")["Percent_Increase"].median().sort_values(ascending=False),2)
salary_increase_median = pd.Series.to_frame(salary_increase_median).reset_index()
salary_increase_median = salary_increase_median.rename(columns={"index":"SchoolDegree","Percent_Increase":"Percentage"})

# Average expected salary increase (percent)
salary_increase = round(survey.groupby("SchoolDegree")["Percent_Increase"].mean().sort_values(ascending=False),2)
salary_increase = pd.Series.to_frame(salary_increase).reset_index()
salary_increase = salary_increase.rename(columns={"index":"SchoolDegree","Percent_Increase":"Percentage"})
#------------------------------------------------------------------------------------------------------------------------#

# Color assignment
colors = ["#145DDE", "#145DDE","#145DDE", "grey", "grey", "grey","grey", "grey", "grey", "grey"]

# Plot (1) results average and median salary increase (percent)
fig, ax = plt.subplots(figsize = (8, 6))

# Title
plt.title("Expected salary raise by education level", size = 19, loc = "left", x= -0.65, y = 1.16)

# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")

# Misc. text
plt.text(-4,-2,"Median", color = "#4B86C1", size = 14)
plt.text(35,-2,"Average", color = "grey", size = 14)
plt.text(80,-2,"(Percent)", size = 14)
plt.text(160, 2.5,"Education levels below bachelor's\ndegree have the highest average\nand median expected salary increase", color = "grey")

# Average plotted
plt.barh(salary_increase["SchoolDegree"], salary_increase["Percentage"], color = colors, height = 0.62)
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)
ax.xaxis.tick_top()

# Median plotted
plt.barh(salary_increase_median["SchoolDegree"], salary_increase_median["Percentage"], color = "#4B86C1", height = 0.62)
plt.yticks(size = 14, color = "grey")
plt.xticks(size = 13, color = "grey")
plt.gca().invert_yaxis()

# Top 3 education levels by expected salary raise highlighted
plt.gca().get_yticklabels()[0].set_color("#145DDE")
plt.gca().get_yticklabels()[1].set_color("#145DDE")
plt.gca().get_yticklabels()[2].set_color("#145DDE")

plt.show()

#-----------------------------------------------------------------------------------------------------------------------------------------#

# Assign groupby objects for plotting using SchoolDegree 
schl_dgree = survey.groupby("SchoolDegree").mean().sort_values(by = "MonthsProgramming") # sort by the average number of months programming
degree_income = survey.groupby("SchoolDegree").mean().sort_values(by = "Income") # sort by the average income

# Plot results income by school edcuation level
colors_degree_income = ["#145DDE", "grey","#145DDE", "grey", "grey", "#145DDE","grey", "grey", "grey", "grey"]

# Plot (2) school degree income
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(degree_income.index, degree_income["Income"], color = colors_degree_income, height = 0.62)

# Title
plt.title("Salary by education", size = 20, loc = "left", x = -0.7, y = 1.12)

# Text
plt.text(-23000,10.7,"Average Salary (US dollars)", size = 14, color = "grey")

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")

# X axis to top
ax.xaxis.tick_top()

# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)

# Top 3 education levels by expected salary raise highlighted
plt.gca().get_yticklabels()[0].set_color("#145DDE")
plt.gca().get_yticklabels()[2].set_color("#145DDE")
plt.gca().get_yticklabels()[5].set_color("#145DDE")

plt.show()

#-----------------------------------------------------------------------------------------------------------------------------------------#

# Plot results number of months programming by edcuation level
colors_schl_dgree = ["grey", "grey","grey", "grey", "grey", "#145DDE","#145DDE", "grey", "#145DDE", "grey"]

# Plot (3) school degree number of months programming
fig, ax = plt.subplots(figsize = (8, 6))
plt.barh(schl_dgree.index, schl_dgree["MonthsProgramming"], color = colors_schl_dgree, height = 0.62)

# Title
plt.title("New programmer experience by education", size = 20, loc = "left", x = -0.7, y = 1.12)

# Text
plt.text(-11,10.7,"Average number of months", size = 14, color = "grey")

# Remove spines
plt.gca().spines[["right", "left", "top", "bottom"]].set_visible(False)

# Y label
plt.ylabel("School Degree", loc = "top", size = 14, color = "grey")

# X axis to top
ax.xaxis.tick_top()

# X and Y ticks
plt.xticks(size = 13, color = "grey")
plt.yticks(size = 14)

# Top 3 education levels by expected salary raise highlighted
plt.gca().get_yticklabels()[-2].set_color("#145DDE")
plt.gca().get_yticklabels()[-4].set_color("#145DDE")
plt.gca().get_yticklabels()[-5].set_color("#145DDE")

plt.show()

In [51]:

# Calculate median to avoid most of the skewness from outliers
# Median hours spent studying programming
survey.groupby("SchoolDegree")["HoursLearning"].median().sort_values(ascending=False)

Out[51]:

SchoolDegree
trade, technical, or vocational training    13.0
Ph.D.                                       12.0
associate's degree                          12.0
bachelor's degree                           12.0
high school diploma or equivalent (GED)     12.0
master's degree (non-professional)          12.0
professional degree (MBA, MD, JD, etc.)     12.0
some college credit, no degree              12.0
some high school                            12.0
no high school (secondary school)           10.0
Name: HoursLearning, dtype: float64

Markets¶

A vast majority of respondents reside in the United States, followed by India at about 7 % and the United Kingdom at 5 %. Before making a decision, we need to find out how much are new programmers willing to spend on education. If we advertise in markets that are only interested in free learning we're unlikely to be profitable.

MoneyForLearning column describes the amount of money that survey participants have spent since the beginning of their programming journey. Since our business model operates on a monthly subscription we are interested in how much customers are willing to spend per month. To find that information we need to create a new column.

Formula: MoneyForLearning / MonthsProgramming

We may need to limit our analysis to the following countries: US, India, UK, and Canada. Two reasons for this decision are:

These countries have the highest frequency in the dataset
The e-learning program is in English, and English is an official language in all these four countries. We'd like to maximize our chances of advertising to the right audience.

In [52]:

# Months programming frequency
survey["MonthsProgramming"].value_counts().head(20)

Out[52]:

1.0     1373
6.0     1371
12.0    1334
3.0     1273
2.0     1228
24.0     821
4.0      733
5.0      557
36.0     441
0.0      421
8.0      412
10.0     320
18.0     288
7.0      246
9.0      229
20.0     194
48.0     190
30.0     149
60.0     143
15.0     143
Name: MonthsProgramming, dtype: int64

To avoid dividing by zero, we'll need to change that particular value with 1. We can at least assume that respondents that answered with 0 months experience had probably just started and had only a few weeks of experience. For simplicity we'll change it to 1.

In [53]:

# Set new copy
spending = survey.copy()

# Replaces any instances of "zero months programming" (0) with (1) for proper calculation
spending["MonthsProgramming"] = spending["MonthsProgramming"].replace({0:1})

In [54]:

# Calculates monthly spending by dividing money for learning with number of months programming
spending["Monthly_spending"] = spending["MoneyForLearning"] / spending["MonthsProgramming"]
spending["Monthly_spending"].value_counts(dropna=False)

Out[54]:

0.000000        5769
NaN             1140
16.666667        297
50.000000        264
100.000000       246
                ... 
130.000000         1
80000.000000       1
76.000000          1
47.222222          1
1600.000000        1
Name: Monthly_spending, Length: 707, dtype: int64

In [55]:

# Total number of missing data points in monthly_spending column
spending["Monthly_spending"].isna().sum()

Out[55]:

In [56]:

# Drop missing data from following columns
spending = spending.dropna(subset=["CountryLive","Monthly_spending"])

# Groupby and calculate mean
avg_month = spending.groupby("CountryLive").mean()

# Shows only four countries selected below
avg_month["Monthly_spending"][["United States of America", "India","United Kingdom", "Canada"]]

Out[56]:

CountryLive
United States of America    256.969675
India                       100.449884
United Kingdom               93.828988
Canada                      141.571630
Name: Monthly_spending, dtype: float64

The United States spends the most out of the top four countries (at a significant amount compared to the other three)
United Kingdom spends the least amount, interest in programming skills might not be as prevalent
A box plot will show any discrepencies/outliers

In [57]:

# Assigns new variable for countries listed below
four_countries = spending[spending["CountryLive"].str.contains("United States of America|India|United Kingdom|Canada")]

In [58]:

# Plot results of outliers in USA, India, UK, and Canada
fig, ax = plt.subplots(figsize = (12, 8))
sns.boxplot(x = "CountryLive", y = "Monthly_spending", data = four_countries)

# Remove spines
plt.gca().spines[["right","top"]].set_visible(False)

# X ticks
plt.xticks(size = 13, color = "grey")

# X and Y labels
plt.ylabel("US dollars", loc = "top", size = 14, color = "grey")
plt.xlabel("")

# Title
plt.title("Money spent per month", loc = "left", size = 20)

plt.show()

It's still difficult to tell if the data is wrong or not with so many outliers in each country. There are far too many data points with monthly_spending values exceeding several thousand dollars. These outliers skew the distribution of monthly spending.

Using the .value_counts() method with bins set to 20 should give a clearer picture of the distribution of monthly_spending. With this we should be able to know where to isolate the data further.

In [59]:

# Value counts method shows distribution
spending["Monthly_spending"].value_counts(bins = 20, normalize= True) * 100

Out[59]:

(-80.001, 4000.0]     99.210412
(4000.0, 8000.0]       0.442516
(8000.0, 12000.0]      0.182213
(16000.0, 20000.0]     0.052061
(12000.0, 16000.0]     0.043384
(24000.0, 28000.0]     0.017354
(36000.0, 40000.0]     0.008677
(68000.0, 72000.0]     0.008677
(48000.0, 52000.0]     0.008677
(76000.0, 80000.0]     0.008677
(32000.0, 36000.0]     0.008677
(28000.0, 32000.0]     0.008677
(44000.0, 48000.0]     0.000000
(52000.0, 56000.0]     0.000000
(56000.0, 60000.0]     0.000000
(60000.0, 64000.0]     0.000000
(64000.0, 68000.0]     0.000000
(20000.0, 24000.0]     0.000000
(72000.0, 76000.0]     0.000000
(40000.0, 44000.0]     0.000000
Name: Monthly_spending, dtype: float64

$4,000 (US) per month is higher than even the average college tuition in the United States, nonetheless, we'll use this amount as a cutoff for re-calculating the monthly spending of United States, India, UK, and Canada. This should result in slightly less skewed calculations.

In [60]:

four_countries

Out[60]:

	Age	AttendedBootcamp	CityPopulation	CommuteTime	CountryCitizen	CountryLive	EmploymentField	EmploymentStatus	Gender	HasDebt	HasFinancialDependents	HasHighSpdInternet	HasServedInMilitary	HoursLearning	Income	IsEthnicMinority	IsReceiveDisabilitiesBenefits	IsSoftwareDev	IsUnderEmployed	JobApplyWhen	JobPref	JobWherePref	LanguageAtHome	MaritalStatus	MoneyForLearning	MonthsProgramming	SchoolDegree	SchoolMajor	JobRoleInterest	ExpectedEarning	Year	Percent_Increase	Monthly_spending
1	34.0	0.0	less than 100,000	NaN	United States of America	United States of America	NaN	Not working but looking for work	male	True	False	1.0	0.0	10.0	NaN	0.0	0.0	0.0	NaN	Within 7 to 12 months	work for a nonprofit	in an office with other developers	English	single, never married	80.0	6.0	some college credit, no degree	NaN	[Full-Stack Web Developer]	35000.0	2017	NaN	13.333333
2	21.0	0.0	more than 1 million	15 to 29 minutes	United States of America	United States of America	software development and IT	Employed for wages	male	False	False	1.0	0.0	25.0	13000.0	1.0	0.0	0.0	0.0	Within 7 to 12 months	work for a medium-sized company	no preference	Spanish	single, never married	1000.0	5.0	high school diploma or equivalent (GED)	NaN	[ Front-End Web Developer, Back-End Web Deve...	70000.0	2017	438.461538	200.000000
6	29.0	0.0	between 100,000 and 1 million	30 to 44 minutes	United Kingdom	United Kingdom	NaN	Employed for wages	female	True	False	1.0	0.0	16.0	40000.0	NaN	0.0	0.0	0.0	I'm already applying	work for a medium-sized company	no preference	English	married or domestic partnership	0.0	12.0	some college credit, no degree	NaN	[Full-Stack Web Developer]	30000.0	2017	-25.000000	0.000000
15	32.0	0.0	less than 100,000	30 to 44 minutes	United States of America	United States of America	sales	Employed for wages	male	True	False	1.0	0.0	1.0	20000.0	0.0	0.0	0.0	1.0	more than 12 months from now	work for a nonprofit	in an office with other developers	English	single, never married	0.0	1.0	master's degree (non-professional)	English	[Full-Stack Web Developer]	40000.0	2017	100.000000	0.000000
16	29.0	0.0	between 100,000 and 1 million	30 to 44 minutes	Lithuania	United States of America	finance	Employed for wages	male	False	False	1.0	0.0	6.0	60000.0	0.0	0.0	0.0	0.0	Within the next 6 months	work for a medium-sized company	in an office with other developers	English	married or domestic partnership	200.0	12.0	master's degree (non-professional)	Political Science	[Full-Stack Web Developer]	60000.0	2017	0.000000	16.666667
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
15571	61.0	0.0	less than 100,000	20.0	United States of America	United States of America	farming, fishing, and forestry	Employed for wages	male	False	True	1.0	0.0	40.0	60000.0	0.0	0.0	0.0	1.0	Within 7 to 12 months	work for a medium-sized company	no preference	English	married or domestic partnership	500.0	240.0	bachelor's degree	Computer Programming	[Full-Stack Web Developer]	80000.0	2016	33.333333	2.083333
15578	42.0	0.0	between 100,000 and 1 million	60.0	United States of America	United States of America	NaN	Self-employed business owner	female	True	True	1.0	0.0	25.0	60000.0	0.0	0.0	0.0	1.0	Within 7 to 12 months	work for a medium-sized company	no preference	English	married or domestic partnership	0.0	1.0	bachelor's degree	Film and Video Studies	[Full-Stack Web Developer]	60000.0	2016	0.000000	0.000000
15598	51.0	0.0	less than 100,000	30.0	United States of America	United States of America	finance	Employed for wages	male	True	True	1.0	1.0	30.0	200000.0	0.0	0.0	0.0	0.0	more than 12 months from now	work for a medium-sized company	in an office with other developers	English	married or domestic partnership	100.0	12.0	professional degree (MBA, MD, JD, etc.)	Investments and Securities	[Full-Stack Web Developer]	100000.0	2016	-50.000000	8.333333
15600	38.0	0.0	more than 1 million	90.0	United States of America	United States of America	finance	Employed for wages	male	False	True	1.0	0.0	6.0	200000.0	0.0	0.0	0.0	0.0	more than 12 months from now	work for a startup	no preference	English	married or domestic partnership	500.0	12.0	bachelor's degree	Finance	[Full-Stack Web Developer]	150000.0	2016	-25.000000	41.666667
15615	28.0	0.0	less than 100,000	7.0	United States of America	United States of America	food and beverage	Employed for wages	male	True	True	1.0	0.0	20.0	200000.0	0.0	0.0	0.0	1.0	I'm already applying	work for a medium-sized company	from home	English	married or domestic partnership	1400.0	7.0	associate's degree	Computer and Information Systems Security	[Full-Stack Web Developer]	50000.0	2016	-75.000000	200.000000

7536 rows × 33 columns

In [61]:

# Money spent less than or equal to $4,000
four_countries = four_countries[four_countries["Monthly_spending"] <= 4000]

# Dataframe length
four_countries.shape

Out[61]:

(7469, 33)

Programming Bootcamp Attendance¶

Respondents that indicated they paid for learning will help us make our decision for which countries to advertise in. We noticed that over 92% of survey participants from United States, India, United Kingdom, and Canada did not attend a programming bootcamp (online or in person training program that teaches the fundamentals of programming within a limited timeframe, not to be confused with our e-learning platform that is accessible 24/7 at the customer's learning pace).

The average monthly spending grouped by bootcamp attendance indicates that bootcamp attendees spend far more than those that did not attend a programming bootcamp. This indicates a skewness in the data that may yield inaccurate/skewed results for spending by country.

The amount of respondents that paid for learning was greater than those that did not spend any money for learning
Less than 7% of respondents attended a programming bootcamp
These programs cost money to attend, as expected 98% survey participants that have not spent any money for learning did not attend a bootcamp, the other 2% may have used a loan/borrowed money from someone
11% of individuals that spent money for learning reported boot camp attendance, whereas the remaining 88 % did not attend one, presumably they spent it elsewhere on other platforms/resources

In [62]:

# Bootcamp attendance frequency
# 1 = Yes
# 0 = No
four_countries["AttendedBootcamp"].value_counts(normalize= True, dropna= False) * 100

Out[62]:

0.0    92.703173
1.0     6.868389
NaN     0.428438
Name: AttendedBootcamp, dtype: float64

In [63]:

# Average money spent grouped by bootcamp attendance
four_countries.groupby("AttendedBootcamp")["Monthly_spending"].mean()

Out[63]:

AttendedBootcamp
0.0     72.306148
1.0    952.885023
Name: Monthly_spending, dtype: float64

In [64]:

# Number of observations where individuals did not spend money
print("Free learning:", len(four_countries[four_countries["Monthly_spending"] <= 0]), "observations")

# Number of observations where individuals did not spend money
print("Paid learning", len(four_countries[four_countries["Monthly_spending"] > 0]), "observations")

Free learning: 3271 observations
Paid learning 4198 observations

In [65]:

# Frequency of observations that did not spend money, but did attend a programming bootcamp
# Boolean masking
free = (four_countries[four_countries["Monthly_spending"] == 0])
free["AttendedBootcamp"].value_counts(dropna=False, normalize= True) * 100

Out[65]:

0.0    98.135127
1.0     1.284011
NaN     0.580862
Name: AttendedBootcamp, dtype: float64

In [66]:

# Assigns variable that isolates respondents that only paid for learning
attended_bc = four_countries[four_countries["Monthly_spending"] > 0]

# Returns frequency of all paid learning based on bootcamp attendance
attended_bc["AttendedBootcamp"].value_counts(dropna=False, normalize= True) * 100

Out[66]:

0.0    88.470700
1.0    11.219628
NaN     0.309671
Name: AttendedBootcamp, dtype: float64

In [67]:

# Variable assignment for boolean masking, both values must meet criteria below

# Paid learning, and bootcamp attendance is false
group_a = (four_countries["Monthly_spending"] > 0) & (four_countries["AttendedBootcamp"] == 1)
a = four_countries[group_a]

# Paid learning, and bootcamp attendance is false
group_b = (four_countries["Monthly_spending"] > 0) & (four_countries["AttendedBootcamp"] == 0)
b = four_countries[group_b]

Monthly spending by attendance¶

We'll see that group A (paid learners that also attended a programming bootcamp) has a greater range of monthly spending. Any amount from $1 to $4,000 US dollars is common. As programming bootcamps are generally expensive this does not come as a surprise. The opposite is true for group b (paid learners that did not attend a bootcamp). Group b could be any other paid learning service or subscription, thus we generally see that most people in this group spent less than $500 US dollars.

Bootcamp attendees clearly skews the data for Monthly_spending, however we cannot discount that they did not pursue other means of learning as well. Instead we need to understand that these bootcamps are expensive, whereas other means of learning (outside of community colleges and universities) are cheaper.

We'll demonstrate that the monthly spending by country is significantly higher when bootcamp attendance is true. Whereas average spending is more reasonable when considering all spending habits.

In [68]:

# Plot results of monthly spending for group a
fig, ax = plt.subplots(figsize = (13,9))
four_countries[group_a]["Monthly_spending"].plot.hist(bins = 20, color = "grey")

# Remove spines
plt.gca().spines[["right","top"]].set_visible(False)

# Title
plt.title("Monthly spending of programming bootcamp attendees", loc="left", size = 20, y = 1.05)

# X and Y labels
plt.ylabel("Frequency", size = 16, color = "grey", loc = "top")
plt.xlabel("US dollars", size = 14, loc = "left", color = "grey")

# X and Y ticks
plt.xticks(size = 13)
plt.yticks(size = 14)

# Text
plt.text(1500, 80,"Respondents that reported attendance\nof a programming bootcamp spent any amount up to $4,000", color = "grey", size = 14)

plt.show() 

# Plot results of monthly spending for group b
fig, ax = plt.subplots(figsize = (13,9))
four_countries[group_b]["Monthly_spending"].plot.hist(bins = 15, color = "grey")

# Remove spines
plt.gca().spines[["right","top"]].set_visible(False)

# Title
plt.title("Monthly spending (excluding bootcamp attendees)", loc="left", size = 20, y = 1.05)

# X and Y labels
plt.ylabel("Frequency", size = 16, color = "grey", loc = "top")
plt.xlabel("US dollars", size = 14, loc = "left", color = "grey")

# X and Y ticks
plt.xticks(size = 13)
plt.yticks(size = 14)

# Text
plt.text(2000,2000,"Excluding programming bootcamp attendees,\nmost participants spent less than $500", color = "grey", size = 14)

plt.show()

In [69]:

# Isolate rows to include only monthly spending less than or equal to $4000
spending = spending[spending["Monthly_spending"] <= 4000]

# Monthly average spending by country
avg_month = spending.groupby("CountryLive").mean()
country_spends = avg_month["Monthly_spending"][["United States of America", "India","United Kingdom", "Canada"]].sort_index(ascending=False)

# Respondents that have spent money for learning, and did attend a bootcamp
over_zero_and_bootcamp = four_countries[group_a].groupby("CountryLive")["Monthly_spending"].mean().sort_index(ascending=False)

# X labels
labels = ["United States", "United Kingdom", "India", "Canada"]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

# Plot results
fig, ax = plt.subplots(figsize = (12, 8))
rects1 = ax.bar(x - width/2, country_spends, width, label="Includes Free Learners",color = "#4B86C1")
rects2 = ax.bar(x + width/2, over_zero_and_bootcamp, width, label= "Bootcamp attendees only", color = "#ED7E00")

# Subscription price
plt.axhline(59, color = "black", alpha = 0.5, label = "Subscription Price ($59)", linewidth = 2, linestyle = "--")

# Text
plt.text(-0.5,1110,"Country averages only include bootcamp\nattendees that paid for learning", size = 14, color = "grey")

# Labels and title
ax.set_ylabel("US dollars", loc = "top", size = 14, color = "grey")
ax.set_title("Average Monthly Spending", size = 20, loc = "left")
ax.set_xticks(x, labels, size = 14, color = "grey")

# Legend
ax.legend(loc = "center left")
ax.spines[["right", "left", "top", "bottom"]].set_visible(False)

# Bar labels
ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)

# Apply tight layout
fig.tight_layout()

plt.show()

#------------------------------------------------------------------------------------------------------------------------------------------------#

# Isolate rows to include only monthly spending less than or equal to $4000
spending = spending[spending["Monthly_spending"] <= 4000]

# Monthly average spending by country
avg_month = spending.groupby("CountryLive").mean()
country_spends = avg_month["Monthly_spending"][["United States of America", "India","United Kingdom", "Canada"]].sort_index(ascending=False)

# Respondents that have spent money for learning, but did not attend a bootcamp
over_zero = four_countries[four_countries["Monthly_spending"] > 0].groupby("CountryLive")["Monthly_spending"].mean().sort_index(ascending=False)

# X labels
labels = ["United States", "United Kingdom", "India", "Canada"]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

# Plot results
fig, ax = plt.subplots(figsize = (12, 8))
rects1 = ax.bar(x - width/2, country_spends, width, label="Includes Free Learners",color = "#4B86C1")
rects2 = ax.bar(x + width/2, over_zero, width, label= "Excludes Free Learners", color = "#ED7E00")

# Subscription price
plt.axhline(59, color = "black", alpha = 0.5, label = "Subscription Price ($59)", linewidth = 2, linestyle = "--")

# Labels and title
ax.set_ylabel("US dollars", loc = "top", size = 14, color = "grey")
ax.set_title("Average Monthly Spending", size = 20, loc = "left")
ax.set_xticks(x, labels, size = 14, color = "grey")

# Legend
ax.legend(loc = "upper right")

# Remove spines
ax.spines[["right", "left", "top", "bottom"]].set_visible(False)

# Bar labels
ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)

# Apply tight layout
fig.tight_layout()

plt.show()

The United States should be our first choice for advertising:

The US has the highest number of new programmers
Highest average monthly spending for learning programming

India could be the second choice:

Second highest number of new programmers
India's average monthly spending is below our subscription price, as noted from both surveys, India has the second highest frequency of survey participation. This suggests that programming interest in India and the United States is higher than other countries, thus the potential number of customers is probably higher than in United Kingdom or Canada

Conclusion¶

The analysis of freeCodeCamp's survey indicates that the United States should be the prime candidate for advertising based on the following criteria:

Greatest number of new programmers
Monthly spending is highest out of the English speaking countries

The second candidate can be India:

Second highest number of new programmers
Monthly spending is low, but there is potential for attracting new customers based off of the population density of India

We believe we explained why people are learning and practicing a new skill like programming. The information in this survey points to a desire for upward mobility, career advancement, and higher income. We demonstrated that the difference between current yearly salary from the expected salary was great enough to explain the decisions of survey participants. Many people indicated that they are interested in software/data science careers outside of the current one they have with an expectation of increased salary.