Finding The Best Markets To Advertise In

For this project we will take the role of a data analyst who works for an e-learning company. While the company focuses on courses in web and mobile development, it also offers programs in data science and game development. The company wants to invest in advertisement and wants to know what the two best markets are to invest in.

To gain insight into who is learning to code, we will we will examine the results of a survey released by freeCodeCamp. The survey received 31,000 repsonses to over 50 questions.

You can read more about the survey here and more about the data on their github repository

Examining this survey is a good first step since organizing our own survey would be costly and time consuming.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# To avoid truncated output
pd.options.display.max_columns = 150

# Read in the data
survey = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv', low_memory = 0)

Explore The Data

In [2]:
survey.shape
Out[2]:
(18175, 136)
In [3]:
survey.head()
Out[3]:
Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName BootcampRecommend ChildrenNumber CityPopulation CodeEventConferences CodeEventDjangoGirls CodeEventFCC CodeEventGameJam CodeEventGirlDev CodeEventHackathons CodeEventMeetup CodeEventNodeSchool CodeEventNone CodeEventOther CodeEventRailsBridge CodeEventRailsGirls CodeEventStartUpWknd CodeEventWkdBootcamps CodeEventWomenCode CodeEventWorkshops CommuteTime CountryCitizen CountryLive EmploymentField EmploymentFieldOther EmploymentStatus EmploymentStatusOther ExpectedEarning FinanciallySupporting FirstDevJob Gender GenderOther HasChildren HasDebt HasFinancialDependents HasHighSpdInternet HasHomeMortgage HasServedInMilitary HasStudentDebt HomeMortgageOwe HoursLearning ID.x ID.y Income IsEthnicMinority IsReceiveDisabilitiesBenefits IsSoftwareDev IsUnderEmployed JobApplyWhen JobInterestBackEnd JobInterestDataEngr JobInterestDataSci JobInterestDevOps JobInterestFrontEnd JobInterestFullStack JobInterestGameDev JobInterestInfoSec JobInterestMobile JobInterestOther JobInterestProjMngr JobInterestQAEngr JobInterestUX JobPref JobRelocateYesNo JobRoleInterest JobWherePref LanguageAtHome MaritalStatus MoneyForLearning MonthsProgramming NetworkID Part1EndTime Part1StartTime Part2EndTime Part2StartTime PodcastChangeLog PodcastCodeNewbie PodcastCodePen PodcastDevTea PodcastDotNET PodcastGiantRobots PodcastJSAir PodcastJSJabber PodcastNone PodcastOther PodcastProgThrowdown PodcastRubyRogues PodcastSEDaily PodcastSERadio PodcastShopTalk PodcastTalkPython PodcastTheWebAhead ResourceCodecademy ResourceCodeWars ResourceCoursera ResourceCSS ResourceEdX ResourceEgghead ResourceFCC ResourceHackerRank ResourceKA ResourceLynda ResourceMDN ResourceOdinProj ResourceOther ResourcePluralSight ResourceSkillcrush ResourceSO ResourceTreehouse ResourceUdacity ResourceUdemy ResourceW3S SchoolDegree SchoolMajor StudentDebtOwe YouTubeCodeCourse YouTubeCodingTrain YouTubeCodingTut360 YouTubeComputerphile YouTubeDerekBanas YouTubeDevTips YouTubeEngineeredTruth YouTubeFCC YouTubeFunFunFunction YouTubeGoogleDev YouTubeLearnCode YouTubeLevelUpTuts YouTubeMIT YouTubeMozillaHacks YouTubeOther YouTubeSimplilearn YouTubeTheNewBoston
0 27.0 0.0 NaN NaN NaN NaN NaN more than 1 million NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15 to 29 minutes Canada Canada software development and IT NaN Employed for wages NaN NaN NaN NaN female NaN NaN 1.0 0.0 1.0 0.0 0.0 0.0 NaN 15.0 02d9465b21e8bd09374b0066fb2d5614 eb78c1c3ac6cd9052aec557065070fbf NaN NaN 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN start your own business NaN NaN NaN English married or domestic partnership 150.0 6.0 6f1fbc6b2b 2017-03-09 00:36:22 2017-03-09 00:32:59 2017-03-09 00:59:46 2017-03-09 00:36:26 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN 1.0 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 some college credit, no degree NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 34.0 0.0 NaN NaN NaN NaN NaN less than 100,000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN United States of America United States of America NaN NaN Not working but looking for work NaN 35000.0 NaN NaN male NaN NaN 1.0 0.0 1.0 0.0 0.0 1.0 NaN 10.0 5bfef9ecb211ec4f518cfc1d2a6f3e0c 21db37adb60cdcafadfa7dca1b13b6b1 NaN 0.0 0.0 0.0 NaN Within 7 to 12 months NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN work for a nonprofit 1.0 Full-Stack Web Developer in an office with other developers English single, never married 80.0 6.0 f8f8be6910 2017-03-09 00:37:07 2017-03-09 00:33:26 2017-03-09 00:38:59 2017-03-09 00:37:10 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN 1.0 1.0 some college credit, no degree NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 21.0 0.0 NaN NaN NaN NaN NaN more than 1 million NaN NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN 15 to 29 minutes United States of America United States of America software development and IT NaN Employed for wages NaN 70000.0 NaN NaN male NaN NaN 0.0 0.0 1.0 NaN 0.0 NaN NaN 25.0 14f1863afa9c7de488050b82eb3edd96 21ba173828fbe9e27ccebaf4d5166a55 13000.0 1.0 0.0 0.0 0.0 Within 7 to 12 months 1.0 NaN NaN 1.0 1.0 1.0 NaN NaN 1.0 NaN NaN NaN NaN work for a medium-sized company 1.0 Front-End Web Developer, Back-End Web Develo... no preference Spanish single, never married 1000.0 5.0 2ed189768e 2017-03-09 00:37:58 2017-03-09 00:33:53 2017-03-09 00:40:14 2017-03-09 00:38:02 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN Codenewbie NaN NaN NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN high school diploma or equivalent (GED) NaN NaN NaN NaN 1.0 NaN 1.0 1.0 NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN
3 26.0 0.0 NaN NaN NaN NaN NaN between 100,000 and 1 million NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN I work from home Brazil Brazil software development and IT NaN Employed for wages NaN 40000.0 0.0 NaN male NaN 0.0 1.0 1.0 1.0 1.0 0.0 0.0 40000.0 14.0 91756eb4dc280062a541c25a3d44cfb0 3be37b558f02daae93a6da10f83f0c77 24000.0 0.0 0.0 0.0 1.0 Within the next 6 months 1.0 NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN work for a medium-sized company NaN Front-End Web Developer, Full-Stack Web Deve... from home Portuguese married or domestic partnership 0.0 5.0 dbdc0664d1 2017-03-09 00:40:13 2017-03-09 00:37:45 2017-03-09 00:42:26 2017-03-09 00:40:18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN 1.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN some college credit, no degree NaN NaN NaN NaN NaN NaN NaN 1.0 NaN 1.0 1.0 NaN NaN 1.0 NaN NaN NaN NaN NaN
4 20.0 0.0 NaN NaN NaN NaN NaN between 100,000 and 1 million NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Portugal Portugal NaN NaN Not working but looking for work NaN 140000.0 NaN NaN female NaN NaN 0.0 0.0 1.0 NaN 0.0 NaN NaN 10.0 aa3f061a1949a90b27bef7411ecd193f d7c56bbf2c7b62096be9db010e86d96d NaN 0.0 0.0 0.0 NaN Within 7 to 12 months 1.0 NaN NaN NaN 1.0 1.0 NaN 1.0 1.0 NaN NaN NaN NaN work for a multinational corporation 1.0 Full-Stack Web Developer, Information Security... in an office with other developers Portuguese single, never married 0.0 24.0 11b0f2d8a9 2017-03-09 00:42:45 2017-03-09 00:39:44 2017-03-09 00:45:42 2017-03-09 00:42:50 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN bachelor's degree Information Technology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [4]:
survey.columns.values
Out[4]:
array(['Age', 'AttendedBootcamp', 'BootcampFinish', 'BootcampLoanYesNo',
       'BootcampName', 'BootcampRecommend', 'ChildrenNumber',
       'CityPopulation', 'CodeEventConferences', 'CodeEventDjangoGirls',
       'CodeEventFCC', 'CodeEventGameJam', 'CodeEventGirlDev',
       'CodeEventHackathons', 'CodeEventMeetup', 'CodeEventNodeSchool',
       'CodeEventNone', 'CodeEventOther', 'CodeEventRailsBridge',
       'CodeEventRailsGirls', 'CodeEventStartUpWknd',
       'CodeEventWkdBootcamps', 'CodeEventWomenCode',
       'CodeEventWorkshops', 'CommuteTime', 'CountryCitizen',
       'CountryLive', 'EmploymentField', 'EmploymentFieldOther',
       'EmploymentStatus', 'EmploymentStatusOther', 'ExpectedEarning',
       'FinanciallySupporting', 'FirstDevJob', 'Gender', 'GenderOther',
       'HasChildren', 'HasDebt', 'HasFinancialDependents',
       'HasHighSpdInternet', 'HasHomeMortgage', 'HasServedInMilitary',
       'HasStudentDebt', 'HomeMortgageOwe', 'HoursLearning', 'ID.x',
       'ID.y', 'Income', 'IsEthnicMinority',
       'IsReceiveDisabilitiesBenefits', 'IsSoftwareDev',
       'IsUnderEmployed', 'JobApplyWhen', 'JobInterestBackEnd',
       'JobInterestDataEngr', 'JobInterestDataSci', 'JobInterestDevOps',
       'JobInterestFrontEnd', 'JobInterestFullStack',
       'JobInterestGameDev', 'JobInterestInfoSec', 'JobInterestMobile',
       'JobInterestOther', 'JobInterestProjMngr', 'JobInterestQAEngr',
       'JobInterestUX', 'JobPref', 'JobRelocateYesNo', 'JobRoleInterest',
       'JobWherePref', 'LanguageAtHome', 'MaritalStatus',
       'MoneyForLearning', 'MonthsProgramming', 'NetworkID',
       'Part1EndTime', 'Part1StartTime', 'Part2EndTime', 'Part2StartTime',
       'PodcastChangeLog', 'PodcastCodeNewbie', 'PodcastCodePen',
       'PodcastDevTea', 'PodcastDotNET', 'PodcastGiantRobots',
       'PodcastJSAir', 'PodcastJSJabber', 'PodcastNone', 'PodcastOther',
       'PodcastProgThrowdown', 'PodcastRubyRogues', 'PodcastSEDaily',
       'PodcastSERadio', 'PodcastShopTalk', 'PodcastTalkPython',
       'PodcastTheWebAhead', 'ResourceCodecademy', 'ResourceCodeWars',
       'ResourceCoursera', 'ResourceCSS', 'ResourceEdX',
       'ResourceEgghead', 'ResourceFCC', 'ResourceHackerRank',
       'ResourceKA', 'ResourceLynda', 'ResourceMDN', 'ResourceOdinProj',
       'ResourceOther', 'ResourcePluralSight', 'ResourceSkillcrush',
       'ResourceSO', 'ResourceTreehouse', 'ResourceUdacity',
       'ResourceUdemy', 'ResourceW3S', 'SchoolDegree', 'SchoolMajor',
       'StudentDebtOwe', 'YouTubeCodeCourse', 'YouTubeCodingTrain',
       'YouTubeCodingTut360', 'YouTubeComputerphile', 'YouTubeDerekBanas',
       'YouTubeDevTips', 'YouTubeEngineeredTruth', 'YouTubeFCC',
       'YouTubeFunFunFunction', 'YouTubeGoogleDev', 'YouTubeLearnCode',
       'YouTubeLevelUpTuts', 'YouTubeMIT', 'YouTubeMozillaHacks',
       'YouTubeOther', 'YouTubeSimplilearn', 'YouTubeTheNewBoston'],
      dtype=object)
In [5]:
survey['JobRoleInterest']
Out[5]:
0                                                      NaN
1                                 Full-Stack Web Developer
2          Front-End Web Developer, Back-End Web Develo...
3          Front-End Web Developer, Full-Stack Web Deve...
4        Full-Stack Web Developer, Information Security...
                               ...                        
18170                                                  NaN
18171      DevOps / SysAdmin,   Mobile Developer,   Pro...
18172                                                  NaN
18173                                                  NaN
18174    Back-End Web Developer, Data Engineer,   Data ...
Name: JobRoleInterest, Length: 18175, dtype: object

Observations

There is alot of information in this data set but the information most pertinent to our analisys are the following columns:

  • JobeRoleInterest - which field the respondant is interested in
  • CountryLive - which country the respondant lives in
  • MoneyForLearning - the amount of money (in US dollars) that the respondant has spent since they started coding until the present (when then completed this survey)
  • MonthsProgramming - the number of months the responant has been coding.

Determining If The Survey Is A Representative Sample

In order to determine if the survey is relevant for us we need to find out if any of the respondants are interested in the job fields where we offer courses.

We will look at the JobRoleInterest column to determine this:

In [6]:
survey['JobRoleInterest'].value_counts(normalize=True, ascending=False)
Out[6]:
Full-Stack Web Developer                                                                                                                                                                                      0.117706
  Front-End Web Developer                                                                                                                                                                                     0.064359
  Data Scientist                                                                                                                                                                                              0.021739
Back-End Web Developer                                                                                                                                                                                        0.020309
  Mobile Developer                                                                                                                                                                                            0.016733
                                                                                                                                                                                                                ...   
Game Developer,   DevOps / SysAdmin,   Mobile Developer,   Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer                                                                          0.000143
  Quality Assurance Engineer,   DevOps / SysAdmin,   Data Scientist, Game Developer, Information Security, Back-End Web Developer, Full-Stack Web Developer,   Front-End Web Developer,   Mobile Developer    0.000143
  Mobile Developer, Game Developer, Back-End Web Developer, Full-Stack Web Developer,   Front-End Web Developer,   User Experience Designer, User Interface Designer                                          0.000143
  Front-End Web Developer, Back-End Web Developer,   Mobile Developer,   User Experience Designer, Full-Stack Web Developer                                                                                   0.000143
Back-End Web Developer,   DevOps / SysAdmin,   Front-End Web Developer, Full-Stack Web Developer,   Mobile Developer,   User Experience Designer, Game Developer                                              0.000143
Name: JobRoleInterest, Length: 3213, dtype: float64

Observations

It appears that Web Development and Data Science are among the most popular, however there are also many entries of multiple fields.

To extract the information we need we will split the strings so that we can determine:

1) if the respondant has one or multiple interests and
2) if they are interested in the fields covered by our courses:

Interested in One or Mupltiple Areas of Study?

In [7]:
# Split each string in the 'JobRoleInterest' column
interests_split = survey['JobRoleInterest'].dropna()
interests_split = interests_split.str.split(',')

# Frequency table for the var describing the number of options
interests_count = interests_split.apply(lambda x: len(x)) # x is a list of job options
interests_count.value_counts(normalize = True).sort_index() * 100
Out[7]:
1     31.650458
2     10.883867
3     15.889588
4     15.217391
5     12.042334
6      6.721968
7      3.861556
8      1.759153
9      0.986842
10     0.471968
11     0.185927
12     0.300343
13     0.028604
Name: JobRoleInterest, dtype: float64
In [8]:
# Transform Data and Aggregate 
interests_split = pd.DataFrame(interests_split)
interests_split['multiple'] = interests_split['JobRoleInterest'].apply(
    lambda x: 'multiple' if len(x) >1 else 'one')
interests_split_gb = interests_split.groupby(['multiple']).count()
interests_split_gb

# Plot
fig, ax = plt.subplots(figsize=(6,6))

plt.pie(interests_split_gb['JobRoleInterest'], 
        labels=('Multiple Interests', 'One Interest'), 
        startangle=90,
        autopct='%1.1f%%',
        textprops={'fontsize': 8})

# Plot Aesthetics
plt.title('Number of Job Interests',
          fontsize=10,
          y=0.95)
Out[8]:
Text(0.5, 0.95, 'Number of Job Interests')

Observations

  • Almost 70% of the respondants have more than one interest.
  • Since our compnay offers courses across several disciplines, this isn't a disqualfying factor at all. It may be a benefit since they could possible take multiple courses with us.

Are The Respondants Interested In The Courses We Offer?

In [9]:
contains_web_mobile = survey['JobRoleInterest'].str.contains('Web Developer|Mobile Developer')
contains_data_game = survey['JobRoleInterest'].str.contains('Data|Game')
# contains_web_moblie = contains_web_mobile.copy()
# contains_data_game = contains_data_game.copy()

# Plot
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,6))

patches, texts, autotexts = ax1.pie(contains_web_mobile.value_counts(), 
        startangle=90,
        autopct='%1.1f%%',
        textprops={'fontsize': 16})

patches, texts, autotexts = ax2.pie(contains_data_game.value_counts(),
        startangle=90,
        autopct='%1.1f%%',
        textprops={'fontsize': 16})

# Plot Aesthetics
ax1.set_title('Interested In Web or Mobile Development?',
              y=0.97,
              fontsize=16)
ax2.set_title('Interested In Data Science or Game Development?',
              y=0.97,
              fontsize=16)
pielabels = ['Yes', 'No']
fig.legend(pielabels,
           loc='lower center',
           prop={'size': 18})
plt.tight_layout()

Observations

  • Over 85% of the respondants are intererested in the primary courses we offer (Web and Mobile Development).
  • Over half of the respondants are interested in the secondary courses we offer (Data Science and Game Development).
  • Based on this, we can be confident that the survey is a representative sample of our target population.

Next we need to determine where the respondants live and also if they are willing to spend enough money to enroll in our courses:

Finding Locations To Advertise In

Next, we need to determine where our target market lives so we know where to deploy resourses. We will use the CountryLive column.

We will use our above contains_web_mobile and contains_data_game series as boolean masks against the survey dataframe. This will give us our entire target population of respondants who are interested in the subjects of the courses we offer. From there we can find the most common locations and choose among them.

In [10]:
# Remove rows where they have NaN values in the ```JobRoleInterest``` column
# NaN values in this column mean they did answer that question
survey = survey[survey['JobRoleInterest'].notna()]

# Remove the same rows in our boolean masks series
contains_web_mobile = contains_web_mobile[contains_web_mobile.notna()]
contains_data_game = contains_data_game[contains_data_game.notna()]

# Apply the filters to get dataframes that hold respondants interested in our courses
survey_web_mobile = survey[contains_web_mobile].copy()
survey_data_game = survey[contains_data_game].copy()

# Plot
fix, ax = plt.subplots(nrows = 1, ncols = 2, figsize=(14,6))
plt.rcParams['figure.dpi'] = 460
plt.suptitle('Country of Residence (Percentage)',
             fontsize=22,
             y=1.04)

ax1 = plt.subplot(1,2,1)
ax1 = sns.barplot(data = survey_web_mobile,
                  x = survey_web_mobile['CountryLive'].value_counts(normalize=True).index[:10],
                  y = survey_web_mobile['CountryLive'].value_counts(normalize=True)[:10] * 100,
                  color = 'steelblue')
ax1.set_title('Web & Mobile Interest', fontsize=18)
ax1.set_ylabel('')
ax1.tick_params(labelsize=14)
plt.xticks(rotation = 45, ha='right')
sns.despine(left=True)
sns.set(style='whitegrid')

ax2 = plt.subplot(1,2,2)
ax2 = sns.barplot(data = survey_data_game,
                  x = survey_data_game['CountryLive'].value_counts(normalize=True).index[:10],
                  y = survey_data_game['CountryLive'].value_counts(normalize=True)[:10] * 100,
                  color = 'steelblue')
ax2.set_title('Data & Game Interest', fontsize=18)
ax2.set_ylabel('')
ax2.tick_params(labelsize=14)
plt.xticks(rotation = 45, ha='right')
sns.despine(left=True)
sns.set(style='whitegrid')
In [11]:
survey_web_mobile['CountryLive'].value_counts()[:10]
Out[11]:
United States of America    2676
India                        443
United Kingdom               281
Canada                       221
Poland                       121
Brazil                       115
Germany                      107
Russia                        93
Australia                     88
Ukraine                       86
Name: CountryLive, dtype: int64
In [12]:
survey_data_game['CountryLive'].value_counts()[:10]
Out[12]:
United States of America    1444
India                        247
United Kingdom               137
Canada                       123
Brazil                        66
Germany                       60
Australia                     55
Poland                        54
Russia                        44
Spain                         37
Name: CountryLive, dtype: int64

Observations

  • The United States has the highest number of poeple who are interested in our courses with over 40% or about 4,000 people for both our primary and secondary courses.
  • India is the 2nd highest with about 8% or about 700 people, the UK and Canada are the next highest.
  • The perentage breakdown is nearly identical for both our primary and secondary courses.
  • Since our courses are written in English, it makes sense to include the UK and Canada in our list of countries we will consider investing in.

Do They Plan To Spend Money On Taking Courses?

Now that we know where are target market lives, we need to look at another critical factor: Do they intend to spend money on courses? If so, is it more or less than the price of our courses?

The MoneyForLearning column is the amount of money (in US dollars) that the respondant has spent since they started coding until the present (when then completed this survey). Since our courses are paid for with a monthly subscription, we would like to get break this down into a monthly figure. Fortunatley, we have the MonthsProgramming column which is the number of months the responant has been coding.

Next we will create a bar plot to visualize the average monthly expenditure for our 4 countries of interest:

  • The United States
  • India
  • The United Kingdom
  • Canada
In [13]:
# Some respondants indicated they have 0 months experience.  We will change it to 1 so we won't divide by 0. 
survey_web_mobile['MonthsProgramming'].replace(0,1, inplace=True)
survey_data_game['MonthsProgramming'].replace(0,1, inplace=True)

# Remove null values for both columns
survey_web_mobile_clean = survey_web_mobile[survey_web_mobile['MoneyForLearning'].notna()]
survey_web_mobile_clean = survey_web_mobile_clean[survey_web_mobile_clean['MonthsProgramming'].notna()]
survey_data_game_clean = survey_data_game[survey_data_game['MoneyForLearning'].notna()].copy()
survey_data_game_clean = survey_data_game_clean[survey_data_game_clean['MonthsProgramming'].notna()]

# Calculate monthy spending
survey_web_mobile_clean['MoneyForLearning_monthly'] = round(
    (survey_web_mobile_clean['MoneyForLearning'] / survey_web_mobile_clean['MonthsProgramming']),2)
survey_data_game_clean['MoneyForLearning_monthly'] = round(
    (survey_data_game_clean['MoneyForLearning'] / survey_data_game_clean['MonthsProgramming']),2)

# Remove null values from ```CountryLive``` column
survey_web_mobile_clean = survey_web_mobile_clean[survey_web_mobile_clean['CountryLive'].notna()].copy()
survey_data_game_clean = survey_data_game_clean[survey_data_game_clean['CountryLive'].notna()].copy()

survey_web_mobile_clean.head(3)
Out[13]:
Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName BootcampRecommend ChildrenNumber CityPopulation CodeEventConferences CodeEventDjangoGirls CodeEventFCC CodeEventGameJam CodeEventGirlDev CodeEventHackathons CodeEventMeetup CodeEventNodeSchool CodeEventNone CodeEventOther CodeEventRailsBridge CodeEventRailsGirls CodeEventStartUpWknd CodeEventWkdBootcamps CodeEventWomenCode CodeEventWorkshops CommuteTime CountryCitizen CountryLive EmploymentField EmploymentFieldOther EmploymentStatus EmploymentStatusOther ExpectedEarning FinanciallySupporting FirstDevJob Gender GenderOther HasChildren HasDebt HasFinancialDependents HasHighSpdInternet HasHomeMortgage HasServedInMilitary HasStudentDebt HomeMortgageOwe HoursLearning ID.x ID.y Income IsEthnicMinority IsReceiveDisabilitiesBenefits IsSoftwareDev IsUnderEmployed JobApplyWhen JobInterestBackEnd JobInterestDataEngr JobInterestDataSci JobInterestDevOps JobInterestFrontEnd JobInterestFullStack JobInterestGameDev JobInterestInfoSec JobInterestMobile JobInterestOther JobInterestProjMngr JobInterestQAEngr JobInterestUX JobPref JobRelocateYesNo JobRoleInterest JobWherePref LanguageAtHome MaritalStatus MoneyForLearning MonthsProgramming NetworkID Part1EndTime Part1StartTime Part2EndTime Part2StartTime PodcastChangeLog PodcastCodeNewbie PodcastCodePen PodcastDevTea PodcastDotNET PodcastGiantRobots PodcastJSAir PodcastJSJabber PodcastNone PodcastOther PodcastProgThrowdown PodcastRubyRogues PodcastSEDaily PodcastSERadio PodcastShopTalk PodcastTalkPython PodcastTheWebAhead ResourceCodecademy ResourceCodeWars ResourceCoursera ResourceCSS ResourceEdX ResourceEgghead ResourceFCC ResourceHackerRank ResourceKA ResourceLynda ResourceMDN ResourceOdinProj ResourceOther ResourcePluralSight ResourceSkillcrush ResourceSO ResourceTreehouse ResourceUdacity ResourceUdemy ResourceW3S SchoolDegree SchoolMajor StudentDebtOwe YouTubeCodeCourse YouTubeCodingTrain YouTubeCodingTut360 YouTubeComputerphile YouTubeDerekBanas YouTubeDevTips YouTubeEngineeredTruth YouTubeFCC YouTubeFunFunFunction YouTubeGoogleDev YouTubeLearnCode YouTubeLevelUpTuts YouTubeMIT YouTubeMozillaHacks YouTubeOther YouTubeSimplilearn YouTubeTheNewBoston MoneyForLearning_monthly
1 34.0 0.0 NaN NaN NaN NaN NaN less than 100,000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN United States of America United States of America NaN NaN Not working but looking for work NaN 35000.0 NaN NaN male NaN NaN 1.0 0.0 1.0 0.0 0.0 1.0 NaN 10.0 5bfef9ecb211ec4f518cfc1d2a6f3e0c 21db37adb60cdcafadfa7dca1b13b6b1 NaN 0.0 0.0 0.0 NaN Within 7 to 12 months NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN work for a nonprofit 1.0 Full-Stack Web Developer in an office with other developers English single, never married 80.0 6.0 f8f8be6910 2017-03-09 00:37:07 2017-03-09 00:33:26 2017-03-09 00:38:59 2017-03-09 00:37:10 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN 1.0 1.0 some college credit, no degree NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 13.33
2 21.0 0.0 NaN NaN NaN NaN NaN more than 1 million NaN NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN 15 to 29 minutes United States of America United States of America software development and IT NaN Employed for wages NaN 70000.0 NaN NaN male NaN NaN 0.0 0.0 1.0 NaN 0.0 NaN NaN 25.0 14f1863afa9c7de488050b82eb3edd96 21ba173828fbe9e27ccebaf4d5166a55 13000.0 1.0 0.0 0.0 0.0 Within 7 to 12 months 1.0 NaN NaN 1.0 1.0 1.0 NaN NaN 1.0 NaN NaN NaN NaN work for a medium-sized company 1.0 Front-End Web Developer, Back-End Web Develo... no preference Spanish single, never married 1000.0 5.0 2ed189768e 2017-03-09 00:37:58 2017-03-09 00:33:53 2017-03-09 00:40:14 2017-03-09 00:38:02 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN Codenewbie NaN NaN NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN high school diploma or equivalent (GED) NaN NaN NaN NaN 1.0 NaN 1.0 1.0 NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN 200.00
3 26.0 0.0 NaN NaN NaN NaN NaN between 100,000 and 1 million NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN I work from home Brazil Brazil software development and IT NaN Employed for wages NaN 40000.0 0.0 NaN male NaN 0.0 1.0 1.0 1.0 1.0 0.0 0.0 40000.0 14.0 91756eb4dc280062a541c25a3d44cfb0 3be37b558f02daae93a6da10f83f0c77 24000.0 0.0 0.0 0.0 1.0 Within the next 6 months 1.0 NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN work for a medium-sized company NaN Front-End Web Developer, Full-Stack Web Deve... from home Portuguese married or domestic partnership 0.0 5.0 dbdc0664d1 2017-03-09 00:40:13 2017-03-09 00:37:45 2017-03-09 00:42:26 2017-03-09 00:40:18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN 1.0 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN some college credit, no degree NaN NaN NaN NaN NaN NaN NaN 1.0 NaN 1.0 1.0 NaN NaN 1.0 NaN NaN NaN NaN NaN 0.00
In [14]:
# Group and calculate mean expenditure per country
web_mobile_gb = survey_web_mobile_clean.groupby(by='CountryLive').mean()
web_mobile_gb = web_mobile_gb.loc[
    ['United States of America', 'India', 'United Kingdom', 'Canada'], ['MoneyForLearning_monthly']]
data_game_gb = survey_data_game_clean.groupby(by='CountryLive').mean()
data_game_gb = data_game_gb.loc[
    ['United States of America', 'India', 'United Kingdom', 'Canada'], ['MoneyForLearning_monthly']]

# Merge into one dataframe
monthly_spend = web_mobile_gb.merge(data_game_gb, how='inner', on='CountryLive')
monthly_spend = monthly_spend.rename(
    mapper={'MoneyForLearning_monthly_x':'web_mobile', 'MoneyForLearning_monthly_y':'data_game'}, axis=1)

# Further transform the data for a grouped bar plot
monthly_spend= monthly_spend.reset_index()
monthly_spend = monthly_spend.melt(id_vars=['CountryLive'])

monthly_spend
Out[14]:
CountryLive variable value
0 United States of America web_mobile 249.521510
1 India web_mobile 146.663842
2 United Kingdom web_mobile 49.117287
3 Canada web_mobile 129.170936
4 United States of America data_game 163.720415
5 India data_game 105.637189
6 United Kingdom data_game 23.719098
7 Canada data_game 119.184054
In [15]:
# Plot
fig, ax = plt.subplots(figsize=(6,3))
ax = sns.barplot(x='CountryLive',
                 y='value',
                 hue='variable',
                 data=monthly_spend)

# Plot Aesthetics
handles, labels = ax.get_legend_handles_labels()
labels_new = ['Web & Mobile Development', 'Data Science & Game Development']

ax.set_title('Monthly Learning Expenditure (USD)', fontsize=10)
ax.set_ylabel('')
ax.set_xlabel('')
ax.tick_params(labelsize=7)
plt.xticks(rotation = 45, ha='right')
plt.legend(labels = labels_new,
           handles = handles,
           prop={'size': 7},
           fontsize=5,
           frameon=False,
           bbox_to_anchor=(0.5,0.95))
sns.despine(left=True)
sns.set(style='whitegrid')

Observations

  • At first glance, the US appears to be the highest.
  • The Web & Mobile for the US looks to be exceptionally higher than the others so let's see if there are any extreme values causing this:

Dealing With Outliers

In [16]:
# Create dataframes that just have our courtries of interest
survey_web_mobile_countries = survey_web_mobile_clean[(survey_web_mobile_clean['CountryLive'] == 'United States of America') |
                        (survey_web_mobile_clean['CountryLive'] == 'India') |
                        (survey_web_mobile_clean['CountryLive'] == 'United Kingdom') |
                        (survey_web_mobile_clean['CountryLive'] == 'Canada')]
survey_data_game_countries = survey_data_game_clean[(survey_data_game_clean['CountryLive'] == 'United States of America') |
                        (survey_data_game_clean['CountryLive'] == 'India') |
                        (survey_data_game_clean['CountryLive'] == 'United Kingdom') |
                        (survey_data_game_clean['CountryLive'] == 'Canada')]
In [17]:
# Plot
sns.boxplot(data=survey_web_mobile_countries, x='CountryLive', y='MoneyForLearning_monthly')
Out[17]:
<AxesSubplot:xlabel='CountryLive', ylabel='MoneyForLearning_monthly'>
In [18]:
# Plot
sns.boxplot(data=survey_data_game_countries, x='CountryLive', y='MoneyForLearning_monthly') 
Out[18]:
<AxesSubplot:xlabel='CountryLive', ylabel='MoneyForLearning_monthly'>

Observations

Clearly we have many extreme ouliers include two for the US that are very extreme. Let's use the describe.() method and tweak it to see the upper percentaile ranges:

In [19]:
survey_web_mobile_countries[survey_web_mobile_countries['CountryLive'] == 'United States of America']['MoneyForLearning_monthly'].describe(percentiles=[.25, .5, .75, .85, .95, .98])
Out[19]:
count     2516.000000
mean       249.521510
std       2080.607888
min          0.000000
25%          0.000000
50%          4.170000
75%         50.000000
85%        166.670000
95%        979.165000
98%       2333.330000
max      80000.000000
Name: MoneyForLearning_monthly, dtype: float64
In [20]:
survey_web_mobile_countries[survey_web_mobile_countries['CountryLive'] == 'India']['MoneyForLearning_monthly'].describe(percentiles=[.25, .5, .75, .85, .95, .98])
Out[20]:
count      393.000000
mean       146.663842
std        747.584003
min          0.000000
25%          0.000000
50%          0.000000
75%         12.500000
85%         60.500000
95%        500.000000
98%       1720.002800
max      10000.000000
Name: MoneyForLearning_monthly, dtype: float64
In [21]:
survey_web_mobile_countries[survey_web_mobile_countries['CountryLive'] == 'United Kingdom']['MoneyForLearning_monthly'].describe(percentiles=[.25, .5, .75, .85, .95, .98])
Out[21]:
count     247.000000
mean       49.117287
std       171.432998
min         0.000000
25%         0.000000
50%         0.250000
75%        22.915000
85%        60.000000
95%       200.000000
98%       403.508800
max      1400.000000
Name: MoneyForLearning_monthly, dtype: float64
In [22]:
survey_web_mobile_countries[survey_web_mobile_countries['CountryLive'] == 'Canada']['MoneyForLearning_monthly'].describe(percentiles=[.25, .5, .75, .85, .95, .98])
Out[22]:
count     203.000000
mean      129.170936
std       476.672180
min         0.000000
25%         0.000000
50%         0.830000
75%        25.000000
85%        83.330000
95%       857.276000
98%      1493.333200
max      5000.000000
Name: MoneyForLearning_monthly, dtype: float64

Observations

There are many oulliers and it's hard to know why the responants entered that value. Some possibilities are:

  • The number is incorrect
  • They included college tutition in that number
  • They entered the total cost (instead of the monthly cost) of a bootcamp/course/college they attended.
  • It's an amout that they would be willing to pay not that they have actually paid.
  • They included some other source of funding (scholorship/finacial aid/compnay is paying, etc.)
  • They actually are paying that much per month and the number is correct

Remember we are using a survey that wasn't specifically designed for our intended purpose. Since we simply want to know which countries tend to invest more in their learning and since our courses only cost \$59 per month, it isn't critical that we know if someone is paying or would be willing to pay \\$200, \$800, or \\$2500 per month for a course. For our analysis here, it is sufficient to know that they are willing to spend something more than what our courses cost.

To represent this, we could change any value over a certain threshold to some other (lower) amount that is closer to the rest of the values. How can we achieve this while still representing that value as outlier but not letting it skew our analysis so much?

If we look at the percentaile tables above we can see a big jump from the 85th percentile to the 95th percentile. If we impute any value that is in the 95th or higher to a value equal to the 85th percentile, the imputed values will still be represented as reletively "high" values which would indicate that residents are willing to spend alot of money on their learning.

In [23]:
# Update the dataframe so it only includes the information we need
survey_web_mobile_countries = survey_web_mobile_countries.loc[:, ['CountryLive', 'MoneyForLearning_monthly']]
survey_data_game_countries = survey_data_game_countries.loc[:, ['CountryLive', 'MoneyForLearning_monthly']]

survey_web_mobile_countries
Out[23]:
CountryLive MoneyForLearning_monthly
1 United States of America 13.33
2 United States of America 200.00
6 United Kingdom 0.00
15 United States of America 0.00
16 United States of America 16.67
... ... ...
18107 India 275.00
18111 India 200.00
18113 United States of America 0.00
18130 United States of America 0.00
18156 India 1000.00

3359 rows × 2 columns

In [24]:
# Create function that will impute values that are over the 95th percentile with a value equal to the 95th percentile

def imputer(df):
    countries = ['United States of America', 'India', 'United Kingdom', 'Canada']
    for country in countries:
        value = df[df['CountryLive'] == country]['MoneyForLearning_monthly'].quantile(.95) #The 95th percentile value
        mask = (df['CountryLive'] == country) & (df['MoneyForLearning_monthly'] > value)
        df['MoneyForLearning_monthly'] = df['MoneyForLearning_monthly'].mask(mask, value)
    return df

# Execute fuction on both dataframes
imputer(survey_web_mobile_countries)
imputer(survey_data_game_countries)
Out[24]:
CountryLive MoneyForLearning_monthly
19 United States of America 17.86
32 United States of America 100.00
35 United States of America 0.00
40 United States of America 25.00
52 India 0.00
... ... ...
18049 United States of America 0.00
18050 United States of America 16.67
18111 India 200.00
18113 United States of America 0.00
18130 United States of America 0.00

1798 rows × 2 columns

As we can see below, the highest values in the MoneyForLearning_monthly column are equal to the 95th percentile value for that country:

In [25]:
survey_web_mobile_countries[survey_web_mobile_countries['CountryLive'] == 'United States of America']['MoneyForLearning_monthly'].sort_values(ascending=False)
Out[25]:
7612     979.165
9801     979.165
13587    979.165
13517    979.165
3013     979.165
          ...   
9875       0.000
9857       0.000
9823       0.000
3793       0.000
18130      0.000
Name: MoneyForLearning_monthly, Length: 2516, dtype: float64
In [26]:
survey_data_game_countries[survey_data_game_countries['CountryLive'] == 'United States of America']['MoneyForLearning_monthly'].sort_values(ascending=False)
Out[26]:
4944     597.9
4365     597.9
4824     597.9
4831     597.9
4832     597.9
         ...  
6750       0.0
6760       0.0
6782       0.0
6847       0.0
18130      0.0
Name: MoneyForLearning_monthly, Length: 1348, dtype: float64
In [27]:
# Group and calculate mean expenditure per country
web_mobile_gb2 = survey_web_mobile_countries.groupby(by='CountryLive').mean()
data_game_gb2 = survey_data_game_countries.groupby(by='CountryLive').mean()

# Merge into one dataframe
monthly_spend2 = web_mobile_gb2.merge(data_game_gb2, how='inner', on='CountryLive')
monthly_spend2 = monthly_spend2.rename(mapper={'MoneyForLearning_monthly_x':'web_mobile', 'MoneyForLearning_monthly_y':'data_game'}, axis=1)

# Further transform the data to for grouped bar plot
monthly_spend2 = monthly_spend2.reset_index()
monthly_spend2 = monthly_spend2.melt(id_vars=['CountryLive'])

monthly_spend2
Out[27]:
CountryLive variable value
0 Canada web_mobile 81.982000
1 India web_mobile 47.321170
2 United Kingdom web_mobile 27.441377
3 United States of America web_mobile 103.435688
4 Canada data_game 43.849279
5 India data_game 27.864498
6 United Kingdom data_game 14.040410
7 United States of America data_game 60.228316
In [28]:
# Plot
fig, ax = plt.subplots(figsize=(6,3))
ax = sns.barplot(x='CountryLive',
                 y='value',
                 hue='variable',
                 data=monthly_spend2)

# Plot Aesthetics
handles, labels = ax.get_legend_handles_labels()
labels_new = ['Web & Mobile Development', 'Data Science & Game Development']
ax.set_title('Adjusted Monthly Learning Expenditure (USD)', fontsize=10)
ax.set_ylabel('')
ax.set_xlabel('')
ax.tick_params(labelsize=7)
plt.axhline(59,
            color='red',
            ls='-')
plt.text(.98,
         63,
         "Price of Course = $59",
         color='red',
         fontsize=8)
plt.xticks(rotation = 45, ha='right')
plt.legend(labels = labels_new,
           handles = handles,
           prop={'size': 7},
           fontsize=5,
           frameon=False,
           bbox_to_anchor=(0.7,0.915))
sns.despine(left=True)
sns.set(style='whitegrid')