The most perspective markets in programming e-learning¶

In this project we are working as an analyst for an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc.

Our company has some money to invest in advertisement. We are going to determine two markets that are the most perspective for advertizing

To reach our goal, we could organize surveys for a couple of different markets to find out which would be the best choices for advertising. This is very costly, however, and it's a good call to explore cheaper options first.

We can try to search existing data that might be relevant for our purpose. One good candidate is the data from freeCodeCamp's 2017 New Coder Survey.

freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.

The survey data is publicly available in this GitHub repository.

Exploring data¶

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv', low_memory=0)

In [2]:

data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18175 entries, 0 to 18174
Columns: 136 entries, Age to YouTubeTheNewBoston
dtypes: float64(105), object(31)
memory usage: 18.9+ MB

Out[2]:

	Age	BootcampFinish	BootcampLoanYesNo	BootcampName	BootcampRecommend	ChildrenNumber	CityPopulation	CodeEventConferences	CodeEventDjangoGirls	...	YouTubeFCC	YouTubeFunFunFunction	YouTubeGoogleDev	YouTubeLearnCode	YouTubeLevelUpTuts	YouTubeMIT	YouTubeMozillaHacks	YouTubeOther	YouTubeSimplilearn	YouTubeTheNewBoston
0	27.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	34.0	NaN	NaN	NaN	NaN	NaN	less than 100,000	NaN	NaN	...	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	21.0	NaN	NaN	NaN	NaN	NaN	more than 1 million	NaN	NaN	...	NaN	NaN	NaN	1.0	1.0	NaN	NaN	NaN	NaN	NaN
3	26.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	1.0	1.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN
4	20.0	NaN	NaN	NaN	NaN	NaN	between 100,000 and 1 million	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 136 columns

In [3]:

#Show all columns with more than 5 000 not Null values

for i in data.columns:
    if data[i].notna().sum() > 5000:
        print(i, data[i].notna().sum())  

Age 15367
AttendedBootcamp 17709
CityPopulation 15354
CommuteTime 9246
CountryCitizen 15382
CountryLive 15336
EmploymentField 8116
EmploymentStatus 14345
ExpectedEarning 7200
Gender 15454
HasDebt 15291
HasFinancialDependents 15329
HasHighSpdInternet 15380
HasHomeMortgage 6397
HasServedInMilitary 15256
HasStudentDebt 6402
HoursLearning 16714
ID.x 18175
ID.y 15743
Income 7623
IsEthnicMinority 15293
IsReceiveDisabilitiesBenefits 15222
IsSoftwareDev 18068
IsUnderEmployed 9223
JobApplyWhen 8138
JobPref 13483
JobRelocateYesNo 6503
JobRoleInterest 6992
JobWherePref 8118
LanguageAtHome 15344
MaritalStatus 15335
MoneyForLearning 16577
MonthsProgramming 17084
NetworkID 18175
Part1EndTime 18175
Part1StartTime 18175
Part2EndTime 15743
Part2StartTime 15743
ResourceCodecademy 9390
ResourceFCC 13803
ResourceMDN 6417
ResourceSO 11207
ResourceUdemy 5120
ResourceW3S 9765
SchoolDegree 15368
SchoolMajor 8727
YouTubeFCC 6036

Relevant columns¶

We've determined the columns we want to analize. These are:

ID - unique number of a respondent
CountryLive - a country where he lives
MonthsProgramming - how long in months a person is learning programming
MoneyForLearning - how much money a person has already spent on learning
JobRoleInterest - which programming jobs the person is interested in

In [4]:

relevant_data = data[['ID.x',
                      'CountryLive',
                      'MonthsProgramming',
                      'MoneyForLearning',
                      'JobRoleInterest'
                     ]].copy()
relevant_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18175 entries, 0 to 18174
Data columns (total 5 columns):
ID.x                 18175 non-null object
CountryLive          15336 non-null object
MonthsProgramming    17084 non-null float64
MoneyForLearning     16577 non-null float64
JobRoleInterest      6992 non-null object
dtypes: float64(2), object(3)
memory usage: 710.0+ KB

Dropping rows without job interest¶

As our goal is to choose 2 markets for advertising, we can't interpret any information when a respondent has no job interests. We simply can't predict which courses to advertise her. So we are going to use only the answers with at least one job interest

In [5]:

relevant_data = relevant_data[relevant_data['JobRoleInterest'].notnull()]
relevant_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6992 entries, 1 to 18174
Data columns (total 5 columns):
ID.x                 6992 non-null object
CountryLive          6838 non-null object
MonthsProgramming    6633 non-null float64
MoneyForLearning     6476 non-null float64
JobRoleInterest      6992 non-null object
dtypes: float64(2), object(3)
memory usage: 327.8+ KB

Dropping rows with nulls¶

Now we have only a one third from the initial data set, but there are also some Null values. Let's see how dropping all rows with Nulls will affect on the data set

In [6]:

relevant_data.dropna(axis=0, inplace=True)
relevant_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6212 entries, 1 to 18171
Data columns (total 5 columns):
ID.x                 6212 non-null object
CountryLive          6212 non-null object
MonthsProgramming    6212 non-null float64
MoneyForLearning     6212 non-null float64
JobRoleInterest      6212 non-null object
dtypes: float64(2), object(3)
memory usage: 291.2+ KB

As it has decreased only by about 10 per cent we can state it's OK

Renaming columns¶

In [7]:

relevant_data.rename(columns = {'ID.x': 'id',
                                'CountryLive': 'country',
                                'MonthsProgramming': 'months_programming',
                                'MoneyForLearning': 'money_for_learning',
                                'JobRoleInterest': 'job_interest'},
                     inplace = True)
relevant_data.head(2)

Out[7]:

	id	country	months_programming	money_for_learning	job_interest
1	5bfef9ecb211ec4f518cfc1d2a6f3e0c	United States of America	6.0	80.0	Full-Stack Web Developer
2	14f1863afa9c7de488050b82eb3edd96	United States of America	5.0	1000.0	Front-End Web Developer, Back-End Web Develo...

Exploring job interests¶

Let's analize which jobs are the most interesting for the audience

In [8]:

relevant_data['job_interest'].value_counts(normalize=True)*100

Out[8]:

Full-Stack Web Developer                                                                                                                                                                                                   11.799742
  Front-End Web Developer                                                                                                                                                                                                   6.423052
  Data Scientist                                                                                                                                                                                                            2.189311
Back-End Web Developer                                                                                                                                                                                                      1.996137
Game Developer                                                                                                                                                                                                              1.625885
  Mobile Developer                                                                                                                                                                                                          1.593690
Information Security                                                                                                                                                                                                        1.191243
Full-Stack Web Developer,   Front-End Web Developer                                                                                                                                                                         0.965873
Data Engineer                                                                                                                                                                                                               0.756600
  Product Manager                                                                                                                                                                                                           0.756600
  Front-End Web Developer, Full-Stack Web Developer                                                                                                                                                                         0.756600
  User Experience Designer                                                                                                                                                                                                  0.724404
  Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer                                                                                                                                                 0.627817
  User Experience Designer,   Front-End Web Developer                                                                                                                                                                       0.595621
  DevOps / SysAdmin                                                                                                                                                                                                         0.515132
Back-End Web Developer,   Front-End Web Developer, Full-Stack Web Developer                                                                                                                                                 0.499034
Back-End Web Developer, Full-Stack Web Developer,   Front-End Web Developer                                                                                                                                                 0.482936
Full-Stack Web Developer,   Front-End Web Developer, Back-End Web Developer                                                                                                                                                 0.466838
Full-Stack Web Developer,   Mobile Developer                                                                                                                                                                                0.450741
  Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer                                                                                                                                                 0.434643
  Front-End Web Developer,   User Experience Designer                                                                                                                                                                       0.418545
Full-Stack Web Developer, Back-End Web Developer                                                                                                                                                                            0.386349
Back-End Web Developer, Full-Stack Web Developer                                                                                                                                                                            0.370251
Full-Stack Web Developer, Back-End Web Developer,   Front-End Web Developer                                                                                                                                                 0.289762
Back-End Web Developer,   Front-End Web Developer                                                                                                                                                                           0.257566
Data Engineer,   Data Scientist                                                                                                                                                                                             0.257566
Full-Stack Web Developer,   Data Scientist                                                                                                                                                                                  0.257566
Full-Stack Web Developer, Game Developer                                                                                                                                                                                    0.225370
  Mobile Developer, Full-Stack Web Developer                                                                                                                                                                                0.225370
  Mobile Developer,   Front-End Web Developer                                                                                                                                                                               0.225370
                                                                                                                                                                                                                             ...    
  User Experience Designer, Full-Stack Web Developer,   Mobile Developer,   Front-End Web Developer,   Quality Assurance Engineer, Back-End Web Developer                                                                   0.016098
  User Experience Designer,   Front-End Web Developer, Data Engineer,   Product Manager,   DevOps / SysAdmin, Full-Stack Web Developer,   Data Scientist, Game Developer, Back-End Web Developer                            0.016098
  Data Scientist,   Product Manager, Full-Stack Web Developer,   User Experience Designer                                                                                                                                   0.016098
  Data Scientist, Data Engineer,   Quality Assurance Engineer                                                                                                                                                               0.016098
  Data Scientist,   Front-End Web Developer, Information Security                                                                                                                                                           0.016098
  Mobile Developer,   DevOps / SysAdmin,   Data Scientist, Back-End Web Developer,   Front-End Web Developer,   User Experience Designer, Full-Stack Web Developer                                                          0.016098
Data Engineer, Full-Stack Web Developer,   Front-End Web Developer,   Mobile Developer, Game Developer, Back-End Web Developer                                                                                              0.016098
  Front-End Web Developer,   Data Scientist, Data Engineer, Full-Stack Web Developer,   DevOps / SysAdmin, Back-End Web Developer, Information Security,   User Experience Designer,   Mobile Developer, Game Developer     0.016098
Information Security,   Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer,   Mobile Developer                                                                                                       0.016098
  User Experience Designer, Full-Stack Web Developer, Information Security,   Front-End Web Developer,   Mobile Developer, Game Developer, Back-End Web Developer                                                           0.016098
  Mobile Developer, Full-Stack Web Developer, Information Security,   User Experience Designer,   Product Manager,   Data Scientist                                                                                         0.016098
  User Experience Designer, Back-End Web Developer, Game Developer,   Mobile Developer,   Front-End Web Developer                                                                                                           0.016098
Back-End Web Developer, Full-Stack Web Developer, Data Engineer,   DevOps / SysAdmin                                                                                                                                        0.016098
Game Developer,   User Experience Designer,   Front-End Web Developer,   Mobile Developer                                                                                                                                   0.016098
  User Experience Designer,   Mobile Developer,   Product Manager,   Front-End Web Developer,   Data Scientist, Information Security,   Quality Assurance Engineer                                                          0.016098
  DevOps / SysAdmin,   Front-End Web Developer,   Mobile Developer, Full-Stack Web Developer                                                                                                                                0.016098
  Data Scientist, Information Security, Back-End Web Developer, Game Developer                                                                                                                                              0.016098
Data Engineer, Back-End Web Developer,   Front-End Web Developer,   Mobile Developer,   DevOps / SysAdmin, Full-Stack Web Developer,   Data Scientist, Information Security                                                 0.016098
  Mobile Developer, Data Engineer, Back-End Web Developer,   Data Scientist,   Front-End Web Developer, Full-Stack Web Developer                                                                                            0.016098
  Data Scientist, Back-End Web Developer,   Mobile Developer,   Front-End Web Developer, Full-Stack Web Developer                                                                                                           0.016098
  Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer,   Mobile Developer,   Product Manager                                                                                                          0.016098
  Data Scientist,   User Experience Designer, Full-Stack Web Developer                                                                                                                                                      0.016098
Ceo                                                                                                                                                                                                                         0.016098
All - whatever is required to develop tools to revolutionize the mechanical engineering process                                                                                                                             0.016098
  Data Scientist, Data Engineer, Full-Stack Web Developer, Game Developer,   Mobile Developer                                                                                                                               0.016098
  User Experience Designer, Back-End Web Developer, Data Engineer,   Data Scientist, Full-Stack Web Developer, Information Security                                                                                         0.016098
  Data Scientist, Data Engineer, Back-End Web Developer,   Mobile Developer, Full-Stack Web Developer,   Product Manager                                                                                                    0.016098
Full-Stack Web Developer,   Front-End Web Developer, Information Security,   DevOps / SysAdmin                                                                                                                              0.016098
Data Engineer,   DevOps / SysAdmin,   Data Scientist, Back-End Web Developer,   Front-End Web Developer, Full-Stack Web Developer,   Mobile Developer                                                                       0.016098
  Mobile Developer, Back-End Web Developer,   Front-End Web Developer, Information Security                                                                                                                                 0.016098
Name: job_interest, Length: 2905, dtype: float64

Dividing job interests¶

It seems job_interest column need to be cleaned because multiple answers were allowed. To see the entire picture we need to count all mentions for every job. We'll do this by splitting the job_interest column and then melt the result

In [9]:

jobs = relevant_data['job_interest'].str.split(',', expand = True).copy()

splited_db = pd.concat([relevant_data, jobs], axis=1)
splited_db.info()
splited_db.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6212 entries, 1 to 18171
Data columns (total 18 columns):
id                    6212 non-null object
country               6212 non-null object
months_programming    6212 non-null float64
money_for_learning    6212 non-null float64
job_interest          6212 non-null object
0                     6212 non-null object
1                     4263 non-null object
2                     3590 non-null object
3                     2602 non-null object
4                     1650 non-null object
5                     904 non-null object
6                     471 non-null object
7                     225 non-null object
8                     118 non-null object
9                     56 non-null object
10                    30 non-null object
11                    20 non-null object
12                    2 non-null object
dtypes: float64(2), object(16)
memory usage: 922.1+ KB

Out[9]:

	id	country	months_programming	money_for_learning	job_interest	0	1	2	3	4	5	6	7	8	9	10	11	12
1	5bfef9ecb211ec4f518cfc1d2a6f3e0c	United States of America	6.0	80.0	Full-Stack Web Developer	Full-Stack Web Developer	None	None	None	None	None	None	None	None	None	None	None	None
2	14f1863afa9c7de488050b82eb3edd96	United States of America	5.0	1000.0	Front-End Web Developer, Back-End Web Develo...	Front-End Web Developer	Back-End Web Developer	DevOps / SysAdmin	Mobile Developer	Full-Stack Web Developer	None	None	None	None	None	None	None	None
3	91756eb4dc280062a541c25a3d44cfb0	Brazil	5.0	0.0	Front-End Web Developer, Full-Stack Web Deve...	Front-End Web Developer	Full-Stack Web Developer	Back-End Web Developer	None	None	None	None	None	None	None	None	None	None
4	aa3f061a1949a90b27bef7411ecd193f	Portugal	24.0	0.0	Full-Stack Web Developer, Information Security...	Full-Stack Web Developer	Information Security	Mobile Developer	Front-End Web Developer	Back-End Web Developer	None	None	None	None	None	None	None	None
6	5e130f133306abd6c2f9af31467ff37c	United Kingdom	12.0	0.0	Full-Stack Web Developer	Full-Stack Web Developer	None	None	None	None	None	None	None	None	None	None	None	None

Deleting initial job column in the merged db¶

Now as we can see splitting was made right, we can delete the initial "job_interest" column

In [10]:

splited_db.drop('job_interest', axis=1, inplace=True)
splited_db.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6212 entries, 1 to 18171
Data columns (total 17 columns):
id                    6212 non-null object
country               6212 non-null object
months_programming    6212 non-null float64
money_for_learning    6212 non-null float64
0                     6212 non-null object
1                     4263 non-null object
2                     3590 non-null object
3                     2602 non-null object
4                     1650 non-null object
5                     904 non-null object
6                     471 non-null object
7                     225 non-null object
8                     118 non-null object
9                     56 non-null object
10                    30 non-null object
11                    20 non-null object
12                    2 non-null object
dtypes: float64(2), object(15)
memory usage: 873.6+ KB

Melting¶

In [11]:

melted_db = splited_db.melt(id_vars = ['id',
                                       'country', 
                                       'months_programming',
                                       'money_for_learning'],
                            value_vars = [0,1,2,3,4,5,6,7,8,9,10,11,12]                 
                            )
melted_db = melted_db[melted_db['value'].notnull()]
melted_db.rename(columns = {'variable': 'mention_num'}, inplace=True)
melted_db.info()
melted_db.sort_values('id', ascending=False).head(10)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20143 entries, 0 to 80755
Data columns (total 6 columns):
id                    20143 non-null object
country               20143 non-null object
months_programming    20143 non-null float64
money_for_learning    20143 non-null float64
mention_num           20143 non-null object
value                 20143 non-null object
dtypes: float64(2), object(4)
memory usage: 1.1+ MB

Out[11]:

	id	country	months_programming	money_for_learning	mention_num	value
6803	fffba79ffe1aacf3aaa4d55cd42c826e	India	8.0	0.0	1	Full-Stack Web Developer
19227	fffba79ffe1aacf3aaa4d55cd42c826e	India	8.0	0.0	3	Data Scientist
591	fffba79ffe1aacf3aaa4d55cd42c826e	India	8.0	0.0	0	Back-End Web Developer
13015	fffba79ffe1aacf3aaa4d55cd42c826e	India	8.0	0.0	2	Data Engineer
52506	fff57c57c3e9e9762f5410204dfa8281	Canada	6.0	400.0	8	Mobile Developer
33870	fff57c57c3e9e9762f5410204dfa8281	Canada	6.0	400.0	5	Back-End Web Developer
27658	fff57c57c3e9e9762f5410204dfa8281	Canada	6.0	400.0	4	Information Security
2810	fff57c57c3e9e9762f5410204dfa8281	Canada	6.0	400.0	0	Quality Assurance Engineer
15234	fff57c57c3e9e9762f5410204dfa8281	Canada	6.0	400.0	2	Data Scientist
9022	fff57c57c3e9e9762f5410204dfa8281	Canada	6.0	400.0	1	DevOps / SysAdmin

Cleaning¶

In [12]:

melted_db['value'].value_counts().head(15)

Out[12]:

 Full-Stack Web Developer      2228
   Front-End Web Developer     2046
 Back-End Web Developer        1810
Full-Stack Web Developer       1555
   Mobile Developer            1531
  Front-End Web Developer      1105
 Game Developer                1028
   Data Scientist               995
   User Experience Designer     966
 Information Security           873
 Data Engineer                  838
Back-End Web Developer          686
   DevOps / SysAdmin            633
   Product Manager              531
  Mobile Developer              500
Name: value, dtype: int64

Too much whitespaces, let's get rid of them

In [13]:

melted_db['value'] = melted_db['value'].str.strip()

In [14]:

melted_db['value'].value_counts()

Out[14]:

Full-Stack Web Developer                      3783
Front-End Web Developer                       3151
Back-End Web Developer                        2496
Mobile Developer                              2031
Data Scientist                                1460
Game Developer                                1441
User Experience Designer                      1306
Information Security                          1165
Data Engineer                                 1108
DevOps / SysAdmin                              838
Product Manager                                718
Quality Assurance Engineer                     433
Software Engineer                                9
Software Developer                               6
Artificial Intelligence                          4
Software engineer                                4
Desktop Application Developer                    3
Project Manager                                  2
Data Analyst                                     2
Machine Learning Engineer                        2
Research                                         2
Not sure yet                                     2
User Interface Designer                          2
Java developer                                   2
data analyst                                     2
Business Analyst                                 2
Unsure                                           2
Product Designer                                 2
Machine Learning                                 2
Programmer                                       2
                                              ... 
Marketing                                        1
Software Projects Manager                        1
Software developer                               1
Teacher. Teaching students to code.              1
Software Developper                              1
mobile developer                                 1
Library Developer                                1
Teacher                                          1
developer                                        1
Cybersecurity                                    1
Software Engineer (Computer Science Based)       1
Data Visualization Specialist                    1
Information Developer                            1
etc...                                           1
non-programmer                                   1
Software Specialist                              1
Software enginner                                1
Quant (Algorithmic Trader)                       1
Not Sure                                         1
Analyst                                          1
Desktop Applications                             1
Developer Evangelist                             1
Astrophysicist                                   1
code developer...in whatever format              1
UI Design                                        1
Networking                                       1
Security Business Analyst                        1
back-end                                         1
Researcher                                       1
Non technical                                    1
Name: value, Length: 187, dtype: int64

Much better, but there are many jobs with less that 10 mentions. Let's group them to "Others"

In [15]:

frequencies = melted_db['value'].value_counts().head(12)
job_list = list(frequencies.index)
job_list

Out[15]:

['Full-Stack Web Developer',
 'Front-End Web Developer',
 'Back-End Web Developer',
 'Mobile Developer',
 'Data Scientist',
 'Game Developer',
 'User Experience Designer',
 'Information Security',
 'Data Engineer',
 'DevOps / SysAdmin',
 'Product Manager',
 'Quality Assurance Engineer']

In [16]:

melted_db['value'] = melted_db['value'].apply(lambda x: x if x in job_list else 'Others')

In [17]:

(melted_db['value'].value_counts(normalize=True) * 100).head(20)

Out[17]:

Full-Stack Web Developer      18.780718
Front-End Web Developer       15.643151
Back-End Web Developer        12.391401
Mobile Developer              10.082907
Data Scientist                 7.248176
Game Developer                 7.153850
User Experience Designer       6.483642
Information Security           5.783647
Data Engineer                  5.500670
DevOps / SysAdmin              4.160254
Product Manager                3.564514
Quality Assurance Engineer     2.149630
Others                         1.057439
Name: value, dtype: float64

Analizing job interests¶

It is likely that the more interesting the job is, the earlier it is mentioned. Let's see if a trend will change if we take only first three jobs for every participant

In [18]:

(melted_db[melted_db['mention_num'] < 3]['value'].value_counts(normalize=True) * 100).head(20)

Out[18]:

Full-Stack Web Developer      20.988269
Front-End Web Developer       16.885887
Back-End Web Developer        12.264486
Mobile Developer               9.662282
Data Scientist                 7.095627
Game Developer                 7.017419
User Experience Designer       6.142908
Information Security           5.311056
Data Engineer                  4.820476
DevOps / SysAdmin              3.618912
Product Manager                3.448276
Quality Assurance Engineer     1.770352
Others                         0.974049
Name: value, dtype: float64

Generally, results are the same, so we'll use extended version with all jobs preferences

In [19]:

melted_db['value'].value_counts().sort_values().plot.barh()
plt.show()

We see that the most participants are interested in web and mobile development. Exactly what we are selling most.

We can't use our melted_db data set for analizing countries, as there can be multiple rows for every respondent if they gave multiple answers about job inerests, so we'll return to the relevant_data data set

Finding the best countries¶

In [20]:

relevant_data['country'].value_counts().head(10)

Out[20]:

United States of America    2933
India                        463
United Kingdom               279
Canada                       240
Poland                       122
Germany                      114
Brazil                       111
Australia                    102
Russia                        79
Ukraine                       79
Name: country, dtype: int64

In [21]:

relevant_data['country'].value_counts().sort_values().tail(10).plot.barh()
plt.show()

It seems obvious that USA and India are the biggest markets. But we must figure out the participants' intention to pay for education. Advertising within markets where most people are only willing to learn for free is extremely unlikely to be profitable for us.

The MoneyForLearning column describes in American dollars the amount of money spent by participants from the moment they started coding until the moment they completed the survey. Our company sells subscriptions at a price of $59 per month, and for this reason we're interested in finding out how much money each student spends per month.

It also seems like a good idea to narrow down our analysis to only four countries: the US, India, the United Kingdom, and Canada. Two reasons for this decision are:

These are the countries having the highest absolute frequencies in our sample, which means we have a decent amount of data for each.

Our courses are written in English, and English is an official language in all these four countries. The more people that know English, the better our chances to target the right people with our ads.

Taking top 4 countries¶

In [22]:

countries = ['United States of America', 'India', 'United Kingdom', 'Canada']
relevant_data = relevant_data[relevant_data['country'].isin(countries)]
relevant_data['country'].value_counts()

Out[22]:

United States of America    2933
India                        463
United Kingdom               279
Canada                       240
Name: country, dtype: int64

Cleaning "months programming"¶

In [23]:

print(relevant_data['months_programming'].isnull().sum())
print((relevant_data['months_programming'] == 0).sum())

0
146

In [24]:

relevant_data.loc[(relevant_data['months_programming'] == 0), 'months_programming'] = 1

In [25]:

print(relevant_data['months_programming'].isnull().sum())
print((relevant_data['months_programming'] == 0).sum())

0
0

Calculating money per month¶

In [26]:

relevant_data['spent_per_month'] = relevant_data['money_for_learning'] / relevant_data['months_programming']

We'll take only those who already spent some money for learning. But there is a trick:

We are suggested by Dataquest to delete only rows with Null values in 'spent_per_month' to take only participants who spent at least something.¶
It is wrong because we leave '0' values. I'll show the consequences of this below¶

In [27]:

print(relevant_data['spent_per_month'].isnull().sum())
(relevant_data['spent_per_month'] == 0).sum()

Out[27]:

In [28]:

relevant_data_without_zeroes = relevant_data[relevant_data['spent_per_month'] != 0]
relevant_data_with_zeroes = relevant_data[relevant_data['spent_per_month'].notnull()]

In [29]:

relevant_data_with_zeroes['country'].value_counts().head()

Out[29]:

United States of America    2933
India                        463
United Kingdom               279
Canada                       240
Name: country, dtype: int64

In [30]:

relevant_data_without_zeroes['country'].value_counts().head()

Out[30]:

United States of America    1619
India                        201
United Kingdom               137
Canada                       114
Name: country, dtype: int64

Analizing results¶

In [31]:

relevant_data_with_zeroes.groupby('country')['spent_per_month'].describe()

Out[31]:

	count	mean	std	min	25%	50%	75%	max
country
Canada	240.0	113.510961	441.014158	0.0	0.0	0.000000	25.000000	5000.0
India	463.0	135.100982	692.960378	0.0	0.0	0.000000	13.250000	10000.0
United Kingdom	279.0	45.534443	162.311836	0.0	0.0	0.000000	20.416667	1400.0
United States of America	2933.0	227.997996	1940.245614	0.0	0.0	3.333333	50.000000	80000.0

In [32]:

relevant_data_without_zeroes.groupby('country')['spent_per_month'].describe()

Out[32]:

	count	mean	std	min	25%	50%	75%	max
country
Canada	114.0	238.970444	617.345854	0.416667	8.333333	25.000000	131.250000	5000.0
India	201.0	311.202759	1026.728280	0.138889	5.500000	20.000000	133.333333	10000.0
United Kingdom	137.0	92.730727	222.359671	0.050000	8.333333	20.833333	66.666667	1400.0
United States of America	1619.0	413.043929	2597.175174	0.033333	10.000000	33.333333	166.666667	80000.0

Let's have a look on means in isolation

In [33]:

means_with_zeroes = relevant_data_with_zeroes.groupby('country')['spent_per_month'].mean()
means_without_zeroes = relevant_data_without_zeroes.groupby('country')['spent_per_month'].mean()

pd.DataFrame(data = {'Means_with_zeroes': means_with_zeroes,
                     'Means_without_zeroes': means_without_zeroes})

Out[33]:

	Means_with_zeroes	Means_without_zeroes
country
Canada	113.510961	238.970444
India	135.100982	311.202759
United Kingdom	45.534443	92.730727
United States of America	227.997996	413.043929

It looks like "0" values that we have left change everything!

Also, we see there are many outliers (look at the huge difference between 75 percentile and maximum) so the mean is not too representative in this case. But surprisingly we are suggested by DataQuest to use it

Let's see how the averages will change if we take median instead

In [34]:

medians_with_zeroes = relevant_data_with_zeroes.groupby('country')['spent_per_month'].median()
medians_without_zeroes = relevant_data_without_zeroes.groupby('country')['spent_per_month'].median()

pd.DataFrame(data = {'Medians_with_zeroes': medians_with_zeroes,
                     'Medians_without_zeroes': medians_without_zeroes})

Out[34]:

	Medians_with_zeroes	Medians_without_zeroes
country
Canada	0.000000	25.000000
India	0.000000	20.000000
United Kingdom	0.000000	20.833333
United States of America	3.333333	33.333333

And again the presence of '0' values change everything and make the median unuseful. But when we take a properly cleaned data set, the median becomes quite representative.

At this stage it should be obvious that rows with "zeroes" in spent_per_month column must be droped. I will use only properly cleaned data set from now on

Managing outliers¶

We may notice the huge difference between 75th percentile and the maximum value when we analized spent_per_month. It seems there are some outliers. Let's visualize it and see

In [35]:

sns.boxplot(x = 'country',
            y = 'spent_per_month', 
            data = relevant_data_without_zeroes)
plt.title('Money Spent Per Month',fontsize = 16)
plt.ylabel('Money per month, USD')
plt.xlabel('Country')
plt.xticks(range(4), ['US', 'UK', 'Canada', 'India'])
plt.show()

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

We even can't see boxes. Let's try to cut the furthest outliers and take only values less than 20 000

In [36]:

sns.boxplot(x = 'country',
            y = 'spent_per_month', 
            data = relevant_data_without_zeroes[relevant_data_without_zeroes['spent_per_month']<20000])
plt.title('Money Spent Per Month',fontsize = 16)
plt.ylabel('Money per month, USD')
plt.xlabel('Country')
plt.xticks(range(4), ['US', 'UK', 'Canada', 'India'])
plt.show()

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Still quite unclear. The number of outliers can tell us that participants could take into account money spent on primary univercity or bootcamps.

Let's try to determine our potential customers as those, who spent between 20 and 200 USD per month and see how it affects on the size of our sample

In [37]:

condition = (relevant_data_without_zeroes['spent_per_month'] > 20) & (relevant_data_without_zeroes['spent_per_month'] < 200)

In [38]:

print(relevant_data_without_zeroes.shape[0])
relevant_data_without_zeroes[condition].shape[0]

Out[38]:

Now we have only one third af the previous sample, but this is the most likely our audience

Visualizing money spent per month¶

In [39]:

relevant_data_without_zeroes[condition]['spent_per_month'].hist()
plt.show()

As we have an extremely right skewed distribution we may assume that we have saved the most important data when applying our condition.

Let's see how boxplot will work now

In [40]:

sns.boxplot(x = 'country',
            y = 'spent_per_month', 
            data = relevant_data_without_zeroes[condition])
plt.title('Money Spent Per Month',fontsize = 16)
plt.ylabel('Money per month, USD')
plt.xlabel('Country')
plt.xticks(range(4), ['US', 'UK', 'Canada', 'India'])
plt.show()

/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Much better now! But before making conclusions let's compare means and medians for each country. Also we must take into account a number of respondents from each country

Making the result table¶

In [41]:

medians = relevant_data_without_zeroes[condition].groupby('country')['spent_per_month'].median()
means = relevant_data_without_zeroes[condition].groupby('country')['spent_per_month'].mean()
num_respondents = relevant_data_without_zeroes[condition]['country'].value_counts()
results_df = pd.DataFrame(data = {'Median': medians,
                                  'Mean': means,
                                  'Num_respondents': num_respondents})
results_df

Out[41]:

	Mean	Median	Num_respondents
Canada	58.502976	45.833333	40
India	88.939039	79.916667	62
United Kingdom	69.128001	60.000000	55
United States of America	68.857456	51.111111	617

Now we are going to think which measure represents the distribution best. I assume, it is the mean, because we've left only those who spent more than 20 USD and less than 200 per month. So it is more important to take the weight of every value into account, than to avoid outliers.

We'll also round the values of means

In [42]:

results_df.drop(columns = 'Median', inplace = True)
results_df['Mean'] = round((results_df['Mean']),0).astype('int')

In [43]:

results_df.reset_index(inplace=True)
results_df

Out[43]:

	index	Mean	Num_respondents
0	Canada	59	40
1	India	89	62
2	United Kingdom	69	55
3	United States of America	69	617

Visualizing results¶

Infortunately, seaborn.relplot is unavaliable, so we'll make a simple graph using matplotlib

In [44]:

plt.figure(figsize = (10, 4))
plt.title('Average sums spent per month in top-4 countries', fontsize = 20)

plt.scatter(results_df['Num_respondents'],
            results_df['Mean'],
            s=100)
        
plt.text(60, 60, 'Canada', fontsize = 15)
plt.text(71, 71, 'UK', fontsize = 15)
plt.text(75, 90, 'India', fontsize = 15)
plt.text(620, 72, 'USA', fontsize = 15)

plt.ylabel('Money spent per month, USD')
plt.xlabel('Number of respondents')


plt.show()

Conclusion¶

To be honest, all our samples do not represent population for every country, because take into account only those who were involved in the survey. But nevertheless we could share some thoughts with the marketing department.

Obviously, the most interesting market for advertizing is USA. As for the second country, we see that number of respondents are almost equal, but in India people on average tend to spend more than in Canada and UK.

So I would recommend to invest in USA and India.

What is much more important is to always bear in mind what lies behind the numbers we're analizing