Finding investment opportunities - Where should we advertise?

We are an e-learning company that offers programming courses. We focus on web and mobile development; however, we also many other domains, like data science, game development, etc.

We want to spend some money on advertising to promote our product. However, our goal is to find the two best markets to advertise our products as our budget is limited.

To do so, we will try answering the below questions: • Where are new coders located? • What are the locations with the most significant number of new coders? • How much money new coders are willing to spend on learning.

Summary of finding

In this project, we analyzed survey data from new coders to find the best two markets to advertise in, based on the potential customer and monthly spend.

The USA is an excellent market to promote as it is the biggest market and has the highest spending per month on learning.

India and Canada are good candidates too. However, we will need input from the Marketing team to make the best decision.

For more details, please refer to the full analysis below.

Gathering existing data

To avoid spending a lot of money on gathering data, we will explore existing data relevant to our purpose.

One promising candidate is the data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for our analysis.

The data set is available on GitHub

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

survey = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv',low_memory=False)

survey.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18175 entries, 0 to 18174
Columns: 136 entries, Age to YouTubeTheNewBoston
dtypes: float64(105), object(31)
memory usage: 18.9+ MB
In [2]:
# prinint the first 5 rows to inspect the data
survey.head()
Out[2]:
Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName BootcampRecommend ChildrenNumber CityPopulation CodeEventConferences CodeEventDjangoGirls ... YouTubeFCC YouTubeFunFunFunction YouTubeGoogleDev YouTubeLearnCode YouTubeLevelUpTuts YouTubeMIT YouTubeMozillaHacks YouTubeOther YouTubeSimplilearn YouTubeTheNewBoston
0 27.0 0.0 NaN NaN NaN NaN NaN more than 1 million NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 34.0 0.0 NaN NaN NaN NaN NaN less than 100,000 NaN NaN ... 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 21.0 0.0 NaN NaN NaN NaN NaN more than 1 million NaN NaN ... NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN
3 26.0 0.0 NaN NaN NaN NaN NaN between 100,000 and 1 million NaN NaN ... 1.0 1.0 NaN NaN 1.0 NaN NaN NaN NaN NaN
4 20.0 0.0 NaN NaN NaN NaN NaN between 100,000 and 1 million NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 136 columns

The data has 18 175 rows and 136 columns. Each column represents a question from the survey.

Checking the Reresentativity of the sample

InIn the next step, we will check if the data it's representative of our population of interest and if it has the right categories of people for our purpose.

In [3]:
# explore JobRoleInterest column

survey.JobRoleInterest
Out[3]:
0                                                      NaN
1                                 Full-Stack Web Developer
2          Front-End Web Developer, Back-End Web Develo...
3          Front-End Web Developer, Full-Stack Web Deve...
4        Full-Stack Web Developer, Information Security...
5                                                      NaN
6                                 Full-Stack Web Developer
7                                                      NaN
8                                                      NaN
9        Full-Stack Web Developer,   Quality Assurance ...
10                                                     NaN
11         DevOps / SysAdmin,   Data Scientist, Informa...
12                                                     NaN
13       Back-End Web Developer, Full-Stack Web Develop...
14                                Full-Stack Web Developer
15                                Full-Stack Web Developer
16                                Full-Stack Web Developer
17                                                     NaN
18       Full-Stack Web Developer,   Front-End Web Deve...
19         Front-End Web Developer,   Mobile Developer,...
20                                                     NaN
21                                    Information Security
22                                Full-Stack Web Developer
23                                  Back-End Web Developer
24                                                     NaN
25                                                     NaN
26                                                     NaN
27                                                     NaN
28                                Full-Stack Web Developer
29         Front-End Web Developer,   Data Scientist, F...
                               ...                        
18145                                                  NaN
18146                                                  NaN
18147                                                  NaN
18148                                                  NaN
18149                                                  NaN
18150                                                  NaN
18151                              Front-End Web Developer
18152                                                  NaN
18153       Information Security, Full-Stack Web Developer
18154                             Full-Stack Web Developer
18155    Full-Stack Web Developer,   Front-End Web Deve...
18156                             Full-Stack Web Developer
18157    Back-End Web Developer, Data Engineer,   Mobil...
18158                                                  NaN
18159                                                  NaN
18160                             User Experience Designer
18161                             Full-Stack Web Developer
18162      Data Scientist, Game Developer,   Quality As...
18163    Back-End Web Developer, Data Engineer,   Data ...
18164                                                  NaN
18165                                                  NaN
18166                                                  NaN
18167                                                  NaN
18168                                                  NaN
18169                                                  NaN
18170                                                  NaN
18171      DevOps / SysAdmin,   Mobile Developer,   Pro...
18172                                                  NaN
18173                                                  NaN
18174    Back-End Web Developer, Data Engineer,   Data ...
Name: JobRoleInterest, Length: 18175, dtype: object

As we can see, people are interested in more than one subject. Moreover, we can see that we have some null values.

Cleaning Job Interest data

To better understand the data, we are going to calculate the frequency of every job.

In [4]:
job_interest_freq = {'Unknown':0}

# loop over each response in JobRoleInterest
for row in survey["JobRoleInterest"]:
    if type(row) is float:
        job_interest_freq["Unknown"] += 1 # null Values will be tracked under Unknown
        
    elif type(row) is str:
        str_list = row.split(',')# create a list for answer with multiple choices
        for n in str_list:
            n= n.strip().lower()# clean up the string by removing trailing spaces and Upper case letter
            if n in job_interest_freq:
                job_interest_freq[n]+=1
            else:
                job_interest_freq[n]=1
                
# turn Dictionnary into Dataframe                
job_interest_freq = pd.DataFrame.from_dict(job_interest_freq,orient="index")
job_interest_freq.rename(columns={0:"Frequency"}, inplace=True)

# Convert values to percentages
job_interest_freq["Frequency"] = round(job_interest_freq["Frequency"]/job_interest_freq["Frequency"].sum()*100,2)
job_interest_freq.sort_values(by="Frequency",ascending=False,inplace= True)

#Summarize top 10 and others
top_10 = job_interest_freq[:10]

# add the rest as others
others = job_interest_freq[10:].sum()
others = pd.Series(others[0],index=["Other"],name="Frequency")
others= others.to_frame()

top_10 = pd.concat([top_10,others])

top_10.plot.barh()
plt.title("Frequency of Top 10 Job Interest")
plt.xlabel('Frequency (%)')
Out[4]:
<matplotlib.text.Text at 0x7f08b62741d0>

As the question is open-ended, there is a lot of different jobs. However, 33.1% of the respondent did not specify any interest. Web developer, which splits into full-stack, front-end, and back-end, is the highest interest role with 31.1% of the respondent. Mobile developer is next with 6.82% of the respondent, followed by Data scientists, game developers.

As we can see, the sample has the right categories of people for our purpose. Therefore we can continue with our analysis.

New Coders by Country

In the survey, we have information regarding the location of the new coders. Each Country represents an individual market. In the next step, we will be finding the best Country to advertise on, based on new coders density.

We will start by removing all rows that have not entered any JobRoleInterest to ensure that we focus on the coders we want to target.

In [5]:
#drop rows with unknow job interest
survey = survey.dropna(subset=["JobRoleInterest"])

Next, we will explore the column CountryLive, which describes the Country the participants currently living.

In [6]:
country = survey["CountryLive"].value_counts(normalize= True)*100
country
Out[6]:
United States of America         45.700497
India                             7.721556
United Kingdom                    4.606610
Canada                            3.802281
Poland                            1.915765
Brazil                            1.886517
Germany                           1.828020
Australia                         1.637906
Russia                            1.491664
Ukraine                           1.301550
Nigeria                           1.228429
Spain                             1.126060
France                            1.096812
Romania                           1.038315
Netherlands (Holland, Europe)     0.950570
Italy                             0.906698
Serbia                            0.760456
Philippines                       0.760456
Greece                            0.672711
Ireland                           0.628839
South Africa                      0.570342
Mexico                            0.541094
Turkey                            0.526470
Hungary                           0.497221
Singapore                         0.497221
New Zealand                       0.482597
Argentina                         0.467973
Croatia                           0.467973
Pakistan                          0.453349
Indonesia                         0.453349
                                   ...    
Rwanda                            0.014624
Nicaragua                         0.014624
Panama                            0.014624
Guadeloupe                        0.014624
Cayman Islands                    0.014624
Channel Islands                   0.014624
Qatar                             0.014624
Guatemala                         0.014624
Liberia                           0.014624
Samoa                             0.014624
Nambia                            0.014624
Somalia                           0.014624
Myanmar                           0.014624
Gambia                            0.014624
Kyrgyzstan                        0.014624
Angola                            0.014624
Jordan                            0.014624
Anguilla                          0.014624
Gibraltar                         0.014624
Yemen                             0.014624
Mozambique                        0.014624
Cameroon                          0.014624
Botswana                          0.014624
Turkmenistan                      0.014624
Trinidad & Tobago                 0.014624
Papua New Guinea                  0.014624
Cuba                              0.014624
Sudan                             0.014624
Vanuatu                           0.014624
Aruba                             0.014624
Name: CountryLive, Length: 137, dtype: float64

The survey has respondents from all over the world. However, 72% of the respondent are within the top 10 countries.

Let's visualize the Top 10 countries to get a better understanding of the top markets.

In [7]:
country[:10].plot.barh(color='blue')
plt.title("Frequency of respondent by Country")
plt.xlabel("Frequency (%)")
Out[7]:
<matplotlib.text.Text at 0x7f08b62cd4a8>

As we can see, most of the respondents are from the United States of America, followed by India, the United Kingdom, and Canada. Any of these countries could be a potential market for our advertisement. However, to decide which market to advertise on, we need to go more in-depth with the analysis.

In the next step, we will explore how much each market is willing to spend on learning.

Exploring Spent by Market

For the rest of the analysis, we will focus on the top 4 countries, making up 62% of the respondents. Moreover, our courses are in English, and all 4 are English speaking countries.

Let's explore the column MoneyForLearning. This column describes in USD, the amount of money spent by participants from the moment they started coding.

In [8]:
survey["MoneyForLearning"].describe()
Out[8]:
count      6476.00000
mean        885.31223
std        4893.37377
min           0.00000
25%           0.00000
50%          10.00000
75%         200.00000
max      200000.00000
Name: MoneyForLearning, dtype: float64

Average money spends USD 885. However, the data is very spread with a range of USD 200000. If we look at the third Quartile, 75% of the respondent has spent USD 200 or less. Comparing this with the average, we can say that some extreme outliers likely impact the average.

Calculating Monthly Spend

For our analysis, we are interested in monthly spending, as our subscription is priced per month. We will need to calculate this by using the column MonthsProgramming.

In [22]:
country_list=["United States of America","India","United Kingdom","Canada"]

# filtering the data for the countries we are intersted in
survey = survey[survey["CountryLive"].isin(country_list)].copy()

# replacing monthsProgramming 0 with 1 to avoid dividing by 0

survey["MonthsProgramming"] = survey["MonthsProgramming"].replace(0,1)

# creating spend per month column

survey["Spend_Month"] = survey["MoneyForLearning"]/survey["MonthsProgramming"]

# Removing rows with Null Spend_month

survey.dropna(subset=["Spend_Month"],inplace= True)

In the next step, we will explore the average monthly spend per Country to determine which market could be the most profitable.

In [10]:
survey.pivot_table(values="Spend_Month",index="CountryLive",aggfunc=[np.mean,np.median])
Out[10]:
mean median
Spend_Month Spend_Month
CountryLive
Canada 113.510961 0.000000
India 135.100982 0.000000
United Kingdom 45.534443 0.000000
United States of America 227.997996 3.333333

We can say that Coders in the USA average monthly spending is the highest at USD 228 per month from the table above. United Kingdom has the lowest spend at USD 45 per month. There seems to be a big difference between the average spend for some countries. Therefore we are going to explore the spread of the monthly spend for each Country.

On the other hand, looking at the median, we can see that all countries except the USA have a Median spend of USD 0 per month, suggesting that the data is right-skewed and that 50% of the respondent lear using free learnings.

In the next step, we will explore the outliers to get a better understanding of the average spend per month

Exploring Extreme Outliers

In [11]:
sns.boxplot(y ="Spend_Month",x="CountryLive", data = survey)
plt.title("Comparison of monthly spend by Country")
plt.xticks(rotation =45)
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Out[11]:
(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)

The Box plots above are not very clear; however, we can see that the USA has some very extreme values impacting the average spend. The same can be said about India and Canada.

To continue with our analysis, we will exclude rows with Average spend per month higher than 10000.

In [12]:
survey = survey[survey["Spend_Month"]< 10000].copy()
In [13]:
sns.boxplot(y ="Spend_Month",x="CountryLive", data = survey)
plt.title("Comparison of monthly spend by Country")
plt.xticks(rotation =45)
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Out[13]:
(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)

We can see that there are still quite a lot of outliers.

Let's explore some of the outliers respondents to see if there is a parameter driving this high spend per month.

In [14]:
survey[survey["Spend_Month"]> 2000].head()
Out[14]:
Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName BootcampRecommend ChildrenNumber CityPopulation CodeEventConferences CodeEventDjangoGirls ... YouTubeFunFunFunction YouTubeGoogleDev YouTubeLearnCode YouTubeLevelUpTuts YouTubeMIT YouTubeMozillaHacks YouTubeOther YouTubeSimplilearn YouTubeTheNewBoston Spend_Month
415 26.0 1.0 1.0 1.0 The Iron Yard 1.0 NaN between 100,000 and 1 million NaN NaN ... NaN NaN 1.0 NaN NaN NaN NaN NaN NaN 2333.333333
441 30.0 1.0 0.0 0.0 Rutgers Coding Bootcamp 1.0 NaN between 100,000 and 1 million NaN NaN ... NaN 1.0 NaN NaN 1.0 NaN The Net Ninja NaN NaN 2250.000000
484 25.0 1.0 1.0 0.0 Codesmith 1.0 NaN more than 1 million NaN NaN ... 1.0 NaN NaN NaN NaN NaN NaN NaN NaN 2857.142857
718 26.0 1.0 0.0 0.0 The Coding Boot Camp at UCLA Extension 1.0 NaN more than 1 million 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 8000.000000
723 22.0 1.0 1.0 0.0 Dev Bootcamp 0.0 NaN more than 1 million NaN NaN ... 1.0 NaN NaN NaN NaN NaN NaN NaN NaN 4000.000000

5 rows × 137 columns

From the data extract above, we can see that respondents with spending higher than USD 2000 seem to have done a Bootcamp.

Exploring Differences in Spend for Bootcamp attendees.

In the next step, we will split the monthly spend into ten bins to explore how many people per bin attended a Bootcamp.

In [15]:
# Use pd.cut to sort data into Bins and Unstack to create a  pivot table

(survey.groupby(["AttendedBootcamp",pd.cut(survey["Spend_Month"],10)]).size().unstack())
Out[15]:
Spend_Month (-9.0, 900.0] (900.0, 1800.0] (1800.0, 2700.0] (2700.0, 3600.0] (3600.0, 4500.0] (4500.0, 5400.0] (5400.0, 6300.0] (6300.0, 7200.0] (7200.0, 8100.0] (8100.0, 9000.0]
AttendedBootcamp
0.0 3557.0 44.0 12.0 5.0 5.0 6.0 NaN NaN NaN NaN
1.0 171.0 42.0 23.0 10.0 2.0 6.0 1.0 2.0 2.0 1.0

The table above shows the number of people who did or did not attend a Bootcamp ( show by value 1.0 or 0.0) by Monthly spending.

We can see that most people who did not attend a booth camp spend less than USD 900 per month. The bins with USD 900 and more have more people attending Boothcamps than not attending.

Therefore, we are going to exclude rows with spending more than USD 1000 and re-evaluate. We will also exclude all rows with 0 Spend, as we are interested in understanding how much each market is willing to pay for learning.

In [16]:
no_outlier = survey[(survey["Spend_Month"]< 1000) & (survey["Spend_Month"]>1)]

(no_outlier.groupby(["AttendedBootcamp",pd.cut(no_outlier["Spend_Month"],10)]).size().unstack())
Out[16]:
Spend_Month (0.0705, 98.16] (98.16, 195.278] (195.278, 292.396] (292.396, 389.514] (389.514, 486.632] (486.632, 583.75] (583.75, 680.868] (680.868, 777.986] (777.986, 875.104] (875.104, 972.222]
AttendedBootcamp
0.0 1291 190 81 40 23 26 20 3 5 4
1.0 51 17 8 15 13 11 11 7 15 7

As we can see, most people who did not attend a Bootcamp spend less than USD 98 per month on learning. However,there are some significant numbers of people without a Bootcamp on some of the higher spend bins. We will use this data without outliers to re-evaluate the average spend per Country.

Re-evaluating Average spend per Country

In [20]:
sns.boxplot(y ="Spend_Month",x="CountryLive", data = no_outlier)
plt.title("Comparison of monthly spend by Country")
plt.xticks(rotation =45)
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning:

remove_na is deprecated and is a private function. Do not use.

Out[20]:
(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)

We can see that the USA has the highest median and the highest 1st Quartile in the box plots above. This suggests that the USA could be the most profitable market to advertise.

Let's calculate the average spend per Country to confirm that the USA is the most profitable market and understand which other markets could be profitable.

In [18]:
no_outlier[["CountryLive","Spend_Month"]].groupby(["CountryLive"]).mean()
Out[18]:
Spend_Month
CountryLive
Canada 104.687955
India 78.786918
United Kingdom 62.797366
United States of America 107.408242

USA and Canada have the highest average spend per month and seem to be the most profitable market to advertise.

However, before we conclude, let's review the market size after the removal of the outliers.

In [21]:
no_outlier["CountryLive"].value_counts(normalize=True).plot.barh()
plt.title("Frequency of respondent by Country")
plt.xlabel("Frequency (%)")
Out[21]:
<matplotlib.text.Text at 0x7f08b5e83630>

The United States remains the most significant market and has the highest average spend per month; therefore, we should advertise in this market.

Canada is the second market we should focus on, as it has the second-highest spend per month.

India is also a potential market to advertise on; despite a lower average spend of USD 78, it is the second-highest market. The average spend is higher than the price of our subscription.

It is now time to share the results with the Marketing team, to use their domain knowledge in making the final decision

Conclusion

In this project, we analyzed survey data from new coders to find the best two markets to advertise in, based on the potential customer and monthly spend.

The USA is an excellent market to promote as it is the biggest market and has the highest spending per month on learning.

India and Canada are good candidates too. However, we will need input from the Marketing team to make the best decision.