We're working for an an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. We want to promote our product and we'd like to invest some money in advertisement. Our goal in this project is to find out the two best markets to advertise our product in.
To reach our goal, we could organize surveys for a couple of different markets to find out which would the best choices for advertising. This is very costly, however, and it's a good call to explore cheaper options first.
We can try to search existing data that might be relevant for our purpose. One good candidate is the data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
The survey data is publicly available in this GitHub repository.
# read in our data en import the libraries we'll use
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
survey = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv', low_memory=False)
survey.shape
(18175, 136)
survey.head()
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampName | BootcampRecommend | ChildrenNumber | CityPopulation | CodeEventConferences | CodeEventDjangoGirls | ... | YouTubeFCC | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeOther | YouTubeSimplilearn | YouTubeTheNewBoston | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 34.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | less than 100,000 | NaN | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 21.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | ... | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
3 | 26.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | ... | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
4 | 20.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 136 columns
Our dataset contains 18175 rows and 136 columns. Some interesting columns for our analysis are JobRoleInterest
, HoursLearning
(the time clients spent studying), CityPopulation
(the place clients are located) and MoneyForLearning
(hwo much money they want to spend on learning).
As we mentioned earlier, most of the courses we offer are on web and mobile development, but we also cover many other domains, like data science, game development, etc. For the purpose of our analysis, we want to answer questions about a population of new coders that are interested in the subjects we teach. What we'd like to know:
Before starting to analyze the sample data we have, we need to clarify whether it's representative for our population of interest and whether it has the right categories of people for our purpose. The JobRoleInterest
column describes for every participant the role(s) they'd be interested in working.
# generating a frequency table for the JobRoleInterest column
freq_interest = survey['JobRoleInterest'].value_counts(normalize=True) * 100
freq_interest
Full-Stack Web Developer 11.770595 Front-End Web Developer 6.435927 Data Scientist 2.173913 Back-End Web Developer 2.030892 Mobile Developer 1.673341 Game Developer 1.630435 Information Security 1.315789 Full-Stack Web Developer, Front-End Web Developer 0.915332 Front-End Web Developer, Full-Stack Web Developer 0.800915 Product Manager 0.786613 Data Engineer 0.758009 User Experience Designer 0.743707 User Experience Designer, Front-End Web Developer 0.614989 Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer 0.557780 Back-End Web Developer, Full-Stack Web Developer, Front-End Web Developer 0.514874 Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer 0.514874 DevOps / SysAdmin 0.514874 Full-Stack Web Developer, Front-End Web Developer, Back-End Web Developer 0.443364 Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer 0.429062 Full-Stack Web Developer, Mobile Developer 0.414760 Front-End Web Developer, User Experience Designer 0.414760 Back-End Web Developer, Full-Stack Web Developer 0.386156 Full-Stack Web Developer, Back-End Web Developer 0.371854 Back-End Web Developer, Front-End Web Developer 0.286041 Full-Stack Web Developer, Back-End Web Developer, Front-End Web Developer 0.271739 Data Engineer, Data Scientist 0.271739 Front-End Web Developer, Mobile Developer 0.257437 Full-Stack Web Developer, Data Scientist 0.243135 Mobile Developer, Game Developer 0.228833 Data Scientist, Data Engineer 0.228833 ... User Experience Designer, Front-End Web Developer, Data Engineer, Product Manager, DevOps / SysAdmin, Full-Stack Web Developer, Data Scientist, Game Developer, Back-End Web Developer 0.014302 User Experience Designer, Front-End Web Developer, Data Scientist, Information Security, Mobile Developer, DevOps / SysAdmin, Product Manager 0.014302 Mobile Developer, Back-End Web Developer, Front-End Web Developer, Information Security 0.014302 DevOps / SysAdmin, Game Developer, Data Engineer, Mobile Developer, Information Security, Full-Stack Web Developer, Data Scientist 0.014302 Front-End Web Developer, Full-Stack Web Developer, Data Scientist, DevOps / SysAdmin, Back-End Web Developer, Game Developer, Mobile Developer 0.014302 Data Engineer, User Experience Designer, Mobile Developer, Data Scientist, Product Manager, Full-Stack Web Developer, Information Security, Product Designer 0.014302 Data Scientist, Data Engineer, Full-Stack Web Developer, Information Security, Front-End Web Developer, Back-End Web Developer, DevOps / SysAdmin 0.014302 User Experience Designer, Game Developer, Full-Stack Web Developer, DevOps / SysAdmin, Data Scientist 0.014302 Data Engineer, DevOps / SysAdmin, Data Scientist, Game Developer, Front-End Web Developer 0.014302 Mobile Developer, Quality Assurance Engineer, Full-Stack Web Developer, Front-End Web Developer, User Experience Designer, Product Manager 0.014302 Game Developer, Mobile Developer, Back-End Web Developer, Full-Stack Web Developer, Front-End Web Developer, Information Security 0.014302 Full-Stack Web Developer, Mobile Developer, Data Scientist, Game Developer, Back-End Web Developer 0.014302 Mobile Developer, Product Manager, Full-Stack Web Developer, Back-End Web Developer, Information Security 0.014302 Back-End Web Developer, Mobile Developer, Data Scientist, User Experience Designer, Quality Assurance Engineer, Product Manager, Information Security, DevOps / SysAdmin, Front-End Web Developer, Data Engineer, Full-Stack Web Developer, Game Developer 0.014302 Mobile Developer, Information Security, Front-End Web Developer, Back-End Web Developer 0.014302 Back-End Web Developer, DevOps / SysAdmin, Data Engineer 0.014302 Data Scientist, DevOps / SysAdmin, Product Manager 0.014302 Game Developer, Front-End Web Developer, Information Security, Full-Stack Web Developer, Data Engineer 0.014302 Mobile Developer, User Experience Designer, Full-Stack Web Developer, DevOps / SysAdmin, Technical Writer 0.014302 Information Security, Game Developer, Mobile Developer, DevOps / SysAdmin 0.014302 Mobile Developer, User Experience Designer, Front-End Web Developer, Information Security, Full-Stack Web Developer 0.014302 Front-End Web Developer, Mobile Developer, Back-End Web Developer, Information Security, Full-Stack Web Developer, Data Scientist, Data Engineer, DevOps / SysAdmin 0.014302 Mobile Developer, Back-End Web Developer, Product Manager, Full-Stack Web Developer, Information Security, DevOps / SysAdmin 0.014302 Front-End Web Developer, Full-Stack Web Developer, User Experience Designer, Mobile Developer, Information Security 0.014302 Full-Stack Web Developer, Front-End Web Developer, Data Engineer, Game Developer, Quality Assurance Engineer, Data Scientist, Back-End Web Developer, Information Security, DevOps / SysAdmin, Mobile Developer 0.014302 Game Developer, Mobile Developer, DevOps / SysAdmin, Data Scientist, Front-End Web Developer, Data Engineer, User Experience Designer, Information Security, Quality Assurance Engineer 0.014302 Information Security, Game Developer, Front-End Web Developer, Data Engineer, Data Scientist, Quality Assurance Engineer, Back-End Web Developer, Mobile Developer 0.014302 Game Developer, Back-End Web Developer, Full-Stack Web Developer, Desktop Application Developer 0.014302 Full-Stack Web Developer, Back-End Web Developer, Front-End Web Developer, Mobile Developer, Information Security, Data Engineer, Quality Assurance Engineer 0.014302 DevOps / SysAdmin, Information Security, Mobile Developer, Quality Assurance Engineer 0.014302 Name: JobRoleInterest, Length: 3213, dtype: float64
Analyzing our frequency table, we can tell that almost 12% of the people who took the survey are interested in becoming a Full-Stack Web Developer. Almost 6% is interested in a role as Front-End Web Developer. When we look to the second half of our table, we can notice that many people are interested in more than one subject. We'll plot this data in a pie chart.
# analyze the number of students interested in one or multiple courses
multiple_courses = survey['JobRoleInterest'].str.contains(',')
print("Number of students interested in multiple courses: " + str(multiple_courses.sum()))
all_students = survey['JobRoleInterest'].shape[0]
single_course = all_students - multiple_courses.sum()
print("Number of students interested in one single course: " + str(single_course))
print('Total number of students: ' + str(all_students))
Number of students interested in multiple courses: 4779 Number of students interested in one single course: 13396 Total number of students: 18175
# plot our data in a pie chart
df = pd.DataFrame({'Courses': [multiple_courses.sum(), single_course]}, index=['Multiple Courses', 'One Course'])
df.plot.pie(y='Courses', figsize=(5, 5))
plt.ylabel('')
plt.legend(loc='lower right')
<matplotlib.legend.Legend at 0x7fa2c7756630>
More than a quarter of the students is interested in multiple courses. We can say that it makes sense to our company to propose a variety of course subjects. Let's analyze how many people are interested in web and mobile development - the focus of our courses.
# create a frequency table with absolute values
survey_web_mobile = survey['JobRoleInterest'].str.contains('Web Developer|Mobile Developer')
survey_web_mobile.value_counts()
True 6030 False 962 Name: JobRoleInterest, dtype: int64
# create a frequency table with percentages
survey_web_mobile_percent = survey_web_mobile.value_counts(normalize=True) * 100
survey_web_mobile_percent.plot(kind='bar')
# plot the results in a bar chart
plt.style.use('fivethirtyeight')
plt.title('Distribution of Students \nby Job Role Interest', size=18)
plt.ylabel('Percentage', size=12)
plt.ylim([0, 100])
plt.xticks([0, 1], ['Web or Mobile \nDevelopment', 'Other Subjects'], rotation=0)
plt.show()
More than 86% of the people who took the survey indicated that they are interested in web or mobile related subjects. We can conclude that our sample is representative for the purpose of our analysis. Now that we found out that the sample has the right categories of people, we can begin analyzing it.
We can start with finding out where these new coders are located and what are the densities for each location. We can find this information in the CountryCitizen
and CountryLive
columns which describe the country of origin and the country each participant lives in. As we're interested in where people are located, we'll use the last mentioned column only. One good indicator of a good market is the number of potential customers - the more potential customers in a market, the better.
# drop the rows where respondents did not answer what role they are interested in
survey_clean = survey[survey['JobRoleInterest'].notnull()].copy()
survey_clean.shape
(6992, 136)
# generate a frequency table for the `CountryLive` column with absolute values
absolute_values = survey_clean['CountryLive'].value_counts()
absolute_values.head()
United States of America 3125 India 528 United Kingdom 315 Canada 260 Poland 131 Name: CountryLive, dtype: int64
# generate a frequency table for the `CountryLive` column with percentages
percentage = survey_clean['CountryLive'].value_counts(normalize=True) * 100
percentage.head()
United States of America 45.700497 India 7.721556 United Kingdom 4.606610 Canada 3.802281 Poland 1.915765 Name: CountryLive, dtype: float64
The United States is the top country where more than 45% of the respondents live. India comes far behind with almost 8% of the respondents. All the other countries stay under 5%. For our advertisement, we certainly will choose the United States for advertisement. As a second market, we could choose India, but we should check whether people living in India are willing to spend a lot of money on learning courses compared to the numbers 3 and 4 in the list - the United Kingdom and Canada. The MoneyForLearning
column will be helpfull to answer this question.
We'll narrow down our analysis to only four countries: the US, India, the United Kingdom and Canada. These are the countries having the highest absolute frequencies in our sample, which means we have a decent amount of data for each. In addition to that, our courses are written in English, and English is an official language in all these four countries. The more people that know English, the better our chances to target the right people with our ads.
The MoneyForLearning
column describes in American dollars the amount of money spent by participants from the moment they started coding until the moment they completed the survey. Our company sells subscriptions at a price of $59 per month, and for this reason we're interested in finding out how much money each student spends per month. Therefore, we will divide the MoneyForLearning
column to the MonthsProgramming
column and replace all values of 0 (students that had just started when they completed the survey) in the last mentioned column with 1 to avoid dividing by 0.
# replace values of 0 with 1
survey_clean['MonthsProgramming'] = survey_clean['MonthsProgramming'].replace(0, 1)
# create a new column that describes the money students spent per month
survey_clean['MonthlySpent'] = round(survey_clean['MoneyForLearning'] / survey_clean['MonthsProgramming'], 2)
survey_clean['MonthlySpent'].head()
1 13.33 2 200.00 3 0.00 4 0.00 6 0.00 Name: MonthlySpent, dtype: float64
# count the null values in the `MonthlySpent` column
survey_clean['MonthlySpent'].isnull().sum()
675
# count the null values in the `CountryLive` column
survey_clean['CountryLive'].isnull().sum()
154
# remove the null values in the new column and the `CountryLive` column
survey_clean['MonthlySpent'] = survey_clean['MonthlySpent'].notnull().copy()
survey_clean['CountryLive'] = survey_clean['CountryLive'].notnull().copy()
# check the null values again
print(survey_clean['MonthlySpent'].isnull().sum())
print(survey_clean['CountryLive'].isnull().sum())
0 0
# group the remaining data by country and mean of the `MonthlySpent` column
countries_mean = survey_clean.groupby('CountryLive')['MonthlySpent'].mean()
countries_mean
CountryLive False 0.681818 True 0.908453 Name: MonthlySpent, dtype: float64
We got some surprising results. Intuitively, we'd expect people in the UK and Canada to spend more on learning than people in India, taking into account a few socio-economical metrics like GDP per capita. It might be that we don't have enough representative data for the UK, Canada and India, or we might have some outliers coming from wrong survey answers for example influencing the mean, or it might be that the results are correct.
We'll generate 4 box plots to visualize the distribution of the MonthlySpent
column for each country and spot possible outliers.
# generate 4 box plots for the `MonthlySpent` column
plt.boxplot(countries_mean[['United States of America']])
KeyErrorTraceback (most recent call last) <ipython-input-23-a9b99658ecbc> in <module>() 1 # generate 4 box plots for the `MonthlySpent` column ----> 2 plt.boxplot(countries_mean[['United States of America']]) /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/series.py in __getitem__(self, key) 662 key = check_bool_indexer(self.index, key) 663 --> 664 return self._get_with(key) 665 666 def _get_with(self, key): /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/series.py in _get_with(self, key) 703 # handle the dup indexing case (GH 4246) 704 if isinstance(key, (list, tuple)): --> 705 return self.loc[key] 706 707 return self.reindex(key) /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexing.py in __getitem__(self, key) 1371 1372 maybe_callable = com._apply_if_callable(key, self.obj) -> 1373 return self._getitem_axis(maybe_callable, axis=axis) 1374 1375 def _is_scalar_access(self, key): /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis) 1614 raise ValueError('Cannot index with multidimensional key') 1615 -> 1616 return self._getitem_iterable(key, axis=axis) 1617 1618 # nested tuple slicing /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis) 1113 1114 if self._should_validate_iterable(axis): -> 1115 self._has_valid_type(key, axis) 1116 1117 labels = self.obj._get_axis(axis) /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis) 1470 raise KeyError( 1471 u"None of [{key}] are in the [{axis}]".format( -> 1472 key=key, axis=self.obj._get_axis_name(axis))) 1473 else: 1474 KeyError: "None of [['United States of America']] are in the [index]"