#!/usr/bin/env python
# coding: utf-8

# # What is the best market to advertise in?
# 
# We're working for an an e-learning company that offers courses on programming. Most of our courses are on web and mobile development, but we also cover many other domains, like data science, game development, etc. We want to promote our product and we'd like to invest some money in advertisement. Our goal in this project is to find out the two best markets to advertise our product in.
# 
# To reach our goal, we could organize surveys for a couple of different markets to find out which would the best choices for advertising. This is very costly, however, and it's a good call to explore cheaper options first.
# 
# We can try to search existing data that might be relevant for our purpose. One good candidate is the data from freeCodeCamp's 2017 New Coder Survey. freeCodeCamp is a free e-learning platform that offers courses on web development. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis.
# 
# Let's have a look at the survey dataset:
# 
# # Data Exploration
# 

# In[1]:


# Import Pandas
import pandas as pd

# view all columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Read file and view first 5 rows
file = pd.read_csv("2017-fCC-New-Coders-Survey-Data.csv", low_memory=False)
file.head()


# The data contains a lot of information that could be useful for us. For example, columns include location, Money for learning, Job preferences and information on where they learn. All of this will help alot but we do not yet know if this sample is actually useful for our analysis. We are working for a company that focuses mainly on web development and mobile development. So, let's make sure that these interests are well represented in the survey:

# In[2]:


# Value counts of Job role interests
file["JobRoleInterest"].value_counts(normalize=True)


# Just from looking through the most popular responses in the 'JobRoleInterest' column doesn't help too much. We can see that the most popular repsonses seem to involve Web development but responders can give multiple responses which means we have 3212 unique responses. Does this indicate that most are unsure of what they are interested in learning? let's take a look. 

# In[3]:


# Import matplotlib
import matplotlib.pyplot as plt


# Find numbers with 1 or more interests
multiple_interests = 0
one_interest = 0

for i in file["JobRoleInterest"].dropna():
        i = i.split(",")
        if len(i) > 1:
            multiple_interests += 1
        else:
            one_interest += 1
            
# convert to %            
percent_multiple_interests = (multiple_interests / len(file["JobRoleInterest"].dropna())) * 100
percent_one_interest = (one_interest / len(file["JobRoleInterest"].dropna())) * 100   

# Print Values
print("Multiple Interests: {}".format(percent_multiple_interests))
print("percent_one_interest: {}".format(percent_one_interest))

# Show in Chart
plt.bar(['Multiple Interests', 'One Interest'], [percent_multiple_interests, percent_one_interest])
plt.title("Majority have more than 1 interest")
plt.ylabel("Percentage")
plt.show()


# 68% of responders gave more than response which indicates the majority are unsure of what they want to learn. Now remember that the company we are working for offers many different courses so this isnt actually a problem. With may dfferent courses they can appeal to those unsure what they want to learn.
# 
# The main focus of the company, however, is on web development and mobile development. So, what % are interested in at least one of those?

# In[4]:


# Find munber intersted in web or mobile
are_interested = 0
for i in file["JobRoleInterest"].dropna():
        if 'Web' in i or 'Mobile' in i:
            are_interested += 1

# Convert to %
percent_interested = (are_interested / len(file["JobRoleInterest"].dropna())) * 100
percent_not_interested = 100 - percent_interested

# Print
print('Interested: {}'.format(percent_interested))
print('Not Interested: {}'.format(percent_not_interested))

# Show in Bar Chart
plt.bar(['Interested', 'Not Interested'], [percent_interested, percent_not_interested])
plt.title("Majority are interested in web or mobile")
plt.ylabel("Percentage")
plt.show()


# Great, over 86% of responders are interested in at least one. Now that we know the survey is representative of our population interest we can move on. Let's look through where most of our potential customers live.
# 
# # Where do they live?

# In[5]:


# Remove null columns
file_clean = file[file['JobRoleInterest'].notnull()].copy()
# Show proportion from each country
file_clean['CountryLive'].value_counts(normalize=True) *100


# By far the the highest proportion come from the United States. In second place is India. I am fairly sure that with such a large number of potential customers that the United States is going to be 1 of the 2 markets I suggest advertising in. However, before deciding on India as the second choice we need to also check how much money each customer is able to spend. 
# 
# I am going to narrow things down to the top 4 countries in the chart above. These are 4 countries with English as an official language(the programming courses are in English) and are also a fair bit ahead of the 5th most common country.
# 
# # How much do they spend?
# 
# The subscription for a our programming cours is 59$ per month so lets find out how much on average each customer can spend by dividing the money for learning column with the months programming column. I will then use this number to find the average spend of each potential customer from our top 4 countries. This should give us a good idea which 2 markets to invest in.

# In[6]:


# Create money spent column
programming = file_clean["MonthsProgramming"].replace(0, 1, inplace = True)
file_clean['Money_Spent'] = round(file_clean['MoneyForLearning'] / file_clean["MonthsProgramming"], 2)


# Remove null values
file_clean = file_clean[file_clean['Money_Spent'].notnull()].copy()
file_clean = file_clean[file_clean['CountryLive'].notnull()].copy()

# Filter top 4 countries
top_four = ["United States of America", "United Kingdom", "Canada", "India"]
file_clean_four = file_clean[(file_clean["CountryLive"] == "United States of America") | (file_clean["CountryLive"] == "United Kingdom")
                             | (file_clean["CountryLive"] == "Canada") | (file_clean["CountryLive"] == "India")].copy()

# Show mean per country
money_per_region = file_clean_four.groupby('CountryLive')["Money_Spent"].mean()
money_per_region


# I must admit these results are a slght suprise. I expected the average spend in India to be the lowest because of the average income in each country. Before suggesting USA and India as the 2 markets I will just check to make sure that the results are not being influenced by extreme outliers. Let's check a box plot for each country:

# In[7]:


# Import module
import seaborn as sns

# Plot box plot
ax = sns.boxplot(x = 'CountryLive', y = 'Money_Spent', data = file_clean_four)
ax.set_title('Average money spent per person in each Country')
ax.set_xticklabels(labels = ['USA', 'UK', 'India', 'Canada'],  rotation=90)
ax.tick_params(left=False, bottom=False)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)


# It's clear we have outliers affecting the results. Someone in America is spending 80,000 a month! I am going to use z score to remove some of the extreme outliers from each country. Usually, anything with a z score above 3 would be removed. However, I do expect quite a lot of variance in the data because of the difference in spending between learners who study for free online and thos who pay for expensive coding bootcamps. I will therefore remove any that have a z score above 10.

# In[8]:


# Calculate mean and std per country
grouped_mean = file_clean_four.groupby('CountryLive')["Money_Spent"].transform("mean")
grouped_std = file_clean_four.groupby('CountryLive')["Money_Spent"].transform("std")
file_clean_four["grouped_std"] = grouped_std
file_clean_four["grouped_mean"] = grouped_mean

# Create x score column
file_clean_four["z_score"] = (file_clean_four['Money_Spent'] - file_clean_four["grouped_mean"]) / file_clean_four["grouped_std"]
# Filter out high z score
file_clean_four = file_clean_four[file_clean_four["z_score"] < 10]

# Show new mean
money_per_region = file_clean_four.groupby('CountryLive')["Money_Spent"].mean()
money_per_region


# In[9]:


# Plot box plot
ax = sns.boxplot(x = 'CountryLive', y = 'Money_Spent', data = file_clean_four)
ax.set_title('Average money spent per person in each Country')
ax.set_xticklabels(labels = ['USA', 'UK', 'India', 'Canada'],  rotation=90)
ax.tick_params(left=False, bottom=False)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)


# These boxplots look better but we still have a few outliers in the USA and India columns. Let's take a closer look:

# In[10]:


# Show USA outliers
usa_outliers = file_clean_four[(file_clean_four["CountryLive"] == "United States of America") & (file_clean_four["Money_Spent"] > 7500)]
usa_outliers


# Of the outliers here 5 of the 9 didn't go to bootcamp. How else would they have spent so much? (University tuition is not included). They can be removed. The other 4 did attend bootcamp but the data shows that they been programming for no more than three months when they completed the survey. They most likely paid a large sum of money for a bootcamp that was going to last for several months, so the amount of money spent per month is unrealistic and should be significantly lower (because they probably didn't spend anything for the next couple of months after the survey). As a consequence, I'll remove all outliers.

# In[11]:


# Show India outliers
india_outliers = file_clean_four[(file_clean_four["CountryLive"] == "India") & (file_clean_four["Money_Spent"] > 2500)]
india_outliers


# Meanwhile, none of the Indian outliers attended bootcamp and can therefore all be excluded. 

# In[12]:


# Drop outliers
file_clean_four = file_clean_four.drop(usa_outliers.index)
file_clean_four = file_clean_four.drop(india_outliers.index)

# Show new mean values per country
money_per_region = file_clean_four.groupby('CountryLive')["Money_Spent"].mean()
money_per_region


# Removing all the outliers from the box plot leaves us with the following average money spent shown above. Before giving the final verdict I am going to make a couple more charts. The first will show the total number of potential customers in each country who spend enough for a monthly subscription. The second will show the total money spent by all learners from each country. This will give a good idea of the total potnetial customers and revenue available in each market. 
# 
# # Market Potential

# In[13]:


# Number who spend above 59 dollars in each country
india_spend_enough = len(file_clean_four[(file_clean_four["CountryLive"] == "India") &
                                 (file_clean_four["MoneyForLearning"] > 59)])
canada_spend_enough = len(file_clean_four[(file_clean_four["CountryLive"] == "Canada") &
                                 (file_clean_four["MoneyForLearning"] > 59)])
uk_spend_enough = len(file_clean_four[(file_clean_four["CountryLive"] == "United Kingdom") &
                                 (file_clean_four["MoneyForLearning"] > 59)])
usa_spend_enough = len(file_clean_four[(file_clean_four["CountryLive"] == "United States of America") &
                                 (file_clean_four["MoneyForLearning"] > 59)])

spend_enough = [canada_spend_enough, india_spend_enough, usa_spend_enough, uk_spend_enough]

# Spend per country
canada_meanxpotential = money_per_region["Canada"] * len(file_clean_four[file_clean_four["CountryLive"] == "Canada"])
india_meanxpotential = money_per_region["India"] * len(file_clean_four[file_clean_four["CountryLive"] == "Canada"])
usa_meanxpotential = money_per_region["United States of America"] * len(file_clean_four[file_clean_four["CountryLive"] == "United States of America"])
uk_meanxpotential = money_per_region["United Kingdom"] * len(file_clean_four[file_clean_four["CountryLive"] == "United Kingdom"])
meanxpotential = [canada_meanxpotential, india_meanxpotential, usa_meanxpotential, uk_meanxpotential]

# Create new dataframe
potential_customers = pd.DataFrame([spend_enough, meanxpotential])
potential_customers.columns = ['Canada', 'India', 'USA', 'UK']
potential_customers.index = ['Total Spend > 59', 'Spending * potential customers']
potential_customers


# In[14]:


# Create 1x2 grid
fig = plt.figure(figsize= (14,6))
fig.tight_layout
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

# plot number who spend above 59 dollars
ax1.bar(potential_customers.columns, potential_customers.iloc[0])
ax1.set_title("Number of potential customers who spend more per month \n than subscription cost.")
ax1.set_ylabel("Number of people")
ax1.set_xlabel("Countries")
ax1.tick_params(left=False, bottom=False)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)

# Plot total spend per country
ax2.bar(potential_customers.columns, potential_customers.iloc[1], color='g')
ax2.set_title("Potential money in the market")
ax2.set_ylabel("Spend X potential Customers")
ax2.set_xlabel("Countries")
ax2.tick_params(left=False, bottom=False)
ax2.spines["top"].set_visible(False)
ax2.spines["right"].set_visible(False)


# If it wasn't clear already the USA is by far the best market to advertise in. With both the highest number of potential customers and the highest mean spend the US market dwarfs the competition. The chart on the left shows that the US has over 6X the number of potential customers who spend more than the subscription price per month. The chart on the right shows how much more potential money is in the US market. Candada and India are both slightly ahead of the UK in both charts. India has more potential customers who spend at least the subscription cost per month, whilst there is more total money in the Canadian market due to it's higher mean spend.
# 
# # Conclusion
# 
# One of the markets must be the United States and it may be prudent just to spend all of the budget there considering how much more potential money is in the market. The second market should be either India or Canada. If it has to be 2 markets then I would suggest 80% of the budget goes to the US whilst 20% goes to Canada which has slightly more potential money in its market than India.
# 
# **Suggested Budget**
# 
# USA 80%
# 
# Canada 20%

# In[ ]: