In this practice project, I assume that I'm working for an an e-learning company that offers online courses on programming. Most of the courses are on web and mobile development, but the company also covers many other domains, like data science, game development, etc. It wants to promote its product and it wants to invest some money in advertisement.
The goal in this project is to find out the two best markets for advertising the e-learning platform.
To reach this goal, the company could organize surveys for a couple of different markets to find out which would be the best choices for advertising. However, this is very costly. It would be more efficient to first explore cheaper options, such as using existing sources of market research and surveys.
To select these markets, I use data from an existing survey conducted by FreeCodeCamp as a sample of the population of people we would like to target with advertising: new coders or coders interested in continuing education. This will help narrow down which two markets are the best for our advertising campaign.
The main questions I wanted to answer are:
By answering these questions, I will then suggest which two national markets would be the best for advertising.
The conclusions for these questions were:
Based on these conclusions, I recommend the US and India as the two best national markets for advertising:
After I've answered those questions, I decided to look at additional market segments relevant points of investigation. I considered looking at:
I know that within tech fields, underrpresentation of women is a major problems, and contribute to gender discrimination in terms of unequal pay, workplace harassment and a lack of technology being made with the needs of women in mind. I therefore decided to explore the gender breakdown of respondents: how many men vs. women participated in the survey and whether there are differences in job role interests, location or ability to pay between genders.
My findings from the gender analysis were that:
My conclusions for the gender analysis of the sample were that:
Since female respondents show an even greater ability to pay monthly for learning than male respondents and are underrepresented both in the sample and in tech fields worldwide, the company could both boost its users and help contribute to reducing gender inqeuality by targeting advertising or even offering discounted rates to interested female learners in the field of web and mobile development.
I am useing data from freeCodeCamp's 2017 New Coder Survey. Because they run a popular Medium publication (over 400,000 followers), their survey attracted new coders with varying interests (not only web development), which is ideal for the purpose of our analysis. I am using an existing survey rather than conducting our own in order to save money on the initial analysis.
Survey data is taken from this GitHub repository: https://github.com/freeCodeCamp/2017-new-coder-survey
import numpy as np
import pandas as pd
survey_data = pd.read_csv('2017-fCC-New-Coders-Survey-Data.csv', sep=',')
survey_data.info()
## On first pass, receive error:
##/dataquest/system/env/python3/lib/python3.4/site-packages/IPython/core/interactiveshell.py:2723:
##vDtypeWarning: Columns (17,62) have mixed types.
## Specify dtype option on import or set low_memory=False.
## interactivity=interactivity, compiler=compiler, result=result)
## So we will investigate dtypes for columns 17 and 61
survey_data.columns[17]
## Printed 'CodeEventOther'
survey_data.columns[62]
## Printed 'JobInterestOther'
survey_data['CodeEventOther'].dtype
survey_data['JobInterestOther'].dtype
## Both return dtype 'O' indicating a python object (string)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 18175 entries, 0 to 18174 Columns: 136 entries, Age to YouTubeTheNewBoston dtypes: float64(105), object(31) memory usage: 18.9+ MB
/dataquest/system/env/python3/lib/python3.4/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (17,62) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
dtype('O')
Next, I'll look more closely at the unique values in each column to determine what dtype to assign.
survey_data['CodeEventOther'].unique()
## This returned unique values that were entirely titles of
## different code camps, so we will leave the dtype 'O' designation
array([nan, 'Ladies Learning Code', 'Peatix Events', 'Virada Tecnológica - Brasil', 'Mentoring', 'Microsoft Dev Camps', 'GDG', 'Include Girls Workshop', 'Local Event Called CodeCamp', 'DjangoCon', 'C3', 'Learners Guild Enrollment Day', 'Bootcamp', 'Laracon', 'Girl Geek Carrot', 'Coding Kids', 'No', 'Meet-ups In College', 'Codepen Meetups', 'Code2040', 'Summer Of Tech Bootcamps', 'CTFs', 'CodeNewbie', 'Been Wanting To Attend Women Who Code Etc Hackathons, Go To Hacklab But Am A Chicken And Also A Tired Human', 'Pyladies Events', 'LRUG', 'General Assembly', 'School Mettings', 'TechOlympics', 'The Iron Yard Crash Courses', 'LearningFuze Development Bootcamp In Irvine, CA', 'Code First: Girls', "Havn't Attended Any Events", 'WordCamp', 'App Academy Prep Course', ' Kc Code Noobs', 'Noone', 'CCC Waterloo', 'BrazilJS', 'MLH Hackathons', 'No One', 'Techzion', 'Bitmaker (Toronto, ON - CA)', 'The Recurse Center', 'Outreachy', 'Mentor', 'Pink Programming', 'Women Techmakers', 'There Are Very Limited Resources Where I Live', 'Institute Of Code Bali', 'Mentorship For Girls', 'Code Up', 'Competitive Programming', 'HearHerCode DC, Iron Yard Meetups', 'DeveloperDeveloperDeveloper', 'Just In My University.', 'No Attended', 'Weekend Coding Meet Ups', 'Hackathons', 'ACM ICPC', 'Code4 America', 'PyCon', 'Css3 At Universidad De La Laguna', 'Mentorship Saturdays In Portland', 'Codebar!!', 'Some Meetups Organized In My Neighbourhood', 'South Florida Code Camp', 'Hack Reactor', 'Launch Code', 'IndustryConf', 'Email Course On The Fundamentals Of Functional Programming', 'DrupalCon', 'Udemy', 'Software Carpentry', 'Nan', 'Team Retreat :P', 'Hear Me Code', 'Coalition For Queens (C4Q)', 'Google Developer Group', 'WeTeach_CS', 'Code For America', 'Coding Dojo', "Just Moved To Kansas City And I'm Going To All The Tech Meetups Here, Kansas City Women In Tech, Kansas City Code And Meetup, Etc. ", 'School', 'IEEEXtreme Competition', 'Perl User Groups', 'Codeathons', 'CodeCoach', 'Zappos Coding Challenge', 'In-company Events', 'Derek Benas', 'Hz', 'Pyday', 'N0', 'I Went To A Bootcamp', 'DevCon Philippines', 'For Loop', 'Programming Knowledge Youtube Channel', 'Metal Toad Meetup', 'Try Turing', 'Local Lectures From College', 'PDX Ruby Group', '12-week Bootcamp', 'Code Fellows Bootcamp In Seattle, WA', 'Python Code School', 'Saylani Mass Training Program', 'Conferences', 'Codebar', 'There Are No Events In My Area Whatsoever.', 'Noble Desktop', 'Have Yet To Visit', 'All Star Code', '2 Week Html Class And School', 'High School Code Class', 'Pycon India', 'Google Developers Group', 'General Assembly Introductory Class', 'Detroit.js', 'Discord', 'Andela Bootcamp', 'Course Run At Local Digital Hub', 'Private Classes', 'Pyladies', 'Launchcode', 'TENT', 'Code.org Teacher Training', 'Null', 'Small University Group', 'Didnt Attended Any', 'Android App Building School', 'OpenHack', 'Didnt Have Chance To Attend To Any Event', 'Study Groups For My Minor In College', 'WoC Hackers', 'Py Ohio (Hosted By Local Meetup "Python Ohio" Every Year)', 'Q', 'SharePoint Saturdays', 'Manchester Girl Geeks', 'College', 'Spring In JAVA', 'No Events', 'Codesmith', 'Black Tech Week', 'Ninguno', 'Na', 'PyLadies', 'Any', 'AWS Day By AWS User Group Japan In Japan', 'DrupalCamp', 'Meet.js', 'CoderCamps Meet And Greet', 'Library Code Courses', 'Steer Data Viz D3.js', 'Vilnius Girls Code', 'Yt Videos', 'Intro To Computer Science Class', 'Self Coding', 'College-related', 'Minor Programmen At The University Of Amsterdam', 'Coding Bootcamp', '5 Month Private Class', 'Camp', 'I Meet With A Friend Weekly.', 'CoderGirls', 'Meetup With Friends', 'OKCoders', 'JavaForum Stuttgart', 'Angular', 'Boise Code Camp', 'Competion', 'Eduweb.pl Web Design UI/UX/HTML Bootcamp', 'MyCocoahead', 'Puppy', 'Haven;t Been 2 One', 'Competitions', 'Library Hosted Classes And Workshops', 'Girls Who Code', 'I Dont Atend', 'Weekly Courses Held By Our School', 'Campus Party Mexico', 'Le Wagon', 'Code & Beer In Seattle', 'Local Events', 'Codeforce', 'Hackerspace Member', 'Brainstation', 'Code For Orlando', 'Radius Co-Working Events', 'PiterJS', 'Meetup Data Girls', 'ChiHack Night', 'WordPress Meetup Groups', 'CoderGirl By LaunchCode', 'Code Fellows', 'NIT FossMeet', 'LaunchCode', 'Any One', 'Letslearncoding.org', 'No Event', 'Local', 'IDTech', 'Barcode', 'Www.thenewboston.com', '.', 'Local Coding Club', 'CodeCamp Cluj, DevDays Cluj', 'We Got Coders', 'Learn Teach Code LA', 'N A', "I Havn't", 'Computer Science 1st Year', 'Programming Group Projects In My College Of Technology', 'Hyperion', 'YouTube', 'Epicodus (part Time/full Time Boot Camp)', 'Rmotr', 'High School Curriculum', 'NEXT Academy (Malaysia)', 'Easyctf', 'Meetup', 'GreenFox', 'Friends', 'School Camp', 'Codemotion', 'At Our University PAFKIET(IntraBattle)', 'Code.org', 'Lær Kids Koding', 'Networking With Women Like Me Who Are Currently Learning Code', 'Free Class', 'Mini Curso De Python', 'Google', 'Hacklab Almeria', 'Local Developers Events', 'Symposium', 'Coding Club Started With Developers At The Software Company I Work For.', 'Meetus', 'Reactive Conference Bratislava', 'CoderDojo Scotland', 'Pink Programming, Codher', 'Coding Tutorials 360', 'Code Clan', 'Ladies Learn Code', 'Bay Area Coding Sessions', 'Codementor.io', 'HackSSC', 'Thinkful Meetups', 'CodeUp Manchester', 'Amazon Workshop For Getting Into Coding', 'Php Conference', 'HeartLandGaming Expo', 'DroidCon', 'Pycon', 'Continuing Education Program For IT', 'Smashing Conference', 'Google Developers Conference. GDG', 'Informatics Olympiads', 'Hour Of Code', 'Galvanize', 'SC Codes', 'Hackerspace', 'Google Developers Group Ghana', 'Codingame', 'AZ Code Challenge', "Here In My Country We Don't Have Much Of This Kind Of Meetings. Brazil", '7 Week Bootcamp', 'Ironhack', 'Google Developer Groups Study Jam,devfest', '微软win10校园开发', 'Codechef', 'Code Over Coffee', 'Week Long Bootcamp', 'Free Software Day Event', 'Grace Hopper Conference', 'Free Code Camp', '-', 'Firebase Dev Summit', 'Talking To Other Developers At My Company', 'PuppetConf 2016', 'Hackaton', 'Code For Miami', 'Ainsleys Cooking Classes', 'College Competion', 'Laracast', 'Code Academy Azerbaijan', 'Iforum, Wordpress Kitchen', 'Codecool', 'Bristol JS', 'Hack Reactor Meetups, Rithm School Meetups', 'WoMoz', 'Founders And Coders', 'Local Courses From Es-press-oh', 'Exercises At The University', 'FISL', 'Inspaya Incubator In Nigeria', 'Edx', ' D', 'Weekly Bootcamps', 'Νονε', 'The Iron Yard', 'Sololearn', 'Code Camp South Florida', 'Bhkbhk', 'LaunchCode Mentorship', 'A', 'LaunchCode LC101', 'Unknown', 'Roblox', 'Html500 Vancouver', 'Tech Meetup Edinburgh, Scotland Based Event', 'Asdf', 'GDG Meetups', "I Don't Remember What's That Called", 'Bootcamps Night', 'CoderGirl', 'General Assembly Data Science Bootcamp', 'Codebar Brighton', 'School Classes', 'Coder Dojo', 'She Codes;', 'Codecademy London', 'Bitwise', 'CTF', 'Class', 'Crash Courses', 'Didnt Attempt', 'Google DevFest', "GDG Code D'Armor", 'I Havnt', 'Sleep Over With Friends', 'I Attended In Bootcamps In My Region', 'Coding Contests', 'Gdg', 'Campus Party Brasil 2017', 'Web Summit', 'Google Developer Groups', 'College Class For A Year', 'College Events', 'PyCaribbean 2017', "Couplo' Kids Web Pages On Coding :)", "I Live In Saudi Arabia And There Ain't Much Event Organized Here", 'School Groups :)', 'College Hackathon', 'Home Meets', 'IRL', 'Mobile App Development Class', 'Baidu', "My College's Own Activities", 'Iduntdoanyting', 'Random Conversations With Friends Of Friends And Friend-friends', 'Hackers News Meetup', 'Meet Ups', 'Other School', 'Coding Classes', 'College :(', 'Coder Camps-Pearland Texas', 'Seattle Coder Dojo', 'University Events Etc', 'Short Bootcamp', 'Have Yet To Attend', 'Group, Friend, Coding Jams!', 'Php Conf', 'Slef', 'Jug Meet Up', 'Just Starting School, Take Advantage Of Office Hours, Slack And Coffe Clatches Dedicated To Coding', 'Code First Girls', 'I Attend Firestation 101 Every Saturday. We Use FreeCodeCamp.', 'Developer Conferences', "I Don't Have Attended Yet.", 'Youtube', 'Coderhouse', 'University Workshop', 'Codecademy', 'LearningFuze', 'Geek Girls Carrots', 'CoderDojo Silicon Valley', 'Rails Girls', 'CoLab Kaduna', 'Internet', 'DevFest', 'Q College', 'AndroidSchool', 'Campus Party', 'Python For (Typo)Graphic Designers', 'For Loop Lagos', 'Meet Up And Teaching', 'HackerRank', 'Female Coders Lab', 'Hackathon', 'Open Source 101', 'Ieee', 'Hackerrank', 'PHP 7 Lection', 'RubyRails', 'Other', 'Swap Round Project', 'CodeFirst:Girls', 'Didnt Attend', 'Coderbunker', 'Nobody', 'Forloop', 'I Didnt', 'DevCongress Meetups', 'Audax', 'CodeSmith', 'A Part-time Course With Instructors And Peers', "I Don't Know", 'No,I Have No Attended It', 'Ihub Meetups', 'GDG Events', 'Facebook Groups', 'Code Platoon', 'Rutgers Coding Bootcamp', 'Lighthouse Labs', 'Local Part Time Bootcamps', 'University', 'Linux Security Related 5 Days Boot Camp', 'PHP Courses', 'AstanaJUG', 'Betabeers', 'School Events Such As Cat Barcamp', 'Yow', 'Local Ones', 'Books', 'College Class Meetups', 'Youtube Conversations Live - Video Streaming', 'No Coding-related Events', 'Baseball Hack Day', 'Code For Greenville', 'User Group Meetup', 'GoCode Colorado', 'Bootcamp Free Workshops', 'Google Dev Conference', 'Work Full Time', 'HackerYou, Ladies Learning Code', 'Rails Camp', 'Learning Tree', 'User Groups', 'Bootcamp Online', 'Cisco Live DevNet', 'YOU TUBE FREE TUTORIALS', 'Meetups', 'Webinar Online', 'Local Ruby Meetup (Dakar)', 'Local Game Developer Group', 'Chi Hack Night Open Data Hacking', 'Off-line Java Courses', 'Legal Hackers', 'Word Camp', 'Grand Circus', 'Edx, Alison, Coursera', 'An Event Apart', 'Self Trained', 'EL ZERO WEB SCHOOL', 'Getting Together With Family Who Code', 'Learners Guild', 'Local Coding And Dinner Event', 'My Friend', 'N', 'Sir Syed', 'Google Code Camp', 'WSD - WebStandardsDays', 'A Class At Community College', 'Matf', 'EPAM Trainings'], dtype=object)
survey_data['JobInterestOther'].unique()
## This returned a list of different job titles, so we will
## also leave the dtype 'O' distinction
array([nan, 'Security Expert', 'Technical Writer', 'Researcher', 'Systems Engineer', 'Desktop Applications Programmer', 'Robotics', 'Non Technical', 'UI Design', 'Software Engineer', 'Email Coder', 'Data Analyst', 'I Dont Yet Know', 'UX Developer/designer', 'Support Scientific Resaerch', 'AI And Neuroscience', 'Full Stack Software Engineer', 'Program Manager', 'Application Support Analyst', "This Futurist's Dream Of Using Some Tech In A Way That Inspires Critical Amounts Of People To Influence The Changes We Need To Protect & Repair Our Planet", 'Information Architect', 'Physicist', 'Security Business Analyst', 'Bioinformatics/science', 'Creative Coder / Generative Artist/designer', 'A Job In Which I Can Use Coding Skills To Create Valuable Portals To Advance Human Rights', 'Research', 'Bitcoin/Crypto', 'Embedded Hardware', 'Data/Interactive Journalist', 'Software Engineering', 'Business Analyst', 'Network Engineer', 'Information Developer', 'Java Developer', 'Project Management', 'Machine Learning Engineer', 'Real-time Systems', 'Cybersecurity', 'GIS Developer', 'Research And Education', 'System Software', 'Full Stack Developer & Instructional Designer / Educational Technologist', 'AI', ' Bioinformatics', 'Urban Planner', 'Full Stack Developer', 'SWE', 'Embedded Developer', 'Virtual Reality Developer', 'Journalist/Graphic Designer/Marketing', 'Web Designer', 'Computer Architect', 'Networking', 'Software Developer', 'AI And Machine Learning', 'Computer Engineer', 'Artificial Intelligence', 'Systems Programming', 'Software Engineer (Computer Science Based)', 'Technology Management', 'Full-stack Developer', 'BA Or Developer', 'User Interface Design', 'System Engineer', 'Network', 'Analyst', 'Machine Learning', 'Pharmacy Tech', 'Data Journalist / Data Visualist', 'Desings', 'Infrastructure Architect', 'Tech Art', 'Technology-Business Liaison', 'Product Designer', 'Front-End Web Designer', 'Document Controller', 'Software Enginner', 'Programmer', 'Undeceided', 'Pharmaceutical Industry', 'Information Technology', 'Library Developer', 'Desktop Application Developer', 'Operating Systems, Compilers, Etc...', 'GIS Database Admin', 'Designer', 'Support Engineer Or API Support', 'Python Developer', 'Bioinformatics', 'Robotics Process Automation Specialist', 'Data Visualisation', 'Desktop Applications Developer', 'All - Whatever Is Required To Develop Tools To Revolutionize The Mechanical Engineering Process', 'Digital Humanitites', 'User Interface Designer', 'Software Development', 'Programming', 'Web Development', 'Marketing', 'Financial Services', 'Natural Language Processing', 'Entreprenuer / Web Dev Hustler', 'Marketing Automation', 'AI Developer', 'Network Admin', 'Front End, Back End, Game, Web, Mobile Developer', 'Computer Scientist', 'UI Designer', 'Data Entry', 'Business Consultant', 'Cloud Computing', 'Machine\u200b Learning Engineer', "I'd Like To Wear Lots Of Hats And Do Hard Work", 'Fintech', 'Neuroscientist', 'Visual Designer', 'Database Administration', 'Application Developer', 'AI Development', 'Eggs', 'Project Manager', 'Undecided', 'Milatary Engineer', 'SEO', 'Astrophysicist', '*', 'Journalist', 'Philosopher', 'Desktop Applications', 'IoT Developer', 'Systems Programmer', 'Professor', 'Artificial Intelligence Engineer', 'Developer Evangelist', 'Interaction Developer', 'Bioinformatitian', 'IoT', 'Entrepreneur', 'I Am Interested In Game Development, Mobile Development, Web Design, Front End Web Development', 'Data Reporter', 'Computer Vision Engineer/Research Scientist', 'Web Developer', 'Robotics And AI Engineer', 'Ethical Hacker', 'Scientific Programming', 'Software Developer Or Front-End Web Developer', 'Campaign Manager', 'AI Engineer', 'Software Specialist', 'Growth Hacker', 'Founder', 'Software Engineers', 'VR Technology Developer', 'Developer', 'Plc', 'Ceo', 'Tech Lobbiest', 'Quant (Algorithmic Trader)', 'Machine Learning And AI', 'Databases', 'Software Developper', 'College Professor', 'System Administrator/Network', 'Software Projects Manager', 'Teacher. Teaching Students To Code.', 'Education', 'Code Developer...in Whatever Format, Front-end, Back-end, App Dev Etc.', 'Improving In My Current Career As A Learning Technologist', 'Informatician', 'Lab Scientist', 'Data Visualization Specialist', "I'm Just Learning Code To Increase My Skill-set. I See It As A Literacy Issue.", 'Teacher', 'Criminal Defense Attorney-- Focusing On Cyber Crimes', 'Remote Support', 'Non-programmer', 'IT Specialist'], dtype=object)
## Print first 5 rows for overview
## Data has 136 columns
survey_data.head()
## Assign a list of columns to identify best for analysis
column_list = list(survey_data.columns)
## Print a numbered list to more easily identify columns
for i, item in enumerate(column_list, start=0):
print(i, item)
0 Age 1 AttendedBootcamp 2 BootcampFinish 3 BootcampLoanYesNo 4 BootcampName 5 BootcampRecommend 6 ChildrenNumber 7 CityPopulation 8 CodeEventConferences 9 CodeEventDjangoGirls 10 CodeEventFCC 11 CodeEventGameJam 12 CodeEventGirlDev 13 CodeEventHackathons 14 CodeEventMeetup 15 CodeEventNodeSchool 16 CodeEventNone 17 CodeEventOther 18 CodeEventRailsBridge 19 CodeEventRailsGirls 20 CodeEventStartUpWknd 21 CodeEventWkdBootcamps 22 CodeEventWomenCode 23 CodeEventWorkshops 24 CommuteTime 25 CountryCitizen 26 CountryLive 27 EmploymentField 28 EmploymentFieldOther 29 EmploymentStatus 30 EmploymentStatusOther 31 ExpectedEarning 32 FinanciallySupporting 33 FirstDevJob 34 Gender 35 GenderOther 36 HasChildren 37 HasDebt 38 HasFinancialDependents 39 HasHighSpdInternet 40 HasHomeMortgage 41 HasServedInMilitary 42 HasStudentDebt 43 HomeMortgageOwe 44 HoursLearning 45 ID.x 46 ID.y 47 Income 48 IsEthnicMinority 49 IsReceiveDisabilitiesBenefits 50 IsSoftwareDev 51 IsUnderEmployed 52 JobApplyWhen 53 JobInterestBackEnd 54 JobInterestDataEngr 55 JobInterestDataSci 56 JobInterestDevOps 57 JobInterestFrontEnd 58 JobInterestFullStack 59 JobInterestGameDev 60 JobInterestInfoSec 61 JobInterestMobile 62 JobInterestOther 63 JobInterestProjMngr 64 JobInterestQAEngr 65 JobInterestUX 66 JobPref 67 JobRelocateYesNo 68 JobRoleInterest 69 JobWherePref 70 LanguageAtHome 71 MaritalStatus 72 MoneyForLearning 73 MonthsProgramming 74 NetworkID 75 Part1EndTime 76 Part1StartTime 77 Part2EndTime 78 Part2StartTime 79 PodcastChangeLog 80 PodcastCodeNewbie 81 PodcastCodePen 82 PodcastDevTea 83 PodcastDotNET 84 PodcastGiantRobots 85 PodcastJSAir 86 PodcastJSJabber 87 PodcastNone 88 PodcastOther 89 PodcastProgThrowdown 90 PodcastRubyRogues 91 PodcastSEDaily 92 PodcastSERadio 93 PodcastShopTalk 94 PodcastTalkPython 95 PodcastTheWebAhead 96 ResourceCodecademy 97 ResourceCodeWars 98 ResourceCoursera 99 ResourceCSS 100 ResourceEdX 101 ResourceEgghead 102 ResourceFCC 103 ResourceHackerRank 104 ResourceKA 105 ResourceLynda 106 ResourceMDN 107 ResourceOdinProj 108 ResourceOther 109 ResourcePluralSight 110 ResourceSkillcrush 111 ResourceSO 112 ResourceTreehouse 113 ResourceUdacity 114 ResourceUdemy 115 ResourceW3S 116 SchoolDegree 117 SchoolMajor 118 StudentDebtOwe 119 YouTubeCodeCourse 120 YouTubeCodingTrain 121 YouTubeCodingTut360 122 YouTubeComputerphile 123 YouTubeDerekBanas 124 YouTubeDevTips 125 YouTubeEngineeredTruth 126 YouTubeFCC 127 YouTubeFunFunFunction 128 YouTubeGoogleDev 129 YouTubeLearnCode 130 YouTubeLevelUpTuts 131 YouTubeMIT 132 YouTubeMozillaHacks 133 YouTubeOther 134 YouTubeSimplilearn 135 YouTubeTheNewBoston
There are many columns here that would be useful when deciding how to target an advertising campaign. Some of the most potentially relevant column(s) in parentheses:
Although the focus of the e-learning platform is on web and mobile development, it also wants to appeal to those interested in other programming areas like data science or game development.
To get an idea whether the population taking this survey represents our population of interest, I look at the JobRoleInterest column to see what roles survey participants are interested in.
## Generate a frequency table of the JobRoleInterest column
## Normalize to show percentages and drop null values (NaN)
job_freq = survey_data['JobRoleInterest'].value_counts(normalize=True, dropna=True)*100
job_freq
Full-Stack Web Developer 11.770595 Front-End Web Developer 6.435927 Data Scientist 2.173913 Back-End Web Developer 2.030892 Mobile Developer 1.673341 Game Developer 1.630435 Information Security 1.315789 Full-Stack Web Developer, Front-End Web Developer 0.915332 Front-End Web Developer, Full-Stack Web Developer 0.800915 Product Manager 0.786613 Data Engineer 0.758009 User Experience Designer 0.743707 User Experience Designer, Front-End Web Developer 0.614989 Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer 0.557780 Back-End Web Developer, Full-Stack Web Developer, Front-End Web Developer 0.514874 DevOps / SysAdmin 0.514874 Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer 0.514874 Full-Stack Web Developer, Front-End Web Developer, Back-End Web Developer 0.443364 Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer 0.429062 Full-Stack Web Developer, Mobile Developer 0.414760 Front-End Web Developer, User Experience Designer 0.414760 Back-End Web Developer, Full-Stack Web Developer 0.386156 Full-Stack Web Developer, Back-End Web Developer 0.371854 Back-End Web Developer, Front-End Web Developer 0.286041 Full-Stack Web Developer, Back-End Web Developer, Front-End Web Developer 0.271739 Data Engineer, Data Scientist 0.271739 Front-End Web Developer, Mobile Developer 0.257437 Full-Stack Web Developer, Data Scientist 0.243135 Data Scientist, Data Engineer 0.228833 Mobile Developer, Game Developer 0.228833 ... Data Engineer, Mobile Developer, Front-End Web Developer, Back-End Web Developer, Game Developer 0.014302 Front-End Web Developer, Back-End Web Developer, Quality Assurance Engineer, Full-Stack Web Developer, Product Manager 0.014302 Full-Stack Web Developer, Data Engineer, Information Security, User Experience Designer, Mobile Developer, Back-End Web Developer, Front-End Web Developer 0.014302 Game Developer, Front-End Web Developer, User Experience Designer, Information Security 0.014302 Information Security, Full-Stack Web Developer, Data Scientist, Back-End Web Developer 0.014302 Full-Stack Web Developer, Back-End Web Developer, Mobile Developer, User Experience Designer, Front-End Web Developer 0.014302 Mobile Developer, Back-End Web Developer, User Experience Designer, Full-Stack Web Developer, DevOps / SysAdmin, Front-End Web Developer 0.014302 Game Developer, Full-Stack Web Developer, Software Developer 0.014302 Game Developer, Full-Stack Web Developer, Front-End Web Developer, Data Engineer, User Experience Designer, Data Scientist, Mobile Developer, Back-End Web Developer, Information Security 0.014302 Data Engineer, Full-Stack Web Developer, Data Scientist, Information Security, Back-End Web Developer 0.014302 Mobile Developer, Full-Stack Web Developer, DevOps / SysAdmin, Front-End Web Developer, Game Developer, Back-End Web Developer 0.014302 Full-Stack Web Developer, I dont yet know 0.014302 Front-End Web Developer, Data Scientist, Back-End Web Developer, Data Engineer, Full-Stack Web Developer 0.014302 I'm just learning code to increase my skill-set. I see it as a literacy issue. 0.014302 User Experience Designer, Data Engineer, Front-End Web Developer, Back-End Web Developer, Game Developer, Data Scientist, Information Security, Full-Stack Web Developer, Mobile Developer 0.014302 User Experience Designer, Information Security, Mobile Developer, Product Manager, Quality Assurance Engineer 0.014302 Data Engineer, Data Scientist, Front-End Web Developer 0.014302 DevOps / SysAdmin, Front-End Web Developer, User Experience Designer, Data Scientist, Full-Stack Web Developer, Information Security, Mobile Developer 0.014302 Game Developer, Mobile Developer, Data Engineer, User Experience Designer, Product Manager, DevOps / SysAdmin, Full-Stack Web Developer, Front-End Web Developer, Back-End Web Developer 0.014302 Mobile Developer, Game Developer, User Experience Designer, Product Manager, Full-Stack Web Developer, Front-End Web Developer 0.014302 Full-Stack Web Developer, User Experience Designer, Data Scientist, Mobile Developer, Game Developer 0.014302 Full-Stack Web Developer, Game Developer, Quality Assurance Engineer, Front-End Web Developer 0.014302 Data Engineer, Back-End Web Developer, Full-Stack Web Developer, Data Scientist 0.014302 Full-Stack Web Developer, Front-End Web Developer, Mobile Developer, Data Scientist, Product Manager, User Experience Designer 0.014302 Back-End Web Developer, User Experience Designer, Mobile Developer, Full-Stack Web Developer, Front-End Web Developer 0.014302 Information Security, Game Developer, Full-Stack Web Developer, Back-End Web Developer, Mobile Developer, Front-End Web Developer 0.014302 Front-End Web Developer, Full-Stack Web Developer, Information Security, Data Scientist, Back-End Web Developer, Game Developer 0.014302 Data Engineer, Front-End Web Developer, Mobile Developer, Full-Stack Web Developer, Back-End Web Developer 0.014302 User Experience Designer, Full-Stack Web Developer, DevOps / SysAdmin 0.014302 Back-End Web Developer, Data Scientist, Product Manager, Data Engineer 0.014302 Name: JobRoleInterest, Length: 3213, dtype: float64
Here its shown that the largest share of survey respondents are interested in Full-Stack Web Development, followed by Front-End Web Development, Data Science and Back-End Development. However, as one scrolls through the list, it can be seen that a large amount of respondents selected more than one answer for the job role they are interested in. It therefore can't really be said yet whether the "majority" is interested in web development or not, since one would need to add up the percentages for the answers selecting multiple roles.
First, let's see how many respondents chose more than one job role by generating another frequency table.
## Starting with our original dataset, we first drop null values from the
## Job Role Interest column
jobrole_no_nulls = survey_data['JobRoleInterest'].dropna()
## We then use our cleaned column to split by commas to separate out
## individual roles from the multiple selection answers
multi_jobs = jobrole_no_nulls.str.split(',')
multi_jobs
## This returns the same column, but with each individual job role selected
## as a separate list item within each row
1 [Full-Stack Web Developer] 2 [ Front-End Web Developer, Back-End Web Deve... 3 [ Front-End Web Developer, Full-Stack Web De... 4 [Full-Stack Web Developer, Information Securi... 6 [Full-Stack Web Developer] 9 [Full-Stack Web Developer, Quality Assuranc... 11 [ DevOps / SysAdmin, Data Scientist, Info... 13 [Back-End Web Developer, Full-Stack Web Devel... 14 [Full-Stack Web Developer] 15 [Full-Stack Web Developer] 16 [Full-Stack Web Developer] 18 [Full-Stack Web Developer, Front-End Web De... 19 [ Front-End Web Developer, Mobile Develope... 21 [Information Security] 22 [Full-Stack Web Developer] 23 [Back-End Web Developer] 28 [Full-Stack Web Developer] 29 [ Front-End Web Developer, Data Scientist,... 30 [Back-End Web Developer, Full-Stack Web Devel... 31 [ Front-End Web Developer] 32 [ Data Scientist, Information Security, Dat... 33 [Full-Stack Web Developer, Quality Assuranc... 34 [Back-End Web Developer, Full-Stack Web Devel... 35 [Back-End Web Developer, Full-Stack Web Devel... 37 [ Mobile Developer, Product Manager] 40 [ Front-End Web Developer, Back-End Web Deve... 41 [ Front-End Web Developer] 42 [Full-Stack Web Developer] 43 [Back-End Web Developer, Front-End Web Deve... 52 [ Data Scientist, Game Developer, Full-Stac... ... 18080 [ Mobile Developer, Front-End Web Developer] 18081 [Full-Stack Web Developer, Back-End Web Devel... 18088 [Full-Stack Web Developer] 18089 [ Quality Assurance Engineer] 18090 [Game Developer, Data Scientist, Full-Stac... 18093 [ Front-End Web Developer, Mobile Developer] 18097 [Game Developer, Mobile Developer, Full-St... 18098 [ Front-End Web Developer, Full-Stack Web De... 18099 [Full-Stack Web Developer] 18107 [Full-Stack Web Developer] 18111 [ Mobile Developer, Game Developer] 18112 [ Mobile Developer, Game Developer, Full-St... 18113 [ Mobile Developer, Game Developer] 18118 [ DevOps / SysAdmin, Full-Stack Web Develope... 18125 [ Front-End Web Developer, Full-Stack Web De... 18129 [ Mobile Developer] 18130 [ Front-End Web Developer, User Experience... 18131 [Game Developer, Front-End Web Developer, ... 18151 [ Front-End Web Developer] 18153 [Information Security, Full-Stack Web Developer] 18154 [Full-Stack Web Developer] 18155 [Full-Stack Web Developer, Front-End Web De... 18156 [Full-Stack Web Developer] 18157 [Back-End Web Developer, Data Engineer, Mo... 18160 [ User Experience Designer] 18161 [Full-Stack Web Developer] 18162 [ Data Scientist, Game Developer, Quality... 18163 [Back-End Web Developer, Data Engineer, Da... 18171 [ DevOps / SysAdmin, Mobile Developer, ... 18174 [Back-End Web Developer, Data Engineer, Da... Name: JobRoleInterest, Length: 6992, dtype: object
## We now use a lambda function to count the
## number of job roles within each row and sum up their frequencies
n_options = multi_jobs.apply(lambda x: len(x))## x represents number of job roles in the row
## Generate frequency table of number of options
n_freq = n_options.value_counts(normalize=True).sort_index()*100
n_freq
1 31.650458 2 10.883867 3 15.889588 4 15.217391 5 12.042334 6 6.721968 7 3.861556 8 1.759153 9 0.986842 10 0.471968 11 0.185927 12 0.300343 13 0.028604 Name: JobRoleInterest, dtype: float64
This shows that almost 32% of respondents chose only one job role. However, the majority of the other respondents chose more than one role, with most choosing between 2-5 options.
This indicates that many potential customers are interested in multi-disciplinary learning, and it would be in the platform's best interests to offer learning opportunities on a variety of subjects to attract the broadest customer segment.
I will next need to select all rows that contain a mention of a certain role to get a better idea of how much interest there really is, including those who are interested in multiple roles.
The main course offerings are in web and mobile development, but the company is also interested in offering courses in data science, game development or other technical fields that meet our customers' needs.
So let's first see what proportion of the respondents are interested in at least one of those 4 fields: Web Development, Mobile Development, Game Development
## Select all the rows where at least one of the included job roles is mentioned
markets_4 = jobrole_no_nulls.str.contains('Web Developer|Mobile Developer|Data Scientist|Game Developer'
) ## returns a boolean array of the column
markets_4_freq = markets_4.value_counts(normalize=True)*100
markets_4_freq
True 93.320938 False 6.679062 Name: JobRoleInterest, dtype: float64
About 93% of survey respondents showed an interest in learning in at least one of those fields. This indicates the company is on the right track in terms of its planned course offerings. Next, let's see how many respondents are interested in our main 2 areas of specialization: Web and Mobile Development.
## Identify rows that include either Web or Mobile Developer as a selected job role
web_mobile = jobrole_no_nulls.str.contains('Web Developer|Mobile Developer') ## returns a boolean array
## Generate a relative frequency table
web_mobile_freq = web_mobile.value_counts(normalize=True)*100
web_mobile_freq
True 86.241419 False 13.758581 Name: JobRoleInterest, dtype: float64
Or visualized below:
## Display the plots within the cell
%matplotlib inline
##Import pyplot module from matplotlib library; this will remain imported
import matplotlib.pyplot as plt
## Set style; this will hold true for future plots generated
plt.style.use('ggplot')
## Generate bar plot of the relative frequency table for web or mobile development
web_mobile_freq.plot.bar()
## Set the title; y number pads the title upward
plt.title('Interest in Web or Mobile Development', y = 1.1)
## Set x tick labels and rotate horizontally
plt.xticks([0,1],['Web or Mobile \nDevelopment', 'Other'], rotation=0)
## Set y label and size
plt.ylabel('Percentage', fontsize=12)
## Set x label and size
plt.xlabel('Job Roles', fontsize=12)
## Set limits of y axis; since we normalized the frequencies to percentage,
## upper limit should be 100%
plt.ylim([0,100])
plt.show()
Here, it shows that 86% of respondents listed some interest in just Web or Mobile Development. It's safe to say it would be good for the company to continue specializing in these areas, while providing some offerings in Data Science or Game Development to supplement.
Let's continue this line of investigation to get even more nuanced data regarding the job roles respondents are most interested in.
Let's see both what percentage just mentions web development, and which role within web development (Full Stack, Front End, Back End) is more in demand.
## Generate boolean array for row containing Web Developer as a job selection
## Then generate a relative frequency table (all within same code line)
web= jobrole_no_nulls.str.contains('Web Developer').value_counts(normalize=True)*100
web
True 82.608696 False 17.391304 Name: JobRoleInterest, dtype: float64
Or visualized below:
%matplotlib inline
## ggplot2 style is kept from previous set.style method executed above
web.plot.bar()
## Set same graph style parameters
plt.title('Interest in Web Development', y = 1.1)
plt.xticks([0,1],['Web Development', 'Other'], rotation=0)
plt.ylabel('Percentage', fontsize=12)
plt.xlabel('Job Roles', fontsize=12)
plt.ylim([0,100])
plt.show()
And interest in Mobile Development?
## Select rows with Mobile Developer selected as a job interest;
## then generate relative frequency table
mobile = jobrole_no_nulls.str.contains('Mobile Developer').value_counts(normalize=True)*100
mobile
False 67.048055 True 32.951945 Name: JobRoleInterest, dtype: float64
Above, we that only about 33% of respondent indicated some interested in Mobile Development, while in the graph above, we see that almost 83% indicated some interest in Web Development.
If we recall from the first frequency table made above, it showed that at the top, in response with only one job interest selected, data scientist actually took the third place spot, behind full-stack and front-end web development.
Let's see at what frequency it occurs across all responses, including those with multiple selections.
## Select rows with Data Scientis selected as a job interest;
## then generate relative frequency table
data_sci = jobrole_no_nulls.str.contains('Data Scientist').value_counts(normalize=True)*100
data_sci
False 76.501716 True 23.498284 Name: JobRoleInterest, dtype: float64
Interest in data science shows up in only about 23% of response, even less than mobile development.
From this initial data on Job Roles, it's confirmed that web and mobile development continue to be the best fields to focus our e-learning courses on, as they account for the majority of interest in the sample. I could run this same analysis for other single interest roles to see how often they are mentioned in the multi-selection answers, but it seems unlikely given the data we've already seen that they would account for a greater share than either mobile or web.
Besides, focusing particularly on web development (accounting for the largest share of interest in our sample) provides the opportunity to reach out to learners in a variety of sub-fields (full stack, front-end, back-end) and create more specially curated content towards these users.
Now that the target job roles are confirmed, I now want to analyze where new coders in the sample are located.
I can use the columns previously noted, CountryCitizen and CountryLive, to see both which countries new coders are originally from and where they are living now (if it is different than their origin country).
However, for the purpose of advertising, I'm more interested where new coders are living now.
Since, country is the most granular level of data we have on coders' location, I will use 'country' as synonymous with 'market' and find the two best countries/national markets to advertise in.
I will again remove all respondents who didn't answer to which job role they were interested in. This way, the sample remains representative of the population we are interested in. I'll use it to generate frequency tables for the CountryLive column.
## Drop na value in the JobRoleInterest column from the entire survey dataset
job_no_nulls = survey_data[survey_data['JobRoleInterest'].notnull()]
## Generate an absolute frequency table for the CountryLive column
job_no_nulls['CountryLive'].value_counts()
United States of America 3125 India 528 United Kingdom 315 Canada 260 Poland 131 Brazil 129 Germany 125 Australia 112 Russia 102 Ukraine 89 Nigeria 84 Spain 77 France 75 Romania 71 Netherlands (Holland, Europe) 65 Italy 62 Philippines 52 Serbia 52 Greece 46 Ireland 43 South Africa 39 Mexico 37 Turkey 36 Hungary 34 Singapore 34 New Zealand 33 Croatia 32 Argentina 32 Sweden 31 Norway 31 ... Samoa 1 Botswana 1 Liberia 1 Sudan 1 Bolivia 1 Myanmar 1 Guadeloupe 1 Turkmenistan 1 Yemen 1 Nicaragua 1 Trinidad & Tobago 1 Channel Islands 1 Gibraltar 1 Rwanda 1 Angola 1 Cayman Islands 1 Vanuatu 1 Panama 1 Guatemala 1 Qatar 1 Kyrgyzstan 1 Aruba 1 Cameroon 1 Cuba 1 Mozambique 1 Jordan 1 Papua New Guinea 1 Somalia 1 Nambia 1 Anguilla 1 Name: CountryLive, Length: 137, dtype: int64
The table above shows that, in absolute terms, our sample has most respondents residing in the US. Let's look at it in relative proportions.
## Generate relative frequency table
top_countries = job_no_nulls['CountryLive'].value_counts(normalize=True)*100
top_countries
United States of America 45.700497 India 7.721556 United Kingdom 4.606610 Canada 3.802281 Poland 1.915765 Brazil 1.886517 Germany 1.828020 Australia 1.637906 Russia 1.491664 Ukraine 1.301550 Nigeria 1.228429 Spain 1.126060 France 1.096812 Romania 1.038315 Netherlands (Holland, Europe) 0.950570 Italy 0.906698 Philippines 0.760456 Serbia 0.760456 Greece 0.672711 Ireland 0.628839 South Africa 0.570342 Mexico 0.541094 Turkey 0.526470 Hungary 0.497221 Singapore 0.497221 New Zealand 0.482597 Croatia 0.467973 Argentina 0.467973 Sweden 0.453349 Norway 0.453349 ... Samoa 0.014624 Botswana 0.014624 Liberia 0.014624 Sudan 0.014624 Bolivia 0.014624 Myanmar 0.014624 Guadeloupe 0.014624 Turkmenistan 0.014624 Yemen 0.014624 Nicaragua 0.014624 Trinidad & Tobago 0.014624 Channel Islands 0.014624 Gibraltar 0.014624 Rwanda 0.014624 Angola 0.014624 Cayman Islands 0.014624 Vanuatu 0.014624 Panama 0.014624 Guatemala 0.014624 Qatar 0.014624 Kyrgyzstan 0.014624 Aruba 0.014624 Cameroon 0.014624 Cuba 0.014624 Mozambique 0.014624 Jordan 0.014624 Papua New Guinea 0.014624 Somalia 0.014624 Nambia 0.014624 Anguilla 0.014624 Name: CountryLive, Length: 137, dtype: float64
About 46% of respondents in the sample are US residents. This is leagues ahead of the next highest proportion coming from India at roughly 8% and the UK at roughly 4%.
Based on these results, the company could focus the majority of our advertising budget on the US market, since, taking this sample as representative of our study population, it accounts for nearly half of our potential market.
%matplotlib inline
## ggplot2 style remains
## Too many individual country values to visualize well
## So we visualize the top 5 rows here
top_countries[:5].plot.bar()
## Same style parameters
plt.title('Resident Countries of \nSurvey Respondents', y = 1)
## Change the xtick parameters to avoid label overlap but still be
## easily readable; 50 is equivalent to 50 degrees rotation
plt.xticks(rotation=50)
plt.ylabel('Percentage', fontsize=12)
plt.xlabel('Countries', fontsize=12)
plt.ylim([0,100])
plt.show()
Let's see if there are any other regions though that could comprise a similar 'single' market outside of just a single country.
For example, althought the EU is comprised of different countries, it is characterized by a high degree of cultural exchange, labor migration and educational mobility, with similar services, companies and websites located throughout the regions.
Let's see how big of a market there is in the EU by looking at all respondents from EU countries in the sample. The EU includes:
Austria, Belgium, Bulgaria, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Poland, Portugal, Romania, Slovakia, Slovenia, Spain and Sweden
## Create list of EU countries for us in row selection of survey data
eu =['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark',
'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy',
'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Poland', 'Portugal',
'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden']
## Select rows where the CountryLive column value matches one of the elements
## within the eu list above;
## then generate a relative frequency table
eu_countries = job_no_nulls['CountryLive'].isin(eu).value_counts(normalize=True)*100
eu_countries
False 86.255721 True 13.744279 Name: CountryLive, dtype: float64
%matplotlib inline
eu_countries.plot.bar()
## Same style parameters
plt.title('Percentage of EU Resident \n among Survey Respondents', y = 1)
## Change the xtick label and rotation
plt.xticks([0,1], ['Outside EU', 'Within EU'], rotation=0)
plt.ylabel('Percentage', fontsize=12)
plt.xlabel('Countries', fontsize=12)
plt.ylim([0,100])
plt.show()
Even if all respondents from EU countries in our sample are summed together, they still only account for roughly 14% of overall respondents.
Conclusion #2: The US should be our primary market of focus for advertising, given that it is the country of residence for 45% of survey respondents. The second market recommended for advertisement would be either the EU overall, accounting for 14% of respondents, if a website or advertising media can be found that extends over most or all of the EU. Otherwise, India would be the second recommended target market for advertising as, although it only accounts for roughly 8% of respondents, it is the country with the second highest occurence as a resident country in the data.
Since the e-learning platform is entirely in English, we will stick with US and India as the two recommended countries for advertising since they are both native English speaking countries and have the highest absolute frequencies of respondents.
One point to keep in mind:
Free Code Camp offers all of its content in English only, which would likely influence the nationalities using it (based on the average English level in a given country) and therefore who is responding to the survey. If the company is only interested in offering courses in English, then it could stick with this sample. However, if it is interested in hosting courses in a variety of languages or using automatic translation in its courses, it would need to use a different sample representing more linguistic communities to draw conclusions about which markets to advertise in.
We saw before that the dataset contains a MoneyForLearning column, showing how much money in USD $ respondents had already spent on learning to code from the moment they started until the time of the survey.
The company sells monthly subscriptions to its e-learning website for $59/month.
So it is interested in analyzing how much each respondent spends monthly on coding, on average.
For this analysis, I will include respondents from the two countries selected above (the US and India) as our two highest potentially grossing markets. I will also include the UK and Canada, as they come in third and fourth, respectively, in terms of absolute frequency of respondents and are also native English speaking countries which the ads would be understandable and attractive to.
I will calculate a new column in the dataset using the MoneyForLearning column and dividing by the MonthsProgramming column to get the approximate amount spent monthly by each respondent.
Since I will be dividing one column by another, I first need to evaluate if there are any characteristics in the data of those columns that would mess up the calculation, inluding the respective data types and whether the column in the denominator (MonthsProgramming) has any 0's.
## Replace 0 values with 1 in MonthsProgramming to avoid division by 0
job_no_nulls['MonthsProgramming'].replace(0,1, inplace = True)
## New column for the amount of money each student spends each month
job_no_nulls['MoneyPerMonth'] = job_no_nulls['MoneyForLearning'] / job_no_nulls['MonthsProgramming']
job_no_nulls['MoneyPerMonth']
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/generic.py:4619: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._update_inplace(new_data) /dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
1 13.333333 2 200.000000 3 0.000000 4 0.000000 6 0.000000 9 5.555556 11 0.000000 13 NaN 14 NaN 15 0.000000 16 16.666667 18 35.714286 19 17.857143 21 100.000000 22 285.714286 23 100.000000 28 2.416667 29 NaN 30 66.666667 31 0.000000 32 100.000000 33 83.333333 34 NaN 35 0.000000 37 NaN 40 25.000000 41 0.000000 42 50.000000 43 0.000000 52 0.000000 ... 18080 25.000000 18081 NaN 18088 182.000000 18089 0.000000 18090 0.000000 18093 27.777778 18097 0.000000 18098 1.222222 18099 1000.000000 18107 275.000000 18111 200.000000 18112 0.000000 18113 0.000000 18118 0.000000 18125 28.571429 18129 NaN 18130 0.000000 18131 16.666667 18151 0.000000 18153 0.000000 18154 297.000000 18155 0.000000 18156 1000.000000 18157 0.000000 18160 33.333333 18161 0.000000 18162 0.000000 18163 NaN 18171 10000.000000 18174 NaN Name: MoneyPerMonth, Length: 6992, dtype: float64
We can see above that the MoneyPerMonth column I created has some null values. We'll now remove those.
## Remove the null values from MoneyPerMonth and assign to new variable
money = job_no_nulls[job_no_nulls['MoneyPerMonth'].notnull()]
money['MoneyPerMonth']
1 13.333333 2 200.000000 3 0.000000 4 0.000000 6 0.000000 9 5.555556 11 0.000000 15 0.000000 16 16.666667 18 35.714286 19 17.857143 21 100.000000 22 285.714286 23 100.000000 28 2.416667 30 66.666667 31 0.000000 32 100.000000 33 83.333333 35 0.000000 40 25.000000 41 0.000000 42 50.000000 43 0.000000 52 0.000000 55 0.000000 58 0.000000 63 16.666667 64 50.000000 66 2.777778 ... 18070 0.055556 18071 7.500000 18073 16.666667 18078 500.000000 18080 25.000000 18088 182.000000 18089 0.000000 18090 0.000000 18093 27.777778 18097 0.000000 18098 1.222222 18099 1000.000000 18107 275.000000 18111 200.000000 18112 0.000000 18113 0.000000 18118 0.000000 18125 28.571429 18130 0.000000 18131 16.666667 18151 0.000000 18153 0.000000 18154 297.000000 18155 0.000000 18156 1000.000000 18157 0.000000 18160 33.333333 18161 0.000000 18162 0.000000 18171 10000.000000 Name: MoneyPerMonth, Length: 6317, dtype: float64
## Remove null values from the CountryLive column, in preparation for
## grouping by country
money = money[money['CountryLive'].notnull()]
## Group data set by country
## Calculate mean sum of money spent by students in each country each month
mpm_mean = money.groupby('CountryLive').mean()
mpm_mean
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampRecommend | ChildrenNumber | CodeEventConferences | CodeEventDjangoGirls | CodeEventFCC | CodeEventGameJam | ... | YouTubeFCC | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeSimplilearn | YouTubeTheNewBoston | MoneyPerMonth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CountryLive | |||||||||||||||||||||
Afghanistan | 18.750000 | 0.000000 | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | ... | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | 0.000000 |
Albania | 20.666667 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | NaN | 1.0 | 7.111111 |
Algeria | 23.750000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 0.000000 |
Angola | 20.000000 | 0.000000 | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | 1.0 | ... | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | 116.666667 |
Anguilla | 25.000000 | NaN | NaN | NaN | NaN | 2.000000 | NaN | NaN | 1.0 | NaN | ... | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
Argentina | 26.466667 | 0.000000 | NaN | NaN | NaN | 1.166667 | 1.0 | NaN | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | 55.984444 |
Australia | 28.465347 | 0.068627 | 0.142857 | 0.428571 | 0.857143 | 1.714286 | 1.0 | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 419.268452 |
Austria | 26.500000 | 0.062500 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 1.0 | NaN | 1.0 | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | 936.208333 |
Azerbaijan | 27.666667 | 0.000000 | NaN | NaN | NaN | 1.000000 | NaN | NaN | NaN | NaN | ... | 1.0 | NaN | NaN | 1.0 | NaN | 1.0 | NaN | NaN | NaN | 25.555556 |
Bahrain | 14.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
Bangladesh | 24.277778 | 0.000000 | NaN | NaN | NaN | 2.000000 | 1.0 | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | NaN | 1.0 | 239.361883 |
Belarus | 25.333333 | 0.000000 | NaN | NaN | NaN | 1.500000 | 1.0 | NaN | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 21.323854 |
Belgium | 27.705882 | 0.000000 | NaN | NaN | NaN | 1.500000 | 1.0 | NaN | 1.0 | NaN | ... | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | 53.774510 |
Bosnia & Herzegovina | 26.050000 | 0.050000 | 1.000000 | 1.000000 | 0.000000 | NaN | NaN | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 19.807540 |
Botswana | 23.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
Brazil | 24.783784 | 0.000000 | NaN | NaN | NaN | 1.555556 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 56.239402 |
Bulgaria | 25.357143 | 0.000000 | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | 75.833333 |
Cambodia | 22.000000 | 0.000000 | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | 1.0 | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 1.0 | 10.079365 |
Cameroon | 21.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.946970 |
Canada | 26.924686 | 0.037657 | 0.625000 | 0.222222 | 0.555556 | 1.888889 | 1.0 | NaN | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 113.510961 |
Cayman Islands | 20.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
Channel Islands | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 1.0 | 0.000000 |
Chile | 24.200000 | 0.000000 | NaN | NaN | NaN | 1.000000 | NaN | NaN | NaN | 1.0 | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 300.416667 |
China | 25.600000 | 0.000000 | NaN | NaN | NaN | 1.000000 | 1.0 | NaN | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 236.441270 |
Colombia | 23.142857 | 0.000000 | NaN | NaN | NaN | 1.000000 | 1.0 | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 60.399660 |
Costa Rica | 22.833333 | 0.000000 | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | 28.111111 |
Croatia | 28.692308 | 0.000000 | NaN | NaN | NaN | 1.500000 | 1.0 | 1.0 | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | NaN | 1.0 | 31.674298 |
Cuba | 26.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | 0.000000 |
Cyprus | 36.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | 17.395833 |
Czech Republic | 23.615385 | 0.000000 | NaN | NaN | NaN | 2.000000 | 1.0 | 1.0 | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | NaN | 1.0 | 40.038767 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Senegal | 34.500000 | 0.000000 | NaN | NaN | NaN | 4.000000 | NaN | NaN | NaN | NaN | ... | 1.0 | NaN | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | 1.0 | 80.595238 |
Serbia | 28.113636 | 0.000000 | NaN | NaN | NaN | 1.600000 | 1.0 | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 77.626263 |
Singapore | 24.793103 | 0.034483 | 0.000000 | 0.000000 | 1.000000 | 1.500000 | 1.0 | NaN | NaN | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | 51.618774 |
Slovakia | 24.428571 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 29.738095 |
Slovenia | 27.666667 | 0.000000 | NaN | NaN | NaN | 2.500000 | 1.0 | NaN | NaN | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | 18.425926 |
Somalia | 34.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 150.000000 |
South Africa | 27.944444 | 0.027778 | 0.000000 | 0.000000 | 0.000000 | 1.555556 | 1.0 | NaN | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 1.0 | 75.043561 |
Spain | 28.200000 | 0.106061 | 0.428571 | 0.714286 | 0.857143 | 1.200000 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 784.553084 |
Sri Lanka | 24.000000 | 0.125000 | 0.000000 | 0.000000 | 1.000000 | NaN | 1.0 | NaN | NaN | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 1.0 | 26.607143 |
Sweden | 25.370370 | 0.000000 | NaN | NaN | NaN | 1.000000 | 1.0 | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 35.413206 |
Switzerland | 29.214286 | 0.000000 | NaN | NaN | NaN | 1.000000 | 1.0 | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | 35.530045 |
Taiwan | 31.166667 | 0.166667 | 0.500000 | 0.500000 | 0.500000 | 2.333333 | NaN | 1.0 | 1.0 | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 417.966524 |
Thailand | 23.900000 | 0.000000 | NaN | NaN | NaN | 3.000000 | 1.0 | NaN | NaN | NaN | ... | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | 1.0 | NaN | 1.0 | 11.166314 |
Trinidad & Tobago | 37.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | 500.000000 |
Tunisia | 21.000000 | 0.200000 | 1.000000 | 0.000000 | 1.000000 | NaN | NaN | NaN | 1.0 | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | 1.0 | NaN | NaN | 18.000000 |
Turkey | 24.612903 | 0.000000 | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | NaN | 1.0 | 71.451613 |
Turkmenistan | 17.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 |
Uganda | 26.666667 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | 276.620370 |
Ukraine | 25.797468 | 0.038462 | 0.000000 | 0.333333 | 1.000000 | 2.200000 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 54.314874 |
United Arab Emirates | 25.062500 | 0.000000 | NaN | NaN | NaN | 2.000000 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 94.583333 |
United Kingdom | 28.620939 | 0.025090 | 0.714286 | 0.571429 | 0.857143 | 1.833333 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 45.534443 |
United States of America | 29.382282 | 0.081877 | 0.529661 | 0.341772 | 0.772152 | 1.918367 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 227.997996 |
Uruguay | 30.500000 | 0.000000 | NaN | NaN | NaN | 2.000000 | NaN | NaN | 1.0 | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 8.333333 |
Uzbekistan | 21.333333 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | 1.0 | 3.333333 |
Vanuatu | 28.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | 1.0 | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | 0.000000 |
Venezuela | 23.384615 | 0.000000 | NaN | NaN | NaN | NaN | 1.0 | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | 11.923077 |
Vietnam | 22.291667 | 0.000000 | NaN | NaN | NaN | 1.000000 | NaN | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | 1.0 | 243.324074 |
Virgin Islands (USA) | 36.166667 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | NaN | 1.0 | NaN | 60.416667 |
Yemen | 1.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | 10000.000000 |
Zimbabwe | 28.750000 | 0.000000 | NaN | NaN | NaN | 1.500000 | NaN | NaN | NaN | NaN | ... | 1.0 | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 1.0 | 23.125000 |
131 rows × 106 columns
## See the average money spent per month for student in the top 4 countries
mpm_mean['MoneyPerMonth'][['United States of America',
'India', 'United Kingdom',
'Canada']].sort_values(ascending=False)
CountryLive United States of America 227.997996 India 135.100982 Canada 113.510961 United Kingdom 45.534443 Name: MoneyPerMonth, dtype: float64
This gives us some surprising results. I expected the countries with the highest average MoneyPerMonth to generally follow a patterns similar to average GDP. By this logic, one would expect the UK and Canada to have higerh average money per month available for learning than respondents in India.
We will generate plots to understand how the average spending per month is distributed among respondents from each country. This may give us a clue as to what is driving Indian respondents average spending per month to be higher than expected.
import seaborn as sns
## Create separate data set for the top 4 countries identified above
top_4 = money[money['CountryLive'].str.contains(
'United States of America|India|United Kingdom|Canada')]
## Plot the MoneyPerMonth column for the top 4 countries
sns.set_style('darkgrid')
sns.boxplot(y = 'MoneyPerMonth', x = 'CountryLive',
data = top_4)
plt.title('Money Spent Per Month Per Country\n(Distributions)',
fontsize = 16)
plt.ylabel('Money Per Month (US dollars)')
plt.xlabel('Country')
plt.xticks([0,1,2,3],['US', 'UK', 'India', 'Canada']) # avoids tick labels overlap
plt.show()
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use. box_data = remove_na(group_data)
We can clearly see some obvious outliers already, just based on logic. Anything above 10,000 is unrealistic. We know that even for expensive coding coursework like bootcamps, the maximum cost is usally up to around 9,000 USD total. We can see values of 10,000 USD or above for India and the US. I'll first remove all values at or above 10,000 USD.
top_4_under = top_4[top_4['MoneyPerMonth'] < 10000]
top_4_under['MoneyPerMonth'].describe()
count 3906.000000 mean 140.096420 std 555.236533 min 0.000000 25% 0.000000 50% 1.846591 75% 37.500000 max 9000.000000 Name: MoneyPerMonth, dtype: float64
Now, I will group our data again by country and mean MoneyPerMonth to see how removing those outliers changed our data.
top_4_mean = top_4_under.groupby('CountryLive').mean()
top_4_mean['MoneyPerMonth'].sort_values(ascending=False)
CountryLive United States of America 155.459187 India 113.748387 Canada 113.510961 United Kingdom 45.534443 Name: MoneyPerMonth, dtype: float64
We can see that India is still on par with Canada and above the UK, which still doesn't make sense based on what we know about the average GDP of each country.
import seaborn as sns
## Generate boxplot of distributions for the MoneyPerMonth column
sns.set_style('darkgrid')
sns.boxplot(y = 'MoneyPerMonth', x = 'CountryLive',
data = top_4_under)
plt.title('Money Spent Per Month Per Country\n(Distributions)',
fontsize = 16)
plt.ylabel('Money Per Month (US dollars)')
plt.xlabel('Country')
plt.xticks([0,1,2,3],['US', 'UK', 'India', 'Canada']) # avoids tick labels overlap
plt.ylim(0,10000)
plt.show()
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use. box_data = remove_na(group_data)
Students who spend thousands of dollar per month on learning are most likely either attending a bootcamp or doing some other kind of formal education, like university. The cheaper monthly cost of e-learning platforms is one of their major selling points over formal, in-person education. I could remove respondents who responded with very high monthly spending amount one of two ways:
I could manually set another, lower spending limit that we think will effectively filter out those respondents with thousands to spend each month, indicating they are most likely participating in a bootcamp or formal education
I could remove respondents from the data set who indicated on the survey that they were participating or had participated in a bootcamp.
Let's explore option 2 more by exploring the original dataset to which columns we could use to filter out bootcamp participants.
top_4_under.describe()
## We can see there is an AttendedBootcamp column
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampRecommend | ChildrenNumber | CodeEventConferences | CodeEventDjangoGirls | CodeEventFCC | CodeEventGameJam | ... | YouTubeFCC | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeSimplilearn | YouTubeTheNewBoston | MoneyPerMonth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3866.000000 | 3889.000000 | 255.000000 | 257.000000 | 257.000000 | 611.000000 | 217.0 | 25.0 | 376.0 | 46.0 | ... | 1463.0 | 244.0 | 631.0 | 561.0 | 264.0 | 808.0 | 85.0 | 43.0 | 646.0 | 3906.000000 |
mean | 28.223487 | 0.066855 | 0.529412 | 0.334630 | 0.770428 | 1.906710 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 140.096420 |
std | 8.989643 | 0.249803 | 0.500116 | 0.472782 | 0.421378 | 0.974833 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 555.236533 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.000000 |
25% | 22.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.000000 |
50% | 27.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 2.000000 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.846591 |
75% | 33.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 2.000000 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 37.500000 |
max | 71.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 7.000000 | 1.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 9000.000000 |
8 rows × 106 columns
top_4_under['AttendedBootcamp'].value_counts()
## This looks like a binary response column; 0=no and 1=yes
0.0 3629 1.0 260 Name: AttendedBootcamp, dtype: int64
This would indicate that there 260 respondents saying they did participate in a bootcamp. Let's look at what money theses respondents are spending per month to see if these answers could be skewing our data.
bootcamps = top_4_under[top_4_under['AttendedBootcamp'] == 1.0]
bootcamps['MoneyPerMonth'].describe()
count 260.000000 mean 1031.457503 std 1459.192236 min 0.000000 25% 82.500000 50% 500.000000 75% 1297.619048 max 9000.000000 Name: MoneyPerMonth, dtype: float64
top_4_under['MoneyPerMonth'].describe()
count 3906.000000 mean 140.096420 std 555.236533 min 0.000000 25% 0.000000 50% 1.846591 75% 37.500000 max 9000.000000 Name: MoneyPerMonth, dtype: float64
Looking at the descriptive statistics for MoneyPerMonth from those who attend Bootcamps compared to all respondents, it seems likely the Bootcamp affirmative respondents are skewing the MoneyPerMonth results. The mean MoneyPerMonth for those attending bootcamps is over 1000 USD, while the mean of all respondents is only 140 USD.
I'll now remove the respondents who attended bootcamps and see how it changes the results for MoneyPerMonth.
## Select only respondents who did not attend a bootcamp
top4_no_bc = top_4_under[top_4_under['AttendedBootcamp'] == 0.0]
## Generate new boxplots
sns.set_style('darkgrid')
sns.boxplot(y = 'MoneyPerMonth', x = 'CountryLive',
data = top4_no_bc)
plt.title('Money Spent Per Month Per Country\n(Distributions)',
fontsize = 16)
plt.ylabel('Money Per Month (US dollars)')
plt.xlabel('Country')
plt.xticks([0,1,2,3],['US', 'UK', 'India', 'Canada']) # avoids tick labels overlap
plt.ylim(0,10000)
plt.show()
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use. box_data = remove_na(group_data)
Although the variation in our data is gradually lessening and looking more realistic, we can still see that the data for India seems unusual compared to its average GDP compared vs. the average GDP of the other countries. We can also see in the box plot that the four highest data points on the upper wihisker are very dispersed from one another compared to the majority closer to the box itself. They do not match the rest of the data very well.
We'll take a look at the India data and the remove these 4 outlier data points.
## Examine India data points at 2000 or above
india_outliers = top4_no_bc[(top4_no_bc['CountryLive'] == 'India')&(top4_no_bc['MoneyPerMonth'] >= 2000)]
india_outliers
Age | AttendedBootcamp | BootcampFinish | BootcampLoanYesNo | BootcampName | BootcampRecommend | ChildrenNumber | CityPopulation | CodeEventConferences | CodeEventDjangoGirls | ... | YouTubeFunFunFunction | YouTubeGoogleDev | YouTubeLearnCode | YouTubeLevelUpTuts | YouTubeMIT | YouTubeMozillaHacks | YouTubeOther | YouTubeSimplilearn | YouTubeTheNewBoston | MoneyPerMonth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1728 | 24.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5000.000000 |
1755 | 20.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | ... | NaN | NaN | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | 3333.333333 |
7989 | 28.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | 1.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5000.000000 |
8126 | 22.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | ... | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN | NaN | 1.0 | 5000.000000 |
9410 | 38.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | 1.0 | NaN | ... | 1.0 | 1.0 | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | 2000.000000 |
12451 | 24.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | between 100,000 and 1 million | NaN | NaN | ... | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 2000.000000 |
15587 | 27.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | more than 1 million | NaN | NaN | ... | NaN | 1.0 | 1.0 | NaN | 1.0 | NaN | NaN | NaN | NaN | 4166.666667 |
7 rows × 137 columns
## Remove the outliers for India
top4_no_outliers = top4_no_bc.drop(india_outliers.index) #Passes the row labels of Indian outliers
# to the drop function
## Generate boxplots
sns.set_style('darkgrid')
sns.boxplot(y = 'MoneyPerMonth', x = 'CountryLive',
data = top4_no_outliers)
plt.title('Money Spent Per Month Per Country\n(Distributions)',
fontsize = 16)
plt.ylabel('Money Per Month (US dollars)')
plt.xlabel('Country')
plt.xticks([0,1,2,3],['US', 'UK', 'India', 'Canada']) # avoids tick labels overlap
plt.ylim(0,10000)
plt.show()
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use. box_data = remove_na(group_data)
The spread of these boxplots looks more coherent now. Let's regroup the data by country to see what the mean MoneyPerMonth spent is now.
top4_new_mean = top4_no_outliers.groupby('CountryLive').mean()
top4_new_mean['MoneyPerMonth'].sort_values(ascending=False)
CountryLive United States of America 76.350634 Canada 64.127841 India 53.923528 United Kingdom 34.468329 Name: MoneyPerMonth, dtype: float64
If we compare to the first grouping by country, the US and Canada now show the highest average spending per month by students, with India in 3rd place and the UK in 4th. Previously, before removing outliers, India was equal to Canada in monthly spending. While it is still a bit surprising that Indian respondents show more available montly spending on learning than UK respondents, we have already removed many outliers, so we can assume this trend is real. We also don't want to let our assumptions about how the world works overly influence what the data actually shows.
We can now safely say that the top market we want to advertise in is still the US. We saw before that the majority of respondents in the survey were from the US, and we now see above that US respondents have the highest available monthly spending, on average.
Which should be our 2nd national target market? We can see that Canada has the second highest available monthly spending, at roughly 64 USD. Given that the montly price for our e-learning platform is 59 USD, we would likely do well advertising in Canada since our price is below the average willingness to spend for beginner coders. However let's look again at how many respondents are from Canada vs. the rest of the countries.
top4_no_outliers['CountryLive'].value_counts(normalize=True)*100
United States of America 73.826615 India 12.313639 United Kingdom 7.509663 Canada 6.350083 Name: CountryLive, dtype: float64
Although Canada has the second highest available money to spend monthly, it has the lowest percentage of respondents in our top 4 countries. India, on other other hand still shows an average willingness to spend (53 USD) relatively close to our monthly price and has a higher percentage of respondents (~12%) compared to Canada and the UK.
For this reason, we stand to gain more potential subscribers by choosing India as our second advertising market than by choosing Canada.
We will advertise our e-learning platform to learners in the fields of web and mobile development in both the US and Indian national markets.
We can see above that 73% of the respondents in our top 4 countries come from the US, while only 12% come from the next best market, India.
I propose using these percentages to create a simple logic for allocating the advertising budget, by allocating 70% of the advertising budget to the US market (in line with the percentage of respondents from the US) and the remaining 30% to the Indian market.
However:
It might be useful to simply *pass the analysis onto the marketing or business development team for further decision making on the budget*, as other important factors to consider when making this decision are:
The costs of advertising in the US vs. the Indian market. How much will we have to pay per ad in each? If advertising in the US market is more expensive, should we allocate even more of our budget there to ensure we can create enough marketing material to reach our audience?
What types of advertisements (e.g. web banners, video, social media) work best in each market? How will relying on different types affect the cost of advertising in each market?
How many competitors do we have in each market? And what is the relative likelihood that we can outcompete them with this budget?
Marketing or business development would likely have the most available data and experience to answer these questions.
Is there any further useful information we could get out of this data to help marketing or business development make their decision? Let's see.
## Look at other information provided in the dataset's columns
list(top4_no_outliers.columns)
['Age', 'AttendedBootcamp', 'BootcampFinish', 'BootcampLoanYesNo', 'BootcampName', 'BootcampRecommend', 'ChildrenNumber', 'CityPopulation', 'CodeEventConferences', 'CodeEventDjangoGirls', 'CodeEventFCC', 'CodeEventGameJam', 'CodeEventGirlDev', 'CodeEventHackathons', 'CodeEventMeetup', 'CodeEventNodeSchool', 'CodeEventNone', 'CodeEventOther', 'CodeEventRailsBridge', 'CodeEventRailsGirls', 'CodeEventStartUpWknd', 'CodeEventWkdBootcamps', 'CodeEventWomenCode', 'CodeEventWorkshops', 'CommuteTime', 'CountryCitizen', 'CountryLive', 'EmploymentField', 'EmploymentFieldOther', 'EmploymentStatus', 'EmploymentStatusOther', 'ExpectedEarning', 'FinanciallySupporting', 'FirstDevJob', 'Gender', 'GenderOther', 'HasChildren', 'HasDebt', 'HasFinancialDependents', 'HasHighSpdInternet', 'HasHomeMortgage', 'HasServedInMilitary', 'HasStudentDebt', 'HomeMortgageOwe', 'HoursLearning', 'ID.x', 'ID.y', 'Income', 'IsEthnicMinority', 'IsReceiveDisabilitiesBenefits', 'IsSoftwareDev', 'IsUnderEmployed', 'JobApplyWhen', 'JobInterestBackEnd', 'JobInterestDataEngr', 'JobInterestDataSci', 'JobInterestDevOps', 'JobInterestFrontEnd', 'JobInterestFullStack', 'JobInterestGameDev', 'JobInterestInfoSec', 'JobInterestMobile', 'JobInterestOther', 'JobInterestProjMngr', 'JobInterestQAEngr', 'JobInterestUX', 'JobPref', 'JobRelocateYesNo', 'JobRoleInterest', 'JobWherePref', 'LanguageAtHome', 'MaritalStatus', 'MoneyForLearning', 'MonthsProgramming', 'NetworkID', 'Part1EndTime', 'Part1StartTime', 'Part2EndTime', 'Part2StartTime', 'PodcastChangeLog', 'PodcastCodeNewbie', 'PodcastCodePen', 'PodcastDevTea', 'PodcastDotNET', 'PodcastGiantRobots', 'PodcastJSAir', 'PodcastJSJabber', 'PodcastNone', 'PodcastOther', 'PodcastProgThrowdown', 'PodcastRubyRogues', 'PodcastSEDaily', 'PodcastSERadio', 'PodcastShopTalk', 'PodcastTalkPython', 'PodcastTheWebAhead', 'ResourceCodecademy', 'ResourceCodeWars', 'ResourceCoursera', 'ResourceCSS', 'ResourceEdX', 'ResourceEgghead', 'ResourceFCC', 'ResourceHackerRank', 'ResourceKA', 'ResourceLynda', 'ResourceMDN', 'ResourceOdinProj', 'ResourceOther', 'ResourcePluralSight', 'ResourceSkillcrush', 'ResourceSO', 'ResourceTreehouse', 'ResourceUdacity', 'ResourceUdemy', 'ResourceW3S', 'SchoolDegree', 'SchoolMajor', 'StudentDebtOwe', 'YouTubeCodeCourse', 'YouTubeCodingTrain', 'YouTubeCodingTut360', 'YouTubeComputerphile', 'YouTubeDerekBanas', 'YouTubeDevTips', 'YouTubeEngineeredTruth', 'YouTubeFCC', 'YouTubeFunFunFunction', 'YouTubeGoogleDev', 'YouTubeLearnCode', 'YouTubeLevelUpTuts', 'YouTubeMIT', 'YouTubeMozillaHacks', 'YouTubeOther', 'YouTubeSimplilearn', 'YouTubeTheNewBoston', 'MoneyPerMonth']
Some further avenues for analysis could be:
Exploring the average age of respondents in our target markets
Exploring the gender balance of respondents in our target markets
We see some columns referencing whether respondents have attended different coding events, listen to different podcasts or have used different coding resources. By looking at which one of these media sources is most popular and/or which is most popular within each category, I might be able to identify some excellent advertising platforms.
Since we know there is a large gender gap in many tech fields still today, particularly in web and mobile development, let's look at gender first.
## Use dataset with null values in JobRole column removed
## Remove null values from Gender column
gender = job_no_nulls[job_no_nulls['Gender'].notnull()]
## First look at unique values in the gender column
gender['Gender'].value_counts()
male 5221 female 1572 trans 42 genderqueer 38 agender 18 Name: Gender, dtype: int64
## Look at unique values in the GenderOther column
gender['GenderOther'].value_counts()
Series([], Name: GenderOther, dtype: int64)
## Closer look at GenderOther
gender['GenderOther']
1 NaN 2 NaN 3 NaN 4 NaN 6 NaN 9 NaN 11 NaN 13 NaN 14 NaN 15 NaN 16 NaN 18 NaN 19 NaN 21 NaN 22 NaN 23 NaN 28 NaN 29 NaN 30 NaN 31 NaN 32 NaN 33 NaN 34 NaN 35 NaN 37 NaN 40 NaN 41 NaN 42 NaN 43 NaN 52 NaN ... 18080 NaN 18081 NaN 18088 NaN 18089 NaN 18090 NaN 18093 NaN 18097 NaN 18098 NaN 18099 NaN 18107 NaN 18111 NaN 18112 NaN 18113 NaN 18118 NaN 18125 NaN 18129 NaN 18130 NaN 18131 NaN 18151 NaN 18153 NaN 18154 NaN 18155 NaN 18156 NaN 18157 NaN 18160 NaN 18161 NaN 18162 NaN 18163 NaN 18171 NaN 18174 NaN Name: GenderOther, Length: 6891, dtype: object
The GenderOther column seems to contain nothing but null values, indicating it wasn't used by respondents. Maybe it wasn't relevant to the majority of respondents who already answered the first 'Gender' selection question, or maybe the question was confusing or inaccessible. Regardless, I will stick with just the Gender column for further analysis.
We can see from looking at the unique values for that column that men account for the majority of respondents. Let's look more closely at the relative amounts though.
## Show proportion of each unique value of the total
gender['Gender'].value_counts(normalize=True)*100
male 75.765491 female 22.812364 trans 0.609491 genderqueer 0.551444 agender 0.261210 Name: Gender, dtype: float64
Men account for a whopping 76% of those who responded. Women, trans and other non-binary genders together only account for approximately 25% of respondents. Female respondents alone account for only 23%.
This is a clear gender gap. We're well aware that women are less represented in tech fields, which leads to a variety of other problems, such as technology that doesn't meet the needs of women, harassment at work, wider gender-based salary gap and others. Let's say the company here wants to play a role in helping to correct this by encouraging women and underrepresented genders to learn programming and other tech skills on their platform so that they can move into tech careers.
I'll now do some exploration of job role interests and if there are any significant differences in interest between men and women. I'll first isolate the male and female respondents with their job roles in separate data frames, and then look at which job roles have the most interest proportionally. Let's look at what job roles women and non-binary genders are most interested in.
women_jobroles = gender[(gender['Gender'] == 'female')&(gender['JobRoleInterest'])]
women_jobroles['JobRoleInterest'].value_counts(normalize=True)*100
Full-Stack Web Developer 9.351145 Front-End Web Developer 8.778626 Data Scientist 2.226463 User Experience Designer 1.781170 Mobile Developer 1.463104 Back-End Web Developer 1.399491 Information Security 1.335878 User Experience Designer, Front-End Web Developer 1.335878 Game Developer 1.145038 Front-End Web Developer, User Experience Designer 0.954198 Front-End Web Developer, Full-Stack Web Developer 0.890585 Data Engineer 0.699746 Product Manager 0.636132 Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer 0.636132 Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer 0.572519 Full-Stack Web Developer, Front-End Web Developer 0.508906 Back-End Web Developer, Full-Stack Web Developer, Front-End Web Developer 0.508906 Front-End Web Developer, Mobile Developer, Full-Stack Web Developer 0.445293 DevOps / SysAdmin 0.445293 Full-Stack Web Developer, Front-End Web Developer, Back-End Web Developer 0.445293 User Experience Designer, Full-Stack Web Developer, Front-End Web Developer 0.445293 Mobile Developer, Full-Stack Web Developer 0.381679 Back-End Web Developer, Front-End Web Developer 0.381679 Back-End Web Developer, Data Scientist 0.381679 Game Developer, Mobile Developer 0.381679 Back-End Web Developer, Full-Stack Web Developer 0.381679 Mobile Developer, Front-End Web Developer 0.381679 Front-End Web Developer, Full-Stack Web Developer, User Experience Designer 0.318066 Full-Stack Web Developer, Back-End Web Developer 0.318066 Full-Stack Web Developer, User Experience Designer, Front-End Web Developer 0.318066 ... Mobile Developer, Game Developer, User Experience Designer, Product Manager, Full-Stack Web Developer, Front-End Web Developer 0.063613 Mobile Developer, User Experience Designer, Full-Stack Web Developer, Back-End Web Developer, Front-End Web Developer 0.063613 Front-End Web Developer, User Experience Designer, Data Scientist, Full-Stack Web Developer, Back-End Web Developer 0.063613 Product Manager, Game Developer, Front-End Web Developer, User Experience Designer 0.063613 User Experience Designer, Front-End Web Developer, Data Engineer, Back-End Web Developer, Game Developer, Mobile Developer, Full-Stack Web Developer 0.063613 Mobile Developer, Information Security, Back-End Web Developer, Front-End Web Developer, Game Developer, User Experience Designer 0.063613 Mobile Developer, Product Manager, Full-Stack Web Developer, Front-End Web Developer, Back-End Web Developer, Game Developer 0.063613 User Experience Designer, Front-End Web Developer, Full-Stack Web Developer, Mobile Developer, Back-End Web Developer 0.063613 User Experience Designer, Game Developer, Data Scientist, Full-Stack Web Developer 0.063613 Data Scientist, Game Developer, Information Security, Mobile Developer, Front-End Web Developer 0.063613 Mobile Developer, User Experience Designer, Data Scientist, Front-End Web Developer, Quality Assurance Engineer, DevOps / SysAdmin, Full-Stack Web Developer, Data Engineer, Information Security, Back-End Web Developer 0.063613 User Experience Designer, Front-End Web Developer, Mobile Developer, Game Developer 0.063613 Front-End Web Developer, Product Manager, Mobile Developer, Data Scientist, Data Engineer, User Experience Designer 0.063613 User Experience Designer, Game Developer, Full-Stack Web Developer, Front-End Web Developer 0.063613 Game Developer, Front-End Web Developer, Product Manager, Mobile Developer, Back-End Web Developer 0.063613 Game Developer, Mobile Developer, Front-End Web Developer 0.063613 Full-Stack Web Developer, Mobile Developer, Game Developer, Front-End Web Developer, Back-End Web Developer 0.063613 Mobile Developer, Front-End Web Developer, Full-Stack Web Developer, Data Scientist, Product Manager, Back-End Web Developer 0.063613 User Experience Designer, Mobile Developer, Back-End Web Developer, Front-End Web Developer, DevOps / SysAdmin, Full-Stack Web Developer 0.063613 Front-End Web Developer, Full-Stack Web Developer, User Experience Designer, Quality Assurance Engineer, Back-End Web Developer 0.063613 Quality Assurance Engineer, Front-End Web Developer, Product Manager, User Experience Designer 0.063613 Back-End Web Developer, Mobile Developer, Front-End Web Developer 0.063613 Back-End Web Developer, Mobile Developer, Game Developer, DevOps / SysAdmin, Front-End Web Developer 0.063613 I don't know yet! 0.063613 Product Manager, Back-End Web Developer, Full-Stack Web Developer, Mobile Developer, Front-End Web Developer 0.063613 Front-End Web Developer, Information Security, Game Developer, Data Scientist 0.063613 Data Engineer, DevOps / SysAdmin 0.063613 Full-Stack Web Developer, Mobile Developer, Game Developer, Back-End Web Developer, Front-End Web Developer 0.063613 Mobile Developer, Product Manager, Information Security, User Experience Designer 0.063613 Information Security, Full-Stack Web Developer, Front-End Web Developer, User Experience Designer, Back-End Web Developer, Mobile Developer 0.063613 Name: JobRoleInterest, Length: 874, dtype: float64
We can see here that, similar to the full dataset containing all genders, front end and full stack web development hold the top 2 positions, and data science is still within the top 4. However, we can also see that User Experience Designer is in the top 4. Much earlier in this analysis, when we looked at the full dataset containing repsondents of all genders, we barely saw rows where User Experience Designer was selected.
Next, I'll isolate male respondents to see their specific job role interests.
## Create dataset isolating respondents who identify as male
men_jobroles = gender[(gender['Gender'] == 'male')&(gender['JobRoleInterest'])]
## Generate frequency table
men_jobroles['JobRoleInterest'].value_counts(normalize=True)*100
Full-Stack Web Developer 12.679563 Front-End Web Developer 5.669412 Back-End Web Developer 2.240950 Data Scientist 2.106876 Game Developer 1.781268 Mobile Developer 1.685501 Information Security 1.340739 Full-Stack Web Developer, Front-End Web Developer 1.072591 Product Manager 0.823597 Data Engineer 0.804444 Front-End Web Developer, Full-Stack Web Developer 0.785290 Front-End Web Developer, Back-End Web Developer, Full-Stack Web Developer 0.536296 Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer 0.517142 DevOps / SysAdmin 0.517142 Back-End Web Developer, Full-Stack Web Developer, Front-End Web Developer 0.517142 Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer 0.497989 Full-Stack Web Developer, Front-End Web Developer, Back-End Web Developer 0.459682 Full-Stack Web Developer, Mobile Developer 0.440529 User Experience Designer 0.440529 User Experience Designer, Front-End Web Developer 0.402222 Back-End Web Developer, Full-Stack Web Developer 0.383068 Full-Stack Web Developer, Back-End Web Developer 0.383068 Full-Stack Web Developer, Back-End Web Developer, Front-End Web Developer 0.325608 Mobile Developer, Game Developer 0.268148 Data Scientist, Data Engineer 0.268148 Data Engineer, Data Scientist 0.268148 Full-Stack Web Developer, Game Developer 0.268148 Front-End Web Developer, User Experience Designer 0.268148 Back-End Web Developer, Front-End Web Developer 0.268148 Data Scientist, Full-Stack Web Developer 0.248994 ... Mobile Developer, Back-End Web Developer, Front-End Web Developer, User Experience Designer, Full-Stack Web Developer, Information Security 0.019153 Full-Stack Web Developer, Game Developer, Back-End Web Developer, User Experience Designer 0.019153 Front-End Web Developer, DevOps / SysAdmin, Back-End Web Developer, Full-Stack Web Developer, Data Scientist, Mobile Developer, Data Engineer 0.019153 Data Engineer, Data Scientist, Product Manager, Back-End Web Developer, Quality Assurance Engineer 0.019153 Back-End Web Developer, Game Developer, Front-End Web Developer, Full-Stack Web Developer, Mobile Developer, Quality Assurance Engineer 0.019153 Data Scientist, Full-Stack Web Developer, DevOps / SysAdmin, Data Engineer 0.019153 Information Security, Back-End Web Developer, Mobile Developer, Full-Stack Web Developer, Data Scientist 0.019153 Front-End Web Developer, Full-Stack Web Developer, Data Engineer, Game Developer, Data Scientist, Back-End Web Developer 0.019153 Mobile Developer, Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer, Product Manager 0.019153 Quality Assurance Engineer, Back-End Web Developer, Data Engineer, Game Developer 0.019153 Game Developer, Mobile Developer, Full-Stack Web Developer, Front-End Web Developer, DevOps / SysAdmin 0.019153 Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer, DevOps / SysAdmin, Information Security 0.019153 Information Security, Data Engineer, Data Scientist, Front-End Web Developer, Back-End Web Developer, Game Developer, DevOps / SysAdmin, Full-Stack Web Developer 0.019153 Back-End Web Developer, Game Developer, Full-Stack Web Developer, Data Engineer 0.019153 Full-Stack Web Developer, Game Developer, Back-End Web Developer, Front-End Web Developer, DevOps / SysAdmin 0.019153 DevOps / SysAdmin, Front-End Web Developer, Back-End Web Developer, Information Security, Full-Stack Web Developer 0.019153 Information Security, Product Manager, Data Scientist 0.019153 Data Scientist, Back-End Web Developer, Product Manager, DevOps / SysAdmin, Mobile Developer 0.019153 Product Manager, Full-Stack Web Developer, Back-End Web Developer, Mobile Developer 0.019153 Back-End Web Developer, Information Security, DevOps / SysAdmin, Full-Stack Web Developer 0.019153 Game Developer, Data Engineer, Back-End Web Developer, Information Security 0.019153 DevOps / SysAdmin, Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer, Information Security 0.019153 Front-End Web Developer, Mobile Developer, User Experience Designer, Game Developer 0.019153 Data Scientist, Data Engineer, Back-End Web Developer, Front-End Web Developer, Full-Stack Web Developer 0.019153 Full-Stack Web Developer, Data Scientist, Mobile Developer, Game Developer, Information Security, Back-End Web Developer, Front-End Web Developer, Data Engineer 0.019153 Front-End Web Developer, User Experience Designer, Mobile Developer, Information Security, Back-End Web Developer, Full-Stack Web Developer 0.019153 Front-End Web Developer, Data Engineer, Full-Stack Web Developer, Data Scientist, Back-End Web Developer 0.019153 Data Scientist, Data Engineer, DevOps / SysAdmin, Full-Stack Web Developer, Back-End Web Developer 0.019153 Data Engineer, Front-End Web Developer, Full-Stack Web Developer, Back-End Web Developer, Information Security, Mobile Developer, Data Scientist 0.019153 DevOps / SysAdmin, Front-End Web Developer, Full-Stack Web Developer, Game Developer, Mobile Developer, Back-End Web Developer, Data Engineer, User Experience Designer 0.019153 Name: JobRoleInterest, Length: 2514, dtype: float64
The frequency table for male respondents shows that web development, specifically full-stack and front-end development, are in the top choices in single selection answers, just as for female respondents.
One things that stands out comparing the frequency tables of job role interests between male and femal respondents is the relative interest in the role of Back-End Developer, which shows the reverse pattern from UX Design. From the frequency table for male respondents, we see that Back-End Developer was the number 3 job role among single selections in the Job Role Interest column. However, for female respondents, its takes only 6th place single job role selections.
Next, I'll get a fuller picture of the interest in different job roles by selecting all rows, including the multi-select answers, containing the names of certain roles. I'll then graphically compare the popularity of different roles between male and female respondents.
First, I'll check whether web development really does have relatively equal interest among men and women in our sample.
## Isolate rows indicating interest in web development for women
women_web = women_jobroles['JobRoleInterest'].str.contains('Web Developer')
## Generate frequency table for relative interest
women_web_freq = women_web.value_counts(normalize=True)*100
women_web_freq
True 81.615776 False 18.384224 Name: JobRoleInterest, dtype: float64
## Bar plot in matplotlib, still with ggplot style
women_web_freq.plot.bar()
## Set plot parameters
plt.title('Percentage of Women Interested in \nWeb Development')
plt.xticks([0,1],['Web Dev','Other'], rotation=0)
plt.xlabel('Job Role', fontsize=12)
plt.ylabel('Percentage', fontsize=12)
plt.ylim([0,100])
plt.show()
## How many men are interested in web development compared to women?
men_web = men_jobroles['JobRoleInterest'].str.contains('Web Developer')
men_web_freq = men_web.value_counts(normalize=True)*100
men_web_freq
#Roughly 82% of women showed some interest in web development
#For men, it's roughly 83%
True 82.991764 False 17.008236 Name: JobRoleInterest, dtype: float64
## Plot a comparison of the relative percentages of interest in web dev
## Create new data set from relative frequencies for web dev interest
web_data = {'Men': men_web_freq,
'Women': women_web_freq}
web_compare = pd.concat(web_data, axis=1)
web_compare
Men | Women | |
---|---|---|
True | 82.991764 | 81.615776 |
False | 17.008236 | 18.384224 |
## Create grouped bar chart
web_compare.plot.bar()
plt.title('Comparative Gender Differences in \nWeb Development Interest')
plt.xlabel('Interested- True or False?')
plt.ylabel('Percentage')
plt.ylim([0,100])
plt.legend()
plt.show()
The bar graph clearly shows that interest for working in web development is almost equal for both men and women.
Next, I'll look at relative interest between genders in UX design.
## Isolate responses with the word "designer" in job role interest column
women_ux = women_jobroles['JobRoleInterest'].str.contains('User Experience Designer')
## Generate frequency table
women_ux_freq = women_ux.value_counts(normalize=True)*100
women_ux_freq
False 68.956743 True 31.043257 Name: JobRoleInterest, dtype: float64
## Generate bar plot
women_ux_freq.plot.bar()
plt.title('Women Interested in \nUX Design')
plt.xticks([0,1],['Other','UX Design'], rotation=0)
plt.xlabel('Job Role', fontsize=12)
plt.ylabel('Percentage', fontsize=12)
plt.ylim([0,100])
plt.show()
It looks like about 31% of women who responded indicated some interest in User Experience Design. While not nearly as high as interest in Web Development, it stands out because in the full dataset containing all genders,UX design did not appear anywhere near the top results for relative interest. This is why we did not explor this job role before.
Logically, this must mean that most men showed little interest in UX design, skewing the results. Let's look at how many men are interested in UX design.
## How many men are intersted in user experience design compared to women?
men_ux = men_jobroles['JobRoleInterest'].str.contains('User Experience Designer')
men_ux_freq = men_ux.value_counts(normalize=True)*100
men_ux_freq
##Only about 18% of men showed an interest in UX design, compared to 31% of women
False 82.149014 True 17.850986 Name: JobRoleInterest, dtype: float64
## Plotting differences in interest in UX design between men & women
## Create new data set from UX relative frequencies
ux_data = {'Men': men_ux_freq,
'Women': women_ux_freq}
ux_compare = pd.concat(ux_data, axis=1)
ux_compare
## Create grouped bar chart
ux_compare.plot.bar()
plt.title('Comparative Gender Differences in \nUX Design Interest')
plt.xlabel('Interested- True or False?')
plt.ylabel('Percentage')
plt.ylim([0,100])
plt.legend()
plt.show()
The bar graph above shows that there is not particularly high interest in UX design among either women or men, with neither accounting fo 50% or more of interest in job roles. However, of those interested, significantly more women are interested in UX design than men.
Let's see if there are differences in men and women interested in back-end development.
## Isolate women who listed Back-End Web Development in JobRoleInterest column
women_backend = women_jobroles['JobRoleInterest'].str.contains('Back-End Web Developer')
## Generate frequency table
women_backend_freq = women_backend.value_counts(normalize=True)*100
women_backend_freq
## Roughly 34% of female respondents showed interest in back-end development
False 66.284987 True 33.715013 Name: JobRoleInterest, dtype: float64
## Isolate men who listed Back-End Web Development in JobRoleInterest column
men_backend = men_jobroles['JobRoleInterest'].str.contains('Back-End Web Developer')
## Generate frequency table
men_backend_freq = men_backend.value_counts(normalize=True)*100
men_backend_freq
## Roughly 41% of male respondents showed interest in back-end development
False 58.762689 True 41.237311 Name: JobRoleInterest, dtype: float64
## Plotting differences in interest in back-end development between men & women
## Create new data set from back-end relative frequencies
backend_data = {'Men': men_backend_freq,
'Women': women_backend_freq}
backend_compare = pd.concat(backend_data, axis=1)
## Create grouped bar chart
backend_compare.plot.bar()
plt.title('Comparative Gender Differences in \nBack-End Development Interest')
plt.xlabel('Interested- True or False?')
plt.ylabel('Percentage')
plt.ylim([0,100])
plt.legend()
plt.show()
Interest in back-end development shows the reverse pattern from UX Design. While not a top choice for either gender, significantly more men show interest in back-end development than women.
We can see a few obvious differences by comparing the frequency tables for job role interest between men and women.
Although web development is the top field of interest among men and women, at roughly 82% and 81% interest respectively, UX design has far more female respondents interested than men. 31% of women showed an interest in UX design compared to only 17% of men.
Similarly, back-end web development has more men interested in it than women, with 41% of men showing interest compared to only 34% of women.
This indicates that if the company wants to encourage more women to use its platform, one potential way of doing so would be to offer some courses in user experience design, or courses that combine elements of UX design and web development and/or data science.
While it's not necessarily important for our immediate business decisions, this also indicates that the tech industry as a whole might have a problem with a gender gap particularly in back-end web development. It would be interesting to investigate this further to look into reasons why fewer women are interested in purusing back-end development and why so few men are interested in pursuing UX design. This could also have implications for average earning between men and women in tech, as the average salary for back-end developers os often significantly higher than UX designers.
While this is an interesting point to consider, it is outside the scope of the current goal of the analysis.
Now that we know what proportion of our sample identifies as women and what fields women are most interested in, it would also be interesting to check if the women in our sample are mostly located within the same top 4 countries we identified earlier, or not. I also want to compare to check where most men are, and how that would skew the results for the country analysis on the full data we did before. Let's have a look.
## Remove null values in the CountryLive column from gender dataset
gender_countries = gender[gender['CountryLive'].notnull()]
gender_countries
## Isolate women and the CountryLive column
women_countries = gender_countries[(gender_countries['Gender'] == 'female')
&(gender_countries['CountryLive'])]
## Generate frequency table
women_countries_freq = women_countries['CountryLive'].value_counts(normalize=True)*100
women_countries_freq
United States of America 56.957929 United Kingdom 4.983819 Canada 4.142395 India 3.689320 Australia 1.941748 Poland 1.682848 Germany 1.553398 Brazil 1.488673 Netherlands (Holland, Europe) 1.294498 Russia 1.229773 France 1.035599 Ukraine 1.035599 Spain 0.970874 Nigeria 0.970874 Italy 0.906149 Romania 0.906149 Sweden 0.776699 Great Britain 0.711974 Philippines 0.711974 Japan 0.582524 Malaysia 0.517799 Denmark 0.517799 Vietnam 0.517799 Singapore 0.517799 South Africa 0.453074 Serbia 0.453074 Ireland 0.388350 Turkey 0.388350 Portugal 0.388350 Bosnia & Herzegovina 0.388350 ... Iran 0.129450 Chile 0.129450 Colombia 0.129450 Sri Lanka 0.129450 Jamaica 0.129450 Austria 0.129450 Kenya 0.129450 Indonesia 0.129450 Virgin Islands (USA) 0.129450 Belarus 0.129450 Pakistan 0.129450 Egypt 0.129450 Kazakhstan 0.064725 Luxembourg 0.064725 Morocco 0.064725 Iraq 0.064725 Haiti 0.064725 Dominican Republic 0.064725 Niger 0.064725 Uruguay 0.064725 Lithuania 0.064725 Ghana 0.064725 Bahrain 0.064725 Jordan 0.064725 Taiwan 0.064725 Papua New Guinea 0.064725 Hong Kong 0.064725 Saudi Arabia 0.064725 Paraguay 0.064725 Cambodia 0.064725 Name: CountryLive, Length: 83, dtype: float64
## Isolate women and the CountryLive column
men_countries = gender_countries[(gender_countries['Gender'] == 'male')
&(gender_countries['CountryLive'])]
## Generate frequency table
men_countries_freq = men_countries['CountryLive'].value_counts(normalize=True)*100
men_countries_freq
United States of America 41.956142 India 9.101494 United Kingdom 4.482826 Canada 3.725985 Poland 2.037648 Brazil 1.998836 Germany 1.843586 Russia 1.571900 Australia 1.552494 Ukraine 1.397244 Nigeria 1.319620 Spain 1.164370 France 1.125558 Romania 1.086746 Italy 0.931496 Serbia 0.873278 Netherlands (Holland, Europe) 0.815059 Greece 0.815059 Philippines 0.776247 Ireland 0.718028 South Africa 0.620997 Mexico 0.620997 Hungary 0.582185 Turkey 0.582185 Indonesia 0.562779 Pakistan 0.562779 Argentina 0.543373 Egypt 0.523967 Croatia 0.523967 Norway 0.523967 ... Myanmar 0.019406 Sudan 0.019406 Guadeloupe 0.019406 Turkmenistan 0.019406 Liberia 0.019406 Gambia 0.019406 Samoa 0.019406 Haiti 0.019406 Nicaragua 0.019406 Botswana 0.019406 Guatemala 0.019406 Bahrain 0.019406 Cuba 0.019406 Cameroon 0.019406 Iraq 0.019406 Panama 0.019406 Angola 0.019406 Qatar 0.019406 Kyrgyzstan 0.019406 Aruba 0.019406 Gibraltar 0.019406 Channel Islands 0.019406 Mozambique 0.019406 Afghanistan 0.019406 Nambia 0.019406 Rwanda 0.019406 Somalia 0.019406 Vanuatu 0.019406 Trinidad & Tobago 0.019406 Anguilla 0.019406 Name: CountryLive, Length: 133, dtype: float64
## Compare top 5 countries for men and women side by side
## Show first 5 rows
women_top5 = women_countries_freq.head()
## Show first 5 rows
men_top5 = men_countries_freq.head()
## Create dictionary with data
top5_data = {'Women': women_top5, 'Men': men_top5}
## Create new frame from both series
top5_mw = pd.concat(top5_data, axis=1)
top5_mw
Men | Women | |
---|---|---|
Australia | NaN | 1.941748 |
Canada | 3.725985 | 4.142395 |
India | 9.101494 | 3.689320 |
Poland | 2.037648 | NaN |
United Kingdom | 4.482826 | 4.983819 |
United States of America | 41.956142 | 56.957929 |
## Plotting grouped bar chart
top5_mw.plot.bar()
plt.title('Comparative Gender Differences by \nResident Country')
plt.xlabel('Countries')
plt.ylabel('Percentage Residing')
plt.ylim([0,100])
plt.legend()
plt.show()
We can tell a few things from looking at this gender breakdown by country.
First, the same 4 countries appear in the top 4 for both men and women in our sample, only that they appear in a slightly different order of importance. The order for men (US, India, UK, Canada) reflects a similar order from greatest number of interested learners to least that we found when analysing the overall data set including all genders.
However, when we isolate countries for women, India falls to a lower rank, with only roughly 4% of all female respondents residing there, compared to 9% of male respondents. Poland is the 5th country with the most male respondents, while Australia is the 5th country with the most female respondents. The top countries for female respondents are, in descending order, the US, UK, Canada and India.
This has some interesting implications for the company if it were interested in helping to close the gender gap in tech by appealing to its broadest base of female audience. The US is still the clear winner in the sample when it comes to interested female learners, accounting for 56%. But India is no longer the second most important market. Instead, the UK, accounting for roughly 5% of female respondents has the second highest proportion of female learners and Canada is close behind at 4%.
This means, if the company decided at some point to run a targeted marketing campaign encouraging women to learn skills related to tech and development, it would rather want to switch its advertising campaign to focus still primarily on the US, but with a secondary focus on either the UK or Canada.
In the previous analysis on available spending per month for the full data, I created a MoneyPerMonth column by dividing the MoneyForLearning column by the MonthsProgramming column, and then removed null values.
I will use this and filter by gender to:
see if there is a general difference in available monthly spending for courses between men and women. This will be interesting to know over all countries to see if there is a gap in ability to pay. Given the well-known gender pay gap between men and women, this is likely, and can inform our knowledge on how much of disadvantage women who want to learn tech skills may be at when it comes to paying for course
use the group by function to see how available spending for women differs by country, and compared to men, in our selected 3 countries of interest for female potential coders (US, UK, Canada)
I will again remove respondents who have attended bootcamps, to avoid skewing the data towards high monthly spending amounts and use a sample of spending data that is more realistic for a remote, self-paced learning platform like ours.
women_mpm = money[(money['Gender']=='female')&
(money['AttendedBootcamp']==0.0)&
(money['MoneyPerMonth'])]
women_mpm['MoneyPerMonth'].describe()
count 597.000000 mean 187.835575 std 879.070620 min 0.066667 25% 8.333333 50% 30.000000 75% 100.000000 max 15000.000000 Name: MoneyPerMonth, dtype: float64
men_mpm = money[(money['Gender']=='male')&
(money['AttendedBootcamp']==0.0)&
(money['MoneyPerMonth'])]
men_mpm['MoneyPerMonth'].describe()
count 2192.000000 mean 259.765508 std 2208.591653 min 0.033333 25% 7.500000 50% 25.000000 75% 84.910000 max 80000.000000 Name: MoneyPerMonth, dtype: float64
We can see from the descriptive stats provided that we will still need to remove significant outliers. The data for the men has a much larger standard deviation than that of the women, but both are very large. Also, the maximum value in both is rather unrealistic compared to context of monthly spending for learning programming and compared to the IQRs.
We'll graph both in boxplots to visualize how might be best to define outliers to remove.
## Concatenate women and men MPM data to single dataframe for use in pyplot
## Reset index to remove gaps within columns
gender_mpm = pd.concat([women_mpm['MoneyPerMonth'].reset_index(drop=True).rename('Women'),
men_mpm['MoneyPerMonth'].reset_index(drop=True).rename('Men')], axis=1)
gender_mpm
Women | Men | |
---|---|---|
0 | 35.714286 | 13.333333 |
1 | 100.000000 | 200.000000 |
2 | 285.714286 | 5.555556 |
3 | 100.000000 | 16.666667 |
4 | 166.666667 | 17.857143 |
5 | 2.777778 | 2.416667 |
6 | 1.388889 | 66.666667 |
7 | 2.714286 | 100.000000 |
8 | 10.000000 | 83.333333 |
9 | 40.000000 | 25.000000 |
10 | 100.000000 | 50.000000 |
11 | 16.666667 | 16.666667 |
12 | 117.500000 | 50.000000 |
13 | 8.333333 | 2.777778 |
14 | 58.333333 | 1.785714 |
15 | 166.666667 | 357.142857 |
16 | 4.166667 | 50.000000 |
17 | 2.727273 | 16.666667 |
18 | 50.000000 | 1.052632 |
19 | 10.416667 | 0.833333 |
20 | 12.500000 | 100.000000 |
21 | 11.111111 | 50.000000 |
22 | 1.250000 | 55.555556 |
23 | 166.666667 | 8.333333 |
24 | 20.000000 | 20.833333 |
25 | 16.666667 | 35.416667 |
26 | 60.000000 | 18.750000 |
27 | 600.000000 | 1.500000 |
28 | 144.000000 | 25.000000 |
29 | 12.500000 | 50.000000 |
... | ... | ... |
2162 | NaN | 30.000000 |
2163 | NaN | 50.000000 |
2164 | NaN | 16.666667 |
2165 | NaN | 33.333333 |
2166 | NaN | 4.166667 |
2167 | NaN | 15.000000 |
2168 | NaN | 66.666667 |
2169 | NaN | 75.000000 |
2170 | NaN | 33.333333 |
2171 | NaN | 10.416667 |
2172 | NaN | 14.285714 |
2173 | NaN | 5.555556 |
2174 | NaN | 21.428571 |
2175 | NaN | 83.333333 |
2176 | NaN | 13.888889 |
2177 | NaN | 150.000000 |
2178 | NaN | 69.444444 |
2179 | NaN | 0.055556 |
2180 | NaN | 7.500000 |
2181 | NaN | 500.000000 |
2182 | NaN | 25.000000 |
2183 | NaN | 1.222222 |
2184 | NaN | 1000.000000 |
2185 | NaN | 275.000000 |
2186 | NaN | 200.000000 |
2187 | NaN | 16.666667 |
2188 | NaN | 297.000000 |
2189 | NaN | 1000.000000 |
2190 | NaN | 33.333333 |
2191 | NaN | 10000.000000 |
2192 rows × 2 columns
## Generate boxplots
sns.set_style('darkgrid')
sns.boxplot(data = gender_mpm)
plt.title('Money Spent Per Month \nby Gender')
plt.ylabel('Money Per Month (US dollars)')
plt.xlabel('Gender')
plt.xticks([0,1],['Women','Men']) # avoids tick labels overlap
plt.show()
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use. box_data = remove_na(group_data)
The box plots show outliers for both the Men and Women columns, although the most significant outliers are in the Men column. For uniformity, we will follow the same methodology used previously in the full dataset to remove outliers: removing any data points indicating MPM at or above 10,000 USD.
## Index gender_mpm for row where the value in Men or Women is less than or equal to 10000
gender_mpm_removed = gender_mpm[(gender_mpm['Women'] < 10000)&(gender_mpm['Men'] <= 10000)]
gender_mpm_removed
Women | Men | |
---|---|---|
0 | 35.714286 | 13.333333 |
1 | 100.000000 | 200.000000 |
2 | 285.714286 | 5.555556 |
3 | 100.000000 | 16.666667 |
4 | 166.666667 | 17.857143 |
5 | 2.777778 | 2.416667 |
6 | 1.388889 | 66.666667 |
7 | 2.714286 | 100.000000 |
8 | 10.000000 | 83.333333 |
9 | 40.000000 | 25.000000 |
10 | 100.000000 | 50.000000 |
11 | 16.666667 | 16.666667 |
12 | 117.500000 | 50.000000 |
13 | 8.333333 | 2.777778 |
14 | 58.333333 | 1.785714 |
15 | 166.666667 | 357.142857 |
16 | 4.166667 | 50.000000 |
17 | 2.727273 | 16.666667 |
18 | 50.000000 | 1.052632 |
19 | 10.416667 | 0.833333 |
20 | 12.500000 | 100.000000 |
21 | 11.111111 | 50.000000 |
22 | 1.250000 | 55.555556 |
23 | 166.666667 | 8.333333 |
24 | 20.000000 | 20.833333 |
25 | 16.666667 | 35.416667 |
26 | 60.000000 | 18.750000 |
27 | 600.000000 | 1.500000 |
28 | 144.000000 | 25.000000 |
29 | 12.500000 | 50.000000 |
... | ... | ... |
567 | 30.000000 | 500.000000 |
568 | 66.666667 | 333.333333 |
569 | 5.555556 | 33.333333 |
570 | 100.000000 | 104.166667 |
571 | 5.000000 | 800.000000 |
572 | 166.666667 | 150.000000 |
573 | 12.500000 | 2.916667 |
574 | 12.500000 | 1.250000 |
575 | 25.000000 | 5.000000 |
576 | 166.666667 | 138.888889 |
577 | 54.545455 | 1.250000 |
578 | 12.500000 | 15.000000 |
579 | 1400.000000 | 114.285714 |
580 | 3.333333 | 62.500000 |
581 | 50.000000 | 25.000000 |
582 | 250.000000 | 62.500000 |
583 | 866.666667 | 10.000000 |
584 | 166.666667 | 266.666667 |
585 | 33.250000 | 166.666667 |
586 | 25.000000 | 400.000000 |
587 | 0.833333 | 5.000000 |
588 | 2.500000 | 12.500000 |
589 | 25.000000 | 28.571429 |
590 | 8.333333 | 83.333333 |
591 | 25.000000 | 2.500000 |
592 | 300.000000 | 200.000000 |
593 | 10.000000 | 44.444444 |
594 | 16.666667 | 20.000000 |
595 | 182.000000 | 3.333333 |
596 | 28.571429 | 7.000000 |
593 rows × 2 columns
## Generate new boxplots
sns.set_style('darkgrid')
sns.boxplot(data = gender_mpm_removed)
plt.title('Money Spent Per Month \nby Gender without Outliers')
plt.ylabel('Money Per Month (US dollars)')
plt.xlabel('Gender')
plt.xticks([0,1],['Women','Men']) # avoids tick labels overlap
plt.ylim([0,10000])
plt.show()
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use. box_data = remove_na(group_data)
Once I've removed outliers, we can see that actually, when considering respondents from all countries, women and men indicated relatively equal financial resources dedicated to learning.
Now, I will instead compare how men and women might differ in the amount of financial resources avaialble for learning by country using the 3 that we found to have the most female respondents: US, UK Canada
## Isolate gender data set for:
## 1) male & female respondents
## 2) our 3 countries of interest for gender analysis: US, UK, Canada
## 3) remove outliers in MoneyPerMonth equal to or greater than 10000
gender_top3_mw = money[(money['CountryLive'].str.contains(
'United States of America|United Kingdom|Canada'))&(money['Gender'].str.contains(
'female|male')&(money['MoneyPerMonth']<10000))]
## Generate grouped box plots by country and gender
sns.set_style('darkgrid')
sns.boxplot(x='CountryLive', y='MoneyPerMonth', hue='Gender', data=gender_top3_mw)
plt.title('Money Spent Per Month \nby Gender and Country')
plt.ylabel('Money Per Month (US dollars)')
plt.xlabel('Country')
plt.xticks(rotation=0) # avoids tick labels overlap
plt.ylim([0,10000])
plt.show()
/dataquest/system/env/python3/lib/python3.4/site-packages/seaborn/categorical.py:482: FutureWarning: remove_na is deprecated and is a private function. Do not use. box_data = remove_na(group_data[hue_mask])
## Group gender_top3_mw by country and gender with average MPM
top3_mw_grouped = gender_top3_mw.groupby(['CountryLive','Gender']).mean()
top3_mw_grouped['MoneyPerMonth']
CountryLive Gender Canada female 152.252715 male 102.158928 United Kingdom female 78.673260 male 36.668347 United States of America female 205.464014 male 138.559582 Name: MoneyPerMonth, dtype: float64
We can see that, despite available statistics showing average salaries for women being lower than for men, female respondents to the survey in the US, UK and Canada actually reported having more money available to spend on learning per month than men.
This disproves our original hypothesis: that female survey respondents may be at a disadvantage financially compared to the male respondents and may benefit from a lower cost for our e-learning courses.
Although the data here shows that women in these 3 countries have the financial resources to study programming and other tech skills, they still represent a minority in our sample, accounting for only 22% of overall responses.
Therefore, should the company want to contribute to reducing the gender representation gap in tech fields, offering a scholarship or discount on learning fees to women in their advertising campaign may be a way to encourage more women to consider using their courses to start a career in tech.