Star Wars Survey
Reading in the data
import pandas as pd
import numpy as np
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
Exploring the data set
star_wars.head(10)
print(star_wars.columns)
star_wars.shape
Data cleaning
we will start by removing the null values in the RespondentID
since it's meant to have a unique number
star_wars['RespondentID'].notnull().sum()
star_wars = star_wars[star_wars['RespondentID'].notnull()]
star_wars.shape
We will convert the next few columns from Yes/No to True/False to make it easier to work with. After that we will rename the columns that pertains to star wars seen and ranking so that it can easily be comprehended.
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
yes_no = {'Yes': True, 'No': False}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
star_wars[star_wars.columns[3]].value_counts(dropna=False)
dic_map = {'Star Wars: Episode I The Phantom Menace': True, 'Star Wars: Episode II Attack of the Clones': True, 'Star Wars: Episode III Revenge of the Sith': True, 'Star Wars: Episode IV A New Hope': True, 'Star Wars: Episode V The Empire Strikes Back': True, 'Star Wars: Episode VI Return of the Jedi': True, np.NaN: False}
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(dic_map)
#print(star_wars[col])
star_wars[star_wars.columns[8]].value_counts(dropna=False)
print(star_wars.columns[3:9])
star_wars = star_wars.rename(columns={'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1', 'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6': 'seen_4', 'Unnamed: 7': 'seen_5', 'Unnamed: 8': 'seen_6'})
print(star_wars.columns[3:9])
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
print(star_wars.columns[9:15])
star_wars = star_wars.rename(columns={'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1', 'Unnamed: 10': 'ranking_2', 'Unnamed: 11': 'ranking_3', 'Unnamed: 12': 'ranking_4', 'Unnamed: 13': 'ranking_5', 'Unnamed: 14': 'ranking_6'})
print(star_wars.columns[9:15])
Analyze data
we will start by computing the mean of the ranking columns and making a bar chart of each. Then we proceed to computing the sum of the seen columns and plotting a bar chart of each.
%matplotlib inline
ranking_mean = star_wars.iloc[:,9:15].mean()
ranking_mean.plot(kind='bar', title='Mean rankings', ylim=(0,5))
From the above chart we can see that the highest ranked star wars movie is Star Wars: Episode V The Empire Strikes Back since it has the lowest mean score and the least ranked is Star Wars: Episode III Revenge of the Sith since it has the highest mean score.
seen_sum = star_wars.iloc[:,3:9].sum()
print(seen_sum)
seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies', ylim=(500,800))
As can be seen from the chart, Star Wars: Episode V The Empire Strikes Back
is the most seen which i believe should be as a result of the high ranking. and the least seen unsuprisingly is Star Wars: Episode III Revenge of the Sith
which should be as a result of the low rank it received.
Split the data into two groups by gender
Let's split the data into two groups by gender and reperform our analysis to see if there will be any interesting pattern.
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
males_ranking_mean = males.iloc[:,9:15].mean()
print(males_ranking_mean)
males_ranking_mean.plot(kind='bar', title='Mean rankings by Males', ylim=(0,5))
females_ranking_mean = females.iloc[:,9:15].mean()
print(females_ranking_mean)
females_ranking_mean.plot(kind='bar', title='Mean rankings by Females', ylim=(0,5))
males_seen_sum = males.iloc[:,3:9].sum()
print(males_seen_sum)
males_seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies by Males', ylim=(200,400))
females_seen_sum = females.iloc[:,3:9].sum()
print(females_seen_sum)
females_seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies by Females', ylim=(200,400))
Performing the analysis by splitting the data into two groups by gender did not change the pattern of the results we received for the highest ranked and the most seen star wars movies.