Star Wars Fans Preference Analysis
Source:facebook
Does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch? This was the question that the team at fivethrityeight were trying to get an answer to, right before the movie release of Star Wars: The Force Awakens.
They conducted a survey using Survey Monkey and received 835 responses. Besides the above questions, the survey was also used to get a better understanding about Star Wars fans. The questions in the survey reveal the same.
The goal of this project is to:
The data required for this analysis has been provided in the github repository of fivethirtyeight. We shall read the data below.
import pandas as pd
#Read the data
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
pd.options.display.max_columns = 50
star_wars.columns
star_wars.head(2)
There are multiple issues with the data that have been detailed below:
These issues need to be corrected before proceeding with analysis.
Cleaning the dataset
We can begin by cleaning the column names. The column names are very descriptive and need to be shortened so that they can be referred to easily during analysis.
#Re-name columns
star_wars.rename(columns={'RespondentID':'participant_id',
'Have you seen any of the 6 films in the Star Wars franchise?':'watched_any',
'Do you consider yourself to be a fan of the Star Wars film franchise?':'star_wars_fan',
'Which of the following Star Wars films have you seen? Please select all that apply.':'episode_1',
'Unnamed: 4':'episode_2',
'Unnamed: 5':'episode_3',
'Unnamed: 6':'episode_4',
'Unnamed: 7':'episode_5',
'Unnamed: 8':'episode_6',
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.':'rating_ep1',
'Unnamed: 10':'rating_ep2',
'Unnamed: 11':'rating_ep3',
'Unnamed: 12':'rating_ep4',
'Unnamed: 13':'rating_ep5',
'Unnamed: 14':'rating_ep6',
'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.':'han_solo',
'Unnamed: 16':'luke_skywalker',
'Unnamed: 17':'princess_leia',
'Unnamed: 18':'anakin_skywalker',
'Unnamed: 19':'obi_wan',
'Unnamed: 20':'emperor_palpatine',
'Unnamed: 21':'darth_vader',
'Unnamed: 22':'lando_calrissian',
'Unnamed: 23':'boba_fett',
'Unnamed: 24':'c_3po',
'Unnamed: 25':'r2_d2',
'Unnamed: 26':'jar_jar_binks',
'Unnamed: 27':'padme_amidala',
'Unnamed: 28':'yoda',
'Which character shot first?':'shot_first',
'Are you familiar with the Expanded Universe?':'know_expand_uni',
'Do you consider yourself to be a fan of the Expanded Universe?':'expand_uni_fan',
'Do you consider yourself to be a fan of the Star Trek franchise?':'star_trek_fan',
'Gender':'gender', 'Age':'age', 'Household Income':'income', 'Education':'education',
'Location (Census Region)':'location'},inplace=True)
After renaming the columns, it is now easier to analyse the columns. Next we shall clean the data in each of these columns.
star_wars.head(3)
Based on the analysis of the above columns we can identify columns sets as detailed below:
We can clean the columns with Yes/No values by replacing the values with boolean True and False.
import numpy as np
#Re-name first two columns after participant-id
star_wars.loc[:,'watched_any':'star_wars_fan'] = star_wars.loc[:,'watched_any':'star_wars_fan'].applymap(lambda val:True if val=='Yes' else (False if val=='No' else np.NaN))
#Re-name the other yes/no columns
star_wars.loc[:,'know_expand_uni':'star_trek_fan'] = star_wars.loc[:,'know_expand_uni':'star_trek_fan'].applymap(lambda val:True if val=='Yes' else (False if val=='No' else np.NaN))
star_wars.loc[:,'watched_any':'star_wars_fan'].head(2)
star_wars.loc[:,'know_expand_uni':'star_trek_fan'].head(2)
The columns which are preceded by the word episode will have boolean True set against it where the value in the column is the name of the column.
#Display columns related to episode ratings
episode_columns = [each for each in star_wars.columns if 'episode' in each]
star_wars[episode_columns].head(3)
star_wars[episode_columns] = star_wars[episode_columns].applymap(lambda val:np.NaN if val is np.NaN else True)
star_wars[episode_columns].head(3)
star_wars[star_wars.columns[15:29]].head(4)
Each of the character columns have the below values. These values will be replaced with appropriate scores which should help to calculate the ratings for each charachter of the franchise.
star_wars['obi_wan'].value_counts(dropna=True)
ratings_dict = {'Very favorably':2,
'Somewhat favorably':1,
'Neither favorably nor unfavorably (neutral)':0,
'Somewhat unfavorably':-1,
'Very unfavorably':-2,
'Unfamiliar (N/A)':3}
char_cols = star_wars.columns[15:29]
star_wars[char_cols] = star_wars[char_cols].applymap(lambda rating:np.NaN if rating is np.NaN else ratings_dict[rating])
star_wars[char_cols].head(4)
The age column holds age ranges as opposed to the actual age of the individual.
star_wars['age'].value_counts(dropna=False)
The age column can be categorized in to four group:
This categorization can be mapped to the age column to enable easier analysis.
age_dict = {'> 60':'Boomers',
'45-60':'Gen X',
'30-44':'Gen Y',
'18-29':'Gen Z'}
star_wars['age'] = star_wars['age'].apply(lambda age_range:np.NaN if age_range is np.NaN else age_dict[age_range])
star_wars['age'].value_counts()
The income column has the same issue as we had with age. This to will be categorized based on the ranges.
star_wars['income'].value_counts()
The income column can be categorized as follows:
income_dict = {"$50,000 - $99,999":"Lower Middle",
"$25,000 - $49,999":"Middle",
"$100,000 - $149,999":"Upper Middle",
"$0 - $24,999":"Low",
"$150,000+":"High"}
star_wars['income'] = star_wars['income'].map(income_dict)
star_wars['income'].head(5)
star_wars['education'].value_counts()
Neither the column nor the data in it has any specific issues. But we could re-name the data to make it more friendly for annotating the plots.
education_dict = {'Bachelor degree':'Bachelors',
'Some college or Associate degree':'Diploma',
'Graduate degree':'Masters',
'High school degree':'High School',
'Less than high school degree':'Primary School'}
star_wars['education']=star_wars['education'].map(education_dict)
star_wars['education'].head(5)
star_wars['star_wars_fan'].value_counts(dropna=False)
Now that we have cleaned all the columns lets verify the data set as a whole and verify the state of the null values in the data.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
check = star_wars.isnull()
fig,ax = plt.subplots(figsize=(10,8))
sns.heatmap(check,cbar=False)
ax.set_xticklabels(labels = star_wars.columns, rotation=80)
ax.set_title("Null (white colour) in Star Wars data", size=20)
plt.show()
Clearly, there are a number of rows in the data set that are invalid. There are a number of participants who haven't highlighted their position with regards to being a fan. A quick assessment of those rows could help assess whether their related data points must be removed.
star_wars[star_wars["star_wars_fan"].isnull()].tail(10)
Based on the data above it is clear that these participants have nothing to share with regards to their preferences. All they have offered as part of the survey are some details relevant to them. We shall exclude this data and analyse the impact of the removal.
#Remove data associated incomplete fan status
star_wars = star_wars[star_wars["star_wars_fan"].notnull()]
check = star_wars.isnull()
fig,ax = plt.subplots(figsize=(10,8))
sns.heatmap(check,cbar=False)
ax.set_xticklabels(labels = star_wars.columns, rotation=80)
ax.set_title("Null (white colour) after Cleaning", size=20)
plt.show()