Not So Long Ago, In a Dataset...

Star Wars Fans Preference Analysis

Index

1

Introduction

image.png Source:facebook

Does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch? This was the question that the team at fivethrityeight were trying to get an answer to, right before the movie release of Star Wars: The Force Awakens.

They conducted a survey using Survey Monkey and received 835 responses. Besides the above questions, the survey was also used to get a better understanding about Star Wars fans. The questions in the survey reveal the same.

The goal of this project is to:

  • Clean and analyse the survey data
  • Get a better understanding of who Star Wars fans are
  • Analyze how fans respond to movies previously released in the Star Wars Franchise
  • Analyze charachter preferences across the previously released movies from 11 charachters

2

Reading the data

The data required for this analysis has been provided in the github repository of fivethirtyeight. We shall read the data below.

In [1]:
import pandas as pd

#Read the data
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
pd.options.display.max_columns = 50
star_wars.columns
Out[1]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')
In [2]:
star_wars.head(2)
Out[2]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe? Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 3292879998 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3.0 2.0 1.0 4.0 5.0 6.0 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
1 3292879538 No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central

There are multiple issues with the data that have been detailed below:

  • The column names, while sensible, are long and inconsistent with python standards
  • The columns associated to the reply of the survey questions have values which are either Yes, No, NaN or the name of the columns themselves

These issues need to be corrected before proceeding with analysis.

3

Clean the Data We Must..

Cleaning the dataset

We can begin by cleaning the column names. The column names are very descriptive and need to be shortened so that they can be referred to easily during analysis.

In [3]:
#Re-name columns
star_wars.rename(columns={'RespondentID':'participant_id',
                          'Have you seen any of the 6 films in the Star Wars franchise?':'watched_any',
                          'Do you consider yourself to be a fan of the Star Wars film franchise?':'star_wars_fan',
                          'Which of the following Star Wars films have you seen? Please select all that apply.':'episode_1',
                          'Unnamed: 4':'episode_2',
                          'Unnamed: 5':'episode_3',
                          'Unnamed: 6':'episode_4',
                          'Unnamed: 7':'episode_5',
                          'Unnamed: 8':'episode_6',
                          'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.':'rating_ep1',
                          'Unnamed: 10':'rating_ep2',
                          'Unnamed: 11':'rating_ep3', 
                          'Unnamed: 12':'rating_ep4', 
                          'Unnamed: 13':'rating_ep5',
                          'Unnamed: 14':'rating_ep6',
                          'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.':'han_solo',
                          'Unnamed: 16':'luke_skywalker', 
                          'Unnamed: 17':'princess_leia', 
                          'Unnamed: 18':'anakin_skywalker', 
                          'Unnamed: 19':'obi_wan',
                          'Unnamed: 20':'emperor_palpatine', 
                          'Unnamed: 21':'darth_vader', 
                          'Unnamed: 22':'lando_calrissian', 
                          'Unnamed: 23':'boba_fett',
                          'Unnamed: 24':'c_3po', 
                          'Unnamed: 25':'r2_d2', 
                          'Unnamed: 26':'jar_jar_binks', 
                          'Unnamed: 27':'padme_amidala',
                          'Unnamed: 28':'yoda', 
                          'Which character shot first?':'shot_first',
                          'Are you familiar with the Expanded Universe?':'know_expand_uni',
                          'Do you consider yourself to be a fan of the Expanded Universe?':'expand_uni_fan',
                          'Do you consider yourself to be a fan of the Star Trek franchise?':'star_trek_fan',
                          'Gender':'gender', 'Age':'age', 'Household Income':'income', 'Education':'education',
                          'Location (Census Region)':'location'},inplace=True)

After renaming the columns, it is now easier to analyse the columns. Next we shall clean the data in each of these columns.

In [4]:
star_wars.head(3)
Out[4]:
participant_id watched_any star_wars_fan episode_1 episode_2 episode_3 episode_4 episode_5 episode_6 rating_ep1 rating_ep2 rating_ep3 rating_ep4 rating_ep5 rating_ep6 han_solo luke_skywalker princess_leia anakin_skywalker obi_wan emperor_palpatine darth_vader lando_calrissian boba_fett c_3po r2_d2 jar_jar_binks padme_amidala yoda shot_first know_expand_uni expand_uni_fan star_trek_fan gender age income education location
0 3292879998 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3.0 2.0 1.0 4.0 5.0 6.0 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
1 3292879538 No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
2 3292765271 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1.0 2.0 3.0 4.0 5.0 6.0 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central

Based on the analysis of the above columns we can identify columns sets as detailed below:

  • The first two columns following participant_id and the last three columns before gender column have only values Yes, No or NaN. The Yes and No values can be replaced by boolean values True and False
  • The columns associated to whether a movie was seen. Columns prefixed with episode
  • The columns associated to movie ratings.
  • The columns associated to ratings given to each charachter of the franchise. Lets clean the same column sets below

Clean Columns with Yes/No data

We can clean the columns with Yes/No values by replacing the values with boolean True and False.

In [5]:
import numpy as np
#Re-name first two columns after participant-id
star_wars.loc[:,'watched_any':'star_wars_fan'] = star_wars.loc[:,'watched_any':'star_wars_fan'].applymap(lambda val:True if val=='Yes' else (False if val=='No' else np.NaN))

#Re-name the other yes/no columns
star_wars.loc[:,'know_expand_uni':'star_trek_fan'] = star_wars.loc[:,'know_expand_uni':'star_trek_fan'].applymap(lambda val:True if val=='Yes' else (False if val=='No' else np.NaN))
In [6]:
star_wars.loc[:,'watched_any':'star_wars_fan'].head(2)
Out[6]:
watched_any star_wars_fan
0 True True
1 False NaN
In [7]:
star_wars.loc[:,'know_expand_uni':'star_trek_fan'].head(2)
Out[7]:
know_expand_uni expand_uni_fan star_trek_fan
0 True False False
1 NaN NaN True

Cleaning columns with Episode Ratings

The columns which are preceded by the word episode will have boolean True set against it where the value in the column is the name of the column.

In [8]:
#Display columns related to episode ratings
episode_columns = [each for each in star_wars.columns if 'episode' in each]
star_wars[episode_columns].head(3)
Out[8]:
episode_1 episode_2 episode_3 episode_4 episode_5 episode_6
0 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
1 NaN NaN NaN NaN NaN NaN
2 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN
In [9]:
star_wars[episode_columns] = star_wars[episode_columns].applymap(lambda val:np.NaN if val is np.NaN else True)
In [10]:
star_wars[episode_columns].head(3)
Out[10]:
episode_1 episode_2 episode_3 episode_4 episode_5 episode_6
0 True True True True True True
1 NaN NaN NaN NaN NaN NaN
2 True True True NaN NaN NaN

Cleaning columns with Ratings of each Character

In [11]:
star_wars[star_wars.columns[15:29]].head(4)
Out[11]:
han_solo luke_skywalker princess_leia anakin_skywalker obi_wan emperor_palpatine darth_vader lando_calrissian boba_fett c_3po r2_d2 jar_jar_binks padme_amidala yoda
0 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A)
3 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably

Each of the character columns have the below values. These values will be replaced with appropriate scores which should help to calculate the ratings for each charachter of the franchise.

  • Unfamiliar (N/A): np.NaN
  • Very favorably: 2
  • Somewhat favorably: 1
  • Neither favorably nor unfavorably (neutral): 0
  • Somewhat unfavorably: -1
  • Very unfavorably: -2
In [12]:
star_wars['obi_wan'].value_counts(dropna=True)
Out[12]:
Very favorably                                 591
Somewhat favorably                             159
Neither favorably nor unfavorably (neutral)     43
Unfamiliar (N/A)                                17
Somewhat unfavorably                             8
Very unfavorably                                 7
Name: obi_wan, dtype: int64
In [13]:
ratings_dict = {'Very favorably':2,
                'Somewhat favorably':1,
                'Neither favorably nor unfavorably (neutral)':0,
                'Somewhat unfavorably':-1,
                'Very unfavorably':-2,
                'Unfamiliar (N/A)':3}
char_cols = star_wars.columns[15:29]
star_wars[char_cols] = star_wars[char_cols].applymap(lambda rating:np.NaN if rating is np.NaN else ratings_dict[rating])
star_wars[char_cols].head(4)
Out[13]:
han_solo luke_skywalker princess_leia anakin_skywalker obi_wan emperor_palpatine darth_vader lando_calrissian boba_fett c_3po r2_d2 jar_jar_binks padme_amidala yoda
0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 2.0 2.0 2.0 2.0 2.0
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1.0 1.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3 2.0 2.0 2.0 2.0 2.0 1.0 2.0 1.0 -1.0 2.0 2.0 2.0 2.0 2.0

Cleaning the Age Column

The age column holds age ranges as opposed to the actual age of the individual.

In [14]:
star_wars['age'].value_counts(dropna=False)
Out[14]:
45-60    291
> 60     269
30-44    268
18-29    218
NaN      140
Name: age, dtype: int64

The age column can be categorized in to four group:

  • >60: Boomers
  • 45-60: Gen X
  • 30-44: Gen Y
  • 18-29: Gen Z

This categorization can be mapped to the age column to enable easier analysis.

In [15]:
age_dict = {'> 60':'Boomers',
            '45-60':'Gen X',
            '30-44':'Gen Y',
            '18-29':'Gen Z'}
star_wars['age'] = star_wars['age'].apply(lambda age_range:np.NaN if age_range is np.NaN else age_dict[age_range])
star_wars['age'].value_counts()
Out[15]:
Gen X      291
Boomers    269
Gen Y      268
Gen Z      218
Name: age, dtype: int64

Cleaning the Income column

The income column has the same issue as we had with age. This to will be categorized based on the ranges.

In [16]:
star_wars['income'].value_counts()
Out[16]:
$50,000 - $99,999      298
$25,000 - $49,999      186
$100,000 - $149,999    141
$0 - $24,999           138
$150,000+               95
Name: income, dtype: int64

The income column can be categorized as follows:

  • \$0 - \$24,999: Low
  • \$50000 - \$99,999: Lower Middle
  • \$25000 - \$49,999: Middle
  • \$100000 - \$149,999: Upper Middle
  • \$150000+: High
In [17]:
income_dict = {"$50,000 - $99,999":"Lower Middle",
               "$25,000 - $49,999":"Middle",
               "$100,000 - $149,999":"Upper Middle",
               "$0 - $24,999":"Low",
               "$150,000+":"High"}
star_wars['income'] = star_wars['income'].map(income_dict)
star_wars['income'].head(5)
Out[17]:
0             NaN
1             Low
2             Low
3    Upper Middle
4    Upper Middle
Name: income, dtype: object

Cleaning the Education column

In [18]:
star_wars['education'].value_counts()
Out[18]:
Some college or Associate degree    328
Bachelor degree                     321
Graduate degree                     275
High school degree                  105
Less than high school degree          7
Name: education, dtype: int64

Neither the column nor the data in it has any specific issues. But we could re-name the data to make it more friendly for annotating the plots.

In [19]:
education_dict = {'Bachelor degree':'Bachelors',
                  'Some college or Associate degree':'Diploma',
                  'Graduate degree':'Masters',
                  'High school degree':'High School',
                  'Less than high school degree':'Primary School'}
star_wars['education']=star_wars['education'].map(education_dict)
star_wars['education'].head(5)
Out[19]:
0    High School
1      Bachelors
2    High School
3        Diploma
4        Diploma
Name: education, dtype: object
In [20]:
star_wars['star_wars_fan'].value_counts(dropna=False)
Out[20]:
True     552
NaN      350
False    284
Name: star_wars_fan, dtype: int64

Now that we have cleaned all the columns lets verify the data set as a whole and verify the state of the null values in the data.

In [21]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
check = star_wars.isnull()
fig,ax = plt.subplots(figsize=(10,8))
sns.heatmap(check,cbar=False)
ax.set_xticklabels(labels = star_wars.columns, rotation=80)
ax.set_title("Null (white colour) in Star Wars data", size=20)
plt.show()

Clearly, there are a number of rows in the data set that are invalid. There are a number of participants who haven't highlighted their position with regards to being a fan. A quick assessment of those rows could help assess whether their related data points must be removed.

In [22]:
star_wars[star_wars["star_wars_fan"].isnull()].tail(10)
Out[22]:
participant_id watched_any star_wars_fan episode_1 episode_2 episode_3 episode_4 episode_5 episode_6 rating_ep1 rating_ep2 rating_ep3 rating_ep4 rating_ep5 rating_ep6 han_solo luke_skywalker princess_leia anakin_skywalker obi_wan emperor_palpatine darth_vader lando_calrissian boba_fett c_3po r2_d2 jar_jar_binks padme_amidala yoda shot_first know_expand_uni expand_uni_fan star_trek_fan gender age income education location
1143 3288455900 True NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1150 3288436226 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False Female Boomers NaN Bachelors Middle Atlantic
1151 3288432395 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False Female Gen X Lower Middle Diploma West South Central
1153 3288429390 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False Male Gen Y Lower Middle Bachelors West South Central
1157 3288423088 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False Male Gen X Upper Middle Masters Middle Atlantic
1159 3288421819 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False Female Boomers Low Diploma West South Central
1168 3288410073 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False Female Gen Y High Masters East North Central
1170 3288402717 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False Female Gen X NaN Masters East North Central
1178 3288395255 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False Female Gen Z NaN Primary School West North Central
1183 3288375286 False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN False Female Gen Y Lower Middle Bachelors Middle Atlantic

Based on the data above it is clear that these participants have nothing to share with regards to their preferences. All they have offered as part of the survey are some details relevant to them. We shall exclude this data and analyse the impact of the removal.

In [23]:
#Remove data associated incomplete fan status
star_wars = star_wars[star_wars["star_wars_fan"].notnull()]
In [24]:
check = star_wars.isnull()
fig,ax = plt.subplots(figsize=(10,8))
sns.heatmap(check,cbar=False)
ax.set_xticklabels(labels = star_wars.columns, rotation=80)
ax.set_title("Null (white colour) after Cleaning", size=20)
plt.show()