Star Wars Survey

Reading in the data

In [1]:
import pandas as pd
import numpy as np
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')

Exploring the data set

In [2]:
star_wars.head(10)
Out[2]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace ... Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
6 3.292719e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 1 ... Very favorably Han Yes No Yes Male 18-29 $25,000 - $49,999 Bachelor degree Middle Atlantic
7 3.292685e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 6 ... Very favorably Han Yes No No Male 18-29 NaN High school degree East North Central
8 3.292664e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 4 ... Very favorably Han No NaN Yes Male 18-29 NaN High school degree South Atlantic
9 3.292654e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Somewhat favorably Han No NaN No Male 18-29 $0 - $24,999 Some college or Associate degree South Atlantic

10 rows × 38 columns

In [3]:
print(star_wars.columns)
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ξ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')
In [4]:
star_wars.shape
Out[4]:
(1187, 38)

Data cleaning

we will start by removing the null values in the RespondentID since it's meant to have a unique number

In [5]:
star_wars['RespondentID'].notnull().sum()
Out[5]:
1186
In [6]:
star_wars = star_wars[star_wars['RespondentID'].notnull()]
In [7]:
star_wars.shape
Out[7]:
(1186, 38)

We will convert the next few columns from Yes/No to True/False to make it easier to work with. After that we will rename the columns that pertains to star wars seen and ranking so that it can easily be comprehended.

In [8]:
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Out[8]:
Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [9]:
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Out[9]:
Yes    552
NaN    350
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
In [10]:
yes_no = {'Yes': True, 'No': False}

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
In [11]:
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
In [12]:
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Out[12]:
True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
In [13]:
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Out[13]:
True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [14]:
star_wars[star_wars.columns[3]].value_counts(dropna=False)
Out[14]:
Star Wars: Episode I  The Phantom Menace    673
NaN                                         513
Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64
In [15]:
dic_map = {'Star Wars: Episode I  The Phantom Menace': True, 'Star Wars: Episode II  Attack of the Clones': True, 'Star Wars: Episode III  Revenge of the Sith': True, 'Star Wars: Episode IV  A New Hope': True, 'Star Wars: Episode V The Empire Strikes Back': True, 'Star Wars: Episode VI Return of the Jedi': True, np.NaN: False}

for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(dic_map)
    #print(star_wars[col])
In [16]:
star_wars[star_wars.columns[8]].value_counts(dropna=False)
Out[16]:
True     738
False    448
Name: Unnamed: 8, dtype: int64
In [17]:
print(star_wars.columns[3:9])
Index(['Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'],
      dtype='object')
In [18]:
star_wars = star_wars.rename(columns={'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1', 'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6': 'seen_4', 'Unnamed: 7': 'seen_5', 'Unnamed: 8': 'seen_6'})
In [19]:
print(star_wars.columns[3:9])
Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')
In [20]:
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
In [21]:
print(star_wars.columns[9:15])
Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14'],
      dtype='object')
In [22]:
star_wars = star_wars.rename(columns={'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1', 'Unnamed: 10': 'ranking_2', 'Unnamed: 11': 'ranking_3', 'Unnamed: 12': 'ranking_4', 'Unnamed: 13': 'ranking_5', 'Unnamed: 14': 'ranking_6'})
In [23]:
print(star_wars.columns[9:15])
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')

Analyze data

we will start by computing the mean of the ranking columns and making a bar chart of each. Then we proceed to computing the sum of the seen columns and plotting a bar chart of each.

In [24]:
%matplotlib inline

ranking_mean = star_wars.iloc[:,9:15].mean()
ranking_mean.plot(kind='bar', title='Mean rankings', ylim=(0,5))
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f4063f28>

From the above chart we can see that the highest ranked star wars movie is Star Wars: Episode V The Empire Strikes Back since it has the lowest mean score and the least ranked is Star Wars: Episode III Revenge of the Sith since it has the highest mean score.

In [25]:
seen_sum = star_wars.iloc[:,3:9].sum()
print(seen_sum)
seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies', ylim=(500,800))
seen_1    673
seen_2    571
seen_3    550
seen_4    607
seen_5    758
seen_6    738
dtype: int64
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1d43978>

As can be seen from the chart, Star Wars: Episode V The Empire Strikes Back is the most seen which i believe should be as a result of the high ranking. and the least seen unsuprisingly is Star Wars: Episode III Revenge of the Sith which should be as a result of the low rank it received.

Split the data into two groups by gender

Let's split the data into two groups by gender and reperform our analysis to see if there will be any interesting pattern.

In [26]:
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
In [27]:
males_ranking_mean = males.iloc[:,9:15].mean()
print(males_ranking_mean)
males_ranking_mean.plot(kind='bar', title='Mean rankings by Males', ylim=(0,5))
ranking_1    4.037825
ranking_2    4.224586
ranking_3    4.274882
ranking_4    2.997636
ranking_5    2.458629
ranking_6    3.002364
dtype: float64
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f58f9c88>
In [28]:
females_ranking_mean = females.iloc[:,9:15].mean()
print(females_ranking_mean)
females_ranking_mean.plot(kind='bar', title='Mean rankings by Females', ylim=(0,5))
ranking_1    3.429293
ranking_2    3.954660
ranking_3    4.418136
ranking_4    3.544081
ranking_5    2.569270
ranking_6    3.078086
dtype: float64
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1c901d0>
In [29]:
males_seen_sum = males.iloc[:,3:9].sum()
print(males_seen_sum)
males_seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies by Males', ylim=(200,400))
seen_1    361
seen_2    323
seen_3    317
seen_4    342
seen_5    392
seen_6    387
dtype: int64
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1be0668>
In [30]:
females_seen_sum = females.iloc[:,3:9].sum()
print(females_seen_sum)
females_seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies by Females', ylim=(200,400))
seen_1    298
seen_2    237
seen_3    222
seen_4    255
seen_5    353
seen_6    338
dtype: int64
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1c01be0>

Performing the analysis by splitting the data into two groups by gender did not change the pattern of the results we received for the highest ranked and the most seen star wars movies.