Star Wars Survey

Reading in the data

In [1]:

import pandas as pd
import numpy as np
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')

Exploring the data set

In [2]:

star_wars.head(10)

Out[2]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	NaN	Response	Response	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Star Wars: Episode I The Phantom Menace	...	Yoda	Response	Response	Response	Response	Response	Response	Response	Response	Response
1	3.292880e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3.292731e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	...	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
6	3.292719e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	1	...	Very favorably	Han	Yes	No	Yes	Male	18-29	$25,000 - $49,999	Bachelor degree	Middle Atlantic
7	3.292685e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	6	...	Very favorably	Han	Yes	No	No	Male	18-29	NaN	High school degree	East North Central
8	3.292664e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	4	...	Very favorably	Han	No	NaN	Yes	Male	18-29	NaN	High school degree	South Atlantic
9	3.292654e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	...	Somewhat favorably	Han	No	NaN	No	Male	18-29	$0 - $24,999	Some college or Associate degree	South Atlantic

10 rows × 38 columns

In [3]:

print(star_wars.columns)

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

In [4]:

star_wars.shape

Out[4]:

(1187, 38)

Data cleaning

we will start by removing the null values in the RespondentID since it's meant to have a unique number

In [5]:

star_wars['RespondentID'].notnull().sum()

Out[5]:

In [6]:

star_wars = star_wars[star_wars['RespondentID'].notnull()]

In [7]:

star_wars.shape

Out[7]:

(1186, 38)

We will convert the next few columns from Yes/No to True/False to make it easier to work with. After that we will rename the columns that pertains to star wars seen and ranking so that it can easily be comprehended.

In [8]:

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)

Out[8]:

Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

In [9]:

star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)

Out[9]:

Yes    552
NaN    350
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

In [10]:

yes_no = {'Yes': True, 'No': False}

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)

In [11]:

star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)

In [12]:

star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)

Out[12]:

True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

In [13]:

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)

Out[13]:

True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

In [14]:

star_wars[star_wars.columns[3]].value_counts(dropna=False)

Out[14]:

Star Wars: Episode I  The Phantom Menace    673
NaN                                         513
Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64

In [15]:

dic_map = {'Star Wars: Episode I  The Phantom Menace': True, 'Star Wars: Episode II  Attack of the Clones': True, 'Star Wars: Episode III  Revenge of the Sith': True, 'Star Wars: Episode IV  A New Hope': True, 'Star Wars: Episode V The Empire Strikes Back': True, 'Star Wars: Episode VI Return of the Jedi': True, np.NaN: False}

for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(dic_map)
    #print(star_wars[col])

In [16]:

star_wars[star_wars.columns[8]].value_counts(dropna=False)

Out[16]:

True     738
False    448
Name: Unnamed: 8, dtype: int64

In [17]:

print(star_wars.columns[3:9])

Index(['Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'],
      dtype='object')

In [18]:

star_wars = star_wars.rename(columns={'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1', 'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6': 'seen_4', 'Unnamed: 7': 'seen_5', 'Unnamed: 8': 'seen_6'})

In [19]:

print(star_wars.columns[3:9])

Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')

In [20]:

star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

In [21]:

print(star_wars.columns[9:15])

Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14'],
      dtype='object')

In [22]:

star_wars = star_wars.rename(columns={'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1', 'Unnamed: 10': 'ranking_2', 'Unnamed: 11': 'ranking_3', 'Unnamed: 12': 'ranking_4', 'Unnamed: 13': 'ranking_5', 'Unnamed: 14': 'ranking_6'})

In [23]:

print(star_wars.columns[9:15])

Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')

Analyze data

we will start by computing the mean of the ranking columns and making a bar chart of each. Then we proceed to computing the sum of the seen columns and plotting a bar chart of each.

In [24]:

%matplotlib inline

ranking_mean = star_wars.iloc[:,9:15].mean()
ranking_mean.plot(kind='bar', title='Mean rankings', ylim=(0,5))

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f81f4063f28>

From the above chart we can see that the highest ranked star wars movie is Star Wars: Episode V The Empire Strikes Back since it has the lowest mean score and the least ranked is Star Wars: Episode III Revenge of the Sith since it has the highest mean score.

In [25]:

seen_sum = star_wars.iloc[:,3:9].sum()
print(seen_sum)
seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies', ylim=(500,800))

seen_1    673
seen_2    571
seen_3    550
seen_4    607
seen_5    758
seen_6    738
dtype: int64

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1d43978>

As can be seen from the chart, Star Wars: Episode V The Empire Strikes Back is the most seen which i believe should be as a result of the high ranking. and the least seen unsuprisingly is Star Wars: Episode III Revenge of the Sith which should be as a result of the low rank it received.

Split the data into two groups by gender

Let's split the data into two groups by gender and reperform our analysis to see if there will be any interesting pattern.

In [26]:

males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]

In [27]:

males_ranking_mean = males.iloc[:,9:15].mean()
print(males_ranking_mean)
males_ranking_mean.plot(kind='bar', title='Mean rankings by Males', ylim=(0,5))

ranking_1    4.037825
ranking_2    4.224586
ranking_3    4.274882
ranking_4    2.997636
ranking_5    2.458629
ranking_6    3.002364
dtype: float64

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f81f58f9c88>

In [28]:

females_ranking_mean = females.iloc[:,9:15].mean()
print(females_ranking_mean)
females_ranking_mean.plot(kind='bar', title='Mean rankings by Females', ylim=(0,5))

ranking_1    3.429293
ranking_2    3.954660
ranking_3    4.418136
ranking_4    3.544081
ranking_5    2.569270
ranking_6    3.078086
dtype: float64

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1c901d0>

In [29]:

males_seen_sum = males.iloc[:,3:9].sum()
print(males_seen_sum)
males_seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies by Males', ylim=(200,400))

seen_1    361
seen_2    323
seen_3    317
seen_4    342
seen_5    392
seen_6    387
dtype: int64

Out[29]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1be0668>

In [30]:

females_seen_sum = females.iloc[:,3:9].sum()
print(females_seen_sum)
females_seen_sum.plot(kind='bar', title='Sum of each seen Star Wars movies by Females', ylim=(200,400))

seen_1    298
seen_2    237
seen_3    222
seen_4    255
seen_5    353
seen_6    338
dtype: int64

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f81f1c01be0>

Performing the analysis by splitting the data into two groups by gender did not change the pattern of the results we received for the highest ranked and the most seen star wars movies.