Star Wars Survey Project

FiveThirtyEight conducted an online survey to find out more about Star Wars fans. Their main question was does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

They received 835 survey responses, downloadable from GitHub.

First, we'll read the data into a pandas dataframe and begin our investigations.

In [1]:
import pandas as pd
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
star_wars
Out[1]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace ... Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1182 3.288389e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Very favorably Han No NaN Yes Female 18-29 $0 - $24,999 Some college or Associate degree East North Central
1183 3.288379e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 4 ... Very favorably I don't understand this question No NaN Yes Female 30-44 $50,000 - $99,999 Bachelor degree Mountain
1184 3.288375e+09 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN No Female 30-44 $50,000 - $99,999 Bachelor degree Middle Atlantic
1185 3.288373e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 4 ... Very favorably Han No NaN Yes Female 45-60 $100,000 - $149,999 Some college or Associate degree East North Central
1186 3.288373e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones NaN NaN Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 6 ... Very unfavorably I don't understand this question No NaN No Female > 60 $50,000 - $99,999 Graduate degree Pacific

1187 rows × 38 columns

In [2]:
star_wars.shape
Out[2]:
(1187, 38)
In [3]:
star_wars.columns
Out[3]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ξ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

Checking RespondentID is unique

RespondentID is a unique ID for each respondent, so let's make sure we remove rows with null values here.

In [4]:
star_wars = star_wars[pd.notnull(star_wars['RespondentID'])]
# drop rows with null RespondentID
In [5]:
star_wars.shape
Out[5]:
(1186, 38)

Changing 'Yes'/'No' data to boolean

The next two columns are:

  • Have you seen any of the 6 films in the Star Wars franchise?
  • Do you consider yourself to be a fan of the Star Wars film franchise?

These are populated with 'Yes', 'No' or NaN. Let's convert these to boolean to make filtering easier later.

In [6]:
bool_map = {'Yes': True, 'No': False}
# dictionary to define our mapping
In [7]:
for col in [
    "Have you seen any of the 6 films in the Star Wars franchise?",
    "Do you consider yourself to be a fan of the Star Wars film franchise?"
    ]:
    star_wars[col] = star_wars[col].map(bool_map)
<ipython-input-7-3b955894cb79>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  star_wars[col] = star_wars[col].map(bool_map)
In [8]:
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna = False)
Out[8]:
True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
In [9]:
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna = False)
Out[9]:
True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

Tidying up column names and data for films that have been watched

In [10]:
cols = star_wars.columns[3:9]
In [11]:
star_wars[cols]
Out[11]:
Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8
1 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
2 NaN NaN NaN NaN NaN NaN
3 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN
4 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
5 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
... ... ... ... ... ... ...
1182 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
1183 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
1184 NaN NaN NaN NaN NaN NaN
1185 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
1186 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones NaN NaN Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi

1186 rows × 6 columns

In [12]:
import numpy as np
bool_map = {
    "Star Wars: Episode I  The Phantom Menace": True,
    "Star Wars: Episode II  Attack of the Clones": True,
    "Star Wars: Episode III  Revenge of the Sith": True,
    "Star Wars: Episode IV  A New Hope": True,
    "Star Wars: Episode V The Empire Strikes Back": True,
    "Star Wars: Episode VI Return of the Jedi": True,
    np.nan: False
}

for col in cols:
    star_wars[col] = star_wars[col].map(bool_map)
<ipython-input-12-d87bf786c603>:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  star_wars[col] = star_wars[col].map(bool_map)
In [13]:
star_wars[cols]
Out[13]:
Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8
1 True True True True True True
2 False False False False False False
3 True True True False False False
4 True True True True True True
5 True True True True True True
... ... ... ... ... ... ...
1182 True True True True True True
1183 True True True True True True
1184 False False False False False False
1185 True True True True True True
1186 True True False False True True

1186 rows × 6 columns

In [14]:
star_wars.columns
Out[14]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ξ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')
In [15]:
col_map = {}
for i in range(0, 6):
    col_map[star_wars.columns[i+3]] = 'seen_{}'.format(i+1)

col_map
Out[15]:
{'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
 'Unnamed: 4': 'seen_2',
 'Unnamed: 5': 'seen_3',
 'Unnamed: 6': 'seen_4',
 'Unnamed: 7': 'seen_5',
 'Unnamed: 8': 'seen_6'}
In [16]:
star_wars = star_wars.rename(columns = col_map)
In [17]:
star_wars
Out[17]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
1 3.292880e+09 True True True True True True True True 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 False NaN False False False False False False NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 True False True True True False False False 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 True True True True True True True True 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 True True True True True True True True 5 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1182 3.288389e+09 True True True True True True True True 5 ... Very favorably Han No NaN Yes Female 18-29 $0 - $24,999 Some college or Associate degree East North Central
1183 3.288379e+09 True True True True True True True True 4 ... Very favorably I don't understand this question No NaN Yes Female 30-44 $50,000 - $99,999 Bachelor degree Mountain
1184 3.288375e+09 False NaN False False False False False False NaN ... NaN NaN NaN NaN No Female 30-44 $50,000 - $99,999 Bachelor degree Middle Atlantic
1185 3.288373e+09 True True True True True True True True 4 ... Very favorably Han No NaN Yes Female 45-60 $100,000 - $149,999 Some college or Associate degree East North Central
1186 3.288373e+09 True False True True False False True True 6 ... Very unfavorably I don't understand this question No NaN No Female > 60 $50,000 - $99,999 Graduate degree Pacific

1186 rows × 38 columns

In [18]:
cols = star_wars.columns[9:15]
star_wars[cols]
Out[18]:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14
1 3 2 1 4 5 6
2 NaN NaN NaN NaN NaN NaN
3 1 2 3 4 5 6
4 5 6 1 2 4 3
5 5 4 6 2 1 3
... ... ... ... ... ... ...
1182 5 4 6 3 2 1
1183 4 5 6 2 3 1
1184 NaN NaN NaN NaN NaN NaN
1185 4 3 6 5 2 1
1186 6 1 2 3 4 5

1186 rows × 6 columns

Tidying up ranking columns

We'll change the data types to float, and rename the columns.

In [19]:
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
In [20]:
star_wars[cols]
Out[20]:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14
1 3.0 2.0 1.0 4.0 5.0 6.0
2 NaN NaN NaN NaN NaN NaN
3 1.0 2.0 3.0 4.0 5.0 6.0
4 5.0 6.0 1.0 2.0 4.0 3.0
5 5.0 4.0 6.0 2.0 1.0 3.0
... ... ... ... ... ... ...
1182 5.0 4.0 6.0 3.0 2.0 1.0
1183 4.0 5.0 6.0 2.0 3.0 1.0
1184 NaN NaN NaN NaN NaN NaN
1185 4.0 3.0 6.0 5.0 2.0 1.0
1186 6.0 1.0 2.0 3.0 4.0 5.0

1186 rows × 6 columns

In [21]:
col_map = {}
for i in range(9, 15):
    col_map[star_wars.columns[i]] = 'ranking_{}'.format(i-8)

col_map
Out[21]:
{'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1',
 'Unnamed: 10': 'ranking_2',
 'Unnamed: 11': 'ranking_3',
 'Unnamed: 12': 'ranking_4',
 'Unnamed: 13': 'ranking_5',
 'Unnamed: 14': 'ranking_6'}
In [22]:
star_wars = star_wars.rename(columns = col_map)
star_wars[star_wars.columns[9:15]]
Out[22]:
ranking_1 ranking_2 ranking_3 ranking_4 ranking_5 ranking_6
1 3.0 2.0 1.0 4.0 5.0 6.0
2 NaN NaN NaN NaN NaN NaN
3 1.0 2.0 3.0 4.0 5.0 6.0
4 5.0 6.0 1.0 2.0 4.0 3.0
5 5.0 4.0 6.0 2.0 1.0 3.0
... ... ... ... ... ... ...
1182 5.0 4.0 6.0 3.0 2.0 1.0
1183 4.0 5.0 6.0 2.0 3.0 1.0
1184 NaN NaN NaN NaN NaN NaN
1185 4.0 3.0 6.0 5.0 2.0 1.0
1186 6.0 1.0 2.0 3.0 4.0 5.0

1186 rows × 6 columns

Visualizing mean rankings

In [23]:
means = star_wars[star_wars.columns[9:15]].mean()
means
Out[23]:
ranking_1    3.732934
ranking_2    4.087321
ranking_3    4.341317
ranking_4    3.272727
ranking_5    2.513158
ranking_6    3.047847
dtype: float64
In [24]:
import matplotlib.pyplot as plt
%matplotlib inline

means.plot.bar(title = 'Mean ratings for Star Wars films')
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x23ba72aaac0>

Visualizing viewing figures

In [25]:
viewing_figures = star_wars[star_wars.columns[3:9]].sum()
viewing_figures.plot.bar(title = 'Viewing figures for Star Wars Films')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x23ba7a207f0>

More people have seen 4, 5 and 6, and they tend to be ranked higher.

In [26]:
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
males_means = males[males.columns[9:15]].mean()
males_means.plot.bar(title = 'Mean ratings for Star Wars films - males')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x23ba7a8b6d0>
In [27]:
females_means = females[females.columns[9:15]].mean()
females_means.plot.bar(title = 'Mean ratings for Star Wars films - females')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x23ba7afc310>
In [28]:
males_viewing_figures = males[males.columns[3:9]].sum()
males_viewing_figures.plot.bar(title = 'Viewing figures for Star Wars Films - males')
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x23ba7ae89a0>
In [29]:
females_viewing_figures = females[females.columns[3:9]].sum()
females_viewing_figures.plot.bar(title = 'Viewing figures for Star Wars Films - females')
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x23ba7bc1b50>
In [ ]: