Star Wars Survey


AIM :

To verify if The Empire Strikes Back is really the best Star Wars movie out there


Data Description :

The data consists of responses received by SurveyMoneky to the survey conducted by the team at FiveThirtyEight about the Star Wars movies which includes the following columns:

Header Description
RespondentID An anonymized ID for the respondent (person taking the survey)
Gender The respondent's gender
Age The respondent's age
Household Income The respondent's income
Education The respondent's education level
Location (Census Region) The respondent's location
Have you seen any of the 6 films in the Star Wars franchise? Has a Yes or No response
Do you consider yourself to be a fan of the Star Wars film franchise? Has a Yes or No response

They received 835 total responses, which can be downloaded from their GitHub repository.


Data Cleaning :

In [1]:
## Let's first import all the required libraries :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
In [2]:
## Importing the data into a dataframe :
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")

# Exploring the data :
star_wars.head(10)
Out[2]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe? Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 3292879998 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3.0 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
1 3292879538 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
2 3292765271 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1.0 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
3 3292763116 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
4 3292731220 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3292719380 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 1.0 ... Very favorably Han Yes No Yes Male 18-29 $25,000 - $49,999 Bachelor degree Middle Atlantic
6 3292684787 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 6.0 ... Very favorably Han Yes No No Male 18-29 NaN High school degree East North Central
7 3292663732 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 4.0 ... Very favorably Han No NaN Yes Male 18-29 NaN High school degree South Atlantic
8 3292654043 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Somewhat favorably Han No NaN No Male 18-29 $0 - $24,999 Some college or Associate degree South Atlantic
9 3292640424 Yes No NaN Star Wars: Episode II Attack of the Clones NaN NaN NaN NaN 1.0 ... Very favorably I don't understand this question No NaN No Male 18-29 $25,000 - $49,999 Some college or Associate degree Pacific

10 rows × 38 columns

In [3]:
## Let's see all the columns of the dataframe:
star_wars.columns
Out[3]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

The columns Have you seen any of the 6 films in the Star Wars franchise? and Do you consider yourself to be a fan of the Star Wars film franchise? seem to contain Yes and No values. We can convert these values into Boolean values, that will make the data cleaning process much easier.

In [4]:
## First, let's verify our observation :
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Out[4]:
Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [5]:
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Out[5]:
Yes    552
NaN    350
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
In [6]:
## Now that we have verified the values in the columns, let's convert them into boolean:
yes_no = {'Yes': True, 'No': False}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Out[6]:
True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [7]:
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Out[7]:
True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

We see above that some of the columns have headers like Unnamed: 4, or Which of the following Star Wars films have you seen? Please select all that apply.. Let's change these headers into something better.

In [8]:
## Let's change the column headers :
headers = {'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
          'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6': 'seen_4', 'Unnamed: 7': 'seen_5',
           'Unnamed: 8': 'seen_6'}
star_wars = star_wars.rename(columns=headers)
# checking the headers
star_wars.columns
Out[8]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

The values contained in these seem to be 'checkboxes' containing the name of the movie. Rather than storing the names of the movies, we can change these columns to contain Boolean values indicating whether the respondent has seen the movie.

In [9]:
## We see that the first row has all the movies,let's use that to create a dict to change values:
to_change = {star_wars.iloc[0,3] : True, star_wars.iloc[0,4] : True, star_wars.iloc[0,5] :True,
             star_wars.iloc[0,6] : True, star_wars.iloc[0,7] : True, star_wars.iloc[0,8] : True,
             np.nan : False
            }
for series in star_wars.columns[3:9]:
    star_wars[series] = star_wars[series].map(to_change)

# checking the values :
star_wars.head()
Out[9]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe? Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 3292879998 True True True True True True True True 3.0 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
1 3292879538 False NaN False False False False False False NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
2 3292765271 True False True True True False False False 1.0 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
3 3292763116 True True True True True True True True 5.0 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
4 3292731220 True True True True True True True True 5.0 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 38 columns

In [10]:
## let's look at the next 6 columns :
star_wars.iloc[:, 9:15]
Out[10]:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14
0 3.0 2.0 1.0 4.0 5.0 6.0
1 NaN NaN NaN NaN NaN NaN
2 1.0 2.0 3.0 4.0 5.0 6.0
3 5.0 6.0 1.0 2.0 4.0 3.0
4 5.0 4.0 6.0 2.0 1.0 3.0
... ... ... ... ... ... ...
1181 5.0 4.0 6.0 3.0 2.0 1.0
1182 4.0 5.0 6.0 2.0 3.0 1.0
1183 NaN NaN NaN NaN NaN NaN
1184 4.0 3.0 6.0 5.0 2.0 1.0
1185 6.0 1.0 2.0 3.0 4.0 5.0

1186 rows × 6 columns

These columns hold the ranks that each of the respondents have given to the Star Wars movies. The headers of these columns seem unintuitive so let's change them into suitable headers.

In [11]:
## changing the headers:
i=1
for header in star_wars.columns[9:15]:
    new_header = 'ranking_{}'.format(i)
    star_wars = star_wars.rename(columns={header : new_header})
    i +=1

# checking the new headers:
star_wars.columns[9:15]
Out[11]:
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')
In [12]:
## Let's look at the column dtpye :
star_wars.iloc[:, 9:15].dtypes
Out[12]:
ranking_1    float64
ranking_2    float64
ranking_3    float64
ranking_4    float64
ranking_5    float64
ranking_6    float64
dtype: object

Data Analysis :

Which movie is ranked better :

In [13]:
rankings = star_wars.iloc[:, 9:15].mean()
rankings.plot.bar(title= 'Movie rankings (Lower is better!)', rot=0,color='red')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f675ab36b80>

We see that ranking_4: Star Wars: Episode IV A New Hope , ranking_5 : Star Wars: Episode V The Empire Strikes Back and ranking_6 : Star Wars: Episode VI Return of the Jedi have been rated the best out of the bunch. These movies are older and tend to have a staunch fan following.

Number of people that have seen each movie :

In [14]:
movie_seen = star_wars.iloc[:, 3:9].sum()
movie_seen.plot.bar(title= 'How many people have seen the movie', rot=0)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f675a7aeb50>

We again see that the older movies have been watched more than the newer ones, addng to our findings from the ratings.

Genderwise distribution :

In [15]:
## creating a subset dataframe that contains only male values in the gender column:
star_gender = star_wars.groupby('Gender').agg(np.sum)

# Plotting the most watched film:
star_gender.iloc[:,1:8].plot.bar(title = 'Most watched film by gender')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()
In [16]:
# Plotting the best ranked film:
star_gender.iloc[:,8:].plot.bar(title = 'The best ranked film by gender (Lower is better!)')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()

By Gender :

  • More Males have watched the films.
  • The Fifth film Star Wars: Episode V The Empire Strikes Back is the most popular film amongst both the gender.
  • On avg, people seem to gravitate towards the older films as have a higher view rate and are ranked better by the respondents.

How have fans rated the movies :

In [17]:
## creating a subset dataframe that contains only fan values:
star_wars_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']== True]
print ('The number of fans of the Star wars franchise :', star_wars_fan.shape[0])

## creating a subset dataframe that contains only male values in the gender column:
star_fan = star_wars.groupby('Do you consider yourself to be a fan of the Star Wars film franchise?').agg(np.sum)

# Plotting the most watched film:
star_fan.iloc[:,1:8].plot.bar(title = 'Most watched film by fandom')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()
The number of fans of the Star wars franchise : 552
In [18]:
# Plotting the best ranked film:
star_fan.iloc[:,8:].plot.bar(title = 'The best ranked film by fandom (Lower is better!)')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()

As ranked by a Star Trek fans :

In [19]:
## creating a subset dataframe that contains only Star trek fan values:
star_trek_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?']== 'Yes']

# Let's see the number of fans of the franchsie:
print ('Number of people who a fan of the Star Trek franchise :', star_trek_fan.shape[0])

## creating a subset dataframe that contains only male values in the gender column:
star_trek_fan = star_wars.groupby('Do you consider yourself to be a fan of the Star Trek franchise?').agg(np.sum)

# Plotting the most watched film:
star_trek_fan.iloc[:,2:8].plot.bar(title = 'Most watched film by Star Trekkers')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()
Number of people who a fan of the Star Trek franchise : 427

Being a fan of the Star Wars franchise or being a fan of Star Trek doesn't seem to affect our analysis. The two oldest films i.e. Star Wars: Episode V The Empire Strikes Back and Star Wars: Episode VI Return of the Jedi are the most popular Star Wars films.

Location wise :

In [20]:
loc = star_wars['Location (Census Region)'].value_counts()
loc.plot.bar(title = 'Number of respondents based on location')
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f67597435b0>

We see that East North Central had the most respondants and East South Central had the least number of respondents for the survey.

In [21]:
star_group = star_wars.groupby('Location (Census Region)').agg(np.sum)
star_group = star_group.drop('RespondentID', axis=1)
In [22]:
# Let's plot the values to visually understand the result better:
star_group.iloc[:,1].plot(kind='bar', title= 'Number of films watched by location')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f67596afb80>

We see that the people from Pacific region are more exited about Star Wars films as most of the respondents had seen the films. On the other hand East South Central has the least number of viewers.

In [23]:
star_group.iloc[:,1:7].plot(kind='bar', title= 'Films watched by location')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f675abf2a90>

We again witness the popularity of the older films as they are ranked much better than the newer ones all over the country.

Which character shot first:

In [24]:
star_wars['Which character shot first?'].value_counts().plot.bar(title='Which character shot first')
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6759611df0>

The age old ambiguity still holds about who fired the gun first. Though I personally believe that Han fired first at Greedo in the cantina


Conclusion :

Let's condense our findings from above :

  1. From our analysis we clearly see that Star Wars: Episode V The Empire Strikes Back and Star Wars: Episode VI Return of the Jedi have been rated the best out of the bunch (view rate and rankings ). These movies are older and tend to have a staunch fan following (including me.)

  2. Genderwise:

    • More Men seem to follow the Star Wars franchise as compared to Females
  3. Fandom Trivia :

    • Star Wars have more fans than Star Trek. (though not by a large margin)
    • Star Wars: Episode V The Empire Strikes Back and Star Wars: Episode VI Return of the Jedi are unbeatable!
  4. By Location :

    • We see that the people from Pacific region are more exited about Star Wars films as most of the respondents had seen the films. On the other hand East South Central has the least number of viewers.
  5. Who Shot First :