Star Wars Survey¶

AIM :¶

To verify if *The Empire Strikes Back* is really the best *Star Wars* movie out there

Data Description :¶

The data consists of responses received by SurveyMoneky to the survey conducted by the team at FiveThirtyEight about the *Star Wars* movies which includes the following columns:

Header	Description
`RespondentID`	An anonymized ID for the respondent (person taking the survey)
`Gender`	The respondent's gender
`Age`	The respondent's age
`Household Income`	The respondent's income
`Education`	The respondent's education level
`Location (Census Region)`	The respondent's location
`Have you seen any of the 6 films in the Star Wars franchise?`	Has a `Yes` or `No` response
`Do you consider yourself to be a fan of the Star Wars film franchise?`	Has a `Yes` or `No` response

They received 835 total responses, which can be downloaded from their GitHub repository.

Data Cleaning :¶

In [1]:

## Let's first import all the required libraries :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

In [2]:

## Importing the data into a dataframe :
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")

# Exploring the data :
star_wars.head(10)

Out[2]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	3292879998	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3.0	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
1	3292879538	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
2	3292765271	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1.0	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
3	3292763116	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
4	3292731220	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	...	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3292719380	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	1.0	...	Very favorably	Han	Yes	No	Yes	Male	18-29	$25,000 - $49,999	Bachelor degree	Middle Atlantic
6	3292684787	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	6.0	...	Very favorably	Han	Yes	No	No	Male	18-29	NaN	High school degree	East North Central
7	3292663732	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	4.0	...	Very favorably	Han	No	NaN	Yes	Male	18-29	NaN	High school degree	South Atlantic
8	3292654043	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	...	Somewhat favorably	Han	No	NaN	No	Male	18-29	$0 - $24,999	Some college or Associate degree	South Atlantic
9	3292640424	Yes	No	NaN	Star Wars: Episode II Attack of the Clones	NaN	NaN	NaN	NaN	1.0	...	Very favorably	I don't understand this question	No	NaN	No	Male	18-29	$25,000 - $49,999	Some college or Associate degree	Pacific

10 rows × 38 columns

In [3]:

## Let's see all the columns of the dataframe:
star_wars.columns

Out[3]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

The columns Have you seen any of the 6 films in the Star Wars franchise? and Do you consider yourself to be a fan of the Star Wars film franchise? seem to contain Yes and No values. We can convert these values into Boolean values, that will make the data cleaning process much easier.

In [4]:

## First, let's verify our observation :
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)

Out[4]:

Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

In [5]:

star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)

Out[5]:

Yes    552
NaN    350
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

In [6]:

## Now that we have verified the values in the columns, let's convert them into boolean:
yes_no = {'Yes': True, 'No': False}
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)

Out[6]:

True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

In [7]:

star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)

Out[7]:

True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

We see above that some of the columns have headers like Unnamed: 4, or Which of the following Star Wars films have you seen? Please select all that apply.. Let's change these headers into something better.

In [8]:

## Let's change the column headers :
headers = {'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
          'Unnamed: 4': 'seen_2', 'Unnamed: 5': 'seen_3', 'Unnamed: 6': 'seen_4', 'Unnamed: 7': 'seen_5',
           'Unnamed: 8': 'seen_6'}
star_wars = star_wars.rename(columns=headers)
# checking the headers
star_wars.columns

Out[8]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

The values contained in these seem to be 'checkboxes' containing the name of the movie. Rather than storing the names of the movies, we can change these columns to contain Boolean values indicating whether the respondent has seen the movie.

In [9]:

## We see that the first row has all the movies,let's use that to create a dict to change values:
to_change = {star_wars.iloc[0,3] : True, star_wars.iloc[0,4] : True, star_wars.iloc[0,5] :True,
             star_wars.iloc[0,6] : True, star_wars.iloc[0,7] : True, star_wars.iloc[0,8] : True,
             np.nan : False
            }
for series in star_wars.columns[3:9]:
    star_wars[series] = star_wars[series].map(to_change)

# checking the values :
star_wars.head()

Out[9]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	3292879998	True	True	True	True	True	True	True	True	3.0	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
1	3292879538	False	NaN	False	False	False	False	False	False	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
2	3292765271	True	False	True	True	True	False	False	False	1.0	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
3	3292763116	True	True	True	True	True	True	True	True	5.0	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
4	3292731220	True	True	True	True	True	True	True	True	5.0	...	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

5 rows × 38 columns

In [10]:

## let's look at the next 6 columns :
star_wars.iloc[:, 9:15]

Out[10]:

	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14
0	3.0	2.0	1.0	4.0	5.0	6.0
1	NaN	NaN	NaN	NaN	NaN	NaN
2	1.0	2.0	3.0	4.0	5.0	6.0
3	5.0	6.0	1.0	2.0	4.0	3.0
4	5.0	4.0	6.0	2.0	1.0	3.0
...	...	...	...	...	...	...
1181	5.0	4.0	6.0	3.0	2.0	1.0
1182	4.0	5.0	6.0	2.0	3.0	1.0
1183	NaN	NaN	NaN	NaN	NaN	NaN
1184	4.0	3.0	6.0	5.0	2.0	1.0
1185	6.0	1.0	2.0	3.0	4.0	5.0

1186 rows × 6 columns

These columns hold the ranks that each of the respondents have given to the *Star Wars* movies. The headers of these columns seem unintuitive so let's change them into suitable headers.

In [11]:

## changing the headers:
i=1
for header in star_wars.columns[9:15]:
    new_header = 'ranking_{}'.format(i)
    star_wars = star_wars.rename(columns={header : new_header})
    i +=1

# checking the new headers:
star_wars.columns[9:15]

Out[11]:

Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')

In [12]:

## Let's look at the column dtpye :
star_wars.iloc[:, 9:15].dtypes

Out[12]:

ranking_1    float64
ranking_2    float64
ranking_3    float64
ranking_4    float64
ranking_5    float64
ranking_6    float64
dtype: object

Data Analysis :¶

Which movie is ranked better :¶

In [13]:

rankings = star_wars.iloc[:, 9:15].mean()
rankings.plot.bar(title= 'Movie rankings (Lower is better!)', rot=0,color='red')

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f675ab36b80>

We see that ranking_4: Star Wars: Episode IV A New Hope , ranking_5 : Star Wars: Episode V The Empire Strikes Back and ranking_6 : *Star Wars: Episode VI Return of the Jedi* have been rated the best out of the bunch. These movies are older and tend to have a staunch fan following.

Number of people that have seen each movie :¶

In [14]:

movie_seen = star_wars.iloc[:, 3:9].sum()
movie_seen.plot.bar(title= 'How many people have seen the movie', rot=0)

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f675a7aeb50>

We again see that the older movies have been watched more than the newer ones, addng to our findings from the ratings.

Genderwise distribution :¶

In [15]:

## creating a subset dataframe that contains only male values in the gender column:
star_gender = star_wars.groupby('Gender').agg(np.sum)

# Plotting the most watched film:
star_gender.iloc[:,1:8].plot.bar(title = 'Most watched film by gender')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()

In [16]:

# Plotting the best ranked film:
star_gender.iloc[:,8:].plot.bar(title = 'The best ranked film by gender (Lower is better!)')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()

By Gender :

More Males have watched the films.
The Fifth film *Star Wars: Episode V The Empire Strikes Back* is the most popular film amongst both the gender.
On avg, people seem to gravitate towards the older films as have a higher view rate and are ranked better by the respondents.

How have fans rated the movies :¶

In [17]:

## creating a subset dataframe that contains only fan values:
star_wars_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']== True]
print ('The number of fans of the Star wars franchise :', star_wars_fan.shape[0])

## creating a subset dataframe that contains only male values in the gender column:
star_fan = star_wars.groupby('Do you consider yourself to be a fan of the Star Wars film franchise?').agg(np.sum)

# Plotting the most watched film:
star_fan.iloc[:,1:8].plot.bar(title = 'Most watched film by fandom')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()

The number of fans of the Star wars franchise : 552

In [18]:

# Plotting the best ranked film:
star_fan.iloc[:,8:].plot.bar(title = 'The best ranked film by fandom (Lower is better!)')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()

As ranked by a Star Trek fans :¶

In [19]:

## creating a subset dataframe that contains only Star trek fan values:
star_trek_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?']== 'Yes']

# Let's see the number of fans of the franchsie:
print ('Number of people who a fan of the Star Trek franchise :', star_trek_fan.shape[0])

## creating a subset dataframe that contains only male values in the gender column:
star_trek_fan = star_wars.groupby('Do you consider yourself to be a fan of the Star Trek franchise?').agg(np.sum)

# Plotting the most watched film:
star_trek_fan.iloc[:,2:8].plot.bar(title = 'Most watched film by Star Trekkers')
plt.legend(loc='lower left', bbox_to_anchor=(1,0),borderaxespad=0)
plt.show()

Number of people who a fan of the Star Trek franchise : 427

Being a fan of the Star Wars franchise or being a fan of *Star Trek* doesn't seem to affect our analysis. The two oldest films i.e. *Star Wars: Episode V The Empire Strikes Back* and *Star Wars: Episode VI Return of the Jedi* are the most popular Star Wars films.

Location wise :¶

In [20]:

loc = star_wars['Location (Census Region)'].value_counts()
loc.plot.bar(title = 'Number of respondents based on location')

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f67597435b0>

We see that East North Central had the most respondants and East South Central had the least number of respondents for the survey.

In [21]:

star_group = star_wars.groupby('Location (Census Region)').agg(np.sum)
star_group = star_group.drop('RespondentID', axis=1)

In [22]:

# Let's plot the values to visually understand the result better:
star_group.iloc[:,1].plot(kind='bar', title= 'Number of films watched by location')

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f67596afb80>

We see that the people from Pacific region are more exited about Star Wars films as most of the respondents had seen the films. On the other hand East South Central has the least number of viewers.

In [23]:

star_group.iloc[:,1:7].plot(kind='bar', title= 'Films watched by location')

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f675abf2a90>

We again witness the popularity of the older films as they are ranked much better than the newer ones all over the country.

Which character shot first:¶

In [24]:

star_wars['Which character shot first?'].value_counts().plot.bar(title='Which character shot first')

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6759611df0>

The age old ambiguity still holds about who fired the gun first. Though I personally believe that Han fired first at Greedo in the cantina

Conclusion :¶

Let's condense our findings from above :

From our analysis we clearly see that *Star Wars: Episode V The Empire Strikes Back* and *Star Wars: Episode VI Return of the Jedi* have been rated the best out of the bunch (view rate and rankings ). These movies are older and tend to have a staunch fan following (including me.)
Genderwise:
- More Men seem to follow the *Star Wars* franchise as compared to Females
Fandom Trivia :
- *Star Wars* have more fans than *Star Trek*. (though not by a large margin)
- *Star Wars: Episode V The Empire Strikes Back* and *Star Wars: Episode VI Return of the Jedi* are unbeatable!
By Location :
- We see that the people from Pacific region are more exited about Star Wars films as most of the respondents had seen the films. On the other hand East South Central has the least number of viewers.
Who Shot First :
- Depends on 'George Lucas'. - Link attached.