Guided Project: Star Wars Survey¶

In this project, we are addressing the question regarding the Star Wars series - does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

To achieve this, we are going to analyse the data collected by FiveThirtyEightafter surveying Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository here.

Read data set and overview of the data¶

In [39]:

import pandas as pd
import numpy as np
import  matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
sns.set_style("whitegrid", {'axes.grid' : False})


star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
star_wars.columns

Out[39]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

The data has several columns, including:

RespondentID - An anonymized ID for the respondent (person taking the survey)
Gender - The respondent's gender
Age - The respondent's age
Household Income - The respondent's income
Education - The respondent's education level
Location (Census Region) - The respondent's location
Have you seen any of the 6 films in the Star Wars franchise? - Has a Yes or No response
Do you consider yourself to be a fan of the Star Wars film franchise? - Has a Yes or No response
Which of the following Star Wars films have you seen? Please select all that apply.

We will now check for any strange values in the dataset.

In [40]:

star_wars.head(5)

Out[40]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	NaN	Response	Response	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Star Wars: Episode I The Phantom Menace	...	Yoda	Response	Response	Response	Response	Response	Response	Response	Response	Response
1	3.292880e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

5 rows × 38 columns

It is obvious from the above result cell that RespondentID contains NaN and we should clean this column before proceeding with our analysis.

In [41]:

star_wars = star_wars[pd.notnull(star_wars["RespondentID"])]
star_wars.head(3)

Out[41]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
1	3.292880e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central

3 rows × 38 columns

Cleaning and Mapping Yes/No columns¶

Some of the columns represent Yes/No questions and it is also important to bear in mind that it can also have NaN where a respondent chooses not to answer a question. The columns in question are:

Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?

Let's jump straightaway into cleaning these columns.

In [42]:

#  dictionary to define a mapping for each value in the series
#  map value Yes to boolean value True and No to False
yes_no = {
    "Yes": True,
    "No": False
}
# function to map and convert column values to Boolean
def convert_to_bool(col):
    return col.map(yes_no)
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = convert_to_bool(star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] )
print(star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False))
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = convert_to_bool(star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'])
print(star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False))

True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

Now we have True, False and NaN values for both the columns

Cleaning and mapping checkbox columns¶

If we check our column names we can notice that there are nearly 4 columns that represent a single checkbox question. The columns are:

Which of the following Star Wars films have you seen? Please select all that apply. - Whether or not the respondent saw Star Wars: Episode I The Phantom Menace.
Unnamed: 4 - Whether or not the respondent saw Star Wars: Episode II Attack of the Clones.
Unnamed: 5 - Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.

`Unnamed: 6` - Whether or not the respondent saw `Star Wars: Episode IV A New Hope`.

`Unnamed: 7` - Whether or not the respondent saw `Star Wars: Episode V The Empire Strikes Back`.

`Unnamed: 8` - Whether or not the respondent saw `Star Wars: Episode VI Return of the Jedi`.

For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll conver each of these columns to a Boolean, then rename the column for sanity purposes. 🤓

In [43]:

# mapping dictionary for movies 

movie_mapping = {
    "Star Wars: Episode I  The Phantom Menace": True,
    np.nan: False,
    "Star Wars: Episode II  Attack of the Clones": True,
    "Star Wars: Episode III  Revenge of the Sith": True,
    "Star Wars: Episode IV  A New Hope": True,
    "Star Wars: Episode V The Empire Strikes Back": True,
    "Star Wars: Episode VI Return of the Jedi": True
}
# map values and convert to Boolean values. 
# columns numbers 3 to 9 represent the columns in question
for col in star_wars.columns[3:9]: 
    star_wars[col] = star_wars[col].map(movie_mapping)

# rename columns
star_wars = star_wars.rename(columns={
        "Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
        "Unnamed: 4": "seen_2",
        "Unnamed: 5": "seen_3",
        "Unnamed: 6": "seen_4",
        "Unnamed: 7": "seen_5",
        "Unnamed: 8": "seen_6"
        })

star_wars.head(2)

Out[43]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
1	3.292880e+09	True	True	True	True	True	True	True	True	3	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	False	NaN	False	False	False	False	False	False	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central

2 rows × 38 columns

Cleaning the ranking columns¶

Next, we have columns that rank the Star Wars in order of least favorite to most favorite, 1 being most favorite and 6 being the least favorite. The following are the columns that rank the movies:

Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. - How much the respondent liked Star Wars: Episode I The Phantom Menace
Unnamed: 10 - How much the respondent liked Star Wars: Episode II Attack of the Clones
Unnamed: 11 - How much the respondent liked Star Wars: Episode III Revenge of the Sith
Unnamed: 12 - How much the respondent liked Star Wars: Episode IV A New Hope
Unnamed: 13 - How much the respondent liked Star Wars: Episode V The Empire Strikes Back
Unnamed: 14 - How much the respondent liked Star Wars: Episode VI Return of the Jedi

We'll convert each column to a numeric type and then rename the columns. The columns numbers range from 9 to 15 in this case.

In [44]:

# Convert each of the columns above to a float type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

# rename the columns
star_wars = star_wars.rename(columns={
        "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_1",
        "Unnamed: 10": "ranking_2",
        "Unnamed: 11": "ranking_3",
        "Unnamed: 12": "ranking_4",
        "Unnamed: 13": "ranking_5",
        "Unnamed: 14": "ranking_6"
        })

star_wars.head(2)

Out[44]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
1	3.292880e+09	True	True	True	True	True	True	True	True	3.0	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	False	NaN	False	False	False	False	False	False	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central

2 rows × 38 columns

Finding the highest-ranked movie¶

In [45]:

# find the highest-ranked movie by finding the mean of each rating
star_wars[star_wars.columns[9:15]].mean()

Out[45]:

ranking_1    3.732934
ranking_2    4.087321
ranking_3    4.341317
ranking_4    3.272727
ranking_5    2.513158
ranking_6    3.047847
dtype: float64

In [46]:

# plot the mean values
star_wars[star_wars.columns[9:15]].mean().plot(kind='bar')
sns.despine()

plt.show()

From the plot, we can say that ranking_5 has the lowest ranking ie, Star Wars: Episode V The Empire Strikes Back is the most favorite movie. We have to remember that the rankings values are 1 through 6, 1 means the film was the most favorite, and 6 means it was the least favorite.

Finding the most viewed movie¶

We have already cleaned up the seen columns and converted their values to the Boolean type. Now let's find the most viewed movie from the series.

In [47]:

# columns numbers 3 to 9 represent the columns seen
star_wars[star_wars.columns[3:9]].sum()

Out[47]:

seen_1    673
seen_2    571
seen_3    550
seen_4    607
seen_5    758
seen_6    738
dtype: int64

In [48]:

# plot the values

star_wars[star_wars.columns[3:9]].sum().plot(kind='bar')
sns.despine()
plt.show()

seen_5 or Star Wars: Episode V The Empire Strikes Back is the most viewed movie which explains why the highest ranked movie is also the same ie, more number of people watched Star Wars: Episode V The Empire Strikes Back than other movies in the Star War series.

Exploring the data by binary segments¶

Let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:

Do you consider yourself to be a fan of the Star Wars film franchise? - True or False
Do you consider yourself to be a fan of the Star Trek franchise? - Yes or No
Gender - Male or Female

We can compute the most viewed movie, the highest-ranked movie, and other statistics separately for each group.

In [49]:

# split the data into two groups based on gender
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]

# Highest-ranked movie - Male respondents and plot the values

print("Highest-ranked movie - Male respondents \n\n",males[males.columns[9:15]].mean())
males[males.columns[9:15]].mean().plot(kind='bar', title="Movie ranking by Male respondents")
plt.show()

# Highest-ranked movie - Female respondents and plot the values

print("Highest-ranked movie - Female respondents \n\n",females[females.columns[9:15]].mean())
females[females.columns[9:15]].mean().plot(kind='bar', title="Movie ranking by Female respondents")
sns.despine()
plt.show()

Highest-ranked movie - Male respondents 

 ranking_1    4.037825
ranking_2    4.224586
ranking_3    4.274882
ranking_4    2.997636
ranking_5    2.458629
ranking_6    3.002364
dtype: float64

Highest-ranked movie - Female respondents 

 ranking_1    3.429293
ranking_2    3.954660
ranking_3    4.418136
ranking_4    3.544081
ranking_5    2.569270
ranking_6    3.078086
dtype: float64

In [50]:

# Most most viewed - Male and plot the values
print("Most most viewed - Male respondents\n\n",males[males.columns[3:9]].sum())

males[males.columns[3:9]].sum().plot(kind='bar',title="Most viewed movie by Male respondents")
plt.show()

# Most most viewed - Female and plot the values
print("Most most viewed - Female respondents\n\n",females[females.columns[3:9]].sum())

females[females.columns[3:9]].sum().plot(kind='bar',title="Most viewed movie by Female respondents")
sns.despine()
plt.show()

Most most viewed - Male respondents

 seen_1    361
seen_2    323
seen_3    317
seen_4    342
seen_5    392
seen_6    387
dtype: int64

Most most viewed - Female respondents

 seen_1    298
seen_2    237
seen_3    222
seen_4    255
seen_5    353
seen_6    338
dtype: int64

From the plots,episode 5 received highest rating and views from both men and women. More men watched episodes 1-3 but didnt like the episodes compared to women. Episodes 5 and 6 shows more views from both men and women.

In [51]:

# analysis based on column-
# Do you consider yourself to be a fan of the Star Wars film franchise?


# rename the column in both male and female dataset we grouped in the previous step
male_fans = males.rename(columns={
        'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"})

# drop NaN values 
print(male_fans['fan_or_not'].value_counts(dropna=False))
male_fans['fan_or_not']= male_fans['fan_or_not'].fillna(False)
print('\nafter removing NaN values\n',male_fans['fan_or_not'].value_counts(dropna=False,normalize=True))
# plot values
male_fans['fan_or_not'].value_counts(dropna=False,normalize=True).plot(kind='bar', title='Star Wars Fan or not - Male')
sns.despine()

plt.show()

True     303
False    120
NaN       74
Name: fan_or_not, dtype: int64

after removing NaN values
 True     0.609658
False    0.390342
Name: fan_or_not, dtype: float64

In [52]:

female_fans = females.rename(columns={
        'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"})

# drop NaN values 
print(female_fans['fan_or_not'].value_counts(dropna=False))
female_fans['fan_or_not']= female_fans['fan_or_not'].fillna(False)
print('\nafter removing NaN values\n',female_fans['fan_or_not'].value_counts(dropna=False,normalize=True))
# plot values
female_fans['fan_or_not'].value_counts(dropna=False,normalize=True).plot(kind='bar', title='Star Wars Fan or not - Female')
sns.despine()

plt.show()

True     238
False    159
NaN      152
Name: fan_or_not, dtype: int64

after removing NaN values
 False    0.566485
True     0.433515
Name: fan_or_not, dtype: float64

In [53]:

# combined plot - Gender, fan_ot_not

all_fans =star_wars.rename(columns={
        'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"})
# considering NaN values as False
all_fans['fan_or_not']= all_fans['fan_or_not'].fillna(False)

# group by columns in question and plot
fans_by_gender=all_fans.groupby(['Gender','fan_or_not']).size()
df=fans_by_gender.unstack()
df.plot(kind='bar', title="Are you a fan of Star Wars?")
sns.despine()

It is obvious from the plots above that women are not fans of Star Wars series, whereas men are!

Further analysis¶

1. Based on Education¶

In [54]:

# check the values in the column
star_wars['Education'].value_counts()

Out[54]:

Some college or Associate degree    328
Bachelor degree                     321
Graduate degree                     275
High school degree                  105
Less than high school degree          7
Name: Education, dtype: int64

Ranking based on Education¶

In [55]:

# rename the column
star_wars = star_wars.rename(columns={
        'Do you consider yourself to be a fan of the Star Wars film franchise?': "fan_or_not"})
# create a pivot table
ranking_by_education = star_wars.pivot_table(index="Education", values=star_wars.columns[9:15])
print(ranking_by_education)

# plot the data

ranking_by_education.plot(kind='bar', title='Ranking by education', figsize=(20,10),fontsize=10)
sns.despine()
plt.show()

# sns heatmap
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(ranking_by_education, annot=True,  linewidths=.5, ax=ax)
ax.set_title('Ranking by education')

                                  ranking_1  ranking_2  ranking_3  ranking_4  \
Education                                                                      
Bachelor degree                    3.828244   4.290076   4.521073   3.114504   
Graduate degree                    3.822222   4.225664   4.500000   3.199115   
High school degree                 3.802817   3.746479   4.126761   3.211268   
Less than high school degree       5.000000   5.333333   3.666667   2.666667   
Some college or Associate degree   3.551181   3.885827   4.102362   3.503937   

                                  ranking_5  ranking_6  
Education                                               
Bachelor degree                    2.309160   2.931298  
Graduate degree                    2.323009   2.920354  
High school degree                 2.873239   3.239437  
Less than high school degree       1.000000   3.333333  
Some college or Associate degree   2.783465   3.173228

Out[55]:

<matplotlib.text.Text at 0x7fef99a0b400>

Views based on Education¶

In [56]:

views_by_education  =  star_wars.pivot_table(index="Education", values=star_wars.columns[3:9]) 
print(views_by_education)


f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(views_by_education*100, annot=True,fmt='.1f' ,linewidths=.5, ax=ax)
ax.set_title('Views by education')

                                    seen_1    seen_2    seen_3    seen_4  \
Education                                                                  
Bachelor degree                   0.641745  0.529595  0.507788  0.607477   
Graduate degree                   0.650909  0.541818  0.505455  0.592727   
High school degree                0.542857  0.457143  0.457143  0.504762   
Less than high school degree      0.428571  0.428571  0.428571  0.428571   
Some college or Associate degree  0.643293  0.567073  0.557927  0.548780   

                                    seen_5    seen_6  
Education                                             
Bachelor degree                   0.757009  0.728972  
Graduate degree                   0.752727  0.730909  
High school degree                0.580952  0.571429  
Less than high school degree      0.428571  0.428571  
Some college or Associate degree  0.692073  0.679878

Out[56]:

<matplotlib.text.Text at 0x7fef99b81b00>

The data above shows that respondents with less than high school education were the ones who most liked episode 5 in the Star Wars franchise but only 43% of them watched it. On contrast, almost 78% of respondents with a bachelor's degree watched episode 5 and also rated it an avereage of 2.3 .

2. Based on Location/Region¶

In [57]:

# check values in  location column

star_wars['Location (Census Region)'].value_counts()

Out[57]:

East North Central    181
Pacific               175
South Atlantic        170
Middle Atlantic       122
West South Central    110
West North Central     93
Mountain               79
New England            75
East South Central     38
Name: Location (Census Region), dtype: int64

Ranking and views based on region¶

In [58]:

ranking_by_location = star_wars.pivot_table(index="Location (Census Region)", values=star_wars.columns[9:15])

f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(ranking_by_location, annot=True,  linewidths=.5, ax=ax)
ax.set_title('Ranking by region')
plt.show()

#views by location
views_location = star_wars.pivot_table(index="Location (Census Region)", values=star_wars.columns[3:9])

#
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(views_location, annot=True,  linewidths=.5, ax=ax)
ax.set_title('Views by region')
plt.show()

From our analysis of location data, we see that respondents across all the regions rated episode 5 with a higher ranking. Approx. 82% of espondents from East South Central region views episode 5. Taking a closer look, the data shows that more number of respondents ( more than 50%) watched episode 1,4,5 and 6 across the regions.

3. Response to `Which character shot first?`¶

In [59]:

# check the values in the column
star_wars['Which character shot first?'].value_counts(dropna=False)

Out[59]:

NaN                                 358
Han                                 325
I don't understand this question    306
Greedo                              197
Name: Which character shot first?, dtype: int64

In [60]:

# replacing NaN values 
star_wars['Which character shot first?'].fillna("I don't understand this question", inplace = True) 
print(star_wars['Which character shot first?'].value_counts(normalize=True))

star_wars['Which character shot first?'].value_counts(normalize=True).plot(kind='bar', title='Who was shot first - all respondents')
sns.despine()

plt.show()

I don't understand this question    0.559865
Han                                 0.274030
Greedo                              0.166105
Name: Which character shot first?, dtype: float64

`Which character shot first?` : Response based on `Gender`¶

In [61]:

star_wars['Which character shot first?'].fillna("I don't understand this question", inplace = True) 
print("Who was shot first? - all fans \n",star_wars['Which character shot first?'].value_counts())

grouped=star_wars.groupby(['Gender','Which character shot first?']).size()
df=grouped.unstack()
df.plot(kind='bar')
sns.despine()
# sns.set_style( {'axes.grid' : False})

Who was shot first? - all fans 
 I don't understand this question    664
Han                                 325
Greedo                              197
Name: Which character shot first?, dtype: int64

Male and female respondents said Han was shot first, however more female respondents said they did not understand the question.

4. Views and Ranks based on age group¶

In [62]:

star_wars['Age'].value_counts(normalize=True)

Out[62]:

45-60    0.278203
> 60     0.257170
30-44    0.256214
18-29    0.208413
Name: Age, dtype: float64

In [63]:

# views by age
views_by_age =  star_wars.pivot_table(index="Age", values=star_wars.columns[3:9]) 
print(views_by_age)


f, ax = plt.subplots(figsize=(9, 6))

sns.heatmap(views_by_age*100, annot=True,fmt='.1f' ,linewidths=.5, ax=ax)
ax.set_title('Views by Age')

         seen_1    seen_2    seen_3    seen_4    seen_5    seen_6
Age                                                              
18-29  0.733945  0.678899  0.665138  0.697248  0.733945  0.733945
30-44  0.652985  0.589552  0.567164  0.656716  0.735075  0.735075
45-60  0.621993  0.508591  0.487973  0.567010  0.756014  0.721649
> 60   0.531599  0.394052  0.371747  0.386617  0.624535  0.587361

Out[63]:

<matplotlib.text.Text at 0x7fef998e2358>

We can see that approx. more than 66% of viewers under the age group 18-29 watched all the episodes and 73.4% of them watched episode 5 and the figures shows that only the series was least watched by viewers above 60 years of age however, 62.5% watched episode 5 which is the highest views in this age group.

More than 73% of viewers under the age groups 18-29, 30-44 and 45-60 watched episode 5 and clearly, episode 5 was most viewed by all the viewers when compared to other episodes in the series.

In [64]:

# rankings by age
ranks_by_age =  star_wars.pivot_table(index="Age", values=star_wars.columns[9:15]) 
print(ranks_by_age)



f, ax = plt.subplots(figsize=(9, 6))

sns.heatmap(ranks_by_age, annot=True,fmt='.1f' ,linewidths=.5, ax=ax)
ax.set_title('Rankings by Age')

       ranking_1  ranking_2  ranking_3  ranking_4  ranking_5  ranking_6
Age                                                                    
18-29   4.100000   4.100000   3.966667   2.994444   2.722222   3.116667
30-44   4.347826   4.309179   4.475728   2.932367   2.212560   2.714976
45-60   3.541667   4.170833   4.537500   3.308333   2.437500   3.004167
> 60    3.010417   3.761658   4.316062   3.808290   2.730570   3.357513

Out[64]:

<matplotlib.text.Text at 0x7fef99c42588>

Clearly, the highest ranked movie by all the people from all the given age ranges is episode 5 with an average of 2.5 rating.

Conclusion¶

We started our analysis of the survey data collected by FiveThirtyEight to answer the question does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

From our analysis of the survey results of 835 responses, it is obvious that Star Wars: The Empire Strikes Back is the best of all the episodes in the Star Wars franchise. It was not only the most watched movie but also the episode with the top ratings. We also found out that compared to women, more men were fans of the Star Wars movies.