Analyzing Star Wars Survey

Studying the people's perception on the Star Wars franchise

Star Wars is the epic space-opera media frnachise. It quickly became a worldwide pop-culture phenomenon. There have been a total of 9 movies called episodes since its first release in 1977.
While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that "The Empire Strikes Back" is clearly the best of the bunch?
They surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses.

The aim of this project is to clean and analyze this survey to answer the following questions :-

  • How many respondants like the Star Wars franchise?
  • Which Star Wars film is most popular among the fans?
  • How many of the respondants are Super fans?
  • How many of the respondants like space-opera media franchises (Star Wars and Star Trek)?
  • Which Characters are favorable and unfavorable to the fans?
  • Which character is controversial, split between likes and dislikes?

These questions give an insight into the perception of the respondants and are key to finding the popular movies and characters of the franchise.
The analysis to answer the above questions is split into 4 parts :-

Analyzing Star Wars film franchise fans on a granular level.
Fnding the most viewed and most popular movie of the Star Wars franchise.
Analyzing super fans of the franchise on a granular level.
Perceptions of charaters from Star Wars franchise.
Analyzing Space-Opera media franchises (Star Wars and Star Trek) fans on a granular level.

The dataset has been picked up from this Link
A few columns are described below:-

  • RespondentID - An anonymized ID for the respondent (person taking the survey)
  • Gender - The respondent's gender
  • Age - The respondent's age
  • Household Income - The respondent's income
  • Education - The respondent's education level
  • Location (Census Region) - The respondent's location
  • Have you seen any of the 6 films in the Star Wars franchise? - Has a Yes or No response
  • Do you consider yourself to be a fan of the Star Wars film franchise? - Has a Yes or No response

There are several more columns in the dataset containing answers to questions about the Star Wars franchise.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pywaffle.waffle import Waffle
import plotly as py
import plotly.graph_objs as go
import warnings
warnings.filterwarnings('ignore')
In [2]:
survey = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
survey.head(5)
Out[2]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace ... Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 38 columns

In [3]:
cols = ['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location']

survey.columns = cols
survey.columns
Out[3]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education', 'Location'],
      dtype='object')
In [4]:
survey.isna().sum()
Out[4]:
RespondentID                                                                                                                                       1
Have you seen any of the 6 films in the Star Wars franchise?                                                                                       0
Do you consider yourself to be a fan of the Star Wars film franchise?                                                                            350
Which of the following Star Wars films have you seen? Please select all that apply.                                                              513
Unnamed: 4                                                                                                                                       615
Unnamed: 5                                                                                                                                       636
Unnamed: 6                                                                                                                                       579
Unnamed: 7                                                                                                                                       428
Unnamed: 8                                                                                                                                       448
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.    351
Unnamed: 10                                                                                                                                      350
Unnamed: 11                                                                                                                                      351
Unnamed: 12                                                                                                                                      350
Unnamed: 13                                                                                                                                      350
Unnamed: 14                                                                                                                                      350
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.                                   357
Unnamed: 16                                                                                                                                      355
Unnamed: 17                                                                                                                                      355
Unnamed: 18                                                                                                                                      363
Unnamed: 19                                                                                                                                      361
Unnamed: 20                                                                                                                                      372
Unnamed: 21                                                                                                                                      360
Unnamed: 22                                                                                                                                      366
Unnamed: 23                                                                                                                                      374
Unnamed: 24                                                                                                                                      359
Unnamed: 25                                                                                                                                      356
Unnamed: 26                                                                                                                                      365
Unnamed: 27                                                                                                                                      372
Unnamed: 28                                                                                                                                      360
Which character shot first?                                                                                                                      358
Are you familiar with the Expanded Universe?                                                                                                     358
Do you consider yourself to be a fan of the Expanded Universe?                                                                                   973
Do you consider yourself to be a fan of the Star Trek franchise?                                                                                 118
Gender                                                                                                                                           140
Age                                                                                                                                              140
Household Income                                                                                                                                 328
Education                                                                                                                                        150
Location                                                                                                                                         143
dtype: int64

The RespondentID contains an invalid rows where the RespondentID is NaN. Since the Id is supposed to be unique, this row is removed from the dataset. This row actually gives the options presented to the respondant for questions with checkboxes.

In [5]:
df = survey.dropna(axis=0,subset=['RespondentID'])

The two columns - Have you seen any of the 6 films in the Star Wars franchise? and Do you consider yourself to be a fan of the Star Wars film franchise? are answers to these questions. They are important for this analysis and the focus would be to analyze the people who have seen the movies and/or are fans of the franchise.

In [6]:
df[[
    'Have you seen any of the 6 films in the Star Wars franchise?',
    'Do you consider yourself to be a fan of the Star Wars film franchise?'
]].head(15)
Out[6]:
Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise?
1 Yes Yes
2 No NaN
3 Yes No
4 Yes Yes
5 Yes Yes
6 Yes Yes
7 Yes Yes
8 Yes Yes
9 Yes Yes
10 Yes No
11 Yes NaN
12 No NaN
13 Yes No
14 Yes Yes
15 Yes Yes

The two columns contain either Yes or No values, with some missing values in between. For ease of usage throughout the analysis, these values are mapped to boolean.

'Yes' - True
'No' - False

The column names have also been changed to :-

Have you seen any of the 6 films in the Star Wars franchise? - 
seen_any
Do you consider yourself to be a fan of the Star Wars film      franchise? - is_fan
In [7]:
mappings = {
    'Yes':True,
    'No':False
}

df[
    'Have you seen any of the 6 films in the Star Wars franchise?'
] = df[
    'Have you seen any of the 6 films in the Star Wars franchise?'
].map(mappings)

df[
    'Do you consider yourself to be a fan of the Star Wars film franchise?'
] = df[
    'Do you consider yourself to be a fan of the Star Wars film franchise?'
].map(mappings)

df.rename(columns = {
    'Have you seen any of the 6 films in the Star Wars franchise?' : 'seen_any',
    'Do you consider yourself to be a fan of the Star Wars film franchise?' : 'is_fan'    
}, inplace=True)

The next 6 columns from Which of the following Star Wars films have you seen? Please select all that apply. to Unnamed:8 are answers for the question - Which of the following Star Wars films have you seen? Please select all that apply., the user checked off a series of boxes as response.

Since the aim of the survey and eventually the analysis is to identify which Star Wars film the public likes the most. It is imperative, that these columns be cleaned. Each column out of the six represents a movie starting from Star Wars: Episode I The Phantom Menace to Star Wars: Episode VI Return of the Jedi. The column has a NaN value if either the respondant hasn't watched the movie or hasn't answered. Considering these NaN to be False and any text appearing to be True, the columns make much more sense. The column names are also changed

In [8]:
cols = df.iloc[:,3:9].columns

for col in cols:
    df[col] = df[col].apply(lambda x: False if pd.isna(x) else True)

df.rename(columns={
    'Which of the following Star Wars films have you seen? Please select all that apply.':'Ep_1',
    "Unnamed: 4":'Ep_2',
    "Unnamed: 5":'Ep_3',
    "Unnamed: 6":'Ep_4',
    "Unnamed: 7":'Ep_5',
    "Unnamed: 8":'Ep_6'
}, inplace=True)

df.iloc[:,3:9].head(10)
Out[8]:
Ep_1 Ep_2 Ep_3 Ep_4 Ep_5 Ep_6
1 True True True True True True
2 False False False False False False
3 True True True False False False
4 True True True True True True
5 True True True True True True
6 True True True True True True
7 True True True True True True
8 True True True True True True
9 True True True True True True
10 False True False False False False

The columns Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. to Unnamed: 14 ask the respondant to rank the movies from 1 to 6. Rank 1 being the most favorite and Rank 6 being the least favorite.
Since the columns are already numeric, only the column names are changed for the ease of analysis.

In [9]:
df.iloc[:,9:15] = df.iloc[:,9:15].astype('float')

df.rename(columns= {
    'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'Ep_1_rank',
    'Unnamed: 10' : 'Ep_2_rank',
    'Unnamed: 11' : 'Ep_3_rank',
    'Unnamed: 12' : 'Ep_4_rank',
    'Unnamed: 13' : 'Ep_5_rank',
    'Unnamed: 14' : 'Ep_6_rank'
}, inplace = True)

The columns Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. to Unnamed: 28 are answers to a series of questions about the characters of the films that they favour. These columns are not of much importance to the analysis in hand. Thus these columns are removed from the dataset for now.

In [10]:
rm_cols = df.iloc[:,15:30].columns
df.drop(rm_cols, axis=1, inplace=True)
df.head(5)
Out[10]:
RespondentID seen_any is_fan Ep_1 Ep_2 Ep_3 Ep_4 Ep_5 Ep_6 Ep_1_rank ... Ep_5_rank Ep_6_rank Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe? Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location
1 3.292880e+09 True True True True True True True True 3.0 ... 5.0 6.0 Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 False NaN False False False False False False NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 True False True True True False False False 1.0 ... 5.0 6.0 No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 True True True True True True True True 5.0 ... 4.0 3.0 No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 True True True True True True True True 5.0 ... 1.0 3.0 Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 23 columns

All the non-film material produced such as novels, comic books, TV-series and other supporting films is refered to as The Star Wars Expanded Universe, which was later rebranded to Star Wars Legends. There are two columns in the dataset that touches these topics - Are you familiar with the Expanded Universe? and Do you consider yourself to be a fan of the Expanded Universe?.
These columns are preserved for now, for further analysis. Similar to the approach for the first two columns, the answers are mapped to Boolean :-

'Yes' - True
'No' - False

and column names are changed for ease of access to :-

Are you familiar with the Expanded Universe? - knows_EU
Do you consider yourself to be a fan of the Expanded Universe? - likes_EU
In [11]:
mappings = {
    'Yes':True,
    'No':False
}

cols = [
    'Are you familiar with the Expanded Universe?',
    'Do you consider yourself to be a fan of the Expanded Universe?'
]

for col in cols:
    df[col] = df[col].map(mappings)

df.rename(columns= {
    'Are you familiar with the Expanded Universe?' : 'knows_EU',
    'Do you consider yourself to be a fan of the Expanded Universe?' : 'likes_EU'
}, inplace=True)

df.head(5)
Out[11]:
RespondentID seen_any is_fan Ep_1 Ep_2 Ep_3 Ep_4 Ep_5 Ep_6 Ep_1_rank ... Ep_5_rank Ep_6_rank knows_EU likes_EU Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location
1 3.292880e+09 True True True True True True True True 3.0 ... 5.0 6.0 True False No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 False NaN False False False False False False NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 True False True True True False False False 1.0 ... 5.0 6.0 False NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 True True True True True True True True 5.0 ... 4.0 3.0 False NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 True True True True True True True True 5.0 ... 1.0 3.0 True False No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 23 columns

To add some spice to the data, the respondants were asked whether they also liked the Star Trek franchise. The answers of the respondants are in the column - Do you consider yourself to be a fan of the Star Trek franchise?.
Taking a similar approach, the column is cleaned by making the following mappings :-

'Yes' - True
'No' - False

and changing the column name to like_star_trek for ease of analysis.

In [12]:
mappings = {
    'Yes':True,
    'No':False
}

df[
    'Do you consider yourself to be a fan of the Star Trek franchise?'
] = df[
    'Do you consider yourself to be a fan of the Star Trek franchise?'
].map(mappings)

df.rename(columns= {
    'Do you consider yourself to be a fan of the Star Trek franchise?' : 'likes_star_trek'
}, inplace=True)

df.head(5)
Out[12]:
RespondentID seen_any is_fan Ep_1 Ep_2 Ep_3 Ep_4 Ep_5 Ep_6 Ep_1_rank ... Ep_5_rank Ep_6_rank knows_EU likes_EU likes_star_trek Gender Age Household Income Education Location
1 3.292880e+09 True True True True True True True True 3.0 ... 5.0 6.0 True False False Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 False NaN False False False False False False NaN ... NaN NaN NaN NaN True Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 True False True True True False False False 1.0 ... 5.0 6.0 False NaN False Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 True True True True True True True True 5.0 ... 4.0 3.0 False NaN True Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 True True True True True True True True 5.0 ... 1.0 3.0 True False False Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 23 columns

The remaining columns -

  • Gender
  • Age
  • Household Income
  • Education
  • Location

describe personal attributes of the respondant. These columns can be useful for generalizing the analysis over segments of respondants. These columns as such do not require cleaning nor column name changes.

Now that all the columns are clean and ready for the analysis, the first question to answer is - How many respondants are fans of the Star Wars franchise? To answer this question, the columns seen_any, is_fan, gender and age are utilized.

The is_fan column identifies which respondants are fans of the Star Wars franchise. Since this is a survey, we cannot always trust the data present. There is a possibility that a respondant answered that he/she is a fan, but hasn't watched any movies. This can be considered as outliers in the data. It is better to check for such outliers.

In [13]:
df[(df.seen_any == False) & (df.is_fan == True)]
Out[13]:
RespondentID seen_any is_fan Ep_1 Ep_2 Ep_3 Ep_4 Ep_5 Ep_6 Ep_1_rank ... Ep_5_rank Ep_6_rank knows_EU likes_EU likes_star_trek Gender Age Household Income Education Location

0 rows × 23 columns

There seem to be no such outliers, which is good as it hints that this data can be trustable.

In [14]:
df.is_fan.value_counts(dropna=False)
Out[14]:
True     552
NaN      350
False    284
Name: is_fan, dtype: int64

There are 350 NaN values in the data. It would be better to classify them either True or False so the analysis can be complete. Not considering these 350 respondants will result in the existing data give skewed percentages.

In [15]:
nulls = df[df.is_fan.isna()]
nulls.seen_any.value_counts()
Out[15]:
False    250
True     100
Name: seen_any, dtype: int64

There are 250 respondants who have not watched any of the 6 Star Wars movies - False. Since they have not watched the movie, it is fair to assume that they would not be fans of the franchise. For the respondants who have answered with a yes = True. It is interesting to note that all the respondants with a True value for seen_any in this case have all following column values as NaN. These 100 rows contain almost all missing data.

NOTE - Only a few rows out of a 100 are being shown to prove the point. The trend does follow through.

In [16]:
nulls[nulls.seen_any == True].head(15)
Out[16]:
RespondentID seen_any is_fan Ep_1 Ep_2 Ep_3 Ep_4 Ep_5 Ep_6 Ep_1_rank ... Ep_5_rank Ep_6_rank knows_EU likes_EU likes_star_trek Gender Age Household Income Education Location
11 3.292638e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
81 3.291669e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
97 3.291570e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
106 3.291470e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
128 3.291420e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
130 3.291406e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
146 3.291341e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
181 3.291038e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
191 3.291022e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
198 3.291007e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
209 3.290981e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
211 3.290977e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
223 3.290950e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
231 3.290940e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
244 3.290912e+09 True NaN False False False False False False NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

15 rows × 23 columns

Based on the arguements above, the following actions are taken :-

  • All 100 rows will almost all missing values are removed from the dataset
  • Rows containing NaN for is_fan where False for seen_any exists are filled with False values.
In [17]:
df = df[~((df.seen_any == True) & (df.is_fan.isna()))]
df.is_fan.fillna(False, inplace=True)
In [18]:
df.is_fan.value_counts(normalize=True)
Out[18]:
True     0.508287
False    0.491713
Name: is_fan, dtype: float64
In [19]:
fans_dist = df.is_fan.value_counts(normalize=True)

plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.barplot(x= fans_dist.index, y= fans_dist.values)
plt.ylabel('precentage of respondants')
plt.xlabel('respondant is a fan?')
plt.title('Percentage of respondants who are fans of the Star Wars franchise')
Out[19]:
Text(0.5, 1.0, 'Percentage of respondants who are fans of the Star Wars franchise')

Considering people who have not watched any of the movies as "not fans", the precentage of respondants who are and are not fans is very close. It can be said that out all the respondants, slightly more than half of them are fans of the Star Wars franchise.

These findings are incomplete, if the analysis is not done further at a granular level, focusing only on the "Fans" i.e. those respondants who claim to be fans of the franchise.
The analysis will be done on the two columns - gender and age
Starting with the gender
There may not be a logical correlation between gender and being "Fans" of the franchise. Only for the sake of the analysis and gathering a statistic from the respondants, gender is considered.

In [20]:
fans = df[df.is_fan == True]

fans.Gender.value_counts(dropna=False, normalize=True)
Out[20]:
Male      0.548913
Female    0.431159
NaN       0.019928
Name: Gender, dtype: float64

There are about 11 NaN values, these could be beacuse either the respondant did not want to reveal their gender or no option represented them.
To be fair and not ignore data, these NaN values are filled with 'Others' out of respect. The stats above show that percentage or males and females among the "Fans" are not very far apart, with the male tipping the scale by around 9%.

In [21]:
fans.Gender.fillna('Others',inplace=True)
gender_counts = fans.Gender.value_counts(normalize=True)
print(gender_counts)

# plt.style.use('seaborn')
# plt.figure(figsize=(10,8))
# plt.pie(
#     x= gender_counts,
#     labels = ['Male','Female','Others'],
#     colors = ['#009999','#ff9933','#99004C'],
#     autopct="%1.1f%%",
#     textprops=dict(color='w',fontsize=10),
#     shadow= True,
#     wedgeprops = {'linewidth': 1},
#     pctdistance= 0.7
# )
# plt.legend(['Male','Female','Others'],loc='upper right', bbox_to_anchor=(1, 0.5, 0.5, 0.5))
# plt.show()

layout = go.Layout(
    title={
        'text':"<b>Distribution of Gender among Fans</b><br>Percentage of Males, Females or others in the Fan population",
        'yanchor':'top',
        'xref':'paper',
        'x':0.5
    }
)

data = [
    go.Pie(
        labels= gender_counts.index,
        values= gender_counts.values,
        marker= dict(
            colors= ['#009999','#ff9933','#99004C'],
            line= dict(width=1)
        ),
        hovertemplate= "%{label} : %{percent}<extra></extra>"
    )
]

fig = go.Figure(data= data, layout= layout)
fig.show()
Male      0.548913
Female    0.431159
Others    0.019928
Name: Gender, dtype: float64