Star Wars is the epic space-opera media frnachise. It quickly became a worldwide pop-culture phenomenon. There have been a total of 9 movies called episodes since its first release in 1977.
While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that "The Empire Strikes Back" is clearly the best of the bunch?
They surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses.
The aim of this project is to clean and analyze this survey to answer the following questions :-
These questions give an insight into the perception of the respondants and are key to finding the popular movies and characters of the franchise.
The analysis to answer the above questions is split into 4 parts :-
Analyzing Star Wars film franchise fans on a granular level.
Fnding the most viewed and most popular movie of the Star Wars franchise.
Analyzing super fans of the franchise on a granular level.
Perceptions of charaters from Star Wars franchise.
Analyzing Space-Opera media franchises (Star Wars and Star Trek) fans on a granular level.
The dataset has been picked up from this Link
A few columns are described below:-
There are several more columns in the dataset containing answers to questions about the Star Wars franchise.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pywaffle.waffle import Waffle
import plotly as py
import plotly.graph_objs as go
import warnings
warnings.filterwarnings('ignore')
survey = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
survey.head(5)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 38 columns
cols = ['RespondentID',
'Have you seen any of the 6 films in the Star Wars franchise?',
'Do you consider yourself to be a fan of the Star Wars film franchise?',
'Which of the following Star Wars films have you seen? Please select all that apply.',
'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
'Unnamed: 14',
'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
'Unnamed: 28', 'Which character shot first?',
'Are you familiar with the Expanded Universe?',
'Do you consider yourself to be a fan of the Expanded Universe?',
'Do you consider yourself to be a fan of the Star Trek franchise?',
'Gender', 'Age', 'Household Income', 'Education',
'Location']
survey.columns = cols
survey.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location'], dtype='object')
survey.isna().sum()
RespondentID 1 Have you seen any of the 6 films in the Star Wars franchise? 0 Do you consider yourself to be a fan of the Star Wars film franchise? 350 Which of the following Star Wars films have you seen? Please select all that apply. 513 Unnamed: 4 615 Unnamed: 5 636 Unnamed: 6 579 Unnamed: 7 428 Unnamed: 8 448 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. 351 Unnamed: 10 350 Unnamed: 11 351 Unnamed: 12 350 Unnamed: 13 350 Unnamed: 14 350 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. 357 Unnamed: 16 355 Unnamed: 17 355 Unnamed: 18 363 Unnamed: 19 361 Unnamed: 20 372 Unnamed: 21 360 Unnamed: 22 366 Unnamed: 23 374 Unnamed: 24 359 Unnamed: 25 356 Unnamed: 26 365 Unnamed: 27 372 Unnamed: 28 360 Which character shot first? 358 Are you familiar with the Expanded Universe? 358 Do you consider yourself to be a fan of the Expanded Universe? 973 Do you consider yourself to be a fan of the Star Trek franchise? 118 Gender 140 Age 140 Household Income 328 Education 150 Location 143 dtype: int64
The RespondentID contains an invalid rows where the RespondentID is NaN. Since the Id is supposed to be unique, this row is removed from the dataset. This row actually gives the options presented to the respondant for questions with checkboxes.
df = survey.dropna(axis=0,subset=['RespondentID'])
The two columns - Have you seen any of the 6 films in the Star Wars franchise? and Do you consider yourself to be a fan of the Star Wars film franchise? are answers to these questions. They are important for this analysis and the focus would be to analyze the people who have seen the movies and/or are fans of the franchise.
df[[
'Have you seen any of the 6 films in the Star Wars franchise?',
'Do you consider yourself to be a fan of the Star Wars film franchise?'
]].head(15)
Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | |
---|---|---|
1 | Yes | Yes |
2 | No | NaN |
3 | Yes | No |
4 | Yes | Yes |
5 | Yes | Yes |
6 | Yes | Yes |
7 | Yes | Yes |
8 | Yes | Yes |
9 | Yes | Yes |
10 | Yes | No |
11 | Yes | NaN |
12 | No | NaN |
13 | Yes | No |
14 | Yes | Yes |
15 | Yes | Yes |
The two columns contain either Yes or No values, with some missing values in between. For ease of usage throughout the analysis, these values are mapped to boolean.
'Yes' - True
'No' - False
The column names have also been changed to :-
Have you seen any of the 6 films in the Star Wars franchise? -
seen_any
Do you consider yourself to be a fan of the Star Wars film franchise? - is_fan
mappings = {
'Yes':True,
'No':False
}
df[
'Have you seen any of the 6 films in the Star Wars franchise?'
] = df[
'Have you seen any of the 6 films in the Star Wars franchise?'
].map(mappings)
df[
'Do you consider yourself to be a fan of the Star Wars film franchise?'
] = df[
'Do you consider yourself to be a fan of the Star Wars film franchise?'
].map(mappings)
df.rename(columns = {
'Have you seen any of the 6 films in the Star Wars franchise?' : 'seen_any',
'Do you consider yourself to be a fan of the Star Wars film franchise?' : 'is_fan'
}, inplace=True)
The next 6 columns from Which of the following Star Wars films have you seen? Please select all that apply. to Unnamed:8 are answers for the question - Which of the following Star Wars films have you seen? Please select all that apply., the user checked off a series of boxes as response.
Since the aim of the survey and eventually the analysis is to identify which Star Wars film the public likes the most. It is imperative, that these columns be cleaned. Each column out of the six represents a movie starting from Star Wars: Episode I The Phantom Menace to Star Wars: Episode VI Return of the Jedi. The column has a NaN
value if either the respondant hasn't watched the movie or hasn't answered. Considering these NaN
to be False
and any text appearing to be True
, the columns make much more sense. The column names are also changed
cols = df.iloc[:,3:9].columns
for col in cols:
df[col] = df[col].apply(lambda x: False if pd.isna(x) else True)
df.rename(columns={
'Which of the following Star Wars films have you seen? Please select all that apply.':'Ep_1',
"Unnamed: 4":'Ep_2',
"Unnamed: 5":'Ep_3',
"Unnamed: 6":'Ep_4',
"Unnamed: 7":'Ep_5',
"Unnamed: 8":'Ep_6'
}, inplace=True)
df.iloc[:,3:9].head(10)
Ep_1 | Ep_2 | Ep_3 | Ep_4 | Ep_5 | Ep_6 | |
---|---|---|---|---|---|---|
1 | True | True | True | True | True | True |
2 | False | False | False | False | False | False |
3 | True | True | True | False | False | False |
4 | True | True | True | True | True | True |
5 | True | True | True | True | True | True |
6 | True | True | True | True | True | True |
7 | True | True | True | True | True | True |
8 | True | True | True | True | True | True |
9 | True | True | True | True | True | True |
10 | False | True | False | False | False | False |
The columns Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. to Unnamed: 14 ask the respondant to rank the movies from 1 to 6. Rank 1 being the most favorite and Rank 6 being the least favorite.
Since the columns are already numeric, only the column names are changed for the ease of analysis.
df.iloc[:,9:15] = df.iloc[:,9:15].astype('float')
df.rename(columns= {
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.' : 'Ep_1_rank',
'Unnamed: 10' : 'Ep_2_rank',
'Unnamed: 11' : 'Ep_3_rank',
'Unnamed: 12' : 'Ep_4_rank',
'Unnamed: 13' : 'Ep_5_rank',
'Unnamed: 14' : 'Ep_6_rank'
}, inplace = True)
The columns Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. to Unnamed: 28 are answers to a series of questions about the characters of the films that they favour. These columns are not of much importance to the analysis in hand. Thus these columns are removed from the dataset for now.
rm_cols = df.iloc[:,15:30].columns
df.drop(rm_cols, axis=1, inplace=True)
df.head(5)
RespondentID | seen_any | is_fan | Ep_1 | Ep_2 | Ep_3 | Ep_4 | Ep_5 | Ep_6 | Ep_1_rank | ... | Ep_5_rank | Ep_6_rank | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | ... | 5.0 | 6.0 | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | ... | 5.0 | 6.0 | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | 4.0 | 3.0 | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | 1.0 | 3.0 | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 23 columns
All the non-film material produced such as novels, comic books, TV-series and other supporting films is refered to as The Star Wars Expanded Universe, which was later rebranded to Star Wars Legends. There are two columns in the dataset that touches these topics -
Are you familiar with the Expanded Universe? and Do you consider yourself to be a fan of the Expanded Universe?.
These columns are preserved for now, for further analysis. Similar to the approach for the first two columns, the answers are mapped to Boolean :-
'Yes' - True
'No' - False
and column names are changed for ease of access to :-
Are you familiar with the Expanded Universe? - knows_EU
Do you consider yourself to be a fan of the Expanded Universe? - likes_EU
mappings = {
'Yes':True,
'No':False
}
cols = [
'Are you familiar with the Expanded Universe?',
'Do you consider yourself to be a fan of the Expanded Universe?'
]
for col in cols:
df[col] = df[col].map(mappings)
df.rename(columns= {
'Are you familiar with the Expanded Universe?' : 'knows_EU',
'Do you consider yourself to be a fan of the Expanded Universe?' : 'likes_EU'
}, inplace=True)
df.head(5)
RespondentID | seen_any | is_fan | Ep_1 | Ep_2 | Ep_3 | Ep_4 | Ep_5 | Ep_6 | Ep_1_rank | ... | Ep_5_rank | Ep_6_rank | knows_EU | likes_EU | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | ... | 5.0 | 6.0 | True | False | No | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | ... | 5.0 | 6.0 | False | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | 4.0 | 3.0 | False | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | 1.0 | 3.0 | True | False | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 23 columns
To add some spice to the data, the respondants were asked whether they also liked the Star Trek franchise. The answers of the respondants are in the column - Do you consider yourself to be a fan of the Star Trek franchise?.
Taking a similar approach, the column is cleaned by making the following mappings :-
'Yes' - True
'No' - False
and changing the column name to like_star_trek for ease of analysis.
mappings = {
'Yes':True,
'No':False
}
df[
'Do you consider yourself to be a fan of the Star Trek franchise?'
] = df[
'Do you consider yourself to be a fan of the Star Trek franchise?'
].map(mappings)
df.rename(columns= {
'Do you consider yourself to be a fan of the Star Trek franchise?' : 'likes_star_trek'
}, inplace=True)
df.head(5)
RespondentID | seen_any | is_fan | Ep_1 | Ep_2 | Ep_3 | Ep_4 | Ep_5 | Ep_6 | Ep_1_rank | ... | Ep_5_rank | Ep_6_rank | knows_EU | likes_EU | likes_star_trek | Gender | Age | Household Income | Education | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3.0 | ... | 5.0 | 6.0 | True | False | False | Male | 18-29 | NaN | High school degree | South Atlantic |
2 | 3.292880e+09 | False | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | True | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1.0 | ... | 5.0 | 6.0 | False | NaN | False | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | 4.0 | 3.0 | False | NaN | True | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5.0 | ... | 1.0 | 3.0 | True | False | False | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 rows × 23 columns
The remaining columns -
describe personal attributes of the respondant. These columns can be useful for generalizing the analysis over segments of respondants. These columns as such do not require cleaning nor column name changes.
Now that all the columns are clean and ready for the analysis, the first question to answer is - How many respondants are fans of the Star Wars franchise? To answer this question, the columns seen_any, is_fan, gender and age are utilized.
The is_fan column identifies which respondants are fans of the Star Wars franchise. Since this is a survey, we cannot always trust the data present. There is a possibility that a respondant answered that he/she is a fan, but hasn't watched any movies. This can be considered as outliers in the data. It is better to check for such outliers.
df[(df.seen_any == False) & (df.is_fan == True)]
RespondentID | seen_any | is_fan | Ep_1 | Ep_2 | Ep_3 | Ep_4 | Ep_5 | Ep_6 | Ep_1_rank | ... | Ep_5_rank | Ep_6_rank | knows_EU | likes_EU | likes_star_trek | Gender | Age | Household Income | Education | Location |
---|
0 rows × 23 columns
There seem to be no such outliers, which is good as it hints that this data can be trustable.
df.is_fan.value_counts(dropna=False)
True 552 NaN 350 False 284 Name: is_fan, dtype: int64
There are 350 NaN
values in the data. It would be better to classify them either True
or False
so the analysis can be complete. Not considering these 350 respondants will result in the existing data give skewed percentages.
nulls = df[df.is_fan.isna()]
nulls.seen_any.value_counts()
False 250 True 100 Name: seen_any, dtype: int64
There are 250 respondants who have not watched any of the 6 Star Wars movies - False
. Since they have not watched the movie, it is fair to assume that they would not be fans of the franchise. For the respondants who have answered with a yes = True
. It is interesting to note that all the respondants with a True
value for seen_any in this case have all following column values as NaN. These 100 rows contain almost all missing data.
NOTE - Only a few rows out of a 100 are being shown to prove the point. The trend does follow through.
nulls[nulls.seen_any == True].head(15)
RespondentID | seen_any | is_fan | Ep_1 | Ep_2 | Ep_3 | Ep_4 | Ep_5 | Ep_6 | Ep_1_rank | ... | Ep_5_rank | Ep_6_rank | knows_EU | likes_EU | likes_star_trek | Gender | Age | Household Income | Education | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 3.292638e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
81 | 3.291669e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
97 | 3.291570e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
106 | 3.291470e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
128 | 3.291420e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
130 | 3.291406e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
146 | 3.291341e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
181 | 3.291038e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
191 | 3.291022e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
198 | 3.291007e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
209 | 3.290981e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
211 | 3.290977e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
223 | 3.290950e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
231 | 3.290940e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
244 | 3.290912e+09 | True | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
15 rows × 23 columns
Based on the arguements above, the following actions are taken :-
NaN
for is_fan where False
for seen_any exists are filled with False
values.df = df[~((df.seen_any == True) & (df.is_fan.isna()))]
df.is_fan.fillna(False, inplace=True)
df.is_fan.value_counts(normalize=True)
True 0.508287 False 0.491713 Name: is_fan, dtype: float64
fans_dist = df.is_fan.value_counts(normalize=True)
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.barplot(x= fans_dist.index, y= fans_dist.values)
plt.ylabel('precentage of respondants')
plt.xlabel('respondant is a fan?')
plt.title('Percentage of respondants who are fans of the Star Wars franchise')
Text(0.5, 1.0, 'Percentage of respondants who are fans of the Star Wars franchise')
Considering people who have not watched any of the movies as "not fans", the precentage of respondants who are and are not fans is very close. It can be said that out all the respondants, slightly more than half of them are fans of the Star Wars franchise.
These findings are incomplete, if the analysis is not done further at a granular level, focusing only on the "Fans" i.e. those respondants who claim to be fans of the franchise.
The analysis will be done on the two columns - gender and age
Starting with the gender
There may not be a logical correlation between gender and being "Fans" of the franchise. Only for the sake of the analysis and gathering a statistic from the respondants, gender is considered.
fans = df[df.is_fan == True]
fans.Gender.value_counts(dropna=False, normalize=True)
Male 0.548913 Female 0.431159 NaN 0.019928 Name: Gender, dtype: float64
There are about 11 NaN
values, these could be beacuse either the respondant did not want to reveal their gender or no option represented them.
To be fair and not ignore data, these NaN
values are filled with 'Others' out of respect.
The stats above show that percentage or males and females among the "Fans" are not very far apart, with the male tipping the scale by around 9%.
fans.Gender.fillna('Others',inplace=True)
gender_counts = fans.Gender.value_counts(normalize=True)
print(gender_counts)
# plt.style.use('seaborn')
# plt.figure(figsize=(10,8))
# plt.pie(
# x= gender_counts,
# labels = ['Male','Female','Others'],
# colors = ['#009999','#ff9933','#99004C'],
# autopct="%1.1f%%",
# textprops=dict(color='w',fontsize=10),
# shadow= True,
# wedgeprops = {'linewidth': 1},
# pctdistance= 0.7
# )
# plt.legend(['Male','Female','Others'],loc='upper right', bbox_to_anchor=(1, 0.5, 0.5, 0.5))
# plt.show()
layout = go.Layout(
title={
'text':"<b>Distribution of Gender among Fans</b><br>Percentage of Males, Females or others in the Fan population",
'yanchor':'top',
'xref':'paper',
'x':0.5
}
)
data = [
go.Pie(
labels= gender_counts.index,
values= gender_counts.values,
marker= dict(
colors= ['#009999','#ff9933','#99004C'],
line= dict(width=1)
),
hovertemplate= "%{label} : %{percent}<extra></extra>"
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
Male 0.548913 Female 0.431159 Others 0.019928 Name: Gender, dtype: float64
plt.figure(
FigureClass=Waffle,
figsize=(12,6),
rows = 4,
columns = 10,
values = fans.Gender.value_counts(normalize=True),
legend={'loc': 'upper left', 'bbox_to_anchor': (1.05, 0.5), 'labels':['Male','Female','Others']},
icons='child',
font_size=65,
title={'label': 'Number of Fans by Gender (per 40 people)', 'loc': 'center','fontsize':24}
)
plt.show()
The following conclusions are drawn from the plots :-
Thus concluding that the franchise is popular among both Males and Females almost equally.
Age seems as a more appropriate charateristic for differentitation between the Fans. Different Age groups usually have different likings, which is expected from the results of this analysis. However, given Star Wars was first released in 1978, with A New Hope - Episode IV and the survey was conducted in recent times. The age group 40 and above is also a candidate to have a good number of Fans.
df.Age.value_counts(dropna=False)
45-60 291 > 60 269 30-44 268 18-29 218 NaN 40 Name: Age, dtype: int64
The age group is of ordinal type with the following intervals. There are 40 NaN
values, which in this case will be ignored. For ease of the analysis, these intervals/categories are converted to text categories as below :-
18-29 -> Young
30-44 -> Middle
45-60 -> Senior
> 60 -> Elder
The barplot below, shows the percentage of fans per age group.
def encode(age_grp):
"""Convert age groups in interval forms to Labels
:param age_grp: the age interval
"""
if age_grp == '18-29':
return 'Young'
elif age_grp == '30-44':
return 'Middle'
elif age_grp == '45-60':
return 'Senior'
else:
return 'Elder'
fans['age_label'] = fans.Age.apply(encode)
fans_age = fans.age_label.value_counts(normalize=True)
fans_age = fans_age.iloc[[2,1,0,3]]
print(fans_age)
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.barplot(x= fans_age.index, y=fans_age.values)
plt.ylabel('precentage of fans')
plt.xlabel('age category')
plt.title('Percentage of fans of Star Wars per Age category')
Young 0.224638 Middle 0.271739 Senior 0.278986 Elder 0.224638 Name: age_label, dtype: float64
Text(0.5, 1.0, 'Percentage of fans of Star Wars per Age category')
The plot shows that the number of fans are in an increasing order with age. The Senior age category has the maximum representatives. This supports the theory discussed above. The Young generation of the 1980s are the Seniors of today and thus this category is likely, as shown, to have more representatives. Its important to note, that over the years, the fan base of the Star Wars franchise has been consistent. All age categories have a good number of fans
age_catgs = fans.age_label.unique()
Males = []
Females = []
for catg in age_catgs:
gender_counts = fans[fans.age_label == catg].Gender.value_counts()
Males.append(gender_counts[0])
Females.append(gender_counts[1])
layout = go.Layout(
title = {
'text':"<b>Gender distribution of Fans between various Age categories</b><br>Distribution of male and female fans for every age category",
'xanchor':'left',
'font':{'size':22}
},
yaxis=go.layout.YAxis(
title='Age category'
),
xaxis=go.layout.XAxis(
range=[-80, 100],
# tickvals=[-100, -70, -30, 0, 30, 70, 100],
# ticktext=[100, 70, 30, 0, 30, 70, 100],
title='Number of Fans',
showticklabels=False
),
barmode='overlay',
bargap=0.1
)
data = [
go.Bar(
y=age_catgs,
x=Males,
orientation='h',
name='Male',
hoverinfo='x',
marker=dict(color='#ff9933')
),
go.Bar(
y=age_catgs,
x=[-1 * f for f in Females],
orientation='h',
name='Female',
text= Females,
hoverinfo='text',
marker=dict(color='#009999')
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
The plot above is only to illustrate the distribution of gender across various age categories for the fans of the Star Wars franchise.
The next part of the analysis is to answer, which of the 6 parts is most viewed by the public (respondants) and which of them is the most popular. Here popularity is defined as the highest average rank recieved by the movie intuitively. (Basically, the movie with the average lowest rank on a scale of 1-6)
There are two sets of columns required for this analysis. The columns Ep_1 through Ep_6 are boolean columns identifying whether respondant has watched that movie or not. Every respondant has ranked the movies they have seen from 1 to 6, 1 being the highest rank and 6 being the lowest. These are available in the Ep_1_rank through Ep_6_rank columns.
A mean of both these sets of columns, gives a fair enough metric for the popularity of the movie. A popular movie as mentioned, will have a lower average meaning, has a higher average rank. For the ease of Understanding, the movies are mapped as following :-
Episode I -> The Phantom Menace (1999)
Episode II -> Attack of the Clones (2002)
Episode III -> Revenge of the Sith (2005)
Episode IV -> A New Hope (1977)
Episode V -> The Empire Strikes Back (1980)
Episode VI -> Return of the Jedi (1983)
The mean of each rank is subtracted from the max possible rank - 6, so that the rank with the lower mean becomes higher. This inverts the scale and thus the plot is more interpretable.
max_rank = 6
movies = [
'The phantom Menace',
'Attack of the Clones',
'Revenge of the Sith',
'A New Hope',
'The Empire Strikes Back',
'Return of the Jedi'
]
views = df.iloc[:,3:9].sum()
views.index = movies
ranks = df.iloc[:,9:15].mean()
invert_ranks = max_rank - ranks
invert_ranks.index = movies
colors=['#009999','#009999','#009999','#009999','#ff9933','#009999']
layout = go.Layout(
title = {
'text':'<b>Most viewed Star Wars film in the franchise</b><br>Views recieved by each movie in the Star Wars franchise',
'font':{'size':22}
},
yaxis=go.layout.YAxis(title='Star Wars movie'),
xaxis=go.layout.XAxis(title='Number of views',showticklabels=False)
)
data = [
go.Bar(
x= views.values,
y= views.index,
marker_color= colors,
hovertemplate='<i>Views:</i> %{x:.f}<extra></extra>',
orientation= 'h'
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
The plot highlights the most viewed movie of all in the Star Wars franchise. Without a doubt, it is The Empire Strikes Back. Out of all respondants, 758 people have watched this movie. A close second is the Return of a Jedi with 738 views. Since these numbers are not just for fans, but all the respondants, the high number of views gives a sense into the popularity of these movies.
colors=['#009999','#009999','#009999','#009999','#ff9933','#009999']
layout = go.Layout(
title = {
'text':'<b>Most popular Star Wars film in the franchise</b><br>'+
'average rank recieved by each movie in the Star Wars franchise',
'font':{'size':22},
},
yaxis=go.layout.YAxis(title='Star Wars movie'),
xaxis=go.layout.XAxis(title='Average rank',showticklabels=False)
)
data = [
go.Bar(
x= invert_ranks.values,
y= invert_ranks.index,
marker_color= colors,
orientation= 'h',
hovertemplate='<i>Average rank:</i> %{text:.2f}<extra></extra>',
text = ranks
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
As the plot shows, The Empire Strike Backs is the most popular film in the Star Wars franchise. It can also be inferred that, the Trilogy that was originally released, between 1977 and 1983, are the most popular among all. The prequel trilogy released later in 1999 through 2005, is less popular among the respondants.
The Star Wars franchise also consists of novels, comic books, TV series and other such entities apart from the movies. These entities were categorized under the Star Wars Expanded Universe which was later rebranded to The Star Wars Legends. Two specific columns have the respondants views on these - knows_EU and likes_EU
A super fan for this analysis is defined as someone who is a fan of the movies and likes the Extended Universe. These two columns have to be checked as well, for any sort of faulty data where, the respondant has answered False
for knows_EU but answered True
for likes_EU
df[(df.knows_EU == False ) & (df.likes_EU == True)]
RespondentID | seen_any | is_fan | Ep_1 | Ep_2 | Ep_3 | Ep_4 | Ep_5 | Ep_6 | Ep_1_rank | ... | Ep_5_rank | Ep_6_rank | knows_EU | likes_EU | likes_star_trek | Gender | Age | Household Income | Education | Location |
---|
0 rows × 23 columns
df[df.likes_EU.isna()].knows_EU.value_counts()
False 615 Name: knows_EU, dtype: int64
There are no such faulty rows. All NaN
values for the likes_EU column have knows_EU as False
, hence these rows can filled with False
.
df.likes_EU.fillna(False, inplace=True)
df.knows_EU.fillna(False, inplace=True)
df.likes_EU.value_counts(dropna=False)
False 987 True 99 Name: likes_EU, dtype: int64
Among all the respondants, only 99 people like the Star Wars Extended Universe. This gives the sense that maybe, the Extended Universe is not a total hit with the public. To understand this better, the following plot shows a comparision between the number of people who like it and the people who dont amongst the people that were aware that Star Wars Extended Universe existed. It would not be fit to consider those people who were not aware of its existence as it cannot be said for sure whether they will like it or not once they come to know of it.
knows_fans = df.knows_EU.value_counts(dropna=False, normalize=True)
likes_fans = df[df.knows_EU == True].likes_EU.value_counts(dropna=False, normalize=True)
plt.subplots(figsize=(20,8))
plt.subplot(1,2,1)
sns.barplot(x= knows_fans.index, y= knows_fans.values)
plt.xlabel('knows about the Extended Universe?')
plt.ylabel('precentage of respondants')
plt.title('Percentage of respondants who know about the Extended Universe')
plt.subplot(1,2,2)
sns.barplot(x= likes_fans.index, y= likes_fans.values)
plt.xlabel('likes the Extended Universe?')
plt.ylabel('precentage of respondants')
plt.title('Percentage of respondants who like the Extended Universe')
Text(0.5, 1.0, 'Percentage of respondants who like the Extended Universe')
The conclusions drawn from the two plots above are :-
This leads to the conclusion that, the Star Wars Extended Universe is less popular than the Star Wars film franchise.
As per the definition for "super fan" this analysis, a fan who likes both the movies and the Extended Universe is considered to be a super fan. The respondants who are super fans are a subset of the fans. The above conclusions drawn are for the entire population of respondants. Limiting the sample to only those respondants who are Fans.
knows_fans = fans.knows_EU.value_counts(dropna=False, normalize=True)
likes_fans = fans[fans.knows_EU == True].likes_EU.value_counts(dropna=False, normalize=True)
plt.subplots(figsize=(20,8))
plt.subplot(1,2,1)
sns.barplot(x= knows_fans.index, y= knows_fans.values)
plt.xlabel('knows about the Extended Universe?')
plt.ylabel('precentage of respondants')
plt.title('Percentage of respondants who know about the Extended Universe')
plt.subplot(1,2,2)
sns.barplot(x= likes_fans.index, y= likes_fans.values)
plt.xlabel('likes the Extended Universe?')
plt.ylabel('precentage of respondants')
plt.title('Percentage of respondants who like the Extended Universe')
Text(0.5, 1.0, 'Percentage of respondants who like the Extended Universe')
The first plot shows the percentage of people who know about the Extended Universe out of the fans. The result is inline with previous conclusions. Not a lot of fans know about the Extended Universe. From the small set of fans who know about the Extended Universe, a slightly higher percentage of people actually like it, the "super fans". A similar analysis is carried out with the "super fans" as done with the fans, i.e. based on the gender and age.
super_fans = fans[fans.likes_EU == True]
gender_counts = super_fans.Gender.value_counts(normalize=True)
layout = go.Layout(
title={
'text':"<b>Distribution of Gender among Super Fans</b><br>Percentage of Males, Females or others in the Super Fan population",
'yanchor':'top',
'xref':'paper',
'x':0.5
}
)
data = [
go.Pie(
labels= gender_counts.index,
values= gender_counts.values,
marker= dict(
colors= ['#009999','#ff9933','#99004C'],
line= dict(width=1)
),
hovertemplate= "%{label} : %{percent}<extra></extra>"
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
It is interesting to note, that there are more number of Males who are super fans as compared to the almost equal percentages of Males and Females who were fans.
Similarly, for the age categories.
super_fans_age = super_fans.age_label.value_counts(normalize=True)
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.barplot(x= super_fans_age.index, y=super_fans_age.values)
plt.ylabel('precentage of super fans')
plt.xlabel('age category')
plt.title('Percentage of super fans of Star Wars per Age category')
Text(0.5, 1.0, 'Percentage of super fans of Star Wars per Age category')
Contrary to the observations made in the previous version of this plot, where the population considered was of the fans of the Star Wars franchise, the plot above narrates a different story.
age_catgs = super_fans.age_label.unique()
Males = []
Females = []
for catg in age_catgs:
gender_counts = super_fans[super_fans.age_label == catg].Gender.value_counts()
Males.append(gender_counts[0])
Females.append(gender_counts[1])
layout = go.Layout(
title = {
'text':"<b>Gender distribution of Super Fans between various Age categories</b><br>"+
"Distribution of male and female super fans for every age category",
'xanchor':'left',
'font':{'size':22}
},
yaxis=go.layout.YAxis(title='Age category'),
xaxis=go.layout.XAxis(
range=[-15, 25],
# tickvals=[-10, 0, 10, 20],
# ticktext=[10, 0, 10 ,20],
showticklabels= False,
title='Number of Fans'
),
barmode='overlay',
bargap=0.1
)
data = [
go.Bar(
y=age_catgs,
x=Males,
orientation='h',
name='Male',
hoverinfo='x',
marker=dict(color='#ff9933')
),
go.Bar(
y=age_catgs,
x=[-1 * f for f in Females],
orientation='h',
name='Female',
text= Females,
hoverinfo='text',
marker=dict(color='#009999')
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
As suggested from the pie plot, the number of male super fans is far greater for every age category.
The Extended Universe is a newer entity as compared to the film franchise. The animated movies, TV series, comic books, Video games are more appealing to the Young and the Middle aged population rather than the old. The results of the plot support the arguments put forward. There are columns that are answers to the question - Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. The response to this question is on a scale as follows :-
Very favorably
Somewhat favorably
Neither favorably nor unfavorably (neutral)
Unfamiliar
Somewhat unfavorably
Very unfavorably
All the NaN
values are filled with Unfamiliar for all the columns and the column is converted to type categorical.
characters = survey.iloc[0,15:29].values
char_ratings = survey.iloc[1:,15:29]
char_ratings.fillna('Unfamiliar',inplace=True)
characters
array(['Han Solo', 'Luke Skywalker', 'Princess Leia Organa', 'Anakin Skywalker', 'Obi Wan Kenobi', 'Emperor Palpatine', 'Darth Vader', 'Lando Calrissian', 'Boba Fett', 'C-3P0', 'R2 D2', 'Jar Jar Binks', 'Padme Amidala', 'Yoda'], dtype=object)
cols = char_ratings.columns
for col in cols:
char_ratings[col] = char_ratings[col].astype(pd.api.types.CategoricalDtype(ordered=True, categories = [
'Very favorably',
'Somewhat favorably',
'Neither favorably nor unfavorably (neutral)',
'Unfamiliar',
'Somewhat unfavorably',
'Very unfavorably'
]))
The categories - Unfamiliar and Neither favorably nor unfavorably, intuitively do not give insight into the character popularity or perception of the charater by the public (respondants). Thus these two categories are ignored.
char_very_favorable = []
char_somewhat_favorable = []
char_somewhat_unfavorable = []
char_very_unfavorable = []
for col in cols:
counts = char_ratings[col].value_counts(normalize=True) * 100
char_very_favorable.append(counts['Very favorably'])
char_somewhat_favorable.append(counts['Somewhat favorably'])
char_somewhat_unfavorable.append(counts['Somewhat unfavorably'])
char_very_unfavorable.append(counts['Very unfavorably'])
ratings = pd.DataFrame({
'character':characters,
'very_fav':char_very_favorable,
'somewhat_fav':char_somewhat_favorable,
'somewhat_unfav':char_somewhat_unfavorable,
'very_unfav':char_very_unfavorable
})
ratings.sort_values(by='very_fav',ascending=True,inplace=True)
layout = go.Layout(
title={
"text":"<b>Percentage of Favourability for every Character</b><br>"+
"Viewers perception of the popular characters from the Star Wars franchise",
'x':.5
},
yaxis=go.layout.YAxis(title='Characters'),
xaxis=go.layout.XAxis(
range=[-25, 55],
# tickvals=[-25, -10, 0, 15, 30, 55],
# ticktext=[25, 10, 0, 15, 30, 55],
title='Percentage of favorability',
showticklabels= False
),
barmode='overlay',
bargap=0.1
)
data = [
go.Bar(
y=ratings.character,
x= ratings.very_fav,
orientation= 'h',
name= 'Very Favorable',
hoverinfo= 'text',
hovertemplate="Very favorable %{text:.1f}<extra></extra>",
text = ratings.very_fav,
marker= dict(color= 'powderblue')
),
go.Bar(
y=ratings.character,
x= ratings.somewhat_fav,
orientation= 'h',
name= 'Somewhat Favorable',
hovertemplate="Somewhat favorable %{text:.1f}<extra></extra>",
hoverinfo= 'text',
text= ratings.somewhat_fav,
marker= dict(color= '#009999')
),
go.Bar(
y=ratings.character,
x= [-1 * r for r in ratings.very_unfav],
orientation= 'h',
name= 'Very Unfavorable',
hovertemplate="Very Unfavorable %{text:.1f}<extra></extra>",
hoverinfo= 'text',
text= ratings.very_unfav,
marker= dict(color= '#FC4040')
),
go.Bar(
y=ratings.character,
x= [-1 * r for r in ratings.somewhat_unfav],
orientation= 'h',
name= 'Somewhat Unfavorable',
hovertemplate="Somewhat Unfavorable %{text:.1f}<extra></extra>",
hoverinfo= 'text',
text= ratings.somewhat_unfav,
marker= dict(color= 'crimson')
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
The interactive plot above draws the following conclusions :-
The most popular and favorable characters are :-
* Han Solo
* Yoda
* Obi Wan Kenobi
These characters have more than 50% of "very favorable" hits and more than 10% on "somewhat favorable" hits by the respondants.
There are no characters who are completely unfavored. A few characters crossing the 10% mark for unfavorability are :-
* Jar Jar Binks
* Emperor Palpatine
* Darth Vader
Interesting point to notice, the two characters having the highest unfavorability percentages also have equivalent favorability percentages. This really points to the fact that no character in the Star Wars film franchise is hated so much.
There are some controversial character ratings among the lot such as :-
* Emperor Palpatine
* Darth Vader
Since these characters being evil have a high unfavorability percentage, but at the same time they have an equivalent or more favorability percentage as well.
The fact that the character list has Anakin Skywalker and Darth Vader both on the list, it shows the viewer's perception of Star Wars on two different time lines.
To conclude this analysis, a final question to answer is, how many of the respondants like the space-opera media franchises. From all the respondants, a person is said to like sci-fi movies if that person has expressed he/she is a fan of the Star Wars franchise or the Star Trek franchise. The columns considered for this are - is_fan and likes_star_trek.
df['scifi_fan'] = df.apply(
lambda x: True if x.likes_star_trek == True or x.is_fan == True else False,
axis=1
)
scifi_fans = df[df.scifi_fan == True]
scifi = df.scifi_fan.value_counts(normalize=True)
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.barplot(x= scifi.index, y= scifi.values)
plt.ylabel('precentage of respondants')
plt.xlabel('respondant is a fan?')
plt.title('Percentage of respondants who are fans of the Space-Opera Media franchises')
Text(0.5, 1.0, 'Percentage of respondants who are fans of the Space-Opera Media franchises')
From all the fans of the Star Wars franchise, a very high percentage of respondants are fans of both Star Wars and Star Trek - Space-Opera Media franchises. A more granular analysis of these fans reveals,
gender_counts = scifi_fans.Gender.value_counts(normalize=True)
layout = go.Layout(
title={
'text':"<b>Distribution of Gender among Sci-Fi Fans</b><br>"+
"Percentage of Males, Females or others in the Sci-Fi Fan population",
'yanchor':'top',
'xref':'paper',
'x':0.5
}
)
data = [
go.Pie(
labels= gender_counts.index,
values= gender_counts.values,
marker= dict(
colors= ['#009999','#ff9933','#99004C'],
line= dict(width=1)
),
hovertemplate= "%{label} : %{percent}<extra></extra>"
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
The percentage of Males and Females are close to equal, with the Males slightly larger.
scifi_fans['age_label'] = scifi_fans.Age.apply(encode)
fans_age = scifi_fans.age_label.value_counts(dropna=False, normalize=True)
fans_age = fans_age.iloc[[3,1,0,2]]
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.barplot(x= fans_age.index, y=fans_age.values)
plt.ylabel('precentage of fans')
plt.xlabel('age category')
plt.title('Percentage of fans of Space-Opera Media per Age category')
Text(0.5, 1.0, 'Percentage of fans of Space-Opera Media per Age category')
Interesting to note that the plot shows, the fans of the space-media opera franchises are mostly older than 30 years, with a peak in the number of fans aged 45 and older. The two space-opera media franchises mentioned are Star Wars and Star Trek. The former first released in 1977 whereas the latter first released in 1966. The Young of that era are now the Senior and Elder of today, thus it makes sense that most of the fans are 45 or older.
age_catgs = scifi_fans.age_label.unique()
Males = []
Females = []
for catg in age_catgs:
gender_counts = scifi_fans[scifi_fans.age_label == catg].Gender.value_counts()
Males.append(gender_counts[0])
Females.append(gender_counts[1])
layout = go.Layout(
title = {
'text':"<b>Gender distribution of Sapce-Opera Media Fans between various Age categories</b><br>"+
"Distribution of male and female space-opera media fans for every age category",
'xanchor':'left',
'font':{'size':22}
},
yaxis=go.layout.YAxis(title='Age category'),
xaxis=go.layout.XAxis(
range=[-100, 100],
# tickvals=[-60, -40, -20, 0, 25, 50, 75],
# ticktext=[60, 40, 20, 0, 25, 50, 75],
title='Number of Fans',
showticklabels= False
),
barmode='overlay',
bargap=0.1
)
data = [
go.Bar(
y=age_catgs,
x=Males,
orientation='h',
name='Male',
hoverinfo='x',
marker=dict(color='#ff9933')
),
go.Bar(
y=age_catgs,
x=[-1 * f for f in Females],
orientation='h',
name='Female',
text= Females,
hoverinfo='text',
marker=dict(color='#009999')
)
]
fig = go.Figure(data= data, layout= layout)
fig.show()
There is an increase in the number of fans by age category as shown by the plot. It is inline with the results obtained above this. The plot below summarizes the data into categories and makes it easier to gain a view of the entire distribution.
df['age_label'] = df.Age.apply(encode)
total_resp = len(df)
gender_counts = df.Gender.value_counts(dropna=False)
no_male = gender_counts[1]
no_female = gender_counts[0]
age_counts_male = df[df.Gender == 'Male'].age_label.value_counts(dropna=False)
no_male_young = age_counts_male[3]
no_male_middle = age_counts_male[1]
no_male_senior = age_counts_male[0]
no_male_elder = age_counts_male[2]
age_counts_female = df[df.Gender == 'Female'].age_label.value_counts(dropna=False)
no_female_young = age_counts_female[3]
no_female_middle = age_counts_female[1]
no_female_senior = age_counts_female[0]
no_female_elder = age_counts_female[2]
sw_count_young_male = scifi_fans[(df.Gender=='Male') & (df.age_label == 'Young')].is_fan.sum()
sw_count_middle_male = scifi_fans[(df.Gender=='Male') & (df.age_label == 'Middle')].is_fan.sum()
sw_count_senior_male = scifi_fans[(df.Gender=='Male') & (df.age_label == 'Senior')].is_fan.sum()
sw_count_elder_male = scifi_fans[(df.Gender=='Male') & (df.age_label == 'Elder')].is_fan.sum()
sw_count_young_female = scifi_fans[(df.Gender=='Female') & (df.age_label == 'Young')].is_fan.sum()
sw_count_middle_female = scifi_fans[(df.Gender=='Female') & (df.age_label == 'Middle')].is_fan.sum()
sw_count_senior_female = scifi_fans[(df.Gender=='Female') & (df.age_label == 'Senior')].is_fan.sum()
sw_count_elder_female = scifi_fans[(df.Gender=='Female') & (df.age_label == 'Elder')].is_fan.sum()
labels= [
'Total',
'Male',
'Female',
'Young',
'Middle',
'Senior',
'Elder',
'Young',
'Middle',
'Senior',
'Elder',
'Star Wars Fans',
'Star Wars Fans',
'Star Wars Fans',
'Star Wars Fans',
'Star Wars Fans',
'Star Wars Fans',
'Star Wars Fans',
'Star Wars Fans'
]
ids = [
'Total',
'Male',
'Female',
'Young Males',
'Middle Males',
'Senior Males',
'Elder Males',
'Young Females',
'Middle Females',
'Senior Females',
'Elder Females',
'Star Wars Fans Young Males',
'Star Wars Fans Middle Males',
'Star Wars Fans Senior Males',
'Star Wars Fans Elder Males',
'Star Wars Fans Young Females',
'Star Wars Fans Middle Females',
'Star Wars Fans Senior Females',
'Star Wars Fans Elder Females'
]
parents = [
"",
'Total',
'Total',
'Male',
'Male',
'Male',
'Male',
'Female',
'Female',
'Female',
'Female',
'Young Males',
'Middle Males',
'Senior Males',
'Elder Males',
'Young Females',
'Middle Females',
'Senior Females',
'Elder Females',
]
values = [
total_resp,
no_male,
no_female,
no_male_young,
no_male_middle,
no_male_senior,
no_male_elder,
no_female_young,
no_female_middle,
no_female_senior,
no_female_elder,
sw_count_young_male,
sw_count_middle_male,
sw_count_senior_male,
sw_count_elder_male,
sw_count_young_female,
sw_count_middle_female,
sw_count_senior_female,
sw_count_elder_female
]
data= [
go.Sunburst(
ids= ids,
labels = labels,
parents= parents,
values= values,
branchvalues= 'total'
)
]
layout = go.Layout(
title= {
'text': '<b>Distribution of Respondants in various categories</b><br>'+
'Summarizing the data into various categories',
'x':0.5,
'y':0.95
},
autosize= True,
margin = dict(t=85, l=0, r=0, b=0)
)
fig = go.Figure(data= data,layout=layout)
fig.show()
The conclusion of the analysis done on the respondants of this survey are listed as following :-
Slightly more than half of the respondants are fans of the Star Wars film franchise.
The fans of the Star Wars film franchise have about equal percentages of Males and Females, with the Males slightly tipping the balance.
The fans of the Star Wars film franchise belong to the age bracket of 30 years to 60 years, indicating the period of time Star Wars was a vogue.
The Empire Strikes Back is the most viewed and the most popular movie of the entire Star Wars film franchise.
The Star Wars Extended Universe is not a great hit with the public, but out of the fans that know about its existence, slightly more than half of the respondants like the Extended Universe.
The Super fans have a substantially more percentage of Males as compared to Females.
Mostly the Young generation consists of Super fans, since the Extended Universe containes TV series and Video games which are new and a hit with the Young.
The following characters are rated the most favorable from the Star Wars film franchise:
Similarly, the following characters are rated the most unfavorable from the Star Wars film franchise:
The most controversial characters of the Star Wars film franchise, as even though they are villans and are most unfavorable, they still have a good percentage of favorability:
A good percentage of fans of the Star Wars franchise are fans of the Space-Opera Media franchises.
The percentages of Males and Females of the fans of Space-Opera Media franchises are very close, with the number of Males outweighing the number of Females.
The Fans of the Space-Opera Media franchises are mostly above the age of 30 years with a peak in the fans of the age bracket 45 to 60 years. This is valid since both Star Wars and Star Trek are from the 1970s and 1960s respectively.
The final sunburst plot concludes how the respondants are distributed between the various categories.