The team at FiveThirtyEight surveyed a question with star wars fans does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch? using the online tool SurveyMonkey. They received 835 total responses. So we will be analyzing these data based on the responses from the star wars fans and will answer the above question
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
We need to specify an encoding because the data set has some characters that aren't in Python's default utf-8 encoding.
#Lets analyze the data
star_wars.head(10)
#Lets analyze the data types
star_wars.info()
The data has several columns, including:
We observed that RespondentID is a unique ID but it contains many blank rows. So we will remove those rows with invalid RespondentID
#Lets analyze columns
star_wars.columns
# Selecting rows with geniuine respondent id
star_wars=star_wars[star_wars['RespondentID'].notna()]
# Verifying Data
star_wars.head(10)
Now star_wars dataframe conatin rows where RespondentID is not NaN.
Columns Have you seen any of the 6 films in the Star Wars franchise? and Do you consider yourself to be a fan of the Star Wars film franchise? have values Yes/No. But they can also be NaN where a respondent chooses not to answer a question.
# Analyzing user response for the view count
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'][:20]
# Analyzing user response for fan count
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'][:20]
We can see that both colum contain values as Yes,No or NAN.
It will be easier if we convert these values in boolean as booleans are easier to work with because we can select the rows that are True or False without having to do a string comparison.
# Conrting Yes/No values as Boolean values
yes_no = {
"Yes": True,
"No": False
}
star_wars=star_wars.copy()
star_wars['Have you seen any of the 6 films in the Star Wars franchise?']=star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
# Verifying the conversion
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'][:20]
# Verifying the conversion
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'][:20]
The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.
The columns for this question are:
# Analyzig movie column values
star_wars['Which of the following Star Wars films have you seen? Please select all that apply.']
Now we will modify the values in the column such that if the movie is seen we will mark them as True and if not then NAN.
# Mapping movie column as boolean values
movie_mapping = {
"Star Wars: Episode I The Phantom Menace": True,
np.nan: False,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True
}
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(movie_mapping)
# Verifying conversion
star_wars.loc[3:9].head(10)
# Renaming the column to seen_1,seen_2... and so on
star_wars = star_wars.rename(columns={
"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6"
})
# Verifying data
star_wars.head(10)
We have given meaningful names to column. Now lets analyze next 6 columns.
The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:
As we can observe that there mostly integer values in these columns so we will convet them as float data type
# Converting data type from integer to float
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
Now we will rename the column names to give more meaningful aspect of data.
# Renaming columns
star_wars = star_wars.rename(columns={
"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6"
})
#Verifying Data
star_wars.head()
Now lets analyze the ranking columns and plot them
# Find mean values of ranking columns
mean=star_wars[star_wars.columns[9:15]].mean()
# Verifying average ranking mean values
mean
# Plot average ranking mean values
sns.set_style('white')
x_tile= ["Empire I","Empire II",'Empire III','Empire IV','Empire V','Empire VI']
fig,ax= plt.subplots(figsize=(5,5))
x= np.arange(len(mean.index))
y= mean.values
ax.bar(x,y)
ax.set_xticks([0.5,1.5,2.5,3.5,4.5,5.5])
ax.set_xticklabels(x_tile,rotation=90)
ax.set_title("Average Ranking of all 6 Star war movies ")
ax.set_ylabel('Ranking')
ax.set_xlabel('Movies')
ax.tick_params(bottom="off",top="off",left="off",right="off")
for sp in ax.spines:
ax.spines[sp].set_visible(False)
plt.show()
From above plot it looks like the "original" movies are rated much more highly than the newer ones.
# Calculating value count for seen movies
total_seen=star_wars.iloc[:,3:9].sum()
# Verifying Data
total_seen
# Plot view count of the movies
sns.set_style('white')
fig,ax= plt.subplots(figsize=(5,5))
x= np.arange(len(total_seen.index))
y= total_seen.values
ax.bar(x,y)
ax.set_xticks([0.5,1.5,2.5,3.5,4.5,5.5])
ax.set_xticklabels(x_tile,rotation=90)
ax.set_title("View count of the movies")
ax.set_ylabel('Count')
ax.set_xlabel('Movies')
ax.tick_params(bottom="off",top="off",left="off",right="off")
for sp in ax.spines:
ax.spines[sp].set_visible(False)
plt.show()
From above plot we can observe that:
# Filtering data based on gender
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
# Calculating mean values of ranking genderwise
males_mean=males[males.columns[9:15]].mean()
females_mean=females[females.columns[9:15]].mean()
# Plot average ranking of movies for Male
import seaborn as sns
sns.set_style('white')
# plt.style.use('seaborn-paper')
fig,ax= plt.subplots(figsize=(5,5))
x= np.arange(len(males_mean.index))
y= males_mean.values
sns.barplot(x,y)
# ax.set_xticks([0.5,1.5,2.5,3.5,4.5,5.5])
ax.set_xticklabels(x_tile,rotation=90)
ax.set_title("Average Ranking of all 6 Star war movies by Male")
ax.set_ylabel('Ranking')
ax.tick_params(bottom="off",top="off",left="off",right="off")
for sp in ax.spines:
ax.spines[sp].set_visible(False)
plt.show()
# Plot average ranking of movies for Females
import seaborn as sns
print(plt.style.available)
sns.set_style('white')
fig,ax= plt.subplots(figsize=(5,5))
x= np.arange(len(females_mean.index))
y= females_mean.values
sns.barplot(x,y)
# ax.set_xticks([0.5,1.5,2.5,3.5,4.5,5.5])
ax.set_xticklabels(x_tile,rotation=90)
ax.set_title("Average Ranking of all 6 Star war movies by Female")
ax.set_ylabel('Ranking')
ax.tick_params(bottom="off",top="off",left="off",right="off")
for sp in ax.spines:
ax.spines[sp].set_visible(False)
plt.show()
# Plot average ranking of movies for both genders
sns.set_style('white')
plt.figure(figsize=(10, 6))
width = 0.25
plt.bar(np.arange(1,len(females_mean.index)+1)+width,females_mean.values,width,label="Female",color="red")
plt.bar(np.arange(1,len(males_mean.index)+1)-width,males_mean.values,width,label="Male",color="blue")
plt.xticks(np.arange(1,len(females_mean.index)+1), x_tile,fontsize=10)
for i,d in enumerate(males_mean.values):
plt.text(x=i+1-width, y=d+0.10,s='{:.2f}'.format(d),fontdict=dict(fontsize=8),bbox=dict(facecolor='white', alpha=0.5))
for i,d in enumerate(females_mean.values):
plt.text(x=i+1+width, y=d+0.10,s='{:.2f}'.format(d),fontdict=dict(fontsize=8),bbox=dict(facecolor='gray', alpha=0.5))
plt.legend(title='Gender')
plt.ylabel('Average Ranking',fontsize=12)
plt.ylim(0,5)
plt.title('Ranking of Star war movies Gender wise',fontsize=18)
plt.show()
From above plot we can observe that:
# Calculating movie view count for males
males_seen=males.iloc[:,3:9].sum()
males_seen
# Calculating movie view count for females
females_seen=females.iloc[:,3:9].sum()
females_seen
# Plot view count of the movies for Males
sns.set_style('white')
fig,ax= plt.subplots(figsize=(5,5))
x= np.arange(len(males_seen.index))
y= males_seen.values
width=0.5
ax.bar(x+width,y)
ax.set_xticks([0.75,1.75,2.75,3.75,4.75,5.75])
ax.set_xticklabels(x_tile,rotation=90)
ax.set_title("Movie seen by Males",fontsize=18)
ax.set_ylabel('No of Males',fontsize=12)
ax.tick_params(bottom="off",top="off",left="off",right="off")
for sp in ax.spines:
ax.spines[sp].set_visible(False)
plt.show()
# Plot view count of the movies for Females
fig,ax= plt.subplots(figsize=(5,5))
sns.set_style('white')
x_tile= ["Empire I","Empire II",'Empire III','Empire IV','Empire V','Empire VI']
x= np.arange(len(females_seen.index))
y= females_seen.values
width=0.5
ax.bar(x+width,y)
ax.set_xticks([0.75,1.75,2.75,3.75,4.75,5.75])
ax.set_xticklabels(x_tile,rotation=90)
ax.set_title("Movie seen by Females",fontsize=18)
ax.set_ylabel('No of Females',fontsize=12)
# ax.tick_params(bottom="off",top="off",left="off",right="off")
for sp in ax.spines:
ax.spines[sp].set_visible(False)
plt.show()
# Plot view count of the movies for Females
sns.set_style('white')
plt.figure(figsize=(10, 6))
width = 0.25
x_tile= ["Empire I","Empire II",'Empire III','Empire IV','Empire V','Empire VI']
plt.bar(np.arange(1,len(females_seen.index)+1)+width,females_seen.values,width,label="Female",color="red")
plt.bar(np.arange(1,len(males_seen.index)+1)-width,males_seen.values,width,label="Male",color="blue")
plt.xticks(np.arange(1,len(females_seen.index)+1), x_tile,fontsize=10)
for i,d in enumerate(males_seen.values):
plt.text(x=i+1-width, y=d+10,s=d,fontdict=dict(fontsize=8),bbox=dict(facecolor='white', alpha=0.5))
for i,d in enumerate(females_seen.values):
plt.text(x=i+1+width, y=d+10,s=d,fontdict=dict(fontsize=8),bbox=dict(facecolor='gray', alpha=0.3))
plt.legend(title='Gender')
plt.ylabel('Average View Count',fontsize=12)
plt.ylim(0,450)
plt.title('View Count of Star war movies Gender wise',fontsize=24)
plt.show()
We can observe from above plot that:
# Analyzing data based on education
star_wars['Education'].value_counts(dropna=False)
# Calculating pivot Data for ranking grouped by education
pivot_data= star_wars.pivot_table(values=['ranking_1','ranking_2','ranking_3','ranking_4','ranking_5','ranking_6'],index='Education',dropna=True,aggfunc=np.mean)
pivot_data.reset_index(inplace=True)
#Verifying pivot data
pivot_data
# Plot pivot data for ranking based on education
pivot_data.plot.bar(width=0.8)
deg=['Bachelor','Graduate','High school','< High school','> High School']
plt.xticks(np.arange(0,5),deg,fontsize=8,rotation=0)
plt.legend(fontsize=7)
plt.ylabel('Rating')
plt.xlabel('Degree')
plt.title('Ranking based on Education')
plt.show()
From above plot we can observe that:
#Unique values
star_wars['Location (Census Region)'].value_counts(dropna=False)
#Setting a pivot table:
pivot_location= star_wars.pivot_table(values=['seen_1','seen_2','seen_3','seen_4','seen_5','seen_6'],index='Location (Census Region)',dropna=True,aggfunc=np.sum,)
pivot_location.reset_index(inplace=True)
pivot_location
#Plotting pivot
pivot_location.plot.bar(figsize=(15,6),width=0.8)
deg=['East North Central','East South Central','Middle Atlantic','Mountain','New England','Pacific','South Atlantic','West North Central','West South Central']
plt.xticks(np.arange(0,9),deg,fontsize=8,rotation=0)
plt.legend(fontsize=12)
plt.ylabel('View Count',fontsize=15)
plt.xlabel('Location (Census Region)',fontsize=15)
plt.title('View Count based on location',fontsize=18)
plt.show()
From above plot we can observe that:
#Now renaming columns 15-28:
name_mapping = {'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.':'Han Solo',
'Unnamed: 16':'Luke Skywalker','Unnamed: 17':'Princess Leia','Unnamed: 18':'Anakin','Unnamed: 19':'Obi wan Kenobi',
'Unnamed: 20':'Palpatine','Unnamed: 21':'Darth Vader','Unnamed: 22':'Lando','Unnamed: 23':'Boba Fett','Unnamed: 24':'C-3PO',
'Unnamed: 25':'R2 D2','Unnamed: 26':'Jar Jar Binks','Unnamed: 27':'Padme','Unnamed: 28':'Yoda'}
star_wars=star_wars.rename(columns=(name_mapping)).copy()
#Lets check for NAN values on character columns:
star_wars[star_wars.columns[15:29]].isna().sum()
As we can observe that the missing values lies in about the same range i.e 355-374 so it could be because of some similar dataset missing . So we can remove them as it will affect to all column as the same.
#Lets drop NAN values on character columns:
character_star_wars=star_wars[star_wars.columns[15:29]].dropna(axis=0)
#Verify the data
character_star_wars.head(10)
We can convert the ratings in integer format to perform our calculation easily. So we will map values as:
#Mapping character column data
mapping = {'Very favorably':6,'Somewhat favorably':5,'Neither favorably nor unfavorably (neutral)':4,'Somewhat unfavorably':3,'Unfamiliar (N/A)':2,'Very unfavorably':1}
for c in character_star_wars:
character_star_wars[c]=character_star_wars[c].map(mapping)
# Verifying data
character_star_wars.head(10)
# Find average rating of each character
character_mean=character_star_wars.mean().sort_values(ascending=False)
# Plot the graph
sns.set_style('white')
plt.figure(figsize=(10, 6))
character_mean.plot.barh()
plt.box(False) #remove box
plt.yticks(fontsize=14)
plt.xlabel('Ratings',fontsize=12)
plt.ylabel('Characters',fontsize=12)
plt.title('Average Rating of Character',fontsize=24)
plt.show()