While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you can download from their GitHub repository.
In this project, we cleant and analysed data to see the movie which is liked or seen the most and the various factors affecting it.
import pandas as pd
#reading in data
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head(10)
star_wars.columns
The data has several columns, including the following:
RespondentID
— An anonymized ID for the respondent (person taking the survey)Gender
— the respondent's genderAge
— the respondent's ageHousehold Income
— the respondent's incomeEducation
— the respondent's education levelLocation (Census Region)
— the respondent's locationHave you seen any of the 6 films in the Star Wars franchise?
— a Yes or No responseDo you consider yourself to be a fan of the Star Wars film franchise?
— a Yes or No responseThere are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes.
Let us take a look at the following 2 columns:
Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?
saw_any_of_6 = star_wars['Have you seen any of the 6 films in the Star Wars franchise?']
saw_any_of_6.value_counts(dropna=False)
fan_series = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']
fan_series.value_counts(dropna=False)
Both represent Yes/No
questions. There is also NaN
where a respondent chooses not to answer a question. We made use of the pandas.Series.value_counts() method on a series to see all of the unique values in a column, along with the total number of times each value appears.
yes_no = {'Yes':True, 'No':False}
saw_any_of_6 = saw_any_of_6.map(yes_no) #converting Yes, No to boolean
saw_any_of_6.value_counts(dropna=False)
fan_series = fan_series.map(yes_no)
fan_series.value_counts(dropna=False)
Both columns are currently string types, because the main values they contain are Yes
and No
. We made the data a bit easier to analyze by converting each column to a Boolean with only the values True
, False
, and NaN
. Booleans are easier to work with because we can select the rows that are True
or False
without having to do a string comparison.
We made use of the pandas.Series.map()
method on series objects to perform the conversion.
The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.
star_wars['Which of the following Star Wars films have you seen? Please select all that apply.'].value_counts(dropna=False)
The columns for this checkbox question are:
Which of the following Star Wars films have you seen? Please select all that apply.
— whether or not the respondent saw Star Wars: Episode I The Phantom Menace.Unnamed: 4
— whether or not the respondent saw Star Wars: Episode II Attack of the Clones.Unnamed: 5
— whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.Unnamed: 6
— whether or not the respondent saw Star Wars: Episode IV A New Hope.Unnamed: 7
— whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.Unnamed: 8
— whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.cols_before=star_wars.columns[3:9] #execute_only_once
cols_before
#renaming columns with index 3 to 8
for i in range(len(cols_before)):
star_wars=star_wars.rename(columns={cols_before[i]:f'seen_{i+1}'})
star_wars.columns
cols_after = star_wars.columns[3:9]
cols_after
For cleaning the data pertaining to the above cleant column names, in each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN
, the respondent either didn't answer or didn't see the movie. We assumed that they didn't see the movie. Like in the previous scenario, we converted the data to a boolean type which helps us to analyze the data easier in the further steps.
def boolean_conv(series):
new_series = series.isna()
return ~new_series
watched_data_before=star_wars.iloc[:, 3:9] #execute only once
watched_data_before
# converting data to boolean
watched_data_after=watched_data_before.apply(boolean_conv)
watched_data_after
# assigning the boolean data back to main dataframe
star_wars.iloc[:, 3:9]=watched_data_after
star_wars
The next six columns ask the respondent to rank the Star Wars movies in order from least to most favorite. 1
means the film was the most favorite, and 6
means it was the least favorite.
ranking_cols_before=star_wars.columns[9:15] #execute only once
ranking_cols_before
Each of the following columns can contain the value 1
, 2
, 3
, 4
, 5
, 6
, or NaN
:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
- How much the respondent liked Star Wars: Episode I The Phantom MenaceUnnamed: 10
— How much the respondent liked Star Wars: Episode II Attack of the ClonesUnnamed: 11
— How much the respondent liked Star Wars: Episode III Revenge of the SithUnnamed: 12
— How much the respondent liked Star Wars: Episode IV A New HopeUnnamed: 13
— How much the respondent liked Star Wars: Episode V The Empire Strikes BackUnnamed: 14
— How much the respondent liked Star Wars: Episode VI Return of the Jedi#renaming columns
for i in range(len(ranking_cols_before)):
star_wars = star_wars.rename(columns={ranking_cols_before[i]:f'ranking_{i+1}'})
star_wars.columns
ranking_cols_after=star_wars.columns[9:15]
ranking_cols_after
#check column names and data type
star_wars[ranking_cols_after].info()
Now that we cleaned up the ranking columns, finding the highest ranked movie is much easier.
ranking_data=star_wars[ranking_cols_after]
ranking_data
mean_data=ranking_data.mean() #calculating mean
mean_data.sort_values()
%matplotlib inline
import matplotlib.pyplot as plt
mean_data.plot(kind='bar')
Lets look at the year in which each of these movies were released in the order of ranking:
It looks like the "original" movies are rated much more highly than the newer ones.
watched_data_after.sum().sort_values(ascending=False)
watched_data_after.sum().plot(kind='bar')
It appears that the original movies were seen by more respondents than the newer movies. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular.
Now let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:
Do you consider yourself to be a fan of the Star Wars film franchise?
— True
or False
Do you consider yourself to be a fan of the Star Trek franchise?
— Yes
or No
Gender
— Male
or Female
We can split a DataFrame into two groups based on a binary column by creating two subsets of that column.
fans = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='Yes'] #star wars fans
non_fans=star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='No'] # non star wars fans
#comparison of fans & non star wars fans rating
plt.figure(figsize=(10, 3))
plt.subplot(1, 2, 1)
fans[ranking_cols_after].mean().plot(kind='bar')
plt.title('Fans')
plt.subplot(1, 2, 2)
non_fans[ranking_cols_after].mean().plot(kind='bar')
plt.title('Non Fans')
#comparison of star wars fans and non fans on the number of people who watched the movie
plt.figure(figsize=(10, 3))
plt.subplot(1, 2, 1)
fans[cols_after].sum().plot(kind='bar')
plt.title('Fans')
plt.subplot(1, 2, 2)
non_fans[cols_after].sum().plot(kind='bar')
plt.title('Non Fans')
non_fans[ranking_cols_after].mean().sort_values()
fans[ranking_cols_after].mean().sort_values()
non_fans[cols_after].sum().sort_values(ascending=False)
fans[cols_after].sum().sort_values(ascending=False)