While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you can download from their GitHub repository.
In this project, we cleant and analysed data to see the movie which is liked or seen the most and the various factors affecting it.
import pandas as pd
#reading in data
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head(10)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3292879998 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
1 | 3292879538 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 | 3292765271 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
3 | 3292763116 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
4 | 3292731220 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3292719380 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1.0 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
6 | 3292684787 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6.0 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
7 | 3292663732 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4.0 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
8 | 3292654043 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
9 | 3292640424 | Yes | No | NaN | Star Wars: Episode II Attack of the Clones | NaN | NaN | NaN | NaN | 1.0 | ... | Very favorably | I don't understand this question | No | NaN | No | Male | 18-29 | $25,000 - $49,999 | Some college or Associate degree | Pacific |
10 rows × 38 columns
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
The data has several columns, including the following:
RespondentID
— An anonymized ID for the respondent (person taking the survey)Gender
— the respondent's genderAge
— the respondent's ageHousehold Income
— the respondent's incomeEducation
— the respondent's education levelLocation (Census Region)
— the respondent's locationHave you seen any of the 6 films in the Star Wars franchise?
— a Yes or No responseDo you consider yourself to be a fan of the Star Wars film franchise?
— a Yes or No responseThere are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes.
Let us take a look at the following 2 columns:
Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?
saw_any_of_6 = star_wars['Have you seen any of the 6 films in the Star Wars franchise?']
saw_any_of_6.value_counts(dropna=False)
Yes 936 No 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
fan_series = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']
fan_series.value_counts(dropna=False)
Yes 552 NaN 350 No 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
Both represent Yes/No
questions. There is also NaN
where a respondent chooses not to answer a question. We made use of the pandas.Series.value_counts() method on a series to see all of the unique values in a column, along with the total number of times each value appears.
yes_no = {'Yes':True, 'No':False}
saw_any_of_6 = saw_any_of_6.map(yes_no) #converting Yes, No to boolean
saw_any_of_6.value_counts(dropna=False)
True 936 False 250 Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
fan_series = fan_series.map(yes_no)
fan_series.value_counts(dropna=False)
True 552 NaN 350 False 284 Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
Both columns are currently string types, because the main values they contain are Yes
and No
. We made the data a bit easier to analyze by converting each column to a Boolean with only the values True
, False
, and NaN
. Booleans are easier to work with because we can select the rows that are True
or False
without having to do a string comparison.
We made use of the pandas.Series.map()
method on series objects to perform the conversion.
The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.
star_wars['Which of the following Star Wars films have you seen? Please select all that apply.'].value_counts(dropna=False)
Star Wars: Episode I The Phantom Menace 673 NaN 513 Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64
The columns for this checkbox question are:
Which of the following Star Wars films have you seen? Please select all that apply.
— whether or not the respondent saw Star Wars: Episode I The Phantom Menace.Unnamed: 4
— whether or not the respondent saw Star Wars: Episode II Attack of the Clones.Unnamed: 5
— whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.Unnamed: 6
— whether or not the respondent saw Star Wars: Episode IV A New Hope.Unnamed: 7
— whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.Unnamed: 8
— whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.cols_before=star_wars.columns[3:9] #execute_only_once
cols_before
Index(['Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'], dtype='object')
#renaming columns with index 3 to 8
for i in range(len(cols_before)):
star_wars=star_wars.rename(columns={cols_before[i]:f'seen_{i+1}'})
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
cols_after = star_wars.columns[3:9]
cols_after
Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')
For cleaning the data pertaining to the above cleant column names, in each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN
, the respondent either didn't answer or didn't see the movie. We assumed that they didn't see the movie. Like in the previous scenario, we converted the data to a boolean type which helps us to analyze the data easier in the further steps.
def boolean_conv(series):
new_series = series.isna()
return ~new_series
watched_data_before=star_wars.iloc[:, 3:9] #execute only once
watched_data_before
seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | |
---|---|---|---|---|---|---|
0 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN |
3 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
4 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
... | ... | ... | ... | ... | ... | ... |
1181 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
1182 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
1183 | NaN | NaN | NaN | NaN | NaN | NaN |
1184 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
1185 | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | NaN | NaN | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi |
1186 rows × 6 columns
# converting data to boolean
watched_data_after=watched_data_before.apply(boolean_conv)
watched_data_after
seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | |
---|---|---|---|---|---|---|
0 | True | True | True | True | True | True |
1 | False | False | False | False | False | False |
2 | True | True | True | False | False | False |
3 | True | True | True | True | True | True |
4 | True | True | True | True | True | True |
... | ... | ... | ... | ... | ... | ... |
1181 | True | True | True | True | True | True |
1182 | True | True | True | True | True | True |
1183 | False | False | False | False | False | False |
1184 | True | True | True | True | True | True |
1185 | True | True | False | False | True | True |
1186 rows × 6 columns
# assigning the boolean data back to main dataframe
star_wars.iloc[:, 3:9]=watched_data_after
star_wars
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | seen_1 | seen_2 | seen_3 | seen_4 | seen_5 | seen_6 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3292879998 | Yes | Yes | True | True | True | True | True | True | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
1 | 3292879538 | No | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 | 3292765271 | Yes | No | True | True | True | False | False | False | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
3 | 3292763116 | Yes | Yes | True | True | True | True | True | True | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
4 | 3292731220 | Yes | Yes | True | True | True | True | True | True | 5.0 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1181 | 3288388730 | Yes | Yes | True | True | True | True | True | True | 5.0 | ... | Very favorably | Han | No | NaN | Yes | Female | 18-29 | $0 - $24,999 | Some college or Associate degree | East North Central |
1182 | 3288378779 | Yes | Yes | True | True | True | True | True | True | 4.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Female | 30-44 | $50,000 - $99,999 | Bachelor degree | Mountain |
1183 | 3288375286 | No | NaN | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | No | Female | 30-44 | $50,000 - $99,999 | Bachelor degree | Middle Atlantic |
1184 | 3288373068 | Yes | Yes | True | True | True | True | True | True | 4.0 | ... | Very favorably | Han | No | NaN | Yes | Female | 45-60 | $100,000 - $149,999 | Some college or Associate degree | East North Central |
1185 | 3288372923 | Yes | No | True | True | False | False | True | True | 6.0 | ... | Very unfavorably | I don't understand this question | No | NaN | No | Female | > 60 | $50,000 - $99,999 | Graduate degree | Pacific |
1186 rows × 38 columns
The next six columns ask the respondent to rank the Star Wars movies in order from least to most favorite. 1
means the film was the most favorite, and 6
means it was the least favorite.
ranking_cols_before=star_wars.columns[9:15] #execute only once
ranking_cols_before
Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'], dtype='object')
Each of the following columns can contain the value 1
, 2
, 3
, 4
, 5
, 6
, or NaN
:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.
- How much the respondent liked Star Wars: Episode I The Phantom MenaceUnnamed: 10
— How much the respondent liked Star Wars: Episode II Attack of the ClonesUnnamed: 11
— How much the respondent liked Star Wars: Episode III Revenge of the SithUnnamed: 12
— How much the respondent liked Star Wars: Episode IV A New HopeUnnamed: 13
— How much the respondent liked Star Wars: Episode V The Empire Strikes BackUnnamed: 14
— How much the respondent liked Star Wars: Episode VI Return of the Jedi#renaming columns
for i in range(len(ranking_cols_before)):
star_wars = star_wars.rename(columns={ranking_cols_before[i]:f'ranking_{i+1}'})
star_wars.columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
ranking_cols_after=star_wars.columns[9:15]
ranking_cols_after
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6'], dtype='object')
#check column names and data type
star_wars[ranking_cols_after].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1186 entries, 0 to 1185 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ranking_1 835 non-null float64 1 ranking_2 836 non-null float64 2 ranking_3 835 non-null float64 3 ranking_4 836 non-null float64 4 ranking_5 836 non-null float64 5 ranking_6 836 non-null float64 dtypes: float64(6) memory usage: 55.7 KB
Now that we cleaned up the ranking columns, finding the highest ranked movie is much easier.
ranking_data=star_wars[ranking_cols_after]
ranking_data
ranking_1 | ranking_2 | ranking_3 | ranking_4 | ranking_5 | ranking_6 | |
---|---|---|---|---|---|---|
0 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
3 | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
4 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
... | ... | ... | ... | ... | ... | ... |
1181 | 5.0 | 4.0 | 6.0 | 3.0 | 2.0 | 1.0 |
1182 | 4.0 | 5.0 | 6.0 | 2.0 | 3.0 | 1.0 |
1183 | NaN | NaN | NaN | NaN | NaN | NaN |
1184 | 4.0 | 3.0 | 6.0 | 5.0 | 2.0 | 1.0 |
1185 | 6.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
1186 rows × 6 columns
mean_data=ranking_data.mean() #calculating mean
mean_data.sort_values()
ranking_5 2.513158 ranking_6 3.047847 ranking_4 3.272727 ranking_1 3.732934 ranking_2 4.087321 ranking_3 4.341317 dtype: float64
%matplotlib inline
import matplotlib.pyplot as plt
mean_data.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7fd410593d30>
Lets look at the year in which each of these movies were released in the order of ranking:
It looks like the "original" movies are rated much more highly than the newer ones.
watched_data_after.sum().sort_values(ascending=False)
seen_5 758 seen_6 738 seen_1 673 seen_4 607 seen_2 571 seen_3 550 dtype: int64
watched_data_after.sum().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7fd40f5d1970>
It appears that the original movies were seen by more respondents than the newer movies. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular.
Now let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:
Do you consider yourself to be a fan of the Star Wars film franchise?
— True
or False
Do you consider yourself to be a fan of the Star Trek franchise?
— Yes
or No
Gender
— Male
or Female
We can split a DataFrame into two groups based on a binary column by creating two subsets of that column.
fans = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='Yes'] #star wars fans
non_fans=star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='No'] # non star wars fans
#comparison of fans & non star wars fans rating
plt.figure(figsize=(10, 3))
plt.subplot(1, 2, 1)
fans[ranking_cols_after].mean().plot(kind='bar')
plt.title('Fans')
plt.subplot(1, 2, 2)
non_fans[ranking_cols_after].mean().plot(kind='bar')
plt.title('Non Fans')
Text(0.5, 1.0, 'Non Fans')
#comparison of star wars fans and non fans on the number of people who watched the movie
plt.figure(figsize=(10, 3))
plt.subplot(1, 2, 1)
fans[cols_after].sum().plot(kind='bar')
plt.title('Fans')
plt.subplot(1, 2, 2)
non_fans[cols_after].sum().plot(kind='bar')
plt.title('Non Fans')
Text(0.5, 1.0, 'Non Fans')
non_fans[ranking_cols_after].mean().sort_values()
ranking_5 2.862676 ranking_1 2.936396 ranking_6 3.471831 ranking_2 3.591549 ranking_4 3.933099 ranking_3 4.193662 dtype: float64
fans[ranking_cols_after].mean().sort_values()
ranking_5 2.333333 ranking_6 2.829710 ranking_4 2.932971 ranking_1 4.141304 ranking_2 4.342391 ranking_3 4.417423 dtype: float64
non_fans[cols_after].sum().sort_values(ascending=False)
seen_5 220 seen_6 201 seen_1 173 seen_4 124 seen_2 108 seen_3 100 dtype: int64
fans[cols_after].sum().sort_values(ascending=False)
seen_5 538 seen_6 537 seen_1 500 seen_4 483 seen_2 463 seen_3 450 dtype: int64