Star Wars Survey¶

While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you can download from their GitHub repository.

In this project, we cleant and analysed data to see the movie which is liked or seen the most and the various factors affecting it.

In [1]:

import pandas as pd
#reading in data

star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head(10)

Out[1]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	3292879998	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3.0	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
1	3292879538	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
2	3292765271	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1.0	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
3	3292763116	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
4	3292731220	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	...	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3292719380	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	1.0	...	Very favorably	Han	Yes	No	Yes	Male	18-29	$25,000 - $49,999	Bachelor degree	Middle Atlantic
6	3292684787	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	6.0	...	Very favorably	Han	Yes	No	No	Male	18-29	NaN	High school degree	East North Central
7	3292663732	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	4.0	...	Very favorably	Han	No	NaN	Yes	Male	18-29	NaN	High school degree	South Atlantic
8	3292654043	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5.0	...	Somewhat favorably	Han	No	NaN	No	Male	18-29	$0 - $24,999	Some college or Associate degree	South Atlantic
9	3292640424	Yes	No	NaN	Star Wars: Episode II Attack of the Clones	NaN	NaN	NaN	NaN	1.0	...	Very favorably	I don't understand this question	No	NaN	No	Male	18-29	$25,000 - $49,999	Some college or Associate degree	Pacific

10 rows × 38 columns

In [2]:

star_wars.columns

Out[2]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

The data has several columns, including the following:

RespondentID — An anonymized ID for the respondent (person taking the survey)
Gender — the respondent's gender
Age — the respondent's age
Household Income — the respondent's income
Education — the respondent's education level
Location (Census Region) — the respondent's location
Have you seen any of the 6 films in the Star Wars franchise? — a Yes or No response
Do you consider yourself to be a fan of the Star Wars film franchise? — a Yes or No response

There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes.

Cleaning and Mapping Columns¶

Let us take a look at the following 2 columns:

Have you seen any of the 6 films in the Star Wars franchise?
Do you consider yourself to be a fan of the Star Wars film franchise?

In [3]:

saw_any_of_6 = star_wars['Have you seen any of the 6 films in the Star Wars franchise?']
saw_any_of_6.value_counts(dropna=False)

Out[3]:

Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

In [4]:

fan_series = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']
fan_series.value_counts(dropna=False)

Out[4]:

Yes    552
NaN    350
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

Both represent Yes/No questions. There is also NaN where a respondent chooses not to answer a question. We made use of the pandas.Series.value_counts() method on a series to see all of the unique values in a column, along with the total number of times each value appears.

In [5]:

yes_no = {'Yes':True, 'No':False}

saw_any_of_6 = saw_any_of_6.map(yes_no) #converting Yes, No to boolean
saw_any_of_6.value_counts(dropna=False)

Out[5]:

True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

In [6]:

fan_series = fan_series.map(yes_no)
fan_series.value_counts(dropna=False)

Out[6]:

True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

Both columns are currently string types, because the main values they contain are Yes and No. We made the data a bit easier to analyze by converting each column to a Boolean with only the values True, False, and NaN. Booleans are easier to work with because we can select the rows that are True or False without having to do a string comparison.

We made use of the pandas.Series.map() method on series objects to perform the conversion.

Cleaning and Mapping Checkbox Columns¶

The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.

In [7]:

star_wars['Which of the following Star Wars films have you seen? Please select all that apply.'].value_counts(dropna=False)

Out[7]:

Star Wars: Episode I  The Phantom Menace    673
NaN                                         513
Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64

The columns for this checkbox question are:

Which of the following Star Wars films have you seen? Please select all that apply. — whether or not the respondent saw Star Wars: Episode I The Phantom Menace.
Unnamed: 4 — whether or not the respondent saw Star Wars: Episode II Attack of the Clones.
Unnamed: 5 — whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.
Unnamed: 6 — whether or not the respondent saw Star Wars: Episode IV A New Hope.
Unnamed: 7 — whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.
Unnamed: 8 — whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.

In [8]:

cols_before=star_wars.columns[3:9] #execute_only_once
cols_before

Out[8]:

Index(['Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'],
      dtype='object')

In [9]:

#renaming columns with index 3 to 8

for i in range(len(cols_before)):
    star_wars=star_wars.rename(columns={cols_before[i]:f'seen_{i+1}'})
star_wars.columns

Out[9]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

In [10]:

cols_after = star_wars.columns[3:9]
cols_after

Out[10]:

Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')

For cleaning the data pertaining to the above cleant column names, in each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We assumed that they didn't see the movie. Like in the previous scenario, we converted the data to a boolean type which helps us to analyze the data easier in the further steps.

In [11]:

def boolean_conv(series):
    new_series = series.isna()
    return ~new_series   

In [12]:

watched_data_before=star_wars.iloc[:, 3:9] #execute only once
watched_data_before

Out[12]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6
0	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
1	NaN	NaN	NaN	NaN	NaN	NaN
2	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN
3	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
4	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
...	...	...	...	...	...	...
1181	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
1182	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
1183	NaN	NaN	NaN	NaN	NaN	NaN
1184	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi
1185	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	NaN	NaN	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi

1186 rows × 6 columns

In [13]:

# converting data to boolean

watched_data_after=watched_data_before.apply(boolean_conv)
watched_data_after

Out[13]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6
0	True	True	True	True	True	True
1	False	False	False	False	False	False
2	True	True	True	False	False	False
3	True	True	True	True	True	True
4	True	True	True	True	True	True
...	...	...	...	...	...	...
1181	True	True	True	True	True	True
1182	True	True	True	True	True	True
1183	False	False	False	False	False	False
1184	True	True	True	True	True	True
1185	True	True	False	False	True	True

1186 rows × 6 columns

In [14]:

# assigning the boolean data back to main dataframe

star_wars.iloc[:, 3:9]=watched_data_after
star_wars

Out[14]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	...	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	3292879998	Yes	Yes	True	True	True	True	True	True	3.0	...	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
1	3292879538	No	NaN	False	False	False	False	False	False	NaN	...	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
2	3292765271	Yes	No	True	True	True	False	False	False	1.0	...	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
3	3292763116	Yes	Yes	True	True	True	True	True	True	5.0	...	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
4	3292731220	Yes	Yes	True	True	True	True	True	True	5.0	...	Somewhat favorably	Greedo	Yes	No	No	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1181	3288388730	Yes	Yes	True	True	True	True	True	True	5.0	...	Very favorably	Han	No	NaN	Yes	Female	18-29	$0 - $24,999	Some college or Associate degree	East North Central
1182	3288378779	Yes	Yes	True	True	True	True	True	True	4.0	...	Very favorably	I don't understand this question	No	NaN	Yes	Female	30-44	$50,000 - $99,999	Bachelor degree	Mountain
1183	3288375286	No	NaN	False	False	False	False	False	False	NaN	...	NaN	NaN	NaN	NaN	No	Female	30-44	$50,000 - $99,999	Bachelor degree	Middle Atlantic
1184	3288373068	Yes	Yes	True	True	True	True	True	True	4.0	...	Very favorably	Han	No	NaN	Yes	Female	45-60	$100,000 - $149,999	Some college or Associate degree	East North Central
1185	3288372923	Yes	No	True	True	False	False	True	True	6.0	...	Very unfavorably	I don't understand this question	No	NaN	No	Female	> 60	$50,000 - $99,999	Graduate degree	Pacific

1186 rows × 38 columns

Cleaning the ranking columns¶

The next six columns ask the respondent to rank the Star Wars movies in order from least to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite.

In [15]:

ranking_cols_before=star_wars.columns[9:15] #execute only once
ranking_cols_before

Out[15]:

Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14'],
      dtype='object')

Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:

Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. - How much the respondent liked Star Wars: Episode I The Phantom Menace
Unnamed: 10 — How much the respondent liked Star Wars: Episode II Attack of the Clones
Unnamed: 11 — How much the respondent liked Star Wars: Episode III Revenge of the Sith
Unnamed: 12 — How much the respondent liked Star Wars: Episode IV A New Hope
Unnamed: 13 — How much the respondent liked Star Wars: Episode V The Empire Strikes Back
Unnamed: 14 — How much the respondent liked Star Wars: Episode VI Return of the Jedi

In [16]:

#renaming columns

for i in range(len(ranking_cols_before)):
    star_wars = star_wars.rename(columns={ranking_cols_before[i]:f'ranking_{i+1}'})
star_wars.columns    

Out[16]:

Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1',
       'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

In [17]:

ranking_cols_after=star_wars.columns[9:15]
ranking_cols_after

Out[17]:

Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')

In [18]:

#check column names and data type

star_wars[ranking_cols_after].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ranking_1  835 non-null    float64
 1   ranking_2  836 non-null    float64
 2   ranking_3  835 non-null    float64
 3   ranking_4  836 non-null    float64
 4   ranking_5  836 non-null    float64
 5   ranking_6  836 non-null    float64
dtypes: float64(6)
memory usage: 55.7 KB

Finding the Highest Ranked Movie¶

Now that we cleaned up the ranking columns, finding the highest ranked movie is much easier.

In [19]:

ranking_data=star_wars[ranking_cols_after]
ranking_data

Out[19]:

	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6
0	3.0	2.0	1.0	4.0	5.0	6.0
1	NaN	NaN	NaN	NaN	NaN	NaN
2	1.0	2.0	3.0	4.0	5.0	6.0
3	5.0	6.0	1.0	2.0	4.0	3.0
4	5.0	4.0	6.0	2.0	1.0	3.0
...	...	...	...	...	...	...
1181	5.0	4.0	6.0	3.0	2.0	1.0
1182	4.0	5.0	6.0	2.0	3.0	1.0
1183	NaN	NaN	NaN	NaN	NaN	NaN
1184	4.0	3.0	6.0	5.0	2.0	1.0
1185	6.0	1.0	2.0	3.0	4.0	5.0

1186 rows × 6 columns

In [20]:

mean_data=ranking_data.mean() #calculating mean
mean_data.sort_values()

Out[20]:

ranking_5    2.513158
ranking_6    3.047847
ranking_4    3.272727
ranking_1    3.732934
ranking_2    4.087321
ranking_3    4.341317
dtype: float64

In [21]:

%matplotlib inline
import matplotlib.pyplot as plt

mean_data.plot(kind='bar')

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd410593d30>

Lets look at the year in which each of these movies were released in the order of ranking:

ranking_5: Star Wars: Episode V The Empire Strikes Back - 1980
ranking_6: Star Wars: Episode VI Return of the Jedi - 1983
ranking_4: Star Wars: Episode IV A New Hope - 1977
ranking_1: Star Wars: Episode I The Phantom Menace - 1999
ranking_2: Star Wars: Episode II Attack of the Clones - 2002
ranking_3: Star Wars: Episode III Revenge of the Sith - 2005

It looks like the "original" movies are rated much more highly than the newer ones.

Finding the Most Viewed Movie¶

In [22]:

watched_data_after.sum().sort_values(ascending=False)

Out[22]:

seen_5    758
seen_6    738
seen_1    673
seen_4    607
seen_2    571
seen_3    550
dtype: int64

In [23]:

watched_data_after.sum().plot(kind='bar')

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd40f5d1970>

It appears that the original movies were seen by more respondents than the newer movies. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular.

Exploring the data by Binary Segments¶

Now let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:

Do you consider yourself to be a fan of the Star Wars film franchise? — True or False
Do you consider yourself to be a fan of the Star Trek franchise? — Yes or No
Gender — Male or Female

We can split a DataFrame into two groups based on a binary column by creating two subsets of that column.

In [24]:

fans = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='Yes'] #star wars fans
non_fans=star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='No'] # non star wars fans

In [25]:

#comparison of fans & non star wars fans rating
plt.figure(figsize=(10, 3))

plt.subplot(1, 2, 1)
fans[ranking_cols_after].mean().plot(kind='bar')
plt.title('Fans')

plt.subplot(1, 2, 2)
non_fans[ranking_cols_after].mean().plot(kind='bar')
plt.title('Non Fans')

Out[25]:

Text(0.5, 1.0, 'Non Fans')

In [26]:

#comparison of star wars fans and non fans on the number of people who watched the movie

plt.figure(figsize=(10, 3))

plt.subplot(1, 2, 1)
fans[cols_after].sum().plot(kind='bar')
plt.title('Fans')

plt.subplot(1, 2, 2)
non_fans[cols_after].sum().plot(kind='bar')
plt.title('Non Fans')

Out[26]:

Text(0.5, 1.0, 'Non Fans')

In [27]:

non_fans[ranking_cols_after].mean().sort_values()

Out[27]:

ranking_5    2.862676
ranking_1    2.936396
ranking_6    3.471831
ranking_2    3.591549
ranking_4    3.933099
ranking_3    4.193662
dtype: float64

In [28]:

fans[ranking_cols_after].mean().sort_values()

Out[28]:

ranking_5    2.333333
ranking_6    2.829710
ranking_4    2.932971
ranking_1    4.141304
ranking_2    4.342391
ranking_3    4.417423
dtype: float64

In [29]:

non_fans[cols_after].sum().sort_values(ascending=False)

Out[29]:

seen_5    220
seen_6    201
seen_1    173
seen_4    124
seen_2    108
seen_3    100
dtype: int64

In [30]:

fans[cols_after].sum().sort_values(ascending=False)

Out[30]:

seen_5    538
seen_6    537
seen_1    500
seen_4    483
seen_2    463
seen_3    450
dtype: int64

Observations:¶

It is unanimous that Episode 5: The Empire Strikes Back is the most liked and seen movie of all time.
Interestingly, we also observe that the number of fans as well as the non fans decrease (see cells 29, 30) in chronological order of the episodes release dates but, the rankings given by the fans and non fans (see cells 27, 28) are completely different for the episodes 1,2,4,6. (see cells 25, 26 for graphs)