Star Wars Survey

While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you can download from their GitHub repository.

In this project, we cleant and analysed data to see the movie which is liked or seen the most and the various factors affecting it.

In [1]:
import pandas as pd
#reading in data

star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head(10)
Out[1]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe? Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 3292879998 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3.0 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
1 3292879538 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
2 3292765271 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1.0 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
3 3292763116 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
4 3292731220 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3292719380 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 1.0 ... Very favorably Han Yes No Yes Male 18-29 $25,000 - $49,999 Bachelor degree Middle Atlantic
6 3292684787 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 6.0 ... Very favorably Han Yes No No Male 18-29 NaN High school degree East North Central
7 3292663732 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 4.0 ... Very favorably Han No NaN Yes Male 18-29 NaN High school degree South Atlantic
8 3292654043 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5.0 ... Somewhat favorably Han No NaN No Male 18-29 $0 - $24,999 Some college or Associate degree South Atlantic
9 3292640424 Yes No NaN Star Wars: Episode II Attack of the Clones NaN NaN NaN NaN 1.0 ... Very favorably I don't understand this question No NaN No Male 18-29 $25,000 - $49,999 Some college or Associate degree Pacific

10 rows × 38 columns

In [2]:
star_wars.columns
Out[2]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

The data has several columns, including the following:

  • RespondentID — An anonymized ID for the respondent (person taking the survey)
  • Gender — the respondent's gender
  • Age — the respondent's age
  • Household Income — the respondent's income
  • Education — the respondent's education level
  • Location (Census Region) — the respondent's location
  • Have you seen any of the 6 films in the Star Wars franchise? — a Yes or No response
  • Do you consider yourself to be a fan of the Star Wars film franchise? — a Yes or No response

There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes.

Cleaning and Mapping Columns

Let us take a look at the following 2 columns:

  • Have you seen any of the 6 films in the Star Wars franchise?
  • Do you consider yourself to be a fan of the Star Wars film franchise?
In [3]:
saw_any_of_6 = star_wars['Have you seen any of the 6 films in the Star Wars franchise?']
saw_any_of_6.value_counts(dropna=False)
Out[3]:
Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [4]:
fan_series = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']
fan_series.value_counts(dropna=False)
Out[4]:
Yes    552
NaN    350
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

Both represent Yes/No questions. There is also NaN where a respondent chooses not to answer a question. We made use of the pandas.Series.value_counts() method on a series to see all of the unique values in a column, along with the total number of times each value appears.

In [5]:
yes_no = {'Yes':True, 'No':False}

saw_any_of_6 = saw_any_of_6.map(yes_no) #converting Yes, No to boolean
saw_any_of_6.value_counts(dropna=False)
Out[5]:
True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [6]:
fan_series = fan_series.map(yes_no)
fan_series.value_counts(dropna=False)
Out[6]:
True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

Both columns are currently string types, because the main values they contain are Yes and No. We made the data a bit easier to analyze by converting each column to a Boolean with only the values True, False, and NaN. Booleans are easier to work with because we can select the rows that are True or False without having to do a string comparison.

We made use of the pandas.Series.map() method on series objects to perform the conversion.

Cleaning and Mapping Checkbox Columns

The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.

In [7]:
star_wars['Which of the following Star Wars films have you seen? Please select all that apply.'].value_counts(dropna=False)
Out[7]:
Star Wars: Episode I  The Phantom Menace    673
NaN                                         513
Name: Which of the following Star Wars films have you seen? Please select all that apply., dtype: int64

The columns for this checkbox question are:

  • Which of the following Star Wars films have you seen? Please select all that apply. — whether or not the respondent saw Star Wars: Episode I The Phantom Menace.
  • Unnamed: 4 — whether or not the respondent saw Star Wars: Episode II Attack of the Clones.
  • Unnamed: 5 — whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.
  • Unnamed: 6 — whether or not the respondent saw Star Wars: Episode IV A New Hope.
  • Unnamed: 7 — whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.
  • Unnamed: 8 — whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.
In [8]:
cols_before=star_wars.columns[3:9] #execute_only_once
cols_before
Out[8]:
Index(['Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'],
      dtype='object')
In [9]:
#renaming columns with index 3 to 8

for i in range(len(cols_before)):
    star_wars=star_wars.rename(columns={cols_before[i]:f'seen_{i+1}'})
star_wars.columns
Out[9]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')
In [10]:
cols_after = star_wars.columns[3:9]
cols_after
Out[10]:
Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')

For cleaning the data pertaining to the above cleant column names, in each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We assumed that they didn't see the movie. Like in the previous scenario, we converted the data to a boolean type which helps us to analyze the data easier in the further steps.

In [11]:
def boolean_conv(series):
    new_series = series.isna()
    return ~new_series   
In [12]:
watched_data_before=star_wars.iloc[:, 3:9] #execute only once
watched_data_before
Out[12]:
seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
0 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
1 NaN NaN NaN NaN NaN NaN
2 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN
3 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
4 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
... ... ... ... ... ... ...
1181 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
1182 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
1183 NaN NaN NaN NaN NaN NaN
1184 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi
1185 Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones NaN NaN Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi

1186 rows × 6 columns

In [13]:
# converting data to boolean

watched_data_after=watched_data_before.apply(boolean_conv)
watched_data_after
Out[13]:
seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
0 True True True True True True
1 False False False False False False
2 True True True False False False
3 True True True True True True
4 True True True True True True
... ... ... ... ... ... ...
1181 True True True True True True
1182 True True True True True True
1183 False False False False False False
1184 True True True True True True
1185 True True False False True True

1186 rows × 6 columns

In [14]:
# assigning the boolean data back to main dataframe

star_wars.iloc[:, 3:9]=watched_data_after
star_wars
Out[14]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe? Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 3292879998 Yes Yes True True True True True True 3.0 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
1 3292879538 No NaN False False False False False False NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
2 3292765271 Yes No True True True False False False 1.0 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
3 3292763116 Yes Yes True True True True True True 5.0 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
4 3292731220 Yes Yes True True True True True True 5.0 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1181 3288388730 Yes Yes True True True True True True 5.0 ... Very favorably Han No NaN Yes Female 18-29 $0 - $24,999 Some college or Associate degree East North Central
1182 3288378779 Yes Yes True True True True True True 4.0 ... Very favorably I don't understand this question No NaN Yes Female 30-44 $50,000 - $99,999 Bachelor degree Mountain
1183 3288375286 No NaN False False False False False False NaN ... NaN NaN NaN NaN No Female 30-44 $50,000 - $99,999 Bachelor degree Middle Atlantic
1184 3288373068 Yes Yes True True True True True True 4.0 ... Very favorably Han No NaN Yes Female 45-60 $100,000 - $149,999 Some college or Associate degree East North Central
1185 3288372923 Yes No True True False False True True 6.0 ... Very unfavorably I don't understand this question No NaN No Female > 60 $50,000 - $99,999 Graduate degree Pacific

1186 rows × 38 columns

Cleaning the ranking columns

The next six columns ask the respondent to rank the Star Wars movies in order from least to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite.

In [15]:
ranking_cols_before=star_wars.columns[9:15] #execute only once
ranking_cols_before
Out[15]:
Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14'],
      dtype='object')

Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:

  • Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. - How much the respondent liked Star Wars: Episode I The Phantom Menace
  • Unnamed: 10 — How much the respondent liked Star Wars: Episode II Attack of the Clones
  • Unnamed: 11 — How much the respondent liked Star Wars: Episode III Revenge of the Sith
  • Unnamed: 12 — How much the respondent liked Star Wars: Episode IV A New Hope
  • Unnamed: 13 — How much the respondent liked Star Wars: Episode V The Empire Strikes Back
  • Unnamed: 14 — How much the respondent liked Star Wars: Episode VI Return of the Jedi
In [16]:
#renaming columns

for i in range(len(ranking_cols_before)):
    star_wars = star_wars.rename(columns={ranking_cols_before[i]:f'ranking_{i+1}'})
star_wars.columns    
Out[16]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1',
       'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')
In [17]:
ranking_cols_after=star_wars.columns[9:15]
ranking_cols_after
Out[17]:
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')
In [18]:
#check column names and data type

star_wars[ranking_cols_after].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1186 entries, 0 to 1185
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ranking_1  835 non-null    float64
 1   ranking_2  836 non-null    float64
 2   ranking_3  835 non-null    float64
 3   ranking_4  836 non-null    float64
 4   ranking_5  836 non-null    float64
 5   ranking_6  836 non-null    float64
dtypes: float64(6)
memory usage: 55.7 KB

Finding the Highest Ranked Movie

Now that we cleaned up the ranking columns, finding the highest ranked movie is much easier.

In [19]:
ranking_data=star_wars[ranking_cols_after]
ranking_data
Out[19]:
ranking_1 ranking_2 ranking_3 ranking_4 ranking_5 ranking_6
0 3.0 2.0 1.0 4.0 5.0 6.0
1 NaN NaN NaN NaN NaN NaN
2 1.0 2.0 3.0 4.0 5.0 6.0
3 5.0 6.0 1.0 2.0 4.0 3.0
4 5.0 4.0 6.0 2.0 1.0 3.0
... ... ... ... ... ... ...
1181 5.0 4.0 6.0 3.0 2.0 1.0
1182 4.0 5.0 6.0 2.0 3.0 1.0
1183 NaN NaN NaN NaN NaN NaN
1184 4.0 3.0 6.0 5.0 2.0 1.0
1185 6.0 1.0 2.0 3.0 4.0 5.0

1186 rows × 6 columns

In [20]:
mean_data=ranking_data.mean() #calculating mean
mean_data.sort_values()
Out[20]:
ranking_5    2.513158
ranking_6    3.047847
ranking_4    3.272727
ranking_1    3.732934
ranking_2    4.087321
ranking_3    4.341317
dtype: float64
In [21]:
%matplotlib inline
import matplotlib.pyplot as plt

mean_data.plot(kind='bar')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd410593d30>

Lets look at the year in which each of these movies were released in the order of ranking:

  • ranking_5: Star Wars: Episode V The Empire Strikes Back - 1980
  • ranking_6: Star Wars: Episode VI Return of the Jedi - 1983
  • ranking_4: Star Wars: Episode IV A New Hope - 1977
  • ranking_1: Star Wars: Episode I The Phantom Menace - 1999
  • ranking_2: Star Wars: Episode II Attack of the Clones - 2002
  • ranking_3: Star Wars: Episode III Revenge of the Sith - 2005

It looks like the "original" movies are rated much more highly than the newer ones.

Finding the Most Viewed Movie

In [22]:
watched_data_after.sum().sort_values(ascending=False)
Out[22]:
seen_5    758
seen_6    738
seen_1    673
seen_4    607
seen_2    571
seen_3    550
dtype: int64
In [23]:
watched_data_after.sum().plot(kind='bar')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd40f5d1970>

It appears that the original movies were seen by more respondents than the newer movies. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular.

Exploring the data by Binary Segments

Now let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:

  • Do you consider yourself to be a fan of the Star Wars film franchise?True or False
  • Do you consider yourself to be a fan of the Star Trek franchise?Yes or No
  • GenderMale or Female

We can split a DataFrame into two groups based on a binary column by creating two subsets of that column.

In [24]:
fans = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='Yes'] #star wars fans
non_fans=star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='No'] # non star wars fans
In [25]:
#comparison of fans & non star wars fans rating
plt.figure(figsize=(10, 3))

plt.subplot(1, 2, 1)
fans[ranking_cols_after].mean().plot(kind='bar')
plt.title('Fans')

plt.subplot(1, 2, 2)
non_fans[ranking_cols_after].mean().plot(kind='bar')
plt.title('Non Fans')
Out[25]:
Text(0.5, 1.0, 'Non Fans')
In [26]:
#comparison of star wars fans and non fans on the number of people who watched the movie

plt.figure(figsize=(10, 3))

plt.subplot(1, 2, 1)
fans[cols_after].sum().plot(kind='bar')
plt.title('Fans')

plt.subplot(1, 2, 2)
non_fans[cols_after].sum().plot(kind='bar')
plt.title('Non Fans')
Out[26]:
Text(0.5, 1.0, 'Non Fans')
In [27]:
non_fans[ranking_cols_after].mean().sort_values()
Out[27]:
ranking_5    2.862676
ranking_1    2.936396
ranking_6    3.471831
ranking_2    3.591549
ranking_4    3.933099
ranking_3    4.193662
dtype: float64
In [28]:
fans[ranking_cols_after].mean().sort_values()
Out[28]:
ranking_5    2.333333
ranking_6    2.829710
ranking_4    2.932971
ranking_1    4.141304
ranking_2    4.342391
ranking_3    4.417423
dtype: float64
In [29]:
non_fans[cols_after].sum().sort_values(ascending=False)
Out[29]:
seen_5    220
seen_6    201
seen_1    173
seen_4    124
seen_2    108
seen_3    100
dtype: int64
In [30]:
fans[cols_after].sum().sort_values(ascending=False)
Out[30]:
seen_5    538
seen_6    537
seen_1    500
seen_4    483
seen_2    463
seen_3    450
dtype: int64

Observations:

  • It is unanimous that Episode 5: The Empire Strikes Back is the most liked and seen movie of all time.
  • Interestingly, we also observe that the number of fans as well as the non fans decrease (see cells 29, 30) in chronological order of the episodes release dates but, the rankings given by the fans and non fans (see cells 27, 28) are completely different for the episodes 1,2,4,6. (see cells 25, 26 for graphs)