Guided Project: Star Wars

While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which we'll be cleaning and exploring.

Exploring and cleaning the data set

In [1]:
import numpy as np
import pandas as pd

star_wars = pd.read_csv('star_wars.csv', encoding="ISO-8859-1")
star_wars.head(10)
Out[1]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace ... Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
6 3.292719e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 1 ... Very favorably Han Yes No Yes Male 18-29 $25,000 - $49,999 Bachelor degree Middle Atlantic
7 3.292685e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 6 ... Very favorably Han Yes No No Male 18-29 NaN High school degree East North Central
8 3.292664e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 4 ... Very favorably Han No NaN Yes Male 18-29 NaN High school degree South Atlantic
9 3.292654e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Somewhat favorably Han No NaN No Male 18-29 $0 - $24,999 Some college or Associate degree South Atlantic

10 rows × 38 columns

We can notice some strange values. The RespondentID column is supposed to be a unique ID for each respondent, but it's blank in some rows. There are also questions in the survey where the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format.

In [2]:
# reviewing column names
star_wars.columns
Out[2]:
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ξ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

Some columns have strange names, which we will handle later on. For now, we'll remove any rows where RespondentID is NaN.

In [3]:
# removing NaN values
star_wars = star_wars[star_wars['RespondentID'].notnull()]
star_wars.head()
Out[3]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
1 3.292880e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 38 columns

Three columns represent Yes/No questions:

  • Have you seen any of the 6 films in the Star Wars franchise?
  • Do you consider yourself to be a fan of the Star Wars film franchise?
  • Do you consider yourself to be a fan of the Star Trek franchise?

They can also be NaN where a respondent chooses not to answer a question. We can convert those values to Booleans, which makes it easier to analyze down the road because we can select the rows that are True or False without having to do a string comparison.

In [4]:
# exploring columns
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()
Out[4]:
Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [5]:
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()
Out[5]:
Yes    552
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
In [6]:
star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'].value_counts()
Out[6]:
No     641
Yes    427
Name: Do you consider yourself to be a fan of the Star Trek franchise?, dtype: int64
In [7]:
# converting 'Yes' and 'No' to Boolean values
new_values = {'Yes': True, 'No': False}

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(new_values)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(new_values)
star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'].map(new_values)
In [8]:
# exploring the new values
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()
Out[8]:
True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [9]:
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()
Out[9]:
True     552
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
In [10]:
star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'].value_counts()
Out[10]:
False    641
True     427
Name: Do you consider yourself to be a fan of the Star Trek franchise?, dtype: int64

The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question Which of the following Star Wars films have you seen? Please select all that apply.

The columns for this question are:

  • Which of the following Star Wars films have you seen? Please select all that apply. - Whether or not the respondent saw Star Wars: Episode I The Phantom Menace.
  • Unnamed: 4 - Whether or not the respondent saw Star Wars: Episode II Attack of the Clones.
  • Unnamed: 5 - Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.
  • Unnamed: 6 - Whether or not the respondent saw Star Wars: Episode IV A New Hope.
  • Unnamed: 7 - Whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.
  • Unnamed: 8 - Whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.

For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.

We'll convert each of these columns to a Boolean and then rename the column something more intuitive.

In [11]:
new_values2 = {
    'Star Wars: Episode I  The Phantom Menace': True, 
    'Star Wars: Episode II  Attack of the Clones': True, 
    'Star Wars: Episode III  Revenge of the Sith': True, 
    'Star Wars: Episode IV  A New Hope': True, 
    'Star Wars: Episode V The Empire Strikes Back': True, 
    'Star Wars: Episode VI Return of the Jedi': True, 
    np.nan: False
}



for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(new_values2)
    
star_wars[3:9].head()
Out[11]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
4 3.292763e+09 True True True True True True True True 5 ... Very favorably I don't understand this question No NaN True Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 True True True True True True True True 5 ... Somewhat favorably Greedo Yes No False Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
6 3.292719e+09 True True True True True True True True 1 ... Very favorably Han Yes No True Male 18-29 $25,000 - $49,999 Bachelor degree Middle Atlantic
7 3.292685e+09 True True True True True True True True 6 ... Very favorably Han Yes No False Male 18-29 NaN High school degree East North Central
8 3.292664e+09 True True True True True True True True 4 ... Very favorably Han No NaN True Male 18-29 NaN High school degree South Atlantic

5 rows × 38 columns

In [12]:
# renaming the columns
star_wars = star_wars.rename(columns={'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1', 
                                     'Unnamed: 4': 'seen_2', 
                                      'Unnamed: 5': 'seen_3', 
                                      'Unnamed: 6': 'seen_4',
                                      'Unnamed: 7': 'seen_5', 
                                      'Unnamed: 8': 'seen_6'
                                     })

star_wars.columns[3:9]
Out[12]:
Index(['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6'], dtype='object')

The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6 or NaN:

  • Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. - How much the respondent liked Star Wars: Episode I The Phantom Menace.
  • Unnamed: 10 - How much the respondent liked Star Wars: Episode II Attack of the Clones.
  • Unnamed: 11 - How much the respondent liked Star Wars: Episode III Revenge of the Sith.
  • Unnamed: 12 - How much the respondent liked Star Wars: Episode IV A New Hope.
  • Unnamed: 13 - How much the respondent liked Star Wars: Episode V The Empire Strikes Back.
  • Unnamed: 14 - How much the respondent liked Star Wars: Episode VI Return of the Jedi.

We'll convert each column to a numeric type and rename the columns.

In [13]:
# converting columns to numeric type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

# renaming the columns
star_wars = star_wars.rename(columns={'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': 'ranking_1', 
                                      'Unnamed: 10': 'ranking_2', 
                                     'Unnamed: 11': 'ranking_3',
                                     'Unnamed: 12': 'ranking_4',
                                     'Unnamed: 13': 'ranking_5',
                                     'Unnamed: 14': 'ranking_6'})

star_wars.columns[9:15]
Out[13]:
Index(['ranking_1', 'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5',
       'ranking_6'],
      dtype='object')

Global Movie Ranking

In [14]:
# calculating the mean of each ranking column
ranking_mean = star_wars[star_wars.columns[9:15]].mean()
ranking_mean
Out[14]:
ranking_1    3.732934
ranking_2    4.087321
ranking_3    4.341317
ranking_4    3.272727
ranking_5    2.513158
ranking_6    3.047847
dtype: float64
In [15]:
# creating a bar chart to plot the different means
import matplotlib.pyplot as plt
%matplotlib inline

ranking_mean.plot(kind='bar', title='Ranking Star Wars Movies', colormap='ocean')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x2affcede160>

The lower the mean, the higher the respondents ranked the movie.

From this bar chart, we can deduce the following:

  • ranking_5 is the most favorite movie of the respondents
  • ranking_3 is the least favorite movie of the respondents
  • the last 3 movies have a better score than the first 3

For the seen columns, we'll figure out how many people have seen each movie by taking the sum of each column.

In [16]:
sum_seen = star_wars[star_wars.columns[3:9]].sum()
sum_seen.plot(kind='bar', title='Ranking Star Wars Movies', colormap='ocean')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x2affef8fa90>

We can deduce from this bar chart that the first and the last 2 movies have been seen the most by the respondents. This can explain why the most recent movies have a higher ranking than the older ones.

We'll now examine how certain segments of the survey population responded. There are several columns that segment our data into two groups:

  • Do you consider yourself to be a fan of the Star Wars film franchise? - True or False
  • Do you consider yourself to be a fan of the Star Trek franchise? - Yes or No
  • Gender - Male or Female

Ranking by Groups

In [17]:
# splitting our data in two groups

fan_star_wars = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] == True]
no_fan_star_wars = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] == False]

fan_star_trek = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] == 'Yes']
no_fan_star_trek = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] == 'No']

males = star_wars[star_wars['Gender'] == 'Male']
females = star_wars[star_wars['Gender'] == 'Female']

By Star Wars fans

In [18]:
# calcuting the mean ranking values for Star Wars fans and non Star Wars fans
mean_star_wars_fan = fan_star_wars[fan_star_wars.columns[9:15]].mean()
mean_star_wars_no_fan = no_fan_star_wars[no_fan_star_wars.columns[9:15]].mean()

# creating a bar chart for both groups
cols = ['ranking_1', 'ranking_2', 'ranking_3', 
        'ranking_4', 'ranking_5', 'ranking_6']
fan = [fan_star_wars, no_fan_star_wars]
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, mean_star_wars_fan, bar_width, color='green', label='Fan')
plt.bar(pos + bar_width, mean_star_wars_no_fan, bar_width, color='darkblue', label='No Fan')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Ranking')
plt.title('Ranking Star Wars Movies')
plt.legend(loc='upper right')

plt.show()

The bar charts show us following trends:

  • the fans of Star Wars give a lower score to the first three Star Wars movies than the respondents who do not consider themselves to be fans.
  • the last three movies are much more appreciated by the respondents who are fan of the Star Wars movies
  • in general we can say that the last two movies are the most popular
In [25]:
# calculating the total number of Star Wars fans and non Star Wars fans
star_wars_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] == True]
star_wars_no_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] == False]

sum_star_wars_fan = star_wars_fan[star_wars_fan.columns[3:9]].sum()
sum_star_wars_no_fan = star_wars_no_fan[star_wars_no_fan.columns[3:9]].sum()

# creating a bar chart for both groups
cols = ['seen_1', 'seen_2', 'seen_3', 
        'seen_4', 'seen_5', 'seen_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, sum_star_wars_fan, bar_width, color='green', label='Fan')
plt.bar(pos + bar_width, sum_star_wars_no_fan, bar_width, color='darkblue', label='No Fan')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Number of Respondents')
plt.title('No of Respondants who have seen each Star Wars Movie')
plt.legend(loc='upper left')

plt.show()

From the bar chart above, we can deduce the following:

  • a lot more fans than non fans have seen the different Star Wars movies.
  • the last two movies have been seen by the most respondents (fan and non fan).

By Star Trek fans

In [30]:
# calculating the mean ranking values for Star Trek fans and non Star Trek Fans
mean_star_trek_fan = star_trek_fan[star_trek_fan.columns[9:15]].mean()
mean_star_trek_no_fan = star_trek_no_fan[star_trek_no_fan.columns[9:15]].mean()

# creating a bar chart for both groups
cols = ['ranking_1', 'ranking_2', 'ranking_3', 
        'ranking_4', 'ranking_5', 'ranking_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, mean_star_trek_fan, bar_width, color='green', label='Fan')
plt.bar(pos + bar_width, mean_star_trek_no_fan, bar_width, color='darkblue', label='No Fan')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Ranking')
plt.title('Ranking Star Wars Movies')
plt.legend(loc='upper right')

plt.show()

We can observe following trends:

  • the first 3 movies are more appreciated by the respondents who do not consider themselves as fans of the Star Trek franchise.
  • the last 3 movies are more appreciated by the Star Trek fans.
  • overall, we can see that the last 2 movies are the most popular amongst both groups.
In [21]:
# calculating the number of Star Trek fans and non Star Trek fans
star_trek_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] == True]
star_trek_no_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Trek franchise?'] == False]

sum_star_trek_fan = star_trek_fan[star_trek_fan.columns[3:9]].sum()
sum_star_trek_no_fan = star_trek_no_fan[star_trek_no_fan.columns[3:9]].sum()

# creating a bar chart for both groups
cols = ['seen_1', 'seen_2', 'seen_3', 
        'seen_4', 'seen_5', 'seen_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, sum_star_trek_fan, bar_width, color='green', label='Fan')
plt.bar(pos + bar_width, sum_star_trek_no_fan, bar_width, color='darkblue', label='No Fan')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Number of Respondents')
plt.title('No of Respondants who have seen each Star Wars Movie')
plt.legend(loc='upper left')

plt.show()

From the bar chart above, we can deduce the following:

  • more fans of Star Trek have seen the different movies.
  • the latest two movies have been seen more than the older ones by both groups (fan and non fan).

By Gender

In [22]:
# calculating the mean ranking values for males and females
mean_males = males[males.columns[9:15]].mean()
mean_females = females[females.columns[9:15]].mean()

# creating a bar chart for both groups
cols = ['ranking_1', 'ranking_2', 'ranking_3', 
        'ranking_4', 'ranking_5', 'ranking_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, mean_males, bar_width, color='green', label='Male')
plt.bar(pos + bar_width, mean_females, bar_width, color='darkblue', label='Female')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Ranking')
plt.title('Ranking Star Wars Movies by Gender')
plt.legend(loc='upper right')

plt.show()

We can see the same trend line in ranking by male and by female respondents. Following differences appear from the bar charts:

  • the first 2 movies have a higher score amongst female respondents.
  • the last 4 movies have a higher score amongst male respondents.
  • in general, we can say that the last 2 movies are the most popular.
In [37]:
# calculating the sum of each `seen` column
seen_male = males[males.columns[3:9]].sum()
seen_female = females[females.columns[3:9]].sum()

# creating a bar chart for both groups
cols = ['seen_1', 'seen_2', 'seen_3', 
        'seen_4', 'seen_5', 'seen_6']
pos = np.arange(len(cols))
bar_width = 0.35

plt.bar(pos, seen_male, bar_width, color='green', label='Male')
plt.bar(pos + bar_width, seen_female, bar_width, color='darkblue', label='Female')
plt.xticks(pos, cols, rotation=90)
plt.ylabel('Number of Respondents')
plt.title('No of Respondents who have seen the Star Wars movies')
plt.legend(loc='upper center')

plt.show()

From the bar chart above, we can deduce the following:

  • in general, each movie has been seen more by male than by female respondents.
  • the last two movies have been seen by the most respondents (male and female).

As a conclusion, we can say that the more respondents have seen a movie, the higher its ranking is, regardless of gender or loving Star Wars or Star Trek. The last two movies are by far the most seen and most popular ones among the six movies, followed in general by the first movie. Fans of Star Wars and Star Trek appreciate and see more movies than non fans.

Here are some potential next steps:

  • Try to segment the data based on columns like Education, Location (Census Region), and Which character shot first?, which aren't binary. Are they any interesting patterns?
  • Clean up columns 15 to 29, which contain data on the characters respondents view favorably and unfavorably.
  • Which character do respondents like the most?
  • Which character do respondents dislike the most?
  • Which character is the most controversial (split between likes and dislikes)?