The Best and the Worst of the Star Wars Film Franchise (so far)¶

A decade after the release of Star Wars: Revenge of the Sith, Disney sought to build upon main storyline of this classic sci-fi series with it's seventh installment, Star Wars: The Force Awakens. In anticipation of the movie, FiveThirtyEight conducted an online survey through SurveyMonkey to assess the America's sentiments towards the series so far. The survey was limited to the six main films and did not include other media such as comic books, television series, etc.

The data collected from the survey can be downloaded from their GitHub repository.

Based on the survey results, we will try to determine how the respondents ranked the series' first six films and its pivotal characters.

Results¶

Among the first six films of the film franchise, The Empire Strikes Back, ranked at the top for both most viewed and best movie. From the characters, Luke Skywalker was the most favored while Jar Jar Binks was the least favored.

Cleaning Our Data Set¶

Let us familiarize ourselves with some of the columns we will be working with.

Column Name	Description
`RespondentID`	An anonymized ID for the respondent (person taking the survey)
`Gender`	The respondent's gender
`Age`	The respondent's age
`Household Income`	The respondent's income
`Education`	The respondent's education level
`Location (Census Region)`	The respondent's location
`Have you seen any of the 6 films in the Star Wars franchise?`	Has a Yes or No response
`Do you consider yourself to be a fan of the Star Wars film franchise?`	Has a Yes or No response

Retaining Valid Respondents¶

After reading the dataset into a pandas DataFrame, we will notice that the first row appears to contain an invalid response as indicated by the NaNvalue under the RespondentID columns. A valid response will need to have a unique identifier. Checking the contents of this row for the other columns, we will quickly realize that the first row actually contains the options per question or a prompt for the response. The actual answers begin in the second row.

In [1]:

import pandas as pd

star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')

pd.options.display.max_columns = 50 # To show all columns of the data set
display(star_wars.head())
star_wars.shape

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	NaN	Response	Response	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda	Response	Response	Response	Response	Response	Response	Response	Response	Response
1	3.292880e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	2	1	4	5	6	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	Yes	No	No	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	No	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Yes	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	Yes	No	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	2	3	4	5	6	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	I don't understand this question	No	NaN	No	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	Yes	Yes	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	6	1	2	4	3	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	No	NaN	Yes	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

Out[1]:

(1187, 38)

We will store the first row under options as reference.

In [2]:

options = star_wars.iloc[0:1,:]
options

Out[2]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	NaN	Response	Response	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda	Response	Response	Response	Response	Response	Response	Response	Response	Response

Although we haven't checked the entire dataset, it is possible to have other NaN values for RespondentID. We will use the pandas.Series.notnull() method to filter the dataset to contain only rows with valid ID values.

In [3]:

star_wars = star_wars[star_wars['RespondentID'].notnull()]
star_wars.shape

Out[3]:

(1186, 38)

Comparing the results of the DataFrame.shape attribute (enclosed in parentheses), from 1187 to 1186, it is clear that only the first row contained an invalid ID value.

Yes/No Columns¶

To make our analysis easier, we will convert the columns with Yes or No responses into the bool type by replacing Yes with True and No with False. NaN values will remain as such. bool values make it easier for us to compute aggregate values through methods such as Series.mean() and Series.sum().

Below are the names of the columns that we will perform our value substitution on. We will access them using their indexes to forego typing them.

In [4]:

star_wars.columns[[1,2,30,31,32]]

Out[4]:

Index(['Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦',
       'Do you consider yourself to be a fan of the Star Trek franchise?'],
      dtype='object')

Let's see first how many Yes / No values each column contains.

In [5]:

for n in [1,2,30,31,32]:
    display(star_wars.iloc[:,n].value_counts(dropna=False))

Yes    936
No     250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

Yes    552
NaN    350
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

No     615
NaN    358
Yes    213
Name: Are you familiar with the Expanded Universe?, dtype: int64

NaN    973
No     114
Yes     99
Name: Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦, dtype: int64

No     641
Yes    427
NaN    118
Name: Do you consider yourself to be a fan of the Star Trek franchise?, dtype: int64

We create a dictionary, yes_no, to contain the substitute values. This dictionary is passed on to the Series.map() method which will take care of the substitution. Note that since NaN is not included, these values will be retained as such. The respondents may have been given the option to not answer the question, hence the missing value.

In [6]:

# Substitute Yes and No values with True and False, respectively
yes_no = {'Yes': True, 'No': False}

for n in [1,2,30,31,32]:
    star_wars.iloc[:,n] = star_wars.iloc[:,n].map(yes_no)

We confirm our changes below.

In [7]:

for n in [1,2,30,31,32]:
    display(star_wars.iloc[:,n].value_counts(dropna=False))

True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

True     552
NaN      350
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

False    615
NaN      358
True     213
Name: Are you familiar with the Expanded Universe?, dtype: int64

NaN      973
False    114
True      99
Name: Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦, dtype: int64

False    641
True     427
NaN      118
Name: Do you consider yourself to be a fan of the Star Trek franchise?, dtype: int64

In [8]:

display(star_wars.head())

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
1	3.292880e+09	True	True	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	3	2	1	4	5	6	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	True	False	False	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	True	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	True	False	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	NaN	NaN	NaN	1	2	3	4	5	6	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	I don't understand this question	False	NaN	False	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	True	True	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	6	1	2	4	3	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	False	NaN	True	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3.292731e+09	True	True	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	5	4	6	2	1	3	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably	Greedo	True	False	False	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

Checkbox columns¶

The columns with indexes 3 to 8 of our data set indicate whether a respondent has seen a particular Star Wars film. From the last sentence, "Please select all that apply" (column with index 3), we can deduce that each film title was listed beside a checkbox. If a box was ticked, the corresponding film's title will appear in our dataset. If otherwise, there will be a missing value (NaN), indicating that a respondent has not seen the film.

Similar to the Yes or No columns, we will replace the values with True or False.

Below, the names of the movies are taken from options and stored as a list in movie_names.

In [9]:

movie_names = list(options.iloc[0,3:9])

movie_names

Out[9]:

['Star Wars: Episode I  The Phantom Menace',
 'Star Wars: Episode II  Attack of the Clones',
 'Star Wars: Episode III  Revenge of the Sith',
 'Star Wars: Episode IV  A New Hope',
 'Star Wars: Episode V The Empire Strikes Back',
 'Star Wars: Episode VI Return of the Jedi']

We create a function, replace(), which will change a cell's content to either True if it contains a film title, or False if otherwise. This function is passed on to the DataFrame.applymap() method to work on the relevant columns.

In [10]:

# Create a function that replaces a movie title with True and other values with False
def replace(movie):
    if movie in movie_names:
        return True
    else:
        return False

star_wars.iloc[:,3:9] = star_wars.iloc[:,3:9].applymap(replace) # Apply the function to the appropriate columns

We will also change the columns names to something more understandable and intuitive. The current column names are stored as a list in seen_cols.

In [11]:

# Store the current column names as a list in seen_cols
seen_cols = star_wars.columns[3:9]

seen_cols

Out[11]:

Index(['Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'],
      dtype='object')

To replace our column names, we need to create a dictionary, which will contain the current names as the keys and the replacement names as the corresponding values.

In [12]:

# Create a dictionary to contain replacement names for the checkbox (seen movies) columns
seen_dict = {}

n = 0
for seen in seen_cols:
    n += 1
    seen_dict[seen] = 'seen_{0}'.format(n)
    
seen_dict

Out[12]:

{'Which of the following Star Wars films have you seen? Please select all that apply.': 'seen_1',
 'Unnamed: 4': 'seen_2',
 'Unnamed: 5': 'seen_3',
 'Unnamed: 6': 'seen_4',
 'Unnamed: 7': 'seen_5',
 'Unnamed: 8': 'seen_6'}

We then use the DataFrame.rename() method to replace our column names.

In [13]:

star_wars.rename(columns=seen_dict, inplace=True)

Our changes are confirmed below.

In [14]:

star_wars.head()

Out[14]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
1	3.292880e+09	True	True	True	True	True	True	True	True	3	2	1	4	5	6	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	True	False	False	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	False	NaN	False	False	False	False	False	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	True	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	True	False	True	True	True	False	False	False	1	2	3	4	5	6	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	I don't understand this question	False	NaN	False	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	True	True	True	True	True	True	True	True	5	6	1	2	4	3	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	False	NaN	True	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3.292731e+09	True	True	True	True	True	True	True	True	5	4	6	2	1	3	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably	Greedo	True	False	False	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

Ranking Columns¶

The next six columns after the checkbox columns let the respondent rank the movies according to preference. Since there are six movies, each movie is ranked on a scale of 1-6, with 1 being the highest.

We see that each column is of the type, object.

In [15]:

star_wars[star_wars.columns[9:15]].dtypes

Out[15]:

Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.    object
Unnamed: 10                                                                                                                                      object
Unnamed: 11                                                                                                                                      object
Unnamed: 12                                                                                                                                      object
Unnamed: 13                                                                                                                                      object
Unnamed: 14                                                                                                                                      object
dtype: object

We will convert them to a numeric type and then rename the columns to something more intuitive like we did with the checkbox columns.

In [16]:

# Convert the ranking columns to a numeric type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float) 

# Store the column names as a list in rank_cols
rank_cols = star_wars.columns[9:15]

rank_cols

Out[16]:

Index(['Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14'],
      dtype='object')

As with the checkbox columns, we conduct a similar process of creating a dictionary and using the DataFrame.rename() method to replace the ranking column names.

The results of our changes are displayed after.

In [17]:

# Create a dictionary to contain replacement names for the ranking columns
rank_dict = {}
n = 0
for rank in rank_cols:
    n += 1
    rank_dict[rank] = 'ranking_{}'.format(n)
    

# Replace the column names of the ranking columns
star_wars.rename(columns=rank_dict, inplace=True)

star_wars.head()

Out[17]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
1	3.292880e+09	True	True	True	True	True	True	True	True	3.0	2.0	1.0	4.0	5.0	6.0	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	True	False	False	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	False	NaN	False	False	False	False	False	False	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	True	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	True	False	True	True	True	False	False	False	1.0	2.0	3.0	4.0	5.0	6.0	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Somewhat favorably	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	Unfamiliar (N/A)	I don't understand this question	False	NaN	False	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	True	True	True	True	True	True	True	True	5.0	6.0	1.0	2.0	4.0	3.0	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	Somewhat favorably	Very favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very favorably	Very favorably	Very favorably	Very favorably	I don't understand this question	False	NaN	True	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3.292731e+09	True	True	True	True	True	True	True	True	5.0	4.0	6.0	2.0	1.0	3.0	Very favorably	Somewhat favorably	Somewhat favorably	Somewhat unfavorably	Very favorably	Very unfavorably	Somewhat favorably	Neither favorably nor unfavorably (neutral)	Very favorably	Somewhat favorably	Somewhat favorably	Very unfavorably	Somewhat favorably	Somewhat favorably	Greedo	True	False	False	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

Character Columns¶

Reviewing options, we will notice that there are several columns with character names, from Han Solo to Yoda.

In [18]:

options

Out[18]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	Which of the following Star Wars films have you seen? Please select all that apply.	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.	Unnamed: 10	Unnamed: 11	Unnamed: 12	Unnamed: 13	Unnamed: 14	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
0	NaN	Response	Response	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Star Wars: Episode I The Phantom Menace	Star Wars: Episode II Attack of the Clones	Star Wars: Episode III Revenge of the Sith	Star Wars: Episode IV A New Hope	Star Wars: Episode V The Empire Strikes Back	Star Wars: Episode VI Return of the Jedi	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda	Response	Response	Response	Response	Response	Response	Response	Response	Response

Each of these characters was rated in terms of favorability. We can see the list of possible answers for these columns below.

In [19]:

star_wars.iloc[:,15].value_counts(dropna=False)

Out[19]:

Very favorably                                 610
NaN                                            357
Somewhat favorably                             151
Neither favorably nor unfavorably (neutral)     44
Unfamiliar (N/A)                                15
Somewhat unfavorably                             8
Very unfavorably                                 1
Name: Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her., dtype: int64

For easier visualization, we can divide these ratings into four parts:

Favorable
Unfavorable
Neutral
Unfamiliar

We will group them according to and convert them to their corresponding numbers above to aid in our analysis.

Before doing so, let us inspect the reason behind the presence of NaN values. These values may be coming from those respondents who have not seen any of the Star Wars films. Let's confirm this below.

In [20]:

# Select only rows for respondents who have NOT seen any film
not_seen = star_wars.loc[star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] == False, star_wars.columns[15:29]]

not_seen.isnull().describe() # To show the amount of NaN values per column 

Out[20]:

	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28
count	250	250	250	250	250	250	250	250	250	250	250	250	250	250
unique	1	1	1	1	1	1	1	1	1	1	1	1	1	1
top	True	True	True	True	True	True	True	True	True	True	True	True	True	True
freq	250	250	250	250	250	250	250	250	250	250	250	250	250	250

As seen above, all respondents who answered No (amounting to 250) for the first question did not give any favorability rating to any of the characters as indicated by a frequency of 250 for all columns. This frequency represents the amount of times the value NaN appears per column.

However, upon further inspection, we will see that NaN values appear even for respondents who have seen ANY of the films.

In [21]:

# Select only rows for respondents who have SEEN any film
seen = star_wars.loc[star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] == True, star_wars.columns[15:29]]

In [22]:

seen.isnull().describe() # To show the amount of NaN values per column 

Out[22]:

	Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.	Unnamed: 16	Unnamed: 17	Unnamed: 18	Unnamed: 19	Unnamed: 20	Unnamed: 21	Unnamed: 22	Unnamed: 23	Unnamed: 24	Unnamed: 25	Unnamed: 26	Unnamed: 27	Unnamed: 28
count	936	936	936	936	936	936	936	936	936	936	936	936	936	936
unique	2	2	2	2	2	2	2	2	2	2	2	2	2	2
top	False	False	False	False	False	False	False	False	False	False	False	False	False	False
freq	829	831	831	823	825	814	826	820	812	827	830	821	814	826

For those who have seen ANY of the films, we can see that even though most of the values are not NaN, this value still appears in varying frequencies per column. Since we will not include respondents who have NOT seen any film in our analysis, we can assume that NaN values indicate that a respondent was unfamiliar with the character.

Below, we create a dictionary for the replacement values and then use the Series.map() method within a loop to do the appropriate subsitution.

In [23]:

import numpy as np

# Dictionary containing replacement values for favorability ratings
favor_ratings = {
                'Very favorably' : 1,
                'Somewhat favorably' : 1,
                'Neither favorably nor unfavorably (neutral)' : 3,
                'Somewhat unfavorably' : 2,
                'Very unfavorably' : 2,
                'Unfamiliar (N/A)' : 4, 
                np.nan : 4
                }

# Replace favorability ratings with corresponding numeric or NaN value
for n in range(15,29):
    star_wars.iloc[:,n] = star_wars.iloc[:,n].map(favor_ratings)

We will also convert the column names to the actual names of the characters.

In [24]:

# Dictionary containing replacement names for character column names
char_dict = {
            'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.': 'Han Solo',
            'Unnamed: 16': 'Luke Skywalker',
            'Unnamed: 17': 'Princess Leia Organa',
            'Unnamed: 18': 'Anakin Skywalker',
            'Unnamed: 19': 'Obi Wan Kenobi',
            'Unnamed: 20': 'Emperor Palpatine',
            'Unnamed: 21': 'Darth Vader',
            'Unnamed: 22': 'Lando Calrissian',
            'Unnamed: 23': 'Boba Fett',
            'Unnamed: 24': 'C-3P0',
            'Unnamed: 25': 'R2 D2',
            'Unnamed: 26': 'Jar Jar Binks',
            'Unnamed: 27': 'Padme Amidala',
            'Unnamed: 28': 'Yoda'
            }

# Replace column names of character columns
star_wars.rename(columns=char_dict, inplace=True)

The first rows of our final dataset is displayed below.

In [25]:

star_wars.head()

Out[25]:

	RespondentID	Have you seen any of the 6 films in the Star Wars franchise?	Do you consider yourself to be a fan of the Star Wars film franchise?	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	ranking_1	ranking_2	ranking_3	ranking_4	ranking_5	ranking_6	Han Solo	Luke Skywalker	Princess Leia Organa	Anakin Skywalker	Obi Wan Kenobi	Emperor Palpatine	Darth Vader	Lando Calrissian	Boba Fett	C-3P0	R2 D2	Jar Jar Binks	Padme Amidala	Yoda	Which character shot first?	Are you familiar with the Expanded Universe?	Do you consider yourself to be a fan of the Expanded Universe?ÂÃ¦	Do you consider yourself to be a fan of the Star Trek franchise?	Gender	Age	Household Income	Education	Location (Census Region)
1	3.292880e+09	True	True	True	True	True	True	True	True	3.0	2.0	1.0	4.0	5.0	6.0	1	1	1	1	1	1	1	4	4	1	1	1	1	1	I don't understand this question	True	False	False	Male	18-29	NaN	High school degree	South Atlantic
2	3.292880e+09	False	NaN	False	False	False	False	False	False	NaN	NaN	NaN	NaN	NaN	NaN	4	4	4	4	4	4	4	4	4	4	4	4	4	4	NaN	NaN	NaN	True	Male	18-29	$0 - $24,999	Bachelor degree	West South Central
3	3.292765e+09	True	False	True	True	True	False	False	False	1.0	2.0	3.0	4.0	5.0	6.0	1	1	1	1	1	4	4	4	4	4	4	4	4	4	I don't understand this question	False	NaN	False	Male	18-29	$0 - $24,999	High school degree	West North Central
4	3.292763e+09	True	True	True	True	True	True	True	True	5.0	6.0	1.0	2.0	4.0	3.0	1	1	1	1	1	1	1	1	2	1	1	1	1	1	I don't understand this question	False	NaN	True	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central
5	3.292731e+09	True	True	True	True	True	True	True	True	5.0	4.0	6.0	2.0	1.0	3.0	1	1	1	2	1	2	1	3	1	1	1	2	1	1	Greedo	True	False	False	Male	18-29	$100,000 - $149,999	Some college or Associate degree	West North Central

Most Viewed Movie¶

Now that we have cleaned our dataset, let us first figure out which among the six films was viewed the most. Recall that the titles were assigned to movie_names.

In [26]:

movie_names

Out[26]:

['Star Wars: Episode I  The Phantom Menace',
 'Star Wars: Episode II  Attack of the Clones',
 'Star Wars: Episode III  Revenge of the Sith',
 'Star Wars: Episode IV  A New Hope',
 'Star Wars: Episode V The Empire Strikes Back',
 'Star Wars: Episode VI Return of the Jedi']

In our visualization, we will only include the subtitles to avoid making it too cluttered with text. The subtitles for each movie appear after the roman numerals (I, II, III, etc.).

We will first convert movie_names to a Series object, assign it to titles, then utilize a Regular Expression to extract the subtitles. These subtitles will then be assigned to the aptly named variable, subtitles.

In [27]:

import re

pattern = r'Star Wars: Episode .{1,3} +(.+)' # Regular expression to extract the subtitles

titles = pd.Series(movie_names)
subtitles = titles.str.extract(pattern)
subtitles

Out[27]:

	0
0	The Phantom Menace
1	Attack of the Clones
2	Revenge of the Sith
3	A New Hope
4	The Empire Strikes Back
5	Return of the Jedi

The first question of the survey asks whether the respondent has seen any of six the Star Wars films. It would make sense to include in our analysis only those who responded Yes to this question. Below, we see that 936 respondents meet this criteria.

In [28]:

star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna = False)

Out[28]:

True     936
False    250
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64

For each film, we will figure out how many of the 936 respondents have seen it. Note that a respondent may have seen all, a few, or just one of the films. This means that each film can have a maximum of 936 True values. The film with the highest number of respondents (True values) will be considered as the most viewed film among the six.

Below, we filter the dataset to include only the 936 respondents and the checkbox columns we previously cleaned. We then use the DataFrame.T attribute to make the rows into columns and vice versa. This will allow us to make a new column containing the sum of True values for each film. The transposed DataFrame is stored in seen_any.

In [29]:

seen_any = star_wars.loc[star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] == True, star_wars.columns[3:9]].T
seen_any

Out[29]:

	1	3	4	5	6	7	8	9	10	11	13	14	15	16	17	18	19	20	21	22	23	24	25	27	28	...	1156	1157	1159	1161	1162	1163	1164	1165	1166	1167	1168	1170	1172	1173	1174	1175	1176	1177	1178	1180	1181	1182	1183	1185	1186
seen_1	True	True	True	True	True	True	True	True	False	False	True	True	True	True	False	True	True	True	True	True	True	True	True	True	True	...	False	True	False	True	True	True	False	True	True	True	True	True	True	False	True	True	True	True	True	False	True	True	True	True	True
seen_2	True	True	True	True	True	True	True	True	True	False	True	True	True	True	False	True	True	True	True	True	True	True	True	True	True	...	False	True	False	False	True	False	False	True	True	True	False	False	True	False	True	True	True	True	True	False	True	True	True	True	True
seen_3	True	True	True	True	True	True	True	True	False	False	True	True	True	True	False	True	True	True	True	True	True	True	True	True	True	...	False	True	False	True	True	False	False	True	True	True	False	False	True	False	True	True	True	False	True	False	True	True	True	True	False
seen_4	True	False	True	True	True	True	True	True	False	False	True	True	True	True	True	False	True	True	True	True	True	True	True	True	True	...	True	True	False	False	False	True	True	False	True	True	False	True	True	True	True	False	True	False	False	False	True	True	True	True	False
seen_5	True	False	True	True	True	True	True	True	False	False	True	True	True	True	False	False	True	True	True	True	True	True	True	True	True	...	True	True	True	False	True	True	True	True	True	True	True	True	True	True	True	False	True	False	True	True	True	True	True	True	True
seen_6	True	False	True	True	True	True	True	True	False	False	True	True	True	True	False	True	True	True	True	False	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True	True	True	True	True	True	False	True	False	True	True	True	True	True	True	True

6 rows × 936 columns

We will create a new column, seen_count, to contain the sum of the True values for each film. The percentages of these sum values are then computed and placed under the perc column.

In [30]:

seen_any['seen_count'] = seen_any.sum(axis = 1) 
seen_any['perc'] = (seen_any['seen_count'] / 936) * 100 # The denominator 936 is the number of respondents who have seen ANY film
seen_any['perc'] = round(seen_any['perc'])
seen_any_df = seen_any[['seen_count', 'perc']]

In [31]:

seen_any_df # New DataFrame to be used for the visualization.

Out[31]:

	seen_count	perc
seen_1	673	72.0
seen_2	571	61.0
seen_3	550	59.0
seen_4	607	65.0
seen_5	758	81.0
seen_6	738	79.0

Now that we have computed, for each film, how many of the 936 respondents have seen it, we can visualize these values through a horizontal bar graph. For a more intuitive approach, we will plot our values in terms of percentage.

We can check the available preset graph styles of matplotlib by importing its style module and accessing style.available.

In [32]:

import matplotlib.pyplot as plt
import matplotlib.style as style

%matplotlib inline

style.available

Out[32]:

['Solarize_Light2',
 '_classic_test_patch',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
 'seaborn-colorblind',
 'seaborn-dark',
 'seaborn-dark-palette',
 'seaborn-darkgrid',
 'seaborn-deep',
 'seaborn-muted',
 'seaborn-notebook',
 'seaborn-paper',
 'seaborn-pastel',
 'seaborn-poster',
 'seaborn-talk',
 'seaborn-ticks',
 'seaborn-white',
 'seaborn-whitegrid',
 'tableau-colorblind10']

We will use the fivethirtyeight style to imitate FiveThirtyEight's graphs. The code block below is used to construct our bar graph.

In [33]:

%config InlineBackend.figure_format ='retina' # For higher resolution graphs

# Set essential graph attributes
style.use('fivethirtyeight') # This style will apply to all our succeeding graphs

seen_graph = seen_any_df['perc'].plot.barh(figsize = (7,5), legend = False, color = '#1F77B4', width = 0.7)
seen_graph.tick_params(axis = 'both', which = 'both', labelsize = 16, labelbottom = False)
seen_graph.grid(False)
seen_graph.set_yticklabels(subtitles[0], alpha = 0.8)
seen_graph.set_xlim(left = -2) # Gives space between bars and the y-tick labels
seen_graph.get_children()[4].set_color('#D62728') # Sets a different color for 'The Empire Strikes Back'
seen_graph.invert_yaxis() # Orders the films from top (part 1) to bottom (part 6)

# Label each bar's value in percent 
seen_graph.text(x = 74.5, y = 0.1, s = '72%', alpha = 0.8)
seen_graph.text(x = 64, y = 1.1, s = '61', alpha = 0.8)
seen_graph.text(x = 62, y = 2.1, s = '59', alpha = 0.8)
seen_graph.text(x = 68, y = 3.1, s = '65', alpha = 0.8)
seen_graph.text(x = 84.2, y = 4.1, s = '81  ', alpha = 0.8)
seen_graph.text(x = 81.5, y = 5.1, s = '79', alpha = 0.8)

# Place the heading and subheading
seen_graph.text(x = -43.5, y = -1.6, s = ' The Most Viewed Star Wars Movie', fontsize = 27, weight = 'bold', alpha = 0.75)
seen_graph.text(x = -43, y = -1, s = ' Out of 936 respondents who have seen ANY of the films', fontsize = 18)

Out[33]:

Text(-43, -1, ' Out of 936 respondents who have seen ANY of the films')

From the graph above, we can see that The Empire Strikes Back was the most viewed movie at 81%, followed by Return of the Jedi at 79%.

Highest Ranked Movie¶

The next thing we want to find out is which among the films the respondents consider the best. For this, we will have to be more selective of our respondents. A good criteria would be to select only those who have seen ALL of the films as they are more qualified to rank the films from best to worst.

To determine if a respondent has seen ALL six films, we will create a new column, seen_all. With an axis value of 1, the DataFrame.all() method will be used on the checkbox (seen) columns to check whether all cells in a particular row contain True. If this is the case, the value True will be returned to seen_all. If otherwise, False will be returned.

In [34]:

star_wars['seen_all'] = star_wars.iloc[:,3:9].all(axis = 1)
star_wars['seen_all'].sum()

Out[34]:

In [35]:

star_wars.iloc[:,[3,4,5,6,7,8,38]].head()

Out[35]:

	seen_1	seen_2	seen_3	seen_4	seen_5	seen_6	seen_all
1	True	True	True	True	True	True	True
2	False	False	False	False	False	False	False
3	True	True	True	False	False	False	False
4	True	True	True	True	True	True	True
5	True	True	True	True	True	True	True

Above, we see that there are 471 respondents who have seen ALL films.

As we have observed, the films were ranked on a scale of 1-6, with 1 being the highest. To determine the highest ranked film, we will count how many times each film was ranked number 1 and compare the results. Just as with our previous query, we will transpose the DataFrame and create a new column to contain the results. This time, we will use our more selective criteria for the respondents.

In [36]:

ranks_seen_all = star_wars.loc[star_wars['seen_all'] == True, star_wars.columns[9:15]].T
ranks_seen_all

Out[36]:

	1	4	5	6	7	8	9	13	14	15	16	19	20	21	23	24	25	27	28	29	30	31	32	33	36	...	1115	1120	1122	1124	1126	1128	1129	1130	1132	1135	1143	1148	1150	1153	1155	1157	1166	1167	1172	1174	1176	1181	1182	1183	1185
ranking_1	3.0	5.0	5.0	1.0	6.0	4.0	5.0	3.0	4.0	4.0	4.0	6.0	6.0	6.0	6.0	6.0	1.0	4.0	6.0	6.0	4.0	4.0	6.0	6.0	6.0	...	6.0	3.0	1.0	1.0	1.0	3.0	4.0	5.0	4.0	6.0	4.0	6.0	3.0	6.0	6.0	3.0	5.0	6.0	6.0	5.0	1.0	3.0	5.0	4.0	4.0
ranking_2	2.0	6.0	4.0	4.0	5.0	5.0	4.0	4.0	5.0	2.0	6.0	5.0	5.0	5.0	5.0	1.0	2.0	6.0	5.0	5.0	6.0	3.0	5.0	1.0	3.0	...	5.0	2.0	4.0	5.0	5.0	4.0	5.0	6.0	5.0	4.0	5.0	4.0	2.0	5.0	4.0	4.0	6.0	5.0	2.0	3.0	3.0	4.0	4.0	5.0	3.0
ranking_3	1.0	1.0	6.0	3.0	4.0	6.0	6.0	5.0	6.0	5.0	5.0	2.0	1.0	4.0	4.0	4.0	6.0	5.0	4.0	4.0	5.0	5.0	4.0	2.0	4.0	...	3.0	1.0	5.0	6.0	3.0	5.0	6.0	4.0	6.0	5.0	6.0	5.0	1.0	4.0	5.0	5.0	4.0	1.0	1.0	6.0	4.0	5.0	6.0	6.0	6.0
ranking_4	4.0	2.0	2.0	6.0	3.0	3.0	2.0	6.0	2.0	3.0	3.0	3.0	4.0	1.0	3.0	2.0	3.0	2.0	2.0	2.0	2.0	6.0	2.0	3.0	1.0	...	4.0	5.0	3.0	2.0	6.0	6.0	2.0	1.0	3.0	2.0	1.0	3.0	5.0	1.0	1.0	6.0	3.0	4.0	5.0	4.0	6.0	2.0	3.0	2.0	5.0
ranking_5	5.0	4.0	1.0	5.0	1.0	2.0	1.0	1.0	3.0	1.0	1.0	1.0	3.0	2.0	1.0	3.0	4.0	3.0	3.0	1.0	1.0	2.0	1.0	4.0	2.0	...	1.0	6.0	2.0	4.0	2.0	1.0	3.0	2.0	1.0	1.0	2.0	1.0	4.0	2.0	2.0	2.0	1.0	3.0	3.0	2.0	2.0	1.0	2.0	3.0	2.0
ranking_6	6.0	3.0	3.0	2.0	2.0	1.0	3.0	2.0	1.0	6.0	2.0	4.0	2.0	3.0	2.0	5.0	5.0	1.0	1.0	3.0	3.0	1.0	3.0	5.0	5.0	...	2.0	4.0	6.0	3.0	4.0	2.0	1.0	3.0	2.0	3.0	3.0	2.0	6.0	3.0	3.0	1.0	2.0	2.0	4.0	1.0	5.0	6.0	1.0	1.0	1.0

6 rows × 471 columns

In ranks_seen_all, we will create a new column one_count to contain the number of times each film was ranked first. Since each film is ranked uniquely by each respondent, we can simply divide each one_count value by the column's sum (equal to 471) to determine its percentage. The results are assigned to the perc column.

In [37]:

ranks_seen_all['one_count'] = (ranks_seen_all == 1).sum(axis = 1)
ranks_seen_all['perc'] = (ranks_seen_all['one_count'] / ranks_seen_all['one_count'].sum()) * 100 
ranks_seen_all['perc'] = round(ranks_seen_all['perc'])
ranks_one_df = ranks_seen_all[['one_count', 'perc']]

In [38]:

ranks_one_df # New DataFrame to be used for the visualization.

Out[38]:

	one_count	perc
ranking_1	47	10.0
ranking_2	18	4.0
ranking_3	27	6.0
ranking_4	128	27.0
ranking_5	169	36.0
ranking_6	82	17.0

With the ranks_one_df DataFrame ready, we use the code block below to visualize our findings.

In [39]:

# Set essential graph attributes
ranks_graph = ranks_one_df['perc'].plot.barh(figsize = (7,5), legend = False, color = '#1F77B4', width = 0.7)
ranks_graph.tick_params(axis = 'both', which = 'both', labelsize = 16, labelbottom = False)
ranks_graph.grid(False)
ranks_graph.set_yticklabels(subtitles[0], alpha = 0.8)
ranks_graph.set_xlim(left = -1) # Gives space between bars and the y-tick labels
ranks_graph.get_children()[4].set_color('#D62728') # Sets a different color for 'The Empire Strikes Back'
ranks_graph.invert_yaxis() # Orders the films from top (part 1) to bottom (part 6)

# Label each bars value in percent 
ranks_graph.text(x = 11, y = 0.1, s = '10%', alpha = 0.8)
ranks_graph.text(x = 5, y = 1.1, s = '4', alpha = 0.8)
ranks_graph.text(x = 7, y = 2.1, s = '6', alpha = 0.8)
ranks_graph.text(x = 28.2, y = 3.1, s = '27', alpha = 0.8)
ranks_graph.text(x = 37, y = 4.1, s = '36  ', alpha = 0.8)
ranks_graph.text(x = 18.4, y = 5.1, s = '17', alpha = 0.8)

# Place the heading and subheading
ranks_graph.text(x = -19.6, y = -1.5, s = ' The Best Star Wars Movie', fontsize = 27, weight = 'bold', alpha = 0.75)
ranks_graph.text(x = -19.2, y = -0.9, s = ' According to 471 respondents who have seen ALL films', fontsize = 18)

Out[39]:

Text(-19.2, -0.9, ' According to 471 respondents who have seen ALL films')

We can see that The Empire Strikes Back was the highest ranked at 36%, followed by A New Hope at 27%.

Even though The Empire Strikes Back was the most viewed and was considered the best among the six, these two factors are not necessarily correlated as the other five films fall in different orders when both graphs are compared. For example, Revenge of the Sith was the least viewed but is ranked 5th in the graph above.

Character Ratings¶

Finally, we move on to how the respondents felt about a number of the franchise's characters in terms of favorability. We will select the answers of those respondents who have seen ANY of the film (936 respondents).

Below, we filter the dataset according to this criteria, selecting the character columns, and then transposing the filtered DataFrame. We assign the result to all_chars.

In [40]:

all_chars = star_wars.loc[star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] == True , star_wars.columns[15:29]].T
all_chars

Out[40]:

	1	3	4	5	6	7	8	9	10	11	13	14	15	16	17	18	19	20	21	22	23	24	25	27	28	...	1156	1157	1159	1161	1162	1163	1164	1165	1166	1167	1168	1170	1172	1173	1174	1175	1176	1177	1178	1180	1181	1182	1183	1185	1186
Han Solo	1	1	1	1	1	1	1	1	3	4	1	1	1	1	3	1	1	1	1	3	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	3	1	1	1	1	1	1	1
Luke Skywalker	1	1	1	1	1	1	1	2	1	4	1	1	1	1	1	1	1	1	1	3	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	3	3	1	1	1	1	1	3	1
Princess Leia Organa	1	1	1	1	1	1	1	1	1	4	1	1	1	1	2	1	1	1	1	3	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	3	1	1	1	1	1	1	1
Anakin Skywalker	1	1	1	2	1	1	3	1	1	4	1	1	1	2	3	1	1	1	1	3	3	2	1	1	1	...	1	1	3	1	1	2	3	1	1	1	1	3	1	1	1	1	1	3	1	3	1	1	2	1	2
Obi Wan Kenobi	1	1	1	1	1	1	1	1	1	4	1	1	1	1	2	1	1	1	1	3	1	1	1	1	1	...	1	1	1	1	1	4	1	1	1	1	1	1	1	1	1	1	1	3	1	1	1	1	1	1	1
Emperor Palpatine	1	4	1	2	3	1	2	1	2	4	1	4	2	1	1	1	1	2	3	3	2	1	1	1	2	...	4	1	3	1	1	1	4	1	4	2	4	1	1	3	1	4	3	3	3	3	1	1	3	3	2
Darth Vader	1	4	1	1	1	1	2	1	1	4	1	2	2	1	2	1	1	1	1	3	1	1	1	1	2	...	1	2	1	1	2	2	2	2	1	1	1	1	1	2	1	1	1	3	3	2	1	1	2	1	1
Lando Calrissian	4	4	1	3	3	1	3	1	2	4	3	4	2	1	3	3	1	2	3	3	1	3	1	1	1	...	1	3	3	1	3	3	1	1	1	4	4	1	2	3	4	4	3	3	4	3	1	1	1	1	2
Boba Fett	4	4	2	1	1	1	1	1	2	4	1	3	2	1	1	1	1	1	1	3	2	3	1	1	2	...	1	1	3	1	3	2	2	3	1	1	4	1	2	3	1	4	1	3	4	3	1	1	4	1	4
C-3P0	1	4	1	1	1	1	1	3	1	4	1	1	1	2	3	1	1	1	1	3	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	3	1	1	1	1	1	1	1
R2 D2	1	4	1	1	1	1	1	1	1	4	1	1	1	3	2	1	1	1	1	3	1	1	1	1	1	...	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	3	1	1	1	1	1	1	1
Jar Jar Binks	1	4	1	2	1	2	2	2	1	4	1	4	1	2	3	1	2	2	2	3	2	1	1	2	1	...	1	1	3	1	2	3	1	2	4	2	4	2	3	3	1	4	3	3	1	3	1	1	2	1	2
Padme Amidala	1	4	1	1	3	1	2	2	2	4	1	1	1	2	2	1	2	1	3	3	3	3	1	1	1	...	4	1	3	1	3	1	1	2	1	1	3	2	3	3	1	4	3	3	4	3	1	1	2	1	3
Yoda	1	4	1	1	1	1	1	1	1	4	1	1	1	1	2	1	1	1	1	3	1	1	1	1	1	...	1	1	1	1	3	1	1	3	1	1	1	1	2	1	1	1	1	3	1	1	1	1	1	1	2

14 rows × 936 columns

Recall that we made the following divisions for the ratings:

Favorable
Unfavorable
Neutral
Unfamiliar

For each character, we will compute how many percent they fall under each division. To do this, we will have to create eight new columns. The first four will contain the total count of each favorability rating while the last four will contain their corresponding percentages.

In [41]:

favorability = ['Favorable', 'Unfavorable', 'Neutral', 'Unfamiliar']

# Create the first four columns to contain each rating's total count 
n = 0
for rating in favorability:
    n += 1
    all_chars[rating] = (all_chars == n).sum(axis = 1) # Sums the amount of `True` values for each rating
    
# Create the last four columns for the corresponding percentages 
n = 0
for rating in favorability:
    n += 1
    all_chars[rating + '_perc'] = round((all_chars[rating] / 936) * 100) # Out of the 936 respondents
    all_chars[rating + '_perc'] = all_chars[rating + '_perc'].astype(int)

char_df = all_chars.iloc[:,936:]
char_df = char_df.sort_values('Favorable', ascending = False) # Order the DataFrame by the Favorable rating

In [42]:

char_df # New DataFrame to be used for the visualization.

Out[42]:

	Favorable	Unfavorable	Neutral	Unfamiliar	Favorable_perc	Unfavorable_perc	Neutral_perc	Unfamiliar_perc
Luke Skywalker	771	16	38	111	82	2	4	12
Han Solo	761	9	44	122	81	1	5	13
Princess Leia Organa	757	18	48	113	81	2	5	12
Obi Wan Kenobi	750	15	43	128	80	2	5	14
Yoda	749	16	51	120	80	2	5	13
R2 D2	747	16	57	116	80	2	6	12
C-3P0	703	30	79	124	75	3	8	13
Anakin Skywalker	514	122	135	165	55	13	14	18
Darth Vader	481	251	84	120	51	27	9	13
Lando Calrissian	365	71	236	264	39	8	25	28
Padme Amidala	351	92	207	286	38	10	22	31
Boba Fett	291	141	248	256	31	15	26	27
Emperor Palpatine	253	192	213	278	27	21	23	30
Jar Jar Binks	242	306	164	224	26	33	18	24

To visualize our table, we will use the percentage columns to place four horizontal bar graphs beside each other. This will make it easy for us to compare the four favorability ratings for each character.

Below, we take the names of each character and store them in char_names.

In [43]:

char_names = char_df.index
char_names

Out[43]:

Index(['Luke Skywalker', 'Han Solo', 'Princess Leia Organa', 'Obi Wan Kenobi',
       'Yoda', 'R2 D2', 'C-3P0', 'Anakin Skywalker', 'Darth Vader',
       'Lando Calrissian', 'Padme Amidala', 'Boba Fett', 'Emperor Palpatine',
       'Jar Jar Binks'],
      dtype='object')

The code block belows allows us to construct our desired visualization. Please refer to the comments to understand better how the combination of graphs was constructed.

In [44]:

%config InlineBackend.figure_format = 'retina' # For higher resolution graphs

favorability = ['Favorable', 'Unfavorable', 'Neutral', 'Unfamiliar']
clrs = ['#2CA02C', '#D62728', '#1F77B4', 'grey'] # Colors to be used for each bar graph
favorability_perc = ['Favorable_perc', 'Unfavorable_perc', 'Neutral_perc', 'Unfamiliar_perc'] # Column names to extract percent values

char = plt.figure(figsize = (9, 8)) 

#Set essential attributes for each bar graph
f = 0 # For favorability list order
s = 0 # For the subplot number
for rating in favorability_perc:
    s += 1
    ax = char.add_subplot(1, 4, s)
    ax.barh(char_names, char_df[rating], color = clrs[f])
    ax.text(x = 0.2, y = -1, s = favorability[f], weight = 'bold', size = 18, alpha = 0.7)
    ax.grid(False)
    ax.set_xticklabels([])
    ax.set_xlim((-12,90)) # Gives space between bars and the y-tick labels
    ax.set_yticklabels(char_names, alpha = 0.8)
    ax.invert_yaxis()
    f += 1
    
    # Remove y-tick labels for the three other graphs
    if s != 1:
        ax.set_yticklabels([])
    
    # Bar labels in percent
    bar_labels = char_df[rating].tolist()
    y_loc = 0 # For y-location of label
    for label in bar_labels:
        
        if y_loc == 0: # Places the % symbol for only the top bar label
            ax.text(x = label + 5, y = 0.15 + y_loc,  s = str(label) + '%', alpha = 0.8, size = 12)
                
        ax.text(x = label + 5, y = 0.15 + y_loc,  s = str(label), alpha = 0.8, size = 12)
        y_loc += 1

# Places the heading and subheading
char.axes[0].text(x = -140, y = -2.7, s = ' Favorite Characters from the films', fontsize = 27, weight = 'bold', alpha = 0.75)
char.axes[0].text(x = -137, y = -1.9, s = ' Out of 936 respondents who have seen ANY of the films', fontsize = 18)

Out[44]:

Text(-137, -1.9, ' Out of 936 respondents who have seen ANY of the films')

Luke Skywalker was considered as the most favored among the characters at 82%, followed by Han Solo and Princess Leia, both at 81%. At the opposite end, Jar Jar Binks was the most unfavored at 33%. His favorability was also diminished by 24% of the respondents who were unfamiliar with him.

Conclusion¶

From the results shown by the graphs we generated, we saw that the fifth installment of the film franchise, The Empire Strikes Back, ranked at the top for both most viewed and best movie. We also saw how the characters were rated in terms of favorability, with Luke Skywalker, the main protagonist or the original Star Wars trilogy, being the most favored.