%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
Fivethirtyeight has some great data sets and this is one of them. In July 2014, before the third Star Wars trilogy was released, they decided to survey Americans to see which of the first six movies was their favorite. Let's take a look at the results. Some light cleaning should make it more usable!
# import libraries and csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
# explore the data frame
display(star_wars.head(10))
display(star_wars.columns)
display(star_wars.shape)
RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe? | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3292879998 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3.0 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
1 | 3292879538 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
2 | 3292765271 | Yes | No | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1.0 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
3 | 3292763116 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
4 | 3292731220 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
5 | 3292719380 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 1.0 | ... | Very favorably | Han | Yes | No | Yes | Male | 18-29 | $25,000 - $49,999 | Bachelor degree | Middle Atlantic |
6 | 3292684787 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 6.0 | ... | Very favorably | Han | Yes | No | No | Male | 18-29 | NaN | High school degree | East North Central |
7 | 3292663732 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 4.0 | ... | Very favorably | Han | No | NaN | Yes | Male | 18-29 | NaN | High school degree | South Atlantic |
8 | 3292654043 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5.0 | ... | Somewhat favorably | Han | No | NaN | No | Male | 18-29 | $0 - $24,999 | Some college or Associate degree | South Atlantic |
9 | 3292640424 | Yes | No | NaN | Star Wars: Episode II Attack of the Clones | NaN | NaN | NaN | NaN | 1.0 | ... | Very favorably | I don't understand this question | No | NaN | No | Male | 18-29 | $25,000 - $49,999 | Some college or Associate degree | Pacific |
10 rows × 38 columns
Index(['RespondentID', 'Have you seen any of the 6 films in the Star Wars franchise?', 'Do you consider yourself to be a fan of the Star Wars film franchise?', 'Which of the following Star Wars films have you seen? Please select all that apply.', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Which character shot first?', 'Are you familiar with the Expanded Universe?', 'Do you consider yourself to be a fan of the Expanded Universe?', 'Do you consider yourself to be a fan of the Star Trek franchise?', 'Gender', 'Age', 'Household Income', 'Education', 'Location (Census Region)'], dtype='object')
(1186, 38)
# rename columns
new_columns = {"Have you seen any of the 6 films in the Star Wars franchise?":"Seen any of the first 6 Star Wars movies?",
"Do you consider yourself to be a fan of the Star Wars film franchise?":"Fan of the Star Wars franchise?",
"Which of the following Star Wars films have you seen? Please select all that apply." : "seen_ep_1",
"Unnamed: 4" : "seen_ep_2",
"Unnamed: 5" : "seen_ep_3",
"Unnamed: 6" : "seen_ep_4",
"Unnamed: 7" : "seen_ep_5",
"Unnamed: 8" : "seen_ep_6",
"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.":"rank_ep_1",
"Unnamed: 10":"rank_ep_2",
"Unnamed: 11":"rank_ep_3",
"Unnamed: 12":"rank_ep_4",
"Unnamed: 13":"rank_ep_5",
"Unnamed: 14":"rank_ep_6"}
star_wars = star_wars.rename(columns = new_columns)
# value counts for columns 1:3
print(star_wars["Seen any of the first 6 Star Wars movies?"].value_counts(dropna=False))
print(star_wars["Fan of the Star Wars franchise?"].value_counts(dropna=False))
print(star_wars["seen_ep_1"].value_counts(dropna=False))
Yes 936 No 250 Name: Seen any of the first 6 Star Wars movies?, dtype: int64 Yes 552 NaN 350 No 284 Name: Fan of the Star Wars franchise?, dtype: int64 Star Wars: Episode I The Phantom Menace 673 NaN 513 Name: seen_ep_1, dtype: int64
# switch values to boolean for columns 1:2
yes_no_bool = {"Yes":True, "No":False}
star_wars["Seen any of the first 6 Star Wars movies?"] = star_wars["Seen any of the first 6 Star Wars movies?"].map(yes_no_bool)
star_wars["Fan of the Star Wars franchise?"] = star_wars["Fan of the Star Wars franchise?"].map(yes_no_bool)
# switch values to boolean for columns 3:9
watch_bool = {"Star Wars: Episode I The Phantom Menace" : True,
"Star Wars: Episode II Attack of the Clones" : True,
"Star Wars: Episode III Revenge of the Sith" : True,
"Star Wars: Episode IV A New Hope" : True,
"Star Wars: Episode V The Empire Strikes Back" : True,
"Star Wars: Episode VI Return of the Jedi" : True,
np.NaN : False}
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(watch_bool)
# value counts for columns 1:3
print(star_wars["Seen any of the first 6 Star Wars movies?"].value_counts(dropna=False))
print(star_wars["Fan of the Star Wars franchise?"].value_counts(dropna=False))
display(star_wars.iloc[:,3].value_counts(dropna=False))
True 936 False 250 Name: Seen any of the first 6 Star Wars movies?, dtype: int64 True 552 NaN 350 False 284 Name: Fan of the Star Wars franchise?, dtype: int64
True 673 False 513 Name: seen_ep_1, dtype: int64
# convert rankings to float
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
# star_wars.columns[9].dtype # why does this work sometimes?
star_wars.iloc[:,9].dtype
dtype('float64')
star_wars.iloc[:,9:15].head()
rank_ep_1 | rank_ep_2 | rank_ep_3 | rank_ep_4 | rank_ep_5 | rank_ep_6 | |
---|---|---|---|---|---|---|
0 | 3.0 | 2.0 | 1.0 | 4.0 | 5.0 | 6.0 |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
3 | 5.0 | 6.0 | 1.0 | 2.0 | 4.0 | 3.0 |
4 | 5.0 | 4.0 | 6.0 | 2.0 | 1.0 | 3.0 |
star_wars.iloc[:,9].value_counts()
4.0 237 6.0 168 3.0 130 1.0 129 5.0 100 2.0 71 Name: rank_ep_1, dtype: int64
star_wars.iloc[:,14].value_counts()
2.0 232 3.0 220 1.0 146 6.0 145 4.0 57 5.0 36 Name: rank_ep_6, dtype: int64
# star_wars.iloc[:,9] =
star_wars.iloc[:,9:15] = star_wars.iloc[:,9:15].replace({1:6, 6:1, 2:5, 5:2, 3:4, 4:3})
star_wars.iloc[:,9].value_counts()
3.0 237 1.0 168 4.0 130 6.0 129 2.0 100 5.0 71 Name: rank_ep_1, dtype: int64
star_wars.iloc[:,14].value_counts()
5.0 232 4.0 220 6.0 146 1.0 145 3.0 57 2.0 36 Name: rank_ep_6, dtype: int64
star_wars.iloc[:,9:15].head()
rank_ep_1 | rank_ep_2 | rank_ep_3 | rank_ep_4 | rank_ep_5 | rank_ep_6 | |
---|---|---|---|---|---|---|
0 | 4.0 | 5.0 | 6.0 | 3.0 | 2.0 | 1.0 |
1 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 6.0 | 5.0 | 4.0 | 3.0 | 2.0 | 1.0 |
3 | 2.0 | 1.0 | 6.0 | 5.0 | 3.0 | 4.0 |
4 | 2.0 | 3.0 | 1.0 | 5.0 | 6.0 | 4.0 |