The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.
For this project, you'll be cleaning and exploring the data set in Jupyter notebook. To see a sample notebook containing all of the answers, visit the project's GitHub repository.
We need to specify an encoding because the data set has some characters that aren't in Python's default utf-8 encoding. You can read more about character encodings on developer Joel Spolsky's blog.
The data has several columns, including:
There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format. As a result, this data set needs a lot of cleaning.
#importing the modules
import pandas as pd
import numpy as np
#loading the file into Dataframe
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")
#displaying first 10 rows
print(star_wars.head(10))
#inspecting column names
columns = star_wars.columns
#Inspecting for null values
star_wars.isnull()
#removing rows without respondentId
star_wars = star_wars[star_wars['RespondentID'].notnull()]
#checking for changes in the DF
star_wars.isnull().sum()
Take a look at the next two columns, which are:
Both represent Yes/No questions. They can also be NaN where a respondent chooses not to answer a question. We can use the pandas.Series.value_counts() method on a series to see all of the unique values in a column, along with the total number of times each value appears.
Both columns are currently string types, because the main values they contain are Yes and No. We can make the data a bit easier to analyze down the road by converting each column to a Boolean having only the values True, False, and NaN. Booleans are easier to work with because we can select the rows that are True or False without having to do a string comparison.
# inspecting Unique Column Values
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()
#defining Dictonary to work with map function
yes_no = {"Yes": True, "No": False}
#applying map funtion with dictnory to columns
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
#Looking for changes
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()
The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.
The columns for this question are:
For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.
We'll need to convert each of these columns to a Boolean, then rename the column something more intuitive. We can convert the values the same way we did earlier, except that we'll need to include the movie title and NaN in the mapping dictionary.
#defining dictonary for conversion
true_false = {"Star Wars: Episode I The Phantom Menace": True,
"Star Wars: Episode II Attack of the Clones":True,
"Star Wars: Episode III Revenge of the Sith":True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True,
np.NaN: False}
#applying mapping function
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(true_false)
#renaming Column names
for col in star_wars[3:9]:
star_wars = star_wars.rename(columns = {"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6"})
star_wars.head()
The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:
# concerting columns to float type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
star_wars[star_wars.columns[9:15]].head()
#Renaming columns
star_wars = star_wars.rename(columns={"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.":
"ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6"
})
star_wars[star_wars.columns[9:15]].head()
we've cleaned up the ranking columns, we can find the highest-ranked movie more quickly. To do this, take the mean of each of the ranking columns using the pandas.DataFrame.mean() method on dataframes.
#importing visulization library
%matplotlib inline
import matplotlib.pyplot as plt
ranking_mean = star_wars[star_wars.columns[9:15]].mean()
plt.bar(range(6), ranking_mean)
it looks like the "original" movies are rated much more highly than the newer ones.
plt.bar(range(6), star_wars[star_wars.columns[3:9]].sum())
It appears that the original movies were seen by more respondents than the newer movies. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular.
We know which movies the survey population as a whole has ranked the highest. Now let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:
# Converting it into binary
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]
# plotting the segments
plt.bar(range(6), males[males.columns[9:15]].mean())
plt.show()
plt.bar(range(6), females[females.columns[9:15]].mean())
plt.show()
#Finding polular by gender
plt.bar(range(6), males[males.columns[3:9]].sum())
plt.show()
plt.bar(range(6), females[females.columns[3:9]].sum())
plt.show()