#!/usr/bin/env python # coding: utf-8 # # Star Wars Survey # # While waiting for *[Star Wars: The Force Awakens](https://en.wikipedia.org/wiki/Star_Wars:_The_Force_Awakens)* to come out, the team at [FiveThirtyEight](https://fivethirtyeight.com/) became interested in answering some questions about Star Wars fans. In particular, they wondered: **does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?** # # The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you can download from their [GitHub repository](https://github.com/fivethirtyeight/data/tree/master/star-wars-survey). # # In this project, we cleant and analysed data to see the movie which is liked or seen the most and the various factors affecting it. # In[1]: import pandas as pd #reading in data star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1') star_wars.head(10) # In[2]: star_wars.columns # The data has several columns, including the following: # # * `RespondentID` — An anonymized ID for the respondent (person taking the survey) # * `Gender` — the respondent's gender # * `Age` — the respondent's age # * `Household Income` — the respondent's income # * `Education` — the respondent's education level # * `Location (Census Region)` — the respondent's location # * `Have you seen any of the 6 films in the Star Wars franchise?` — a Yes or No response # * `Do you consider yourself to be a fan of the Star Wars film franchise?` — a Yes or No response # # There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. # ## Cleaning and Mapping Columns # # Let us take a look at the following 2 columns: # # * `Have you seen any of the 6 films in the Star Wars franchise?` # * `Do you consider yourself to be a fan of the Star Wars film franchise?` # In[3]: saw_any_of_6 = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] saw_any_of_6.value_counts(dropna=False) # In[4]: fan_series = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] fan_series.value_counts(dropna=False) # Both represent `Yes/No` questions. There is also `NaN` where a respondent chooses not to answer a question. We made use of the **pandas.Series.value_counts()** method on a series to see all of the unique values in a column, along with the total number of times each value appears. # In[5]: yes_no = {'Yes':True, 'No':False} saw_any_of_6 = saw_any_of_6.map(yes_no) #converting Yes, No to boolean saw_any_of_6.value_counts(dropna=False) # In[6]: fan_series = fan_series.map(yes_no) fan_series.value_counts(dropna=False) # Both columns are currently string types, because the main values they contain are `Yes` and `No`. We made the data a bit easier to analyze by converting each column to a Boolean with only the values `True`, `False`, and `NaN`. Booleans are easier to work with because we can select the rows that are `True` or `False` without having to do a string comparison. # # We made use of the `pandas.Series.map()` method on series objects to perform the conversion. # ## Cleaning and Mapping Checkbox Columns # # The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, `Which of the following Star Wars films have you seen? Please select all that apply.` # In[7]: star_wars['Which of the following Star Wars films have you seen? Please select all that apply.'].value_counts(dropna=False) # The columns for this checkbox question are: # # * `Which of the following Star Wars films have you seen? Please select all that apply.` — whether or not the respondent saw Star Wars: Episode I The Phantom Menace. # * `Unnamed: 4` — whether or not the respondent saw Star Wars: Episode II Attack of the Clones. # * `Unnamed: 5` — whether or not the respondent saw Star Wars: Episode III Revenge of the Sith. # * `Unnamed: 6` — whether or not the respondent saw Star Wars: Episode IV A New Hope. # * `Unnamed: 7` — whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back. # * `Unnamed: 8` — whether or not the respondent saw Star Wars: Episode VI Return of the Jedi. # In[8]: cols_before=star_wars.columns[3:9] #execute_only_once cols_before # In[9]: #renaming columns with index 3 to 8 for i in range(len(cols_before)): star_wars=star_wars.rename(columns={cols_before[i]:f'seen_{i+1}'}) star_wars.columns # In[10]: cols_after = star_wars.columns[3:9] cols_after # For cleaning the data pertaining to the above cleant column names, in each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is `NaN`, the respondent either didn't answer or didn't see the movie. We assumed that they didn't see the movie. Like in the previous scenario, we converted the data to a boolean type which helps us to analyze the data easier in the further steps. # In[11]: def boolean_conv(series): new_series = series.isna() return ~new_series # In[12]: watched_data_before=star_wars.iloc[:, 3:9] #execute only once watched_data_before # In[13]: # converting data to boolean watched_data_after=watched_data_before.apply(boolean_conv) watched_data_after # In[14]: # assigning the boolean data back to main dataframe star_wars.iloc[:, 3:9]=watched_data_after star_wars # ## Cleaning the ranking columns # # The next six columns ask the respondent to rank the Star Wars movies in order from least to most favorite. `1` means the film was the most favorite, and `6` means it was the least favorite. # In[15]: ranking_cols_before=star_wars.columns[9:15] #execute only once ranking_cols_before # Each of the following columns can contain the value `1`, `2`, `3`, `4`, `5`, `6`, or `NaN`: # # * `Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.` - How much the respondent liked Star Wars: Episode I The Phantom Menace # * `Unnamed: 10` — How much the respondent liked Star Wars: Episode II Attack of the Clones # * `Unnamed: 11` — How much the respondent liked Star Wars: Episode III Revenge of the Sith # * `Unnamed: 12` — How much the respondent liked Star Wars: Episode IV A New Hope # * `Unnamed: 13` — How much the respondent liked Star Wars: Episode V The Empire Strikes Back # * `Unnamed: 14` — How much the respondent liked Star Wars: Episode VI Return of the Jedi # In[16]: #renaming columns for i in range(len(ranking_cols_before)): star_wars = star_wars.rename(columns={ranking_cols_before[i]:f'ranking_{i+1}'}) star_wars.columns # In[17]: ranking_cols_after=star_wars.columns[9:15] ranking_cols_after # In[18]: #check column names and data type star_wars[ranking_cols_after].info() # ## Finding the Highest Ranked Movie # # Now that we cleaned up the ranking columns, finding the highest ranked movie is much easier. # In[19]: ranking_data=star_wars[ranking_cols_after] ranking_data # In[20]: mean_data=ranking_data.mean() #calculating mean mean_data.sort_values() # In[21]: get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt mean_data.plot(kind='bar') # Lets look at the year in which each of these movies were released in the order of ranking: # # * ranking_5: Star Wars: Episode V The Empire Strikes Back - 1980 # * ranking_6: Star Wars: Episode VI Return of the Jedi - 1983 # * ranking_4: Star Wars: Episode IV A New Hope - 1977 # * ranking_1: Star Wars: Episode I The Phantom Menace - 1999 # * ranking_2: Star Wars: Episode II Attack of the Clones - 2002 # * ranking_3: Star Wars: Episode III Revenge of the Sith - 2005 # # It looks like the "original" movies are rated much more highly than the newer ones. # ## Finding the Most Viewed Movie # In[22]: watched_data_after.sum().sort_values(ascending=False) # In[23]: watched_data_after.sum().plot(kind='bar') # It appears that the original movies were seen by more respondents than the newer movies. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular. # ## Exploring the data by Binary Segments # # Now let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples: # # * `Do you consider yourself to be a fan of the Star Wars film franchise?` — `True` or `False` # * `Do you consider yourself to be a fan of the Star Trek franchise?` — `Yes` or `No` # * `Gender` — `Male` or `Female` # # We can split a DataFrame into two groups based on a binary column by creating two subsets of that column. # In[24]: fans = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='Yes'] #star wars fans non_fans=star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=='No'] # non star wars fans # In[25]: #comparison of fans & non star wars fans rating plt.figure(figsize=(10, 3)) plt.subplot(1, 2, 1) fans[ranking_cols_after].mean().plot(kind='bar') plt.title('Fans') plt.subplot(1, 2, 2) non_fans[ranking_cols_after].mean().plot(kind='bar') plt.title('Non Fans') # In[26]: #comparison of star wars fans and non fans on the number of people who watched the movie plt.figure(figsize=(10, 3)) plt.subplot(1, 2, 1) fans[cols_after].sum().plot(kind='bar') plt.title('Fans') plt.subplot(1, 2, 2) non_fans[cols_after].sum().plot(kind='bar') plt.title('Non Fans') # In[27]: non_fans[ranking_cols_after].mean().sort_values() # In[28]: fans[ranking_cols_after].mean().sort_values() # In[29]: non_fans[cols_after].sum().sort_values(ascending=False) # In[30]: fans[cols_after].sum().sort_values(ascending=False) # ## Observations: # # * It is unanimous that Episode 5: The Empire Strikes Back is the most liked and seen movie of all time. # * Interestingly, we also observe that the number of fans as well as the non fans decrease (see cells 29, 30) in chronological order of the episodes release dates but, the rankings given by the fans and non fans (see cells 27, 28) are completely different for the episodes 1,2,4,6. (see cells 25, 26 for graphs)