Cleaning Star Wars Survey

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses, which you download from their GitHub repository.

For this project, you'll be cleaning and exploring the data set in Jupyter notebook. To see a sample notebook containing all of the answers, visit the project's GitHub repository.

We need to specify an encoding because the data set has some characters that aren't in Python's default utf-8 encoding. You can read more about character encodings on developer Joel Spolsky's blog.

The data has several columns, including:

  • RespondentID - An anonymized ID for the respondent (person taking the survey)
  • Gender - The respondent's gender
  • Age - The respondent's age
  • Household Income - The respondent's income
  • Education - The respondent's education level
  • Location (Census Region) - The respondent's location
  • Have you seen any of the 6 films in the Star Wars franchise? - Has a Yes or No response
  • Do you consider yourself to be a fan of the Star Wars film franchise? - Has a Yes or No response

There are several other columns containing answers to questions about the Star Wars movies. For some questions, the respondent had to check one or more boxes. This type of data is difficult to represent in columnar format. As a result, this data set needs a lot of cleaning.

In [64]:
#importing the modules 
import pandas as pd
import numpy as np

#loading the file into Dataframe
star_wars = pd.read_csv("star_wars.csv", encoding="ISO-8859-1")

#displaying first 10 rows
print(star_wars.head(10))

#inspecting column names
columns =  star_wars.columns
RespondentID Have you seen any of the 6 films in the Star Wars franchise?  \
0           NaN                                           Response             
1  3.292880e+09                                                Yes             
2  3.292880e+09                                                 No             
3  3.292765e+09                                                Yes             
4  3.292763e+09                                                Yes             
5  3.292731e+09                                                Yes             
6  3.292719e+09                                                Yes             
7  3.292685e+09                                                Yes             
8  3.292664e+09                                                Yes             
9  3.292654e+09                                                Yes             

  Do you consider yourself to be a fan of the Star Wars film franchise?  \
0                                           Response                      
1                                                Yes                      
2                                                NaN                      
3                                                 No                      
4                                                Yes                      
5                                                Yes                      
6                                                Yes                      
7                                                Yes                      
8                                                Yes                      
9                                                Yes                      

  Which of the following Star Wars films have you seen? Please select all that apply.  \
0           Star Wars: Episode I  The Phantom Menace                                    
1           Star Wars: Episode I  The Phantom Menace                                    
2                                                NaN                                    
3           Star Wars: Episode I  The Phantom Menace                                    
4           Star Wars: Episode I  The Phantom Menace                                    
5           Star Wars: Episode I  The Phantom Menace                                    
6           Star Wars: Episode I  The Phantom Menace                                    
7           Star Wars: Episode I  The Phantom Menace                                    
8           Star Wars: Episode I  The Phantom Menace                                    
9           Star Wars: Episode I  The Phantom Menace                                    

                                    Unnamed: 4  \
0  Star Wars: Episode II  Attack of the Clones   
1  Star Wars: Episode II  Attack of the Clones   
2                                          NaN   
3  Star Wars: Episode II  Attack of the Clones   
4  Star Wars: Episode II  Attack of the Clones   
5  Star Wars: Episode II  Attack of the Clones   
6  Star Wars: Episode II  Attack of the Clones   
7  Star Wars: Episode II  Attack of the Clones   
8  Star Wars: Episode II  Attack of the Clones   
9  Star Wars: Episode II  Attack of the Clones   

                                    Unnamed: 5  \
0  Star Wars: Episode III  Revenge of the Sith   
1  Star Wars: Episode III  Revenge of the Sith   
2                                          NaN   
3  Star Wars: Episode III  Revenge of the Sith   
4  Star Wars: Episode III  Revenge of the Sith   
5  Star Wars: Episode III  Revenge of the Sith   
6  Star Wars: Episode III  Revenge of the Sith   
7  Star Wars: Episode III  Revenge of the Sith   
8  Star Wars: Episode III  Revenge of the Sith   
9  Star Wars: Episode III  Revenge of the Sith   

                          Unnamed: 6  \
0  Star Wars: Episode IV  A New Hope   
1  Star Wars: Episode IV  A New Hope   
2                                NaN   
3                                NaN   
4  Star Wars: Episode IV  A New Hope   
5  Star Wars: Episode IV  A New Hope   
6  Star Wars: Episode IV  A New Hope   
7  Star Wars: Episode IV  A New Hope   
8  Star Wars: Episode IV  A New Hope   
9  Star Wars: Episode IV  A New Hope   

                                     Unnamed: 7  \
0  Star Wars: Episode V The Empire Strikes Back   
1  Star Wars: Episode V The Empire Strikes Back   
2                                           NaN   
3                                           NaN   
4  Star Wars: Episode V The Empire Strikes Back   
5  Star Wars: Episode V The Empire Strikes Back   
6  Star Wars: Episode V The Empire Strikes Back   
7  Star Wars: Episode V The Empire Strikes Back   
8  Star Wars: Episode V The Empire Strikes Back   
9  Star Wars: Episode V The Empire Strikes Back   

                                 Unnamed: 8  \
0  Star Wars: Episode VI Return of the Jedi   
1  Star Wars: Episode VI Return of the Jedi   
2                                       NaN   
3                                       NaN   
4  Star Wars: Episode VI Return of the Jedi   
5  Star Wars: Episode VI Return of the Jedi   
6  Star Wars: Episode VI Return of the Jedi   
7  Star Wars: Episode VI Return of the Jedi   
8  Star Wars: Episode VI Return of the Jedi   
9  Star Wars: Episode VI Return of the Jedi   

  Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.  \
0           Star Wars: Episode I  The Phantom Menace                                                                                              
1                                                  3                                                                                              
2                                                NaN                                                                                              
3                                                  1                                                                                              
4                                                  5                                                                                              
5                                                  5                                                                                              
6                                                  1                                                                                              
7                                                  6                                                                                              
8                                                  4                                                                                              
9                                                  5                                                                                              

   ...         Unnamed: 28       Which character shot first?  \
0  ...                Yoda                          Response   
1  ...      Very favorably  I don't understand this question   
2  ...                 NaN                               NaN   
3  ...    Unfamiliar (N/A)  I don't understand this question   
4  ...      Very favorably  I don't understand this question   
5  ...  Somewhat favorably                            Greedo   
6  ...      Very favorably                               Han   
7  ...      Very favorably                               Han   
8  ...      Very favorably                               Han   
9  ...  Somewhat favorably                               Han   

  Are you familiar with the Expanded Universe?  \
0                                     Response   
1                                          Yes   
2                                          NaN   
3                                           No   
4                                           No   
5                                          Yes   
6                                          Yes   
7                                          Yes   
8                                           No   
9                                           No   

  Do you consider yourself to be a fan of the Expanded Universe?ξ  \
0                                           Response                   
1                                                 No                   
2                                                NaN                   
3                                                NaN                   
4                                                NaN                   
5                                                 No                   
6                                                 No                   
7                                                 No                   
8                                                NaN                   
9                                                NaN                   

  Do you consider yourself to be a fan of the Star Trek franchise?    Gender  \
0                                           Response                Response   
1                                                 No                    Male   
2                                                Yes                    Male   
3                                                 No                    Male   
4                                                Yes                    Male   
5                                                 No                    Male   
6                                                Yes                    Male   
7                                                 No                    Male   
8                                                Yes                    Male   
9                                                 No                    Male   

        Age     Household Income                         Education  \
0  Response             Response                          Response   
1     18-29                  NaN                High school degree   
2     18-29         $0 - $24,999                   Bachelor degree   
3     18-29         $0 - $24,999                High school degree   
4     18-29  $100,000 - $149,999  Some college or Associate degree   
5     18-29  $100,000 - $149,999  Some college or Associate degree   
6     18-29    $25,000 - $49,999                   Bachelor degree   
7     18-29                  NaN                High school degree   
8     18-29                  NaN                High school degree   
9     18-29         $0 - $24,999  Some college or Associate degree   

  Location (Census Region)  
0                 Response  
1           South Atlantic  
2       West South Central  
3       West North Central  
4       West North Central  
5       West North Central  
6          Middle Atlantic  
7       East North Central  
8           South Atlantic  
9           South Atlantic  

[10 rows x 38 columns]
In [65]:
#Inspecting for null values
star_wars.isnull()
Out[65]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 True False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False True False False
2 False False True True True True True True True True ... True True True True False False False False False False
3 False False False False False False True True True False ... False False False True False False False False False False
4 False False False False False False False False False False ... False False False True False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1182 False False False False False False False False False False ... False False False True False False False False False False
1183 False False False False False False False False False False ... False False False True False False False False False False
1184 False False True True True True True True True True ... True True True True False False False False False False
1185 False False False False False False False False False False ... False False False True False False False False False False
1186 False False False False False True True False False False ... False False False True False False False False False False

1187 rows × 38 columns

In [66]:
#removing rows without respondentId
star_wars = star_wars[star_wars['RespondentID'].notnull()]
In [67]:
#checking for changes in the DF
star_wars.isnull().sum()
Out[67]:
RespondentID                                                                                                                                       0
Have you seen any of the 6 films in the Star Wars franchise?                                                                                       0
Do you consider yourself to be a fan of the Star Wars film franchise?                                                                            350
Which of the following Star Wars films have you seen? Please select all that apply.                                                              513
Unnamed: 4                                                                                                                                       615
Unnamed: 5                                                                                                                                       636
Unnamed: 6                                                                                                                                       579
Unnamed: 7                                                                                                                                       428
Unnamed: 8                                                                                                                                       448
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.    351
Unnamed: 10                                                                                                                                      350
Unnamed: 11                                                                                                                                      351
Unnamed: 12                                                                                                                                      350
Unnamed: 13                                                                                                                                      350
Unnamed: 14                                                                                                                                      350
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.                                   357
Unnamed: 16                                                                                                                                      355
Unnamed: 17                                                                                                                                      355
Unnamed: 18                                                                                                                                      363
Unnamed: 19                                                                                                                                      361
Unnamed: 20                                                                                                                                      372
Unnamed: 21                                                                                                                                      360
Unnamed: 22                                                                                                                                      366
Unnamed: 23                                                                                                                                      374
Unnamed: 24                                                                                                                                      359
Unnamed: 25                                                                                                                                      356
Unnamed: 26                                                                                                                                      365
Unnamed: 27                                                                                                                                      372
Unnamed: 28                                                                                                                                      360
Which character shot first?                                                                                                                      358
Are you familiar with the Expanded Universe?                                                                                                     358
Do you consider yourself to be a fan of the Expanded Universe?ξ                                                                               973
Do you consider yourself to be a fan of the Star Trek franchise?                                                                                 118
Gender                                                                                                                                           140
Age                                                                                                                                              140
Household Income                                                                                                                                 328
Education                                                                                                                                        150
Location (Census Region)                                                                                                                         143
dtype: int64

Cleaning and mapping Yes/No Columns

Take a look at the next two columns, which are:

  • Have you seen any of the 6 films in the Star Wars franchise?
  • Do you consider yourself to be a fan of the Star Wars film franchise?

Both represent Yes/No questions. They can also be NaN where a respondent chooses not to answer a question. We can use the pandas.Series.value_counts() method on a series to see all of the unique values in a column, along with the total number of times each value appears.

Both columns are currently string types, because the main values they contain are Yes and No. We can make the data a bit easier to analyze down the road by converting each column to a Boolean having only the values True, False, and NaN. Booleans are easier to work with because we can select the rows that are True or False without having to do a string comparison.

In [68]:
# inspecting Unique Column Values
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()
Out[68]:
Yes    552
No     284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64
In [69]:
#defining Dictonary to work with map function
yes_no = {"Yes": True, "No": False}

#applying map funtion with dictnory to columns
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'] = star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map(yes_no)
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'] = star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map(yes_no)
In [70]:
#Looking for changes 
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts()
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts()
Out[70]:
True     552
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

Cleaning and Mapping Checkbox Columns

The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question, Which of the following Star Wars films have you seen? Please select all that apply.

The columns for this question are:

  • Which of the following Star Wars films have you seen? Please select all that apply. - Whether or not the respondent saw Star Wars: Episode I The Phantom Menace.
  • Unnamed: 4 - Whether or not the respondent saw Star Wars: Episode II Attack of the Clones.
  • Unnamed: 5 - Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith.
  • Unnamed: 6 - Whether or not the respondent saw Star Wars: Episode IV A New Hope.
  • Unnamed: 7 - Whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back.
  • Unnamed: 8 - Whether or not the respondent saw Star Wars: Episode VI Return of the Jedi.

For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.

We'll need to convert each of these columns to a Boolean, then rename the column something more intuitive. We can convert the values the same way we did earlier, except that we'll need to include the movie title and NaN in the mapping dictionary.

In [71]:
#defining dictonary for conversion
true_false = {"Star Wars: Episode I  The Phantom Menace": True,
              "Star Wars: Episode II  Attack of the Clones":True,
              "Star Wars: Episode III  Revenge of the Sith":True,
              "Star Wars: Episode IV  A New Hope": True,
              "Star Wars: Episode V The Empire Strikes Back": True,
              "Star Wars: Episode VI Return of the Jedi": True,
              np.NaN: False}
#applying mapping function
for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(true_false)
In [72]:
#renaming Column names
for col in star_wars[3:9]:
    star_wars = star_wars.rename(columns = {"Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
                                           "Unnamed: 4": "seen_2",
        "Unnamed: 5": "seen_3",
        "Unnamed: 6": "seen_4",
        "Unnamed: 7": "seen_5",
        "Unnamed: 8": "seen_6"})
star_wars.head()
Out[72]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
1 3.292880e+09 True True True True True True True True 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 False NaN False False False False False False NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 True False True True True False False False 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 True True True True True True True True 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 True True True True True True True True 5 ... Somewhat favorably Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 38 columns

Cleaning the Ranking Columns

The next six columns ask the respondent to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite, and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:

  • Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. - How much the respondent liked Star Wars: Episode I The Phantom Menace
  • Unnamed: 10 - How much the respondent liked Star Wars: Episode II Attack of the Clones
  • Unnamed: 11 - How much the respondent liked Star Wars: Episode III Revenge of the Sith
  • Unnamed: 12 - How much the respondent liked Star Wars: Episode IV A New Hope
  • Unnamed: 13 - How much the respondent liked Star Wars: Episode V The Empire Strikes Back
  • Unnamed: 14 - How much the respondent liked Star Wars: Episode VI Return of the Jedi
In [73]:
# concerting columns to float type
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

star_wars[star_wars.columns[9:15]].head()
Out[73]:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14
1 3.0 2.0 1.0 4.0 5.0 6.0
2 NaN NaN NaN NaN NaN NaN
3 1.0 2.0 3.0 4.0 5.0 6.0
4 5.0 6.0 1.0 2.0 4.0 3.0
5 5.0 4.0 6.0 2.0 1.0 3.0
In [74]:
#Renaming columns

star_wars = star_wars.rename(columns={"Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.":
                                      "ranking_1", 
                                      "Unnamed: 10": "ranking_2",
                                      "Unnamed: 11": "ranking_3",
                                      "Unnamed: 12": "ranking_4",
                                      "Unnamed: 13": "ranking_5",
                                      "Unnamed: 14": "ranking_6"
    
})

star_wars[star_wars.columns[9:15]].head()
Out[74]:
ranking_1 ranking_2 ranking_3 ranking_4 ranking_5 ranking_6
1 3.0 2.0 1.0 4.0 5.0 6.0
2 NaN NaN NaN NaN NaN NaN
3 1.0 2.0 3.0 4.0 5.0 6.0
4 5.0 6.0 1.0 2.0 4.0 3.0
5 5.0 4.0 6.0 2.0 1.0 3.0

Finding the Highest-Ranked Movie

we've cleaned up the ranking columns, we can find the highest-ranked movie more quickly. To do this, take the mean of each of the ranking columns using the pandas.DataFrame.mean() method on dataframes.

In [75]:
#importing visulization library
%matplotlib inline
import matplotlib.pyplot as plt

ranking_mean = star_wars[star_wars.columns[9:15]].mean()

plt.bar(range(6), ranking_mean)
Out[75]:
<BarContainer object of 6 artists>

Findings

it looks like the "original" movies are rated much more highly than the newer ones.

Finding most Viewed Movie

In [76]:
plt.bar(range(6), star_wars[star_wars.columns[3:9]].sum())
Out[76]:
<BarContainer object of 6 artists>

It appears that the original movies were seen by more respondents than the newer movies. This reinforces what we saw in the rankings, where the earlier movies seem to be more popular.

Exploring the Data by Binary Segments

We know which movies the survey population as a whole has ranked the highest. Now let's examine how certain segments of the survey population responded. There are several columns that segment our data into two groups. Here are a few examples:

  • Do you consider yourself to be a fan of the Star Wars film franchise? - True or False
  • Do you consider yourself to be a fan of the Star Trek franchise -Yes or No
  • Gender - Male or Female
In [77]:
# Converting it into binary
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]

# plotting the segments
plt.bar(range(6), males[males.columns[9:15]].mean())
plt.show()

plt.bar(range(6), females[females.columns[9:15]].mean())
plt.show()
In [78]:
#Finding polular by gender

plt.bar(range(6), males[males.columns[3:9]].sum())
plt.show()

plt.bar(range(6), females[females.columns[3:9]].sum())
plt.show()