Guide Project: Insights from StarWar survey

Scenario: While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?

The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses for all, and now we will clean it to see what's the hidden insights of this survey: Are the StarWar's fan feel that The Empire Strikes Back is the best of all series?

To clearly, the Episode: The Empire Strikes Back is created and introduced from 1980's, and while the team FiveThirtyEight created this survey (2015s'), there are total 4 episode is introduced after The Empire Strikes Back. Our task is find the insight from this survey data to answer the question that team FiveThirtyEight interest: we can accept it or reject it depend on the data's result we have. The conclusion at the end of this project will be the final answer for the team.

Because we will work with csv file, but sometime we don't know what's that's file encoding? So, we will make a function to check the file encoding before we load in DataFrame.

In [1]:
# pip install chardet
In [2]:
## Check the file encoding:

from chardet.universaldetector import UniversalDetector

def detect_encode(file_name):
    
    detector = UniversalDetector()
    for item in open(file_name, 'rb'):
        detector.feed(item)
        if detector.done: break
        detector.close()
        print(detector.result)

Now let's download the file into our local and check the encoding of this.

In [3]:
# # Download the file:

# import opendatasets as od
# page='https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv'
# od.download(page)
In [4]:
## Check the encoding:
detect_encode('StarWars.csv')
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

The encoding of file is in Windows-1252, so we will pass this encoding name into function call our DataFrame, and check file the final times to sure everything is OK.

In [5]:
## Load library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
In [6]:
## Load data in:
starwar = pd.read_csv('StarWars.csv', encoding='Windows-1252', delimiter=',')
starwar.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1187 entries, 0 to 1186
Data columns (total 38 columns):
 #   Column                                                                                                                                         Non-Null Count  Dtype  
---  ------                                                                                                                                         --------------  -----  
 0   RespondentID                                                                                                                                   1186 non-null   float64
 1   Have you seen any of the 6 films in the Star Wars franchise?                                                                                   1187 non-null   object 
 2   Do you consider yourself to be a fan of the Star Wars film franchise?                                                                          837 non-null    object 
 3   Which of the following Star Wars films have you seen? Please select all that apply.                                                            674 non-null    object 
 4   Unnamed: 4                                                                                                                                     572 non-null    object 
 5   Unnamed: 5                                                                                                                                     551 non-null    object 
 6   Unnamed: 6                                                                                                                                     608 non-null    object 
 7   Unnamed: 7                                                                                                                                     759 non-null    object 
 8   Unnamed: 8                                                                                                                                     739 non-null    object 
 9   Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.  836 non-null    object 
 10  Unnamed: 10                                                                                                                                    837 non-null    object 
 11  Unnamed: 11                                                                                                                                    836 non-null    object 
 12  Unnamed: 12                                                                                                                                    837 non-null    object 
 13  Unnamed: 13                                                                                                                                    837 non-null    object 
 14  Unnamed: 14                                                                                                                                    837 non-null    object 
 15  Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.                                 830 non-null    object 
 16  Unnamed: 16                                                                                                                                    832 non-null    object 
 17  Unnamed: 17                                                                                                                                    832 non-null    object 
 18  Unnamed: 18                                                                                                                                    824 non-null    object 
 19  Unnamed: 19                                                                                                                                    826 non-null    object 
 20  Unnamed: 20                                                                                                                                    815 non-null    object 
 21  Unnamed: 21                                                                                                                                    827 non-null    object 
 22  Unnamed: 22                                                                                                                                    821 non-null    object 
 23  Unnamed: 23                                                                                                                                    813 non-null    object 
 24  Unnamed: 24                                                                                                                                    828 non-null    object 
 25  Unnamed: 25                                                                                                                                    831 non-null    object 
 26  Unnamed: 26                                                                                                                                    822 non-null    object 
 27  Unnamed: 27                                                                                                                                    815 non-null    object 
 28  Unnamed: 28                                                                                                                                    827 non-null    object 
 29  Which character shot first?                                                                                                                    829 non-null    object 
 30  Are you familiar with the Expanded Universe?                                                                                                   829 non-null    object 
 31  Do you consider yourself to be a fan of the Expanded Universe?ξ                                                                               214 non-null    object 
 32  Do you consider yourself to be a fan of the Star Trek franchise?                                                                               1069 non-null   object 
 33  Gender                                                                                                                                         1047 non-null   object 
 34  Age                                                                                                                                            1047 non-null   object 
 35  Household Income                                                                                                                               859 non-null    object 
 36  Education                                                                                                                                      1037 non-null   object 
 37  Location (Census Region)                                                                                                                       1044 non-null   object 
dtypes: float64(1), object(37)
memory usage: 352.5+ KB

Let's see a few records to know the data more clearly:

In [7]:
# Setting for maximum columns = 60
pd.set_option('display.max_columns',60)

# Check the first 5 records:
starwar.head()
Out[7]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Han Solo Luke Skywalker Princess Leia Organa Anakin Skywalker Obi Wan Kenobi Emperor Palpatine Darth Vader Lando Calrissian Boba Fett C-3P0 R2 D2 Jar Jar Binks Padme Amidala Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 2 1 4 5 6 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 No NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 2 3 4 5 6 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 6 1 2 4 3 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

We can see the structre of this survey like this:

  • On the top index column: 9 question in total, 5 field is personal information. In 9 question:
  1. From the 3rd question: The attendance have more than 1 selection, co-responde with each filter menu of question
  • Behind the index column is the menu filter with the suggest answer, the attendance fill with fixed answer like: YES, NO, some adjective...
  • The answer of attendance is actually at the 2nd record => We're considering to modify the index column + 1st record, or somehow to make column more clean.

Clean data

1. Handle YES/NO columns

We have two filed of YES/NO column:

  • Have you seen any of the 6 films in the Star Wars franchise? (1st)
  • Do you consider yourself to be a fan of the Star Wars film franchise? (2nd)

As each field can contains missing value (due to attendance reject to answer), we do the following:

  1. Count the value of each filed: By the df.info() function we know the first filed is none of missing, but the 2nd has => we will count again to confirm
  2. Convert YES/ NO to boolean: by series.map() function.

The process will be like the code block below.

In [8]:
# Count the unique value again:
starwar['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Out[8]:
Yes         936
No          250
Response      1
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [9]:
starwar['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Out[9]:
Yes         552
NaN         350
No          284
Response      1
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

Skip thorugh the Respond records, we will convert the other value now:

In [10]:
# Convert the value to boolean:
yes_no = {'Yes':True, 'No':False}

col = ['Have you seen any of the 6 films in the Star Wars franchise?',
      'Do you consider yourself to be a fan of the Star Wars film franchise?']

for item in col:
    starwar[item] = starwar[item].map(yes_no, na_action='ignore')
In [11]:
# Check the result:
starwar['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
Out[11]:
True     936
False    250
NaN        1
Name: Have you seen any of the 6 films in the Star Wars franchise?, dtype: int64
In [12]:
starwar['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Out[12]:
True     552
NaN      351
False    284
Name: Do you consider yourself to be a fan of the Star Wars film franchise?, dtype: int64

2. Clean Check-box columns

Everything seem OK, now let's take the next step with the next 6 question, we can see that:

  • On the field Which of the following Star Wars films have you seen? Please select all that apply., right below it is title film: Star Wars: Episode I The Phantom Menace with mean that the 1st question is about the attendance seen the Eposide 1 or not
  • Next is the same, Star Wars: Episode II Attack of the Clones...
  • Right behind these column is the answer with whether repeat the film title/ NaN (not seen answer).

With that said, we can do the following below:

  1. Check the value contains in each field, modify the string if need
  2. Convert the title film as TRUE, NaN as False
  3. Rename column name: Exp: Which of the following Star Wars films have you seen? Please select all that apply. => seen_1
In [13]:
# Check the unique value on each field:
def check_value(df, start, end):
    col_check = df.columns[start:end]
    result = []
    for item in col_check:
        val = df[item].unique()
        result.append(val[0])
        if val[1] not in result:
            val[1] = np.nan
            result.append(val[1])
            continue
    return result
In [14]:
check_value(starwar,3, 9)
Out[14]:
['Star Wars: Episode I  The Phantom Menace',
 nan,
 'Star Wars: Episode II  Attack of the Clones',
 'Star Wars: Episode III  Revenge of the Sith',
 'Star Wars: Episode IV  A New Hope',
 'Star Wars: Episode V The Empire Strikes Back',
 'Star Wars: Episode VI Return of the Jedi']
In [15]:
# Convert the value to boolean:
convert_value = check_value(starwar,3,9)
mapper = {}
for name in convert_value:
    if len(str(name))==3:
        mapper[name] = False
        continue
    mapper[name] = True
        
In [16]:
def convert_value(df, start, end):
    cols = df.columns[start:end]
    for i in cols:
        df[i] = df[i].map(mapper)
    return df[i]
In [17]:
convert_value(starwar,3,9)
Out[17]:
0        True
1        True
2       False
3       False
4        True
        ...  
1182     True
1183     True
1184    False
1185     True
1186     True
Name: Unnamed: 8, Length: 1187, dtype: bool
In [18]:
# Convert the name of column:
def convert_columns(df, start, end, str_, number):
    
    cols = df.columns[start:end]
    revert = []
    for i in range(1,number+1):
        revert.append('{}_{}'.format(str_,i))

    mapp = {}
    for old, new in zip(cols, revert):
        mapp[old] = new
    
    return df.rename(mapp, inplace=True, axis=1)
     
In [19]:
## Run the function
convert_columns(starwar, 3, 9, 'seen', 6)
In [20]:
## Check some records:
starwar.head()
Out[20]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN NaN NaN True True True True True True Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Han Solo Luke Skywalker Princess Leia Organa Anakin Skywalker Obi Wan Kenobi Emperor Palpatine Darth Vader Lando Calrissian Boba Fett C-3P0 R2 D2 Jar Jar Binks Padme Amidala Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 True True True True True True True True 3 2 1 4 5 6 Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 False NaN False False False False False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 True False True True True False False False 1 2 3 4 5 6 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 True True True True True True True True 5 6 1 2 4 3 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

3. Clean the ranking columns:

Similary as the Check-box columns above, but now we don't need to cleaning as much as the Check-boxs field. Instead, we rename the column to be ranking_n and cast the value to be numeric type.

In [21]:
# Cast the value:
starwar[starwar.columns[9:15]] = starwar.loc[2:,starwar.columns[9:15]].astype('float')
In [22]:
# Check:
starwar[starwar.columns[9:15]].dtypes
Out[22]:
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.    float64
Unnamed: 10                                                                                                                                      float64
Unnamed: 11                                                                                                                                      float64
Unnamed: 12                                                                                                                                      float64
Unnamed: 13                                                                                                                                      float64
Unnamed: 14                                                                                                                                      float64
dtype: object
In [23]:
# Rename the columns:
# Convert the name of column:
convert_columns(starwar, 9, 15, 'ranking', 6)
In [24]:
## Check:
starwar.head()
Out[24]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 ranking_1 ranking_2 ranking_3 ranking_4 ranking_5 ranking_6 Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. Unnamed: 16 Unnamed: 17 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26 Unnamed: 27 Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN NaN NaN True True True True True True NaN NaN NaN NaN NaN NaN Han Solo Luke Skywalker Princess Leia Organa Anakin Skywalker Obi Wan Kenobi Emperor Palpatine Darth Vader Lando Calrissian Boba Fett C-3P0 R2 D2 Jar Jar Binks Padme Amidala Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 True True True True True True True True NaN NaN NaN NaN NaN NaN Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Very favorably Unfamiliar (N/A) Unfamiliar (N/A) Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 False NaN False False False False False False NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 True False True True True False False False 1.0 2.0 3.0 4.0 5.0 6.0 Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Somewhat favorably Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 True True True True True True True True 5.0 6.0 1.0 2.0 4.0 3.0 Very favorably Very favorably Very favorably Very favorably Very favorably Somewhat favorably Very favorably Somewhat favorably Somewhat unfavorably Very favorably Very favorably Very favorably Very favorably Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

4. Clean the favourable columns

Like the structre of Check-boxs and Ranking field, the favourable field have these structre:

  • On the top: From Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. to Unnamed: 28 is the question, co-responed to:
  1. Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. => Han Solo
  2. Unname: 16 => Luke Skywalker, and so on
  • In the answer form, we can see: Very favourably,Somewhat favorably,Unfamiliar,NA

So, what we should do is:

  • Change the column name first, to avoid loss actor name
  • Check the unique value of these field, and
  • Convert the adjective into 3 rank: A,B,C co-respone to 3 degree: Very favourably,Somewhat favorably,Unfamiliar, and of course, N/A is 'C'
  • Keep the records of actor's name, because we will use these records to summarize data.

The convert process will be shown as code-block below.

In [25]:
# Get the columns name and the name of actor:
list_actor_name = starwar.iloc[0,15:29].values
actor_name = [] 
for ac_name in list_actor_name:
    names = (ac_name
                      .replace(' ','_')
                      .lower()
                      .strip())
    actor_name.append(names)

# modify the actor name:
old_name = starwar.columns[15:29]
mapp1 = {}
for od, nw in zip(old_name, actor_name):
    mapp1[od] = nw
starwar.rename(mapper=mapp1, inplace=True, axis=1)

# Check the result:
starwar.columns[15:29]
Out[25]:
Index(['han_solo', 'luke_skywalker', 'princess_leia_organa',
       'anakin_skywalker', 'obi_wan_kenobi', 'emperor_palpatine',
       'darth_vader', 'lando_calrissian', 'boba_fett', 'c-3p0', 'r2_d2',
       'jar_jar_binks', 'padme_amidala', 'yoda'],
      dtype='object')
In [26]:
# Check the unique value in each field:
starwar['han_solo'].value_counts(dropna=False)
Out[26]:
Very favorably                                 610
NaN                                            357
Somewhat favorably                             151
Neither favorably nor unfavorably (neutral)     44
Unfamiliar (N/A)                                15
Somewhat unfavorably                             8
Han Solo                                         1
Very unfavorably                                 1
Name: han_solo, dtype: int64

We will do a little change of our process:

  • The evaluate rank is: Very favorably, Somewhat favorably, Neither favorably nor unfavorably,Unfamiliar,Somewhat unfavorably, Very unfavorably => We will change our standardized rank as:
  1. 1: Very favorably
  2. 2: Somewhat favorably
  3. 3: Neither favorably nor unfavorably (neutral)
  4. 4: Unfamiliar or NaN value (because in the decription, Unfamiliar equal to Unknow, can consider as missing value)
  5. 5: Somewhat unfavorably
  6. 6: Very unfavorably
  • For records with NaN, we will count it as D rank
In [27]:
#Convert the value:
mapping = {'Very favorably':1, 'Somewhat favorably':2,'Neither favorably nor unfavorably (neutral)':3,
          'Unfamiliar (N/A)':4, 'Somewhat unfavorably':5, 'Very unfavorably':6}

colume = starwar.columns[15:29]
for mark in colume:
    starwar[mark] = starwar[mark].map(mapping, na_action='ignore')
In [28]:
#Check:
starwar['han_solo']
Out[28]:
0       NaN
1       1.0
2       NaN
3       2.0
4       1.0
       ... 
1182    1.0
1183    1.0
1184    NaN
1185    1.0
1186    1.0
Name: han_solo, Length: 1187, dtype: float64
In [29]:
# Fill missing value
df1 = starwar[colume].copy()

df1.fillna(4, inplace=True, axis=1)

starwar.loc[:,colume] = df1
starwar.head()
Out[29]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 ranking_1 ranking_2 ranking_3 ranking_4 ranking_5 ranking_6 han_solo luke_skywalker princess_leia_organa anakin_skywalker obi_wan_kenobi emperor_palpatine darth_vader lando_calrissian boba_fett c-3p0 r2_d2 jar_jar_binks padme_amidala yoda Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN NaN NaN True True True True True True NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 Response Response Response Response Response Response Response Response Response
1 3.292880e+09 True True True True True True True True NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 4.0 4.0 1.0 1.0 1.0 1.0 1.0 I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 False NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 True False True True True False False False 1.0 2.0 3.0 4.0 5.0 6.0 2.0 2.0 2.0 2.0 2.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 True True True True True True True True 5.0 6.0 1.0 2.0 4.0 3.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 2.0 5.0 1.0 1.0 1.0 1.0 1.0 I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

For the convenient of further step, we'll clear the first record, which is fill with only 'Respone'... and other guide syntax.

In [30]:
## Drop the first records:
starwar.drop(index=0,axis=0, inplace=True)

Considering the way to explore data:

We'already have the data that cleaned, now we're going to consider some explantory method about this data:

  • Find the highest ranking episode.
  • Aggregate data on each genre: Male and Female about how their taste of actor? What's their (average) ranking about the highest ranking episode?
  • Caculate the number of view at each episode, divide to the total view for all episode, and output the ratio of each episode: Which episode have the view higher? (Pie chart)
  • For the question of being fan in some certain field, there's some pattern:
  1. We can put it by analytic: Which gender seem to being fan on which certain?
  2. Which age likely to be the member of some franchise fansign?

And some of other content could appear in our analysis process.

About the ranking of each episode

1. First, let's find the episode with highest ranking:

In [31]:
# Fill the value again for the first record:

#Check data:
starwar.iloc[0,9:15].values
Out[31]:
array([nan, nan, nan, nan, nan, nan], dtype=object)
In [32]:
#Replace data:
replace = [3, 2, 1, 4, 5, 6]

for item, i in zip(replace, range(9,15)):
    starwar.iloc[0, i] = item
    
#Check:
starwar.head()
Out[32]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 ranking_1 ranking_2 ranking_3 ranking_4 ranking_5 ranking_6 han_solo luke_skywalker princess_leia_organa anakin_skywalker obi_wan_kenobi emperor_palpatine darth_vader lando_calrissian boba_fett c-3p0 r2_d2 jar_jar_binks padme_amidala yoda Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
1 3.292880e+09 True True True True True True True True 3.0 2.0 1.0 4.0 5.0 6.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 4.0 4.0 1.0 1.0 1.0 1.0 1.0 I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 False NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 True False True True True False False False 1.0 2.0 3.0 4.0 5.0 6.0 2.0 2.0 2.0 2.0 2.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 True True True True True True True True 5.0 6.0 1.0 2.0 4.0 3.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 2.0 5.0 1.0 1.0 1.0 1.0 1.0 I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
5 3.292731e+09 True True True True True True True True 5.0 4.0 6.0 2.0 1.0 3.0 1.0 2.0 2.0 5.0 1.0 6.0 2.0 3.0 1.0 2.0 2.0 6.0 2.0 2.0 Greedo Yes No No Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central
In [33]:
## Find the mean ranking:
#1. Create a new dataframe of ranking
cols_ranking = starwar.columns[9:15]
rank = starwar.copy()[cols_ranking]

#2. Compute the mean:
mean_rank = rank.mean()

#3. Plot the rank:
tick_name = []
for i in range(1,7):
    tick_name.append('Episode_{}'.format(i)) #Define the ticks name

mean_rank.plot.bar(legend=True, label='Ranking', rot=30)
plt.xticks(ticks=range(0,6), labels=tick_name)
plt.xlabel('Episode')
plt.ylabel('Mean ranking')

plt.title('Highest ranking of each episode\nLower is better')
plt.show()

We've got the first result: The Episode V (The Empries Strike Backs) is the episode with highest ranking (by the survey's attendance - as low as best)

Now, let's look into something more deeper, like:

  • Is there the ranking is effect by Age?? by Education?? or by Location???

To do that, we'll do below:

  • Check the unique value for each field above, and clean (if need)
  • Aggregate data by each of these field (into each new df)

2: Clean data

In [34]:
## Check the unique value:
#1. Create the list unique value
age_value = []
education_value = []
location_value = []

array = [-1, -2, -4]
for name in starwar.columns[array]:
    if name == 'Age':
        age_value.append(starwar[name].value_counts(dropna=False).sort_index())
    elif name == 'Education':
        education_value.append(starwar[name].value_counts(dropna=False).sort_index())
    else:
        location_value.append(starwar[name].value_counts(dropna=False).sort_index())
        
#2. Create the dic store data:
store = {}
list_data = [age_value, education_value, location_value]
list_name = ['Age', 'Education', 'Location']
for item, value in zip(list_name, list_data):
    store[item] = value

store
Out[34]:
{'Age': [18-29    218
  30-44    268
  45-60    291
  > 60     269
  NaN      140
  Name: Age, dtype: int64],
 'Education': [Bachelor degree                     321
  Graduate degree                     275
  High school degree                  105
  Less than high school degree          7
  Some college or Associate degree    328
  NaN                                 150
  Name: Education, dtype: int64],
 'Location': [East North Central    181
  East South Central     38
  Middle Atlantic       122
  Mountain               79
  New England            75
  Pacific               175
  South Atlantic        170
  West North Central     93
  West South Central    110
  NaN                   143
  Name: Location (Census Region), dtype: int64]}
In [35]:
# Check the missing value of each field:
starwar[starwar['Age'].isnull()]
Out[35]:
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 ranking_1 ranking_2 ranking_3 ranking_4 ranking_5 ranking_6 han_solo luke_skywalker princess_leia_organa anakin_skywalker obi_wan_kenobi emperor_palpatine darth_vader lando_calrissian boba_fett c-3p0 r2_d2 jar_jar_binks padme_amidala yoda Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
11 3.292638e+09 True NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 3.292635e+09 False NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
81 3.291669e+09 True NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
87 3.291650e+09 False NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN No NaN NaN NaN NaN NaN
97 3.291570e+09 True NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1109 3.288512e+09 True NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1140 3.288460e+09 True NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1142 3.288459e+09 True NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1144 3.288456e+09 True NaN False False False False False False NaN NaN NaN NaN NaN NaN 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1162 3.288418e+09 True False True True True False True True 3.0 4.0 5.0 6.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 5.0 3.0 3.0 2.0 2.0 5.0 3.0 3.0 Greedo No NaN No NaN NaN NaN NaN NaN

140 rows × 38 columns

We can see the odd thing here:

  • Some of answer said that they've seen at least 1 in 6 episode, but in the check-box for seen or not: they have not give their answer
  • In the same records, they don't have any respond for any of question

=> In order to get this odd thing clearly, we will use heatmap to check the missing value on this case, with the missing value is show by light color, and the data pattern shown as black color

In [36]:
# Check missing value by heatmap graph:
#1. Import seaborn library:
import seaborn as sns      

# #3. Get the graph
def plot_null(df):
    # Modify the frame
    plt.figure(figsize=(20,10))
    # Identify the data
    data = df.isnull()
    #Plot
    sns.heatmap(data, cbar=False, yticklabels=False)
    plt.xticks(rotation=90, size='x-large')
    
plot_null(starwar)

We can see that:

  • Wherever the Age have missing value, it will be missing at the other field too, except actor_favourite field ( han_solo to yoda) because we've convert both Unfimiliar and NaN to D rank => If we remove the missing at Age, we don't get any loss data at the other field
  • To the field: Do you consider yourself to be.., we will convert all the missing value and No to be False, and analyze it, due to the missing pattern/ data pattern at two field have the signification different.

=> We will remove all the missing at Age field.

In [37]:
# Remove the missing value at Age field:
starwar.dropna(subset=['Age'], inplace=True)

For the Education field, we can see the missing case almost same the Age, and we have get rid of almost these missing value. With the remain missing, we'll find the correlation between Age field and Education field, and then fill with the co-respone of Age value.

For the Household Income, we will put it later.

In [38]:
# Find the correlation of `Age` and `Education` field:
col_c = ['Age', 'Education']
print(starwar[col_c].notnull().corr())

# Find the value of Education co-respond with `Age`:
starwar[col_c][starwar['Education'].isnull()]
           Age  Education
Age        NaN        NaN
Education  NaN        1.0
Out[38]:
Age Education
25 18-29 NaN
33 18-29 NaN
88 > 60 NaN
93 18-29 NaN
263 45-60 NaN
415 45-60 NaN
527 30-44 NaN
673 30-44 NaN
823 45-60 NaN
929 > 60 NaN

Since for correlation between 2 field Age and Education is None (and for Location too) => It's better not to fill in, considering by the size of missing value is 10 records of Education, and the number of records remain after delete these 10 case is not have large effect

=> Let's delete these missing value at Education too, and we will delete out in Location if any.

In [39]:
#Check the `Location` missing value:
print(starwar.iloc[:,-1].isnull().sum())

# Delete the missing record in `Education`:
starwar.dropna(subset=['Education'], inplace=True)
3
In [40]:
# Delete the missing record in `Location`:
starwar.dropna(subset=[starwar.columns[-1]], inplace=True)
In [41]:
# Check for the missing value at 3 field we modify:
num = [-1, -2, -4]
order = starwar.columns[num]

starwar[order].isnull().sum()
Out[41]:
Location (Census Region)    0
Education                   0
Age                         0
dtype: int64

The last cleaning on this chapter is for Gender, but if look back the missing value detect graph, we get the observe that almost the missing records in Age is in Gender too; so, for the remain missing value (if any), we can get rid of it now without worry about the loss of data.

In [42]:
# Check the missing value at `Gender`:
starwar['Gender'].isnull().sum()
Out[42]:
0

Luckily, the supposing was True, all the missing value in Gender was blown away along with Age when cleaning Age field. Now, let's jump to analysis by:

  • Education, age, Location (first, to expand the analysis result for the ranking of each episode above)
  • Gender

For the next analysis step, we have some way to perform the aggregate data by:

  1. Age + Gender (aggerate by Age and filter by Gender)
  2. Location + Age or Garden + Location

3: Discover the highest ranking Episode by Age + Gender

We've look at the graph above and see what's episode is received the highest ranking (as low as best) => Let's look behind the curtain, we want to know:

  • Male and Female, between each group of Age, what's the group have the most attendance?? => This result to determine the most result (bad or good) from what's group.
  • What's the average ranking point for the Episode that each group give to the Episode?? => This result can be use to answer one pattern for the ranking we got above in the bar chat.
In [43]:
#Check the 'gender' value:
starwar['Gender'].value_counts(dropna=False)
Out[43]:
Female    545
Male      489
Name: Gender, dtype: int64
In [44]:
# Aggregate by 'Age'
age_group = starwar.copy().groupby('Age')

#Filter by `Male`:
age_male = age_group.apply(lambda x: x[x['Gender']=='Male'])
rank_by_male = age_male[age_male.columns[9:15]]
# Input sum ranking of each age group:
sum_rank_by_male = rank_by_male.reset_index().groupby('Age').agg(np.sum)
result = sum_rank_by_male['ranking_5']
        
#Filter by `Female':
age_female = age_group.apply(lambda x: x[x['Gender']=='Female'])
rank_by_female = age_female[age_female.columns[9:15]]
# Input sum ranking of each age group:
sum_rank_by_female = rank_by_female.reset_index().groupby('Age').agg(np.sum)
result_2 = sum_rank_by_female['ranking_5']
In [45]:
def plot_the_pie(series_1, series_2, df, title_1, title_2, col_name):
    
    # Define data for each group:
    group_1 = list(series_1)
    group_2 = list(series_2)

    # Create explode data:
    explodes = [0.2, 0, 0.3, 0]

    # Create auto pct funct:
    def get_pct(pct, data):
        result = int(pct / 100.*np.sum(data)) ##formula: percentage/100 * sum(data) = item in data
        return ("{:.1f}% \n {}".format(pct, result))
    # Create wedge properties:
    wd = {'linewidth': 1, 'edgecolor':'black'}
    # Create label:
    key = list(df[col_name].unique())

    # Plot the pie:
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
  
    # For male group  
    wedges, texts, autotexts = ax1.pie(x=group_1,
                     autopct= lambda pct: get_pct(pct, group_1),
                     explode = explodes,
                     wedgeprops=wd,
                     colors=['cyan','pink','violet','green'],
                     labels=key,
                     startangle=90)
    ax1.legend(wedges, key, title='Age',loc=6, bbox_to_anchor=[1,1]) #Set legend box
    plt.setp(autotexts, size = 12, weight ="bold")
    ax1.set_title('{}'.format(title_1)) # Set title
    
    # For female group
    wedges, texts, autotexts = ax2.pie(x=group_2,
                     autopct= lambda pct: get_pct(pct, group_2),
                     explode = explodes,
                     wedgeprops=wd,
                    colors=['cyan','pink','violet','green'],
                     labels=key,
                     startangle=90)
#     ax2.legend(wedges, key, title='Age', loc=6, bbox_to_anchor=[1,1])
    plt.setp(autotexts, size = 12, weight ="bold")
    ax2.set_title('{}'.format(title_2))
    plt.show()
In [46]:
plot_the_pie(result, result_2, starwar, 
            'Ranking ratio for Episode V by age\nGender: Male', 'Ranking ratio for Episode V by age\nGender: Female',
            'Age')

Look at the graph, we can get the overview for our attendance:

  • The mostly votes from the Senior (above 60 years old), and we got the votes of men more than women
  • The second one is from Tricenarian group (45-60), but this times women's votes more than men
  • The last one is for Teenage/Pre-middle age and Middle age group, with the raio votes is opposite of each other.

After got the ranking ratio, we can focus on the average point that they are ranking for this Episode, and make the bar chart to see what's happend.

In [47]:
## Get the mean point of each gender:
#Male:
mean_rank_by_male = rank_by_male.reset_index().groupby('Age').agg(np.mean)
mean_1 = mean_rank_by_male['ranking_5']

#Female:
mean_rank_by_female = rank_by_female.reset_index().groupby('Age').agg(np.mean)
mean_2 = mean_rank_by_female['ranking_5']
In [48]:
#Plot the bar chat:
plt.barh(y = list(starwar['Age'].unique()),width = list(mean_1), label='Male', height=0.3)
plt.barh(y = list(starwar['Age'].unique()),width = list(mean_2), align='edge', height=-0.3, label='Female')

plt.yticks(ticks=np.arange(0,4), labels=list(starwar['Age'].unique()))
plt.xlabel('Ranking points in average')
plt.ylabel('Age group')

plt.legend()
plt.title('Ranking point by each gender in each age group\n for StarWar Episode V')
plt.show()

Compare with the above pie chart ranking vote number ratio, we should divide the result to 2 area: Love and uhh, normal:

  • Love group: We have a lot of thing to say:
  1. The love group contain only Mr. with age in 30-44, so it's the only group by Age 'save' the ranking points of Episode V
  2. The group that have a quite lot of number who ranking for this Episode: >60 - though they still like it, but the ranking score seem as quite as between love and normal. In other words, if our Mr. in 30-44 like this Episode so much - then - our Mr. in Senior group seem like feel a quite pale taste than age group 30-44
  • urghh, normal:
  1. Group of 18-29: The average point that Ms./Mrs. ranking is closer to 3.0, worse than Mr.'s (2.5) => It seem like our Ms./Mrs. attendance not interest too much in the episode.
  2. Group of 45-60: Both Mr. and Mrs.'s result is greater than 2.5 (between 2.5 and 3.0) => Our attendance don't feel like they get a very good taste from the episode, the feeling is urghh, normal, normal and normal.

4: Expand for Location, what's the region with the best ranking for this episode??

In order to expand the result, we will discover it by what does the ranking ratio by Location? We will perform the two chart: Pie chart for know the raking ratio distribution for each region, and Bar chat to discover the average ranking point that each region has gave out to the episode. Our purpose is:

  • Answer the question: The mostly result come from what's region?
  • Evaluate the result (average ranking point): Is the area with mostly result come from will affect the over-all result?
In [49]:
## Aggregate data:
location = starwar.copy().groupby(starwar.columns[-1])

## Ranking ratio of each region:
sum_by_location = location.agg(np.sum)
to_piechart = sum_by_location['ranking_5']
## Average ranking point by each region:
mean_by_location = location.agg(np.mean)
to_barchart = mean_by_location['ranking_5']
In [50]:
def plot_the_pie_and_bar(data_to_piechart, data_to_barchart,
                        num_cols , jud_num, df,title_of_legend,titl_piechart,titl_xlabel,titl_barchart,
                         greater_or_less='greater'):
    
    # Get the data:
    piechart = list(data_to_piechart)
    barchat = list(data_to_barchart)
   # Define pce func for piechart:
    def get_func(pct, data):
        result = int(pct/100*np.sum(data))
        return ('{:.1f}% \n {} votes').format(pct, result)
    # Define wedge props:
    wd = {'linewidth':1, 'edgecolor':'black'}
    # Define labels:
    key = list(starwar[starwar.columns[num_cols]].unique())

    #Plot process:
    #1. Define the frame:
    fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,10))
    #2. Plot the pie chart:
    wedges, texts, autotexts = ax1.pie(x=piechart,
                                  autopct= lambda pct:get_func(pct,piechart),
                                  
                                  
                                  startangle= 90,
                                  wedgeprops=wd)
    ax1.legend(wedges, key, title='{}'.format(title_of_legend), loc=6, bbox_to_anchor=[-0.3, 0.8])
    plt.setp(autotexts, size=12, weight='bold')
    ax1.set_title('{}'.format(titl_piechart))
    #3. Plot the bar chart:
    if greater_or_less=='greater':
        color_def = data_to_barchart>jud_num #To filter the region with ranking point greater than 2.5
        color_def_fil = color_def.map({True:'orange', False:'grey'})
        ax2.barh(y=np.arange(1,len(barchat)+1), width=barchat, align='center', color=color_def_fil)
    else:
        color_def_2 = data_to_barchart<jud_num
        color_def_fil_2 = color_def_2.map({True:'green',False:'grey'})
        ax2.barh(y=np.arange(1,len(barchat)+1), width=barchat, align='center', color=color_def_fil_2)
    ax2.set_yticks(np.arange(1,len(barchat)+1))
    ax2.set_yticklabels(key)
    ax2.set_xlabel('{}'.format(titl_xlabel))
    ax2.set_title('{}'.format(titl_barchart))

    plt.show()
In [51]:
plot_the_pie_and_bar(to_piechart, to_barchart, -1, 2.5, starwar,
                     'Location', 'Ranking ratio of each region', 'Average ranking (points)',
                     'The average ranking point of each region', 'greater')

For the convient, the region with average ranking point greater than 2.5 had been marked as Orange color, and look at the result, we got:

  • The most votes from the Pacific region, and sadly they ranked it as greater than 2.5 points (approximately 2.7 points). The two second votes from Mountain and South Atlantic, though they ranked it less than 2.5 points but closely to 2.5 points (2.4 points) => This result somewhat can light out the reason of the current ranking for Eposide V of StarWar, not as normal, but somewhat in certain region, it's just urrgh normal
  • The three remain region with the score greater than 2.5 points is East North Central, Middle Atlantic and New England, with the ratio position is 8th (7.4%), 6th (9.3%) and 7th (8.8%)
  • The rest is ranking as less than 2.5 points and place closer to average of 2.0 and 2.5.
  • The region give the highest ranking only West South Central, but it place a last end position in the distribution ratio of number regions ranking (3.5%)

Connect with the result of Age and Gender we have Mr. in age 30-44 love this episode. Is there their sign in the West South Central? Let's do the quick bar chat to see the average ranking they gave for this Episode.

In [52]:
## Aggregate data by Location: West South Central
west_south = location.get_group('West South Central')

##Filter by `Male` and give it the average points:
west_south_male = west_south[west_south['Gender']=='Male']
ranking_by_age_west_south_male = west_south_male.groupby('Age')[west_south_male.columns[9:15]]
mean_ranking = ranking_by_age_west_south_male.agg(np.mean)
In [53]:
## Plot the barchart:
color_bar = mean_ranking['ranking_5']<2
color_fill = color_bar.map({True:'Green', False:'Grey'})

mean_ranking['ranking_5'].plot.barh(color=color_fill)
plt.xlabel('Average points')
plt.ylabel('Age group')

plt.title('The average ranking point by\nMr. in West South Central region')
plt.show()

Sound like we've captured them, the fan of Eposide V StarWar is Mr. in Worth South Central in age of 30-44, and sadly that's only them like this Episode and love it in fantasy way. Even in their region, and same gender, our Senior get it like urrgh normal, the rest is oh, it's good, but not in love too much.

Because we've captured the fan of this Episode, let's see what's their favour charachters. About the seen total for each episode, we will put it after the next analysis below, because we can get a invidual topic about this theme

CONCLUSION:

  • The fan of Episode V: The Empire Strikes Back is exists, it is group of Mr. whose age in range 30-44 at West South Central
  • Almost the feeling of everyone is from Somewhat Favourable to Normal, especially people whose age in range 45-60 and 18-29, the group above 60 is equal fine
  • The mostly votes come from Pacific, but they don't feel very fantastic with this Episode.

About the favourite characters: What's the taste of the fan Episode V?

We can sure one thing: Only the fan of Episode V can get crazy for their characters, and now, let's get started with our Mr. attendance in West South Central region.

In [54]:
# Create the condition:
con_1 = starwar['Gender'] == 'Male'
con_3 = starwar['Age'] == '30-44'
total = con_1&con_3

#Filter the data:
start_1 = starwar.copy()[total]
start_2 = start_1[start_1[starwar.columns[-1]]=='West South Central']
compute = start_2[start_2.columns[15:29]]
# Compute the mean:
favour_1 = compute.mean()
In [55]:
#Plot the bar chart:
#Define color bar:
color_fil = favour_1<2
color_tab = color_fil.map({True:'green', False:'grey'})

favour_1.plot.barh(color = color_tab, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')

plt.title('The favourite characters by fan of Episode V')
plt.show()

Among many character here, all of them is in degree Somewhat Favourable to Normal => We still can't say about what's the most favourable characters in these group.

We're already find the taste of the fan Episode V, now let's take it back to their region, to see what's the favourite character. We are expecting the result is not different than these because all of the rest have feeling of 50:50, so they likely not focus on any character.

In [56]:
#Filter the data:
west_south_charac = west_south.loc[:,starwar.columns[15:29]]
mean_favou = west_south_charac.agg(np.mean)
In [57]:
#Plot the bar chart:
#Define color bar:
color_fil_2 = mean_favou<2.5
color_tab_2 = color_fil_2.map({True:'green', False:'grey'})

mean_favou.plot.barh(color = color_tab_2, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')

plt.title('The favourite characters by West South Central Region')
plt.show()

We still got 7 familiar names like above, but this time, all of them is in degree Somewhat Favourable. Let's check the optional item: All the Ms/ Mrs in West South Central Region, and finally we dig into all of data to confirm one thing: Because the ranking for this film is not much as too much loving (above 2, closer to 2.5 points) => the favourite for each character if any is not higher, we can expect it as Somewhat Favourable

In [58]:
#Aggregate data by Ms/ Mrs:
west_south_female = west_south[west_south['Gender']=='Female']

# Compute the mean case 1: Ms/ Mrs in West South Central
female_favour_char = west_south_female[starwar.columns[15:29]].mean()

#Case 2: Get all the mean favour in data
cal_rec = starwar[starwar.columns[15:29]]
cal_mean = cal_rec.mean()
In [59]:
#Plot the bar chart for case 1
#Define color bar:
color_fil_3 = female_favour_char<2.5
color_tab_3 = color_fil_3.map({True:'green', False:'grey'})

female_favour_char.plot.barh(color = color_tab_3, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')

plt.title('The favourite characters by all Ms/ Mrs \n in West South Central Region')
plt.show()

The result we see here is the same result in the graph above when we research for West South Central Region. Obvious that because so many people, contain our Ms/ Mrs not much interest in Eposide V, so the favourite ranking somewhat effect by this factor.

In [60]:
#Plot the bar chart for case 1
#Define color bar:
color_fil_4 = cal_mean<2
color_tab_4 = color_fil_4.map({True:'green', False:'grey'})

cal_mean.plot.barh(color = color_tab_4, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')

plt.title('The favourite characters by all attendance')
plt.show()

For this finally test, the result was set to show only those have favourable raking less than 2 (closer to Somewhat Favourable), and we got 2 characters: Han-Solo and Yoda. About Han-Solo, along with Yoda, we will find only those records with these two characters have favourable ranking less than 2 to see in what region, and which group of age, gender like these two character.

In [61]:
## Get the data only for characters got favourite point (less than 2.0)
favour_cha = starwar.copy()[(starwar['han_solo']<2)|(starwar['yoda']<2)]

## Aggregate data:
#1. By location:
cha_loc = favour_cha.groupby(starwar.columns[-1])
favou_loc = cha_loc[starwar.columns[15:29]].agg(np.mean)
sum_favour_loc = cha_loc[starwar.columns[15:29]].agg(np.sum)

#2. By Age and Gender:
cha_age = favour_cha.groupby(starwar['Age'])
cha_gender = favour_cha.groupby(starwar['Gender'])
In [62]:
# Prepare data:
#Case 1: For `Han-Solo`
to_piechart_case1 = sum_favour_loc['han_solo']
to_barchat_case1 = favou_loc['han_solo']

#Case 2: For 'Yoda':
to_piechart_2 = sum_favour_loc['yoda']
to_barchart_2 = favou_loc['yoda']
In [63]:
plot_the_pie_and_bar(to_piechart_case1, to_barchat_case1, -1, 1.2, starwar,
                     'Location', 'The favourite ranking for Han-Solo\n by each region', 
                     'Average ranking (points)',
                     'The average favourute ranking point for Han-Solo of each region', 'less')