Scenario: While waiting for Star Wars: The Force Awakens to come out, the team at FiveThirtyEight became interested in answering some questions about Star Wars fans. In particular, they wondered: does the rest of America realize that “The Empire Strikes Back” is clearly the best of the bunch?
The team needed to collect data addressing this question. To do this, they surveyed Star Wars fans using the online tool SurveyMonkey. They received 835 total responses for all, and now we will clean it to see what's the hidden insights of this survey: Are the StarWar's fan feel that The Empire Strikes Back is the best of all series?
To clearly, the Episode: The Empire Strikes Back is created and introduced from 1980's, and while the team FiveThirtyEight created this survey (2015s'), there are total 4 episode is introduced after The Empire Strikes Back. Our task is find the insight from this survey data to answer the question that team FiveThirtyEight interest: we can accept it or reject it depend on the data's result we have. The conclusion at the end of this project will be the final answer for the team.
Because we will work with csv file, but sometime we don't know what's that's file encoding? So, we will make a function to check the file encoding before we load in DataFrame.
# pip install chardet
## Check the file encoding:
from chardet.universaldetector import UniversalDetector
def detect_encode(file_name):
detector = UniversalDetector()
for item in open(file_name, 'rb'):
detector.feed(item)
if detector.done: break
detector.close()
print(detector.result)
Now let's download the file into our local and check the encoding of this.
# # Download the file:
# import opendatasets as od
# page='https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv'
# od.download(page)
## Check the encoding:
detect_encode('StarWars.csv')
The encoding of file is in Windows-1252
, so we will pass this encoding name into function call our DataFrame, and check file the final times to sure everything is OK.
## Load library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
## Load data in:
starwar = pd.read_csv('StarWars.csv', encoding='Windows-1252', delimiter=',')
starwar.info()
Let's see a few records to know the data more clearly:
# Setting for maximum columns = 60
pd.set_option('display.max_columns',60)
# Check the first 5 records:
starwar.head()
We can see the structre of this survey like this:
1. Handle YES/NO columns
We have two filed of YES/NO column:
Have you seen any of the 6 films in the Star Wars franchise?
(1st)Do you consider yourself to be a fan of the Star Wars film franchise?
(2nd)As each field can contains missing value (due to attendance reject to answer), we do the following:
df.info()
function we know the first filed is none of missing, but the 2nd has => we will count again to confirmseries.map()
function.The process will be like the code block below.
# Count the unique value again:
starwar['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
starwar['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
Skip thorugh the Respond records, we will convert the other value now:
# Convert the value to boolean:
yes_no = {'Yes':True, 'No':False}
col = ['Have you seen any of the 6 films in the Star Wars franchise?',
'Do you consider yourself to be a fan of the Star Wars film franchise?']
for item in col:
starwar[item] = starwar[item].map(yes_no, na_action='ignore')
# Check the result:
starwar['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna=False)
starwar['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna=False)
2. Clean Check-box columns
Everything seem OK, now let's take the next step with the next 6 question, we can see that:
Which of the following Star Wars films have you seen? Please select all that apply.
, right below it is title film: Star Wars: Episode I The Phantom Menace
with mean that the 1st question is about the attendance seen the Eposide 1 or notStar Wars: Episode II Attack of the Clones
...With that said, we can do the following below:
Which of the following Star Wars films have you seen? Please select all that apply.
=> seen_1
# Check the unique value on each field:
def check_value(df, start, end):
col_check = df.columns[start:end]
result = []
for item in col_check:
val = df[item].unique()
result.append(val[0])
if val[1] not in result:
val[1] = np.nan
result.append(val[1])
continue
return result
check_value(starwar,3, 9)
# Convert the value to boolean:
convert_value = check_value(starwar,3,9)
mapper = {}
for name in convert_value:
if len(str(name))==3:
mapper[name] = False
continue
mapper[name] = True
def convert_value(df, start, end):
cols = df.columns[start:end]
for i in cols:
df[i] = df[i].map(mapper)
return df[i]
convert_value(starwar,3,9)
# Convert the name of column:
def convert_columns(df, start, end, str_, number):
cols = df.columns[start:end]
revert = []
for i in range(1,number+1):
revert.append('{}_{}'.format(str_,i))
mapp = {}
for old, new in zip(cols, revert):
mapp[old] = new
return df.rename(mapp, inplace=True, axis=1)
## Run the function
convert_columns(starwar, 3, 9, 'seen', 6)
## Check some records:
starwar.head()
3. Clean the ranking columns:
Similary as the Check-box columns above, but now we don't need to cleaning as much as the Check-boxs field. Instead, we rename the column to be ranking_n
and cast the value to be numeric type.
# Cast the value:
starwar[starwar.columns[9:15]] = starwar.loc[2:,starwar.columns[9:15]].astype('float')
# Check:
starwar[starwar.columns[9:15]].dtypes
# Rename the columns:
# Convert the name of column:
convert_columns(starwar, 9, 15, 'ranking', 6)
## Check:
starwar.head()
4. Clean the favourable columns
Like the structre of Check-boxs and Ranking field, the favourable field have these structre:
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.
to Unnamed: 28
is the question, co-responed to:Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.
=> Han Solo
Unname: 16
=> Luke Skywalker, and so onVery favourably
,Somewhat favorably
,Unfamiliar
,NA
So, what we should do is:
Very favourably
,Somewhat favorably
,Unfamiliar
, and of course, N/A is 'C'The convert process will be shown as code-block below.
# Get the columns name and the name of actor:
list_actor_name = starwar.iloc[0,15:29].values
actor_name = []
for ac_name in list_actor_name:
names = (ac_name
.replace(' ','_')
.lower()
.strip())
actor_name.append(names)
# modify the actor name:
old_name = starwar.columns[15:29]
mapp1 = {}
for od, nw in zip(old_name, actor_name):
mapp1[od] = nw
starwar.rename(mapper=mapp1, inplace=True, axis=1)
# Check the result:
starwar.columns[15:29]
# Check the unique value in each field:
starwar['han_solo'].value_counts(dropna=False)
We will do a little change of our process:
Very favorably
, Somewhat favorably
, Neither favorably nor unfavorably
,Unfamiliar
,Somewhat unfavorably
, Very unfavorably
=> We will change our standardized rank as: Very favorably
Somewhat favorably
Neither favorably nor unfavorably (neutral)
Unfamiliar
or NaN value (because in the decription, Unfamiliar equal to Unknow, can consider as missing value)Somewhat unfavorably
Very unfavorably
#Convert the value:
mapping = {'Very favorably':1, 'Somewhat favorably':2,'Neither favorably nor unfavorably (neutral)':3,
'Unfamiliar (N/A)':4, 'Somewhat unfavorably':5, 'Very unfavorably':6}
colume = starwar.columns[15:29]
for mark in colume:
starwar[mark] = starwar[mark].map(mapping, na_action='ignore')
#Check:
starwar['han_solo']
# Fill missing value
df1 = starwar[colume].copy()
df1.fillna(4, inplace=True, axis=1)
starwar.loc[:,colume] = df1
starwar.head()
For the convenient of further step, we'll clear the first record, which is fill with only 'Respone'... and other guide syntax.
## Drop the first records:
starwar.drop(index=0,axis=0, inplace=True)
We'already have the data that cleaned, now we're going to consider some explantory method about this data:
Male
and Female
about how their taste of actor? What's their (average) ranking about the highest ranking episode?And some of other content could appear in our analysis process.
1. First, let's find the episode with highest ranking:
# Fill the value again for the first record:
#Check data:
starwar.iloc[0,9:15].values
#Replace data:
replace = [3, 2, 1, 4, 5, 6]
for item, i in zip(replace, range(9,15)):
starwar.iloc[0, i] = item
#Check:
starwar.head()
## Find the mean ranking:
#1. Create a new dataframe of ranking
cols_ranking = starwar.columns[9:15]
rank = starwar.copy()[cols_ranking]
#2. Compute the mean:
mean_rank = rank.mean()
#3. Plot the rank:
tick_name = []
for i in range(1,7):
tick_name.append('Episode_{}'.format(i)) #Define the ticks name
mean_rank.plot.bar(legend=True, label='Ranking', rot=30)
plt.xticks(ticks=range(0,6), labels=tick_name)
plt.xlabel('Episode')
plt.ylabel('Mean ranking')
plt.title('Highest ranking of each episode\nLower is better')
plt.show()
We've got the first result: The Episode V (The Empries Strike Backs
) is the episode with highest ranking (by the survey's attendance - as low as best)
Now, let's look into something more deeper, like:
Age
?? by Education
?? or by Location
???To do that, we'll do below:
2: Clean data
## Check the unique value:
#1. Create the list unique value
age_value = []
education_value = []
location_value = []
array = [-1, -2, -4]
for name in starwar.columns[array]:
if name == 'Age':
age_value.append(starwar[name].value_counts(dropna=False).sort_index())
elif name == 'Education':
education_value.append(starwar[name].value_counts(dropna=False).sort_index())
else:
location_value.append(starwar[name].value_counts(dropna=False).sort_index())
#2. Create the dic store data:
store = {}
list_data = [age_value, education_value, location_value]
list_name = ['Age', 'Education', 'Location']
for item, value in zip(list_name, list_data):
store[item] = value
store
# Check the missing value of each field:
starwar[starwar['Age'].isnull()]
We can see the odd thing here:
=> In order to get this odd thing clearly, we will use heatmap to check the missing value on this case, with the missing value is show by light color, and the data pattern shown as black color
# Check missing value by heatmap graph:
#1. Import seaborn library:
import seaborn as sns
# #3. Get the graph
def plot_null(df):
# Modify the frame
plt.figure(figsize=(20,10))
# Identify the data
data = df.isnull()
#Plot
sns.heatmap(data, cbar=False, yticklabels=False)
plt.xticks(rotation=90, size='x-large')
plot_null(starwar)
We can see that:
Age
have missing value, it will be missing at the other field too, except actor_favourite
field ( han_solo
to yoda
) because we've convert both Unfimiliar
and NaN
to D
rank => If we remove the missing at Age
, we don't get any loss data at the other fieldDo you consider yourself to be..
, we will convert all the missing value and No
to be False
, and analyze it, due to the missing pattern/ data pattern at two field have the signification different.=> We will remove all the missing at Age
field.
# Remove the missing value at Age field:
starwar.dropna(subset=['Age'], inplace=True)
For the Education
field, we can see the missing case almost same the Age
, and we have get rid of almost these missing value. With the remain missing, we'll find the correlation between Age
field and Education
field, and then fill with the co-respone of Age
value.
For the Household Income
, we will put it later.
# Find the correlation of `Age` and `Education` field:
col_c = ['Age', 'Education']
print(starwar[col_c].notnull().corr())
# Find the value of Education co-respond with `Age`:
starwar[col_c][starwar['Education'].isnull()]
Since for correlation between 2 field Age
and Education
is None (and for Location
too) => It's better not to fill in, considering by the size of missing value is 10 records of Education
, and the number of records remain after delete these 10 case is not have large effect
=> Let's delete these missing value at Education
too, and we will delete out in Location
if any.
#Check the `Location` missing value:
print(starwar.iloc[:,-1].isnull().sum())
# Delete the missing record in `Education`:
starwar.dropna(subset=['Education'], inplace=True)
# Delete the missing record in `Location`:
starwar.dropna(subset=[starwar.columns[-1]], inplace=True)
# Check for the missing value at 3 field we modify:
num = [-1, -2, -4]
order = starwar.columns[num]
starwar[order].isnull().sum()
The last cleaning on this chapter is for Gender
, but if look back the missing value detect graph, we get the observe that almost the missing records in Age
is in Gender
too; so, for the remain missing value (if any), we can get rid of it now without worry about the loss of data.
# Check the missing value at `Gender`:
starwar['Gender'].isnull().sum()
Luckily, the supposing was True, all the missing value in Gender
was blown away along with Age
when cleaning Age
field. Now, let's jump to analysis by:
For the next analysis step, we have some way to perform the aggregate data by:
Age
+ Gender
(aggerate by Age
and filter by Gender
)Location
+ Age
or Garden
+ Location
3: Discover the highest ranking Episode by Age
+ Gender
We've look at the graph above and see what's episode is received the highest ranking (as low as best) => Let's look behind the curtain, we want to know:
Male
and Female
, between each group of Age
, what's the group have the most attendance?? => This result to determine the most result (bad or good) from what's group.#Check the 'gender' value:
starwar['Gender'].value_counts(dropna=False)
# Aggregate by 'Age'
age_group = starwar.copy().groupby('Age')
#Filter by `Male`:
age_male = age_group.apply(lambda x: x[x['Gender']=='Male'])
rank_by_male = age_male[age_male.columns[9:15]]
# Input sum ranking of each age group:
sum_rank_by_male = rank_by_male.reset_index().groupby('Age').agg(np.sum)
result = sum_rank_by_male['ranking_5']
#Filter by `Female':
age_female = age_group.apply(lambda x: x[x['Gender']=='Female'])
rank_by_female = age_female[age_female.columns[9:15]]
# Input sum ranking of each age group:
sum_rank_by_female = rank_by_female.reset_index().groupby('Age').agg(np.sum)
result_2 = sum_rank_by_female['ranking_5']
def plot_the_pie(series_1, series_2, df, title_1, title_2, col_name):
# Define data for each group:
group_1 = list(series_1)
group_2 = list(series_2)
# Create explode data:
explodes = [0.2, 0, 0.3, 0]
# Create auto pct funct:
def get_pct(pct, data):
result = int(pct / 100.*np.sum(data)) ##formula: percentage/100 * sum(data) = item in data
return ("{:.1f}% \n {}".format(pct, result))
# Create wedge properties:
wd = {'linewidth': 1, 'edgecolor':'black'}
# Create label:
key = list(df[col_name].unique())
# Plot the pie:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
# For male group
wedges, texts, autotexts = ax1.pie(x=group_1,
autopct= lambda pct: get_pct(pct, group_1),
explode = explodes,
wedgeprops=wd,
colors=['cyan','pink','violet','green'],
labels=key,
startangle=90)
ax1.legend(wedges, key, title='Age',loc=6, bbox_to_anchor=[1,1]) #Set legend box
plt.setp(autotexts, size = 12, weight ="bold")
ax1.set_title('{}'.format(title_1)) # Set title
# For female group
wedges, texts, autotexts = ax2.pie(x=group_2,
autopct= lambda pct: get_pct(pct, group_2),
explode = explodes,
wedgeprops=wd,
colors=['cyan','pink','violet','green'],
labels=key,
startangle=90)
# ax2.legend(wedges, key, title='Age', loc=6, bbox_to_anchor=[1,1])
plt.setp(autotexts, size = 12, weight ="bold")
ax2.set_title('{}'.format(title_2))
plt.show()
plot_the_pie(result, result_2, starwar,
'Ranking ratio for Episode V by age\nGender: Male', 'Ranking ratio for Episode V by age\nGender: Female',
'Age')
Look at the graph, we can get the overview for our attendance:
After got the ranking ratio, we can focus on the average point that they are ranking for this Episode, and make the bar chart to see what's happend.
## Get the mean point of each gender:
#Male:
mean_rank_by_male = rank_by_male.reset_index().groupby('Age').agg(np.mean)
mean_1 = mean_rank_by_male['ranking_5']
#Female:
mean_rank_by_female = rank_by_female.reset_index().groupby('Age').agg(np.mean)
mean_2 = mean_rank_by_female['ranking_5']
#Plot the bar chat:
plt.barh(y = list(starwar['Age'].unique()),width = list(mean_1), label='Male', height=0.3)
plt.barh(y = list(starwar['Age'].unique()),width = list(mean_2), align='edge', height=-0.3, label='Female')
plt.yticks(ticks=np.arange(0,4), labels=list(starwar['Age'].unique()))
plt.xlabel('Ranking points in average')
plt.ylabel('Age group')
plt.legend()
plt.title('Ranking point by each gender in each age group\n for StarWar Episode V')
plt.show()
Compare with the above pie chart ranking vote number ratio, we should divide the result to 2 area: Love and uhh, normal:
30-44
, so it's the only group by Age
'save' the ranking points of Episode V>60
- though they still like it, but the ranking score seem as quite as between love and normal. In other words, if our Mr. in 30-44
like this Episode so much - then - our Mr. in Senior group seem like feel a quite pale taste than age group 30-44
18-29
: The average point that Ms./Mrs. ranking is closer to 3.0, worse than Mr.'s (2.5) => It seem like our Ms./Mrs. attendance not interest too much in the episode.45-60
: Both Mr. and Mrs.'s result is greater than 2.5 (between 2.5 and 3.0) => Our attendance don't feel like they get a very good taste from the episode, the feeling is urghh, normal, normal and normal.4: Expand for Location
, what's the region with the best ranking for this episode??
In order to expand the result, we will discover it by what does the ranking ratio by Location
? We will perform the two chart: Pie chart for know the raking ratio distribution for each region, and Bar chat to discover the average ranking point that each region has gave out to the episode. Our purpose is:
## Aggregate data:
location = starwar.copy().groupby(starwar.columns[-1])
## Ranking ratio of each region:
sum_by_location = location.agg(np.sum)
to_piechart = sum_by_location['ranking_5']
## Average ranking point by each region:
mean_by_location = location.agg(np.mean)
to_barchart = mean_by_location['ranking_5']
def plot_the_pie_and_bar(data_to_piechart, data_to_barchart,
num_cols , jud_num, df,title_of_legend,titl_piechart,titl_xlabel,titl_barchart,
greater_or_less='greater'):
# Get the data:
piechart = list(data_to_piechart)
barchat = list(data_to_barchart)
# Define pce func for piechart:
def get_func(pct, data):
result = int(pct/100*np.sum(data))
return ('{:.1f}% \n {} votes').format(pct, result)
# Define wedge props:
wd = {'linewidth':1, 'edgecolor':'black'}
# Define labels:
key = list(starwar[starwar.columns[num_cols]].unique())
#Plot process:
#1. Define the frame:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,10))
#2. Plot the pie chart:
wedges, texts, autotexts = ax1.pie(x=piechart,
autopct= lambda pct:get_func(pct,piechart),
startangle= 90,
wedgeprops=wd)
ax1.legend(wedges, key, title='{}'.format(title_of_legend), loc=6, bbox_to_anchor=[-0.3, 0.8])
plt.setp(autotexts, size=12, weight='bold')
ax1.set_title('{}'.format(titl_piechart))
#3. Plot the bar chart:
if greater_or_less=='greater':
color_def = data_to_barchart>jud_num #To filter the region with ranking point greater than 2.5
color_def_fil = color_def.map({True:'orange', False:'grey'})
ax2.barh(y=np.arange(1,len(barchat)+1), width=barchat, align='center', color=color_def_fil)
else:
color_def_2 = data_to_barchart<jud_num
color_def_fil_2 = color_def_2.map({True:'green',False:'grey'})
ax2.barh(y=np.arange(1,len(barchat)+1), width=barchat, align='center', color=color_def_fil_2)
ax2.set_yticks(np.arange(1,len(barchat)+1))
ax2.set_yticklabels(key)
ax2.set_xlabel('{}'.format(titl_xlabel))
ax2.set_title('{}'.format(titl_barchart))
plt.show()
plot_the_pie_and_bar(to_piechart, to_barchart, -1, 2.5, starwar,
'Location', 'Ranking ratio of each region', 'Average ranking (points)',
'The average ranking point of each region', 'greater')
For the convient, the region with average ranking point greater than 2.5 had been marked as Orange color, and look at the result, we got:
Connect with the result of Age
and Gender
we have Mr. in age 30-44
love this episode. Is there their sign in the West South Central? Let's do the quick bar chat to see the average ranking they gave for this Episode.
## Aggregate data by Location: West South Central
west_south = location.get_group('West South Central')
##Filter by `Male` and give it the average points:
west_south_male = west_south[west_south['Gender']=='Male']
ranking_by_age_west_south_male = west_south_male.groupby('Age')[west_south_male.columns[9:15]]
mean_ranking = ranking_by_age_west_south_male.agg(np.mean)
## Plot the barchart:
color_bar = mean_ranking['ranking_5']<2
color_fill = color_bar.map({True:'Green', False:'Grey'})
mean_ranking['ranking_5'].plot.barh(color=color_fill)
plt.xlabel('Average points')
plt.ylabel('Age group')
plt.title('The average ranking point by\nMr. in West South Central region')
plt.show()
Sound like we've captured them, the fan of Eposide V StarWar is Mr. in Worth South Central in age of 30-44
, and sadly that's only them like this Episode and love it in fantasy way. Even in their region, and same gender, our Senior get it like urrgh normal, the rest is oh, it's good, but not in love too much.
Because we've captured the fan of this Episode, let's see what's their favour charachters. About the seen total for each episode, we will put it after the next analysis below, because we can get a invidual topic about this theme
CONCLUSION:
30-44
at West South Central
Somewhat Favourable
to Normal
, especially people whose age in range 45-60
and 18-29
, the group above 60
is equal finePacific
, but they don't feel very fantastic with this Episode.We can sure one thing: Only the fan of Episode V can get crazy for their characters, and now, let's get started with our Mr. attendance in West South Central region.
# Create the condition:
con_1 = starwar['Gender'] == 'Male'
con_3 = starwar['Age'] == '30-44'
total = con_1&con_3
#Filter the data:
start_1 = starwar.copy()[total]
start_2 = start_1[start_1[starwar.columns[-1]]=='West South Central']
compute = start_2[start_2.columns[15:29]]
# Compute the mean:
favour_1 = compute.mean()
#Plot the bar chart:
#Define color bar:
color_fil = favour_1<2
color_tab = color_fil.map({True:'green', False:'grey'})
favour_1.plot.barh(color = color_tab, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')
plt.title('The favourite characters by fan of Episode V')
plt.show()
Among many character here, all of them is in degree Somewhat Favourable
to Normal
=> We still can't say about what's the most favourable characters in these group.
We're already find the taste of the fan Episode V, now let's take it back to their region, to see what's the favourite character. We are expecting the result is not different than these because all of the rest have feeling of 50:50, so they likely not focus on any character.
#Filter the data:
west_south_charac = west_south.loc[:,starwar.columns[15:29]]
mean_favou = west_south_charac.agg(np.mean)
#Plot the bar chart:
#Define color bar:
color_fil_2 = mean_favou<2.5
color_tab_2 = color_fil_2.map({True:'green', False:'grey'})
mean_favou.plot.barh(color = color_tab_2, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')
plt.title('The favourite characters by West South Central Region')
plt.show()
We still got 7 familiar names like above, but this time, all of them is in degree Somewhat Favourable
. Let's check the optional item: All the Ms/ Mrs in West South Central Region, and finally we dig into all of data to confirm one thing:
Because the ranking for this film is not much as too much loving (above 2, closer to 2.5 points) => the favourite for each character if any is not higher, we can expect it as Somewhat Favourable
#Aggregate data by Ms/ Mrs:
west_south_female = west_south[west_south['Gender']=='Female']
# Compute the mean case 1: Ms/ Mrs in West South Central
female_favour_char = west_south_female[starwar.columns[15:29]].mean()
#Case 2: Get all the mean favour in data
cal_rec = starwar[starwar.columns[15:29]]
cal_mean = cal_rec.mean()
#Plot the bar chart for case 1
#Define color bar:
color_fil_3 = female_favour_char<2.5
color_tab_3 = color_fil_3.map({True:'green', False:'grey'})
female_favour_char.plot.barh(color = color_tab_3, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')
plt.title('The favourite characters by all Ms/ Mrs \n in West South Central Region')
plt.show()
The result we see here is the same result in the graph above when we research for West South Central Region. Obvious that because so many people, contain our Ms/ Mrs not much interest in Eposide V, so the favourite ranking somewhat effect by this factor.
#Plot the bar chart for case 1
#Define color bar:
color_fil_4 = cal_mean<2
color_tab_4 = color_fil_4.map({True:'green', False:'grey'})
cal_mean.plot.barh(color = color_tab_4, rot=10)
plt.xlabel('Favourable')
plt.ylabel('Characters name')
plt.title('The favourite characters by all attendance')
plt.show()
For this finally test, the result was set to show only those have favourable raking less than 2 (closer to Somewhat Favourable
), and we got 2 characters: Han-Solo
and Yoda
. About Han-Solo
, along with Yoda
, we will find only those records with these two characters have favourable ranking less than 2 to see in what region, and which group of age, gender like these two character.
## Get the data only for characters got favourite point (less than 2.0)
favour_cha = starwar.copy()[(starwar['han_solo']<2)|(starwar['yoda']<2)]
## Aggregate data:
#1. By location:
cha_loc = favour_cha.groupby(starwar.columns[-1])
favou_loc = cha_loc[starwar.columns[15:29]].agg(np.mean)
sum_favour_loc = cha_loc[starwar.columns[15:29]].agg(np.sum)
#2. By Age and Gender:
cha_age = favour_cha.groupby(starwar['Age'])
cha_gender = favour_cha.groupby(starwar['Gender'])
# Prepare data:
#Case 1: For `Han-Solo`
to_piechart_case1 = sum_favour_loc['han_solo']
to_barchat_case1 = favou_loc['han_solo']
#Case 2: For 'Yoda':
to_piechart_2 = sum_favour_loc['yoda']
to_barchart_2 = favou_loc['yoda']
plot_the_pie_and_bar(to_piechart_case1, to_barchat_case1, -1, 1.2, starwar,
'Location', 'The favourite ranking for Han-Solo\n by each region',
'Average ranking (points)',
'The average favourute ranking point for Han-Solo of each region', 'less')