Exploratory Data Analysis¶

This is the code for the blog post Exploratory Data Analysis

In [1]:

import pandas as pd
import plotly.offline as py 
import seaborn as sns
py.init_notebook_mode(connected=True)

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go 
import plotly.tools as tls 
import plotly.figure_factory as ff 

The data is imported using the pandas predefined function read_csv() as our data file format is csv (comma-seprated values) in the dataset variable.

In [2]:

df = pd.read_csv("billionaires.csv")

Basics¶

In [3]:

df.shape

Out[3]:

(2614, 22)

In [4]:

df.columns

Out[4]:

Index(['name', 'rank', 'year', 'company.founded', 'company.name',
       'company.relationship', 'company.sector', 'company.type',
       'demographics.age', 'demographics.gender', 'location.citizenship',
       'location.country code', 'location.gdp', 'location.region',
       'wealth.type', 'wealth.worth in billions', 'wealth.how.category',
       'wealth.how.from emerging', 'wealth.how.industry',
       'wealth.how.inherited', 'wealth.how.was founder',
       'wealth.how.was political'],
      dtype='object')

The purpose of displaying examples from the data set is not to make a thorough analysis. It is to get a qualitative "sense" of the data we have.

In [5]:

df.head()

Out[5]:

	name	rank	year	company.founded	company.name	company.relationship	company.sector	company.type	demographics.age	demographics.gender	...	location.gdp	location.region	wealth.type	wealth.worth in billions	wealth.how.category	wealth.how.from emerging	wealth.how.industry	wealth.how.inherited	wealth.how.was founder	wealth.how.was political
0	Bill Gates	1	1996	1975	Microsoft	founder	Software	new	40	male	...	8.100000e+12	North America	founder non-finance	18.5	New Sectors	True	Technology-Computer	not inherited	True	True
1	Bill Gates	1	2001	1975	Microsoft	founder	Software	new	45	male	...	1.060000e+13	North America	founder non-finance	58.7	New Sectors	True	Technology-Computer	not inherited	True	True
2	Bill Gates	1	2014	1975	Microsoft	founder	Software	new	58	male	...	0.000000e+00	North America	founder non-finance	76.0	New Sectors	True	Technology-Computer	not inherited	True	True
3	Warren Buffett	2	1996	1962	Berkshire Hathaway	founder	Finance	new	65	male	...	8.100000e+12	North America	founder non-finance	15.0	Traded Sectors	True	Consumer	not inherited	True	True
4	Warren Buffett	2	2001	1962	Berkshire Hathaway	founder	Finance	new	70	male	...	1.060000e+13	North America	founder non-finance	32.3	Traded Sectors	True	Consumer	not inherited	True	True

5 rows × 22 columns

Let's see if there are any missing data in the columns

In [6]:

df.isna().sum()

Out[6]:

name                         0
rank                         0
year                         0
company.founded              0
company.name                38
company.relationship        46
company.sector              23
company.type                36
demographics.age             0
demographics.gender         34
location.citizenship         0
location.country code        0
location.gdp                 0
location.region              0
wealth.type                 22
wealth.worth in billions     0
wealth.how.category          1
wealth.how.from emerging     0
wealth.how.industry          1
wealth.how.inherited         0
wealth.how.was founder       0
wealth.how.was political     0
dtype: int64

Descriptive statistics¶

The descriptive statistics provide us a information of numerical featuers in the term of the Mean, Standard Deviation and 5 elements of the box plot (Min, Max, Q1, Q2, Q3).

In [7]:

df.describe().T

Out[7]:

	count	mean	std	min	25%	50%	75%	max
rank	2614.0	5.996725e+02	4.678857e+02	1.0	215.0	430.0	9.880000e+02	1.565000e+03
year	2614.0	2.008412e+03	7.483598e+00	1996.0	2001.0	2014.0	2.014000e+03	2.014000e+03
company.founded	2614.0	1.924712e+03	2.437765e+02	0.0	1936.0	1963.0	1.985000e+03	2.012000e+03
demographics.age	2614.0	5.334124e+01	2.533332e+01	-42.0	47.0	59.0	7.000000e+01	9.800000e+01
location.gdp	2614.0	1.769103e+12	3.547083e+12	0.0	0.0	0.0	7.250000e+11	1.060000e+13
wealth.worth in billions	2614.0	3.531943e+00	5.088813e+00	1.0	1.4	2.0	3.500000e+00	7.600000e+01

Plot quantitative data¶

Often a quick histogram is enough to understand the data.

Let's start with the main thing - what's about money.

In [8]:

plt.figure(figsize=(15,10))
sns.distplot(df['wealth.worth in billions'])
plt.xscale('log')

I used a logarithmic scale to at least show some distribution. Obviously, there are many more people who don't have huge amounts of money but there is also a long tail that indicates that there are people who have VERY much money.

How old are our billionaires?

We remember that there are outliers in this column, let's clean them up and see the right picture.

In [9]:

df = df[df['demographics.age'] > 0]

In [10]:

plt.figure(figsize=(15,10))
sns.distplot(df['demographics.age'], bins=15)
plt.show()

The distribution is similar to normal, with a slightly larger tail on the left.

Let's do the same with the splitting by industry.

In [11]:

plt.figure(figsize=(15,10))
g = sns.FacetGrid(data=df, hue='wealth.how.industry', aspect=3, height=4)
g.map(sns.kdeplot, 'demographics.age', shade=True)
g.add_legend(title='wealth.how.industry')

Out[11]:

<seaborn.axisgrid.FacetGrid at 0x7f0eb219e7f0>

<Figure size 1080x720 with 0 Axes>

In [12]:

industries = ['Hedge funds', 'Consumer', 'Technology-Computer']
plt.figure(figsize=(15,10))
g = sns.FacetGrid(
    data=df[(df['wealth.how.industry'] != '0') & (df['wealth.how.industry'].isin(industries))], 
    hue='wealth.how.industry', 
    aspect=3, 
    height=4)
g.map(sns.kdeplot, 'demographics.age', shade=True)
g.add_legend(title='wealth.how.industry')

Out[12]:

<seaborn.axisgrid.FacetGrid at 0x7f0eb21a6f98>

<Figure size 1080x720 with 0 Axes>

You can see the money going to the older part on the dataset. In addition, it can be seen that tech companies are more skewed towards the young, while the consumer industry is the opposite towards the elderly. There is also an industry where for some reason one can get rich before 20.

Plot qualitative data¶

Let's answer the question — what industry are the richer billionaires in?

In [13]:

city = df['wealth.how.industry'].value_counts(ascending=False)

df_city = df.filter(['wealth.how.industry'], axis=1)
df_city['count'] = 1

grouped_city = df_city.groupby('wealth.how.industry', as_index=False,sort=False).sum()
grouped_city.sort_index(ascending=False)

grouped_city = grouped_city.sort_values('count', ascending=False)                            

plt.figure(figsize=(15,8))
sns.barplot(data=grouped_city, x='count', y='wealth.how.industry')
plt.title('Industries of billioners', fontsize=17)

Out[13]:

Text(0.5, 1.0, 'Industries of billioners')

Judging by the plot at the top are industries that target consumers. It is difficult for me to draw any conclusions as to why - but it is this insight that I can tell the business. Besides, there is some industry 0 - we can assume that these are people who simply don't have industry or it's mixed.

Who are the more men or women among the billionaires?

In [14]:

plt.figure(figsize=(7,5))
sns.countplot(data=df, x='demographics.gender')
plt.title('Gender', fontsize=17)

Out[14]:

Text(0.5, 1.0, 'Gender')

It just so happens that it's mostly men.

Let's try to see the billionaire countries.

In [15]:

column = 'location.citizenship'
fig  = go.Figure(data = [
    go.Pie(
        values = df[column].value_counts().values.tolist(),
        labels = df[column].value_counts().keys().tolist(),
        name = column,
        marker = dict(line = dict(width = 2, color = 'rgb(243,243,243)')),
    hole = .3
    )],
    layout=dict(title = dict(text="Billionaire countries"))

)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

More than a third of billionaires come from the United States.

Boxplots¶

Let's go through all the quantitative data and build their box plots.

In [16]:

f, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 10))
for (index, col) in enumerate(df.select_dtypes(include=['int','float']).columns):
    sns.boxplot(data=df, y=col, ax=axes[index//3][index%3])

rank — it appears to show a human rank in the overall sample.

year. We can see the period of time during which the billionaires are collected. You can see that he's been very skewed to recent years, which seems logical - if you could earn the first billion a long time ago, then in time you should probably stack more and you're unlikely to leave this list.

company.founded. A similar conclusion, you can also see that there are likely to be some missing values. We'll have to deal with them later.

demographics.age. A lot of outliers, you can see that there are people with zero or negative age, which is not right. If you throw away such outliers, you may suspect that there is something near-normal in this variable distribution. We should build a distplot for this variable.

location.gdp. It is difficult to say something on this graph - it seems that most billionaire countries are not very rich, but it is difficult to judge what this column means exactly.

wealth.worth in billions. A huge number of outliers, although by quarters we can say that most have close to zero number of billions that we have already seen in the previous plots.

In the simplest box plot, the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). As a rule, outliers are either 3×IQR below the third quartile or 3×IQR above the first quartile. But the definition of the outlier will be different for each data set.

Boxplot is very good at presenting information about the central tendency, symmetry and variance, although they can mislead aspects such as multimodality. One of the best applications of boxplot is in the form of side-by-side boxplot (see multivariate graphical analysis below).

In [17]:

plt.figure(figsize=(15,10))
sns.boxplot(x='demographics.gender', y="demographics.age", hue="wealth.type", data=df)
plt.show()

Correlation analysis¶

In [18]:

sns.set()
sns.set(font_scale = 1.25)
sns.heatmap(
    df[['rank', 'year', 'company.founded', 'demographics.age', 'location.gdp']].corr(), 
    annot = True,
    fmt = '.1f'
)
plt.show()

In [19]:

cols = ['rank', 'year', 'company.founded', 'demographics.age', 'location.gdp']
sns.pairplot(
    data=df[cols], 
    vars=cols, 
    kind='scatter'
)

Out[19]:

<seaborn.axisgrid.PairGrid at 0x7f0eab2cb7f0>

Industries of billioners¶

In [20]:

times = df['name'].value_counts().rename_axis('name').reset_index(name='times')

In [21]:

top_rich = pd.merge(df, times, on='name')
top_rich = top_rich[top_rich['year'] == 2014]
top_rich = top_rich.nlargest(200, 'wealth.worth in billions')
top_rich.shape

Out[21]:

(200, 23)

In [22]:

plt.figure(figsize=(15,10))
sns.scatterplot(
    x='wealth.worth in billions', 
    y='demographics.age', 
    hue='times', 
    size='times', 
    data=top_rich, 
    palette='plasma',
    sizes=(50, 500)
)
plt.show()

In [ ]: