In this chapter, you'll learn how to quantify the strength of a linear relationship between two variables, and explore how confounding variables can affect the relationship between two other variables. You'll also see how a study’s design can influence its results, change how the data should be analyzed, and potentially affect the reliability of your conclusions. This is the Summary of lecture "Introduction to Statistics in Python", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Correlation coefficient
Pearson product-moment correlation($r$)
$$ r = \sum_{i=1}^{n} \frac{(x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \times \sigma_y} $$
In this chapter, you'll be working with a dataset world_happiness
containing results from the 2019 World Happiness Report. The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.
In this exercise, you'll examine the relationship between a country's life expectancy (life_exp
) and happiness score (happiness_score
) both visually and quantitatively.
world_happiness = pd.read_csv('./dataset/world_happiness.csv', index_col=0)
world_happiness.head()
country | social_support | freedom | corruption | generosity | gdp_per_cap | life_exp | happiness_score | |
---|---|---|---|---|---|---|---|---|
1 | Finland | 2.0 | 5.0 | 4.0 | 47.0 | 42400 | 81.8 | 155 |
2 | Denmark | 4.0 | 6.0 | 3.0 | 22.0 | 48300 | 81.0 | 154 |
3 | Norway | 3.0 | 3.0 | 8.0 | 11.0 | 66300 | 82.6 | 153 |
4 | Iceland | 1.0 | 7.0 | 45.0 | 3.0 | 47900 | 83.0 | 152 |
5 | Netherlands | 15.0 | 19.0 | 12.0 | 7.0 | 50500 | 81.8 | 151 |
# Create a scatterplot of happiness_score vs. life_exp and show
sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness);
# Create scatterplot of happiness_score vs. life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None);
# Correlation between life_exp and happiness_score
cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])
print(cor)
0.7802249053272061
While the correlation coefficient is a convenient way to quantify the strength of a relationship between two variables, it's far from perfect. In this exercise, you'll explore one of the caveats of the correlation coefficient by examining the relationship between a country's GDP per capita (gdp_per_cap
) and happiness score.
# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness);
# Correlation between gdp_per_cap and life_exp
cor = world_happiness['gdp_per_cap'].corr(world_happiness['life_exp'])
print(cor)
0.7019547642148014
When variables have skewed distributions, they often require a transformation in order to form a linear relationship with another variable so that correlation can be computed. In this exercise, you'll perform a transformation yourself.
# Scatterplot of happiness_score vs. gdp_per_cap
sns.scatterplot(x='gdp_per_cap', y='happiness_score', data=world_happiness);
# Calculate correlation
cor = world_happiness['gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor)
0.7279733012222975
# Create log_gdp_per_cap column
world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])
# Scatterplot of log_gdp_per_cap and happiness_score
sns.scatterplot(x='log_gdp_per_cap', y='happiness_score', data=world_happiness);
# Calculate correlation
cor = world_happiness['log_gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor)
0.8043146004918288
A new column has been added to world_happiness
called grams_sugar_per_day
, which contains the average amount of sugar eaten per person per day in each country. In this exercise, you'll examine the effect of a country's average sugar consumption on its happiness score.
world_happiness = pd.read_csv('./dataset/world_happiness_add_sugar.csv', index_col=0)
world_happiness
country | social_support | freedom | corruption | generosity | gdp_per_cap | life_exp | happiness_score | grams_sugar_per_day | |
---|---|---|---|---|---|---|---|---|---|
Unnamed: 0 | |||||||||
1 | Finland | 2 | 5 | 4.0 | 47 | 42400 | 81.8 | 155 | 86.8 |
2 | Denmark | 4 | 6 | 3.0 | 22 | 48300 | 81.0 | 154 | 152.0 |
3 | Norway | 3 | 3 | 8.0 | 11 | 66300 | 82.6 | 153 | 120.0 |
4 | Iceland | 1 | 7 | 45.0 | 3 | 47900 | 83.0 | 152 | 132.0 |
5 | Netherlands | 15 | 19 | 12.0 | 7 | 50500 | 81.8 | 151 | 122.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129 | Yemen | 100 | 147 | 83.0 | 155 | 2340 | 68.1 | 5 | 77.9 |
130 | Rwanda | 144 | 21 | 2.0 | 90 | 2110 | 69.1 | 4 | 14.1 |
131 | Tanzania | 131 | 78 | 34.0 | 49 | 2980 | 67.7 | 3 | 28.0 |
132 | Afghanistan | 151 | 155 | 136.0 | 137 | 1760 | 64.1 | 2 | 24.5 |
133 | Central African Republic | 155 | 133 | 122.0 | 113 | 794 | 52.9 | 1 | 22.4 |
133 rows × 9 columns
# Scatterplot of grams_sugar_per_day and happiness_score
sns.scatterplot(x='grams_sugar_per_day', y='happiness_score', data=world_happiness);
# Correlation between grams_sugar_per_day and happiness_score
cor = world_happiness['grams_sugar_per_day'].corr(world_happiness['happiness_score'])
print(cor)
0.6939100021829635