In this chapter, you will create and customize plots that visualize the relationship between two quantitative variables. To do this, you will use scatter plots and line plots to explore how the level of air pollution in a city changes over the course of a day and how horsepower relates to fuel efficiency in cars. You will also see another big advantage of using Seaborn - the ability to easily create subplots in a single figure! This is the Summary of lecture "Introduction to Data Visualization with Seaborn", via datacamp.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
We've seen in prior exercises that students with more absences ("absences"
) tend to have lower final grades ("G3"
). Does this relationship hold regardless of how much time students study each week?
To answer this, we'll look at the relationship between the number of absences that a student has in school and their final grade in the course, creating separate subplots based on each student's weekly study time ("study_time"
).
student_data = pd.read_csv('./dataset/student-alcohol-consumption.csv', index_col=0)
student_data.head()
school | sex | age | famsize | Pstatus | Medu | Fedu | traveltime | failures | schoolsup | ... | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | location | study_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GP | F | 18 | GT3 | A | 4 | 4 | 2 | 0 | yes | ... | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 | Urban | 2 to 5 hours |
1 | GP | F | 17 | GT3 | T | 1 | 1 | 1 | 0 | no | ... | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 | Urban | 2 to 5 hours |
2 | GP | F | 15 | LE3 | T | 1 | 1 | 1 | 3 | yes | ... | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 | Urban | 2 to 5 hours |
3 | GP | F | 15 | GT3 | T | 4 | 2 | 1 | 0 | no | ... | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 | Urban | 5 to 10 hours |
4 | GP | F | 16 | GT3 | T | 3 | 3 | 1 | 0 | no | ... | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 | Urban | 2 to 5 hours |
5 rows × 29 columns
# Change to use relplot() instead of scatterplot()
sns.relplot(x="absences", y="G3", data=student_data, kind='scatter');
# Change to make subplots based on study time
sns.relplot(x="absences", y="G3",
data=student_data,
kind="scatter",
col='study_time');
# Change this scatter plot to arrange the plots in rows instead of columns
sns.relplot(x="absences", y="G3",
data=student_data,
kind="scatter",
row="study_time");
Because these subplots had a large range of x values, it's easier to read them arranged in rows instead of columns.
Let's continue looking at the student_data dataset of students in secondary school. Here, we want to answer the following question: does a student's first semester grade ("G1"
) tend to correlate with their final grade ("G3"
)?
There are many aspects of a student's life that could result in a higher or lower final grade in the class. For example, some students receive extra educational support from their school ("schoolsup"
) or from their family ("famsup"
), which could result in higher grades. Let's try to control for these two factors by creating subplots based on whether the student received extra educational support from their school or family.
# Create a scatter plot of G1 vs. G3
sns.relplot(x='G1', y='G3', data=student_data, kind='scatter');
# Adjust to add subplots based on school support
sns.relplot(x="G1", y="G3", data=student_data, kind="scatter", col='schoolsup', col_order=['yes', 'no']);
# Adjust further to add subplots based on family support
sns.relplot(x="G1", y="G3",
data=student_data,
kind="scatter",
col="schoolsup",
row='famsup',
col_order=["yes", "no"],
row_order=['yes', 'no']);
It looks like the first semester grade does correlate with the final grade, regardless of what kind of support the student received.
In this exercise, we'll explore Seaborn's mpg
dataset, which contains one row per car model and includes information such as the year the car was made, the number of miles per gallon ("M.P.G.") it achieves, the power of its engine (measured in "horsepower"), and its country of origin.
What is the relationship between the power of a car's engine ("horsepower"
) and its fuel efficiency ("mpg"
)? And how does this relationship vary by the number of cylinders ("cylinders"
) the car has? Let's find out.
Let's continue to use relplot()
instead of scatterplot()
since it offers more flexibility.
mpg = pd.read_csv('./dataset/mpg.csv')
mpg.head()
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa | ford torino |
# Create scatter plot of horsepower vs. mpg
sns.relplot(x='horsepower', y='mpg', data=mpg, size='cylinders', kind='scatter');
# Create scatter plot of horsepower vs. mpg
sns.relplot(x="horsepower", y="mpg",
data=mpg, kind="scatter",
size="cylinders",
hue='cylinders');
Cars with higher horsepower tend to get a lower number of miles per gallon. They also tend to have a higher number of cylinders.
Let's continue exploring Seaborn's mpg
dataset by looking at the relationship between how fast a car can accelerate ("acceleration"
) and its fuel efficiency ("mpg"
). Do these properties vary by country of origin ("origin"
)?
Note that the "acceleration" variable is the time to accelerate from 0 to 60 miles per hour, in seconds. Higher values indicate slower acceleration.
# Create a scatter plot of acceleration vs. mpg
sns.relplot(x='acceleration', y='mpg', data=mpg, kind='scatter', style='origin', hue='origin');
Cars from the USA tend to accelerate more quickly and get lower miles per gallon compared to cars from Europe and Japan.
Two types of relational plots: scatter plots and line plots - Scatter plots Each plot point is an independent observation - Line plots Each plot point represents the same "thing", typically tracked over time.
Shared region is the confidence interval - Assume dataset is random sample - 95% confident that the mean is within this interval - Indicates uncertainty in our estimate
In this exercise, we'll continue to explore Seaborn's mpg dataset, which contains one row per car model and includes information such as the year the car was made, its fuel efficiency (measured in "miles per gallon" or "M.P.G"), and its country of origin (USA, Europe, or Japan).
How has the average miles per gallon achieved by these cars changed over time? Let's use line plots to find out!
# Create line plot
sns.relplot(x='model_year', y='mpg', data=mpg, kind='line');
In the last exercise, we looked at how the average miles per gallon achieved by cars has changed over time. Now let's use a line plot to visualize how the distribution of miles per gallon has changed over time.
# Make the shaded area show the standard deviation
sns.relplot(x="model_year", y="mpg", data=mpg, kind="line", ci='sd');
Unlike the plot in the last exercise, this plot shows us the distribution of miles per gallon for all the cars in each year.
Let's continue to look at the mpg dataset. We've seen that the average miles per gallon for cars has increased over time, but how has the average horsepower for cars changed over time? And does this trend differ by country of origin?
# Create line plot of model year vs. horsepower
sns.relplot(x='model_year', y='horsepower', data=mpg, kind='line', ci=None);
# Change to create subgroups for country of origin
sns.relplot(x="model_year", y="horsepower", data=mpg, kind="line", style='origin', hue='origin',
ci=None);
# Add markers and make each line have the same style
sns.relplot(x="model_year", y="horsepower",
data=mpg, kind="line", ci=None, style="origin", hue="origin",
markers=True, dashes=False);
Now that we've added subgroups, we can see that this downward trend in horsepower was more pronounced among cars from the USA.