Author: Dylan Castillo
I'm teaching a course about the essential tools of Data Science. Among those, I'm going to cover how to use some of the most popular data visualization libraries in Python: pandas
(yes, that's not a typo!), matplotlib
, seaborn
, and plotly.express
.
I thought it be useful for my students to have cheat sheet with some popular graphs made with each of these tools. So I wrote this cheat sheet.
In the next sections, you'll learn how to set up your local environment, read the data, and get the code to make the following types of graphs:
Let me know what you think!
Working with virtual environments will save you lots of headhaches when working in Python project. So, you'll start by creating one, and installing the required libraries.
If you're using venv
, then here's how you set up your local enviroment:
$ python3 -m venv .dataviz
$ source .dataviz/bin/activate
(.dataviz) $ python3 -m pip install pandas==1.2.4 numpy==1.2.0 matplotlib==3.4.2 plotly==4.14.3 seaborn==0.11.1 notebook==6.4.0
(.dataviz) $ jupyter notebook
If you're using conda
, then you need to run these commands:
$ conda create --name .dataviz
$ conda activate .dataviz
(.dataviz) $ conda install pandas==1.2.4 numpy==1.19.2 matplotlib==3.4.2 plotly==4.14.3 seaborn==0.11.1 notebook==6.4.0 -y
$ jupyter notebook
That's it! These commands will:
.dataviz
pandas
, numpy
, matplotlib
, plotly
, seaborn
, and notebook
)Note that if you're only planning on using just one of the data visualization libraries, then feel free not to install all of them. For example, if you want to use plotly.express
, you don't need to install matplotlib
and seaborn
.
Open Jupyter Notebook. Then, create a new notebook by clicking on New > Python3 notebook in the menu. By now, you should have an empty Jupyter notebook in front of you. Now, let's get to the fun part!
First, you'll need to import the required libraries. Create a new cell in your notebook and paste the following code to import the required libraries:
# All
import pandas as pd
import numpy as np
# matplotlib
import matplotlib.ticker as mtick
import matplotlib.pyplot as plt
# plotly
import plotly.io as pio
import plotly.express as px
# seaborn
import seaborn as sns
# Set templates
pio.templates.default = "seaborn"
plt.style.use("seaborn")
This code will import the required libraries and set up the themes for matplotlib
and plotly
. Each library provides you with a specific set of functionalities:
pandas
helps you read the datamatplotlib.pyplot
, plotly.express
and seaborn
will help you make the graphsmatplotlib.ticker
provides with a way to set specific settings of the tickers on your axes in your matplotlib
graphsplotly.io
makes it easy to define a specific theme for your plotly graphsIn lines 17 and 18, you define the themes for plotly
and matplotlib
. In this case, you set them to use the seaborn
theme. This will make the graphs from all the libraries look similar.
Throughout this tutorial you'll use a dataset with stock market data for 29 companies compiled by ichardddddd. It has the following columns:
You can take a look ad the data by taking a sample of a few rows:
url = "https://raw.githubusercontent.com/szrlee/Stock-Time-Series-Analysis/master/data/all_stocks_2006-01-01_to_2018-01-01.csv"
df = pd.read_csv(url)
df.sample(5)
Date | Open | High | Low | Close | Volume | Name | |
---|---|---|---|---|---|---|---|
25931 | 2013-01-18 | 52.24 | 52.34 | 51.81 | 52.34 | 8492176 | DIS |
53204 | 2013-06-05 | 98.13 | 98.16 | 96.12 | 96.42 | 5394802 | MCD |
39946 | 2008-09-26 | 117.21 | 121.01 | 117.01 | 119.42 | 4760683 | IBM |
37191 | 2009-10-15 | 27.28 | 27.37 | 27.05 | 27.30 | 13350145 | HD |
2877 | 2017-06-08 | 204.84 | 206.03 | 204.09 | 205.94 | 2451348 | MMM |
This is a long dataset (in regards to the stock names). In the next sections, you'll notice that some libraries make it easy to work with data in this form, and others will require you to transform it into a wide dataset.
That's it! Now you can find whatever graph you'd like to make and copy-paste its code.
Read the data as follows:
url = "https://raw.githubusercontent.com/szrlee/Stock-Time-Series-Analysis/master/data/all_stocks_2006-01-01_to_2018-01-01.csv"
df = pd.read_csv(url)
df = df.loc[df.Name.isin(["AAPL", "JPM", "GOOGL", "AMZN"]), ["Date", "Name", "Close"]]
df["Date"] = pd.to_datetime(df.Date)
df.rename(columns={"Close": "Closing Price"}, inplace=True)
pandas
¶df_wide = df.pivot(index="Date", columns="Name", values="Closing Price")
df_wide.plot(
title="Stock prices (2006 - 2017)", ylabel="Closing Price", figsize=(12, 6), rot=0
)
<matplotlib.axes._subplots.AxesSubplot at 0x16d5f6ac0>
matplotlib
¶fig, ax = plt.subplots(figsize=(12, 6))
for i, g in df.groupby("Name"):
ax.plot(g["Date"], g["Closing Price"], label=i)
ax.set_title("Stock prices (2006 - 2017)")
ax.set_ylabel("Closing Price")
ax.set_xlabel("Date")
ax.legend(title="Name")
<matplotlib.legend.Legend at 0x16d714400>
seaborn
¶fig, ax = plt.subplots(figsize=(12, 6))
sns.lineplot(data=df, x="Date", y="Closing Price", hue="Name", ax=ax)
ax.set_title("Stock Prices (2006 - 2017)")
Text(0.5, 1.0, 'Stock Prices (2006 - 2017)')
plotly.express
¶fig = px.line(
df,
x="Date",
y="Closing Price",
color="Name",
title="Stock Prices (2006 - 2017)",
width=900,
height=500,
)
fig.show()
Read the data as follows:
url = "https://raw.githubusercontent.com/szrlee/Stock-Time-Series-Analysis/master/data/all_stocks_2006-01-01_to_2018-01-01.csv"
df = pd.read_csv(url)
df = df[df.Name == "AAPL"]
df["Year"] = pd.to_datetime(df.Date).dt.year
df = df.query("Year >= 2014").groupby("Year").max().reset_index(drop=False)
pandas
¶df.plot.bar(
x="Year",
y=["Open", "Close"],
rot=0,
figsize=(12, 6),
ylabel="Price in USD",
title="Max Opening and Closing Prices per Year for AAPL",
)
<matplotlib.axes._subplots.AxesSubplot at 0x16ddb0a60>
matplotlib
¶fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(df.Year))
width = 0.25
ax.bar(x - width / 2, df.Close, width, label="Open")
ax.bar(x + width / 2, df.Open, width, label="Close")
ax.set_xlabel("Year")
ax.set_ylabel("Price in USD")
ax.set_title("Max Opening and Closing Prices per Year for AAPL")
ax.set_xticks(x)
ax.set_xticklabels(df.Year)
ax.legend()
<matplotlib.legend.Legend at 0x16ddc5550>
seaborn
¶df_long = df.melt(
id_vars="Year",
value_vars=["Open", "Close"],
var_name="Category",
value_name="Price",
)
fig, ax = plt.subplots(figsize=(12, 6))
sns.barplot(data=df_long, x="Year", y="Price", hue="Category", ax=ax)
ax.set_title("Max Opening and Closing Prices per Year for AAPL")
ax.legend(title=None)
<matplotlib.legend.Legend at 0x16d7e7160>
plotly.express
¶fig = px.bar(
df,
x="Year",
y=["Open", "Close"],
title="Max Opening and Closing Prices per Year for AAPL",
barmode="group",
labels={"value": "Price in USD"},
width=900,
height=500,
)
fig.show()
Read the data as follows:
url = "https://raw.githubusercontent.com/szrlee/Stock-Time-Series-Analysis/master/data/all_stocks_2006-01-01_to_2018-01-01.csv"
df = pd.read_csv(url)
stocks_filter = ["AAPL", "JPM", "GOOGL", "AMZN", "IBM"]
df = df[df.Name.isin(stocks_filter)]
df["Date"] = pd.to_datetime(df.Date)
df["Year"] = pd.to_datetime(df.Date).dt.year
df["Volume"] = df["Volume"] / 1e9
df = (
df[["Year", "Volume", "Name"]]
.query("Year >= 2012")
.groupby(["Year", "Name"])
.sum()
.reset_index(drop=False)
)
pandas
¶df_wide = df.pivot(index="Year", columns="Name", values="Volume")
df_wide.plot.bar(
rot=0,
figsize=(12, 6),
ylabel="Volume (billions of shares)",
title="Trading volume per year for selected shares",
stacked=True,
)
<matplotlib.axes._subplots.AxesSubplot at 0x16de1f3a0>
matplotlib
¶fig, ax = plt.subplots(figsize=(12, 6))
bottom = np.zeros(df.Year.nunique())
for i, g in df.groupby("Name"):
ax.bar(g["Year"], g["Volume"], bottom=bottom, label=i, width=0.5)
bottom += g["Volume"].values
ax.set_title("Trading volume per year for selected shares")
ax.set_ylabel("Volume (billions of shares)")
ax.set_xlabel("Year")
ax.legend()
<matplotlib.legend.Legend at 0x16de80220>
seaborn
¶fig, ax = plt.subplots(figsize=(12, 6))
ax = sns.histplot(
x=df.Year,
hue=df.Name,
weights=df.Volume,
multiple="stack",
shrink=0.5,
discrete=True,
hue_order=df.groupby("Name").Volume.sum().sort_values().index,
)
ax.set_title("Trading volume per year for selected shares")
ax.set_ylabel("Volume (billions of shares)")
legend = ax.get_legend()
legend.set_bbox_to_anchor((1, 1))
plotly.express
¶fig = px.bar(
df,
x="Year",
y="Volume",
color="Name",
title="Trading volume per year for selected shares",
barmode="stack",
labels={"Volume": "Volume (billions of shares)"},
width=900,
height=500,
)
fig.show()
Read the data as follows:
url = "https://raw.githubusercontent.com/szrlee/Stock-Time-Series-Analysis/master/data/all_stocks_2006-01-01_to_2018-01-01.csv"
df = pd.read_csv(url)
stocks = ["AAPL", "AMZN", "GOOGL", "IBM", "JPM"]
df = df.loc[df.Name.isin(stocks), ["Date", "Name", "Volume"]]
df["Date"] = pd.to_datetime(df.Date)
df = df[df.Date.dt.year >= 2017]
df["Volume Perc"] = df["Volume"] / df.groupby("Date")["Volume"].transform("sum")
pandas
¶df_wide = df.pivot(index="Date", columns="Name", values="Volume Perc")
ax = df_wide.plot.area(
rot=0,
figsize=(12, 6),
title="Distribution of daily trading volume - 2017",
stacked=True,
)
ax.legend(bbox_to_anchor=(1, 1), loc="upper left")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
matplotlib
¶df_wide = df.pivot(index="Date", columns="Name", values="Volume Perc")
fig, ax = plt.subplots(figsize=(12, 6))
ax.stackplot(df_wide.index, [df_wide[col].values for col in stocks], labels=stocks)
ax.legend(bbox_to_anchor=(1, 1), loc="upper left")
ax.set_title("Distribution of daily trading volume - 2017")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
plotly.express
¶fig = px.area(
df,
x="Date",
y="Volume Perc",
color="Name",
title="Distribution of daily trading volume - 2017",
width=900,
height=500,
)
fig.update_layout(yaxis_tickformat="%")
fig.show()
Read the data as follows:
url = "https://raw.githubusercontent.com/szrlee/Stock-Time-Series-Analysis/master/data/all_stocks_2006-01-01_to_2018-01-01.csv"
df = pd.read_csv(url)
stocks_filter = ["AAPL", "JPM", "GOOGL", "AMZN", "IBM"]
df = df.loc[df.Name.isin(stocks_filter), ["Name", "Volume"]]
df = df.groupby("Name").sum().reset_index()
pandas
¶df.set_index("Name").plot.pie(
y="Volume",
wedgeprops=dict(width=0.5),
figsize=(8, 8),
autopct="%1.0f%%",
pctdistance=0.75,
title="Distribution of trading volume for selected stocks (2006 - 2017)",
)
<matplotlib.axes._subplots.AxesSubplot at 0x16d6ed400>
matplotlib
¶fig, ax = plt.subplots(figsize=(8, 8))
ax.pie(
df.Volume,
labels=df.Name,
autopct="%1.0f%%",
wedgeprops=dict(width=0.5),
pctdistance=0.75,
)
ax.set_title("Distribution of trading volume for selected stocks (2006 - 2017)")
ax.legend()
<matplotlib.legend.Legend at 0x16dad0940>
plotly.express
¶fig = px.pie(
data_frame=df,
values="Volume",
names="Name",
hole=0.5,
color="Name",
title="Distribution of trading volume for selected stocks (2006 - 2017)",
width=900,
height=500,
)
fig.show()
Read the data as follows:
url = "https://raw.githubusercontent.com/szrlee/Stock-Time-Series-Analysis/master/data/all_stocks_2006-01-01_to_2018-01-01.csv"
df = pd.read_csv(url)
stocks_filter = ["GOOGL", "AMZN"]
df = df.loc[df.Name.isin(stocks_filter), ["Name", "Close"]]
pandas
¶fig, ax = plt.subplots(figsize=(12, 6))
for idx, (i, g) in enumerate(df.groupby("Name")):
if idx == 0:
_, bins, _ = ax.hist(g.Close, alpha=0.75, label=i, bins=30)
else:
ax.hist(g.Close, alpha=0.75, label=i, bins=bins)
ax.legend()
ax.set_title("Distribution of Closing Prices - GOOGL vs. AMZN")
ax.set_ylabel("Frequency")
ax.set_xlabel("Closing Price")
Text(0.5, 0, 'Closing Price')
matplotlib
¶fig, ax = plt.subplots(figsize=(12, 6))
for idx, (i, g) in enumerate(df.groupby("Name")):
if idx == 0:
_, bins, _ = ax.hist(g.Close, alpha=0.75, label=i, bins=30)
else:
ax.hist(g.Close, alpha=0.75, label=i, bins=bins)
ax.legend()
ax.set_title("Distribution of Closing Prices - GOOGL vs. AMZN")
ax.set_ylabel("Frequency")
ax.set_xlabel("Closing Price")
Text(0.5, 0, 'Closing Price')
seaborn
¶fig, ax = plt.subplots(figsize=(12, 6))
sns.histplot(data=df, x="Close", hue="Name", ax=ax)
ax.set_title("Distribution of Closing Prices - GOOGL vs. AMZN")
ax.set_ylabel("Frequency")
ax.set_xlabel("Closing Price")
Text(0.5, 0, 'Closing Price')
plotly.express
¶fig = px.histogram(
df,
x="Close",
color="Name",
labels={"Close": "Closing Price", "count": "Frequency"},
title="Distribution of Closing Prices - GOOGL vs. AMZN",
barmode="overlay",
width=900,
height=500,
)
fig.show()
Read the data as follows:
url = "https://raw.githubusercontent.com/szrlee/Stock-Time-Series-Analysis/master/data/all_stocks_2006-01-01_to_2018-01-01.csv"
df = pd.read_csv(url)
stocks_filter = ["GOOGL", "AMZN"]
df = df.loc[
(df.Name.isin(stocks_filter)) & (pd.to_datetime(df.Date).dt.year >= 2017),
["Date", "Name", "Open", "Close"],
]
df["Return"] = (df["Close"] - df["Open"]) / df["Open"]
df_wide = df.pivot(index="Date", columns="Name", values="Return")
pandas
¶ax = df_wide.plot.scatter(
x="GOOGL", y="AMZN", title="Daily returns - GOOGL vs. AMZN", figsize=(8, 8)
)
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
ax.xaxis.set_major_formatter(mtick.PercentFormatter(1))
matplotlib
¶import matplotlib.ticker as mtick
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(x=df_wide["GOOGL"], y=df_wide["AMZN"])
ax.set_xlabel("GOOGL")
ax.set_ylabel("AMZN")
ax.set_title("Daily returns - GOOGL vs. AMZN")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
ax.xaxis.set_major_formatter(mtick.PercentFormatter(1))
seaborn
¶fig, ax = plt.subplots(figsize=(8, 8))
sns.scatterplot(data=df_wide, x="GOOGL", y="AMZN", ax=ax)
ax.set_title("Daily returns - GOOGL vs AMZN")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
ax.xaxis.set_major_formatter(mtick.PercentFormatter(1))
plotly.express
¶df_wide["GOOGL"] = df_wide["GOOGL"]
df_wide["AMZN"] = df_wide["AMZN"]
fig = px.scatter(
df_wide,
x="GOOGL",
y="AMZN",
title="Daily returns - GOOGL vs. AMZN",
width=600,
height=600,
)
fig.update_layout(yaxis_tickformat="%", xaxis_tickformat="%")
fig.show()
url = "https://raw.githubusercontent.com/szrlee/Stock-Time-Series-Analysis/master/data/all_stocks_2006-01-01_to_2018-01-01.csv"
df = pd.read_csv(url)
stocks = ["AMZN", "GOOGL", "IBM", "JPM"]
df = df.loc[
(df.Name.isin(stocks)) & (pd.to_datetime(df.Date).dt.year == 2016),
["Date", "Name", "Close", "Open"],
]
df["Return"] = (df["Close"] - df["Open"]) / df["Open"]
df["Date"] = pd.to_datetime(df.Date)
pandas
¶df_wide = df.pivot(index="Date", columns="Name", values="Return")
ax = df_wide.boxplot(column=stocks)
ax.set_ylabel("Daily returns")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
matplotlib
¶df_wide = df.pivot(index="Date", columns="Name", values="Return")
fig, ax = plt.subplots(figsize=(12, 6))
ax.boxplot([df_wide[col] for col in stocks], vert=True, autorange=True, labels=stocks)
ax.set_ylabel("Daily returns")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
seaborn
¶ax = sns.boxplot(x="Name", y="Return", data=df, order=stocks)
ax.set_ylabel("Daily returns")
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1))
plotly.express
¶fig = px.box(
df,
x="Name",
y="Return",
category_orders={"Name": stocks},
width=900,
height=500,
)
fig.show()