Using the Matplotlib library for visualization is great because you can customize nothing if you don't need it, and everything works fine out of the box.
Another great thing is you can customize almost any part of the plot as you wish, the tuning options are very wide.
A good source of information about the possibilities of Matplotlib is a Gallery and tutorials on the project website. In the gallery you can find an example for any need, it is enough to imagine what and how you want to visualize - and you will find the implementation of your imagination in the gallery.
In this tutorial, we will not retell the User's Guide.
We just will create some plots and make corrections to them to better convey the idea.
import matplotlib
import numpy as np
import pandas as pd
Backend specify different output formats, and there are two types of backends: user interface (interactive) backends and hardcopy (non-interactive) backends to make image files. You can see available backends:
matplotlib.rcsetup.interactive_bk
matplotlib.rcsetup.non_interactive_bk
If you want to get more interactive capabilities with your plots (such as zoom in, zoom out etc.) you can choose an appropriate backend
matplotlib.use('nbagg')
before you call
import matplotlib.pyplot as plt
We will not do that.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
What you really need is to choose a style. In Appendix 1 you can see the available styles.
The style defines many parameters of the chart, so if you can not live without a grid and with a gray background - first of all set your favorite style.
I will:
plt.rcdefaults()
plt.style.use("seaborn-whitegrid")
plt.rcParams["figure.figsize"] = (8, 4)
Let's get Olympic dataset.
dfAthlete = pd.read_csv("../../data/athlete_events.csv")
dfNOC = pd.read_csv("../../data/noc_regions.csv")
dfNOC.columns = ["NOC", "NOC_Region", "NOC_Notes"]
dfAthlete = pd.merge(
left=dfAthlete, right=dfNOC, left_on="NOC", right_on="NOC", how="left"
)
dfAthlete.head()
Our first idea would be to demonstrate the growth of some values over the years.
plt.plot(
dfAthlete[dfAthlete["Season"] == "Summer"].groupby("Year")["ID"].nunique(), "ro-"
)
plt.plot(
dfAthlete[dfAthlete["Season"] == "Winter"].groupby("Year")["ID"].nunique(), "bo-"
)
plt.xlabel("Years", fontsize=14)
plt.ylabel("Athletes", fontsize=14)
plt.title("The number of Athletes", fontsize=16);
The total number of athletes taking part in the games is growing over time, obviously.
Autoscaling has done well, but we want better. Let's consider, what we can do with a grid.
Now we need to briefly describe the relationship between Figure
, Axes
, Subplots
and pyplot
.
pylot
functions, pyplot calls Axes method on Axes is "current"We purposely used pylot
functions above, because it is faster and easier.
But further we instead of this record:
plt.plot(...)
plt.title(...)
will use this record:
fig, ax = plt.subplots()
ax.plot(...)
ax.set_title(...)
Yes, it's longer, but it's more explicit and this variant we can use in multiple axes with subplots.
Ok, we want to turn on minor lines for XAxis and show it every 4 years.
We can do it something like this:
ax.xaxis.set_ticks(np.arange(minYear, maxYear, 4), minor=True) # set minor ticks location
ax.grid(True, which='minor', linestyle='dotted') # turn minor ticks lines on
fig, ax = plt.subplots()
ax.plot(
dfAthlete[dfAthlete["Season"] == "Summer"].groupby("Year")["ID"].nunique(), "ro-"
)
ax.plot(
dfAthlete[dfAthlete["Season"] == "Winter"].groupby("Year")["ID"].nunique(), "bo-"
)
ax.xaxis.set_ticks(np.arange(1896, 2020, 4), minor=True)
ax.grid(True, which="minor", linestyle="dotted")
ax.set_xlabel("Years", fontsize=14)
ax.set_ylabel("Athletes", fontsize=14)
ax.set_title("The number of Athletes", fontsize=16);
But there is a better way for location ticks.
We can use function MultipleLocator
, it is just what we need in this case.
Do not invent your own ways to location ticks until you have looked into matplotlib.ticker
. There is many useful locator for any cases (but for datetime you can use lacators from matplotlib.dates
).
So, let's
MultipleLocator
for minor (ticks every 4 years) and major (every 24 years) ticks.MaxNLocator
(no more than 3 ticks).from matplotlib.ticker import MaxNLocator, MultipleLocator
fig, ax = plt.subplots()
ax.plot(
dfAthlete[dfAthlete["Season"] == "Summer"].groupby("Year")["ID"].nunique(),
color="r",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="r",
label="Summer",
)
ax.plot(
dfAthlete[dfAthlete["Season"] == "Winter"].groupby("Year")["ID"].nunique(),
color="b",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="b",
label="Winter",
)
ax.xaxis.set_major_locator(MultipleLocator(24)) # Major ticks every 24 years
ax.xaxis.set_minor_locator(MultipleLocator(4)) # Minor ticks every 4 years
ax.yaxis.set_major_locator(MaxNLocator(3)) # Major ticks, maximum 3 lines
ax.grid(True, which="minor", linestyle="dotted") # Minor lines dotted
ax.margins(x=0.05, y=0.1) # padding
ax.tick_params(labelsize=12)
ax.legend(loc="center left", frameon=True, fontsize=12) # legend with frame
ax.set_xlabel("Years", fontsize=14)
ax.set_ylabel("Athletes", fontsize=14)
ax.set_title("The number of Athletes", fontsize=16)
plt.show();
Now we can add the same plot for the number of Events, and gather them together.
We want to use one XAxis for both plots so we will use
plt.subplots(nrows=2, ncols=1, sharex=True)
fig, axes = plt.subplots(nrows=2, ncols=1, sharex=True)
axes[0].plot(
dfAthlete[dfAthlete["Season"] == "Summer"].groupby("Year")["ID"].nunique(),
color="r",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="r",
label="Summer",
)
axes[0].plot(
dfAthlete[dfAthlete["Season"] == "Winter"].groupby("Year")["ID"].nunique(),
color="b",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="b",
label="Winter",
)
axes[0].grid(True, which="minor", linestyle="dotted")
axes[0].yaxis.set_major_locator(MaxNLocator(3))
axes[0].margins(x=0.05, y=0.1)
axes[0].legend(loc="center left", frameon=True, fontsize=12)
axes[0].set_xlabel("Years", fontsize=14)
axes[0].set_ylabel("Athletes", fontsize=14)
axes[0].set_title("The number of Athletes", fontsize=16)
axes[1].plot(
dfAthlete[dfAthlete["Season"] == "Summer"].groupby("Year")["Event"].nunique(),
color="r",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="r",
label="Summer",
)
axes[1].plot(
dfAthlete[dfAthlete["Season"] == "Winter"].groupby("Year")["Event"].nunique(),
color="b",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="b",
label="Winter",
)
axes[1].xaxis.set_major_locator(MultipleLocator(24))
axes[1].xaxis.set_minor_locator(MultipleLocator(4))
axes[1].yaxis.set_major_locator(MaxNLocator(3))
axes[1].grid(True, which="minor", linestyle="dotted")
axes[1].margins(x=0.05, y=0.1)
axes[1].legend(loc="center left", frameon=True, fontsize=12)
axes[1].set_xlabel("Years", fontsize=14)
axes[1].set_ylabel("Events", fontsize=14)
axes[1].set_title("The number of Events", fontsize=16)
plt.show();
Oh, something wrong with labels and titles.
In this case, possible solutions are:
fig.tight_layout()
.fig.subplots_adjust(hspace=XX)
figure.figsize
.Now let's set location of shared xticks. For example, we can locate it between two plots.
We can do so:
axes[1].xaxis.tick_top()
axes[1].set_xlabel('')
axes[1].tick_params(axis='x', pad=5)
fig.subplots_adjust(hspace=0.1)
Let's try.
fig, axes = plt.subplots(nrows=2, ncols=1, sharex=True, figsize=(10, 8))
axes[0].plot(
dfAthlete[dfAthlete["Season"] == "Summer"].groupby("Year")["ID"].nunique(),
color="r",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="r",
label="Summer",
)
axes[0].plot(
dfAthlete[dfAthlete["Season"] == "Winter"].groupby("Year")["ID"].nunique(),
color="b",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="b",
label="Winter",
)
axes[0].grid(True, which="minor", linestyle="dotted")
axes[0].yaxis.set_major_locator(MaxNLocator(3))
axes[0].margins(x=0.05, y=0.1)
axes[0].set_xlabel("Years", fontsize=14)
axes[0].set_ylabel("Athletes", fontsize=14)
axes[0].set_title("The number of athletes and events is growing", fontsize=16)
# Common title
axes[1].plot(
dfAthlete[dfAthlete["Season"] == "Summer"].groupby("Year")["Event"].nunique(),
color="r",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="r",
label="Summer",
)
axes[1].plot(
dfAthlete[dfAthlete["Season"] == "Winter"].groupby("Year")["Event"].nunique(),
color="b",
linewidth=4,
marker="o",
markersize=8,
markerfacecolor="w",
markeredgecolor="b",
label="Winter",
)
axes[1].xaxis.set_major_locator(MultipleLocator(24))
axes[1].xaxis.set_minor_locator(MultipleLocator(4))
axes[1].yaxis.set_major_locator(MaxNLocator(3))
axes[1].grid(True, which="minor", linestyle="dotted")
axes[1].margins(x=0.05, y=0.1)
axes[1].legend(loc="upper left", frameon=True, fontsize=12)
# Common legend
axes[1].set_xlabel("")
# Hide x-label
axes[1].xaxis.tick_top()
# Move ticks to top
axes[1].tick_params(axis="x", pad=5) # Increase space to plot
axes[1].set_ylabel("Events", fontsize=14)
fig.subplots_adjust(hspace=0.1) # Reduce space between plots
plt.show();
We are going to add two more plots in the same style. So, we can get tired to set the same linewidth, markersize etc every time.
If we plan to customize more parameters and use this style again, we can save <style-name>.mplstyle
file to mpl_configdir/stylelib
with something like
And then get your style with plt.style.use(<style-name>)
everytime you need it.
Now we just set some common parameters with rcParams:
plt.rcParams['lines.linewidth'] = 3
plt.rcParams['lines.marker'] = 'o'
plt.rcParams['lines.markersize'] = 6
plt.rcParams['lines.markerfacecolor'] = 'white'
After we finished to work with this parameters, we need to return to defaults settings:
plt.rcdefaults()
plt.rcdefaults()
plt.style.use("seaborn-whitegrid")
plt.rcParams["lines.linewidth"] = 3
plt.rcParams["lines.marker"] = "o"
plt.rcParams["lines.markersize"] = 6
plt.rcParams["lines.markerfacecolor"] = "white"
color = sns.color_palette("Paired")
fig, axes = plt.subplots(nrows=2, ncols=1, sharex=True, figsize=(10, 8))
axes[0].plot(
dfAthlete[(dfAthlete["Sex"] == "F") & (dfAthlete["Season"] == "Summer")][
["ID", "Year", "Height"]
]
.drop_duplicates()
.groupby("Year")["Height"]
.mean(),
color=color[5],
markeredgecolor=color[5],
label="Female Summer",
)
axes[0].plot(
dfAthlete[(dfAthlete["Sex"] == "F") & (dfAthlete["Season"] == "Winter")][
["ID", "Year", "Height"]
]
.drop_duplicates()
.groupby("Year")["Height"]
.mean(),
color=color[4],
markeredgecolor=color[4],
label="Female Winter",
)
axes[0].plot(
dfAthlete[(dfAthlete["Sex"] == "M") & (dfAthlete["Season"] == "Summer")][
["ID", "Year", "Height"]
]
.drop_duplicates()
.groupby("Year")["Height"]
.mean(),
color=color[1],
markeredgecolor=color[1],
label="Male Summer",
)
axes[0].plot(
dfAthlete[(dfAthlete["Sex"] == "M") & (dfAthlete["Season"] == "Winter")][
["ID", "Year", "Height"]
]
.drop_duplicates()
.groupby("Year")["Height"]
.mean(),
color=color[0],
markeredgecolor=color[0],
label="Male Winter",
)
axes[0].yaxis.set_major_locator(MaxNLocator(6))
axes[0].grid(True, which="minor", linestyle="dotted")
axes[0].margins(x=0.05, y=0.1)
axes[0].set_xlabel("Years", fontsize=12)
axes[0].set_ylabel("Height", fontsize=12)
axes[0].set_title("The average height and weight of the Athletes", fontsize=14)
axes[1].plot(
dfAthlete[(dfAthlete["Sex"] == "F") & (dfAthlete["Season"] == "Summer")][
["ID", "Year", "Weight"]
]
.drop_duplicates()
.groupby("Year")["Weight"]
.mean(),
color=color[5],
markeredgecolor=color[5],
label="Female Summer",
)
axes[1].plot(
dfAthlete[(dfAthlete["Sex"] == "F") & (dfAthlete["Season"] == "Winter")][
["ID", "Year", "Weight"]
]
.drop_duplicates()
.groupby("Year")["Weight"]
.mean(),
color=color[4],
markeredgecolor=color[4],
label="Female Winter",
)
axes[1].plot(
dfAthlete[(dfAthlete["Sex"] == "M") & (dfAthlete["Season"] == "Summer")][
["ID", "Year", "Weight"]
]
.drop_duplicates()
.groupby("Year")["Weight"]
.mean(),
color=color[1],
markeredgecolor=color[1],
label="Male Summer",
)
axes[1].plot(
dfAthlete[(dfAthlete["Sex"] == "M") & (dfAthlete["Season"] == "Winter")][
["ID", "Year", "Weight"]
]
.drop_duplicates()
.groupby("Year")["Weight"]
.mean(),
color=color[0],
markeredgecolor=color[0],
label="Male Winter",
)
axes[1].set_ylabel("Weight", fontsize=12)
axes[1].xaxis.set_major_locator(MultipleLocator(24))
axes[1].xaxis.set_minor_locator(MultipleLocator(4))
axes[1].yaxis.set_major_locator(MaxNLocator(6))
axes[1].grid(True, which="minor", linestyle="dotted")
axes[1].margins(x=0.05, y=0.1)
axes[1].set_xlabel("")
axes[1].xaxis.tick_top()
axes[1].tick_params(axis="x", pad=5)
fig.subplots_adjust(hspace=0.1)
plt.show();
How you can see in the list of available colormaps, they are:
We used qualitative colors from colormap Paired above for Male/Female and Summer/Winter gradation.
color = sns.color_palette('Paired')
ax.plot(...,color=color[5])
Now let's use some diverging colormap for heatmap. Let's look at the change in the number of medals per athlete in 10 countries that have got the most medals in the last 5 games.
TopTen = (
dfAthlete[
(dfAthlete["Season"] == "Summer")
& (dfAthlete["Year"] >= 2000)
& (~dfAthlete["Medal"].isna())
][["Year", "NOC", "Event", "Medal"]]
.drop_duplicates()
.groupby("NOC")
.size()
.sort_values(ascending=False)
.head(10)
.index.values
)
TopTen_MedalRate = []
for c in TopTen:
for y in range(2000, 2020, 4):
TopTen_MedalRate.append(
[
y,
c,
dfAthlete[
(dfAthlete["Season"] == "Summer")
& (dfAthlete["Year"] == y)
& (dfAthlete["NOC"] == c)
& (~dfAthlete["Medal"].isna())
][["Event", "Medal"]]
.drop_duplicates()["Event"]
.count(),
dfAthlete[
(dfAthlete["Season"] == "Summer")
& (dfAthlete["Year"] == y)
& (dfAthlete["NOC"] == c)
][["ID"]]
.nunique()
.values[0],
]
)
dfTopTen_MedalRate = pd.DataFrame(
TopTen_MedalRate, columns=["Year", "NOC", "Medals", "Athletes"]
)
dfTopTen_MedalRate["MedalsToAthlete"] = (
dfTopTen_MedalRate["Medals"] / dfTopTen_MedalRate["Athletes"]
)
plt.rcdefaults()
plt.style.use("seaborn-whitegrid")
sns.heatmap(
dfTopTen_MedalRate.pivot_table(
columns="Year", index="NOC", values="MedalsToAthlete"
),
annot=True,
fmt=".5f",
linewidths=0.5,
cmap="RdGy_r",
);
We can manually set center if we know the boundary separating the losers (we will consider as losers all who doesn't hold out to 0.16 medals on the athlete). Also, we can rotate yticks labels, and increase space from it to the plot.
fig, ax = plt.subplots()
sns.heatmap(
dfTopTen_MedalRate.pivot_table(
columns="Year", index="NOC", values="MedalsToAthlete"
),
annot=True,
fmt=".5f",
linewidths=0.5,
cmap="RdGy_r",
center=0.16,
ax=ax,
)
plt.setp(ax.yaxis.get_majorticklabels(), rotation=0)
ax.tick_params(axis="y", pad=10)
ax.set_title(
"Change Medal-per-Athlete in last 5 Summer Olympic Games",
fontsize=12,
weight="bold",
)
plt.show()
Usage of the miscellaneous colormap we will show on the rating data.
import matplotlib.colors as clrs
import numpy as np
import pandas as pd
dfMLO = pd.read_csv("../../data/English session 2 rating - Sheet1.csv")
dfMLO.fillna(value=0, inplace=True)
AssignColimns = dfMLO.columns.values[
[True if x.find("A") == 0 else False for x in dfMLO.columns.values]
]
CURRENT_STEP = 9
dfPlaces = pd.DataFrame(
{"Place": range(1, 1 + dfMLO["Overall"].drop_duplicates().shape[0])},
index=dfMLO["Overall"].drop_duplicates(),
)
dfMLO = pd.merge(
left=dfMLO, right=dfPlaces, left_on="Overall", right_index=True, how="left"
)
MaxPlace = dfMLO.iloc[-1]["Place"]
MaxGroupMembers = dfMLO.groupby("Overall").size().max()
plt.rcParams["figure.figsize"] = (20, 20)
cmap = plt.cm.rainbow
cl_norm = clrs.Normalize(vmin=1, vmax=MaxPlace)
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
sns.countplot(x="Overall", data=dfMLO, palette="rainbow", ax=ax1)
ax1.set_title(str(dfMLO.shape[0]) + " participants", fontsize=16)
ax1.set_xlabel("Points")
ax1.set_xlim([dfMLO.iloc[-1]["Overall"] - 1, dfMLO.iloc[0]["Overall"] + 2])
ax1.set_yticks(np.arange(0, ((MaxGroupMembers // 50) + 1) * 50, 50))
ax1.set_yticks(np.arange(0, ((MaxGroupMembers // 10) + 1) * 10, 10), minor=True)
ax1.grid(axis="y", which="minor", linestyle="dotted")
ax1.set_ylabel("Number of persons", fontsize=16)
for i in range(dfMLO.shape[0] - 1, 0, -1):
ax2.plot(
dfMLO.iloc[i][AssignColimns[:CURRENT_STEP]].cumsum().values,
range(1, len(AssignColimns[:CURRENT_STEP]) + 1),
color=cmap(cl_norm(MaxPlace - dfMLO.iloc[i]["Place"])),
marker="o",
linestyle="-",
alpha=0.3,
)
ax2.set_yticks(ticks=range(1, len(AssignColimns[:CURRENT_STEP]) + 1))
ax2.set_yticklabels(labels=AssignColimns[:CURRENT_STEP])
ax2.set_ylabel("Assignment", fontsize=16)
ax2.xaxis.tick_top()
ax2.set_xlim([dfMLO.iloc[-1]["Overall"] - 1, dfMLO.iloc[0]["Overall"] + 2])
ax2.set_xticks(np.arange(0, ((dfMLO.iloc[0]["Overall"] // 5) + 1) * 5, 5))
ax2.set_xticklabels(
range(0, ((dfMLO.iloc[0]["Overall"] // 5) + 1) * 5, 5), rotation="vertical"
)
ax2.tick_params(axis="x", pad=20)
ax2.grid(None)
plt.subplots_adjust(hspace=0.1)
plt.show();
Colormap here is sets by place in current rating and shows the "way" of participant in all previous assignment:
import matplotlib.colors as clrs
cmap = plt.cm.rainbow
cl_norm = clrs.Normalize(vmin=1, vmax=MaxPlace)
ax.plot(..., color=cmap(cl_norm(...))
We very briefly reviewed:
In preparation for this tutorial, I did not met the limits of the capabilities of Matplotlib. To find them, we need more experience. Good luck with that!
for st in plt.style.available:
plt.rcdefaults()
plt.style.use(st)
plt.plot(range(10), np.random.normal(5, 3, 10))
plt.title(st)
plt.show();