Unlike R, where the community has rallied around a single visualization package (ggplot2), Python users have many different packages to choose from -- all of which have their strengths and weaknesses.
Here is a sampling of a few prominent options:
Matplotlib
Matplotlib is the "grandparent" of Python plotting libraries. It was written to look and act like MatLab, so it was originally written in a fairly "non-Pythonic" way. Since it has been around for the longest time, there are a lot of Python libraries that are built around it, and there have been various efforts to streamline and overhaul the way to interface with it.
Link: https://matplotlib.org/
Seaborn
Seaborn is built on top of Matplotlib to provide functions to build various specific statistical plots. But it also incorporates default nice styling, and also attempts to standardize the code.
Link: https://seaborn.pydata.org/
Plotnine
Plotnine is also built on top of Matplotlib, and is an effort to be a Python port of R's ggplot plotting library. The original Data Carpentry Python visualization lesson is written to use Altair, so that it can stay in sync with the Data Carpentry R lesson.
Link: https://plotnine.readthedocs.io/en/stable/
Plotly
Link: https://plotly.com/python/
Bokeh
Link: https://bokeh.org/
Altair
Link: https://altair-viz.github.io/
We will be using Altair for most of today's lesson for its combination to adherence to the Grammar of Graphics as well as its widespread adoption by Python users.
import pandas as pd
surveys = pd.read_csv('data/surveys.csv')
surveys.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 35549 entries, 0 to 35548 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 record_id 35549 non-null int64 1 month 35549 non-null int64 2 day 35549 non-null int64 3 year 35549 non-null int64 4 plot_id 35549 non-null int64 5 species_id 34786 non-null object 6 sex 33038 non-null object 7 hindfoot_length 31438 non-null float64 8 weight 32283 non-null float64 dtypes: float64(2), int64(5), object(2) memory usage: 2.4+ MB
species_counts = surveys.groupby('species_id')['record_id'].count().reset_index(name='species_count')
species_counts.head()
species_id | species_count | |
---|---|---|
0 | AB | 303 |
1 | AH | 437 |
2 | AS | 2 |
3 | BA | 46 |
4 | CB | 50 |
len(species_counts)
48
big_species = species_counts[species_counts['species_count'] >= 50]['species_id'].to_list()
big_species
['AB', 'AH', 'CB', 'DM', 'DO', 'DS', 'NL', 'OL', 'OT', 'PB', 'PE', 'PF', 'PM', 'PP', 'RF', 'RM', 'SA', 'SH', 'SS']
surveys_filtered = surveys[surveys['species_id'].isin(big_species)].dropna()
surveys_filtered.info()
<class 'pandas.core.frame.DataFrame'> Index: 30463 entries, 62 to 35547 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 record_id 30463 non-null int64 1 month 30463 non-null int64 2 day 30463 non-null int64 3 year 30463 non-null int64 4 plot_id 30463 non-null int64 5 species_id 30463 non-null object 6 sex 30463 non-null object 7 hindfoot_length 30463 non-null float64 8 weight 30463 non-null float64 dtypes: float64(2), int64(5), object(2) memory usage: 2.3+ MB
surveys_filtered.to_csv('data/surveys_filtered.csv', index=False)
import altair as alt
import vegafusion as vf
vf.enable_widget()
vegafusion.enable_widget()
source = surveys.sample(50)
alt.Chart(source).mark_circle().encode(x='weight',
y='hindfoot_length')
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…
url = 'https://gist.githubusercontent.com/MikeTrizna/cd01f9bf3e21d6f74823423bdb45a2f3/raw/2d8c36cf78c9b6abf6938451c60defc93c5911a4/surveys_filtered.csv'
alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q',
y='hindfoot_length:Q')
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…
alt.Chart(surveys_filtered).mark_circle(opacity=0.1,
color='red').encode(x='weight:Q',
y='hindfoot_length:Q')
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…
alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q',
y='hindfoot_length:Q',
color='species_id:N')
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…
alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q',
y='hindfoot_length:Q',
color='species_id:N',
tooltip='species_id:N'
).interactive()
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…
alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q',
y='hindfoot_length:Q',
facet='sex:N',
color='species_id:N')
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…
alt.Chart(surveys_filtered).mark_boxplot().encode(x='species_id:N',
y='weight:Q')
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…
Challenge
Make a boxplot of the dataset that shows the distribution of hindfoot_length values by plot_id
alt.Chart(surveys_filtered).mark_bar().encode(
x='plot_id:O',
y='count():Q',
color='sex:N'
)
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…
alt.Chart(surveys_filtered).mark_line().encode(
x='year:O',
y='count():Q',
color='species_id:N'
)
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…
Challenge
Make a bar plot showing the breakdown of sex values by species_id
brush = alt.selection_interval()
points = alt.Chart(surveys_filtered).mark_point(opacity=0.1).encode(
x='weight:Q',
y='hindfoot_length:Q',
color=alt.condition(brush, 'species_id:N', alt.value('lightgray'))
).add_params(
brush
)
bars = alt.Chart(surveys_filtered).mark_bar().encode(
y='species_id:N',
color='species_id:N',
x='count(species_id):Q'
).transform_filter(
brush
)
points & bars
VegaFusionWidget(spec='{\n "config": {\n "view": {\n "continuousWidth": 300,\n "continuousHeight…