Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, and Jonathan Morgan.
In this module, you will learn to quickly and flexibly make a wide series of visualizations for exploratory data analysis and communicating to your audience. This module contains a practical introduction to data visualization in Python and covers important rules that any data visualizer should follow.
Become familiar with a core base of data visualization tools in Python - specifically matplotlib and seaborn
Begin exploring what visualizations are going to best reveal various types of patterns in your data
Learn more about our primary datasets data with exploratory analyses and visualizations
# data manipulation in Python
import pandas as pd
# visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
# database connection
from sqlalchemy import create_engine
# see how long queries/etc take
import time
# so images get plotted in the notebook
%matplotlib inline
# set up sqlalchemy engine
host = 'stuffed.adrf.info'
DB = 'appliedda'
connection_string = "postgresql://{}/{}".format(host, DB)
conn = create_engine(connection_string)
We will continue exploring a similar selection of data as we ended with in the Dataset Exploration notebook.
SQL code to generate the tables we'll explore below
Table 1: tanf_cohort_2006q4
: study cohort in this notebook, individuals who finished a TANF spell in Q4 of 2006
CREATE TABLE ada_tanf.tanf_cohort_2006q4 AS
SELECT DISTINCT ON (i.recptno) i.recptno, i.start_date, i.end_date,
m.birth_date, m.ssn_hash, sex, rac, rootrace
FROM il_dhs.ind_spells i
LEFT JOIN il_dhs.member m
ON i.recptno = m.recptno
WHERE end_date >= '2006-10-01'::date AND
end_date < '2007-01-01'::date
AND benefit_type = 'tanf46';
-- age at end/beginning of spell
ALTER TABLE ada_tanf.tanf_cohort_2006q4 ADD COLUMN age_end numeric, ADD COLUMN age_start numeric;
UPDATE ada_tanf.tanf_cohort_2006q4 SET (age_start, age_end) =
(extract(epoch from age(start_date, birth_date))/(3600.*24*365),
extract(epoch from age(end_date, birth_date))/(3600.*24*365));
-- add duration of spell
ALTER TABLE ada_tanf.tanf_cohort_2006q4 ADD COLUMN spell_dur int;
UPDATE ada_tanf.tanf_cohort_2006q4 SET spell_dur = end_date - start_date;
-- add indexes
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4 (recptno);
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4 (ssn_hash);
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4 (start_date);
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4 (end_date);
-- change owner to schema's admin group
ALTER TABLE ada_tanf.tanf_cohort_2006q4 OWNER TO ada_tanf_admin;
-- good practice to VACUUM (although DB does it periodically)
VACUUM FULL ada_tanf.tanf_cohort_2006q4;
Table 2: tanf_cohort_2006q4_jobs
-- create job view for the 2006q4 cohort
CREATE TABLE ada_tanf.tanf_cohort_2006q4_jobs AS
SELECT
-- job identifiers
ssn, ein, seinunit, empr_no, year, quarter,
-- individual's earnings at this job
wage AS earnings
FROM il_des_kcmo.il_wage
WHERE year IN (2005, 2006, 2007)
AND ssn IN
(SELECT ssn_hash FROM ada_tanf.tanf_cohort_2006q4);
-- add indexes
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (ssn);
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (ein);
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (seinunit);
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (empr_no);
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (year);
CREATE INDEX ON ada_tanf.tanf_cohort_2006q4_jobs (quarter);
-- change owner to schema's admin group
ALTER TABLE ada_tanf.tanf_cohort_2006q4_jobs OWNER TO ada_tanf_admin;
-- good practice to VACUUM (although DB does it periodically)
VACUUM FULL ada_tanf.tanf_cohort_2006q4_jobs;
# get dataframe of study cohort
start_time = time.time()
query = """
SELECT * FROM ada_tanf.tanf_cohort_2006q4;
"""
# read the data, and parse the dates so we can use datetime functions
df = pd.read_sql(query, conn, parse_dates=['start_date', 'end_date', 'birth_date'])
# report how long reading this data frame took
print('data read in {:.2f} seconds'.format(time.time()-start_time))
df.info()
# get DataFrame of cohort jobs
start_time = time.time()
query = """
SELECT * FROM ada_tanf.tanf_cohort_2006q4_jobs;
"""
df_jobs = pd.read_sql(query, conn)
print('data read in {:.2f} seconds'.format(time.time()-start_time))
df_jobs.info()
# how many of our cohort have a job in the `il_wage` dataset
df_jobs['ssn'].nunique() #.nunique() returns the unique number of values
matplotlib
¶Under the hood, Pandas
uses matplotlib
to produce visualizations. matplotlib
is the most commonly used base visualization package and provides low level access to different chart characteristics (eg tick mark labels)
# and view a simple hist of the age distribution of our cohort
df.hist(column='age_end')
# one default we may want to change for histograms is the number of bins
df.hist(column='age_end', bins=50)
df.hist('spell_dur')
# spell duration is in days, so what is the duration in years?
(df['spell_dur']/365.).hist()
# aside: note the "." after 365 is a holdover
# from Python 2 to give us floating point rather than integer values
# in Python 3 integer division returns floating point values
print('type with "." is: {}'.format((df['spell_dur']/365.).dtypes))
print('type without "." is: {}'.format((df['spell_dur']/365).dtypes))
# to get integers in python 3 you can do floor division:
print('floor division type is: {}'.format((df['spell_dur']//365).dtypes))
# how many spells last over 5 years?
((df['spell_dur']/365.)>5).value_counts()
# the .value_counts() function can also report percentages by using the 'normalize' parameter:
((df['spell_dur']/365.)>5).value_counts(normalize=True)
# what is the overall distribution of earnings across all the jobs?
df_jobs.hist(column='earnings', bins=50)
# the simple histogram produced above shows a l/ot of small earnings values
# what is the distribution of the higher values
df_jobs['earnings'].describe(percentiles = [.01, .1, .25, .5, .75, .9, .95, .99, .999])
## We can see a long tail in the earnings per job
## let's subset to below the 99% percentile and make a historgram
subset_values = df_jobs['earnings']<pd.np.percentile(df_jobs['earnings'], 99)
df_jobs[subset_values].hist(column='earnings', bins=50)
Note in the above cell we split subsetting the data into two steps:
subset_values
which is simply a list of True or Falsedf_jobs
dataframe where subset_values
was True## We can change options within the hist function (e.g. number of bins, color, transparency):
df_jobs[subset_values].hist(column='earnings', bins=20, facecolor="purple", alpha=0.5, figsize=(10,6))
## And we can change the plot options with `plt` (which is our alias for matplotlib.pyplot)
plt.xlabel('Job earnings ($)')
plt.ylabel('Number of jobs')
plt.title('Distribution of jobs by earnings for 2006q4 cohort')
## And add Data sourcing:
### xy are measured in percent of axes length, from bottom left of graph:
plt.annotate('Source: IL Depts of Employment Security and Human Services',
xy=(0.5,-0.15), xycoords="axes fraction")
## We use plt.show() to display the graph once we are done setting options:
plt.show()
Data sourcing is a critical aspect of any data visualization. Although here we are simply referencing the agencies that created the data, it is ideal to provide as direct of a path as possible for the viewer to find the data the graph is based on. When this is not possible (e.g. the data is sequestered), directing the viewer to documentation or methodology for the data is a good alternative. Regardless, providing clear sourcing for the underlying data is an absolutely requirement of any respectable visualization, and further builds trusts and enables reproducibility.
matplotlib
¶As briefly demonstrated by changing the labels and adding the source, above, we can make consecutive changes to the same plot; that means we can also layer multiple plots on the same figure
. By default, the first graph you create will be on the bottom with following graphs on top.
# to demonstrate simple layering
# we will create a histogram of 2005 and 2007 earnings
# similar to that already demonstrated above
plt.hist(df_jobs[subset_values & (df_jobs['year']==2005)]['earnings'], facecolor="blue", alpha=0.6)
plt.hist(df_jobs[subset_values & (df_jobs['year']==2007)]['earnings'], facecolor="orange", alpha=0.6)
plt.annotate('Source: IL Depts of Employment Security and Human Services',
xy=(0.5,-0.15), xycoords="axes fraction")
plt.show()
seaborn
¶Seaborn
is a popular visualization package built on top of matplotlib
which makes some more cumbersome graphs easier to make, however it does not give direct access to the lower level objects in a figure
(more on that later).
## Barplot function in seaborn
sns.barplot(x='year', y='earnings', data=df_jobs)
plt.show()
What values does the above plot actually show us? Let's use the help()
function to check the details of the seaborn.barplot()
function we called above:
help(sns.barplot)
In the documentation, we can see that there is an estimator
function that by default is mean
## Barplot using sum of earnings rather than the default mean
sns.barplot(x='year', y='earnings', data=df_jobs, estimator=sum)
plt.show()
## Seaborn has a great series of charts for showing different cuts of data
sns.factorplot(x='quarter', y='earnings', hue='year', data=df_jobs, kind='bar')
plt.show()
## Other options for the 'kind' argument can be found in the documentation
seaborn
and matplotlib
¶There are many excellent data visualiation modules available in Python, but for the tutorial we will stick to the tried and true combination of matplotlib
and seaborn
.
Below, we use seaborn
for setting an overall aesthetic style and then faceting (created small multiples). We then use matplotlib
to set very specific adjustments - things like adding the title, adjusting the locations of the plots, and sizing th graph space. This is a pretty protoyptical use of the power of these two libraries together.
More on seaborn
's set_style function.
More on matplotlib
's figure (fig) API.
## Seaborn offers a powerful tool called FacetGrid for making small multiples of matplotlib graphs:
### Create an empty set of grids:
facet_histograms = sns.FacetGrid(df_jobs[subset_values], row='year', col='quarter')
## "map' a histogram to each grid:
facet_histograms = facet_histograms.map(plt.hist, 'earnings')
## Data Sourcing:
plt.annotate('Source: IL Depts of Employment Security and Human Services',
xy=(0.5,-0.35), xycoords="axes fraction")
plt.show()
# Seaborn's set_style function allows us to set many aesthetic parameters.
sns.set_style("whitegrid")
facet_histograms = sns.FacetGrid(df_jobs[subset_values], row='year', col='quarter')
facet_histograms = facet_histograms.map(plt.hist, 'earnings')
## We can still change options with matplotlib, using facet_histograms.fig
facet_histograms.fig.subplots_adjust(top=0.9)
facet_histograms.fig.suptitle("Earnings for 99% of the jobs held by 2006q4 cohort", fontsize=14)
facet_histograms.fig.set_size_inches(12,8)
## Data Sourcing:
facet_histograms.fig.text(x=0.5, y=-0.05, s='Source: IL Depts of Employment Security and Human Services',
fontsize=12)
plt.show()
Question: what are employment patterns of our cohort?
# reminder of what columns we have in our two DataFrames
print(df.columns.tolist())
print('') # just to add a space
print(df_jobs.columns.tolist())
# also check the total rows in the two datasets, and the number of unique individuals in our jobs data
print(df.shape[0])
print(df_jobs.shape[0], df_jobs['ssn'].nunique())
# how many in our cohort had any job during each quarter
df_jobs.groupby(['year', 'quarter'])['ssn'].nunique().plot(kind='bar')
# did individuals have more than one job in a given quarter?
df_jobs.groupby(['year', 'quarter', 'ssn'])['ein'].count().sort_values(ascending=False).head()
How many people were employed in the same pattern of quarters over our 3 year period?
# count the number of jobs each individual had in each quarter
# where a "job" is simply that they had a record in the IDES data
df_tmp = df_jobs.groupby(['year', 'quarter', 'ssn'])['ein'].count().unstack(['year', 'quarter'])
df_tmp.head(1)
# flatten all columns to a single name with an '_' separator:
df_tmp.columns = ['_'.join([str(c) for c in col]) for col in df_tmp.columns.values]
df_tmp.head()
# replace NaN with 0
df_tmp.fillna(0, inplace=True)
# and set values >0 to 1
df_tmp[df_tmp>0] = 1
# make "ssn" a column instead of an index - then we can count it when we group by the 'year_q' columns
df_tmp.reset_index(inplace=True)
df_tmp.head()
# make a list of just the columns that start with '2005' or 2006
cols = [c for c in df_tmp.columns.values if c.startswith('2005') | c.startswith('2006')]
print(cols)
# aside on the above "list comprehension", here are the same steps one by one:
# 1- get an array of our columns
column_list = df_tmp.columns.values
# 2 - loop through each value in the array
for c in column_list:
# 3 - check if the string starts with either '2005' or '2006'
if c.startswith('2005') | c.startswith('2006'):
# 4 - add the column to our new list (here we just print to demonstrate)
print(c)
# group by all columns to count number of people with the same pattern
df_tmp = df_tmp.groupby(cols)['ssn'].count().reset_index()
print('There are {} different patterns of employment in 2005 and 2006'.format(df_tmp.shape[0]))
# total possible patterns of employment
poss_patterns = 2**len(cols)
pct_of_patterns = 100 * df_tmp.shape[0] / poss_patterns
print('With this definition of employment, our cohort shows {:.1f}% of the possible patterns'.format(pct_of_patterns))
# Look at just the top 10:
df_tmp.sort_values('ssn', ascending=False).head(10)
# and how many people follow other patterns
df_tmp.sort_values('ssn', ascending=False).tail(df_tmp.shape[0]-10)['ssn'].sum()
# grab the top 10 for a visualization
df_tmp_top = df_tmp.sort_values('ssn', ascending=False).head(10).reset_index()
# drop old index
df_tmp_top.drop(columns='index', inplace=True)
print('percent of employed in top 10 patterns is {:.1f}%'.format(100*df_tmp_top['ssn'].sum()/df_tmp['ssn'].sum()))
# calculate percentage of cohort in each group:
df_tmp_top['pct_cohort'] = df_tmp_top['ssn'].astype(float) / df['ssn_hash'].nunique()
df_tmp_top.head()
# visualize with a simple heatmap
sns.heatmap(df_tmp_top[cols])
The default visualization leaves a lot to be desired. Now let's customize the same heatmap.
# Create the matplotlib object so we can tweak graph properties later
fig, ax = plt.subplots(figsize = (14,8))
# create the list of labels we want on our y-axis
ylabs = ['{:.2f}%'.format(x*100) for x in df_tmp_top['pct_cohort']]
# make the heatmap
sns.heatmap(df_tmp_top[cols], linewidths=0.01, linecolor='grey', yticklabels=ylabs, cbar=False, cmap="Blues")
# make y-labels horizontal and change tickmark font size
plt.yticks(rotation=360, fontsize=12)
plt.xticks(fontsize=12)
# add axis labels
ax.set_ylabel('Percent of cohort', fontsize=16)
ax.set_xlabel('Quarter', fontsize=16)
## Data Sourcing:
ax.annotate('Source: IL Depts of Employment Security and Human Services',
xy=(0.5,-0.15), xycoords="axes fraction", fontsize=12)
## add a title
fig.suptitle('Top 10 most common employment patterns of cohort', fontsize=18)
ax.set_title('Blue is "employed" and white is "not employed"', fontsize=12)
plt.show()
Decision trees are a useful visualization when exploring how important your "features" (aka "right-hand variables", "explanatory variables", etc) are in predicting your "label" (aka "outcome") - we will revisit these concepts much more in the Machine Learning portion of the program. For now, we're going to use the data we have been exploring above to demonstrate creating and visualizing a Decision Tree.
# our "label" will just be if our cohort was present in the wage data after 2006
# get the list of SSN's present after 2006:
employed = df_jobs[df_jobs['year']>2006]['ssn'].unique()
df['label'] = df['ssn_hash'].isin(employed)
# how many of our cohort are "employed" after exiting TANF:
df['label'].value_counts(normalize=True)
# set which columns to use as our "features"
sel_features = ['sex', 'rac', 'rootrace', 'age_end', 'age_start', 'spell_dur']
# additional imports to create and visualize tree
# we will revisit sklearn during the Machine Learning portions of the program
from sklearn.tree import DecisionTreeClassifier
# packages to display a tree in Jupyter notebooks
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import graphviz as gv
import pydotplus
# create Tree to visualize,
# here we'll set maximum tree depth to 3 but you should try different values
dtree = DecisionTreeClassifier(max_depth=3)
# fit our data
dtree.fit(df[sel_features],df['label'])
# visualize the tree
# object to hold the graphviz data
dot_data = StringIO()
# create the visualization
export_graphviz(dtree, out_file=dot_data, filled=True,
rounded=True, special_characters=True,
feature_names=df[sel_features].columns.values)
# convert to a graph from the data
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# display the graph in our notebook
Image(graph.create_png())
what does this tree tell us about our data?
When you are satisfied with your visualization, you may want to save a a copy outside of your notebook. You can do this with matplotlib
's savefig function. You simply need to run:
plt.savefig("fileName.fileExtension")
The file extension is actually surprisingly important. Image formats like png and jpeg are actually not ideal. These file formats store your graph as a giant grid of pixels, which is space-efficient, but can't be edited later. Saving your visualizations instead as a PDF is strongly advised. PDFs are a type of vector image, which means all the components of the graph will be maintained.
With PDFs, you can later open the image in a program like Adobe Illustrator and make changes like the size or typeface of your text, move your legends, or adjust the colors of your visual encodings. All of this would be impossible with a png or jpeg.
## Let's save the employement patterns heatmap we created earlier
## below just copied and pasted from above:
# Create the matplotlib object so we can tweak graph properties later
fig, ax = plt.subplots(figsize = (14,8))
# create the list of labels we want on our y-axis
ylabs = ['{:.2f}%'.format(x*100) for x in df_tmp_top['pct_cohort']]
# make the heatmap
sns.heatmap(df_tmp_top[cols], linewidths=0.01, linecolor='grey', yticklabels=ylabs, cbar=False, cmap="Blues")
# make y-labels horizontal and change tickmark font size
plt.yticks(rotation=360, fontsize=12)
plt.xticks(fontsize=12)
# add axis labels
ax.set_ylabel('Percent of cohort', fontsize=16)
ax.set_xlabel('Quarter', fontsize=16)
## Data Sourcing:
ax.annotate('Source: IL Depts of Employment Security and Human Services',
xy=(0.5,-0.15), xycoords="axes fraction", fontsize=12)
## add a title
fig.suptitle('Top 10 most common employment patterns of cohort', fontsize=18)
ax.set_title('Blue is "employed" and white is "not employed"', fontsize=12)
fig.savefig('./output/cohort2006q4_empl_patterns.pdf')
# and the decision tree we made - note since we created the "graph" object we
# do not need to completely reproduce the tree itself
graph.write_pdf('./output/cohort2006q4_dtree.pdf')
You can read more about different options for data visualization in Python in the Additional Resources section at the bottom of this notebook.
matplotlib
is very expressive, meaning it has functionality that can easily account for fine-tuned graph creation and adjustment. However, this also means that matplotlib
is somewhat more complex to code.
seaborn
is a higher-level visualization module, which means it is much less expressive and flexible than matplotlib, but far more concise and easier to code.
It may seem like we need to choose between these two approaches, but this is not the case! Since seaborn
is itself written in matplotlib
(you will sometimes see seaborn
be called a matplotlib
'wrapper'), we can use seaborn
for making graphs quickly and then matplotlib
for specific adjustments. When you see plt
referenced in the code below, we are using matplotlib
's pyplot submodule.
seaborn
also improves on matplotlib
in important ways, such as the ability to more easily visualize regression model results, creating small multiples, enabling better color palettes, and improve default aesthetics. From seaborn
's documentation:
If matplotlib 'tries to make easy things easy and hard things possible', seaborn tries to make a well-defined set of hard things easy too.
The title of a visualization occupies the most valuable real estate on the page. If nothing else, you can be reasonably sure a viewer will at least read the title and glance at your visualization. This is why you want to put thought into making a clear and effective title that acts as a narrative for your chart. Many novice visualizers default to an explanatory title, something like: "Average Wages Over Time (2006-2016)". This title is correct - it just isn't very useful. This is particularly true since any good graph will have explained what the visualization is through the axes and legends. Instead, use the title to reinforce and explain the core point of the visualization. It should answer the question "Why is this graph important?" and focus the viewer onto the most critical take-away.
Data-Viz-Extras notebook in the "notebooks_additional" folder
Other Python Visualization Libraries: