matplotlib.pyplot
¶Data and original analysis from Stuart Geiger, adapted with added analysis by Chris Pyles
In this lab, we'll cover the basics of one of the most important aspects of data science: data visualization. By this point, you should have some familarity with data visualization from the builtin Table
methods that allow you to create plots, histograms, and the like, but today we'll cover the tool that underlies all of those: pyplot
. The pyplot
library, a subset of a larger plotting library called matplotlib
, is a very powerful (if unintiuitive) tool that allows you to create robust data visualizations in Python. While there are some other tools that can be used in conjunction with pyplot
and which make the code easier to write and understand (e.g. seaborn
, altair
), we won't be covering those in this lesson.
Before you continue in this lab, please take time to complete the pyplot
tutorial. This will give you the basic building blocks to understand how to answer the questions in this lab. If you find yourself struggling with a plot, post on Piazza or Google the function. We cannot stress enough how helpful this is. For example, if you didn't know how to choose where the $x$ ticks fall on a scatter plot, you should start by googling "pyplot xticks" and go from there.
from datascience import *
import datetime as dt
import numpy as np
import pandas as pd
# the standard pyplot abbreviation is plt
import matplotlib.pyplot as plt
# this sets the colorscheme and style of the plots create in this notebook
plt.style.use('fivethirtyeight')
# this Jupyter magic command tells IPython to display the plots right after they're generated
%matplotlib inline
In this lab, we'll be working with a data set generated by Stuart Geiger on the number of pageviews for the Wikipedia pages of members of the U.S. Congress. If you're interested, the code that was used to generate the data sets loaded below and the original analysis are hosted on Github. The house
and senate
tables below contain one column for each member of the House and Senate, respectively, and one row for each month from January 2016 to December 2018. Each cell contains the number of pageviews for that member of Congress in the given month. We'll be looking at trends in these pageviews across and within houses of Congress, and for individual legislators. The goal of this lab is to familiarize you with the matplotlib.pyplot
library for creating data visualizations.
house = Table.read_table('house_views.csv')
house.show(5)
senate = Table.read_table('senate_views.csv')
senate.show(5)
Before we can do much with the data that we have collected, we need to do a bit of data cleaning. For this particular project, the only real cleaning we have to do is convert the dates in both tables to something that we can use to analyze time trends. Python has a builtin library called datetime
(std. abbrev.: dt
) which allows for robust comparison and manipulation of dates.
Question 1. The month_to_dt
function (defined for you) converts a string to a datetime
object. Apply this function to the Month
column of both house
and senate
, storing the results as house_months
and senate_months
, respectively. The last two lines of code replace the strings in the tables with the datetime
objects.
def month_to_dt(entry):
return dt.datetime.strptime(entry, '%Y-%m')
house_months = ...
senate_months = ...
house['Month'] = house_months
senate['Month'] = senate_months
Let's say that we wanted to look at how the number of pageviews for congresspersons is changing over time. The first thing we would need to do is aggregate the data that we have collected across columns into a total pageview number for each month (row) in the table.
Question 2a. Iterate over the rows in the house
table and store the sum of each row in row_sums
. Then add a column to house
with the values in row_sums
.
...
Question 2b. The code below creates a barplot of the number of pageviews over each month in the house
table. Add code to it to include the following elements:
Monthly Wikipedia pageviews for U.S. House of Representatives
Month
Number of pageviews
Also, don't forget the semicolon ;
so that Jupyter doesn't output any text.
plt.figure(figsize=[20, 10])
plt.bar(house.column('Month'), house.column('view sums'), width=10)
...
Now we'll do a similar analysis for the senate
table.
Question 3a. Iterate over the rows in the senate
table and store the sum of each row in row_sums
. Then add a column to senate
with the values in row_sums
.
...
Question 3b. Create another barplot, this time for the senate
table, similar to the one above. Use the same axis labels, but change the title to read Monthly Wikipedia pageviews for U.S. Senate
. Your plot should look like this:
plt.figure(figsize=[20, 10])
...
Now let's look at aggregates across both houses of Congress. In order to mainpulate the tables in the format that we have them, we need to do some work using pandas
, a Python library that deals with tables and data manipulation. You should look over the code below and try to understand it, but you don't need to be able to do anything of the sort until you take Data 100. If you're interested, here is the basic series of steps that happen below:
pandas
dataframeshouse_t
dataframe to be the row labeled Month
and select all rows after the Month
rowsenate_t
dataframesenate_t
dataframe to the end of the house_t
dataframe and reset the indexcongress_df
congress_df
to be the row labeled index
index
row and reset the indexcongress_df
as a csv file without the indexTable
objectMonth
column to datetime
objectshouse_df = house.to_df()
senate_df = senate.to_df()
house_t = house_df.transpose()
senate_t = senate_df.transpose()
house_t.columns = house_t.loc['Month']
house_t = house_t.iloc[1:]
senate_t.columns = senate_t.loc['Month']
senate_t = senate_t.iloc[1:]
congress_df = house_t.append(senate_t).reset_index()
congress_df = congress_df.transpose()
congress_df.columns = congress_df.loc['index']
congress_df = congress_df.iloc[1:].reset_index()
congress_df.to_csv('congress_views.csv', index=False)
congress = Table.read_table('congress_views.csv')
congress_months = congress.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d'), 'Month')
congress['Month'] = congress_months
congress.show(5)
The congress
table now contains similar data to the house
and senate
tables, but now with 1 column for each representative and senator. We can, therefore, use similar code to create a plot of aggregate views over time.
Question 4a. Iterate over the rows in the congress
table and store the sum of each row in row_sums
. Then add a column to congress
with the values in row_sums
.
...
Question 4b. This time, create a line plot for the congress
, house
, and senate
tables (all in the same plot). Use the same axis labels, but change the title to read Monthly Wikipedia pageviews for U.S. Congress
. Use dashed lines for the house
and senate
tables and include a legend in your plot; it should look like this:
plt.figure(figsize=[20, 10])
...
Finally, let's take a look at the distribution of pageviews for all of Congress. In order to get an idea of what the distribution of a variable looks like, we use a histogram. In the below cell, we plot the histogram for you with varying numbers of bins.
plt.figure(figsize=[20, 7])
plt.suptitle('Histogram of Number of pageviews for Members of U.S. Congress', size=20)
plt.subplot(131)
plt.hist(congress['view sums'], bins=5)
plt.title('5 bins')
plt.xlabel('Number of pageviews')
plt.subplot(132)
plt.hist(congress['view sums'], bins=10)
plt.title('10 bins')
plt.xlabel('Number of pageviews')
plt.subplot(133)
plt.hist(congress['view sums'], bins=20)
plt.title('20 bins')
plt.xlabel('Number of pageviews');
Question 5. What do you notice about how the distribution changes as we increase the number of bins? Why is this significant?
Type your answer here, replacing this text.
Now let's take a look at the trends for a few individual members of Congress. If we wanted to look at trends for one congressperson, say Nancy Pelosi, we could select the columns Month
and Nancy_Pelosi
from the congress
table and make a plot of the pageviews for her Wikipedia page over time:
pelosi = congress.select('Month', 'Nancy_Pelosi')
plt.figure(figsize=[20, 10])
plt.plot(pelosi['Month'], pelosi['Nancy_Pelosi'])
plt.title('Monthly Wikipedia pageviews for Nancy Pelosi')
plt.xlabel('Month')
plt.ylabel('Number of pageviews');
You should see a spike in the plot right around November 2011; thinking about the context, this makes sense, because it was in this month that the Democrats took the House over and Nancy Pelosi was on track to become Speaker of the House.
Question 6. Choose two legislators in the congress
table. For each, create a figure with 2 subplots (1 row, 2 columns); the first subplot should contain a line plot of their monthly pageviews and the second should be a histogram of their pageviews. For the lineplot, use a different line style for each plot. For the histogram, make sure you choose a number of bins that reveals important trends in the distribution but which does not result in a hard-to-read or uninformative plot.
# 1st Legislator: [PUT THEIR NAME HERE]
plt.figure(figsize=[20, 7])
...
# 2nd Legislator: [PUT THEIR NAME HERE]
plt.figure(figsize=[20, 7])
...
Data visualization is a powerful tool in many apsects of data science, from data exploration to finished analyses. Mastering pyplot
, which underlies all plotting libraries in Python, is an important step to being able to make powerful, informative visulatizations. The next step in the process is to start learning about another library, like seaborn
or altair
, which make plotting more intuitive and responsive. If you're interested in data visualization, you should definitely take Data 100, as a large part of that course focuses on how to make good graphics using Python tools beyond just matplotlib
.
To submit this lab, please download this notebook from DataHub and replace the notebook in your Github repository with this notebook. Commit these changes and push them to Github.