Notebook

L&S 88 - Lab 5 - Data Visualization with `matplotlib.pyplot`¶

Data and original analysis from Stuart Geiger, adapted with added analysis by Chris Pyles

In this lab, we'll cover the basics of one of the most important aspects of data science: data visualization. By this point, you should have some familarity with data visualization from the builtin Table methods that allow you to create plots, histograms, and the like, but today we'll cover the tool that underlies all of those: pyplot. The pyplot library, a subset of a larger plotting library called matplotlib, is a very powerful (if unintiuitive) tool that allows you to create robust data visualizations in Python. While there are some other tools that can be used in conjunction with pyplot and which make the code easier to write and understand (e.g. seaborn, altair), we won't be covering those in this lesson.

Before you continue in this lab, please take time to complete the pyplot tutorial. This will give you the basic building blocks to understand how to answer the questions in this lab. If you find yourself struggling with a plot, post on Piazza or Google the function. We cannot stress enough how helpful this is. For example, if you didn't know how to choose where the $x$ ticks fall on a scatter plot, you should start by googling "pyplot xticks" and go from there.

In [ ]:

from datascience import *
import datetime as dt
import numpy as np
import pandas as pd

# the standard pyplot abbreviation is plt
import matplotlib.pyplot as plt

# this sets the colorscheme and style of the plots create in this notebook
plt.style.use('fivethirtyeight')

# this Jupyter magic command tells IPython to display the plots right after they're generated
%matplotlib inline

In this lab, we'll be working with a data set generated by Stuart Geiger on the number of pageviews for the Wikipedia pages of members of the U.S. Congress. If you're interested, the code that was used to generate the data sets loaded below and the original analysis are hosted on Github. The house and senate tables below contain one column for each member of the House and Senate, respectively, and one row for each month from January 2016 to December 2018. Each cell contains the number of pageviews for that member of Congress in the given month. We'll be looking at trends in these pageviews across and within houses of Congress, and for individual legislators. The goal of this lab is to familiarize you with the matplotlib.pyplot library for creating data visualizations.

In [ ]:

house = Table.read_table('house_views.csv')
house.show(5)
senate = Table.read_table('senate_views.csv')
senate.show(5)

Part 1: Data Cleaning¶

Before we can do much with the data that we have collected, we need to do a bit of data cleaning. For this particular project, the only real cleaning we have to do is convert the dates in both tables to something that we can use to analyze time trends. Python has a builtin library called datetime (std. abbrev.: dt) which allows for robust comparison and manipulation of dates.

Question 1. The month_to_dt function (defined for you) converts a string to a datetime object. Apply this function to the Month column of both house and senate, storing the results as house_months and senate_months, respectively. The last two lines of code replace the strings in the tables with the datetime objects.

In [ ]:

def month_to_dt(entry):
    return dt.datetime.strptime(entry, '%Y-%m')

house_months = ...
senate_months = ...

house['Month'] = house_months
senate['Month'] = senate_months

Part 2: Aggregate Pageviews¶

Let's say that we wanted to look at how the number of pageviews for congresspersons is changing over time. The first thing we would need to do is aggregate the data that we have collected across columns into a total pageview number for each month (row) in the table.

Question 2a. Iterate over the rows in the house table and store the sum of each row in row_sums. Then add a column to house with the values in row_sums.

In [ ]:

...

Question 2b. The code below creates a barplot of the number of pageviews over each month in the house table. Add code to it to include the following elements:

Title: Monthly Wikipedia pageviews for U.S. House of Representatives
$x$ label: Month
$y$ label: Number of pageviews

Also, don't forget the semicolon ; so that Jupyter doesn't output any text.

In [ ]:

plt.figure(figsize=[20, 10])
plt.bar(house.column('Month'), house.column('view sums'), width=10)

...

Now we'll do a similar analysis for the senate table.

Question 3a. Iterate over the rows in the senate table and store the sum of each row in row_sums. Then add a column to senate with the values in row_sums.

In [ ]:

...

Question 3b. Create another barplot, this time for the senate table, similar to the one above. Use the same axis labels, but change the title to read Monthly Wikipedia pageviews for U.S. Senate. Your plot should look like this:

In [ ]:

plt.figure(figsize=[20, 10])

...

Now let's look at aggregates across both houses of Congress. In order to mainpulate the tables in the format that we have them, we need to do some work using pandas, a Python library that deals with tables and data manipulation. You should look over the code below and try to understand it, but you don't need to be able to do anything of the sort until you take Data 100. If you're interested, here is the basic series of steps that happen below:

Move tables to pandas dataframes
Take the transpose of the dataframes (rows become columns and columns become rows)
Assign the column names of the house_t dataframe to be the row labeled Month and select all rows after the Month row
Do the same for the senate_t dataframe
Append the senate_t dataframe to the end of the house_t dataframe and reset the index
Take the transpose of congress_df
Set the columns of congress_df to be the row labeled index
Take all rows after the index row and reset the index
Export congress_df as a csv file without the index
Load the csv file as a Table object
Convert the string dates in the new Month column to datetime objects

In [ ]:

house_df = house.to_df()
senate_df = senate.to_df()

house_t = house_df.transpose()
senate_t = senate_df.transpose()

house_t.columns = house_t.loc['Month']
house_t = house_t.iloc[1:]

senate_t.columns = senate_t.loc['Month']
senate_t = senate_t.iloc[1:]

congress_df = house_t.append(senate_t).reset_index()
congress_df = congress_df.transpose()
congress_df.columns = congress_df.loc['index']
congress_df = congress_df.iloc[1:].reset_index()
congress_df.to_csv('congress_views.csv', index=False)

congress = Table.read_table('congress_views.csv')
congress_months = congress.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d'), 'Month')
congress['Month'] = congress_months
congress.show(5)

The congress table now contains similar data to the house and senate tables, but now with 1 column for each representative and senator. We can, therefore, use similar code to create a plot of aggregate views over time.

Question 4a. Iterate over the rows in the congress table and store the sum of each row in row_sums. Then add a column to congress with the values in row_sums.

In [ ]:

...

Question 4b. This time, create a line plot for the congress, house, and senate tables (all in the same plot). Use the same axis labels, but change the title to read Monthly Wikipedia pageviews for U.S. Congress. Use dashed lines for the house and senate tables and include a legend in your plot; it should look like this:

In [ ]:

plt.figure(figsize=[20, 10])

...

Finally, let's take a look at the distribution of pageviews for all of Congress. In order to get an idea of what the distribution of a variable looks like, we use a histogram. In the below cell, we plot the histogram for you with varying numbers of bins.

In [ ]:

plt.figure(figsize=[20, 7])

plt.suptitle('Histogram of Number of pageviews for Members of U.S. Congress', size=20)

plt.subplot(131)
plt.hist(congress['view sums'], bins=5)
plt.title('5 bins')
plt.xlabel('Number of pageviews')

plt.subplot(132)
plt.hist(congress['view sums'], bins=10)
plt.title('10 bins')
plt.xlabel('Number of pageviews')

plt.subplot(133)
plt.hist(congress['view sums'], bins=20)
plt.title('20 bins')
plt.xlabel('Number of pageviews');

Question 5. What do you notice about how the distribution changes as we increase the number of bins? Why is this significant?

Type your answer here, replacing this text.

Part 3: Individual Trends¶

Now let's take a look at the trends for a few individual members of Congress. If we wanted to look at trends for one congressperson, say Nancy Pelosi, we could select the columns Month and Nancy_Pelosi from the congress table and make a plot of the pageviews for her Wikipedia page over time:

In [ ]:

pelosi = congress.select('Month', 'Nancy_Pelosi')

plt.figure(figsize=[20, 10])
plt.plot(pelosi['Month'], pelosi['Nancy_Pelosi'])
plt.title('Monthly Wikipedia pageviews for Nancy Pelosi')
plt.xlabel('Month')
plt.ylabel('Number of pageviews');

You should see a spike in the plot right around November 2011; thinking about the context, this makes sense, because it was in this month that the Democrats took the House over and Nancy Pelosi was on track to become Speaker of the House.

Question 6. Choose two legislators in the congress table. For each, create a figure with 2 subplots (1 row, 2 columns); the first subplot should contain a line plot of their monthly pageviews and the second should be a histogram of their pageviews. For the lineplot, use a different line style for each plot. For the histogram, make sure you choose a number of bins that reveals important trends in the distribution but which does not result in a hard-to-read or uninformative plot.

In [ ]:

# 1st Legislator: [PUT THEIR NAME HERE]

plt.figure(figsize=[20, 7])

...

In [ ]:

# 2nd Legislator: [PUT THEIR NAME HERE]

plt.figure(figsize=[20, 7])

...

Data visualization is a powerful tool in many apsects of data science, from data exploration to finished analyses. Mastering pyplot, which underlies all plotting libraries in Python, is an important step to being able to make powerful, informative visulatizations. The next step in the process is to start learning about another library, like seaborn or altair, which make plotting more intuitive and responsive. If you're interested in data visualization, you should definitely take Data 100, as a large part of that course focuses on how to make good graphics using Python tools beyond just matplotlib.

Submission¶

To submit this lab, please download this notebook from DataHub and replace the notebook in your Github repository with this notebook. Commit these changes and push them to Github.

L&S 88 - Lab 5 - Data Visualization with matplotlib.pyplot¶

Part 1: Data Cleaning¶

Part 2: Aggregate Pageviews¶

Part 3: Individual Trends¶

Submission¶

L&S 88 - Lab 5 - Data Visualization with `matplotlib.pyplot`¶