#!/usr/bin/env python # coding: utf-8 # # L&S 88 - Lab 5 - Data Visualization with `matplotlib.pyplot` # # _Data and original analysis from [Stuart Geiger](https://github.com/staeiou/wiki-pageview-notebook), adapted with added analysis by Chris Pyles_ # # In this lab, we'll cover the basics of one of the most important aspects of data science: data visualization. By this point, you should have some familarity with data visualization from the builtin `Table` methods that allow you to create plots, histograms, and the like, but today we'll cover the tool that underlies all of those: `pyplot`. The `pyplot` library, a subset of a larger plotting library called `matplotlib`, is a very powerful (if unintiuitive) tool that allows you to create robust data visualizations in Python. While there are some other tools that can be used in conjunction with `pyplot` and which make the code easier to write and understand (e.g. `seaborn`, `altair`), we won't be covering those in this lesson. # # **Before you continue in this lab, please take time to complete the [`pyplot` tutorial](pyplot.ipynb).** This will give you the basic building blocks to understand how to answer the questions in this lab. If you find yourself struggling with a plot, post on Piazza or _Google the function_. We cannot stress enough how helpful this is. For example, if you didn't know how to choose where the $x$ ticks fall on a scatter plot, you should start by googling "pyplot xticks" and go from there. # In[ ]: from datascience import * import datetime as dt import numpy as np import pandas as pd # the standard pyplot abbreviation is plt import matplotlib.pyplot as plt # this sets the colorscheme and style of the plots create in this notebook plt.style.use('fivethirtyeight') # this Jupyter magic command tells IPython to display the plots right after they're generated get_ipython().run_line_magic('matplotlib', 'inline') # In this lab, we'll be working with a data set generated by Stuart Geiger on the number of pageviews for the Wikipedia pages of members of the U.S. Congress. If you're interested, the code that was used to generate the data sets loaded below and the original analysis are hosted [on Github](https://github.com/staeiou/wiki-pageview-notebook). The `house` and `senate` tables below contain one column for each member of the House and Senate, respectively, and one row for each month from January 2016 to December 2018. Each cell contains the number of pageviews for that member of Congress in the given month. We'll be looking at trends in these pageviews across and within houses of Congress, and for individual legislators. The goal of this lab is to familiarize you with the `matplotlib.pyplot` library for creating data visualizations. # In[ ]: house = Table.read_table('house_views.csv') house.show(5) senate = Table.read_table('senate_views.csv') senate.show(5) # ## Part 1: Data Cleaning # Before we can do much with the data that we have collected, we need to do a bit of data cleaning. For this particular project, the only real cleaning we have to do is convert the dates in both tables to something that we can use to analyze time trends. Python has a builtin library called `datetime` (std. abbrev.: `dt`) which allows for robust comparison and manipulation of dates. # # **Question 1.** The `month_to_dt` function (defined for you) converts a string to a `datetime` object. Apply this function to the `Month` column of both `house` and `senate`, storing the results as `house_months` and `senate_months`, respectively. The last two lines of code replace the strings in the tables with the `datetime` objects. # In[ ]: def month_to_dt(entry): return dt.datetime.strptime(entry, '%Y-%m') house_months = ... senate_months = ... house['Month'] = house_months senate['Month'] = senate_months # ## Part 2: Aggregate Pageviews # Let's say that we wanted to look at how the number of pageviews for congresspersons is changing over time. The first thing we would need to do is aggregate the data that we have collected across columns into a total pageview number for each month (row) in the table. # # **Question 2a.** Iterate over the rows in the `house` table and store the sum of each row in `row_sums`. Then add a column to `house` with the values in `row_sums`. # In[ ]: ... # **Question 2b.** The code below creates a barplot of the number of pageviews over each month in the `house` table. Add code to it to include the following elements: # * Title: `Monthly Wikipedia pageviews for U.S. House of Representatives` # * $x$ label: `Month` # * $y$ label: `Number of pageviews` # # Also, don't forget the semicolon `;` so that Jupyter doesn't output any text. # In[ ]: plt.figure(figsize=[20, 10]) plt.bar(house.column('Month'), house.column('view sums'), width=10) ... # Now we'll do a similar analysis for the `senate` table. # # **Question 3a.** Iterate over the rows in the `senate` table and store the sum of each row in `row_sums`. Then add a column to `senate` with the values in `row_sums`. # In[ ]: ... # **Question 3b.** Create another barplot, this time for the `senate` table, similar to the one above. Use the same axis labels, but change the title to read `Monthly Wikipedia pageviews for U.S. Senate`. Your plot should look like this: # # # In[ ]: plt.figure(figsize=[20, 10]) ... # Now let's look at aggregates across both houses of Congress. In order to mainpulate the tables in the format that we have them, we need to do some work using `pandas`, a Python library that deals with tables and data manipulation. You should look over the code below and try to understand it, but you don't need to be able to do anything of the sort until you take Data 100. If you're interested, here is the basic series of steps that happen below: # 1. Move tables to `pandas` dataframes # 2. Take the transpose of the dataframes (rows become columns and columns become rows) # 3. Assign the column names of the `house_t` dataframe to be the row labeled `Month` and select all rows after the `Month` row # 4. Do the same for the `senate_t` dataframe # 5. Append the `senate_t` dataframe to the end of the `house_t` dataframe and reset the index # 6. Take the transpose of `congress_df` # 7. Set the columns of `congress_df` to be the row labeled `index` # 8. Take all rows after the `index` row and reset the index # 9. Export `congress_df` as a csv file without the index # 10. Load the csv file as a `Table` object # 11. Convert the string dates in the new `Month` column to `datetime` objects # In[ ]: house_df = house.to_df() senate_df = senate.to_df() house_t = house_df.transpose() senate_t = senate_df.transpose() house_t.columns = house_t.loc['Month'] house_t = house_t.iloc[1:] senate_t.columns = senate_t.loc['Month'] senate_t = senate_t.iloc[1:] congress_df = house_t.append(senate_t).reset_index() congress_df = congress_df.transpose() congress_df.columns = congress_df.loc['index'] congress_df = congress_df.iloc[1:].reset_index() congress_df.to_csv('congress_views.csv', index=False) congress = Table.read_table('congress_views.csv') congress_months = congress.apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d'), 'Month') congress['Month'] = congress_months congress.show(5) # The `congress` table now contains similar data to the `house` and `senate` tables, but now with 1 column for each representative and senator. We can, therefore, use similar code to create a plot of aggregate views over time. # # **Question 4a.** Iterate over the rows in the `congress` table and store the sum of each row in `row_sums`. Then add a column to `congress` with the values in `row_sums`. # In[ ]: ... # **Question 4b.** This time, create a line plot for the `congress`, `house`, and `senate` tables (all in the same plot). Use the same axis labels, but change the title to read `Monthly Wikipedia pageviews for U.S. Congress`. Use dashed lines for the `house` and `senate` tables and include a legend in your plot; it should look like this: # # # In[ ]: plt.figure(figsize=[20, 10]) ... # Finally, let's take a look at the distribution of pageviews for all of Congress. In order to get an idea of what the distribution of a variable looks like, we use a histogram. In the below cell, we plot the histogram for you with varying numbers of bins. # In[ ]: plt.figure(figsize=[20, 7]) plt.suptitle('Histogram of Number of pageviews for Members of U.S. Congress', size=20) plt.subplot(131) plt.hist(congress['view sums'], bins=5) plt.title('5 bins') plt.xlabel('Number of pageviews') plt.subplot(132) plt.hist(congress['view sums'], bins=10) plt.title('10 bins') plt.xlabel('Number of pageviews') plt.subplot(133) plt.hist(congress['view sums'], bins=20) plt.title('20 bins') plt.xlabel('Number of pageviews'); # **Question 5.** What do you notice about how the distribution changes as we increase the number of bins? Why is this significant? # _Type your answer here, replacing this text._ # ## Part 3: Individual Trends # Now let's take a look at the trends for a few individual members of Congress. If we wanted to look at trends for one congressperson, say Nancy Pelosi, we could select the columns `Month` and `Nancy_Pelosi` from the `congress` table and make a plot of the pageviews for her Wikipedia page over time: # In[ ]: pelosi = congress.select('Month', 'Nancy_Pelosi') plt.figure(figsize=[20, 10]) plt.plot(pelosi['Month'], pelosi['Nancy_Pelosi']) plt.title('Monthly Wikipedia pageviews for Nancy Pelosi') plt.xlabel('Month') plt.ylabel('Number of pageviews'); # You should see a spike in the plot right around November 2011; thinking about the context, this makes sense, because it was in this month that the Democrats took the House over and Nancy Pelosi was on track to become Speaker of the House. # # **Question 6.** Choose two legislators in the `congress` table. For each, create a figure with 2 subplots (1 row, 2 columns); the first subplot should contain a line plot of their monthly pageviews and the second should be a histogram of their pageviews. For the lineplot, use a different line style for each plot. For the histogram, make sure you choose a number of bins that reveals important trends in the distribution but which does not result in a hard-to-read or uninformative plot. # In[ ]: # 1st Legislator: [PUT THEIR NAME HERE] plt.figure(figsize=[20, 7]) ... # In[ ]: # 2nd Legislator: [PUT THEIR NAME HERE] plt.figure(figsize=[20, 7]) ... # Data visualization is a powerful tool in many apsects of data science, from data exploration to finished analyses. Mastering `pyplot`, which underlies all plotting libraries in Python, is an important step to being able to make powerful, informative visulatizations. The next step in the process is to start learning about another library, like `seaborn` or `altair`, which make plotting more intuitive and responsive. If you're interested in data visualization, you should definitely take Data 100, as a large part of that course focuses on how to make good graphics using Python tools beyond just `matplotlib`. # # ## Submission # To submit this lab, please download this notebook from DataHub and replace the notebook in your Github repository with this notebook. Commit these changes and push them to Github.