#!/usr/bin/env python # coding: utf-8 # Open In Colab # # Tabular data primer # # This notebook demonstrates downloading speeches from DICES, creating tables of speech metadata, and using grouping and aggregation to create summary statistics. # # ## Preliminaries # # ### Install DICES client software # # Because Google Colab runs this notebook on a fresh virtual machine every time, we always need to install DICES as the first step. # In[1]: # install DICES client get_ipython().system('pip install -q git+https://github.com/cwf2/dices-client') # ### Import statements # # Here we tell Python which ancillary packages we want to make accessible. In this case, the DICES client and Pandas. # In[2]: # for talking to DICES from dicesapi import DicesAPI # for creating data tables import pandas as pd # ### Connect to DICES # Here we instantiate a connection to the DICES database. As we continue to work on it, the DICES client spits out a lot of debugging information. It's convenient to divert those messages to a separate file. # In[3]: # create a new DICES connection api = DicesAPI(logfile="dices.log", logdetail=0) # ## Download speech metadata # The `getSpeeches()` method is our main tool for downloading speech metadata. If you provide no other criteria, you will get all the speeches. # In[4]: # download all speeches speeches = api.getSpeeches() # check that we got them print (len(speeches), "speeches.") # ## Create a table # One convenient way to manipulate the speech metadata from DICES is in tabular format, like a spreadsheet. The ancillary package Pandas is designed to create and analyze tabular data. # # ### A table with one row per speech # Here, we create a table in which each row represents one speech. Using a `for` loop over the results from DICES, we take each speech in turn. After making a few calculations, we create a row record for that speech. We choose which values we want to keep and giving each a label. Then we add the record to a growing list of rows. In the final step, we create a **DataFrame** (i.e., a table) from the list of individual row records. # In[5]: # an empty list to hold the rows rows = list() # iterate over the speeches for speech in speeches: # skip speeches that span more than one book/poem if speech.isMultiPrefix(): print("skipping", speech) continue # rough estimate of number of lines based on subtraction # - get line numbers, strip alphabetic suffixes like "101a" # - subtract first line from last line and add one first_line = speech.getLineNo("first", alpha=False) last_line = speech.getLineNo("last", alpha=False) nlines = int(last_line) - int(first_line) + 1 # get first speaker gender spkr_gender = speech.spkr[0].gender # create a new row, labelling all the data values row = { "speech_id": speech.id, "language": speech.lang, "author": speech.author.name, "work": speech.work.title, "prefix": speech.getPrefix(), "first_line": speech.getLineNo("first"), "last_line": speech.getLineNo("last"), "num_lines": nlines, "speaker": speech.getSpkrString(), "gender": spkr_gender, "addressee": speech.getAddrString(), "turn": speech.part, } # add the row to the list rows.append(row) # make the table table = pd.DataFrame(rows) # display the table display(table) # #### Export the table as a CSV file # I like to save the output to a CSV file right away. This can be imported into Excel to check the data or to continue your work outside of Python. If you're using Google Colab, you might have to click a "files" icon on the left in order to see and download the new file. # # 💁🏻‍♂️ *I like tab-separated files better than comma-separated files; I think they're a little easier to read. Excel will import either format without a problem. If you prefer to use commas instead of tabs, just remove the option `sep="\t"` below.* # In[6]: # write the table to a file for import to Excel table.to_csv("speeches.tsv", sep="\t", index=False) # ## Manipulating tabular data # Now that we have a table with one row per speech, we probably want to look for different ways to group and summarize the collection. In this workshop we introduce three Pandas methods that aggregate records in ways that might be familiar from Excel. # ### Grouping and aggregating # We can use `groupby()` to group the speeches according to some factor and `aggregate()` to summarize the new groups. The values of our grouping column become the row names of the new table. # # #### Example # Let's find out how many speeches there are in each language. Here's how columns in the original table map onto the rows, columns, and values of the new summary table: # # - We're grouping by column **language**. That means *rows* in the new table will correspond to the values of **language** in the original table # - The column we're summarizing is **speech_id**, because we're counting individual speeches. Let's call the corresponding *column* in our new summary table **speeches**. # - The values of the new column should be how many speeches are in each group. So our *aggregation function* is just `"count"`. # # So, our new table will have rows taken from **language**, a single column called **speeches**, in which each value will be the `count` of column **speech_id** in the original table. # # In[7]: # group by column "language" grouped_data = table.groupby("language") # aggregate column "speech_id" using aggregation function "count" language_summary = grouped_data.aggregate(speeches=("speech_id", "count")) # show the results display(language_summary) # #### Alternative style # Here's a slightly different way of doing the same thing. It omits a step by chaining `aggregate()` directly to `groupby()` while adding some whitespace to make the layout of the new table more prominent. `agg()` is a synonym for `aggregate()`. The outer parentheses are necessary so that Python understands when the whole chain is finished. # In[8]: language_summary = ( table # original data .groupby("language") # column to group by .agg( # define summary table speeches = ("speech_id", "count"), # - one new column ) ) # show results display(language_summary) # ### More aggregation functions # In this example, each row of the new table represents the group of speeches in one language. Besides just counting **speech_id**s, we can summarize other columns of the original data using more interesting functions. # # #### Example # Let's add some new columns about speech length to our summary table. For each language, we'll give the total number of lines, the longest speech, and the average speech length. # # - All the new columns summarize the column **num_lines** in the original table. # - New column **lines** will use aggregation function `"sum"`. # - New column **max_lines** will use aggregation function `"max"`. # - New column **avg_lines** will use aggregation function `"mean"`. # # We can add these column definitions into the code we used before. # In[9]: language_summary = ( table # original data .groupby("language") # column to group by .agg( # define summary table speeches = ("speech_id", "count"), # - count of speech ids lines = ("num_lines", "sum"), # - sum lines in all speeches max_lines = ("num_lines", "max"), # - length of longest speech avg_lines = ("num_lines", "mean"), # - average speech length ) ) # show results display(language_summary) # ### Grouping by additional factors # If we provide `groupby()` with a *list* of column names instead of just one, we get a multi-dimensional grouping in which each possible combination of values from the respective columns is taken in turn. # # #### Example # How many speeches are delivered by speakers of each gender in each language? Our *aggregation function* will again be `"count"`, but now the rows (or *index*) of our summary table have two levels, corresponding to **language** and **gender**. # In[10]: language_summary = ( table # original data .groupby(["language", "gender"]) # columns to group by .agg( # define summary table speeches = ("speech_id", "count"), # - count of speech ids ) ) # show results display(language_summary) # #### Alternative table orientation # To make it easier to compare values across genders, we can *unstack* the table: this rotates the inner level of the *index* or row labels (i.e., gender) over to the column headings. # In[11]: language_summary = ( table # original data .groupby(["language", "gender"]) # columns to group by .agg( # define summary table speeches = ("speech_id", "count"), # - count of speech ids ) .unstack() # move second factor to columns ) # show results display(language_summary) # ### Complex, multi-factor tables # Let's put everything together and calculate multiple statistics for every combination of language and gender. The results are necessarily complex; you'll have to decide whether the stacked or unstacked version is easier to read. # # #### stacked version # In[12]: language_summary = ( table # original data .groupby(["language", "gender"]) # columns to group by .agg( # define summary table speeches = ("speech_id", "count"), # - count of speech ids lines = ("num_lines", "sum"), # - sum lines in all speeches max_lines = ("num_lines", "max"), # - length of longest speech avg_lines = ("num_lines", "mean"), # - average speech length ) ) # show results display(language_summary) # #### unstacked version # In[13]: language_summary = ( table # original data .groupby(["language", "gender"]) # columns to group by .agg( # define summary table speeches = ("speech_id", "count"), # - count of speech ids lines = ("num_lines", "sum"), # - sum lines in all speeches max_lines = ("num_lines", "max"), # - length of longest speech avg_lines = ("num_lines", "mean"), # - average speech length ) .unstack() # move second factor to columns ) # show results display(language_summary) # ### Bonus: filtering # We don't have time here to explore all the ways to select and filter data, but here's a quick example. The **gender** column includes some values that I'm not interested in at the moment, so I'm going to use the `query()` method to select only speeches where the gender is `'male'` or `'female'`. # In[14]: language_summary = ( table # original data .query("gender in ['male', 'female']") # filter .groupby(["language", "gender"]) # columns to group by .agg( # define summary table speeches = ("speech_id", "count"), # - count of speech ids lines = ("num_lines", "sum"), # - sum lines in all speeches max_lines = ("num_lines", "max"), # - length of longest speech avg_lines = ("num_lines", "mean"), # - average speech length ) .unstack() # move second factor to columns ) # show results display(language_summary) # ## Further reading # Pandas provides complete documentation for the methods we use here. They also have some helpful tutorials. # # - [Guide to grouping and aggregating](https://pandas.pydata.org/docs/user_guide/groupby.html) # - [Complete user guide to Pandas](https://pandas.pydata.org/docs/user_guide/index.html) # - [Manual page for `groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) # - [Manual page for `aggregate()`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) # - [Manual page for `query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html) # In[17]: ( table .query("gender in ['male', 'female']") .pivot_table( index = "language", columns = "gender", values = "num_lines", aggfunc = ["sum", "max", "mean"], ) ) # In[18]: ( table # original data .query("gender in ['male', 'female']") # filter .groupby(["language", "gender"]) # columns to group by .agg( # define summary table lines = ("num_lines", "sum"), # - sum lines in all speeches max_lines = ("num_lines", "max"), # - length of longest speech avg_lines = ("num_lines", "mean"), # - average speech length ) .unstack() # move second factor to columns )