This notebook demonstrates downloading speeches from DICES, creating tables of speech metadata, and using grouping and aggregation to create summary statistics.
Because Google Colab runs this notebook on a fresh virtual machine every time, we always need to install DICES as the first step.
# install DICES client
!pip install -q git+https://github.com/cwf2/dices-client
[notice] A new release of pip is available: 24.2 -> 25.0.1 [notice] To update, run: pip install --upgrade pip
Here we tell Python which ancillary packages we want to make accessible. In this case, the DICES client and Pandas.
# for talking to DICES
from dicesapi import DicesAPI
# for creating data tables
import pandas as pd
Here we instantiate a connection to the DICES database. As we continue to work on it, the DICES client spits out a lot of debugging information. It's convenient to divert those messages to a separate file.
# create a new DICES connection
api = DicesAPI(logfile="dices.log", logdetail=0)
The getSpeeches()
method is our main tool for downloading speech metadata. If you provide no other criteria, you will get all the speeches.
# download all speeches
speeches = api.getSpeeches()
# check that we got them
print (len(speeches), "speeches.")
4696 speeches.
One convenient way to manipulate the speech metadata from DICES is in tabular format, like a spreadsheet. The ancillary package Pandas is designed to create and analyze tabular data.
Here, we create a table in which each row represents one speech. Using a for
loop over the results from DICES, we take each speech in turn. After making a few calculations, we create a row record for that speech. We choose which values we want to keep and giving each a label. Then we add the record to a growing list of rows. In the final step, we create a DataFrame (i.e., a table) from the list of individual row records.
# an empty list to hold the rows
rows = list()
# iterate over the speeches
for speech in speeches:
# skip speeches that span more than one book/poem
if speech.isMultiPrefix():
print("skipping", speech)
continue
# rough estimate of number of lines based on subtraction
# - get line numbers, strip alphabetic suffixes like "101a"
# - subtract first line from last line and add one
first_line = speech.getLineNo("first", alpha=False)
last_line = speech.getLineNo("last", alpha=False)
nlines = int(last_line) - int(first_line) + 1
# get first speaker gender
spkr_gender = speech.spkr[0].gender
# create a new row, labelling all the data values
row = {
"speech_id": speech.id,
"language": speech.lang,
"author": speech.author.name,
"work": speech.work.title,
"prefix": speech.getPrefix(),
"first_line": speech.getLineNo("first"),
"last_line": speech.getLineNo("last"),
"num_lines": nlines,
"speaker": speech.getSpkrString(),
"gender": spkr_gender,
"addressee": speech.getAddrString(),
"turn": speech.part,
}
# add the row to the list
rows.append(row)
# make the table
table = pd.DataFrame(rows)
# display the table
display(table)
skipping <Speech 890: Odyssey 9.2-11.332> skipping <Speech 957: Odyssey 11.378-12.453> skipping <Speech 3950: Paraphrase 14.89-16.54> skipping <Speech 1552: Aeneid 2.3-3.715>
speech_id | language | author | work | prefix | first_line | last_line | num_lines | speaker | gender | addressee | turn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4230 | greek | Anonymous | Orphic Argonautica | 77 | 96 | 20 | Jason | male | Orpheus | 1 | |
1 | 4231 | greek | Anonymous | Orphic Argonautica | 98 | 113 | 16 | Orpheus | male | Jason | 2 | |
2 | 4232 | greek | Anonymous | Orphic Argonautica | 256 | 267 | 12 | Orpheus | male | Argo, Argonauts | 1 | |
3 | 4233 | greek | Anonymous | Orphic Argonautica | 283 | 294 | 12 | Jason | male | Argonauts | 1 | |
4 | 4234 | greek | Anonymous | Orphic Argonautica | 335 | 354 | 20 | Orpheus | male | gods of the sea | 1 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4687 | 1864 | latin | Virgil | Aeneid | 12 | 872 | 884 | 13 | Juturna | female | Juturna, Turnus | 1 |
4688 | 1865 | latin | Virgil | Aeneid | 12 | 889 | 893 | 5 | Aeneas | male | Turnus | 1 |
4689 | 1866 | latin | Virgil | Aeneid | 12 | 894 | 895 | 2 | Turnus | male | Aeneas | 2 |
4690 | 1867 | latin | Virgil | Aeneid | 12 | 931 | 938 | 8 | Turnus | male | Aeneas | 1 |
4691 | 1868 | latin | Virgil | Aeneid | 12 | 947 | 949 | 3 | Aeneas | male | Turnus | 2 |
4692 rows × 12 columns
I like to save the output to a CSV file right away. This can be imported into Excel to check the data or to continue your work outside of Python. If you're using Google Colab, you might have to click a "files" icon on the left in order to see and download the new file.
💁🏻♂️ I like tab-separated files better than comma-separated files; I think they're a little easier to read. Excel will import either format without a problem. If you prefer to use commas instead of tabs, just remove the option sep="\t"
below.
# write the table to a file for import to Excel
table.to_csv("speeches.tsv", sep="\t", index=False)
Now that we have a table with one row per speech, we probably want to look for different ways to group and summarize the collection. In this workshop we introduce three Pandas methods that aggregate records in ways that might be familiar from Excel.
We can use groupby()
to group the speeches according to some factor and aggregate()
to summarize the new groups. The values of our grouping column become the row names of the new table.
Let's find out how many speeches there are in each language. Here's how columns in the original table map onto the rows, columns, and values of the new summary table:
"count"
.So, our new table will have rows taken from language, a single column called speeches, in which each value will be the count
of column speech_id in the original table.
# group by column "language"
grouped_data = table.groupby("language")
# aggregate column "speech_id" using aggregation function "count"
language_summary = grouped_data.aggregate(speeches=("speech_id", "count"))
# show the results
display(language_summary)
speeches | |
---|---|
language | |
greek | 2778 |
latin | 1914 |
Here's a slightly different way of doing the same thing. It omits a step by chaining aggregate()
directly to groupby()
while adding some whitespace to make the layout of the new table more prominent. agg()
is a synonym for aggregate()
. The outer parentheses are necessary so that Python understands when the whole chain is finished.
language_summary = (
table # original data
.groupby("language") # column to group by
.agg( # define summary table
speeches = ("speech_id", "count"), # - one new column
)
)
# show results
display(language_summary)
speeches | |
---|---|
language | |
greek | 2778 |
latin | 1914 |
In this example, each row of the new table represents the group of speeches in one language. Besides just counting speech_ids, we can summarize other columns of the original data using more interesting functions.
Let's add some new columns about speech length to our summary table. For each language, we'll give the total number of lines, the longest speech, and the average speech length.
"sum"
."max"
."mean"
.We can add these column definitions into the code we used before.
language_summary = (
table # original data
.groupby("language") # column to group by
.agg( # define summary table
speeches = ("speech_id", "count"), # - count of speech ids
lines = ("num_lines", "sum"), # - sum lines in all speeches
max_lines = ("num_lines", "max"), # - length of longest speech
avg_lines = ("num_lines", "mean"), # - average speech length
)
)
# show results
display(language_summary)
speeches | lines | max_lines | avg_lines | |
---|---|---|---|---|
language | ||||
greek | 2778 | 31886 | 478 | 11.478042 |
latin | 1914 | 27123 | 592 | 14.170846 |
If we provide groupby()
with a list of column names instead of just one, we get a multi-dimensional grouping in which each possible combination of values from the respective columns is taken in turn.
How many speeches are delivered by speakers of each gender in each language? Our aggregation function will again be "count"
, but now the rows (or index) of our summary table have two levels, corresponding to language and gender.
language_summary = (
table # original data
.groupby(["language", "gender"]) # columns to group by
.agg( # define summary table
speeches = ("speech_id", "count"), # - count of speech ids
)
)
# show results
display(language_summary)
speeches | ||
---|---|---|
language | gender | |
greek | female | 575 |
male | 2071 | |
none | 76 | |
x | 56 | |
latin | female | 546 |
male | 1334 | |
none | 9 | |
x | 25 |
To make it easier to compare values across genders, we can unstack the table: this rotates the inner level of the index or row labels (i.e., gender) over to the column headings.
language_summary = (
table # original data
.groupby(["language", "gender"]) # columns to group by
.agg( # define summary table
speeches = ("speech_id", "count"), # - count of speech ids
)
.unstack() # move second factor to columns
)
# show results
display(language_summary)
speeches | ||||
---|---|---|---|---|
gender | female | male | none | x |
language | ||||
greek | 575 | 2071 | 76 | 56 |
latin | 546 | 1334 | 9 | 25 |
language_summary = (
table # original data
.groupby(["language", "gender"]) # columns to group by
.agg( # define summary table
speeches = ("speech_id", "count"), # - count of speech ids
lines = ("num_lines", "sum"), # - sum lines in all speeches
max_lines = ("num_lines", "max"), # - length of longest speech
avg_lines = ("num_lines", "mean"), # - average speech length
)
)
# show results
display(language_summary)
speeches | lines | max_lines | avg_lines | ||
---|---|---|---|---|---|
language | gender | ||||
greek | female | 575 | 7053 | 100 | 12.266087 |
male | 2071 | 24321 | 478 | 11.743602 | |
none | 76 | 302 | 34 | 3.973684 | |
x | 56 | 210 | 12 | 3.750000 | |
latin | female | 546 | 8913 | 450 | 16.324176 |
male | 1334 | 17922 | 592 | 13.434783 | |
none | 9 | 92 | 25 | 10.222222 | |
x | 25 | 196 | 49 | 7.840000 |
language_summary = (
table # original data
.groupby(["language", "gender"]) # columns to group by
.agg( # define summary table
speeches = ("speech_id", "count"), # - count of speech ids
lines = ("num_lines", "sum"), # - sum lines in all speeches
max_lines = ("num_lines", "max"), # - length of longest speech
avg_lines = ("num_lines", "mean"), # - average speech length
)
.unstack() # move second factor to columns
)
# show results
display(language_summary)
speeches | lines | max_lines | avg_lines | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender | female | male | none | x | female | male | none | x | female | male | none | x | female | male | none | x |
language | ||||||||||||||||
greek | 575 | 2071 | 76 | 56 | 7053 | 24321 | 302 | 210 | 100 | 478 | 34 | 12 | 12.266087 | 11.743602 | 3.973684 | 3.75 |
latin | 546 | 1334 | 9 | 25 | 8913 | 17922 | 92 | 196 | 450 | 592 | 25 | 49 | 16.324176 | 13.434783 | 10.222222 | 7.84 |
We don't have time here to explore all the ways to select and filter data, but here's a quick example. The gender column includes some values that I'm not interested in at the moment, so I'm going to use the query()
method to select only speeches where the gender is 'male'
or 'female'
.
language_summary = (
table # original data
.query("gender in ['male', 'female']") # filter
.groupby(["language", "gender"]) # columns to group by
.agg( # define summary table
speeches = ("speech_id", "count"), # - count of speech ids
lines = ("num_lines", "sum"), # - sum lines in all speeches
max_lines = ("num_lines", "max"), # - length of longest speech
avg_lines = ("num_lines", "mean"), # - average speech length
)
.unstack() # move second factor to columns
)
# show results
display(language_summary)
speeches | lines | max_lines | avg_lines | |||||
---|---|---|---|---|---|---|---|---|
gender | female | male | female | male | female | male | female | male |
language | ||||||||
greek | 575 | 2071 | 7053 | 24321 | 100 | 478 | 12.266087 | 11.743602 |
latin | 546 | 1334 | 8913 | 17922 | 450 | 592 | 16.324176 | 13.434783 |
Pandas provides complete documentation for the methods we use here. They also have some helpful tutorials.
(
table
.query("gender in ['male', 'female']")
.pivot_table(
index = "language",
columns = "gender",
values = "num_lines",
aggfunc = ["sum", "max", "mean"],
)
)
sum | max | mean | ||||
---|---|---|---|---|---|---|
gender | female | male | female | male | female | male |
language | ||||||
greek | 7053 | 24321 | 100 | 478 | 12.266087 | 11.743602 |
latin | 8913 | 17922 | 450 | 592 | 16.324176 | 13.434783 |
(
table # original data
.query("gender in ['male', 'female']") # filter
.groupby(["language", "gender"]) # columns to group by
.agg( # define summary table
lines = ("num_lines", "sum"), # - sum lines in all speeches
max_lines = ("num_lines", "max"), # - length of longest speech
avg_lines = ("num_lines", "mean"), # - average speech length
)
.unstack() # move second factor to columns
)
lines | max_lines | avg_lines | ||||
---|---|---|---|---|---|---|
gender | female | male | female | male | female | male |
language | ||||||
greek | 7053 | 24321 | 100 | 478 | 12.266087 | 11.743602 |
latin | 8913 | 17922 | 450 | 592 | 16.324176 | 13.434783 |