Tabular data primer¶

This notebook demonstrates downloading speeches from DICES, creating tables of speech metadata, and using grouping and aggregation to create summary statistics.

Preliminaries¶

Install DICES client software¶

Because Google Colab runs this notebook on a fresh virtual machine every time, we always need to install DICES as the first step.

In [1]:

# install DICES client
!pip install -q git+https://github.com/cwf2/dices-client

[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: pip install --upgrade pip

Import statements¶

Here we tell Python which ancillary packages we want to make accessible. In this case, the DICES client and Pandas.

In [2]:

# for talking to DICES
from dicesapi import DicesAPI

# for creating data tables
import pandas as pd

Connect to DICES¶

Here we instantiate a connection to the DICES database. As we continue to work on it, the DICES client spits out a lot of debugging information. It's convenient to divert those messages to a separate file.

In [3]:

# create a new DICES connection
api = DicesAPI(logfile="dices.log", logdetail=0)

Download speech metadata¶

The getSpeeches() method is our main tool for downloading speech metadata. If you provide no other criteria, you will get all the speeches.

In [4]:

# download all speeches
speeches = api.getSpeeches()

# check that we got them
print (len(speeches), "speeches.")

4696 speeches.

Create a table¶

One convenient way to manipulate the speech metadata from DICES is in tabular format, like a spreadsheet. The ancillary package Pandas is designed to create and analyze tabular data.

A table with one row per speech¶

Here, we create a table in which each row represents one speech. Using a for loop over the results from DICES, we take each speech in turn. After making a few calculations, we create a row record for that speech. We choose which values we want to keep and giving each a label. Then we add the record to a growing list of rows. In the final step, we create a DataFrame (i.e., a table) from the list of individual row records.

In [5]:

# an empty list to hold the rows
rows = list()

# iterate over the speeches
for speech in speeches:

    # skip speeches that span more than one book/poem
    if speech.isMultiPrefix():
        print("skipping", speech)
        continue

    # rough estimate of number of lines based on subtraction
    # - get line numbers, strip alphabetic suffixes like "101a"
    # - subtract first line from last line and add one
    first_line = speech.getLineNo("first", alpha=False)
    last_line = speech.getLineNo("last", alpha=False)
    nlines = int(last_line) - int(first_line) + 1

    # get first speaker gender
    spkr_gender = speech.spkr[0].gender

    # create a new row, labelling all the data values
    row = {
        "speech_id": speech.id,
        "language": speech.lang,
        "author": speech.author.name,
        "work": speech.work.title,
        "prefix": speech.getPrefix(),
        "first_line": speech.getLineNo("first"),
        "last_line": speech.getLineNo("last"),
        "num_lines": nlines,
        "speaker": speech.getSpkrString(),
        "gender": spkr_gender,
        "addressee": speech.getAddrString(),
        "turn": speech.part,
    }

    # add the row to the list
    rows.append(row)

# make the table
table = pd.DataFrame(rows)

# display the table
display(table)

skipping <Speech 890: Odyssey 9.2-11.332>
skipping <Speech 957: Odyssey 11.378-12.453>
skipping <Speech 3950: Paraphrase 14.89-16.54>
skipping <Speech 1552: Aeneid 2.3-3.715>

	speech_id	language	author	work	prefix	first_line	last_line	num_lines	speaker	gender	addressee	turn
0	4230	greek	Anonymous	Orphic Argonautica		77	96	20	Jason	male	Orpheus	1
1	4231	greek	Anonymous	Orphic Argonautica		98	113	16	Orpheus	male	Jason	2
2	4232	greek	Anonymous	Orphic Argonautica		256	267	12	Orpheus	male	Argo, Argonauts	1
3	4233	greek	Anonymous	Orphic Argonautica		283	294	12	Jason	male	Argonauts	1
4	4234	greek	Anonymous	Orphic Argonautica		335	354	20	Orpheus	male	gods of the sea	1
...	...	...	...	...	...	...	...	...	...	...	...	...
4687	1864	latin	Virgil	Aeneid	12	872	884	13	Juturna	female	Juturna, Turnus	1
4688	1865	latin	Virgil	Aeneid	12	889	893	5	Aeneas	male	Turnus	1
4689	1866	latin	Virgil	Aeneid	12	894	895	2	Turnus	male	Aeneas	2
4690	1867	latin	Virgil	Aeneid	12	931	938	8	Turnus	male	Aeneas	1
4691	1868	latin	Virgil	Aeneid	12	947	949	3	Aeneas	male	Turnus	2

4692 rows × 12 columns

Export the table as a CSV file¶

I like to save the output to a CSV file right away. This can be imported into Excel to check the data or to continue your work outside of Python. If you're using Google Colab, you might have to click a "files" icon on the left in order to see and download the new file.

💁🏻‍♂️ I like tab-separated files better than comma-separated files; I think they're a little easier to read. Excel will import either format without a problem. If you prefer to use commas instead of tabs, just remove the option sep="\t" below.

In [6]:

# write the table to a file for import to Excel
table.to_csv("speeches.tsv", sep="\t", index=False)

Manipulating tabular data¶

Now that we have a table with one row per speech, we probably want to look for different ways to group and summarize the collection. In this workshop we introduce three Pandas methods that aggregate records in ways that might be familiar from Excel.

Grouping and aggregating¶

We can use groupby() to group the speeches according to some factor and aggregate() to summarize the new groups. The values of our grouping column become the row names of the new table.

Example¶

Let's find out how many speeches there are in each language. Here's how columns in the original table map onto the rows, columns, and values of the new summary table:

We're grouping by column language. That means rows in the new table will correspond to the values of language in the original table
The column we're summarizing is speech_id, because we're counting individual speeches. Let's call the corresponding column in our new summary table speeches.
The values of the new column should be how many speeches are in each group. So our aggregation function is just "count".

So, our new table will have rows taken from language, a single column called speeches, in which each value will be the count of column speech_id in the original table.

In [7]:

# group by column "language"
grouped_data = table.groupby("language")

# aggregate column "speech_id" using aggregation function "count"
language_summary = grouped_data.aggregate(speeches=("speech_id", "count"))

# show the results
display(language_summary)

	speeches
language
greek	2778
latin	1914

Alternative style¶

Here's a slightly different way of doing the same thing. It omits a step by chaining aggregate() directly to groupby() while adding some whitespace to make the layout of the new table more prominent. agg() is a synonym for aggregate(). The outer parentheses are necessary so that Python understands when the whole chain is finished.

In [8]:

language_summary = (
    table                                  # original data
    .groupby("language")                   # column to group by
    .agg(                                  # define summary table
        speeches = ("speech_id", "count"), #  - one new column
    )
)

# show results
display(language_summary)

	speeches
language
greek	2778
latin	1914

More aggregation functions¶

In this example, each row of the new table represents the group of speeches in one language. Besides just counting speech_ids, we can summarize other columns of the original data using more interesting functions.

Example¶

Let's add some new columns about speech length to our summary table. For each language, we'll give the total number of lines, the longest speech, and the average speech length.

All the new columns summarize the column num_lines in the original table.
New column lines will use aggregation function "sum".
New column max_lines will use aggregation function "max".
New column avg_lines will use aggregation function "mean".

We can add these column definitions into the code we used before.

In [9]:

language_summary = (
    table                                  # original data
    .groupby("language")                   # column to group by
    .agg(                                  # define summary table
        speeches = ("speech_id", "count"), #  - count of speech ids
        lines = ("num_lines", "sum"),      #  - sum lines in all speeches
        max_lines = ("num_lines", "max"),  #  - length of longest speech
        avg_lines = ("num_lines", "mean"), #  - average speech length
    )
)

# show results
display(language_summary)

	speeches	lines	max_lines	avg_lines
language
greek	2778	31886	478	11.478042
latin	1914	27123	592	14.170846

Grouping by additional factors¶

If we provide groupby() with a list of column names instead of just one, we get a multi-dimensional grouping in which each possible combination of values from the respective columns is taken in turn.

Example¶

How many speeches are delivered by speakers of each gender in each language? Our aggregation function will again be "count", but now the rows (or index) of our summary table have two levels, corresponding to language and gender.

In [10]:

language_summary = (
    table                                  # original data
    .groupby(["language", "gender"])       # columns to group by
    .agg(                                  # define summary table
        speeches = ("speech_id", "count"), #  - count of speech ids
    )
)

# show results
display(language_summary)

		speeches
language	gender
greek	female	575
	male	2071
	none	76
	x	56
latin	female	546
	male	1334
	none	9
	x	25

Alternative table orientation¶

To make it easier to compare values across genders, we can unstack the table: this rotates the inner level of the index or row labels (i.e., gender) over to the column headings.

In [11]:

language_summary = (
    table                                  # original data
    .groupby(["language", "gender"])       # columns to group by
    .agg(                                  # define summary table
        speeches = ("speech_id", "count"), #  - count of speech ids
    )
    .unstack()                             # move second factor to columns
)

# show results
display(language_summary)

	speeches
gender	female	male	none	x
language
greek	575	2071	76	56
latin	546	1334	9	25

Complex, multi-factor tables¶

Let's put everything together and calculate multiple statistics for every combination of language and gender. The results are necessarily complex; you'll have to decide whether the stacked or unstacked version is easier to read.

stacked version¶

In [12]:

language_summary = (
    table                                  # original data
    .groupby(["language", "gender"])       # columns to group by
    .agg(                                  # define summary table
        speeches = ("speech_id", "count"), #  - count of speech ids
        lines = ("num_lines", "sum"),      #  - sum lines in all speeches
        max_lines = ("num_lines", "max"),  #  - length of longest speech
        avg_lines = ("num_lines", "mean"), #  - average speech length
    )
)

# show results
display(language_summary)

		speeches	lines	max_lines	avg_lines
language	gender
greek	female	575	7053	100	12.266087
	male	2071	24321	478	11.743602
	none	76	302	34	3.973684
	x	56	210	12	3.750000
latin	female	546	8913	450	16.324176
	male	1334	17922	592	13.434783
	none	9	92	25	10.222222
	x	25	196	49	7.840000

unstacked version¶

In [13]:

language_summary = (
    table                                  # original data
    .groupby(["language", "gender"])       # columns to group by
    .agg(                                  # define summary table
        speeches = ("speech_id", "count"), #  - count of speech ids
        lines = ("num_lines", "sum"),      #  - sum lines in all speeches
        max_lines = ("num_lines", "max"),  #  - length of longest speech
        avg_lines = ("num_lines", "mean"), #  - average speech length
    )
    .unstack()                             # move second factor to columns
)

# show results
display(language_summary)

	speeches				lines				max_lines				avg_lines
gender	female	male	none	x	female	male	none	x	female	male	none	x	female	male	none	x
language
greek	575	2071	76	56	7053	24321	302	210	100	478	34	12	12.266087	11.743602	3.973684	3.75
latin	546	1334	9	25	8913	17922	92	196	450	592	25	49	16.324176	13.434783	10.222222	7.84

Bonus: filtering¶

We don't have time here to explore all the ways to select and filter data, but here's a quick example. The gender column includes some values that I'm not interested in at the moment, so I'm going to use the query() method to select only speeches where the gender is 'male' or 'female'.

In [14]:

language_summary = (
    table                                  # original data
    .query("gender in ['male', 'female']") # filter
    .groupby(["language", "gender"])       # columns to group by
    .agg(                                  # define summary table
        speeches = ("speech_id", "count"), #  - count of speech ids
        lines = ("num_lines", "sum"),      #  - sum lines in all speeches
        max_lines = ("num_lines", "max"),  #  - length of longest speech
        avg_lines = ("num_lines", "mean"), #  - average speech length
    )
    .unstack()                             # move second factor to columns
)

# show results
display(language_summary)

	speeches		lines		max_lines		avg_lines
gender	female	male	female	male	female	male	female	male
language
greek	575	2071	7053	24321	100	478	12.266087	11.743602
latin	546	1334	8913	17922	450	592	16.324176	13.434783

Tabular data primer¶

Preliminaries¶

Install DICES client software¶

Import statements¶

Connect to DICES¶

Download speech metadata¶

Create a table¶

A table with one row per speech¶

Export the table as a CSV file¶

Manipulating tabular data¶

Grouping and aggregating¶

Example¶

Alternative style¶

More aggregation functions¶

Example¶

Grouping by additional factors¶

Example¶

Alternative table orientation¶

Complex, multi-factor tables¶

stacked version¶

unstacked version¶

Bonus: filtering¶

Further reading¶