Records as tabular data¶

Sometimes it's helpful to thing of speeches, characters, or conversations as spreadsheet tables rather than as objects. A really useful tool for converting data back and forth between DICES object classes and tabular formats is Pandas.

In [ ]:

from dicesapi import DicesAPI
import pandas as pd

# create a connection to DICES
api = DicesAPI(logfile='dices.log')

Pandas provides the DataFrame class: this is a really convenient container for tabular data.

Example¶

Download some speeches¶

In [ ]:

speeches = api.getSpeeches(spkr_name='Achilles')
print('Got', len(speeches), 'speeches')

Convert to a DataFrame¶

The basic move for building a data frame is to create a list of records ('dictionaries' in Python) where each record is made up of key-value pairs. The key is the column heading for this value. The pattern looks like this:

In [ ]:

table = pd.DataFrame(dict(
    speech_id = s.id,
    author = s.author.name,
    work = s.work.title,
    language = s.lang,
    first_line = s.l_fi,
    last_line = s.l_la,
    speakers = s.getSpkrString(),
    addressees = s.getAddrString(),
    type = s.type,
) for s in speeches)

The first and last lines in this block set up the data frame as a list of records, one per speech. The lines in the middle define the fields and values for each record. The column names are on the left sides of the equals signs, and the values for a given speech are on the right. Each time through the loop, s will be a different speech.

Let's see the result:

In [ ]:

display(table)

By default, Pandas only shows the first and last rows of the table. This is helpful if it's really large. If you want to see the whole thing at once, you can do this:

In [ ]:

with pd.option_context('display.max_rows', None,):
    display(table)

Export CSV¶

If at this point you want to move your data to Excel, you can save a Pandas DataFrame to CSV format using the to_csv() method. (Note: if you're running this in Binder or another cloud-based platform, you'll have to download the resulting file afterward.)

In [ ]:

table.to_csv('achilles.csv')

Pivot Tables¶

If you want to continue working with your data in Python, you'll find that most of the tasks you might do in Excel have straightforward analogues in Pandas.

In particular, aggregating and summarizing data in a Pivot Table is done with the pivot_table() method. The most common arguments are:

index: which fields to use for the rows of the new table
columns: which fields to use for the columns of the new table
values: which fields to summarize in the cells of the table
aggfunc: the function used to summarize multiple data. Common choices are 'count', 'sum', 'mean', 'max', etc. See here for details, or type help(pd.DataFrame.pivot_table)

Example¶

This produces a table showing how many speeches are of each type. I want one row for each type, and then the values should be a count of speeches: I'll count the speech_id column here.

In [ ]:

table.pivot_table(
    index = 'type',
    values = 'speech_id',
    aggfunc = 'count',
)

Working with columns and rows¶

Subsetting columns by name¶

To get a new DataFrame using only selected columns, you can use doubled square brackets after the table name.

In [ ]:

table[['first_line', 'last_line']]

Extracting one column as a Series¶

You can get a single column out of the DataFrame as a list-like, one-dimensional series by putting the name in single square brackets or using a dot after the table name.

We'll see how this form is useful when we subset rows by column values below.

In [ ]:

table['type']

In [ ]:

table.addressees

Subsetting rows¶

If you want specific rows, you can put a range of numbers in square brackets:

In [ ]:

table[0:10]

You can also select rows based on the values in certain columns:

In [ ]:

table[table.addressees=='Agamemnon']

Creating new columns¶

You can add columns to the table by assigning values to them. Here, we create a new column with the book number for each speech by performing a operation on one of the other columns. We split the value of first_line on the '.' character, and then take only the first element of the result.

In [ ]:

table['book'] = [l.split('.')[0] for l in table.first_line]

In [ ]:

table

Putting it together¶

Let's count Achilles' speeches in the Iliad by book number, and make a graph:

In [ ]:

subset = table[table.work=='Iliad']

subset.pivot_table(
    index = 'book',
    values = 'speech_id',
    aggfunc = 'count',
    sort = False,
)

In [ ]:

subset.pivot_table(
    index = 'book',
    values = 'speech_id',
    aggfunc = 'count',
    sort = False,
).plot.bar(
    legend = False,
    rot = False,
)