from dicesapi import DicesAPI
import pandas as pd
# create a connection to DICES
api = DicesAPI(logfile='dices.log')
speeches = api.getSpeeches(spkr_name='Achilles')
print('Got', len(speeches), 'speeches')
The basic move for building a data frame is to create a list of records ('dictionaries' in Python) where each record is made up of key-value pairs. The key is the column heading for this value. The pattern looks like this:
table = pd.DataFrame(dict(
speech_id = s.id,
author = s.author.name,
work = s.work.title,
language = s.lang,
first_line = s.l_fi,
last_line = s.l_la,
speakers = s.getSpkrString(),
addressees = s.getAddrString(),
type = s.type,
) for s in speeches)
The first and last lines in this block set up the data frame as a list of records, one per speech. The lines in the middle define the fields and values for each record. The column names are on the left sides of the equals signs, and the values for a given speech are on the right. Each time through the loop, s
will be a different speech.
Let's see the result:
display(table)
By default, Pandas only shows the first and last rows of the table. This is helpful if it's really large. If you want to see the whole thing at once, you can do this:
with pd.option_context('display.max_rows', None,):
display(table)
If at this point you want to move your data to Excel, you can save a Pandas DataFrame to CSV format using the to_csv()
method. (Note: if you're running this in Binder or another cloud-based platform, you'll have to download the resulting file afterward.)
table.to_csv('achilles.csv')
If you want to continue working with your data in Python, you'll find that most of the tasks you might do in Excel have straightforward analogues in Pandas.
In particular, aggregating and summarizing data in a Pivot Table is done with the pivot_table()
method. The most common arguments are:
index
: which fields to use for the rows of the new tablecolumns
: which fields to use for the columns of the new tablevalues
: which fields to summarize in the cells of the tableaggfunc
: the function used to summarize multiple data. Common choices are 'count'
, 'sum'
, 'mean'
, 'max'
, etc. See here for details, or type help(pd.DataFrame.pivot_table)
This produces a table showing how many speeches are of each type. I want one row for each type
, and then the values should be a count of speeches: I'll count the speech_id
column here.
table.pivot_table(
index = 'type',
values = 'speech_id',
aggfunc = 'count',
)
table[['first_line', 'last_line']]
You can get a single column out of the DataFrame as a list-like, one-dimensional series by putting the name in single square brackets or using a dot after the table name.
We'll see how this form is useful when we subset rows by column values below.
table['type']
table.addressees
If you want specific rows, you can put a range of numbers in square brackets:
table[0:10]
You can also select rows based on the values in certain columns:
table[table.addressees=='Agamemnon']
You can add columns to the table by assigning values to them. Here, we create a new column with the book number for each speech by performing a operation on one of the other columns. We split the value of first_line
on the '.'
character, and then take only the first element of the result.
table['book'] = [l.split('.')[0] for l in table.first_line]
table
Let's count Achilles' speeches in the Iliad by book number, and make a graph:
subset = table[table.work=='Iliad']
subset.pivot_table(
index = 'book',
values = 'speech_id',
aggfunc = 'count',
sort = False,
)
subset.pivot_table(
index = 'book',
values = 'speech_id',
aggfunc = 'count',
sort = False,
).plot.bar(
legend = False,
rot = False,
)