Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____
Description of methods in this notebook: This notebook shows how to explore and pre-process the metadata of a dataset using Pandas.
The following processes are described:
Use Case: For Learners (Detailed explanation, not ideal for researchers)
Difficulty: Intermediate
Completion time: 45 minutes
Knowledge Required:
Knowledge Recommended:
Data Format: CSV file
Libraries Used:
Research Pipeline: None ____
We'll use the tdm_client
library to automatically retrieve the metadata for a dataset. We can retrieve metadata in a CSV file using the get_metadata
method.
Enter a dataset ID in the next code cell.
If you don't have a dataset ID, you can:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
Next, import the tdm_client
, passing the dataset_id
as an argument using the get_metadata
method.
# Import the `tdm_client`
import tdm_client
# Pull in our dataset CSV using
dataset_metadata = tdm_client.get_metadata(dataset_id)
We are ready to import pandas for our analysis and create a dataframe. We will use the read_csv()
method to create our dataframe from the CSV file.
# Import pandas
import pandas as pd
# Create our dataframe
df = pd.read_csv(dataset_metadata)
We can confirm the size of our dataset using the len()
function on our dataframe.
original_document_count = len(df)
print('Total original documents:', original_document_count)
Now let's take a look at the data in our dataframe df
. We will set pandas to show all columns using set_option()
then get a preview using head()
.
# Set the pandas option to show all columns
pd.set_option("max_columns", None)
# Show the first five rows of our dataframe
df.head()
Here are descriptions for the metadata types found in each column:
Column Name | Description |
---|---|
id | a unique item ID (In JSTOR, this is a stable URL) |
title | the title for the item |
isPartOf | the larger work that holds this title (for example, a journal title) |
publicationYear | the year of publication |
doi | the digital object identifier for an item |
docType | the type of document (for example, article or book) |
provider | the source or provider of the dataset |
datePublished | the publication date in yyyy-mm-dd format |
issueNumber | the issue number for a journal publication |
volumeNumber | the volume number for a journal publication |
url | a URL for the item and/or the item's metadata |
creator | the author or authors of the item |
publisher | the publisher for the item |
language | the language or languages of the item (eng is the ISO 639 code for English) |
pageStart | the first page number of the print version |
pageEnd | the last page number of the print version |
placeOfPublication | the city of the publisher |
wordCount | the number of words in the item |
pageCount | the number of print pages in the item |
outputFormat | what data is available (unigrams, bigrams, trigrams, and/or full-text) |
If there are any columns you would like to drop from your analysis, you can drop them with:
df df.drop(['column_name1', 'column_name2', ...], axis=1)
# Drop each of these named columns
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished', 'language'], axis=1)
# Show the first five rows of our updated dataframe
df.head()
If you would like to know if a particular id is in the dataframe, you can use the in
operator to return a boolean value (True or False).
# Check if a particular item id is in the `id` column
'http://www.jstor.org/stable/2868641' in df.id.values
Now that we have filtered out unwanted metadata columns, we can begin filtering out any texts that may not match our research interests. Let's examine the first and last twenty rows of the dataframe to see if we can identify texts that we would like to remove. We are looking for patterns in the metadata that could help us remove many texts at once.
# Preview the first twenty items in the dataframe
# df.head(20) # Change 20 to view a greater or lesser number of rows
# Preview the last twenty items in the dataframe
# df.tail(20) # Change 20 to view a greater or lesser number of rows
For example, we may wish to remove any texts that do not have authors. (In the case of journals, this may be helpful for removing paratextual sections such as the table of contents, indices, etc.) The column of interest in this case is creator
.
# Remove all texts without an author
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'
# Print the total original documents followed by the current number
print('Total original documents:', original_document_count)
print('Total current documents: ', len(df))
We can also remove texts that have a particular value in a column. Here are a few examples.
# Remove all items with a particular title
df = df[df.title != 'Review Article'] # Change `Review Article` to your desired title
# Remove all items with less than 3000 words
df = df[df.wordCount > 3000] # Change `3000` to your desired number
# Print the total original documents followed by the current number
print('Total original documents:', original_document_count)
print('Total current documents: ', len(df))
Take a final look at your dataframe to make sure the current texts fit your research goals. In the next step, we will save the IDs of your pre-processed dataset.
# Preview the first 50 lines of your dataset
df.head(50)
# Write the column "id" to a CSV file called `pre-processed_###.csv` where ### is the `dataset_id`
df["id"].to_csv('data/pre-processed_' + dataset_id + '.csv')
Download the "pre-processed_###.csv" file (where ### is the dataset_id
) for future analysis. You can use this file in combination with the dataset ID to automatically filter your texts and reduce the processing time of your analyses.
# Group the data by publication year and the aggregated number of ids into a bar chart
df.groupby(['publicationYear'])['id'].agg('count').plot.bar(title='Documents by year', figsize=(20, 5), fontsize=12);
# Read more about Pandas dataframe plotting here:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html
And now let's look at the total page numbers by year.
# Group the data by publication year and aggregated sum of the page counts into a bar chart
df.groupby(['publicationYear'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);