CC BY license logo

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email [email protected]


Exploring Metadata and Pre-Processing

Description of methods in this notebook: This notebook shows how to explore and pre-process the metadata of a dataset using Pandas.

The following processes are described:

  • Importing a CSV file containing the metadata for a given dataset ID
  • Creating a Pandas dataframe to view the metadata
  • Pre-processing your dataset by filtering out unwanted texts
  • Exporting a list of relevant IDs to a CSV file
  • Visualizing the metadata of your pre-processed dataset by the number of documents/year and pages/year

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Completion time: 45 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: CSV file

Libraries Used:

Research Pipeline: None


Import your dataset

We'll use the tdm_client library to automatically retrieve the metadata for a dataset. We can retrieve metadata in a CSV file using the get_metadata method.

Enter a dataset ID in the next code cell.

If you don't have a dataset ID, you can:

In [ ]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Next, import the tdm_client, passing the dataset_id as an argument using the get_metadata method.

In [ ]:
# Import the `tdm_client`
import tdm_client

# Pull in our dataset CSV using 
dataset_metadata = tdm_client.get_metadata(dataset_id)

We are ready to import pandas for our analysis and create a dataframe. We will use the read_csv() method to create our dataframe from the CSV file.

In [ ]:
# Import pandas 
import pandas as pd

# Create our dataframe
df = pd.read_csv(dataset_metadata)

We can confirm the size of our dataset using the len() function on our dataframe.

In [ ]:
original_document_count = len(df)
print('Total original documents:', original_document_count)

Now let's take a look at the data in our dataframe df. We will set pandas to show all columns using set_option() then get a preview using head().

In [ ]:
# Set the pandas option to show all columns
pd.set_option("max_columns", None) 

# Show the first five rows of our dataframe
df.head() 

Metadata Type by Column Name

Here are descriptions for the metadata types found in each column:

Column Name Description
id a unique item ID (In JSTOR, this is a stable URL)
title the title for the item
isPartOf the larger work that holds this title (for example, a journal title)
publicationYear the year of publication
doi the digital object identifier for an item
docType the type of document (for example, article or book)
provider the source or provider of the dataset
datePublished the publication date in yyyy-mm-dd format
issueNumber the issue number for a journal publication
volumeNumber the volume number for a journal publication
url a URL for the item and/or the item's metadata
creator the author or authors of the item
publisher the publisher for the item
language the language or languages of the item (eng is the ISO 639 code for English)
pageStart the first page number of the print version
pageEnd the last page number of the print version
placeOfPublication the city of the publisher
wordCount the number of words in the item
pageCount the number of print pages in the item
outputFormat what data is available (unigrams, bigrams, trigrams, and/or full-text)

If there are any columns you would like to drop from your analysis, you can drop them with:

df df.drop(['column_name1', 'column_name2', ...], axis=1)

In [ ]:
# Drop each of these named columns
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished', 'language'], axis=1)

# Show the first five rows of our updated dataframe
df.head()

If you would like to know if a particular id is in the dataframe, you can use the in operator to return a boolean value (True or False).

In [ ]:
# Check if a particular item id is in the `id` column
'http://www.jstor.org/stable/2868641' in df.id.values

Filtering Out Unwanted Texts

Now that we have filtered out unwanted metadata columns, we can begin filtering out any texts that may not match our research interests. Let's examine the first and last twenty rows of the dataframe to see if we can identify texts that we would like to remove. We are looking for patterns in the metadata that could help us remove many texts at once.

In [ ]:
# Preview the first twenty items in the dataframe
# df.head(20) # Change 20 to view a greater or lesser number of rows
In [ ]:
# Preview the last twenty items in the dataframe
# df.tail(20) # Change 20 to view a greater or lesser number of rows

Remove all rows without data for a particular column

For example, we may wish to remove any texts that do not have authors. (In the case of journals, this may be helpful for removing paratextual sections such as the table of contents, indices, etc.) The column of interest in this case is creator.

In [ ]:
# Remove all texts without an author
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'
In [ ]:
# Print the total original documents followed by the current number
print('Total original documents:', original_document_count)
print('Total current documents: ', len(df))

Remove row based on the content of a particular column

We can also remove texts that have a particular value in a column. Here are a few examples.

In [ ]:
# Remove all items with a particular title
df = df[df.title != 'Review Article'] # Change `Review Article` to your desired title
In [ ]:
# Remove all items with less than 3000 words
df = df[df.wordCount > 3000] # Change `3000` to your desired number
In [ ]:
# Print the total original documents followed by the current number
print('Total original documents:', original_document_count)
print('Total current documents: ', len(df))

Take a final look at your dataframe to make sure the current texts fit your research goals. In the next step, we will save the IDs of your pre-processed dataset.

In [ ]:
# Preview the first 50 lines of your dataset
df.head(50)

Saving a list of IDs to a CSV file

In [ ]:
# Write the column "id" to a CSV file called `pre-processed_###.csv` where ### is the `dataset_id`
df["id"].to_csv('data/pre-processed_' + dataset_id + '.csv')

Download the "pre-processed_###.csv" file (where ### is the dataset_id) for future analysis. You can use this file in combination with the dataset ID to automatically filter your texts and reduce the processing time of your analyses.


Visualizing the Pre-Processed Data

In [ ]:
# Group the data by publication year and the aggregated number of ids into a bar chart
df.groupby(['publicationYear'])['id'].agg('count').plot.bar(title='Documents by year', figsize=(20, 5), fontsize=12); 

# Read more about Pandas dataframe plotting here: 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

And now let's look at the total page numbers by year.

In [ ]:
# Group the data by publication year and aggregated sum of the page counts into a bar chart

df.groupby(['publicationYear'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);
In [ ]: