CC BY license logo

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____

Exploring Metadata and Pre-Processing¶

Description of methods in this notebook: This notebook shows how to explore and pre-process the metadata of a dataset using Pandas.

The following processes are described:

Importing a CSV file containing the metadata for a given dataset ID
Creating a Pandas dataframe to view the metadata
Pre-processing your dataset by filtering out unwanted texts
Exporting a list of relevant IDs to a CSV file
Visualizing the metadata of your pre-processed dataset by the number of documents/year and pages/year

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Completion time: 45 minutes

Knowledge Required:

Python Basics Series (Start Python Basics I)

Knowledge Recommended:

Pandas I

Data Format: CSV file

Libraries Used:

tdm_client to retrieve the metadata in a CSV file
Pandas to manipulate and visualize the metadata

Research Pipeline: None ____

Import your dataset¶

We'll use the tdm_client library to automatically retrieve the metadata for a dataset. We can retrieve metadata in a CSV file using the get_metadata method.

Enter a dataset ID in the next code cell.

If you don't have a dataset ID, you can:

Use the sample dataset ID already in the code cell
Create a new dataset
Use a dataset ID from other pre-built sample datasets

In [ ]:

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Next, import the tdm_client, passing the dataset_id as an argument using the get_metadata method.

In [ ]:

# Import the `tdm_client`
import tdm_client

# Pull in our dataset CSV using 
dataset_metadata = tdm_client.get_metadata(dataset_id)

We are ready to import pandas for our analysis and create a dataframe. We will use the read_csv() method to create our dataframe from the CSV file.

In [ ]:

# Import pandas 
import pandas as pd

# Create our dataframe
df = pd.read_csv(dataset_metadata)

We can confirm the size of our dataset using the len() function on our dataframe.

In [ ]:

original_document_count = len(df)
print('Total original documents:', original_document_count)

Now let's take a look at the data in our dataframe df. We will set pandas to show all columns using set_option() then get a preview using head().

In [ ]:

# Set the pandas option to show all columns
pd.set_option("max_columns", None) 

# Show the first five rows of our dataframe
df.head() 

Metadata Type by Column Name¶

Here are descriptions for the metadata types found in each column:

Column Name	Description
id	a unique item ID (In JSTOR, this is a stable URL)
title	the title for the item
isPartOf	the larger work that holds this title (for example, a journal title)
publicationYear	the year of publication
doi	the digital object identifier for an item
docType	the type of document (for example, article or book)
provider	the source or provider of the dataset
datePublished	the publication date in yyyy-mm-dd format
issueNumber	the issue number for a journal publication
volumeNumber	the volume number for a journal publication
url	a URL for the item and/or the item's metadata
creator	the author or authors of the item
publisher	the publisher for the item
language	the language or languages of the item (eng is the ISO 639 code for English)
pageStart	the first page number of the print version
pageEnd	the last page number of the print version
placeOfPublication	the city of the publisher
wordCount	the number of words in the item
pageCount	the number of print pages in the item
outputFormat	what data is available (unigrams, bigrams, trigrams, and/or full-text)

If there are any columns you would like to drop from your analysis, you can drop them with:

df df.drop(['column_name1', 'column_name2', ...], axis=1)

In [ ]:

# Drop each of these named columns
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished', 'language'], axis=1)

# Show the first five rows of our updated dataframe
df.head()

If you would like to know if a particular id is in the dataframe, you can use the in operator to return a boolean value (True or False).

In [ ]:

# Check if a particular item id is in the `id` column
'http://www.jstor.org/stable/2868641' in df.id.values

Filtering Out Unwanted Texts¶

Now that we have filtered out unwanted metadata columns, we can begin filtering out any texts that may not match our research interests. Let's examine the first and last twenty rows of the dataframe to see if we can identify texts that we would like to remove. We are looking for patterns in the metadata that could help us remove many texts at once.

In [ ]:

# Preview the first twenty items in the dataframe
# df.head(20) # Change 20 to view a greater or lesser number of rows

In [ ]:

# Preview the last twenty items in the dataframe
# df.tail(20) # Change 20 to view a greater or lesser number of rows

Remove all rows without data for a particular column¶

For example, we may wish to remove any texts that do not have authors. (In the case of journals, this may be helpful for removing paratextual sections such as the table of contents, indices, etc.) The column of interest in this case is creator.

In [ ]:

# Remove all texts without an author
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'

In [ ]:

# Print the total original documents followed by the current number
print('Total original documents:', original_document_count)
print('Total current documents: ', len(df))

Remove row based on the content of a particular column¶

We can also remove texts that have a particular value in a column. Here are a few examples.

In [ ]:

# Remove all items with a particular title
df = df[df.title != 'Review Article'] # Change `Review Article` to your desired title

In [ ]:

# Remove all items with less than 3000 words
df = df[df.wordCount > 3000] # Change `3000` to your desired number

In [ ]:

# Print the total original documents followed by the current number
print('Total original documents:', original_document_count)
print('Total current documents: ', len(df))

Take a final look at your dataframe to make sure the current texts fit your research goals. In the next step, we will save the IDs of your pre-processed dataset.

In [ ]:

# Preview the first 50 lines of your dataset
df.head(50)

Saving a list of IDs to a CSV file¶

In [ ]:

# Write the column "id" to a CSV file called `pre-processed_###.csv` where ### is the `dataset_id`
df["id"].to_csv('data/pre-processed_' + dataset_id + '.csv')

Download the "pre-processed_###.csv" file (where ### is the dataset_id) for future analysis. You can use this file in combination with the dataset ID to automatically filter your texts and reduce the processing time of your analyses.

Visualizing the Pre-Processed Data¶

In [ ]:

# Group the data by publication year and the aggregated number of ids into a bar chart
df.groupby(['publicationYear'])['id'].agg('count').plot.bar(title='Documents by year', figsize=(20, 5), fontsize=12); 

# Read more about Pandas dataframe plotting here: 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

And now let's look at the total page numbers by year.

In [ ]:

# Group the data by publication year and aggregated sum of the page counts into a bar chart

df.groupby(['publicationYear'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);

In [ ]: