Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
____
Description of methods in this notebook: This notebook shows how to explore the metadata of your JSTOR and/or Portico dataset using Python. The following processes are described:
.get
method to retrieve bibliographic metadataDifficulty: Intermediate
Purpose: Learning (Optimized for explanation over code)
Knowledge Required:
Knowledge Recommended:
Completion time: 45 minutes
Data Format: JSTOR/Portico JSON Lines (.jsonl)
Libraries Used:
We'll use the tdm_client
library to automatically upload your dataset. We import the Dataset
module from the tdm_client
library. The tdm_client library contains functions for connecting to the JSTOR server containing the corpus dataset. To analyze your dataset, use the dataset ID provided when you created your dataset. A copy of your dataset ID was sent to your email when you created your corpus. It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a list by discipline. Advanced users can also upload a dataset from their local machine.
#Importing your dataset with a dataset ID
import tdm_client
#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.
tdm_client.get_dataset("7e41317e-740f-e86a-4729-20dab492e925", "sampleJournalAnalysis") #Insert your dataset ID on this line
Before we can begin working with our dataset, we need to convert the JSON lines file format into Python so we can work with it. Remember that each line of our JSON lines file represents a single text, whether that is a journal article, book, or something else. We will create a Python list that contains every document. Within each list item for each document, we will use a Python dictionary of key/value pairs to store information related to that document. Read more about the dataset format.
Essentially we will have a list of documents numbered, from zero to the last document. Each list item then will be composed of a dictionary of key/value pairs that allows us to retrieve information from that particular document by number. The structure will look something like this:
For each item in our list we will be able to use key/value pairs to get a value if we supply a key. We will call our Python list variable all_documents
since it will contain all of the documents in our corpus.
# The tdm_client uses a default `file_name` of `sampleJournalAnalysis.jsonl`.
# Unless you have uploaded your own dataset, this code cell requires no modification before running.
file_name = 'sampleJournalAnalysis.jsonl'
# Import the json module
import json
# Create an empty new list variable named `all_documents`
all_documents = []
# Temporarily open the file `filename` in the datasets/ folder
with open('./datasets/' + file_name) as dataset_file:
#for each line in the dataset file
for line in dataset_file:
# Read each line into a Python dictionary.
# Create a variable document that contains the line using json.loads to convert the json key/value pairs to a python dictionary
document = json.loads(line)
# Append a new list item to `all_documents` containing the dictionary we created.
all_documents.append(document)
Now all of our documents have been converted from our original JSON lines file format (.jsonl) into a python List variable named all_documents
. Let's see what we can discover about our corpus with a few simple methods.
First, we can determine how many texts are in our dataset by using the len()
function to get the size of all_documents
.
len(all_documents)
We can also choose a single document and get bibliographic metadata for that item. First we select a document from our list all_documents
. (In computer code, 0 is the first item, 1 is the second item, 2 is the third item, etc.) If we wanted to select the first item, we could use the .get
method to retrieve the title for the item in our list by writing all_documents[0]
.
# Define a new dictionary variable `chosenDocument` that is equal to a single item in our `all_documents` list
chosen_document = all_documents[0] # Select the first document in our list
chosen_document.get('title') #get the corresponding value for the key 'title'
We can also use the .get
method to discover additional bibliographic metadata. Here are the most significant bibliographic metadata items found with a JSTOR item:
title
returns the titlecreators
returns the authors in a Python listisPartOf
returns the journal titledatePublished
returns the publication dateid
returns the stable URL for a JSTOR itemidentifier
returns a Python list of dictionaries containing the ISSN #, OCLC #, and DOI #.volumeNumber
returns the journal volume numberpageCount
returns the number of pages in the print articlepagination
returns the page number range of the print articlepageStart
returns the first print pagepageEnd
returns the last print pagewordCount
returns the number of words in the articledocType
returns the type of document, usually article
for journal articleurl
returns the stable url for the documentprovider
returns the source of the data, for JSTOR articles usually jstor
language
returns the language the article is written inLet's try all these on our chosenDocument
.
print("Title: " + chosen_document.get('title'))
print("Authors: ", end='')
print(chosen_document.get('creator'))
print("Journal: " + chosen_document.get('isPartOf'))
print("Publication Date: " + chosen_document.get('datePublished'))
#print("Publisher: " + chosen_document.get('publisher'))
print("ID: " + chosen_document.get('id'))
print("ISSN, OCLC, DOI: ", end='')
print(chosen_document.get('identifier'))
#print("Volume Number: " + chosen_document.get('volumeNumber'))
print("Number of Pages: " + str(chosen_document.get('pageCount')))
print("Print Pagination: " + str(chosen_document.get('pagination')))
print("First Page: " + str(chosen_document.get('pageStart')))
print("Last Page: " + str(chosen_document.get('pageEnd')))
print("Number of words: " + str(chosen_document.get('wordCount')))
print("Document Type: " + chosen_document.get('docType'))
print("URL: " + chosen_document.get('url'))
print("Provider: " + chosen_document.get('provider'))
print("Language: " + str(chosen_document.get('language')))
We can see every Python dictionary key in the metadata by using the .keys
method. We could use this in conjunction with the print()
function, but we will use the list()
function here to make it a little neater for reading purposes.
#print(chosen_document.keys())# Uncomment the # in front of print to run this line of code
list(chosen_document.keys()) # Create a list of every Python dictionary key within `chosen_document`
Of course, we could also list all the Python dictionary values, but the output will be quite long since it includes the word counts for every word that is in the article. (In fact, it includes the count for every unique string in the article. We'll address the distinction in the word frequencies notebooks.) The word counts are found within unigramCount
which we'll address in the word frequencies notebook.
# Uncomment the # in the next line to display all values in chosen_document
list(chosen_document.values())
Let's return to our larger corpus all_documents
to do some exploratory analysis. What if we wanted to check if a particular item was in the corpus?
Assuming the item is from JSTOR, we could search out any journal article on jstor.org. The article description page will feature a stable url.
We already saw above that the stable URL is stored in both the id
and url
dictionaries, so we can check our whole corpus for a particular JSTOR article if we know the stable URL. (If we are looking at a Portico item, they will have an id
that starts with ark://
and url
that lists a doi
.) From the image above, we can see the article in question has a stable URL of: https://www.jstor.org/stable/2871420
We can check whether the item above is in all_documents
with the in
or not in
operators. First though, we need a list of all of the URLs in our corpus. We'll create a variable list_of_urls
to hold all these values. Then we can check to see if our stable URL (http://www.jstor.org/stable/2871420) is in that list.
# We create a blank list that will contain all of the urls in our dataset
list_of_urls = []
# For every document in our dataset
for document in all_documents:
# Create a url_value variable to hold the URL for that document
url_value = document.get('url')
# Append or add that URL to our Python list `list_of_urls`
list_of_urls.append(url_value)
# Show the first five items in our list of urls
list(list_of_urls[0:5])
Now that we have a list of all the URLs in our corpus in the list_of_urls
variable, let's use the in
operator to discover whether our text is in the corpus. If the article is in our dataset, we will receive true
. If the article is not our dataset, we will receive false
.
*Note that the stable URL from jstor.org uses a secure address starting with "https://". Our dictionary values, however, use a standard address beginning with "http://".
You'll need to remove the "s" to run this test since our list_of_urls
are not secure addresses.
'http://www.jstor.org/stable/24778550' in list_of_urls
Now we have a good idea of what metadata is in our corpus and how we might retrieve it. We were able to use the in
operator above to check if a particular article was in the corpus using the URL. Of course, we could also check to see if a particular journal, author, publisher, or DOI # was in our corpus using a similar method.
We'll finish this notebook by taking a big picture look at the corpus. What largescale patterns exist in this corpus over the decades? We'll use Pandas to help with our analysis. If you would like to learn more about Pandas, we recommend the Python Pandas tutorial at learndatasci.com. For now, we will create a couple visualizations for demonstration purposes.
# Imports pandas and allows us to call it with the phrase pd
import pandas as pd
Now we can turn our Python list all_documents
into a Pandas dataframe. This will enable us to manipulate and view our data as a table or a graph. We will call our dataframe df
.
df = pd.DataFrame(all_documents)
Let's see what our corpus looks like in table form. We can use the .head()
method to show us the first five rows of our data as a table.
df.head()
Now we can see the first part of all the metadata we have been discussing in table form. Much clearer! There's a lot of metadata here so you may need to scroll right to see all the columns. Long items in this view are abbreviated with ...
to signify that they continue past what is shown. To see all the columns, we can set an option in Pandas.
# To show all columns
pd.set_option("max_columns", None) # Show all columns
# We could do the same with "max_rows" but the length would be too long to effectively scroll through.
df.head() # Show the first five rows of our DataFrame
Let's use our new Pandas dataframe to learn a little more about our corpus. First, we may not be interested in every column so let's simplify our dataframe by dropping columns that may not be useful to us. We'll drop:
Add any others you might like to drop.
df = df.drop(['identifier', 'outputFormat', 'sourceCategory', 'pageEnd', 'pageStart', 'pagination', 'datePublished', 'language'], axis=1) # Drop each of these named columns
df #display df
We will also drop any rows where the creator/author is blank.
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'
df #display df
And finally clean up some of the data based on parameters relevant to our goals and dataset.
# Cleaning up the dataset by removing entries. Customize these choices to your goals and dataset.
# Examples for cleaning up the data based on the values found under 'title'
df = df[df.title != 'Review Article'] # Remove articles with title "Review Article"
df = df[df.title != 'Front Matter'] # Remove articles with title "Front Matter"
df = df[df.title != 'Back Matter'] # Remove articles with title "Back Matter"
# Examples for cleaning up the data based on values found under 'wordCount'
df = df[df.wordCount > 3000] # Remove articles with fewer than 3000 words, adjust or remove
df #display df
Notice above there is column labeled publicationYear
. Let's figure out the full year range of our corpus. We can do this by using the .min
and .max
methods. We'll create a variable to store each and then print them out.
We can find the year range in our Pandas dataframe by finding the minimum and maximum of publicationYear
.
min_year = df['publicationYear'].min() #create variable `minYear` that is the minimum value from `publicationYear`
max_year = df['publicationYear'].max() #create variable `maxYear` that is the maximum value from `publicationYear`
print(str(min_year) + ' to ' + str(max_year)) #print a string showing "minYear to maxYear"
Now we know the full year range of our dataset. Let's see if we can identify any trends across the decades.
Since decade
isn't a column in our Pandas dataframe, we'll need to create it. First though, we'll need to consider how to turn a date into a decade. Let's try an example. To translate a year (1925) to a decade (1920), we need to subtract the final digit so it becomes a zero. Basically, we need a way to discover the final digit in each decade and then subtract it so the final digit of our date becomes a zero. Something like:
1925 - 5 = 0
We can find the value for the final digit in any particular case by using modulo (which provides the remainder of a division). If we use % 10
on a date, it should give us a remainder that is the ones digit.
# What is the remainder of 1925 divided by 10?
1925 % 10
The result will give us our ones digit. Now we subtract this calculation from our original date. The result gives us the decade number we are looking for.
1925 - (1925 % 10)
We can translate this example to the whole dataframe using the following code. We'll create a new function add_decade
that takes a value
from the publicationYear
column and translates it into a decade column.
# Create a function `add_decade` that takes an argument `value`
def add_decade(value):
# Create a variable `yr` that turns value from a string into an integer
yr = int(value)
# Create a variable `decade` that subtracts the ones digit by using the modulo (%) operator
decade = yr - ( yr % 10 )
# Return the variable `decade` for the function `add_decade`
return decade
df['decade'] = df['publicationYear'].apply(add_decade)
# create a new column `decade` in our dataframe that is equal to
# the column `publicationYear` after applying
# the add_decade function we created
To see the new decade column we created in our data, let's use the df.head()
method again to see how it changed the first 5 rows of the dataframe. To see the decade column, you will need to scroll all the way to the right.
df.head()
# Group the data by decade and the aggregated number of ids into a bar chart
# There is a weird bug where this cell needs to be run twice to be shown
df.groupby(['decade'])['id'].agg('count').plot.bar(title='Documents by decade', figsize=(20, 5), fontsize=12);
# Read more about Pandas dataframe plotting here:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html
And now let's look at the total page numbers by decade.
# Group the data by decade and aggregated sum of the page counts into a bar chart
df.groupby(['decade'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);