Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
___
Working with Dataset Files
Description: This notebook describes how to:
tdm_client
to read in metadatatdm_client
to read in dataThis notebook describes how to read and write text, CSV, and JSON files using Python. Additionally, it explains how the tdm_client
can help users easily load and analyze their datasets.
Use Case: For Learners (Detailed explanation, not ideal for researchers)
Difficulty: Intermediate
Completion Time: 60 minutes
Knowledge Required:
Knowledge Recommended: None
Data Format: Text (.txt), CSV (.csv), JSON (.json)
Libraries Used:
pandas
to read and write CSV filesjson
to read and write JSON filestdm_client
to retrieve and read dataResearch Pipeline: None ___
Working with files is an essential part of Python programming. When we execute code in Python, we manipulate data through the use of variables. When the program is closed, however, any data stored in those variables is erased. To save the information stored in variables, we must learn how to write it to a file.
At the same time, we may have notebooks for applying specific analyses, but we need to have a way to bring data into the notebook for analysis. Otherwise, we would have to type all the data into the program ourselves! Both reading-in from files and writing-out data out to files are important skills for data science and the digital humanities.
This section describes how to work with three kinds of common data files in Python:
Each of these filetypes are in wide use in data science, digital humanities, and general programming.
A plain text file is one of the simplest kinds of computer files. Plain text files can be opened with a text editor like Notepad (Windows 10) or TextEdit (OS X). The file can contain only basic textual characters such as: letters, numbers, spaces, and line breaks. Plain text files do not contain styling such as: heading sizes, italic, bold, or specialized fonts. (To including styling in a text file, writers may use other file formats such as rich text format (.rtf) or markdown (.md).)
Plain text files (.txt) can be easily viewed and modified by humans by changing the text within. This is an important distinction from binary files such as images (.jpg), archives (.gzip), audio (.wav), or video (.mp4). If a binary file is opened with a text editor, the content will be largely unreadable.
A comma-separated value file is also a text file that can easily be modifed with a text editor. A CSV file is generally used to store data that fits in a series or table (like a list or spreadsheet). A spreadsheet application (like Microsoft Excel or Google Sheets) will allow you to view and edit a CSV data in the form of a table.
Each row of a CSV file represents a single row of a table. The values in a CSV are separated by commas (with no space between), but other delimiters can be chosen such as a tab or pipe (|). A tab-separated value file is called a TSV file (.tsv). Using tabs or pipes may be preferable if the data being stored contains commas (since this could make it confusing whether a comma is part of a single entry or a delimiter between entries).
Username,Login email,Identifier,First name,Last name
booker12,rachel@example.com,9012,Rachel,Booker
grey07,,2070,Laura,Grey
johnson81,,4081,Craig,Johnson
jenkins46,mary@example.com,9346,Mary,Jenkins
smith79,jamie@example.com,5079,Jamie,Smith
A Javascript Object Notation file is also a text file that can be modified with a text editor. A JSON file stores data in key/value pairs, very similar to a Python dictionary. One of the key benefits of JSON is its compactness which makes it ideal for exchanging data between web browsers and servers.
While smaller JSON files can be opened in a text editor, larger files can be difficult to read. Viewing and editing JSON is easier in specialized editors, available online at sites like:
A JSON file has a nested structure, where smaller concepts are grouped under larger ones. Like extensible markup language (.xml), a JSON file can be checked to determine that it is valid (follows the proper format for JSON) and/or well-formed (follows the proper format defined in a specialized example, called a schema).
{
"firstName": "Julia",
"lastName": "Smith",
"gender": "woman",
"age": 57,
"address": {
"streetAddress": "11434",
"city": "Detroit",
"state": "Mi",
"postalCode": "48202"
},
"phoneNumbers": [
{ "type": "home", "number": "7383627627" }
]
}
Before we can read or write to text file, we must open the file. Normally, when we open a file, a new window appears where we can see the contents. In Python, opening a file simply means create a file object that refers to the particular file. When the file has been opened, we can read or write to the file. Finally, we must close the file. Here are the three steps:
Let's practice on sample.txt
, a sample text file.
# Open the text file `sample.txt` creating
# a file object called `f`
f = open('data/sample.txt', 'r')
We have created a file object called f
. The first argument ('sample.txt'
) is a string containing the file name. You can see the sample.txt in the same directory as this lesson. If your file was called reports.txt, you would replace that argument with 'reports.txt'
. The second argument ('r'
) determines that we are opening the file in "read" mode, where the file can be read but not modified. There are three main modes that can be specified:
Argument | Mode Name | Description |
---|---|---|
'r' | read | Reads file but no writing allowed (protects file from modification) |
'w' | write | Writes to file, overwriting any information in the file (saves over the current version of the file |
'a' | append | Appends (or adds) information to the end of the file (new information is added below old information) |
# Create a variable called `file_contents`
# that will hold the result of using the
# .read() method on our file object
file_contents = f.read()
print(file_contents)
When we are finished with the file object, we must close it using the .close() method on it. It is very important to always close a file, otherwise your program may crash or create memory problems.
# Close the file by using the .close() method
# on the file object
f.close()
If a file is very large, we may want to read a single line at a time so as not to fill all of the available computer memory. To read a single line at a time, we can use the .readlines()
method instead of the .read()
method.
# Open the file sample.txt in read mode
# creating the file object `f`
f = open('data/sample.txt', 'r')
file_contents = f.readlines()
print(file_contents)
With the .read()
method, we read in the whole text file as a single string. The .readlines()
gives us a Python list, where each item is a single string representing a single line of the text document. You may also notice that each line ends with \n
which represents a line break in a string. If we print the string, the line break is visible in our output.
# Print the first item in the file_contents list
# Note the \n turns into a line break
print(file_contents[0])
f.close()
# Opening a file in append mode
# and creating a new file object
# called `f`
f = open('data/sample.txt', 'a')
Now we can use the .write()
method to append a string to the file.
# Appending an eleventh line to the file
f.write('\nThis is the eleventh line')
f.close()
Can you read the file back in to see whether the .write()
was successful?
# Open the the file in read mode
# create a file object called `sample_file`
f =
# Use the .read() method on the file object
# Store the result in a variable `file_contents`
file_contents =
# Print the contents
print(file_contents)
# Close the file
f.close()
Opening a file in write mode is useful in two scenarios:
Here is an example:
# Creating a new file in write mode
f = open('data/new_sample.txt', 'w')
# Define a string variable to add to the new file
string = 'Here is some data\nWith a second line'
# Using write method on the file object
contents = f.write(string)
# Close file object
f.close()
with open
¶The with open
technique is commonly used in Python because it has two significant advantages:
The basic form resembles a flow control statement, ending in a colon and then executing an indented block of code. After the block executes, the file is closed automatically.
with open('data/sample.txt', 'r') as f:
print(f.read())
CSV file data can be easily opened, read, and written using the pandas
library. (For large CSV files (>500 mb), you may wish to use the csv
library to read in a single row at a time to reduce the memory footprint.) Pandas is flexible for working with tabular data, and the process for importing and exporting to CSV is simple.
# Import pandas
import pandas as pd
# Create our dataframe
df = pd.read_csv('data/sample.csv')
# Display the dataframe
print(df)
After you've made any necessary changes in Pandas, write the dataframe back to the CSV file. (Remember to always back up your data before writing over the file.)
# Write data to new file
# Keeping the Header but removing the index
df.to_csv('data/new_sample.csv', header=True, index=False)
JSON files use a key/value structure very similar to a Python dictionary. We will start with a Python dictionary called py_dict
and then write the data to a JSON file using the json
library.
# Defining sample data in a Python dictionary
py_dict = {
"firstName": "Julia",
"lastName": "Smith",
"gender": "woman",
"age": 57,
"address": {
"streetAddress": "11434",
"city": "Detroit",
"state": "Mi",
"postalCode": "48202"
},
"phoneNumbers": [
{ "type": "home", "number": "7383627627" }
]
}
To write our dictionary to a JSON file, we will use the with open
technique we learned that automatically closes file objects. We also need the json
library to help dump our dictionary data into the file object. The json.dump
function works a little differently than the write method we saw with text files.
We need to specify two arguments:
# Open/create sample.json in write mode
# as the file object `f`. The data in py_dict
# is dumped into `f` and then `f` is closed
import json
with open('data/sample.json', 'w') as f:
json.dump(py_dict, f)
To read data in from a JSON file, we can use the json.load
function on our file object. Here we load all the content into a variable called content
. We can then print values based on particular keys.
with open('data/sample.json') as f:
contents = json.load(f)
print('First Name: ' + contents['firstName'])
print('Last Name: '+ contents['lastName'])
print('Age: ' + str(contents['age']))
print('Phone Number: ', contents['phoneNumbers'][0]['number'])
tdm_client
¶The tdm_client
helps retrieve a given dataset and/or its associated metadata. The metadata is supplied in a CSV file and the full dataset is supplied in a compressed JSON Lines file (.jsonl.gz). For any analysis focused on metadata pre-processing, we recommend users start with the CSV file since it is both easier and faster to view, parse, and manipulate.
All of the textual data and metadata is available inside of the JSON Lines files, but we have chosen to offer the metadata CSV for two primary reasons:
More information is available, including the metadata categories, in the FAQ "What is the data file format?".
tdm_client
methods)¶By passing the tdm_client
a dataset ID (here called dataset_id
), we can automatically download the metadata CSV file or the full JSON Lines dataset created by the dataset builder.
.get_metadata()
method to retrieve the metadata CSV fileget_dataset()
method to retrieve the full JSON data fileCode | Result |
---|---|
f = tdm_client.get_metadata(dataset_id) | Automatically retrieves a metadata CSV file and creates a file object f |
f = tdm_client.get_dataset(dataset_id) | Automatically retrieves a compressed JSON Lines dataset file (jsonl.gz) and creates a file object f |
The JSON Lines file will be downloaded in a compressed gzip format (jsonl.gz). We can iterate over each document in the corpus by using the dataset_reader()
method.
The dataset_reader()
method will read in data from the compressed JSON dataset file object. By keeping the data in a compressed format and reading in a single line at a time, we reduce the processing time and memory use. These can be substantial for large datasets. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.
The dataset_reader()
essentially unzips each row of the dataset at a time. Each row constitutes all the metadata and data available for a single document. Here is what that looks like the actual tdm_client:
with
for row in input_file:
yield json.loads(row)
In most cases, users will want to iterate over every document/row in the JSON Lines file. In practice, that looks like:
Code | Result |
---|---|
for document in tdm_client.dataset_reader(f): | Iterates over each document in file object f |