(rectangling)=
In this chapter, you'll learn about nested data, working with data that is fundamentally tree-like and (often) converting it into a rectangular data frames made up of rows and columns. This is important because nested data is surprisingly common, especially when working with data that comes from a web API (such as you will see in {ref}webscraping-and-apis
).
To learn about rectangling, you'll first learn about lists, dictionaries, and the JSON format, as these are the data structures that are most often used to work with hierarchical data in Python. Then you'll learn about some functions that can help you turn hierarchical data into 'tidy' data in columns and rows. We'll then show you a few case studies, applying these simple function multiple times to solve real complex problems.
# remove cell
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt
# Plot settings
plt.style.use("https://github.com/aeturrell/python4DS/raw/main/plot_style.txt")
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")
This chapter will use the pandas data analysis package.
Lists are a really useful way to work with lots of data at once. They're defined with square brackets, with entries separated by commas.
list_example = [10, 1.23, "like this", True, None]
print(list_example)
You can also construct them by appending entries:
list_example.append("one more entry")
print(list_example)
And you can access earlier entries using an index, which begins at 0 and ends at one less than the length of the list (this is the convention in many programming languages). For instance, to print specific entries at the start, using 0
, and end, using -1
:
print(list_example[0])
print(list_example[-1])
{admonition}
How might you access the penultimate entry in a list object if you didn't know how many elements it had?
As well as accessing positions in lists using indexing, you can use slices on lists. This uses the colon character, :
, to stand in for 'from the beginning' or 'until the end' (when only appearing once). For instance, to print just the last two entries, we would use the index -2:
to mean from the second-to-last onwards. Here are two distinct examples: getting the first three and last three entries to be successively printed:
print(list_example[:3])
print(list_example[-3:])
Slicing can be even more elaborate than that because we can jump entries using a second colon. Here's a full example that begins at the second entry (remember the index starts at 0), runs up until the second-to-last entry (exclusive), and jumps every other entry inbetween (range just produces a list of integers from the value to one less than the last):
list_of_numbers = list(range(1, 11))
start = 1
stop = -1
step = 2
print(list_of_numbers[start:stop:step])
A handy trick is that you can print a reversed list entirely using double colons:
print(list_of_numbers[::-1])
{admonition}
Slice the `list_example` from earlier to get only the first five entries.
What's amazing about lists is that they can hold any type, including other lists! Here's a valid example of a list that's got a lot going on:
wacky_list = [
3.1415,
16,
["five", 4, 3],
(91, 93, 90),
"Hello World!",
True,
None,
{"key": "value", "key2": "value2"},
]
wacky_list
Because lists can contain more lists (and so on), they can be used to put hierachical data in. Let's take a look at an example:
multilayer_list = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]
multilayer_list
Now, say we wanted to reduce this to a single list. We can do it with a list comprehension:
[x for little_list in multilayer_list for x in little_list]
What we're saying here is take all of the values of every little list and put them into a single list.
Occassionally, you'll have data in lists that you wish to turn into a data frame. For example, perhaps you have a list of lists like this:
list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
You can pass this straight into a constructor for a data frame as the data=
keyword argument (adding in other info as necessary). Note that this is four lists of three entries, so the inner loop has entries in 0 to 2... it is this inner loop that will be used as the rows of any data frame with the number of entries in each inner list equal to the number of columns.
import pandas as pd
pd.DataFrame(data=list_of_lists, columns=["a", "b", "c"])
There's one more trick to show you: explode. This is useful when you have data that has more than one level of list depth. Let's say you read in some data with a complex hierarchical structure like this:
df = pd.DataFrame(
{
"alpha": [[0, 1, 2], "foo", [], [3, 4]],
"beta": 1,
"gamma": [["a", "b", "c"], pd.NA, [], ["d", "e"]],
}
)
df
We have multiple rows and columns that contain lists. In some situations, it's fine to have a list in a column but here it's probably not as it's mixed in with other types of data. We can use explode()
to split out the columns further length-wise
df.explode("alpha")
No discussion of programming and hierarchical data objects would be complete without mentioning JSON (Java Script Object Notation)! You will run into this text data format all of the time when dealing with data from the web, especially from APIs (automated web-based data services). JSON represents objects as name/value pairs, just like a Python dictionary (these have the format {key1: value1, key2: value2}
).
The table below compares the different data types found in Python and JSON.
JSON OBJECT | PYTHON OBJECT |
---|---|
object | dict |
array | list |
string | str |
null | None |
number (int) | int |
number (real) | float |
true | True |
false | False |
There are typically two operations you may want to do with JSON data: 1) turn JSON data in a Python object (eg JSON to Python dictionary) or vice versa (known as deserialisation and serialisation respectively); and 2) converting a deserialised object into a different kind of Python object.
Let's look at each in turn.
import requests
url = "https://api.ons.gov.uk/timeseries/MGSX/dataset/LMS/data"
# Get the data from the ONS API:
json_data = requests.get(url).json()
# There's too much data to show it all here, so just show the first 2 entries
dict(list(json_data.items())[:2])
There's a lot here, and it's not terribly easy to read! Let's check what type we got:
type(json_data)
As expected, the JSON data has automatically been read in as a dictionary—but be wary that the fields have been read in as text rather than numbers, datetimes, and other specific data types.
For this exercise, you'll need to download the JSON file 'cakes.json' from the data folder of the repository associated with this book and save it in a sub-folder called "data". We can take a peek at the data using the terminal (which is what the preceeding exclamation mark means):
json_string = """
{
"food": "doughnut",
"good_with": ["coffee", "tea"],
"flavour": null,
"toppings": [{"id": 0, "type": "glazed"},
{"id": 1, "type": "sugar"}]
}
"""
We use the built-in json library to read this into Python (you could also use a file path here—more on how in a moment):
import json
result = json.loads(json_string)
result
Note that not everything is the same in going from JSON text to a Python dictionary: JSON uses null
rather than None
, won't accept trailing commas at the end of lists, and has basic types that are lists, strings (and all keys must be strings), numbers, booleans, and nulls. Let's now see how to write a Python dictionary back to a JSON, perhaps for writing to file:
json_stream = json.dumps(result)
json_stream
To write to a file, you would use the pattern:
with open('data/json_data_output.json', 'w') as outfile:
json.dump(json_stream, outfile)
To read a file from disk, for example "data/json_data_output.json", it's
json.load(open("data/json_data_output.json"))
pandas has lots of options for turning JSON or dictionary data into a data frame. You do need to think a little bit about the structure of the data underneath though:
import pandas as pd
pd.DataFrame(result["toppings"], columns=["id", "type"])
The web-scraped data we downloaded earlier had a more complicated structure, but pandas has a json_normalize()
function that can cope with this. For example, with the following data, there are many missing entries but json_normalize()
can still parse it into a Data Frame.
data = [
{"id": 1, "name": {"first": "Coleen", "last": "Volk"}},
{"name": {"given": "Mark", "family": "Regner"}},
{"id": 2, "name": "Faye Raker"},
]
pd.json_normalize(data)
And we can control the level that properties like 'name' are split out to as well (you can check out more options over at the pandas documentation)
pd.json_normalize(data, max_level=0)
As well as the JSON normalise function, pandas has a from_dict()
method to work with simpler dictionary objects.