(rectangling)=

Nested Data¶

Introduction¶

In this chapter, you'll learn about nested data, working with data that is fundamentally tree-like and (often) converting it into a rectangular data frames made up of rows and columns. This is important because nested data is surprisingly common, especially when working with data that comes from a web API (such as you will see in {ref}webscraping-and-apis).

To learn about rectangling, you'll first learn about lists, dictionaries, and the JSON format, as these are the data structures that are most often used to work with hierarchical data in Python. Then you'll learn about some functions that can help you turn hierarchical data into 'tidy' data in columns and rows. We'll then show you a few case studies, applying these simple function multiple times to solve real complex problems.

In [ ]:

# remove cell
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt

# Plot settings
plt.style.use("https://github.com/aeturrell/python4DS/raw/main/plot_style.txt")
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

Prerequisites¶

This chapter will use the pandas data analysis package.

Lists¶

Lists are a really useful way to work with lots of data at once. They're defined with square brackets, with entries separated by commas.

In [ ]:

list_example = [10, 1.23, "like this", True, None]
print(list_example)

You can also construct them by appending entries:

In [ ]:

list_example.append("one more entry")
print(list_example)

And you can access earlier entries using an index, which begins at 0 and ends at one less than the length of the list (this is the convention in many programming languages). For instance, to print specific entries at the start, using 0, and end, using -1:

In [ ]:

print(list_example[0])
print(list_example[-1])

{admonition}
How might you access the penultimate entry in a list object if you didn't know how many elements it had?

As well as accessing positions in lists using indexing, you can use slices on lists. This uses the colon character, :, to stand in for 'from the beginning' or 'until the end' (when only appearing once). For instance, to print just the last two entries, we would use the index -2: to mean from the second-to-last onwards. Here are two distinct examples: getting the first three and last three entries to be successively printed:

In [ ]:

print(list_example[:3])
print(list_example[-3:])

Slicing can be even more elaborate than that because we can jump entries using a second colon. Here's a full example that begins at the second entry (remember the index starts at 0), runs up until the second-to-last entry (exclusive), and jumps every other entry inbetween (range just produces a list of integers from the value to one less than the last):

In [ ]:

list_of_numbers = list(range(1, 11))
start = 1
stop = -1
step = 2
print(list_of_numbers[start:stop:step])

A handy trick is that you can print a reversed list entirely using double colons:

In [ ]:

print(list_of_numbers[::-1])

{admonition}
Slice the `list_example` from earlier to get only the first five entries.

What's amazing about lists is that they can hold any type, including other lists! Here's a valid example of a list that's got a lot going on:

In [ ]:

wacky_list = [
    3.1415,
    16,
    ["five", 4, 3],
    (91, 93, 90),
    "Hello World!",
    True,
    None,
    {"key": "value", "key2": "value2"},
]
wacky_list

Hierarchical Data in Lists¶

Because lists can contain more lists (and so on), they can be used to put hierachical data in. Let's take a look at an example:

In [ ]:

multilayer_list = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]
multilayer_list

Now, say we wanted to reduce this to a single list. We can do it with a list comprehension:

In [ ]:

[x for little_list in multilayer_list for x in little_list]

What we're saying here is take all of the values of every little list and put them into a single list.

From Lists to Data Frames¶

Occassionally, you'll have data in lists that you wish to turn into a data frame. For example, perhaps you have a list of lists like this:

In [ ]:

list_of_lists = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]

You can pass this straight into a constructor for a data frame as the data= keyword argument (adding in other info as necessary). Note that this is four lists of three entries, so the inner loop has entries in 0 to 2... it is this inner loop that will be used as the rows of any data frame with the number of entries in each inner list equal to the number of columns.

In [ ]:

import pandas as pd

pd.DataFrame(data=list_of_lists, columns=["a", "b", "c"])

There's one more trick to show you: explode. This is useful when you have data that has more than one level of list depth. Let's say you read in some data with a complex hierarchical structure like this:

In [ ]:

df = pd.DataFrame(
    {
        "alpha": [[0, 1, 2], "foo", [], [3, 4]],
        "beta": 1,
        "gamma": [["a", "b", "c"], pd.NA, [], ["d", "e"]],
    }
)
df

We have multiple rows and columns that contain lists. In some situations, it's fine to have a list in a column but here it's probably not as it's mixed in with other types of data. We can use explode() to split out the columns further length-wise

In [ ]:

df.explode("alpha")

JSON (Java Script Object Notation)¶

No discussion of programming and hierarchical data objects would be complete without mentioning JSON (Java Script Object Notation)! You will run into this text data format all of the time when dealing with data from the web, especially from APIs (automated web-based data services). JSON represents objects as name/value pairs, just like a Python dictionary (these have the format {key1: value1, key2: value2}).

The table below compares the different data types found in Python and JSON.

JSON OBJECT	PYTHON OBJECT
object	dict
array	list
string	str
null	None
number (int)	int
number (real)	float
true	True
false	False

There are typically two operations you may want to do with JSON data: 1) turn JSON data in a Python object (eg JSON to Python dictionary) or vice versa (known as deserialisation and serialisation respectively); and 2) converting a deserialised object into a different kind of Python object.

Let's look at each in turn.

Reading in JSON data¶

Let's look at a typical example of reading in some JSON data.

From the Web¶

We'll get some JSON data from an API. Let's grab the latest UK unemployment data (timeseries code "MGSX" and dataset code "LMS").

In [ ]:

import requests

url = "https://api.ons.gov.uk/timeseries/MGSX/dataset/LMS/data"

# Get the data from the ONS API:
json_data = requests.get(url).json()
# There's too much data to show it all here, so just show the first 2 entries
dict(list(json_data.items())[:2])

There's a lot here, and it's not terribly easy to read! Let's check what type we got:

In [ ]:

type(json_data)

As expected, the JSON data has automatically been read in as a dictionary—but be wary that the fields have been read in as text rather than numbers, datetimes, and other specific data types.

From a File or Stream¶

For this exercise, you'll need to download the JSON file 'cakes.json' from the data folder of the repository associated with this book and save it in a sub-folder called "data". We can take a peek at the data using the terminal (which is what the preceeding exclamation mark means):

In [ ]:

json_string = """
{
 "food": "doughnut",
 "good_with": ["coffee", "tea"],
 "flavour": null,
 "toppings": [{"id": 0, "type": "glazed"},
              {"id": 1, "type": "sugar"}]
}
"""

We use the built-in json library to read this into Python (you could also use a file path here—more on how in a moment):

In [ ]:

import json

result = json.loads(json_string)
result

Note that not everything is the same in going from JSON text to a Python dictionary: JSON uses null rather than None, won't accept trailing commas at the end of lists, and has basic types that are lists, strings (and all keys must be strings), numbers, booleans, and nulls. Let's now see how to write a Python dictionary back to a JSON, perhaps for writing to file:

In [ ]:

json_stream = json.dumps(result)
json_stream

To write to a file, you would use the pattern:

with open('data/json_data_output.json', 'w') as outfile:
    json.dump(json_stream, outfile)

To read a file from disk, for example "data/json_data_output.json", it's

json.load(open("data/json_data_output.json"))

From JSON data to Data Frame¶

pandas has lots of options for turning JSON or dictionary data into a data frame. You do need to think a little bit about the structure of the data underneath though:

In [ ]:

import pandas as pd

pd.DataFrame(result["toppings"], columns=["id", "type"])

The web-scraped data we downloaded earlier had a more complicated structure, but pandas has a json_normalize() function that can cope with this. For example, with the following data, there are many missing entries but json_normalize() can still parse it into a Data Frame.

In [ ]:

data = [
    {"id": 1, "name": {"first": "Coleen", "last": "Volk"}},
    {"name": {"given": "Mark", "family": "Regner"}},
    {"id": 2, "name": "Faye Raker"},
]
pd.json_normalize(data)

And we can control the level that properties like 'name' are split out to as well (you can check out more options over at the pandas documentation)

In [ ]:

pd.json_normalize(data, max_level=0)

As well as the JSON normalise function, pandas has a from_dict() method to work with simpler dictionary objects.