encoding jupyter notebooks as `jsonlines`¶

the current notebook format only works by loading entire documents in at a time. the limits performance and user experience when loading large documents. in this document, we think an alternative serialization of the notebook format, specifically the cell format, represented as jsonlines.

jsonlines is a line delimited json format, where each line is a json object. effectively, jsonlines represents a list of json objects.

this approach takes advantage of new id feature in the notebook cell format. specifically, we will use the uuid1 to capture timestamp metadata information about the execution. we get the added before of an id and temporal metadata.

In [1]:

    import typing, nbformat, json, uuid, pathlib, freezegun, datetime, bz2, gzip, pandas, operator, collections, frozendict

for this discussion, we use this essay "lines-nb-format.ipynb" as test data.

In [2]:

    with open("lines-nb-format.ipynb") as file:
        original: nbformat = nbformat.v4.reads(file.read())

we are going to modify the existing cell metadata because their values are effectively meaningless. by adding a uuid1 id we not have id's that represent the dual purpose of timekeeping and identification. time tracking can be turned off using a different uuid format.

we'll use freezegun to simulate cell execution at different times, and replace our original existings ids with proper uuids.

In [3]:

    for i, cell in enumerate(original["cells"]):
        @freezegun.freeze_time(datetime.datetime(2021, 4, 15, 11, i))
        def frozen_uuid():
            return str(uuid.uuid1())
        
        original["cells"][i] = {**cell, "id": frozen_uuid()}

the lead cell uses the special NIL uuid. this value refers to the date datetime(1582, 10, 15); the date of Gregorian reform to the Christian calendar. we will use this special identifier to hold the metadata of the notebook, but in the cell metadata because no additional properties are allowed in the top leve cell.

In [4]:

    nil = "00000000-0000-0000-0000-000000000000"

the statement below converts all of our original cells into a block strings that has lines of valid json.

In [5]:

    jsonlines: typing.Text = "\n".join(
        map(
            json.dumps, [
                nbformat.v4.new_code_cell(id=nil, metadata=original["metadata"])
            ] + original["cells"]
        )
    )

from the string representation we can create a replica by loading each lines are json.

In [6]:

    replica: nbformat = nbformat.v4.new_notebook(cells=list(
        map(json.loads, jsonlines.splitlines())
    ))

our replica notebook has one extra cell relative to the original notebook. this is because we added the nil uuid to represent our notebook's metadata.

In [7]:

    assert len(replica.cells) == len(original.cells) + 1

our last task is to merge the jsonlines metadata representation with the conventional nbformat. we do this by promoting the metadata from the cell with the nil uuid to the notebook metadata.

In [8]:

    def reconstitute_metadata_from_nil(nb):
        """perform an in place modification of the notebook metadata"""
        for i, cell in enumerate(nb["cells"]):
            if cell["id"] == nil:
                cell = nb["cells"].pop(i)
                nb["metadata"].update(cell["metadata"])
                break
        return nb
    reconstitute_metadata_from_nil(replica); # in place operation

from json lines we can recover the original with little custom work. the only extra effort is to cast the cell containing the nil as the notebook metadata.

In [9]:

    assert replica == original

on disk formats¶

it is recommended to use the .jsonl extension for jsonlines

In [10]:

    with open("lines-nb-format.ipynb.jsonl", "w") as file:
        file.write(jsonlines)

jsonlines can also use bz2 or gzip compression

In [11]:

    with bz2.open("lines-nb-format.ipynb.jsonl.bz2", "w") as file:
        file.write(jsonlines.encode("utf-8"))

In [12]:

    with gzip.open("lines-nb-format.ipynb.jsonl.gz", "w") as file:
        file.write(jsonlines.encode("utf-8"))

relative on disk sizes of formats¶

In [13]:

    files = list(pathlib.Path().glob("lines-nb-format.ipynb*"))
    size = pandas.Series(data=files, index=files, name="kb")

the table below shows the relative on disk sizes of the different formats.

In [14]:

    size.apply(pathlib.PosixPath.stat).apply(
        operator.attrgetter("st_size")).sort_values().divide(2**10).to_frame()

Out[14]:

	kb
lines-nb-format.ipynb.jsonl.bz2	3.480469
lines-nb-format.ipynb.jsonl.gz	3.559570
lines-nb-format.ipynb.jsonl	12.794922
lines-nb-format.ipynb	14.484375

a comparsion of the content of the files¶

In [15]:

    from_disk = collections.defaultdict(nbformat.v4.new_notebook)

the original document has to be loaded all at once.

In [16]:

    with open("lines-nb-format.ipynb") as file:
        from_disk[file.name].update(json.loads(file.read()))
        
    from_disk[file.name] = original # because this is the one with the uuid ids

meanwhile, the jsonlines format can load in cells line by line.

In [17]:

    with open("lines-nb-format.ipynb.jsonl") as file:
        for line in file:
            from_disk[file.name]["cells"].append(json.loads(line))

and there is a polymorphic api for loading lines of json from compressed files.

In [18]:

    name = "lines-nb-format.ipynb.jsonl.bz2"
    with bz2.open(name) as file:
        for line in file:
            from_disk[name]["cells"].append(json.loads(line))

In [19]:

    name = "lines-nb-format.ipynb.jsonl.gz"
    with gzip.open(name) as file:
        for line in file:
            from_disk[name]["cells"].append(json.loads(line))

remember we need to take our nil uuid and make that the notebook metadata.

In [20]:

    for value in from_disk.values():
        reconstitute_metadata_from_nil(value)

we can now test that all of the forms of the notebook are the same.

In [21]:

    import unittest

    class Compose(unittest.TestCase):
        def test_compare_files(x):
            x.assertDictEqual(
                from_disk['lines-nb-format.ipynb'], 
                from_disk['lines-nb-format.ipynb.jsonl']
            )
            x.assertDictEqual(
                from_disk['lines-nb-format.ipynb'], 
                from_disk['lines-nb-format.ipynb.jsonl.gz']
            )
            x.assertDictEqual(
                from_disk['lines-nb-format.ipynb'], 
                from_disk['lines-nb-format.ipynb.jsonl.bz2']
            )

    unittest.main(argv=[""], exit=False);

.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK

encoding jupyter notebooks as jsonlines¶

on disk formats¶

relative on disk sizes of formats¶

a comparsion of the content of the files¶

encoding jupyter notebooks as `jsonlines`¶