jsonlines
¶the current notebook format only works by loading entire documents in at a time. the limits performance and user experience when loading large documents. in this document, we think an alternative serialization of the notebook format, specifically the cell format, represented as jsonlines.
jsonlines
is a line delimited json
format, where each line is a json object. effectively, jsonlines
represents a list of json
objects.
this approach takes advantage of new id
feature in the notebook cell format. specifically, we will use the uuid1
to capture timestamp metadata information about the execution. we get the added before of an id and temporal metadata.
import typing, nbformat, json, uuid, pathlib, freezegun, datetime, bz2, gzip, pandas, operator, collections, frozendict
for this discussion, we use this essay "lines-nb-format.ipynb"
as test data.
with open("lines-nb-format.ipynb") as file:
original: nbformat = nbformat.v4.reads(file.read())
we are going to modify the existing cell metadata because their values are effectively meaningless. by adding a uuid1
id we not have id's that represent the dual purpose of timekeeping and identification. time tracking can be turned off using a different uuid
format.
we'll use freezegun
to simulate cell execution at different times, and replace our original existings ids with proper uuids.
for i, cell in enumerate(original["cells"]):
@freezegun.freeze_time(datetime.datetime(2021, 4, 15, 11, i))
def frozen_uuid():
return str(uuid.uuid1())
original["cells"][i] = {**cell, "id": frozen_uuid()}
the lead cell uses the special NIL uuid. this value refers to the date datetime(1582, 10, 15)
; the date of Gregorian reform to the Christian calendar. we will use this special identifier to hold the metadata of the notebook, but in the cell metadata because no additional properties are allowed in the top leve cell.
nil = "00000000-0000-0000-0000-000000000000"
the statement below converts all of our original
cells into a block strings that has lines of valid json
.
jsonlines: typing.Text = "\n".join(
map(
json.dumps, [
nbformat.v4.new_code_cell(id=nil, metadata=original["metadata"])
] + original["cells"]
)
)
from the string representation we can create a replica
by loading each lines are json.
replica: nbformat = nbformat.v4.new_notebook(cells=list(
map(json.loads, jsonlines.splitlines())
))
our replica notebook has one extra cell relative to the original
notebook. this is because we added the nil
uuid
to represent our notebook's metadata.
assert len(replica.cells) == len(original.cells) + 1
our last task is to merge the jsonlines
metadata representation with the conventional nbformat
. we do this by promoting the metadata from the cell with the nil
uuid
to the notebook metadata.
def reconstitute_metadata_from_nil(nb):
"""perform an in place modification of the notebook metadata"""
for i, cell in enumerate(nb["cells"]):
if cell["id"] == nil:
cell = nb["cells"].pop(i)
nb["metadata"].update(cell["metadata"])
break
return nb
reconstitute_metadata_from_nil(replica); # in place operation
from json lines we can recover the original with little custom work. the only extra effort is to cast the cell containing the nil
as the notebook metadata.
assert replica == original
it is recommended to use the .jsonl
extension for jsonlines
with open("lines-nb-format.ipynb.jsonl", "w") as file:
file.write(jsonlines)
jsonlines
can also use bz2
or gzip
compression
with bz2.open("lines-nb-format.ipynb.jsonl.bz2", "w") as file:
file.write(jsonlines.encode("utf-8"))
with gzip.open("lines-nb-format.ipynb.jsonl.gz", "w") as file:
file.write(jsonlines.encode("utf-8"))
files = list(pathlib.Path().glob("lines-nb-format.ipynb*"))
size = pandas.Series(data=files, index=files, name="kb")
the table below shows the relative on disk sizes of the different formats.
size.apply(pathlib.PosixPath.stat).apply(
operator.attrgetter("st_size")).sort_values().divide(2**10).to_frame()
kb | |
---|---|
lines-nb-format.ipynb.jsonl.bz2 | 3.480469 |
lines-nb-format.ipynb.jsonl.gz | 3.559570 |
lines-nb-format.ipynb.jsonl | 12.794922 |
lines-nb-format.ipynb | 14.484375 |
from_disk = collections.defaultdict(nbformat.v4.new_notebook)
the original document has to be loaded all at once.
with open("lines-nb-format.ipynb") as file:
from_disk[file.name].update(json.loads(file.read()))
from_disk[file.name] = original # because this is the one with the uuid ids
meanwhile, the jsonlines format can load in cells line by line.
with open("lines-nb-format.ipynb.jsonl") as file:
for line in file:
from_disk[file.name]["cells"].append(json.loads(line))
and there is a polymorphic api for loading lines of json from compressed files.
name = "lines-nb-format.ipynb.jsonl.bz2"
with bz2.open(name) as file:
for line in file:
from_disk[name]["cells"].append(json.loads(line))
name = "lines-nb-format.ipynb.jsonl.gz"
with gzip.open(name) as file:
for line in file:
from_disk[name]["cells"].append(json.loads(line))
remember we need to take our nil
uuid
and make that the notebook metadata.
for value in from_disk.values():
reconstitute_metadata_from_nil(value)
we can now test that all of the forms of the notebook are the same.
import unittest
class Compose(unittest.TestCase):
def test_compare_files(x):
x.assertDictEqual(
from_disk['lines-nb-format.ipynb'],
from_disk['lines-nb-format.ipynb.jsonl']
)
x.assertDictEqual(
from_disk['lines-nb-format.ipynb'],
from_disk['lines-nb-format.ipynb.jsonl.gz']
)
x.assertDictEqual(
from_disk['lines-nb-format.ipynb'],
from_disk['lines-nb-format.ipynb.jsonl.bz2']
)
unittest.main(argv=[""], exit=False);
. ---------------------------------------------------------------------- Ran 1 test in 0.001s OK