Marshal is a data serialization format used in the standard library of Python. It is more primitive, but it might be faster.
As a simple test, we take the feature data for g_word
and oslots
.
g_word
is a map from the numbers 1 to ca. 420,000 to Hebrew word occurrences (ascii strings).
oslots
is a map from ca. 1 million integers to tuples of integers.
In Text-Fabric we have a representation in plain text and a compressed, pickled representation.
We also run the deserialization in two ways: when the garbafe collector is enabled, or when the garabage collector is deliberately turned off.
Pickle is faster. Loading gzipped, pickled data is much faster than loading gzipped, marshalled data.
The size of the marshal uncompressed serialization is much bigger than the TF text representation.
The size of the gzipped marshal serialization is approximately the same as the gzipped, pickled TF serialization.
what | g_word | oslots |
---|---|---|
pickle.gz with gc | 0.08 | 0.7 |
pickle.gz without gc | 0.09 | 0.38 |
marshal.gz with gc | 1.11 | 1.86 |
marshal.gz without gc | 1.07 | 1.85 |
We do not see reasons to replace the TF feature data serialization by marshal.
We do not fiddle with the garbage collector.
%load_ext autoreload
%autoreload 2
import os
import gzip
import marshal
import pickle
import gc
from shutil import move
from tf.fabric import Fabric
from tf.app import use
GZIP_LEVEL = 2 # same as used in Text-Fabric
BASE = os.path.expanduser("~/github/annotation/text-fabric")
TEST_BASE = f"{BASE}/_temp/serial"
TEST_DATA_TF = f"{TEST_BASE}/tf"
TEST_DATA_SERIAL = f"{TEST_BASE}/serialized"
FEATURES = ("g_word", "oslots")
if not os.path.exists(TEST_DATA_SERIAL):
os.makedirs(TEST_DATA_SERIAL, exist_ok=True)
TF = Fabric(locations=TEST_DATA_TF)
api = TF.load(FEATURES)
This is Text-Fabric 9.1.11 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 2 features found and 0 ignored 0.00s Not all of the warp features otype and oslots are present in ~/github/annotation/text-fabric/_temp/serial/tf/ 0.00s Only the Feature and Edge APIs will be enabled 0.00s Warp feature "otext" not found. Working without Text-API | 23s T oslots from ~/github/annotation/text-fabric/_temp/serial/tf | 2.29s T g_word from ~/github/annotation/text-fabric/_temp/serial/tf 25s All features loaded/computed - for details use TF.isLoaded()
During this time, the textual data has been compiled and written to a binary form. We move the binary form (gz pickled) to the serial directory.
for fName in FEATURES:
move(f"{TEST_DATA_TF}/.tf/2/{fName}.tfx", f"{TEST_DATA_SERIAL}/{fName}.pickle.gz")
def load(fName, ext, withGc=True):
TF.indent(reset=True)
fullName = f"{fName}.{ext}"
path = f"{TEST_DATA_SERIAL}/{fullName}"
TF.info(f"start loading {fullName}")
if not withGc:
gc.disable()
if ext == "pickle.gz":
with gzip.open(path, "rb") as f:
data = pickle.load(f)
elif ext == "marshal.gz":
with gzip.open(path, "rb") as f:
data = marshal.load(f)
TF.info(f"end loading {fName}.{ext}")
if not withGc:
gc.enable()
return data
data = {}
for fName in FEATURES:
data[fName] = load(fName, "pickle.gz")
for fName in FEATURES:
data[fName] = load(fName, "pickle.gz", withGc=False)
0.00s start loading g_word.pickle.gz 0.09s end loading g_word.pickle.gz 0.00s start loading oslots.pickle.gz 0.70s end loading oslots.pickle.gz 0.00s start loading g_word.pickle.gz 0.09s end loading g_word.pickle.gz 0.00s start loading oslots.pickle.gz 0.43s end loading oslots.pickle.gz
for fName in FEATURES:
with open(f"{TEST_DATA_SERIAL}/{fName}.marshal.gz", 'wb') as mf:
with gzip.open(f"{TEST_DATA_SERIAL}/{fName}.marshal.gz", "wb", compresslevel=GZIP_LEVEL) as f:
marshal.dump(data[fName], f)
dataMarshal = {}
for fName in FEATURES:
dataMarshal[fName] = load(fName, "marshal.gz")
for fName in FEATURES:
dataMarshal[fName] = load(fName, "marshal.gz", withGc=False)
0.00s start loading g_word.marshal.gz 1.16s end loading g_word.marshal.gz 0.00s start loading oslots.marshal.gz 1.92s end loading oslots.marshal.gz 0.00s start loading g_word.marshal.gz 1.07s end loading g_word.marshal.gz 0.00s start loading oslots.marshal.gz 1.87s end loading oslots.marshal.gz
It seems that oslots loads much faster with the garbage collector temporarily switched off.
Let's try to load the whole BHSA in both ways:
TF.indent(reset=True)
TF.info("start loading bhsa with gc switched off")
A = use("bhsa", withGc=False)
TF.info("end loading bhsa with gc switched off")
0.00s start loading bhsa with gc switched off
This is Text-Fabric 9.1.11 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 122 features found and 0 ignored
7.70s end loading bhsa with gc switched off
TF.indent(reset=True)
TF.info("start loading bhsa with gc switched on")
A = use("bhsa", withGc=True)
TF.info("end loading bhsa with gc switched on")
0.00s start loading bhsa with gc switched on
This is Text-Fabric 9.1.11 Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html 122 features found and 0 ignored
6.34s end loading bhsa with gc switched on
Does not make much difference. We leave the garbage collector untouched by default, i.e. we do not switch it off.