Marshal is a data serialization format used in the standard library of Python. It is more primitive, but it might be faster.
As a simple test, we take the feature data for g_word_utf8
.
It is a map from the numbers 1 to 426584 to Hebrew word occurrences (Unicode strings).
In Text-Fabric we have a representation in plain text and a compressed, pickled representation.
Pickle is faster. Loading gzipped, pickled data is much faster than loading gzipped, marshalled data.
The size of the marshal uncompressed serialization is much bigger than the TF text representation.
The size of the gzipped marshal serialization is approximately the same as the gzipped, pickled TF serialization.
name | kind | size | load time |
---|---|---|---|
g_word_utf8.tf | tf: plain unicode text | 5.4 MB | 1.6 s |
g_word_utf8.tfx | tf: gzipped binary | 3.2 MB | 0.2 s |
g_word_utf8.joblib | marshal: uncompressed | 9.2 MB | 0.8 s |
g_word_utf8.joblib.gz | marshal: gzipped | 3.3 MB | 3.0 s |
We do not see reasons to replace the TF feature data serialization by marshal.
import os
import gzip
import marshal
import pickle
from tf.fabric import Fabric
GZIP_LEVEL = 2 # same as used in Text-Fabric
VERSION = 'c'
BHSA = f'BHSA/tf/{VERSION}'
PARA = f'parallels/tf/{VERSION}'
TF = Fabric(locations='~/github/etcbc', modules=[BHSA, PARA])
api = TF.load('')
api.makeAvailableIn(globals())
This is Text-Fabric 5.5.22 Api reference : https://annotation.github.io/text-fabric/Api/Fabric/ Tutorial : https://github.com/annotation/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/annotation/text-fabric-data 117 features found and 0 ignored 0.00s loading features ... | 1.60s T g_word_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c 6.02s All features loaded/computed - for details use loadLog()
The load time is ~ 1.6 seconds.
But during this time, the textual data has been compiled and written to a binary form. Let's load again.
TF = Fabric(locations='~/github/etcbc', modules=[BHSA, PARA])
api = TF.load('')
api.makeAvailableIn(globals())
This is Text-Fabric 5.5.22 Api reference : https://annotation.github.io/text-fabric/Api/Fabric/ Tutorial : https://github.com/annotation/text-fabric/blob/master/docs/tutorial.ipynb Example data : https://github.com/annotation/text-fabric-data 117 features found and 0 ignored 0.00s loading features ... 4.65s All features loaded/computed - for details use loadLog()
loadLog()
| 0.03s B otype from /Users/dirk/github/etcbc/BHSA/tf/c | 0.53s B oslots from /Users/dirk/github/etcbc/BHSA/tf/c | 0.01s B book from /Users/dirk/github/etcbc/BHSA/tf/c | 0.01s B chapter from /Users/dirk/github/etcbc/BHSA/tf/c | 0.01s B verse from /Users/dirk/github/etcbc/BHSA/tf/c | 0.13s B g_cons from /Users/dirk/github/etcbc/BHSA/tf/c | 0.18s B g_cons_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c | 0.14s B g_lex from /Users/dirk/github/etcbc/BHSA/tf/c | 0.25s B g_lex_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c | 0.22s B g_word from /Users/dirk/github/etcbc/BHSA/tf/c | 0.26s B g_word_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c | 0.14s B lex0 from /Users/dirk/github/etcbc/BHSA/tf/c | 0.17s B lex_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B qere from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B qere_trailer from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B qere_trailer_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B qere_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c | 0.07s B trailer from /Users/dirk/github/etcbc/BHSA/tf/c | 0.08s B trailer_utf8 from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B __levels__ from otype, oslots, otext | 0.03s B __order__ from otype, oslots, __levels__ | 0.03s B __rank__ from otype, __order__ | 1.11s B __levUp__ from otype, oslots, __rank__ | 0.87s B __levDown__ from otype, __levUp__, __rank__ | 0.33s B __boundary__ from otype, oslots, __rank__ | 0.01s B __sections__ from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse | 0.00s B book@am from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@ar from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@bn from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@da from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@de from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@el from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@en from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@es from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@fa from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@fr from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@he from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@hi from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@id from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@ja from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@ko from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@la from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@nl from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@pa from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@pt from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@ru from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@sw from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@syc from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@tr from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@ur from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@yo from /Users/dirk/github/etcbc/BHSA/tf/c | 0.00s B book@zh from /Users/dirk/github/etcbc/BHSA/tf/c
The load time of the feature g_word_utf8
is ~ 0.2 seconds.
tempDir = os.path.expanduser('~/github/annotation/text-fabric/_temp/marshal')
os.makedirs(tempDir, exist_ok=True)
feature = 'g_word_utf8'
data = TF.features[feature].data
print(len(data))
print(data[2])
426584 רֵאשִׁ֖ית
We write the feature data to an Avro data file.
dataFile = f'{tempDir}/{feature}.marshal'
indent(reset=True)
info('start writing')
with open(dataFile, 'wb') as mf:
marshal.dump(data, mf)
info('done')
0.00s start writing 0.09s done
We make also a gzipped data file.
indent(reset=True)
info('start writing')
dataFileZ = f'{dataFile}.gz'
with gzip.open(dataFileZ, 'wb', compresslevel=GZIP_LEVEL) as mf:
marshal.dump(data, mf)
info('done')
0.00s start writing 0.26s done
indent(reset=True)
info('start reading')
with open(dataFile, 'rb') as mf:
rData = marshal.load(mf)
info('done')
print(rData[2])
0.00s start reading 0.83s done רֵאשִׁ֖ית
Load time ~ 0.8 seconds.
indent(reset=True)
info('start reading')
with gzip.open(dataFileZ, 'rb') as mf:
rData = marshal.load(mf)
info('done')
print(rData[2])
0.00s start reading 3.04s done רֵאשִׁ֖ית
Load time ~ 3.0 seconds.
indent(reset=True)
info('start reading')
with gzip.open(dataFileZ, 'rb') as mf:
rData = marshal.load(mf)
info('done')
print(rData[2])
0.00s start reading 3.05s done רֵאשִׁ֖ית
tfDataFileZ = os.path.expanduser('~/github/etcbc/bhsa/tf/c/.tf/g_word_utf8.tfx')
indent(reset=True)
info('start reading')
with gzip.open(tfDataFileZ, 'rb') as mf:
rData = pickle.load(mf)
info('done')
print(rData[2])
0.00s start reading 0.27s done רֵאשִׁ֖ית