Nodegraph
can work as a regular Bloom Filter,
if you ignore k-size and use count
directly by sending integers.
(This is limited to 64-bit ints,
since that's what the implementation supports).
It works with the values in the signatures created with Sourmash,
so that's what I'm using for now.
Initial SBT implementation comes from https://github.com/dib-lab/2015-09-10-scihack, which was based on https://github.com/ctb/2015-sbt-demo. The scihack version adds better tree balancing (all leaves are always in the last or second-to-last level), saving and loading from disk, and methods for printing the tree (for easier debugging). I extended it by supporting different types of Leaf classes and moving from a binary to a n-ary tree (might work better with lots of Leaves)
I'm using the same signatures from the sourmash urchin demo.
cd -q ..
from collections import defaultdict
from glob import glob
import os
from functools import partial
from IPython.display import Image
from sourmash_lib import signature
from sbt import SBT, GraphFactory, Node
from sbtmh import search_minhashes, SigLeaf
factory = GraphFactory(31, 1e5, 4)
We'll use a reduced dataset to demonstrate the SBT methods (only lividus-*
signatures, and a binary tree).
sig_to_search = "urchin/lividus-SRR1735497.sig"
with open(sig_to_search, 'r') as data:
to_search = signature.load_signatures(data)[0]
We build trees by adding leaves with the add_node
method,
which takes care of positioning the leaf in the tree and updating internal nodes.
tree = SBT(factory)
for f in glob("urchin/lividus*.sig"):
with open(f, 'r') as data:
sig = signature.load_signatures(data)
leaf = SigLeaf(os.path.basename(f), sig[0])
tree.add_node(leaf)
There are two printing methods:
print
, a simple ASCII treeprint_dot
, which can be fed into GraphViztree.print()
*Node:internal.0 [occupied: 1587, fpr: 6.4e-08] *Node:internal.2 [occupied: 1020, fpr: 1.1e-08] **Leaf:lividus-SRR1735498.sig -> lividus-SRR1735498.sig *Node:internal.5 [occupied: 787, fpr: 3.8e-09] **Leaf:lividus-SRR1735497.sig -> lividus-SRR1735497.sig **Leaf:lividus-SRR1735496.sig -> lividus-SRR1735496.sig *Node:internal.1 [occupied: 1207, fpr: 2.1e-08] *Node:internal.4 [occupied: 750, fpr: 3.2e-09] **Leaf:lividus-SRR1735499.sig -> lividus-SRR1735499.sig **Leaf:lividus-SRR1735501.sig -> lividus-SRR1735501.sig *Node:internal.3 [occupied: 848, fpr: 5.2e-09] **Leaf:lividus-SRR1735500.sig -> lividus-SRR1735500.sig **Leaf:lividus-SRR1664663.sig -> lividus-SRR1664663.sig
%%capture dag
tree.print_dot()
with open('dag.dot', 'w') as f:
f.write(dag.stdout)
!dot -Tpng -Nshape=ellipse dag.dot > tree.png
Image("tree.png")
The find
method needs a search function.
In our case it is the search_minhashes
function,
defined in sbtmh
:
from inspect import getsource
print(getsource(search_minhashes))
def search_minhashes(node, sig, threshold, results=None): mins = sig.estimator.mh.get_mins() if isinstance(node, SigLeaf): matches = node.data.estimator.count_common(sig.estimator) else: # Node or Leaf, Nodegraph by minhash comparison matches = sum(1 for value in mins if node.data.get(value)) if results is not None: results[node.name] = matches / len(mins) if matches / len(mins) >= threshold: return 1 return 0
There are two cases to consider: is the node
a SigLeaf
(another MinHash
) or a Nodegraph
(an internal node)?
Both do the same thing (count how many values are in the intersection),
but need to use the appropriate method from each class.
results
can be passed to keep track of intermediary results (see which nodes were searched),
but more about this later.
Finally, we can pass search_minhashes
to the find
method:
print('*' * 60)
print("{}:".format(sig_to_search))
filtered = tree.find(search_minhashes, to_search, 0.1)
matches = [(str(s.metadata), s.data.similarity(to_search))
for s in filtered]
print(*matches, sep='\n')
************************************************************ urchin/lividus-SRR1735497.sig: ('lividus-SRR1735498.sig', 0.47999998927116394) ('lividus-SRR1664663.sig', 0.41600000858306885) ('lividus-SRR1735500.sig', 0.4059999883174896) ('lividus-SRR1735501.sig', 0.3619999885559082) ('lividus-SRR1735499.sig', 0.4580000042915344) ('lividus-SRR1735496.sig', 0.421999990940094) ('lividus-SRR1735497.sig', 1.0)
tree.save('urchin')
'urchin.sbt.json'
tree = SBT.load('urchin.sbt.json', leaf_loader=SigLeaf.load)
print('*' * 60)
print("{}:".format(sig_to_search))
load_filtered = tree.find(search_minhashes, to_search, 0.1)
load_matches = [(str(s.metadata), s.data.similarity(to_search))
for s in load_filtered]
print(*matches, sep='\n')
************************************************************ urchin/lividus-SRR1735497.sig: ('lividus-SRR1735498.sig', 0.47999998927116394) ('lividus-SRR1664663.sig', 0.41600000858306885) ('lividus-SRR1735500.sig', 0.4059999883174896) ('lividus-SRR1735501.sig', 0.3619999885559082) ('lividus-SRR1735499.sig', 0.4580000042915344) ('lividus-SRR1735496.sig', 0.421999990940094) ('lividus-SRR1735497.sig', 1.0)
And we can see results before saving the tree and after loading it are the same:
set(matches) == set(load_matches)
True
We'll use the urchin/purpuratus*
signatures,
since there are more of them than urchin/lividus*
and so generate more pronounced differences in tree layout (while still easy enough to visualize without huge images, as is the case when using all the urchin
signatures).
sig_to_search = "urchin/purpuratus-SRR1012313.sig"
with open(sig_to_search, 'r') as data:
to_search = signature.load_signatures(data)[0]
trees = {}
for d in (2, 5, 10):
trees[d] = SBT(factory, d=d)
We can read all signatures once and add them to each of tree (instead of re-reading signatures each time we build one tree):
for f in glob("urchin/purpuratus*.sig"):
with open(f, 'r') as data:
sig = signature.load_signatures(data)
leaf = SigLeaf(os.path.basename(f), sig[0])
for d in (2, 5, 10):
trees[d].add_node(leaf)
%%capture dag
trees[2].print_dot()
with open('dag.dot', 'w') as f:
f.write(dag.stdout)
!twopi -Tpng -Granksep=5 dag.dot > tree.png
Image("tree.png")
%%capture dag
trees[5].print_dot()
with open('dag.dot', 'w') as f:
f.write(dag.stdout)
!twopi -Tpng -Granksep=5 dag.dot > tree.png
Image("tree.png")
%%capture dag
trees[10].print_dot()
with open('dag.dot', 'w') as f:
f.write(dag.stdout)
!twopi -Granksep=5 -Tpng -Nshape=ellipse dag.dot > tree.png
Image("tree.png")
print("Internal nodes:")
for d in sorted(trees):
internal = sum(1 for node in trees[d].nodes if isinstance(node, Node))
total = sum(1 for node in trees[d].nodes if node is not None)
print("{}-ary: {}/{} nodes ({:.2f}%)".format(
d, internal, total, internal/total * 100))
Internal nodes: 2-ary: 52/105 nodes (49.52%) 5-ary: 13/66 nodes (19.70%) 10-ary: 6/59 nodes (10.17%)
trees = {}
for d in (2, 5, 10):
trees[d] = SBT(factory, d=d)
for f in glob("urchin/*.sig"):
with open(f, 'r') as data:
sig = signature.load_signatures(data)
leaf = SigLeaf(os.path.basename(f), sig[0])
for d in (2, 5, 10):
trees[d].add_node(leaf)
print('*' * 60)
print("{}:".format(sig_to_search))
for d in trees:
print(*[(str(s.metadata), s.data.similarity(to_search))
for s in trees[d].find(search_minhashes, to_search, 0.2)],
sep='\n')
print()
************************************************************ urchin/purpuratus-SRR1012313.sig: ('purpuratus-SRR1012339.sig', 0.4300000071525574) ('purpuratus-SRR1012313.sig', 1.0) ('purpuratus-SRR1765910.sig', 0.32199999690055847) ('purpuratus-SRR1765938.sig', 0.33399999141693115) ('purpuratus-SRR1012340.sig', 0.3700000047683716) ('purpuratus-SRR1012339.sig', 0.4300000071525574) ('purpuratus-SRR1765938.sig', 0.33399999141693115) ('purpuratus-SRR1012313.sig', 1.0) ('purpuratus-SRR1765910.sig', 0.32199999690055847) ('purpuratus-SRR1012340.sig', 0.3700000047683716) ('purpuratus-SRR1012339.sig', 0.4300000071525574) ('purpuratus-SRR1765938.sig', 0.33399999141693115) ('purpuratus-SRR1012313.sig', 1.0) ('purpuratus-SRR1765910.sig', 0.32199999690055847) ('purpuratus-SRR1012340.sig', 0.3700000047683716)
Node expansion can be performed once all available spaces for a depth are used,
but I need to figure out the correct formula fo n > 2
Done. Once all nodes are used, we can extend based on the current height $h$ and $n$-arity:
$$h = 1 + \lfloor log_n |nodes| \rfloor$$And we need to expand by $n ^ {height}$ nodes. These are the expansions for $n=2, 5, 10$:
$$ n=2: 1 \rightarrow 3 \rightarrow 7 \rightarrow 15 \rightarrow 31 \rightarrow 63 \rightarrow 127 \rightarrow 255 \rightarrow 511 \\ n=5: 1 \rightarrow 6 \rightarrow 31 \rightarrow 156 \rightarrow 781 \\ n=10: 1 \rightarrow 11 \rightarrow 111 \rightarrow 1111 \\ $$But it still worth a check on alternatives, the node list expansion is quite dramatic for larger $n$, and many are not used (empty nodes).
Might also adapt the save
method to avoid writing a lot of None
nodes,
but need to also rewrite the load
method to create a long enough node list.
for d in trees:
useful = sum(1 for c in trees[d].nodes if c is not None)
all = len(trees[d].nodes)
print("{}: {}/{} ({})".format(d, useful, all, useful / all))
2: 405/511 (0.7925636007827789) 10: 226/1111 (0.20342034203420342) 5: 254/781 (0.32522407170294493)
Right now we just select the next available position and add it. This works well for the original SBT, but for the MinHash SBT it might be better to put similar datasets together. We can do that by searching the tree first when adding a node and following the path of greater similarity.
Possible issues:
sourmash search
¶#!cd urchin && /usr/bin/time sourmash search --threshold 0.1 urchin/lividus-SRR1735497.sig urchin/*.sig