LCA_Database
API¶LCA_Database
objects combine a fast in-memory storage of signatures
indexed by their hash values, with taxonomic lineage storage. They are
limited to storing scaled DNA signatures with a single ksize; the scaled
and ksize values are specified at creation.
You can run this notebook interactively via mybinder; click on this button:
A rendered version of this notebook is available at sourmash.readthedocs.io under "Tutorials and notebooks".
You can also get this notebook from the doc/ subdirectory of the sourmash github repository. See binder/environment.yaml for installation dependencies.
This is a Jupyter Notebook using Python 3. If you are running this via binder, you can use Shift-ENTER to run cells, and double click on code cells to edit them.
Contact: C. Titus Brown, ctbrown@ucdavis.edu. Please file issues on GitHub if you have any questions or comments!
LCA_Database
object¶Create an LCA_Database
like so:
import sourmash
db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)
Create signatures for some genomes, load them, and add them:
!sourmash sketch dna -p k=31,scaled=1000 genomes/*
== This is sourmash version 4.8.2. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == computing signatures for files: genomes/akkermansia.fa, genomes/shew_os185.fa, genomes/shew_os223.fa Computing a total of 1 signature(s) for each input. skipping genomes/akkermansia.fa - already done skipping genomes/shew_os185.fa - already done skipping genomes/shew_os223.fa - already done
sig1 = sourmash.load_one_signature('akkermansia.fa.sig', ksize=31)
sig2 = sourmash.load_one_signature('shew_os185.fa.sig', ksize=31)
sig3 = sourmash.load_one_signature('shew_os223.fa.sig', ksize=31)
db.insert(sig1, ident='akkermansia')
db.insert(sig2, ident='shew_os185')
db.insert(sig3, ident='shew_os223')
490
search
and gather
via the Index
API¶from pprint import pprint
pprint(db.search(sig1, threshold=0.1))
[Result(score=1.0, signature=SourmashSignature('CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome', 6822e0b7), location=None)]
pprint(db.search(sig2, threshold=0.1))
[Result(score=1.0, signature=SourmashSignature('NC_009665.1 Shewanella baltica OS185, complete genome', b47b13ef), location=None), Result(score=0.22846441947565543, signature=SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6), location=None)]
pprint(db.best_containment(sig3))
Result(score=1.0, signature=SourmashSignature('NC_011663.1 Shewanella baltica OS223, complete genome', ae6659f6), location=None)
signatures()
¶for i in db.signatures():
print(i)
CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome NC_009665.1 Shewanella baltica OS185, complete genome NC_011663.1 Shewanella baltica OS223, complete genome
The list of (unique) identifiers in the database can be accessed via the attribute ident_to_idx
, which maps to integer identifiers; identifiers can also retrieve full names, which are taken from sig.name()
upon insertion.
pprint(db._ident_to_idx.keys())
dict_keys(['akkermansia', 'shew_os185', 'shew_os223'])
pprint(db._ident_to_name)
{'akkermansia': 'CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete ' 'genome', 'shew_os185': 'NC_009665.1 Shewanella baltica OS185, complete genome', 'shew_os223': 'NC_011663.1 Shewanella baltica OS223, complete genome'}
The attribute _hashval_to_idx
contains a mapping from individual hash values to sets of idx
indices.
See the method _find_signatures()
for an example of how this is used in search
and gather
.
print('{} hash values total in this database'.format(len(db._hashval_to_idx)))
1300 hash values total in this database
all_idx = set()
for idx_set in db._hashval_to_idx.values():
all_idx.update(idx_set)
print('belonging to signatures with idx {}'.format(all_idx))
belonging to signatures with idx {0, 1, 2}
first_three_hashvals = list(db._hashval_to_idx)[:3]
for hashval in first_three_hashvals:
print('hashval {} belongs to idxs {}'.format(hashval, db._hashval_to_idx[hashval]))
hashval 17302105753387 belongs to idxs {0} hashval 95741036335406 belongs to idxs {0} hashval 165640715598232 belongs to idxs {0}
query_idx = 2
hashval_set = set()
for hashval, idx_set in db._hashval_to_idx.items():
if query_idx in idx_set:
hashval_set.add(hashval)
print('{} hashvals belong to query idx {}'.format(len(hashval_set), query_idx))
ident = db._idx_to_ident[query_idx]
print('query idx {} matches to ident {}'.format(query_idx, ident))
name = db._ident_to_name[ident]
print('query idx {} matches to name {}'.format(query_idx, name))
490 hashvals belong to query idx 2 query idx 2 matches to ident shew_os223 query idx 2 matches to name NC_011663.1 Shewanella baltica OS223, complete genome
from sourmash.lca.lca_utils import LineagePair
superkingdom = LineagePair('superkingdom', 'Bacteria')
phylum = LineagePair('phylum', 'Verrucomicrobia')
klass = LineagePair('class', 'Verrucomicrobiae')
lineage = (superkingdom, phylum, klass)
db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)
db.insert(sig1, lineage=lineage)
499
# by default, the identifier is the signature name --
ident = sig1.name
idx = db._ident_to_idx[ident]
print("ident '{}' has idx {}".format(ident, idx))
lid = db._idx_to_lid[idx]
print("lid for idx {} is {}".format(idx, lid))
lineage = db._lid_to_lineage[lid]
display = sourmash.lca.display_lineage(lineage)
print("lineage for lid {} is {}".format(lid, display))
ident 'CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome' has idx 0 lid for idx 0 is 0 lineage for lid 0 is Bacteria;Verrucomicrobia;Verrucomicrobiae
While sourmash lineage functions can work with any taxonomy ranks and any taxonomy names, both the NCBI and GTDB taxonomies use superkingdom/phylum/etc, so there is a hard coded list availalbe via sourmash.lca.taxlist()
.
print(list(sourmash.lca.taxlist()))
['superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'strain']
Given a taxonomy as a list, you can then construct a lineage like so:
linstr1 = ["Bacteria", "Verrucomicrobia", "Verrucomicrobiae",
"Verrucomicrobiales", "Akkermansiaceae", "Akkermansia",
"Akkermansia muciniphila", "Akkermansia muciniphila ATCC BAA-835"]
lineage1 = [ LineagePair(*pair) for pair in zip(sourmash.lca.taxlist(), linstr1) ]
pprint(lineage1)
[LineagePair(rank='superkingdom', name='Bacteria'), LineagePair(rank='phylum', name='Verrucomicrobia'), LineagePair(rank='class', name='Verrucomicrobiae'), LineagePair(rank='order', name='Verrucomicrobiales'), LineagePair(rank='family', name='Akkermansiaceae'), LineagePair(rank='genus', name='Akkermansia'), LineagePair(rank='species', name='Akkermansia muciniphila'), LineagePair(rank='strain', name='Akkermansia muciniphila ATCC BAA-835')]
# display lineages as strings with 'sourmash.lca.display_lineage()'
sourmash.lca.display_lineage(lineage1)
'Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835'
The LCA functionality available in sourmash is built around some simple lineage manipulation functions -- build_tree
and find_lca
.
First, let's define some more lineages --
linstr2 = ["Bacteria", "Proteobacteria", "Gammaproteobacteria",
"Alteromonadales", "Shewanellaceae", "Shewanella",
"Shewanella baltica", "Shewanella baltica OS185"]
lineage2 = [ LineagePair(*pair) for pair in zip(sourmash.lca.taxlist(), linstr2) ]
linstr3 = ["Bacteria", "Proteobacteria", "Gammaproteobacteria",
"Alteromonadales", "Shewanellaceae", "Shewanella",
"Shewanella baltica", "Shewanella baltica OS223"]
lineage3 = [ LineagePair(*pair) for pair in zip(sourmash.lca.taxlist(), linstr3) ]
print('lineage2 is', sourmash.lca.display_lineage(lineage2))
print('lineage3 is', sourmash.lca.display_lineage(lineage3))
lineage2 is Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185 lineage3 is Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223
Now, build a tree structure that collapses these lineages where it can, and run some LCA analyses. Lineages 1 and 2 collapse to superkingdom:
tree = sourmash.lca.build_tree([lineage1, lineage2])
sourmash.lca.find_lca(tree)
((LineagePair(rank='superkingdom', name='Bacteria'),), 2)
while lineages 2 and 3 collapse to species:
tree = sourmash.lca.build_tree([lineage2, lineage3])
sourmash.lca.find_lca(tree)
((LineagePair(rank='superkingdom', name='Bacteria'), LineagePair(rank='phylum', name='Proteobacteria'), LineagePair(rank='class', name='Gammaproteobacteria'), LineagePair(rank='order', name='Alteromonadales'), LineagePair(rank='family', name='Shewanellaceae'), LineagePair(rank='genus', name='Shewanella'), LineagePair(rank='species', name='Shewanella baltica')), 2)
First, let's create a database from 3 signatures, and this time we'll store lineages in there:
db = sourmash.lca.LCA_Database(ksize=31, scaled=1000)
db.insert(sig1, lineage=lineage1)
db.insert(sig2, lineage=lineage2)
db.insert(sig3, lineage=lineage3)
490
Now, for any collection of hashes, you can retrieve all the lineage assignments into a dictionary:
{ hashval: set of lineages }
assignments = sourmash.lca.gather_assignments(sig2.minhash.get_mins(), [db])
print('num hashvals:', len(assignments))
num hashvals: 494
/var/folders/6s/_f373w1d6hdfjc2kjstq97s80000gp/T/ipykernel_3384/490137846.py:1: DeprecatedWarning: get_mins is deprecated as of 3.5 and will be removed in 5.0. Use .hashes property instead. assignments = sourmash.lca.gather_assignments(sig2.minhash.get_mins(), [db])
For example, this particular hashvalue belongs to two different lineages:
assignments[196037984804395]
{(LineagePair(rank='superkingdom', name='Bacteria'), LineagePair(rank='phylum', name='Proteobacteria'), LineagePair(rank='class', name='Gammaproteobacteria'), LineagePair(rank='order', name='Alteromonadales'), LineagePair(rank='family', name='Shewanellaceae'), LineagePair(rank='genus', name='Shewanella'), LineagePair(rank='species', name='Shewanella baltica'), LineagePair(rank='strain', name='Shewanella baltica OS185')), (LineagePair(rank='superkingdom', name='Bacteria'), LineagePair(rank='phylum', name='Proteobacteria'), LineagePair(rank='class', name='Gammaproteobacteria'), LineagePair(rank='order', name='Alteromonadales'), LineagePair(rank='family', name='Shewanellaceae'), LineagePair(rank='genus', name='Shewanella'), LineagePair(rank='species', name='Shewanella baltica'), LineagePair(rank='strain', name='Shewanella baltica OS223'))}
sourmash also includes functions to summarize assignments by counting the number of
times a particular lineage occurs; this is
used by sourmash lca classify
and sourmash lca summarize
.
counter = sourmash.lca.count_lca_for_assignments(assignments)
# count_lca_for_assignments returns a collections.Counter object
for lineage, count in counter.most_common():
print('{} hashes have LCA: {}'.format(count, sourmash.lca.display_lineage(lineage)))
311 hashes have LCA: Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185 183 hashes have LCA: Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica