You can run this notebook interactively via mybinder; click on this button:
A rendered version of this notebook is available at sourmash.readthedocs.io under "Tutorials and notebooks".
You can also get this notebook from the doc/ subdirectory of the sourmash github repository. See binder/environment.yaml for installation dependencies.
This is a Jupyter Notebook using Python 3. If you are running this via binder, you can use Shift-ENTER to run cells, and double click on code cells to edit them.
Contact: C. Titus Brown, ctbrown@ucdavis.edu. Please file issues on GitHub if you have any questions or comments!
!mkdir -p big_genomes
!curl -L https://osf.io/8uxj9/?action=download | (cd big_genomes && tar xzf -)
/Users/t/dev/sourmash/doc/big_genomes % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 459 100 459 0 0 750 0 --:--:-- --:--:-- --:--:-- 750 100 61.1M 100 61.1M 0 0 2966k 0 0:00:21 0:00:21 --:--:-- 3496k
!cd big_genomes/ && sourmash compute -k 31 --scaled=1000 --name-from-first *.fa
/Users/t/dev/sourmash/doc/big_genomes == This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == setting num_hashes to 0 because --scaled is set computing signatures for files: 0.fa, 1.fa, 10.fa, 11.fa, 12.fa, 13.fa, 14.fa, 15.fa, 16.fa, 17.fa, 18.fa, 19.fa, 2.fa, 20.fa, 21.fa, 22.fa, 23.fa, 24.fa, 25.fa, 26.fa, 27.fa, 28.fa, 29.fa, 3.fa, 30.fa, 31.fa, 32.fa, 33.fa, 34.fa, 35.fa, 36.fa, 37.fa, 38.fa, 39.fa, 4.fa, 40.fa, 41.fa, 42.fa, 43.fa, 44.fa, 45.fa, 46.fa, 47.fa, 48.fa, 49.fa, 5.fa, 50.fa, 51.fa, 52.fa, 53.fa, 54.fa, 55.fa, 56.fa, 57.fa, 58.fa, 59.fa, 6.fa, 60.fa, 61.fa, 62.fa, 63.fa, 7.fa, 8.fa, 9.fa Computing signature for ksizes: [31] Computing only nucleotide (and not protein) signatures. Computing a total of 1 signature(s). ... reading sequences from 0.fa calculated 1 signatures for 1 sequences in 0.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 1.fa calculated 1 signatures for 1 sequences in 1.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 10.fa calculated 1 signatures for 1 sequences in 10.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 11.fa calculated 1 signatures for 1 sequences in 11.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 12.fa calculated 1 signatures for 1 sequences in 12.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 13.fa calculated 1 signatures for 1 sequences in 13.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 14.fa calculated 1 signatures for 1 sequences in 14.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 15.fa calculated 1 signatures for 1 sequences in 15.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 16.fa calculated 1 signatures for 4 sequences in 16.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 17.fa calculated 1 signatures for 2 sequences in 17.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 18.fa calculated 1 signatures for 1 sequences in 18.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 19.fa calculated 1 signatures for 9 sequences in 19.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 2.fa calculated 1 signatures for 1 sequences in 2.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 20.fa calculated 1 signatures for 1 sequences in 20.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 21.fa calculated 1 signatures for 1 sequences in 21.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 22.fa calculated 1 signatures for 1 sequences in 22.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 23.fa calculated 1 signatures for 5 sequences in 23.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 24.fa calculated 1 signatures for 3 sequences in 24.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 25.fa calculated 1 signatures for 1 sequences in 25.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 26.fa calculated 1 signatures for 1 sequences in 26.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 27.fa calculated 1 signatures for 1 sequences in 27.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 28.fa calculated 1 signatures for 3 sequences in 28.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 29.fa calculated 1 signatures for 1 sequences in 29.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 3.fa calculated 1 signatures for 1 sequences in 3.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 30.fa calculated 1 signatures for 1 sequences in 30.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 31.fa calculated 1 signatures for 1 sequences in 31.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 32.fa calculated 1 signatures for 1 sequences in 32.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 33.fa calculated 1 signatures for 1 sequences in 33.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 34.fa calculated 1 signatures for 1 sequences in 34.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 35.fa calculated 1 signatures for 7 sequences in 35.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 36.fa calculated 1 signatures for 1 sequences in 36.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 37.fa calculated 1 signatures for 1 sequences in 37.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 38.fa calculated 1 signatures for 1 sequences in 38.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 39.fa calculated 1 signatures for 1 sequences in 39.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 4.fa calculated 1 signatures for 1 sequences in 4.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 40.fa calculated 1 signatures for 1 sequences in 40.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 41.fa calculated 1 signatures for 1 sequences in 41.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 42.fa calculated 1 signatures for 1 sequences in 42.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 43.fa calculated 1 signatures for 1 sequences in 43.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 44.fa calculated 1 signatures for 2 sequences in 44.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 45.fa calculated 1 signatures for 1 sequences in 45.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 46.fa calculated 1 signatures for 1 sequences in 46.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 47.fa calculated 1 signatures for 2 sequences in 47.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 48.fa calculated 1 signatures for 1 sequences in 48.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 49.fa calculated 1 signatures for 228 sequences in 49.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 5.fa calculated 1 signatures for 1 sequences in 5.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 50.fa calculated 1 signatures for 1 sequences in 50.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 51.fa calculated 1 signatures for 1 sequences in 51.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 52.fa calculated 1 signatures for 1 sequences in 52.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 53.fa calculated 1 signatures for 1 sequences in 53.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 54.fa calculated 1 signatures for 1 sequences in 54.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 55.fa calculated 1 signatures for 1 sequences in 55.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 56.fa calculated 1 signatures for 1 sequences in 56.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 57.fa calculated 1 signatures for 1 sequences in 57.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 58.fa calculated 1 signatures for 30 sequences in 58.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 59.fa calculated 1 signatures for 5 sequences in 59.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 6.fa calculated 1 signatures for 76 sequences in 6.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 60.fa calculated 1 signatures for 11 sequences in 60.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 61.fa calculated 1 signatures for 47 sequences in 61.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 62.fa calculated 1 signatures for 1 sequences in 62.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 63.fa calculated 1 signatures for 4 sequences in 63.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 7.fa calculated 1 signatures for 3 sequences in 7.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 8.fa calculated 1 signatures for 1 sequences in 8.fa saved 1 signature(s). Note: signature license is CC0. ... reading sequences from 9.fa calculated 1 signatures for 3 sequences in 9.fa saved 1 signature(s). Note: signature license is CC0.
!sourmash compare big_genomes/*.sig -o compare_all.mat
!sourmash plot compare_all.mat
== This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loaded 64 signatures total. downsampling to scaled value of 1000 min similarity in matrix: 0.000 saving labels to: compare_all.mat.labels.txt saving distance matrix to: compare_all.mat == This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading comparison matrix from compare_all.mat... ...got 64 x 64 matrix. loading labels from compare_all.mat.labels.txt saving histogram of matrix values => compare_all.mat.hist.png wrote dendrogram to: compare_all.mat.dendro.png wrote numpy distance matrix to: compare_all.mat.matrix.png
from IPython.display import Image
Image(filename='compare_all.mat.matrix.png')
!sourmash index -k 31 all-genomes big_genomes/*.sig
== This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading 64 files into SBT reading from big_genomes/9.fa.sig (63 signatures so far)) loaded 64 sigs; saving SBT under "all-genomes" 127 of 127 nodes saved Finished saving nodes, now saving SBT json file.
You can now use this to search, and gather.
!sourmash search shew_os185.fa.sig all-genomes --threshold=0.001
== This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == selecting default query k=31. loaded query: NC_009665.1 Shewanella baltica... (k=31, DNA) loaded 1 databases. 2 matches: similarity match ---------- ----- 9.5% NC_009665.1 Shewanella baltica OS185, complete genome 4.4% NC_011663.1 Shewanella baltica OS223, complete genome
# (make fake metagenome again, just in case)
!cat genomes/*.fa > fake-metagenome.fa
!sourmash compute -k 31 --scaled=1000 fake-metagenome.fa
== This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == setting num_hashes to 0 because --scaled is set computing signatures for files: fake-metagenome.fa Computing signature for ksizes: [31] Computing only nucleotide (and not protein) signatures. Computing a total of 1 signature(s). skipping fake-metagenome.fa - already done
!sourmash gather fake-metagenome.fa.sig all-genomes
== This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == select query k=31 automatically. loaded query: fake-metagenome.fa... (k=31, DNA) loaded 1 databases. overlap p_query p_match --------- ------- ------- 0.5 Mbp 42.2% 10.5% NC_011663.1 Shewanella baltica OS223,... 499.0 kbp 38.4% 18.5% CP001071.1 Akkermansia muciniphila AT... 0.5 Mbp 19.4% 4.9% NC_009665.1 Shewanella baltica OS185,... found 3 matches total; the recovered matches hit 100.0% of the query
for this, we need to provide a metadata file that contains accession => tax information.
import pandas
df = pandas.read_csv('podar-lineage.csv')
df
accession | taxid | superkingdom | phylum | class | order | family | genus | species | strain | |
---|---|---|---|---|---|---|---|---|---|---|
0 | AE000782 | 224325 | Archaea | Euryarchaeota | Archaeoglobi | Archaeoglobales | Archaeoglobaceae | Archaeoglobus | Archaeoglobus fulgidus | Archaeoglobus fulgidus DSM 4304 |
1 | NC_000909 | 243232 | Archaea | Euryarchaeota | Methanococci | Methanococcales | Methanocaldococcaceae | Methanocaldococcus | Methanocaldococcus jannaschii | Methanocaldococcus jannaschii DSM 2661 |
2 | NC_003272 | 103690 | Bacteria | Cyanobacteria | NaN | Nostocales | Nostocaceae | Nostoc | Nostoc sp. PCC 7120 | NaN |
3 | AE009441 | 178306 | Archaea | Crenarchaeota | Thermoprotei | Thermoproteales | Thermoproteaceae | Pyrobaculum | Pyrobaculum aerophilum | Pyrobaculum aerophilum str. IM2 |
4 | AE009950 | 186497 | Archaea | Euryarchaeota | Thermococci | Thermococcales | Thermococcaceae | Pyrococcus | Pyrococcus furiosus | Pyrococcus furiosus DSM 3638 |
5 | AE009951 | 190304 | Bacteria | Fusobacteria | Fusobacteriia | Fusobacteriales | Fusobacteriaceae | Fusobacterium | Fusobacterium nucleatum | NaN |
6 | AE010299 | 188937 | Archaea | Euryarchaeota | Methanomicrobia | Methanosarcinales | Methanosarcinaceae | Methanosarcina | Methanosarcina acetivorans | Methanosarcina acetivorans C2A |
7 | AE009439 | 190192 | Archaea | Euryarchaeota | Methanopyri | Methanopyrales | Methanopyraceae | Methanopyrus | Methanopyrus kandleri | Methanopyrus kandleri AV19 |
8 | NC_003911 | 246200 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodobacterales | Rhodobacteraceae | Ruegeria | Ruegeria pomeroyi | Ruegeria pomeroyi DSS-3 |
9 | AE006470 | 194439 | Bacteria | Chlorobi | Chlorobia | Chlorobiales | Chlorobiaceae | Chlorobaculum | Chlorobaculum tepidum | Chlorobaculum tepidum TLS |
10 | AE015928 | 226186 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Bacteroidaceae | Bacteroides | Bacteroides thetaiotaomicron | Bacteroides thetaiotaomicron VPI-5482 |
11 | AL954747 | 228410 | Bacteria | Proteobacteria | Betaproteobacteria | Nitrosomonadales | Nitrosomonadaceae | Nitrosomonas | Nitrosomonas europaea | Nitrosomonas europaea ATCC 19718 |
12 | BX119912 | 243090 | Bacteria | Planctomycetes | Planctomycetia | Planctomycetales | Planctomycetaceae | Rhodopirellula | Rhodopirellula baltica | Rhodopirellula baltica SH 1 |
13 | BX571656 | 273121 | Bacteria | Proteobacteria | Epsilonproteobacteria | Campylobacterales | Helicobacteraceae | Wolinella | Wolinella succinogenes | Wolinella succinogenes DSM 1740 |
14 | AE017180 | 243231 | Bacteria | Proteobacteria | Deltaproteobacteria | Desulfuromonadales | Geobacteraceae | Geobacter | Geobacter sulfurreducens | Geobacter sulfurreducens PCA |
15 | AE017226 | 243275 | Bacteria | Spirochaetes | Spirochaetia | Spirochaetales | Spirochaetaceae | Treponema | Treponema denticola | Treponema denticola ATCC 35405 |
16 | BX950229 | 267377 | Archaea | Euryarchaeota | Methanococci | Methanococcales | Methanococcaceae | Methanococcus | Methanococcus maripaludis | Methanococcus maripaludis S2 |
17 | AE017221 | 262724 | Bacteria | Deinococcus-Thermus | Deinococci | Thermales | Thermaceae | Thermus | Thermus thermophilus | Thermus thermophilus HB27 |
18 | BA000001 | 70601 | Archaea | Euryarchaeota | Thermococci | Thermococcales | Thermococcaceae | Pyrococcus | Pyrococcus horikoshii | Pyrococcus horikoshii OT3 |
19 | BA000023 | 273063 | Archaea | Crenarchaeota | Thermoprotei | Sulfolobales | Sulfolobaceae | Sulfolobus | Sulfolobus tokodaii | Sulfolobus tokodaii str. 7 |
20 | NC_007951 | 266265 | Bacteria | Proteobacteria | Betaproteobacteria | Burkholderiales | Burkholderiaceae | Paraburkholderia | Paraburkholderia xenovorans | Paraburkholderia xenovorans LB400 |
21 | CP000492 | 290317 | Bacteria | Chlorobi | Chlorobia | Chlorobiales | Chlorobiaceae | Chlorobium | Chlorobium phaeobacteroides | Chlorobium phaeobacteroides DSM 266 |
22 | NC_008751 | 391774 | Bacteria | Proteobacteria | Deltaproteobacteria | Desulfovibrionales | Desulfovibrionaceae | Desulfovibrio | Desulfovibrio vulgaris | Desulfovibrio vulgaris DP4 |
23 | CP000568 | 203119 | Bacteria | Firmicutes | Clostridia | Clostridiales | Ruminococcaceae | Ruminiclostridium | Ruminiclostridium thermocellum | Ruminiclostridium thermocellum ATCC 27405 |
24 | CP000561 | 410359 | Archaea | Crenarchaeota | Thermoprotei | Thermoproteales | Thermoproteaceae | Pyrobaculum | Pyrobaculum calidifontis | Pyrobaculum calidifontis JCM 11548 |
25 | CP000609 | 402880 | Archaea | Euryarchaeota | Methanococci | Methanococcales | Methanococcaceae | Methanococcus | Methanococcus maripaludis | Methanococcus maripaludis C5 |
26 | CP000607 | 290318 | Bacteria | Chlorobi | Chlorobia | Chlorobiales | Chlorobiaceae | Chlorobium | Chlorobium phaeovibrioides | Chlorobium phaeovibrioides DSM 265 |
27 | CP000660 | 340102 | Archaea | Crenarchaeota | Thermoprotei | Thermoproteales | Thermoproteaceae | Pyrobaculum | Pyrobaculum arsenaticum | Pyrobaculum arsenaticum DSM 13514 |
28 | CP000667 | 369723 | Bacteria | Actinobacteria | Actinobacteria | Micromonosporales | Micromonosporaceae | Salinispora | Salinispora tropica | Salinispora tropica CNB-440 |
29 | CP000679 | 351627 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacterales Family III. Incertae Sedis | Caldicellulosiruptor | Caldicellulosiruptor saccharolyticus | Caldicellulosiruptor saccharolyticus DSM 8903 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
34 | CP000850 | 391037 | Bacteria | Actinobacteria | Actinobacteria | Micromonosporales | Micromonosporaceae | Salinispora | Salinispora arenicola | Salinispora arenicola CNS-205 |
35 | CP000909 | 324602 | Bacteria | Chloroflexi | Chloroflexia | Chloroflexales | Chloroflexaceae | Chloroflexus | Chloroflexus aurantiacus | Chloroflexus aurantiacus J-10-fl |
36 | CP000924 | 340099 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacteraceae | Thermoanaerobacter | Thermoanaerobacter pseudethanolicus | Thermoanaerobacter pseudethanolicus ATCC 33223 |
37 | CP000969 | 126740 | Bacteria | Thermotogae | Thermotogae | Thermotogales | Thermotogaceae | Thermotoga | Thermotoga sp. RQ2 | NaN |
38 | CP001013 | 395495 | Bacteria | Proteobacteria | Betaproteobacteria | Burkholderiales | NaN | Leptothrix | Leptothrix cholodnii | Leptothrix cholodnii SP-6 |
39 | CP001071 | 349741 | Bacteria | Verrucomicrobia | Verrucomicrobiae | Verrucomicrobiales | Akkermansiaceae | Akkermansia | Akkermansia muciniphila | Akkermansia muciniphila ATCC BAA-835 |
40 | AP009380 | 431947 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | Porphyromonas gingivalis | Porphyromonas gingivalis ATCC 33277 |
41 | NC_010730 | 436114 | Bacteria | Aquificae | Aquificae | Aquificales | Hydrogenothermaceae | Sulfurihydrogenibium | Sulfurihydrogenibium sp. YO3AOP1 | NaN |
42 | CP001097 | 290315 | Bacteria | Chlorobi | Chlorobia | Chlorobiales | Chlorobiaceae | Chlorobium | Chlorobium limicola | Chlorobium limicola DSM 245 |
43 | CP001110 | 324925 | Bacteria | Chlorobi | Chlorobia | Chlorobiales | Chlorobiaceae | Pelodictyon | Pelodictyon phaeoclathratiforme | Pelodictyon phaeoclathratiforme BU-1 |
44 | CP001130 | 380749 | Bacteria | Aquificae | Aquificae | Aquificales | Aquificaceae | Hydrogenobaculum | Hydrogenobaculum sp. Y04AAS1 | NaN |
45 | NZ_CH959311 | 52598 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodobacterales | Rhodobacteraceae | Sulfitobacter | Sulfitobacter sp. EE-36 | NaN |
46 | NZ_CH959317 | 314267 | Bacteria | Proteobacteria | Alphaproteobacteria | Rhodobacterales | Rhodobacteraceae | Sulfitobacter | Sulfitobacter sp. NAS-14.1 | NaN |
47 | CP001251 | 515635 | Bacteria | Dictyoglomi | Dictyoglomia | Dictyoglomales | Dictyoglomaceae | Dictyoglomus | Dictyoglomus turgidum | Dictyoglomus turgidum DSM 6724 |
48 | NC_011663 | 407976 | Bacteria | Proteobacteria | Gammaproteobacteria | Alteromonadales | Shewanellaceae | Shewanella | Shewanella baltica | Shewanella baltica OS223 |
49 | CP000916 | 309803 | Bacteria | Thermotogae | Thermotogae | Thermotogales | Thermotogaceae | Thermotoga | Thermotoga neapolitana | Thermotoga neapolitana DSM 4359 |
50 | NZ_DS996397 | 411464 | Bacteria | Proteobacteria | Deltaproteobacteria | Desulfovibrionales | Desulfovibrionaceae | Desulfovibrio | Desulfovibrio piger | Desulfovibrio piger ATCC 29098 |
51 | CP001230 | 123214 | Bacteria | Aquificae | Aquificae | Aquificales | Hydrogenothermaceae | Persephonella | Persephonella marina | Persephonella marina EX-H1 |
52 | CP001472 | 240015 | Bacteria | Acidobacteria | Acidobacteriia | Acidobacteriales | Acidobacteriaceae | Acidobacterium | Acidobacterium capsulatum | Acidobacterium capsulatum ATCC 51196 |
53 | AP009153 | 379066 | Bacteria | Gemmatimonadetes | Gemmatimonadetes | Gemmatimonadales | Gemmatimonadaceae | Gemmatimonas | Gemmatimonas aurantiaca | Gemmatimonas aurantiaca T-27 |
54 | CP001941 | 439481 | Archaea | Euryarchaeota | NaN | NaN | NaN | Aciduliprofundum | Aciduliprofundum boonei | Aciduliprofundum boonei T469 |
55 | NC_013968 | 309800 | Archaea | Euryarchaeota | Halobacteria | Haloferacales | Haloferacaceae | Haloferax | Haloferax volcanii | Haloferax volcanii DS2 |
56 | NZ_KE136524 | 226185 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Enterococcaceae | Enterococcus | Enterococcus faecalis | Enterococcus faecalis V583 |
57 | NZ_KQ961402 | 542 | Bacteria | Proteobacteria | Alphaproteobacteria | Sphingomonadales | Sphingomonadaceae | Zymomonas | Zymomonas mobilis | NaN |
58 | NZ_CP015081 | 243230 | Bacteria | Deinococcus-Thermus | Deinococci | Deinococcales | Deinococcaceae | Deinococcus | Deinococcus radiodurans | Deinococcus radiodurans R1 |
59 | NZ_ABZS01000228 | 432331 | Bacteria | Aquificae | Aquificae | Aquificales | Hydrogenothermaceae | Sulfurihydrogenibium | Sulfurihydrogenibium yellowstonense | Sulfurihydrogenibium yellowstonense SS-5 |
60 | NZ_JGWU01000001 | 1458259 | Bacteria | Proteobacteria | Betaproteobacteria | Burkholderiales | Alcaligenaceae | Bordetella | Bordetella bronchiseptica | Bordetella bronchiseptica D989 |
61 | NZ_FWDH01000003 | 31899 | Bacteria | Firmicutes | Clostridia | Thermoanaerobacterales | Thermoanaerobacterales Family III. Incertae Sedis | Caldicellulosiruptor | Caldicellulosiruptor bescii | NaN |
62 | NC_009972 | 316274 | Bacteria | Chloroflexi | Chloroflexia | Herpetosiphonales | Herpetosiphonaceae | Herpetosiphon | Herpetosiphon aurantiacus | Herpetosiphon aurantiacus DSM 785 |
63 | NC_005213 | 228908 | Archaea | Nanoarchaeota | NaN | Nanoarchaeales | Nanoarchaeaceae | Nanoarchaeum | Nanoarchaeum equitans | Nanoarchaeum equitans Kin4-M |
64 rows × 10 columns
!sourmash lca index podar-lineage.csv taxdb big_genomes/*.sig -C 3 --split-identifiers
== This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == examining spreadsheet headers... ** assuming column 'accession' is identifiers in spreadsheet 64 distinct identities in spreadsheet out of 64 rows. 64 distinct lineages in spreadsheet out of 64 rows. 64 assigned lineages out of 64 distinct lineages in spreadsheet. 64) 64 identifiers used out of 64 distinct identifiers in spreadsheet. saving to LCA DB: taxdb.lca.json
This database 'taxdb.lca.json' can be used for search and gather as above:
!sourmash gather fake-metagenome.fa.sig taxdb.lca.json
== This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == select query k=31 automatically. loaded query: fake-metagenome.fa... (k=31, DNA) loaded 1 databases. overlap p_query p_match --------- ------- ------- 0.6 Mbp 46.7% 11.6% NC_011663.1 Shewanella baltica OS223,... 0.5 Mbp 38.7% 19.3% CP001071.1 Akkermansia muciniphila AT... 0.5 Mbp 14.6% 3.9% NC_009665.1 Shewanella baltica OS185,... found 3 matches total; the recovered matches hit 100.0% of the query
...but can also be used for taxonomic summarization:
!sourmash lca summarize --query fake-metagenome.fa.sig --db taxdb.lca.json
== This is sourmash version 2.0.0a12.dev48+ga92289b. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loaded 1 LCA databases. ksize=31, scaled=10000 finding query signatures... loaded 1 signatures from 1 files total.of 1) 38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia 38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae 38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales 38.7% 53 Bacteria;Verrucomicrobia;Verrucomicrobiae 38.7% 53 Bacteria;Verrucomicrobia 100.0% 137 Bacteria 61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica 61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella 61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae 61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales 61.3% 84 Bacteria;Proteobacteria;Gammaproteobacteria 61.3% 84 Bacteria;Proteobacteria 22.6% 31 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS223 14.6% 20 Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Shewanellaceae;Shewanella;Shewanella baltica;Shewanella baltica OS185