sourmash is research software from the Lab for Data Intensive Biology at UC Davis. It implements MinHash and modulo hash.
Below are some examples of using sourmash. They are computed in a Jupyter Notebook so you can run them yourself if you like!
Sourmash works on signature files, which are just saved collections of hashes.
Let's try it out!
You can run this notebook interactively via mybinder; click on this button:
A rendered version of this notebook is available at sourmash.readthedocs.io under "Tutorials and notebooks".
You can also get this notebook from the doc/ subdirectory of the sourmash github repository. See binder/environment.yaml for installation dependencies.
This is a Jupyter Notebook using Python 3. If you are running this via binder, you can use Shift-ENTER to run cells, and double click on code cells to edit them.
Contact: C. Titus Brown, [email protected] Please file issues on GitHub if you have any questions or comments!
!rm -f *.sig
!sourmash sketch dna -p k=21,k=31,k=51,scaled=1000 genomes/*.fa --name-from-first -f
This outputs three signature files, each containing three signatures (one calculated at k=21, one at k=31, and one at k=51).
ls *.sig
We can now use these signature files for various comparisons.
The below command queries all of the signature files in the directory with the shew_os223
signature and finds the best Jaccard similarity:
!sourmash search -k 31 shew_os223.fa.sig *.sig
The below command uses Jaccard containment instead of Jaccard similarity:
!sourmash search -k 31 shew_os223.fa.sig *.sig --containment
We can also compare all three signatures:
!sourmash compare -k 31 *.sig
...and produce a similarity matrix that we can use for plotting:
!sourmash compare -k 31 *.sig -o genome_compare.mat
!sourmash plot genome_compare.mat
from IPython.display import Image
Image(filename='genome_compare.mat.matrix.png')
and for the R aficionados, you can output a CSV version of the matrix:
!sourmash compare -k 31 *.sig --csv genome_compare.csv
!cat genome_compare.csv
This is now a file that you can load into R and examine - see our documentation on that.
Let's make a fake metagenome:
!rm -f fake-metagenome.fa*
!cat genomes/*.fa > fake-metagenome.fa
!sourmash sketch dna -p k=31,scaled=1000 fake-metagenome.fa
We can use the sourmash gather
command to see what's in it:
!sourmash gather fake-metagenome.fa.sig shew*.sig akker*.sig