tetrad
species tree inference¶The ipyrad.analysis
package includes a module and command-line tool called tetrad
that can be used to infer quartets from very large SNP alignments, and to join the quartets into a species tree for large numbers of samples following the methodology outlined by Chifman and Kubatko (2014) and implemented in SVDQuartets
.
The tetrad
approach varies in several ways: (1) it uses the same mode of parallelization as ipyrad
and is therefore easy to distribute work across multiple nodes of a large HPC cluster; (2) it can be easily implemented from the command-line tool or in a jupyter-notebook without having to write commands into the large sequence file as a nexus block; (3) it implements a strategy to perform bootstrap re-sampling of RAD loci, and of SNPs within RAD loci; (4) it calculates admixture statistics while it runs (i.e., ABBA-BABA); (5) [coming soon] advanced sampling methods for large trees when the number of quartets is too large to sample.
This is a Jupyter notebook, a reproducible and executable document. The code in this notebook is Python (2.7), and should be executed in a jupyter-notebook like this one. Execute each cell in order to reproduce our entire analysis. We make use of the ipyparallel Python library to distribute STRUCTURE jobs across processers in parallel. If that is confusing, see our tutorial on using ipcluster with jupyter. The example data set used in this analysis is from the empirical example ipyrad tutorial.
All software required for this notebook can be installed using conda.
## conda install ipyrad -c ipyrad
## conda install toytree -c eaton-lab
import ipyrad as ip
import ipyrad.analysis as ipa
import ipyparallel as ipp
import toytree
tetrad
uses a program called ipcluster
to distribute work across a computing cluster. When using the command-line program tetrad
will automatically start up an ipcluster
instance, however, when using the API we leave it to the use to start the ipcluster
instance so that you have more fine-tuned control over the parallelization. For this notebook we assume that you will have started an ipcluster instance running. You can start one by running the commented command below in a separate terminal. This will be connected to when you call the .run()
command further below.
##
## ipcluster start --n=40
##
tetrad
tree inference¶The first step is to create a named tetrad
Class object, which requires a minimum of two argument, a name and a sequence file. The sequence that you will typically want to enter is the '.snps.phy'
file from ipyrad
, which is a phylip formatted file with all SNPs. You can also pass it the '.snps.map'
file, which tells tetrad
how to the SNPS are linked within loci, so that a single SNP can be randomly sampled in each bootstrap replicate.
## create a tetrad Class object
tet = ipa.tetrad(
name='tutorial',
data="./analysis-ipyrad/pedicularis_outfiles/pedicularis.snps.phy",
mapfile="./analysis-ipyrad/pedicularis_outfiles/pedicularis.snps.map",
)
loading seq array [13 taxa x 196435 bp] max unlinked SNPs per quartet (nloci): 39613
A large number of additional paremeter settings are available that you can set when you create the tetrad
Class object, or which you can set afterwards, as we do here. Let's set the 'nboots'
parameter and the 'method'
parameter which tells tetrad
how to sample quartets. Use tab-completion after the tet
variable below to see what other options are available.
## set additional parameters
tet.nboots = 10
tet.method = "all"
The 'run()'
command distributes the computation across your cluster, print progress, and blocks until the job in complete. The results of the analysis will be written to file and also stored in your tetrad
Class object, and can be accessed from its .stats
and .trees
attributes.
tet.run()
host compute node: [4 cores] on oud inferring 715 induced quartet trees [####################] 100% initial tree | 0:00:04 | [####################] 100% boot 10 | 0:00:26 |
Use the toytree.tree()
function to generate a tree plot of the unrooted consensus tree with bootstrap support values. The command 'draw()'
returns a number of objects, the first of which is the canvas
. To save the plot as an SVG image use the toyplot.svg.render()
function like below.
## the newick tree files produced by tetrad
tet.trees
boots ~/Documents/ipyrad/tests/analysis-tetrad/tutorial.boots cons ~/Documents/ipyrad/tests/analysis-tetrad/tutorial.cons nhx ~/Documents/ipyrad/tests/analysis-tetrad/tutorial.nhx tree ~/Documents/ipyrad/tests/analysis-tetrad/tutorial.tree
## create a toytree object from the newick file path
tre = toytree.tree(tet.trees.nhx)
## draw unrooted consensus tree with support values
canvas, axes = tre.draw(
width=300,
node_labels=tre.get_node_values("support"),
);
## draw unrooted consensus tree with the number of quartets
## that can possibly be informative about any given split
## given the 'true' tree.
canvas, axes = tre.draw(
width=300,
node_labels=tre.get_node_values("quartets_total"),
);
## save the tree figure in [format]
import toyplot.svg
toyplot.svg.render(canvas, "analysis-tetrad/pedic-tree.svg")
If you want to add more bootstrap replicates later simply increase the the 'nboots'
attribute and execute 'run()'
again.
tet.nboots = 50
tet.run()
host compute node: [4 cores] on oud initial tree already inferred [####################] 100% boot 50 | 0:01:43 |
## load in the tree
tre = toytree.tree(tet.trees.nhx)
## draw the unrooted consensus tree with support values
canvas, axes = tre.draw(
width=300,
node_labels=tre.get_node_values("support"),
);
Whenever you execute the 'run()'
command your tetrad
Class object will be saved in your working directory as a JSON file ({name}.tet.json
). You can load an existing tetrad
object back into memory by using the load
argument when you create a tetrad
object. The default working directory (workdir
) is './analysis-tetrad'
unless you change it.
## load an old tetrad object
## (the json file will be found from {workdir}/{name}.tet.json)
oldtet = ipa.tetrad(
name="tutorial",
workdir="analysis-tetrad/",
load=True)
## draw the tree from it
tre = toytree.tree(oldtet.trees.nhx)
tre.draw(width=300, node_labels=tre.get_node_values("support"));
loading seq array [13 taxa x 196435 bp] max unlinked SNPs per quartet (nloci): 39613