## (📗) ipyrad Cookbook: tetrad species tree inference¶

The ipyrad.analysis package includes a module and command-line tool called tetrad that can be used to infer quartets from very large SNP alignments, and to join the quartets into a species tree for large numbers of samples following the methodology outlined by Chifman and Kubatko (2014) and implemented in SVDQuartets.

The tetrad approach varies in several ways: (1) it uses the same mode of parallelization as ipyrad and is therefore easy to distribute work across multiple nodes of a large HPC cluster; (2) it can be easily implemented from the command-line tool or in a jupyter-notebook without having to write commands into the large sequence file as a nexus block; (3) it implements a strategy to perform bootstrap re-sampling of RAD loci, and of SNPs within RAD loci; (4) it calculates admixture statistics while it runs (i.e., ABBA-BABA); (5) [coming soon] advanced sampling methods for large trees when the number of quartets is too large to sample.

## Notebook setup¶

This is a Jupyter notebook, a reproducible and executable document. The code in this notebook is Python (2.7), and should be executed in a jupyter-notebook like this one. Execute each cell in order to reproduce our entire analysis. We make use of the ipyparallel Python library to distribute STRUCTURE jobs across processers in parallel. If that is confusing, see our tutorial on using ipcluster with jupyter. The example data set used in this analysis is from the empirical example ipyrad tutorial.

#### Required software¶

All software required for this notebook can be installed using conda.

In [1]:
## conda install ipyrad -c ipyrad
## conda install toytree -c eaton-lab

In [9]:
import ipyrad as ip
import ipyrad.analysis as ipa
import ipyparallel as ipp
import toytree


#### Connect to an ipyparallel cluster¶

tetrad uses a program called ipcluster to distribute work across a computing cluster. When using the command-line program tetrad will automatically start up an ipcluster instance, however, when using the API we leave it to the use to start the ipcluster instance so that you have more fine-tuned control over the parallelization. For this notebook we assume that you will have started an ipcluster instance running. You can start one by running the commented command below in a separate terminal. This will be connected to when you call the .run() command further below.

In [6]:
##
## ipcluster start --n=40
##


### Setup tetrad tree inference¶

The first step is to create a named tetrad Class object, which requires a minimum of two argument, a name and a sequence file. The sequence that you will typically want to enter is the '.snps.phy' file from ipyrad, which is a phylip formatted file with all SNPs. You can also pass it the '.snps.map' file, which tells tetrad how to the SNPS are linked within loci, so that a single SNP can be randomly sampled in each bootstrap replicate.

In [3]:
## create a tetrad Class object
name='tutorial',
)

loading seq array [13 taxa x 196435 bp]
max unlinked SNPs per quartet (nloci): 39613


A large number of additional paremeter settings are available that you can set when you create the tetrad Class object, or which you can set afterwards, as we do here. Let's set the 'nboots' parameter and the 'method' parameter which tells tetrad how to sample quartets. Use tab-completion after the tet variable below to see what other options are available.

In [4]:
## set additional parameters
tet.nboots = 10
tet.method = "all"


### Infer the tree¶

The 'run()' command distributes the computation across your cluster, print progress, and blocks until the job in complete. The results of the analysis will be written to file and also stored in your tetrad Class object, and can be accessed from its .stats and .trees attributes.

In [5]:
tet.run()

host compute node: [4 cores] on oud
inferring 715 induced quartet trees
[####################] 100%  initial tree | 0:00:04 |
[####################] 100%  boot 10      | 0:00:26 |


### Draw the tree¶

Use the toytree.tree() function to generate a tree plot of the unrooted consensus tree with bootstrap support values. The command 'draw()' returns a number of objects, the first of which is the canvas. To save the plot as an SVG image use the toyplot.svg.render() function like below.

In [7]:
## the newick tree files produced by tetrad
tet.trees

Out[7]:
boots   ~/Documents/ipyrad/tests/analysis-tetrad/tutorial.boots
tree    ~/Documents/ipyrad/tests/analysis-tetrad/tutorial.tree
In [10]:
## create a toytree object from the newick file path
tre = toytree.tree(tet.trees.nhx)

## draw unrooted consensus tree with support values
canvas, axes = tre.draw(
width=300,
node_labels=tre.get_node_values("support"),
);

## draw unrooted consensus tree with the number of quartets
## that can possibly be informative about any given split
## given the 'true' tree.
canvas, axes = tre.draw(
width=300,
node_labels=tre.get_node_values("quartets_total"),
);

In [11]:
## save the tree figure in [format]
import toyplot.svg


### Checkpointed analysis¶

If you want to add more bootstrap replicates later simply increase the the 'nboots' attribute and execute 'run()' again.

In [12]:
tet.nboots = 50
tet.run()

host compute node: [4 cores] on oud
initial tree already inferred
[####################] 100%  boot 50      | 0:01:43 |

In [13]:
## load in the tree
tre = toytree.tree(tet.trees.nhx)

## draw the unrooted consensus tree with support values
canvas, axes = tre.draw(
width=300,
node_labels=tre.get_node_values("support"),
);


Whenever you execute the 'run()' command your tetrad Class object will be saved in your working directory as a JSON file ({name}.tet.json). You can load an existing tetrad object back into memory by using the load argument when you create a tetrad object. The default working directory (workdir) is './analysis-tetrad' unless you change it.

In [14]:
## load an old tetrad object
## (the json file will be found from {workdir}/{name}.tet.json)

loading seq array [13 taxa x 196435 bp]