This IPython notebook contains a combination of Python, bash and R scripts that describe the full analysis used in the
pyRAD manuscript. If you wish to execute the scripts yourself you can copy and paste commands from the input cells,
or you can download this notebook as a .pynb and re-execute all scripts in an IPython notebook of your own.
If you are unfamiliar with IPython notebooks (you are reading one now) you can read more about them here, however I'll
explain briefly: The default language in any coding cell is Python, however other languages can be executed within cells
as well. An easy way to do this is to designate the language at the beginning of the cell using the %% character. Cells
which begin with %%bash are executed like a shell script, and those with "%%R" execute R scripts, including the ability
to embed high quality graphics in the output below cells, which is demonstrated near the end of this notebook. Below I use
the ls -l unix command to check what is in my current working directory.
If you plan to execute scripts in this notebook you will need the following Python modules:
In addition you will need PyRAD v.1.64 and Stacks 1.1 or higher (the MYSQL installation is not required). You will also need
two directories containing the necessary scripts and input files, which are located in a zipped directory in the supplemental
materials. Unzip this archive and move the contents into your current working directory.
You will need to change the location of pyRAD and Stacks in all of the following scripts to reflect its location on your machine.
!ls -l
total 400 -rw-rw-r-- 1 deren deren 43695 Dec 2 18:31 PyRAD_STACKS.ipynb drwxrwxr-x 2 deren deren 4096 Dec 2 18:21 REALDATA drwxrwxr-x 2 deren deren 4096 Dec 2 18:22 SIMbig -rw-rw-rw- 1 deren deren 7687 Nov 20 14:14 simradsMANU.py drwxrwxr-x 2 deren deren 4096 Dec 2 18:23 SIMsmall
At this point you see there is a .pynb file, which is this Ipython notebook, and also three directories and a python script for
simulating RADseq data under a coalescent model. This will also create a barcodes output files for each data set. The SIMsmall
directory contains the params input files for three PyRAD analyses.
!ls -l SIMsmall/
total 24 -rw-rw-r-- 1 deren deren 4369 Dec 2 18:23 params_high.txt -rw-rw-r-- 1 deren deren 4370 Dec 2 18:23 params_low.txt -rw-rw-r-- 1 deren deren 4365 Dec 2 18:24 params_no.txt
The script simradsMANU.py (included with supplemental materials) uses the egglib module to simulate sequences under a coalescent
model and then trims them to emulate RADseq-like data. To begin, I simulate three different data sets which I use to compare the
performance of PyRAD and Stacks on data containing indels.
The first argument to the script is the rate of indel deletions, and the second is whether mutations should arise in the restriction
recognition site (locus dropout), which I do not use here (set to 0). The third argument is the number of loci to simulate and the
fourth is the number of individuals to sample per species. The final argument is the name of the output file. Data are simulated on
a species tree as described in the manuscript text. All other parameters used in the simulation are kept constant, because I am
primarily interested in examining the effect of indel variation.
%%bash
## you must have the egglib Python module installed to do this step.
python simradsMANU.py 0.00 0 10000 1 SIMsmall/simRADs_no.fastq
python simradsMANU.py 0.02 0 10000 1 SIMsmall/simRADs_low.fastq
python simradsMANU.py 0.05 0 10000 1 SIMsmall/simRADs_high.fastq
simulating 10000 RAD loci across 12 tip taxa with 1 samples per taxon indels arise at frequency of 0.000000 mutations in restriction site = False theta=4Nu= 0.0014 creating new barcode map . . . . . . . . . . simulating 10000 RAD loci across 12 tip taxa with 1 samples per taxon indels arise at frequency of 0.020000 mutations in restriction site = False theta=4Nu= 0.0014 creating new barcode map . . . . . . . . . . simulating 10000 RAD loci across 12 tip taxa with 1 samples per taxon indels arise at frequency of 0.050000 mutations in restriction site = False theta=4Nu= 0.0014 creating new barcode map . . . . . . . . . .
The raw simulated data are in fastQ format as shown below. With uniformly high quality scores.
!head -n12 SIMsmall/simRADs_no.fastq
@lane1_fakedata_R1_0 1:N:0: ATGGATTGCAGTGATATATTCCCATTTCTGGTTCTGGAGATACCCTCGTCCGGGAATCTAACTATCTGTGTCTTTACGGCTGAGGTTGTCCCATTGTATTTCCAGTCTTAG + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata_R1_1 1:N:0: ATGGATTGCAGTGATATATTCCCATTTCTGGTTCTGGAGATACCCTCGTCCGGGAATCTAACTATCTGTGTCTTTACGGCTGAGGTTGTCCCATTGTATTTCCAGTCTTAG + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @lane1_fakedata_R1_2 1:N:0: ATGGATTGCAGTGATATATTCCCATTTCTGGTTCTGGAGATACCCTCGTCCGGGAATCTAACTATCTGTGTCTTTACGGCTGAGGTTGTCCCATTGTATTTCCAGTCTTAG + BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Now I run the PyRAD analysis. To begin, I create a different working directory for each analysis, then execute PyRAD on the three
different params files which point to one of the three working directories, respectively. The following cells will each take some time
to run. The time command is used to measure how long. And each is run in a separate cell. The params files designate to use parallel
processing on 12 processors.
%%bash
## create working directories for each PyRAD analysis
mkdir SIMsmall/Py_no
mkdir SIMsmall/Py_low
mkdir SIMsmall/Py_high
%time ! ~/pyrad_v.1.62/pyRAD -p SIMsmall/params_no.txt 2> SIMsmall/log.pyno
usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores (C) Copyright 2010-12 Robert C. Edgar, all rights reserved. http://drive5.com/usearch Licensed to: daeaton@uchicago.edu ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0'] addon [] exclude [] ............ final stats written to: /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_no/stats/c85d6m2pN.stats output files written to: /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_no/outfiles/ directory CPU times: user 1.92 s, sys: 0.34 s, total: 2.26 s Wall time: 1128.01 s
%time ! ~/pyrad_v.1.62/pyRAD -p SIMsmall/params_low.txt 2> SIMsmall/log.pylow
usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores (C) Copyright 2010-12 Robert C. Edgar, all rights reserved. http://drive5.com/usearch Licensed to: daeaton@uchicago.edu ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0'] addon [] exclude [] ............ final stats written to: /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_low/stats/c85d6m2pN.stats output files written to: /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_low/outfiles/ directory CPU times: user 1.81 s, sys: 0.30 s, total: 2.10 s Wall time: 1112.46 s
%time ! ~/pyrad_v.1.62/pyRAD -p SIMsmall/params_high.txt 2> SIMsmall/log.pyhigh
usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores (C) Copyright 2010-12 Robert C. Edgar, all rights reserved. http://drive5.com/usearch Licensed to: daeaton@uchicago.edu ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0'] addon [] exclude [] ............ final stats written to: /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_high/stats/c85d6m2pN.stats output files written to: /home/deren/Dropbox/PYRADMANU/SIMsmall/Py_high/outfiles/ directory CPU times: user 1.73 s, sys: 0.33 s, total: 2.06 s Wall time: 1133.35 s
%%bash
## The following script removes unneeded files leftover from the analysis to save memory space.
rm -r SIMsmall/Py_no/fastq/ SIMsmall/Py_no/edits/
rm -r SIMsmall/Py_low/fastq/ SIMsmall/Py_low/edits/
rm -r SIMsmall/Py_high/fastq/ SIMsmall/Py_high/edits/
Now I run the Stacks analysis. To begin, I create three output directories, one for each analysis.
%%bash
## create working directories for each stacks analysis
mkdir SIMsmall/stackf_no
mkdir SIMsmall/stackf_low
mkdir SIMsmall/stackf_high
I then use process_tags to sort reads by their barcodes and then the denovo_map.pl pipeline to run the three Stacks analyses.
Parameter setting used are analagous (as close as possible) to those used in the PyRAD analysis. Because the barcodes were randomly
generated from the simulation and they must be entered by hand in the stacks script I use the commands below to generate a Stacks input
script as a .sh file. I show the file and then execute it.
%%bash
## create a barcode file for stacks analyses
cut -f 2 SIMsmall/simRADs_no.fastq.barcodes > SIMsmall/simRADs_no.stacks.barcodes
cut -f 2 SIMsmall/simRADs_low.fastq.barcodes > SIMsmall/simRADs_low.stacks.barcodes
cut -f 2 SIMsmall/simRADs_high.fastq.barcodes > SIMsmall/simRADs_high.stacks.barcodes
%%bash
process_radtags -f SIMsmall/simRADs_no.fastq -o SIMsmall/stackf_no/ \
-b SIMsmall/simRADs_no.stacks.barcodes -e pstI -E phred33
Using Phred+33 encoding for quality scores. Found 1 input file(s). Searching for single-end, inlined barcodes. Loaded 12 barcodes (6bp). Processing file 1 of 1 [simRADs_no.fastq] 2600000 total reads; -200000 ambiguous barcodes; -0 ambiguous RAD-Tags; +0 recovered; -0 low quality reads; 2400000 retained reads. Closing files, flushing buffers... Outputing details to log: 'SIMsmall/stackf_no/process_radtags.log' 2600000 total sequences; 200000 ambiguous barcode drops; 0 low quality read drops; 0 ambiguous RAD-Tag drops; 2400000 retained reads.
The function below creates the shell script (".sh") to run a stacks analysis.
import glob
def makestackslist(dset):
infiles = ["-s "+ff+" " for ff in glob.glob("SIMsmall/stackf_"+dset+"/*.fq")]
denovo = "denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X "+\
"'populations:-m 6' -D 'SIMsmall_no' -o SIMsmall/stackf_"+dset+" "+"".join(infiles)
with open("SIMsmall/stacks_"+dset+".sh",'w') as outfile:
outfile.write(denovo)
makestackslist('no')
For example, the stacks_no.sh script below will run stacks on the data set with no indels.
!cat SIMsmall/stacks_no.sh
denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X 'populations:-m 6' -D 'SIMsmall_no' -o SIMsmall/stackf_no -s SIMsmall/stackf_no/sample_AAGAGG.fq -s SIMsmall/stackf_no/sample_GAGAGA.fq -s SIMsmall/stackf_no/sample_GGTTTA.fq -s SIMsmall/stackf_no/sample_ATAGTA.fq -s SIMsmall/stackf_no/sample_TAAATG.fq -s SIMsmall/stackf_no/sample_TGAAAT.fq -s SIMsmall/stackf_no/sample_GAGAAT.fq -s SIMsmall/stackf_no/sample_GAGTAT.fq -s SIMsmall/stackf_no/sample_GATAGA.fq -s SIMsmall/stackf_no/sample_ATGGAT.fq -s SIMsmall/stackf_no/sample_TGTAGA.fq -s SIMsmall/stackf_no/sample_ATGGTT.fq
%time ! sh SIMsmall/stacks_no.sh 2> SIMsmall/log.stackno
Identifying unique stacks; file 1 of 12 [sample_AAGAGG] Identifying unique stacks; file 2 of 12 [sample_GAGAGA] Identifying unique stacks; file 3 of 12 [sample_GGTTTA] Identifying unique stacks; file 4 of 12 [sample_ATAGTA] Identifying unique stacks; file 5 of 12 [sample_TAAATG] Identifying unique stacks; file 6 of 12 [sample_TGAAAT] Identifying unique stacks; file 7 of 12 [sample_GAGAAT] Identifying unique stacks; file 8 of 12 [sample_GAGTAT] Identifying unique stacks; file 9 of 12 [sample_GATAGA] Identifying unique stacks; file 10 of 12 [sample_ATGGAT] Identifying unique stacks; file 11 of 12 [sample_TGTAGA] Identifying unique stacks; file 12 of 12 [sample_ATGGTT] CPU times: user 7.00 s, sys: 0.99 s, total: 7.99 s Wall time: 2490.63 s
%%bash
process_radtags -f SIMsmall/simRADs_low.fastq -o SIMsmall/stackf_low/ \
-b SIMsmall/simRADs_low.stacks.barcodes -e pstI -E phred33
Using Phred+33 encoding for quality scores. Found 1 input file(s). Searching for single-end, inlined barcodes. Loaded 12 barcodes (6bp). Processing file 1 of 1 [simRADs_low.fastq] 2600000 total reads; -200000 ambiguous barcodes; -0 ambiguous RAD-Tags; +0 recovered; -0 low quality reads; 2400000 retained reads. Closing files, flushing buffers... Outputing details to log: 'SIMsmall/stackf_low/process_radtags.log' 2600000 total sequences; 200000 ambiguous barcode drops; 0 low quality read drops; 0 ambiguous RAD-Tag drops; 2400000 retained reads.
makestackslist("low")
!cat SIMsmall/stacks_low.sh
denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X 'populations:-m 6' -D 'SIMsmall_no' -o SIMsmall/stackf_low -s SIMsmall/stackf_low/sample_ATGTGG.fq -s SIMsmall/stackf_low/sample_AAAAAG.fq -s SIMsmall/stackf_low/sample_GTTGGA.fq -s SIMsmall/stackf_low/sample_GTATGT.fq -s SIMsmall/stackf_low/sample_GGTAGG.fq -s SIMsmall/stackf_low/sample_TAGTAA.fq -s SIMsmall/stackf_low/sample_GGTAGT.fq -s SIMsmall/stackf_low/sample_GGTATG.fq -s SIMsmall/stackf_low/sample_GGGTTG.fq -s SIMsmall/stackf_low/sample_GGGGAG.fq -s SIMsmall/stackf_low/sample_AAGATT.fq -s SIMsmall/stackf_low/sample_TTTGTG.fq
%time ! sh SIMsmall/stacks_low.sh 2> SIMsmall/log.stacklow
Identifying unique stacks; file 1 of 12 [sample_ATGTGG] Identifying unique stacks; file 2 of 12 [sample_AAAAAG] Identifying unique stacks; file 3 of 12 [sample_GTTGGA] Identifying unique stacks; file 4 of 12 [sample_GTATGT] Identifying unique stacks; file 5 of 12 [sample_GGTAGG] Identifying unique stacks; file 6 of 12 [sample_TAGTAA] Identifying unique stacks; file 7 of 12 [sample_GGTAGT] Identifying unique stacks; file 8 of 12 [sample_GGTATG] Identifying unique stacks; file 9 of 12 [sample_GGGTTG] Identifying unique stacks; file 10 of 12 [sample_GGGGAG] Identifying unique stacks; file 11 of 12 [sample_AAGATT] Identifying unique stacks; file 12 of 12 [sample_TTTGTG] CPU times: user 7.28 s, sys: 1.13 s, total: 8.42 s Wall time: 2485.87 s
%%bash
process_radtags -f SIMsmall/simRADs_high.fastq -o SIMsmall/stackf_high/ \
-b SIMsmall/simRADs_high.stacks.barcodes -e pstI -E phred33
Using Phred+33 encoding for quality scores. Found 1 input file(s). Searching for single-end, inlined barcodes. Loaded 12 barcodes (6bp). Processing file 1 of 1 [simRADs_high.fastq] 2600000 total reads; -200000 ambiguous barcodes; -0 ambiguous RAD-Tags; +0 recovered; -0 low quality reads; 2400000 retained reads. Closing files, flushing buffers... Outputing details to log: 'SIMsmall/stackf_high/process_radtags.log' 2600000 total sequences; 200000 ambiguous barcode drops; 0 low quality read drops; 0 ambiguous RAD-Tag drops; 2400000 retained reads.
makestackslist("high")
!cat SIMsmall/stacks_high.sh
denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X 'populations:-m 6' -D 'SIMsmall_no' -o SIMsmall/stackf_high -s SIMsmall/stackf_high/sample_TAGATT.fq -s SIMsmall/stackf_high/sample_GGATAT.fq -s SIMsmall/stackf_high/sample_AGAATG.fq -s SIMsmall/stackf_high/sample_TTAGGG.fq -s SIMsmall/stackf_high/sample_ATATGT.fq -s SIMsmall/stackf_high/sample_TTAAAG.fq -s SIMsmall/stackf_high/sample_TTGAAG.fq -s SIMsmall/stackf_high/sample_AGATAT.fq -s SIMsmall/stackf_high/sample_TTAAAA.fq -s SIMsmall/stackf_high/sample_GAAATG.fq -s SIMsmall/stackf_high/sample_ATAGGA.fq -s SIMsmall/stackf_high/sample_GGAAAT.fq
%time ! sh SIMsmall/stacks_high.sh 2> SIMsmall/log.stackhigh
Identifying unique stacks; file 1 of 12 [sample_TAGATT] Identifying unique stacks; file 2 of 12 [sample_GGATAT] Identifying unique stacks; file 3 of 12 [sample_AGAATG] Identifying unique stacks; file 4 of 12 [sample_TTAGGG] Identifying unique stacks; file 5 of 12 [sample_ATATGT] Identifying unique stacks; file 6 of 12 [sample_TTAAAG] Identifying unique stacks; file 7 of 12 [sample_TTGAAG] Identifying unique stacks; file 8 of 12 [sample_AGATAT] Identifying unique stacks; file 9 of 12 [sample_TTAAAA] Identifying unique stacks; file 10 of 12 [sample_GAAATG] Identifying unique stacks; file 11 of 12 [sample_ATAGGA] Identifying unique stacks; file 12 of 12 [sample_GGAAAT] CPU times: user 7.86 s, sys: 1.15 s, total: 9.00 s Wall time: 2777.69 s
I create a list measuring taxon coverage for each Stacks analysis, and similarly for each PyRAD analysis below. This measures the
total number of loci that have data for at least N taxa for each value of N.
## Stacks results ##
lines = open("SIMsmall/stackf_high/batch_1.haplotypes.tsv").readlines()
cnts = [int(field.strip().split("\t")[1]) for field in lines[1:]]
shigh = [cnts.count(i) for i in range(1,13)]
SHIGH = [int(sum(shigh)-sum(shigh[:i-1])) for i in range(1,13)]
lines = open("SIMsmall/stackf_low/batch_1.haplotypes.tsv").readlines()
cnts = [int(field.strip().split("\t")[1]) for field in lines[1:]]
slow = [cnts.count(i) for i in range(1,13)]
SLOW = [int(sum(slow)-sum(slow[:i-1])) for i in range(1,13)]
lines = open("SIMsmall/stackf_no/batch_1.haplotypes.tsv").readlines()
cnts = [int(field.strip().split("\t")[1]) for field in lines[1:]]
sno = [cnts.count(i) for i in range(1,13)]
SNO = [int(sum(sno)-sum(sno[:i-1])) for i in range(1,13)]
## PyRAD results ##
infile = open("SIMsmall/Py_no/stats/c85d6m2pN.stats").readlines()
pno = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
PNO = [sum(pno)-sum(pno[:i-1]) for i in range(1,13)]
infile = open("SIMsmall/Py_low/stats/c85d6m2pN.stats").readlines()
plow = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
PLOW = [sum(plow)-sum(plow[:i-1]) for i in range(1,13)]
infile = open("SIMsmall/Py_high/stats/c85d6m2pN.stats").readlines()
phigh = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
PHIGH = [sum(phigh)-sum(phigh[:i-1]) for i in range(1,13)]
## convert PyRAD lists to R objects ##
%load_ext rmagic
%Rpush PNO PLOW PHIGH SNO SLOW SHIGH
%%R
RES = cbind(PNO,PLOW,PHIGH,SNO,SLOW,SHIGH)
print(RES)
PNO PLOW PHIGH SNO SLOW SHIGH [1,] 10000 9997 10049 10000 12085 16041 [2,] 10000 9997 10033 10000 10918 12683 [3,] 10000 9997 10021 10000 10707 11986 [4,] 10000 9996 10010 10000 10466 11100 [5,] 10000 9995 9988 10000 10062 10162 [6,] 10000 9995 9982 10000 9991 9929 [7,] 10000 9993 9974 10000 9966 9781 [8,] 10000 9993 9968 10000 9882 9481 [9,] 10000 9992 9945 10000 9523 8651 [10,] 10000 9991 9934 10000 9301 8000 [11,] 10000 9991 9923 10000 9110 7432 [12,] 10000 9991 9911 9996 8301 5755
%%R
makeplot <- function() {
par(mfcol=c(1,3),
cex.axis=1.1,
cex.main=1.1,
cex.lab=1.1,
mar=c(4,4,3,0),
oma=c(1,2,0,2))
plot(RES[,1], ylim=c(0,18000),xlim=c(1,12),
xlab="Minimum taxon coverage",
ylab="Number of loci recovered",
main="No indels")
points(RES[,1], pch=20, col='black', cex=2.5)
lines(RES[,1], lwd=3)
points(RES[,4], pch=20, col='darkgrey', cex=1.35)
lines(RES[,4], lwd=2.5, col='darkgrey')
plot(RES[,2], ylim=c(0,18000),xlim=c(1,12),
xlab="Minimum taxon coverage",
ylab="",#loci with at least N taxa",
main="Low indels")
points(RES[,2], pch=20, col='black', cex=2.5)
lines(RES[,2], lwd=3)
points(RES[,5], pch=20, col='darkgrey', cex=1.35)
lines(RES[,5], lwd=2.5, col='darkgrey')
plot(RES[,3], ylim=c(0,18000),xlim=c(1,12),
xlab="Minimum taxon coverage",
ylab="",#loci with at least N taxa",
main="High indels")
points(RES[,3], pch=20, col='black', cex=2.5)
lines(RES[,3], lwd=3)
points(RES[,6], pch=20, col='darkgrey', cex=1.35)
lines(RES[,6], lwd=2.5, col='darkgrey')
}
%%R
pdf("SIMsmall/FIG_1.pdf",
width=6.5, height=2.8,
family="Times")
makeplot()
dev.off()
%R makeplot()
import glob
RES = []
for run in ["no", "low", "high"]:
handle = "SIMsmall/stackf_"+run+"/sample_*.tags.tsv"
c = [open(ff,'r').read().count("consensus") for ff in glob.glob(handle)]
RES.append(['stacks',run,float(sum(c))/len(c)])
handle = "SIMsmall/Py_"+run+"/stats/s5.consens.txt"
c = map(float,[line.split("\t")[3] for line in open(handle,'r').readlines()[1:-7]])
RES.append(['pyrad',run,float(sum(c))/len(c)])
for res in RES:
print 'mean N loci per sample from %s indel %s run \t%f' % (res[1],res[0],res[2])
mean N loci per sample from no indel stacks run 10000.000000 mean N loci per sample from no indel pyrad run 10000.000000 mean N loci per sample from low indel stacks run 10027.083333 mean N loci per sample from low indel pyrad run 10000.500000 mean N loci per sample from high indel stacks run 10087.083333 mean N loci per sample from high indel pyrad run 10002.000000
infile = open("SIMsmall/stackf_high/batch_1.haplotypes.tsv").readlines()
tcov = [line.split("\t")[1] for line in infile[1:]]
print mean(map(int,tcov)), 'mean coverage stacks highindel'
infile = open("SIMsmall/stackf_low/batch_1.haplotypes.tsv").readlines()
tcov = [line.split("\t")[1] for line in infile[1:]]
print mean(map(int,tcov)), 'mean coverage stacks lowindel'
infile = open("SIMsmall/stackf_no/batch_1.haplotypes.tsv").readlines()
tcov = [line.split("\t")[1] for line in infile[1:]]
print mean(map(int,tcov)), 'mean coverage stacks noindel'
infile = open("SIMsmall/Py_high/stats/c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
a = line.split("\t")
tcov += [int(a[0])]*int(a[1])
print mean(tcov), 'mean coverage pyrad highindel'
infile = open("SIMsmall/Py_low/stats/c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
a = line.split("\t")
tcov += [int(a[0])]*int(a[1])
print mean(tcov), 'mean coverage pyrad lowhindel'
infile = open("SIMsmall/Py_no/stats/c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
a = line.split("\t")
tcov += [int(a[0])]*int(a[1])
print mean(tcov), 'mean coverage pyrad noindel'
7.54323296553 mean coverage stacks highindel 9.95548200248 mean coverage stacks lowindel 11.9996 mean coverage stacks noindel 11.9154144691 mean coverage pyrad highindel 11.9963989197 mean coverage pyrad lowhindel 12.0 mean coverage pyrad noindel
An empirical analysis was conducted on the published data set from Eaton & Ree (2013), available de-multiplexed in fastQ format at the
ncbi sequence read archive (SRA072507). This step takes many hours using 12 processors, as described in the paper. The data are already
sorted, so PyRAD is started from step 2. After trimming the barcode and adapter read lengths are 69bp. At .85 similarity this is equivalent
to 10 base differences. The params.txt file is available in the supplementary materials. Because Stacks and PyRAD use different quality
filtering methods I employed filtering by PyRAD only, and the data are converted back to fastQ format to be read into Stacks. This way
quality filtering does not affect the results of comparison between the two programs.
%time ! ~/pyrad_v.1.62/pyRAD -p REALDATA/paramsPAR.txt -s 2
------------------------------------------------------------------ pyRAD : RAD/GBS for phylogenetics & introgression analyses ------------------------------------------------------------------ sorted .fastq from /home/deren/Dropbox/raw_cyatho/*.fq being used step 2: editing raw reads .............CPU times: user 0.37 s, sys: 0.04 s, total: 0.40 s Wall time: 157.25 s
To remove the effect of different error filtering of the two programs the filtered data from PyRAD were used for both analyses. Because
PyRAD converts the data from fastq to fasta at this step, discarding read quality scores, the data were converted back to fastq with
arbitrarily high quality scores (b) added to for each base for use in Stacks.
! mkdir REALDATA/edits4stacks
for ff in glob.glob("REALDATA/edits/*.edit"):
dat = open(ff).readlines()
outfile = open("REALDATA/edits4stacks/"+ff.split("/")[-1].replace(".edit",".fq"),'w')
writing = []
for line in dat:
if ">" in line:
x,y = line.split("_")[2:4]
nn = "@GRC13_0027_FC:4:1:"+x+":"+y+"#0/1"
writing.append(nn.strip())
else:
seq = "TGCAG"+line.strip()
writing.append(seq)
writing.append("+")
writing.append("b"*len(seq)+"\n")
outfile.write("\n".join(writing))
writing = []
outfile.close()
PyRAD analysis is done with and without paralog filtering
%time ! ~/pyrad_v.1.62/pyRAD -p REALDATA/paramsNOPAR.txt -s 34567 2> REALDATA/log.py.txt
usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores (C) Copyright 2010-12 Robert C. Edgar, all rights reserved. http://drive5.com/usearch Licensed to: daeaton@uchicago.edu ingroup ['FieldMuseum_Ree_29154', 'FieldMuseum_Ree_30556', 'FieldMuseum_Ree_30686', 'FieldMuseum_Ree_32082', 'FieldMuseum_Ree_33405', 'FieldMuseum_Ree_33413', 'FieldMuseum_Ree_33588', 'FieldMuseum_Ree_35855', 'FieldMuseum_Ree_38362', 'FieldMuseum_Ree_39618', 'FieldMuseum_Ree_40578', 'FieldMuseum_Ree_41478', 'FieldMuseum_Ree_41954'] addon [] exclude [] ............ final stats written to: /home/deren/Dropbox/PYRADMANU/REALDATA/stats/c85d6m2pN.stats output files written to: /home/deren/Dropbox/PYRADMANU/REALDATA/outfiles/ directory CPU times: user 208.31 s, sys: 32.33 s, total: 240.64 s Wall time: 127436.99 s
%%bash
mv REALDATA/stats REALDATA/stats.nopar
mkdir REALDATA/stats
rm REALDATA/clust.85/*.consens
~/pyrad_v.1.62/pyRAD -p REALDATA/paramsPAR.txt -s 4567 2> REALDATA/log.py2.txt
usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores (C) Copyright 2010-12 Robert C. Edgar, all rights reserved. http://drive5.com/usearch Licensed to: daeaton@uchicago.edu ingroup ['FieldMuseum_Ree_29154', 'FieldMuseum_Ree_30556', 'FieldMuseum_Ree_30686', 'FieldMuseum_Ree_32082', 'FieldMuseum_Ree_33405', 'FieldMuseum_Ree_33413', 'FieldMuseum_Ree_33588', 'FieldMuseum_Ree_35855', 'FieldMuseum_Ree_38362', 'FieldMuseum_Ree_39618', 'FieldMuseum_Ree_40578', 'FieldMuseum_Ree_41478', 'FieldMuseum_Ree_41954'] addon [] exclude [] ............ final stats written to: /home/deren/Dropbox/PYRADMANU/REALDATA/stats/c85d6m2p3H3N3.stats output files written to: /home/deren/Dropbox/PYRADMANU/REALDATA/outfiles/ directory
!mkdir REALDATA/stacks85
import glob
infiles = ["-s "+ff+" " for ff in glob.glob("REALDATA/edits4stacks/*")]
denovo = "denovo_map.pl -m 2 -M 10 -N 10 -n 10 -T 12 -b 1 -B 1 -S -X "+\
"'populations:-m 6' -X 'ustacks:--max_locus_stacks 2' -D 'stacks85' "+\
"-o REALDATA/stacks85 "+"".join(infiles)
with open("REALDATA/stacks85/stacks.sh",'w') as outfile:
outfile.write(denovo)
!cat REALDATA/stacks85/stacks.sh
denovo_map.pl -m 2 -M 10 -N 10 -n 10 -T 12 -b 1 -B 1 -S -X 'populations:-m 6' -X 'ustacks:--max_locus_stacks 2' -D 'stacks85' -o REALDATA/stacks85 -s REALDATA/edits4stacks/FieldMuseum_Ree_41478.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_35855.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_30556.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_33588.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_39618.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_38362.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_40578.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_29154.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_33413.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_33405.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_41954.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_32082.fq -s REALDATA/edits4stacks/FieldMuseum_Ree_30686.fq
%time !sh REALDATA/stacks85/stacks.sh 2> REALDATA/log.stacks.txt
Identifying unique stacks; file 1 of 13 [FieldMuseum_Ree_41478] Identifying unique stacks; file 2 of 13 [FieldMuseum_Ree_35855] Identifying unique stacks; file 3 of 13 [FieldMuseum_Ree_30556] Identifying unique stacks; file 4 of 13 [FieldMuseum_Ree_33588] Identifying unique stacks; file 5 of 13 [FieldMuseum_Ree_39618] Identifying unique stacks; file 6 of 13 [FieldMuseum_Ree_38362] Identifying unique stacks; file 7 of 13 [FieldMuseum_Ree_40578] Identifying unique stacks; file 8 of 13 [FieldMuseum_Ree_29154] Identifying unique stacks; file 9 of 13 [FieldMuseum_Ree_33413] Identifying unique stacks; file 10 of 13 [FieldMuseum_Ree_33405] Identifying unique stacks; file 11 of 13 [FieldMuseum_Ree_41954] Identifying unique stacks; file 12 of 13 [FieldMuseum_Ree_32082] Identifying unique stacks; file 13 of 13 [FieldMuseum_Ree_30686] CPU times: user 800.48 s, sys: 143.01 s, total: 943.49 s Wall time: 266088.77 s
infile = open("REALDATA/stats.nopar/c85d6m2pN.stats").readlines()
pno = [int(infile[j].strip().split("\t")[1]) for j in range(27,40)]
PEMPN = [sum(pno)-sum(pno[:i-1]) for i in range(1,14)]
infile = open("REALDATA/stats/c85d6m2p3H3N3.stats").readlines()
fno = [int(infile[j].strip().split("\t")[1]) for j in range(27,40)]
PEMPF = [sum(fno)-sum(fno[:i-1]) for i in range(1,14)]
Stacks does not have the same paralog filtering options that pyrad does and so I implement three additional filters on the data.
First, I remove loci that have more than three alleles in a locus, and second I remove loci with more than three heterozygous sites,
and lastly, I remove stacks from the final output that are heterozygous at the same site across more than four samples.
handle = "REALDATA/stacks85/batch_1.haplotypes.tsv"
lines = open(handle).readlines()
FILTER = []
NOFILTER = []
for line in lines[1:]:
dat = line.strip().split("\t")[1:]
cnt = int(dat[0])
ocnt = int(dat[0])
hets = []
for loc in dat:
" remove loci with more than 2 alleles"
if loc.count("/") > 1:
dat.remove(loc)
cnt -= 1
else:
if "/" in loc:
a,b = loc.split("/")
" remove loci with more than 4 hetero sites "
if len(a) > 3:
dat.remove(loc)
cnt -= 1
else:
" remove stack if a hetero site is shared across more than 4 samples"
h = [i for i in range(len(a)) if a[i]!=b[i]]
for i in h:
hets.append(i)
if [hets.count(i) for i in set(hets)]:
if max([hets.count(i) for i in set(hets)]) > 4:
cnt = 0
FILTER.append(cnt)
NOFILTER.append(ocnt)
filt = [FILTER.count(i) for i in range(1,14)]
nofilt = [NOFILTER.count(i) for i in range(1,14)]
SEMPF = [int(sum(filt)-sum(filt[:i-1])) for i in range(1,14)]
SEMPN = [int(sum(nofilt)-sum(nofilt[:i-1])) for i in range(1,14)]
print "mean coverage stacks no filter", mean(NOFILTER)
print "%fullcov stacks no filter",float([NOFILTER.count(13)][0])/sum([FILTER.count(i) for i in range(1,14)])*100
print ""
print "mean coverage stacks paralog filtered", mean(FILTER)
print "%fullcov stacks filtered",float([FILTER.count(13)][0])/sum([FILTER.count(i) for i in range(1,14)])*100
mean coverage stacks no filter 2.65351217823 %fullcov stacks no filter 0.93914895677 mean coverage stacks paralog filtered 2.44040946562 %fullcov stacks filtered 0.543641380646
def meancovpyrad(lines):
tcov = []
for line in lines:
a = line.split("\t")
tcov += [int(a[0])]*int(a[1])
full = 100*float(tcov.count(13))/len(tcov)
return mean(tcov), full
infile = open("REALDATA/stats.nopar/c85d6m2pN.stats").readlines()[27:40]
nofiltpycov, fullnofiltpy = meancovpyrad(infile)
infile = open("REALDATA/stats/c85d6m2p3H3N3.stats").readlines()[27:40]
filtpycov, fullfiltpy = meancovpyrad(infile)
print "mean coverage pyrad no filter", nofiltpycov
print "%fullcov pyrad no filter", fullnofiltpy
print ""
print "mean coverage pyrad paralog filtered", filtpycov
print "%fullcov pyrad no filter", fullfiltpy
mean coverage pyrad no filter 2.99135473853 %fullcov pyrad no filter 1.35862030721 mean coverage pyrad paralog filtered 2.96574310049 %fullcov pyrad no filter 0.985513914028
%load_ext rmagic
%Rpush SEMPF SEMPN PEMPF PEMPN
%%R
DIFFN <- c(SEMPN-PEMPN)
DIFFF <- c(SEMPF-PEMPF)
%%R
REALRES = cbind(PEMPN,SEMPN,PEMPF,SEMPF,DIFFN,DIFFF)
print(REALRES)
PEMPN SEMPN PEMPF SEMPF DIFFN DIFFF [1,] 180330 213742 165193 206570 33412 41377 [2,] 78326 86858 72317 82252 8532 9935 [3,] 51231 52562 47542 48868 1331 1326 [4,] 42819 42429 39505 39032 -390 -473 [5,] 37216 35963 34146 32694 -1253 -1452 [6,] 33022 31270 30151 28009 -1752 -2142 [7,] 29293 27159 26532 23780 -2134 -2752 [8,] 25804 23508 23162 20024 -2296 -3138 [9,] 22171 19807 19470 16182 -2364 -3288 [10,] 17596 15489 15031 11873 -2107 -3158 [11,] 12330 10649 10078 7533 -1681 -2545 [12,] 6843 5791 5165 3678 -1052 -1487 [13,] 2450 1940 1628 1123 -510 -505
The difference in the number of loci recovered by the two programs is shown. Above the zero line (grey) Stacks recovers more loci and
below the line (black) pyRAD recovers more loci. PyRAD returns more loci at high coverage than does Stacks, which instead recovers many
more singleton and small coverage loci. Plotted on log scale. A better way to visualize the difference in performance between the two programs.
%%R
diffplot <- function() {
par(cex.axis=1.1,
cex.main=1.1,
cex.lab=1.6,
mar=c(4,4,3,0),
oma=c(1,2,0,2))
D <- log(abs(DIFFN)) * c(-1,-1,-1,rep(1,10))
E <- log(abs(DIFFF)) * c(-1,-1,-1,rep(1,10))
names(D) <- seq(1,13)
df.bar <- barplot(D, col=c(rep('grey',3),rep('black',10)),
ylim=c(-12,10), space=0.5,
ylab="Log difference in number of loci recovered",
xlab="Minimum taxon coverage",axes=F)
box()
abline(h=0,lwd=2,lty=2)
points(df.bar, E, bg=c(rep('grey',3),rep('black',10)),
cex=2, pch=21)
axis(2,c(-10,-5,0,5,10),c("10","5","0","5","10"))
}
diffplot()
%%R
pdf("diffplot.pdf", width=7, height=6, family='Times')
diffplot()
dev.off()
def counts(infile, mincov):
lines = [i.split("\n") for i in infile]
INDS = []
for loc in lines[:-1]:
seqs = []
for line in loc:
if ">" in line:
seqs.append(line.split(" ")[-1])
if len(seqs) >= mincov:
N = numpy.array([list(i) for i in seqs])
cols = iter(N.transpose())
inds = 0
col = cols.next()
while 1:
if "-" in col:
try: col = cols.next()
except StopIteration: break
if "-" not in col:
inds += 1
else:
try: col = cols.next()
except StopIteration: break
INDS.append(inds)
i = [i for i in INDS if i!=0]
return float(len(i))/len(INDS)
#infile1 = open("REALDATA/outfiles/c85d6m2pN.loci").read().split("//")
infile1 = open("REALDATA/outfiles/c85d6m2p3H3N3.loci").read().split("//")
infile2 = open("SIMsmall/Py_high/outfiles/c85d6m2pN.loci").read().split("//")
infile3 = open("SIMsmall/Py_low/outfiles/c85d6m2pN.loci").read().split("//")
for i in [13,12,10,8,6,4]:
dat = counts(infile1,i)
print 'cyatho',i,"%0.3f" % dat
indelesthigh = counts(infile2,12)
indelestlow = counts(infile3,12)
print 'simhigh', 12, "%0.3f" % indelesthigh
print 'simlow', 12, "%0.3f" % indelestlow
cyatho 13 0.141 cyatho 12 0.156 cyatho 10 0.214 cyatho 8 0.260 cyatho 6 0.287 cyatho 4 0.306 simhigh 12 0.467 simlow 12 0.195
Simulate the large RADseq data set on the same species tree as before, with no indel variation, but allowing mutations to arise in the
restriction recognition site (25bp region in this case to simulate large dropout). I simulate 500K loci to create a data set that has
many fewer loci within any individual sample due to locus dropout (~150K) such that within-sample clustering is fast, but many loci to
cluster across samples (500K) such that across-sample clustering is slow. I compare run times to analyze these data in Stacks and PyRAD
including when implementing hierarchical clustering in PyRAD.
!python simradsMANU.py 0.0 50 500000 1 SIMbig/big12.fastq
simulating 500000 RAD loci across 12 tip taxa with 1 samples per taxon indels arise at frequency of 0.000000 mutations in restriction site at length (freq) = 50 theta=4Nu= 0.0014 creating new barcode map
%%bash
mkdir SIMbig/big0/
mkdir SIMbig/big4/
%time ! ~/pyrad_v.1.62/pyRAD -p SIMbig/paramsbig.txt -s 12345 2> SIMbig/log.big
CPU times: user 10.67 s, sys: 1.66 s, total: 12.33 s Wall time: 7884.25 s
%time ! ~/pyrad_v.1.62/pyRAD -p SIMbig/paramsbig.txt -s67 2> SIMbig/log.big
usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores (C) Copyright 2010-12 Robert C. Edgar, all rights reserved. http://drive5.com/usearch Licensed to: daeaton@uchicago.edu ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0'] addon [] exclude [] ............ final stats written to: SIMbig/big0/stats/big.c85d6m2pN.stats output files written to: SIMbig/big0/outfiles/ directory CPU times: user 44.70 s, sys: 7.13 s, total: 51.83 s Wall time: 33415.28 s
%%bash
mv SIMbig/big0/clust.85/ SIMbig/big4/
mkdir SIMbig/big4/stats
%time ! ~/pyrad_v.1.62/pyRAD -p SIMbig/paramsbigH4.txt -s 67 2> SIMbig/log.bigH4
1 4 2 4 3 4 usearch_i86linux32 v6.0.307, 4.0Gb RAM (49.5Gb total), 24 cores (C) Copyright 2010-12 Robert C. Edgar, all rights reserved. http://drive5.com/usearch Licensed to: daeaton@uchicago.edu ingroup ['1A0', '1B0', '1C0', '1D0', '2E0', '2F0', '2G0', '2H0', '3I0', '3J0', '3K0', '3L0'] addon [] exclude [] . . . . . . final stats written to: SIMbig/big4/stats/big.c85d6m2pN.stats output files written to: SIMbig/big4/outfiles/ directory CPU times: user 8.23 s, sys: 1.12 s, total: 9.36 s Wall time: 6755.55 s
%%bash
mkdir SIMbig/stackf
cut -f 2 SIMbig/big12.fastq.barcodes > SIMbig/big12.stacks.barcodes
! process_radtags -f SIMbig/big12.fastq -o SIMbig/stackf -b SIMbig/big12.stacks.barcodes -e pstI -E phred33
Using Phred+33 encoding for quality scores. Found 1 input file(s). Searching for single-end, inlined barcodes. Loaded 12 barcodes (6bp). 22173620 total reads; -10000000 ambiguous barcodes; -0 ambiguous RAD-Tags; +0 recovered; -0 low quality reads; 12173620 retained reads. Closing files, flushing buffers... Outputing details to log: 'SIMbig/stackf/process_radtags.log' 22173620 total sequences; 10000000 ambiguous barcode drops; 0 low quality read drops; 0 ambiguous RAD-Tag drops; 12173620 retained reads.
Because the barcodes were made randomly and every single of the 120 input files has to be written in by hand for the stacks command
denovo_map.pl I wrote a script here to find the sample names, as output by process_radtags, and create an input file for denovo_map.pl
written to a bash script called stacks.sh
makestackslist("nom")
infiles = ["-s "+ff+" " for ff in glob.glob("SIMbig/stackf/*.fq")]
denovo = "denovo_map.pl -m 2 -M 13 -n 15 -T 12 -b 1 -B 1 -S -X "+\
"'populations:-m 6' -D 'SimRADsBIG' -o SIMbig/stackf "+"".join(infiles)
with open("SIMbig/stacks.sh",'w') as outfile:
outfile.write(denovo)
%time ! sh SIMbig/stacks.sh
Found 12 sample file(s). Identifying unique stacks; file 1 of 12 [sample_GTTAGG] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_GTTAGG.fq -o SIMbig/stackf -i 1 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 2 of 12 [sample_ATGTGT] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_ATGTGT.fq -o SIMbig/stackf -i 2 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 3 of 12 [sample_TGGTGG] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_TGGTGG.fq -o SIMbig/stackf -i 3 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 4 of 12 [sample_TGATGA] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_TGATGA.fq -o SIMbig/stackf -i 4 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 5 of 12 [sample_ATTTGG] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_ATTTGG.fq -o SIMbig/stackf -i 5 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 6 of 12 [sample_TGGGTG] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_TGGGTG.fq -o SIMbig/stackf -i 6 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 7 of 12 [sample_ATTGAG] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_ATTGAG.fq -o SIMbig/stackf -i 7 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 8 of 12 [sample_GGAGAA] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_GGAGAA.fq -o SIMbig/stackf -i 8 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 9 of 12 [sample_AGTTAT] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_AGTTAT.fq -o SIMbig/stackf -i 9 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 10 of 12 [sample_TAGTTA] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_TAGTTA.fq -o SIMbig/stackf -i 10 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 11 of 12 [sample_GAAATA] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_GAAATA.fq -o SIMbig/stackf -i 11 -m 2 -M 13 -p 12 2>&1 Identifying unique stacks; file 12 of 12 [sample_AGTTAA] /usr/local/bin//ustacks -t fastq -f SIMbig/stackf/sample_AGTTAA.fq -o SIMbig/stackf -i 12 -m 2 -M 13 -p 12 2>&1 Generating catalog... /usr/local/bin//cstacks -b 1 -o SIMbig/stackf -s SIMbig/stackf/sample_GTTAGG -s SIMbig/stackf/sample_ATGTGT -s SIMbig/stackf/sample_TGGTGG -s SIMbig/stackf/sample_TGATGA -s SIMbig/stackf/sample_ATTTGG -s SIMbig/stackf/sample_TGGGTG -s SIMbig/stackf/sample_ATTGAG -s SIMbig/stackf/sample_GGAGAA -s SIMbig/stackf/sample_AGTTAT -s SIMbig/stackf/sample_TAGTTA -s SIMbig/stackf/sample_GAAATA -s SIMbig/stackf/sample_AGTTAA -n 15 -p 12 2>&1 Matching RAD-Tags to catalog; file 1 of 12 [sample_GTTAGG] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_GTTAGG -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 2 of 12 [sample_ATGTGT] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_ATGTGT -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 3 of 12 [sample_TGGTGG] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_TGGTGG -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 4 of 12 [sample_TGATGA] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_TGATGA -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 5 of 12 [sample_ATTTGG] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_ATTTGG -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 6 of 12 [sample_TGGGTG] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_TGGGTG -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 7 of 12 [sample_ATTGAG] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_ATTGAG -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 8 of 12 [sample_GGAGAA] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_GGAGAA -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 9 of 12 [sample_AGTTAT] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_AGTTAT -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 10 of 12 [sample_TAGTTA] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_TAGTTA -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 11 of 12 [sample_GAAATA] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_GAAATA -o SIMbig/stackf -p 12 2>&1 Matching RAD-Tags to catalog; file 12 of 12 [sample_AGTTAA] /usr/local/bin//sstacks -b 1 -c SIMbig/stackf/batch_1 -s SIMbig/stackf/sample_AGTTAA -o SIMbig/stackf -p 12 2>&1 Calculating population-level summary statistics /usr/local/bin//populations -b 1 -P SIMbig/stackf -s -t 12 -m 6 2>&1 CPU times: user 221.47 s, sys: 34.13 s, total: 255.60 s Wall time: 73780.51 s
Measure stats for big simulation analysis
infile = open("SIMbig/stackf/batch_1.haplotypes.tsv").readlines()
tcov = [line.split("\t")[1] for line in infile[1:]]
print mean(map(int,tcov)), 'mean coverage stacks'
infile = open("SIMbig/big0/stats/big.c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
a = line.split("\t")
tcov += [int(a[0])]*int(a[1])
print mean(tcov), 'mean coverage pyrad'
infile = open("SIMbig/big4/stats/big.c85d6m2pN.stats").readlines()[26:38]
tcov = []
for line in infile:
a = line.split("\t")
tcov += [int(a[0])]*int(a[1])
print mean(tcov), 'mean coverage pyrad hierarchical'
4.4786988998 mean coverage stacks 4.47936475229 mean coverage pyrad 2.8736130253 mean coverage pyrad hierarchical
infile = open("SIMbig/big0/stats/big.c85d6m2pN.stats").readlines()
phigh = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
P0 = [sum(phigh)-sum(phigh[:i-1]) for i in range(1,13)]
infile = open("SIMbig/big4/stats/big.c85d6m2pN.stats").readlines()
phigh = [int(infile[j].strip().split("\t")[1]) for j in range(26,38)]
P4 = [sum(phigh)-sum(phigh[:i-1]) for i in range(1,13)]
lines = open("SIMbig/stackf/batch_1.haplotypes.tsv").readlines()
cnts = [int(field.strip().split("\t")[1]) for field in lines[1:]]
sno = [cnts.count(i) for i in range(1,13)]
S0 = [int(sum(sno)-sum(sno[:i-1])) for i in range(1,13)]
print array([P0,P4,S0]).transpose()
[[ 135884. 85618. 135885.] [ 121263. 45613. 121228.] [ 106282. 45613. 106261.] [ 82504. 45613. 82487.] [ 59113. 5609. 59106.] [ 43673. 5609. 43670.] [ 29057. 5609. 29054.] [ 16368. 5609. 16367.] [ 8528. 285. 8528.] [ 4206. 285. 4206.] [ 1511. 285. 1511.] [ 285. 285. 285.]]
100*(253./500000)
0.050600000000000006