Paired-end GBS (or similarly, paired Ez-RAD) data may commonly contain first and second reads that overlap to some degree and therefore should be merged. This notebook will outline how to analyze such data in pyRAD (v.3.03 or newer). I will demonstrate this example using an empirical data set of mine. To begin, we will create an empty directory within our current directory in which to perform the assembly.
%%bash
mkdir -p analysis_pyrad
Next we create an empty params file by calling pyRAD with the -n option.
%%bash
pyrad -n
new params.txt file created
Then we enter some arguments into the params file, which can be done using any text editor. I use the script below to automate it here. Some important arguments for dealing with PE-GBS data in particular include the following:
Set the correct datatype. In the case of this analysis, we will use 'pairgbs' for de-multiplexing in step 1, but then switch it to 'merged' for steps 2-7 after we merge the paired reads.
The filter setting (param 21) is not needed in this case because we will be using an external read merging program that applies this filtering.
We will, however, still probably want to trim the overhanging edges of reads (param 29), which helps to remove barcodes or adapters that were missed.
It can also be helpful with this data type to use a low mindepth setting in conjunction with majority rule consensus base calls (param 31). This will tell pyrad to make statistical base calls for all sites with depth >mindepth, but to make majority rule base calls at sites with depth lower than mindepth, but higher than the majority rule minimum depth. I first run the analysis with the option turned off, and then at the end of the tutorial show the difference with it turned on.
%%bash
sed -i '/## 1. /c\analysis_pyrad/ ## 1. working directory ' params.txt
sed -i '/## 2. /c\/home/deren/Longiflorae/*.gz ## 2. working directory ' params.txt
sed -i '/## 3. /c\/home/deren/barcodes/cyatho_longi.barcodes ## 3. barcodes ' params.txt
sed -i '/## 7. /c\12 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10. /c\.85 ## 10. Wclust ' params.txt
sed -i '/## 11. /c\pairgbs ## 11. datatype ' params.txt
sed -i '/## 14. /c\c85d6m4p3 ## 14. output prefix ' params.txt
sed -i '/## 29./c\1,1 ## 29. trim edges ' params.txt
sed -i '/## 30./c\p ## 30. output formats ' params.txt
cat params.txt
==** parameter inputs for pyRAD version 3.0.3 **======================== affected step == analysis_pyrad/ ## 1. working directory /home/deren/Longiflorae/*.gz ## 2. working directory /home/deren/barcodes/cyatho_longi.barcodes ## 3. barcodes vsearch ## 4. command (or path) to call vsearch (or usearch) (s3,s6) muscle ## 5. command (or path) to call muscle (s3,s7) TGCAG ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG) (s1,s2) 12 ## 7. N processors 6 ## 8. Mindepth: min coverage for a cluster (s4,s5) 6 ## 9. NQual .85 ## 10. Wclust pairgbs ## 11. datatype 4 ## 12. MinCov: min samples in a final locus (s7) 3 ## 13. MaxSH: max inds with shared hetero site (s7) c85d6m4p3 ## 14. output prefix ==== optional params below this line =================================== affected step == ## 15.opt.: select subset (prefix* only selector) (s2-s7) ## 16.opt.: add-on (outgroup) taxa (list or prefix*) (s6,s7) ## 17.opt.: exclude taxa (list or prefix*) (s7) ## 18.opt.: loc. of de-multiplexed data (s2) ## 19.opt.: maxM: N mismatches in barcodes (def= 1) (s1) ## 20.opt.: phred Qscore offset (def= 33) (s2) ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict (s2) ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5) ## 23.opt.: maxN: max Ns in a cons seq (def=5) (s5) ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5) (s5) ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5) ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100) (s7) ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7) ## 28.opt.: random number seed (def. 112233) (s3,s6,s7) 1,1 ## 29. trim edges p ## 30. output formats ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5) ## 32.opt.: keep trimmed reads (def=0). Enter min length. (s2) ## 33.opt.: max stack size (int), def= max(500,mean+2*SD) (s3) ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3) ## 35.opt.: use hierarchical clustering (def.=0, 1=yes) (s6) ## 36.opt.: repeat masking (def.=1='dust' method, 0=no) (s3,s6) ## 37.opt.: vsearch max threads per job (def.=6; see docs) (s3,s6) ==== optional: list group/clade assignments below this line (see docs) ==================
%%bash
pyrad -p params.txt -s 1
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ step 1: sorting reads by barcode ...........................................................
It looks like most reads from each file were matched to a barcode.
%%bash
head -n 100 analysis_pyrad/stats/s1.sorting.txt
file Nreads cut_found bar_matched lane2_NoIndex_L002_049.fastq.gz 4000000 3721343 3563126 lane2_NoIndex_L002_039.fastq.gz 4000000 3775744 3746367 lane2_NoIndex_L002_038.fastq.gz 4000000 3749702 3712199 lane2_NoIndex_L002_008.fastq.gz 4000000 3788251 3763415 lane2_NoIndex_L002_030.fastq.gz 4000000 3756116 3721905 lane2_NoIndex_L002_042.fastq.gz 4000000 3821669 3761365 lane2_NoIndex_L002_027.fastq.gz 4000000 3809029 3786687 lane2_NoIndex_L002_046.fastq.gz 4000000 3800963 3717126 lane2_NoIndex_L002_037.fastq.gz 4000000 3746059 3721532 lane2_NoIndex_L002_045.fastq.gz 4000000 3698917 3593579 lane2_NoIndex_L002_019.fastq.gz 4000000 3778683 3750877 lane2_NoIndex_L002_044.fastq.gz 4000000 3751756 3661419 lane2_NoIndex_L002_058.fastq.gz 4000000 3755612 3729931 lane2_NoIndex_L002_032.fastq.gz 4000000 3823466 3804376 lane2_NoIndex_L002_025.fastq.gz 4000000 3821273 3800999 lane2_NoIndex_L002_055.fastq.gz 4000000 3789040 3767778 lane2_NoIndex_L002_023.fastq.gz 4000000 3832294 3814024 lane2_NoIndex_L002_014.fastq.gz 4000000 3832280 3812920 lane2_NoIndex_L002_013.fastq.gz 4000000 3835097 3816100 lane2_NoIndex_L002_031.fastq.gz 4000000 3827911 3808997 lane2_NoIndex_L002_011.fastq.gz 4000000 3845888 3828585 lane2_NoIndex_L002_022.fastq.gz 4000000 3837573 3819811 lane2_NoIndex_L002_024.fastq.gz 4000000 3829959 3810628 lane2_NoIndex_L002_034.fastq.gz 4000000 3611235 3590798 lane2_NoIndex_L002_056.fastq.gz 4000000 3802234 3779975 lane2_NoIndex_L002_026.fastq.gz 4000000 3819201 3799143 lane2_NoIndex_L002_021.fastq.gz 4000000 3835070 3816075 lane2_NoIndex_L002_052.fastq.gz 4000000 3828882 3810331 lane2_NoIndex_L002_041.fastq.gz 4000000 3835218 3791068 lane2_NoIndex_L002_051.fastq.gz 4000000 3833143 3814474 lane2_NoIndex_L002_047.fastq.gz 4000000 3759517 3607421 lane2_NoIndex_L002_059.fastq.gz 4000000 3757946 3719770 lane2_NoIndex_L002_054.fastq.gz 4000000 3819129 3798453 lane2_NoIndex_L002_028.fastq.gz 4000000 3795559 3771229 lane2_NoIndex_L002_048.fastq.gz 4000000 3748548 3623691 lane2_NoIndex_L002_035.fastq.gz 4000000 3266305 3243997 lane2_NoIndex_L002_043.fastq.gz 4000000 3817002 3744305 lane2_NoIndex_L002_057.fastq.gz 4000000 3743088 3719312 lane2_NoIndex_L002_018.fastq.gz 4000000 3794176 3769286 lane2_NoIndex_L002_017.fastq.gz 4000000 3807601 3784147 lane2_NoIndex_L002_050.fastq.gz 4000000 3745707 3651533 lane2_NoIndex_L002_033.fastq.gz 4000000 3821688 3802696 lane2_NoIndex_L002_040.fastq.gz 4000000 3761738 3722996 lane2_NoIndex_L002_053.fastq.gz 4000000 3822961 3803422 lane2_NoIndex_L002_029.fastq.gz 4000000 3781432 3755276 lane2_NoIndex_L002_016.fastq.gz 4000000 3822901 3802934 lane2_NoIndex_L002_036.fastq.gz 4000000 3805280 3783717 lane2_NoIndex_L002_020.fastq.gz 4000000 3732223 3681151 lane2_NoIndex_L002_005.fastq.gz 4000000 3814967 3793880 lane2_NoIndex_L002_006.fastq.gz 4000000 3812538 3792209 lane2_NoIndex_L002_001.fastq.gz 4000000 3838930 3821090 lane2_NoIndex_L002_007.fastq.gz 4000000 3799427 3776091 lane2_NoIndex_L002_015.fastq.gz 4000000 3822598 3802203 lane2_NoIndex_L002_004.fastq.gz 4000000 3823816 3803572 lane2_NoIndex_L002_010.fastq.gz 4000000 3761201 3730496 lane2_NoIndex_L002_009.fastq.gz 4000000 3777890 3750566 lane2_NoIndex_L002_002.fastq.gz 4000000 3826291 3807448 lane2_NoIndex_L002_003.fastq.gz 4000000 3824097 3804683 lane2_NoIndex_L002_012.fastq.gz 4000000 3838714 3820626 sample true_bar obs_bars N_obs d31733 AAAAGTT AAAAGTT 5114026 d31733 AAAAGTT AANAGTT 813193 d31733 AAAAGTT AAGAGTT 90500 d31733 AAAAGTT AACAGTT 57040 d31733 AAAAGTT AAAAGGT 19502 d31733 AAAAGTT AAAAGAT 19214 d31733 AAAAGTT AATAGTT 13373 d31733 AAAAGTT AAAGGTT 9539 d31733 AAAAGTT AAAAGCT 8946 d31733 AAAAGTT GAAAGTT 8772 d31733 AAAAGTT NAAAGTT 8297 d31733 AAAAGTT AAAAGTC 7090 d31733 AAAAGTT AAAAATT 6140 d31733 AAAAGTT AGAAGTT 6042 d31733 AAAAGTT AAAATTT 4801 d31733 AAAAGTT CAAAGTT 4089 d31733 AAAAGTT AAAAGTG 3887 d31733 AAAAGTT AAACGTT 3255 d31733 AAAAGTT ACAAGTT 2946 d31733 AAAAGTT TAAAGTT 1997 d31733 AAAAGTT AAAAGTA 1857 d31733 AAAAGTT ATAAGTT 1145 d31733 AAAAGTT AAATGTT 994 d31733 AAAAGTT AAAACTT 896 d31733 AAAAGTT ANAAGTT 512 d33291 AACGCCT AACGCCT 949865 d33291 AACGCCT AANGCCT 150664 d33291 AACGCCT AAGGCCT 18149 d33291 AACGCCT AAAGCCT 14152 d33291 AACGCCT AATGCCT 3499 d33291 AACGCCT AACGACT 2636 d33291 AACGCCT AACGCTT 2091 d33291 AACGCCT AACGTCT 2053 d33291 AACGCCT GACGCCT 1775 d33291 AACGCCT AACGCCC 1714 d33291 AACGCCT AGCGCCT 1543 d33291 AACGCCT NACGCCT 1474
For this we will use the program PEAR. This step is very important. I can pretty much assure you that you will have many paired end reads in your library that overlap, and if you do not merge these reads they can really mess up your analysis. In this case the vast majority of paired reads are overlapping (~80%), and in fact, the merged reads are the only portion of the data that we will be using for analysis.
The first thing we have to do is unzip the reads if they are gzipped. If you demultiplexed your data in pyrad they will be in the fastq/ directory.
%%bash
## For this example I subselect only the samples from
## this library that start with the letter 'd'
gunzip analysis_pyrad/fastq/d*.gz
I prefer using a stringent filter setting, including the -q
option which trims parts of the read that are low quality. This often results in removing all reads that are not merged. You can assemble merged and non-merged reads separately, as described in the 'non-merged PE-GBS tutorial', but here we will focus on just analyzing the merged reads.
%%bash
for gfile in analysis_pyrad/fastq/*_R1.fq;
do pear -f $gfile \
-r ${gfile/_R1.fq/_R2.fq} \
-o ${gfile/_R1.fq/} \
-n 33 \
-t 33 \
-q 10 \
-j 20 >> pear.log 2>&1;
done
%%bash
## set location of demultiplexed data that are 'pear' filtered
sed -i '/## 18./c\analysis_pyrad/fastq/*.assembled.fastq ## 18. demulti data loc ' ./params.txt
%%bash
sed -i '/## 11./c\merged ## 11. data type ' ./params.txt
%%bash
pyrad -p params.txt -s 2
------------------------------------------------------------ pyRAD : RADseq for phylogenetics & introgression analyses ------------------------------------------------------------ sorted .fastq from analysis_pyrad/fastq/*.assembled.fastq being used step 2: editing raw reads ........................
%%bash
cat analysis_pyrad/stats/s2.rawedit.txt
sample Nreads passed passed.w.trim passed.total d19long1.assembled 2973258 2865913 0 2865913 d30181.assembled 4178962 4048613 0 4048613 d30695.assembled 4435144 4249855 0 4249855 d31733.assembled 4757784 4598646 0 4598646 d33291.assembled 847503 819072 0 819072 d34041.assembled 988252 969387 0 969387 d35178.assembled 2479633 2429555 0 2429555 d35320.assembled 1308443 1269046 0 1269046 d35371.assembled 572512 559953 0 559953 d35422.assembled 3069039 3016381 0 3016381 d39103.assembled 2629656 2523785 0 2523785 d39104.assembled 3415956 3353178 0 3353178 d39114.assembled 2215501 2133972 0 2133972 d39187.assembled 907321 875889 0 875889 d39253.assembled 5665909 5444296 0 5444296 d39404.assembled 2567688 2447216 0 2447216 d39531.assembled 3254639 3128372 0 3128372 d39968.assembled 3561187 3441543 0 3441543 d40328.assembled 1651265 1601896 0 1601896 d41058.assembled 2444960 2395622 0 2395622 d41237.assembled 1847033 1815168 0 1815168 d41389.assembled 2491305 2433467 0 2433467 d41732.assembled 2344042 2291187 0 2291187 decor21.assembled 5056601 4926349 0 4926349 Nreads = total number of reads for a sample passed = retained reads that passed quality filtering at full length passed.w.trim= retained reads that were trimmed due to detection of adapters passed.total = total kept reads of sufficient length note: you can set the option in params file to include trimmed reads of xx length.
%%bash
pyrad -p params.txt -s 3
%%bash
cat analysis_pyrad/stats/s3
%%bash
gzip -cd analysis_pyrad/clust.85/d35422.assembled.clustS.gz | head -n 100 | cut -b 1-80
>d35422.assembled_2120_r1;size=51; TGCAGCGCTACCTGCCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA >d35422.assembled_274175_r1;size=15;+ TGCAGCGCTACCTGCCCGATCTGTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA >d35422.assembled_2265219_r1;size=2;+ TGCAGCGCTACCTGCCGGATCTGTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA >d35422.assembled_216079_r1;size=1;+ TGCAGCGCTACCTGCCCGATCTGCTCGAGAAGCGGACGAACTGAGAATGAACCTGCA >d35422.assembled_2495995_r1;size=1;+ TGCAGCGCTACCTGCCCGGTCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA >d35422.assembled_590416_r1;size=1;+ TGCAGCGCTATCTGCCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA >d35422.assembled_1947641_r1;size=1;+ TGCAGCGCTACCTGCCCGATCTTTTCGAGAAGCGGACGAGCTGAGAATGAACCTGCA >d35422.assembled_1601389_r1;size=1;+ TGCAGCGCTACCTGCCGGATCTGTTCGAGAAGCGGACGAACTGAGAATGACACTGCA >d35422.assembled_190367_r1;size=1;+ TGCAGCGCTACCTGCCCGATCCTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA >d35422.assembled_2631365_r1;size=1;+ TGCAGCGCTACCTGCCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCG >d35422.assembled_126737_r1;size=1;+ TGCAGCGCTACCTGCCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTNCA >d35422.assembled_288692_r1;size=1;+ TGCAGCGCTACCTGCCGGATCNGTTCGAGAAGCGGACGAACTGAGAATGACACTGCA >d35422.assembled_2194406_r1;size=1;+ TGCAGCGCTANCTNCCCGATCTTTTCGAGAAGCGGANNANCTGAGAATGAACCTGCA >d35422.assembled_831862_r1;size=1;+ TGCAGCGCTACCTGNCCGATCTTTTCGAGAAGCGGACGAACTGAGAATGAACCTGCA // // >d35422.assembled_61639_r1;size=24; TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC >d35422.assembled_825340_r1;size=8;+ TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGTCAGTTCATCCAGTCGCATCCGCGCGACGCCCTGGTGCCGTCC >d35422.assembled_88436_r1;size=2;+ TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC >d35422.assembled_354389_r1;size=2;+ TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC >d35422.assembled_2229078_r1;size=1;+ TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTACGTCAGTTCATCCAGTCGCATCCGCGCGACTCCCTGGTGCCGTCC >d35422.assembled_1405977_r1;size=1;+ TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATNCAGTCGCANCCCCGCGACGCCCTGGTGCCGNCC >d35422.assembled_2738651_r1;size=1;+ TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC >d35422.assembled_2565401_r1;size=1;+ TGCAGCGTCAGTACGAGCTGGCGGAGATGGGCCTGCGTCAGTTCATCCAGTCCCATCCGCGTGATGCACTGGTACCGGCC >d35422.assembled_2060896_r1;size=1;+ TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCGCGCGACGCCCTCGTGCCGGCC >d35422.assembled_27774_r1;size=1;+ TGCAGCGTCAGTACGAGCTGGCGGAGATGGGCCTGCGTCAGTTCATCCAGTCCCATCCGCGTGATGCACTGGTACCGGCC >d35422.assembled_2776613_r1;size=1;+ TGCAGCGCCAGTACGAGCAGGCCGAGATGGGCCTGCGCCAGTTCATCCAGTCGCATCCCCGCGACGCCCTGGTGCCGGCC >d35422.assembled_2566934_r1;size=1;+ TGCAGCGCCAGTACGAGCAGGCCGGGATGGGCCTGCGTCAGTTCATCCAGTCGCATCCGCGCGACGCCCTGGTGCCGTCC // // >d35422.assembled_781655_r1;size=2;+ TGCAGGACCAGGGCTTCGCCGTGACCGCGAACATGGTGACCAGCCGCATCGTCGTGGACGCCGTGGGCACCTTCGACGCC >d35422.assembled_1218064_r1;size=2; TGCAGGACCAGGGCTTCGCCGTGACCGCGAACATGGTGACCAGTCGTGCTGTGGTCGAAGCCGTCGGCGCCTTCGACGCC >d35422.assembled_81417_r1;size=1;+ TGCAGGACCAGGGCTTCGCCGTGNCCGCGAACATGGTNACCACCCGCGCCGTTGTCGACGCCGTCGGCACCTTCGACGCC >d35422.assembled_1447392_r1;size=1;+ TGCAGGACCAGGGCTTCGCNGTGACCNCGAACATGGTGACCAGCCGCATCGTCGTGGACGCCGTGGGCACCTTCGACGCC >d35422.assembled_1916059_r1;size=1;+ TGCAGGACCAGGGCTTCGCCGTGACCGCGAACATGGTGACCAGCCGCGTCGTCGTGGACGCCGTGGGCACCTTCGACGCC // // >d35422.assembled_1086507_r1;size=1; TGCAGNNANC-AAAAAAAATATATG-TTTTTTTTTTTTGCTGCA >d35422.assembled_1935566_r1;size=1;+ TGCAGNTAACNAAAAAAAATATATGTTTTTTTTTTTTTGCTGCA // // >d35422.assembled_105061_r1;size=2; TGCAGGTCTCCTCCGACGTCTACTACTACACGGTCGGCTCGAAGCTCTTCTCGAGGCCGAACAAGCCCCTGCA >d35422.assembled_405948_r1;size=2;+ TGCAGGTCTCCTCCGACGTCTACTACTACACGGTCGGCGCGAACCTCTTCGAGAGGAAGAACAAGCCGCTGCA >d35422.assembled_1112236_r1;size=1;+ TGCAGGTNNCCTCNGACGTCTACTACTACACGGTCGGCGCGAACCTCTTCTCCAAGCCGAACAAGCCGCTGCA >d35422.assembled_129605_r1;size=1;+ TGCAGGTGTCCTCCGACGTCTACTACTACACCGTCGGGTCGAAGCTCTTCTCGAGGCCGAACAAGCCCCTGCA >d35422.assembled_108168_r1;size=1;+ TGCAGGTGTCCTCCGACGTCTACTACTACACGGTCGGCGCGAACCTCTTCTCGAAGCCGAACAAGCCCCTGCA >d35422.assembled_1796775_r1;size=1;+ TGCAGGTGTCCTCCGACGTCTACTACTACACGGTCGGCGCGAACCTCTTCTCCAAGCCGAACAAGCCGCTGCA // // >d35422.assembled_2015519_r1;size=1;+ TGCAGGGGGNCGCGGGGAGCAACCCCAGCTCGGTGCACGTGAGACAGGTGTCGATCACGGGCGGGGACCTCCTGGGGCCT >d35422.assembled_1326334_r1;size=1; TGCAGGGGGTCGCGGGGAGCAACCCCAGCTCGGTGCACGTGAGACAGGTGTCGATCACGGGCGGGGACCTCCTGGGGCCT // // >d35422.assembled_735556_r1;size=2; TGCAGCGCCGCCAGCGGCTGCAGAACTGCCCCCGAGTTTCCAGCTTCTGCA >d35422.assembled_2203593_r1;size=1;+ TGCAGCGCCTGGACGCCCTGCAGAACTGCCCCCGAGTTTCCAGCTTCTGCA // //
gzip: stdout: Broken pipe
%%bash
pyrad -p params.txt -s 4
%%bash
cat analysis_pyrad/stats/Pi_E_estimate.txt
taxa H E d35371.assembled 0.0182707 0.00219242 d39187.assembled 0.02781985 0.00272968 d39104.assembled 0.02073342 0.0015139 d35320.assembled 0.02621882 0.00175474 d34041.assembled 0.02897603 0.00381241 d33291.assembled 0.02675046 0.00311296 d40328.assembled 0.02060546 0.00267209 d35422.assembled 0.04092042 0.00224104 d19long1.assembled 0.02280653 0.00095164 d41389.assembled 0.02985094 0.00193657 d41237.assembled 0.05196614 0.00204698 d39114.assembled 0.02263828 0.00168706 d39404.assembled 0.04004115 0.00165295 d41058.assembled 0.03910496 0.00179648 d39531.assembled 0.03286906 0.0014618 d41732.assembled 0.0367377 0.00251955 d35178.assembled 0.0248532 0.00261068 d31733.assembled 0.03005955 0.00205113 d30181.assembled 0.02449057 0.00222622 d39968.assembled 0.03075244 0.00329168 d39253.assembled 0.03181217 0.00343958 decor21.assembled 0.03917313 0.00253141 d30695.assembled 0.02899983 0.00255832 d39103.assembled 0.03842947 0.00396182
You may want to adjust:
%%bash
pyrad -p params.txt -s 5
%%bash
cat analysis_pyrad/stats/s5.consens.txt
taxon nloci f1loci f2loci nsites npoly poly d35371.assembled 60885 7839 6432 489780 2313 0.0047225 d41237.assembled 293759 10550 5791 433633 2696 0.0062172 d35320.assembled 156419 12616 9039 777103 3193 0.0041089 d41058.assembled 329719 13087 8587 620904 3583 0.0057706 d35422.assembled 169511 13976 8440 588597 3865 0.0065665 d19long1.assembled 161936 11666 8611 763559 2339 0.0030633 d33291.assembled 169929 14920 10442 913599 3976 0.004352 d34041.assembled 197867 19992 14191 931867 6604 0.0070868 d39187.assembled 96755 12685 8535 752923 3913 0.0051971 d39104.assembled 101266 8902 6827 546599 1931 0.0035328 d41732.assembled 452240 23593 15770 1129545 7020 0.0062149 d41389.assembled 211387 21245 14809 1053031 6331 0.0060122 d39404.assembled 243761 23389 13769 1106220 7083 0.0064029 d40328.assembled 136690 24752 18911 1544641 7199 0.0046606 d39114.assembled 218883 25257 18878 1562653 7444 0.0047637 d39531.assembled 332792 26717 16869 1393571 7705 0.005529 d35178.assembled 450842 47869 38116 2467261 21384 0.0086671 decor21.assembled 577428 46634 28116 2021202 14304 0.007077 d31733.assembled 439816 40309 27103 2178840 11820 0.0054249 d30181.assembled 516149 49126 36637 2879253 13306 0.0046213 d39103.assembled 642379 48620 27395 2247944 13442 0.0059797 d30695.assembled 485553 48503 32034 2797840 15100 0.005397 d39253.assembled 431026 51267 31529 2654672 14942 0.0056286 d39968.assembled 456174 61773 39746 3309319 18475 0.0055827 ## nloci = number of loci ## f1loci = number of loci with >N depth coverage ## f2loci = number of loci with >N depth and passed paralog filter ## nsites = number of sites across f loci ## npoly = number of polymorphic sites in nsites ## poly = frequency of polymorphic sites
%%bash
pyrad -p params.txt -s 6
%%bash
pyrad -p params.txt -s 7
%%bash
cat analysis_pyrad/stats/c88d6m4p3.stats
17401 ## loci with > minsp containing data 16418 ## loci with > minsp containing data & paralogs removed 16418 ## loci with > minsp containing data & paralogs removed & final filtering ## number of loci recovered in final data set for each taxon. taxon nloci d19long1.assembled 2232 d30181.assembled 5279 d30695.assembled 8028 d31733.assembled 5717 d33291.assembled 3529 d34041.assembled 4291 d35178.assembled 4252 d35320.assembled 3451 d35371.assembled 2535 d35422.assembled 2776 d39103.assembled 6100 d39104.assembled 2311 d39114.assembled 4695 d39187.assembled 3249 d39253.assembled 7324 d39404.assembled 5203 d39531.assembled 6176 d39968.assembled 6724 d40328.assembled 4921 d41058.assembled 1872 d41237.assembled 1308 d41389.assembled 4676 d41732.assembled 4007 decor21.assembled 7367 ## nloci = number of loci with data for exactly ntaxa ## ntotal = number of loci for which at least ntaxa have data ntaxa nloci saved ntotal 1 - 2 - - 3 - - 4 5937 * 16418 5 3301 * 10481 6 2050 * 7180 7 1386 * 5130 8 910 * 3744 9 588 * 2834 10 412 * 2246 11 329 * 1834 12 241 * 1505 13 167 * 1264 14 140 * 1097 15 121 * 957 16 97 * 836 17 106 * 739 18 91 * 633 19 76 * 542 20 93 * 466 21 107 * 373 22 105 * 266 23 98 * 161 24 63 * 63 ## nvar = number of loci containing n variable sites (pis+autapomorphies). ## sumvar = sum of variable sites (SNPs). ## pis = number of loci containing n parsimony informative sites. ## sumpis = sum of parsimony informative sites. nvar sumvar PIS sumPIS 0 2451 0 6616 0 1 2053 2053 3490 3490 2 1932 5917 1952 7394 3 1655 10882 1254 11156 4 1376 16386 865 14616 5 1193 22351 611 17671 6 959 28105 383 19969 7 806 33747 266 21831 8 660 39027 210 23511 9 499 43518 184 25167 10 429 47808 118 26347 11 324 51372 110 27557 12 266 54564 77 28481 13 250 57814 70 29391 14 214 60810 48 30063 15 172 63390 39 30648 16 171 66126 31 31144 17 139 68489 24 31552 18 104 70361 16 31840 19 107 72394 10 32030 20 79 73974 16 32350 21 87 75801 10 32560 22 59 77099 4 32648 23 58 78433 2 32694 24 43 79465 5 32814 25 35 80340 2 32864 26 36 81276 2 32916 27 27 82005 0 32916 28 33 82929 2 32972 29 36 83973 1 33001 30 23 84663 0 33001 31 23 85376 0 33001 32 13 85792 0 33001 33 20 86452 0 33001 34 12 86860 0 33001 35 14 87350 0 33001 36 10 87710 0 33001 37 7 87969 0 33001 38 8 88273 0 33001 39 7 88546 0 33001 40 3 88666 0 33001 41 2 88748 0 33001 42 2 88832 0 33001 43 2 88918 0 33001 44 4 89094 0 33001 45 2 89184 0 33001 46 4 89368 0 33001 47 0 89368 0 33001 48 2 89464 0 33001 49 3 89611 0 33001 50 2 89711 0 33001 51 0 89711 0 33001 52 1 89763 0 33001 53 1 89816 0 33001 total var= 89816 total pis= 33001
%%bash
head -n 100 analysis_pyrad/outfiles/c88d6m4p3.loci | cut -b 1-88
>d31733.assembled GTACAGCACCGGGTAGCGGCGGRGGCTAGCGGCGTAGCCGGGCGGCAGGTACACCCACACCCGGC >d39114.assembled GTACAGCACCGGGTAGCGGCGGRGGCTAGCGGCGTAGCCGGGCGGCAGGTACACCCACACCCGGC >d39968.assembled GTACAGCACCGGGTAGCGGCGGGGGCTAGCNGCGTAGCCGGGCGGCAGGTACACCCACACCCGGC >decor21.assembled GTACAGCACCGGGTAGCGGCGGRGGCTAGCGGCGTAGCCGGGCGGCAGGTACACCCACACCCGGC // * >d30695.assembled AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC >d31733.assembled AATTGTGAGTACCGTCGTGACCGANGGCACCGAGGC >d39103.assembled AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC >d39531.assembled AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC >d41389.assembled AATTGTGAGTACCGTCGTRACCGA-GGCACCGAGGC >d41732.assembled AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC >decor21.assembled AATTGTGAGTACCGTCGTGACCGA-GGCACCGAGGC // - | >d30695.assembled CCRTGGTGCGGCTGGCGCAAGGTGCGSGASCTSGCGGTGGCGCAGGGC >d31733.assembled CCATGGTGCGGCTGGCGCAAGGTGCGSGACCTSGCGGTGGCGCAGGGC >d33291.assembled CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGGCGCAGGGC >d34041.assembled CCGTGGTGCGGCTGGCGCAAGGTGCGCGARCTCGCGGTGGCGCAGGGY >d35320.assembled CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGNCGCAGGGC >d35422.assembled CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGGCGCAGGGC >d39253.assembled CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGGCGCAGGGY >d39531.assembled CCGTGGTGCGGCTGGCGCAAGGTGCGCGAGCTCGCGGTGGCGCAGGGC // * * * * *| >d30181.assembled GCCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCGCCATTCCCCATGGTACGCCGCA >d30695.assembled GCCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCGCCATTCCCCATGGTACGCCGCA >d31733.assembled GCCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCGCCATTCCCCATGGTACGCCGCA >d39114.assembled GCCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCNCCATTCCCCATGGTACGCCGCA >d39404.assembled GNCCGGGAAGCACAAGGCTCCACGTTCCTTGGCCAGGGCATCGCCATTCCCCATGGTACGCCGCA // >d30181.assembled CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGCAGGGGCAGCAGCTTGACCTTGCAGGTTCACGACG >d30695.assembled CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGCAGGGGCAGCAGCTTGACCTTGCAGGTTCACGACG >d31733.assembled CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGCAGGGGCAGCAGCTTGACCTTGCAGGTTCACGACG >d39114.assembled CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGNAGGGGCAGNAGCTTGACCTTNCAGGTTCACGACG >d39253.assembled CCAGGGCAGCGAAGGCCTGAGCGTCGGCAGCAGGGGCAGCAGCTTGACCTTGCAGGTTCACGACG >d39404.assembled CCAGGGCAGCGAAGGCCTGAGCGTCGNCAGCAGGGGCAGCAGCTTGACCTTGCAGGNTCACGACG // >d35320.assembled GCCCGAGTCGGCCCCGACCAGCCAGCCCAGGACGTCGGAGCGGTAGGTGGCGAAGTCGTAGAGCT >d39114.assembled ACCGGAGTCCGCGCCGACCAGCCAGCCCAGGACGTCGGAGCGGTAGGTGGCGAAGTCGTAGAGCT >d39531.assembled RCCGGAGNCCGCRCCGACCAGCCAGCCCAGGACGTCGGAGCGGTAGGTGGCGAAGTCGTAGAGCT >decor21.assembled GCCGGAGNCCGCACCGACCAGCCAGCCCAGGACGTCGGAGCGGTAGGTGGCGAAGTCGTAGAGCT // * - - * >d19long1.assembled ATATTTTTGGAGGATGAAGAGTTTCTGTCTTCTGCATTCTCTTCTTCTTCTTCTTCCTCTTAGGT >d35422.assembled ATATTTTTGGAGGATGACGAGTTTCTGTCTTCTGCATTC---TCTTCTTCTTCTTCCTCTTAGGT >d39104.assembled ATATTTTTGGAGGATGACGAGTTTCTGTCTTCTGCATTC---TCTTCTTCTTCCTCCTCTTAGGT >d41058.assembled ATATTTTTGGAGGATGAAGATTTTCTGTCTTCTGCATTC---TCTTCTTCTTCCTCCTCTTAGGT // * - * >d35320.assembled CGCCGGGCGTTCGAACCGAGCGCGGGCGTCGCGCCGCACGCGGCGCGGCGCGNC >d39103.assembled CGCCGGGCGTTCGAACCGAGCGCGGGCGTCGCGCCGCNCGCGGCGCGGCGCGCC >d39404.assembled CGCCGGGCGTTCGARCCGAGCGCGGGCRTCGCGCCGCACGCGGCGCGSCGNGCC >d39531.assembled CGCCGSGCGTTCGARCCGAGCGCGGGCRTCGCGCCGCACGCGGCGCGGCGCGCC >d39968.assembled CGCCGGGCGTTCGAACCGAGCGCGGGCGTCGCGCCNCACGCGNCGCGGCGCGCC // - * * - | >d30695.assembled GGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT >d31733.assembled GGCGTAYGACTCGTCGTTCAGCCAGGCCAGCGT >d34041.assembled GGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT >d39114.assembled NGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT >d39253.assembled GGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT >d41732.assembled GGCGTAYNACTCGTCGTTCAGCCAGGCCAGCGT >decor21.assembled GGCGTACGACTCGTCGTTCAGCCAGGCCAGCGT // * | >d30695.assembled ACGCGAGGCCCATTCCAAA-TTCGAGAGGACTTACACAGATATCGCCGCAGATAACTTTTACGAT >d39253.assembled ACGCGAGGCCCATTCCAAAGTTYGAGAGGACTTACACAGATATCGCCGCAGATAACTTTTACGAT >d39968.assembled ACGCGAGGCCCATTCCAAAGTTCGAGAGGACTTACACAGATATCGCCGCAGATAACTTTTACGAT >d40328.assembled ACGCGAGGCCCATTCCAAAGTTCGAGAGGACTTACACAGATATCGCCGCAGAYAAYTNTTACGAT // - - - >d30181.assembled TCGGTGCTTCCGGAACTCCGCAACGNG >d30695.assembled TCGGTGCTTCCGGAACTCCGCAACGCG >d39103.assembled TCGGTGCTTCCGGAACTGCGCAACGCG >d39968.assembled TCGGTGCTTCCGGAACTCCGCAACGCG // - | >d33291.assembled GTTGCGCGCGATCTCCACCATCTGCTGCTGGCCGATGCCCAGCGTGCCGACCCGCGTGCCCGGGT >d35178.assembled GTTGCGCGCGATCTCCACCATCTGCTGCTGGCCGATGCCSAGCGTGCCGACCMGCGTGCCCGGGT >d35320.assembled GTTGCGSGCGATCTCCACCATCTGCTGCTGGCCGATGCCSAGCGTGCCGACCCGCGTGCCCGGGT >d39253.assembled GTTGCGCGCGATCTCCACCATCTGCTGCTGGCCGATGCCCAGCGTGCCGACCNGCGTGCCCGGGT >d40328.assembled GTTGCGCGCGATCTCSACCATCTGCTGCTGSCCGATGCCGAGCGTGCCGACCMGCGTGCCSGGGT // - - - * * - >d30695.assembled CGACGACACGTCGGAGACCGGCTCCTCGACGAGAACCACTCCGGCAGCGCGACACGCACTCAGTA >d39531.assembled CGACGACACGTCGGAGACCRGCTCCTCGACGAGAACCACTCCGGCAGCGCGACACGCANTCAGTA >d41389.assembled CGACGACACGTCGGAGACCGGCTCCTCGACGAGAACCACTCCGGCAGCGCGACACGCACTCAGTA >decor21.assembled CGACGACACGTCGGAGACCRNCTCCTCGACGAGAACCACTCCGGCAGCGCGACACGCANTCAGTA // * >d31733.assembled CGGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC >d35178.assembled CNGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC >d35320.assembled CNGCAGCTCCAGCTGGGGCCGCGCCAGCCGCTGCCGCAGCTC >d35422.assembled CGGCAGCTCNAGCTGGGGCCGCGCCAGCCGCTGCCGCNGNTC >d39103.assembled CGGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC >d39187.assembled CGGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC >d39253.assembled CGGCAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC >d39404.assembled CGGYNGCNCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC >d39968.assembled CGGCAGCTCCAGCCGGGGCTGCNCCAGCCGCTGNCGCAGCTC >decor21.assembled CGGNAGCTCCAGCCGGGGCTGCGCCAGCCGCTGCCGCAGCTC // - * * | >d30181.assembled AGCCTGCTSGGCTACGTNGTCCGCTGGGTCGACGCCGGGGTGGG >d39187.assembled AGCCTGCTSGGCTACGTCGTCCGCTGGGTCGACGCCGGNGTGGG >d39253.assembled AGCCTGCTGGGCTACGTCGTCCGCTGGGTCGACGCCGGNGTGGG >d39968.assembled AGCCTGCTGGGCTACGTCGTCCGCTGGGTCGACGCCGGCGTGGG // * - | >d30181.assembled CAGCGAGGTGTCCACGCGGCTGCCCTGGCCGGTGCGCAGCCGCTTCACGTAGGCGGTGACCACGC >d33291.assembled CAGCGAGGTGTCCACGCGGCTGCCCTGGCCGGTGCGCAGCCGCTTCACGTAGGCGGTGACCACGC >d39103.assembled CAGCGAGGTGTCCACGCGGCTGCCCTGGCCGGTGCGCAGCCGCTTCACGTAGGCGGTGACCACGC >d39187.assembled CAGCGAGGTGTCCACGCGGCTGCCCTGGCCGGTGCGCAGCCGCTTCACGTAGGCGGTGACCACGC
%%bash
raxmlHPC-PTHREADS-AVX -s analysis_pyrad/outfiles/c88d6m4p3.phy \
-m GTRGAMMA \
-f a \
-x 12345 \
-p 12345 \
-N 100 \
-w /home/deren/Documents/PEGBS/analysis_raxml/ \
-n c88d6m4p3 \
-o "d35320.assembled" \
-T 20 &> /dev/null
%load_ext rmagic
%%R
library(ape)
tre <- read.tree("analysis_raxml/RAxML_bipartitions.c88d6m4p3")
plot(tre)
nodelabels(tre$node.label)
If you look at the stats output for step 5 you can see that we discarded most of our data for being too low of coverage. One way to salvage these data is to set the minimum depth for majority rule consensus base calls to 2 (param 31). In this way, statistical base calls are still made for all stacks in which the depth is >mindepth (param 8), but for everything between (2-mindepth) majority rule is used to make base calls. This will probably increase the size of this data set ~10X, as you can see in the example S5 stats file below.
%%bash
## This is the step 5 results after I moved the consens.gz files out of
## the clust.85/ directory and replaced them with new consens calls
## generated by running step 5 with the majority rule consensus
## option turned on.
tail -n 32 analysis_pyrad/stats/s5.consens.txt
taxon nloci f1loci f2loci nsites npoly poly d35371.assembled 60885 24564 23062 1780882 2350 0.0013196 d41058.assembled 329719 90853 84689 6244501 3656 0.0005855 d41237.assembled 293759 90845 84120 5741117 2794 0.0004867 d33291.assembled 169929 67772 62376 5502056 4098 0.0007448 d34041.assembled 197867 89535 83219 5919976 6740 0.0011385 d35422.assembled 169511 56563 50274 3627878 3974 0.0010954 d35320.assembled 156419 52912 48780 4266120 3268 0.000766 d39187.assembled 96755 45614 41058 3775082 4079 0.0010805 d39104.assembled 101266 34065 31273 2410273 1958 0.0008124 d19long1.assembled 161936 39462 35591 3167415 2402 0.0007583 d41389.assembled 211387 77211 70021 5475971 6483 0.0011839 d40328.assembled 136690 65435 59329 5145197 7386 0.0014355 d41732.assembled 452240 181118 170999 12626514 7201 0.0005703 d39114.assembled 218883 92037 85134 7595069 7721 0.0010166 d39404.assembled 243761 89814 78903 6913918 7306 0.0010567 d39531.assembled 332792 112016 101057 8879145 7975 0.0008982 d35178.assembled 450842 201307 190564 13878623 21667 0.0015612 d31733.assembled 439816 175627 160987 13694620 12159 0.0008879 decor21.assembled 577428 227921 207326 15697724 14727 0.0009382 d30181.assembled 516149 203114 189290 15726272 13709 0.0008717 d39103.assembled 642379 235667 211348 18531562 13982 0.0007545 d30695.assembled 485553 174196 155764 14331377 15685 0.0010945 d39253.assembled 431026 172635 151210 13507456 15403 0.0011403 d39968.assembled 456174 186053 162430 14528464 19145 0.0013178 ## nloci = number of loci ## f1loci = number of loci with >N depth coverage ## f2loci = number of loci with >N depth and passed paralog filter ## nsites = number of sites across f loci ## npoly = number of polymorphic sites in nsites ## poly = frequency of polymorphic sites