!fastqc --version
FastQC v0.11.2
!fastqc 2112_lane1_NoIndex_L001_R1_001.fastq.gz
Started analysis of 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 5% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 10% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 15% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 20% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 25% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 30% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 35% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 40% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 45% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 50% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 55% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 60% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 65% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 70% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 75% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 80% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 85% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 90% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 95% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Approx 100% complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz Analysis complete for 2112_lane1_NoIndex_L001_R1_001.fastq.gz
!cp 2112_lane1_NoIndex_L001_R1_001_fastqc.* \
/Volumes/Eagle/Arabidopsis/iPythonNotebooks/
!mv /Volumes/Eagle/Arabidopsis/iPythonNotebooks/2112_lane1_NoIndex_L001_R1_001_fastqc.html \
/Volumes/Eagle/Arabidopsis/iPythonNotebooks/20150313_2112_lane1_NoIndex_L001_R1_001_fastqc.html
!mv /Volumes/Eagle/Arabidopsis/iPythonNotebooks/2112_lane1_NoIndex_L001_R1_001_fastqc.zip \
/Volumes/Eagle/Arabidopsis/iPythonNotebooks/20150313_2112_lane1_NoIndex_L001_R1_001_fastqc.zip
from IPython.display import HTML
HTML('<iframe src=http://eagle.fish.washington.edu/Arabidopsis/iPythonNotebooks/20150313_2112_lane1_NoIndex_L001_R1_001_fastqc.html width=100% height=700></iframe>')
pwd
u'/Users/Sam/Documents'
This command uses bash commands to count the number of lines in the FASTQ file (wc-l
),
divides the total number of lines by 4
(there are 4 lines per read in Illumina FASTQ files).
The echo
command is used to print the result to the screen, which gets stored in the variable:
TotalSeqs
TotalSeqs = !echo $((`wc -l < 2112_lane1_NoIndex_L001_R1_001.fastq` / 4))
#Prints the value stored in TotalSeqs.
#Notice that this is a Python string list and is not an integer!
print TotalSeqs
['16000000']
#Converts the value in the TotalSeqs string list at index 0 (TotalSeqs[0]) to
#an integer value of base 10.
#This conversion will be used repeatedly throughout this notebook to allow
#mathematical calculations using the numbers generated by bash commands.
TotalSeqs = int(TotalSeqs[0], 10)
print TotalSeqs
16000000
grep
and wc -l
to count all the instances of each of the full-length TruSeq adaptor/index sequences.¶The index sequence is indicated in each of the respective variable names.
Additionally, the Epigentek barcode number is indicated in the variable names (e.g. BC1 = barcode 1).
BC1_ATCACG_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC1_ATCACG_full string list at index 0 (BC1_ATCACG_full[0]) to
#an integer value of base 10.
BC1_ATCACG_full = int(BC1_ATCACG_full[0] ,10)
print BC1_ATCACG_full
294374
BC3_TTAGGC_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC3_TTAGGC_full string list at index 0 (BC3_TTAGGC_full[0]) to
#an integer value of base 10.
BC3_TTAGGC_full = int(BC3_TTAGGC_full[0], 10)
print BC3_TTAGGC_full
244638
BC4_TGACCA_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC4_TGACCA_full string list at index 0 (BC4_TGACCA_full[0]) to
#an integer value of base 10.
BC4_TGACCA_full = int(BC4_TGACCA_full[0], 10)
print BC4_TGACCA_full
388498
BC5_ACAGTG_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC5_ACAGTG_full string list at index 0 (BC5_ACAGTG_full[0]) to
#an integer value of base 10.
BC5_ACAGTG_full = int(BC5_ACAGTG_full[0], 10)
print BC5_ACAGTG_full
308463
BC6_GCCAAT_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC6_GCCAAT_full string list at index 0 (BC6_GCCAAT_full[0]) to
#an integer value of base 10.
BC6_GCCAAT_full = int(BC6_GCCAAT_full[0], 10)
print BC6_GCCAAT_full
211205
BC7_CAGATC_full = !grep -o 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC7_CAGATC_full string list at index 0 (BC7_CAGATC_full[0]) to
#an integer value of base 10.
BC7_CAGATC_full = int(BC7_CAGATC_full[0], 10)
print BC7_CAGATC_full
504685
#Adds all of the counts from each full-length Illumina adaptor/index sequence.
#Saves to variable "sum_full".
sum_full = BC1_ATCACG_full + BC3_TTAGGC_full + BC4_TGACCA_full + BC5_ACAGTG_full + BC6_GCCAAT_full + BC7_CAGATC_full
print sum_full
1951863
#Calculates percentage of reads having full-lenght Illumina adaptor/index sequences.
#Uses "float" to convert integer values to floating point decimals. Necessary since
#the calculation on integers would be < 1 & would result in an answer of '0'.
print ((float(sum_full)/TotalSeqs)*100)
12.19914375
grep
and wc -l
to count all the instances of each of the TruSeq index sequences.¶The index sequence is indicated in each of the respective variable names.
Additionally, the Epigentek barcode number is indicated in the variable names (e.g. BC1 = barcode 1).
BC1_ATCACG = !grep -o 'ATCACG' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC1_ATCACG string list at index 0 (BC1_ATCACG[0]) to
#an integer value of base 10.
BC1_ATCACG = int(BC1_ATCACG[0] ,10)
print BC1_ATCACG
700818
BC3_TTAGGC = !grep -o 'TTAGGC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC3_TTAGGC string list at index 0 (BC3_TTAGGC[0]) to
#an integer value of base 10.
BC3_TTAGGC = int(BC3_TTAGGC[0] ,10)
print BC3_TTAGGC
387329
BC4_TGACCA = !grep -o 'TGACCA' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC4_TGACCA string list at index 0 (BC4_TGACCA[0]) to
#an integer value of base 10.
BC4_TGACCA = int(BC4_TGACCA[0])
print BC4_TGACCA
727528
BC5_ACAGTG = !grep -o 'ACAGTG' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC5_ACAGTG string list at index 0 (BC5_ACAGTG[0]) to
#an integer value of base 10.
BC5_ACAGTG = int(BC5_ACAGTG[0] ,10)
print BC5_ACAGTG
544521
BC6_GCCAAT = !grep -o 'GCCAAT' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC6_GCCAAT string list at index 0 (BC6_GCCAAT[0]) to
#an integer value of base 10.
BC6_GCCAAT = int(BC6_GCCAAT[0] ,10)
print BC6_GCCAAT
469213
BC7_CAGATC = !grep -o 'CAGATC' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the BC7_CAGATC string list at index 0 (BC7_CAGATC[0]) to
#an integer value of base 10.
BC7_CAGATC = int(BC7_CAGATC[0] ,10)
print BC7_CAGATC
1107028
#Adds all of the counts from each Illumina adaptor/index sequence.
#Saves to variable "sum_short".
sum_short = BC1_ATCACG + BC3_TTAGGC + BC4_TGACCA + BC5_ACAGTG + BC6_GCCAAT + BC7_CAGATC
print sum_short
3936437
#Calculates percentage of reads having full-lenght Illumina adaptor/index sequences.
#Uses "float" to convert integer values to floating point decimals. Necessary since
#the calculation on integers would be < 1 & would result in an answer of '0'.
print ((float(sum_short)/TotalSeqs)*100)
24.60273125
fastx_barcode_splitter
to identify full-length TruSeq adaptor/index sequences.¶fastx_barcode_splitter
is a component of fastx_toolkit-0.0.13.2¶#The full-lengths barcode file used by fastx_barcode_splitter.
!head TruSeqBarcodesLong.txt
BC7_CAGATC GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGC BC4_TGACCA GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGC BC5_ACAGTG GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC BC1_ATCACG GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGC BC3_TTAGGC GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGC BC6_GCCAAT GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC
#Gunzip the gzipped FASTQ file.
#Pipe the output of that to fastx_barcode_splitter.pl
#fastx_barcode_splitter uses a default mismatch value = 1
#Specify barcode file (--bcfile TruSeqBarcodesLong.txt)
#Specify to look for barcode at beginning of file (--bol)
#Specify output location and append a prefix to new file name (--prefix ./bol_)
#Specify new file name suffix (--suffix ".fastq")
!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \
fastx_barcode_splitter.pl \
--bcfile TruSeqBarcodesLong.txt \
--bol \
--prefix ./bol_long_ \
--suffix ".fastq" | \
tee bol_long_stats.txt
Barcode Count Location BC1_ATCACG 359478 ./bol_long_BC1_ATCACG.fastq BC3_TTAGGC 299176 ./bol_long_BC3_TTAGGC.fastq BC4_TGACCA 472062 ./bol_long_BC4_TGACCA.fastq BC5_ACAGTG 378448 ./bol_long_BC5_ACAGTG.fastq BC6_GCCAAT 257815 ./bol_long_BC6_GCCAAT.fastq BC7_CAGATC 605680 ./bol_long_BC7_CAGATC.fastq unmatched 13627341 ./bol_long_unmatched.fastq total 16000000
#Gunzip the gzipped FASTQ file.
#Pipe the output of that to fastx_barcode_splitter.pl
#fastx_barcode_splitter uses a default mismatch value = 1
#Specify barcode file (--bcfile TruSeqBarcodesLong.txt)
#Specify to look for barcode at beginning of file (--eol)
#Specify output location and append a prefix to new file name (--prefix ./eol_)
#Specify new file name suffix (--suffix ".fastq")
!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \
fastx_barcode_splitter.pl \
--bcfile TruSeqBarcodes.txt \
--eol \
--prefix ./eol_long_ \
--suffix ".fastq" | \
tee eol_long_stats.txt
Barcode Count Location BC1_ATCACG 136 ./eol_long_BC1_ATCACG.fastq BC3_TTAGGC 180 ./eol_long_BC3_TTAGGC.fastq BC4_TGACCA 388 ./eol_long_BC4_TGACCA.fastq BC5_ACAGTG 138 ./eol_long_BC5_ACAGTG.fastq BC6_GCCAAT 99 ./eol_long_BC6_GCCAAT.fastq BC7_CAGATC 336 ./eol_long_BC7_CAGATC.fastq unmatched 15998723 ./eol_long_unmatched.fastq total 16000000
fastx_barcode_splitter
to identify TruSeq index sequences.¶#The full-lenghts barcode file used by fastx_barcode_splitter.
!head TruSeqBarcodesShort.txt
BC7_CAGATC CAGATC BC4_TGACCA TGACCA BC5_ACAGTG ACAGTG BC1_ATCACG ATCACG BC3_TTAGGC TTAGGC BC6_GCCAAT GCCAAT
#Gunzip the gzipped FASTQ file.
#Pipe the output of that to fastx_barcode_splitter.pl
#fastx_barcode_splitter uses a default mismatch value = 1
#Specify barcode file (--bcfile TruSeqBarcodesShort.txt)
#Specify to look for barcode at beginning of file (--bol)
#Specify output location and append a prefix to new file name (--prefix ./bol_)
#Specify new file name suffix (--suffix ".fastq")
!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \
fastx_barcode_splitter.pl \
--bcfile TruSeqBarcodesShort.txt \
--bol \
--prefix ./bol_ \
--suffix ".fastq" | \
tee bol_short_stats.txt
Barcode Count Location BC1_ATCACG 54841 ./bol_BC1_ATCACG.fastq BC3_TTAGGC 24864 ./bol_BC3_TTAGGC.fastq BC4_TGACCA 63480 ./bol_BC4_TGACCA.fastq BC5_ACAGTG 12874 ./bol_BC5_ACAGTG.fastq BC6_GCCAAT 25066 ./bol_BC6_GCCAAT.fastq BC7_CAGATC 24092 ./bol_BC7_CAGATC.fastq unmatched 15794783 ./bol_unmatched.fastq total 16000000
#Gunzip the gzipped FASTQ file.
#Pipe the output of that to fastx_barcode_splitter.pl
#fastx_barcode_splitter uses a default mismatch value = 1
#Specify barcode file (--bcfile TruSeqBarcodesShort.txt)
#Specify to look for barcode at beginning of file (--eol)
#Specify output location and append a prefix to new file name (--prefix ./eol_)
#Specify new file name suffix (--suffix ".fastq")
!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \
fastx_barcode_splitter.pl \
--bcfile TruSeqBarcodesShort.txt \
--eol \
--prefix ./eol_ \
--suffix ".fastq" | \
tee eol_short_stats.txt
Barcode Count Location BC1_ATCACG 86091 ./eol_BC1_ATCACG.fastq BC3_TTAGGC 27144 ./eol_BC3_TTAGGC.fastq BC4_TGACCA 63478 ./eol_BC4_TGACCA.fastq BC5_ACAGTG 30680 ./eol_BC5_ACAGTG.fastq BC6_GCCAAT 54759 ./eol_BC6_GCCAAT.fastq BC7_CAGATC 162844 ./eol_BC7_CAGATC.fastq unmatched 15575004 ./eol_unmatched.fastq total 16000000