This command uses bash commands to count the number of lines in the FASTQ file (wc-l
),
divides the total number of lines by 4
(there are 4 lines per read in Illumina FASTQ files).
The echo
command is used to print the result to the screen, which gets stored in the variable:
TotalSeqs
TotalSeqs = !echo $((`wc -l < 2112_lane1_NoIndex_L001_R1_001.fastq` / 4))
#Prints the value stored in TotalSeqs.
#Notice that this is a Python string list and is not an integer!
print TotalSeqs
['16000000']
#Converts the value in the TotalSeqs string list at index 0 (TotalSeqs[0]) to
#an integer value of base 10.
#This conversion will be used repeatedly throughout this notebook to allow
#mathematical calculations using the numbers generated by bash commands.
TotalSeqs = int(TotalSeqs[0])
print TotalSeqs
16000000
grep
and wc -l
to count all the instances of the Epinext adaptor 1 sequence:¶ACACTCTTTCCCTACACGACGCTCTTCCGATCT
TruSeq_adaptor1_grep = !grep -o 'ACACTCTTTCCCTACACGACGCTCTTCCGATCT ' 2112_lane1_NoIndex_L001_R1_001.fastq \
| wc -l
#Converts the value in the TruSeq_adaptor1_grep string list at index 0 (TruSeq_adaptor1_grep[0]) to
#an integer value of base 10.
TruSeq_adaptor1_grep = int(TruSeq_adaptor1_grep[0])
print TruSeq_adaptor1_grep
0
#Calculates percentage of reads having TruSeq adaptor sequences.
#Uses "float" to convert integer values to floating point decimals. Necessary since
#the calculation on integers would be < 1 & would result in an answer of '0'.
print ((float(TruSeq_adaptor1_grep)/TotalSeqs)*100)
0.0
fastx_barcode_splitter
to identify Epinext adaptor 1 sequence.¶fastx_barcode_splitter
is a component of fastx_toolkit-0.0.13.2¶#The full-lengths barcode file used by fastx_barcode_splitter.
!head EpinextAdaptor1.txt
Epinext_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT
#Gunzip the gzipped FASTQ file.
#Pipe the output of that to fastx_barcode_splitter.pl
#fastx_barcode_splitter uses a default mismatch value = 1
#Specify barcode file (--bcfile EpinextAdaptor1.txt)
#Specify to look for barcode at beginning of file (--bol)
#Specify output location and append a prefix to new file name (--prefix ./bol_)
#Specify new file name suffix (--suffix ".fastq")
#Print data to screen and output file (tee bol_EpinextAdaptor1_stats.txt)
!gunzip -c 2112_lane1_NoIndex_L001_R1_001.fastq.gz | \
fastx_barcode_splitter.pl \
--bcfile EpinextAdaptor1.txt \
--bol \
--prefix ./bol_ \
--suffix ".fastq" | \
tee bol_EpinextAdaptor1_stats.txt
Barcode Count Location Epinext_1 5 ./bol_Epinext_1.fastq unmatched 15999995 ./bol_unmatched.fastq total 16000000
#Uses awk to capture the second field (the "Count" column; print $2) from
#the second line (FNR == 2) of the bol_EpinextAdaptor1_stats.txt
#Stores the value in the variable EpinextAdaptor1_fastx_bol as a Python string list.
EpinextAdaptor1_fastx_bol = !awk 'FNR == 2 {print $2}' bol_EpinextAdaptor1_stats.txt
print EpinextAdaptor1_fastx_bol
['5']
#Converts the value in the TruSeqAdaptor_fastx_bol string list at index 0 (TruSeqAdaptor_fastx_bol[0]) to
#an integer value of base 10.
EpinextAdaptor1_fastx_bol = int(EpinextAdaptor1_fastx_bol[0])
#Calculates percentage of reads having Epinext adaptor 1 sequences.
#Uses "float" to convert integer values to floating point decimals. Necessary since
#the calculation on integers would be < 1 & would result in an answer of '0'.
print ((float(EpinextAdaptor1_fastx_bol)/TotalSeqs)*100)
3.125e-05