Simple adapter trimming vs. adapter trimming and trimming to expect sRNA lengths.
%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh
TODAY'S DATE: Wed May 24 10:44:03 PDT 2023 ------------
No LSB modules are available.
Distributor ID: Ubuntu Description: Ubuntu 18.04.6 LTS Release: 18.04 Codename: bionic ------------ HOSTNAME: raven ------------ Computer Specs: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz Stepping: 7 CPU MHz: 1000.108 CPU max MHz: 4000.0000 CPU min MHz: 1000.0000 BogoMIPS: 4400.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities ------------ Memory Specs total used free shared buff/cache available Mem: 247G 9.5G 76G 672K 161G 235G Swap: 99G 583M 99G
%env
indicates a bash variable
without %env
is Python variable
# Set directories, input/output files
%env data_dir=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq
%env analysis_dir=/home/shared/8TB_HDD_01/sam/analyses/20230524-E5-coral-sRNAseq_trimmings_comparisons
analysis_dir="/home/shared/8TB_HDD_01/sam/20230524-E5-coral-sRNAseq_trimmings_comparisons"
%env R1_fastq=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R1_001.fastq.gz
%env R2_fastq=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R2_001.fastq.gz
# Set CPU threads
%env threads=40
# Max read length
%env max_read_length=50
# Set program locations
%env fastqc=/home/shared/FastQC/fastqc
%env flexbar=/home/shared/flexbar-3.5.0-linux/flexbar
# Set some formatting stuff
%env break_line=--------------------------------------------------------------------------
env: data_dir=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq env: analysis_dir=/home/shared/8TB_HDD_01/sam/analyses/20230524-E5-coral-sRNAseq_trimmings_comparisons env: R1_fastq=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R1_001.fastq.gz env: R2_fastq=/home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R2_001.fastq.gz env: threads=40 env: max_read_length=50 env: fastqc=/home/shared/FastQC/fastqc env: flexbar=/home/shared/flexbar-3.5.0-linux/flexbar env: break_line=--------------------------------------------------------------------------
%%bash
# Make analysis and data directory, if doesn't exist
mkdir --parents "${analysis_dir}"
mkdir --parents "${data_dir}"
Adapter sequences are in the NEB sRNA kit protocol used by Azenta for library construction.
%%bash
cat "${data_dir}/NEB-adapters.fasta"
>first AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC >second GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT
Options:
-ap
: For paired-end analysis; recommended by NEB
-qf il.8
: Sets quality type as Illumina v1.8
qt
: Mean quality score of 25
--target
: Sets output filename
--zip-output GZ
: Sets gzip compression for trimmed files
%%bash
cd ${analysis_dir}
${flexbar} \
-r ${R1_fastq} \
-p ${R2_fastq} \
-a ${data_dir}/NEB-adapters.fasta \
-ap ON \
-qf i1.8 \
-qt 25 \
--threads ${threads} \
--target sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only \
--zip-output GZ
ls -lh
total 633M -rw-rw-r-- 1 sam sam 311M May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_1.fastq.gz -rw-rw-r-- 1 sam sam 322M May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_2.fastq.gz -rw-rw-r-- 1 sam sam 2.6K May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only.log
%%bash
cd ${analysis_dir}
cat sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only.log
________ __ / ____/ /__ _ __/ /_ ____ ______ / /_ / / _ \| |/ / __ \/ __ `/ ___/ / __/ / / __/> </ /_/ / /_/ / / /_/ /_/\___/_/|_/_.___/\__._/_/ Flexbar - flexible barcode and adapter removal, version 3.5.0 Developed with SeqAn, the library for sequence analysis Available on github.com/seqan/flexbar Local time: Wed May 24 10:44:10 2023 Number of threads: 40 Bundled fragments: 256 Target name: sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only File type: fastq Reads file: /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R1_001.fastq.gz Reads file 2: /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R2_001.fastq.gz (paired run) Adapter file: /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/NEB-adapters.fasta max-uncalled: 0 min-read-length: 18 adapter-pair-overlap: ON adapter-trim-end: RIGHT adapter-min-overlap: 3 adapter-min-poverlap: 40 adapter-error-rate: 0.1 adapter-match: 1 adapter-mismatch: -1 adapter-gap: -6 Adapter: Sequence: first AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC second GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT Processing reads ...done. Elapsed time: 4 min 2 sec Processing speed: 152950 reads/s Min, max, mean and median overlap of paired reads: 40 / 149 / 54 / 44 Adapter removal statistics ========================== Adapter: Overlap removal: Full length: first 17854577 17749958 second 17688528 17594166 Min, max, mean and median overlap: 1 / 60 / 44 / 34 Output file statistics ====================== Read file: sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_1.fastq.gz written reads 17551962 short reads 948699 Read file 2: sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_2.fastq.gz written reads 17551962 short reads 941364 Filtering statistics ==================== Processed reads 37013944 skipped due to uncalled bases 10070 (5373 uncalled in 5035 pairs) short prior to adapter removal 0 finally skipped short reads 1890063 skipped paired single reads 9887 Discarded reads overall 1910020 Remaining reads 35103924 (94%) Processed bases 5552091600 Remaining bases 1201741240 (21% of input) Flexbar completed adapter removal.
Options:
-ap
: For paired-end analysis; recommended by NEB
-qf il.8
: Sets quality type as Illumina v1.8
qt
: Mean quality score of 25
--post-trim-length
: Trim reads from 3' end to length specified after adapter and quality trimming.
--target
: Sets output filename
--zip-output GZ
: Sets gzip compression for trimmed files
%%bash
cd ${analysis_dir}
${flexbar} \
-r ${R1_fastq} \
-p ${R2_fastq} \
-a ${data_dir}/NEB-adapters.fasta \
-ap ON \
-qf i1.8 \
-qt 25 \
--post-trim-length ${max_read_length} \
--threads ${threads} \
--target sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50 \
--zip-output GZ
ls -lh
total 1.2G -rw-rw-r-- 1 sam sam 292M May 24 10:52 sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50_1.fastq.gz -rw-rw-r-- 1 sam sam 293M May 24 10:52 sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50_2.fastq.gz -rw-rw-r-- 1 sam sam 2.7K May 24 10:52 sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50.log -rw-rw-r-- 1 sam sam 311M May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_1.fastq.gz -rw-rw-r-- 1 sam sam 322M May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only_2.fastq.gz -rw-rw-r-- 1 sam sam 2.6K May 24 10:48 sRNA-ACR-140-S1-TP2_R1_001-adapter_trim_only.log
%%bash
cd ${analysis_dir}
cat sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50.log
________ __ / ____/ /__ _ __/ /_ ____ ______ / /_ / / _ \| |/ / __ \/ __ `/ ___/ / __/ / / __/> </ /_/ / /_/ / / /_/ /_/\___/_/|_/_.___/\__._/_/ Flexbar - flexible barcode and adapter removal, version 3.5.0 Developed with SeqAn, the library for sequence analysis Available on github.com/seqan/flexbar Local time: Wed May 24 10:48:12 2023 Number of threads: 40 Bundled fragments: 256 Target name: sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50 File type: fastq Reads file: /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R1_001.fastq.gz Reads file 2: /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/sRNA-ACR-140-S1-TP2_R2_001.fastq.gz (paired run) Adapter file: /home/shared/8TB_HDD_01/sam/data/A_pulchra/sRNAseq/NEB-adapters.fasta max-uncalled: 0 post-trim-length: 50 min-read-length: 18 adapter-pair-overlap: ON adapter-trim-end: RIGHT adapter-min-overlap: 3 adapter-min-poverlap: 40 adapter-error-rate: 0.1 adapter-match: 1 adapter-mismatch: -1 adapter-gap: -6 Adapter: Sequence: first AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC second GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT Processing reads ...done. Elapsed time: 3 min 49 sec Processing speed: 161632 reads/s Min, max, mean and median overlap of paired reads: 40 / 149 / 54 / 44 Adapter removal statistics ========================== Adapter: Overlap removal: Full length: first 17854577 17749958 second 17688528 17594166 Min, max, mean and median overlap: 1 / 60 / 44 / 34 Output file statistics ====================== Read file: sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50_1.fastq.gz written reads 17551962 short reads 948699 Read file 2: sRNA-ACR-140-S1-TP2_R1_001-adapter-and-length-50_2.fastq.gz written reads 17551962 short reads 941364 Filtering statistics ==================== Processed reads 37013944 skipped due to uncalled bases 10070 (5373 uncalled in 5035 pairs) short prior to adapter removal 0 finally skipped short reads 1890063 skipped paired single reads 9887 Discarded reads overall 1910020 Remaining reads 35103924 (94%) Processed bases 5552091600 Remaining bases 1048763452 (18% of input) Flexbar completed adapter removal.
%%bash
cd ${analysis_dir}
trimmed_fastq_array=(*.fastq.gz)
# Pass array contents to new variable as space-delimited list
trimmed_fastqc_list=$(echo "${trimmed_fastq_array[*]}")
${fastqc} \
${trimmed_fastqc_list} \
--threads ${threads} \
--outdir ./ \
--quiet
%%bash
cd ${analysis_dir}
multiqc .
/// ]8;id=757692;https://multiqc.info\MultiQC]8;;\ 🔍 | v1.14 | multiqc | Search path : /home/shared/8TB_HDD_01/sam/analyses/20230524-E5-coral-sRNAseq_trimmings_comparisons | searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 14/14 0m | flexbar | Found 2 logs | fastqc | Found 4 reports | multiqc | Compressing plot data | multiqc | Report : multiqc_report.html | multiqc | Data : multiqc_data | multiqc | MultiQC complete
%%bash
${flexbar} -hh
flexbar - flexible barcode and adapter removal ============================================== SYNOPSIS flexbar -r reads [-b barcodes] [-a adapters] [options] DESCRIPTION The program Flexbar preprocesses high-throughput sequencing data efficiently. It demultiplexes barcoded runs and removes adapter sequences. Several adapter removal presets for Illumina libraries are included. Flexbar computes exact overlap alignments using SIMD and multicore parallelism. Moreover, trimming and filtering features are provided, e.g. trimming of homopolymers at read ends. Flexbar increases read mapping rates and improves genome as well as transcriptome assemblies. Unique molecular identifiers can be extracted in a flexible way. The software supports data in fasta and fastq format from multiple sequencing platforms. Refer to the manual on github.com/seqan/flexbar/wiki or contact Johannes Roehr on github.com/jtroehr for support with this application. OPTIONS -h, --help Display the help message. -hh, --full-help Display the help message with advanced options. --version-check BOOL Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES, 0, OFF, FALSE, F, and NO. Default: 1. -hm, --man-help Print advanced options as man document. -v, --versions Print Flexbar and SeqAn version numbers. -c, --cite Show program references for citation. Basic options: -n, --threads INTEGER Number of threads to employ. Default: 1. -N, --bundle INTEGER Number of (paired) reads per thread. Default: 256. -M, --bundles INTEGER Process only certain number of bundles for testing. -t, --target OUTPUT_PREFIX Prefix for output file names or paths. Default: flexbarOut. -r, --reads INPUT_FILE Fasta/q file or stdin (-) with reads that may contain barcodes. -p, --reads2 INPUT_FILE Second input file of paired reads, gz and bz2 files supported. -i, --interleaved Interleaved format for first input set with paired reads. -I, --iupac Accept iupac symbols in reads and convert to N if not ATCG. Barcode detection: -b, --barcodes INPUT_FILE Fasta file with barcodes for demultiplexing, may contain N. -b2, --barcodes2 INPUT_FILE Additional barcodes file for second read set in paired mode. -br, --barcode-reads INPUT_FILE Fasta/q file containing separate barcode reads for detection. -bo, --barcode-min-overlap INTEGER Minimum overlap of barcode and read. Default: barcode length. -be, --barcode-error-rate DOUBLE Error rate threshold for mismatches and gaps. Default: 0.0. -bt, --barcode-trim-end STRING Type of detection, see section trim-end modes. Default: LTAIL. -bn, --barcode-tail-length INTEGER Region size in tail trim-end modes. Default: barcode length. -bk, --barcode-keep Keep barcodes within reads instead of removal. -bu, --barcode-unassigned Include unassigned reads in output generation. -bm, --barcode-match INTEGER Alignment match score. Default: 1. -bi, --barcode-mismatch INTEGER Alignment mismatch score. Default: -1. -bg, --barcode-gap INTEGER Alignment gap score. Default: -9. Adapter removal: -a, --adapters INPUT_FILE Fasta file with adapters for removal that may contain N. -a2, --adapters2 INPUT_FILE File with extra adapters for second read set in paired mode. -as, --adapter-seq STRING Single adapter sequence as alternative to adapters option. -aa, --adapter-preset STRING One of TruSeq, SmallRNA, Methyl, Ribo, Nextera, and NexteraMP. -ao, --adapter-min-overlap INTEGER Minimum overlap for removal without pair overlap. Default: 3. -ae, --adapter-error-rate DOUBLE Error rate threshold for mismatches and gaps. Default: 0.1. -at, --adapter-trim-end STRING Type of removal, see section trim-end modes. Default: RIGHT. -an, --adapter-tail-length INTEGER Region size for tail trim-end modes. Default: adapter length. -ax, --adapter-relaxed Skip restriction to pass read ends in right and left modes. -ap, --adapter-pair-overlap STRING Overlap detection of paired reads. One of ON, SHORT, and ONLY. -av, --adapter-min-poverlap INTEGER Minimum overlap of paired reads for detection. Default: 40. -ac, --adapter-revcomp STRING Include reverse complements of adapters. One of ON and ONLY. -ad, --adapter-revcomp-end STRING Use different trim-end for reverse complements of adapters. -ab, --adapter-add-barcode Add reverse complement of detected barcode to adapters. -ar, --adapter-read-set STRING Consider only single read set for adapters. One of 1 and 2. -ak, --adapter-trimmed-out STRING Modify that trimmed reads are kept. One of OFF and ONLY. -ay, --adapter-cycles INTEGER Number of adapter removal cycles. Default: 1. -am, --adapter-match INTEGER Alignment match score. Default: 1. -ai, --adapter-mismatch INTEGER Alignment mismatch score. Default: -1. -ag, --adapter-gap INTEGER Alignment gap score. Default: -6. Filtering and trimming: -u, --max-uncalled INTEGER Allowed uncalled bases N for each read. Default: 0. -x, --pre-trim-left INTEGER Trim given number of bases on 5' read end before detection. -y, --pre-trim-right INTEGER Trim specified number of bases on 3' end prior to detection. -k, --post-trim-length INTEGER Trim to specified read length from 3' end after removal. -m, --min-read-length INTEGER Minimum read length to remain after removal. Default: 18. Quality-based trimming: -q, --qtrim STRING Quality-based trimming mode. One of TAIL, WIN, and BWA. -qf, --qtrim-format STRING Quality format. One of sanger, solexa, i1.3, i1.5, and i1.8. -qt, --qtrim-threshold INTEGER Minimum quality as threshold for trimming. Default: 20. -qw, --qtrim-win-size INTEGER Region size for sliding window approach. Default: 5. -qa, --qtrim-post-removal Perform quality-based trimming after removal steps. Trimming of homopolymers: -hl, --htrim-left STRING Trim specific homopolymers on left read end after removal. -hr, --htrim-right STRING Trim certain homopolymers on right read end after removal. -hi, --htrim-min-length INTEGER Minimum length of homopolymers at read ends. Default: 3. -h2, --htrim-min-length2 INTEGER Minimum length for homopolymers specified after first one. -hx, --htrim-max-length INTEGER Maximum length of homopolymers on left and right read end. -hf, --htrim-max-first Apply maximum length of homopolymers only for first one. -he, --htrim-error-rate DOUBLE Error rate threshold for mismatches. Default: 0.1. -ha, --htrim-adapter Trim only in case of adapter removal on same side. Output selection: -f, --fasta-output Prefer non-quality format fasta for output. -z, --zip-output STRING Direct compression of output files. One of GZ and BZ2. -1, --stdout-reads Write reads to stdout, tagged and interleaved if needed. -R, --output-reads OUTPUT_FILE Output file for reads instead of target prefix usage. -P, --output-reads2 OUTPUT_FILE Output file for reads2 instead of target prefix usage. -j, --length-dist Generate length distribution for read output files. -s, --single-reads Write single reads for too short counterparts in pairs. -S, --single-reads-paired Write paired single reads with N for short counterparts. Logging and tagging: -l, --align-log STRING Print chosen read alignments. One of ALL, MOD, and TAB. -o, --stdout-log Write statistics to stdout instead of target log file. -O, --output-log OUTPUT_FILE Output file for logging instead of target prefix usage. -g, --removal-tags Tag reads that are subject to adapter or barcode removal. -e, --number-tags Replace read tags by ascending number to save space. -d, --umi-tags Capture UMIs in reads at barcode or adapter N positions. TRIM-END MODES ANY: longer side of read remains after removal of overlap LEFT: right side remains after removal, align <= read end RIGHT: left part remains after removal, align >= read start LTAIL: consider first n bases of reads in alignment RTAIL: use only last n bases, see tail-length options EXAMPLES flexbar -r reads.fq -t target -q TAIL -qf i1.8 flexbar -r reads.fq -b barcodes.fa -bt LTAIL flexbar -r reads.fq -a adapters.fa -ao 3 -ae 0.1 flexbar -r r1.fq -p r2.fq -a a1.fa -a2 a2.fa -ap ON flexbar -r r1.fq -p r2.fq -aa TruSeq -ap ON VERSION Last update: May 2019 flexbar version: 3.5.0 SeqAn version: 2.4.0 Available on github.com/seqan/flexbar