1. Get data & packages

Data was downloaded from Globus (globus.org). It was sent in two zipped files, one from each lane, containing separate files for each sample (demultiplexed).

I then checked that the md5 values were the same using md5 [filename].tar.gz >> [filename].md5 (this appened the new value in the .md5 file, so I could compare).

A note on data accessibility & my working environment

I originally downloaded data to Ostrich, thinking that I could work on that computer remotely using Remote Desktop and Jupyter Notebook. However, my internet connection is too slow to work productively that way. I therefore also downloaded the data to my external hard drive, and worked locally. This also allows me to use all the packages that I had installed on my personal computer in 2018 (which is helpful).

Canonical versions of the data is saved to Owl/Nightengales in the zipped format here: nightingales/O_lurida/2020-04-21_QuantSeq-data/ and as individual fastq/sample here nightingales/O_lurida/, and to Gannet as individual fastq/sample here Atumefaciens/20200426_olur_fastqc_quantseq/.

Sam also ran MultiQC on my samples; check out his notebook entry, and the MultiQC report.

Create path variables, i.e. shortcuts to certain directories

To start, create some variables for commonly accessed paths. NOTE: many of the steps in this workflow require the me to be located within a specific directory to access files. So, while I try to use these path variables, I often have to hard-code my paths.

In [11]:
# create path variable to raw data, saved on my external hard drive 
workingdir = "/Volumes/Bumblebee/O.lurida_QuantSeq-2020/"
In [2]:
cd {workingdir}
/Volumes/Bumblebee/O.lurida_QuantSeq-2020
In [3]:
# create path variable to fastqc directory 
fastqc = "/Applications/bioinformatics/FastQC/"
In [5]:
# test fqstqc 
! {fastqc}fastqc --version
FastQC v0.11.8

I installed MultiQC using git clone via the following:

git clone https://github.com/ewels/MultiQC.git
cd MultiQC
pip install .
In [6]:
# test multiqc 
! multiqc --version
multiqc, version 1.9.dev0

Install Cutadapt

The Cutadapt program is used in the tagseq processing pipeline. I installed cutadapt using python3 -m pip install --user --upgrade cutadapt. During install, I received this warning:

WARNING: The script cutadapt is installed in '/Users/laura/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

So, I added that path to my PATH using the following in Terminal:
PATH=$PATH:/Users/laura/.local/bin

For some reason Jupyter Notebook doesn't recognize cutadapt, even though I can access it via the Terminal.

In [109]:
# Try to add the path here 
! PATH=$PATH:/Users/laura/.local/bin
In [113]:
# Still doesn't work 
! cutadapt --version
/bin/sh: cutadapt: command not found
In [7]:
# Hard coding the cutadapt path works, though 
! /Users/laura/.local/bin/cutadapt --version
2.10
In [10]:
# test fastq_quality_filter (see if it's correctly added to my PATH)
! fastq_quality_filter -h
usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.14 by A. Gordon ([email protected])

   [-h]         = This helpful help screen.
   [-q N]       = Minimum quality score to keep.
   [-p N]       = Minimum percent of bases that must have [-q] quality.
   [-z]         = Compress output with GZIP.
   [-i INFILE]  = FASTA/Q input file. default is STDIN.
   [-o OUTFILE] = FASTA/Q output file. default is STDOUT.
   [-v]         = Verbose - report number of sequences.
                  If [-o] is specified,  report will be printed to STDOUT.
                  If [-o] is not specified (and output goes to STDOUT),
                  report will be printed to STDERR.

Unpack raw data

Currently, the data is still zipped (that's how it arrived via Globus). I need to thus tar/gunzip the lane files, before I can gunzip the individual library files.

In [12]:
cd raw-data/
/Volumes/Bumblebee/O.lurida_QuantSeq-2020/raw-data
In [15]:
ls
Batch1_69plex_lane1.md5*    key*
Batch1_69plex_lane1.tar.gz* quantseq2020_key.csv*
Batch2_77plex_lane2_md5*    quantseq2020_key.xlsx*
Batch2_77plex_lane2_tar.gz* test-batch/
In [21]:
# check md5's for batch1.tar.gz compared to md5 that seq. facility sent 
! cat Batch1_69plex_lane1.md5
! md5 Batch1_69plex_lane1.tar.gz
a856ffbe2bf102b07ee78607946a463b  Batch1_69plex_lane1.tar.gz
MD5 (Batch1_69plex_lane1.tar.gz) = a856ffbe2bf102b07ee78607946a463b
In [20]:
# check md5's for batch2.tar.gz compared to md5 that seq. facility sent
! cat Batch2_77plex_lane2_md5
! md5 Batch2_77plex_lane2_tar.gz
56decbd72a38b6076de6a421cfddf9c5  Batch2_77plex_lane2_tar.gz
MD5 (Batch2_77plex_lane2_tar.gz) = 56decbd72a38b6076de6a421cfddf9c5
In [22]:
# extract batch/lane 1 data 
! gunzip -c Batch1_69plex_lane1.tar.gz | tar xopf -
In [23]:
# extract batch/lane 2 data 
! gunzip -c Batch2_77plex_lane2_tar.gz | tar xopf -
In [24]:
# check out resulting file structure 
! ls
Batch1_69plex_lane1.md5    Batch2_77plex_lane2_tar.gz
Batch1_69plex_lane1.tar.gz key
Batch1_69plex_lane1_done   quantseq2020_key.csv
Batch2_77plex_lane2_done   quantseq2020_key.xlsx
Batch2_77plex_lane2_md5    test-batch
In [25]:
! ls Batch1_69plex_lane1_done/
137_S63_L001_R1_001.fastq.gz         314_S49_L001_R1_001.fastq.gz
139_S54_L001_R1_001.fastq.gz         315_S26_L001_R1_001.fastq.gz
140_S64_L001_R1_001.fastq.gz         316_S9_L001_R1_001.fastq.gz
141_S61_L001_R1_001.fastq.gz         317_S33_L001_R1_001.fastq.gz
156_S66_L001_R1_001.fastq.gz         318_S6_L001_R1_001.fastq.gz
159_S68_L001_R1_001.fastq.gz         319_S52_L001_R1_001.fastq.gz
161_S57_L001_R1_001.fastq.gz         321_S29_L001_R1_001.fastq.gz
162_S62_L001_R1_001.fastq.gz         322_S8_L001_R1_001.fastq.gz
168_S67_L001_R1_001.fastq.gz         323_S39_L001_R1_001.fastq.gz
169_S65_L001_R1_001.fastq.gz         324_S47_L001_R1_001.fastq.gz
171_S58_L001_R1_001.fastq.gz         325_S13_L001_R1_001.fastq.gz
172_S59_L001_R1_001.fastq.gz         326_S38_L001_R1_001.fastq.gz
181_S69_L001_R1_001.fastq.gz         327_S37_L001_R1_001.fastq.gz
183_S56_L001_R1_001.fastq.gz         328_S12_L001_R1_001.fastq.gz
184_S55_L001_R1_001.fastq.gz         329_S46_L001_R1_001.fastq.gz
185_S60_L001_R1_001.fastq.gz         331_S53_L001_R1_001.fastq.gz
291_S42_L001_R1_001.fastq.gz         332_S48_L001_R1_001.fastq.gz
292_S32_L001_R1_001.fastq.gz         333_S30_L001_R1_001.fastq.gz
293_S11_L001_R1_001.fastq.gz         334_S50_L001_R1_001.fastq.gz
294_S7_L001_R1_001.fastq.gz          335_S31_L001_R1_001.fastq.gz
295_S41_L001_R1_001.fastq.gz         336_S51_L001_R1_001.fastq.gz
296_S40_L001_R1_001.fastq.gz         337_S35_L001_R1_001.fastq.gz
298_S25_L001_R1_001.fastq.gz         338_S4_L001_R1_001.fastq.gz
299_S21_L001_R1_001.fastq.gz         339_S10_L001_R1_001.fastq.gz
301_S22_L001_R1_001.fastq.gz         341_S19_L001_R1_001.fastq.gz
302_S15_L001_R1_001.fastq.gz         342_S23_L001_R1_001.fastq.gz
303_S17_L001_R1_001.fastq.gz         343_S14_L001_R1_001.fastq.gz
304_S24_L001_R1_001.fastq.gz         344_S27_L001_R1_001.fastq.gz
305_S1_L001_R1_001.fastq.gz          345_S16_L001_R1_001.fastq.gz
306_S44_L001_R1_001.fastq.gz         346_S18_L001_R1_001.fastq.gz
307_S45_L001_R1_001.fastq.gz         347_S43_L001_R1_001.fastq.gz
308_S5_L001_R1_001.fastq.gz          348_S3_L001_R1_001.fastq.gz
309_S20_L001_R1_001.fastq.gz         349_S34_L001_R1_001.fastq.gz
311_S2_L001_R1_001.fastq.gz          Reports
312_S28_L001_R1_001.fastq.gz         Stats
313_S36_L001_R1_001.fastq.gz         Undetermined_S0_L001_R1_001.fastq.gz

Move to each batch's directory containing demultiplexed library files, and gunzip all fastq files in that folder

In [26]:
cd Batch1_69plex_lane1_done/
/Volumes/Bumblebee/O.lurida_QuantSeq-2020/raw-data/Batch1_69plex_lane1_done
In [27]:
! gunzip *.fastq.gz
In [28]:
cd ../Batch2_77plex_lane2_done/
/Volumes/Bumblebee/O.lurida_QuantSeq-2020/raw-data/Batch2_77plex_lane2_done
In [29]:
! gunzip *.fastq.gz 
In [30]:
# Check out contents after gunzip 
! ls 
34_S68_L002_R1_001.fastq          482_S25_L002_R1_001.fastq
35_S72_L002_R1_001.fastq          483_S7_L002_R1_001.fastq
37_S70_L002_R1_001.fastq          484_S43_L002_R1_001.fastq
39_S52_L002_R1_001.fastq          485_S21_L002_R1_001.fastq
401_S10_L002_R1_001.fastq         487_S6_L002_R1_001.fastq
402_S5_L002_R1_001.fastq          488_S26_L002_R1_001.fastq
403_S30_L002_R1_001.fastq         489_S35_L002_R1_001.fastq
404_S42_L002_R1_001.fastq         490_S19_L002_R1_001.fastq
411_S9_L002_R1_001.fastq          491_S50_L002_R1_001.fastq
412_S74_L002_R1_001.fastq         492_S40_L002_R1_001.fastq
413_S38_L002_R1_001.fastq         506_S47_L002_R1_001.fastq
414_S49_L002_R1_001.fastq         513_S56_L002_R1_001.fastq
41_S62_L002_R1_001.fastq          521_S65_L002_R1_001.fastq
421_S22_L002_R1_001.fastq         522_S1_L002_R1_001.fastq
431b_S8_L002_R1_001.fastq         523_S4_L002_R1_001.fastq
432_S75_L002_R1_001.fastq         524_S11_L002_R1_001.fastq
434_S55_L002_R1_001.fastq         525_S18_L002_R1_001.fastq
43_S46_L002_R1_001.fastq          526_S15_L002_R1_001.fastq
441_S73_L002_R1_001.fastq         527_S39_L002_R1_001.fastq
442b_S60_L002_R1_001.fastq        528_S27_L002_R1_001.fastq
443_S36_L002_R1_001.fastq         529_S53_L002_R1_001.fastq
444_S34_L002_R1_001.fastq         531_S44_L002_R1_001.fastq
445_S45_L002_R1_001.fastq         532_S33_L002_R1_001.fastq
44_S71_L002_R1_001.fastq          533_S61_L002_R1_001.fastq
451_S28_L002_R1_001.fastq         541_S41_L002_R1_001.fastq
452b_S2_L002_R1_001.fastq         542_S3_L002_R1_001.fastq
453_S12_L002_R1_001.fastq         543_S29_L002_R1_001.fastq
45_S63_L002_R1_001.fastq          551_S24_L002_R1_001.fastq
461b_S31_L002_R1_001.fastq        552b_S54_L002_R1_001.fastq
462b_S64_L002_R1_001.fastq        553_S23_L002_R1_001.fastq
46_S66_L002_R1_001.fastq          554_S13_L002_R1_001.fastq
471b_S51_L002_R1_001.fastq        561_S69_L002_R1_001.fastq
472b_S48_L002_R1_001.fastq        562_S77_L002_R1_001.fastq
473_S20_L002_R1_001.fastq         563_S59_L002_R1_001.fastq
474_S14_L002_R1_001.fastq         564_S32_L002_R1_001.fastq
475_S16_L002_R1_001.fastq         565_S67_L002_R1_001.fastq
476_S17_L002_R1_001.fastq         571_S76_L002_R1_001.fastq
477_S37_L002_R1_001.fastq         Reports
47_S58_L002_R1_001.fastq          Stats
481_S57_L002_R1_001.fastq         Undetermined_S0_L002_R1_001.fastq

2. Initial QC

In [45]:
! mkdir {workingdir}qc-processing/fastqc/
In [31]:
! mkdir {workingdir}qc-processing/fastqc/untrimmed/
In [49]:
# test fastqc on one sample file 
! {fastqc}fastqc \
506_S47_L002_R1_001.fastq \
--outdir ../fastqc/untrimmed/
Started analysis of 506_S47_L002_R1_001.fastq
Approx 5% complete for 506_S47_L002_R1_001.fastq
Approx 10% complete for 506_S47_L002_R1_001.fastq
Approx 15% complete for 506_S47_L002_R1_001.fastq
Approx 20% complete for 506_S47_L002_R1_001.fastq
Approx 25% complete for 506_S47_L002_R1_001.fastq
Approx 30% complete for 506_S47_L002_R1_001.fastq
Approx 35% complete for 506_S47_L002_R1_001.fastq
Approx 40% complete for 506_S47_L002_R1_001.fastq
Approx 45% complete for 506_S47_L002_R1_001.fastq
Approx 50% complete for 506_S47_L002_R1_001.fastq
Approx 55% complete for 506_S47_L002_R1_001.fastq
Approx 60% complete for 506_S47_L002_R1_001.fastq
Approx 65% complete for 506_S47_L002_R1_001.fastq
Approx 70% complete for 506_S47_L002_R1_001.fastq
Approx 75% complete for 506_S47_L002_R1_001.fastq
Approx 80% complete for 506_S47_L002_R1_001.fastq
Approx 85% complete for 506_S47_L002_R1_001.fastq
Approx 90% complete for 506_S47_L002_R1_001.fastq
Approx 95% complete for 506_S47_L002_R1_001.fastq
Analysis complete for 506_S47_L002_R1_001.fastq

Run fastqc on all .fastq files in Batch/Lane2, the larval data (current directory)

In [32]:
! {fastqc}fastqc \
*.fastq \
--outdir {workingdir}qc-processing/fastqc/untrimmed/ \
--quiet
In [33]:
# check out resulting fastqc files. 
! ls {workingdir}qc-processing/fastqc/untrimmed/
34_S68_L002_R1_001_fastqc.html          481_S57_L002_R1_001_fastqc.html
34_S68_L002_R1_001_fastqc.zip           481_S57_L002_R1_001_fastqc.zip
35_S72_L002_R1_001_fastqc.html          482_S25_L002_R1_001_fastqc.html
35_S72_L002_R1_001_fastqc.zip           482_S25_L002_R1_001_fastqc.zip
37_S70_L002_R1_001_fastqc.html          483_S7_L002_R1_001_fastqc.html
37_S70_L002_R1_001_fastqc.zip           483_S7_L002_R1_001_fastqc.zip
39_S52_L002_R1_001_fastqc.html          484_S43_L002_R1_001_fastqc.html
39_S52_L002_R1_001_fastqc.zip           484_S43_L002_R1_001_fastqc.zip
401_S10_L002_R1_001_fastqc.html         485_S21_L002_R1_001_fastqc.html
401_S10_L002_R1_001_fastqc.zip          485_S21_L002_R1_001_fastqc.zip
402_S5_L002_R1_001_fastqc.html          487_S6_L002_R1_001_fastqc.html
402_S5_L002_R1_001_fastqc.zip           487_S6_L002_R1_001_fastqc.zip
403_S30_L002_R1_001_fastqc.html         488_S26_L002_R1_001_fastqc.html
403_S30_L002_R1_001_fastqc.zip          488_S26_L002_R1_001_fastqc.zip
404_S42_L002_R1_001_fastqc.html         489_S35_L002_R1_001_fastqc.html
404_S42_L002_R1_001_fastqc.zip          489_S35_L002_R1_001_fastqc.zip
411_S9_L002_R1_001_fastqc.html          490_S19_L002_R1_001_fastqc.html
411_S9_L002_R1_001_fastqc.zip           490_S19_L002_R1_001_fastqc.zip
412_S74_L002_R1_001_fastqc.html         491_S50_L002_R1_001_fastqc.html
412_S74_L002_R1_001_fastqc.zip          491_S50_L002_R1_001_fastqc.zip
413_S38_L002_R1_001_fastqc.html         492_S40_L002_R1_001_fastqc.html
413_S38_L002_R1_001_fastqc.zip          492_S40_L002_R1_001_fastqc.zip
414_S49_L002_R1_001_fastqc.html         506_S47_L002_R1_001_fastqc.html
414_S49_L002_R1_001_fastqc.zip          506_S47_L002_R1_001_fastqc.zip
41_S62_L002_R1_001_fastqc.html          513_S56_L002_R1_001_fastqc.html
41_S62_L002_R1_001_fastqc.zip           513_S56_L002_R1_001_fastqc.zip
421_S22_L002_R1_001_fastqc.html         521_S65_L002_R1_001_fastqc.html
421_S22_L002_R1_001_fastqc.zip          521_S65_L002_R1_001_fastqc.zip
431b_S8_L002_R1_001_fastqc.html         522_S1_L002_R1_001_fastqc.html
431b_S8_L002_R1_001_fastqc.zip          522_S1_L002_R1_001_fastqc.zip
432_S75_L002_R1_001_fastqc.html         523_S4_L002_R1_001_fastqc.html
432_S75_L002_R1_001_fastqc.zip          523_S4_L002_R1_001_fastqc.zip
434_S55_L002_R1_001_fastqc.html         524_S11_L002_R1_001_fastqc.html
434_S55_L002_R1_001_fastqc.zip          524_S11_L002_R1_001_fastqc.zip
43_S46_L002_R1_001_fastqc.html          525_S18_L002_R1_001_fastqc.html
43_S46_L002_R1_001_fastqc.zip           525_S18_L002_R1_001_fastqc.zip
441_S73_L002_R1_001_fastqc.html         526_S15_L002_R1_001_fastqc.html
441_S73_L002_R1_001_fastqc.zip          526_S15_L002_R1_001_fastqc.zip
442b_S60_L002_R1_001_fastqc.html        527_S39_L002_R1_001_fastqc.html
442b_S60_L002_R1_001_fastqc.zip         527_S39_L002_R1_001_fastqc.zip
443_S36_L002_R1_001_fastqc.html         528_S27_L002_R1_001_fastqc.html
443_S36_L002_R1_001_fastqc.zip          528_S27_L002_R1_001_fastqc.zip
444_S34_L002_R1_001_fastqc.html         529_S53_L002_R1_001_fastqc.html
444_S34_L002_R1_001_fastqc.zip          529_S53_L002_R1_001_fastqc.zip
445_S45_L002_R1_001_fastqc.html         531_S44_L002_R1_001_fastqc.html
445_S45_L002_R1_001_fastqc.zip          531_S44_L002_R1_001_fastqc.zip
44_S71_L002_R1_001_fastqc.html          532_S33_L002_R1_001_fastqc.html
44_S71_L002_R1_001_fastqc.zip           532_S33_L002_R1_001_fastqc.zip
451_S28_L002_R1_001_fastqc.html         533_S61_L002_R1_001_fastqc.html
451_S28_L002_R1_001_fastqc.zip          533_S61_L002_R1_001_fastqc.zip
452b_S2_L002_R1_001_fastqc.html         541_S41_L002_R1_001_fastqc.html
452b_S2_L002_R1_001_fastqc.zip          541_S41_L002_R1_001_fastqc.zip
453_S12_L002_R1_001_fastqc.html         542_S3_L002_R1_001_fastqc.html
453_S12_L002_R1_001_fastqc.zip          542_S3_L002_R1_001_fastqc.zip
45_S63_L002_R1_001_fastqc.html          543_S29_L002_R1_001_fastqc.html
45_S63_L002_R1_001_fastqc.zip           543_S29_L002_R1_001_fastqc.zip
461b_S31_L002_R1_001_fastqc.html        551_S24_L002_R1_001_fastqc.html
461b_S31_L002_R1_001_fastqc.zip         551_S24_L002_R1_001_fastqc.zip
462b_S64_L002_R1_001_fastqc.html        552b_S54_L002_R1_001_fastqc.html
462b_S64_L002_R1_001_fastqc.zip         552b_S54_L002_R1_001_fastqc.zip
46_S66_L002_R1_001_fastqc.html          553_S23_L002_R1_001_fastqc.html
46_S66_L002_R1_001_fastqc.zip           553_S23_L002_R1_001_fastqc.zip
471b_S51_L002_R1_001_fastqc.html        554_S13_L002_R1_001_fastqc.html
471b_S51_L002_R1_001_fastqc.zip         554_S13_L002_R1_001_fastqc.zip
472b_S48_L002_R1_001_fastqc.html        561_S69_L002_R1_001_fastqc.html
472b_S48_L002_R1_001_fastqc.zip         561_S69_L002_R1_001_fastqc.zip
473_S20_L002_R1_001_fastqc.html         562_S77_L002_R1_001_fastqc.html
473_S20_L002_R1_001_fastqc.zip          562_S77_L002_R1_001_fastqc.zip
474_S14_L002_R1_001_fastqc.html         563_S59_L002_R1_001_fastqc.html
474_S14_L002_R1_001_fastqc.zip          563_S59_L002_R1_001_fastqc.zip
475_S16_L002_R1_001_fastqc.html         564_S32_L002_R1_001_fastqc.html
475_S16_L002_R1_001_fastqc.zip          564_S32_L002_R1_001_fastqc.zip
476_S17_L002_R1_001_fastqc.html         565_S67_L002_R1_001_fastqc.html
476_S17_L002_R1_001_fastqc.zip          565_S67_L002_R1_001_fastqc.zip
477_S37_L002_R1_001_fastqc.html         571_S76_L002_R1_001_fastqc.html
477_S37_L002_R1_001_fastqc.zip          571_S76_L002_R1_001_fastqc.zip
47_S58_L002_R1_001_fastqc.html          Undetermined_S0_L002_R1_001_fastqc.html
47_S58_L002_R1_001_fastqc.zip           Undetermined_S0_L002_R1_001_fastqc.zip

Run fastqc on all fastq files in Batch/Lane 1, the adult ctenidia+juvenile samples

In [34]:
cd {workingdir}raw-data/Batch1_69plex_lane1_done/
/Volumes/Bumblebee/O.lurida_QuantSeq-2020/raw-data/Batch1_69plex_lane1_done
In [35]:
! {fastqc}fastqc \
*.fastq \
--outdir {workingdir}qc-processing/fastqc/untrimmed/ \
--quiet

Generate a MultiQC report on all untrimmed data files

In [36]:
! multiqc {workingdir}qc-processing/fastqc/untrimmed/ \
--filename {workingdir}qc-processing/fastqc/untrimmed/multiqc_report_untrimmed.html
[INFO   ]         multiqc : This is MultiQC v1.9.dev0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching   : /Volumes/Bumblebee/O.lurida_QuantSeq-2020/qc-processing/fastqc/untrimmed
[INFO   ]          fastqc : Found 148 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : ../../qc-processing/fastqc/untrimmed/multiqc_report_untrimmed.html
[INFO   ]         multiqc : Data        : ../../qc-processing/fastqc/untrimmed/multiqc_report_untrimmed_data
[INFO   ]         multiqc : MultiQC complete

Inspect MultiQC report

In [38]:
import IPython
IPython.display.HTML(filename='/Volumes/Bumblebee/O.lurida_QuantSeq-2020/qc-processing/fastqc/untrimmed/multiqc_report_untrimmed.html')
Out[38]:
MultiQC Report