Data was downloaded from Globus (globus.org). It was sent in two zipped files, one from each lane, containing separate files for each sample (demultiplexed).
I then checked that the md5 values were the same using md5 [filename].tar.gz >> [filename].md5
(this appened the new value in the .md5 file, so I could compare).
I originally downloaded data to Ostrich, thinking that I could work on that computer remotely using Remote Desktop and Jupyter Notebook. However, my internet connection is too slow to work productively that way. I therefore also downloaded the data to my external hard drive, and worked locally. This also allows me to use all the packages that I had installed on my personal computer in 2018 (which is helpful).
Canonical versions of the data is saved to Owl/Nightengales in the zipped format here: nightingales/O_lurida/2020-04-21_QuantSeq-data/ and as individual fastq/sample here nightingales/O_lurida/, and to Gannet as individual fastq/sample here Atumefaciens/20200426_olur_fastqc_quantseq/.
Sam also ran MultiQC on my samples; check out his notebook entry, and the MultiQC report.
To start, create some variables for commonly accessed paths. NOTE: many of the steps in this workflow require the me to be located within a specific directory to access files. So, while I try to use these path variables, I often have to hard-code my paths.
# create path variable to raw data, saved on my external hard drive
workingdir = "/Volumes/Bumblebee/O.lurida_QuantSeq-2020/"
cd {workingdir}
# create path variable to fastqc directory
fastqc = "/Applications/bioinformatics/FastQC/"
# test fqstqc
! {fastqc}fastqc --version
git clone https://github.com/ewels/MultiQC.git
cd MultiQC
pip install .
# test multiqc
! multiqc --version
The Cutadapt program is used in the tagseq processing pipeline. I installed cutadapt
using python3 -m pip install --user --upgrade cutadapt
. During install, I received this warning:
WARNING: The script cutadapt is installed in '/Users/laura/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
So, I added that path to my PATH using the following in Terminal:
PATH=$PATH:/Users/laura/.local/bin
For some reason Jupyter Notebook doesn't recognize cutadapt
, even though I can access it via the Terminal.
# Try to add the path here
! PATH=$PATH:/Users/laura/.local/bin
# Still doesn't work
! cutadapt --version
# Hard coding the cutadapt path works, though
! /Users/laura/.local/bin/cutadapt --version
# test fastq_quality_filter (see if it's correctly added to my PATH)
! fastq_quality_filter -h
Currently, the data is still zipped (that's how it arrived via Globus). I need to thus tar/gunzip the lane files, before I can gunzip the individual library files.
cd raw-data/
ls
# check md5's for batch1.tar.gz compared to md5 that seq. facility sent
! cat Batch1_69plex_lane1.md5
! md5 Batch1_69plex_lane1.tar.gz
# check md5's for batch2.tar.gz compared to md5 that seq. facility sent
! cat Batch2_77plex_lane2_md5
! md5 Batch2_77plex_lane2_tar.gz
# extract batch/lane 1 data
! gunzip -c Batch1_69plex_lane1.tar.gz | tar xopf -
# extract batch/lane 2 data
! gunzip -c Batch2_77plex_lane2_tar.gz | tar xopf -
# check out resulting file structure
! ls
! ls Batch1_69plex_lane1_done/
cd Batch1_69plex_lane1_done/
! gunzip *.fastq.gz
cd ../Batch2_77plex_lane2_done/
! gunzip *.fastq.gz
# Check out contents after gunzip
! ls
! mkdir {workingdir}qc-processing/fastqc/
! mkdir {workingdir}qc-processing/fastqc/untrimmed/
# test fastqc on one sample file
! {fastqc}fastqc \
506_S47_L002_R1_001.fastq \
--outdir ../fastqc/untrimmed/
! {fastqc}fastqc \
*.fastq \
--outdir {workingdir}qc-processing/fastqc/untrimmed/ \
--quiet
# check out resulting fastqc files.
! ls {workingdir}qc-processing/fastqc/untrimmed/
cd {workingdir}raw-data/Batch1_69plex_lane1_done/
! {fastqc}fastqc \
*.fastq \
--outdir {workingdir}qc-processing/fastqc/untrimmed/ \
--quiet
! multiqc {workingdir}qc-processing/fastqc/untrimmed/ \
--filename {workingdir}qc-processing/fastqc/untrimmed/multiqc_report_untrimmed.html
import IPython
IPython.display.HTML(filename='/Volumes/Bumblebee/O.lurida_QuantSeq-2020/qc-processing/fastqc/untrimmed/multiqc_report_untrimmed.html')