Started Docker container with the following command:
docker run -p 8888:8888 -v /home/sam/data/pacbio_oly/:/home/data -it 9ce16ff93ef9 /bin/bash
The command allows /home/sam/data/pacbio_oly/
to be accessible to the Docker container.
Once access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files the container was started, started Jupyter Notebook with the following command inside the Docker container:
jupyter notebook --allow-root
This is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888. The Docker container is running on an image created from this Dockerfile (Git commit 832008c
%%bash
date
Thu Sep 7 22:03:11 UTC 2017
%%bash
hostname
ff9e68310edc
%%bash
lscpu
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 44 Model name: Intel(R) Xeon(R) CPU X5670 @ 2.93GHz Stepping: 2 CPU MHz: 2926.129 BogoMIPS: 5851.98 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 0-23
%%bash
free -mh
total used free shared buffers cached Mem: 70G 32G 38G 178M 447M 25G -/+ buffers/cache: 6.1G 64G Swap: 4.7G 0B 4.7G
%%bash
UsageError: %%bash is a cell magic, but the cell body is empty.
%%bash
pwd
/home/data
%%bash
ls -l
total 24738112 -rwxrwxr-x 1 1000 1000 2852947472 Sep 7 21:26 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 3126996263 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2843320527 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_D01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 3114876304 Sep 7 21:28 170228_PCB-CC_AL_20kb_P6v2_E01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2960438946 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2995066419 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2092190052 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1842836662 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1672061431 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A03_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1831019208 Sep 7 21:30 170314_PCB-CC_20kb_P6v2_A04_1_filtered_subreads.fastq -rw-r--r-- 1 root root 3799 Sep 7 22:12 20170907_docker_pacbio_oly_minimap2.ipynb -rw-rw-r-- 1 1000 1000 902 Sep 7 21:30 md5sums.txt
%%bash
which minimap2
/usr/local/bioinformatics/minimap2-2.1.1_x64-linux/minimap2
-x ava-pb
option. This is listed in the minimap2 manual as a preset for "PacBio all-vs-all overlap mapping."¶%%bash
time minimap2 -x ava-pb \
170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq \
170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq \
170228_PCB-CC_AL_20kb_P6v2_D01_1_filtered_subreads.fastq \
170228_PCB-CC_AL_20kb_P6v2_E01_1_filtered_subreads.fastq \
170307_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq \
170307_PCB-CC_AL_20kb_P6v2_C02_1_filtered_subreads.fastq \
170314_PCB-CC_20kb_P6v2_A01_1_filtered_subreads.fastq \
170314_PCB-CC_20kb_P6v2_A02_1_filtered_subreads.fastq \
170314_PCB-CC_20kb_P6v2_A03_1_filtered_subreads.fastq \
170314_PCB-CC_20kb_P6v2_A04_1_filtered_subreads.fastq \
> 20170905_minimap2_pacibio_oly.paf
[M::mm_idx_gen::44.362*1.37] collected minimizers [M::mm_idx_gen::53.324*1.64] sorted minimizers [M::main::53.324*1.64] loaded/built the index for 220331 target sequence(s) [M::mm_mapopt_update::59.437*1.57] mid_occ = 150; max_occ = 1012 [M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 220331 [M::mm_idx_stat::61.424*1.56] distinct minimizers: 143263923 (59.16% are singletons); average occurrences: 2.317; average spacing: 4.270 [M::worker_pipeline::99.245*2.10] mapped 61679 sequences [M::worker_pipeline::136.871*2.35] mapped 61589 sequences [M::worker_pipeline::173.658*2.49] mapped 61193 sequences [M::worker_pipeline::177.803*2.50] mapped 6791 sequences [M::worker_pipeline::216.438*2.58] mapped 61406 sequences [M::worker_pipeline::254.175*2.65] mapped 61190 sequences [M::worker_pipeline::285.102*2.69] mapped 50466 sequences [M::worker_pipeline::323.885*2.72] mapped 61057 sequences [M::worker_pipeline::360.993*2.75] mapped 60462 sequences [M::worker_pipeline::398.020*2.78] mapped 60668 sequences [M::worker_pipeline::401.684*2.78] mapped 6051 sequences [M::worker_pipeline::440.719*2.79] mapped 74133 sequences [M::worker_pipeline::478.095*2.81] mapped 72450 sequences [M::worker_pipeline::513.634*2.83] mapped 69360 sequences [M::worker_pipeline::552.319*2.84] mapped 72499 sequences [M::worker_pipeline::589.710*2.85] mapped 73630 sequences [M::worker_pipeline::626.184*2.86] mapped 71104 sequences [M::worker_pipeline::666.204*2.86] mapped 76847 sequences [M::worker_pipeline::705.398*2.87] mapped 76285 sequences [M::worker_pipeline::708.543*2.87] mapped 6132 sequences [M::worker_pipeline::748.995*2.88] mapped 77534 sequences [M::worker_pipeline::782.140*2.88] mapped 64988 sequences [M::worker_pipeline::823.985*2.89] mapped 78021 sequences [M::worker_pipeline::850.287*2.89] mapped 51988 sequences [M::worker_pipeline::891.063*2.89] mapped 78098 sequences [M::worker_pipeline::922.855*2.90] mapped 63965 sequences [M::main] Version: 2.1.1-r341 [M::main] CMD: minimap2 -x ava-pb 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq 170228_PCB-CC_AL_20kb_P6v2_D01_1_filtered_subreads.fastq 170228_PCB-CC_AL_20kb_P6v2_E01_1_filtered_subreads.fastq 170307_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq 170307_PCB-CC_AL_20kb_P6v2_C02_1_filtered_subreads.fastq 170314_PCB-CC_20kb_P6v2_A01_1_filtered_subreads.fastq 170314_PCB-CC_20kb_P6v2_A02_1_filtered_subreads.fastq 170314_PCB-CC_20kb_P6v2_A03_1_filtered_subreads.fastq 170314_PCB-CC_20kb_P6v2_A04_1_filtered_subreads.fastq [M::main] Real time: 923.248 sec; CPU: 2673.740 sec real 15m23.428s user 44m21.720s sys 0m12.200s
%%bash
ls -l
total 24738120 -rwxrwxr-x 1 1000 1000 2852947472 Sep 7 21:26 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 3126996263 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2843320527 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_D01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 3114876304 Sep 7 21:28 170228_PCB-CC_AL_20kb_P6v2_E01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2960438946 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2995066419 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2092190052 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1842836662 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1672061431 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A03_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1831019208 Sep 7 21:30 170314_PCB-CC_20kb_P6v2_A04_1_filtered_subreads.fastq -rw-r--r-- 1 root root 0 Sep 7 22:40 20170905_minimap2_pacibio_oly.paf -rw-r--r-- 1 root root 10901 Sep 11 17:48 20170907_docker_pacbio_oly_minimap2.ipynb -rw-rw-r-- 1 1000 1000 902 Sep 7 21:30 md5sums.txt
20170905_minimap2_pacibio_oly.paf
) appears to be empty... Let's verify.¶%%bash
head 20170905_minimap2_pacibio_oly.paf
SYNOPSIS
* Indexing the target sequences (optional):
minimap2 [-x preset] -d target.mmi target.fa
minimap2 [-H] [-k kmer] [-w miniWinSize] [-I batchSize] -d target.mmi target.fa
* Long-read alignment with CIGAR:
minimap2 -a [-x preset] target.mmi query.fa > output.sam
minimap2 -c [-H] [-k kmer] [-w miniWinSize] [...] target.fa query.fa > output.paf
* Long-read overlap without CIGAR:
minimap2 -x ava-ont [-t nThreads] target.fa query.fa > output.paf
Each of the usage examples only specifies a single target and a single query.
So, let's test it out by only aligning two sequences and see if the output file actually contains some data.
Of course, I guess there's always the possiblility that none of the PacBio data actually overlaps with each other (which could explain the empty output file in the initial assembly that used all of the files)?
%%bash
time minimap2 -x ava-pb \
170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq \
170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq \
> 20170911_minimap2_pacibio_oly_170210_vs_170228C01.paf
[M::mm_idx_gen::44.417*1.37] collected minimizers [M::mm_idx_gen::53.100*1.63] sorted minimizers [M::main::53.100*1.63] loaded/built the index for 220331 target sequence(s) [M::mm_mapopt_update::59.320*1.57] mid_occ = 150; max_occ = 1012 [M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 220331 [M::mm_idx_stat::61.303*1.55] distinct minimizers: 143263923 (59.16% are singletons); average occurrences: 2.317; average spacing: 4.270 [M::worker_pipeline::98.772*2.09] mapped 61679 sequences [M::worker_pipeline::136.002*2.35] mapped 61589 sequences [M::worker_pipeline::173.195*2.49] mapped 61193 sequences [M::worker_pipeline::177.334*2.50] mapped 6791 sequences [M::main] Version: 2.1.1-r341 [M::main] CMD: minimap2 -x ava-pb 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq [M::main] Real time: 177.408 sec; CPU: 443.432 sec real 2m57.995s user 7m16.116s sys 0m7.900s
%%bash
ls -l
total 24738124 -rwxrwxr-x 1 1000 1000 2852947472 Sep 7 21:26 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 3126996263 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2843320527 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_D01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 3114876304 Sep 7 21:28 170228_PCB-CC_AL_20kb_P6v2_E01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2960438946 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2995066419 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2092190052 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1842836662 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1672061431 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A03_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1831019208 Sep 7 21:30 170314_PCB-CC_20kb_P6v2_A04_1_filtered_subreads.fastq -rw-r--r-- 1 root root 0 Sep 7 22:40 20170905_minimap2_pacibio_oly.paf -rw-r--r-- 1 root root 16091 Sep 11 18:08 20170907_docker_pacbio_oly_minimap2.ipynb -rw-r--r-- 1 root root 0 Sep 11 18:03 20170911_minimap2_pacibio_oly_170210_vs_170228C01.paf -rw-rw-r-- 1 1000 1000 902 Sep 7 21:30 md5sums.txt
Well, this output file (20170911_minimap2_pacibio_oly_170210_vs_170228C01.paf) is also empty. I'll just post an issue to the minimap2 GitHub page and see if I get any help there.
%%bash
ls -l *.fasta
-rw-rw-r-- 1 1000 1000 1463327832 Sep 11 18:51 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta -rw-rw-r-- 1 1000 1000 1601947458 Sep 11 18:49 170228_PCB-CC_AL_20kb_P6v2_C01_1.fasta
%%bash
mv 170228_PCB-CC_AL_20kb_P6v2_C01_1.fasta 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fasta
%%bash
ls -l *.fasta
-rw-rw-r-- 1 1000 1000 1463327832 Sep 11 18:51 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta -rw-rw-r-- 1 1000 1000 1601947458 Sep 11 18:49 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fasta
%%bash
time minimap2 -x ava-pb \
170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta \
170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fasta \
> 20170911_minimap2_pacibio_oly_170210_vs_170228C01.paf
[M::mm_idx_gen::44.207*1.36] collected minimizers [M::mm_idx_gen::52.985*1.63] sorted minimizers [M::main::52.985*1.63] loaded/built the index for 220331 target sequence(s) [M::mm_mapopt_update::59.202*1.57] mid_occ = 150; max_occ = 1012 [M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 220331 [M::mm_idx_stat::61.148*1.55] distinct minimizers: 143263923 (59.16% are singletons); average occurrences: 2.317; average spacing: 4.270 [M::worker_pipeline::99.409*2.10] mapped 61679 sequences [M::worker_pipeline::136.320*2.35] mapped 61589 sequences [M::worker_pipeline::172.797*2.49] mapped 61193 sequences [M::worker_pipeline::176.903*2.50] mapped 6791 sequences [M::main] Version: 2.1.1-r341 [M::main] CMD: minimap2 -x ava-pb 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fasta [M::main] Real time: 176.953 sec; CPU: 442.168 sec real 2m57.459s user 7m15.560s sys 0m7.112s
%%bash
ls -l
total 27731572 -rw-rw-r-- 1 1000 1000 1463327832 Sep 11 18:51 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta -rwxrwxr-x 1 1000 1000 2852947472 Sep 7 21:26 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq -rw-rw-r-- 1 1000 1000 1601947458 Sep 11 18:49 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fasta -rwxrwxr-x 1 1000 1000 3126996263 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2843320527 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_D01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 3114876304 Sep 7 21:28 170228_PCB-CC_AL_20kb_P6v2_E01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2960438946 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2995066419 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2092190052 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1842836662 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1672061431 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A03_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1831019208 Sep 7 21:30 170314_PCB-CC_20kb_P6v2_A04_1_filtered_subreads.fastq -rw-r--r-- 1 root root 0 Sep 7 22:40 20170905_minimap2_pacibio_oly.paf -rw-r--r-- 1 root root 20221 Sep 11 18:54 20170907_docker_pacbio_oly_minimap2.ipynb -rw-r--r-- 1 root root 0 Sep 11 18:54 20170911_minimap2_pacibio_oly_170210_vs_170228C01.paf -rw-rw-r-- 1 1000 1000 902 Sep 7 21:30 md5sums.txt
Well, that didn't do anything. Like a doofus, I should've thought of the next step as my initial test!
(I'm also bumping this up to 12 - added thread argument -t 12
to use 12 computing threads)
%%bash
time minimap2 -x ava-pb -t 12 \
170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta \
170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta \
> 20170911_minimap2_pacibio_oly_170210.paf
[M::mm_idx_gen::44.598*1.36] collected minimizers [M::mm_idx_gen::47.690*2.02] sorted minimizers [M::main::47.690*2.02] loaded/built the index for 220331 target sequence(s) [M::mm_mapopt_update::53.712*1.90] mid_occ = 150; max_occ = 1012 [M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 220331 [M::mm_idx_stat::55.658*1.87] distinct minimizers: 143263923 (59.16% are singletons); average occurrences: 2.317; average spacing: 4.270 [M::worker_pipeline::79.382*4.78] mapped 78070 sequences [M::worker_pipeline::103.565*6.50] mapped 77485 sequences [M::worker_pipeline::118.661*7.11] mapped 64776 sequences [M::main] Version: 2.1.1-r341 [M::main] CMD: minimap2 -x ava-pb -t 12 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta [M::main] Real time: 119.067 sec; CPU: 844.484 sec real 1m59.437s user 13m53.832s sys 0m10.868s
%%bash
ls -l
total 29050584 -rw-rw-r-- 1 1000 1000 1463327832 Sep 11 18:51 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta -rwxrwxr-x 1 1000 1000 2852947472 Sep 7 21:26 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq -rw-rw-r-- 1 1000 1000 1601947458 Sep 11 18:49 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fasta -rwxrwxr-x 1 1000 1000 3126996263 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2843320527 Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_D01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 3114876304 Sep 7 21:28 170228_PCB-CC_AL_20kb_P6v2_E01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2960438946 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2995066419 Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2092190052 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1842836662 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1672061431 Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A03_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1831019208 Sep 7 21:30 170314_PCB-CC_20kb_P6v2_A04_1_filtered_subreads.fastq -rw-r--r-- 1 root root 0 Sep 7 22:40 20170905_minimap2_pacibio_oly.paf -rw-r--r-- 1 root root 25434 Sep 11 19:04 20170907_docker_pacbio_oly_minimap2.ipynb -rw-r--r-- 1 root root 1350653569 Sep 11 19:04 20170911_minimap2_pacibio_oly_170210.paf -rw-r--r-- 1 root root 0 Sep 11 18:54 20170911_minimap2_pacibio_oly_170210_vs_170228C01.paf -rw-rw-r-- 1 1000 1000 902 Sep 7 21:30 md5sums.txt
%%bash
ls -lh 20170911_minimap2_pacibio_oly_170210.paf
-rw-r--r-- 1 root root 1.3G Sep 11 19:04 20170911_minimap2_pacibio_oly_170210.paf
Hey, it worked! For fun, I'm just going to test with FASTQ files and see what happens...
%%bash
time minimap2 -x ava-pb -t 12 \
170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq \
170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq \
> 20170911_minimap2_pacbio_oly_170210_fq.paf
[M::mm_idx_gen::44.018*1.37] collected minimizers [M::mm_idx_gen::47.074*2.04] sorted minimizers [M::main::47.074*2.04] loaded/built the index for 220331 target sequence(s) [M::mm_mapopt_update::53.133*1.92] mid_occ = 150; max_occ = 1012 [M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 220331 [M::mm_idx_stat::55.089*1.89] distinct minimizers: 143263923 (59.16% are singletons); average occurrences: 2.317; average spacing: 4.270 [M::worker_pipeline::79.106*4.84] mapped 78070 sequences [M::worker_pipeline::103.637*6.56] mapped 77485 sequences [M::worker_pipeline::118.680*7.17] mapped 64776 sequences [M::main] Version: 2.1.1-r341 [M::main] CMD: minimap2 -x ava-pb -t 12 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq [M::main] Real time: 118.727 sec; CPU: 851.000 sec real 1m59.282s user 14m0.536s sys 0m11.016s
%%bash
ls -lh 20170911_minimap2_pacbio_oly_170210_fq.paf
-rw-r--r-- 1 root root 1.3G Sep 11 19:11 20170911_minimap2_pacbio_oly_170210_fq.paf
OK, this worked, too! So, this leads me to believe I need to run a query against itself in order for us to get the data we need to proceed to the next step in this pipeline (miniasm).
%%bash
time for i in *.fastq
do cat "$i" >> 201709011_oly_pacbio_cat.fastq
done
real 1m58.624s user 0m0.116s sys 0m20.216s
%%bash
ls -lh
total 53G -rw-rw-r-- 1 1000 1000 1.4G Sep 11 18:51 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fasta -rwxrwxr-x 1 1000 1000 2.7G Sep 7 21:26 170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1_filtered_subreads.fastq -rw-rw-r-- 1 1000 1000 1.5G Sep 11 18:49 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fasta -rwxrwxr-x 1 1000 1000 3.0G Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2.7G Sep 7 21:27 170228_PCB-CC_AL_20kb_P6v2_D01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 3.0G Sep 7 21:28 170228_PCB-CC_AL_20kb_P6v2_E01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2.8G Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2.8G Sep 7 21:28 170307_PCB-CC_AL_20kb_P6v2_C02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 2.0G Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A01_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1.8G Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A02_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1.6G Sep 7 21:29 170314_PCB-CC_20kb_P6v2_A03_1_filtered_subreads.fastq -rwxrwxr-x 1 1000 1000 1.8G Sep 7 21:30 170314_PCB-CC_20kb_P6v2_A04_1_filtered_subreads.fastq -rw-r--r-- 1 root root 24G Sep 11 19:31 201709011_oly_pacbio_cat.fastq -rw-r--r-- 1 root root 0 Sep 7 22:40 20170905_minimap2_pacibio_oly.paf -rw-r--r-- 1 root root 31K Sep 11 19:32 20170907_docker_pacbio_oly_minimap2.ipynb -rw-r--r-- 1 root root 1.3G Sep 11 19:11 20170911_minimap2_pacbio_oly_170210_fq.paf -rw-r--r-- 1 root root 1.3G Sep 11 19:04 20170911_minimap2_pacibio_oly_170210.paf -rw-r--r-- 1 root root 0 Sep 11 18:54 20170911_minimap2_pacibio_oly_170210_vs_170228C01.paf -rw-rw-r-- 1 1000 1000 902 Sep 7 21:30 md5sums.txt
%%bash
mv 201709011_oly_pacbio_cat.fastq 20170911_oly_pacbio_cat.fastq
%%bash
time minimap2 -x ava-pb -t 23 \
20170911_oly_pacbio_cat.fastq \
20170911_oly_pacbio_cat.fastq \
> 20170911_minimap2_pacbio_oly.paf
[M::mm_idx_gen::130.358*1.51] collected minimizers [M::mm_idx_gen::136.461*2.38] sorted minimizers [M::main::136.461*2.38] loaded/built the index for 537516 target sequence(s) [M::mm_mapopt_update::151.239*2.25] mid_occ = 216; max_occ = 1259 [M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 537516 [M::mm_idx_stat::155.085*2.22] distinct minimizers: 263663554 (42.00% are singletons); average occurrences: 3.579; average spacing: 4.239 [M::worker_pipeline::188.540*5.77] mapped 78070 sequences [M::worker_pipeline::219.029*8.19] mapped 77485 sequences [M::worker_pipeline::242.542*9.64] mapped 75465 sequences [M::worker_pipeline::259.888*10.54] mapped 61141 sequences [M::worker_pipeline::280.680*11.47] mapped 61571 sequences [M::worker_pipeline::299.354*12.19] mapped 61587 sequences [M::worker_pipeline::316.902*12.80] mapped 61048 sequences [M::worker_pipeline::335.060*13.35] mapped 61149 sequences [M::worker_pipeline::350.760*13.78] mapped 60782 sequences [M::worker_pipeline::365.411*14.16] mapped 60930 sequences [M::worker_pipeline::380.131*14.50] mapped 60739 sequences [M::worker_pipeline::394.948*14.82] mapped 62543 sequences [M::worker_pipeline::410.157*15.13] mapped 73661 sequences [M::worker_pipeline::425.195*15.41] mapped 72563 sequences [M::worker_pipeline::440.318*15.67] mapped 73439 sequences [M::worker_pipeline::455.319*15.91] mapped 72620 sequences [M::worker_pipeline::470.257*16.14] mapped 73624 sequences [M::worker_pipeline::485.364*16.35] mapped 73660 sequences [M::worker_pipeline::501.031*16.56] mapped 76838 sequences [M::worker_pipeline::516.778*16.76] mapped 76351 sequences [M::worker_pipeline::532.511*16.94] mapped 77711 sequences [M::worker_pipeline::548.370*17.12] mapped 78261 sequences [M::worker_pipeline::564.239*17.28] mapped 78012 sequences [M::worker_pipeline::579.895*17.44] mapped 78098 sequences [M::worker_pipeline::595.449*17.58] mapped 78393 sequences [M::worker_pipeline::598.265*17.61] mapped 14176 sequences [M::mm_idx_gen::723.834*14.81] collected minimizers [M::mm_idx_gen::729.497*14.86] sorted minimizers [M::main::729.497*14.86] loaded/built the index for 537277 target sequence(s) [M::mm_mapopt_update::750.838*14.47] mid_occ = 193; max_occ = 1097 [M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 537277 [M::mm_idx_stat::754.770*14.40] distinct minimizers: 269906881 (41.64% are singletons); average occurrences: 3.508; average spacing: 4.225 [M::worker_pipeline::793.777*14.55] mapped 78070 sequences [M::worker_pipeline::819.415*14.82] mapped 77485 sequences [M::worker_pipeline::843.192*15.05] mapped 75465 sequences [M::worker_pipeline::860.856*15.22] mapped 61141 sequences [M::worker_pipeline::881.694*15.41] mapped 61571 sequences [M::worker_pipeline::902.016*15.58] mapped 61587 sequences [M::worker_pipeline::924.371*15.76] mapped 61048 sequences [M::worker_pipeline::946.292*15.93] mapped 61149 sequences [M::worker_pipeline::967.784*16.09] mapped 60782 sequences [M::worker_pipeline::988.697*16.24] mapped 60930 sequences [M::worker_pipeline::1010.371*16.39] mapped 60739 sequences [M::worker_pipeline::1030.331*16.52] mapped 62543 sequences [M::worker_pipeline::1049.598*16.64] mapped 73661 sequences [M::worker_pipeline::1068.946*16.76] mapped 72563 sequences [M::worker_pipeline::1086.343*16.86] mapped 73439 sequences [M::worker_pipeline::1101.711*16.94] mapped 72620 sequences [M::worker_pipeline::1116.774*17.03] mapped 73624 sequences [M::worker_pipeline::1130.965*17.10] mapped 73660 sequences [M::worker_pipeline::1144.947*17.17] mapped 76838 sequences [M::worker_pipeline::1159.121*17.24] mapped 76351 sequences [M::worker_pipeline::1173.292*17.32] mapped 77711 sequences [M::worker_pipeline::1187.516*17.38] mapped 78261 sequences [M::worker_pipeline::1201.661*17.45] mapped 78012 sequences [M::worker_pipeline::1215.707*17.52] mapped 78098 sequences [M::worker_pipeline::1229.665*17.58] mapped 78393 sequences [M::worker_pipeline::1232.203*17.59] mapped 14176 sequences [M::mm_idx_gen::1379.041*15.87] collected minimizers [M::mm_idx_gen::1384.750*15.89] sorted minimizers [M::main::1384.750*15.89] loaded/built the index for 612555 target sequence(s) [M::mm_mapopt_update::1400.855*15.72] mid_occ = 283; max_occ = 1800 [M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 612555 [M::mm_idx_stat::1404.738*15.68] distinct minimizers: 251016217 (43.01% are singletons); average occurrences: 3.728; average spacing: 4.274 [M::worker_pipeline::1453.808*15.78] mapped 78070 sequences [M::worker_pipeline::1487.401*15.94] mapped 77485 sequences [M::worker_pipeline::1513.014*15.97] mapped 75465 sequences [M::worker_pipeline::1562.314*15.88] mapped 61141 sequences [M::worker_pipeline::1586.832*15.99] mapped 61571 sequences [M::worker_pipeline::1610.594*16.09] mapped 61587 sequences [M::worker_pipeline::1637.485*16.21] mapped 61048 sequences [M::worker_pipeline::1662.537*16.31] mapped 61149 sequences [M::worker_pipeline::1686.189*16.37] mapped 60782 sequences [M::worker_pipeline::1716.947*16.48] mapped 60930 sequences [M::worker_pipeline::1739.231*16.54] mapped 60739 sequences [M::worker_pipeline::1767.526*16.65] mapped 62543 sequences [M::worker_pipeline::1795.989*16.75] mapped 73661 sequences [M::worker_pipeline::1821.595*16.84] mapped 72563 sequences [M::worker_pipeline::1846.132*16.88] mapped 73439 sequences [M::worker_pipeline::1881.615*16.95] mapped 72620 sequences [M::worker_pipeline::1909.401*17.04] mapped 73624 sequences [M::worker_pipeline::1936.466*17.13] mapped 73660 sequences [M::worker_pipeline::1969.038*17.22] mapped 76838 sequences [M::worker_pipeline::1998.795*17.31] mapped 76351 sequences [M::worker_pipeline::2023.857*17.38] mapped 77711 sequences [M::worker_pipeline::2048.376*17.45] mapped 78261 sequences [M::worker_pipeline::2070.530*17.51] mapped 78012 sequences [M::worker_pipeline::2090.002*17.56] mapped 78098 sequences [M::worker_pipeline::2107.277*17.61] mapped 78393 sequences [M::worker_pipeline::2110.051*17.61] mapped 14176 sequences [M::mm_idx_gen::2133.572*17.43] collected minimizers [M::mm_idx_gen::2134.303*17.43] sorted minimizers [M::main::2134.303*17.43] loaded/built the index for 92569 target sequence(s) [M::mm_mapopt_update::2137.287*17.41] mid_occ = 96; max_occ = 702 [M::mm_idx_stat] kmer size: 19; skip: 5; is_HPC: 1; #seq: 92569 [M::mm_idx_stat::2138.223*17.40] distinct minimizers: 79280749 (71.19% are singletons); average occurrences: 1.736; average spacing: 4.295 [M::worker_pipeline::2150.329*17.42] mapped 78070 sequences [M::worker_pipeline::2158.745*17.42] mapped 77485 sequences [M::worker_pipeline::2169.431*17.43] mapped 75465 sequences [M::worker_pipeline::2175.999*17.44] mapped 61141 sequences [M::worker_pipeline::2201.521*17.31] mapped 61571 sequences [M::worker_pipeline::2210.181*17.31] mapped 61587 sequences [M::worker_pipeline::2219.166*17.32] mapped 61048 sequences [M::worker_pipeline::2230.005*17.32] mapped 61149 sequences [M::worker_pipeline::2236.932*17.34] mapped 60782 sequences [M::worker_pipeline::2243.696*17.36] mapped 60930 sequences [M::worker_pipeline::2250.382*17.38] mapped 60739 sequences [M::worker_pipeline::2257.541*17.39] mapped 62543 sequences [M::worker_pipeline::2265.596*17.42] mapped 73661 sequences [M::worker_pipeline::2272.442*17.43] mapped 72563 sequences [M::worker_pipeline::2279.949*17.45] mapped 73439 sequences [M::worker_pipeline::2286.867*17.47] mapped 72620 sequences [M::worker_pipeline::2294.249*17.49] mapped 73624 sequences [M::worker_pipeline::2301.783*17.51] mapped 73660 sequences [M::worker_pipeline::2311.212*17.53] mapped 76838 sequences [M::worker_pipeline::2320.151*17.56] mapped 76351 sequences [M::worker_pipeline::2329.772*17.58] mapped 77711 sequences [M::worker_pipeline::2338.490*17.60] mapped 78261 sequences [M::worker_pipeline::2347.768*17.62] mapped 78012 sequences [M::worker_pipeline::2354.384*17.64] mapped 78098 sequences [M::worker_pipeline::2364.010*17.66] mapped 78393 sequences [M::worker_pipeline::2364.329*17.66] mapped 14176 sequences [M::main] Version: 2.1.1-r341 [M::main] CMD: minimap2 -x ava-pb -t 23 20170911_oly_pacbio_cat.fastq 20170911_oly_pacbio_cat.fastq [M::main] Real time: 2364.639 sec; CPU: 41757.964 sec real 39m25.094s user 690m58.980s sys 4m59.240s
%%bash
ls -lh 20170911_minimap2_pacbio_oly.paf
-rw-r--r-- 1 root root 40G Sep 11 22:02 20170911_minimap2_pacbio_oly.paf
%%bash
date
Mon Sep 18 13:57:31 UTC 2017