#!/usr/bin/env python # coding: utf-8 # ### Running in Docker container on Ostrich # # #### Started Docker container with the following command: # # ```docker run -p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:/owl_web -v /Users/sam/gitrepos:/gitrepos -it f99537d7e06a``` # # The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files on Owl/home and Owl/web accessible to the Docker container. # # Once the container was started, started Jupyter Notebook with the following command inside the Docker container: # # ```jupyter notebook``` # # This is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888. # # The Docker container is running on an image created from this [Dockerfile (Git commit 443bc42)](https://github.com/sr320/LabDocs/blob/443bc425cd36d23a07cf12625f38b7e3a397b9be/code/dockerfiles/Dockerfile.bio) # In[1]: get_ipython().run_cell_magic('bash', '', 'date\n') # ### Check computer specs # In[2]: get_ipython().run_cell_magic('bash', '', 'hostname\n') # In[3]: get_ipython().run_cell_magic('bash', '', 'lscpu\n') # ### View/compare MD5 checksums of different "versions" of good files # # #### Recently discovered that two FASTQ files from BGI were corrupt. Communication with BGI has indicated that all "versions" of the FASTQ files we've received over the year (there are three different "versions" of each FASTQ) are actually all the same file - they've just been renamed. So, let's take a look at this... # #### Original FASTQ files from BGI # In[5]: get_ipython().run_cell_magic('bash', '', 'ls -lh /owl_web/nightingales/O_lurida/\n') # #### View original checksum for one of the "non-corrupt" files # In[7]: get_ipython().run_cell_magic('bash', '', 'grep 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz /owl_web/nightingales/O_lurida/checksums.md5\n') # #### View checksum for subsequent "versions" of this file # In[8]: get_ipython().run_cell_magic('bash', '', 'grep 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz /owl_web/O_lurida_genome_assemblies_BGI/20160512/checksums.md5\ngrep 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5\n') # ##### Albeit a bit difficult to read, you should be able to see that the checksum values for all three "versions" of this FASTQ file are the same (despite the changes in filename). This confirms BGI's information that all the data files provided at each stage are exactly the same file. # #### View checksums for one of the "corrupt" files # In[10]: get_ipython().run_cell_magic('bash', '', 'grep 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz /owl_web/nightingales/O_lurida/checksums.md5\ngrep 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz /owl_web/O_lurida_genome_assemblies_BGI/20160512/checksums.md5\ngrep 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5\n') # ##### This appears confusing because all three stored checksum values match. However, this [notebook](https://github.com/sr320/LabDocs/blob/2f6c1b43d4dba60c7a4f3e6dd34d9e2d2eb1f85a/jupyter_nbs/sam/20161117_docker_oly_genome_fastq_corruption.ipynb) reviews what has happened. To help improve the ability to follow all of this (w/o needing to leave this notebook to get some idea of what's going on), let's generate MD5 checksums for all three "versions" of this file and see what they look like. # #### Generate MD5 checksums for all three "versions" of this "corrupt" file. # In[11]: get_ipython().run_cell_magic('bash', '', 'time md5sum /owl_web/nightingales/O_lurida/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz\ntime md5sum /owl_web/O_lurida_genome_assemblies_BGI/20160512/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\ntime md5sum /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\n') # ##### Notice that the checksum for the original file is different than the subsequent "versions", despite the fact that the checksum.md5 file has the correct checksum value! Something happened to this file (as well as 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz) during copying to owl/nightingales/O_lurida. # #### For completeness, let's view the MD5 checksums for 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz. This is the other "corrupt" file identified in this [notebook entry](https://github.com/sr320/LabDocs/blob/2f6c1b43d4dba60c7a4f3e6dd34d9e2d2eb1f85a/jupyter_nbs/sam/20161117_docker_oly_genome_fastq_corruption.ipynb) # In[12]: get_ipython().run_cell_magic('bash', '', 'grep 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz /owl_web/nightingales/O_lurida/checksums.md5\ngrep 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz /owl_web/O_lurida_genome_assemblies_BGI/20160512/checksums.md5\ngrep 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5\n') # In[14]: get_ipython().run_cell_magic('bash', '', 'time md5sum /owl_web/nightingales/O_lurida/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz\ntime md5sum /owl_web/O_lurida_genome_assemblies_BGI/20160512/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\ntime md5sum /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\n') # ##### We see the same thing - original file has different checksum value than subsequent "versions". # #### Let's replace the bad files with the good ones # In[15]: get_ipython().run_cell_magic('bash', '', 'time cp /owl_web/O_lurida_genome_assemblies_BGI/20160512/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz \\\n/owl_web/nightingales/O_lurida/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz\n') # #### Look at the checksums now # In[16]: get_ipython().run_cell_magic('bash', '', 'time md5sum /owl_web/nightingales/O_lurida/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz\ntime md5sum /owl_web/O_lurida_genome_assemblies_BGI/20160512/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\ntime md5sum /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\n') # ##### Everything looks good. Time to copy the other file and check it out. # In[17]: get_ipython().run_cell_magic('bash', '', 'time cp /owl_web/O_lurida_genome_assemblies_BGI/20160512/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz \\\n/owl_web/nightingales/O_lurida/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz\ntime md5sum /owl_web/nightingales/O_lurida/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz\ntime md5sum /owl_web/O_lurida_genome_assemblies_BGI/20160512/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\ntime md5sum /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\n') # ##### Everything looks good.