#!/usr/bin/env python # coding: utf-8 # In[1]: get_ipython().run_cell_magic('bash', '', 'date\n') # In[2]: get_ipython().run_cell_magic('bash', '', 'system_profiler SPSoftwareDataType\n') # In[3]: get_ipython().run_cell_magic('bash', '', '#Uses grep to exclude lines that display serial number and hardware UUID\nsystem_profiler SPHardwareDataType | grep -v [SH][ea]\n') # --- # # The goal of this notebook is to copy the PacBio fastq.gz files for the oly genome sequencing project to the ```/owl/nightingales/O_lurida``` folder to be in compliance with our data management plan. Since these files do not have unique names (they're all named the same thing, but are stored in different subdirectories), they need to be renamed. Additionally, to confirm that the files were copied and renamed correctly, an md5 checkusm verification needs to take place. # # --- # ### Create md5 checksums of original gzipped fastq files # In[1]: get_ipython().run_cell_magic('bash', '', 'cd /Volumes/owl/nightingales/O_lurida/20170323_pacbio/\n') # #### Use find to locate all the gzipped fastq files, create an md5 checksum, and write the output from each file to a new file # In[2]: get_ipython().run_cell_magic('bash', '', 'time find . -maxdepth 2 -name "*.fastq.gz" -exec md5 {} + > md5checksums_fastq.gz.md5\n') # Well, that didn't work. I think it didn't change directories. Should have caught this after executing cell #1 above, since there was no output listed. Let's see where I am... # In[3]: get_ipython().run_cell_magic('bash', '', 'pwd\n') # Yep, didn't change directories. I think it's because I used the ```bash``` shell magics for the ```cd``` command. Annoying! # In[4]: cd /Volumes/owl/nightingales/O_lurida/20170323_pacbio/ # #### Use find to locate all the gzipped fastq files, create an md5 checksum, and write the output from each file to a new file # In[5]: get_ipython().run_cell_magic('bash', '', 'time find . -maxdepth 2 -name "*.fastq.gz" -exec md5 {} + > md5checksums_fastq.gz.md5\n') # In[6]: get_ipython().run_cell_magic('bash', '', 'cat md5checksums_fastq.gz.md5\n') # ### Copy, rename, and generate new md5 checksums # # In[7]: get_ipython().run_cell_magic('bash', '', '# Find all xml files and store results in array\nxml_array=($(find . -maxdepth 2 -name "*.xml"))\n\n# Print contents of array, with some formatting for easier reading.\necho "Contents of xml_array:"\nprintf \'%s\\n\' "${xml_array[@]}"\necho ""\necho "-------------"\n\n# Find all fastq.gz files and store results in array\nfastq_array=($(find . -maxdepth 2 -name "*.fastq.gz"))\necho "Contents of xml_array:"\nprintf \'%s\\n\' "${fastq_array[@]}"\necho ""\necho "-------------"\n\n# Use parameter expansion to remove path from each component in fastq_array.\n# Store results in new array.\nfastq_nopath_array=($(echo "${fastq_array[@]##*/}"))\n\n# Print contents of array, with some formatting for easier reading.\necho "Contents of fastq_nopath_array:"\nprintf \'%s\\n\' "${fastq_nopath_array[@]}"\necho ""\necho "-------------"\n\n# Use parameter expansion to remove path from each component in xml_array.\n# Store results in new array.\nxml_nopath_array=($(echo "${xml_array[@]##*/}"))\n\n# Print contents of array, with some formatting for easier reading.\necho "Contents of xml_nopath_array:"\nprintf \'%s\\n\' "${xml_nopath_array[@]}"\necho ""\necho "-------------"\n\n# Use parameter expansion to remove the suffix (.xml) from each component in xml_nopath_array.\n# Store results in new array.\nxml_nosuffix_array=($(echo "${xml_nopath_array[@]%%.*}"))\n\n# Print contents of array, with some formatting for easier reading.\necho "Contents of xml_nosuffix_array:"\nprintf \'%s\\n\' "${xml_nosuffix_array[@]}"\necho ""\necho "-------------"\n\n# Loop through each index (i.e. the number corresponding to each element in the array).\n# Using the paths to each fastq.gz stored in the fastq_array, copy the fastq.gz file to the O_lurida nigthtingales folder &\n# use array elements to provide new name to copied file.\n# List the newly copied/named file to verify it got copied/renamed.\n# Create md5 checksums for each newly copied/renamed file and append to checksums.md5 file.\n# Use grep to verify info was written to checksums.md5 file\nfor item in "${!fastq_array[@]}"; do\n cp "${fastq_array[$item]}" /Volumes/owl/nightingales/O_lurida/"${xml_nosuffix_array[$item]}_${fastq_nopath_array[$item]}"\n ls /Volumes/owl/nightingales/O_lurida/"${xml_nosuffix_array[$item]}_${fastq_nopath_array[$item]}"\n md5 /Volumes/owl/nightingales/O_lurida/"${xml_nosuffix_array[$item]}_${fastq_nopath_array[$item]}" >> \\\n /Volumes/owl/nightingales/O_lurida/checksums.md5\n grep "${xml_nosuffix_array[$item]}_${fastq_nopath_array[$item]}" /Volumes/owl/nightingales/O_lurida/checksums.md5\ndone\n') # Wow! I'm impressed with myself! It nearly all came out as intended!! Forgot to change the second ```echo``` statement to read "Contents of fastq_array:". Other than that, it all worked. Feels good! # ### Compare md5 checksums # # #### Need to compare original checksums with that of the copied/renamed files to ensure file integrity didn't change during the copying process. # Quick test of using ```grep``` and ```awk``` to isolate just the checksum values... # In[8]: grep "filtered_subreads.fastq.gz" /Volumes/owl/nightingales/O_lurida/checksums.md5 | awk '{print "$4"}' # In[9]: get_ipython().run_cell_magic('bash', '', 'grep "filtered_subreads.fastq.gz" /Volumes/owl/nightingales/O_lurida/checksums.md5 | awk \'{print "$4"}\'\n') # In[10]: get_ipython().run_cell_magic('bash', '', 'grep "filtered_subreads.fastq.gz" /Volumes/owl/nightingales/O_lurida/checksums.md5 | awk \'{print $4}\'\n') # Well, I just looked through some previously used code and realized I didn't need to perform the above grep/awk test. # In[11]: get_ipython().run_cell_magic('bash', '', 'original_md5=($(awk \'/filtered_subreads.fastq.gz/{print $4}\' /Volumes/owl/nightingales/O_lurida/20170323_pacbio/md5checksums_fastq.gz.md5))\ncurrent_md5=($(awk \'/filtered_subreads.fastq.gz/{print $4}\' /Volumes/owl/nightingales/O_lurida/checksums.md5\nfor ((i=0;i<=$count;++i))\n do\n printf "%s\\n" "${original_md5[$i]}"\n printf "%s\\n\\n" "${current_md5[$i]}"\n done\n') # In[13]: get_ipython().run_cell_magic('bash', '', 'original_md5=($(awk \'/filtered_subreads.fastq.gz/{print $4}\' /Volumes/owl/nightingales/O_lurida/20170323_pacbio/md5checksums_fastq.gz.md5))\ncurrent_md5=($(awk \'/filtered_subreads.fastq.gz/{print $4}\' /Volumes/owl/nightingales/O_lurida/checksums.md5))\ncount=$(( ${#original_md5[@]} - 1 ))\nfor ((i=0;i<=$count;++i))\n do\n printf "%s\\n" "${original_md5[$i]}"\n printf "%s\\n\\n" "${current_md5[$i]}"\n done\n') # Great! Visual inspection indicates the MD5 checkums have not changed during the copying/renaming process! # #### Explanation of the above command: # # ##### The gist is that the output from the awk command is saved to an array. Then a for loop is run which iterates over the array and prints the element at each position. # # --- # ###### Break down the first line: # # ```original_md5=()``` - This is an empty array called "original_md5". # # ```$()``` - This is an empty command substitution. The stdout of commands within the parentheses are stored. # # ```awk '/filtered_subreads.fastq.gz/{print $4}' md5_file``` - Awk looks for any lines from the input file (md5_file) with "filtered_subreads.fastq.gz" in them. If a line contains "filtered_subreads.fastq.gz", awk prints the fourth field (i.e. the fourth column). # # Summary - The output from each result printed by awk is saved in an auto-incrementing fashion in the array called "original_md5". # # --- # ###### Break down the 3rd line: # # ```count=$(())``` - A variable called "count". This is a combination of empty command substitution and bash arithmeetic. Double parentheses are required for bash arithmetic. # # # ```${#current_md5[@]} - 1``` - This prints the number of indices (#) in the array called "original_md5" and subtracts 1 from that number. Subtraction of one is necessary because bash is a zero-based language (e.g. the array starts at index 0). # # Summary - The length of the array minus one is saved to the variable called "count". # # --- # ###### Break down the for loop: # # ```((i=0;i<=$count;++i))``` - Sets variable "i" to 0. Then, the loop evaluates whether or not the value of "i" is less than/equal to the value in the variable "count". If that condition is met, the loop increases the value stored in "i" by 1 and continues through the loop. # # ```printf "%s\n" "${original_md5[$i]}"``` - Prints the value at the array index designated by the value currently stored in "i" (the printing is specified by the "%s", which means string). This is followed by printing a new line (\n). # # Summary - This prints the value at each position within the array and uses printf to improve legibility of output. # ### Compress the original UW PacBio folder # In[14]: cd .. # In[15]: get_ipython().run_cell_magic('bash', '', 'time tar -zcf 20170323_pacbio.tar.gz /Volumes/owl/nightingales/O_lurida/20170323_pacbio/\n') # Well... That's a bummer. I think I actually need to run this again, as I'm not entirely sure how to tell if the tarball completed properly. # In[16]: pwd # In[17]: get_ipython().run_cell_magic('bash', '', 'pwd\n') # In[18]: cd /Volumes/ # In[19]: cd ~/Downloads/ # Ok... It seems like the notebook is just totally jacked up. Guess I'll restart it. Ugh. # In[1]: get_ipython().run_cell_magic('bash', '', 'date\n') # In[2]: cd /Volumes/owl/nightingales/O_lurida/ # In[3]: ls -lr 20170323_pacbio* # In[4]: ls -lh 20170323* # Well, I still have no way of knowing whether or not that tarball is legit. Guess I'll delete it and re-run the compression command. Sigh... # In[5]: get_ipython().run_cell_magic('bash', '', 'time rm 20170323_pacbio.tar.gz\n') # In[6]: ls -lh 20170323* # In[7]: get_ipython().run_cell_magic('bash', '', 'time tar -zcf 20170323_pacbio.tar.gz /Volumes/owl/nightingales/O_lurida/20170323_pacbio/\n') # Dude! Lost the connection to Owl. AGAIN! What is going on??!! I vaguely remember Sean mentioning some data transfer issues involving Owl and Hyak (mox). I wonder if this is related. Will discuss with Steven to see if he has experienced any weird network issues this week. # # Will have to consider another approach to generating the tarball (likely ssh in as admin and run ```tar``` directly on Owl - not over the network. # # In the meantime, I'm going to delete this incomplete tarball. # # I'll remount Owl outside of this notebook and then proceed. # In[1]: $$bash date # In[2]: get_ipython().run_cell_magic('bash', '', 'date\n') # In[3]: ls -lh /Volumes/web/nightingales/O_lurida/20170323_pacbio.tar.gz # In[4]: get_ipython().run_cell_magic('bash', '', 'time rm /Volumes/web/nightingales/O_lurida/20170323_pacbio.tar.gz\n') # In[5]: ls -lh /Volumes/web/nightingales/O_lurida/20170323_pacbio.tar.gz # Well, I'm just going to try to run this again. Third time's the charm? # In[6]: cd /Volumes/web/nightingales/O_lurida/ # In[7]: get_ipython().run_cell_magic('bash', '', 'time tar -zcf 20170323_pacbio.tar.gz /Volumes/owl/nightingales/O_lurida/20170323_pacbio/\n') # In[8]: get_ipython().run_cell_magic('bash', '', 'time tar -zcf 20170323_pacbio.tar.gz /Volumes/web/nightingales/O_lurida/20170323_pacbio/\n') # OK, this is annoying. The problem is caused by the connection to Owl getting lost/unmounted (see below). This is an issue that Sean had previously experienced as well. He had some conversations with the people who run Hyak (mox), they did some quick tests and experienced a similar problem with moving data to/from Owl. According to Sean, either the Hyak (mox) people and/or UW IT indicated that there's a wiring issue in FTR that's causing this problem and it's not a problem related to Owl. # In[9]: ls Volumes/web/nightingales/O_lurida # I've reconnected to Owl (via Finder) in order to remove the incomplete tarball. # In[1]: get_ipython().run_cell_magic('bash', '', 'date\n') # In[2]: ls /Volumes/web/nightingales/O_lurida/20170323_pacbio.tar.gz # In[ ]: