Code explanation (by line)
!blastn -task blastn \
-query Owl/halfshell/EmmaBS400.fa \
-db nt \
-outfmt "6 sscinames" \
-max_target_seqs 1 \
-num_threads 16 \
-out 20150501_nt_blastn.tab \
2> 20150501_nt_blastn.err
^C
Interrupted kernel because I glanced at the output file and saw this (after running for ~6hrs!):
!head -30 20150501_nt_blastn.tab
Methanohalophilus mahii DSM 5219 uncultured organism uncultured organism Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305 Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Mus musculus Brassica rapa subsp. pekinensis Vanderwaltozyma polyspora DSM 70294 Leishmania braziliensis MHOM/BR/75/M2904 Dictyostelium discoideum AX4 Dictyostelium discoideum AX4 Homo sapiens Maribacter sp. HTCC2170 Dictyostelium discoideum AX4 Mus musculus Dictyostelium discoideum AX4 Dictyostelium fasciculatum Chrysemys picta bellii Rattus norvegicus Rattus norvegicus Rattus norvegicus Rattus norvegicus Rattus norvegicus Beta vulgaris Beta vulgaris Beta vulgaris
The issue is that the output only contains the species info and not the rest of the formatting (e-vals, bit scores, etc) that are part of BLASTn format 6.
It turns out that specifying format 6 just specifies tab-delimited output. If an "additional" feature is specified from the default output of format 6, then the defaults no longer apply; you have to specify them all.
!blastn -task blastn \
-query Owl/halfshell/EmmaBS400.fa \
-db nt \
-outfmt "6 qseqid sseqid pident length evalue stitle sscinames" \
-max_target_seqs 1 \
-num_threads 16 \
-out 20150501_nt_blastn.tab \
2> 20150501_nt_blastn.err
!head -10 20150501_nt_blastn.tab
20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_1 gi|292665689|gb|CP001994.1| 86.00 50 1e-04 Methanohalophilus mahii DSM 5219, complete genome Methanohalophilus mahii DSM 5219 20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_2 gi|534503879|gb|KF524813.1| 99.70 4658 0.0 Uncultured organism clone 89_8 Microvirus J protein, Capsid proteins, Major spike protein (G protein), Microvirus H protein, Bacteriophage replication gene A protein, and Phage protein C genes, complete cds; and Bacteriophage scaffolding protein D gene, partial cds uncultured organism 20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_2 gi|534503879|gb|KF524813.1| 100.00 802 0.0 Uncultured organism clone 89_8 Microvirus J protein, Capsid proteins, Major spike protein (G protein), Microvirus H protein, Bacteriophage replication gene A protein, and Phage protein C genes, complete cds; and Bacteriophage scaffolding protein D gene, partial cds uncultured organism 20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_3 gi|72493824|dbj|AP008934.1| 81.13 53 0.009 Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305 DNA, complete genome Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305 20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4 gi|28475607|gb|AC117201.2| 70.56 180 2e-04 Mus musculus BAC clone RP23-216E6 from 16, complete sequence Mus musculus 20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4 gi|28475607|gb|AC117201.2| 70.81 161 0.033 Mus musculus BAC clone RP23-216E6 from 16, complete sequence Mus musculus 20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4 gi|28475607|gb|AC117201.2| 70.25 158 0.11 Mus musculus BAC clone RP23-216E6 from 16, complete sequence Mus musculus 20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4 gi|28475607|gb|AC117201.2| 69.44 180 0.11 Mus musculus BAC clone RP23-216E6 from 16, complete sequence Mus musculus 20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4 gi|28475607|gb|AC117201.2| 68.78 189 0.40 Mus musculus BAC clone RP23-216E6 from 16, complete sequence Mus musculus 20150414_trimmed_2212_lane2_400ppm_GCCAAT_contig_4 gi|28475607|gb|AC117201.2| 69.70 165 1.4 Mus musculus BAC clone RP23-216E6 from 16, complete sequence Mus musculus
!wc -l 20150501_nt_blastn.tab
67326 20150501_nt_blastn.tab
The code below cuts columns (fields; -f) 5 and 7 (e-value and species) of the input file, sorts and displays first 100 entries (head -100)
!cut -f5,7 20150501_nt_blastn.tab | sort | head -100
0.0 Alteromonas macleodii str. 'Ionian Sea U4' 0.0 Alteromonas macleodii str. 'Ionian Sea U4' 0.0 Canis lupus familiaris 0.0 Comamonas sp. 7D-2 0.0 Comamonas sp. 7D-2 0.0 Comamonas sp. 7D-2 0.0 Comamonas sp. 7D-2 0.0 Crassostrea angulata 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Crassostrea gigas 0.0 Cycloclasticus zancles 7-ME 0.0 Cycloclasticus zancles 7-ME 0.0 Cycloclasticus zancles 7-ME 0.0 Dinoroseobacter shibae DFL 12 0.0 Dinoroseobacter shibae DFL 12 0.0 Dinoroseobacter shibae DFL 12 0.0 Dinoroseobacter shibae DFL 12 0.0 Dinoroseobacter shibae DFL 12 0.0 Dinoroseobacter shibae DFL 12 0.0 Dinoroseobacter shibae DFL 12 0.0 Dinoroseobacter shibae DFL 12 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens 0.0 Homo sapiens sort: write failed: standard output: Broken pipe sort: write error
The code uses awk to obtain all lines in the file that have e-values less than or equal to 0.001 ($5<=0.001
). The e-values are found in column (i.e. field) 5 of the input file ($5).
%%bash
awk ' $5<=0.001' 20150501_nt_blastn.tab > 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt
!wc -l 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt
40700 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt
The code stores the number of lines in the Python variable "total" from the output of the bash command (designated by the "!").
total = !wc -l < 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt
Notice that the number of lines is stored as a string list. This is denoted by the brackets and single quotes in the output.
print total
[' 40700']
This enables the number to be more easily used for subsequent calculations.
total = int(total[0])
This is confirmed by the lack of brackets around the output.
print total
40700
The code uses awk again to extract all lines with e-values <=0.001. Those lines are then cut on column (i.e. field; -f) 7, sorted (which is necessary for the subsequent "uniq" command), and the uniq entries are counted (uniq -c
).
%%bash
awk ' $5<=0.001' 20150501_nt_blastn.tab | cut -f7 | sort | uniq -c > 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt
!head 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt
2 'Nostoc azollae' 0708 1 Acanthamoeba castellanii 13 Acanthamoeba castellanii mamavirus 89 Acanthamoeba polyphaga mimivirus 1 Acanthocheilonema viteae 2 Acanthocystis turfacea Chlorella virus TN603.4.2 1 Acanthopagrus schlegelii 4 Acaryochloris marina MBIC11017 1 Acaryochloris sp. HICR111A 2 Acetobacter pasteurianus 386B
Although difficult to notice, the previous command where the number of unique entities are counted results in an output with a bunch of spaces at the beginning of each line that I don't want.
The command below utilizes sed. Sed creates a backup of the input file (-i.bu
; ".bu" is a custom suffix for the backup file - you can enter anything you'd like) and then removes (substitutes, thus the s/
) all spaces (*/
) found only at the beginning (^
) of each line.
!sed -i.bu 's/^ *//g' 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt
Note how the output is shifted to the left compared to how it was above.
!head -30 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt
2 'Nostoc azollae' 0708 1 Acanthamoeba castellanii 13 Acanthamoeba castellanii mamavirus 89 Acanthamoeba polyphaga mimivirus 1 Acanthocheilonema viteae 2 Acanthocystis turfacea Chlorella virus TN603.4.2 1 Acanthopagrus schlegelii 4 Acaryochloris marina MBIC11017 1 Acaryochloris sp. HICR111A 2 Acetobacter pasteurianus 386B 3 Acetobacterium woodii DSM 1030 8 Acetohalobium arabaticum DSM 5501 2 Achromobacter denitrificans 1 Acidianus hospitalis W1 1 Acidiphilium cryptum JF-5 1 Acidovorax sp. KKS102 1 Aciduliprofundum boonei T469 2 Acinetobacter baumannii 2 Acinetobacter baumannii AYE 1 Acinetobacter baumannii BJAB07104 4 Acinetobacter baumannii BJAB0715 2 Acinetobacter baumannii BJAB0868 1 Acinetobacter baumannii SDF 1 Acinetobacter calcoaceticus 11 Acinetobacter calcoaceticus PHEA-2 4 Acinetobacter oleivorans DR1 8 Acinetobacter sp. ADP1 2 Acinetobacter sp. C42 1 Acinetobacter sp. M-1 1 Acontias meleagris
!wc -l 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt
1608 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt
!cp 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt /Volumes/Eagle/Arabidopsis/
The code sorts the file in reverse (-r
) numerical (-n
) order on the first column (-k1
) and displays the first 15 lines (head -15
).
!sort -rn -k1 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt | head -15
9737 Dictyostelium discoideum AX4 8248 Danio rerio 6047 Homo sapiens 2739 Mus musculus 956 Rattus norvegicus 411 Solanum lycopersicum 395 Volvox carteri f. nagariensis 377 Dictyostelium fasciculatum 369 Hucho taimen 368 Dictyostelium purpureum 365 Schistosoma mansoni 360 Botryotinia fuckeliana 312 Lodderomyces elongisporus NRRL YB-4239 265 Crassostrea gigas 248 Octadecabacter antarcticus 307
The code uses bash grep to find any lines with "Crassostrea gigas" on them and then uses awk to print the first column (i.e. field; $1) which contains the count.
gigas_count = !grep 'Crassostrea gigas' 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt | awk '{ print $1 }'
print gigas_count
['265']
Conversion to float is necessary based on the desired output we want later. The output I want later will be a value less than 1. Integers can only display whole numbers.
gigas_count = float(gigas_count[0])
print gigas_count
265.0
print (gigas_count/total) * 100
0.651105651106