See this GitHub Issue.
This notebook will utilize NCBI BLAST and DIAMOND BLAST annotations generated by our GenSas P.generosa genome annotation.
It will compare the two sets of SwissProt ID annotations (SPIDs) to determine lowest E-value and use that entry as the representative entry for a gene. It will then use that canonical list of SPIDs to pull gene names and gene ontology (GO) IDs from UniProt, and create a tab-deltimited annotation mapping file.
%%bash
echo "TODAY'S DATE"
date
echo "------------"
echo ""
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "
hostname
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh
TODAY'S DATE Wed 20 Apr 2022 06:24:50 AM PDT ------------ Distributor ID: Ubuntu Description: Ubuntu 20.04.4 LTS Release: 20.04 Codename: focal ------------ HOSTNAME: computer ------------ Computer Specs: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 45 bits physical, 48 bits virtual CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 165 Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz Stepping: 2 CPU MHz: 2400.006 BogoMIPS: 4800.01 Hypervisor vendor: VMware Virtualization type: full L1d cache: 64 KiB L1i cache: 64 KiB L2 cache: 512 KiB L3 cache: 32 MiB NUMA node0 CPU(s): 0,1 Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities ------------ Memory Specs total used free shared buff/cache available Mem: 54Gi 2.0Gi 50Gi 38Mi 2.6Gi 52Gi Swap: 2.0Gi 0B 2.0Gi
No LSB modules are available.
%env
indicates a bash variable%env
is Python variablec######################################################################
### Set directories
%env data_dir=/home/sam/data/P_generosa/genomes
%env analysis_dir=/home/sam/analyses/20220419-pgen-gene_annotation_mapping
analysis_dir="/home/sam/analyses/20220419-pgen-gene_annotation_mapping"
#####################################################################
### Input files
%env base_url=https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation
%env blast_annotations=Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab
%env diamond_annotations=Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab
######################################################################
### Output files
# UniProt batch output
%env uniprot_output=20220419-pgen-uniprot_batch-results.txt
# Gene name list for UniProt batch submission
%env spid_list=Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt
# Genome IDs and SPIDs
%env genome_IDs_SPIDs=Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt
%env blast_diamond_cat=Panopea-generosa-v1.0.a4-blast-diamond-functional.tab
%env blast_diamond_cat_best=Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab
# Parsed UniProt
%env parsed_uniprot=20220419-pgen-accession-gene_name-gene_description-go_ids.tab
# Final output
%env joined_output=20220419-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-go_ids.tab
######################################################################
### Programs
# UniProt batch submission/retrieval script
%env uniprot_mapping_script=/home/sam/programs/uniprot_mapping.pl
env: data_dir=/home/sam/data/P_generosa/genomes env: analysis_dir=/home/sam/analyses/20220419-pgen-gene_annotation_mapping env: base_url=https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation env: blast_annotations=Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab env: diamond_annotations=Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab env: uniprot_output=20220419-pgen-uniprot_batch-results.txt env: spid_list=Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt env: genome_IDs_SPIDs=Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt env: blast_diamond_cat=Panopea-generosa-v1.0.a4-blast-diamond-functional.tab env: blast_diamond_cat_best=Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab env: parsed_uniprot=20220419-pgen-accession-gene_name-gene_description-go_ids.tab env: joined_output=20220419-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-go_ids.tab env: uniprot_mapping_script=/home/sam/programs/uniprot_mapping.pl
%%bash
# If directories don't exist, make them
mkdir --parents "${data_dir}" "${analysis_dir}"
--quiet
: Prevents wget
output from overwhelming Jupyter Notebook
--continue
: If download was previously initiated, will continue where leftoff and will not create a second file if one already exists.
%%bash
cd "${data_dir}"
wget --quiet --continue ${base_url}/${blast_annotations}
wget --quiet --continue ${base_url}/${diamond_annotations}
ls -ltrh
echo ""
echo "---------------------------------------------------------"
echo ""
head -n 25 *.tab
total 2.6G -rw-rw-r-- 1 sam sam 1.5M Oct 3 2019 Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab -rw-rw-r-- 1 sam sam 1.3M Oct 3 2019 Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab -rwxr-xr-x 1 sam sam 914M Nov 5 2019 Panopea-generosa-v1.0.fasta -rw-rw-r-- 1 sam sam 454M Mar 19 07:58 Panopea-generosa-v1.0.a4.gff3 -rw-r--r-- 1 sam sam 503M Mar 22 06:48 Panopea-generosa-v1.0.a4_biotype.gff -rw-r--r-- 1 sam sam 4.8M Mar 24 07:30 Panopea-generosa-v1.0.a4_biotype-trna_strand_converted-no_RNAmmer.bed -rw-rw-r-- 1 sam sam 658 Mar 25 06:11 Panopea-generosa-v1.0.fa.fai -rw-rw-r-- 1 sam sam 9.7M Mar 30 11:03 Panopea-generosa-v1.0.a4_biotype.gtf -rw-rw-r-- 1 sam sam 507M Mar 30 11:43 Panopea-generosa-v1.0.a4_biotype.bed -rw-rw-r-- 1 sam sam 378 Mar 30 13:20 Panopea-generosa-v1.0.fa.lengths -rw-rw-r-- 1 sam sam 9.7M Mar 30 13:34 Panopea-generosa-v1.0.a4_biotype.sorted.gtf -rw-rw-r-- 1 sam sam 996K Mar 30 13:34 Panopea-generosa-v1.0.a4_biotype_non-coding.bed drwxrwxr-x 2 sam sam 4.0K Mar 30 13:46 feelnc_codpot_out -rw-rw-r-- 1 sam sam 70M Mar 31 06:00 Panopea-generosa-v1.0.a4.gtf -rw-rw-r-- 1 sam sam 101K Apr 6 12:22 spids.txt -rw-rw-r-- 1 sam sam 138M Apr 6 12:23 uniprot_mapping-all.txt -rw-rw-r-- 1 sam sam 282K Apr 7 07:13 uniprot_mapping-all-AC_only.txt -rw-rw-r-- 1 sam sam 78 Apr 7 13:57 File1.txt -rw-rw-r-- 1 sam sam 357 Apr 7 13:57 File2.txt --------------------------------------------------------- ==> Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab <== # # Output is generated by GenSAS 7.x-5.0 # #name : mRNA #start : Start of alignment in subject #end : End of alignment in subject #m_start : Start of alignment in query #m_end : End of alignment in query #al : Alignment length #score : Row score of the match #evalue : E value of the match #identity : Percentage of identical matches mame start end score Accession Match ID m_start m_end E-value identity al 21910-PGEN_.00g000010.m01 121 229 165 Q86IC9 sp|Q86IC9|CAMT1_DICDI 11 122 8.93e-14 35.652 115 21910-PGEN_.00g000020.m01 147 467 968 P04177 sp|P04177|TY3H_RAT 20 339 3.47e-127 55.140 321 21910-PGEN_.00g000050.m01 566 722 182 Q8L840 sp|Q8L840|RQL4A_ARATH 2 167 2.67e-14 35.119 168 21910-PGEN_.00g000080.m01 268 322 152 A1E2V0 sp|A1E2V0|BIRC3_CANLF 163 220 3.91e-10 53.448 58 21910-PGEN_.00g000090.m01 199 327 161 P34456 sp|P34456|YMD2_CAEEL 7 134 7.52e-12 26.357 129 21910-PGEN_.00g000210.m01 18 200 263 O00463 sp|O00463|TRAF5_HUMAN 5 191 2.24e-25 34.921 189 21910-PGEN_.00g000230.m01 48 155 287 Q00945 sp|Q00945|CONO_LYMST 31 134 1.59e-32 50.000 108 21910-PGEN_.00g000240.m01 4 605 1091 Q5SWK7 sp|Q5SWK7|RN145_MOUSE 13 601 2.65e-139 39.607 611 21910-PGEN_.00g000280.m01 4 153 210 Q8ZXT3 sp|Q8ZXT3|Y1111_PYRAE 853 1012 1.10e-17 38.750 160 21910-PGEN_.00g000300.m01 159 347 480 Q5REG4 sp|Q5REG4|DTX3_PONAB 1135 1320 1.20e-51 50.794 189 21910-PGEN_.00g000300.m02 159 347 480 Q5REG4 sp|Q5REG4|DTX3_PONAB 1138 1323 1.18e-51 50.794 189 21910-PGEN_.00g000380.m01 381 508 205 Q8QG60 sp|Q8QG60|CRY2_CHICK 2 145 4.92e-18 36.111 144 ==> Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab <== # # Output is generated by GenSAS 7.x-5.0 # #name : mRNA #start : Start of alignment in subject #end : End of alignment in subject #m_start : Start of alignment in query #m_end : End of alignment in query #al : Alignment length #score : Row score of the match #evalue : E value of the match #identity : Percentage of identical matches mame start end score Accession Match ID m_start m_end E-value identity al 21910-PGEN_.00g000020.m01 147 467 945 P04177 sp|P04177|TY3H_RAT 20 339 7.9e-101 55.1 321 21910-PGEN_.00g000050.m01 566 722 180 Q8L840 sp|Q8L840|RQL4A_ARATH 2 167 2.4e-12 35.1 168 21910-PGEN_.00g000060.m01 1957 2106 129 Q61043 sp|Q61043|NIN_MOUSE 31 184 1.7e-06 26.0 154 21910-PGEN_.00g000080.m01 233 304 134 Q24307 sp|Q24307|DIAP2_DROME 174 255 6.2e-07 34.1 82 21910-PGEN_.00g000120.m01 6 49 118 P34457 sp|P34457|YMD3_CAEEL 90 133 3.2e-05 47.7 44 21910-PGEN_.00g000230.m01 49 155 216 Q00945 sp|Q00945|CONO_LYMST 32 134 9.9e-17 49.5 107 21910-PGEN_.00g000240.m01 4 585 1144 Q5SWK7 sp|Q5SWK7|RN145_MOUSE 13 592 1.2e-123 40.1 591 21910-PGEN_.00g000280.m01 433 592 230 Q9WYX8 sp|Q9WYX8|Y508_THEMA 863 1022 2.2e-17 40.2 164 21910-PGEN_.00g000300.m01 161 347 474 Q80V91 sp|Q80V91|DTX3_MOUSE 1137 1320 1.2e-45 51.3 187 21910-PGEN_.00g000300.m02 161 347 474 Q80V91 sp|Q80V91|DTX3_MOUSE 1140 1323 1.2e-45 51.3 187 21910-PGEN_.00g000380.m01 381 508 201 Q8QG60 sp|Q8QG60|CRY2_CHICK 2 145 7.0e-15 35.4 144 21910-PGEN_.00g000440.m01 234 1796 1606 Q9H583 sp|Q9H583|HEAT1_HUMAN 1 1575 8.0e-177 30.0 1624
#
¶%%bash
cd "${data_dir}"
grep -c "^#" "${blast_annotations}" "${diamond_annotations}"
Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab:12 Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab:12
e-value
¶Also modifies mRNA names to generate gene names instead.
awk 'NR > 13'
: Skips first 13 header linessort -k1,1 -k9,9
: Sorts on first field (mRNA name), then on 9th field (e-value)sed 's/^21910-//'
: Removes leading info from each mRNA name, at the beginning of each line (^
)sed 's/.m0[0-9]//'
: Removes .m0N
from each mRNA name.awk '!array[$1]++'
: awk
array that only prints line if it's the first occurrence of gene name (first field; $1
(i.e. no duplicates)Also replaces two obsolete SPIDs, as identified in previous notebook run.
%%bash
cd "${data_dir}"
# Concatenate both annotation files
for file in ${blast_annotations} ${diamond_annotations}
do
awk 'NR > 13' ${file}
done \
>> "${analysis_dir}"/"${blast_diamond_cat}"
# Sort for best e-value and perform formatting of genome IDs
sort -k1,1 -k9,9 "${analysis_dir}"/"${blast_diamond_cat}" \
| sed 's/^21910-//' \
| sed 's/.m0[0-9]//' \
| awk '!array[$1]++' \
>> "${analysis_dir}"/"${blast_diamond_cat_best}"
# Find/replace two obsolete SPIDs
sed -i 's/Q6ZRR9/M0R2J8/g' "${analysis_dir}"/"${blast_diamond_cat_best}"
sed -i 's/Q9NPA5/Q9NTW7/g' "${analysis_dir}"/"${blast_diamond_cat_best}"
echo ""
echo "Line count:"
wc -l "${analysis_dir}"/"${blast_diamond_cat}"
echo "--------------------------------------------------"
echo ""
echo "Line count:"
wc -l "${analysis_dir}"/"${blast_diamond_cat_best}"
echo "--------------------------------------------------"
echo ""
head -n 25 "${analysis_dir}"/"${blast_diamond_cat_best}"
Line count: 31216 /home/sam/analyses/20220419-pgen-gene_annotation_mapping/Panopea-generosa-v1.0.a4-blast-diamond-functional.tab -------------------------------------------------- Line count: 14676 /home/sam/analyses/20220419-pgen-gene_annotation_mapping/Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab -------------------------------------------------- PGEN_.00g000010 121 229 165 Q86IC9 sp|Q86IC9|CAMT1_DICDI 11 122 8.93e-14 35.652 115 PGEN_.00g000020 147 467 968 P04177 sp|P04177|TY3H_RAT 20 339 3.47e-127 55.140 321 PGEN_.00g000050 566 722 180 Q8L840 sp|Q8L840|RQL4A_ARATH 2 167 2.4e-12 35.1 168 PGEN_.00g000060 1957 2106 129 Q61043 sp|Q61043|NIN_MOUSE 31 184 1.7e-06 26.0 154 PGEN_.00g000080 268 322 152 A1E2V0 sp|A1E2V0|BIRC3_CANLF 163 220 3.91e-10 53.448 58 PGEN_.00g000090 199 327 161 P34456 sp|P34456|YMD2_CAEEL 7 134 7.52e-12 26.357 129 PGEN_.00g000120 6 49 118 P34457 sp|P34457|YMD3_CAEEL 90 133 3.2e-05 47.7 44 PGEN_.00g000210 18 200 263 O00463 sp|O00463|TRAF5_HUMAN 5 191 2.24e-25 34.921 189 PGEN_.00g000230 48 155 287 Q00945 sp|Q00945|CONO_LYMST 31 134 1.59e-32 50.000 108 PGEN_.00g000240 4 585 1144 Q5SWK7 sp|Q5SWK7|RN145_MOUSE 13 592 1.2e-123 40.1 591 PGEN_.00g000280 4 153 210 Q8ZXT3 sp|Q8ZXT3|Y1111_PYRAE 853 1012 1.10e-17 38.750 160 PGEN_.00g000300 159 347 480 Q5REG4 sp|Q5REG4|DTX3_PONAB 1135 1320 1.20e-51 50.794 189 PGEN_.00g000380 381 508 205 Q8QG60 sp|Q8QG60|CRY2_CHICK 2 145 4.92e-18 36.111 144 PGEN_.00g000440 792 1796 1362 Q9H583 sp|Q9H583|HEAT1_HUMAN 539 1575 7.24e-156 34.692 1055 PGEN_.00g000450 347 753 1038 A0JMR6 sp|A0JMR6|MYSM1_XENLA 138 535 3.16e-130 48.426 413 PGEN_.00g000460 478 1045 859 O88917 sp|O88917|AGRL1_RAT 150 740 2.17e-95 33.443 610 PGEN_.00g000490 189 393 152 Q7D513 sp|Q7D513|EGTB_MYCTO 307 565 1.7e-08 24.8 266 PGEN_.00g000520 118 339 495 Q92968 sp|Q92968|PEX13_HUMAN 134 356 1.34e-56 49.565 230 PGEN_.00g000530 1 191 593 A6H769 sp|A6H769|RS7_BOVIN 1 160 2.3e-60 68.1 191 PGEN_.00g000540 1876 2081 565 Q14676 sp|Q14676|MDC1_HUMAN 1857 2060 1.19e-56 46.117 206 PGEN_.00g000560 17 114 135 Q54W11 sp|Q54W11|MCFL_DICDI 21 113 3.1e-07 32.7 98 PGEN_.00g000600 49 474 1301 Q9FKK7 sp|Q9FKK7|XYLA_ARATH 215 641 1.35e-173 55.245 429 PGEN_.00g000660 5 308 899 Q3ZCD7 sp|Q3ZCD7|TECR_BOVIN 2 303 1.4e-95 55.7 305 PGEN_.00g000670 60 1237 3493 Q9D180 sp|Q9D180|CFA57_MOUSE 1 1143 0.0 59.102 1181 PGEN_.00g000680 17 206 520 Q9D6Z0 sp|Q9D6Z0|ALKB7_MOUSE 89 278 1.11e-64 48.947 190
%%bash
cd "${analysis_dir}"
awk '{print $5,"\t",$1}' "${blast_diamond_cat_best}" > "$genome_IDs_SPIDs"
echo ""
echo "Line count:"
wc -l "$genome_IDs_SPIDs"
echo "--------------------------------------------------"
head "$genome_IDs_SPIDs"
Line count: 14676 Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt -------------------------------------------------- Q86IC9 PGEN_.00g000010 P04177 PGEN_.00g000020 Q8L840 PGEN_.00g000050 Q61043 PGEN_.00g000060 A1E2V0 PGEN_.00g000080 P34456 PGEN_.00g000090 P34457 PGEN_.00g000120 O00463 PGEN_.00g000210 Q00945 PGEN_.00g000230 Q5SWK7 PGEN_.00g000240
%%bash
cd "${analysis_dir}"
awk '{print $5}' "${blast_diamond_cat_best}" > "${spid_list}"
echo ""
echo "Line count:"
wc -l "${spid_list}"
echo "--------------------------------------------------"
echo ""
head "${spid_list}"
Line count: 14676 Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt -------------------------------------------------- Q86IC9 P04177 Q8L840 Q61043 A1E2V0 P34456 P34457 O00463 Q00945 Q5SWK7
Perl script obtained from UniProt: https://www.uniprot.org/help/api_batch_retrieval
Modified to accept file with list of IDs and to map SPID to UniProt Accession
%%bash
# Print script for viewing
cat "${uniprot_mapping_script}"
use strict; use warnings; use LWP::UserAgent; my $list = $ARGV[0]; # File containg list of UniProt identifiers. my $base = 'https://www.uniprot.org'; my $tool = 'uploadlists'; my $contact = 'samwhite@uw.edu'; # Please set a contact email address here to help us debug in case of problems (see https://www.uniprot.org/help/privacy). my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact"); push @{$agent->requests_redirectable}, 'POST'; my $response = $agent->post("$base/$tool/", [ 'file' => [$list], 'format' => 'txt', 'from' => 'SWISSPROT', 'to' => 'ACC', ], 'Content_Type' => 'form-data'); while (my $wait = $response->header('Retry-After')) { print STDERR "Waiting ($wait)...\n"; sleep $wait; $response = $agent->get($response->base); } $response->is_success ? print $response->content : die 'Failed, got ' . $response->status_line . ' for ' . $response->request->uri . "\n";
%%bash
cd "${analysis_dir}"
# Run UniProt Prel mapping script and time how long it takes
time \
perl "${uniprot_mapping_script}" "${spid_list}" > "${uniprot_output}"
ls -ltrh
echo ""
echo ""
echo "--------------------------------------------------"
echo ""
echo "Line count:"
wc -l "${uniprot_output}"
echo "--------------------------------------------------"
total 142M -rw-rw-r-- 1 sam sam 2.8M Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional.tab -rw-rw-r-- 1 sam sam 1.2M Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab -rw-rw-r-- 1 sam sam 359K Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt -rw-rw-r-- 1 sam sam 101K Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt -rw-rw-r-- 1 sam sam 138M Apr 20 21:21 20220419-pgen-uniprot_batch-results.txt -------------------------------------------------- Line count: 2850419 20220419-pgen-uniprot_batch-results.txt --------------------------------------------------
real 0m43.226s user 0m3.070s sys 0m3.310s
Counting Accession lines (beginning with AC
) should show a lower count than the number of SwissProt IDs submitted, as UniProt automatically removes duplicates upon submission.
%%bash
cd "${analysis_dir}"
head -n 30 "${uniprot_output}"
echo ""
echo "----------------------------------------------------"
echo ""
echo "Number of accessions:"
echo ""
grep -c "^AC" "${uniprot_output}"
ID CAMT1_DICDI Reviewed; 230 AA. AC Q86IC9; Q552T5; DT 05-MAY-2009, integrated into UniProtKB/Swiss-Prot. DT 01-JUN-2003, sequence version 1. DT 23-FEB-2022, entry version 92. DE RecName: Full=Probable caffeoyl-CoA O-methyltransferase 1; DE EC=2.1.1.104; DE AltName: Full=O-methyltransferase 5; GN Name=omt5; ORFNames=DDB_G0275499; OS Dictyostelium discoideum (Slime mold). OC Eukaryota; Amoebozoa; Evosea; Eumycetozoa; Dictyostelia; Dictyosteliales; OC Dictyosteliaceae; Dictyostelium. OX NCBI_TaxID=44689; RN [1] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=AX4; RX PubMed=12097910; DOI=10.1038/nature00847; RA Gloeckner G., Eichinger L., Szafranski K., Pachebat J.A., Bankier A.T., RA Dear P.H., Lehmann R., Baumgart C., Parra G., Abril J.F., Guigo R., RA Kumpf K., Tunggal B., Cox E.C., Quail M.A., Platzer M., Rosenthal A., RA Noegel A.A.; RT "Sequence and analysis of chromosome 2 of Dictyostelium discoideum."; RL Nature 418:79-85(2002). RN [2] RP NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA]. RC STRAIN=AX4; RX PubMed=15875012; DOI=10.1038/nature03481; RA Eichinger L., Pachebat J.A., Gloeckner G., Rajandream M.A., Sucgang R., RA Berriman M., Song J., Olsen R., Szafranski K., Xu Q., Tunggal B., RA Kummerfeld S., Madera M., Konfortov B.A., Rivero F., Bankier A.T., ---------------------------------------------------- Number of accessions: 10634
UniProt accession
Gene name/abbreviation
Gene description
GO IDs
Checks lines beginning with DE
to identify values in the 2nd field with Name
in them.
Identifies unique values. This will determine how to parse properly after this.
%%bash
cd "${analysis_dir}"
grep "^DE" "${uniprot_output}" | awk '$2 ~ /Name/ { print $2 }' | sort -u
AltName: RecName:
%%bash
cd "${analysis_dir}"
# Loop through UniProt records
time \
while read -r line
do
# Get record line descriptor
descriptor=$(echo "${line}" | awk '{print $1}')
# Capture second field for evaluation
go_line=$(echo "${line}" | awk '{print $2}')
# Append GO IDs to array
if [[ "${go_line}" == "GO;" ]]; then
go_id=$(echo "${line}" | awk '{print $3}')
go_ids_array+=("${go_id}")
elif [[ "${go_line}" == "GeneID;" ]]; then
# Uses sed to strip trailing semi-colon
gene_id=$(echo "${line}" | awk '{print $3}' | sed 's/;$//')
fi
# Get gene description
if [[ "${descriptor}" == "DE" ]] && [[ "${go_line}" == "RecName:" ]]; then
# Uses sed to strip trailing spaces at end of line and remove commas
gene_description=$(echo "${line}" | awk -F "[={]" '{print $2}' | sed 's/[[:blank:]]*$//' | sed 's/,//g' | sed 's/;$//')
# Get alternate name
elif [[ "${descriptor}" == "DE" ]] && [[ "${go_line}" == "AltName:" ]]; then
# Uses sed to strip trailing spaces at end of line and remove commas
alt_gene_description=$(echo "${line}" | awk -F "[={]" '{print $2}' | sed 's/[[:blank:]]*$//' | sed 's/,//g' | sed 's/;$//')
# Get gene name
elif [[ "${descriptor}" == "GN" ]] && [[ $(echo "${line}" | awk -F "=" '{print $1}') == "GN Name" ]]; then
# Uses sed to strip trailing spaces at end of line
gene=$(echo "${line}" | awk -F 'Name=|{|;' '{print $2}' | sed 's/[[:blank:]]*$//')
# Get UniProt accession
elif [[ "${descriptor}" == "AC" ]]; then
# awk removes "AC" denotation
# sed removes all spaces
# sed removes trailing semi-colon
# Uses array to handle accessions being on multiple lines of UniProt records file
accession=$(echo "${line}" | awk '{$1="";print $0}' | sed 's/[[:space:]]*//g' | sed 's/;$//')
accessions_array+=("${accession}")
# Identify beginning on new record
elif [[ "${descriptor}" == "//" ]]; then
# Prints other comma-separated variables, then GOID1;GOID2;GOIDn
# IFS prevents spaces from being added between GO IDs
# sed removes ";" after final GO ID
(IFS=; printf "%s\t%s\t%s\t%s\t%s\t%s\n" "${accessions_array[*]}" "${gene_id}" "${gene}" "${gene_description}" "${alt_gene_description}" "${go_ids_array[*]}" | sed 's/;$//')
# Re-initialize variables
accession=""
accessions_array=()
descriptor=""
gene=""
gene_description=""
gene_id=""
go_id=""
go_ids_array=()
fi
done < "${uniprot_output}" >> "${parsed_uniprot}"
real 529m51.428s user 427m33.240s sys 79m42.704s
%%bash
cd "${analysis_dir}"
wc -l "${parsed_uniprot}"
echo ""
echo "------------------------------------------------------------------"
echo ""
head -n 25 "${parsed_uniprot}"
10304 20220419-pgen-accession-gene_name-gene_description-go_ids.tab ------------------------------------------------------------------ Q86IC9;Q552T5 8620183 omt5 Probable caffeoyl-CoA O-methyltransferase 1 O-methyltransferase 5 GO:0042409;GO:0046872;GO:0008757;GO:0032259 P04177 25085 Th Tyrosine 3-monooxygenase Tyrosine 3-hydroxylase GO:0030424;GO:0005737;GO:0009898;GO:0031410;GO:0030659;GO:0005829;GO:0030425;GO:0033162;GO:0005739;GO:0043005;GO:0043025;GO:0005634;GO:0043204;GO:0048471;GO:0005790;GO:0008021;GO:0043195;GO:0016597;GO:0035240;GO:0019899;GO:0008199;GO:0008198;GO:0042802;GO:0004497;GO:0019825;GO:0019904;GO:0034617;GO:0004511;GO:0015842;GO:0009887;GO:0042423;GO:0071312;GO:0071333;GO:0071363;GO:0071287;GO:0071316;GO:0071466;GO:0021987;GO:0042745;GO:0050890;GO:0042416;GO:0006585;GO:0042755;GO:0048596;GO:0042418;GO:0042462;GO:0006631;GO:0016137;GO:0007507;GO:1990384;GO:0033076;GO:0007612;GO:0007626;GO:0007617;GO:0007613;GO:0010259;GO:0042136;GO:0042421;GO:0018963;GO:0052314;GO:0008016;GO:0014823;GO:0001975;GO:0051412;GO:0051602;GO:0032355;GO:0045471;GO:0045472;GO:0070848;GO:0009635;GO:0001666;GO:0035902;GO:0017085;GO:0035900;GO:0009416;GO:0032496;GO:0010038;GO:0035094;GO:0031667;GO:0014070;GO:0043434;GO:0046684;GO:0009651;GO:0048545;GO:0009414;GO:0009410;GO:0010043;GO:0007605;GO:0035176;GO:0006665;GO:0001963;GO:0042214;GO:0007601 Q8L840;O04092;Q9FT71 837636 RECQL4A ATP-dependent DNA helicase Q-like 4A SGS1-like protein GO:0005694;GO:0005737;GO:0005634;GO:0009506;GO:0043138;GO:0005524;GO:0016887;GO:0009378;GO:0046872;GO:0003676;GO:0071215;GO:0070417;GO:0006974;GO:0051276;GO:0032508;GO:0006310;GO:0006281;GO:0006268;GO:0000724 Q61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7 18080 Nin Ninein SGS1-like protein GO:0045177;GO:0030424;GO:0044295;GO:0120103;GO:0005814;GO:0005813;GO:0097539;GO:0005881;GO:0030425;GO:0072686;GO:0097431;GO:0005730;GO:0005654;GO:0000242;GO:0005886;GO:0000922;GO:0005509;GO:0005525;GO:0019900;GO:0051011;GO:0010457;GO:0051642;GO:0090222;GO:0048668;GO:0021540;GO:0021957;GO:0034454;GO:0050772;GO:0031116;GO:0008104 A1E2V0 489433 BIRC3 Baculoviral IAP repeat-containing protein 3 RING-type E3 ubiquitin transferase BIRC3 GO:0005737;GO:0005829;GO:0005654;GO:0005634;GO:0043027;GO:0046872;GO:0061630;GO:0043066;GO:0060546;GO:0031398;GO:0051726 P34456 186266 Uncharacterized protein F54H12.2 RING-type E3 ubiquitin transferase BIRC3 GO:0005829;GO:0004748;GO:0009263 P34457 Putative uncharacterized transposon-derived protein F54H12.3 RING-type E3 ubiquitin transferase BIRC3 GO:0003676;GO:0015074 O00463;B4DIS9;B4E0A2;Q6FHY1 7188 TRAF5 TNF receptor-associated factor 5 RING finger protein 84 GO:0035631;GO:0005813;GO:0009898;GO:0005829;GO:0042802;GO:0031996;GO:0005164;GO:0031625;GO:0008270;GO:0006915;GO:0097400;GO:0048255;GO:0008284;GO:0051091;GO:0043123;GO:0046330;GO:0051092;GO:0070534;GO:0042981;GO:0043122;GO:0007165;GO:0023019;GO:0033209 Q00945 Neurophysin RING finger protein 84 GO:0005576;GO:0005185 Q5SWK7;Q8BXX5;Q9CXG1 74315 Rnf145 RING finger protein 145 RING finger protein 84 GO:0012505;GO:0005783;GO:0005789;GO:0016021;GO:0061630;GO:0008270 Q8ZXT3 Uncharacterized protein PAE1111 RING finger protein 84 Q5REG4 100171717 DTX3 Probable E3 ubiquitin-protein ligase DTX3 RING-type E3 ubiquitin transferase DTX3 GO:0005737;GO:0046872;GO:0016740;GO:0007219;GO:0016567 Q8QG60;Q8QG52;Q8QGQ5 374092 CRY2 Cryptochrome-2 RING-type E3 ubiquitin transferase DTX3 GO:0005737;GO:0005634;GO:0003677;GO:0071949;GO:0009881;GO:0032922;GO:0007623;GO:0043153;GO:0042754;GO:0045892;GO:0018298;GO:0042752;GO:0009416 Q9H583;Q5T3Q8;Q6P197;Q9NW23 55127 HEATR1 HEAT repeat-containing protein 1 N-terminally processed U3 small nucleolar RNA-associated protein 10 homolog GO:0030686;GO:0001650;GO:0016020;GO:0005739;GO:0005730;GO:0005654;GO:0032040;GO:0034455;GO:0003723;GO:0030515;GO:0000462;GO:2000234;GO:0045943 A0JMR6 779416 mysm1 Histone H2A deubiquitinase MYSM1 Myb-like SWIRM and MPN domain-containing protein 1 GO:0005634;GO:0003677;GO:0042393;GO:0070122;GO:0046872;GO:0140492;GO:0004843;GO:0003713;GO:0006338;GO:0035522;GO:0045944 O88917;O09026;O35818;O88916 65096 Adgrl1 Adhesion G protein-coupled receptor L1 Latrophilin-1 GO:0030424;GO:0098978;GO:0030426;GO:0005887;GO:0099056;GO:0043005;GO:0005886;GO:0014069;GO:0042734;GO:0045202;GO:0030246;GO:0050839;GO:0004930;GO:0016524;GO:0015643;GO:0007189;GO:0007420;GO:0035584;GO:0007166;GO:0007157;GO:0051965;GO:0090129 Q7D513 egtB Hercynine oxygenase Gamma-glutamyl hercynylcysteine S-oxide synthase GO:0044875;GO:0005506;GO:0004497 Q92968;B2RCS1 5194 PEX13 Peroxisomal membrane protein PEX13 Peroxin-13 GO:0005829;GO:0005779;GO:0016020;GO:1990429;GO:0005778;GO:0005777;GO:0021795;GO:0001561;GO:0007626;GO:0060152;GO:0001764;GO:0016560;GO:0001967 A6H769 505507 RPS7 40S ribosomal protein S7 Peroxin-13 GO:0022627;GO:0005815;GO:0032040;GO:0003735;GO:0042274;GO:0006364;GO:0006412 Q14676;A2AB04;A2BF04;A2RRA8;A7YY86;B0S8A2;Q0EFC2;Q2L6H7;Q2TAZ4Q5JP55;Q5JP56;Q5ST83;Q68CQ3;Q86Z06;Q96QC2 9656 MDC1 Mediator of DNA damage checkpoint protein 1 Nuclear factor with BRCT domains 1 GO:0005694;GO:0005925;GO:0016604;GO:0005654;GO:0005634;GO:0070975;GO:0008022;GO:0006281;GO:0031573 Q54W11 8622324 mcfL Mitochondrial substrate carrier family protein L Nuclear factor with BRCT domains 1 GO:0016021;GO:0005743;GO:0055085 Q9FKK7;Q8L759 835871 XYLA Xylose isomerase Nuclear factor with BRCT domains 1 GO:0005783;GO:0005794;GO:0000325;GO:0009536;GO:0099503;GO:0046872;GO:0009045;GO:0042843 Q3ZCD7 614105 TECR Very-long-chain enoyl-CoA reductase Trans-23-enoyl-CoA reductase GO:0005783;GO:0030176;GO:0016491;GO:0102758;GO:0030497;GO:0006665;GO:0006694;GO:0042761 Q9D180;A2ACY9 68625 Cfap57 Cilia- and flagella-associated protein 57 WD repeat-containing protein 65 Q9D6Z0;Q8K1H3;Q9CY41;Q9D942 66400 Alkbh7 Alpha-ketoglutarate-dependent dioxygenase alkB homolog 7 mitochondrial Alkylated DNA repair protein alkB homolog 7 GO:0005759;GO:0005739;GO:0051213;GO:0046872;GO:0006974;GO:0006631;GO:0010883;GO:1902445
%%html
# Sets markdown table align left in subsequent cell
<style>
table {margin-left: 0 !important;}
</style>
Output format (tab-delimited):
gene_ID | SPIDs | UniProt_gene_ID | gene | gene_description | alternate_gene_description | GO_IDs |
---|---|---|---|---|---|---|
Explanation:
awk -v FS='[;[:space:]]+'
: Sets the Field Separator variable to handle ;
in UniProt accessions. Allows for proper searching.
FNR == NR
: Restricts next block (designated by {}
) to work only on first input file.
{array[$1]=$0; next}
: Adds the entire line ($0
) of the first file to the array names array
and then moves on to the next set of commands for the second input file.
($1 in array)
: Looks for the value of the first column ($1
, which is SPID) from the second file to see if there's a match from the array (which contains the line from the first file).
{print $2,array[$1]}'
: If there's a match, print the second column ($2
, which is gene ID) from the second file, followed by the line from the first file.
"${parsed_uniprot}" "${spid_list}"
: The first and second input files.
"${joined_output}"
: Result of the join.
%%bash
cd "${analysis_dir}"
awk \
-v FS='[;[:space:]]+' \
'NR==FNR \
{array[$1]=$0; next} \
($1 in array) \
{print $2"\t"array[$1]}' \
"${parsed_uniprot}" "${genome_IDs_SPIDs}" \
> "${joined_output}"
%%bash
cd "${analysis_dir}"
wc -l "${joined_output}"
echo ""
echo "------------------------------------------------------------------"
echo ""
head -n 25 "${joined_output}"
14672 20220419-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-go_ids.tab ------------------------------------------------------------------ PGEN_.00g000010 Q86IC9;Q552T5 8620183 omt5 Probable caffeoyl-CoA O-methyltransferase 1 O-methyltransferase 5 GO:0042409;GO:0046872;GO:0008757;GO:0032259 PGEN_.00g000020 P04177 25085 Th Tyrosine 3-monooxygenase Tyrosine 3-hydroxylase GO:0030424;GO:0005737;GO:0009898;GO:0031410;GO:0030659;GO:0005829;GO:0030425;GO:0033162;GO:0005739;GO:0043005;GO:0043025;GO:0005634;GO:0043204;GO:0048471;GO:0005790;GO:0008021;GO:0043195;GO:0016597;GO:0035240;GO:0019899;GO:0008199;GO:0008198;GO:0042802;GO:0004497;GO:0019825;GO:0019904;GO:0034617;GO:0004511;GO:0015842;GO:0009887;GO:0042423;GO:0071312;GO:0071333;GO:0071363;GO:0071287;GO:0071316;GO:0071466;GO:0021987;GO:0042745;GO:0050890;GO:0042416;GO:0006585;GO:0042755;GO:0048596;GO:0042418;GO:0042462;GO:0006631;GO:0016137;GO:0007507;GO:1990384;GO:0033076;GO:0007612;GO:0007626;GO:0007617;GO:0007613;GO:0010259;GO:0042136;GO:0042421;GO:0018963;GO:0052314;GO:0008016;GO:0014823;GO:0001975;GO:0051412;GO:0051602;GO:0032355;GO:0045471;GO:0045472;GO:0070848;GO:0009635;GO:0001666;GO:0035902;GO:0017085;GO:0035900;GO:0009416;GO:0032496;GO:0010038;GO:0035094;GO:0031667;GO:0014070;GO:0043434;GO:0046684;GO:0009651;GO:0048545;GO:0009414;GO:0009410;GO:0010043;GO:0007605;GO:0035176;GO:0006665;GO:0001963;GO:0042214;GO:0007601 PGEN_.00g000050 Q8L840;O04092;Q9FT71 837636 RECQL4A ATP-dependent DNA helicase Q-like 4A SGS1-like protein GO:0005694;GO:0005737;GO:0005634;GO:0009506;GO:0043138;GO:0005524;GO:0016887;GO:0009378;GO:0046872;GO:0003676;GO:0071215;GO:0070417;GO:0006974;GO:0051276;GO:0032508;GO:0006310;GO:0006281;GO:0006268;GO:0000724 PGEN_.00g000060 Q61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7 18080 Nin Ninein SGS1-like protein GO:0045177;GO:0030424;GO:0044295;GO:0120103;GO:0005814;GO:0005813;GO:0097539;GO:0005881;GO:0030425;GO:0072686;GO:0097431;GO:0005730;GO:0005654;GO:0000242;GO:0005886;GO:0000922;GO:0005509;GO:0005525;GO:0019900;GO:0051011;GO:0010457;GO:0051642;GO:0090222;GO:0048668;GO:0021540;GO:0021957;GO:0034454;GO:0050772;GO:0031116;GO:0008104 PGEN_.00g000080 A1E2V0 489433 BIRC3 Baculoviral IAP repeat-containing protein 3 RING-type E3 ubiquitin transferase BIRC3 GO:0005737;GO:0005829;GO:0005654;GO:0005634;GO:0043027;GO:0046872;GO:0061630;GO:0043066;GO:0060546;GO:0031398;GO:0051726 PGEN_.00g000090 P34456 186266 Uncharacterized protein F54H12.2 RING-type E3 ubiquitin transferase BIRC3 GO:0005829;GO:0004748;GO:0009263 PGEN_.00g000120 P34457 Putative uncharacterized transposon-derived protein F54H12.3 RING-type E3 ubiquitin transferase BIRC3 GO:0003676;GO:0015074 PGEN_.00g000210 O00463;B4DIS9;B4E0A2;Q6FHY1 7188 TRAF5 TNF receptor-associated factor 5 RING finger protein 84 GO:0035631;GO:0005813;GO:0009898;GO:0005829;GO:0042802;GO:0031996;GO:0005164;GO:0031625;GO:0008270;GO:0006915;GO:0097400;GO:0048255;GO:0008284;GO:0051091;GO:0043123;GO:0046330;GO:0051092;GO:0070534;GO:0042981;GO:0043122;GO:0007165;GO:0023019;GO:0033209 PGEN_.00g000230 Q00945 Neurophysin RING finger protein 84 GO:0005576;GO:0005185 PGEN_.00g000240 Q5SWK7;Q8BXX5;Q9CXG1 74315 Rnf145 RING finger protein 145 RING finger protein 84 GO:0012505;GO:0005783;GO:0005789;GO:0016021;GO:0061630;GO:0008270 PGEN_.00g000280 Q8ZXT3 Uncharacterized protein PAE1111 RING finger protein 84 PGEN_.00g000300 Q5REG4 100171717 DTX3 Probable E3 ubiquitin-protein ligase DTX3 RING-type E3 ubiquitin transferase DTX3 GO:0005737;GO:0046872;GO:0016740;GO:0007219;GO:0016567 PGEN_.00g000380 Q8QG60;Q8QG52;Q8QGQ5 374092 CRY2 Cryptochrome-2 RING-type E3 ubiquitin transferase DTX3 GO:0005737;GO:0005634;GO:0003677;GO:0071949;GO:0009881;GO:0032922;GO:0007623;GO:0043153;GO:0042754;GO:0045892;GO:0018298;GO:0042752;GO:0009416 PGEN_.00g000440 Q9H583;Q5T3Q8;Q6P197;Q9NW23 55127 HEATR1 HEAT repeat-containing protein 1 N-terminally processed U3 small nucleolar RNA-associated protein 10 homolog GO:0030686;GO:0001650;GO:0016020;GO:0005739;GO:0005730;GO:0005654;GO:0032040;GO:0034455;GO:0003723;GO:0030515;GO:0000462;GO:2000234;GO:0045943 PGEN_.00g000450 A0JMR6 779416 mysm1 Histone H2A deubiquitinase MYSM1 Myb-like SWIRM and MPN domain-containing protein 1 GO:0005634;GO:0003677;GO:0042393;GO:0070122;GO:0046872;GO:0140492;GO:0004843;GO:0003713;GO:0006338;GO:0035522;GO:0045944 PGEN_.00g000460 O88917;O09026;O35818;O88916 65096 Adgrl1 Adhesion G protein-coupled receptor L1 Latrophilin-1 GO:0030424;GO:0098978;GO:0030426;GO:0005887;GO:0099056;GO:0043005;GO:0005886;GO:0014069;GO:0042734;GO:0045202;GO:0030246;GO:0050839;GO:0004930;GO:0016524;GO:0015643;GO:0007189;GO:0007420;GO:0035584;GO:0007166;GO:0007157;GO:0051965;GO:0090129 PGEN_.00g000490 Q7D513 egtB Hercynine oxygenase Gamma-glutamyl hercynylcysteine S-oxide synthase GO:0044875;GO:0005506;GO:0004497 PGEN_.00g000520 Q92968;B2RCS1 5194 PEX13 Peroxisomal membrane protein PEX13 Peroxin-13 GO:0005829;GO:0005779;GO:0016020;GO:1990429;GO:0005778;GO:0005777;GO:0021795;GO:0001561;GO:0007626;GO:0060152;GO:0001764;GO:0016560;GO:0001967 PGEN_.00g000530 A6H769 505507 RPS7 40S ribosomal protein S7 Peroxin-13 GO:0022627;GO:0005815;GO:0032040;GO:0003735;GO:0042274;GO:0006364;GO:0006412 PGEN_.00g000540 Q14676;A2AB04;A2BF04;A2RRA8;A7YY86;B0S8A2;Q0EFC2;Q2L6H7;Q2TAZ4Q5JP55;Q5JP56;Q5ST83;Q68CQ3;Q86Z06;Q96QC2 9656 MDC1 Mediator of DNA damage checkpoint protein 1 Nuclear factor with BRCT domains 1 GO:0005694;GO:0005925;GO:0016604;GO:0005654;GO:0005634;GO:0070975;GO:0008022;GO:0006281;GO:0031573 PGEN_.00g000560 Q54W11 8622324 mcfL Mitochondrial substrate carrier family protein L Nuclear factor with BRCT domains 1 GO:0016021;GO:0005743;GO:0055085 PGEN_.00g000600 Q9FKK7;Q8L759 835871 XYLA Xylose isomerase Nuclear factor with BRCT domains 1 GO:0005783;GO:0005794;GO:0000325;GO:0009536;GO:0099503;GO:0046872;GO:0009045;GO:0042843 PGEN_.00g000660 Q3ZCD7 614105 TECR Very-long-chain enoyl-CoA reductase Trans-23-enoyl-CoA reductase GO:0005783;GO:0030176;GO:0016491;GO:0102758;GO:0030497;GO:0006665;GO:0006694;GO:0042761 PGEN_.00g000670 Q9D180;A2ACY9 68625 Cfap57 Cilia- and flagella-associated protein 57 WD repeat-containing protein 65 PGEN_.00g000680 Q9D6Z0;Q8K1H3;Q9CY41;Q9D942 66400 Alkbh7 Alpha-ketoglutarate-dependent dioxygenase alkB homolog 7 mitochondrial Alkylated DNA repair protein alkB homolog 7 GO:0005759;GO:0005739;GO:0051213;GO:0046872;GO:0006974;GO:0006631;GO:0010883;GO:1902445
The cells below are for reference, as they were used to identify a set of obsolete/missing SPIDs. These were then used in a subsequent re-run of the noteook to refine things. The remaining cells document this process.
The original number of SPIDs was 14676, but we ended up with only 14668 matches.
Let's see if we can figure out why...
%%bash
cd "${analysis_dir}"
diff <(awk '{print $1}' "${joined_output}") <(awk '{print $2}' "${genome_IDs_SPIDs}")
2344a2345 > PGEN_.00g050810 5161a5163 > PGEN_.00g119250 6930a6933 > PGEN_.00g162250 7914a7918 > PGEN_.00g185820 8976a8981 > PGEN_.00g209090 9601a9607 > PGEN_.00g222140 11142a11149 > PGEN_.00g258380 12942a12950 > PGEN_.00g304800
--------------------------------------------------------------------------- CalledProcessError Traceback (most recent call last) <ipython-input-36-0ccf4752df95> in <module> ----> 1 get_ipython().run_cell_magic('bash', '', '\ncd "${analysis_dir}"\n\ndiff <(awk \'{print $1}\' "${joined_output}") <(awk \'{print $2}\' "${genome_IDs_SPIDs}")\n') ~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell) 2401 with self.builtin_trap: 2402 args = (magic_arg_s, cell) -> 2403 result = fn(*args, **kwargs) 2404 return result 2405 ~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/magics/script.py in named_script_magic(line, cell) 140 else: 141 line = script --> 142 return self.shebang(line, cell) 143 144 # write a basic docstring: ~/programs/miniconda3/lib/python3.9/site-packages/decorator.py in fun(*args, **kw) 230 if not kwsyntax: 231 args, kw = fix(args, kw, sig) --> 232 return caller(func, *(extras + args), **kw) 233 fun.__name__ = func.__name__ 234 fun.__doc__ = func.__doc__ ~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k) 185 # but it's overkill for just that one bit of state. 186 def magic_deco(arg): --> 187 call = lambda f, *a, **k: f(*a, **k) 188 189 if callable(arg): ~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/magics/script.py in shebang(self, line, cell) 243 sys.stderr.flush() 244 if args.raise_error and p.returncode!=0: --> 245 raise CalledProcessError(p.returncode, cell, output=out, stderr=err) 246 247 def _run_script(self, p, cell, to_close): CalledProcessError: Command 'b'\ncd "${analysis_dir}"\n\ndiff <(awk \'{print $1}\' "${joined_output}") <(awk \'{print $2}\' "${genome_IDs_SPIDs}")\n'' returned non-zero exit status 1.
%%bash
cd "${analysis_dir}"
for id in PGEN_.00g050810 PGEN_.00g119250 PGEN_.00g162250 PGEN_.00g185820 PGEN_.00g209090 PGEN_.00g222140 PGEN_.00g258380 PGEN_.00g304800
do
grep "${id}" "${genome_IDs_SPIDs}"
done
Q3UN21 PGEN_.00g050810 P0DN79 PGEN_.00g119250 Q6ZRR9 PGEN_.00g162250 Q9NPA5 PGEN_.00g185820 Q6ZR98 PGEN_.00g209090 Q5T699 PGEN_.00g222140 Q9NPA5 PGEN_.00g258380 Q9NPA5 PGEN_.00g304800
Using information above, will add code to find/replace the redirected SPIDs and then re-run rest of notebook.