Notebook

Create P.generosa primary gene annotations mapping file¶

This notebook will utilize NCBI BLAST and DIAMOND BLAST annotations generated by our GenSas P.generosa genome annotation.

It will compare the two sets of SwissProt ID annotations (SPIDs) to determine lowest E-value and use that entry as the representative entry for a gene. It will then use that canonical list of SPIDs to pull gene names and gene ontology (GO) IDs from UniProt, and create a tab-deltimited annotation mapping file.

List computer specs¶

In [1]:

%%bash
echo "TODAY'S DATE"
date
echo "------------"
echo ""
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "
hostname
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE
Wed 20 Apr 2022 06:24:50 AM PDT
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       2
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
Stepping:                        2
CPU MHz:                         2400.006
BogoMIPS:                        4800.01
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        512 KiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities

------------

Memory Specs

              total        used        free      shared  buff/cache   available
Mem:           54Gi       2.0Gi        50Gi        38Mi       2.6Gi        52Gi
Swap:         2.0Gi          0B       2.0Gi

No LSB modules are available.

Set variables¶

%env indicates a bash variable
without %env is Python variablec

In [2]:

######################################################################
### Set directories
%env data_dir=/home/sam/data/P_generosa/genomes
%env analysis_dir=/home/sam/analyses/20220419-pgen-gene_annotation_mapping
analysis_dir="/home/sam/analyses/20220419-pgen-gene_annotation_mapping"

#####################################################################
### Input files
%env base_url=https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation
%env blast_annotations=Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab
%env diamond_annotations=Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab

######################################################################
### Output files
# UniProt batch output
%env uniprot_output=20220419-pgen-uniprot_batch-results.txt

# Gene name list for UniProt batch submission
%env spid_list=Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt

# Genome IDs and SPIDs
%env genome_IDs_SPIDs=Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt

%env blast_diamond_cat=Panopea-generosa-v1.0.a4-blast-diamond-functional.tab

%env blast_diamond_cat_best=Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab

# Parsed UniProt
%env parsed_uniprot=20220419-pgen-accession-gene_name-gene_description-go_ids.tab

# Final output
%env joined_output=20220419-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-go_ids.tab

######################################################################

### Programs

# UniProt batch submission/retrieval script
%env uniprot_mapping_script=/home/sam/programs/uniprot_mapping.pl

env: data_dir=/home/sam/data/P_generosa/genomes
env: analysis_dir=/home/sam/analyses/20220419-pgen-gene_annotation_mapping
env: base_url=https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation
env: blast_annotations=Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab
env: diamond_annotations=Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab
env: uniprot_output=20220419-pgen-uniprot_batch-results.txt
env: spid_list=Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt
env: genome_IDs_SPIDs=Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt
env: blast_diamond_cat=Panopea-generosa-v1.0.a4-blast-diamond-functional.tab
env: blast_diamond_cat_best=Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab
env: parsed_uniprot=20220419-pgen-accession-gene_name-gene_description-go_ids.tab
env: joined_output=20220419-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-go_ids.tab
env: uniprot_mapping_script=/home/sam/programs/uniprot_mapping.pl

Make input/output directories¶

In [3]:

%%bash
# If directories don't exist, make them
mkdir --parents "${data_dir}" "${analysis_dir}"

Download and inspect annotation files¶

--quiet: Prevents wget output from overwhelming Jupyter Notebook

--continue: If download was previously initiated, will continue where leftoff and will not create a second file if one already exists.

In [4]:

%%bash
cd "${data_dir}"

wget --quiet --continue ${base_url}/${blast_annotations}
wget --quiet --continue ${base_url}/${diamond_annotations}

ls -ltrh

echo ""
echo "---------------------------------------------------------"
echo ""
head -n 25 *.tab

total 2.6G
-rw-rw-r-- 1 sam sam 1.5M Oct  3  2019 Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab
-rw-rw-r-- 1 sam sam 1.3M Oct  3  2019 Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab
-rwxr-xr-x 1 sam sam 914M Nov  5  2019 Panopea-generosa-v1.0.fasta
-rw-rw-r-- 1 sam sam 454M Mar 19 07:58 Panopea-generosa-v1.0.a4.gff3
-rw-r--r-- 1 sam sam 503M Mar 22 06:48 Panopea-generosa-v1.0.a4_biotype.gff
-rw-r--r-- 1 sam sam 4.8M Mar 24 07:30 Panopea-generosa-v1.0.a4_biotype-trna_strand_converted-no_RNAmmer.bed
-rw-rw-r-- 1 sam sam  658 Mar 25 06:11 Panopea-generosa-v1.0.fa.fai
-rw-rw-r-- 1 sam sam 9.7M Mar 30 11:03 Panopea-generosa-v1.0.a4_biotype.gtf
-rw-rw-r-- 1 sam sam 507M Mar 30 11:43 Panopea-generosa-v1.0.a4_biotype.bed
-rw-rw-r-- 1 sam sam  378 Mar 30 13:20 Panopea-generosa-v1.0.fa.lengths
-rw-rw-r-- 1 sam sam 9.7M Mar 30 13:34 Panopea-generosa-v1.0.a4_biotype.sorted.gtf
-rw-rw-r-- 1 sam sam 996K Mar 30 13:34 Panopea-generosa-v1.0.a4_biotype_non-coding.bed
drwxrwxr-x 2 sam sam 4.0K Mar 30 13:46 feelnc_codpot_out
-rw-rw-r-- 1 sam sam  70M Mar 31 06:00 Panopea-generosa-v1.0.a4.gtf
-rw-rw-r-- 1 sam sam 101K Apr  6 12:22 spids.txt
-rw-rw-r-- 1 sam sam 138M Apr  6 12:23 uniprot_mapping-all.txt
-rw-rw-r-- 1 sam sam 282K Apr  7 07:13 uniprot_mapping-all-AC_only.txt
-rw-rw-r-- 1 sam sam   78 Apr  7 13:57 File1.txt
-rw-rw-r-- 1 sam sam  357 Apr  7 13:57 File2.txt

---------------------------------------------------------

==> Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab <==
#
# Output is generated by GenSAS 7.x-5.0
#
#name     : mRNA
#start    : Start of alignment in subject
#end      : End of alignment in subject
#m_start  : Start of alignment in query
#m_end    : End of alignment in query
#al       : Alignment length
#score    : Row score of the match
#evalue   : E value of the match
#identity : Percentage of identical matches
mame	start	end	score	Accession	Match ID	m_start	m_end	E-value	identity	al
21910-PGEN_.00g000010.m01	121	229	165	Q86IC9	sp|Q86IC9|CAMT1_DICDI	11	122	8.93e-14	35.652	115
21910-PGEN_.00g000020.m01	147	467	968	P04177	sp|P04177|TY3H_RAT	20	339	3.47e-127	55.140	321
21910-PGEN_.00g000050.m01	566	722	182	Q8L840	sp|Q8L840|RQL4A_ARATH	2	167	2.67e-14	35.119	168
21910-PGEN_.00g000080.m01	268	322	152	A1E2V0	sp|A1E2V0|BIRC3_CANLF	163	220	3.91e-10	53.448	58
21910-PGEN_.00g000090.m01	199	327	161	P34456	sp|P34456|YMD2_CAEEL	7	134	7.52e-12	26.357	129
21910-PGEN_.00g000210.m01	18	200	263	O00463	sp|O00463|TRAF5_HUMAN	5	191	2.24e-25	34.921	189
21910-PGEN_.00g000230.m01	48	155	287	Q00945	sp|Q00945|CONO_LYMST	31	134	1.59e-32	50.000	108
21910-PGEN_.00g000240.m01	4	605	1091	Q5SWK7	sp|Q5SWK7|RN145_MOUSE	13	601	2.65e-139	39.607	611
21910-PGEN_.00g000280.m01	4	153	210	Q8ZXT3	sp|Q8ZXT3|Y1111_PYRAE	853	1012	1.10e-17	38.750	160
21910-PGEN_.00g000300.m01	159	347	480	Q5REG4	sp|Q5REG4|DTX3_PONAB	1135	1320	1.20e-51	50.794	189
21910-PGEN_.00g000300.m02	159	347	480	Q5REG4	sp|Q5REG4|DTX3_PONAB	1138	1323	1.18e-51	50.794	189
21910-PGEN_.00g000380.m01	381	508	205	Q8QG60	sp|Q8QG60|CRY2_CHICK	2	145	4.92e-18	36.111	144

==> Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab <==
#
# Output is generated by GenSAS 7.x-5.0
#
#name     : mRNA
#start    : Start of alignment in subject
#end      : End of alignment in subject
#m_start  : Start of alignment in query
#m_end    : End of alignment in query
#al       : Alignment length
#score    : Row score of the match
#evalue   : E value of the match
#identity : Percentage of identical matches
mame	start	end	score	Accession	Match ID	m_start	m_end	E-value	identity	al
21910-PGEN_.00g000020.m01	147	467	945	P04177	sp|P04177|TY3H_RAT	20	339	7.9e-101	55.1	321
21910-PGEN_.00g000050.m01	566	722	180	Q8L840	sp|Q8L840|RQL4A_ARATH	2	167	2.4e-12	35.1	168
21910-PGEN_.00g000060.m01	1957	2106	129	Q61043	sp|Q61043|NIN_MOUSE	31	184	1.7e-06	26.0	154
21910-PGEN_.00g000080.m01	233	304	134	Q24307	sp|Q24307|DIAP2_DROME	174	255	6.2e-07	34.1	82
21910-PGEN_.00g000120.m01	6	49	118	P34457	sp|P34457|YMD3_CAEEL	90	133	3.2e-05	47.7	44
21910-PGEN_.00g000230.m01	49	155	216	Q00945	sp|Q00945|CONO_LYMST	32	134	9.9e-17	49.5	107
21910-PGEN_.00g000240.m01	4	585	1144	Q5SWK7	sp|Q5SWK7|RN145_MOUSE	13	592	1.2e-123	40.1	591
21910-PGEN_.00g000280.m01	433	592	230	Q9WYX8	sp|Q9WYX8|Y508_THEMA	863	1022	2.2e-17	40.2	164
21910-PGEN_.00g000300.m01	161	347	474	Q80V91	sp|Q80V91|DTX3_MOUSE	1137	1320	1.2e-45	51.3	187
21910-PGEN_.00g000300.m02	161	347	474	Q80V91	sp|Q80V91|DTX3_MOUSE	1140	1323	1.2e-45	51.3	187
21910-PGEN_.00g000380.m01	381	508	201	Q8QG60	sp|Q8QG60|CRY2_CHICK	2	145	7.0e-15	35.4	144
21910-PGEN_.00g000440.m01	234	1796	1606	Q9H583	sp|Q9H583|HEAT1_HUMAN	1	1575	8.0e-177	30.0	1624

Count number of header lines (i.e. beginning with a `#`¶

In [5]:

%%bash
cd "${data_dir}"

grep -c "^#" "${blast_annotations}" "${diamond_annotations}"

Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab:12
Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab:12

Concatenate annotation files and keep only one with best `e-value`¶

Also modifies mRNA names to generate gene names instead.

awk 'NR > 13': Skips first 13 header lines
sort -k1,1 -k9,9: Sorts on first field (mRNA name), then on 9th field (e-value)
sed 's/^21910-//': Removes leading info from each mRNA name, at the beginning of each line (^)
sed 's/.m0[0-9]//': Removes .m0N from each mRNA name.
awk '!array[$1]++': awk array that only prints line if it's the first occurrence of gene name (first field; $1 (i.e. no duplicates)

Also replaces two obsolete SPIDs, as identified in previous notebook run.

In [50]:

%%bash
cd "${data_dir}"

# Concatenate both annotation files
for file in ${blast_annotations} ${diamond_annotations}
do
    awk 'NR > 13' ${file}
done \
>> "${analysis_dir}"/"${blast_diamond_cat}"

# Sort for best e-value and perform formatting of genome IDs
sort -k1,1 -k9,9 "${analysis_dir}"/"${blast_diamond_cat}" \
| sed 's/^21910-//' \
| sed 's/.m0[0-9]//' \
| awk '!array[$1]++' \
>> "${analysis_dir}"/"${blast_diamond_cat_best}"

# Find/replace two obsolete SPIDs
sed -i 's/Q6ZRR9/M0R2J8/g' "${analysis_dir}"/"${blast_diamond_cat_best}"
sed -i 's/Q9NPA5/Q9NTW7/g' "${analysis_dir}"/"${blast_diamond_cat_best}"

echo ""
echo "Line count:"

wc -l "${analysis_dir}"/"${blast_diamond_cat}"

echo "--------------------------------------------------"

echo ""
echo "Line count:"

wc -l "${analysis_dir}"/"${blast_diamond_cat_best}"

echo "--------------------------------------------------"
echo ""

head -n 25 "${analysis_dir}"/"${blast_diamond_cat_best}"

Line count:
31216 /home/sam/analyses/20220419-pgen-gene_annotation_mapping/Panopea-generosa-v1.0.a4-blast-diamond-functional.tab
--------------------------------------------------

Line count:
14676 /home/sam/analyses/20220419-pgen-gene_annotation_mapping/Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab
--------------------------------------------------

PGEN_.00g000010	121	229	165	Q86IC9	sp|Q86IC9|CAMT1_DICDI	11	122	8.93e-14	35.652	115
PGEN_.00g000020	147	467	968	P04177	sp|P04177|TY3H_RAT	20	339	3.47e-127	55.140	321
PGEN_.00g000050	566	722	180	Q8L840	sp|Q8L840|RQL4A_ARATH	2	167	2.4e-12	35.1	168
PGEN_.00g000060	1957	2106	129	Q61043	sp|Q61043|NIN_MOUSE	31	184	1.7e-06	26.0	154
PGEN_.00g000080	268	322	152	A1E2V0	sp|A1E2V0|BIRC3_CANLF	163	220	3.91e-10	53.448	58
PGEN_.00g000090	199	327	161	P34456	sp|P34456|YMD2_CAEEL	7	134	7.52e-12	26.357	129
PGEN_.00g000120	6	49	118	P34457	sp|P34457|YMD3_CAEEL	90	133	3.2e-05	47.7	44
PGEN_.00g000210	18	200	263	O00463	sp|O00463|TRAF5_HUMAN	5	191	2.24e-25	34.921	189
PGEN_.00g000230	48	155	287	Q00945	sp|Q00945|CONO_LYMST	31	134	1.59e-32	50.000	108
PGEN_.00g000240	4	585	1144	Q5SWK7	sp|Q5SWK7|RN145_MOUSE	13	592	1.2e-123	40.1	591
PGEN_.00g000280	4	153	210	Q8ZXT3	sp|Q8ZXT3|Y1111_PYRAE	853	1012	1.10e-17	38.750	160
PGEN_.00g000300	159	347	480	Q5REG4	sp|Q5REG4|DTX3_PONAB	1135	1320	1.20e-51	50.794	189
PGEN_.00g000380	381	508	205	Q8QG60	sp|Q8QG60|CRY2_CHICK	2	145	4.92e-18	36.111	144
PGEN_.00g000440	792	1796	1362	Q9H583	sp|Q9H583|HEAT1_HUMAN	539	1575	7.24e-156	34.692	1055
PGEN_.00g000450	347	753	1038	A0JMR6	sp|A0JMR6|MYSM1_XENLA	138	535	3.16e-130	48.426	413
PGEN_.00g000460	478	1045	859	O88917	sp|O88917|AGRL1_RAT	150	740	2.17e-95	33.443	610
PGEN_.00g000490	189	393	152	Q7D513	sp|Q7D513|EGTB_MYCTO	307	565	1.7e-08	24.8	266
PGEN_.00g000520	118	339	495	Q92968	sp|Q92968|PEX13_HUMAN	134	356	1.34e-56	49.565	230
PGEN_.00g000530	1	191	593	A6H769	sp|A6H769|RS7_BOVIN	1	160	2.3e-60	68.1	191
PGEN_.00g000540	1876	2081	565	Q14676	sp|Q14676|MDC1_HUMAN	1857	2060	1.19e-56	46.117	206
PGEN_.00g000560	17	114	135	Q54W11	sp|Q54W11|MCFL_DICDI	21	113	3.1e-07	32.7	98
PGEN_.00g000600	49	474	1301	Q9FKK7	sp|Q9FKK7|XYLA_ARATH	215	641	1.35e-173	55.245	429
PGEN_.00g000660	5	308	899	Q3ZCD7	sp|Q3ZCD7|TECR_BOVIN	2	303	1.4e-95	55.7	305
PGEN_.00g000670	60	1237	3493	Q9D180	sp|Q9D180|CFA57_MOUSE	1	1143	0.0	59.102	1181
PGEN_.00g000680	17	206	520	Q9D6Z0	sp|Q9D6Z0|ALKB7_MOUSE	89	278	1.11e-64	48.947	190

Create list of genome IDs and SwissProt IDs¶

In [51]:

%%bash
cd "${analysis_dir}"

awk '{print $5,"\t",$1}' "${blast_diamond_cat_best}" > "$genome_IDs_SPIDs"

echo ""
echo "Line count:"

wc -l "$genome_IDs_SPIDs"

echo "--------------------------------------------------"

head "$genome_IDs_SPIDs"

Line count:
14676 Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt
--------------------------------------------------
Q86IC9 	 PGEN_.00g000010
P04177 	 PGEN_.00g000020
Q8L840 	 PGEN_.00g000050
Q61043 	 PGEN_.00g000060
A1E2V0 	 PGEN_.00g000080
P34456 	 PGEN_.00g000090
P34457 	 PGEN_.00g000120
O00463 	 PGEN_.00g000210
Q00945 	 PGEN_.00g000230
Q5SWK7 	 PGEN_.00g000240

Create list os SwissProt IDs¶

In [52]:

%%bash
cd "${analysis_dir}"

awk '{print $5}' "${blast_diamond_cat_best}" > "${spid_list}"

echo ""
echo "Line count:"

wc -l "${spid_list}"

echo "--------------------------------------------------"

echo ""

head "${spid_list}"

Line count:
14676 Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt
--------------------------------------------------

Q86IC9
P04177
Q8L840
Q61043
A1E2V0
P34456
P34457
O00463
Q00945
Q5SWK7

Batch submission/retrieval to/from UniProt¶

Perl script obtained from UniProt: https://www.uniprot.org/help/api_batch_retrieval

Modified to accept file with list of IDs and to map SPID to UniProt Accession

In [53]:

%%bash
# Print script for viewing
cat "${uniprot_mapping_script}"

use strict;
use warnings;
use LWP::UserAgent;

my $list = $ARGV[0]; # File containg list of UniProt identifiers.

my $base = 'https://www.uniprot.org';
my $tool = 'uploadlists';

my $contact = 'samwhite@uw.edu'; # Please set a contact email address here to help us debug in case of problems (see https://www.uniprot.org/help/privacy).
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'POST';

my $response = $agent->post("$base/$tool/",
                            [ 'file' => [$list],
                              'format' => 'txt',
                              'from' => 'SWISSPROT',
                              'to' => 'ACC',
                            ],
                            'Content_Type' => 'form-data');

while (my $wait = $response->header('Retry-After')) {
  print STDERR "Waiting ($wait)...\n";
  sleep $wait;
  $response = $agent->get($response->base);
}

$response->is_success ?
  print $response->content :
  die 'Failed, got ' . $response->status_line .
    ' for ' . $response->request->uri . "\n";

In [54]:

%%bash
cd "${analysis_dir}"

# Run UniProt Prel mapping script and time how long it takes
time \
perl "${uniprot_mapping_script}" "${spid_list}" > "${uniprot_output}"

ls -ltrh

echo ""
echo ""
echo "--------------------------------------------------"
echo ""
echo "Line count:"

wc -l "${uniprot_output}"

echo "--------------------------------------------------"

total 142M
-rw-rw-r-- 1 sam sam 2.8M Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional.tab
-rw-rw-r-- 1 sam sam 1.2M Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab
-rw-rw-r-- 1 sam sam 359K Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt
-rw-rw-r-- 1 sam sam 101K Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt
-rw-rw-r-- 1 sam sam 138M Apr 20 21:21 20220419-pgen-uniprot_batch-results.txt


--------------------------------------------------

Line count:
2850419 20220419-pgen-uniprot_batch-results.txt
--------------------------------------------------

real	0m43.226s
user	0m3.070s
sys	0m3.310s

Check mapping output¶

Counting Accession lines (beginning with AC) should show a lower count than the number of SwissProt IDs submitted, as UniProt automatically removes duplicates upon submission.

In [55]:

%%bash
cd "${analysis_dir}"

head -n 30 "${uniprot_output}"

echo ""

echo "----------------------------------------------------"

echo ""

echo "Number of accessions:"

echo ""

grep -c "^AC" "${uniprot_output}"

ID   CAMT1_DICDI             Reviewed;         230 AA.
AC   Q86IC9; Q552T5;
DT   05-MAY-2009, integrated into UniProtKB/Swiss-Prot.
DT   01-JUN-2003, sequence version 1.
DT   23-FEB-2022, entry version 92.
DE   RecName: Full=Probable caffeoyl-CoA O-methyltransferase 1;
DE            EC=2.1.1.104;
DE   AltName: Full=O-methyltransferase 5;
GN   Name=omt5; ORFNames=DDB_G0275499;
OS   Dictyostelium discoideum (Slime mold).
OC   Eukaryota; Amoebozoa; Evosea; Eumycetozoa; Dictyostelia; Dictyosteliales;
OC   Dictyosteliaceae; Dictyostelium.
OX   NCBI_TaxID=44689;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=AX4;
RX   PubMed=12097910; DOI=10.1038/nature00847;
RA   Gloeckner G., Eichinger L., Szafranski K., Pachebat J.A., Bankier A.T.,
RA   Dear P.H., Lehmann R., Baumgart C., Parra G., Abril J.F., Guigo R.,
RA   Kumpf K., Tunggal B., Cox E.C., Quail M.A., Platzer M., Rosenthal A.,
RA   Noegel A.A.;
RT   "Sequence and analysis of chromosome 2 of Dictyostelium discoideum.";
RL   Nature 418:79-85(2002).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RC   STRAIN=AX4;
RX   PubMed=15875012; DOI=10.1038/nature03481;
RA   Eichinger L., Pachebat J.A., Gloeckner G., Rajandream M.A., Sucgang R.,
RA   Berriman M., Song J., Olsen R., Szafranski K., Xu Q., Tunggal B.,
RA   Kummerfeld S., Madera M., Konfortov B.A., Rivero F., Bankier A.T.,

----------------------------------------------------

Number of accessions:

10634

Parse the stuff we want¶

UniProt accession
Gene name/abbreviation
Gene description
GO IDs

Check DE descriptor lines to decide pattern matching¶

Checks lines beginning with DE to identify values in the 2nd field with Name in them.

Identifies unique values. This will determine how to parse properly after this.

In [56]:

%%bash
cd "${analysis_dir}"

grep "^DE" "${uniprot_output}" | awk '$2 ~ /Name/ { print $2 }' | sort -u

AltName:
RecName:

In [57]:

%%bash
cd "${analysis_dir}"

# Loop through UniProt records
time \
while read -r line
do
  # Get record line descriptor
  descriptor=$(echo "${line}" | awk '{print $1}')

  # Capture second field for evaluation
  go_line=$(echo "${line}" | awk '{print $2}')

  # Append GO IDs to array
  if [[ "${go_line}" == "GO;" ]]; then
    go_id=$(echo "${line}" | awk '{print $3}')
    go_ids_array+=("${go_id}")
  elif [[ "${go_line}" == "GeneID;" ]]; then
    # Uses sed to strip trailing semi-colon
    gene_id=$(echo "${line}" | awk '{print $3}' | sed 's/;$//')
  fi

  # Get gene description
  if [[ "${descriptor}" == "DE" ]] && [[ "${go_line}" == "RecName:" ]]; then
    # Uses sed to strip trailing spaces at end of line and remove commas
    gene_description=$(echo "${line}" | awk -F "[={]" '{print $2}' | sed 's/[[:blank:]]*$//' | sed 's/,//g' | sed 's/;$//')

  # Get alternate name
  elif [[ "${descriptor}" == "DE" ]] && [[ "${go_line}" == "AltName:" ]]; then
    # Uses sed to strip trailing spaces at end of line and remove commas
    alt_gene_description=$(echo "${line}" | awk -F "[={]" '{print $2}' | sed 's/[[:blank:]]*$//' | sed 's/,//g' | sed 's/;$//')

  # Get gene name
  elif [[ "${descriptor}" == "GN"  ]] && [[ $(echo "${line}" | awk -F "=" '{print $1}') == "GN   Name" ]]; then
    # Uses sed to strip trailing spaces at end of line
    gene=$(echo "${line}" | awk -F 'Name=|{|;' '{print $2}' | sed 's/[[:blank:]]*$//')

  # Get UniProt accession
  elif [[ "${descriptor}" == "AC" ]]; then
    # awk removes "AC" denotation
    # sed removes all spaces
    # sed removes trailing semi-colon
    # Uses array to handle accessions being on multiple lines of UniProt records file
    accession=$(echo "${line}" | awk '{$1="";print $0}' | sed 's/[[:space:]]*//g' | sed 's/;$//')
    accessions_array+=("${accession}")

  # Identify beginning on new record
  elif [[ "${descriptor}" == "//" ]]; then

    # Prints other comma-separated variables, then GOID1;GOID2;GOIDn
    # IFS prevents spaces from being added between GO IDs
    # sed removes ";" after final GO ID
    (IFS=; printf "%s\t%s\t%s\t%s\t%s\t%s\n" "${accessions_array[*]}" "${gene_id}" "${gene}" "${gene_description}" "${alt_gene_description}" "${go_ids_array[*]}" | sed 's/;$//')

    # Re-initialize variables
    accession=""  
    accessions_array=()
    descriptor=""
    gene=""
    gene_description=""
    gene_id=""
    go_id=""
    go_ids_array=()
  fi


done < "${uniprot_output}" >> "${parsed_uniprot}"

real	529m51.428s
user	427m33.240s
sys	79m42.704s

Inspect parsed UniProt file¶

In [58]:

%%bash
cd "${analysis_dir}"

wc -l "${parsed_uniprot}"

echo ""
echo "------------------------------------------------------------------"
echo ""

head -n 25 "${parsed_uniprot}"

10304 20220419-pgen-accession-gene_name-gene_description-go_ids.tab

------------------------------------------------------------------

Q86IC9;Q552T5	8620183	omt5	Probable caffeoyl-CoA O-methyltransferase 1	O-methyltransferase 5	GO:0042409;GO:0046872;GO:0008757;GO:0032259
P04177	25085	Th	Tyrosine 3-monooxygenase	Tyrosine 3-hydroxylase	GO:0030424;GO:0005737;GO:0009898;GO:0031410;GO:0030659;GO:0005829;GO:0030425;GO:0033162;GO:0005739;GO:0043005;GO:0043025;GO:0005634;GO:0043204;GO:0048471;GO:0005790;GO:0008021;GO:0043195;GO:0016597;GO:0035240;GO:0019899;GO:0008199;GO:0008198;GO:0042802;GO:0004497;GO:0019825;GO:0019904;GO:0034617;GO:0004511;GO:0015842;GO:0009887;GO:0042423;GO:0071312;GO:0071333;GO:0071363;GO:0071287;GO:0071316;GO:0071466;GO:0021987;GO:0042745;GO:0050890;GO:0042416;GO:0006585;GO:0042755;GO:0048596;GO:0042418;GO:0042462;GO:0006631;GO:0016137;GO:0007507;GO:1990384;GO:0033076;GO:0007612;GO:0007626;GO:0007617;GO:0007613;GO:0010259;GO:0042136;GO:0042421;GO:0018963;GO:0052314;GO:0008016;GO:0014823;GO:0001975;GO:0051412;GO:0051602;GO:0032355;GO:0045471;GO:0045472;GO:0070848;GO:0009635;GO:0001666;GO:0035902;GO:0017085;GO:0035900;GO:0009416;GO:0032496;GO:0010038;GO:0035094;GO:0031667;GO:0014070;GO:0043434;GO:0046684;GO:0009651;GO:0048545;GO:0009414;GO:0009410;GO:0010043;GO:0007605;GO:0035176;GO:0006665;GO:0001963;GO:0042214;GO:0007601
Q8L840;O04092;Q9FT71	837636	RECQL4A	ATP-dependent DNA helicase Q-like 4A	SGS1-like protein	GO:0005694;GO:0005737;GO:0005634;GO:0009506;GO:0043138;GO:0005524;GO:0016887;GO:0009378;GO:0046872;GO:0003676;GO:0071215;GO:0070417;GO:0006974;GO:0051276;GO:0032508;GO:0006310;GO:0006281;GO:0006268;GO:0000724
Q61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7	18080	Nin	Ninein	SGS1-like protein	GO:0045177;GO:0030424;GO:0044295;GO:0120103;GO:0005814;GO:0005813;GO:0097539;GO:0005881;GO:0030425;GO:0072686;GO:0097431;GO:0005730;GO:0005654;GO:0000242;GO:0005886;GO:0000922;GO:0005509;GO:0005525;GO:0019900;GO:0051011;GO:0010457;GO:0051642;GO:0090222;GO:0048668;GO:0021540;GO:0021957;GO:0034454;GO:0050772;GO:0031116;GO:0008104
A1E2V0	489433	BIRC3	Baculoviral IAP repeat-containing protein 3	RING-type E3 ubiquitin transferase BIRC3	GO:0005737;GO:0005829;GO:0005654;GO:0005634;GO:0043027;GO:0046872;GO:0061630;GO:0043066;GO:0060546;GO:0031398;GO:0051726
P34456	186266		Uncharacterized protein F54H12.2	RING-type E3 ubiquitin transferase BIRC3	GO:0005829;GO:0004748;GO:0009263
P34457			Putative uncharacterized transposon-derived protein F54H12.3	RING-type E3 ubiquitin transferase BIRC3	GO:0003676;GO:0015074
O00463;B4DIS9;B4E0A2;Q6FHY1	7188	TRAF5	TNF receptor-associated factor 5	RING finger protein 84	GO:0035631;GO:0005813;GO:0009898;GO:0005829;GO:0042802;GO:0031996;GO:0005164;GO:0031625;GO:0008270;GO:0006915;GO:0097400;GO:0048255;GO:0008284;GO:0051091;GO:0043123;GO:0046330;GO:0051092;GO:0070534;GO:0042981;GO:0043122;GO:0007165;GO:0023019;GO:0033209
Q00945			Neurophysin	RING finger protein 84	GO:0005576;GO:0005185
Q5SWK7;Q8BXX5;Q9CXG1	74315	Rnf145	RING finger protein 145	RING finger protein 84	GO:0012505;GO:0005783;GO:0005789;GO:0016021;GO:0061630;GO:0008270
Q8ZXT3			Uncharacterized protein PAE1111	RING finger protein 84	
Q5REG4	100171717	DTX3	Probable E3 ubiquitin-protein ligase DTX3	RING-type E3 ubiquitin transferase DTX3	GO:0005737;GO:0046872;GO:0016740;GO:0007219;GO:0016567
Q8QG60;Q8QG52;Q8QGQ5	374092	CRY2	Cryptochrome-2	RING-type E3 ubiquitin transferase DTX3	GO:0005737;GO:0005634;GO:0003677;GO:0071949;GO:0009881;GO:0032922;GO:0007623;GO:0043153;GO:0042754;GO:0045892;GO:0018298;GO:0042752;GO:0009416
Q9H583;Q5T3Q8;Q6P197;Q9NW23	55127	HEATR1	HEAT repeat-containing protein 1 N-terminally processed	U3 small nucleolar RNA-associated protein 10 homolog	GO:0030686;GO:0001650;GO:0016020;GO:0005739;GO:0005730;GO:0005654;GO:0032040;GO:0034455;GO:0003723;GO:0030515;GO:0000462;GO:2000234;GO:0045943
A0JMR6	779416	mysm1	Histone H2A deubiquitinase MYSM1	Myb-like SWIRM and MPN domain-containing protein 1	GO:0005634;GO:0003677;GO:0042393;GO:0070122;GO:0046872;GO:0140492;GO:0004843;GO:0003713;GO:0006338;GO:0035522;GO:0045944
O88917;O09026;O35818;O88916	65096	Adgrl1	Adhesion G protein-coupled receptor L1	Latrophilin-1	GO:0030424;GO:0098978;GO:0030426;GO:0005887;GO:0099056;GO:0043005;GO:0005886;GO:0014069;GO:0042734;GO:0045202;GO:0030246;GO:0050839;GO:0004930;GO:0016524;GO:0015643;GO:0007189;GO:0007420;GO:0035584;GO:0007166;GO:0007157;GO:0051965;GO:0090129
Q7D513		egtB	Hercynine oxygenase	Gamma-glutamyl hercynylcysteine S-oxide synthase	GO:0044875;GO:0005506;GO:0004497
Q92968;B2RCS1	5194	PEX13	Peroxisomal membrane protein PEX13	Peroxin-13	GO:0005829;GO:0005779;GO:0016020;GO:1990429;GO:0005778;GO:0005777;GO:0021795;GO:0001561;GO:0007626;GO:0060152;GO:0001764;GO:0016560;GO:0001967
A6H769	505507	RPS7	40S ribosomal protein S7	Peroxin-13	GO:0022627;GO:0005815;GO:0032040;GO:0003735;GO:0042274;GO:0006364;GO:0006412
Q14676;A2AB04;A2BF04;A2RRA8;A7YY86;B0S8A2;Q0EFC2;Q2L6H7;Q2TAZ4Q5JP55;Q5JP56;Q5ST83;Q68CQ3;Q86Z06;Q96QC2	9656	MDC1	Mediator of DNA damage checkpoint protein 1	Nuclear factor with BRCT domains 1	GO:0005694;GO:0005925;GO:0016604;GO:0005654;GO:0005634;GO:0070975;GO:0008022;GO:0006281;GO:0031573
Q54W11	8622324	mcfL	Mitochondrial substrate carrier family protein L	Nuclear factor with BRCT domains 1	GO:0016021;GO:0005743;GO:0055085
Q9FKK7;Q8L759	835871	XYLA	Xylose isomerase	Nuclear factor with BRCT domains 1	GO:0005783;GO:0005794;GO:0000325;GO:0009536;GO:0099503;GO:0046872;GO:0009045;GO:0042843
Q3ZCD7	614105	TECR	Very-long-chain enoyl-CoA reductase	Trans-23-enoyl-CoA reductase	GO:0005783;GO:0030176;GO:0016491;GO:0102758;GO:0030497;GO:0006665;GO:0006694;GO:0042761
Q9D180;A2ACY9	68625	Cfap57	Cilia- and flagella-associated protein 57	WD repeat-containing protein 65	
Q9D6Z0;Q8K1H3;Q9CY41;Q9D942	66400	Alkbh7	Alpha-ketoglutarate-dependent dioxygenase alkB homolog 7 mitochondrial	Alkylated DNA repair protein alkB homolog 7	GO:0005759;GO:0005739;GO:0051213;GO:0046872;GO:0006974;GO:0006631;GO:0010883;GO:1902445

In [59]:

%%html
# Sets markdown table align left in subsequent cell
<style>
  table {margin-left: 0 !important;}
</style>

# Sets markdown table align left in subsequent cell

Combine with original list of genes and SPIDs¶

Output format (tab-delimited):

gene_ID	SPIDs	UniProt_gene_ID	gene	gene_description	alternate_gene_description	GO_IDs

Explanation:

awk -v FS='[;[:space:]]+': Sets the Field Separator variable to handle ; in UniProt accessions. Allows for proper searching.
FNR == NR: Restricts next block (designated by {}) to work only on first input file.
{array[$1]=$0; next}: Adds the entire line ($0) of the first file to the array names array and then moves on to the next set of commands for the second input file.
($1 in array): Looks for the value of the first column ($1, which is SPID) from the second file to see if there's a match from the array (which contains the line from the first file).
{print $2,array[$1]}': If there's a match, print the second column ($2, which is gene ID) from the second file, followed by the line from the first file.
"${parsed_uniprot}" "${spid_list}": The first and second input files.
"${joined_output}": Result of the join.

In [60]:

%%bash

cd "${analysis_dir}"

awk \
-v FS='[;[:space:]]+' \
'NR==FNR \
{array[$1]=$0; next} \
($1 in array) \
{print $2"\t"array[$1]}' \
"${parsed_uniprot}" "${genome_IDs_SPIDs}" \
> "${joined_output}"

Inspect final annotation file¶

In [61]:

%%bash

cd "${analysis_dir}"

wc -l "${joined_output}"

echo ""
echo "------------------------------------------------------------------"
echo ""

head -n 25 "${joined_output}"

14672 20220419-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-go_ids.tab

------------------------------------------------------------------

PGEN_.00g000010	Q86IC9;Q552T5	8620183	omt5	Probable caffeoyl-CoA O-methyltransferase 1	O-methyltransferase 5	GO:0042409;GO:0046872;GO:0008757;GO:0032259
PGEN_.00g000020	P04177	25085	Th	Tyrosine 3-monooxygenase	Tyrosine 3-hydroxylase	GO:0030424;GO:0005737;GO:0009898;GO:0031410;GO:0030659;GO:0005829;GO:0030425;GO:0033162;GO:0005739;GO:0043005;GO:0043025;GO:0005634;GO:0043204;GO:0048471;GO:0005790;GO:0008021;GO:0043195;GO:0016597;GO:0035240;GO:0019899;GO:0008199;GO:0008198;GO:0042802;GO:0004497;GO:0019825;GO:0019904;GO:0034617;GO:0004511;GO:0015842;GO:0009887;GO:0042423;GO:0071312;GO:0071333;GO:0071363;GO:0071287;GO:0071316;GO:0071466;GO:0021987;GO:0042745;GO:0050890;GO:0042416;GO:0006585;GO:0042755;GO:0048596;GO:0042418;GO:0042462;GO:0006631;GO:0016137;GO:0007507;GO:1990384;GO:0033076;GO:0007612;GO:0007626;GO:0007617;GO:0007613;GO:0010259;GO:0042136;GO:0042421;GO:0018963;GO:0052314;GO:0008016;GO:0014823;GO:0001975;GO:0051412;GO:0051602;GO:0032355;GO:0045471;GO:0045472;GO:0070848;GO:0009635;GO:0001666;GO:0035902;GO:0017085;GO:0035900;GO:0009416;GO:0032496;GO:0010038;GO:0035094;GO:0031667;GO:0014070;GO:0043434;GO:0046684;GO:0009651;GO:0048545;GO:0009414;GO:0009410;GO:0010043;GO:0007605;GO:0035176;GO:0006665;GO:0001963;GO:0042214;GO:0007601
PGEN_.00g000050	Q8L840;O04092;Q9FT71	837636	RECQL4A	ATP-dependent DNA helicase Q-like 4A	SGS1-like protein	GO:0005694;GO:0005737;GO:0005634;GO:0009506;GO:0043138;GO:0005524;GO:0016887;GO:0009378;GO:0046872;GO:0003676;GO:0071215;GO:0070417;GO:0006974;GO:0051276;GO:0032508;GO:0006310;GO:0006281;GO:0006268;GO:0000724
PGEN_.00g000060	Q61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7	18080	Nin	Ninein	SGS1-like protein	GO:0045177;GO:0030424;GO:0044295;GO:0120103;GO:0005814;GO:0005813;GO:0097539;GO:0005881;GO:0030425;GO:0072686;GO:0097431;GO:0005730;GO:0005654;GO:0000242;GO:0005886;GO:0000922;GO:0005509;GO:0005525;GO:0019900;GO:0051011;GO:0010457;GO:0051642;GO:0090222;GO:0048668;GO:0021540;GO:0021957;GO:0034454;GO:0050772;GO:0031116;GO:0008104
PGEN_.00g000080	A1E2V0	489433	BIRC3	Baculoviral IAP repeat-containing protein 3	RING-type E3 ubiquitin transferase BIRC3	GO:0005737;GO:0005829;GO:0005654;GO:0005634;GO:0043027;GO:0046872;GO:0061630;GO:0043066;GO:0060546;GO:0031398;GO:0051726
PGEN_.00g000090	P34456	186266		Uncharacterized protein F54H12.2	RING-type E3 ubiquitin transferase BIRC3	GO:0005829;GO:0004748;GO:0009263
PGEN_.00g000120	P34457			Putative uncharacterized transposon-derived protein F54H12.3	RING-type E3 ubiquitin transferase BIRC3	GO:0003676;GO:0015074
PGEN_.00g000210	O00463;B4DIS9;B4E0A2;Q6FHY1	7188	TRAF5	TNF receptor-associated factor 5	RING finger protein 84	GO:0035631;GO:0005813;GO:0009898;GO:0005829;GO:0042802;GO:0031996;GO:0005164;GO:0031625;GO:0008270;GO:0006915;GO:0097400;GO:0048255;GO:0008284;GO:0051091;GO:0043123;GO:0046330;GO:0051092;GO:0070534;GO:0042981;GO:0043122;GO:0007165;GO:0023019;GO:0033209
PGEN_.00g000230	Q00945			Neurophysin	RING finger protein 84	GO:0005576;GO:0005185
PGEN_.00g000240	Q5SWK7;Q8BXX5;Q9CXG1	74315	Rnf145	RING finger protein 145	RING finger protein 84	GO:0012505;GO:0005783;GO:0005789;GO:0016021;GO:0061630;GO:0008270
PGEN_.00g000280	Q8ZXT3			Uncharacterized protein PAE1111	RING finger protein 84	
PGEN_.00g000300	Q5REG4	100171717	DTX3	Probable E3 ubiquitin-protein ligase DTX3	RING-type E3 ubiquitin transferase DTX3	GO:0005737;GO:0046872;GO:0016740;GO:0007219;GO:0016567
PGEN_.00g000380	Q8QG60;Q8QG52;Q8QGQ5	374092	CRY2	Cryptochrome-2	RING-type E3 ubiquitin transferase DTX3	GO:0005737;GO:0005634;GO:0003677;GO:0071949;GO:0009881;GO:0032922;GO:0007623;GO:0043153;GO:0042754;GO:0045892;GO:0018298;GO:0042752;GO:0009416
PGEN_.00g000440	Q9H583;Q5T3Q8;Q6P197;Q9NW23	55127	HEATR1	HEAT repeat-containing protein 1 N-terminally processed	U3 small nucleolar RNA-associated protein 10 homolog	GO:0030686;GO:0001650;GO:0016020;GO:0005739;GO:0005730;GO:0005654;GO:0032040;GO:0034455;GO:0003723;GO:0030515;GO:0000462;GO:2000234;GO:0045943
PGEN_.00g000450	A0JMR6	779416	mysm1	Histone H2A deubiquitinase MYSM1	Myb-like SWIRM and MPN domain-containing protein 1	GO:0005634;GO:0003677;GO:0042393;GO:0070122;GO:0046872;GO:0140492;GO:0004843;GO:0003713;GO:0006338;GO:0035522;GO:0045944
PGEN_.00g000460	O88917;O09026;O35818;O88916	65096	Adgrl1	Adhesion G protein-coupled receptor L1	Latrophilin-1	GO:0030424;GO:0098978;GO:0030426;GO:0005887;GO:0099056;GO:0043005;GO:0005886;GO:0014069;GO:0042734;GO:0045202;GO:0030246;GO:0050839;GO:0004930;GO:0016524;GO:0015643;GO:0007189;GO:0007420;GO:0035584;GO:0007166;GO:0007157;GO:0051965;GO:0090129
PGEN_.00g000490	Q7D513		egtB	Hercynine oxygenase	Gamma-glutamyl hercynylcysteine S-oxide synthase	GO:0044875;GO:0005506;GO:0004497
PGEN_.00g000520	Q92968;B2RCS1	5194	PEX13	Peroxisomal membrane protein PEX13	Peroxin-13	GO:0005829;GO:0005779;GO:0016020;GO:1990429;GO:0005778;GO:0005777;GO:0021795;GO:0001561;GO:0007626;GO:0060152;GO:0001764;GO:0016560;GO:0001967
PGEN_.00g000530	A6H769	505507	RPS7	40S ribosomal protein S7	Peroxin-13	GO:0022627;GO:0005815;GO:0032040;GO:0003735;GO:0042274;GO:0006364;GO:0006412
PGEN_.00g000540	Q14676;A2AB04;A2BF04;A2RRA8;A7YY86;B0S8A2;Q0EFC2;Q2L6H7;Q2TAZ4Q5JP55;Q5JP56;Q5ST83;Q68CQ3;Q86Z06;Q96QC2	9656	MDC1	Mediator of DNA damage checkpoint protein 1	Nuclear factor with BRCT domains 1	GO:0005694;GO:0005925;GO:0016604;GO:0005654;GO:0005634;GO:0070975;GO:0008022;GO:0006281;GO:0031573
PGEN_.00g000560	Q54W11	8622324	mcfL	Mitochondrial substrate carrier family protein L	Nuclear factor with BRCT domains 1	GO:0016021;GO:0005743;GO:0055085
PGEN_.00g000600	Q9FKK7;Q8L759	835871	XYLA	Xylose isomerase	Nuclear factor with BRCT domains 1	GO:0005783;GO:0005794;GO:0000325;GO:0009536;GO:0099503;GO:0046872;GO:0009045;GO:0042843
PGEN_.00g000660	Q3ZCD7	614105	TECR	Very-long-chain enoyl-CoA reductase	Trans-23-enoyl-CoA reductase	GO:0005783;GO:0030176;GO:0016491;GO:0102758;GO:0030497;GO:0006665;GO:0006694;GO:0042761
PGEN_.00g000670	Q9D180;A2ACY9	68625	Cfap57	Cilia- and flagella-associated protein 57	WD repeat-containing protein 65	
PGEN_.00g000680	Q9D6Z0;Q8K1H3;Q9CY41;Q9D942	66400	Alkbh7	Alpha-ketoglutarate-dependent dioxygenase alkB homolog 7 mitochondrial	Alkylated DNA repair protein alkB homolog 7	GO:0005759;GO:0005739;GO:0051213;GO:0046872;GO:0006974;GO:0006631;GO:0010883;GO:1902445

DO NOT EDIT! CELLS BELOW THIS!!¶

The cells below are for reference, as they were used to identify a set of obsolete/missing SPIDs. These were then used in a subsequent re-run of the noteook to refine things. The remaining cells document this process.

The original number of SPIDs was 14676, but we ended up with only 14668 matches.

Let's see if we can figure out why...

Identify differences in lists of genome IDs¶

In [36]:

%%bash

cd "${analysis_dir}"

diff <(awk '{print $1}' "${joined_output}") <(awk '{print $2}' "${genome_IDs_SPIDs}")

2344a2345
> PGEN_.00g050810
5161a5163
> PGEN_.00g119250
6930a6933
> PGEN_.00g162250
7914a7918
> PGEN_.00g185820
8976a8981
> PGEN_.00g209090
9601a9607
> PGEN_.00g222140
11142a11149
> PGEN_.00g258380
12942a12950
> PGEN_.00g304800

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-36-0ccf4752df95> in <module>
----> 1 get_ipython().run_cell_magic('bash', '', '\ncd "${analysis_dir}"\n\ndiff <(awk \'{print $1}\' "${joined_output}") <(awk \'{print $2}\' "${genome_IDs_SPIDs}")\n')

~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2401             with self.builtin_trap:
   2402                 args = (magic_arg_s, cell)
-> 2403                 result = fn(*args, **kwargs)
   2404             return result
   2405 

~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
    140             else:
    141                 line = script
--> 142             return self.shebang(line, cell)
    143 
    144         # write a basic docstring:

~/programs/miniconda3/lib/python3.9/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/magics/script.py in shebang(self, line, cell)
    243             sys.stderr.flush()
    244         if args.raise_error and p.returncode!=0:
--> 245             raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
    246 
    247     def _run_script(self, p, cell, to_close):

CalledProcessError: Command 'b'\ncd "${analysis_dir}"\n\ndiff <(awk \'{print $1}\' "${joined_output}") <(awk \'{print $2}\' "${genome_IDs_SPIDs}")\n'' returned non-zero exit status 1.

Get SPIDs of "missing" genome IDs¶

In [21]:

%%bash

cd "${analysis_dir}"

for id in PGEN_.00g050810 PGEN_.00g119250 PGEN_.00g162250 PGEN_.00g185820 PGEN_.00g209090 PGEN_.00g222140 PGEN_.00g258380 PGEN_.00g304800
do
  grep "${id}" "${genome_IDs_SPIDs}"
done

Q3UN21 	 PGEN_.00g050810
P0DN79 	 PGEN_.00g119250
Q6ZRR9 	 PGEN_.00g162250
Q9NPA5 	 PGEN_.00g185820
Q6ZR98 	 PGEN_.00g209090
Q5T699 	 PGEN_.00g222140
Q9NPA5 	 PGEN_.00g258380
Q9NPA5 	 PGEN_.00g304800

Manually look up SPIDs in UniProt¶

Q3UN21: Obsolete/deleted from UniProt.
P0DN79: Obsolete/deleted from UniProt.
Q6ZRR9: Redirects to SPID M0R2J8.
Q9NPA5: Redirects to SPID Q9NTW7.
Q6ZR98: Obsolete/deleted from UniProt.
Q5T699: Obsolete/deleted from UniProt.

Using information above, will add code to find/replace the redirected SPIDs and then re-run rest of notebook.

In [ ]: