%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh
TODAY'S DATE: Thu May 13 09:07:32 PDT 2021 ------------ Distributor ID: Ubuntu Description: Ubuntu 20.04.2 LTS Release: 20.04 Codename: focal ------------ HOSTNAME: computer ------------ Computer Specs: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 45 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 8 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 165 Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz Stepping: 2 CPU MHz: 2400.001 BogoMIPS: 4800.00 Hypervisor vendor: VMware Virtualization type: full L1d cache: 256 KiB L1i cache: 256 KiB L2 cache: 2 MiB L3 cache: 128 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat pku ospke md_clear flush_l1d arch_capabilities ------------ Memory Specs total used free shared buff/cache available Mem: 53Gi 1.4Gi 43Gi 89Mi 9.1Gi 51Gi Swap: 2.0Gi 0B 2.0Gi
No LSB modules are available.
%env
indicates a bash variable; without %env
is Python variable# Set directories, input/output files
%env data_dir=/home/samb/data/O_lurida/genomes
%env analysis_dir=/home/samb/analyses/20210513_olur_NCBI_genome-submission-prep
%env genome_fasta=Olurida_v081.fa
%env orig_gff=Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added.gff
%env new_gff=Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added_no-fasta.gff
%env sqn=20210513_Olurida-v081.sqn
# NCBI verification program
%env table2asn=/home/samb/programs/linux64.table2asn_GFF
# Locus tag prefix from NCBI BioProject PRJNA393719
%env locus_tag=CGC61
env: data_dir=/home/samb/data/O_lurida/genomes env: analysis_dir=/home/samb/analyses/20210513_olur_NCBI_genome-submission-prep env: genome_fasta=Olurida_v081.fa env: orig_gff=Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added.gff env: new_gff=Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added_no-fasta.gff env: sqn=20210513_Olurida-v081.sqn env: table2asn=/home/samb/programs/linux64.table2asn_GFF env: locus_tag=CGC61
%%bash
mkdir --parents "${analysis_dir}"
ls -lh "${analysis_dir}"
total 0
If needing to use URLs:
f54512bd964f45645c34b1e8e403a2b0
FastA: http://owl.fish.washington.edu/halfshell/genomic-databank/Olurida_v081.fa
3ac56372bd62038f264d27eef0883bd3
%%bash
rsync -avp owl:/volume1/web/halfshell/genomic-databank/Olurida_v081.fa "${data_dir}"
rsync -avp owl:/volume1/web/halfshell/genomic-databank/Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added.gff "${data_dir}"
ls -lh "${data_dir}"
receiving incremental file list Olurida_v081.fa sent 30 bytes received 1,143,221,718 bytes 7,447,698.68 bytes/sec total size is 1,143,082,078 speedup is 1.00 receiving incremental file list Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added.gff sent 30 bytes received 3,105,037,889 bytes 11,478,883.25 bytes/sec total size is 3,104,658,743 speedup is 1.00 total 4.0G -rw-rw-rw- 1 samb samb 1.1G Jun 7 2018 Olurida_v081.fa -rw-rw-r-- 1 samb samb 2.9G Dec 13 2019 Olurida_v081_genome_snap02.all.renamed.putative_function.domain_added.gff
GFF is not in a standard format. When generated by MAKER on 20190709, the setting used included the genome sequences appended to the end of the GFF. These need to be removed.
Uses the -n
option to display line numbers before each result.
%%bash
cd ${data_dir}
grep -n -B 5 "^>" ${orig_gff} | head -n 5
11574856-Contig58125 repeatmasker match_part 19166 19724 1427 + . ID=Contig58125:hsp:53138:1.3.0.0;Parent=Contig58125:hit:25067:1.3.0.0;Target=species:Helitron-N14_CGi|genus:RC%252FHelitron 474 1043 + 11574857-Contig58125 repeatmasker match 19780 19810 232 + . ID=Contig58125:hit:25068:1.3.0.0;Name=species:rnd-4_family-195|genus:RC%2FHelitron;Target=species:rnd-4_family-195|genus:RC%2FHelitron 413 443 + 11574858-Contig58125 repeatmasker match_part 19780 19810 232 + . ID=Contig58125:hsp:53139:1.3.0.0;Parent=Contig58125:hit:25068:1.3.0.0;Target=species:rnd-4_family-195|genus:RC%252FHelitron 413 443 + 11574859-### 11574860-##FASTA
%%bash
cd ${data_dir}
head -n 11574858 ${orig_gff} > ${new_gff}
# Check the new file
tail -n 1 ${new_gff}
Contig58125 repeatmasker match_part 19780 19810 232 + . ID=Contig58125:hsp:53139:1.3.0.0;Parent=Contig58125:hit:25068:1.3.0.0;Target=species:rnd-4_family-195|genus:RC%252FHelitron 413 443 +
%%bash
cd ${data_dir}
md5sum ${new_gff} > checksums.md5
Options were taken from here (and are the most basic options):
https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/
See end of notebook for the full help menu with explanations of each/every possible option.
Based on experience, there's a lot werror output, so I've redirected stderr to /dev/null
. Plus, all of that info is described in the resulting output files.
%%bash
time \
${table2asn} \
-M n \
-J \
-c w \
-euk \
-locus-tag-prefix ${locus_tag} \
-gaps-min 10 \
-l unspecified \
-i ${data_dir}/${genome_fasta} \
-f ${data_dir}/${new_gff} \
-o ${analysis_dir}/${sqn} \
-Z \
2> /dev/null
# List files
ls -lh ${analysis_dir}
total 1.1G -rw-rw-r-- 1 samb samb 47M May 13 13:23 20210513_Olurida-v081.dr -rw-rw-r-- 1 samb samb 928M May 13 13:23 20210513_Olurida-v081.sqn -rw-rw-r-- 1 samb samb 867 May 13 13:23 20210513_Olurida-v081.stats -rw-rw-r-- 1 samb samb 67M May 13 13:23 20210513_Olurida-v081.val
real 3m4.836s user 3m4.025s sys 0m0.748s
.stats
explanations are here: https://www.ncbi.nlm.nih.gov/genbank/genome_validation/
.dr
explanations are here: https://www.ncbi.nlm.nih.gov/genbank/asndisc/#fatal
%%bash
cat ${analysis_dir}/20210513_Olurida-v081.stats
Total messages: 293302 ================================================================= 35926 WARNING-level messages exist SEQ_INST.TerminalGap: 2 SEQ_INST.LeadingX: 1 SEQ_FEAT.NotSpliceConsensusDonor: 12488 SEQ_FEAT.NotSpliceConsensusAcceptor: 10811 SEQ_FEAT.IntervalBeginsOrEndsInGap: 72 SEQ_FEAT.ShortExon: 1182 SEQ_FEAT.PartialProblemNotSpliceConsensus3Prime: 6154 SEQ_FEAT.PartialProblemNotSpliceConsensus5Prime: 5214 SEQ_FEAT.PartialProblem5Prime: 2 ================================================================= 257376 ERROR-level messages exist SEQ_INST.ShortSeq: 2 SEQ_DESCR.BioSourceMissing: 26218 SEQ_DESCR.NoPubFound: 1 SEQ_DESCR.NoSourceDescriptor: 1 GENERIC.MissingPubRequirement: 1 SEQ_FEAT.IllegalDbXref: 230426 SEQ_FEAT.FeatureBeginsOrEndsInGap: 1 SEQ_FEAT.ShortIntron: 726 =================================================================
%%bash
head -n 20 ${analysis_dir}/20210513_Olurida-v081.dr
Discrepancy Report Results Summary COUNT_NUCLEOTIDES: 159429 nucleotide Bioseqs are present LONG_NO_ANNOTATION: 46845 bioseqs are longer than 5000nt and have no features NO_ANNOTATION: 133211 bioseqs have no features GAPS: 114914 sequences contain gaps LOW_QUALITY_REGION: 809 sequences contain low quality region FEATURE_COUNT: CDS: 32210 present FEATURE_COUNT: gene: 32210 present FEATURE_COUNT: mRNA: 32210 present PROTEIN_NAMES: All proteins have same name "hypothetical protein" BAD_GENE_STRAND: 3 feature locations conflict with gene location strands FATAL: CONTAINED_CDS: 10 coding regions are completely contained in another coding region, but on the opposite strand. FEATURE_LOCATION_CONFLICT: 16978 features have inconsistent gene locations. FATAL: BACTERIAL_JOINED_FEATURES_NO_EXCEPTION: 29207 coding regions with joined locations have no exceptions SHORT_INTRON: 726 introns are shorter than 10 nt FEATURE_LIST: Feature List Detailed Report