Notebook

Convert P.generosa GFF to GTF¶

Notebook relies on:¶

GffRead

List computer specs¶

In [1]:

%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname 
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE:
Tue Jan 31 07:37:12 AM PST 2023
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.1 LTS
Release:	22.04
Codename:	jammy

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   45 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
CPU family:                      6
Model:                           165
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       4
Stepping:                        2
BogoMIPS:                        4800.01
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       128 KiB (4 instances)
L1i cache:                       128 KiB (4 instances)
L2 cache:                        1 MiB (4 instances)
L3 cache:                        64 MiB (4 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:          Mitigation; IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Unknown: Dependent on hypervisor status
Vulnerability Tsx async abort:   Not affected

------------

Memory Specs

               total        used        free      shared  buff/cache   available
Mem:            54Gi       3.3Gi        47Gi       221Mi       3.6Gi        50Gi
Swap:          2.0Gi          0B       2.0Gi

No LSB modules are available.

Set variables¶

%env indicates a bash variable
without %env is Python variable

In [2]:

# Set directories, input/output files
%env data_dir=/home/sam/data/P_acuta/genomes
%env analysis_dir=/home/sam/analyses/20230126-pacu-gff_to_gtf
analysis_dir="20230126-pacu-gff_to_gtf"

# Input files (from NCBI)
%env gff=Pocillopora_acuta_HIv2.genes.gff3

# URL of file directory
%env url=https://owl.fish.washington.edu/halfshell/genomic-databank

# Output file(s)
%env gtf=Pocillopora_acuta_HIv2.gtf


# Set program locations
%env gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread

env: data_dir=/home/sam/data/P_acuta/genomes
env: analysis_dir=/home/sam/analyses/20230126-pacu-gff_to_gtf
env: gff=Pocillopora_acuta_HIv2.genes.gff3
env: url=https://owl.fish.washington.edu/halfshell/genomic-databank
env: gtf=Pocillopora_acuta_HIv2.gtf
env: gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread

Create analysis directory¶

In [3]:

%%bash
# Make analysis and data directory, if doesn't exist
mkdir --parents "${analysis_dir}"

mkdir --parents "${data_dir}"

Download GFF¶

In [4]:

%%bash
cd "${data_dir}"

# Download with wget.
# Use --quiet option to prevent wget output from printing too many lines to notebook
# Use --continue to prevent re-downloading fie if it's already been downloaded.
# Use --no-check-certificate to avoid download error from gannet
wget --quiet \
--continue \
--no-check-certificate \
${url}/${gff}

ls -ltrh "${gff}"

-rw-rw-r-- 1 sam sam 55M May 23  2022 Pocillopora_acuta_HIv2.genes.gff3

Examine GFF¶

In [5]:

%%bash
head -n 20 "${data_dir}"/"${gff}"

Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	transcript	151	2746	.	+	.	ID=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	151	172	.	+	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	151	172	.	+	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	264	304	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	264	304	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	1491	1602	.	+	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	1491	1602	.	+	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	1889	1990	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	1889	1990	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	2107	2127	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	2107	2127	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	2727	2746	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	2727	2746	.	+	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24100.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	transcript	12326	13844	.	-	.	ID=Pocillopora_acuta_HIv2___RNAseq.g24101.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	12326	12381	.	-	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24101.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	12326	12381	.	-	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24101.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	12709	12765	.	-	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24101.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	12709	12765	.	-	2	Parent=Pocillopora_acuta_HIv2___RNAseq.g24101.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	13453	13492	.	-	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24101.t1
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	13453	13492	.	-	0	Parent=Pocillopora_acuta_HIv2___RNAseq.g24101.t1

Convert GFF to GTF¶

In [6]:

%%bash
cd "${data_dir}"

${gffread} -E \
${data_dir}/"${gff}" -T \
1> ${analysis_dir}/"${gtf}" \
2> ${analysis_dir}/gffread-gff_to_gtf.stderr

Inspect GTF¶

In [7]:

%%bash
head ${analysis_dir}/"${gtf}"

Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	transcript	151	2746	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	151	172	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	264	304	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	1491	1602	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	1889	1990	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	2107	2127	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	2727	2746	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	151	172	.	+	0	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	264	304	.	+	2	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	1491	1602	.	+	0	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";

Fix malformatted GTF¶

For use with HISAT2's extract_exons.py, each line needs to have a corresponding gene_id

Check GTF field counts¶

Expecting two values:

One value for transcript lines which have 12 fields (i.e. these have the gene_id and corresponding gene name.
A second value for all other lines which only have 10 fields, because those lines do not have a gene_id and corresponding gene name.

In [8]:

%%bash
# Use awk variable NF (number of fields) to count number of fields on each line
# Followed by sorting and only printing the unique counts
awk '{print NF}' ${analysis_dir}/"${gtf}" | sort -u

10
12

Add `gene_id` and corresponding gene name.¶

In [9]:

%%bash
time \
while read -r line
do
  # Count number of fields in line
  fields=$(echo ${line} | awk '{print NF}')
  
  # Capture gene_id if a line has 12 fields
  if [[ ${fields} == "12" ]]; then
    gene_id=$(echo ${line} | awk '{print $12}')
    echo ${line}
  fi
  
  # If a line only has 10 fields, print the line and add the capture gene id
  if [[ ${fields} == "10" ]]; then
    printf "%s%s%s\n" "${line} " "gene_id " "${gene_id};"
  fi

done < ${analysis_dir}/"${gtf}" \
> ${analysis_dir}/reformatted.gtf

# Rename reformated GTF
mv ${analysis_dir}/reformatted.gtf ${analysis_dir}/"${gtf}"

real	50m49.012s
user	51m37.132s
sys	8m56.399s

Inspect GTF¶

In [10]:

%%bash
awk '{print NF}' ${analysis_dir}/"${gtf}" | sort -u

echo ""
echo ""
head ${analysis_dir}/"${gtf}"

12


Pocillopora_acuta_HIv2___Sc0000016 AUGUSTUS transcript 151 2746 . + . transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	151	172	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	264	304	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	1491	1602	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	1889	1990	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	2107	2127	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	exon	2727	2746	.	+	.	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	151	172	.	+	0	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	264	304	.	+	2	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";
Pocillopora_acuta_HIv2___Sc0000016	AUGUSTUS	CDS	1491	1602	.	+	0	transcript_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1"; gene_id "Pocillopora_acuta_HIv2___RNAseq.g24100.t1";

Generate checksum(s)¶

In [11]:

%%bash
cd "${analysis_dir}"

for file in *
do
  md5sum "${file}" | tee --append checksums.md5
done

1855e771130b6ad5e66c178c0b881e0b  gffread-gff_to_gtf.stderr
34196bd945eb4965e665097648037132  Pocillopora_acuta_HIv2.gtf

Document GffRead program options¶

In [12]:

%%bash
${gffread} -h

gffread v0.12.7. Usage:
gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>] 
 [-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]
 [--jmatch <chr>:<start>-<end>] [--no-pseudo] 
 [-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
 [-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]
 [--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]
 [<input_gff>] 

 Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
 transcripts (exon or CDS) and more.
 By default (i.e. without -O) only transcripts are processed, discarding any
 other non-transcript features. Default output is a simplified GFF3 with only
 the basic attributes.
 
Options:
 --ids discard records/transcripts if their IDs are not listed in <IDs.lst>
 --nids discard records/transcripts if their IDs are listed in <IDs.lst>
 -i   discard transcripts having an intron larger than <maxintron>
 -l   discard transcripts shorter than <minlen> bases
 -r   only show transcripts overlapping coordinate range <start>..<end>
      (on chromosome/contig <chr>, strand <strand> if provided)
 -R   for -r option, discard all transcripts that are not fully 
      contained within the given range
 --jmatch only output transcripts matching the given junction
 -U   discard single-exon transcripts
 -C   coding only: discard mRNAs that have no CDS features
 --nc non-coding only: discard mRNAs that have CDS features
 --ignore-locus : discard locus features and attributes found in the input
 -A   use the description field from <seq_info.fsize> and add it
      as the value for a 'descr' attribute to the GFF record
 -s   <seq_info.fsize> is a tab-delimited file providing this info
      for each of the mapped sequences:
      <seq-name> <seq-length> <seq-description>
      (useful for -A option with mRNA/EST/protein mappings)
Sorting: (by default, chromosomes are kept in the order they were found)
 --sort-alpha : chromosomes (reference sequences) are sorted alphabetically
 --sort-by : sort the reference sequences by the order in which their
      names are given in the <refseq.lst> file
Misc options: 
 -F   keep all GFF attributes (for non-exon features)
 --keep-exon-attrs : for -F option, do not attempt to reduce redundant
      exon/CDS attributes
 -G   do not keep exon attributes, move them to the transcript feature
      (for GFF3 output)
 --attrs <attr-list> only output the GTF/GFF attributes listed in <attr-list>
    which is a comma delimited list of attribute names to
 --keep-genes : in transcript-only mode (default), also preserve gene records
 --keep-comments: for GFF3 input/output, try to preserve comments
 -O   process other non-transcript GFF records (by default non-transcript
      records are ignored)
 -V   discard any mRNAs with CDS having in-frame stop codons (requires -g)
 -H   for -V option, check and adjust the starting CDS phase
      if the original phase leads to a translation with an 
      in-frame stop codon
 -B   for -V option, single-exon transcripts are also checked on the
      opposite strand (requires -g)
 -P   add transcript level GFF attributes about the coding status of each
      transcript, including partialness or in-frame stop codons (requires -g)
 --add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts
      that have CDS features
 --adj-stop stop codon adjustment: enables -P and performs automatic
      adjustment of the CDS stop coordinate if premature or downstream
 -N   discard multi-exon mRNAs that have any intron with a non-canonical
      splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)
 -J   discard any mRNAs that either lack initial START codon
      or the terminal STOP codon, or have an in-frame stop codon
      (i.e. only print mRNAs with a complete CDS)
 --no-pseudo: filter out records matching the 'pseudo' keyword
 --in-bed: input should be parsed as BED format (automatic if the input
           filename ends with .bed*)
 --in-tlf: input GFF-like one-line-per-transcript format without exon/CDS
           features (see --tlf option below); automatic if the input
           filename ends with .tlf)
 --stream: fast processing of input GFF/BED transcripts as they are received
           ((no sorting, exons must be grouped by transcript in the input data)
Clustering:
 -M/--merge : cluster the input transcripts into loci, discarding
      "redundant" transcripts (those with the same exact introns
      and fully contained or equal boundaries)
 -d <dupinfo> : for -M option, write duplication info to file <dupinfo>
 --cluster-only: same as -M/--merge but without discarding any of the
      "duplicate" transcripts, only create "locus" features
 -K   for -M option: also discard as redundant the shorter, fully contained
       transcripts (intron chains matching a part of the container)
 -Q   for -M option, no longer require boundary containment when assessing
      redundancy (can be combined with -K); only introns have to match for
      multi-exon transcripts, and >=80% overlap for single-exon transcripts
 -Y   for -M option, enforce -Q but also discard overlapping single-exon 
      transcripts, even on the opposite strand (can be combined with -K)
Output options:
 --force-exons: make sure that the lowest level GFF features are considered
       "exon" features
 --gene2exon: for single-line genes not parenting any transcripts, add an
       exon feature spanning the entire gene (treat it as a transcript)
 --t-adopt:  try to find a parent gene overlapping/containing a transcript
       that does not have any explicit gene Parent
 -D    decode url encoded characters within attributes
 -Z    merge very close exons into a single exon (when intron size<4)
 -g   full path to a multi-fasta file with the genomic sequences
      for all input mappings, OR a directory with single-fasta files
      (one per genomic sequence, with file names matching sequence names)
 -j    output the junctions and the corresponding transcripts
 -w    write a fasta file with spliced exons for each transcript
 --w-add <N> for the -w option, extract additional <N> bases
       both upstream and downstream of the transcript boundaries
 --w-nocds for -w, disable the output of CDS info in the FASTA file
 -x    write a fasta file with spliced CDS for each GFF transcript
 -y    write a protein fasta file with the translation of CDS for each record
 -W    for -w, -x and -y options, write in the FASTA defline all the exon
       coordinates projected onto the spliced sequence;
 -S    for -y option, use '*' instead of '.' as stop codon translation
 -L    Ensembl GTF to GFF3 conversion, adds version to IDs
 -m    <chr_replace> is a name mapping table for converting reference 
       sequence names, having this 2-column format:
       <original_ref_ID> <new_ref_ID>
 -t    use <trackname> in the 2nd column of each GFF/GTF output line
 -o    write the output records into <outfile> instead of stdout
 -T    main output will be GTF instead of GFF3
 --bed output records in BED format instead of default GFF3
 --tlf output "transcript line format" which is like GFF
       but with exons and CDS related features stored as GFF 
       attributes in the transcript feature line, like this:
         exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords> 
       <exons> is a comma-delimited list of exon_start-exon_end coordinates;
       <CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>
 --table output a simple tab delimited format instead of GFF, with columns
       having the values of GFF attributes given in <attrlist>; special
       pseudo-attributes (prefixed by @) are recognized:
       @id, @geneid, @chr, @start, @end, @strand, @numexons, @exons, 
       @cds, @covlen, @cdslen
       If any of -w/-y/-x FASTA output files are enabled, the same fields
       (excluding @id) are appended to the definition line of corresponding
       FASTA records
 -v,-E expose (warn about) duplicate transcript IDs and other potential
       problems with the given GFF/GTF records

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
/tmp/ipykernel_249572/1000630337.py in <module>
----> 1 get_ipython().run_cell_magic('bash', '', '${gffread} -h\n')

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2417             with self.builtin_trap:
   2418                 args = (magic_arg_s, cell)
-> 2419                 result = fn(*args, **kwargs)
   2420             return result
   2421 

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
    140             else:
    141                 line = script
--> 142             return self.shebang(line, cell)
    143 
    144         # write a basic docstring:

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py in shebang(self, line, cell)
    243             sys.stderr.flush()
    244         if args.raise_error and p.returncode!=0:
--> 245             raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
    246 
    247     def _run_script(self, p, cell, to_close):

CalledProcessError: Command 'b'${gffread} -h\n'' returned non-zero exit status 1.

In [ ]:

Convert P.generosa GFF to GTF¶

Notebook relies on:¶

List computer specs¶

Set variables¶

Create analysis directory¶

Download GFF¶

Examine GFF¶

Convert GFF to GTF¶

Inspect GTF¶

Fix malformatted GTF¶

Check GTF field counts¶

Add gene_id and corresponding gene name.¶

Inspect GTF¶

Generate checksum(s)¶

Document GffRead program options¶

Add `gene_id` and corresponding gene name.¶