Notebook

Create C.virginia long, non-coding RNA files.¶

Downloads files from NCBI.¶

Notebook relies on:¶

GffRead
GFFutils available in your $PATH.
- I accomplished this by creating/activating a conda environment for GFFutils and running this notebook from within that environment.
samtools.

Resulting files will be used for C.virginica RNAseq/DML sex/OA project (GitHub repo)¶

List computer specs¶

In [1]:

%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname 
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE:
Fri 18 Feb 2022 07:10:15 AM PST
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.3 LTS
Release:	20.04
Codename:	focal

------------
HOSTNAME: 
computer

------------
Computer Specs:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       2
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
Stepping:                        2
CPU MHz:                         2400.008
BogoMIPS:                        4800.01
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        512 KiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0,1
Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities

------------

Memory Specs

              total        used        free      shared  buff/cache   available
Mem:           54Gi       3.2Gi        46Gi       138Mi       5.1Gi        50Gi
Swap:         2.0Gi          0B       2.0Gi

No LSB modules are available.

Set variables¶

%env indicates a bash variable
without %env is Python variable

In [2]:

# Set directories, input/output files
%env data_dir=/home/sam/data/C_virginica/genomes
%env analysis_dir=/home/sam/analyses/20220217-cvir-lncRNA_subsetting
analysis_dir="20220217-cvir-lncRNA_subsetting"

# Input files (from NCBI)
%env ncbi_fasta=GCF_002022765.2_C_virginica-3.0_genomic.fna
%env ncbi_fasta_index=GCF_002022765.2_C_virginica-3.0_genomic.fna.fai
%env ncbi_fasta_gz=GCF_002022765.2_C_virginica-3.0_genomic.fna.gz
%env ncbi_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff
%env ncbi_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz

# URL to download files from NCBI
%env ncbi_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0

# Output files
%env lncRNA_bed=GCF_002022765.2_C_virginica-3.0_lncRNA.bed
%env lncRNA_gff=GCF_002022765.2_C_virginica-3.0_lncRNA.gff
%env lncRNA_gtf=GCF_002022765.2_C_virginica-3.0_lncRNA.gtf
%env lncRNA_fasta=GCF_002022765.2_C_virginica-3.0_lncRNA.fa
%env lncRNA_fasta_index=GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai

# Set program locations
%env gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread
%env samtools=/home/sam/programs/samtools-1.12/samtools

env: data_dir=/home/sam/data/C_virginica/genomes
env: analysis_dir=/home/sam/analyses/20220217-cvir-lncRNA_subsetting
env: ncbi_fasta=GCF_002022765.2_C_virginica-3.0_genomic.fna
env: ncbi_fasta_index=GCF_002022765.2_C_virginica-3.0_genomic.fna.fai
env: ncbi_fasta_gz=GCF_002022765.2_C_virginica-3.0_genomic.fna.gz
env: ncbi_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff
env: ncbi_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz
env: ncbi_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0
env: lncRNA_bed=GCF_002022765.2_C_virginica-3.0_lncRNA.bed
env: lncRNA_gff=GCF_002022765.2_C_virginica-3.0_lncRNA.gff
env: lncRNA_gtf=GCF_002022765.2_C_virginica-3.0_lncRNA.gtf
env: lncRNA_fasta=GCF_002022765.2_C_virginica-3.0_lncRNA.fa
env: lncRNA_fasta_index=GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai
env: gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread
env: samtools=/home/sam/programs/samtools-1.12/samtools

Create analysis directory¶

In [3]:

%%bash
# Make analysis directory, if it doesn't exist
mkdir --parents "${analysis_dir}"

Download GFF¶

In [4]:

%%bash
cd "${data_dir}"

# Download with wget.
# Use --quiet option to prevent wget output from printing too many lines to notebook
# Use --continue to prevent re-downloading fie if it's already been downloaded.
wget --quiet \
--continue \
${ncbi_url}/${ncbi_gff_gz}

# Unzip download GFF
gunzip "${ncbi_gff_gz}"

ls -ltrh "${ncbi_gff}"

-rw-rw-r-- 1 sam sam 412M Dec 10  2019 GCF_002022765.2_C_virginica-3.0_genomic.gff

gzip: GCF_002022765.2_C_virginica-3.0_genomic.gff already exists;	not overwritten

Examine GFF¶

In [5]:

%%bash
head -n 20 "${data_dir}"/"${ncbi_gff}"

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build C_virginica-3.0
#!genome-build-accession NCBI_Assembly:GCF_002022765.2
#!annotation-source NCBI Crassostrea virginica Annotation Release 100
##sequence-region NC_035780.1 1 65668440
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565
NC_035780.1	RefSeq	region	1	65668440	.	+	.	ID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample
NC_035780.1	Gnomon	gene	13578	14594	.	+	.	ID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA
NC_035780.1	Gnomon	lnc_RNA	13578	14594	.	+	.	ID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	exon	13578	13603	.	+	.	ID=exon-XR_002636969.1-1;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	exon	14237	14290	.	+	.	ID=exon-XR_002636969.1-2;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	exon	14557	14594	.	+	.	ID=exon-XR_002636969.1-3;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	gene	28961	33324	.	+	.	ID=gene-LOC111126949;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding
NC_035780.1	Gnomon	mRNA	28961	33324	.	+	.	ID=rna-XM_022471938.1;Parent=gene-LOC111126949;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	Gnomon	exon	28961	29073	.	+	.	ID=exon-XM_022471938.1-1;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	Gnomon	exon	30524	31557	.	+	.	ID=exon-XM_022471938.1-2;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	Gnomon	exon	31736	31887	.	+	.	ID=exon-XM_022471938.1-3;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1	Gnomon	exon	31977	32565	.	+	.	ID=exon-XM_022471938.1-4;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1

Download NCBI genomic FastA¶

In [6]:

%%bash
cd "${data_dir}"

# Download with wget.
# Use --quiet option to prevent wget output from printing too many lines to notebook
# Use --continue to prevent re-downloading fie if it's already been downloaded.
wget --quiet \
--continue \
${ncbi_url}/${ncbi_fasta_gz}

# Unzip download GFF
gunzip "${ncbi_fasta_gz}"

ls -ltrh "${ncbi_fasta}"

-rw-rw-r-- 1 sam sam 662M Dec 10  2019 GCF_002022765.2_C_virginica-3.0_genomic.fna

gzip: GCF_002022765.2_C_virginica-3.0_genomic.fna already exists;	not overwritten

Create FastA index with Samtools ¶

In [7]:

%%bash
cd "${data_dir}"

${samtools} faidx "${ncbi_fasta}"

ls -ltrh "${ncbi_fasta_index}"

-rw-rw-r-- 1 sam sam 398 Feb 18 07:10 GCF_002022765.2_C_virginica-3.0_genomic.fna.fai

Inspect NCBI genomic FastA index¶

In [8]:

%%bash
cd "${data_dir}"

head "${ncbi_fasta_index}"

NC_035780.1	65668440	117	80	81
NC_035781.1	61752955	66489530	80	81
NC_035782.1	77061148	129014514	80	81
NC_035783.1	59691872	207039044	80	81
NC_035784.1	98698416	267477182	80	81
NC_035785.1	51258098	367409446	80	81
NC_035786.1	57830854	419308388	80	81
NC_035787.1	75944018	477862245	80	81
NC_035788.1	104168038	554755681	80	81
NC_035789.1	32650045	660225938	80	81

Extracts lncRNAs from genomic GFF using `gtf_extract` from GFFutils ¶

In [9]:

%%bash
cd "${data_dir}"

# Capture GFF header from NCBI gff
head -n 7 "${ncbi_gff}" > ${analysis_dir}/"${lncRNA_gff}"

# Add note about modification
printf "#%s%s\n" "!" "lncRNA only - created by Sam White $(date)" >> ${analysis_dir}/"${lncRNA_gff}"


# Finds lncRNAs in NCBI GFF
gtf_extract \
--feature lnc_RNA \
--gff "${ncbi_gff}" \
>> ${analysis_dir}/"${lncRNA_gff}"


head ${analysis_dir}/"${lncRNA_gff}"

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build C_virginica-3.0
#!genome-build-accession NCBI_Assembly:GCF_002022765.2
#!annotation-source NCBI Crassostrea virginica Annotation Release 100
##sequence-region NC_035780.1 1 65668440
#!lncRNA only - created by Sam White Fri 18 Feb 2022 07:10:32 AM PST
NC_035780.1	Gnomon	lnc_RNA	13578	14594	.	+	.	ID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1	Gnomon	lnc_RNA	169468	170178	.	-	.	ID=rna-XR_002635081.1;Parent=gene-LOC111105702;Dbxref=GeneID:111105702,Genbank:XR_002635081.1;Name=XR_002635081.1;gbkey=ncRNA;gene=LOC111105702;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=uncharacterized LOC111105702;transcript_id=XR_002635081.1

Extract lncRNAs to BED using GffRead ¶

In [10]:

%%bash
cd "${data_dir}"

${gffread} --bed \
${analysis_dir}/"${lncRNA_gff}" \
> ${analysis_dir}/"${lncRNA_bed}"

Inspect lncRNA BED¶

In [11]:

%%bash
head ${analysis_dir}/"${lncRNA_bed}"

NC_035780.1	13577	14594	rna-XR_002636969.1	100	+	13577	14594	0,0,0	1	1017,	0,	geneID=gene-LOC111116054;gene_name=LOC111116054
NC_035780.1	169467	170178	rna-XR_002635081.1	100	-	169467	170178	0,0,0	1	711,	0,	geneID=gene-LOC111105702;gene_name=LOC111105702
NC_035780.1	900325	903430	rna-XR_002636046.1	100	+	900325	903430	0,0,0	1	3105,	0,	geneID=gene-LOC111111519;gene_name=LOC111111519
NC_035780.1	1280830	1282416	rna-XR_002638148.1	100	-	1280830	1282416	0,0,0	1	1586,	0,	geneID=gene-LOC111124195;gene_name=LOC111124195
NC_035780.1	1432943	1458091	rna-XR_002639675.1	100	+	1432943	1458091	0,0,0	1	25148,	0,	geneID=gene-LOC111135942;gene_name=LOC111135942
NC_035780.1	1503801	1513830	rna-XR_002636574.1	100	-	1503801	1513830	0,0,0	1	10029,	0,	geneID=gene-LOC111114441;gene_name=LOC111114441
NC_035780.1	1856840	1863683	rna-XR_002636864.1	100	-	1856840	1863683	0,0,0	1	6843,	0,	geneID=gene-LOC111115591;gene_name=LOC111115591
NC_035780.1	1856840	1863697	rna-XR_002636863.1	100	-	1856840	1863697	0,0,0	1	6857,	0,	geneID=gene-LOC111115591;gene_name=LOC111115591
NC_035780.1	2161222	2166803	rna-XR_002635698.1	100	+	2161222	2166803	0,0,0	1	5581,	0,	geneID=gene-LOC111109763;gene_name=LOC111109763
NC_035780.1	2928483	2930094	rna-XR_002637875.1	100	-	2928483	2930094	0,0,0	1	1611,	0,	geneID=gene-LOC111122009;gene_name=LOC111122009

Convert lncRNA GFF to GTF¶

In [12]:

%%bash
cd "${data_dir}"

${gffread} -E \
${analysis_dir}/"${lncRNA_gff}" -T \
1> ${analysis_dir}/"${lncRNA_gtf}" \
2> ${analysis_dir}/gffread-lncRNA_gff-to-lncRNA_gtf.stderr

Inspect lncRNA GTF¶

In [13]:

%%bash
head ${analysis_dir}/"${lncRNA_gtf}"

NC_035780.1	Gnomon	transcript	13578	14594	.	+	.	transcript_id "rna-XR_002636969.1"; gene_id "gene-LOC111116054"; gene_name "LOC111116054"
NC_035780.1	Gnomon	exon	13578	14594	.	+	.	transcript_id "rna-XR_002636969.1"; gene_id "gene-LOC111116054"; gene_name "LOC111116054";
NC_035780.1	Gnomon	transcript	169468	170178	.	-	.	transcript_id "rna-XR_002635081.1"; gene_id "gene-LOC111105702"; gene_name "LOC111105702"
NC_035780.1	Gnomon	exon	169468	170178	.	-	.	transcript_id "rna-XR_002635081.1"; gene_id "gene-LOC111105702"; gene_name "LOC111105702";
NC_035780.1	Gnomon	transcript	900326	903430	.	+	.	transcript_id "rna-XR_002636046.1"; gene_id "gene-LOC111111519"; gene_name "LOC111111519"
NC_035780.1	Gnomon	exon	900326	903430	.	+	.	transcript_id "rna-XR_002636046.1"; gene_id "gene-LOC111111519"; gene_name "LOC111111519";
NC_035780.1	Gnomon	transcript	1280831	1282416	.	-	.	transcript_id "rna-XR_002638148.1"; gene_id "gene-LOC111124195"; gene_name "LOC111124195"
NC_035780.1	Gnomon	exon	1280831	1282416	.	-	.	transcript_id "rna-XR_002638148.1"; gene_id "gene-LOC111124195"; gene_name "LOC111124195";
NC_035780.1	Gnomon	transcript	1432944	1458091	.	+	.	transcript_id "rna-XR_002639675.1"; gene_id "gene-LOC111135942"; gene_name "LOC111135942"
NC_035780.1	Gnomon	exon	1432944	1458091	.	+	.	transcript_id "rna-XR_002639675.1"; gene_id "gene-LOC111135942"; gene_name "LOC111135942";

Exract lncRNAs to FastA¶

Explanation of GffRead options used below:

-w: specifies output FastA file
-W: specifies to write coordinates of all exons spliced in FastA deflines
-g: specifies input FastA (needs to have a corresponding FastA index file in same directory)

In [14]:

%%bash
cd "${data_dir}"

${gffread} -E \
-w ${analysis_dir}/"${lncRNA_fasta}" -W \
-g "${ncbi_fasta}" \
${analysis_dir}/"${lncRNA_gtf}" \
2> ${analysis_dir}/gffread_lncRNA-fasta-extraction.stderr

Inspect lncRNA FastA¶

In [15]:

%%bash
head ${analysis_dir}/"${lncRNA_fasta}"

>rna-XR_002636969.1 loc:NC_035780.1|13578-14594|+ exons:13578-14594 segs:1-1017
tgatattgttgtgtGCAGAACGTggggtaagaaaacatgcaacactcataatattttacaatctgtctaG
TTTTCGTTGGACACATCCCACATACTAGAGGAAGGTCAGAAGCATGGGGGTGGTGGCATgctttttacac
tgaatgatcggcagtttgcagtgttcaactccaaatctcttctatgcacaaatcaaataacaaactttac
aCAGCTGTTACATGGAAAGTacctacatattttcataatggaaagaaataattatgaccatcacactgta
ttgaatttactagagaatatattgacttagaaggtttttttttaactttgtactggctgccaggcatgat
aacatgctacatcatacatgttgacttttaatcatcttaatagaagtaaaaacaataaaggtaatctctc
tgaaataaacttttattgatgaatgcattgatatgtatacatgtatgtcatcacagttttctcactatca
ttcctgaaatgtacagtgtcagctgatgtcatgatgatctacattttacataaaaattttcctCCTGAGA
TAAAAAGCGCAGATTAATATTTCACTCAATCccattttaactgttttattatacatattaactcttaaac

Create lncRNA FastA index¶

In [16]:

%%bash
cd "${analysis_dir}"

${samtools} faidx "${lncRNA_fasta}"

ls -ltrh "${lncRNA_fasta_index}"

-rw-rw-r-- 1 sam sam 179K Feb 18 07:11 GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai

Inspect lncRNA FastA index¶

In [17]:

%%bash
cd "${analysis_dir}"

head "${lncRNA_fasta_index}"

rna-XR_002636969.1	1017	80	70	71
rna-XR_002635081.1	711	1195	70	71
rna-XR_002636046.1	3105	2001	70	71
rna-XR_002638148.1	1586	5239	70	71
rna-XR_002639675.1	25148	6937	70	71
rna-XR_002636574.1	10029	32534	70	71
rna-XR_002636864.1	6843	42795	70	71
rna-XR_002636863.1	6857	49824	70	71
rna-XR_002635698.1	5581	56867	70	71
rna-XR_002637875.1	1611	62616	70	71

Generate checksums¶

In [18]:

%%bash
cd "${analysis_dir}"

for file in *
do
  md5sum "${file}" | tee --append checksums.md5
done

28de37c9ee1308ac1175397d16b3aafe  GCF_002022765.2_C_virginica-3.0_lncRNA.bed
7fac9e7191915f763cc7f5d22838ac25  GCF_002022765.2_C_virginica-3.0_lncRNA.fa
1b43db284950abc07afb5f50164fb264  GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai
00755b8c80166cdec94b09f231ef440a  GCF_002022765.2_C_virginica-3.0_lncRNA.gff
dedab056acd679cf4eab83629882ee10  GCF_002022765.2_C_virginica-3.0_lncRNA.gtf
7ec412a022f43cfeb7729e55aac78ef6  gffread_lncRNA-fasta-extraction.stderr
cba3ae8e2474861cd60aa304269b66a8  gffread-lncRNA_gff-to-lncRNA_gtf.stderr

Document GffRead program options¶

In [19]:

%%bash
${gffread} -h

gffread v0.12.7. Usage:
gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>] 
 [-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]
 [--jmatch <chr>:<start>-<end>] [--no-pseudo] 
 [-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
 [-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]
 [--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]
 [<input_gff>] 

 Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
 transcripts (exon or CDS) and more.
 By default (i.e. without -O) only transcripts are processed, discarding any
 other non-transcript features. Default output is a simplified GFF3 with only
 the basic attributes.
 
Options:
 --ids discard records/transcripts if their IDs are not listed in <IDs.lst>
 --nids discard records/transcripts if their IDs are listed in <IDs.lst>
 -i   discard transcripts having an intron larger than <maxintron>
 -l   discard transcripts shorter than <minlen> bases
 -r   only show transcripts overlapping coordinate range <start>..<end>
      (on chromosome/contig <chr>, strand <strand> if provided)
 -R   for -r option, discard all transcripts that are not fully 
      contained within the given range
 --jmatch only output transcripts matching the given junction
 -U   discard single-exon transcripts
 -C   coding only: discard mRNAs that have no CDS features
 --nc non-coding only: discard mRNAs that have CDS features
 --ignore-locus : discard locus features and attributes found in the input
 -A   use the description field from <seq_info.fsize> and add it
      as the value for a 'descr' attribute to the GFF record
 -s   <seq_info.fsize> is a tab-delimited file providing this info
      for each of the mapped sequences:
      <seq-name> <seq-length> <seq-description>
      (useful for -A option with mRNA/EST/protein mappings)
Sorting: (by default, chromosomes are kept in the order they were found)
 --sort-alpha : chromosomes (reference sequences) are sorted alphabetically
 --sort-by : sort the reference sequences by the order in which their
      names are given in the <refseq.lst> file
Misc options: 
 -F   keep all GFF attributes (for non-exon features)
 --keep-exon-attrs : for -F option, do not attempt to reduce redundant
      exon/CDS attributes
 -G   do not keep exon attributes, move them to the transcript feature
      (for GFF3 output)
 --attrs <attr-list> only output the GTF/GFF attributes listed in <attr-list>
    which is a comma delimited list of attribute names to
 --keep-genes : in transcript-only mode (default), also preserve gene records
 --keep-comments: for GFF3 input/output, try to preserve comments
 -O   process other non-transcript GFF records (by default non-transcript
      records are ignored)
 -V   discard any mRNAs with CDS having in-frame stop codons (requires -g)
 -H   for -V option, check and adjust the starting CDS phase
      if the original phase leads to a translation with an 
      in-frame stop codon
 -B   for -V option, single-exon transcripts are also checked on the
      opposite strand (requires -g)
 -P   add transcript level GFF attributes about the coding status of each
      transcript, including partialness or in-frame stop codons (requires -g)
 --add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts
      that have CDS features
 --adj-stop stop codon adjustment: enables -P and performs automatic
      adjustment of the CDS stop coordinate if premature or downstream
 -N   discard multi-exon mRNAs that have any intron with a non-canonical
      splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)
 -J   discard any mRNAs that either lack initial START codon
      or the terminal STOP codon, or have an in-frame stop codon
      (i.e. only print mRNAs with a complete CDS)
 --no-pseudo: filter out records matching the 'pseudo' keyword
 --in-bed: input should be parsed as BED format (automatic if the input
           filename ends with .bed*)
 --in-tlf: input GFF-like one-line-per-transcript format without exon/CDS
           features (see --tlf option below); automatic if the input
           filename ends with .tlf)
 --stream: fast processing of input GFF/BED transcripts as they are received
           ((no sorting, exons must be grouped by transcript in the input data)
Clustering:
 -M/--merge : cluster the input transcripts into loci, discarding
      "redundant" transcripts (those with the same exact introns
      and fully contained or equal boundaries)
 -d <dupinfo> : for -M option, write duplication info to file <dupinfo>
 --cluster-only: same as -M/--merge but without discarding any of the
      "duplicate" transcripts, only create "locus" features
 -K   for -M option: also discard as redundant the shorter, fully contained
       transcripts (intron chains matching a part of the container)
 -Q   for -M option, no longer require boundary containment when assessing
      redundancy (can be combined with -K); only introns have to match for
      multi-exon transcripts, and >=80% overlap for single-exon transcripts
 -Y   for -M option, enforce -Q but also discard overlapping single-exon 
      transcripts, even on the opposite strand (can be combined with -K)
Output options:
 --force-exons: make sure that the lowest level GFF features are considered
       "exon" features
 --gene2exon: for single-line genes not parenting any transcripts, add an
       exon feature spanning the entire gene (treat it as a transcript)
 --t-adopt:  try to find a parent gene overlapping/containing a transcript
       that does not have any explicit gene Parent
 -D    decode url encoded characters within attributes
 -Z    merge very close exons into a single exon (when intron size<4)
 -g   full path to a multi-fasta file with the genomic sequences
      for all input mappings, OR a directory with single-fasta files
      (one per genomic sequence, with file names matching sequence names)
 -j    output the junctions and the corresponding transcripts
 -w    write a fasta file with spliced exons for each transcript
 --w-add <N> for the -w option, extract additional <N> bases
       both upstream and downstream of the transcript boundaries
 --w-nocds for -w, disable the output of CDS info in the FASTA file
 -x    write a fasta file with spliced CDS for each GFF transcript
 -y    write a protein fasta file with the translation of CDS for each record
 -W    for -w, -x and -y options, write in the FASTA defline all the exon
       coordinates projected onto the spliced sequence;
 -S    for -y option, use '*' instead of '.' as stop codon translation
 -L    Ensembl GTF to GFF3 conversion, adds version to IDs
 -m    <chr_replace> is a name mapping table for converting reference 
       sequence names, having this 2-column format:
       <original_ref_ID> <new_ref_ID>
 -t    use <trackname> in the 2nd column of each GFF/GTF output line
 -o    write the output records into <outfile> instead of stdout
 -T    main output will be GTF instead of GFF3
 --bed output records in BED format instead of default GFF3
 --tlf output "transcript line format" which is like GFF
       but with exons and CDS related features stored as GFF 
       attributes in the transcript feature line, like this:
         exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords> 
       <exons> is a comma-delimited list of exon_start-exon_end coordinates;
       <CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>
 --table output a simple tab delimited format instead of GFF, with columns
       having the values of GFF attributes given in <attrlist>; special
       pseudo-attributes (prefixed by @) are recognized:
       @id, @geneid, @chr, @start, @end, @strand, @numexons, @exons, 
       @cds, @covlen, @cdslen
       If any of -w/-y/-x FASTA output files are enabled, the same fields
       (excluding @id) are appended to the definition line of corresponding
       FASTA records
 -v,-E expose (warn about) duplicate transcript IDs and other potential
       problems with the given GFF/GTF records

---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
/tmp/ipykernel_36240/1000630337.py in <module>
----> 1 get_ipython().run_cell_magic('bash', '', '${gffread} -h\n')

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2417             with self.builtin_trap:
   2418                 args = (magic_arg_s, cell)
-> 2419                 result = fn(*args, **kwargs)
   2420             return result
   2421 

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py in named_script_magic(line, cell)
    140             else:
    141                 line = script
--> 142             return self.shebang(line, cell)
    143 
    144         # write a basic docstring:

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py in shebang(self, line, cell)
    243             sys.stderr.flush()
    244         if args.raise_error and p.returncode!=0:
--> 245             raise CalledProcessError(p.returncode, cell, output=out, stderr=err)
    246 
    247     def _run_script(self, p, cell, to_close):

CalledProcessError: Command 'b'${gffread} -h\n'' returned non-zero exit status 1.

Document `gtf_extract` options¶

In [20]:

%%bash
gtf_extract -h

usage: gtf_extract [-h] [-v] [-f FEATURE_TYPE] [--fields FIELD_LIST]
                   [-o OUTFILE] [--gff] [-k]
                   GTF_FILE

Extract selected data items from a GTF file and output in tab-delimited
format. The program can also operate on GFF files provided the --gff option is
specified.

positional arguments:
  GTF_FILE              input GTF file to extract data items from

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -f FEATURE_TYPE, --feature FEATURE_TYPE
                        only extract data for lines where feature is
                        FEATURE_TYPE
  --fields FIELD_LIST   comma-separated list of fields to output in tab-
                        delimited format for each line in the GTF, e.g.
                        'chrom,start,end'. Fields can either be a GTF field
                        name (i.e. 'chrom', 'source', 'feature', 'start',
                        'end', 'score', 'strand' and 'frame') or the name of
                        an attribute (e.g. 'gene_name', 'gene_id' etc). Data
                        items are output in the order they appear in
                        FIELD_LIST. If a field doesn't exist for a line then
                        '.' will be output as the value.
  -o OUTFILE            write output to OUTFILE (default is to write to
                        stdout)
  --gff                 specify that the input file is GFF rather than GTF
                        format
  -k, --keep-headers    copy headers from input file to output

In [ ]: