ncbi.datasets
library to download and parse genome annotation data in GFF3 format¶The objective of this notebook is to use the ncbi.datasets
python library to demonstrate how to download and parse genome annotation data in gff3 format from NCBI Datasets.
There are two major types of genome data available from NCBI Datasets:
Genome datasets, which include genome, transcript, and protein sequences, genome annotation data in gff3 format, an assembly data report (genome assembly and annotation metadata) and a sequence report (a list of sequences that comprise the genome assembly).
Genome summaries, which are sets of key metadata that describe the genome datasets
As an example, we will download gff3 files for five Lactobacillus genome assemblies, then parse those gff3 files to get information about crispr gene order in the corresponding genome assemblies.
The notebook can be broken down into the following four major tasks:
After importing the various python modules we will need, create the assembly and download API instances.
from collections import defaultdict, Counter
from datetime import datetime
import json
import os
from textwrap import dedent
import zipfile
import gffutils
import matplotlib.pyplot as plt
import ncbi.datasets
import pandas as pd
from pyfaidx import Fasta
plt.style.use('ggplot')
genome_api = ncbi.datasets.GenomeApi(ncbi.datasets.ApiClient())
Let's look at RefSeq lactobacillus genome assemblies, and see how we can group them based on annotation date. Annotation date, when available, is described in the genome summary.
This section will include the following two major tasks:
Using the ncbi.datasets
library, let's first get the count of available RefSeq lactobacillus assemblies. By setting limit='none'
, we tell the API to return a genome summary with only the count of genome assemblies for the specified NCBI Taxonomy ID.
taxid = 1578 ## This is the NCBI Taxonomy ID for lactobacillus
genome_summary = genome_api.assembly_descriptors_by_taxon(
taxon=taxid,
limit='none',
filters_refseq_only=True)
print(f"Number of RefSeq lactobacillus genome assemblies: {genome_summary.total_count}")
Number of RefSeq lactobacillus genome assemblies: 777
Now let's download the genome summaries for these genome assemblies. To get the metadata for all of the genomes in scope, set limit='all'
.
## download the genome summaries for all RefSeq lactobacillus genome assemblies
genome_summary = genome_api.assembly_descriptors_by_taxon(
taxon=taxid,
limit='all',
filters_refseq_only=True)
print(f"Genome summaries for {genome_summary.total_count} genome assemblies downloaded and saved to genome_summary.")
print(f"As an example, here's the first genome summary for the first genome assembly in the list:\n", genome_summary.assemblies[0])
Genome summaries for 777 genome assemblies downloaded and saved to genome_summary. As an example, here's the first genome summary for the first genome assembly in the list: {'assembly': {'annotation_metadata': {'file': [{'estimated_size': '111943', 'type': 'GENOME_GFF'}, {'estimated_size': '1187873', 'type': 'GENOME_GBFF'}, {'estimated_size': '301643', 'type': 'PROT_FASTA'}, {'estimated_size': '126907', 'type': 'GENOME_GTF'}], 'name': 'From INSDC submitter', 'release_date': 'Oct 07, 2019', 'release_number': None, 'report_url': None, 'source': 'Korea University'}, 'assembly_accession': 'GCF_008831485.1', 'assembly_category': 'representative genome', 'assembly_level': 'Complete Genome', 'bioproject_lineages': [{'bioprojects': [{'accession': 'PRJNA566216', 'parent_accession': None, 'title': 'Lactobacillus ' 'acetotolerans ' 'Genome ' 'sequencing ' 'and ' 'assembly'}]}], 'chromosomes': ['ANONYMOUS'], 'contig_n50': 1683905, 'display_name': 'ASM883148v1', 'estimated_size': '2223244', 'org': {'assembly_count': None, 'assembly_counts': {'node': 10, 'subtree': 14}, 'blast_node': None, 'breed': None, 'children': None, 'common_name': None, 'counts': None, 'cultivar': None, 'ecotype': None, 'icon': None, 'isolate': None, 'key': '1600', 'max_ord': None, 'merged': None, 'merged_tax_ids': None, 'min_ord': None, 'parent_tax_id': '1578', 'rank': 'SPECIES', 'sci_name': 'Lactobacillus acetotolerans', 'search_text': None, 'sex': None, 'strain': 'LA749', 'tax_id': '1600', 'title': 'Lactobacillus acetotolerans'}, 'seq_length': '1683905', 'submission_date': '2019-10-07'}, 'messages': None}
Next we'll plot the assemblies by the year they were annotated.
def annotation_year(annotation_metadata):
return datetime.strptime(annotation_metadata.release_date, '%b %d, %Y').year
annots_by_year = Counter()
no_annot_assms = []
for d in map(lambda d: d.assembly, genome_summary.assemblies):
if not d.annotation_metadata:
no_annot_assms.append(d.assembly_accession)
continue
annots_by_year[annotation_year(d.annotation_metadata)] += 1
if len(no_annot_assms) > 0:
print(dedent(f'''
WARNING!
Some assemblies do not have annotation.
Most likely, this is because of an indexing delay. Skipping {len(no_annot_assms)} assemblies.
'''))
df = pd.DataFrame.from_dict(annots_by_year, orient='index', columns=['count']).sort_index()
df.plot(kind='bar', y='count', figsize=(12,6))
<AxesSubplot:>
We can make a list of the genome assemblies that were annotated in 2020 and count the number of genomes in the list.
year_of_interest = 2020
genome_accessions = []
for d in map(lambda d: d.assembly, genome_summary.assemblies):
if d.annotation_metadata:
if annotation_year(d.annotation_metadata) == year_of_interest:
genome_accessions.append(d.assembly_accession)
print(f'{len(genome_accessions)} genome assemblies were annotated in {year_of_interest}.')
print(f"Here are the first five genome assemblies in the list: {genome_accessions[0:5]}")
145 genome assemblies were annotated in 2020. Here are the first five genome assemblies in the list: ['GCF_014648715.1', 'GCF_012848655.1', 'GCF_013867605.1', 'GCF_013867555.1', 'GCF_013342945.1']
We could download genome datasets for all of the returned assemblies, based on the accessions stored in genome_accessions
. If there are quite a few of them, a download_api.download_assembly_package_post()
request would be required (see docs here)[https://github.com/ncbi/datasets/blob/master/client_docs/python/docs/DownloadApi.md#download_assembly_package_post].
For demonstration purposes, we'll just download genome datasets for five genome assemblies through a GET request.
%%time
## Download data
# Let's use accessions #21-25 in the list because these all have at least one crispr gene
print(f'Download genome datasets for 5 accessions: {genome_accessions[20:25]}')
dl_response = genome_api.download_assembly_package(
genome_accessions[20:25],
exclude_sequence=False,
include_annotation_type=['GENOME_GFF', 'PROT_FASTA'],
_preload_content=False )
## write to a zip file
zipfn = 'ncbi_genomes.zip'
with open(zipfn, 'wb') as f:
f.write(dl_response.data)
print(f'Download saved to {zipfn}')
!unzip -l {zipfn}
Download genome datasets for the first 5 accessions: ['GCF_009857225.1', 'GCF_014830125.1', 'GCF_013394955.1', 'GCF_013370935.1', 'GCF_013249185.1'] Download saved to ncbi_genomes.zip Archive: ncbi_genomes.zip Length Date Time Name --------- ---------- ----- ---- 661 11-19-2020 12:03 README.md 5811 11-19-2020 12:03 ncbi_dataset/data/assembly_data_report.jsonl 1668120 11-19-2020 12:03 ncbi_dataset/data/GCF_009857225.1/GCF_009857225.1_ASM985722v1_genomic.fna 898917 11-19-2020 12:03 ncbi_dataset/data/GCF_009857225.1/genomic.gff 597383 11-19-2020 12:03 ncbi_dataset/data/GCF_009857225.1/protein.faa 1838908 11-19-2020 12:03 ncbi_dataset/data/GCF_013249185.1/GCF_013249185.1_ASM1324918v1_genomic.fna 1062407 11-19-2020 12:03 ncbi_dataset/data/GCF_013249185.1/genomic.gff 594208 11-19-2020 12:03 ncbi_dataset/data/GCF_013249185.1/protein.faa 1786869 11-19-2020 12:03 ncbi_dataset/data/GCF_013370935.1/GCF_013370935.1_ASM1337093v1_genomic.fna 1011101 11-19-2020 12:03 ncbi_dataset/data/GCF_013370935.1/genomic.gff 590951 11-19-2020 12:03 ncbi_dataset/data/GCF_013370935.1/protein.faa 1779401 11-19-2020 12:03 ncbi_dataset/data/GCF_013394955.1/GCF_013394955.1_ASM1339495v1_genomic.fna 1010905 11-19-2020 12:03 ncbi_dataset/data/GCF_013394955.1/genomic.gff 589700 11-19-2020 12:03 ncbi_dataset/data/GCF_013394955.1/protein.faa 2064899 11-19-2020 12:03 ncbi_dataset/data/GCF_014830125.1/GCF_014830125.1_ASM1483012v1_genomic.fna 1167106 11-19-2020 12:03 ncbi_dataset/data/GCF_014830125.1/genomic.gff 694896 11-19-2020 12:03 ncbi_dataset/data/GCF_014830125.1/protein.faa 11623 11-19-2020 12:03 ncbi_dataset/data/GCF_009857225.1/sequence_report.jsonl 19774 11-19-2020 12:03 ncbi_dataset/data/GCF_013249185.1/sequence_report.jsonl 7913 11-19-2020 12:03 ncbi_dataset/data/GCF_013370935.1/sequence_report.jsonl 7909 11-19-2020 12:03 ncbi_dataset/data/GCF_013394955.1/sequence_report.jsonl 34330 11-19-2020 12:03 ncbi_dataset/data/GCF_014830125.1/sequence_report.jsonl 2536 11-19-2020 12:03 ncbi_dataset/data/dataset_catalog.json --------- ------- 17446328 23 files CPU times: user 54.5 ms, sys: 44 ms, total: 98.5 ms Wall time: 1.07 s
Next, we're going to look for crispr genes in these genome datasets, and extract the corresponding crispr gene and protein sequences.
We're going to break this down into a few steps.
First, let's make a list of the data files that are available for each genome assembly. We can use the dataset_catalog.json
file to tell us which data files are available for each genome assembly.
## function to make list of files
from ncbi.datasets.v1alpha1 import catalog_pb2
import pprint
pp = pprint.PrettyPrinter(indent=4)
def retrieve_data_catalog(zip_file):
with zipfile.ZipFile(zip_file, 'r') as zip:
data_catalog = json.loads(zip.read('ncbi_dataset/data/dataset_catalog.json'))
print(f"Catalog found with metadata for {len(data_catalog['assemblies'])-1} assemblies")
return data_catalog
def get_assemblies(data_catalog):
return [x['accession'] for x in data_catalog['assemblies'] if 'accession' in x]
# Temporary hack to support GENOMIC_NUCLEOTIDE_FASTA & PROTEIN_FASTA
# which will be present in the next release
def get_file_list(data_catalog, desired_filetype):
desired_filetype = desired_filetype.upper()
if desired_filetype not in catalog_pb2.File.FileType.keys():
raise Exception(f'Filetype {desired_filetype} is invalid.')
files = defaultdict(list)
for asm in data_catalog['assemblies']:
if 'accession' in asm:
acc = asm['accession']
for f in asm['files']:
filepath = os.path.join('ncbi_dataset', 'data', f['filePath'])
if f['fileType'] == 'FASTA' and desired_filetype in ('GENOMIC_NUCLEOTIDE_FASTA') and filepath.endswith('fna'):
files[acc].append(filepath)
continue
if f['fileType'] == 'GFF3':
if desired_filetype in ('PROTEIN_FASTA') and filepath.endswith('faa'):
files[acc].append(filepath)
if desired_filetype in ('GFF3') and filepath.endswith('gff'):
files[acc].append(filepath)
continue
if f['fileType'] == desired_filetype:
files[acc].append(filepath)
return files
data_catalog = retrieve_data_catalog(zipfn)
print(f'Assemblies:')
print(', '.join(get_assemblies(data_catalog)))
for file_type in ['GFF3', 'FASTA', 'GENOMIC_NUCLEOTIDE_FASTA', 'PROTEIN_FASTA']:
file_list = get_file_list(data_catalog, file_type)
print(f'Files for type: {file_type}')
pp.pprint(file_list)
Catalog found with metadata for 5 assemblies Assemblies: GCF_009857225.1, GCF_013249185.1, GCF_013370935.1, GCF_013394955.1, GCF_014830125.1 Files for type: GFF3 defaultdict(<class 'list'>, { 'GCF_009857225.1': [ 'ncbi_dataset/data/GCF_009857225.1/genomic.gff'], 'GCF_013249185.1': [ 'ncbi_dataset/data/GCF_013249185.1/genomic.gff'], 'GCF_013370935.1': [ 'ncbi_dataset/data/GCF_013370935.1/genomic.gff'], 'GCF_013394955.1': [ 'ncbi_dataset/data/GCF_013394955.1/genomic.gff'], 'GCF_014830125.1': [ 'ncbi_dataset/data/GCF_014830125.1/genomic.gff']}) Files for type: FASTA defaultdict(<class 'list'>, {}) Files for type: GENOMIC_NUCLEOTIDE_FASTA defaultdict(<class 'list'>, { 'GCF_009857225.1': [ 'ncbi_dataset/data/GCF_009857225.1/GCF_009857225.1_ASM985722v1_genomic.fna'], 'GCF_013249185.1': [ 'ncbi_dataset/data/GCF_013249185.1/GCF_013249185.1_ASM1324918v1_genomic.fna'], 'GCF_013370935.1': [ 'ncbi_dataset/data/GCF_013370935.1/GCF_013370935.1_ASM1337093v1_genomic.fna'], 'GCF_013394955.1': [ 'ncbi_dataset/data/GCF_013394955.1/GCF_013394955.1_ASM1339495v1_genomic.fna'], 'GCF_014830125.1': [ 'ncbi_dataset/data/GCF_014830125.1/GCF_014830125.1_ASM1483012v1_genomic.fna']}) Files for type: PROTEIN_FASTA defaultdict(<class 'list'>, { 'GCF_009857225.1': [ 'ncbi_dataset/data/GCF_009857225.1/protein.faa'], 'GCF_013249185.1': [ 'ncbi_dataset/data/GCF_013249185.1/protein.faa'], 'GCF_013370935.1': [ 'ncbi_dataset/data/GCF_013370935.1/protein.faa'], 'GCF_013394955.1': [ 'ncbi_dataset/data/GCF_013394955.1/protein.faa'], 'GCF_014830125.1': [ 'ncbi_dataset/data/GCF_014830125.1/protein.faa']})
We're going to create temporary files to store the genome annotation (gff3 format) and protein sequence data. We will put the files in a temporary directory named tempfiles.
## setting up files and directories
## temporary files; will be deleted at the end
temp_dir = 'tempfiles'
temp_gff = temp_dir + '/temp.gff'
temp_fa = temp_dir + '/temp.fa'
!mkdir -p {temp_dir}
## final output files
genes_fn = 'crispr_genes.fna'
prots_fn = 'crispr_proteins.faa'
Next, we need a few functions to process the data. One of these functions will create a gff3 database in memory using the gffutils python package. Another combines fasta files, a third function will extract gene information by gene symbol from the gff3 database.
def create_gff3_db(files_by_assembly, temp_file, zfh):
'''
create gff3 db in memory, per assembly
okay for bacterial assemblies but use caution
when parsing large assemblies like human
'''
db = {}
for assembly_accession, files in files_by_assembly.items():
with open(temp_file, 'wb') as f:
for file in files:
print(f'\tWrite {file} to {temp_file}')
f.write(zfh.read(file))
db[assembly_accession] = gffutils.create_db(
temp_file,
dbfn = ':memory:',
force=True,
keep_order=True,
merge_strategy='merge',
sort_attribute_values=True)
return db
def combine_fasta(files_by_assembly, temp_file, zfh):
'''
Combine fasta for *all* FASTA files, nt & protein
'''
with open(temp_file, 'wb') as f:
for assembly, files in files_by_assembly.items():
for file in files:
print(f'\tAppending {file} to {temp_file}')
f.write(zfh.read(file))
print(f'Create FASTA object for {temp_file}.')
return Fasta(temp_file, read_long_names=False, duplicate_action='first')
def extract_genes(gff3_db, assemblies, desired_genes):
crispr_order = defaultdict(list)
crispr_genes = {}
for assembly_accession in assemblies:
if assembly_accession in assemblies:
for gene in gff3_db[assembly_accession].features_of_type('gene'):
gene_name = gene.attributes.get('Name', None)[0]
if gene_name[:4] not in desired_genes:
continue
gene_range = (gene.start, gene.end)
prot_acc = None
if gene.attributes['gene_biotype'][0] == 'protein_coding':
cds = list(gff3_db[assembly_accession].children(gene, featuretype='CDS'))
prot_acc = cds[0].attributes.get('protein_id', None)[0]
crispr_genes[gene_name] = ([gene.chrom, gene.strand, gene_range, prot_acc])
crispr_order[assembly_accession].append(gene_name)
return crispr_genes, crispr_order
def write_fasta(fh, defline, content):
fh.write('>' + defline + '\n')
fh.write(content + '\n')
Let's use the above functions to process the datasets zip file. First, create the gff3 database, then extract crispr gene information from the gff3 data, then for each gene, store the crispr gene data and write the fasta sequences.
%%time
crispr_genes = set(['cas3', 'cse1', 'cse2', 'cas7', 'cas5', 'cas6', 'cas4', 'cas1', 'cas2', 'cas5', 'cas8', 'cas9', 'csn2'])
## create empty files to add data to
open(genes_fn, 'w').close()
open(prots_fn, 'w').close()
with zipfile.ZipFile(zipfn, 'r') as zfh:
print('Create GFF3 Database')
gff3_db = create_gff3_db(get_file_list(data_catalog, 'GFF3'), temp_gff, zfh)
print(f'Extract crispr genes ({len(crispr_genes)} genes)')
crispr_genes, crispr_order = extract_genes(gff3_db, get_assemblies(data_catalog), crispr_genes)
print(f'Write genes to {genes_fn}')
print('Create genomes object')
genomes = combine_fasta(get_file_list(data_catalog, 'GENOMIC_NUCLEOTIDE_FASTA'), temp_fa, zfh)
# Write all genes to genes_fn
with open(genes_fn, 'w') as fh:
for gene_name, gene_info in crispr_genes.items():
chrom, strand, gene_range, prot_acc = gene_info
reverse_complement = True if strand == '-' else False
gene_fasta = genomes.get_seq(chrom, gene_range[0], gene_range[1], rc=reverse_complement)
write_fasta(fh, f'{gene_fasta.fancy_name}|{gene_name}', gene_fasta.seq)
print('Create proteome object')
proteome = combine_fasta(get_file_list(data_catalog, 'PROTEIN_FASTA'), temp_fa, zfh)
# Write protein FASTA for each CRISPR gene
with open(prots_fn, 'w') as fh:
for gene_name, gene_info in crispr_genes.items():
chrom, strand, gene_range, prot_acc = gene_info
if prot_acc is not None:
prot_fasta = proteome[prot_acc][:]
write_fasta(fh, f'{prot_fasta.name}|{gene_name}', prot_fasta.seq)
Create GFF3 Database Write ncbi_dataset/data/GCF_009857225.1/genomic.gff to tempfiles/temp.gff Write ncbi_dataset/data/GCF_013249185.1/genomic.gff to tempfiles/temp.gff Write ncbi_dataset/data/GCF_013370935.1/genomic.gff to tempfiles/temp.gff Write ncbi_dataset/data/GCF_013394955.1/genomic.gff to tempfiles/temp.gff Write ncbi_dataset/data/GCF_014830125.1/genomic.gff to tempfiles/temp.gff Extract crispr genes (12 genes) Write genes to crispr_genes.fna Create genomes object Appending ncbi_dataset/data/GCF_009857225.1/GCF_009857225.1_ASM985722v1_genomic.fna to tempfiles/temp.fa Appending ncbi_dataset/data/GCF_013249185.1/GCF_013249185.1_ASM1324918v1_genomic.fna to tempfiles/temp.fa Appending ncbi_dataset/data/GCF_013370935.1/GCF_013370935.1_ASM1337093v1_genomic.fna to tempfiles/temp.fa Appending ncbi_dataset/data/GCF_013394955.1/GCF_013394955.1_ASM1339495v1_genomic.fna to tempfiles/temp.fa Appending ncbi_dataset/data/GCF_014830125.1/GCF_014830125.1_ASM1483012v1_genomic.fna to tempfiles/temp.fa Create FASTA object for tempfiles/temp.fa. Create proteome object Appending ncbi_dataset/data/GCF_009857225.1/protein.faa to tempfiles/temp.fa Appending ncbi_dataset/data/GCF_013249185.1/protein.faa to tempfiles/temp.fa Appending ncbi_dataset/data/GCF_013370935.1/protein.faa to tempfiles/temp.fa Appending ncbi_dataset/data/GCF_013394955.1/protein.faa to tempfiles/temp.fa Appending ncbi_dataset/data/GCF_014830125.1/protein.faa to tempfiles/temp.fa Create FASTA object for tempfiles/temp.fa. CPU times: user 4.83 s, sys: 59.4 ms, total: 4.89 s Wall time: 4.96 s
crispr_genes.fna and crispr_proteins.faa contain the genomic and protein FASTA sequences, respectively, for the crispr genes we identified in all of the five genome assemblies.
## order of crispr genes in various assemblies
for k,v in crispr_order.items():
print(k, v)
GCF_009857225.1 ['csn2'] GCF_013249185.1 ['cas2e', 'cas1e', 'cas6e', 'cas5e', 'cas7e', 'cas3'] GCF_013370935.1 ['cas9', 'cas1', 'cas2', 'csn2'] GCF_013394955.1 ['csn2', 'cas2', 'cas1', 'cas9'] GCF_014830125.1 ['cas3', 'cas5c', 'cas8c', 'cas7c', 'cas4', 'cas1c', 'cas2']