Use NCBI Datasets to Retrieve Information on Genes

The objective of this notebook is to use the ncbi.datasets python library to demonstrate how to download and extract gene data from NCBI Datasets.

There are two major types of gene data available from NCBI Datasets:

  1. Gene datasets, which include gene, transcript, and protein sequences, and a data report (gene metadata in JSON lines format)

  2. Gene summaries, which are brief descriptions of the gene datasets described above

As an example, we will get gene data for the Gonadotropin Releasing Hormone Receptor (GNRHR) gene family, which plays a key role in sexual development and function across vertebrates.

While the role of GnRH in reproduction is conserved across vertebrates, GnRH receptor copy number is variable. In humans and some primates, there are two GnRHR genes, while in fish and amphibians, three GnRHR genes have been identified, with additional duplications observed in some fish. Additional receptors and GnRH ligands suggests that new functions could have been acquired by the additional gene copies[1].

We expect to observe this variable gene copy number in the data we obtain using NCBI Datasets.

[1] Moncaut N, Somoza G, Power DM, Canário AV. Five gonadotrophin-releasing hormone receptors in a teleost fish: isolation, tissue distribution and phylogenetic relationships. J Mol Endocrinol. 2005 Jun;34(3):767-79. doi: 10.1677/jme.1.01757. PMID: 15956346.

First we will load the libraries necessary to run this notebook. This includes:

  1. the ncbi.datasets library
  2. pandas, the python data analysis library and
  3. pprint, which allows "pretty-printing" of Python data structures
In [1]:
# load all libraries
import pandas as pd
import ncbi.datasets

Get gene summaries for three human GnRHR genes

First we're going to get gene summaries for three human GnRHR genes, GNRHR, GNRHR2, and GNRHR2P1, by specifying the NCBI Gene IDs for these genes.

Gene summaries contain a lot of interesting metadata and it's easy to pull out just the fields that you're interested in. As an example, we'll show how to parse gene symbols, chromosome number and the corresponding SwissProt accession for the three genes.

In [1]:
# Start a Datasets gene API instance
api_client = ncbi.datasets.ApiClient()
ds_gene_instance = ncbi.datasets.GeneApi(api_client)

# Retrieve gene summaries for the three genes using NCBI Gene IDs
gene_summary = ds_gene_instance.gene_metadata_by_id([2798, 114814, 404718])

# Look up the symbols, chromosome number and SwissProt accession for each gene
def report_on_gene_descriptors(gene_summary, leader='\t', report_errors=True):
    if report_errors:
        for message in gene_summary.messages or []:
            print(f'{leader}Error for: ({",".join(message.error.invalid_identifiers)})')
            print(f'{leader}{leader}Reason: ({message.error.reason})')

    if not gene_summary.genes:
        print(f'{leader}No genes found')
        return

    for gene in map(lambda g: g.gene, gene_summary.genes):
        print(f'{leader}{gene.symbol} (GeneID: {gene.gene_id}), Chromosome: {gene.chromosomes}, SwissProt: {gene.swiss_prot_accessions}')

report_on_gene_descriptors(gene_summary)
	GNRHR2 (GeneID: 114814), Chromosome: ['1'], SwissProt: ['Q96P88']
	GNRHR (GeneID: 2798), Chromosome: ['4'], SwissProt: ['P30968']
	GNRHR2P1 (GeneID: 404718), Chromosome: ['14'], SwissProt: None

Finding vertebrate gene orthologs by gene symbol

Now we're going to look for GnRHR genes in vertebrates. To find these genes, we'll first query the gene summary service by gene symbol and NCBI Taxonomy ID and retrieve the NCBI GeneID from the summary.

Next we'll query the gene ortholog service by NCBI GeneID to retrieve all vertebrate orthologs for the requested gene. Ortholog sets are calculated by NCBI Gene described here https://www.ncbi.nlm.nih.gov/kis/info/how-are-orthologs-calculated/.

Next we'll present a summary table for all orthologs including gene symbol, NCBI Gene ID, Chromosome, and SwissProt accession (if available).

In [1]:
# Get gene id by gene symbol + organism name
gene_symbol = 'GNRHR'
gene_taxon = 'human'
gene_descriptor = ds_gene_instance.gene_metadata_by_tax_and_symbol(symbols=[gene_symbol], taxon=gene_taxon)

if not gene_descriptor.genes:
    print(f'No gene found for {gene_taxon} {gene_symbol}')
else:
    gene_id = int(gene_descriptor.genes[0].gene.gene_id)

    # Query the gene ortholog service to get all vertebrate orthologs
    ortholog_set = ds_gene_instance.gene_orthologs_by_id(gene_id=gene_id)

    if not ortholog_set.ortholog_set_id:
        print(f'\nUnable to find orthologs for gene {gene_id}')
    else:
        orthologs_descriptors = ortholog_set.genes
        report_on_gene_descriptors(orthologs_descriptors, report_errors=False)
	GNRHR (GeneID: 100009509), Chromosome: ['15'], SwissProt: None
	GNRHR (GeneID: 100011217), Chromosome: ['5'], SwissProt: None
	GNRHR (GeneID: 100033874), Chromosome: ['3'], SwissProt: ['O18821']
	GNRHR (GeneID: 100093333), Chromosome: ['10'], SwissProt: None
	Gnrhr (GeneID: 100135532), Chromosome: ['Un'], SwissProt: ['Q8CH60']
	GNRHR (GeneID: 100385305), Chromosome: ['3'], SwissProt: None
	GNRHR (GeneID: 100437372), Chromosome: ['4'], SwissProt: None
	GNRHR (GeneID: 100483711), Chromosome: ['11'], SwissProt: None
	gnrhr (GeneID: 100552021), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 100589627), Chromosome: ['9'], SwissProt: None
	GNRHR (GeneID: 100662022), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 100758518), Chromosome: ['1'], SwissProt: None
	GNRHR (GeneID: 100860755), Chromosome: ['6'], SwissProt: None
	GNRHR (GeneID: 100916844), Chromosome: ['6'], SwissProt: None
	GNRHR (GeneID: 100947835), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 100967998), Chromosome: ['4'], SwissProt: None
	GNRHR (GeneID: 100998025), Chromosome: ['3'], SwissProt: None
	GNRHR (GeneID: 101052255), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 101088838), Chromosome: ['B1'], SwissProt: None
	GNRHR (GeneID: 101126857), Chromosome: ['4'], SwissProt: None
	GNRHR (GeneID: 101276174), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 101318246), Chromosome: ['5'], SwissProt: None
	LOC101344406 (GeneID: 101344406), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 101376569), Chromosome: ['Un'], SwissProt: None
	LOC101400209 (GeneID: 101400209), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 101437330), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 101521641), Chromosome: ['15'], SwissProt: None
	GNRHR (GeneID: 101546247), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 101574337), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 101607750), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 101623253), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 101644460), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 101673028), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 101704341), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 101965169), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 101982801), Chromosome: ['LG1'], SwissProt: None
	GNRHR (GeneID: 102073570), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102133556), Chromosome: ['5'], SwissProt: None
	GNRHR (GeneID: 102259709), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102277387), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102392862), Chromosome: ['7'], SwissProt: None
	GNRHR (GeneID: 102443335), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102491851), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102507009), Chromosome: ['2'], SwissProt: None
	GNRHR (GeneID: 102536567), Chromosome: ['2'], SwissProt: None
	GNRHR (GeneID: 102750209), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102770612), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102824185), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102849089), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102880174), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 102923015), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102967418), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 102987576), Chromosome: ['7'], SwissProt: None
	GNRHR (GeneID: 103006753), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103067336), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103079467), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103119982), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103198595), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103235600), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103273704), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103301727), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103552793), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103602906), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103679175), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 103750659), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103757041), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 103914819), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104037407), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104262040), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104273099), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104339302), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104381848), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104464873), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104481774), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104492255), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104508031), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104519484), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104548316), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104577671), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104632340), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104664117), Chromosome: ['2'], SwissProt: None
	Gnrhr (GeneID: 104853276), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 104995176), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 105078888), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 105101496), Chromosome: ['2'], SwissProt: None
	GNRHR (GeneID: 105292442), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 105463605), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 105510483), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 105528593), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 105596089), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 105731588), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 105811193), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 105869284), Chromosome: ['26'], SwissProt: None
	Gnrhr (GeneID: 105980204), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 106020964), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 106843218), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 106977029), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 107139050), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 107520510), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 107537230), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 108294649), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 108392639), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 108530291), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 109278270), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 109288442), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 109375906), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 109560161), Chromosome: ['6'], SwissProt: None
	GNRHR (GeneID: 110150965), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 110217426), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 110295230), Chromosome: ['5'], SwissProt: None
	Gnrhr (GeneID: 110330873), Chromosome: ['13'], SwissProt: None
	Gnrhr (GeneID: 110543014), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 110591499), Chromosome: ['Un'], SwissProt: None
	LOC111149281 (GeneID: 111149281), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 111168121), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 111555805), Chromosome: ['3'], SwissProt: None
	GNRHR (GeneID: 111933979), Chromosome: ['10'], SwissProt: None
	GNRHR (GeneID: 112320151), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 112401456), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 112625046), Chromosome: ['5'], SwissProt: None
	GNRHR (GeneID: 112652328), Chromosome: ['13'], SwissProt: None
	GNRHR (GeneID: 112836063), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 112858519), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 112923608), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 113196079), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 113262543), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 113484062), Chromosome: ['10'], SwissProt: None
	GNRHR (GeneID: 113611048), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 113894336), Chromosome: ['6'], SwissProt: None
	GNRHR (GeneID: 113925080), Chromosome: ['2'], SwissProt: None
	GNRHR (GeneID: 113984600), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 114023360), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 114102074), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 114206858), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 114493293), Chromosome: ['1'], SwissProt: None
	Gnrhr (GeneID: 114638378), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 114709126), Chromosome: ['10'], SwissProt: None
	GNRHR (GeneID: 114902876), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 115080820), Chromosome: ['1'], SwissProt: None
	GNRHR (GeneID: 115296774), Chromosome: ['1'], SwissProt: None
	GNRHR (GeneID: 115512588), Chromosome: ['B1'], SwissProt: None
	GNRHR (GeneID: 115865686), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 116070568), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 116468321), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 116537243), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 116585139), Chromosome: ['2'], SwissProt: None
	GNRHR (GeneID: 116628008), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 116753820), Chromosome: ['5'], SwissProt: None
	GNRHR (GeneID: 116861213), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 116912421), Chromosome: ['11'], SwissProt: None
	GNRHR (GeneID: 117022409), Chromosome: ['5'], SwissProt: None
	GNRHR (GeneID: 117085552), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 117713036), Chromosome: ['7'], SwissProt: None
	GNRHR (GeneID: 118010484), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 118538754), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 118592576), Chromosome: ['10'], SwissProt: None
	GNRHR (GeneID: 118643503), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 118652677), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 118705869), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 118854441), Chromosome: ['6'], SwissProt: ['Q9TTI8']
	GNRHR (GeneID: 118895192), Chromosome: ['5'], SwissProt: None
	GNRHR (GeneID: 118928662), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 118984310), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 119052689), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 119260764), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 119528645), Chromosome: ['3'], SwissProt: None
	GNRHR (GeneID: 119705255), Chromosome: ['10'], SwissProt: None
	Gnrhr (GeneID: 119804414), Chromosome: ['1'], SwissProt: None
	GNRHR (GeneID: 119938976), Chromosome: ['17'], SwissProt: None
	GNRHR (GeneID: 120222172), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 120588150), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 120862181), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 121034888), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 121147189), Chromosome: ['Un'], SwissProt: None
	Gnrhr (GeneID: 121460951), Chromosome: ['Un'], SwissProt: None
	GNRHR (GeneID: 121472796), Chromosome: ['12'], SwissProt: None
	Gnrhr (GeneID: 14715), Chromosome: ['5'], SwissProt: ['Q01776']
	GNRHR (GeneID: 2798), Chromosome: ['4'], SwissProt: ['P30968']
	GNRHR (GeneID: 281798), Chromosome: ['6'], SwissProt: ['P32236']
	GNRHR (GeneID: 397515), Chromosome: ['8'], SwissProt: ['P49922']
	GNRHR (GeneID: 403718), Chromosome: ['13'], SwissProt: ['Q9MZI6']
	GNRHR (GeneID: 443413), Chromosome: ['6'], SwissProt: ['P32237']
	GNRHR (GeneID: 471226), Chromosome: ['4'], SwissProt: None
	GNRHR (GeneID: 711577), Chromosome: ['5'], SwissProt: ['Q8SPZ1']
	Gnrhr (GeneID: 81668), Chromosome: ['14'], SwissProt: ['P30969']

Build a table of key metadata for GnRHR genes across vertebrates

Let's expand the taxonomic scope even further, and look at a selection of vertebrates.

We'll use a pre-determined list of Gene IDs to get gene summaries for these genes and build an easily readable table with key information about these genes.

In [1]:
# extract fields of interest from descriptors class to build a table
cols = '''
common_name
taxonomic_name
symbol
type
chromosome
num_transcripts
ensembl_id
omim_id
uniprot_id
nomenclature_id
nomenclature_auth
genome_coordinates
'''
cols = cols.split('\n')[1:-1]

def _range_repr(range):
    ret = []
    for interval in range:
        ret.append(f'{interval.begin}_{interval.end}')
    return ','.join(ret)

def _ranges_repr(ranges):
    ret = []
    for range in ranges:
        ret.append(f'{range.accession_version}:{_range_repr(range.range)}')
    return ','.join(ret)

# specify genes of interest and retrieve descriptors
gene_ids = [2798, 114814, 404718, 14715, 109324103, 109309182, 281798, 395368, 403718, 427517, 471226, 7226731, 100001586, 100135415, 100135416, 100135417, 100136028, 100270671, 100270672, 101318246, 101932446, 101935915, 101953943, 102193667, 102202954, 102205592, 102346610, 102363373, 102364206, 102366752, 102536567, 102687824, 102694185, 102770612, 103899900, 103899926, 105916404, 105919697, 105934126, 108392639, 109987527, 109994050, 109999298, 110488224, 110495632, 110496352, 110513414, 110520912, 112994411, 112996301, 114645297, 114667483]
gene_metadata = ds_gene_instance.gene_metadata_by_id(gene_ids)

# collect elements of the descriptor class into a dictionary based on each gene ID
table_data = {}
for g in gene_metadata.genes:
    if not g.gene:
        print(f'Gene not found: {g}')
        continue
    gene = g.gene

    table_data[gene.gene_id] = [gene.common_name]
    table_data[gene.gene_id].append(gene.taxname)
    table_data[gene.gene_id].append(gene.symbol)
    table_data[gene.gene_id].append(gene.type)
    table_data[gene.gene_id].append(gene.chromosome)
    if gene.transcripts:
        table_data[gene.gene_id].append(len(gene.transcripts))
    else:
        table_data[gene.gene_id].append(0)
    table_data[gene.gene_id].append(gene.ensembl_gene_ids)
    table_data[gene.gene_id].append(gene.omim_ids)
    table_data[gene.gene_id].append(gene.swiss_prot_accessions)
    if gene.nomenclature_authority:
        table_data[gene.gene_id].append(gene.nomenclature_authority.identifier)
        table_data[gene.gene_id].append(gene.nomenclature_authority.authority)
    else:
        table_data[gene.gene_id].append(None)
        table_data[gene.gene_id].append(None)        
    table_data[gene.gene_id].append(_ranges_repr(gene.genomic_ranges))

        
df = pd.DataFrame.from_dict(table_data, orient='index', columns=cols)
df.index.name = 'gene_id'
df
Gene not found: {'query': ['7226731'],
 'warnings': [{'gene_warning_code': 'DISCONTINUED_GENE_ID',
               'message': 'The gene you requested, (7226731) is a valid NCBI '
                          'Gene IDs that has been discontinued. It will be '
                          'omitted from your dataset. For more information '
                          'about the discontinued genes, visit NCBI Gene.\n',
               'reason': 'This GeneID has been discontinued.',
               'unrecognized_identifier': '7226731'}]}
Out[1]:
common_name taxonomic_name symbol type chromosome num_transcripts ensembl_id omim_id uniprot_id nomenclature_id nomenclature_auth genome_coordinates
gene_id
100001586 zebrafish Danio rerio gnrhr4 PROTEIN_CODING None 2 [ENSDARG00000038116] None None ZDB-GENE-050419-76 ZFIN NC_007129.7:25390740_25402909,NW_018395028.1:1...
100135415 tropical clawed frog Xenopus tropicalis gnrhr2 PROTEIN_CODING None 3 [ENSXETG00000021161] None None XB-GENE-5867415 Xenbase NC_030679.2:109001475_109014876
100135416 tropical clawed frog Xenopus tropicalis gnrhr2/nmi PROTEIN_CODING None 2 [ENSXETG00000005637] None None None None NC_030679.2:116455054_116472596
100135417 tropical clawed frog Xenopus tropicalis gnrhr PROTEIN_CODING None 2 [ENSXETG00000001290] None None XB-GENE-5753573 Xenbase NC_030684.2:142203708_142211556
100136028 rainbow trout Oncorhynchus mykiss gnrh-r PROTEIN_CODING None 1 [ENSOMYG00000000839] None None None None NC_048566.1:29925736_29930354
100270671 zebrafish Danio rerio gnrhr2 PROTEIN_CODING None 1 [ENSDARG00000003553] None None ZDB-GENE-090128-3 ZFIN NC_007118.7:52364748_52368843,NW_018395268.1:1...
100270672 zebrafish Danio rerio gnrhr1 PROTEIN_CODING None 2 [ENSDARG00000100593] None None ZDB-GENE-090128-2 ZFIN NC_007130.7:43213293_43235554
101318246 common bottlenose dolphin Tursiops truncatus GNRHR PROTEIN_CODING None 1 None None None None None NC_047038.1:85406171_85428905
101932446 Painted turtle Chrysemys picta LOC101932446 PROTEIN_CODING None 1 [ENSCPBG00000024942] None None None None NW_007281386.1:1139258_1141674
101935915 Painted turtle Chrysemys picta LOC101935915 PROTEIN_CODING None 2 [ENSCPBG00000007948] None None None None NW_007281382.1:370463_379501
101953943 Painted turtle Chrysemys picta LOC101953943 PROTEIN_CODING None 6 [ENSCPBG00000014672] None None None None NW_007281446.1:3015207_3081999
102193667 None Pundamilia nyererei LOC102193667 PROTEIN_CODING None 1 [ENSPNYG00000019741] None None None None NW_005187452.1:3461026_3466217
102202954 None Pundamilia nyererei gnrhr4 PROTEIN_CODING None 1 None None None None None NW_005187436.1:1835841_1849078
102205592 None Pundamilia nyererei LOC102205592 PROTEIN_CODING None 1 [ENSPNYG00000007579] None None None None NW_005187426.1:621250_624817
102346610 coelacanth Latimeria chalumnae LOC102346610 PROTEIN_CODING None 2 [ENSLACG00000014236] None None None None NW_005819113.1:988319_995279
102363373 coelacanth Latimeria chalumnae LOC102363373 PROTEIN_CODING None 1 None None None None None NW_005819068.1:1628819_1638066
102364206 coelacanth Latimeria chalumnae GNRHR PROTEIN_CODING None 1 [ENSLACG00000013490] None None None None NW_005819460.1:865222_884433
102366752 coelacanth Latimeria chalumnae GNRHR1/III PROTEIN_CODING None 1 None None None None None NW_005819060.1:2885236_2899430
102536567 alpaca Vicugna pacos GNRHR PROTEIN_CODING None 2 None None None None None NW_021964157.1:55397729_55411117
102687824 spotted gar Lepisosteus oculatus gnrhr4 PROTEIN_CODING None 1 [ENSLOCG00000014360] None None None None NC_023181.1:47828941_47842900
102694185 spotted gar Lepisosteus oculatus LOC102694185 PROTEIN_CODING None 3 [ENSLOCG00000008760] None None None None NC_023202.1:8297180_8319507
102770612 None Myotis davidii GNRHR PROTEIN_CODING None 1 None None None None None NW_006283639.1:3309455_3329661
103899900 emperor penguin Aptenodytes forsteri LOC103899900 PROTEIN_CODING None 1 None None None None None NW_008795371.1:3421273_3430177
103899926 emperor penguin Aptenodytes forsteri LOC103899926 PROTEIN_CODING None 1 None None None None None NW_008795371.1:6612054_6613998
105916404 mummichog Fundulus heteroclitus LOC105916404 PROTEIN_CODING None 1 [ENSFHEG00000000340] None None None None NC_046364.1:20890429_20902473
105919697 mummichog Fundulus heteroclitus gnrhr4 PROTEIN_CODING None 1 [ENSFHEG00000022489] None None None None NC_046362.1:18441900_18463525
105934126 mummichog Fundulus heteroclitus LOC105934126 PROTEIN_CODING None 1 [ENSFHEG00000001414] None None None None NC_046364.1:19456522_19466386
108392639 Malayan pangolin Manis javanica GNRHR PROTEIN_CODING None 2 None None None None None NW_023436115.1:4749173_4760643
109309182 Australian saltwater crocodile Crocodylus porosus LOC109309182 PROTEIN_CODING None 2 [ENSCPRG00005000240] None None None None NW_017728911.1:36428760_36437582
109324103 Australian saltwater crocodile Crocodylus porosus LOC109324103 PROTEIN_CODING None 1 [ENSCPRG00005008895] None None None None NW_017728893.1:1169521_1173474
109987527 ballan wrasse Labrus bergylta LOC109987527 PROTEIN_CODING None 1 [ENSLBEG00000028052] None None None None NW_018114533.1:1324248_1334822
109994050 ballan wrasse Labrus bergylta LOC109994050 PROTEIN_CODING None 1 None None None None None NW_018114415.1:5852414_5857700
109999298 ballan wrasse Labrus bergylta gnrhr4 PROTEIN_CODING None 1 [ENSLBEG00000006912] None None None None NW_018114964.1:235812_256320
110488224 rainbow trout Oncorhynchus mykiss LOC110488224 PROTEIN_CODING None 1 [ENSOMYG00000013327] None None None None NC_050572.1:29110296_29130379
110495632 rainbow trout Oncorhynchus mykiss LOC110495632 PROTEIN_CODING None 1 [ENSOMYG00000026325] None None None None NC_048566.1:72336802_72343356
110496352 rainbow trout Oncorhynchus mykiss LOC110496352 PROTEIN_CODING None 1 [ENSOMYG00000027791] None None None None NC_048582.1:44582635_44599541
110513414 rainbow trout Oncorhynchus mykiss LOC110513414 PROTEIN_CODING None 1 [ENSOMYG00000000211] None None None None NC_048567.1:21493159_21497274
110520912 rainbow trout Oncorhynchus mykiss gnrhr4 PROTEIN_CODING None 1 [ENSOMYG00000040422] None None None None NC_048565.1:30644730_30649046
112994411 emu Dromaius novaehollandiae LOC112994411 PROTEIN_CODING None 1 None None None None None NW_020453730.1:6179_7726
112996301 emu Dromaius novaehollandiae LOC112996301 PROTEIN_CODING None 1 [ENSDNVG00000011124] None None None None NW_020453948.1:1559259_1562466
114645297 reedfish Erpetoichthys calabaricus LOC114645297 PROTEIN_CODING None 1 None None None None None NC_041395.1:82016453_82036206
114667483 reedfish Erpetoichthys calabaricus gnrhr4 PROTEIN_CODING None 1 [ENSECRG00000016482] None None None None NC_041410.1:98965514_99009309
114814 human Homo sapiens GNRHR2 PSEUDO None 3 [ENSG00000211451] [612875] [Q96P88] HGNC:16341 HGNC NC_000001.11:145919013_145925341
14715 house mouse Mus musculus Gnrhr PROTEIN_CODING None 4 [ENSMUSG00000029255] None [Q01776] MGI:95790 MGI NC_000071.7:86328613_86345760
2798 human Homo sapiens GNRHR PROTEIN_CODING None 2 [ENSG00000109163] [138850] [P30968] HGNC:4421 HGNC NC_000004.12:67737118_67754388
281798 cattle Bos taurus GNRHR PROTEIN_CODING None 2 [ENSBTAG00000000438] None [P32236] VGNC:29482 VGNC NC_037333.1:83434759_83452222
395368 chicken Gallus gallus GNRHR PROTEIN_CODING None 4 [ENSGALG00000021020] None None 49286 CGNC NC_052541.1:17827491_17845010,NC_052582.1:1770...
403718 dog Canis lupus familiaris GNRHR PROTEIN_CODING None 1 [ENSCAFG00000002785] None [Q9MZI6] VGNC:41337 VGNC NC_051817.1:58865514_58878973,NC_049273.1:5859...
404718 human Homo sapiens GNRHR2P1 PSEUDO None 0 [ENSG00000259169] None None HGNC:39424 HGNC NC_000014.9:60398823_60400013
427517 chicken Gallus gallus GNRHR PROTEIN_CODING None 2 [ENSGALG00000020923] None None 53110 CGNC NC_052541.1:20013512_20016275,NC_052582.1:1987...
471226 chimpanzee Pan troglodytes GNRHR PROTEIN_CODING None 2 [ENSPTRG00000016100] None None VGNC:2019 VGNC NC_036883.1:61240135_61260886

Build a table showing GnRHR gene copy number across vertebrates

We're going to build a table showing how gene count varies in among a selected group of vertebrates.
Note that rainbow trout has the most gene copies at 6, while numerous mammals, including mouse, dolphin and alpaca, only have a single annotated gene copy.

In [1]:
# plot gene count based on organism
gene_cnt = df.groupby('common_name')['symbol'].count().reset_index()
gene_cnt.columns = ['organism', 'gene_count']
gene_cnt.sort_values('gene_count', ascending=False, inplace=True)
gene_cnt
Out[1]:
organism gene_count
16 rainbow trout 6
8 coelacanth 4
20 zebrafish 3
2 Painted turtle 3
19 tropical clawed frog 3
4 ballan wrasse 3
15 mummichog 3
14 human 3
12 emu 2
18 spotted gar 2
17 reedfish 2
0 Australian saltwater crocodile 2
11 emperor penguin 2
6 chicken 2
13 house mouse 1
1 Malayan pangolin 1
9 common bottlenose dolphin 1
7 chimpanzee 1
5 cattle 1
3 alpaca 1
10 dog 1

Use gene datasets to build a transcript-focused table

Finally, we are going to download a gene dataset for the human GnRHR genes, and use metadata included as part of the dataset to build a transcript-focused table.
We'll take this metadata from the data report file, data_report.jsonl.

In [1]:
%%time

gene_ids = [2798, 114814, 404718]
gene_ds_download = ds_gene_instance.download_gene_package(gene_ids, _preload_content=False)

## write to a zip file 
zipfile_name = 'gene_ds.zip'
with open(zipfile_name, 'wb') as f:
    f.write(gene_ds_download.data)

print(f'Download saved to {zipfile_name}')
Download saved to gene_ds.zip
CPU times: user 3.19 ms, sys: 154 µs, total: 3.35 ms
Wall time: 74.4 ms
In [1]:
!unzip -v {zipfile_name}
Archive:  gene_ds.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     661  Defl:N      384  42% 07-16-2021 15:01 bc3c97af  README.md
    7081  Defl:N     1448  80% 07-16-2021 15:01 f96d97c3  ncbi_dataset/data/data_report.jsonl
    1434  Defl:N      478  67% 07-16-2021 15:01 fa82a51a  ncbi_dataset/data/data_table.tsv
     220  Defl:N      115  48% 07-16-2021 15:01 ddb3349b  ncbi_dataset/data/dataset_catalog.json
--------          -------  ---                            -------
    9396             2425  74%                            4 files
In [1]:
import zipfile
from google.protobuf.json_format import ParseDict
import jsonlines
import ncbi.datasets.v1.reports.gene_pb2 as gene_report_pb2
def gene_report_for(path_to_zipfile):
    '''
    Return an object representing the data report.
    path_to_zipfile: The relative path to the zipfile containing the virus data report
    '''
    gene_report = gene_report_pb2.GeneDescriptors()
    with zipfile.ZipFile(path_to_zipfile, 'r') as zip:      
        with zip.open('ncbi_dataset/data/data_report.jsonl') as file:
            reader = jsonlines.Reader(file)
            for gene_dict in reader.iter(type=dict, skip_invalid=True):
                ParseDict(gene_dict, gene_report.genes.add())
    return gene_report

def _5prime_len(transcript):
    if not transcript.cds or not transcript.cds.range:
        return None
    return transcript.cds.range[0].begin - 1

def _3prime_len(transcript):
    if not transcript.cds or not transcript.cds.range:
        return None
    return transcript.length - transcript.cds.range[0].end

gene_report = gene_report_for(zipfile_name)

rows = []
for gene in gene_report.genes:

    # transcripts for each gene are embedded as lists and require additional handling
    for transcript in gene.transcripts:
        rows.append({
            'gene_id': gene.gene_id,
            'gene_symbol': gene.symbol,
            'gene_taxonomy': gene.taxname,            
            'accVer': transcript.accession_version,
            'name': transcript.name,
            'length': transcript.length,
            '5`UTR_len': _5prime_len(transcript),
            '3`UTR_len': _3prime_len(transcript),
            'protAccVer': transcript.protein.accession_version or None,
            'protName': transcript.protein.isoform_name or None,
            'protLength': transcript.protein.length or None,
            'exonAccVer': transcript.exons.accession_version,
            'numExons': len(transcript.exons.range),
        })

transcript_table = pd.DataFrame(rows)

transcript_table
Out[1]:
gene_id gene_symbol gene_taxonomy accVer name length 5`UTR_len 3`UTR_len protAccVer protName protLength exonAccVer numExons
0 114814 GNRHR2 Homo sapiens NR_002328.4 transcript variant 1 1626 NaN NaN None None NaN NC_000001.11 3
1 114814 GNRHR2 Homo sapiens NR_104034.1 transcript variant 3 786 NaN NaN None None NaN NC_000001.11 3
2 114814 GNRHR2 Homo sapiens NR_104033.1 transcript variant 2 1035 NaN NaN None None NaN NC_000001.11 4
3 2798 GNRHR Homo sapiens NM_000406.3 transcript variant 1 4402 53.0 3362.0 NP_000397.1 isoform 1 328.0 NC_000004.12 3
4 2798 GNRHR Homo sapiens NM_001012763.2 transcript variant 2 4017 53.0 3214.0 NP_001012781.1 isoform 2 249.0 NC_000004.12 3