Notebook

Getting Gene data using NCBI Datasets command line tools¶

The objective of this Notebook is to demonstrate how to use NCBI Datasets command line tools to explore and download sequence and metadata for RefSeq annotated genes.

The datasets command-line tool currently returns two types of data:

Gene summaries are gene metadata returned in JSON format
Gene data packages are downloadable zip files including gene, transcript and protein sequence, a data table and a data report in JSON Lines format.

Getting Started¶

To get started, we'll first download and grant execute permissions for the datasets command line tools. Datasets has two command line tools

The datasets tool is used to query and download sequence, annotation and metadata for all domains of life.
The dataformat tool is used to convert metadata downloaded from NCBI Datasets from JSON lines format to other formats.

In [1]:

%%bash
printf "Downloading CLI tools...\n"
for app in datasets dataformat
do
    curl --silent --remote-name "https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/${app}"
    chmod +x ${app}
    printf "[size: %s] %s v%s\n" $(du --human-readable ${app}) $(./${app} version)
done

Downloading CLI tools...
[size: 11M] datasets v11.7.0
[size: 13M] dataformat v11.7.0

We'll also download the command line tool jq to parse the datasets JSON Lines data reports into a readable format.

In [1]:

%%bash
curl --silent --location --output jq 'https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64'
chmod +x jq
printf "Downloaded %s" $(./jq --version)

Downloaded jq-1.6

Getting help¶

To get help in using the datasets tools or any commands or sub-commands specify --help after the command

In [1]:

!./datasets --help

datasets is a command-line tool that is used to query and download biological sequence data
across all domains of life from NCBI databases.

Refer to NCBI's [command line start](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start) documentation for information about getting started with the command-line tools.

Usage
  datasets [command]

Data Retrieval Commands
  summary              print a summary of a gene or genome dataset
  download             download a gene, genome or coronavirus dataset as a zip file
  rehydrate            rehydrate a downloaded, dehydrated dataset

Miscellaneous Commands
  completion           generate autocompletion scripts
  version              print the version of this client and exit
  help                 Help about any command

Flags
  -h, --help   help for datasets

Use datasets help <command> for detailed help about a command.

Getting gene metadata for a list of Crassostrea virginia genes¶

In this step, we'll use the Datasets summary gene command to get gene metadata for a list Crassostrea gigas genes. Datasets gene summaries can be queried using NCBI Gene ID, gene symbol or RefSeq transcript or protein accession combined with a taxon name. In this example, we'll query for 3 Crassostrea virginica genes, LOC111112135, LOC111112138, LOC111110223, by specifying gene symbol and taxon name. To make the JSON output easy to read we'll use the command line parser jq.

In [1]:

!./datasets summary gene symbol LOC111112135 LOC111112138 LOC111110223 --taxon "crassostrea virginica"  | ./jq .

{
  "genes": [
    {
      "gene": {
        "annotations": [
          {
            "assemblies_in_scope": [
              {
                "accession": "GCF_002022765.2",
                "name": "C_virginica-3.0"
              }
            ],
            "release_date": "2017-09-11",
            "release_name": "NCBI Crassostrea virginica Annotation Release 100"
          }
        ],
        "chromosomes": [
          "8"
        ],
        "common_name": "eastern oyster",
        "description": "toll-like receptor 6",
        "gene_id": "111110223",
        "genomic_ranges": [
          {
            "accession_version": "NC_035787.1",
            "range": [
              {
                "begin": "70687673",
                "end": "70695429",
                "orientation": "minus"
              }
            ]
          }
        ],
        "nomenclature_authority": {},
        "orientation": "minus",
        "symbol": "LOC111110223",
        "tax_id": "6565",
        "taxname": "Crassostrea virginica",
        "transcripts": [
          {
            "accession_version": "XM_022446624.1",
            "cds": {
              "accession_version": "XM_022446624.1",
              "range": [
                {
                  "begin": "304",
                  "end": "1365"
                }
              ]
            },
            "exons": {
              "accession_version": "NC_035787.1",
              "range": [
                {
                  "begin": "70695129",
                  "end": "70695429",
                  "order": 1
                },
                {
                  "begin": "70691394",
                  "end": "70691585",
                  "order": 2
                },
                {
                  "begin": "70687673",
                  "end": "70690196",
                  "order": 3
                }
              ]
            },
            "genomic_locations": [
              {
                "exons": [
                  {
                    "begin": "70695129",
                    "end": "70695429",
                    "order": 1
                  },
                  {
                    "begin": "70691394",
                    "end": "70691585",
                    "order": 2
                  },
                  {
                    "begin": "70687673",
                    "end": "70690196",
                    "order": 3
                  }
                ],
                "genomic_accession_version": "NC_035787.1",
                "genomic_range": {
                  "begin": "70687673",
                  "end": "70695429",
                  "orientation": "minus"
                },
                "sequence_name": "Chromosome 8 Reference C_virginica-3.0 Primary Assembly"
              }
            ],
            "genomic_range": {
              "accession_version": "NC_035787.1",
              "range": [
                {
                  "begin": "70687673",
                  "end": "70695429",
                  "orientation": "minus"
                }
              ]
            },
            "length": 3017,
            "name": "transcript variant X2",
            "protein": {
              "accession_version": "XP_022302332.1",
              "isoform_name": "isoform X2",
              "length": 353,
              "name": "toll-like receptor 6"
            },
            "type": "PROTEIN_CODING_MODEL"
          },
          {
            "accession_version": "XM_022446623.1",
            "cds": {
              "accession_version": "XM_022446623.1",
              "range": [
                {
                  "begin": "304",
                  "end": "2562"
                }
              ]
            },
            "exons": {
              "accession_version": "NC_035787.1",
              "range": [
                {
                  "begin": "70695129",
                  "end": "70695429",
                  "order": 1
                },
                {
                  "begin": "70687673",
                  "end": "70691585",
                  "order": 2
                }
              ]
            },
            "genomic_locations": [
              {
                "exons": [
                  {
                    "begin": "70695129",
                    "end": "70695429",
                    "order": 1
                  },
                  {
                    "begin": "70687673",
                    "end": "70691585",
                    "order": 2
                  }
                ],
                "genomic_accession_version": "NC_035787.1",
                "genomic_range": {
                  "begin": "70687673",
                  "end": "70695429",
                  "orientation": "minus"
                },
                "sequence_name": "Chromosome 8 Reference C_virginica-3.0 Primary Assembly"
              }
            ],
            "genomic_range": {
              "accession_version": "NC_035787.1",
              "range": [
                {
                  "begin": "70687673",
                  "end": "70695429",
                  "orientation": "minus"
                }
              ]
            },
            "length": 4214,
            "name": "transcript variant X1",
            "protein": {
              "accession_version": "XP_022302331.1",
              "isoform_name": "isoform X1",
              "length": 752,
              "name": "toll-like receptor 13"
            },
            "type": "PROTEIN_CODING_MODEL"
          }
        ],
        "type": "PROTEIN_CODING"
      },
      "query": [
        "LOC111110223"
      ]
    },
    {
      "gene": {
        "annotations": [
          {
            "assemblies_in_scope": [
              {
                "accession": "GCF_002022765.2",
                "name": "C_virginica-3.0"
              }
            ],
            "release_date": "2017-09-11",
            "release_name": "NCBI Crassostrea virginica Annotation Release 100"
          }
        ],
        "chromosomes": [
          "9"
        ],
        "common_name": "eastern oyster",
        "description": "toll-like receptor 13",
        "gene_id": "111112135",
        "genomic_ranges": [
          {
            "accession_version": "NC_035788.1",
            "range": [
              {
                "begin": "101401835",
                "end": "101406321",
                "orientation": "plus"
              }
            ]
          }
        ],
        "nomenclature_authority": {},
        "orientation": "plus",
        "symbol": "LOC111112135",
        "tax_id": "6565",
        "taxname": "Crassostrea virginica",
        "transcripts": [
          {
            "accession_version": "XM_022449492.1",
            "cds": {
              "accession_version": "XM_022449492.1",
              "range": [
                {
                  "begin": "243",
                  "end": "2570"
                }
              ]
            },
            "exons": {
              "accession_version": "NC_035788.1",
              "range": [
                {
                  "begin": "101401835",
                  "end": "101401901",
                  "order": 1
                },
                {
                  "begin": "101402029",
                  "end": "101402206",
                  "order": 2
                },
                {
                  "begin": "101403913",
                  "end": "101406321",
                  "order": 3
                }
              ]
            },
            "genomic_locations": [
              {
                "exons": [
                  {
                    "begin": "101401835",
                    "end": "101401901",
                    "order": 1
                  },
                  {
                    "begin": "101402029",
                    "end": "101402206",
                    "order": 2
                  },
                  {
                    "begin": "101403913",
                    "end": "101406321",
                    "order": 3
                  }
                ],
                "genomic_accession_version": "NC_035788.1",
                "genomic_range": {
                  "begin": "101401835",
                  "end": "101406321",
                  "orientation": "plus"
                },
                "sequence_name": "Chromosome 9 Reference C_virginica-3.0 Primary Assembly"
              }
            ],
            "genomic_range": {
              "accession_version": "NC_035788.1",
              "range": [
                {
                  "begin": "101401835",
                  "end": "101406321",
                  "orientation": "plus"
                }
              ]
            },
            "length": 2654,
            "name": "transcript variant X2",
            "protein": {
              "accession_version": "XP_022305200.1",
              "isoform_name": "isoform X2",
              "length": 775,
              "name": "toll-like receptor 13"
            },
            "type": "PROTEIN_CODING_MODEL"
          },
          {
            "accession_version": "XM_022449491.1",
            "cds": {
              "accession_version": "XM_022449491.1",
              "range": [
                {
                  "begin": "49",
                  "end": "2379"
                }
              ]
            },
            "exons": {
              "accession_version": "NC_035788.1",
              "range": [
                {
                  "begin": "101401848",
                  "end": "101401901",
                  "order": 1
                },
                {
                  "begin": "101403913",
                  "end": "101406321",
                  "order": 2
                }
              ]
            },
            "genomic_locations": [
              {
                "exons": [
                  {
                    "begin": "101401848",
                    "end": "101401901",
                    "order": 1
                  },
                  {
                    "begin": "101403913",
                    "end": "101406321",
                    "order": 2
                  }
                ],
                "genomic_accession_version": "NC_035788.1",
                "genomic_range": {
                  "begin": "101401848",
                  "end": "101406321",
                  "orientation": "plus"
                },
                "sequence_name": "Chromosome 9 Reference C_virginica-3.0 Primary Assembly"
              }
            ],
            "genomic_range": {
              "accession_version": "NC_035788.1",
              "range": [
                {
                  "begin": "101401848",
                  "end": "101406321",
                  "orientation": "plus"
                }
              ]
            },
            "length": 2463,
            "name": "transcript variant X1",
            "protein": {
              "accession_version": "XP_022305199.1",
              "isoform_name": "isoform X1",
              "length": 776,
              "name": "toll-like receptor 13"
            },
            "type": "PROTEIN_CODING_MODEL"
          }
        ],
        "type": "PROTEIN_CODING"
      },
      "query": [
        "LOC111112135"
      ]
    },
    {
      "gene": {
        "annotations": [
          {
            "assemblies_in_scope": [
              {
                "accession": "GCF_002022765.2",
                "name": "C_virginica-3.0"
              }
            ],
            "release_date": "2017-09-11",
            "release_name": "NCBI Crassostrea virginica Annotation Release 100"
          }
        ],
        "chromosomes": [
          "9"
        ],
        "common_name": "eastern oyster",
        "description": "toll-like receptor 4",
        "gene_id": "111112138",
        "genomic_ranges": [
          {
            "accession_version": "NC_035788.1",
            "range": [
              {
                "begin": "101349832",
                "end": "101356947",
                "orientation": "plus"
              }
            ]
          }
        ],
        "nomenclature_authority": {},
        "orientation": "plus",
        "symbol": "LOC111112138",
        "tax_id": "6565",
        "taxname": "Crassostrea virginica",
        "transcripts": [
          {
            "accession_version": "XM_022449500.1",
            "cds": {
              "accession_version": "XM_022449500.1",
              "range": [
                {
                  "begin": "106",
                  "end": "2412"
                }
              ]
            },
            "exons": {
              "accession_version": "NC_035788.1",
              "range": [
                {
                  "begin": "101349832",
                  "end": "101349916",
                  "order": 1
                },
                {
                  "begin": "101354401",
                  "end": "101356947",
                  "order": 2
                }
              ]
            },
            "genomic_locations": [
              {
                "exons": [
                  {
                    "begin": "101349832",
                    "end": "101349916",
                    "order": 1
                  },
                  {
                    "begin": "101354401",
                    "end": "101356947",
                    "order": 2
                  }
                ],
                "genomic_accession_version": "NC_035788.1",
                "genomic_range": {
                  "begin": "101349832",
                  "end": "101356947",
                  "orientation": "plus"
                },
                "sequence_name": "Chromosome 9 Reference C_virginica-3.0 Primary Assembly"
              }
            ],
            "genomic_range": {
              "accession_version": "NC_035788.1",
              "range": [
                {
                  "begin": "101349832",
                  "end": "101356947",
                  "orientation": "plus"
                }
              ]
            },
            "length": 2632,
            "name": "transcript variant X1",
            "protein": {
              "accession_version": "XP_022305208.1",
              "length": 768,
              "name": "toll-like receptor 4"
            },
            "type": "PROTEIN_CODING_MODEL"
          },
          {
            "accession_version": "XM_022449501.1",
            "cds": {
              "accession_version": "XM_022449501.1",
              "range": [
                {
                  "begin": "143",
                  "end": "2449"
                }
              ]
            },
            "exons": {
              "accession_version": "NC_035788.1",
              "range": [
                {
                  "begin": "101352482",
                  "end": "101352603",
                  "order": 1
                },
                {
                  "begin": "101354401",
                  "end": "101356947",
                  "order": 2
                }
              ]
            },
            "genomic_locations": [
              {
                "exons": [
                  {
                    "begin": "101352482",
                    "end": "101352603",
                    "order": 1
                  },
                  {
                    "begin": "101354401",
                    "end": "101356947",
                    "order": 2
                  }
                ],
                "genomic_accession_version": "NC_035788.1",
                "genomic_range": {
                  "begin": "101352482",
                  "end": "101356947",
                  "orientation": "plus"
                },
                "sequence_name": "Chromosome 9 Reference C_virginica-3.0 Primary Assembly"
              }
            ],
            "genomic_range": {
              "accession_version": "NC_035788.1",
              "range": [
                {
                  "begin": "101352482",
                  "end": "101356947",
                  "orientation": "plus"
                }
              ]
            },
            "length": 2669,
            "name": "transcript variant X2",
            "protein": {
              "accession_version": "XP_022305209.1",
              "length": 768,
              "name": "toll-like receptor 4"
            },
            "type": "PROTEIN_CODING_MODEL"
          }
        ],
        "type": "PROTEIN_CODING"
      },
      "query": [
        "LOC111112138"
      ]
    }
  ]
}

Downloading gene sequence, annotation and metadata¶

Next, we'll use the Datasets command line tool to download a gene data package containing gene, transcript and protein sequence, a data report and a data table. The gene data reports contain detailed gene metadata in a hierarchical JSON Lines format. The gene table contains a subset of gene metadata is tsv format. Gene data packages can be queried using NCBI Gene ID, gene symbol or RefSeq transcript or protein accession combined with a taxon name. Datasets data reports are in

The default gene dataset includes the following files:

gene.fna (gene sequences)
rna.fna (transcript sequences)
protein.faa (protein sequences)
data_report.jsonl (data report with gene metadata)
data_table.tsv (data table with gene metadata, one transcript per row)
dataset_catalog.json (a list of files and file types included in the dataset)

In this example, we'll query using the same three NCBI Gene symbols and taxon name. We'll also use the --filename flag to provide a custom name for the download package. For the purposes of this demonstration, we will redirect all messages from the datasets command to datasets.log.

In [1]:

!./datasets download gene symbol LOC111112135 LOC111112138 LOC111110223 --taxon "crassostrea virginica" --filename 3_eastern_oyster_genes.zip >datasets.log 2>&1
!printf "Downloaded:\n%s" "$(du --human-readable 3_eastern_oyster_genes.zip)"

Downloaded:
20K	3_eastern_oyster_genes.zip

We'll use unzip command to view the contents of the gene data package

In [1]:

!unzip -l 3_eastern_oyster_genes.zip

Archive:  3_eastern_oyster_genes.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      661  2021-03-21 16:47   README.md
    19976  2021-03-21 16:47   ncbi_dataset/data/gene.fna
    18487  2021-03-21 16:47   ncbi_dataset/data/rna.fna
     4793  2021-03-21 16:47   ncbi_dataset/data/protein.faa
     7195  2021-03-21 16:47   ncbi_dataset/data/data_report.jsonl
     1783  2021-03-21 16:47   ncbi_dataset/data/data_table.tsv
      454  2021-03-21 16:47   ncbi_dataset/data/dataset_catalog.json
---------                     -------
    53349                     7 files

Next, we'll extract the data files. Note that all NCBI Datasets packages use similar file structure. The -o argument will override existing files

In [1]:

!unzip -o 3_eastern_oyster_genes.zip

Archive:  3_eastern_oyster_genes.zip
  inflating: README.md               
  inflating: ncbi_dataset/data/gene.fna  
  inflating: ncbi_dataset/data/rna.fna  
  inflating: ncbi_dataset/data/protein.faa  
  inflating: ncbi_dataset/data/data_report.jsonl  
  inflating: ncbi_dataset/data/data_table.tsv  
  inflating: ncbi_dataset/data/dataset_catalog.json

Getting metadata from the Datasets gene table¶

The Datasets gene data package contains two types of metadata files, the gene data report and the gene table. The gene data report contains detailed gene information in a hierarchical JSON lines format. By contrast, the gene table contains a reduced, flattened representation of the hierarchial gene data report. In this step, we demonstrate how you can use common unix commands to view metadata in the gene table.

In [1]:

!head ncbi_dataset/data/data_table.tsv | cut -f1,3-7

gene_id	description	scientific_name	common_name	tax_id	genomic_range
111110223	toll-like receptor 6	Crassostrea virginica	eastern oyster	6565	NC_035787.1:70687673-70695429
111110223	toll-like receptor 6	Crassostrea virginica	eastern oyster	6565	NC_035787.1:70687673-70695429
111112135	toll-like receptor 13	Crassostrea virginica	eastern oyster	6565	NC_035788.1:101401835-101406321
111112135	toll-like receptor 13	Crassostrea virginica	eastern oyster	6565	NC_035788.1:101401835-101406321
111112138	toll-like receptor 4	Crassostrea virginica	eastern oyster	6565	NC_035788.1:101349832-101356947
111112138	toll-like receptor 4	Crassostrea virginica	eastern oyster	6565	NC_035788.1:101349832-101356947

In [1]:

!head --lines 10 ncbi_dataset/data/gene.fna

>NC_035787.1:c70695429-70687673 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [chromosome=8]
GTCGCGTGTACTCGATCTGCTGAACGCAGTATCGGTGTATAAATCATTTTGTTCTTCTCGATGAAAAAAA
TTAGGCAAATTTGCCATCAAGTTTAAAAGCTATTCTCACTGTTTCACGCATCGGGACATTTTAAATGGAT
TTTCCAATGCACTAGTTTCATATAAGTCTGCATACTTCCTGGTCTGTGAATAAATCAAACTTAATTATGA
TTTCATGAAGAAATGTAATGCAATGACGAGTTGCATTTTGGAGGAATTTTGAACAGATTTTTCTGAATAA
GCTAGAAACAATTTGTCGAAGGTATGTTTAGAATTTTTCCCGAATATTTAGAAGCTTTGCCTTTAAAATC
ATTGATTATGCAGGCCTTAATTACTCCTTCCAGTTAATGTGCATCCTTGATTGATTGGTTATATTGGCAG
CAGTTAAACTATTCAATGACATCATAATAAGGGGATTCATGGTCAGATTTGGTGTCAATGTTCAGAAAAC
TGTATCTACTTTCTATCTATCTGTATCTAGTTACTAAGCAAATATAATCTTCACCATCAAGTACTTATTA
TAAGACTTACTTTAAACCTGTACATGGAATATTATACATGAAAGACATGGGACTCTACCGGTAAACAAAA

In [1]:

!grep '^>' ncbi_dataset/data/gene.fna

>NC_035787.1:c70695429-70687673 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [chromosome=8]
>NC_035788.1:101401835-101406321 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [chromosome=9]
>NC_035788.1:101349832-101356947 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138] [chromosome=9]

In [1]:

!grep '^>' ncbi_dataset/data/rna.fna

>XM_022446624.1 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [transcript=X2]
>XM_022446623.1 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [transcript=X1]
>XM_022449492.1 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [transcript=X2]
>XM_022449491.1 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [transcript=X1]
>XM_022449500.1 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138] [transcript=X1]
>XM_022449501.1 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138] [transcript=X2]

In [1]:

!grep '^>' ncbi_dataset/data/protein.faa

>XP_022302331.1 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [isoform=X1]
>XP_022302332.1 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [isoform=X2]
>XP_022305199.1 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [isoform=X1]
>XP_022305200.1 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [isoform=X2]
>XP_022305208.1 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138]
>XP_022305209.1 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138]

Converting the JSON Lines gene data report to tabular format¶

Next, we'll show how to use the dataformat command line tool to convert the hierarchical JSON Lines gene data report into a tabular formats including Excel and tsv. First we'll use the help command to view the fields available for conversion in tabular format.

In [1]:

!./dataformat tsv gene --help

Convert Gene Report into TSV format.

Refer to NCBI's [command line start](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start) documentation for information about getting started with the command-line tools.

Usage
  dataformat tsv gene [flags]

Examples
  dataformat tsv gene --inputfile gene_package/ncbi_dataset/data/data_report.jsonl
  dataformat tsv gene --package genes.zip

Flags
      --fields strings     comma-separated list of fields
                               - annotation-assemblies-in-scope-accession
                               - annotation-assemblies-in-scope-name
                               - annotation-release-date
                               - annotation-release-name
                               - chromosomes
                               - common-name
                               - description
                               - ensembl-geneids
                               - gene-id
                               - gene-type
                               - genomic-range-accession
                               - genomic-range-range-orientation
                               - genomic-range-range-start
                               - genomic-range-range-stop
                               - name-authority
                               - name-id
                               - omim-ids
                               - orientation
                               - ref-standard-genomic-region-type
                               - replaced-gene-id
                               - rna-type
                               - swissprot-accessions
                               - symbol
                               - synonyms
                               - tax-id
                               - tax-name
                               - transcript-accession
                               - transcript-ensembl-transcript
                               - transcript-genomic-location-accession
                               - transcript-genomic-location-seq-name
                               - transcript-length
                               - transcript-name
                               - transcript-protein-accession
                               - transcript-protein-ensembl-protein
                               - transcript-protein-isoform
                               - transcript-protein-length
                               - transcript-protein-mat-peptide-accession
                               - transcript-protein-mat-peptide-length
                               - transcript-protein-mat-peptide-name
                               - transcript-protein-name
                               - transcript-transcript-type
  -h, --help               help for gene
      --inputfile string   input file
      --package string     datasets package (zip archive), inputfile parameter is relative to the root path inside the archive



Global Flags
      --elide-header   Do not output header

Now we'll use the dataformat tool to convert a default set of data fields from the gene data report to tsv format. We'll also use the --package flag to identify the gene data report file to convert.

In [1]:

!./dataformat tsv gene --package 3_eastern_oyster_genes.zip --fields gene-id,symbol,tax-id,tax-name

NCBI GeneID	Symbol	Taxonomic ID	Taxonomic Name
111110223	LOC111110223	6565	Crassostrea virginica
111112135	LOC111112135	6565	Crassostrea virginica
111112138	LOC111112138	6565	Crassostrea virginica

Limiting the fasta download to a subset of transcript and protein sequences¶

Now we'll show you how to limit the transcript and protein fasta file to a subset of transcripts and proteins. In this example we'll use the --fasta-filter flag to extract sequence for the transcripts encoding the longest protein.

In [1]:

!./datasets download gene symbol LOC111112135 LOC111112138 LOC111110223 --taxon "crassostrea virginica" --filename 3_eastern_oyster_transcripts.zip --fasta-filter XM_022446623.1 XM_022449491.1 XM_022449500.1 >datasets.log 2>&1
!printf "Downloaded:\n%s" "$(du --human-readable 3_eastern_oyster_transcripts.zip)"

Downloaded:
8.0K	3_eastern_oyster_transcripts.zip

Downloading sequence and metadata for all RefSeq genes for a given organism¶

Finally, we'll show how to download a gene data package containing sequence and metadata for all genes for a given organism. In this example, we'll download all genes for Crassostrea virginica.

In [1]:

!./datasets download gene taxon "crassostrea virginica" --filename eastern_oyster_genes.zip >datasets.log 2>&1
!printf "Downloaded:\n%s" "$(du --human-readable eastern_oyster_genes.zip)"

Downloaded:
226M	eastern_oyster_genes.zip

In [1]:

!unzip -l eastern_oyster_genes.zip

Archive:  eastern_oyster_genes.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      661  2021-03-21 16:47   README.md
438079325  2021-03-21 16:47   ncbi_dataset/data/gene.fna
187895654  2021-03-21 16:48   ncbi_dataset/data/rna.fna
 45124543  2021-03-21 16:52   ncbi_dataset/data/protein.faa
135988248  2021-03-21 16:56   ncbi_dataset/data/data_report.jsonl
 17830664  2021-03-21 16:59   ncbi_dataset/data/data_table.tsv
      454  2021-03-21 17:00   ncbi_dataset/data/dataset_catalog.json
---------                     -------
824919549                     7 files