The objective of this Notebook is to demonstrate how to use NCBI Datasets command line tools to explore and download sequence and metadata for RefSeq annotated genes.
The datasets command-line tool currently returns two types of data:
To get started, we'll first download and grant execute permissions for the datasets command line tools. Datasets has two command line tools
%%bash
printf "Downloading CLI tools...\n"
for app in datasets dataformat
do
curl --silent --remote-name "https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/${app}"
chmod +x ${app}
printf "[size: %s] %s v%s\n" $(du --human-readable ${app}) $(./${app} version)
done
Downloading CLI tools... [size: 11M] datasets v11.7.0 [size: 13M] dataformat v11.7.0
We'll also download the command line tool jq to parse the datasets JSON Lines data reports into a readable format.
%%bash
curl --silent --location --output jq 'https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64'
chmod +x jq
printf "Downloaded %s" $(./jq --version)
Downloaded jq-1.6
To get help in using the datasets tools or any commands or sub-commands specify --help after the command
!./datasets --help
datasets is a command-line tool that is used to query and download biological sequence data across all domains of life from NCBI databases. Refer to NCBI's [command line start](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start) documentation for information about getting started with the command-line tools. Usage datasets [command] Data Retrieval Commands summary print a summary of a gene or genome dataset download download a gene, genome or coronavirus dataset as a zip file rehydrate rehydrate a downloaded, dehydrated dataset Miscellaneous Commands completion generate autocompletion scripts version print the version of this client and exit help Help about any command Flags -h, --help help for datasets Use datasets help <command> for detailed help about a command.
In this step, we'll use the Datasets summary gene command to get gene metadata for a list Crassostrea gigas genes. Datasets gene summaries can be queried using NCBI Gene ID, gene symbol or RefSeq transcript or protein accession combined with a taxon name. In this example, we'll query for 3 Crassostrea virginica genes, LOC111112135, LOC111112138, LOC111110223, by specifying gene symbol and taxon name. To make the JSON output easy to read we'll use the command line parser jq.
!./datasets summary gene symbol LOC111112135 LOC111112138 LOC111110223 --taxon "crassostrea virginica" | ./jq .
{ "genes": [ { "gene": { "annotations": [ { "assemblies_in_scope": [ { "accession": "GCF_002022765.2", "name": "C_virginica-3.0" } ], "release_date": "2017-09-11", "release_name": "NCBI Crassostrea virginica Annotation Release 100" } ], "chromosomes": [ "8" ], "common_name": "eastern oyster", "description": "toll-like receptor 6", "gene_id": "111110223", "genomic_ranges": [ { "accession_version": "NC_035787.1", "range": [ { "begin": "70687673", "end": "70695429", "orientation": "minus" } ] } ], "nomenclature_authority": {}, "orientation": "minus", "symbol": "LOC111110223", "tax_id": "6565", "taxname": "Crassostrea virginica", "transcripts": [ { "accession_version": "XM_022446624.1", "cds": { "accession_version": "XM_022446624.1", "range": [ { "begin": "304", "end": "1365" } ] }, "exons": { "accession_version": "NC_035787.1", "range": [ { "begin": "70695129", "end": "70695429", "order": 1 }, { "begin": "70691394", "end": "70691585", "order": 2 }, { "begin": "70687673", "end": "70690196", "order": 3 } ] }, "genomic_locations": [ { "exons": [ { "begin": "70695129", "end": "70695429", "order": 1 }, { "begin": "70691394", "end": "70691585", "order": 2 }, { "begin": "70687673", "end": "70690196", "order": 3 } ], "genomic_accession_version": "NC_035787.1", "genomic_range": { "begin": "70687673", "end": "70695429", "orientation": "minus" }, "sequence_name": "Chromosome 8 Reference C_virginica-3.0 Primary Assembly" } ], "genomic_range": { "accession_version": "NC_035787.1", "range": [ { "begin": "70687673", "end": "70695429", "orientation": "minus" } ] }, "length": 3017, "name": "transcript variant X2", "protein": { "accession_version": "XP_022302332.1", "isoform_name": "isoform X2", "length": 353, "name": "toll-like receptor 6" }, "type": "PROTEIN_CODING_MODEL" }, { "accession_version": "XM_022446623.1", "cds": { "accession_version": "XM_022446623.1", "range": [ { "begin": "304", "end": "2562" } ] }, "exons": { "accession_version": "NC_035787.1", "range": [ { "begin": "70695129", "end": "70695429", "order": 1 }, { "begin": "70687673", "end": "70691585", "order": 2 } ] }, "genomic_locations": [ { "exons": [ { "begin": "70695129", "end": "70695429", "order": 1 }, { "begin": "70687673", "end": "70691585", "order": 2 } ], "genomic_accession_version": "NC_035787.1", "genomic_range": { "begin": "70687673", "end": "70695429", "orientation": "minus" }, "sequence_name": "Chromosome 8 Reference C_virginica-3.0 Primary Assembly" } ], "genomic_range": { "accession_version": "NC_035787.1", "range": [ { "begin": "70687673", "end": "70695429", "orientation": "minus" } ] }, "length": 4214, "name": "transcript variant X1", "protein": { "accession_version": "XP_022302331.1", "isoform_name": "isoform X1", "length": 752, "name": "toll-like receptor 13" }, "type": "PROTEIN_CODING_MODEL" } ], "type": "PROTEIN_CODING" }, "query": [ "LOC111110223" ] }, { "gene": { "annotations": [ { "assemblies_in_scope": [ { "accession": "GCF_002022765.2", "name": "C_virginica-3.0" } ], "release_date": "2017-09-11", "release_name": "NCBI Crassostrea virginica Annotation Release 100" } ], "chromosomes": [ "9" ], "common_name": "eastern oyster", "description": "toll-like receptor 13", "gene_id": "111112135", "genomic_ranges": [ { "accession_version": "NC_035788.1", "range": [ { "begin": "101401835", "end": "101406321", "orientation": "plus" } ] } ], "nomenclature_authority": {}, "orientation": "plus", "symbol": "LOC111112135", "tax_id": "6565", "taxname": "Crassostrea virginica", "transcripts": [ { "accession_version": "XM_022449492.1", "cds": { "accession_version": "XM_022449492.1", "range": [ { "begin": "243", "end": "2570" } ] }, "exons": { "accession_version": "NC_035788.1", "range": [ { "begin": "101401835", "end": "101401901", "order": 1 }, { "begin": "101402029", "end": "101402206", "order": 2 }, { "begin": "101403913", "end": "101406321", "order": 3 } ] }, "genomic_locations": [ { "exons": [ { "begin": "101401835", "end": "101401901", "order": 1 }, { "begin": "101402029", "end": "101402206", "order": 2 }, { "begin": "101403913", "end": "101406321", "order": 3 } ], "genomic_accession_version": "NC_035788.1", "genomic_range": { "begin": "101401835", "end": "101406321", "orientation": "plus" }, "sequence_name": "Chromosome 9 Reference C_virginica-3.0 Primary Assembly" } ], "genomic_range": { "accession_version": "NC_035788.1", "range": [ { "begin": "101401835", "end": "101406321", "orientation": "plus" } ] }, "length": 2654, "name": "transcript variant X2", "protein": { "accession_version": "XP_022305200.1", "isoform_name": "isoform X2", "length": 775, "name": "toll-like receptor 13" }, "type": "PROTEIN_CODING_MODEL" }, { "accession_version": "XM_022449491.1", "cds": { "accession_version": "XM_022449491.1", "range": [ { "begin": "49", "end": "2379" } ] }, "exons": { "accession_version": "NC_035788.1", "range": [ { "begin": "101401848", "end": "101401901", "order": 1 }, { "begin": "101403913", "end": "101406321", "order": 2 } ] }, "genomic_locations": [ { "exons": [ { "begin": "101401848", "end": "101401901", "order": 1 }, { "begin": "101403913", "end": "101406321", "order": 2 } ], "genomic_accession_version": "NC_035788.1", "genomic_range": { "begin": "101401848", "end": "101406321", "orientation": "plus" }, "sequence_name": "Chromosome 9 Reference C_virginica-3.0 Primary Assembly" } ], "genomic_range": { "accession_version": "NC_035788.1", "range": [ { "begin": "101401848", "end": "101406321", "orientation": "plus" } ] }, "length": 2463, "name": "transcript variant X1", "protein": { "accession_version": "XP_022305199.1", "isoform_name": "isoform X1", "length": 776, "name": "toll-like receptor 13" }, "type": "PROTEIN_CODING_MODEL" } ], "type": "PROTEIN_CODING" }, "query": [ "LOC111112135" ] }, { "gene": { "annotations": [ { "assemblies_in_scope": [ { "accession": "GCF_002022765.2", "name": "C_virginica-3.0" } ], "release_date": "2017-09-11", "release_name": "NCBI Crassostrea virginica Annotation Release 100" } ], "chromosomes": [ "9" ], "common_name": "eastern oyster", "description": "toll-like receptor 4", "gene_id": "111112138", "genomic_ranges": [ { "accession_version": "NC_035788.1", "range": [ { "begin": "101349832", "end": "101356947", "orientation": "plus" } ] } ], "nomenclature_authority": {}, "orientation": "plus", "symbol": "LOC111112138", "tax_id": "6565", "taxname": "Crassostrea virginica", "transcripts": [ { "accession_version": "XM_022449500.1", "cds": { "accession_version": "XM_022449500.1", "range": [ { "begin": "106", "end": "2412" } ] }, "exons": { "accession_version": "NC_035788.1", "range": [ { "begin": "101349832", "end": "101349916", "order": 1 }, { "begin": "101354401", "end": "101356947", "order": 2 } ] }, "genomic_locations": [ { "exons": [ { "begin": "101349832", "end": "101349916", "order": 1 }, { "begin": "101354401", "end": "101356947", "order": 2 } ], "genomic_accession_version": "NC_035788.1", "genomic_range": { "begin": "101349832", "end": "101356947", "orientation": "plus" }, "sequence_name": "Chromosome 9 Reference C_virginica-3.0 Primary Assembly" } ], "genomic_range": { "accession_version": "NC_035788.1", "range": [ { "begin": "101349832", "end": "101356947", "orientation": "plus" } ] }, "length": 2632, "name": "transcript variant X1", "protein": { "accession_version": "XP_022305208.1", "length": 768, "name": "toll-like receptor 4" }, "type": "PROTEIN_CODING_MODEL" }, { "accession_version": "XM_022449501.1", "cds": { "accession_version": "XM_022449501.1", "range": [ { "begin": "143", "end": "2449" } ] }, "exons": { "accession_version": "NC_035788.1", "range": [ { "begin": "101352482", "end": "101352603", "order": 1 }, { "begin": "101354401", "end": "101356947", "order": 2 } ] }, "genomic_locations": [ { "exons": [ { "begin": "101352482", "end": "101352603", "order": 1 }, { "begin": "101354401", "end": "101356947", "order": 2 } ], "genomic_accession_version": "NC_035788.1", "genomic_range": { "begin": "101352482", "end": "101356947", "orientation": "plus" }, "sequence_name": "Chromosome 9 Reference C_virginica-3.0 Primary Assembly" } ], "genomic_range": { "accession_version": "NC_035788.1", "range": [ { "begin": "101352482", "end": "101356947", "orientation": "plus" } ] }, "length": 2669, "name": "transcript variant X2", "protein": { "accession_version": "XP_022305209.1", "length": 768, "name": "toll-like receptor 4" }, "type": "PROTEIN_CODING_MODEL" } ], "type": "PROTEIN_CODING" }, "query": [ "LOC111112138" ] } ] }
Next, we'll use the Datasets command line tool to download a gene data package containing gene, transcript and protein sequence, a data report and a data table. The gene data reports contain detailed gene metadata in a hierarchical JSON Lines format. The gene table contains a subset of gene metadata is tsv format. Gene data packages can be queried using NCBI Gene ID, gene symbol or RefSeq transcript or protein accession combined with a taxon name. Datasets data reports are in
The default gene dataset includes the following files:
In this example, we'll query using the same three NCBI Gene symbols and taxon name. We'll also use the --filename flag to provide a custom name for the download package. For the purposes of this demonstration, we will redirect all messages from the datasets command to datasets.log.
!./datasets download gene symbol LOC111112135 LOC111112138 LOC111110223 --taxon "crassostrea virginica" --filename 3_eastern_oyster_genes.zip >datasets.log 2>&1
!printf "Downloaded:\n%s" "$(du --human-readable 3_eastern_oyster_genes.zip)"
Downloaded: 20K 3_eastern_oyster_genes.zip
We'll use unzip command to view the contents of the gene data package
!unzip -l 3_eastern_oyster_genes.zip
Archive: 3_eastern_oyster_genes.zip Length Date Time Name --------- ---------- ----- ---- 661 2021-03-21 16:47 README.md 19976 2021-03-21 16:47 ncbi_dataset/data/gene.fna 18487 2021-03-21 16:47 ncbi_dataset/data/rna.fna 4793 2021-03-21 16:47 ncbi_dataset/data/protein.faa 7195 2021-03-21 16:47 ncbi_dataset/data/data_report.jsonl 1783 2021-03-21 16:47 ncbi_dataset/data/data_table.tsv 454 2021-03-21 16:47 ncbi_dataset/data/dataset_catalog.json --------- ------- 53349 7 files
Next, we'll extract the data files. Note that all NCBI Datasets packages use similar file structure. The -o argument will override existing files
!unzip -o 3_eastern_oyster_genes.zip
Archive: 3_eastern_oyster_genes.zip inflating: README.md inflating: ncbi_dataset/data/gene.fna inflating: ncbi_dataset/data/rna.fna inflating: ncbi_dataset/data/protein.faa inflating: ncbi_dataset/data/data_report.jsonl inflating: ncbi_dataset/data/data_table.tsv inflating: ncbi_dataset/data/dataset_catalog.json
The Datasets gene data package contains two types of metadata files, the gene data report and the gene table. The gene data report contains detailed gene information in a hierarchical JSON lines format. By contrast, the gene table contains a reduced, flattened representation of the hierarchial gene data report. In this step, we demonstrate how you can use common unix commands to view metadata in the gene table.
!head ncbi_dataset/data/data_table.tsv | cut -f1,3-7
gene_id description scientific_name common_name tax_id genomic_range 111110223 toll-like receptor 6 Crassostrea virginica eastern oyster 6565 NC_035787.1:70687673-70695429 111110223 toll-like receptor 6 Crassostrea virginica eastern oyster 6565 NC_035787.1:70687673-70695429 111112135 toll-like receptor 13 Crassostrea virginica eastern oyster 6565 NC_035788.1:101401835-101406321 111112135 toll-like receptor 13 Crassostrea virginica eastern oyster 6565 NC_035788.1:101401835-101406321 111112138 toll-like receptor 4 Crassostrea virginica eastern oyster 6565 NC_035788.1:101349832-101356947 111112138 toll-like receptor 4 Crassostrea virginica eastern oyster 6565 NC_035788.1:101349832-101356947
!head --lines 10 ncbi_dataset/data/gene.fna
>NC_035787.1:c70695429-70687673 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [chromosome=8] GTCGCGTGTACTCGATCTGCTGAACGCAGTATCGGTGTATAAATCATTTTGTTCTTCTCGATGAAAAAAA TTAGGCAAATTTGCCATCAAGTTTAAAAGCTATTCTCACTGTTTCACGCATCGGGACATTTTAAATGGAT TTTCCAATGCACTAGTTTCATATAAGTCTGCATACTTCCTGGTCTGTGAATAAATCAAACTTAATTATGA TTTCATGAAGAAATGTAATGCAATGACGAGTTGCATTTTGGAGGAATTTTGAACAGATTTTTCTGAATAA GCTAGAAACAATTTGTCGAAGGTATGTTTAGAATTTTTCCCGAATATTTAGAAGCTTTGCCTTTAAAATC ATTGATTATGCAGGCCTTAATTACTCCTTCCAGTTAATGTGCATCCTTGATTGATTGGTTATATTGGCAG CAGTTAAACTATTCAATGACATCATAATAAGGGGATTCATGGTCAGATTTGGTGTCAATGTTCAGAAAAC TGTATCTACTTTCTATCTATCTGTATCTAGTTACTAAGCAAATATAATCTTCACCATCAAGTACTTATTA TAAGACTTACTTTAAACCTGTACATGGAATATTATACATGAAAGACATGGGACTCTACCGGTAAACAAAA
!grep '^>' ncbi_dataset/data/gene.fna
>NC_035787.1:c70695429-70687673 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [chromosome=8] >NC_035788.1:101401835-101406321 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [chromosome=9] >NC_035788.1:101349832-101356947 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138] [chromosome=9]
!grep '^>' ncbi_dataset/data/rna.fna
>XM_022446624.1 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [transcript=X2] >XM_022446623.1 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [transcript=X1] >XM_022449492.1 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [transcript=X2] >XM_022449491.1 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [transcript=X1] >XM_022449500.1 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138] [transcript=X1] >XM_022449501.1 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138] [transcript=X2]
!grep '^>' ncbi_dataset/data/protein.faa
>XP_022302331.1 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [isoform=X1] >XP_022302332.1 LOC111110223 [organism=Crassostrea virginica] [GeneID=111110223] [isoform=X2] >XP_022305199.1 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [isoform=X1] >XP_022305200.1 LOC111112135 [organism=Crassostrea virginica] [GeneID=111112135] [isoform=X2] >XP_022305208.1 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138] >XP_022305209.1 LOC111112138 [organism=Crassostrea virginica] [GeneID=111112138]
Next, we'll show how to use the dataformat command line tool to convert the hierarchical JSON Lines gene data report into a tabular formats including Excel and tsv. First we'll use the help command to view the fields available for conversion in tabular format.
!./dataformat tsv gene --help
Convert Gene Report into TSV format. Refer to NCBI's [command line start](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start) documentation for information about getting started with the command-line tools. Usage dataformat tsv gene [flags] Examples dataformat tsv gene --inputfile gene_package/ncbi_dataset/data/data_report.jsonl dataformat tsv gene --package genes.zip Flags --fields strings comma-separated list of fields - annotation-assemblies-in-scope-accession - annotation-assemblies-in-scope-name - annotation-release-date - annotation-release-name - chromosomes - common-name - description - ensembl-geneids - gene-id - gene-type - genomic-range-accession - genomic-range-range-orientation - genomic-range-range-start - genomic-range-range-stop - name-authority - name-id - omim-ids - orientation - ref-standard-genomic-region-type - replaced-gene-id - rna-type - swissprot-accessions - symbol - synonyms - tax-id - tax-name - transcript-accession - transcript-ensembl-transcript - transcript-genomic-location-accession - transcript-genomic-location-seq-name - transcript-length - transcript-name - transcript-protein-accession - transcript-protein-ensembl-protein - transcript-protein-isoform - transcript-protein-length - transcript-protein-mat-peptide-accession - transcript-protein-mat-peptide-length - transcript-protein-mat-peptide-name - transcript-protein-name - transcript-transcript-type -h, --help help for gene --inputfile string input file --package string datasets package (zip archive), inputfile parameter is relative to the root path inside the archive Global Flags --elide-header Do not output header
Now we'll use the dataformat tool to convert a default set of data fields from the gene data report to tsv format. We'll also use the --package flag to identify the gene data report file to convert.
!./dataformat tsv gene --package 3_eastern_oyster_genes.zip --fields gene-id,symbol,tax-id,tax-name
NCBI GeneID Symbol Taxonomic ID Taxonomic Name 111110223 LOC111110223 6565 Crassostrea virginica 111112135 LOC111112135 6565 Crassostrea virginica 111112138 LOC111112138 6565 Crassostrea virginica
Now we'll show you how to limit the transcript and protein fasta file to a subset of transcripts and proteins. In this example we'll use the --fasta-filter flag to extract sequence for the transcripts encoding the longest protein.
!./datasets download gene symbol LOC111112135 LOC111112138 LOC111110223 --taxon "crassostrea virginica" --filename 3_eastern_oyster_transcripts.zip --fasta-filter XM_022446623.1 XM_022449491.1 XM_022449500.1 >datasets.log 2>&1
!printf "Downloaded:\n%s" "$(du --human-readable 3_eastern_oyster_transcripts.zip)"
Downloaded: 8.0K 3_eastern_oyster_transcripts.zip
Finally, we'll show how to download a gene data package containing sequence and metadata for all genes for a given organism. In this example, we'll download all genes for Crassostrea virginica.
!./datasets download gene taxon "crassostrea virginica" --filename eastern_oyster_genes.zip >datasets.log 2>&1
!printf "Downloaded:\n%s" "$(du --human-readable eastern_oyster_genes.zip)"
Downloaded: 226M eastern_oyster_genes.zip
!unzip -l eastern_oyster_genes.zip
Archive: eastern_oyster_genes.zip Length Date Time Name --------- ---------- ----- ---- 661 2021-03-21 16:47 README.md 438079325 2021-03-21 16:47 ncbi_dataset/data/gene.fna 187895654 2021-03-21 16:48 ncbi_dataset/data/rna.fna 45124543 2021-03-21 16:52 ncbi_dataset/data/protein.faa 135988248 2021-03-21 16:56 ncbi_dataset/data/data_report.jsonl 17830664 2021-03-21 16:59 ncbi_dataset/data/data_table.tsv 454 2021-03-21 17:00 ncbi_dataset/data/dataset_catalog.json --------- ------- 824919549 7 files