The objective of this Notebook is to demonstrate how to use NCBI Datasets command line tools to explore and download genome assembly sequence and metadata.
First, we'll download and grant execute permissions for the datasets command line tools. Datasets has two command line tools
%%bash
printf "Downloading CLI tools...\n"
for app in datasets dataformat
do
curl --silent --remote-name "https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/${app}"
chmod +x ${app}
printf "[size: %s] %s v%s\n" $(du --human-readable ${app}) $(./${app} version)
done
Downloading CLI tools... [size: 11M] datasets v11.7.0 [size: 13M] dataformat v11.7.0
We'll also download the command line tool jq to parse the datasets JSON Lines data reports into a readable format.
%%bash
curl --silent --location --output jq 'https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64'
chmod +x jq
printf "Downloaded %s" $(./jq --version)
Downloaded jq-1.6
To get help in using the tools or any sub-commands specify --help after the command:
!./datasets --help
datasets is a command-line tool that is used to query and download biological sequence data across all domains of life from NCBI databases. Refer to NCBI's [command line start](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start) documentation for information about getting started with the command-line tools. Usage datasets [command] Data Retrieval Commands summary print a summary of a gene or genome dataset download download a gene, genome or coronavirus dataset as a zip file rehydrate rehydrate a downloaded, dehydrated dataset Miscellaneous Commands completion generate autocompletion scripts version print the version of this client and exit help Help about any command Flags -h, --help help for datasets Use datasets help <command> for detailed help about a command.
To begin, we'll use the Datasets summary genome command to explore all the available RefSeq genomes for a group of organisms.
Genome summaries can be accessed in four ways:
In this example, we'll view metadata for all Crustacea genome assemblies using taxon name. Additionally, we'll limit our search to genome annotated by NCBI's RefSeq group using the --refseq flag. To make the JSON output easy to read we'll use the command line parser jq.
!./datasets summary genome taxon Crustacea --refseq | ./jq .
{ "assemblies": [ { "assembly": { "annotation_metadata": { "file": [ { "estimated_size": "8160265", "type": "GENOME_GFF" }, { "estimated_size": "60912986", "type": "GENOME_GBFF" }, { "estimated_size": "15723551", "type": "RNA_FASTA" }, { "estimated_size": "5579866", "type": "PROT_FASTA" }, { "estimated_size": "6762708", "type": "GENOME_GTF" } ], "name": "NCBI Annotation Release 100", "release_date": "Mar 16, 2020", "release_number": "100", "report_url": "https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Daphnia_magna/100/", "source": "NCBI" }, "assembly_accession": "GCF_003990815.1", "assembly_category": "representative genome", "assembly_level": "Chromosome", "bioproject_lineages": [ { "bioprojects": [ { "accession": "PRJNA490418", "title": "Daphnia magna strain:SK Genome sequencing and assembly" } ] } ], "chromosomes": [ "LG1", "LG2", "LG3", "LG4", "LG5", "LG6", "LG7", "LG8", "LG9", "LG10", "Un", "MT" ], "contig_n50": 14466, "display_name": "ASM399081v1", "estimated_size": "131337948", "org": { "assembly_counts": { "node": 3, "subtree": 3 }, "key": "35525", "parent_tax_id": "6668", "rank": "SPECIES", "sci_name": "Daphnia magna", "sex": "pooled male and female", "strain": "SK", "tax_id": "35525", "title": "Daphnia magna" }, "seq_length": "122937721", "submission_date": "2019-01-07" } }, { "assembly": { "annotation_metadata": { "file": [ { "estimated_size": "10610954", "type": "GENOME_GFF" }, { "estimated_size": "175309680", "type": "GENOME_GBFF" }, { "estimated_size": "15496064", "type": "RNA_FASTA" }, { "estimated_size": "6791917", "type": "PROT_FASTA" }, { "estimated_size": "9576030", "type": "GENOME_GTF" } ], "name": "NCBI Annotation Release 100", "release_date": "Dec 21, 2017", "release_number": "100", "report_url": "https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Eurytemora_affinis/100/", "source": "NCBI" }, "assembly_accession": "GCF_000591075.1", "assembly_category": "representative genome", "assembly_level": "Scaffold", "bioproject_lineages": [ { "bioprojects": [ { "accession": "PRJNA203087", "parent_accessions": [ "PRJNA163973" ], "title": "Eurytemora affinis strain:Atlantic clade Genome sequencing and assembly" }, { "accession": "PRJNA163973", "parent_accessions": [ "PRJNA163993" ], "title": "i5k Arthropod Genome Pilot Project" }, { "accession": "PRJNA163993", "title": "i5k initiative" } ] } ], "chromosomes": [ "Un" ], "contig_n50": 67724, "display_name": "Eaff_2.0", "estimated_size": "324965786", "org": { "assembly_counts": { "node": 2, "subtree": 2 }, "key": "88015", "parent_tax_id": "88014", "rank": "SPECIES", "sci_name": "Eurytemora affinis", "strain": "Atlantic clade", "tax_id": "88015", "title": "Eurytemora affinis" }, "seq_length": "389032277", "submission_date": "2017-12-12" } }, { "assembly": { "annotation_metadata": { "file": [ { "estimated_size": "8126944", "type": "GENOME_GFF" }, { "estimated_size": "330164727", "type": "GENOME_GBFF" }, { "estimated_size": "13113008", "type": "RNA_FASTA" }, { "estimated_size": "5472117", "type": "PROT_FASTA" }, { "estimated_size": "7246018", "type": "GENOME_GTF" } ], "name": "NCBI Annotation Release 100", "release_date": "Nov 04, 2020", "release_number": "100", "report_url": "https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Pollicipes_pollicipes/100/", "source": "NCBI" }, "assembly_accession": "GCF_011947565.2", "assembly_category": "representative genome", "assembly_level": "Scaffold", "bioproject_lineages": [ { "bioprojects": [ { "accession": "PRJNA614970", "parent_accessions": [ "PRJNA649812" ], "title": "Pollicipes pollicipes isolate:AB1234 Genome sequencing and assembly" }, { "accession": "PRJNA649812", "parent_accessions": [ "PRJNA533106" ], "title": "The Global Invertebrate Genomics Alliance (GIGA) genomes and transcriptomes" }, { "accession": "PRJNA533106", "title": "Earth BioGenome Project (EBP)" } ] } ], "chromosomes": [ "Un" ], "contig_n50": 109725, "display_name": "Ppol_2", "estimated_size": "597159620", "org": { "assembly_counts": { "node": 2, "subtree": 2 }, "isolate": "AB1234", "key": "41117", "merged_tax_ids": [ "223993" ], "parent_tax_id": "36136", "rank": "SPECIES", "sci_name": "Pollicipes pollicipes", "tax_id": "41117", "title": "Pollicipes pollicipes" }, "seq_length": "770089732", "submission_date": "2020-10-27" } }, { "assembly": { "annotation_metadata": { "file": [ { "estimated_size": "11302958", "type": "GENOME_GFF" }, { "estimated_size": "828679130", "type": "GENOME_GBFF" }, { "estimated_size": "18240609", "type": "RNA_FASTA" }, { "estimated_size": "7235079", "type": "PROT_FASTA" }, { "estimated_size": "8792038", "type": "GENOME_GTF" } ], "name": "NCBI Annotation Release 100", "release_date": "Nov 19, 2020", "release_number": "100", "report_url": "https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Penaeus_monodon/100/", "source": "NCBI" }, "assembly_accession": "GCF_015228065.1", "assembly_category": "representative genome", "assembly_level": "Chromosome", "bioproject_lineages": [ { "bioprojects": [ { "accession": "PRJNA611030", "title": "Genomic sequences of Penaeus monodon" } ] } ], "chromosomes": [ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "Un", "MT" ], "contig_n50": 45084, "display_name": "NSTDA_Pmon_1", "estimated_size": "1407275500", "org": { "assembly_counts": { "node": 4, "subtree": 4 }, "common_name": "black tiger shrimp", "isolate": "SGIC_2016", "key": "6687", "parent_tax_id": "133894", "rank": "SPECIES", "sci_name": "Penaeus monodon", "tax_id": "6687", "title": "black tiger shrimp" }, "seq_length": "2394331783", "submission_date": "2020-11-05" } }, { "assembly": { "annotation_metadata": { "file": [ { "estimated_size": "10526977", "type": "GENOME_GFF" }, { "estimated_size": "618319281", "type": "GENOME_GBFF" }, { "estimated_size": "18152621", "type": "RNA_FASTA" }, { "estimated_size": "7313578", "type": "PROT_FASTA" }, { "estimated_size": "8491987", "type": "GENOME_GTF" } ], "name": "NCBI Annotation Release 100", "release_date": "Dec 07, 2018", "release_number": "100", "report_url": "https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Penaeus_vannamei/100/", "source": "NCBI" }, "assembly_accession": "GCF_003789085.1", "assembly_category": "representative genome", "assembly_level": "Scaffold", "bioproject_lineages": [ { "bioprojects": [ { "accession": "PRJNA438564", "title": "Penaeus vannamei breed:Keihai No. 1 Genome sequencing and assembly" } ] } ], "chromosomes": [ "Un", "MT" ], "contig_n50": 86864, "display_name": "ASM378908v1", "estimated_size": "1534800868", "org": { "assembly_counts": { "node": 3, "subtree": 3 }, "breed": "Kehai No.1", "common_name": "Pacific white shrimp", "key": "6689", "merged_tax_ids": [ "583111" ], "parent_tax_id": "133894", "rank": "SPECIES", "sci_name": "Penaeus vannamei", "sex": "male", "tax_id": "6689", "title": "Pacific white shrimp" }, "seq_length": "1663565311", "submission_date": "2018-11-16" } }, { "assembly": { "annotation_metadata": { "file": [ { "estimated_size": "6482984", "type": "GENOME_GFF" }, { "estimated_size": "249136834", "type": "GENOME_GBFF" }, { "estimated_size": "15862279", "type": "RNA_FASTA" }, { "estimated_size": "6535574", "type": "PROT_FASTA" }, { "estimated_size": "5624279", "type": "GENOME_GTF" } ], "name": "NCBI Annotation Release 100", "release_date": "Sep 13, 2016", "release_number": "100", "report_url": "https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Hyalella_azteca/100/", "source": "NCBI" }, "assembly_accession": "GCF_000764305.1", "assembly_category": "representative genome", "assembly_level": "Scaffold", "bioproject_lineages": [ { "bioprojects": [ { "accession": "PRJNA243935", "parent_accessions": [ "PRJNA163973" ], "title": "Hyalella azteca isolate:HAZT.00-mixed Genome sequencing and assembly" }, { "accession": "PRJNA163973", "parent_accessions": [ "PRJNA163993" ], "title": "i5k Arthropod Genome Pilot Project" }, { "accession": "PRJNA163993", "title": "i5k initiative" } ] } ], "chromosomes": [ "Un" ], "contig_n50": 114415, "display_name": "Hazt_2.0", "estimated_size": "442600481", "org": { "assembly_counts": { "node": 2, "subtree": 2 }, "isolate": "HAZT.00-mixed", "key": "294128", "parent_tax_id": "199487", "rank": "SPECIES", "sci_name": "Hyalella azteca", "tax_id": "294128", "title": "Hyalella azteca" }, "seq_length": "550885727", "submission_date": "2016-07-20" } } ], "total_count": 6 }
If you just want to get the count of available RefSeq (GCF) genomes that fall under a particular tax name, use the --refseq flag and set --limit to NONE:
!./datasets summary genome taxon crustacea --refseq --limit NONE
{"total_count":6}
In this section, we'll show you how to download a genome data package for one of the Crustacean genomes using the datasets download genome command. Genome data packages can be retrieved in four ways
The default genome data package includes the following data (when available):
In this example, we'll download the Datasets genome package for the Penaeus vannamei reference genome. For the purposes of this demonstration, we will redirect all messages from the datasets command to datasets.log.
!./datasets download genome taxon "penaeus vannamei" --filename pacific_white_shrimp.zip >datasets.log 2>&1
!printf "Downloaded:\n%s" "$(du --human-readable pacific_white_shrimp.zip)"
Downloaded: 901M pacific_white_shrimp.zip
The Datasets genome assembly data report can be converted to tabular format using the dataformat tool. In this step, we'll use the help command to view the data fields available for conversion
!./dataformat tsv genome --help
Convert Genome Assembly Data Report into TSV format. Refer to NCBI's [command line start](https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start) documentation for information about getting started with the command-line tools. Usage dataformat tsv genome [flags] Examples dataformat tsv genome --inputfile human/ncbi_dataset/data/assembly_data_report.jsonl dataformat tsv genome --package human.zip Flags --fields strings comma-separated list of fields - annotinfo-featcount-gene-non-coding - annotinfo-featcount-gene-other - annotinfo-featcount-gene-protein-coding - annotinfo-featcount-gene-pseudogene - annotinfo-featcount-gene-total - annotinfo-name - annotinfo-release-date - annotinfo-report-url - annotinfo-source - assminfo-bioproject-lineage-accession - assminfo-bioproject-lineage-parent-accession - assminfo-bioproject-lineage-parent-accessions - assminfo-bioproject-lineage-title - assminfo-biosample-accession - assminfo-description - assminfo-genbank-assm-accession - assminfo-level - assminfo-linked-assm - assminfo-name - assminfo-refseq-assm-accession - assminfo-refseq-category - assminfo-sequencing-tech - assminfo-submission-date - assminfo-submitter - assminfo-type - assminfo-ucsc-assm-name - assmstats-contig-l50 - assmstats-contig-n50 - assmstats-gaps-between-scaffolds-count - assmstats-number-of-component-sequences - assmstats-number-of-contigs - assmstats-number-of-scaffolds - assmstats-scaffold-l50 - assmstats-scaffold-n50 - assmstats-total-number-of-chromosomes - assmstats-total-sequence-len - assmstats-total-ungapped-len - breed - common-name - cultivar - ecotype - isolate - organelle-assembly-name - organelle-bioproject-accessions - organelle-description - organelle-infraspecific-name - organelle-submitter - organelle-total-seq-length - organism-name - sex - strain - tax-id - wgs-contigs-url - wgs-project-accession - wgs-url -h, --help help for genome --inputfile string input file --package string datasets package (zip archive), inputfile parameter is relative to the root path inside the archive Global Flags --elide-header Do not output header
Let's look at the catalog inside the package, converting this JSON into an easy-to-read table.
!./dataformat catalog --package pacific_white_shrimp.zip 2>/dev/null | ./jq -r '.assemblies[] | .files[] | [.filePath, .fileType] | @csv'
"GCA_003730335.1/GCA_003730335.1_ASM373033v1_genomic.fna","GENOMIC_NUCLEOTIDE_FASTA" "GCA_003730335.1/sequence_report.jsonl","SEQUENCE_REPORT" "GCA_003789085.1/GCA_003789085.1_ASM378908v1_genomic.fna","GENOMIC_NUCLEOTIDE_FASTA" "GCA_003789085.1/genomic.gff","GFF3" "GCA_003789085.1/protein.faa","PROTEIN_FASTA" "GCA_003789085.1/sequence_report.jsonl","SEQUENCE_REPORT" "GCF_003789085.1/GCF_003789085.1_ASM378908v1_genomic.fna","GENOMIC_NUCLEOTIDE_FASTA" "GCF_003789085.1/genomic.gff","GFF3" "GCF_003789085.1/protein.faa","PROTEIN_FASTA" "GCF_003789085.1/rna.fna","RNA_NUCLEOTIDE_FASTA" "GCF_003789085.1/sequence_report.jsonl","SEQUENCE_REPORT" "assembly_data_report.jsonl","DATA_REPORT"
Now we'll use the dataformat tool to convert a default set of data fields into tsv format.
!./dataformat tsv genome --package pacific_white_shrimp.zip --fields assminfo-name,assminfo-refseq-assm-accession,assminfo-genbank-assm-accession,assminfo-refseq-category,assmstats-number-of-contigs,assmstats-number-of-scaffolds
Assembly Name Assembly RefSeq Accession Assembly GenBank Accession Assembly Refseq Dategory Assembly Stats Number of Contigs Assembly Stats Number of Scaffolds ASM373033v1 na GCA_003730335.1 na 19584 19584 ASM378908v1 GCF_003789085.1 GCA_003789085.1 representative genome 33019 4682 ASM378908v1 GCF_003789085.1 GCA_003789085.1 representative genome 33019 4682
Next, we can list the first 30 FASTA deflines for the ASM378908v1 RefSeq assembly:
!unzip -q -c pacific_white_shrimp.zip ncbi_dataset/data/GCF_003789085.1/GCF_003789085.1_ASM378908v1_genomic.fna | grep --max-count=30 '^>'
>NW_020868286.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1, whole genome shotgun sequence >NW_020868287.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_10, whole genome shotgun sequence >NW_020868288.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_100, whole genome shotgun sequence >NW_020868289.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1000, whole genome shotgun sequence >NW_020868290.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1001, whole genome shotgun sequence >NW_020868291.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1002, whole genome shotgun sequence >NW_020868292.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1003, whole genome shotgun sequence >NW_020868293.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1004, whole genome shotgun sequence >NW_020868294.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1005, whole genome shotgun sequence >NW_020868295.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1006, whole genome shotgun sequence >NW_020868296.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1007, whole genome shotgun sequence >NW_020868297.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1008, whole genome shotgun sequence >NW_020868298.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1009, whole genome shotgun sequence >NW_020868299.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_101, whole genome shotgun sequence >NW_020868300.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1010, whole genome shotgun sequence >NW_020868301.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1011, whole genome shotgun sequence >NW_020868302.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1012, whole genome shotgun sequence >NW_020868303.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1013, whole genome shotgun sequence >NW_020868304.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1014, whole genome shotgun sequence >NW_020868305.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1015, whole genome shotgun sequence >NW_020868306.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1016, whole genome shotgun sequence >NW_020868307.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1017, whole genome shotgun sequence >NW_020868308.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1018, whole genome shotgun sequence >NW_020868309.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1019, whole genome shotgun sequence >NW_020868310.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_102, whole genome shotgun sequence >NW_020868311.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1020, whole genome shotgun sequence >NW_020868312.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1021, whole genome shotgun sequence >NW_020868313.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1022, whole genome shotgun sequence >NW_020868314.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1023, whole genome shotgun sequence >NW_020868315.1 Penaeus vannamei breed Kehai No.1 unplaced genomic scaffold, ASM378908v1 LVANscaffold_1024, whole genome shotgun sequence