#!/usr/bin/env python # coding: utf-8 # # Getting genome assembly data using NCBI Datasets command line tools # # The objective of this Notebook is to demonstrate how to use NCBI Datasets command line tools to explore and download genome assembly sequence and metadata. # ## Getting started # First, we'll download and grant execute permissions for the datasets command line tools. # Datasets has two command line tools # - The **datasets** tool is used to query and download sequence, annotation and metadata for all domains of life. # - The **dataformat** tool is used to convert metadata downloaded from NCBI Datasets from JSON lines format to other formats. # In[1]: get_ipython().run_cell_magic('bash', '', 'printf "Downloading CLI tools...\\n"\nfor app in datasets dataformat\ndo\n curl --silent --remote-name "https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/${app}"\n chmod +x ${app}\n printf "[size: %s] %s v%s\\n" $(du --human-readable ${app}) $(./${app} version)\ndone\n') # We'll also download the command line tool [jq](https://stedolan.github.io/jq/) to parse the datasets JSON Lines data reports into a readable format. # In[1]: get_ipython().run_cell_magic('bash', '', 'curl --silent --location --output jq \'https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64\'\nchmod +x jq\nprintf "Downloaded %s" $(./jq --version)\n') # ## Getting help # To get help in using the tools or any sub-commands specify --help after the command: # In[1]: get_ipython().system('./datasets --help') # ## Getting genome metadata # # To begin, we'll use the Datasets summary genome command to explore all the available RefSeq genomes for a group of organisms. # # Genome summaries can be accessed in four ways: # # - accession: an NCBI Assembly accession # - organism: an organism or a taxonomical group name # - taxid: using an NCBI Taxonomy identifier, at any level. # - BioProject: using an NCBI BioProject accession # # In this example, we'll view metadata for all Crustacea genome assemblies using taxon name. Additionally, we'll limit our search to genome annotated by NCBI's RefSeq group using the --refseq flag. To make the JSON output easy to read we'll use the command line parser jq. # In[1]: get_ipython().system('./datasets summary genome taxon Crustacea --refseq | ./jq .') # If you just want to get the count of available RefSeq (GCF) genomes that fall under a particular tax name, use the --refseq flag and set --limit to NONE: # In[1]: get_ipython().system('./datasets summary genome taxon crustacea --refseq --limit NONE') # ## Downloading genome assembly sequence and metadata # In this section, we'll show you how to download a genome data package for one of the Crustacean genomes using the datasets download genome command. Genome data packages can be retrieved in four ways # # - accession: an NCBI Assembly accession # - organism: an organism or a taxonomical group name # - taxid: using an NCBI Taxonomy identifier, at any level. # - BioProject: using an NCBI BioProject accession # # The default genome data package includes the following data (when available): # # - genomic sequence (genomic.fna) # - transcript sequences (rna.fna) # - protein sequences (protein.faa) # - annotation in gff3 format (genomic.gff) # - a data report containing genome assembly and annotation metadata (assembly_data_report.jsonl) # - a sequence report listing the nucleotide sequences that comprise the genome assembly (sequence_report.jsonl) # # In this example, we'll download the Datasets genome package for the Penaeus vannamei reference genome. For the purposes of this demonstration, we will redirect all messages from the datasets command to datasets.log. # In[1]: get_ipython().system('./datasets download genome taxon "penaeus vannamei" --filename pacific_white_shrimp.zip >datasets.log 2>&1') get_ipython().system('printf "Downloaded:\\n%s" "$(du --human-readable pacific_white_shrimp.zip)"') # ## Converting the Datasets assembly data report to tabular format # The Datasets genome assembly data report can be converted to tabular format using the dataformat tool. In this step, we'll use the help command to view the data fields available for conversion # In[1]: get_ipython().system('./dataformat tsv genome --help') # Let's look at the catalog inside the package, converting this JSON into an easy-to-read table. # In[1]: get_ipython().system("./dataformat catalog --package pacific_white_shrimp.zip 2>/dev/null | ./jq -r '.assemblies[] | .files[] | [.filePath, .fileType] | @csv'") # Now we'll use the dataformat tool to convert a default set of data fields into tsv format. # In[1]: get_ipython().system('./dataformat tsv genome --package pacific_white_shrimp.zip --fields assminfo-name,assminfo-refseq-assm-accession,assminfo-genbank-assm-accession,assminfo-refseq-category,assmstats-number-of-contigs,assmstats-number-of-scaffolds') # Next, we can list the first 30 FASTA deflines for the ASM378908v1 RefSeq assembly: # In[1]: get_ipython().system("unzip -q -c pacific_white_shrimp.zip ncbi_dataset/data/GCF_003789085.1/GCF_003789085.1_ASM378908v1_genomic.fna | grep --max-count=30 '^>'")