#!/usr/bin/env python # coding: utf-8 # # ipyrad-analysis toolkit: sratools # For reproducibility purposes, it is nice to be able to download the raw data for your analysis from an online repository like NCBI with a simple script at the top of your notebook. We've written a simple wrapper for the sratools command line program (which is notoriously difficult to use and poorly documented) to try to make this easier to do. # ### Required software # In[1]: # conda install ipyrad -c bioconda # conda install sratools -c bioconda # In[2]: import ipyrad.analysis as ipa # ### Fetch info for a published data set by its accession ID # You can find the study ID or individual sample IDs from published papers or by searching the NCBI or related databases. ipyrad can take as input one or more accessions IDs for individual Runs or Studies (SRR or SRP, and similarly ERR or ERP, etc.). # # In[3]: # init sratools object with an accessions argument sra = ipa.sratools(accessions="SRP065788") # In[4]: # fetch info for all samples from this study, save as a dataframe stable = sra.fetch_runinfo() # In[5]: # the dataframe has all information about this study stable.head() # ### File names # You can select columns by their index number to use for file names. See below. # In[8]: stable.iloc[:5, [0, 28, 29]] # ### Download the data # From an sratools object you can fetch just the info, or you can download the files as well. Here we call `.run()` to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the first five SRR sample IDs in the table above. # In[10]: # select first 5 samples list_of_srrs = stable.Run[:5] list_of_srrs # In[11]: # new sra object sra2 = ipa.sratools(accessions=list_of_srrs, workdir="downloaded") # call download (run) function sra2.run(auto=True, name_fields=(1,30)) # ### Check the data files # You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved. # # In[12]: get_ipython().system(' ls -l downloaded')