Check out more notebooks at our Community Notebooks Repository!
Title: How to create convert 10X bams to fastq files using dsub
Author: David L Gibbs
Created: 2019-08-07
Purpose: Demonstrate how to make fastq files from 10X bams
Notes:
In this example, we'll be using DataBiosphere's dsub. dsub makes it easy to run a job without having to spin up and shut down a VM. It's all done automatically.
https://github.com/DataBiosphere/dsub
Docs for the genomics pipeline run: https://cloud.google.com/sdk/gcloud/reference/alpha/genomics/pipelines/run
For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. The API is called: genomics.googleapis.com.
# first to install dsub,
# it's also possible to install it directly from
# github
!pip install dsub
Requirement already satisfied: dsub in ./.local/lib/python2.7/site-packages Requirement already satisfied: oauth2client in /usr/local/lib/python2.7/dist-packages (from dsub) Requirement already satisfied: six in /usr/local/lib/python2.7/dist-packages (from dsub) Requirement already satisfied: python-dateutil in /usr/local/lib/python2.7/dist-packages (from dsub) Requirement already satisfied: pyyaml in /usr/local/lib/python2.7/dist-packages (from dsub) Requirement already satisfied: pytz in /usr/local/lib/python2.7/dist-packages (from dsub) Requirement already satisfied: parameterized in ./.local/lib/python2.7/site-packages (from dsub) Requirement already satisfied: google-api-python-client in /usr/local/lib/python2.7/dist-packages (from dsub) Requirement already satisfied: retrying in /usr/local/lib/python2.7/dist-packages (from dsub) Requirement already satisfied: tabulate in ./.local/lib/python2.7/site-packages (from dsub) Requirement already satisfied: rsa>=3.1.4 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub) Requirement already satisfied: httplib2>=0.9.1 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub) Requirement already satisfied: pyasn1-modules>=0.0.5 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub) Requirement already satisfied: pyasn1>=0.1.7 in /usr/local/lib/python2.7/dist-packages (from oauth2client->dsub) Requirement already satisfied: google-auth>=1.4.1 in /usr/local/lib/python2.7/dist-packages (from google-api-python-client->dsub) Requirement already satisfied: google-auth-httplib2>=0.0.3 in /usr/local/lib/python2.7/dist-packages (from google-api-python-client->dsub) Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /usr/local/lib/python2.7/dist-packages (from google-api-python-client->dsub) Requirement already satisfied: cachetools>=2.0.0 in /usr/local/lib/python2.7/dist-packages (from google-auth>=1.4.1->google-api-python-client->dsub)
# let's see if it's installed OK
!pip show dsub
Name: dsub Version: 0.3.2 Summary: A command-line tool that makes it easy to submit and run batch scripts in the cloud Home-page: https://github.com/DataBiosphere/dsub Author: Verily Author-email: UNKNOWN License: Apache Location: /home/jupyter/.local/lib/python2.7/site-packages Requires: oauth2client, six, python-dateutil, pyyaml, pytz, parameterized, google-api-python-client, retrying, tabulate
# pip install software in the /.local/bin directory .. not part of PATH yet
!~/.local/bin/dsub
usage: /home/jupyter/.local/bin/dsub [-h] [--provider PROVIDER] [--version VERSION] [--unique-job-id] [--name NAME] [--tasks [FILE M-N [FILE M-N ...]]] [--image IMAGE] [--dry-run] [--command COMMAND] [--script SCRIPT] [--env [KEY=VALUE [KEY=VALUE ...]]] [--label [KEY=VALUE [KEY=VALUE ...]]] [--input [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]] [--input-recursive [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]] [--output [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]] [--output-recursive [KEY=REMOTE_PATH [KEY=REMOTE_PATH ...]]] [--user USER] [--user-project USER_PROJECT] [--mount [KEY=PATH_SPEC [KEY=PATH_SPEC ...]]] [--wait] [--retries RETRIES] [--poll-interval POLL_INTERVAL] [--after AFTER [AFTER ...]] [--skip] [--min-cores MIN_CORES] [--min-ram MIN_RAM] [--disk-size DISK_SIZE] [--logging LOGGING] [--project PROJECT] [--boot-disk-size BOOT_DISK_SIZE] [--preemptible] [--zones ZONES [ZONES ...]] [--scopes SCOPES [SCOPES ...]] [--accelerator-type ACCELERATOR_TYPE] [--accelerator-count ACCELERATOR_COUNT] [--keep-alive KEEP_ALIVE] [--regions REGIONS [REGIONS ...]] [--machine-type MACHINE_TYPE] [--cpu-platform CPU_PLATFORM] [--network NETWORK] [--subnetwork SUBNETWORK] [--use-private-address] [--timeout TIMEOUT] [--log-interval LOG_INTERVAL] [--ssh] [--nvidia-driver-version NVIDIA_DRIVER_VERSION] [--service-account SERVICE_ACCOUNT] [--disk-type DISK_TYPE] [--enable-stackdriver-monitoring] /home/jupyter/.local/bin/dsub: error: argument --project is required
# hello world test
# using the local provider (--provider local)
# is a faster way to develop the task
! ~/.local/bin/dsub \
--provider local \
--logging /tmp/dsub-test/logging/ \
--output OUT=/tmp/dsub-test/output/out.txt \
--command 'echo "Hello World" > "${OUT}"' \
--wait
Job: echo--jupyter--190808-173557-030088 Launched job-id: echo--jupyter--190808-173557-030088 To check the status, run: dstat --provider local --jobs 'echo--jupyter--190808-173557-030088' --users 'jupyter' --status '*' To cancel the job, run: ddel --provider local --jobs 'echo--jupyter--190808-173557-030088' --users 'jupyter' Waiting for job to complete... Waiting for: echo--jupyter--190808-173557-030088. echo--jupyter--190808-173557-030088: SUCCESS echo--jupyter--190808-173557-030088
# and we can check the output
!cat /tmp/dsub-test/output/out.txt
Hello World
# dsub can take a shell script..
cmd = '''
apt-get update;
apt-get --yes install wget;
wget http://cf.10xgenomics.com/misc/bamtofastq;
chmod +x bamtofastq;
OUTPUT_DIR="$OUTPUT_FOLDER/fastq";./bamtofastq ${INPUT_FILE} ${OUTPUT_DIR};'''
fout = open('job.sh', 'w')
fout.write(cmd)
fout.close()
!cat job.sh
apt-get update; apt-get --yes install wget; wget http://cf.10xgenomics.com/misc/bamtofastq; chmod +x bamtofastq; ./bamtofastq ${INPUT_FILE} $(dirname ${OUTPUT_FOLDER})/fastq;
# default for dsub is for a ubuntu image
# which is great, because bamtofastq is compatible
!~/.local/bin/dsub \
--provider google-v2 \
--project cgc-05-0180 \
--zones "us-west1-*" \
--script job.sh \
--input INPUT_FILE="gs://cgc_bam_bucket_007/pbmc_1k_protein_v3_possorted_genome_bam.bam" \
--output-recursive OUTPUT_FOLDER="gs://cgc_output/testout/" \
--disk-size 200 \
--logging "gs://cgc_temp_02/testout" \
--wait
#error: error creating output directory: "/mnt/data/output/gs/cruk_data_02". Does it already exist?
Job: job--jupyter--190808-184740-70 Launched job-id: job--jupyter--190808-184740-70 To check the status, run: dstat --provider google-v2 --project cgc-05-0180 --jobs 'job--jupyter--190808-184740-70' --users 'jupyter' --status '*' To cancel the job, run: ddel --provider google-v2 --project cgc-05-0180 --jobs 'job--jupyter--190808-184740-70' --users 'jupyter' Waiting for job to complete... Waiting for: job--jupyter--190808-184740-70. job--jupyter--190808-184740-70: SUCCESS job--jupyter--190808-184740-70
That's it! We can check the output with:
!gsutil ls gs://cgc_bam_bucket_007/output