ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   How to convert bams to fastq files using samtools
Author:  David L Gibbs
Created: 2019-07-29
Purpose: Convert a bam file, to fastq, using the Google Genomics Pipelines API

Using samtools to index, sort, and convert a bam file to a fastq file.

In this example, we'll be using the Google Genomics Pipelines API. The pipelines API makes it easy to run a job without having to spin up and shut down a VM. It's all done automatically.

The work is uses materials from An example yaml config file is found at:

For this to work, we need to make sure that the Google Genomics API is enabled. To do that, from the main menu in the cloud console, select 'APIs & Services'. The API is called:

The software needs to be in a docker image. Lots of biologically related software is already bundled up for you. See

The biocontainers folks have removed all the 'latest' tags from their docker images, so you need to specify a version

docker_image = 'docker pull biocontainers/samtools:v1.7.0_cv4'


In [28]:
name: bedtools
description: Run samtools on one bam

  - us-west1-b

  - name: datadisk
    autoDelete: True

    # Within the Docker container, specify a mount point for the disk.
    mountPoint: /mnt/data

  imageName: bhklab/samtools-1.9.0

  # The Pipelines API does not create the output directory.
  cmd: >
    mkdir /mnt/data/output &&
    find /mnt/data/input &&
    for file in $(/bin/ls /mnt/data/input/*.bam); do
       samtools index ${file};
       samtools sort -o sorted.bam [email protected] 3 ${file}; 
       samtools fastq sorted.bam  > /mnt/data/output/output.fastq;
       mv sorted.bam /mnt/data/output/;       
       mv ${file}.bai /mnt/data/output/;

- name: bamFile
  description: Cloud Storage path or pattern to input file(s)
    path: input/
    disk: datadisk

- name: outputPath
  description: Cloud Storage path for where to samtools output
    path: output/*
    disk: datadisk

fout = open('samtools.yaml','w')
In [6]:
%set_env TEMP=gs://temp_bucket_123
env: TEMP=gs://temp_bucket_123
In [29]:
!gcloud alpha genomics pipelines run \
    --pipeline-file samtools.yaml \
    --inputs bamFile=gs://genomics-public-data/ \
    --outputs outputPath=gs://bam_bucket_1/ \
    --logging "${TEMP}/samtools_run_logs.log" \
    --disk-size datadisk:100
Running [operations/EIjDkrq_LRi6vJvY9qnH138g6YmPu4QUKg9wcm9kdWN0aW9uUXVldWU].
In [23]:
!gcloud alpha genomics operations wait EM2Uzre_LRijp-vcnvHNuqkBIOmJj7uEFCoPcHJvZHVjdGlvblF1ZXVl
Waiting for [operations/EM2Uzre_LRijp-vcnvHNuqkBIOmJj7uEFCoPcHJvZHVjdGlvblF1ZXV
In [ ]: