Building an ATAC-seq count matrices from bam files

In [2]:
import episcanpy.api as epi

The initial step is to download the test dataset on: https://drive.google.com/open?id=17FoiBCFX5Gv4RQYX1nS53QPEOI-U6AQN

Then specify the path to the test dataset.

In [3]:
path_to_play_data = '../ATAC_play_data/'
file_annot_name = "cortex_enhancer.bed"

# list of the bam files you want to build a count matrix for
list_cells =['AGCGATAGAACGAATTCGACTCGTATCACAGGACGT.bam',
             'AGCGATAGAACGAATTCGCCGACTCCAAAGGCGAAG.bam',
             'AGCGATAGAACGAATTCGCATATCCTATGGCTCTGA.bam',
             'AGCGATAGAACGAATTCGACTCGTATCAAGGCGAAG.bam',
             'AGCGATAGAACGAATTCGACCTACGCCAGGCTCTGA.bam'
            ]

Load the annoation file (peaks or enhancers) with the right set of chromosome names

In [4]:
enhancers = epi.ct.load_features(file_annot_name)
enhancer_names = epi.ct.name_features(enhancers)
0.1724870204925537 seconds

Let's now generate the count matrix.

Important limitation. You can build only one count matrix with the function bld_atac_mtx whereas the methylation where you can build multiple data matrices at the same time.

In [5]:
epi.ct.bld_atac_mtx(list_bam_files=list_cells,
                    loaded_feat=enhancers,
                    output_file='test_ATAC_mtx.txt',
                    path=path_to_play_data,
                    writing_option='w',
                    header=enhancer_names)
AGCGATAGAACGAATTCGACTCGTATCACAGGACGT.bam 130527 mapped reads
AGCGATAGAACGAATTCGCCGACTCCAAAGGCGAAG.bam 32584 mapped reads
AGCGATAGAACGAATTCGCATATCCTATGGCTCTGA.bam 23923 mapped reads
AGCGATAGAACGAATTCGACTCGTATCAAGGCGAAG.bam 9743 mapped reads
AGCGATAGAACGAATTCGACCTACGCCAGGCTCTGA.bam 4952 mapped reads
6.1473798751831055 seconds

Converting to a sparse matrix (AnnData object)

In [6]:
epi.ct.save_sparse_mtx(initial_matrix='test_ATAC_mtx.txt',
               output_file='.h5ad',
               path=path_to_play_data)
0.16255593299865723 seconds