Preprocessing and clustering scATAC PBMCs

The data consist in ~3000 cells of human PBMCs. Cells were FAC sorted thus providing a ground truth for cell type identification. The data was produces by Buenrostro et al. 2018 The original count matrix (using bulk peaks as features) from Buenrostro et al. is available here: GSE96772

On a unix system, you can uncomment and run the following to download the count matri in its anndata format.

In [1]:
# !mkdir data
# !wget https://www.dropbox.com/s/cwlaaxl70t27tb2/data_tutorial_buenrostro.tar.gz?dl=0 -O data/data_tutorial_buenrostro.tar.gz
# !cd data; tar -xzf data_tutorial_buenrostro.tar.gz
# !mkdir write

If episcanpy isn't installed already. Or if some of the functions aren't working. Install the following version:

In [2]:
# !pip install git+https://github.com/colomemaria/epiScanpy

Load libraries

In [3]:
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import episcanpy.api as epi
/Users/annadanese/opt/anaconda3/lib/python3.7/site-packages/scanpy/api/__init__.py:7: FutureWarning: 

In a future version of Scanpy, `scanpy.api` will be removed.
Simply use `import scanpy as sc` and `import scanpy.external as sce` instead.

  FutureWarning,
In [4]:
sc.settings.set_figure_params(dpi=80, color_map='gist_earth')
In [5]:
results_file = 'write/buenrostro_pbmc.h5ad'  # the file that will store the analysis results

Load the data

Read in the count matrix into an AnnData object, which holds many slots for annotations and different representations of the data. It also comes with its own HDF5 file format: .h5ad.

In [6]:
adata = ad.read('data/data_tutorial_buenrostro/all_buenrostro_bulk_peaks.h5ad')
adata
Out[6]:
AnnData object with n_obs × n_vars = 2034 × 491436
    obs: 'batch', 'cell_name'
In [7]:
# display information stored in adata.obs
adata.obs
Out[7]:
batch cell_name
BM1077-MPP-Frozen-160105-1.dedup.st.bam 0 BM1077-MPP-Frozen-160105-1
singles-20160726-scATAC-BM1137-GMP3high-HYC-88.dedup.st.bam 0 singles-20160726-scATAC-BM1137-GMP3high-HYC-88
singles-160808-scATAC-BM1137-GMP2mid-LS-65.dedup.st.bam 0 singles-160808-scATAC-BM1137-GMP2mid-LS-65
singles-BM0828-LMPP-frozen-151105-20.dedup.st.bam 0 singles-BM0828-LMPP-frozen-151105-20
singles-160819-BM1137-CMP-LS-95.dedup.st.bam 0 singles-160819-BM1137-CMP-LS-95
... ... ...
BM1077-LMPP-Frozen-160107-40.dedup.st.bam 1 BM1077-LMPP-Frozen-160107-40
BM1077-MPP-Frozen-160105-74.dedup.st.bam 1 BM1077-MPP-Frozen-160105-74
singles-BM1214-GMP-160421-9.dedup.st.bam 1 singles-BM1214-GMP-160421-9
singles-BM0828-LMPP-frozen-151105-62.dedup.st.bam 1 singles-BM0828-LMPP-frozen-151105-62
singles-BM0828-MEP-160420-43.dedup.st.bam 1 singles-BM0828-MEP-160420-43

2034 rows × 2 columns

In [8]:
# checking the variable names (bulk peaks coordinates)
print(adata.var_names)
Index(['chr1_10279_10779', 'chr1_13252_13752', 'chr1_16019_16519',
       'chr1_29026_29526', 'chr1_96364_96864', 'chr1_115440_115940',
       'chr1_237535_238035', 'chr1_240811_241311', 'chr1_540469_540969',
       'chr1_713909_714409',
       ...
       'chrX_154822578_154823078', 'chrX_154840821_154841321',
       'chrX_154841449_154841949', 'chrX_154841956_154842456',
       'chrX_154842517_154843017', 'chrX_154862057_154862557',
       'chrX_154870909_154871409', 'chrX_154880741_154881241',
       'chrX_154891824_154892324', 'chrX_154896342_154896842'],
      dtype='object', name='index', length=491436)

Extract the FACs information from the file names

In [9]:
adata.obs['facs_label'] = ['MEP' if 'MEP' in line else line.split('.bam')[0].lstrip('singles-').split('BM')[-1].split('-')[1] for line in adata.obs['cell_name'].tolist()]
adata
Out[9]:
AnnData object with n_obs × n_vars = 2034 × 491436
    obs: 'batch', 'cell_name', 'facs_label'

Load the additional metadata

In [10]:
!head 'data/data_tutorial_buenrostro/metadata.tsv'
label	cell_type
BM1077-CLP-Frozen-160106-13	CLP
BM1077-CLP-Frozen-160106-14	CLP
BM1077-CLP-Frozen-160106-2	CLP
BM1077-CLP-Frozen-160106-21	CLP
BM1077-CLP-Frozen-160106-27	CLP
BM1077-CLP-Frozen-160106-3	CLP
BM1077-CLP-Frozen-160106-36	CLP
BM1077-CLP-Frozen-160106-42	CLP
BM1077-CLP-Frozen-160106-44	CLP
In [11]:
# format the cell names to match the metadata file
adata.obs_names = [x.split('/')[-1].split('.dedup.st.bam')[0] for x in adata.obs_names.tolist()]
adata.obs_names
Out[11]:
Index(['BM1077-MPP-Frozen-160105-1',
       'singles-20160726-scATAC-BM1137-GMP3high-HYC-88',
       'singles-160808-scATAC-BM1137-GMP2mid-LS-65',
       'singles-BM0828-LMPP-frozen-151105-20',
       'singles-160819-BM1137-CMP-LS-95', 'BM1077-MPP-Frozen-160105-36',
       'singles-20160726-scATAC-BM1214-CMP-LS-84',
       'BM1077-CMP-Frozen-160106-21', 'singles-BM0106-HSC-SIM-160219-36',
       'singles-BM1214-GMP-160421-60',
       ...
       'singles-160822-BM1137-CMP-LS-91',
       'singles-BM0828-CMP-frozen-151118-69',
       'singles-20160617-scATAC-BM1214-CMP-LS-40',
       'singles-BM0828-GMP-151027-2',
       'singles-20160726-scATAC-BM1214-CMP-LS-19',
       'BM1077-LMPP-Frozen-160107-40', 'BM1077-MPP-Frozen-160105-74',
       'singles-BM1214-GMP-160421-9', 'singles-BM0828-LMPP-frozen-151105-62',
       'singles-BM0828-MEP-160420-43'],
      dtype='object', length=2034)
In [12]:
epi.pp.load_metadata(adata,
                     'data/data_tutorial_buenrostro/metadata.tsv',
                     separator='\t')
adata
Out[12]:
AnnData object with n_obs × n_vars = 2034 × 491436
    obs: 'batch', 'cell_name', 'facs_label', 'label', 'cell_type\n'
In [13]:
adata.obs['cell_type'] = [x.rstrip('\n') for x in adata.obs['cell_type\n']]
del adata.obs['cell_type\n'] 
In [14]:
adata
Out[14]:
AnnData object with n_obs × n_vars = 2034 × 491436
    obs: 'batch', 'cell_name', 'facs_label', 'label', 'cell_type'
In [15]:
# check the 2 different annotation obtained
pd.crosstab(adata.obs['cell_type'], adata.obs['facs_label'])
Out[15]:
facs_label CLP CMP GMP GMP1low GMP2mid GMP3high HSC LMPP MCP MEP MPP UNK mono pDC
cell_type
CLP 78 0 0 0 0 0 0 0 0 0 0 0 0 0
CMP 0 502 0 0 0 0 0 0 0 0 0 0 0 0
GMP 0 0 216 68 44 74 0 0 0 0 0 0 0 0
HSC 0 0 0 0 0 0 347 0 0 0 0 0 0 0
LMPP 0 0 0 0 0 0 0 160 0 0 0 0 0 0
MEP 0 0 0 0 0 0 0 0 0 138 0 0 0 0
MPP 0 0 0 0 0 0 0 0 0 0 142 0 0 0
UNK 0 0 0 0 0 0 0 0 0 0 0 60 0 0
mono 0 0 0 0 0 0 0 0 0 0 0 0 64 0
pDC 0 0 0 0 0 0 0 0 73 0 0 0 0 68

Load gene/transcript annotation

Download annotation file (the data are aligned on hg19)

In [16]:
# !wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz -O data/data_tutorial_buenrostro/gencode.v19.annotation.gtf.gz
# !cd data/data_tutorial_buenrostro; gunzip gencode.v19.annotation.gtf
In [17]:
epi.tl.find_genes(adata,
           gtf_file='data/data_tutorial_buenrostro/gencode.v19.annotation.gtf',
           key_added='transcript_annotation',
           upstream=2000,
           feature_type='transcript',
           annotation='HAVANA',
           raw=False)
In [18]:
adata
Out[18]:
AnnData object with n_obs × n_vars = 2034 × 491436
    obs: 'batch', 'cell_name', 'facs_label', 'label', 'cell_type'
    var: 'transcript_annotation'

Preprocessing

Check if the data matrix is binary - if not, binarize the data matrix

In [19]:
print(np.max(adata.X))
1.0
In [20]:
# Here, the data is already binary. If not, run the following commands
# epi.pp.binarize(adata)
# print(np.max(adata.X))

Proceed to basic filtering.

In [21]:
adata
Out[21]:
AnnData object with n_obs × n_vars = 2034 × 491436
    obs: 'batch', 'cell_name', 'facs_label', 'label', 'cell_type'
    var: 'transcript_annotation'
In [22]:
# remove any potential empty features or barcodes
epi.pp.filter_cells(adata, min_features=1)
epi.pp.filter_features(adata, min_cells=1)
In [23]:
# filtered out 24210 peaks
adata
Out[23]:
AnnData object with n_obs × n_vars = 2034 × 467226
    obs: 'batch', 'cell_name', 'facs_label', 'label', 'cell_type', 'nb_features'
    var: 'transcript_annotation', 'n_cells'

Quality controls

In [24]:
adata.obs['log_nb_features'] = [np.log10(x) for x in adata.obs['nb_features']]
adata
Out[24]:
AnnData object with n_obs × n_vars = 2034 × 467226
    obs: 'batch', 'cell_name', 'facs_label', 'label', 'cell_type', 'nb_features', 'log_nb_features'
    var: 'transcript_annotation', 'n_cells'
In [25]:
epi.pl.violin(adata, ['nb_features'])
epi.pl.violin(adata, ['log_nb_features'])
... storing 'facs_label' as categorical
... storing 'cell_type' as categorical
... storing 'transcript_annotation' as categorical
In [26]:
# set a minimum number of cells to keep 
min_features = 1000

epi.pp.coverage_cells(adata, binary=True, log=False, bins=50,
               threshold=min_features, save='Buenrostro_bulk_peaks_coverage_cells.png')
epi.pp.coverage_cells(adata, binary=True, log=10, bins=50,
               threshold=min_features, save='Buenrostro_bulk_peaks_coverage_cells_log10.png')
In [27]:
# minimum number of cells sharing a feature
min_cells = 5
epi.pp.coverage_features(adata, binary=True, log=False, 
                        threshold=min_cells, save='Buenrostro_bulk_peaks_coverage_peaks.png')
epi.pp.coverage_features(adata, binary=True, log=True, 
                        threshold=min_cells, save='Buenrostro_bulk_peaks_coverage_peaks_log10.png')

Actually proceed to filter the cells and peaks based on the QC plots

In [28]:
min_features = 1000
epi.pp.filter_cells(adata, min_features=min_features)
adata
Out[28]:
AnnData object with n_obs × n_vars = 1945 × 467226
    obs: 'batch', 'cell_name', 'facs_label', 'label', 'cell_type', 'nb_features', 'log_nb_features'
    var: 'transcript_annotation', 'n_cells', 'commonness'
In [29]:
min_cells = 5
epi.pp.filter_features(adata, min_cells=min_cells)
adata
Out[29]:
AnnData object with n_obs × n_vars = 1945 × 311035
    obs: 'batch', 'cell_name', 'facs_label', 'label', 'cell_type', 'nb_features', 'log_nb_features'
    var: 'transcript_annotation', 'n_cells', 'commonness'

Looking at the QC plots after filtering

In [30]:
epi.pp.coverage_cells(adata, binary=True, log='log10', bins=50, threshold=min_features)
In [31]:
epi.pp.coverage_features(adata, binary=True, log='log10', bins=50, threshold=min_cells)

Identifying the most variable features

We aim to select a cuttof after the elbow.

In [32]:
epi.pp.cal_var(adata)
In [33]:
min_score_value = 0.515
nb_feature_selected = 120000
epi.pl.variability_features(adata,log=None,
                     min_score=min_score_value, nb_features=nb_feature_selected,
                     save='variability_features_plot_bonemarrow_peakmatrix.png')

epi.pl.variability_features(adata,log='log10',
                     min_score=min_score_value, nb_features=nb_feature_selected,
                     save='variability_features_plot_bonemarrow_peakmatrix_log10.png')
In [34]:
# save the current matrix in the raw layer
adata.raw = adata
In [35]:
# create a new AnnData containing only the most variable features
adata = epi.pp.select_var_feature(adata,
                                  nb_features=nb_feature_selected,
                                  show=False,
                                  copy=True)
In [36]:
adata
Out[36]:
View of AnnData object with n_obs × n_vars = 1945 × 122511
    obs: 'batch', 'cell_name', 'facs_label', 'label', 'cell_type', 'nb_features', 'log_nb_features'
    var: 'transcript_annotation', 'n_cells', 'commonness', 'prop_shared_cells', 'variability_score'
In [37]:
epi.pl.violin(adata, ['nb_features'])
epi.pl.violin(adata, ['log_nb_features'])