%%capture --no-stdout
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scanpy as sc
import cna
np.random.seed(0)
This notebook demonstrates the basic use of the cna
python package. Before you run it, make sure you either install cna
by running pip install cna
or by cloning the cna
github repo and adding it to your python path. (In the latter case you'll likely need to manually install the multianndata
package, which is a dependency of cna
.)
Let's run cna
and visualize results for a toy dataset in the commonly used AnnData
format that already has a UMAP nearest-neighbor graph. You can learn more about data pre-processing and formatting in Section 2.
First, we'll read in the dataset and convert it to the format required for cna
. This involves placing sample-level metadata in a dedicated pandas dataframe. (See Section 2 for more info)
# read in data
# NB: if data are already in MultiAnnData format (see section 2), they can be read via `cna.read`
#. for example, try out: d = cna.read('data_multianndata.h5ad') instead of runnign this code block
import anndata as ad
d = ad.read_h5ad('data_anndata.h5ad')
from multianndata import MultiAnnData
d = MultiAnnData(d)
d.obs_to_sample(['case','male','batch'])
# perform association test for case/ctrl status, controlling for sex as a covariate and accounting for potential batch effect
res = cna.tl.association(d, #dataset
d.samplem.case, #sample-level attribute of intest (case/control status)
covs=d.samplem[['male']], #covariates to control for (in this case just one)
batches=d.samplem.batch) #batch assignments for each sample so that cna can account for batch effects
print('\nglobal association p-value:', res.p)
qcd NAM not found; computing and saving taking step 1 median kurtosis: 16.40326205754668 taking step 2 median kurtosis: 12.75799245750447 taking step 3 median kurtosis: 6.1154849181504485 taking step 4 median kurtosis: 3.177490071480447 stopping after 4 steps throwing out neighborhoods with batch kurtosis >= 6 keeping 10000 neighborhoods covariate-adjusted NAM not found; computing and saving with ridge 100000.0 median batch kurtosis = 1.981293621817167 computing SVD performing association test
/data/srlab1/yakir/py/cna/cna/tools/_association.py:74: UserWarning: global association p-value attained minimal possible value. Consider increasing Nnull warnings.warn('global association p-value attained minimal possible value. '+\
computing neighborhood-level FDRs global association p-value: 0.000999000999000999
# visualize which populations are expanded or depleted among case samples relative to cntrls
cna.pl.umap_ncorr(d, #dataset
res, #cna results object
scatter0={'alpha':0.5, 's':20}, #plotting parameters for neighborhoods that pass FDR
scatter1={'alpha':0.05, 's':20}) #plotting parameters for neighborhoods that don't pass FDR
plt.title('p = {:.2e}'.format(res.p))
plt.show()
In the plot above, each cell is colored according to its neighborhood coefficient, i.e., color corresponds to the correlation of abundance in each cell's corresponding neighborhood to case/control status. Neighborhoods that don't pass FDR 5% are shown in gray. As this plot shows, cases have a greater abundance of cells from the left-hand cluster relative to controls.
Let's next perform an association test for male/female sex rather than case/control status...
# perform association test for male/female sex
# we control for case/control status as a covariate and account for potential batch effects
res = cna.tl.association(d, d.samplem.male, covs=d.samplem[['case']], batches=d.samplem.batch)
# visualize the results
cna.pl.umap_ncorr(d, res,
scatter0={'alpha':0.5, 's':20},
scatter1={'alpha':0.05, 's':20})
plt.title('p = {:.2e}'.format(res.p))
plt.show()
covariate-adjusted NAM not found; computing and saving with ridge 100000.0 median batch kurtosis = 1.9949274142866247 computing SVD performing association test computing neighborhood-level FDRs
It's now clear that a sub-population of the left-hand cluster is more abundant in males. CNA automatically picked up on this without needing to be given a resolution parameter.
Note that it looks like some cells in the right-hand cluster pass our FDR threshold of 5%. However, these cells comprise only ~6% of the cells that pass the threshold and so are likely false discoveries.
If this dataset had been clustered, two clusters would likely have been defined.
sc.tl.leiden(d, resolution = 0.1)
plt.scatter(*d.obsm['X_umap'][d.obs.leiden=='0'].T, c='orangered', label='Cluster 1')
plt.scatter(*d.obsm['X_umap'][d.obs.leiden=='1'].T, c='purple', label='Cluster 0')
plt.legend(); plt.axis('off')
plt.show()
We can examine within-cluster heterogeneity characterized by CNA by visualizing the distribution of neighborhood coefficients within each cluster.
ax = cna.pl.violinplot(d, # dataset
res, # CNA results object
'leiden') # name of the column of obs that contains the clustering
ax.set_title('Within-Cluster Heterogeneity')
ax.set_yticks([-0.08, -0.04, 0, 0.04, 0.08])
ax.set_xlabel('Cluster')
plt.show()
To characterize the populations driving the association, we can examine per-gene correlations to neighborhood coefficient. These per-gene values may be informative on their own, or they could be further analyzed using techniques like gene set enrichment analysis.
# compute correlation between neighborhood coefficients and genes
d.var['corr_case'] = np.corrcoef(res.ncorrs.reshape(1,-1), d.X, rowvar=False)[0,1:]
# view top 5 most strongly correlated genes
d.var.sort_values('corr_case', ascending=False).head()
corr_case | |
---|---|
gene | |
b'gene_10' | 0.186798 |
b'gene_9' | 0.181042 |
b'gene_4' | 0.179933 |
b'gene_11' | 0.179853 |
b'gene_1' | 0.173886 |
# view top 5 most negatively correlated genes
d.var.sort_values('corr_case', ascending=True).head()
corr_case | |
---|---|
gene | |
b'gene_20' | -0.100297 |
b'gene_21' | -0.097420 |
b'gene_22' | -0.089048 |
b'gene_14' | -0.085773 |
b'gene_19' | -0.079371 |
Let's now understand better the data format used by cna
and what pre-processing is necessary to run cna
.
MultiAnnData
object¶The data format used by cna
a python object called MultiAnnData
, which is a derivative of the common AnnData
object used by scanpy
that is modified to explicitly accommodate multi-sample single-cell datasets by including sample-level metadata.
There are three ways to create MultiAnnData
objects. Reading from h5ad
format using cna.read
, conversion from an AnnData
object, and manual construction.
MultiAnnData
object from a file¶d = cna.read('data_multianndata.h5ad')
Like an AnnData
object, the MultiAnnData
object d
stores raw data in .X
, per-cell data in .obs
, and per-marker data in .var
.
However, d
additional stores per-sample data in d.samplem
. Let's take a look.
d.samplem.head()
case | male | batch | |
---|---|---|---|
id | |||
0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
2 | 0 | 0 | 2 |
3 | 0 | 0 | 3 |
4 | 0 | 0 | 4 |
MultiAnnData
object from an AnnData
object¶The second way to create a MultiAnnData
object is to create one from an AnnData
object. In this case, you need to explicitly supply the sample-level metadata for .samplem
. The easiest way to do this is typically to take this information from .obs
if it is already stored there, as is often the case in single-cell datasets. Another option is to explicitly set .samplem
to a DataFrame
that's been manually created.
Either way, cna
needs to know which column of .obs
contains the sample id of each cell. cna
assumes that this columns is named id
, but this behavior can be changed by passing sampleid='someothername'
into the MultiAnnData
function. The name of this column also needs to match the name of the index of .samplem
.
# read in AnnData object from h5ad file and convert it to MultiAnnData
d = ad.read_h5ad('data_anndata.h5ad')
d = MultiAnnData(d, sampleid='id') # the default value of sampleid is 'id', here we just repeat it for clarity
print('d.samplem has', len(d.samplem.columns), 'columns')
# Since our metadata are stored per-cell, we need to copy the information that is actually per-sampl
# (case status, batch and covariates) to the per-sample metadata DataFrame `d.samplem`
d.obs_to_sample(['case','male','batch'])
print('now d.samplem has', len(d.samplem.columns), 'columns')
d.samplem.head()
d.samplem has 0 columns now d.samplem has 3 columns
case | male | batch | |
---|---|---|---|
id | |||
0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 |
2 | 0 | 0 | 2 |
3 | 0 | 0 | 3 |
4 | 0 | 0 | 4 |
MultiAnnData
object manually¶Building a MultiAnnData
object manually works just like building an AnnData
object manually. The main difference is that you also need to tell cna
which column of .obs
contains the sample id of each cell. cna
will assume that this column is named id
unless it is told otherwise.
import multianndata as mad
# let's pretend we had X, obs, and var as standalone objects...
X, obs, var = d.X, d.obs, d.var
# we can create a MultiAnnData object from these
d = mad.MultiAnnData(X=X, # expression matrix
obs=obs, # cell metadata matrix
sampleid='id') # name of column of cell_meta that contains sample id for each cell
# default value is 'id' but we repeat here for clarity
# convert per-cell metadata into per-sample metadata as before
d.obs_to_sample(['batch', 'case', 'male'])
As above, if instead our raw metadata table already contained information per-sample, we could have created d.samplem
directly.
The main pre-processing step that is required before running cna
is to compute a cell-cell similarity graph. Here we'll use scanpy
to compute the standard UMAP similarity graph. (In principle, any graph can be computed here.)
%%capture --no-stdout
# compute the UMAP cell-cell similarity graph
sc.pp.neighbors(d)
# compute UMAP coordinates for plotting
sc.tl.umap(d)
# the following line would save the pre-processed data as a h5ad file
# d.write('test.h5ad')
cna
features¶cna
association
function ¶Let's re-do the cna
analysis for case status and go over the results in more detail
d = cna.read('data_multianndata.h5ad')
res = cna.tl.association(d, d.samplem.case, covs=d.samplem[['male']], batches=d.samplem.batch)
qcd NAM not found; computing and saving taking step 1 median kurtosis: 16.40326205754668 taking step 2 median kurtosis: 12.75799245750447 taking step 3 median kurtosis: 6.1154849181504485 taking step 4 median kurtosis: 3.177490071480447 stopping after 4 steps throwing out neighborhoods with batch kurtosis >= 6 keeping 10000 neighborhoods covariate-adjusted NAM not found; computing and saving with ridge 100000.0 median batch kurtosis = 1.981293621817167 computing SVD performing association test
/data/srlab1/yakir/py/cna/cna/tools/_association.py:74: UserWarning: global association p-value attained minimal possible value. Consider increasing Nnull warnings.warn('global association p-value attained minimal possible value. '+\
computing neighborhood-level FDRs
The res
object returned by the association function contains the following fields:
p
is the global p-value for associationk
is the number of NAM PCs used for the association test (automatically selected in a data-dependent way)ncorrs
is the vector of neighborhood coefficientsfdr_10p_t
and fdr_5p_t
contain the thresholds on np.abs(ncorrs)
needed to provide FDRs of 10% and 5% respectivelyfdrs
contains information on neighborhood-level FDRs at other thresholdskept
is a boolean vector that specifies which cells passed qc. So, e.g., ncorrs
is of length kept.sum()
.r2
is the global prediction r-squared achieved by the modelr2_perpc
is the prediction r-squared achieved by each pc on its ownbeta
is the coefficient vector, one coefficient per PCyresid
and yresid_hat
are the phenotype and the predicted phenotype, both with covariates and batch residualized outThe kurtosis numbers in the output above say how 'outlier-y' the neighborhood abundances are at each step in the diffusion that cna
does to define its neighborhoods. The algorithm stops when these numbers stop going down quickly. But if you see that cna
stops when kurtosis is still very high (like above 10), that's likely a sign that the data aren't qc'd properly or are otherwise unsuitable.
The association function above also automatically computes the NAM and its SVD as a prerequisite to the association testing. These can be found in the following objects
d.uns['NAM.T']
contains the transpose of the NAM (neighborhoods by samples)d.uns['NAM_sampleXpc']
contains the sample loadings of the principal components of the NAMd.uns['NAM_nbhdXpc']
contains the neighborhood loadings of the principal components of the NAMd.uns['NAM_svs']
contains the squared singular values of the NAMThese objects can also be computed independent of any association test using the function cna.tl.nam
, demonstrated below
cna.tl.nam(d, batches=d.samplem.batch) #note: this NAM is being computed *without* adjusting for sex or case/ctrl status
covariate-adjusted NAM not found; computing and saving with ridge 100000.0 median batch kurtosis = 1.9949274142866247 computing SVD
Let's visualize a few of the results...
First, we can plot the neighborhood loadings of each NAM PC to see which abundance patterns they capture
#plot neighborhood loadings of NAM PC 1
d.obs['NAMPC1'] = np.nan
d.obs.loc[d.uns['keptcells'], 'NAMPC1'] = d.uns['NAM_nbhdXpc'].PC1
sc.pl.umap(d, color='NAMPC1', cmap='seismic', title='NAM PC1 neighborhood loadings')
plt.show()
#plot neighborhood loadings of NAM PC 2
d.obs['NAMPC2'] = np.nan
d.obs.loc[d.uns['keptcells'], 'NAMPC2'] = d.uns['NAM_nbhdXpc'].PC2
sc.pl.umap(d, color='NAMPC2', cmap='seismic', title='NAM PC2 neighborhood loadings')
plt.show()
We can see here that NAM PC 1 primarily captures the extent to which a sample has cells in the left-hand cluster vs the right-hand cluster, and that NAM PC 2 primarily additionall captures the extent to which a sample has cells in the top of the left-hand cluster vs the bottom of the left-hand cluster.
We can also examine the sample loadings of each NAM PC to explicitly see which samples have high and low values of each NAM PC. Below we'll examine how this relates to the sample-level metadata in our toy dataset.
def sample_loading_plot_by_binary_pheno(pheno, case_label, ctrl_label): #assumes pheno is binary
pcs = d.uns['NAM_sampleXpc']
cases = pheno.astype(np.bool)
plt.scatter(pcs.loc[cases].PC1, pcs.loc[cases].PC2, s=40, alpha=0.8, label=case_label)
plt.scatter(pcs.loc[~cases].PC1, pcs.loc[~cases].PC2, s=40, alpha=0.8, label=ctrl_label)
plt.legend()
plt.xlabel('NAM PC1')
plt.ylabel('NAM PC2')
#plot sample loadings of NAM PC1 versus NAM PC2, colored by case/control status
sample_loading_plot_by_binary_pheno(d.samplem.case, 'case', 'ctrl')
plt.title('NAM PC1 vs NAM PC2, colored by case/control status')
plt.show()
#plot sample loadings of NAM PC1 versus NAM PC2, colored by male/female sex
sample_loading_plot_by_binary_pheno(d.samplem.male, 'male', 'female')
plt.title('NAM PC1 vs NAM PC2, colored by male/female sex')
plt.show()
You can see in these plots that NAM PC 1 roughly corresponds to case/control status and NAM PC1 roughly corresponds to sex.
Like ordinary PCs, each NAM PC also comes with a quantification of how much dataset-wide variance it explains. These are stored in the NAM_svs
field. Let's visualize them in the traditional way below.
# plot variance explained by each NAM PC
plt.plot(d.uns['NAM_svs']/d.uns['NAM_svs'].sum(), marker='o', linestyle='--')
plt.xlabel('NAM PC')
plt.ylabel('fraction of variance explained')
plt.show()
As this plot demonstrates, PC1 explains ~40% of the variance in the dataset, PC2 explains ~12%, and the rest really don't contribute much.