We analyse the synthetic scRNA-Seq data generated by scBoolSeq from the Boolean simulations of the cellular reprogramming network in 2.3 - synthetic scRNA-Seq from cellular reprogramming network.
⚠️ This notebook must be run in a STREAM environment
import pandas as pd
import stream as st
st.__version__
/opt/conda/lib/python3.7/site-packages/anndata/core/anndata.py:17: FutureWarning: pandas.core.index is deprecated and will be removed in a future version. The public classes are available in the top-level namespace. from pandas.core.index import RangeIndex /opt/conda/lib/python3.7/site-packages/umap/spectral.py:4: NumbaDeprecationWarning: No direct replacement for 'numba.targets' available. Visit https://gitter.im/numba/numba-dev to request help. Thanks! import numba.targets
'1.0'
st.set_figure_params(dpi=80,style='white',figsize=[5.4,4.8],
rc={'image.cmap': 'viridis'})
in_file = 'synthetic_data_reprogramming_counts.tsv'
adata = st.read(
file_name=in_file,
workdir='./stream_branching'
)
df = pd.read_csv(
in_file, sep="\t", index_col=0, header=0
)
df
Saving results in: ./stream_branching
common.0_0 | common.1_0 | common.2_0 | common.3_0 | common.4_0 | common.5_0 | common.6_0 | switch_0 | branch1.0_0 | branch1.1_0 | ... | common.6_299 | switch_299 | branch1.0_299 | branch1.1_299 | branch1.2_299 | stable1_299 | branch2.0_299 | branch2.1_299 | branch2.2_299 | stable2_299 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
G1 | 2.332013 | 0.983021 | 1.051408 | 1.872751 | 1.469882 | 1.842518 | 2.191463 | 2.726888 | 2.532718 | 5.669840 | ... | 2.012113 | 2.715387 | 1.653754 | 8.178878 | 7.962105 | 6.614227 | 1.781937 | 4.394633 | 0.714265 | 3.798878 |
G2 | 4.165573 | 2.765827 | 1.989230 | 3.971206 | 4.325479 | 2.463216 | 3.408604 | 3.335261 | 3.113929 | 1.902228 | ... | 4.184479 | 1.673962 | 3.425767 | 1.417074 | 6.564162 | 8.259758 | 2.736619 | 2.905205 | 1.976459 | 0.000000 |
G3 | 3.036617 | 1.046053 | 2.231362 | 4.966177 | 0.199131 | 1.873473 | 2.392596 | 3.079853 | 1.727244 | 3.098456 | ... | 2.057285 | 3.170271 | 1.694091 | 3.997272 | 3.186728 | 6.124654 | 1.064837 | 3.133291 | 1.115589 | 2.384117 |
G4 | 2.074770 | 0.505312 | 0.000000 | 3.823021 | 2.072330 | 1.324959 | 3.602416 | 0.795757 | 0.289076 | 1.590148 | ... | 0.448112 | 1.160969 | 3.232817 | 2.053166 | 4.584992 | 2.015079 | 0.819506 | 0.000000 | 7.004272 | 7.735472 |
G5 | 2.986022 | 2.334931 | 3.933310 | 1.890802 | 2.122329 | 2.034974 | 3.100866 | 3.116822 | 2.534054 | 2.018515 | ... | 2.788002 | 3.493909 | 0.771094 | 4.476590 | 3.539528 | 0.086330 | 2.788833 | 3.897975 | 9.462772 | 9.712304 |
G6 | 3.149574 | 0.517259 | 0.000000 | 0.488222 | 1.321206 | 2.584524 | 3.068832 | 3.893572 | 4.740555 | 4.392287 | ... | 2.745285 | 2.735624 | 2.152118 | 1.436913 | 4.760308 | 1.262856 | 2.586831 | 1.652100 | 1.227657 | 6.072910 |
G7 | 8.866440 | 7.515305 | 9.997864 | 8.886265 | 10.573842 | 9.932584 | 5.713261 | 3.607456 | 3.524567 | 1.688592 | ... | 8.286901 | 1.586017 | 1.374295 | 2.925640 | 2.619914 | 4.279432 | 2.540313 | 2.137910 | 3.331338 | 3.346911 |
G8 | 8.432784 | 8.184479 | 6.951664 | 8.187462 | 6.925085 | 2.181608 | 1.773672 | 2.477613 | 3.074818 | 2.913678 | ... | 5.136095 | 1.191202 | 3.747575 | 3.361277 | 3.274425 | 1.567930 | 3.326951 | 2.572199 | 2.024478 | 3.327301 |
G9 | 7.435096 | 9.757331 | 5.752679 | 3.301599 | 3.149190 | 1.724121 | 2.451425 | 0.903799 | 2.192013 | 0.318825 | ... | 1.294759 | 2.983470 | 2.350374 | 3.552211 | 2.199803 | 3.742112 | 1.588584 | 0.462587 | 2.664618 | 2.791129 |
TF1 | 3.878601 | 4.499499 | 3.678492 | 5.306327 | 3.504758 | 3.883842 | 2.766272 | 3.687321 | 0.000000 | 8.479916 | ... | 2.654812 | 2.775523 | 7.878093 | 8.962540 | 7.512333 | 6.897451 | 1.497111 | 0.000000 | 3.448575 | 3.537919 |
TF2 | 2.611752 | 3.500210 | 3.249593 | 2.982546 | 2.624572 | 3.043139 | 3.524712 | 3.499111 | 3.200565 | 0.000000 | ... | 3.862386 | 3.230344 | 3.383476 | 2.546113 | 3.805934 | 4.061830 | 7.355900 | 8.628778 | 7.366862 | 7.286984 |
TF3 | 0.739442 | 2.890799 | 2.085020 | 3.383870 | 0.000000 | 1.750129 | 7.683416 | 7.912400 | 9.802873 | 6.811959 | ... | 8.136755 | 6.778667 | 9.660202 | 5.099078 | 4.973927 | 9.670800 | 6.278719 | 7.611550 | 7.220194 | 7.070062 |
TF4 | 1.212811 | 3.691911 | 2.751313 | 0.929342 | 8.181013 | 9.069052 | 6.987849 | 6.870188 | 9.859827 | 9.640220 | ... | 5.323186 | 7.915708 | 5.073914 | 5.746404 | 8.185002 | 8.335946 | 6.536561 | 8.305878 | 6.997148 | 7.726019 |
TF5 | 2.484502 | 1.909916 | 1.685234 | 10.002958 | 5.702311 | 8.823823 | 8.715779 | 7.174744 | 8.285241 | 5.490130 | ... | 7.474936 | 7.414200 | 8.035645 | 5.555454 | 6.066159 | 10.021983 | 10.807584 | 9.600482 | 7.933167 | 8.038149 |
TF6 | 9.225411 | 0.917320 | 2.143655 | 3.271486 | 2.322816 | 2.845896 | 1.381363 | 3.787311 | 0.801867 | 3.453075 | ... | 2.547272 | 2.556986 | 4.595774 | 2.260447 | 2.693491 | 3.009934 | 1.486827 | 3.353606 | 0.000000 | 3.456716 |
TF7 | 6.342557 | 7.575461 | 6.126514 | 7.205930 | 5.391876 | 7.345926 | 7.835637 | 7.308357 | 9.872830 | 5.060935 | ... | 7.587803 | 9.586454 | 8.016264 | 6.135996 | 5.871948 | 7.980499 | 6.779317 | 8.772373 | 6.895146 | 7.619464 |
16 rows × 4800 columns
To load and use 10x Genomics single cell RNA-seq data processed with Cell Ranger:
(The variable index can be reset by choosing a different column in gene.tsv
)
adata=st.read(file_name='./filtered_gene_bc_matrices/matrix.mtx',
file_feature='./filtered_gene_bc_matrices/genes.tsv',
file_sample='./filtered_gene_bc_matrices/barcodes.tsv',
file_format='mtx',workdir='./stream_result')
adata.var.index = adata.var[1].values
If the Anndata object is already created, to run STREAM, please simply specify work directory:
st.set_workdir(adata,'./stream_result')
adata
AnnData object with n_obs × n_vars = 4800 × 16 uns: 'workdir'
st.add_metadata(
adata,
file_name='synthetic_data_reprogramming_metadata.tsv'
)
adata.obs.head()
label | label_color | |
---|---|---|
common.0_0 | common | #aec7e8 |
common.1_0 | common | #aec7e8 |
common.2_0 | common | #aec7e8 |
common.3_0 | common | #aec7e8 |
common.4_0 | common | #aec7e8 |
adata.obs.label.value_counts()
common 2100 branch2 900 branch1 900 stable1 300 stable2 300 switch 300 Name: label, dtype: int64
st.cal_qc(adata,assay='rna')
st.plot_qc(adata,jitter=0.3,)
st.filter_cells(adata,min_n_features= 3)
st.filter_features(adata,min_n_cells = 300)
filter cells based on min_n_features after filtering out low-quality cells: 4800 cells, 16 genes Filter genes based on min_n_cells After filtering out low-expressed genes: 4800 cells, 16 genes
Commented out because our simulated data is already lib_size normalised and log2 transformed
###Normalize gene expression based on library size
st.normalize(adata,method='lib_size')
###Logarithmize gene expression
st.log_transform(adata)
Please check if the blue curve fits the points well. If not, please adjust the parameter 'loess_frac' (usually by lowering it) until the blue curve fits well.
st.select_variable_genes(adata,loess_frac=0.7,n_genes=20)
adata.uns["var_genes"]
16 variable genes are selected
Index(['G7', 'TF3', 'G4', 'G9', 'G8', 'TF5', 'G5', 'G1', 'G3', 'TF6', 'G2', 'G6', 'TF2', 'TF4', 'TF1', 'TF7'], dtype='object')
Alternatively, user can also select top principal components using all genes or variable genes:
st.select_top_principal_components(adata,n_pc=15,first_pc=True)
st.select_variable_genes(adata,loess_frac=0.01, n_genes=2000)
st.select_top_principal_components(adata,feature='var_genes',n_pc=40,first_pc=True)
st.dimension_reduction(
adata,
method='se',
feature='var_genes',
n_components=2,
n_neighbors=15,
n_jobs=12
)
feature var_genes is being used ... 12 cpus are being used ...
Alternatively, using top principal components as features:
st.dimension_reduction(adata,method='se',feature='top_pcs',n_neighbors=15, n_components=2)
for gene in df.index:
st.plot_dimension_reduction(adata,color=['label', gene],
n_components=2,show_graph=False,show_text=False)
st.seed_elastic_principal_graph(adata,n_clusters=10)
Seeding initial elastic principal graph... Clustering... K-Means clustering ... The number of initial nodes is 10 Calculatng minimum spanning tree... Number of initial branches: 3
df.head()
common.0_0 | common.1_0 | common.2_0 | common.3_0 | common.4_0 | common.5_0 | common.6_0 | switch_0 | branch1.0_0 | branch1.1_0 | ... | common.6_299 | switch_299 | branch1.0_299 | branch1.1_299 | branch1.2_299 | stable1_299 | branch2.0_299 | branch2.1_299 | branch2.2_299 | stable2_299 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
G1 | 2.332013 | 0.983021 | 1.051408 | 1.872751 | 1.469882 | 1.842518 | 2.191463 | 2.726888 | 2.532718 | 5.669840 | ... | 2.012113 | 2.715387 | 1.653754 | 8.178878 | 7.962105 | 6.614227 | 1.781937 | 4.394633 | 0.714265 | 3.798878 |
G2 | 4.165573 | 2.765827 | 1.989230 | 3.971206 | 4.325479 | 2.463216 | 3.408604 | 3.335261 | 3.113929 | 1.902228 | ... | 4.184479 | 1.673962 | 3.425767 | 1.417074 | 6.564162 | 8.259758 | 2.736619 | 2.905205 | 1.976459 | 0.000000 |
G3 | 3.036617 | 1.046053 | 2.231362 | 4.966177 | 0.199131 | 1.873473 | 2.392596 | 3.079853 | 1.727244 | 3.098456 | ... | 2.057285 | 3.170271 | 1.694091 | 3.997272 | 3.186728 | 6.124654 | 1.064837 | 3.133291 | 1.115589 | 2.384117 |
G4 | 2.074770 | 0.505312 | 0.000000 | 3.823021 | 2.072330 | 1.324959 | 3.602416 | 0.795757 | 0.289076 | 1.590148 | ... | 0.448112 | 1.160969 | 3.232817 | 2.053166 | 4.584992 | 2.015079 | 0.819506 | 0.000000 | 7.004272 | 7.735472 |
G5 | 2.986022 | 2.334931 | 3.933310 | 1.890802 | 2.122329 | 2.034974 | 3.100866 | 3.116822 | 2.534054 | 2.018515 | ... | 2.788002 | 3.493909 | 0.771094 | 4.476590 | 3.539528 | 0.086330 | 2.788833 | 3.897975 | 9.462772 | 9.712304 |
5 rows × 4800 columns
st.plot_dimension_reduction(adata,color=['label','TF1','n_genes'],n_components=2,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=True)
epg_alpha
, epg_mu
, epg_lambda
are the three most influential parameters for learning elastic principal graph.
epg_alpha
: penalizes spurious branching events. The larger, the fewer branches the function will learn. (by default, epg_alpha=0.02
)epg_mu
: penalizes the deviation from harmonic embedding, where harmonicity assumes that each node is the mean of its neighbor nodes. The larger, the more edges the function will use to fit into points(cells) (by default, epg_mu=0.1
)epg_lambda
: penalizes the total length of edges. The larger, the 'shorter' curves the function will use to fit into points(cells) and the fewer points(cells) the curves will reach. (by default, epg_lambda=0.02
)'epg_trimmingradius' can help get rid of noisy points (by defalut
epg_trimmingradius=Inf
)
e.g. st.elastic_principal_graph(adata,epg_trimmingradius=0.1)
st.elastic_principal_graph(adata,epg_alpha=0.07,epg_mu=0.07,epg_lambda=0.03)
Learning elastic principal graph... [1] "Constructing tree 1 of 1 / Subset 1 of 1" [1] "Computing EPG with 50 nodes on 4800 points and 2 dimensions" [1] "Using a single core" Nodes = 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 BARCODE ENERGY NNODES NEDGES NRIBS NSTARS NRAYS NRAYS2 MSE MSEP FVE FVEP UE UR URN URN2 URSD 1||50 4.53e-07 50 49 46 1 0 0 5.535e-08 3.801e-08 0.9981 0.9987 3.708e-07 2.689e-08 1.344e-06 6.722e-05 0 29.971 sec elapsed [[1]] Number of branches after learning elastic principal graph: 3
st.plot_dimension_reduction(adata,color=['label','TF1','n_genes'],n_components=2,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=False)
st.optimize_branching(adata,incr_n_nodes=30)
st.plot_dimension_reduction(adata,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=False)
st.prune_elastic_principal_graph(adata,epg_collapse_mode='EdgesNumber',epg_collapse_par=2)
st.plot_dimension_reduction(adata,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=False)
st.shift_branching(adata,epg_shift_mode='NodeDensity',epg_shift_radius=0.1,epg_shift_max=3)
st.plot_dimension_reduction(adata,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=False)
###Extend leaf branch to reach further cells
st.extend_elastic_principal_graph(adata, epg_ext_mode='WeigthedCentroid',epg_ext_par=0.8)
st.plot_dimension_reduction(adata,color=['label'],n_components=2,show_graph=True,show_text=True)
st.plot_branches(adata,show_text=True)
Extending leaves with additional nodes ... Number of branches after extending leaves: 3
st.plot_flat_tree(adata,color=['label','branch_id_alias','S2_pseudotime'],
dist_scale=0.5,show_graph=True,show_text=True)
st.plot_flat_tree(adata,color=['label','TF1','S2_pseudotime'],
dist_scale=0.5,show_graph=True,show_text=True)
root = "S1"
st.plot_stream_sc(adata,root=root,color=['label','TF1', 'TF2'],
dist_scale=0.3,show_graph=True,show_text=True)
st.plot_stream(adata,root=root,color=['label','TF1', 'TF2'], preference=["S3", "S2"])
marker_list
defines the list of genes to scan. If not specified, by default it uses all available genes. It might be time-consuming.
Here we only include variable genes.
st.detect_leaf_markers(adata,marker_list=adata.uns['var_genes'],cutoff_zscore=1.0,cutoff_pvalue=0.01,
root=root,n_jobs=10)
Scanning the specified marker list ... Filtering out markers that are expressed in less than 5 cells ... 10 cpus are being used ... 16 markers are being scanned ...
adata.uns['leaf_markers_all'].head()
zscore | H_statistic | H_pvalue | S1S0_pvalue | S0S2_pvalue | S0S3_pvalue | |
---|---|---|---|---|---|---|
G7 | 1.41421 | 3469.02 | 0 | 1 | 0 | 0 |
G8 | 1.41421 | 1773.97 | 0 | 1 | 0 | 0 |
TF2 | 1.41412 | 2652.12 | 0 | 0 | 1 | 0 |
TF1 | 1.41395 | 1922.49 | 0 | 0 | 0 | 1 |
TF3 | -1.4142 | 2549.03 | 0 | 1 | 0 | 0 |
st.detect_transition_markers(adata,marker_list=adata.uns['var_genes'],cutoff_spearman=0.4,cutoff_logfc=0.25,
root=root,n_jobs=4)
Scanning the specified marker list ... Importing precomputed scaled marker expression matrix ... 16 markers are being scanned ...
for _key in adata.uns['transition_markers'].keys():
print(f"Transition {_key}")
print(adata.uns['transition_markers'][_key].head(), end="\n\n")
Transition ('S1', 'S0') stat logfc pval qval TF4 0.724738 1.630755 0.000000e+00 0.000000e+00 TF5 0.709326 1.678000 1.976263e-323 5.928788e-323 G8 -0.635689 1.594105 6.212919e-240 1.242584e-239 G9 -0.620720 1.801819 1.169459e-225 1.754189e-225 Transition ('S0', 'S2') stat logfc pval qval G5 0.778846 1.596762 1.178426e-252 3.535277e-252 G4 0.655799 2.078970 4.317953e-153 6.476929e-153 G6 0.557264 1.456464 6.404390e-102 6.404390e-102 Transition ('S0', 'S3') stat logfc pval qval G2 0.757035 1.607611 3.440692e-269 1.376277e-268 G1 0.723168 1.482292 2.349280e-234 4.698559e-234 G3 0.478625 1.424740 1.151265e-83 1.535020e-83 TF1 0.478269 1.154688 1.584357e-83 1.584357e-83
st.detect_markers(adata,ident='label',marker_list=adata.uns['var_genes'],cutoff_zscore=1.0,cutoff_pvalue=0.01)
st.detect_markers(
adata,
ident='label',
marker_list=adata.uns['var_genes'],
cutoff_zscore=1.0,
cutoff_pvalue=0.01
)
Scanning the specified marker list ... Importing precomputed scaled marker expression matrix ... 16 markers are being scanned ...
adata.uns["markers_label"].keys()
dict_keys(['branch1', 'branch2', 'common', 'stable1', 'stable2', 'switch'])
for key in adata.uns["markers_label"].keys():
print(f"Detected population-specific markers for `{key}`")
print(f"\t{adata.uns['markers_label'][key].index.to_list()}")
Detected population-specific markers for `branch1` [] Detected population-specific markers for `branch2` [] Detected population-specific markers for `common` ['G8', 'G7', 'TF3', 'TF4', 'TF5', 'G9'] Detected population-specific markers for `stable1` ['G1', 'G2', 'G3'] Detected population-specific markers for `stable2` ['G4', 'G5', 'G6'] Detected population-specific markers for `switch` []