The pipeline consists of four successive steps: data pre-processing, cellular clustering and pseudo-temporal ordering, determining differential expressed genes and identifying biomarkers.
library(DIscBIO)
library(partykit)
library(enrichR)
Loading required package: SingleCellExperiment Loading required package: SummarizedExperiment Loading required package: MatrixGenerics Loading required package: matrixStats Attaching package: ‘MatrixGenerics’ The following objects are masked from ‘package:matrixStats’: colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, colCounts, colCummaxs, colCummins, colCumprods, colCumsums, colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, colWeightedMeans, colWeightedMedians, colWeightedSds, colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, rowWeightedMads, rowWeightedMeans, rowWeightedMedians, rowWeightedSds, rowWeightedVars Loading required package: GenomicRanges Loading required package: stats4 Loading required package: BiocGenerics Loading required package: parallel Attaching package: ‘BiocGenerics’ The following objects are masked from ‘package:parallel’: clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Loading required package: S4Vectors Attaching package: ‘S4Vectors’ The following object is masked from ‘package:base’: expand.grid Loading required package: IRanges Loading required package: GenomeInfoDb Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Attaching package: ‘Biobase’ The following object is masked from ‘package:MatrixGenerics’: rowMedians The following objects are masked from ‘package:matrixStats’: anyMissing, rowMedians Loading required package: grid Loading required package: libcoin Loading required package: mvtnorm Attaching package: ‘partykit’ The following object is masked from ‘package:SummarizedExperiment’: width The following object is masked from ‘package:GenomicRanges’: width The following object is masked from ‘package:IRanges’: width The following object is masked from ‘package:S4Vectors’: width The following object is masked from ‘package:BiocGenerics’: width Welcome to enrichR Checking connection ... Connection is Live!
The valuesG1ms dataset consists of single cells from a myxoid liposarcoma cell line. Myxoid liposarcoma is a rare type of tumor driven by specific fusion oncogenes, normally FUS-DDIT3 [Fletcher et al., 2013], with few other genetic changes [Ståhlberg et al., 2014; Hofvander et al., 2018]. The cells were collected based on their cell cycle phase (G1, S or G2/M), assessed by analyzing their DNA content using Fluorescence Activated Cell Sorter [Karlsson et al., 2017]. Data including ERCC spike-ins, are available in the ArrayExpress database at EMBL-EBI with accession number E-MTAB-6142).
The dataset should be formatted in a data frame where columns refer to samples and rows refer to genes. We provide here the possibility to load the dataset either as ".csv" or ".rda" extensions.
FileName<-"valuesG1ms" # Name of the dataset
CSV=TRUE # If the dataset has ".csv", the user shoud set CSV to TRUE
#CSV=FALSE # If the dataset has ".rda", the user shoud set CSV to FALSE
if (CSV==TRUE){
DataSet <- read.csv(file = paste0(FileName,".csv"), sep = ";",header=T)
rownames(DataSet)<-DataSet[,1]
DataSet<-DataSet[,-1]
} else{
load(paste0(FileName,".rda"))
DataSet<-get(FileName)
}
cat(paste0("The ", FileName," contains:","\n","Genes: ",length(DataSet[,1]),"\n","cells: ",length(DataSet[1,]),"\n"))
The valuesG1ms contains: Genes: 59838 cells: 94
sc<- DISCBIO(DataSet) # The DISCBIO class is the central object storing all information generated throughout the pipeline
Prior to applying data analysis methods, it is standard to pre-process the raw read counts resulted from the sequencing. The preprocessing approach depends on the existence or absence of ERCC spike-ins. In both cases, it includes normalization of read counts and gene filtering.
To account for RNA composition and sequencing depth among samples (single-cells), the normalization method “median of ratios” is used. This method takes the ratio of the gene instantaneous median to the total counts for all genes in that cell (column median). The gene instantaneous median is the product of multiplying the median of the total counts across all cells (row median) with the read of the target gene in each cell. This normalization method makes it possible to compare the normalized counts for each gene equally between samples.
The key idea in filtering genes is to appoint the genes that manifest abundant variation across samples. Filtering genes is a critical step due to its dramatic impact on the downstream analysis. In case the raw data includes ERCC spike-ins, genes will be filtered based on variability in comparison to a noise level estimated from the ERCC spike-ins according to an algorithm developed by Brennecke et al (Brennecke et al., 2013). This algorithm utilizes the dependence of technical noise on the average read count and fits a model to the ERCC spike-ins. Further gene filtering can be implemented based on gene expression. In case the raw data does not include ERCC spike-ins, genes will be only filtered based on minimum expression in certain number of cells.
Filtering the raw data that includes ERCCs can be done by applying the “NoiseFiltering” function, which includes several parameters:
- object: the outcome of running the DISCBIO() function.
- percentile: A numeric value of the percentile. It is used to validate the ERCC spik-ins. Default is 0.8.
- CV: A numeric value of the coefficient of variation. It is used to validate the ERCC spik-ins. Default is 0.5.
- geneCol: Color of the genes that did not pass the filtration.
- FgeneCol: Color of the genes that passt the filtration.
- erccCol: Color of the ERCC spik-ins.
- Val: A logical vector that allows plotting only the validated ERCC spike-ins. Default is TRUE. If Val=FALSE will
plot all the ERCC spike-ins.
- plot: A logical vector that allows plotting the technical noise. Default is TRUE.
- export: A logical vector that allows writing the final gene list in excel file. Default is TRUE.
- quiet: if TRUE
, suppresses printed output
To normalize the raw sequencing reads the function Normalizedata() should be used, this function takes 8 parameters.
The function Normalizedata() normalizes the count reads using the normalization method “median of ratios”
To Finalize the preprocessing the function FinalPreprocessing() should be implemented by setting the parameter "GeneFlitering" to NoiseF ( whether the dditional gene filtering step based on gene expression was implemented on not).
sc<-NoiseFiltering(sc,percentile=0.8, CV=0.3)
#### Normalizing the reads without any further gene filtering
sc<-Normalizedata(sc, mintotal=1000, minexpr=0, minnumber=0, maxexpr=Inf, downsample=FALSE, dsn=1, rseed=17000)
sc<-FinalPreprocessing(sc,GeneFlitering="NoiseF",export = TRUE) ### The GeneFiltering can be either "NoiseF" or"ExpF"
Cut-off value for the ERCCs= 12.5 Coefficients of the fit:
a0 a1tilde 0.0200366 70.4893629
Explained variances of log CV^2 values= 0.84 Number of genes that passed the filtering = 5684 The gene filtering method = Noise filtering The Filtered Normalized dataset contains: Genes: 5684 cells: 94 The Filtered Normalized dataset was saved as: filteredDataset.Rdata
Cellular clustering is performed according to the gene expression profiles to detect cellular sub-population with unique properties. After clustering, pseudo-temporal ordering is generated to indicate the cellular differentiation degree.
Model-based clustering assumes that the data are generated by a model and attempts to recover the original model from the data to define cellular clusters.
sc <- Exprmclust(sc,K =3,reduce = TRUE,quiet = TRUE)
To visualize the detected clusters, two common dimensionality reduction tools are implemented: tSNE map and principal component analysis (PCA), which is a linear dimensionality reduction method that preserves the global structure and shows how the measurements themselves are related to each other.
PlotmclustMB(sc)
PCAplotSymbols(sc)
# Plotting the model-based clusters in tSNE maps
sc<- comptSNE(sc,rseed=15555,quiet = F) # to perform the computation of a t-SNE map
plottSNE(sc)
This function may take time sigma summary: Min. : 38.4914978648748 |1st Qu. : 42.9672869120877 |Median : 45.7336241172451 |Mean : 47.2423249643292 |3rd Qu. : 50.7105126779753 |Max. : 73.7307741725683 | Epoch: Iteration #500 error is: 0.608233282504434 Epoch: Iteration #1000 error is: 0.597577895108978 Epoch: Iteration #1500 error is: 0.597054005007593 Epoch: Iteration #2000 error is: 0.59478833216114 Epoch: Iteration #2500 error is: 0.785133695910099 Epoch: Iteration #3000 error is: 0.467442752493165 Epoch: Iteration #3500 error is: 0.467349316975901 Epoch: Iteration #4000 error is: 0.467341397783741 Epoch: Iteration #4500 error is: 0.467329835708217 Epoch: Iteration #5000 error is: 0.467315499731271
# Silhouette of Model-based clusters
par(mar=c(6,2,4,2))
plotSilhouette(sc,K=3)