--- title: "Empirical Bayes non-negative matrix factorization for single-cell RNA-seq data" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Empirical Bayes non-negative matrix factorization for single-cell RNA-seq data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- The aim of this vignette is show how `flashier` can be used to perform a non-negative matrix factorization (NMF) analysis of single-cell RNA-seq data. This vignette is modeled after the [fastTopics vignette on analyzing single-cell data][fasttopics-vignette]. ```{r knitr-opts, include=FALSE} knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 6, warning = FALSE, results = "hold", fig.align = "center", dpi = 120) ``` We begin by loading the required packages. We also set the seed so that results can be fully reproduced. ```{r load-pkgs, message=FALSE} library(flashier) library(Matrix) library(fastTopics) library(ggplot2) set.seed(3) ``` Preparing the single-cell data for flashier ------------------------------------------- The single-cell RNA-seq data (we use the `pbmc_facs` data from the `fastTopics` package) are unique molecular identifier (UMI) counts stored as an $n \times p$ sparse matrix, where $n$ is the number of cells and $p$ is the number of genes: ```{r load-data} data("pbmc_facs") counts <- pbmc_facs$counts colnames(counts) <- make.names(pbmc_facs$genes$symbol, unique = TRUE) dim(counts) ``` (Note that other R packages, for example [Seurat][seurat], use the convention that rows are genes and columns are cells.) Since the `flashier` model, like other linear-model-based methods (e.g., principal components analysis), was not designed for count data, it is recommended to first transform the data in a way that makes them more suitable. A widely-used approach is to divide the counts by a "size factor," add a "pseudocount," then take the log. We call this transformation the "shifted logarithm," following [this paper][ahlmann-eltze-huber-2023], and we refer to the transformed counts as "shifted log counts". In practice, the shifted log counts are computed as follows: we first divide the counts by $\alpha s_i$, where $s_i$ is the size factor for cell $i$, and $\alpha$ is the pseudocount. Done this way, the shifted log counts maintain sparsity in the data; that is, if the original count is zero, then the shifted log count is also zero. More formally, the transformed counts $y_{ij}$ obtained from the original counts $x_{ij}$ are $$ y_{ij} = \log\bigg(1 + \frac{x_{ij}}{\alpha s_i}\bigg). $$ In our analysis below, we use library-size normalization; that is, we set $$ s_i = \frac{\text{total count for cell} \; i} {\text{average across all cells}} = \frac{\sum_{j=1}^p x_{ij}} {\frac{1}{n}\sum_{i=1}^n \sum_{j=1}^p x_{ij}}, $$ and we set the pseudocount $\alpha$ to 1. ```{r shifted-log-counts} a <- 1 size_factors <- rowSums(counts) size_factors <- size_factors / mean(size_factors) shifted_log_counts <- log1p(counts / (a * size_factors)) ``` Note that, since the counts are sparse, we could have performed this transformation more efficiently using, for example, the `mapSparse` function from the [MatrixExtra][matrixextra] package. Variance regularization ----------------------- Before we fit the model, one issue we need to confront is that `flashier` may automatically estimate the variances to be too small, which can especially be an issue for genes with low expression. We can avoid this issue by setting a sensible lower bound on the variance estimates. We use the following rule of thumb: estimate the standard deviation of the transformed data for a Poisson random variable with rate $\mu = 1 / n$, where $n$ is the number of samples. This standard deviation corresponds to a gene for which we would expect to observe a single count across all $n$ cells, so it can serve as a reasonable lower bound to prevent variance estimates from getting too small. ```{r variance-lower-bound} n <- nrow(counts) x <- rpois(1e7, 1/n) s1 <- sd(log(x + 1)) ``` Fit a flashier model -------------------- Now we can call flash to fit an NMF model to the transformed data: ```{r flash} fit <- flash(shifted_log_counts, ebnm_fn = ebnm_point_exponential, var_type = 2, greedy_Kmax = 8, S = s1, backfit = FALSE) ``` A few notes about this flash call: + `ebnm_fn = ebnm_point_exponential` forces both the ${\bf L}$ and ${\bf F}$ matrices in flashier to be non-negative, so this call will generate a non-negative matrix factorization of the (transformed) counts matrix. Other types of matrix factorizations can be produced with different choices of the `ebnm_fn` argument; this is explained in detail in the [Introduction to flashier vignette][flashier_intro]. See also the notes at the bottom of this vignette. + `var_type = 2` means that we estimate column-wise (here, gene-wise) variances; that is, we estimate a different variance for each column (gene). + `greedy_Kmax = 8` forces flashier to fit no more than 8 factors. **In practice we recommend using a larger value,** but to keep the example simple and short we set it to 8 here. + `backfit = FALSE` skips the backfitting step. Backfitting can often greatly improve the fit, and is generally recommended for better results. (And indeed, backfitting noticeably improves the fit for these data.) On the other hand, backfitting can sometimes take a long time to complete, so have omitted the backfitting step here to reduce the computational effort involved. The `plot` method with default settings produces a "scree plot" showing the proportion of variance explained by each of the 8 factors: ```{r scree-plot, fig.height=2.25, fig.width=2.5} plot(fit) ``` Factors 7 and 8 explain a very small proportion of variation (and probably offer little insight into the biology of these cells). Visualizing the cell matrix --------------------------- In this matrix factorization, the rows of the ${\bf L}$ matrix contains the estimated "membership levels," or "memberships," for each cell. For example: ```{r membership-example} cell_ids <- c("GATATATGTCAGTG-1-b_cells", "GACAGTACCTGTGA-1-memory_t", "TGAAGCACACAGCT-1-b_cells") round(fit$L_pm[cell_ids, ], digits = 3) ``` The first and third cells mainly have membership in factors 1 and 6, whereas the second cell mostly has membership in factors 1 and 5. Most of the memberships are zero. We can use plotting functions provided by `flashier` to visualize the relationship between the memberships and the cell labels, and possibly get some clues about the biological meaning of the factors. The cell labels are the "sorted" cell subpopulations: ```{r cell-samples} summary(pbmc_facs$samples$subpop) ``` We can produce different types of plots by specifying the `plot_type` argument to the `plot` method. For example, plotting overlapping histograms makes it immediately clear that each component 2 through 6 is primarily capturing a single cell type. ```{r histograms, fig.height=4, fig.width=5} plot(fit, plot_type = "histogram", pm_which = "loadings", pm_groups = pbmc_facs$samples$subpop, bins = 20) ``` Other plot types can capture more subtle structure. An especially useful plot type for visualizing nonnegative factorizations is the "structure plot" --- a stacked bar plot in which each component is represented as a bar of a different color, and the bar heights are given by the loadings on each component. ```{r structure-plot-1, fig.height=2, fig.width=5, results="hide", message=FALSE} plot(fit, plot_type = "structure", pm_which = "loadings", pm_groups = pbmc_facs$samples$subpop, gap = 25) ``` Factor 1 ("k1" in the legend) is present in all cell types, and is likely capturing a "baseline" level of expression throughout. To focus on differences between cell types, let's remove this first factor from our structure plot: ```{r structure-plot-2, fig.height=2, fig.width=5, results="hide", message=FALSE} plot(fit, plot_type = "structure", kset = 2:8, pm_which = "loadings", pm_groups = pbmc_facs$samples$subpop, gap = 25) ``` It is clear from this plot that factors 2 through 6 ("k2" through "k6" in the legend) are capturing, respectively, natural killer (NK) cells, CD14+ cells, CD34+ cells, T cells and B cells. There is also more subtle structure captured by the memberships, such as the subset of T cells with mixed memberships (factors 2 and 5) which suggests a subpopulation that is intermediate between "pure" T and "pure" NK cells. It is also interesting to note the subset of CD34+ cells that are primarily loaded on factor 3; some of these cells may be mislabeled. We can also visualize the cell matrix in a heatmap: ```{r heatmap, fig.height=3.5, fig.width=3.25, message=FALSE, results="hide"} plot(fit, plot_type = "heatmap", pm_which = "loadings", pm_groups = pbmc_facs$samples$subpop, gap = 25) ``` Visualizing the gene matrix --------------------------- Above, we used the provided cell labels to interpret five of the factors as capturing distinct cell types (B cells, T cells, etc). However, one might not always have cell labels, or the existing cell labels may not be informative for the factors. We can also look to the ${\bf F}$ matrix—the "gene matrix"—for interpreting the factors. The gene matrix contains estimates of gene expression changes. Because these estimates are based on the shifted log counts, more precisely they are *changes in the shifted log-expression.* The first factor captures a "baseline" level of expression, so, broadly speaking, we interpret these changes relative to this baseline. Genes that were estimated to have the largest increases in expression might give the most helpful clues about the underlying biology. For example, many of the genes with the largest expression increases in factor 6 are also genes characteristic of B cells: ```{r top-genes-factor-6} res <- ldf(fit, type = "i") F <- with(res, F %*% diag(D)) rownames(F) <- pbmc_facs$genes$symbol head(sort(F[,6], decreasing = TRUE), n = 16) ``` Let's now apply this simple rule of thumb and examine the top 4 genes (by expression increase) for factors 2 through 6: ```{r top-genes-all, fig.height=4, fig.width=2.5, message=FALSE} top_genes <- apply(F, 2, order, decreasing = TRUE)[1:4, 2:6] top_genes <- rownames(fit$F_pm)[top_genes] plot(fit, plot_type = "heatmap", pm_which = "factors", pm_subset = top_genes, pm_groups = factor(top_genes, levels = rev(top_genes)), kset = 2:6, gap = 0.2) ``` Indeed, this approach reveals genes characteristic of the other cell types (e.g., *GNLY* for NK cells, *S100A9* for CD14+ cells), although not always; for example, some genes show large expression increases more than one factor. While selecting genes with the largest expression increases appears to work well here, in other settings there may be more effective approaches. Plot of mean vs. change for factor 6 ("B cells"), with top 10 genes (by largest increase): ```{r plot-change-vs-mean, fig.height=3, fig.width=3} plot(fit, plot_type = "scatter", pm_which = "factors", kset = 6, labels = TRUE, n_labels = 10, label_size = 2.5) + labs(x = "increase in shifted log expression", y = "mean shifted log expression") ``` Other notes ----------- While we have focussed on NMF, flashier is very flexible, and other types of matrix factorizations are possible. One alternative to NMF that could potentially reveal other interesting substructures is a *semi-non-negative matrix factorization* (semi-NMF). This is achieved by assigning different priors to ${\bf L}$ and ${\bf F}$: a prior with non-negative support (such as the point-exponential prior) for ${\bf L}$; and a prior with support for all real numbers for ${\bf F}$ (such as the point-Laplace prior). The call to the "flash" function looks the same as above except for the "ebnm_fn" argument: ```{r flash-snmf, eval=FALSE} fit_snmf <- flash(shifted_log_counts, ebnm_fn = c(ebnm_point_exponential, ebnm_point_laplace), var_type = 2, greedy_Kmax = 8, S = s1, backfit = FALSE) ``` Session info ------------ This is the version of R and the packages that were used to generate these results. ```{r session-info} sessionInfo() ``` [flashier_intro]: https://willwerscheid.github.io/flashier/articles/flashier_intro.html [seurat]: https://github.com/satijalab/seurat [matrixextra]: https://github.com/david-cortes/MatrixExtra [zheng-2017]: https://doi.org/10.1038/ncomms14049 [fasttopics-vignette]: https://stephenslab.github.io/fastTopics/articles/single_cell_rnaseq_basic.html [ahlmann-eltze-huber-2023]: https://doi.org/10.1038/s41592-023-01814-1