---
title: "Empirical Bayes non-negative matrix factorization for single-cell RNA-seq data"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Empirical Bayes non-negative matrix factorization for single-cell RNA-seq data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The aim of this vignette is show how `flashier` can be used to perform a
non-negative matrix factorization (NMF) analysis of single-cell
RNA-seq data. This vignette is modeled after the
[fastTopics vignette on analyzing single-cell data][fasttopics-vignette].

```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>",
                      fig.width = 6, fig.height = 6, warning = FALSE,
                      results = "hold", fig.align = "center", dpi = 120)
```

We begin by loading the required packages. We also set the seed so that results
can be fully reproduced.

```{r load-pkgs, message=FALSE}
library(flashier)
library(Matrix)
library(fastTopics)
library(ggplot2)
set.seed(3)
```

Preparing the single-cell data for flashier
-------------------------------------------

The single-cell RNA-seq data (we use the `pbmc_facs` data from the
`fastTopics` package) are unique molecular identifier (UMI) counts
stored as an $n \times p$ sparse matrix, where $n$ is the number of
cells and $p$ is the number of genes:

```{r load-data}
data("pbmc_facs")
counts <- pbmc_facs$counts
colnames(counts) <- make.names(pbmc_facs$genes$symbol, unique = TRUE)
dim(counts)
```

(Note that other R packages, for example [Seurat][seurat], use the
convention that rows are genes and columns are cells.)

Since the `flashier` model, like other linear-model-based methods (e.g.,
principal components analysis), was not designed for count data, it is
recommended to first transform the data in a way that makes them more
suitable. A widely-used approach is to divide the counts
by a "size factor," add a "pseudocount," then take the log. We call
this transformation the "shifted logarithm," following
[this paper][ahlmann-eltze-huber-2023], and we refer to the
transformed counts as "shifted log counts".
  
In practice, the shifted log counts are computed as follows: we first
divide the counts by $\alpha s_i$, where $s_i$ is the size factor for
cell $i$, and $\alpha$ is the pseudocount. Done this way, the shifted
log counts maintain sparsity in the data; that is, if the original
count is zero, then the shifted log count is also zero.
  
More formally, the transformed counts $y_{ij}$ obtained from the
original counts $x_{ij}$ are
$$
y_{ij} = \log\bigg(1 + \frac{x_{ij}}{\alpha s_i}\bigg).
$$
In our analysis below, we use library-size normalization;
that is, we set
$$
s_i =
\frac{\text{total count for cell} \; i}
     {\text{average across all cells}} =
\frac{\sum_{j=1}^p x_{ij}}
     {\frac{1}{n}\sum_{i=1}^n \sum_{j=1}^p x_{ij}},
$$
and we set the pseudocount $\alpha$ to 1.

```{r shifted-log-counts}
a <- 1
size_factors <- rowSums(counts)
size_factors <- size_factors / mean(size_factors)
shifted_log_counts <- log1p(counts / (a * size_factors))
```

Note that, since the counts are sparse, we could have performed this
transformation more efficiently using, for example, the `mapSparse`
function from the [MatrixExtra][matrixextra] package.

Variance regularization
-----------------------

Before we fit the model, one issue we need to confront is that `flashier`
may automatically estimate the variances to be too small, which can
especially be an issue for genes with low expression. We can avoid
this issue by setting a sensible lower bound on the variance
estimates.

<!-- When running flashier, we need to place some regularization on the -->
<!-- variance estimates. Without regularization, it is possible for some -->
<!-- variance estimates to be very small, which can lead to overfitting. -->

We use the following rule of thumb: estimate the standard deviation of
the transformed data for a Poisson random variable with rate $\mu = 1
/ n$, where $n$ is the number of samples. This standard deviation
corresponds to a gene for which we would expect to observe a single
count across all $n$ cells, so it can serve as a reasonable lower
bound to prevent variance estimates from getting too small.

```{r variance-lower-bound}
n  <- nrow(counts)
x  <- rpois(1e7, 1/n)
s1 <- sd(log(x + 1))
```

Fit a flashier model
--------------------

Now we can call flash to fit an NMF model to the transformed data:

```{r flash}
fit <- flash(shifted_log_counts, 
             ebnm_fn = ebnm_point_exponential,
             var_type = 2, 
             greedy_Kmax = 8, 
             S = s1,
             backfit = FALSE)
```

A few notes about this flash call:

+ `ebnm_fn = ebnm_point_exponential` forces both the ${\bf L}$ and
  ${\bf F}$ matrices in flashier to be non-negative, so this call will
  generate a non-negative matrix factorization of the
  (transformed) counts matrix. Other types of matrix factorizations
  can be produced with different choices of the `ebnm_fn` argument;
  this is explained in detail in the
  [Introduction to flashier vignette][flashier_intro]. See also the
  notes at the bottom of this vignette.

+ `var_type = 2` means that we estimate column-wise (here, gene-wise)
  variances; that is, we estimate a different variance for each column
  (gene).

+ `greedy_Kmax = 8` forces flashier to fit no more than 8 factors. **In
  practice we recommend using a larger value,** but to keep the example
  simple and short we set it to 8 here.

+ `backfit = FALSE` skips the backfitting step. Backfitting can often
  greatly improve the fit, and is generally recommended for better
  results. (And indeed, backfitting noticeably improves the fit for
  these data.) On the other hand, backfitting can sometimes take a
  long time to complete, so have omitted the backfitting step here to
  reduce the computational effort involved.

The `plot` method with default settings produces a "scree plot" showing
the proportion of variance explained by each of the 8 factors:

```{r scree-plot, fig.height=2.25, fig.width=2.5}
plot(fit)
```

Factors 7 and 8 explain a very small proportion of variation (and
probably offer little insight into the biology of these cells).

Visualizing the cell matrix
---------------------------

<!-- Each cell in the single-cell data set is represented as a linear
combination of the $K$ factors. -->

In this matrix factorization, the rows of the ${\bf L}$ matrix
contains the estimated "membership levels," or "memberships," for each
cell. For example:

```{r membership-example}
cell_ids <- c("GATATATGTCAGTG-1-b_cells",
              "GACAGTACCTGTGA-1-memory_t",
              "TGAAGCACACAGCT-1-b_cells")
round(fit$L_pm[cell_ids, ], digits = 3)
```

The first and third cells mainly have membership in factors 1 and 6,
whereas the second cell mostly has membership in factors 1 and 5.
Most of the memberships are zero.

We can use plotting functions provided by `flashier` to visualize the relationship between
the memberships and the cell labels, and possibly get some clues about
the biological meaning of the factors. The cell labels are the
"sorted" cell subpopulations:

```{r cell-samples}
summary(pbmc_facs$samples$subpop)
```

We can produce different types of plots by specifying the `plot_type` argument 
to the `plot` method. For example, plotting overlapping histograms makes it 
immediately clear that each component 2 through 6 is primarily capturing a 
single cell type.

```{r histograms, fig.height=4, fig.width=5}
plot(fit, 
     plot_type = "histogram",
     pm_which = "loadings", 
     pm_groups = pbmc_facs$samples$subpop,
     bins = 20)
```

Other plot types can capture more subtle structure. An especially useful plot type
for visualizing nonnegative factorizations is the
"structure plot" --- a stacked bar plot in which each component is
represented as a bar of a different color, and the bar heights are
given by the loadings on each component.

```{r structure-plot-1, fig.height=2, fig.width=5, results="hide", message=FALSE}
plot(fit, 
     plot_type = "structure",
     pm_which = "loadings", 
     pm_groups = pbmc_facs$samples$subpop,
     gap = 25)
```

Factor 1 ("k1" in the legend) is present in all cell types, and is
likely capturing a "baseline" level of expression throughout.  To
focus on differences between cell types, let's remove this first
factor from our structure plot:

```{r structure-plot-2, fig.height=2, fig.width=5, results="hide", message=FALSE}
plot(fit, 
     plot_type = "structure",
     kset = 2:8,
     pm_which = "loadings", 
     pm_groups = pbmc_facs$samples$subpop,
     gap = 25)
```
  
It is clear from this plot that factors 2 through 6 ("k2" through "k6"
in the legend) are capturing, respectively, natural killer (NK) cells,
CD14+ cells, CD34+ cells, T cells and B cells. There is also more
subtle structure captured by the memberships, such as the subset of T
cells with mixed memberships (factors 2 and 5) which suggests a
subpopulation that is intermediate between "pure" T and "pure" NK
cells. It is also interesting to note the subset of CD34+ cells that
are primarily loaded on factor 3; some of these cells may be
mislabeled.
  
We can also visualize the cell matrix in a heatmap:

```{r heatmap, fig.height=3.5, fig.width=3.25, message=FALSE, results="hide"}
plot(fit, 
     plot_type = "heatmap",
     pm_which = "loadings", 
     pm_groups = pbmc_facs$samples$subpop,
     gap = 25)
```

Visualizing the gene matrix
---------------------------

Above, we used the provided cell labels to interpret five of the
factors as capturing distinct cell types (B cells, T cells,
etc). However, one might not always have cell labels, or the existing
cell labels may not be informative for the factors. We can also look
to the ${\bf F}$ matrix—the "gene matrix"—for interpreting the
factors.

The gene matrix contains estimates of gene expression changes.
Because these estimates are based on the shifted log counts, more
precisely they are *changes in the shifted log-expression.* The first
factor captures a "baseline" level of expression, so, broadly
speaking, we interpret these changes relative to this baseline.

Genes that were estimated to have the largest increases in expression
might give the most helpful clues about the underlying biology. For
example, many of the genes with the largest expression increases in
factor 6 are also genes characteristic of B cells:

```{r top-genes-factor-6}
res <- ldf(fit, type = "i")
F <- with(res, F %*% diag(D))
rownames(F) <- pbmc_facs$genes$symbol
head(sort(F[,6], decreasing = TRUE), n = 16)
```

Let's now apply this simple rule of thumb and examine the top 4 genes
(by expression increase) for factors 2 through 6:

```{r top-genes-all, fig.height=4, fig.width=2.5, message=FALSE}
top_genes <- apply(F, 2, order, decreasing = TRUE)[1:4, 2:6]
top_genes <- rownames(fit$F_pm)[top_genes]
plot(fit,
     plot_type = "heatmap",
     pm_which = "factors",
     pm_subset = top_genes,
     pm_groups = factor(top_genes, levels = rev(top_genes)),
     kset = 2:6,
     gap = 0.2)
```

Indeed, this approach reveals genes characteristic of the other cell
types (e.g., *GNLY* for NK cells, *S100A9* for CD14+ cells), although not
always; for example, some genes show large expression increases more
than one factor.

While selecting genes with the largest expression increases appears to
work well here, in other settings there may be more effective approaches.

Plot of mean vs. change for factor 6 ("B cells"), with top 10 genes
(by largest increase):

```{r plot-change-vs-mean, fig.height=3, fig.width=3}
plot(fit,
     plot_type = "scatter",
     pm_which = "factors",
     kset = 6,
     labels = TRUE,
     n_labels = 10,
     label_size = 2.5) +
  labs(x = "increase in shifted log expression",
       y = "mean shifted log expression") 
```

Other notes
-----------

While we have focussed on NMF, flashier is very flexible, and other
types of matrix factorizations are possible. One alternative to NMF
that could potentially reveal other interesting substructures is a
*semi-non-negative matrix factorization* (semi-NMF). This is achieved
by assigning different priors to ${\bf L}$ and ${\bf F}$: a prior with
non-negative support (such as the point-exponential prior) for ${\bf
L}$; and a prior with support for all real numbers for ${\bf F}$ (such
as the point-Laplace prior). The call to the "flash" function looks
the same as above except for the "ebnm_fn" argument:

```{r flash-snmf, eval=FALSE}
fit_snmf <- flash(shifted_log_counts, 
                  ebnm_fn = c(ebnm_point_exponential, ebnm_point_laplace),
                  var_type = 2, 
                  greedy_Kmax = 8, 
                  S = s1,
                  backfit = FALSE)
```

Session info
------------

This is the version of R and the packages that were used to generate
these results.

```{r session-info}
sessionInfo()
```

[flashier_intro]: https://willwerscheid.github.io/flashier/articles/flashier_intro.html
[seurat]: https://github.com/satijalab/seurat
[matrixextra]: https://github.com/david-cortes/MatrixExtra
[zheng-2017]: https://doi.org/10.1038/ncomms14049
[fasttopics-vignette]: https://stephenslab.github.io/fastTopics/articles/single_cell_rnaseq_basic.html
[ahlmann-eltze-huber-2023]: https://doi.org/10.1038/s41592-023-01814-1