---
title: "Introduction to flashier"
output: rmarkdown::html_vignette
bibliography: flashier.bib
vignette: >
  %\VignetteIndexEntry{Introduction to flashier}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>",
                      fig.width = 6, fig.height = 6, warning = FALSE,
                      results = "hold", fig.align = "center", dpi = 120)
```

The matrix we analyze in this example is a 1,000 x 44 matrix of
z-scores quantifying support for association between genetic variants
in the human genome (the rows of the matrix) and gene expression
measured across 44 human tissues (the columns):

```{r load-gtex-data, message=FALSE}
library(flashier)
data(gtex)
nrow(gtex) 
ncol(gtex)
gtex[1:2,1:2]
```

The genetic variants are linked to genes, so each row of the matrix
corresponds to a gene (the names of the rows are the genes' Ensembl
ids). These z-scores were originally calculated in @Urbut using data
made available from the Genotype Tissue Expression (GTEx) project
[@GTEx], so we refer to this as the "GTEx data set."

Our motivation for factorizing the GTEx data matrix is to identify
patterns ("factors") that give insight into the genetics of the 44
tissues: in particular, which tissues tend to act in concert, and to
what degree?  First, to get some intuition into the structure of this
data set, we visualize the tissues in a 2-d embedding generated by
$t$-SNE [@tSNE; @tsne-2015; @rtsne]:

```{r tsne, message=FALSE, fig.height=4, fig.width=4}
library(Rtsne)
library(ggrepel)
library(ggplot2)
library(cowplot)
set.seed(1)
out <- Rtsne(t(gtex), dims = 2, perplexity = 10)
pdat <- data.frame(d1 = out$Y[, 1],
                   d2 = out$Y[, 2],
                   tissue = colnames(gtex))
ggplot(pdat,aes(x = d1, y = d2, label = tissue)) +
  geom_point(size = 3, color = gtex_colors) +
  geom_text_repel(size = 2, max.overlaps = Inf) +
  labs(x = "t-SNE [1]", y = "t-SNE [2]") +
  theme_classic() +
  theme(axis.ticks = element_blank(),
        axis.text = element_blank())
```

Visually, one of the predominant trends is the clustering of brain
tissues reflecting the greater genetic similarity among brain tissues.
But there may be other interesting patterns that are revealed more
effectively by a matrix factorization.

The flashier package implements the *empirical Bayes matrix
  factorization* (EBMF) framework developed by @WangStephens.
EBMF is based on the following model of an $n \times p$ data matrix
${\bf X}$:

$$
\mathbf{X} = \mathbf{L} \mathbf{F}^\top + \mathbf{E} \\
e_{ij} \sim \mathcal{N}(0, \sigma^2) \\
\ell_{ik} \sim g_\ell^{(k)} \in \mathcal{G}_{\ell} \\
f_{jk} \sim g_f^{(k)} \in \mathcal{G}_f, 
$$

where ${\bf L}, {\bf F}, {\bf E}$ are, respectively, matrices of
dimension $n \times K$, $p \times K$ and $n \times p$ storing
real-valued elements $l_{ik}$, $f_{jk}$, and $e_{ij}$. The matrices ${\bf
  L}$ and ${\bf F}$ form an *approximate low-rank representation*
of ${\bf X}$, ${\bf X} \approx {\bf L} {\bf F}^\top$, and this
low-rank representation is learned from the data. $K$ specifies
the rank of the reduced representation, and is typically set to a
positive number much smaller than $n$ and $p$.

The flexibility of the EBMF framework comes from its ability to
accommodate, in principle, any choice of prior family for
$\mathcal{G}_{\ell}$ and $\mathcal{G}_f$. As we illustrate below,
different choices of prior families can lead to very different
factorizations. Some choices closely correspond to existing matrix
factorization methods. For example, if $\mathcal{G}_{\ell}$ and
$\mathcal{G}_f$ are both the families of zero-centered normal priors,
then the low-rank representation ${\bf L}{\bf F}^\top$ is expected to
be similar to a truncated singular value decomposition (SVD)
[@Nakajima]. When point-normal prior families are used instead, one
obtains empirical Bayes versions of sparse SVD or sparse factor
analysis [@Engelhardt; @SSVD; @Witten]. Thus, EBMF is a highly
flexible framework for matrix factorization that includes important
previous methods as special cases, but also many new combinations.

Since fitting the EBMF model can be reduced to solving a sequence of
EBNM problems [@WangStephens], flashier leverages ebnm for the core
model fitting routines. In brief, fitting an EBMF model involves
solving an EBNM problem separately for each column of ${\bf L}$ and
for each column of ${\bf F}$; the former results in a fitted prior
$\hat{g}_{\ell}^{(k)}$ and posterior estimates of entries $\ell_{ik}$,
and the latter results in a fitted prior $\hat{g}_f^{(k)}$ and
posteror estimates of entries $f_{jk}$. These EBNM models are
repeatedly re-fitted until the priors and posterior estimates converge
to a fixed point. Therefore, fitting an EBMF model can involve solving
many hundreds or even many thousands of EBNM problems. Leveraging the
efficient EBNM solvers available in ebnm makes it possible for
flashier to tackle large-scale matrix factorization problems.

The flashier function `flash()` provides the main interface for
fitting EBMF models. The following call to `flash()` fits a
matrix factorization to the GTEx data with normal priors on ${\bf L}$
and ${\bf F}$:

```{r flash-normal}
flash_n <- flash(gtex, ebnm_fn = ebnm_normal, backfit = TRUE)
```

The "ebnm_fn" argument specifies the prior family for ${\bf L}$ and
${\bf F}$, and accepts any of the prior family functions implemented
in the ebnm package (e.g., `ebnm_point_normal` or `ebnm_unimodal`; see
Table X for a near-complete list of prior families). Specifying the
number of factors $K$ is not needed because flashier automatically
tries to determine an appropriate rank by checking whether the
addition of new factors substantially improves the fit. (If one
prefers a smaller number of factors than what is determined
automatically, the maximum number of factors can be adjusted with the
"greedy_Kmax" argument.)  Here we also turned on backfitting by
setting `backfit = TRUE`. This is generally recommended because
it improves the quality of the fit [@WangStephens], with the caveat
that it may make the model fitting too slow for very large data
sets. Even with backfitting, the full computation here is very fast;
when running this example on a current MacBook Pro, for example, the
model fitting is completed in less than one second.

The `plot()` method for flash objects can be used to visualize the
estimated matrix ${\bf L}$. Adding color to distinguish among tissues
yields an effective visualization that aids interpretation of factors:

```{r plot-flash-normal, fig.width=5, fig.height=5}
plot(flash_n, pm_which = "factors", pm_colors = gtex_colors, plot_type = "bar")
```

The first factor sets the "baseline" z-score for each tissue. Other
factors appear to capture expected trends: for example, factor 2 picks
up a z-score pattern that is more dominant in brain tissues. More
tissue-specific patterns are picked up by factor 3, which captures
whole blood effects, and factor 6, which captures effects more
specific to testis.

SVD and SVD-like matrix factorizations may provide compact
approximations of a data matrix, but they are generally known to have
poor interpretability. "Sparse" factorizations can produce factors
that are much more individually interpretable. In the EBNM framework,
a sparse matrix factorization can be achieved by choosing a flexible
prior family that better encourages sparsity in ${\bf L}$ and/or ${\bf
F}$.  The flashier interface makes it straightforward to experiment
with different prior families; for example, we can use point-normal
priors by simply modifying the "ebnm_fn" argument:

```{r flash-point-normal, fig.width=6, fig.height=5.5}
flash_pn <- flash(gtex, ebnm_fn = ebnm_point_normal, backfit = TRUE)
plot(flash_pn, pm_which = "factors", pm_colors = gtex_colors, plot_type = "bar")
```

The point-normal priors require estimation of slightly more parameters
than the normal priors, but the call to `flash()` is still fast; it
took less than 5 seconds to run on our MacBook Pro.

As before, the first factor acts as a "baseline" and the second factor
captures z-scores largely specific to brain tissues. However, there
are also some major differences: some factors are much more clearly
tissue-specific (e.g., factors 3 and 4 for, respectively, whole blood
and testis); other patterns include cerebellar-specific patterns
(factor 9) and heart-specific patterns (factor 12). These same
patterns are also captured in the matrix factorization with normal
priors, but they often appear in combination with other patterns in
unintuitive ways.

Another matrix factorization approach that has been recently proposed
is *semi-nonnegative matrix factorization*
[@DingJordan; @wang2019three; @he2020sn], in which the elements of
either ${\bf L}$ or ${\bf F}$ (but not both) are constrained to be
nonnegative. Supposing that ${\bf L}$ is constrained to be
nonnegative, the elements $\ell_{ik} \geq 0$ can be viewed as
"weights" or "memberships," and each row of ${\bf X}$ is then
interpretable as a weighted combination of factors: $x_{i \cdot}
\approx \sum_{k=1}^K \ell_{ik} f_{\cdot k}$.

To obtain a semi-nonnegative matrix factorization via flashier, one
need only choose appropriate prior families. Here we use point-normal
priors for ${\bf L}$ and point-exponential priors for ${\bf F}$; thus
we combine the benefits of both semi-nonnegative and sparse matrix
factorization. In flashier, we can specify different priors for ${\bf
L}$ and ${\bf F}$ by setting the "ebnm_fn" argument to be a list in
which the first element is the prior family for ${\bf L}$ and the
second the prior family for ${\bf F}$:

```{r flash-snn, fig.width=6, fig.height=5.5}
flash_snn <- flash(gtex, 
                   ebnm_fn = c(ebnm_point_normal, ebnm_point_exponential), 
                   backfit = TRUE)
plot(flash_snn, pm_which = "factors", pm_colors = gtex_colors, plot_type = "bar")
```

This code returned results in less than 3 seconds on our MacBook
Pro.

The result is in many ways strikingly similar to the sparse
factorization obtained using point-normal priors. In some cases, the
same tissue-specific effects are recovered, up to a difference in
signs (e.g., factor 5 for fibroblasts and factor 8 for thyroid). But
the semi-nonnegative factorization identifies factors that are
arguably more intuitive; for example, point-normal factor 10, which
combines artery and adipose tissues, is split into two factors in the
semi-nonnegative factorization (factor 9 for artery and factor 11 for
adipose tissue).

The flashier package has many more options which we did not explore
here, including other choices of prior family, different options for
estimating the variances of the residuals ${\bf E}$, and options for
fine-tuning the model fitting to improve the speed and quality of EBMF
model fits for larger data sets. Many of these options are discussed
in the other vignettes.

## Session information

The following R version and packages were used to generate this vignette:

```{r session-info}
sessionInfo()
```

## Bibliography