--- title: "Introduction to flashier" output: rmarkdown::html_vignette bibliography: flashier.bib vignette: > %\VignetteIndexEntry{Introduction to flashier} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 6, warning = FALSE, results = "hold", fig.align = "center", dpi = 120) ``` The matrix we analyze in this example is a 1,000 x 44 matrix of z-scores quantifying support for association between genetic variants in the human genome (the rows of the matrix) and gene expression measured across 44 human tissues (the columns): ```{r load-gtex-data, message=FALSE} library(flashier) data(gtex) nrow(gtex) ncol(gtex) gtex[1:2,1:2] ``` The genetic variants are linked to genes, so each row of the matrix corresponds to a gene (the names of the rows are the genes' Ensembl ids). These z-scores were originally calculated in @Urbut using data made available from the Genotype Tissue Expression (GTEx) project [@GTEx], so we refer to this as the "GTEx data set." Our motivation for factorizing the GTEx data matrix is to identify patterns ("factors") that give insight into the genetics of the 44 tissues: in particular, which tissues tend to act in concert, and to what degree? First, to get some intuition into the structure of this data set, we visualize the tissues in a 2-d embedding generated by $t$-SNE [@tSNE; @tsne-2015; @rtsne]: ```{r tsne, message=FALSE, fig.height=4, fig.width=4} library(Rtsne) library(ggrepel) library(ggplot2) library(cowplot) set.seed(1) out <- Rtsne(t(gtex), dims = 2, perplexity = 10) pdat <- data.frame(d1 = out$Y[, 1], d2 = out$Y[, 2], tissue = colnames(gtex)) ggplot(pdat,aes(x = d1, y = d2, label = tissue)) + geom_point(size = 3, color = gtex_colors) + geom_text_repel(size = 2, max.overlaps = Inf) + labs(x = "t-SNE [1]", y = "t-SNE [2]") + theme_classic() + theme(axis.ticks = element_blank(), axis.text = element_blank()) ``` Visually, one of the predominant trends is the clustering of brain tissues reflecting the greater genetic similarity among brain tissues. But there may be other interesting patterns that are revealed more effectively by a matrix factorization. The flashier package implements the *empirical Bayes matrix factorization* (EBMF) framework developed by @WangStephens. EBMF is based on the following model of an $n \times p$ data matrix ${\bf X}$: $$ \mathbf{X} = \mathbf{L} \mathbf{F}^\top + \mathbf{E} \\ e_{ij} \sim \mathcal{N}(0, \sigma^2) \\ \ell_{ik} \sim g_\ell^{(k)} \in \mathcal{G}_{\ell} \\ f_{jk} \sim g_f^{(k)} \in \mathcal{G}_f, $$ where ${\bf L}, {\bf F}, {\bf E}$ are, respectively, matrices of dimension $n \times K$, $p \times K$ and $n \times p$ storing real-valued elements $l_{ik}$, $f_{jk}$, and $e_{ij}$. The matrices ${\bf L}$ and ${\bf F}$ form an *approximate low-rank representation* of ${\bf X}$, ${\bf X} \approx {\bf L} {\bf F}^\top$, and this low-rank representation is learned from the data. $K$ specifies the rank of the reduced representation, and is typically set to a positive number much smaller than $n$ and $p$. The flexibility of the EBMF framework comes from its ability to accommodate, in principle, any choice of prior family for $\mathcal{G}_{\ell}$ and $\mathcal{G}_f$. As we illustrate below, different choices of prior families can lead to very different factorizations. Some choices closely correspond to existing matrix factorization methods. For example, if $\mathcal{G}_{\ell}$ and $\mathcal{G}_f$ are both the families of zero-centered normal priors, then the low-rank representation ${\bf L}{\bf F}^\top$ is expected to be similar to a truncated singular value decomposition (SVD) [@Nakajima]. When point-normal prior families are used instead, one obtains empirical Bayes versions of sparse SVD or sparse factor analysis [@Engelhardt; @SSVD; @Witten]. Thus, EBMF is a highly flexible framework for matrix factorization that includes important previous methods as special cases, but also many new combinations. Since fitting the EBMF model can be reduced to solving a sequence of EBNM problems [@WangStephens], flashier leverages ebnm for the core model fitting routines. In brief, fitting an EBMF model involves solving an EBNM problem separately for each column of ${\bf L}$ and for each column of ${\bf F}$; the former results in a fitted prior $\hat{g}_{\ell}^{(k)}$ and posterior estimates of entries $\ell_{ik}$, and the latter results in a fitted prior $\hat{g}_f^{(k)}$ and posteror estimates of entries $f_{jk}$. These EBNM models are repeatedly re-fitted until the priors and posterior estimates converge to a fixed point. Therefore, fitting an EBMF model can involve solving many hundreds or even many thousands of EBNM problems. Leveraging the efficient EBNM solvers available in ebnm makes it possible for flashier to tackle large-scale matrix factorization problems. The flashier function `flash()` provides the main interface for fitting EBMF models. The following call to `flash()` fits a matrix factorization to the GTEx data with normal priors on ${\bf L}$ and ${\bf F}$: ```{r flash-normal} flash_n <- flash(gtex, ebnm_fn = ebnm_normal, backfit = TRUE) ``` The "ebnm_fn" argument specifies the prior family for ${\bf L}$ and ${\bf F}$, and accepts any of the prior family functions implemented in the ebnm package (e.g., `ebnm_point_normal` or `ebnm_unimodal`; see Table X for a near-complete list of prior families). Specifying the number of factors $K$ is not needed because flashier automatically tries to determine an appropriate rank by checking whether the addition of new factors substantially improves the fit. (If one prefers a smaller number of factors than what is determined automatically, the maximum number of factors can be adjusted with the "greedy_Kmax" argument.) Here we also turned on backfitting by setting `backfit = TRUE`. This is generally recommended because it improves the quality of the fit [@WangStephens], with the caveat that it may make the model fitting too slow for very large data sets. Even with backfitting, the full computation here is very fast; when running this example on a current MacBook Pro, for example, the model fitting is completed in less than one second. The `plot()` method for flash objects can be used to visualize the estimated matrix ${\bf L}$. Adding color to distinguish among tissues yields an effective visualization that aids interpretation of factors: ```{r plot-flash-normal, fig.width=5, fig.height=5} plot(flash_n, pm_which = "factors", pm_colors = gtex_colors, plot_type = "bar") ``` The first factor sets the "baseline" z-score for each tissue. Other factors appear to capture expected trends: for example, factor 2 picks up a z-score pattern that is more dominant in brain tissues. More tissue-specific patterns are picked up by factor 3, which captures whole blood effects, and factor 6, which captures effects more specific to testis. SVD and SVD-like matrix factorizations may provide compact approximations of a data matrix, but they are generally known to have poor interpretability. "Sparse" factorizations can produce factors that are much more individually interpretable. In the EBNM framework, a sparse matrix factorization can be achieved by choosing a flexible prior family that better encourages sparsity in ${\bf L}$ and/or ${\bf F}$. The flashier interface makes it straightforward to experiment with different prior families; for example, we can use point-normal priors by simply modifying the "ebnm_fn" argument: ```{r flash-point-normal, fig.width=6, fig.height=5.5} flash_pn <- flash(gtex, ebnm_fn = ebnm_point_normal, backfit = TRUE) plot(flash_pn, pm_which = "factors", pm_colors = gtex_colors, plot_type = "bar") ``` The point-normal priors require estimation of slightly more parameters than the normal priors, but the call to `flash()` is still fast; it took less than 5 seconds to run on our MacBook Pro. As before, the first factor acts as a "baseline" and the second factor captures z-scores largely specific to brain tissues. However, there are also some major differences: some factors are much more clearly tissue-specific (e.g., factors 3 and 4 for, respectively, whole blood and testis); other patterns include cerebellar-specific patterns (factor 9) and heart-specific patterns (factor 12). These same patterns are also captured in the matrix factorization with normal priors, but they often appear in combination with other patterns in unintuitive ways. Another matrix factorization approach that has been recently proposed is *semi-nonnegative matrix factorization* [@DingJordan; @wang2019three; @he2020sn], in which the elements of either ${\bf L}$ or ${\bf F}$ (but not both) are constrained to be nonnegative. Supposing that ${\bf L}$ is constrained to be nonnegative, the elements $\ell_{ik} \geq 0$ can be viewed as "weights" or "memberships," and each row of ${\bf X}$ is then interpretable as a weighted combination of factors: $x_{i \cdot} \approx \sum_{k=1}^K \ell_{ik} f_{\cdot k}$. To obtain a semi-nonnegative matrix factorization via flashier, one need only choose appropriate prior families. Here we use point-normal priors for ${\bf L}$ and point-exponential priors for ${\bf F}$; thus we combine the benefits of both semi-nonnegative and sparse matrix factorization. In flashier, we can specify different priors for ${\bf L}$ and ${\bf F}$ by setting the "ebnm_fn" argument to be a list in which the first element is the prior family for ${\bf L}$ and the second the prior family for ${\bf F}$: ```{r flash-snn, fig.width=6, fig.height=5.5} flash_snn <- flash(gtex, ebnm_fn = c(ebnm_point_normal, ebnm_point_exponential), backfit = TRUE) plot(flash_snn, pm_which = "factors", pm_colors = gtex_colors, plot_type = "bar") ``` This code returned results in less than 3 seconds on our MacBook Pro. The result is in many ways strikingly similar to the sparse factorization obtained using point-normal priors. In some cases, the same tissue-specific effects are recovered, up to a difference in signs (e.g., factor 5 for fibroblasts and factor 8 for thyroid). But the semi-nonnegative factorization identifies factors that are arguably more intuitive; for example, point-normal factor 10, which combines artery and adipose tissues, is split into two factors in the semi-nonnegative factorization (factor 9 for artery and factor 11 for adipose tissue). The flashier package has many more options which we did not explore here, including other choices of prior family, different options for estimating the variances of the residuals ${\bf E}$, and options for fine-tuning the model fitting to improve the speed and quality of EBMF model fits for larger data sets. Many of these options are discussed in the other vignettes. ## Session information The following R version and packages were used to generate this vignette: ```{r session-info} sessionInfo() ``` ## Bibliography