Package 'flashier'

Title: Empirical Bayes Matrix Factorization
Description: Methods for matrix factorization based on Wang and Stephens (2021) <https://jmlr.org/papers/v22/20-589.html>.
Authors: Jason Willwerscheid [aut, cre], Peter Carbonetto [aut], Wei Wang [aut], Matthew Stephens [aut], Eric Weine [ctb], Annie Xie [ctb], Gao Wang [ctb]
Maintainer: Jason Willwerscheid <[email protected]>
License: BSD_3_clause + file LICENSE
Version: 1.0.53
Built: 2024-08-24 05:33:36 UTC
Source: https://github.com/willwerscheid/flashier

Help Index


Fitted method for flash objects

Description

Given a flash object, returns the "fitted values" E(LF)=E(L)E(F)E(LF') = E(L) E(F)'.

Usage

## S3 method for class 'flash'
fitted(object, ...)

Arguments

object

An object inheriting from class flash.

...

Additional parameters are ignored.

Value

The matrix of "fitted values."


Fitted method for flash fit objects

Description

Given a flash_fit object, returns the "fitted values" E(LF)=E(L)E(F)E(LF') = E(L) E(F)'.

Usage

## S3 method for class 'flash_fit'
fitted(object, ...)

Arguments

object

An object inheriting from class flash_fit.

...

Additional parameters are ignored.

Value

The matrix of "fitted values."


Empirical Bayes matrix factorization

Description

Fits an empirical Bayes matrix factorization (see Details for a description of the model). The resulting fit is referred to as a "flash" object (short for Factors and Loadings using Adaptive SHrinkage). Two interfaces are provided. The flash function provides a simple interface that allows a flash object to be fit in a single pass, while flash_xxx functions are pipeable functions that allow for more complex flash objects to be fit incrementally (available functions are listed below under See Also). See the vignettes and Examples for usage.

Usage

flash(
  data,
  S = NULL,
  ebnm_fn = ebnm_point_normal,
  var_type = 0L,
  greedy_Kmax = 50L,
  backfit = FALSE,
  nullcheck = TRUE,
  verbose = 1L
)

Arguments

data

The observations. Usually a matrix, but can also be a sparse matrix of class Matrix or a low-rank matrix representation as returned by, for example, svd, irlba, rsvd, or softImpute (in general, any list that includes fields u, d, and v will be interpreted as a low-rank matrix representation).

S

The standard errors. Can be NULL (in which case all residual variance will be estimated) or a matrix, vector, or scalar. S should be a scalar if standard errors are identical across observations. It should be a vector if standard errors either vary across columns but are constant within any given row, or vary across rows but are constant within any given column (flash will use the length of the vector to determine whether the supplied values correspond to rows or columns; if the data matrix is square, then the sense must be specified using parameter S_dim in function flash_init).

ebnm_fn

The function or functions used to solve the empirical Bayes normal means (EBNM) subproblems. Most importantly, these functions specify the families of distributions G(k)G_\ell^{(k)} and Gf(k)G_f^{(k)} to which the priors on loadings and factors g(k)g_\ell^{(k)} and gf(k)g_f^{(k)} are assumed to belong. If the same function is to be used for both loadings LL and factors FF, then ebnm_fn can be a single function. If one function is to be used for loadings and a second for factors, then ebnm_fn should be a list of length two, with the first element giving the function for loadings and the second the function for factors. If different functions are to be used for different values of kk, then factor/loadings pairs must be added successively using multiple calls to either flash_greedy or flash_factors_init.

Any EBNM function provided by package ebnm can be used as input. Non-default arguments to parameters can be supplied using the helper function flash_ebnm. Custom EBNM functions can also be used: for details, see flash_ebnm.

var_type

Describes the structure of the estimated residual variance. Can be NULL, 0, 1, 2, or c(1, 2). If NULL, then S accounts for all residual variance. If var_type = 0, then the estimated residual variance (which is added to any variance given by S) is assumed to be constant across all observations. Setting var_type = 1 estimates a single variance parameter for each row; var_type = 2 estimates one parameter for each column; and var_type = c(1, 2) optimizes over all rank-one matrices (that is, it assumes that the residual variance parameter sijs_{ij} can be written sij=aibjs_{ij} = a_i b_j, where the nn-vector aa and the pp-vector bb are to be estimated).

Note that if any portion of the residual variance is to be estimated, then it is usually faster to set S = NULL and to let flash estimate all of the residual variance. Further, var_type = c(1, 2) is typically much slower than other options, so it should be used with care.

greedy_Kmax

The maximum number of factors to be added. This will not necessarily be the total number of factors added by flash, since factors are only added as long as they increase the variational lower bound on the log likelihood for the model.

backfit

A "greedy" fit is performed by adding up to greedy_Kmax factors, optimizing each newly added factor in one go without returning to optimize previously added factors. When backfit = TRUE, flash will additionally perform a final "backfit" where all factors are cyclically updated until convergence. The backfitting procedure typically takes much longer than the greedy algorithm, but it also usually improves the final fit to a significant degree.

nullcheck

If nullcheck = TRUE, then flash will check that each factor in the final flash object improves the overall fit. Any factor that fails the check will be removed.

verbose

When and how to display progress updates. Set to 0 for none, 1 for updates after a factor is added or a backfit is completed, 2 for additional notifications about the variational lower bound, and 3 for updates after every iteration. It is also possible to output a single tab-delimited table of values using function flash_set_verbose with verbose = -1.

Details

If YY is an n×pn \times p data matrix, then the rank-one empirical Bayes matrix factorization model is:

Y=f+E,Y = \ell f' + E,

where \ell is an nn-vector of loadings, ff is a pp-vector of factors, and EE is an n×pn \times p matrix of residuals (or "errors"). Additionally:

eijN(0,sij2):i=1,...,n;j=1,...,pe_{ij} \sim N(0, s_{ij}^2): i = 1, ..., n; j = 1, ..., p

gG\ell \sim g_\ell \in G_\ell

fgfGf.f \sim g_f \in G_f.

The residual variance parameters sij2s_{ij}^2 are constrained to have a simple structure and are fit via maximum likelihood. (For example, one might assume that all standard errors are identical: sij2=s2s_{ij}^2 = s^2 for some s2s^2 and for all ii, jj). The functions gg_\ell and gfg_f are assumed to belong to some families of priors GG_\ell and GfG_f that are specified in advance, and are estimated via variational approximation.

The general rank-KK empirical Bayes matrix factorization model is:

Y=LF+EY = LF' + E

or

yij=kikfjk+eij:i=1,...,n;j=1,...,p,y_{ij} = \sum_k \ell_{ik} f_{jk} + e_{ij}: i = 1, ..., n; j = 1, ..., p,

where LL is now a matrix of loadings and FF is a matrix of factors.

Separate priors g(k)g_\ell^{(k)} and gf(k)g_f^{(k)} are estimated via empirical Bayes, and different prior families may be used for different values of kk. In general, then:

eijN(0,sij2):i=1,...,n;j=1,...,pe_{ij} \sim N(0, s_{ij}^2): i = 1, ..., n; j = 1, ..., p

ikg(k)G(k):i=1,...,n;k=1,...,K\ell_{ik} \sim g_\ell^{(k)} \in G_\ell^{(k)}: i = 1, ..., n; k = 1, ..., K

fikgf(k)Gf(k):j=1,...,p;k=1,...,K.f_{ik} \sim g_f^{(k)} \in G_f^{(k)}: j = 1, ..., p; k = 1, ..., K.

Typically, G(k)G_\ell^{(k)} and Gf(k)G_f^{(k)} will be closed under scaling, in which case k\ell_k and fkf_k are only identifiable up to a scaling factor dkd_k. In other words, we can write:

Y=LDF+E,Y = LDF' + E,

where DD is a diagonal matrix with diagonal entries d1,...,dKd_1, ..., d_K. The model can then be made identifiable by constraining the scale of k\ell_k and fkf_k for k=1,...,Kk = 1, ..., K.

Value

A flash object. Contains elements:

n_factors

The total number of factor/loadings pairs KK in the fitted model.

pve

The proportion of variance explained by each factor/loadings pair. Since factors and loadings are not required to be orthogonal, this should be interpreted loosely: for example, the total proportion of variance explained could be larger than 1.

elbo

The variational lower bound achieved by the fitted model.

residuals_sd

Estimated residual standard deviations (these include any variance component given as an argument to S).

L_pm, L_psd, L_lfsr

Posterior means, standard deviations, and local false sign rates for loadings LL.

F_pm, F_psd, F_lfsr

Posterior means, standard deviations, and local false sign rates for factors FF.

L_ghat

The fitted priors on loadings g^(k)\hat{g}_\ell^{(k)}.

F_ghat

The fitted priors on factors g^f(k)\hat{g}_f^{(k)}.

sampler

A function that takes a single argument nsamp and returns nsamp samples from the posterior distributions for factors FF and loadings LL.

flash_fit

A flash_fit object. Used by flash when fitting is not performed all at once, but incrementally via calls to various flash_xxx functions.

The following methods are available:

fitted.flash

Returns the "fitted values" E(LF)=E(L)E(F)E(LF') = E(L) E(F)'.

residuals.flash

Returns the expected residuals YE(LF)=YE(L)E(F)Y - E(LF') = Y - E(L) E(F)'.

ldf.flash

Returns an LDFLDF decomposition (see Details above), with columns of LL and FF scaled as specified by the user.

References

Wei Wang and Matthew Stephens (2021). "Empirical Bayes matrix factorization." Journal of Machine Learning Research 22, 1–40.

See Also

flash_init, flash_greedy, flash_backfit, and flash_nullcheck. For more advanced functionality, see flash_factors_init, flash_factors_fix, flash_factors_set_to_zero, flash_factors_remove, flash_set_verbose, and flash_set_conv_crit. For extracting useful data from flash objects, see fitted.flash, residuals.flash, and ldf.flash.

Examples

data(gtex)

# Fit up to 3 factors and backfit.
fl <- flash(gtex, greedy_Kmax = 3L, backfit = TRUE)

# This is equivalent to the series of calls:
fl <- flash_init(gtex) |>
  flash_greedy(Kmax = 3L) |>
  flash_backfit() |>
  flash_nullcheck()

# Fit a unimodal distribution with mean zero to each set of loadings
#   and a scale mixture of normals with mean zero to each factor.
fl <- flash(gtex,
            ebnm_fn = c(ebnm_unimodal,
                        ebnm_normal_scale_mixture),
            greedy_Kmax = 3)

# Fit point-laplace priors using a non-default optimization method.
fl <- flash(gtex,
            ebnm_fn = flash_ebnm(prior_family = "point_laplace",
                                 optmethod = "trust"),
            greedy_Kmax = 3)

# Fit a "Kronecker" (rank-one) variance structure (this can be slow).
fl <- flash(gtex, var_type = c(1, 2), greedy_Kmax = 3L)

Add "intercept" to a flash object

Description

Adds an all-ones vector as a fixed set of loadings (if rowwise = TRUE) or fixed factor (if rowwise = FALSE). Assuming (without loss of generality) that the fixed factor/loadings is indexed as k=1k = 1, a fixed set of loadings gives:

yif1+k=2Kikfk,\mathbf{y}_{i \cdot} \approx \mathbf{f}_1 + \sum_{k = 2}^K \ell_{i k} \mathbf{f}_k,

so that the (estimated) factor f1Rp\mathbf{f}_1 \in \mathbf{R}^p is shared by all row-wise observations yiRp\mathbf{y}_{i \cdot} \in \mathbf{R}^p. A fixed factor gives:

yj1+k=2Kfjkk,\mathbf{y}_{\cdot j} \approx \boldsymbol{\ell}_1 + \sum_{k = 2}^K f_{j k} \boldsymbol{\ell}_k,

so that the (estimated) set of loadings 1Rn\ell_1 \in \mathbf{R}^n is shared by all column-wise observations yjRny_{\cdot j} \in \mathbf{R}^n.

Usage

flash_add_intercept(flash, rowwise = TRUE, ebnm_fn = ebnm_point_normal)

Arguments

flash

A flash or flash_fit object to which an "intercept" is to be added.

rowwise

Should the all-ones vector be added as a fixed set of loadings ("row-wise") or a fixed factor ("column-wise")? See above for details.

ebnm_fn

As with other factor/loadings pairs, a prior is put on the estimated factor (if rowwise = TRUE) or set of loadings (if rowwise = FALSE). Parameter ebnm_fn specifies the function used to estimate that prior; see flash for details.

Details

The estimated factor (if rowwise = TRUE) or set of loadings (if rowwise = FALSE) is initialized at the column- or row-wise means of the data (or, if factor/loadings pairs have previously been added, at the column- or row-wise means of the matrix of residuals) and then backfit via function flash_backfit.

Value

The flash object from argument flash, with an "intercept" added.

Examples

# The following are equivalent:
init <- list(matrix(rowMeans(gtex), ncol = 1),
             matrix(1, nrow = ncol(gtex)))
fl <- flash_init(gtex) |>
  flash_factors_init(init) |>
  flash_factors_fix(kset = 1, which_dim = "factors") |>
  flash_backfit(kset = 1)

fl <- flash_init(gtex) |>
  flash_add_intercept(rowwise = FALSE)

Backfit a flash object

Description

Backfits existing flash factor/loadings pairs. Whereas a "greedy" fit optimizes each newly added factor/loadings pair in one go without returning to optimize previously added pairs, a "backfit" updates all existing pairs in a cyclical fashion. See flash for examples of usage.

Usage

flash_backfit(
  flash,
  kset = NULL,
  extrapolate = TRUE,
  warmstart = TRUE,
  maxiter = 500,
  tol = NULL,
  verbose = NULL
)

Arguments

flash

A flash or flash_fit object.

kset

A vector of integers specifying which factors to backfit. If kset = NULL, then all existing factors will be backfitted.

extrapolate

Whether to use an extrapolation technique inspired by Ang and Gillis (2019) to accelerate the fitting process. Control parameters are handled via global options and can be set by calling options("extrapolate.control") <- control.param.

warmstart

Whether to use "warmstarts" when solving the EBNM subproblems by initializing solutions at the previous value of the fitted prior g^\hat{g}. An important side effect of warmstarts for ashr-like prior families is to fix the grid at its initial setting. Fixing the grid can lead to poor fits if there are large changes in the scale of the estimated prior over the course of the fitting process. However, allowing the grid to vary can occasionally result in decreases in ELBO.

maxiter

The maximum number of backfitting iterations. An "iteration" is defined such that all factors in kset get updated at each iteration.

tol

The convergence tolerance parameter. After each update, the fit is compared to the fit from before the update using a convergence criterion function (by default, the difference in ELBO, but the criterion can be changed via flash_set_conv_crit). The backfit is considered to have "converged" when the value of the convergence criterion function over successive updates to all factor/loadings pairs is less than or equal to tol. If, for example, factor/loadings pairs 1,,K1, \ldots, K are being sequentially backfitted, then fits are compared before and after the update to factor/loadings 1, before and after the update to factor/loadings 2, and so on through factor/loadings KK, and backfitting only terminates when the convergence criterion function returns a value less than or equal to tol for all KK updates. Note that specifying tol here will override any value set by flash_set_conv_crit; to use the "global" tolerance parameter, tol must be left unspecified (NULL). If tol = NULL and a global tolerance parameter has not been set, then the default tolerance used is npϵnp\sqrt{\epsilon}, where nn is the number of rows in the dataset, pp is the number of columns, and ϵ\epsilon is equal to .Machine$double.eps.

verbose

When and how to display progress updates. Set to 0 for none, 1 for updates after a factor is added or a backfit is completed, 2 for additional notifications about the variational lower bound, and 3 for updates after every iteration. It is also possible to output a single tab-delimited table of values using function flash_set_verbose with verbose = -1.

Value

The flash object from argument flash, backfitted as specified.


Set timeout

Description

Used in a flash pipeline to clear timeout conditions set using flash_set_timeout.

Usage

flash_clear_timeout(flash)

Arguments

flash

A flash or flash_fit object.

Value

The flash object from argument flash, with timeout settings cleared.


Calculate the difference in ELBO

Description

The default objective function used to determine convergence when fitting a flash object. Calculates the difference in the variational lower bound ("ELBO") from one iteration to the next.

Usage

flash_conv_crit_elbo_diff(curr, prev, k)

Arguments

curr

The flash_fit object from the current iteration.

prev

The flash_fit object from the previous iteration.

k

Only used during sequential backfits (that is, calls to flash_backfit where extrapolate = FALSE). It then takes the index of the factor/loadings pair currently being optimized.

Details

This function is an example of a function that may be passed to parameter fn in function flash_set_conv_crit to set the convergence criterion for a flash pipeline. See flash_set_conv_crit for details and examples.

Value

A scalar, which is compared against the tolerance parameter tol to determine whether a fit has converged.

See Also

flash_conv_crit_max_chg flash_conv_crit_max_chg_L, flash_conv_crit_max_chg_F


Calculate the maximum absolute difference in scaled loadings and factors

Description

An alternative objective function that can be used to determine convergence when fitting a flash object. Calculates the maximum (absolute) change over all (posterior expected values for) loadings ik\ell_{ik} and factors fjkf_{jk}. At each iteration, the loadings vectors 1,,K\ell_{\cdot 1}, \ldots, \ell_{\cdot K} and factors f1,,fKf_{\cdot 1}, \ldots, f_{\cdot K} are L2L^2-normalized.

Usage

flash_conv_crit_max_chg(curr, prev, k)

Arguments

curr

The flash_fit object from the current iteration.

prev

The flash_fit object from the previous iteration.

k

Only used during sequential backfits (that is, calls to flash_backfit where extrapolate = FALSE). It then takes the index of the factor/loadings pair currently being optimized.

Value

A scalar, which is compared against the tolerance parameter tol to determine whether a fit has converged.

See Also

flash_conv_crit_elbo_diff, flash_conv_crit_max_chg_L flash_conv_crit_max_chg_F


Calculate the maximum absolute difference in scaled factors

Description

An alternative objective function that can be used to determine convergence when fitting a flash object. Calculates the maximum (absolute) change over all (posterior expected values for) factors fjkf_{jk}. At each iteration, the factors f1,,fKf_{\cdot 1}, \ldots, f_{\cdot K} are L2L^2-normalized.

Usage

flash_conv_crit_max_chg_F(curr, prev, k)

Arguments

curr

The flash_fit object from the current iteration.

prev

The flash_fit object from the previous iteration.

k

Only used during sequential backfits (that is, calls to flash_backfit where extrapolate = FALSE). It then takes the index of the factor/loadings pair currently being optimized.

Value

A scalar, which is compared against the tolerance parameter tol to determine whether a fit has converged.

See Also

flash_conv_crit_elbo_diff, flash_conv_crit_max_chg flash_conv_crit_max_chg_L


Calculate the maximum absolute difference in scaled loadings

Description

An alternative objective function that can be used to determine convergence when fitting a flash object. Calculates the maximum (absolute) change over all (posterior expected values for) loadings ik\ell_{ik}. At each iteration, the loadings vectors 1,,K\ell_{\cdot 1}, \ldots, \ell_{\cdot K} are L2L^2-normalized.

Usage

flash_conv_crit_max_chg_L(curr, prev, k)

Arguments

curr

The flash_fit object from the current iteration.

prev

The flash_fit object from the previous iteration.

k

Only used during sequential backfits (that is, calls to flash_backfit where extrapolate = FALSE). It then takes the index of the factor/loadings pair currently being optimized.

Value

A scalar, which is compared against the tolerance parameter tol to determine whether a fit has converged.

See Also

flash_conv_crit_elbo_diff, flash_conv_crit_max_chg flash_conv_crit_max_chg_F


Construct an EBNM function

Description

flash_ebnm is a helper function that provides readable syntax for constructing ebnm functions that can serve as arguments to parameter ebnm_fn in functions flash, flash_greedy, and flash_factors_init (see Examples below). It is also possible to write a custom function from scratch: see Details below for a simple example. A more involved example can be found in the "Extending ebnm with custom ebnm-style functions" vignette in the ebnm package.

Usage

flash_ebnm(...)

Arguments

...

Parameters to be passed to function ebnm in package ebnm. An argument to prior_family should be provided unless the default family of point-normal priors is desired. Arguments to parameters x, s, or output must not be included. Finally, if g_init is included, then fix_g = TRUE must be as well. To fix a prior grid, use parameter scale rather than g_init.

Details

As input to parameter ebnm_fn in functions flash, flash_greedy, and flash_factors_init, it should suffice for many purposes to provide functions from package ebnm as is (for example, one might set ebnm_fn = ebnm_point_laplace). To use non-default arguments, function flash_ebnm may be used (see Examples). Custom functions may also be written; for details, see the "Extending ebnm with custom ebnm-style functions" vignette in the ebnm package.

Value

A function that can be passed as argument to parameter ebnm_fn in functions flash, flash_greedy, and flash_factors_init.

See Also

ebnm

Examples

# A custom EBNM function might be written as follows:
my_ebnm_fn <- function(x, s, g_init, fix_g, output) {
  ebnm_res <- ebnm_point_laplace(
    x = x,
    s = s,
    g_init = g_init,
    fix_g = fix_g,
    output = output,
    control = list(iterlim = 10)
  )
  return(ebnm_res)
}

# The following are equivalent:
fl1 <- flash(
  gtex,
  ebnm_fn = my_ebnm_fn,
  greedy_Kmax = 2
)
fl2 <- flash(
  gtex,
  ebnm_fn = flash_ebnm(
    prior_family = "point_laplace",
    control = list(iterlim = 10)
  ),
  greedy_Kmax = 2
)

Fix flash factors

Description

Fixes loadings k\ell_{\cdot k} or factors fkf_{\cdot k} for one or more factor/loadings pairs, so that their values are not updated during subsequent backfits. For a given pair, either the loadings or factor can be fixed, but not both, and either all entries or a subset can be fixed. To unfix, use function flash_factors_unfix. See flash_factors_init for an example of usage.

Usage

flash_factors_fix(
  flash,
  kset,
  which_dim = c("factors", "loadings"),
  fixed_idx = NULL,
  use_fixed_in_ebnm = NULL
)

Arguments

flash

A flash or flash_fit object.

kset

A vector of integers indexing the factor/loadings pairs whose loadings or factors are to be fixed.

which_dim

Whether to fix factors or loadings.

fixed_idx

If fixed_idx = NULL, then all loadings or factor values will be fixed. If only a subset are to be fixed, then fixed_idx should be an appropriately-sized vector or matrix of values that can be coerced to logical. For example, if a subset of loadings for two factor/loadings pairs are to be fixed, then fixed_idx should be a length-nn vector or an nn by 2 matrix (where nn is the number of rows in the data matrix).

use_fixed_in_ebnm

By default, fixed elements are ignored when solving the EBNM subproblem in order to estimate the prior g^\hat{g}. This behavior can be changed by setting use_fixed_in_ebnm = TRUE. This is a global setting which applies to all factor/loadings pairs; behavior cannot vary from one factor/loadings pair to another.

Value

The flash object from argument flash, with factors or loadings fixed as specified.


Initialize flash factors at specified values

Description

Initializes factor/loadings pairs at values specified by init. This function has two primary uses: 1. One can initialize multiple factor/loadings pairs at once using an SVD-like function and then optimize them via function flash_backfit. Sometimes this results in a better fit than adding them one at a time via flash_greedy. 2. One can initialize factor/loadings pairs and then fix the factor (or loadings) via function flash_factors_fix to incorporate "known" factors into a flash object. See below for examples of both use cases.

Usage

flash_factors_init(flash, init, ebnm_fn = ebnm_point_normal)

Arguments

flash

A flash or flash_fit object to which factors are to be added.

init

An SVD-like object (specifically, a list containing fields u, d, and v), a flash or flash_fit object, or a list of matrices specifying the values at which factors and loadings are to be initialized (for a data matrix of size n×pn \times p, this should be a list of length two, with the first element a matrix of size n×kn \times k and the second a matrix of size p×kp \times k). If a flash fit is supplied, then it will be used to initialize both the first and second moments of posteriors on loadings and factors. Otherwise, the supplied values will be used to initialize posterior means, with posterior second moments initialized as the squared values of the first moments. Missing entries are not allowed.

ebnm_fn

The function or functions used to solve the empirical Bayes normal means (EBNM) subproblems. Most importantly, these functions specify the families of distributions G(k)G_\ell^{(k)} and Gf(k)G_f^{(k)} to which the priors on loadings and factors g(k)g_\ell^{(k)} and gf(k)g_f^{(k)} are assumed to belong. If the same function is to be used for both loadings LL and factors FF, then ebnm_fn can be a single function. If one function is to be used for loadings and a second for factors, then ebnm_fn should be a list of length two, with the first element giving the function for loadings and the second the function for factors. If different functions are to be used for different values of kk, then factor/loadings pairs must be added successively using multiple calls to either flash_greedy or flash_factors_init.

Any EBNM function provided by package ebnm can be used as input. Non-default arguments to parameters can be supplied using the helper function flash_ebnm. Custom EBNM functions can also be used: for details, see flash_ebnm.

Value

The flash object from argument flash, with factors and loadings initialized as specified.

Examples

# Initialize several factors at once and backfit.
fl <- flash_init(gtex) |>
  flash_factors_init(init = svd(gtex, nu = 5, nv = 5)) |>
  flash_backfit()

# Add fixed loadings with \ell_i identically equal to one. This can be
#   interpreted as giving a "mean" factor that accounts for different
#   row-wise means.
ones <- matrix(1, nrow = nrow(gtex), ncol = 1)
# Initialize the factor at the least squares solution.
ls_soln <- t(solve(crossprod(ones), crossprod(ones, gtex)))
fl <- flash_init(gtex) |>
  flash_factors_init(init = list(ones, ls_soln)) |>
  flash_factors_fix(kset = 1, which_dim = "loadings") |>
  flash_backfit() |>
  flash_greedy(Kmax = 5L)

Remove factors from a flash object

Description

Sets factor/loadings pairs to zero and then removes them from the flash object. Note that this will change the indices of existing pairs.

Usage

flash_factors_remove(flash, kset)

Arguments

flash

A flash or flash_fit object.

kset

A vector of integers specifying which factor/loadings pairs to remove.

Value

The flash object from argument flash, with the factors specified by kset removed.

See Also

flash_factors_set_to_zero


Reorder factors in a flash object

Description

Reorders the factor/loadings pairs in a flash object.

Usage

flash_factors_reorder(flash, kset)

Arguments

flash

A flash or flash_fit object.

kset

A vector of integers specifying the new order of the factor/loadings pairs. All existing factors must be included in kset; to drop factors, use flash_factors_remove.

Value

The flash object from argument flash, with the factors reordered according to argument kset.


Set flash factors to zero

Description

Sets factor/loadings pairs to zero but does not remove them from the flash object (so as to keep the indices of existing pairs the same).

Usage

flash_factors_set_to_zero(flash, kset)

Arguments

flash

A flash or flash_fit object.

kset

A vector of integers specifying which factor/loadings pairs to set to zero.

Value

The flash object from argument flash, with the factors specified by kset set to zero.

See Also

flash_factors_remove


Unfix flash factors

Description

If loadings k\ell_{\cdot k} or factors fkf_{\cdot k} for one or more factor/loadings pairs have been "fixed" using function flash_factors_fix, then they can be unfixed using function flash_factors_unfix.

Usage

flash_factors_unfix(flash, kset)

Arguments

flash

A flash or flash_fit object.

kset

A vector of integers indexing the factor/loadings pairs whose values are to be unfixed.

Value

The flash object from argument flash, with values for the factor/loadings pairs specified by kset unfixed.


Extract a flash_fit object

Description

flash_fit objects are the "internal" objects used by flash functions to fit an EBMF model. Whereas flash objects (the end results of the fitting process) include user-friendly fields and methods, flash_fit objects were not designed for public consumption and can be unwieldy. Nonetheless, some advanced flash functionality requires the wielding of flash_fit objects. In particular, initialization, convergence, and "verbose" display functions all take one or more flash_fit objects as input (see parameter init_fn in function flash_greedy; parameter fn in flash_set_conv_crit; and parameter fns in flash_set_verbose). For users who would like to write custom functions, the accessor functions and methods enumerated below may prove useful. See flash_set_verbose for an example.

Usage

flash_fit(flash)

flash_fit_get_pm(f, n)

flash_fit_get_p2m(f, n)

flash_fit_get_est_tau(f)

flash_fit_get_fixed_tau(f)

flash_fit_get_tau(f)

flash_fit_get_elbo(f)

flash_fit_get_KL(f, n)

flash_fit_get_g(f, n)

Arguments

flash

A flash object.

f

A flash_fit object.

n

Set n = 1 to access loadings LL and n = 2 to access factors FF).

Details

The following S3 methods are available for flash_fit objects at all times except while optimizing new factor/loadings pairs as part of a "greedy" fit:

fitted.flash_fit

Returns the "fitted values" E(LF)=E(L)E(F)E(LF') = E(L) E(F)'.

residuals.flash_fit

Returns the expected residuals YE(LF)=YE(L)E(F)Y - E(LF') = Y - E(L) E(F)'.

ldf.flash_fit

Returns an LDFLDF decomposition, with columns of LL and FF scaled as specified by the user.

Value

See function descriptions below.

Functions

  • flash_fit_get_pm(): The posterior means for the loadings matrix LL (when parameter n is equal to 1) or factor matrix FF (when n = 2). While optimizing new factor/loadings pairs as part of a "greedy" fit, only the posterior means for the new loadings k\ell_{\cdot k} or factor fkf_{\cdot k} will be returned.

  • flash_fit_get_p2m(): The posterior second moments for the loadings matrix LL (when parameter n is equal to 1) or factor matrix FF (when n = 2). While optimizing new factor/loadings pairs, only the posterior second moments for the new loadings k\ell_{\cdot k} or factor fkf_{\cdot k} will be returned.

  • flash_fit_get_est_tau(): Equal to 1/σ21 / \sigma^2, where σ2\sigma^2 is the estimated portion of the residual variance (total, by row, or by column, depending on the variance type).

  • flash_fit_get_fixed_tau(): Equal to 1/s21 / s^2, where s2s^2 is the fixed portion of the residual variance (total, by row, or by column).

  • flash_fit_get_tau(): The overall precision 1/(σ2+s2)1 / (\sigma^2 + s^2).

  • flash_fit_get_elbo(): The variational lower bound (ELBO).

  • flash_fit_get_KL(): A vector containing the KL-divergence portions of the ELBO, with one value for each factor (when n = 2) or set of loadings (when n = 1). While optimizing new factor/loadings pairs, only the KL-divergence for the new factor or loadings will be returned.

  • flash_fit_get_g(): A list containing estimated priors on loadings g^\hat{g}_\ell (when n = 1) or factors g^f\hat{g}_f (when n = 2). While optimizing new factor/loadings pairs, only the estimated prior on the new factor or loadings will be returned.


Greedily add factors to a flash object

Description

Adds factor/loadings pairs to a flash object in a "greedy" manner. Up to Kmax pairs are added one at a time. At each step, flash_greedy attempts to find an optimal additional (rank-one) factor given all previously added factors. The additional factor is retained if it increases the variational lower bound (ELBO); otherwise, fitting terminates. See flash for examples of usage.

Usage

flash_greedy(
  flash,
  Kmax = 1,
  ebnm_fn = ebnm_point_normal,
  init_fn = NULL,
  extrapolate = FALSE,
  warmstart = FALSE,
  maxiter = 500,
  tol = NULL,
  verbose = NULL
)

Arguments

flash

A flash or flash_fit object to which factors are to be added.

Kmax

The maximum number of factors to be added. This will not necessarily be the total number of factors added by flash_greedy, since factors are only added as long as they increase the ELBO.

ebnm_fn

The function or functions used to solve the empirical Bayes normal means (EBNM) subproblems. Most importantly, these functions specify the families of distributions G(k)G_\ell^{(k)} and Gf(k)G_f^{(k)} to which the priors on loadings and factors g(k)g_\ell^{(k)} and gf(k)g_f^{(k)} are assumed to belong. If the same function is to be used for both loadings LL and factors FF, then ebnm_fn can be a single function. If one function is to be used for loadings and a second for factors, then ebnm_fn should be a list of length two, with the first element giving the function for loadings and the second the function for factors. If different functions are to be used for different values of kk, then factor/loadings pairs must be added successively using multiple calls to either flash_greedy or flash_factors_init.

Any EBNM function provided by package ebnm can be used as input. Non-default arguments to parameters can be supplied using the helper function flash_ebnm. Custom EBNM functions can also be used: for details, see flash_ebnm.

init_fn

The function used to initialize factor/loadings pairs. Functions flash_greedy_init_default, flash_greedy_init_softImpute, and flash_greedy_init_irlba have been supplied; note, in particular, that flash_greedy_init_softImpute can yield better results than the default initialization function when there is missing data. Custom initialization functions may also be used. If init_fn = NULL then flash_greedy_init_default will be used, with an attempt made to set argument sign_constraints appropriately via test calls to the EBNM function(s) specified by parameter ebnm_fn. If factors or loadings are constrained in some other fashion (e.g., bounded support), then the initialization function should be modified to account for the constraints — otherwise, the greedy algorithm can stop adding factor/loadings pairs too early. Custom initialization functions should accept a single parameter referring to a flash_fit object and should output a list consisting of two vectors, which will be used as initial values for the new loadings k\ell_{\cdot k} and the new factor fkf_{\cdot k}. Typically, a custom initialization function will extract the matrix of residuals from the flash_fit object using method residuals.flash_fit and then return a (possibly constrained) rank-one approximation to the matrix of residuals. See Examples below.

extrapolate

Whether to use an extrapolation technique inspired by Ang and Gillis (2019) to accelerate the fitting process. Control parameters are handled via global options and can be set by calling options("extrapolate.control") <- control.param.

warmstart

Whether to use "warmstarts" when solving the EBNM subproblems by initializing solutions at the previous value of the fitted prior g^\hat{g}. An important side effect of warmstarts for ashr-like prior families is to fix the grid at its initial setting. Fixing the grid can lead to poor fits if there are large changes in the scale of the estimated prior over the course of the fitting process. However, allowing the grid to vary can occasionally result in decreases in ELBO.

maxiter

The maximum number of iterations when optimizing a greedily added factor/loadings pair.

tol

The convergence tolerance parameter. At each iteration, the fit is compared to the fit from the previous iteration using a convergence criterion function (by default, the difference in ELBO, but the criterion can be changed via flash_set_conv_crit). When the value returned by this function is less than or equal to tol, the newly added factor/loadings pair is considered to have "converged," so that flash_greedy moves on and attempts to add another new factor (or, if the maximum number of factors Kmax has been reached, the process terminates). Note that specifying tol here will override any value set by flash_set_conv_crit; to use the "global" tolerance parameter, tol must be left unspecified (NULL). If tol = NULL and a global tolerance parameter has not been set, then the default tolerance used is npϵnp\sqrt{\epsilon}, where nn is the number of rows in the dataset, pp is the number of columns, and ϵ\epsilon is equal to .Machine$double.eps.

verbose

When and how to display progress updates. Set to 0 for none, 1 for updates after a factor is added or a backfit is completed, 2 for additional notifications about the variational lower bound, and 3 for updates after every iteration. It is also possible to output a single tab-delimited table of values using function flash_set_verbose with verbose = -1.

Value

The flash object from argument flash, with up to Kmax new factor/loadings pairs "greedily" added.

See Also

flash_greedy_init_default, flash_greedy_init_softImpute, flash_greedy_init_irlba

Examples

# The following are examples of advanced usage. See ?flash for basic usage.

# Increase the maximum number of iterations in the default initialization
#   method.
my_init_fn <- function(f) flash_greedy_init_default(f, maxiter = 500)
fl <- flash_init(gtex) |>
  flash_greedy(init_fn = my_init_fn)

# Use a custom initialization function that wraps function nmf from
#   package RcppML.
nmf_init_fn <- function(f) {
  nmf_res <- RcppML::nmf(resid(f), k = 1, verbose = FALSE)
  return(list(as.vector(nmf_res$w), as.vector(nmf_res$h)))
}
fl.nmf <- flash_init(gtex) |>
  flash_greedy(ebnm_fn = ebnm_unimodal_nonnegative,
               init_fn = nmf_init_fn)

Initialize a flash factor

Description

The default method for initializing the loadings k\ell_{\cdot k} and factor values fkf_{\cdot k} of a new ("greedy") flash factor. It is essentially an implementation of the power method, but unlike many existing implementations, it can handle missing data and sign constraints. For details, see Chapter 2.2.3 in the reference below.

Usage

flash_greedy_init_default(
  flash,
  sign_constraints = NULL,
  tol = NULL,
  maxiter = 100,
  seed = 666
)

Arguments

flash

A flash_fit object.

sign_constraints

This parameter can be used to constrain the sign of the initial factor and loadings. It should be a vector of length two with entries equal to -1, 0, or 1. The first entry constrains the sign of the loadings k\ell_{\cdot k}, with -1 yielding nonpositive loadings, +1 yielding nonnegative loadings, and 0 indicating that loadings should not be constrained. The second entry of sign_constraints similarly constrains the sign of factor values fkf_{\cdot k}. If sign_constraints = NULL, then no constraints will be applied.

tol

Convergence tolerance parameter. When the maximum (absolute) change over all values ik\ell_{ik} and fjkf_{jk} is less than or equal to tol, initialization terminates. At each iteration, the factor and loadings are L2L^2-normalized. The default tolerance parameter is min(1/n,1/p)\min(1 / n, 1 / p), where nn is the number of rows in the data matrix and pp is the number of columns.

maxiter

Maximum number of power iterations.

seed

Since initialization is random, a default seed is set for reproducibility.

Value

A list of length two consisting of, respectively, the vector of initial values for loadings k\ell_{\cdot k} and the vector of initial factor values fkf_{\cdot k}.

References

Jason Willwerscheid (2021), Empirical Bayes Matrix Factorization: Methods and Applications. Ph.D. thesis, University of Chicago.

See Also

flash_greedy, flash_greedy_init_softImpute, flash_greedy_init_irlba


Initialize a flash factor using IRLBA

Description

Initializes a new ("greedy") flash factor using irlba. This can be somewhat faster than flash_greedy_init_default for large, dense data matrices. For sparse matrices of class Matrix, the default initialization should generally be preferred.

Usage

flash_greedy_init_irlba(flash, seed = 666, ...)

Arguments

flash

A flash_fit object.

seed

Since initialization is random, a default seed is set for reproducibility.

...

Additional parameters to be passed to irlba.

Value

A list of length two consisting of, respectively, the vector of initial values for loadings k\ell_{\cdot k} and the vector of initial factor values fkf_{\cdot k}.

See Also

flash_greedy, flash_greedy_init_default, flash_greedy_init_softImpute


Initialize a flash factor using softImpute

Description

Initializes a new ("greedy") flash factor using softImpute. When there is missing data, this can yield better results than flash_greedy_init_default without sacrificing much (if any) speed.

Usage

flash_greedy_init_softImpute(flash, seed = 666, ...)

Arguments

flash

A flash_fit object.

seed

Since initialization is random, a default seed is set for reproducibility.

...

Additional parameters to be passed to softImpute.

Value

A list of length two consisting of, respectively, the vector of initial values for loadings k\ell_{\cdot k} and the vector of initial factor values fkf_{\cdot k}.

See Also

flash_greedy, flash_greedy_init_default, flash_greedy_init_irlba


Initialize flash object

Description

Sets up a flash object with no factors. Since all other flash_xxx functions take a flash or flash_fit object as their first argument, calling flash_init should be the first step in any flash pipeline. See flash for examples of usage.

Usage

flash_init(data, S = NULL, var_type = 0L, S_dim = NULL)

Arguments

data

The observations. Usually a matrix, but can also be a sparse matrix of class Matrix or a low-rank matrix representation as returned by, for example, svd, irlba, rsvd, or softImpute (in general, any list that includes fields u, d, and v will be interpreted as a low-rank matrix representation).

S

The standard errors. Can be NULL (in which case all residual variance will be estimated) or a matrix, vector, or scalar. S should be a scalar if standard errors are identical across observations. It should be a vector if standard errors either vary across columns but are constant within any given row, or vary across rows but are constant within any given column (flash will use the length of the vector to determine whether the supplied values correspond to rows or columns; if the data matrix is square, then the sense must be specified using parameter S_dim in function flash_init).

var_type

Describes the structure of the estimated residual variance. Can be NULL, 0, 1, 2, or c(1, 2). If NULL, then S accounts for all residual variance. If var_type = 0, then the estimated residual variance (which is added to any variance given by S) is assumed to be constant across all observations. Setting var_type = 1 estimates a single variance parameter for each row; var_type = 2 estimates one parameter for each column; and var_type = c(1, 2) optimizes over all rank-one matrices (that is, it assumes that the residual variance parameter sijs_{ij} can be written sij=aibjs_{ij} = a_i b_j, where the nn-vector aa and the pp-vector bb are to be estimated).

Note that if any portion of the residual variance is to be estimated, then it is usually faster to set S = NULL and to let flash estimate all of the residual variance. Further, var_type = c(1, 2) is typically much slower than other options, so it should be used with care.

S_dim

If the argument to S is a vector and the data matrix is square, then S_dim must specify whether S encodes row-wise or column-wise standard errors. More precisely, if S_dim = 1, then S will be interpreted as giving standard errors that vary across rows but are constant within any particular row; if S_dim = 2, then it will be interpreted as giving standard errors that vary across columns but are constant within any particular column. If S is a matrix or scalar, or if the data matrix is not square, then S_dim should be left unspecified (NULL).

Value

An initialized flash object (with no factors).


Nullcheck flash factors

Description

Sets factor/loadings pairs to zero if doing so improves the variational lower bound (ELBO). See flash for examples of usage.

Usage

flash_nullcheck(flash, kset = NULL, remove = TRUE, tol = NULL, verbose = NULL)

Arguments

flash

A flash or flash_fit object.

kset

A vector of integers specifying which factors to nullcheck. If kset = NULL, then all existing factors will be checked.

remove

Whether to remove factors that have been set to zero from the flash object. Note that this might change the indices of existing factors.

tol

The "tolerance" parameter: if a factor does not improve the ELBO by at least tol, then it will be set to zero. Note that flash_nullcheck does not respect "global" tolerance parameters set by flash_set_conv_crit (which only affects the convergence tolerance for greedy fits and backfits). The default tolerance is npϵnp\sqrt{\epsilon}, where nn is the number of rows in the dataset, pp is the number of columns, and ϵ\epsilon is equal to .Machine$double.eps.

verbose

When and how to display progress updates. For nullchecks, updates are only displayed when verbose > 0.

Value

The flash object from argument flash, with factors that do not improve the ELBO by at least tol either set to zero or removed (depending on the argument to parameter remove).

See Also

flash_factors_remove, flash_factors_set_to_zero


Create bar plots of factors or loadings for a flash fit

Description

Creates a bar plot or sequence of bar plots, one for each value of kk in kset, with bars corresponding to individual posterior means for factors fjkf_{jk} or loadings ik\ell_{ik}. Values are normalized so that the maximum absolute value for each factor fkf_{\cdot k} or set of loadings k\ell_{\cdot k} is equal to 1 (see ldf.flash). This type of plot is most useful when rows i=1,,ni = 1, \ldots, n or columns j=1,,pj = 1, \ldots, p are small in number or ordered in a logical fashion (e.g., spatially).

Usage

flash_plot_bar(
  fl,
  order_by_pve = FALSE,
  kset = NULL,
  pm_which = c("factors", "loadings"),
  pm_subset = NULL,
  pm_groups = NULL,
  pm_colors = NULL,
  labels = FALSE,
  ...
)

Arguments

fl

An object inheriting from class flash.

order_by_pve

If order_by_pve = TRUE, then factor/loadings pairs will be ordered according to proportion of variance explained, from highest to lowest. (By default, they are plotted in the same order as kset; or, if kset is NULL, then they are plotted in the same order as they are found in fl.)

kset

A vector of integers specifying the factor/loadings pairs to be plotted. If order_by_pve = FALSE, then kset also specifies the order in which they are to be plotted.

pm_which

Whether to plot loadings LL or factors FF.

pm_subset

A vector of row indices ii or column indices jj (depending on the argument to pm_which) specifying which values i\ell_{i \cdot} or fjf_{j \cdot} are to be shown. If the dataset has row or column names, then names rather than indices may be specified. If pm_subset = NULL, then all values will be plotted.

pm_groups

A vector specifying the group to which each row of the data yiy_{i \cdot} or column yjy_{\cdot j} belongs (groups may be numeric indices or strings). A group must be provided for each plotted row ii or column jj, so that the length of pm_groups is exactly equal to the number of rows or columns in the full dataset or, if pm_subset is specified, in the subsetted dataset.

pm_colors

A character vector specifying a color for each unique group specified by pm_groups, or, if pm_groups = NULL, a vector specifying a color for each plotted row ii or column jj. Defines the color (fill) of the bars.

labels

Whether to label the bars along the xx-axis. The appearance of the labels (size, angle, etc.) can be adjusted using ggplot2's theme system; see below for an example.

...

Additional arguments to be passed to facet_wrap (e.g., nrow or ncol).

Details

When there is more than one value of kk in kset, a sequence of panels is created using the facet_wrap function from the ggplot2 package. In each panel, the order of bars is determined by the order of the corresponding rows or columns in the data matrix; they can be re-arranged using the pm_subset argument.

Value

A ggplot object.

Examples

data(gtex)
fl <- flash(gtex, greedy_Kmax = 4L, backfit = FALSE)
flash_plot_bar(fl, pm_colors = gtex_colors)

# Tweaks are often required to get x-axis labels to look good:
library(ggplot2)
flash_plot_bar(fl, pm_colors = gtex_colors, labels = TRUE, ncol = 1) +
  theme(axis.text.x = element_text(size = 8, angle = 60))

Create heatmap of factors or loadings for a flash fit

Description

Creates a heatmap of posterior means for factors fjkf_{jk} or loadings ik\ell_{ik}. Values are normalized so that the maximum absolute value for each factor fkf_{\cdot k} or set of loadings k\ell_{\cdot k} is equal to 1 (see ldf.flash).

Usage

flash_plot_heatmap(
  fl,
  order_by_pve = FALSE,
  kset = NULL,
  pm_which = c("factors", "loadings"),
  pm_subset = NULL,
  pm_groups = NULL,
  pm_colors = NULL,
  gap = 1,
  ...
)

Arguments

fl

An object inheriting from class flash.

order_by_pve

If order_by_pve = TRUE, then factor/loadings pairs will be ordered according to proportion of variance explained, from highest to lowest. (By default, they are plotted in the same order as kset; or, if kset is NULL, then they are plotted in the same order as they are found in fl.)

kset

A vector of integers specifying the factor/loadings pairs to be plotted. If order_by_pve = FALSE, then kset also specifies the order in which they are to be plotted.

pm_which

Whether to plot loadings LL or factors FF.

pm_subset

A vector of row indices ii or column indices jj (depending on the argument to pm_which) specifying which values i\ell_{i \cdot} or fjf_{j \cdot} are to be shown. If the dataset has row or column names, then names rather than indices may be specified. If pm_subset = NULL, then all values will be plotted.

pm_groups

A vector specifying the group to which each row of the data yiy_{i \cdot} or column yjy_{\cdot j} belongs (groups may be numeric indices or strings). A group must be provided for each plotted row ii or column jj, so that the length of pm_groups is exactly equal to the number of rows or columns in the full dataset or, if pm_subset is specified, in the subsetted dataset.

pm_colors

A character vector of length 1, 2, or 3 defining the diverging color gradient (low-mid-high) to be used by the heatmap. The midpoint is set at zero. If one or two colors are supplied, then the "mid" color will be set to white. If one color is supplied, then the "low" and "high" colors (used for, respectively, negative and positive posterior means) will be the same. If two are supplied, then the "low" color should be provided first, followed by the "high" color. If all three are supplied, then the "low" color should be provided first, followed by the "mid" color, followed by the "high" color provided. The default color gradient is darkblue for "low" (negative posterior means), white for "mid" (zero), and darkred for "high" (positive posterior means).

gap

The horizontal spacing between groups. Ignored if pm_groups is not provided.

...

Additional parameters to be passed to structure_plot (which is primarily used to arrange the rows ii or columns jj).

Details

By default, a 1-d embedding is used to arrange the rows ii or columns jj in a "smart" manner. This behavior can be overridden via argument loadings_order, which is passed to function structure_plot.

Value

A ggplot object.


Create histograms of factors or loadings for a flash fit

Description

Creates a histogram or sequence of histograms of posterior means for factors fjkf_{jk} or loadings ik\ell_{ik}. One plot is created for each value of kk in kset. Values are normalized so that the maximum absolute value for each factor fkf_{\cdot k} or set of loadings k\ell_{\cdot k} is equal to 1 (see ldf.flash). If pm_groups is specified, then overlapping semi-transparent histograms are created, with one histogram per group specified by pm_groups. This option works best when the number of groups is small or when groups are well separated across components.

Usage

flash_plot_histogram(
  fl,
  order_by_pve = FALSE,
  kset = NULL,
  pm_which = c("factors", "loadings"),
  pm_subset = NULL,
  pm_groups = NULL,
  pm_colors = NULL,
  binwidth = NULL,
  bins = NULL,
  alpha = 0.5,
  ...
)

Arguments

fl

An object inheriting from class flash.

order_by_pve

If order_by_pve = TRUE, then factor/loadings pairs will be ordered according to proportion of variance explained, from highest to lowest. (By default, they are plotted in the same order as kset; or, if kset is NULL, then they are plotted in the same order as they are found in fl.)

kset

A vector of integers specifying the factor/loadings pairs to be plotted. If order_by_pve = FALSE, then kset also specifies the order in which they are to be plotted.

pm_which

Whether to plot loadings LL or factors FF.

pm_subset

A vector of row indices ii or column indices jj (depending on the argument to pm_which) specifying which values i\ell_{i \cdot} or fjf_{j \cdot} are to be shown. If the dataset has row or column names, then names rather than indices may be specified. If pm_subset = NULL, then all values will be plotted.

pm_groups

A vector specifying the group to which each row of the data yiy_{i \cdot} or column yjy_{\cdot j} belongs (groups may be numeric indices or strings). A group must be provided for each plotted row ii or column jj, so that the length of pm_groups is exactly equal to the number of rows or columns in the full dataset or, if pm_subset is specified, in the subsetted dataset.

pm_colors

A character vector specifying a color for each unique group specified by pm_groups. Defines the color and fill of the histograms.

binwidth

The width of the bins (a numeric value). The default is to use the number of bins in bins, covering the range of the data.

bins

Number of bins. Overriden by binwidth. Defaults to 30.

alpha

A transparency value between 0 (transparent) and 1 (opaque).

...

Additional arguments to be passed to facet_wrap (e.g., nrow or ncol).

Value

A ggplot object.


Create scatter plots of factors or loadings for a flash fit

Description

Creates a scatter plot or sequence of scatter plots, with position along the xx-axis defined by posterior means for factors fjkf_{jk} or loadings ik\ell_{ik} and position along the yy-axis defined by a user-supplied covariate. If a covariate is not supplied, then plots will use data column or row means, 1ni=1nyij\frac{1}{n} \sum_{i = 1}^n y_{ij} or 1pj=1pyij\frac{1}{p} \sum_{j = 1}^p y_{ij}. One plot is created for each value of kk in kset. Values are normalized so that the maximum absolute value for each factor fkf_{\cdot k} or set of loadings k\ell_{\cdot k} is equal to 1 (see ldf.flash).

Usage

flash_plot_scatter(
  fl,
  order_by_pve = FALSE,
  kset = NULL,
  pm_which = c("factors", "loadings"),
  pm_subset = NULL,
  pm_groups = NULL,
  pm_colors = NULL,
  covariate = NULL,
  shape = 1,
  labels = FALSE,
  n_labels = 0,
  label_size = 3,
  max_overlaps = Inf,
  ...
)

Arguments

fl

An object inheriting from class flash.

order_by_pve

If order_by_pve = TRUE, then factor/loadings pairs will be ordered according to proportion of variance explained, from highest to lowest. (By default, they are plotted in the same order as kset; or, if kset is NULL, then they are plotted in the same order as they are found in fl.)

kset

A vector of integers specifying the factor/loadings pairs to be plotted. If order_by_pve = FALSE, then kset also specifies the order in which they are to be plotted.

pm_which

Whether to plot loadings LL or factors FF.

pm_subset

A vector of row indices ii or column indices jj (depending on the argument to pm_which) specifying which values i\ell_{i \cdot} or fjf_{j \cdot} are to be shown. If the dataset has row or column names, then names rather than indices may be specified. If pm_subset = NULL, then all values will be plotted.

pm_groups

A vector specifying the group to which each row of the data yiy_{i \cdot} or column yjy_{\cdot j} belongs (groups may be numeric indices or strings). A group must be provided for each plotted row ii or column jj, so that the length of pm_groups is exactly equal to the number of rows or columns in the full dataset or, if pm_subset is specified, in the subsetted dataset.

pm_colors

A character vector specifying a color for each unique group specified by pm_groups, or, if pm_groups = NULL, a vector specifying a color for each plotted row ii or column jj. Defines the colors of the points.

covariate

A numeric vector with one value for each plotted row ii or column jj. These values are mapped onto the plots' yy-axis.

shape

The symbol used for the plots' points. See aes_linetype_size_shape.

labels

Whether to label the points with the largest (absolute) posterior means. If labels = TRUE, then n_labels points will be labelled using geom_text_repel.

n_labels

A (nonnegative) integer. The number of points to label. If n_labels is set to a positive integer but labels = FALSE, then the n_labels points with the largest (absolute) posterior means will be highlighted in blue but not labelled. This can be useful for tweaking labels using the full range of options provided by geom_text_repel. For an example, see below.

label_size

The size of the label text (in millimeters).

max_overlaps

A (nonnegative) integer. For each text label, the number of overlaps with other text labels or other data points are counted, and the text label is excluded if it has too many overlaps.

...

Additional arguments to be passed to facet_wrap (e.g., nrow or ncol).

Value

A ggplot object.

Examples

data(gtex)
fl <- flash(gtex, greedy_Kmax = 4L, backfit = FALSE)
flash_plot_scatter(fl)

# Label axes and points:
library(ggplot2)
flash_plot_scatter(fl, labels = TRUE, n_labels = 3) +
  labs(y = "mean z-score across all SNPs")

# For the full range of labelling options provided by the ggrepel package, set
#   labels = FALSE (the default setting) and add geom_text_repel() manually:
library(ggrepel)
flash_plot_scatter(fl, labels = FALSE, n_labels = 3) +
  geom_text_repel(size = 2.5, min.segment.length = 0)

Create a scree plot for a flash fit

Description

A scree plot is a line plot showing the proportion of variance explained by each factor/loadings pair in a flash fit. Note that since EBMF does not require factors and loadings to be orthogonal, "PVE" should be interpreted loosely: for example, the total proportion of variance explained could be larger than 1.

Usage

flash_plot_scree(
  fl,
  order_by_pve = FALSE,
  kset = NULL,
  labels = FALSE,
  label_size = 3,
  max_overlaps = Inf
)

Arguments

fl

An object inheriting from class flash.

order_by_pve

If order_by_pve = TRUE, then factor/loadings pairs will be ordered according to proportion of variance explained, from highest to lowest. (By default, they are plotted in the same order as kset; or, if kset is NULL, then they are plotted in the same order as they are found in fl.)

kset

A vector of integers specifying the factor/loadings pairs to be plotted. If order_by_pve = FALSE, then kset also specifies the order in which they are to be plotted.

labels

Whether to label the points in the scree plot with the indices of the factor/loading pairs they correspond to. Labels appear as "k1", "k2", "k3", etc.

label_size

The size of the label text (in millimeters).

max_overlaps

A (nonnegative) integer. For each text label, the number of overlaps with other text labels or other data points are counted, and the text label is excluded if it has too many overlaps.

Details

Unlike scree plots for PCA, a scree plot for a flash fit is not in general monotonically decreasing. To ensure a monotonically decreasing scree plot, set order_by_pve = TRUE. Note, however, that if this is done then the numbers on the xx-axis will no longer match the indices of the components in the flash fit. This can also be true if argument kset has been specified. Thus one should consider setting labels = TRUE when order_by_pve = TRUE or when kset has been specified.

Value

A ggplot object.

Examples

data(gtex)
fl <- flash(gtex, greedy_Kmax = 4L, backfit = FALSE)
flash_plot_scree(fl)

# For the full range of labelling options provided by the ggrepel package, set
#   labels = FALSE (the default setting) and add geom_text_repel() manually:
library(ggrepel)
flash_plot_scree(fl) + geom_text_repel(min.segment.length = 0)

Create structure plot of factors or loadings for a flash fit

Description

Creates a “structure plot” (stacked bar plot) of posterior means for factors fjkf_{jk} or loadings ik\ell_{ik}. Different “topics” or components (that is, the different factor/loadings pairs, as specified by kset) are represented by different colors. Values are normalized so that the maximum absolute value for each factor fkf_{\cdot k} or set of loadings k\ell_{\cdot k} is equal to 1 and then stacked (see ldf.flash). Note that structure plots were designed for nonnegative loadings or “memberships”; if posterior means are not nonnegative then a different type of plot should be used (e.g., flash_plot_heatmap). By default, a 1-d embedding is used to arrange the rows ii or columns jj. This step is usually essential to creating a readable structure plot; for details, see structure_plot.

Usage

flash_plot_structure(
  fl,
  order_by_pve = FALSE,
  kset = NULL,
  pm_which = c("factors", "loadings"),
  pm_subset = NULL,
  pm_groups = NULL,
  pm_colors = NULL,
  gap = 1,
  ...
)

Arguments

fl

An object inheriting from class flash.

order_by_pve

If order_by_pve = TRUE, then factor/loadings pairs will be ordered according to proportion of variance explained, from highest to lowest. (By default, they are plotted in the same order as kset; or, if kset is NULL, then they are plotted in the same order as they are found in fl.)

kset

A vector of integers specifying the factor/loadings pairs to be plotted. If order_by_pve = FALSE, then kset also specifies the order in which they are to be plotted.

pm_which

Whether to plot loadings LL or factors FF.

pm_subset

A vector of row indices ii or column indices jj (depending on the argument to pm_which) specifying which values i\ell_{i \cdot} or fjf_{j \cdot} are to be shown. If the dataset has row or column names, then names rather than indices may be specified. If pm_subset = NULL, then all values will be plotted.

pm_groups

A vector specifying the group to which each row of the data yiy_{i \cdot} or column yjy_{\cdot j} belongs (groups may be numeric indices or strings). A group must be provided for each plotted row ii or column jj, so that the length of pm_groups is exactly equal to the number of rows or columns in the full dataset or, if pm_subset is specified, in the subsetted dataset.

pm_colors

The colors of the “topics” or components (factor/loadings pairs).

gap

The horizontal spacing between groups. Ignored if pm_groups is not provided.

...

Additional parameters to be passed to structure_plot.

Value

A ggplot object.


Set convergence criterion and tolerance parameter

Description

Used in a flash pipeline to set the criterion for determining whether a greedy fit or backfit has "converged."

Usage

flash_set_conv_crit(flash, fn = NULL, tol)

Arguments

flash

A flash or flash_fit object.

fn

The convergence criterion function (see Details below). If NULL, then only the tolerance parameter is updated (thus a convergence criterion can be set at the beginning of a flash pipeline, allowing the tolerance parameter to be updated at will without needing to re-specify the convergence criterion each time). The default convergence criterion, which is set when the flash object is initialized, is flash_conv_crit_elbo_diff, which calculates the difference in the variational lower bound or "ELBO" from one iteration to the next.

tol

The tolerance parameter (see Details below). The default, which is set when the flash object is initialized (see flash_init), is npϵnp\sqrt{\epsilon}, where nn is the number of rows in the dataset, pp is the number of columns, and ϵ\epsilon is equal to .Machine$double.eps.

Details

Function flash_set_conv_crit can be used to customize the convergence criterion for a flash object. This criterion determines when to stop optimizing a newly added factor (see flash_greedy) and when to stop backfitting (flash_backfit). Note that, because most alternative convergence criteria do not make sense in the context of a nullcheck, it does not set the "convergence" criterion for flash_nullcheck (for example, flash_conv_crit_max_chg_L would simply return the maximum L2L^2-normalized loading for each set of loadings k\ell_{\cdot k}).

The criterion is defined by the function supplied as argument to fn, which must accept exactly three parameters, curr, prev, and k. curr refers to the flash_fit object from the current iteration; prev, to the flash_fit object from the previous iteration; and, if the iteration is a sequential backfitting iteration (that is, a flash_backfit iteration with argument extrapolate = FALSE), k identifies the factor/loadings pair that is currently being updated (in all other cases, k is NULL). The function must output a numeric value; if the value is less than or equal to tol, then the fit is considered to have "converged." The meaning of "convergence" here varies according to the operation being performed. In the greedy algorithm, fn simply compares the fit from one iteration to the next. During a backfit, it similarly compares fits from one iteration to the next, but it only considers the fit to have converged when the value of fn over successive updates to all factor/loadings pairs is less than or equal to tol. If, for example, factor/loadings pairs 1,,K1, \ldots, K are being sequentially backfitted, then fits are compared before and after the update to factor/loadings 1, before and after the update to factor/loadings 2, and so on through factor/loadings KK, and backfitting only terminates when fn returns a value less than or equal to tol for all KK updates.

Package flashier provides a number of functions that may be supplied as convergence criteria: see flash_conv_crit_elbo_diff (the default criterion), flash_conv_crit_max_chg, flash_conv_crit_max_chg_L, and flash_conv_crit_max_chg_F. Custom functions may also be defined. Typically, they will compare the fit in curr (the current iteration) to the fit in prev (the previous iteration). To facilitate working with flash_fit objects, package flashier provides a number of accessors, which are enumerated in the documentation for object flash_fit. Custom functions should return a numeric value that can be compared against tol; see Examples below.

Value

The flash object from argument flash, with the new convergence criterion reflected in updates to the "internal" flash_fit object. These settings will persist across all subsequent calls to flash_xxx functions in the same flash pipeline (unless, of course, flash_set_conv_crit is again called within the same pipeline).

Examples

fl <- flash_init(gtex) |>
  flash_set_conv_crit(flash_conv_crit_max_chg, tol = 1e-3) |>
  flash_set_verbose(
    verbose = 3,
    fns = flash_verbose_max_chg,
    colnames = "Max Chg",
    colwidths = 20
  ) |>
  flash_greedy(Kmax = 3)

Set timeout

Description

Used in a flash pipeline to set a maximum amount of fitting time. Note that timeout conditions are only checked during greedy fits and backfits, so that the total amount of fitting time can exceed the time set by flash_set_timeout (especially if, for example, there is a nullcheck involving many factor/loading pairs). Also note that timeout conditions must be cleared using function flash_clear_timeout before any re-fitting is attempted.

Usage

flash_set_timeout(
  flash,
  tim,
  units = c("hours", "secs", "mins", "days", "weeks")
)

Arguments

flash

A flash or flash_fit object.

tim

A numeric value giving the maximum amount of fitting time, with the units of time specified by parameter units.

units

The units of time according to which parameter tim is to be interpreted.

Value

The flash object from argument flash, with the timeout settings reflected in updates to the "internal" flash_fit object. These settings will persist across all subsequent calls to flash_xxx functions until they are modified either by flash_clear_timeout or by another call to flash_set_timeout.

Examples

fl <- flash_init(gtex) |>
  flash_set_timeout(1, "secs") |>
  flash_greedy(Kmax = 30) |>
  flash_backfit() |>
  flash_nullcheck() |>
  flash_clear_timeout() # Always clear timeout at the end of a pipeline.

Set verbose output

Description

Used in a flash pipeline to set the output that will be printed after each greedy or backfitting iteration.

Usage

flash_set_verbose(
  flash,
  verbose = 1L,
  fns = NULL,
  colnames = NULL,
  colwidths = NULL
)

Arguments

flash

A flash or flash_fit object.

verbose

When and how to display progress updates. Set to 0 for no updates; 1 for updates after a "greedy" factor is added or a backfit is completed; 2 for additional notifications about the variational lower bound (ELBO); and 3 for updates after every iteration. By default, per-iteration update information includes the change in ELBO and the maximum (absolute) change over all L2-normalized loadings 1,,K\ell_1, \ldots, \ell_K and factors f1,,fKf_1, \ldots, f_K. Update information is customizable via parameters fns, colnames, and colwidths.

A single tab-delimited table of values may also be output using option verbose = -1. This format is especially convenient for downstream analysis of the fitting history. For example, it may be used to plot the value of the ELBO after each iteration (see the "Advanced Flashier" vignette for an illustration).

fns

A vector of functions. Used to calculate values to display after each greedy/backfit iteration when verbose is either -1 or 3 (see Details below). Ignored for other values of verbose (0, 1, or 2).

colnames

A vector of column names, one for each function in fns.

colwidths

A vector of column widths, one for each function in fns.

Details

Function flash_set_verbose can be used to customize the output that is printed to console while fitting a flash object. After each greedy or backfitting iteration (see, respectively, flash_greedy and flash_backfit), each function in argument fns is successively evaluated and the result is printed to console in a table with column names defined by argument colnames and column widths defined by argument colwidths.

Each function in fns must accept exactly three parameters, curr, prev, and k: curr refers to the flash_fit object from the current iteration; prev, to the flash_fit object from the previous iteration; and, if the iteration is a sequential backfitting iteration (that is, a flash_backfit iteration with argument extrapolate = FALSE), k identifies the factor/loadings pair that is currently being updated (in all other cases, k is NULL). Package flashier provides a number of functions that may be used to customize output: see flash_verbose_elbo, flash_verbose_elbo_diff, flash_verbose_max_chg, flash_verbose_max_chg_L, and flash_verbose_max_chg_F. Custom functions may also be defined. They might inspect the current flash_fit object via argument curr; compare the fit in curr to the fit from the previous iteration (provided by argument prev); or ignore both flash_fit objects entirely (for example, to track progress over time, one might simply call Sys.time). To facilitate working with flash_fit objects, package flashier provides a number of accessors, which are enumerated in the documentation for object flash_fit. Custom functions should return a character string that contains the output exactly as it is to displayed; see Examples below.

Value

The flash object from argument flash, with the new verbose settings reflected in updates to the "internal" flash_fit object. These settings will persist across all subsequent calls to flash_xxx functions until they are modified by another call to flash_set_verbose.

Examples

# Suppress all verbose output.
fl <- flash_init(gtex) |>
  flash_set_verbose(0) |>
  flash_greedy(Kmax = 5)

# Set custom verbose output.
sparsity_F <- function(curr, prev, k) {
  g_F <- flash_fit_get_g(curr, n = 2)
  g_F_pi0 <- g_F$pi[1] # Mixture weight of the "null" component.
  return(g_F_pi0)
}
verbose_fns <- c(flash_verbose_elbo, flash_verbose_max_chg_F, sparsity_F)
colnames <- c("ELBO", "Max Chg (Tiss)", "Sparsity (Tiss)")
colwidths <- c(12, 18, 18)
fl <- flash_init(gtex) |>
  flash_set_verbose(
    verbose = 3,
    fns = verbose_fns,
    colnames = colnames,
    colwidths = colwidths
  ) |>
  flash_greedy(Kmax = 3)

# Output can be changed as needed.
fl <- flash_init(gtex) |>
  flash_set_verbose(verbose = 1) |>
  flash_greedy(Kmax = 5L) |>
  flash_backfit(verbose = 3) |>
  flash_greedy(Kmax = 1L)

Update data in a flash object

Description

Replaces the data in a flash object with a new set of observations. Estimates of residual variances and the ELBO are also updated.

Usage

flash_update_data(flash, newdata, Y2_diff = NULL)

Arguments

flash

A flash or flash_fit object.

newdata

The new observations. Can be a matrix, a sparse matrix of class Matrix, or a low-rank matrix representation.

Y2_diff

Optionally, users can supply the (summed) changes in the squared values of the data yij2y_{ij}^2, which are needed to estimate the residual variance parameters sij2s_{ij}^2 for simple variance structures (i.e., when var_type is set to 0, 1, or 2). If calculating entries yij2y_{ij}^2 from scratch is expensive, supplying an argument to Y2_diff can greatly speed up data updates. If specified, the argument should be a scalar i,j(yij2(new)yij2(old))\sum_{i, j} \left( y_{ij}^{2 \text{(new)}} - y_{ij}^{2 \text{(old)}} \right) when var_type = 0; a vector of length nn with entries j=1p(yij2(new)yij2(old))\sum_{j = 1}^p \left( y_{ij}^{2 \text{(new)}} - y_{ij}^{2 \text{(old)}} \right) when var_type = 1; or a vector of length pp with entries i=1n(yij2(new)yij2(old))\sum_{i = 1}^n \left( y_{ij}^{2 \text{(new)}} - y_{ij}^{2 \text{(old)}} \right) when var_type = 2. The argument is ignored when any other variance structure is used.

Value

The flash object from argument flash, with the data modified as specified by newdata. Residual variances and ELBO are also updated.


Display the current ELBO

Description

Displays the value of the variational lower bound (ELBO) at the current iteration.

Usage

flash_verbose_elbo(curr, prev, k)

Arguments

curr

The flash_fit object from the current iteration.

prev

The flash_fit object from the previous iteration.

k

Only used during sequential backfits (that is, calls to flash_backfit where extrapolate = FALSE). It then takes the index of the factor/loadings pair currently being optimized.

Details

This function is an example of a function that may be passed to parameter fns in function flash_set_verbose to customize the output that is printed after each greedy or backfitting iteration. See flash_set_verbose for details and examples.

Value

A character string, suitable for printing progress updates.

See Also

flash_verbose_elbo_diff, flash_verbose_max_chg, flash_verbose_max_chg_L, flash_verbose_max_chg_F


Display the difference in ELBO

Description

Displays the difference in the variational lower bound (ELBO) from one iteration to the next.

Usage

flash_verbose_elbo_diff(curr, prev, k)

Arguments

curr

The flash_fit object from the current iteration.

prev

The flash_fit object from the previous iteration.

k

Only used during sequential backfits (that is, calls to flash_backfit where extrapolate = FALSE). It then takes the index of the factor/loadings pair currently being optimized.

Details

This function is an example of a function that may be passed to parameter fns in function flash_set_verbose to customize the output that is printed after each greedy or backfitting iteration. See flash_set_verbose for details and examples.

Value

A character string, suitable for printing progress updates.

See Also

flash_verbose_elbo, flash_verbose_max_chg, flash_verbose_max_chg_L, flash_verbose_max_chg_F


Display the maximum difference in scaled loadings and factors

Description

Displays the maximum (absolute) change over all (posterior expected values for) loadings ik\ell_{ik} and factors fjkf_{jk}. At each iteration, the loadings vectors 1,,K\ell_{\cdot 1}, \ldots, \ell_{\cdot K} and factors f1,,fKf_{\cdot 1}, \ldots, f_{\cdot K} are L2L^2-normalized.

Usage

flash_verbose_max_chg(curr, prev, k)

Arguments

curr

The flash_fit object from the current iteration.

prev

The flash_fit object from the previous iteration.

k

Only used during sequential backfits (that is, calls to flash_backfit where extrapolate = FALSE). It then takes the index of the factor/loadings pair currently being optimized.

Details

This function is an example of a function that may be passed to parameter fns in function flash_set_verbose to customize the output that is printed after each greedy or backfitting iteration. See flash_set_verbose for details and examples.

Value

A character string, suitable for printing progress updates.

See Also

flash_verbose_elbo, flash_verbose_elbo_diff, flash_verbose_max_chg_L, flash_verbose_max_chg_F


Display the maximum difference in scaled factors

Description

Displays the maximum (absolute) change over all (posterior expected values for) factors fjkf_{jk}. At each iteration, the factors f1,,fKf_{\cdot 1}, \ldots, f_{\cdot K} are L2L^2-normalized.

Usage

flash_verbose_max_chg_F(curr, prev, k)

Arguments

curr

The flash_fit object from the current iteration.

prev

The flash_fit object from the previous iteration.

k

Only used during sequential backfits (that is, calls to flash_backfit where extrapolate = FALSE). It then takes the index of the factor/loadings pair currently being optimized.

Details

This function is an example of a function that may be passed to parameter fns in function flash_set_verbose to customize the output that is printed after each greedy or backfitting iteration. See flash_set_verbose for details and examples.

Value

A character string, suitable for printing progress updates.

See Also

flash_verbose_elbo, flash_verbose_elbo_diff, flash_verbose_max_chg, flash_verbose_max_chg_L


Display the maximum difference in scaled loadings

Description

Displays the maximum (absolute) change over all (posterior expected values for) loadings ik\ell_{ik}. At each iteration, the loadings vectors 1,,K\ell_{\cdot 1}, \ldots, \ell_{\cdot K} are L2L^2-normalized.

Usage

flash_verbose_max_chg_L(curr, prev, k)

Arguments

curr

The flash_fit object from the current iteration.

prev

The flash_fit object from the previous iteration.

k

Only used during sequential backfits (that is, calls to flash_backfit where extrapolate = FALSE). It then takes the index of the factor/loadings pair currently being optimized.

Details

This function is an example of a function that may be passed to parameter fns in function flash_set_verbose to customize the output that is printed after each greedy or backfitting iteration. See flash_set_verbose for details and examples.

Value

A character string, suitable for printing progress updates.

See Also

flash_verbose_elbo, flash_verbose_elbo_diff, flash_verbose_max_chg, flash_verbose_max_chg_F


GTEx data

Description

Derived from data made available by the Genotype Tissue Expression (GTEx) project (Lonsdale et al. 2013), which provides zz-scores for assessing the significance of effects of genetic variants (single nucleotide polymorphisms, or SNPs) on gene expression across 44 human tissues. To reduce the data to a more manageable size, Urbut et al. (2019) chose the "top" SNP for each gene — that is, the SNP associated with the largest (absolute) zz-score over all 44 tissues. This yields a 16,069×4416,069 \times 44 matrix of zz-scores, with rows corresponding to SNP-gene pairs and columns corresponding to tissues. The dataset included here is further subsampled down to 1000 rows.

Format

gtex is a matrix with 1000 rows and 44 columns, with rows corresponding to SNP-gene pairs and columns corresponding to tissues.

Source

https://github.com/stephenslab/gtexresults/blob/master/data/MatrixEQTLSumStats.Portable.Z.rds

References

Lonsdale et al. (2013). "The Genotype-Tissue Expression (GTEx) project." Nature Genetics 45(6), 580–585.

Urbut, Wang, Carbonetto, and Stephens (2019). "Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions." Nature Genetics 51(1), 187–195.

Examples

data(gtex)
summary(gtex)

Colors for plotting GTEx data

Description

A custom palette used by Wang and Stephens (2021) to plot an empirical Bayes matrix factorization of data from the GTEx project (of which the gtex data in package flashier is a subsample). The palette is designed to link similar tissues together visually. For example, brain tissues all have the same color (yellow); arterial tissues are shades of pink or red; etc.

Format

gtex_colors is a named vector of length 44, with names corresponding to tissues (columns) in the gtex dataset and values giving hexadecimal color codes.

Source

https://github.com/stephenslab/gtexresults/blob/master/data/GTExColors.txt

References

Wei Wang and Matthew Stephens (2021). "Empirical Bayes matrix factorization." Journal of Machine Learning Research 22, 1–40.

Examples

fl <- flash(gtex, greedy_Kmax = 4)
plot(fl, pm_colors = gtex_colors)

LDF method for flash and flash fit objects

Description

Given a flash or flash_fit object, returns the LDF decomposition YLDFY \approx LDF'.

Usage

ldf(object, type)

## S3 method for class 'flash'
ldf(object, type = "f")

## S3 method for class 'flash_fit'
ldf(object, type = "f")

Arguments

object

An object inheriting from class flash or flash_fit.

type

Takes identical arguments to function norm. Use "f" or "2" for the 2-norm (Euclidean norm); "o" or "1" for the 1-norm (taxicab norm); and "i" or "m" for the infinity norm (maximum norm).

Details

When the prior families G(k)G_\ell^{(k)} and Gf(k)G_f^{(k)} are closed under scaling (as is typically the case), then the EBMF model (as described in the documention to function flash) is only identifiable up to scaling by a diagonal matrix DD:

Y=LDF+E.Y = LDF' + E.

Method ldf scales columns k\ell_k and fkf_k so that, depending on the argument to parameter type, their 1-norms, 2-norms, or infinity norms are equal to 1.

Value

A list with fields L, D, and F, each of which corresponds to one of the matrices in the decomposition YLDFY \approx LDF' (with the columns of LL and FF scaled according to argument type). Note that D is returned as a vector rather than a matrix (the vector of diagonal entries in DD). Thus, "fitted values" LDFLDF' can be recovered as L %*% diag(D) %*% t(F).

Methods (by class)

  • ldf(flash): LDF decomposition for flash objects

  • ldf(flash_fit): LDF decomposition for flash_fit objects


Plot method for flash objects

Description

Plots a flash object. Several types of plot are possible: see parameter plot_type below as well as functions flash_plot_scree, flash_plot_bar, flash_plot_heatmap, flash_plot_histogram, flash_plot_scatter, and flash_plot_structure.

Usage

## S3 method for class 'flash'
plot(
  x,
  include_scree = TRUE,
  include_pm = TRUE,
  order_by_pve = FALSE,
  kset = NULL,
  pm_which = c("factors", "loadings"),
  pm_subset = NULL,
  pm_groups = NULL,
  pm_colors = NULL,
  plot_type = c("scree", "bar", "heatmap", "histogram", "scatter", "structure"),
  ...
)

Arguments

x

An object inheriting from class flash.

include_scree

[Deprecated] This parameter has been deprecated; please use plot_type instead.

include_pm

[Deprecated] This parameter has been deprecated; please use plot_type instead.

order_by_pve

If order_by_pve = TRUE, then factor/loadings pairs will be ordered according to proportion of variance explained, from highest to lowest. (By default, they are plotted in the same order as kset; or, if kset is NULL, then they are plotted in the same order as they are found in fl.)

kset

A vector of integers specifying the factor/loadings pairs to be plotted. If order_by_pve = FALSE, then kset also specifies the order in which they are to be plotted.

pm_which

Whether to plot loadings LL or factors FF.

pm_subset

A vector of row indices ii or column indices jj (depending on the argument to pm_which) specifying which values i\ell_{i \cdot} or fjf_{j \cdot} are to be shown. If the dataset has row or column names, then names rather than indices may be specified. If pm_subset = NULL, then all values will be plotted.

pm_groups

A vector specifying the group to which each row of the data yiy_{i \cdot} or column yjy_{\cdot j} belongs (groups may be numeric indices or strings). A group must be provided for each plotted row ii or column jj, so that the length of pm_groups is exactly equal to the number of rows or columns in the full dataset or, if pm_subset is specified, in the subsetted dataset.

pm_colors

A character vector specifying a color for each unique group specified by pm_groups, or, if pm_groups = NULL, a vector specifying a color for each plotted row ii or column jj. For effects, see parameter plot_type.

plot_type

The type of plot to return. Options include:

"scree"

A scree plot showing the proportion of variance explained per factor/loadings pair. See flash_plot_scree.

"bar"

A bar plot of posterior means for loadings or factors (depending on argument pm_which), with one bar per row or column. Colors of bars are specified by argument pm_colors. This type of plot is most useful when rows or columns are small in number or ordered in a logical fashion (e.g., spatially). See flash_plot_bar.

"heatmap"

A heatmap showing posterior means for loadings or factors, with rows or columns grouped using a 1-d embedding. Here pm_color specifies the diverging color gradient (low-mid-high). See flash_plot_heatmap.

"histogram"

Overlapping semi-transparent histograms of posterior means for loadings or factors, with one histogram per group specified by pm_groups (or a single histogram if pm_groups is NULL). Colors of histograms are specified by pm_colors. See flash_plot_histogram.

"scatter"

A scatter plot showing the relationship between posterior means for loadings or factors and a user-supplied covariate. If a covariate is not supplied, then data column or row means will be used. Colors of points are specified by pm_colors. See flash_plot_scatter.

"structure"

A "structure plot" (stacked bar plot) produced using function structure_plot in package fastTopics. Here pm_colors specifies the colors of different factor/loadings pairs (as specified by kset) rather than different groups (as specified by pm_groups). Note that factors/loadings must be nonnegative for structure plots to make sense. See flash_plot_structure.

...

Additional parameters to be passed to respective flash_plot_xxx functions. See flash_plot_scree, flash_plot_bar, flash_plot_heatmap, flash_plot_histogram, flash_plot_scatter, and flash_plot_structure for details.

Value

A ggplot object.


Residuals method for flash objects

Description

Given a flash object, returns the expected residuals YE(LF)=YE(L)E(F)Y - E(LF') = Y - E(L) E(F)'.

Usage

## S3 method for class 'flash'
residuals(object, ...)

Arguments

object

An object inheriting from class flash.

...

Additional parameters are ignored.

Value

The matrix of expected residuals.


Residuals method for flash fit objects

Description

Given a flash_fit object, returns the expected residuals YE(LF)=YE(L)E(F)Y - E(LF') = Y - E(L) E(F)'.

Usage

## S3 method for class 'flash_fit'
residuals(object, ...)

Arguments

object

An object inheriting from class flash_fit.

...

Additional parameters are ignored.

Value

The matrix of expected residuals.