This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Inference in generalized bilinear models

Jeffrey W. Miller  
Department of Biostatistics, Harvard University
and
Scott L. Carter
Division of Computational Biology, Dana–Farber Cancer Institute
J.W.M. was supported by the Department of Defense grant W81XWH-18-1-0357, the National Institutes of Health grant R01GM083084, and the Zhu Family Center for Global Cancer Prevention. S.L.C. was supported by National Cancer Institute grant 1R01CA227156-01, the Department of Defense grant W81XWH-18-1-0357, the American Brain Tumor Association, the Wong Family Awards in Translational Oncology, and the LUNGstrong Fund. S.L.C. is also affiliated with the Department of Biostatistics at Harvard University and the Broad Institute of MIT and Harvard.
Abstract

Latent factor models are widely used to discover and adjust for hidden variation in modern applications. However, most methods do not fully account for uncertainty in the latent factors, which can lead to miscalibrated inferences such as overconfident p-values. In this article, we develop a fast and accurate method of uncertainty quantification in generalized bilinear models, which are a flexible extension of generalized linear models to include latent factors as well as row covariates, column covariates, and interactions. In particular, we introduce delta propagation, a general technique for propagating uncertainty among model components using the delta method. Further, we provide a rapidly converging algorithm for maximum a posteriori GBM estimation that extends earlier methods by estimating row and column dispersions. In simulation studies, we find that our method provides approximately correct frequentist coverage of most parameters of interest. We demonstrate on RNA-seq gene expression analysis and copy ratio estimation in cancer genomics.


Keywords: batch effects, factor analysis, Fisher information, negative-binomial regression, uncertainty quantification.

1 Introduction

Latent factor models have become an essential tool for analyzing and adjusting for hidden sources of variation in complex data. Building on generalized linear model theory, generalized bilinear models (GBMs) provide a flexible framework incorporating latent factors along with row covariates, column covariates, and interactions to analyze matrix data (Choulakian, 1996; Gabriel, 1998; de Falguerolles, 2000; Perry and Pillai, 2013; Hoff, 2015; Buettner et al., 2017). However, uncertainty quantification is a persistent problem in large complex models, and GBMs are particularly challenging due to the non-linearity of multiplicative terms, constraints and strong dependencies among parameters, inapplicability of normal model theory, and the fact that the number of parameters grows with the data.

Most latent factor methods do not fully account for uncertainty in the latent factors. For example, to remove batch effects in gene expression analysis, several methods first estimate a factorization UV𝚃UV^{\mathtt{T}} and then treat VV as a known matrix of covariates, accounting for uncertainty only in UU (Leek and Storey, 2007, 2008; Sun et al., 2012; Risso et al., 2014). In copy number variation detection, it is common to treat the estimated UV𝚃UV^{\mathtt{T}} as known and subtract it off (Fromer et al., 2012; Krumm et al., 2012; Jiang et al., 2015). In principle, Bayesian inference provides full uncertainty quantification (Carvalho et al., 2008), however, Markov chain Monte Carlo tends to be slow in large parameter spaces with strong dependencies, as in the case of GBMs. Variational Bayes approaches are faster (Stegle et al., 2010; Buettner et al., 2017; Babadi et al., 2018), but rely on factorized approximations that tend to underestimate uncertainty. Meanwhile, the classical method of inverting the Fisher information matrix is computationally prohibitive in large GBMs.

In this article, we introduce a novel method for uncertainty quantification in GBMs, focusing on the case of count data with negative binomial outcomes. The basic idea is to propagate uncertainty between model components using the delta method, which can be done analytically using closed-form expressions involving the gradient and the Fisher information; we refer to this as delta propagation. The method facilitates computation of p-values and confidence intervals with approximately correct frequentist properties in GBMs. Further, we provide an algorithm for maximum a posteriori GBM estimation that extends previous work by estimating row- and column-specific dispersion parameters, improving numerical stability, and explicitly handling identifiability constraints.

In a suite of simulation studies, we find that our methods perform favorably in terms of consistency, frequentist coverage, computation time, algorithm convergence, and robustness to the outcome distribution. We then apply our methods to gene expression analysis, (a) comparing performance with DESeq2 (Love et al., 2014) on a benchmark dataset of RNA-seq samples from lymphoblastoid cell lines, and (b) testing for age-related genes using RNA-seq data from the Genotype-Tissue Expression (GTEx) project (Melé et al., 2015). Finally, we apply our methods to copy ratio estimation in cancer genomics, comparing performance with the Genome Analysis Toolkit (GATK) (Broad Institute, 2020) on whole-exome sequencing data from the Cancer Cell Line Encyclopedia (Ghandi et al., 2019).

The article is organized as follows. In Section 2, we define the GBM model and we address identifiability, interpretability, and residuals. In Sections 3 and 4, we describe our estimation and inference methods, respectively. In Section 5, we establish theoretical results, and Section 6 contains simulation studies. In Sections 7 and 8, we apply our methods to gene expression analysis and copy ratio estimation in cancer genomics. The supplementary material contains a discussion of previous work and challenges, additional empirical results, mathematical derivations and proofs, and step-by-step algorithms.

2 Model

In this section, we define the class of models considered in this paper and we provide conditions guaranteeing identifiability and interpretability of the model parameters. For i{1,,I}i\in\{1,\ldots,I\} and j{1,,J}j\in\{1,\ldots,J\}, suppose YijY_{ij}\in\mathbb{R} is a random variable such that

g(𝔼(Yij))=k=1Kxikajk+=1Lbizj+k=1K=1Lxikckzj+m=1Muimdmmvjmg(\mathbb{E}(Y_{ij}))=\sum_{k=1}^{K}x_{ik}a_{jk}+\sum_{\ell=1}^{L}b_{i\ell}z_{j\ell}+\sum_{k=1}^{K}\sum_{\ell=1}^{L}x_{ik}c_{k\ell}z_{j\ell}+\sum_{m=1}^{M}u_{im}d_{mm}v_{jm} (2.1)

where xikx_{ik} and zjz_{j\ell} are observed covariates, ajka_{jk}, bib_{i\ell}, ckc_{k\ell}, uimu_{im}, dmmd_{mm}, and vjmv_{jm} are parameters to be estimated, and g()g(\cdot) is a smooth function such that gg^{\prime} is positive, referred to as the link function. In matrix form, denoting 𝒀=(Yij)I×J\bm{Y}=(Y_{ij})\in\mathbb{R}^{I\times J}, Equation 2.1 is equivalent to

g(𝔼(𝒀))=XA𝚃+BZ𝚃+XCZ𝚃+UDV𝚃g(\mathbb{E}(\bm{Y}))=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+UDV^{\mathtt{T}} (2.2)

where gg is applied element-wise to the matrix 𝔼(𝒀)\mathbb{E}(\bm{Y}). To be able to use capital YY to denote scalar random variables, we use bold 𝒀\bm{Y} to denote the data matrix. We refer to this as a generalized bilinear model (GBM), following the terminology of Choulakian (1996).

In the genomics applications in Sections 7 and 8, we use negative binomial outcomes with log link g(μ)=log(μ)g(\mu)=\log(\mu), and the role of each piece is as follows (see Figure 1): YijY_{ij} is the read count for feature ii in sample jj, XI×KX\in\mathbb{R}^{I\times K} contains feature covariates and AJ×KA\in\mathbb{R}^{J\times K} contains the corresponding coefficients, ZJ×LZ\in\mathbb{R}^{J\times L} contains sample covariates and BI×LB\in\mathbb{R}^{I\times L} contains the corresponding coefficients, CK×LC\in\mathbb{R}^{K\times L} contains intercepts and coefficients for interactions between the xx’s and zz’s, and UDV𝚃UDV^{\mathtt{T}} is a low-rank matrix that captures latent effects due, for example, to unobserved covariates such as batch.

Refer to caption
Figure 1: Schematic of the generalized bilinear model (GBM) structure.

2.1 Identifiability and interpretation

For identifiability and interpretability, we impose certain constraints (Conditions 2.1 and 2.2). The only constraints on the covariates are that XX and ZZ are full-rank, centered, and include a column of ones for intercepts; see below. The rest of the constraints are enforced on the parameters during estimation. We write I\mathrm{I} to denote the identity matrix to distinguish it from the number of features, II.

Condition 2.1 (Identifiability constraints).

Assume the following constraints:

  1. (a)

    X𝚃XX^{\mathtt{T}}X and Z𝚃ZZ^{\mathtt{T}}Z are invertible,

  2. (b)

    X𝚃B=0X^{\mathtt{T}}B=0, Z𝚃A=0Z^{\mathtt{T}}A=0, X𝚃U=0X^{\mathtt{T}}U=0, and Z𝚃V=0Z^{\mathtt{T}}V=0,

  3. (c)

    U𝚃U=IU^{\mathtt{T}}U=\mathrm{I} and V𝚃V=IV^{\mathtt{T}}V=\mathrm{I},

  4. (d)

    DD is a diagonal matrix such that d11>d22>>dMM>0d_{11}>d_{22}>\cdots>d_{MM}>0, and

  5. (e)

    the first nonzero entry of each column of UU is positive,

where AJ×KA\in\mathbb{R}^{J\times K}, BI×LB\in\mathbb{R}^{I\times L}, CK×LC\in\mathbb{R}^{K\times L}, DM×MD\in\mathbb{R}^{M\times M}, UI×MU\in\mathbb{R}^{I\times M}, VJ×MV\in\mathbb{R}^{J\times M}, XI×KX\in\mathbb{R}^{I\times K}, ZJ×LZ\in\mathbb{R}^{J\times L}, and M<min{I,J}M<\min\{I,J\}.

In Theorem 5.1, we show that Condition 2.1 is sufficient to guarantee identifiability of AA, BB, CC, DD, UU, and VV in any GBM satisfying Equation 2.2. More precisely, letting

η(A,B,C,D,U,V):=XA𝚃+BZ𝚃+XCZ𝚃+UDV𝚃\displaystyle\eta(A,B,C,D,U,V):=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+UDV^{\mathtt{T}} (2.3)

for some fixed full-rank XX and ZZ, Theorem 5.1 shows that η()\eta(\cdot) is a one-to-one function on the set of parameters satisfying Condition 2.1.

Condition 2.2 (Interpretability constraints).

Assume that (a) xi1=1x_{i1}=1 and zj1=1z_{j1}=1 for all i,ji,j, and (b) i=1Ixik=0\sum_{i=1}^{I}x_{ik}=0 and j=1Jzj=0\sum_{j=1}^{J}z_{j\ell}=0 for all k2k\geq 2, 2\ell\geq 2.

When Condition 2.2(a) holds, we can rearrange the right-hand side of Equation 2.1 as

c11+aj1+bi1+k=2K(ck1+ajk)xik+=2L(c1+bi)zj+k=2K=2Lckxikzj+m=1Muimdmmvjm.\displaystyle c_{11}+a_{j1}+b_{i1}+\sum_{k=2}^{K}(c_{k1}+a_{jk})x_{ik}+\sum_{\ell=2}^{L}(c_{1\ell}+b_{i\ell})z_{j\ell}+\sum_{k=2}^{K}\sum_{\ell=2}^{L}c_{k\ell}x_{ik}z_{j\ell}+\sum_{m=1}^{M}u_{im}d_{mm}v_{jm}.

Using this and assuming Conditions 2.1 and 2.2, we show in Theorem 5.2 that the interpretation of each parameter is: (1) c11c_{11} is the overall intercept, c11=1IJi=1Ij=1Jg(𝔼(Yij))c_{11}=\frac{1}{IJ}\sum_{i=1}^{I}\sum_{j=1}^{J}g(\mathbb{E}(Y_{ij})), (2) aj1a_{j1} is a sample-specific offset and bi1b_{i1} is a feature-specific offset, (3) ck1c_{k1} is the mean effect of the kkth feature covariate and ajka_{jk} is the sample-specific offset of this effect, (4) c1c_{1\ell} is the mean effect of the \ellth sample covariate and bib_{i\ell} is the feature-specific offset of this effect, and (5) ckc_{k\ell} is the effect of the interaction xikzjx_{ik}z_{j\ell} for k2k\geq 2 and 2\ell\geq 2.

GBMs can be decomposed in terms of the sum-of-squares of each component’s contribution to the overall model fit, enabling one to interpret the proportion of variation explained by each component. Specifically, in Theorem 5.3, we show that

SS(XA𝚃+BZ𝚃+XCZ𝚃+UDV𝚃)=SS(XA𝚃)+SS(BZ𝚃)+SS(XCZ𝚃)+SS(UDV𝚃)\mathrm{SS}(XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+UDV^{\mathtt{T}})=\mathrm{SS}(XA^{\mathtt{T}})+\mathrm{SS}(BZ^{\mathtt{T}})+\mathrm{SS}(XCZ^{\mathtt{T}})+\mathrm{SS}(UDV^{\mathtt{T}})

whenever Condition 2.1(b) holds, where SS(Q):=i,jqij2\mathrm{SS}(Q):=\sum_{i,j}q_{ij}^{2} denotes the sum of squares of the entries of a matrix. This extends a similar result by Takane and Shibayama (1991).

2.2 Outcome distributions

For the distribution of YijY_{ij}, we focus on discrete exponential dispersion families (Jørgensen, 1987; Agresti, 2015). Specifically, suppose Yijf(yθij,rij)Y_{ij}\sim f(y\mid\theta_{ij},r_{ij}) where for yy\in\mathbb{Z},

f(yθ,r)=exp(θyrκ(θ))h(y,r)f(y\mid\theta,r)=\exp(\theta y-r\kappa(\theta))h(y,r) (2.4)

is a probability mass function for all θΘ\theta\in\Theta and rr\in\mathcal{R}; this is referred to as a discrete EDF. Here, Θ\Theta\subseteq\mathbb{R}, (0,)\mathcal{R}\subseteq(0,\infty), and \mathbb{Z} denotes the integers. For any discrete EDF, the mean and variance are 𝔼(Y)=rκ(θ)\mathbb{E}(Y)=r\kappa^{\prime}(\theta) and Var(Y)=rκ′′(θ)\operatorname*{Var}(Y)=r\kappa^{\prime\prime}(\theta) (Jørgensen, 1987); also see Section S4. We refer to rr as the inverse dispersion parameter, and 1/r1/r is the dispersion. Discrete EDFs can be translated into standard EDFs of the form exp(r[θyκ(θ)])h(y,r)\exp(r[\theta y-\kappa(\theta)])h(y,r) via the transformation yryy\mapsto ry. We refer to a GBM with EDF outcomes as an EDF-GBM.

Negative binomial (NB) outcomes. In the applications in Sections 7 and 8, we use negative binomial outcomes: YijNegBin(μij,rij)Y_{ij}\sim\operatorname*{NegBin}(\mu_{ij},r_{ij}) where μij\mu_{ij} is the mean and 1/rij1/r_{ij} is the dispersion. This is a discrete EDF as in Equation 2.4 with θ=log(μ/(μ+r))\theta=\log(\mu/(\mu+r)) and κ(θ)=log(1exp(θ))\kappa(\theta)=-\log(1-\exp(\theta)); thus, 𝔼(Y)=μ\mathbb{E}(Y)=\mu and Var(Y)=μ+μ2/r\mathrm{Var}(Y)=\mu+\mu^{2}/r. The NB distribution is an overdispersed Poisson since if Y|λPoisson(λ)Y|\lambda\sim\operatorname*{Poisson}(\lambda) and λGamma(r,r/μ)\lambda\sim\operatorname*{Gamma}(r,\,r/\mu), then integrating out λ\lambda, we have YNegBin(μ,r)Y\sim\operatorname*{NegBin}(\mu,r). We refer to a GBM with NB outcomes as an NB-GBM.

We parametrize the dispersions as 1/rij=exp(si+tj+ω)1/r_{ij}=\exp(s_{i}+t_{j}+\omega) and work in terms of S=(s1,,sI)𝚃IS=(s_{1},\ldots,s_{I})^{\mathtt{T}}\in\mathbb{R}^{I}, T=(t1,,tJ)𝚃JT=(t_{1},\ldots,t_{J})^{\mathtt{T}}\in\mathbb{R}^{J}, and ω\omega\in\mathbb{R}, subject to the identifiability constraints 1Iiesi=1\frac{1}{I}\sum_{i}e^{s_{i}}=1 and 1Jjetj=1\frac{1}{J}\sum_{j}e^{t_{j}}=1. Note that this makes 1IJi,j1/rij=exp(ω)\frac{1}{IJ}\sum_{i,j}1/r_{ij}=\exp(\omega).

2.3 Residuals and adjusting out selected effects

Residuals are useful for many purposes, such as visualization, model criticism, and downstream analyses. We define GBM residuals as εij:=g(Yij+ϵ)ηij\varepsilon_{ij}:=g(Y_{ij}+\epsilon)-\eta_{ij} where η=η(A,B,C,D,U,V)I×J\eta=\eta(A,B,C,D,U,V)\in\mathbb{R}^{I\times J} as in Equation 2.3 and ϵ\epsilon is a small constant to make εij\varepsilon_{ij} well-defined; for NB-GBMs, we use ϵ=1/8\epsilon=1/8 as a default. A model-based estimate of the variance of εij\varepsilon_{ij} is given by σij2g(μij)2\sigma^{2}_{ij}g^{\prime}(\mu_{ij})^{2} where μij\mu_{ij} and σij2\sigma_{ij}^{2} are the mean and variance of YijY_{ij} under the model. This formula can be derived either from the Fisher information or from a first-order Taylor approximation to gg. It turns out that the corresponding precisions wij:=1/(σij2g(μij)2)w_{ij}:=1/(\sigma^{2}_{ij}g^{\prime}(\mu_{ij})^{2}) play a key role in our GBM estimation and inference algorithms. In the NB case with g(μ)=log(μ)g(\mu)=\log(\mu), these residual precisions are wij=rijμij/(rij+μij)w_{ij}=r_{ij}\mu_{ij}/(r_{ij}+\mu_{ij}).

Often, it is useful to adjust out some effects but not others. Let x\mathcal{R}_{x}, z\mathcal{R}_{z}, and u\mathcal{R}_{u} be the indices of the columns of XX, ZZ, and UU (or VV) that one does not wish to adjust out. We define the partial residuals εij:=ηij+εij\varepsilon^{\mathcal{R}}_{ij}:=\eta^{\mathcal{R}}_{ij}+\varepsilon_{ij} where ηI×J\eta^{\mathcal{R}}\in\mathbb{R}^{I\times J} is defined as in Equation 2.3 but with xikx_{ik}, zjz_{j\ell}, and uimu_{im} replaced by xik𝟙(kx)x_{ik}\mathds{1}(k\in\mathcal{R}_{x}), zj𝟙(z)z_{j\ell}\mathds{1}(\ell\in\mathcal{R}_{z}), and uim𝟙(mu)u_{im}\mathds{1}(m\in\mathcal{R}_{u}).

3 Estimation

We provide a general algorithm for estimating the parameters of a discrete EDF-GBM (Equations 2.2 and 2.4), and we augment the algorithm to estimate the NB-GBM dispersion parameters as well. Here, we only give an outline; see Section S7 for step-by-step details.

Inputs. The required inputs are 𝒀I×J\bm{Y}\in\mathbb{Z}^{I\times J}, XI×KX\in\mathbb{R}^{I\times K}, ZJ×LZ\in\mathbb{R}^{J\times L}, and M{0,1,2,}M\in\{0,1,2,\ldots\}. Optional inputs are the maximum step size ρ>0\rho>0, prior precisions λa,λb,λc,λd,λu,λv>0\lambda_{a},\lambda_{b},\lambda_{c},\lambda_{d},\lambda_{u},\lambda_{v}>0, prior means and precisions ms,mt,λs,λtm_{s},m_{t},\lambda_{s},\lambda_{t} for the log-dispersions in the NB-GBM case, and the convergence criterion. As defaults, we use ρ=5\rho=5, λa=λb=λc=λd=λu=λv=1\lambda_{a}=\lambda_{b}=\lambda_{c}=\lambda_{d}=\lambda_{u}=\lambda_{v}=1, ms=mt=0m_{s}=m_{t}=0, λs=λt=1\lambda_{s}=\lambda_{t}=1, convergence tolerance τ=106\tau=10^{-6} for the relative change in log-likelihood+log-prior, and a maximum of 5050 iterations.

Preprocessing.

  1. (1)

    Ensure that XX and ZZ satisfy Condition 2.1(a) and Condition 2.2.

  2. (2)

    Unless the covariates are already on a common scale in terms of units, standardize XX and ZZ such that 1Ii=1Ixik2=1\frac{1}{I}\sum_{i=1}^{I}x_{ik}^{2}=1 and 1Jj=1Jzj2=1\frac{1}{J}\sum_{j=1}^{J}z_{j\ell}^{2}=1 for all k2k\geq 2 and 2\ell\geq 2.

  3. (3)

    Precompute the pseudoinverses X+=(X𝚃X)1X𝚃X^{\texttt{\small{+}}}=(X^{\mathtt{T}}X)^{-1}X^{\mathtt{T}} and Z+=(Z𝚃Z)1Z𝚃Z^{\texttt{\small{+}}}=(Z^{\mathtt{T}}Z)^{-1}Z^{\mathtt{T}}.

Initialization.

  1. (1)

    Solve for AA, BB, and CC to minimize the sum-of-squares of the GBM residuals εij\varepsilon_{ij}.

  2. (2)

    Randomly initialize DD, UU, and VV by computing the truncated singular value decomposition (of rank MM) of a random matrix with i.i.d. 𝒩(0,1016)\mathcal{N}(0,10^{-16}) entries.

  3. (3)

    In the case of NB-GBMs, iteratively update SS, TT, and ω\omega for a few iterations.

Iteration. In each iteration, we cycle through the components of the model, updating each in turn using an optimization-projection step, consisting of an unconstrained optimization step and a likelihood-preserving projection onto the constrained parameter space. We use a bounded, regularized version of Fisher scoring to perform the unconstrained optimization step for each of AA, BB, CC, DD, UDUD, and VDVD, separately, holding all the other parameters fixed. For a generic parameter vector β\beta, the (unbounded) regularized Fisher scoring step is ββ+(𝔼(β2)+λI)1(βλβ)\beta\leftarrow\beta+(\mathbb{E}(-\nabla_{\!\beta}^{2}\mathcal{L})+\lambda\mathrm{I})^{-1}(\nabla_{\!\beta}\mathcal{L}-\lambda\beta) where \mathcal{L} is the log-likelihood and λ>0\lambda>0 is a regularization parameter. This arises from optimizing the log-likelihood plus the log-prior, where the prior on β\beta is π(β)=𝒩(β0,λ1I)\pi(\beta)=\mathcal{N}(\beta\mid 0,\lambda^{-1}\mathrm{I}), since then the gradient and Fisher information are β(+logπ)=βλβ\nabla_{\!\beta}(\mathcal{L}+\log\pi)=\nabla_{\!\beta}\mathcal{L}-\lambda\beta and 𝔼(β2(+logπ))=𝔼(β2)+λI\mathbb{E}(-\nabla_{\!\beta}^{2}(\mathcal{L}+\log\pi))=\mathbb{E}(-\nabla_{\!\beta}^{2}\mathcal{L})+\lambda\mathrm{I}. Since these Fisher scoring steps occasionally diverge, for numerical stability we bound them using

ξ(𝔼(β2)+λI)1(βλβ)ββ+ξmin{1,ρdim(ξ)/ξ}\displaystyle\begin{split}\xi&\leftarrow(\mathbb{E}(-\nabla_{\!\beta}^{2}\mathcal{L})+\lambda\mathrm{I})^{-1}(\nabla_{\!\beta}\mathcal{L}-\lambda\beta)\\ \beta&\leftarrow\beta+\xi\min\{1,\,\rho\sqrt{\mathrm{dim}(\xi)}/\|\xi\|\}\end{split} (3.1)

where ξ=(i|ξi|2)1/2\|\xi\|=(\sum_{i}|\xi_{i}|^{2})^{1/2} is the Euclidean norm. The idea is that ξmin{1,ρdim(ξ)/ξ}\xi\min\{1,\,\rho\sqrt{\mathrm{dim}(\xi)}/\|\xi\|\} points in the same direction as ξ\xi, but its root-mean-square is capped at ρ\rho. Similarly, for SS and TT in the NB-GBM, we use bounded regularized Newton steps for the unconstrained optimizations, since the (expected) Fisher information for SS and TT is not closed form.

See Section S7 for full step-by-step details. See Section 5 for time complexity analysis.

4 Inference

In this section, we introduce our methodology for computing approximate standard errors for the parameters of a GBM. Since the standard technique of inverting the Fisher information matrix does not work well on GBMs, we develop a novel technique for propagating uncertainty from one part of the model to another; see Section 4.1. We provide an outline of our inference algorithm in Section 4.2, and step-by-step details are in Section S8.

4.1 Delta propagation method

In fixed-dimension parametric models, the asymptotic covariance of the maximum likelihood estimator is equal to the inverse of the Fisher information matrix. Thus, classically, approximate standard errors are given by the square roots of the diagonal entries of the inverse Fisher information. However, in GBMs, inverting the full (constraint-augmented) Fisher information does not work well for two reasons: (1) it is computationally intractable for large data matrices, and (2) it does not yield well-calibrated standard errors in terms of coverage, presumably because the number of parameters grows with the amount of data. Meanwhile, the inverse Fisher information for each component individually (for instance, Fa1F_{a}^{-1} for AA) is computationally efficient, but severely underestimates uncertainty since it treats the other components as known, and thus, can be thought of as representing the conditional uncertainty in each component given the other components.

We propose a general technique for approximating, for each model component, the additional variance due to uncertainty in the other components. By adding this to the conditional variance (that is, diag(Fa1)\operatorname*{diag}(F_{a}^{-1}) in the case of AA), we obtain approximate variances that are better calibrated, empirically. The basic idea is to write the estimator for each component as a function of the other components, and propagate the variance of the other components through this function using the same idea as the delta method.

In general, suppose we have a model with parameters θd\theta\in\mathbb{R}^{d} and νk\nu\in\mathbb{R}^{k}, and we wish to quantify the uncertainty in θ\theta due to uncertainty in ν\nu. Suppose the true values are θ0\theta_{0} and ν0\nu_{0}, and let θ^\hat{\theta} and ν^\hat{\nu} be the maximum likelihood estimators. Define

h(ν):=θ0+F(θ0,ν)1g(θ0,ν)h(\nu):=\theta_{0}+F(\theta_{0},\nu)^{-1}g(\theta_{0},\nu)

where g(θ,ν):=θg(\theta,\nu):=\nabla_{\theta}\mathcal{L} is the gradient of the log-likelihood and F(θ,ν):=𝔼(θ2)F(\theta,\nu):=\mathbb{E}(-\nabla_{\theta}^{2}\mathcal{L}) is the Fisher information matrix for θ\theta, evaluated at (θ,ν)(\theta,\nu). The interpretation is that θ^h(ν^)\hat{\theta}\approx h(\hat{\nu}), since h(ν^)h(\hat{\nu}) is a Fisher scoring step on θ\theta starting at θ0\theta_{0}, when the current estimate of ν\nu is ν^\hat{\nu}.

By a Taylor approximation to hh at ν0\nu_{0}, we have h(ν^)h(ν0)+h(ν0)𝚃(ν^ν0)h(\hat{\nu})\approx h(\nu_{0})+h^{\prime}(\nu_{0})^{\mathtt{T}}(\hat{\nu}-\nu_{0}) where h(ν)k×dh^{\prime}(\nu)\in\mathbb{R}^{k\times d} such that h(ν)ij=hj/νih^{\prime}(\nu)_{ij}=\partial h_{j}/\partial\nu_{i}. Define the random element φ=(h(ν0),h(ν0))\varphi=(h(\nu_{0}),h^{\prime}(\nu_{0})), where the randomness comes from the data. Assuming ν^|φ𝒩(ν0,Σν)\hat{\nu}|\varphi\approx\mathcal{N}(\nu_{0},\Sigma_{\nu}) for some Σν\Sigma_{\nu}, we have h(ν^)|φ𝒩(h(ν0),h(ν0)𝚃Σνh(ν0))h(\hat{\nu})|\varphi\approx\mathcal{N}(h(\nu_{0}),\,h^{\prime}(\nu_{0})^{\mathtt{T}}\Sigma_{\nu}h^{\prime}(\nu_{0})). Meanwhile, under standard regularity conditions, 𝔼(h(ν0))=θ0\mathbb{E}(h(\nu_{0}))=\theta_{0} and Cov(h(ν0))=F(θ0,ν0)1\mathrm{Cov}(h(\nu_{0}))=F(\theta_{0},\nu_{0})^{-1}. Then, by the law of total covariance,

Cov(h(ν^))=Cov(𝔼(h(ν^)|φ))+𝔼(Cov(h(ν^)|φ))F(θ0,ν0)1+𝔼(h(ν0)𝚃Σνh(ν0)).\mathrm{Cov}(h(\hat{\nu}))=\mathrm{Cov}(\mathbb{E}(h(\hat{\nu})|\varphi))+\mathbb{E}(\mathrm{Cov}(h(\hat{\nu})|\varphi))\approx F(\theta_{0},\nu_{0})^{-1}+\mathbb{E}(h^{\prime}(\nu_{0})^{\mathtt{T}}\Sigma_{\nu}h^{\prime}(\nu_{0})).

The interpretation of this decomposition is that F(θ0,ν0)1F(\theta_{0},\nu_{0})^{-1} represents the uncertainty in θ\theta given ν\nu, and 𝔼(h(ν0)𝚃Σνh(ν0))\mathbb{E}(h^{\prime}(\nu_{0})^{\mathtt{T}}\Sigma_{\nu}h^{\prime}(\nu_{0})) represents the uncertainty in θ\theta due to uncertainty in ν\nu. Since θ^h(ν^)\hat{\theta}\approx h(\hat{\nu}), plugging in empirical estimates leads to the approximation

Cov(θ^)F(θ^,ν^)1+h^(ν^)𝚃Σ^νh^(ν^),\displaystyle\mathrm{Cov}(\hat{\theta})\approx F(\hat{\theta},\hat{\nu})^{-1}+\hat{h}^{\prime}(\hat{\nu})^{\mathtt{T}}\hat{\Sigma}_{\nu}\hat{h}^{\prime}(\hat{\nu}), (4.1)

where h^(ν)=θ^+F(θ^,ν)1g(θ^,ν)\hat{h}(\nu)=\hat{\theta}+F(\hat{\theta},\nu)^{-1}g(\hat{\theta},\nu).

To compute the Jacobian matrix h(ν)𝚃h^{\prime}(\nu)^{\mathtt{T}}, observe that the iith column of h(ν)𝚃h^{\prime}(\nu)^{\mathtt{T}} is

hνi=F1νig+F1gνi=F1FνiF1g+F1gνi\displaystyle\frac{\partial h}{\partial\nu_{i}}=\frac{\partial F^{-1}}{\partial\nu_{i}}g+F^{-1}\frac{\partial g}{\partial\nu_{i}}=-F^{-1}\frac{\partial F}{\partial\nu_{i}}F^{-1}g+F^{-1}\frac{\partial g}{\partial\nu_{i}} (4.2)

by Bishop (2006, Eqn C.21), where F=F(θ0,ν)F=F(\theta_{0},\nu), g=g(θ0,ν)g=g(\theta_{0},\nu), and /νi\partial/\partial\nu_{i} is applied element-wise. Equation 4.2 also holds for h^/νi\partial\hat{h}/\partial\nu_{i}, but with θ^\hat{\theta} in place of θ0\theta_{0}; this facilitates computing h^(ν^)\hat{h}^{\prime}(\hat{\nu}) in Equation 4.1. To obtain standard errors for each element of θ\theta, computation is simplified by the fact that we only need the diagonal of Cov(θ^)\mathrm{Cov}(\hat{\theta}), and to simplify computation even further we use a diagonal matrix for Σ^ν\hat{\Sigma}_{\nu} when we apply this technique to GBMs. Additionally, since we use maximum a posteriori estimates, we use the regularized Fisher information and the gradient of the log-posterior in the formulas above.

4.2 Outline of inference algorithm

Here we outline our procedure for computing GBM standard errors; see Section S8 for step-by-step details. The strategy is to handle UU and VV jointly by inverting the regularized, constraint-augmented Fisher information matrix for (U,V)(U,V) and then employ delta propagation for AA, BB, CC, SS, and TT; see Figure 2. Empirically, we find that some of the delta propagation terms are negligible, thus, these have been excluded.

CCAAU,VU,VBBSSTT
Figure 2: Diagram of uncertainty propagation scheme for GBM inference.

For notational convenience, we vectorize the parameter matrices as follows. For Qm×nQ\in\mathbb{R}^{m\times n}, define vec(Q):=(q11,q21,,qm1,q12,q22,,qm2,,qmn)mn\mathrm{vec}(Q):=(q_{11},q_{21},\ldots,q_{m1},q_{12},q_{22},\ldots,q_{m2},\ldots\ldots,q_{mn})\in\mathbb{R}^{mn}. Denote a:=vec(A𝚃)\vec{a}:=\mathrm{vec}(A^{\mathtt{T}}), b:=vec(B𝚃)\vec{b}:=\mathrm{vec}(B^{\mathtt{T}}), c:=vec(C)\vec{c}:=\mathrm{vec}(C), d:=diag(D)\vec{d}:=\operatorname*{diag}(D), u:=vec(U𝚃)\vec{u}:=\mathrm{vec}(U^{\mathtt{T}}), and v:=vec(V𝚃)\vec{v}:=\mathrm{vec}(V^{\mathtt{T}}). It is easier to work with the vectorized transposes of AA, BB, UU, and VV (that is, vec(A𝚃)\mathrm{vec}(A^{\mathtt{T}}) rather than vec(A)\mathrm{vec}(A)), since then the Fisher information matrices have a block diagonal structure. We write FaF_{a}, FbF_{b}, FcF_{c}, FdF_{d}, FuF_{u}, and FvF_{v} to denote the regularized Fisher information for a\vec{a}, b\vec{b}, c\vec{c}, d\vec{d}, u\vec{u}, and v\vec{v}, respectively, for instance, Fa=𝔼(a2(+logπ))F_{a}=\mathbb{E}(-\nabla_{\!\vec{a}}^{2}\,(\mathcal{L}+\log\pi)). We write Fs,obsF_{s,\mathrm{obs}} and Ft,obsF_{t,\mathrm{obs}} for the regularized, observed Fisher information for SS and TT, that is, Fs,obs=S2(+logπ)F_{s,\mathrm{obs}}=-\nabla_{\!S}^{2}\,(\mathcal{L}+\log\pi).


Inputs. The required inputs are 𝒀\bm{Y}, XX, ZZ, and the estimates of AA, BB, CC, DD, UU, VV, SS, TT, and ω\omega. Optional inputs are the prior parameters (λa,λb,λc,λd,λu,λv,λs,λt,ms,mt\lambda_{a},\lambda_{b},\lambda_{c},\lambda_{d},\lambda_{u},\lambda_{v},\lambda_{s},\lambda_{t},m_{s},m_{t}).


Compute conditional uncertainty for each component.

  1. (1)

    Compute Fc1F_{c}^{-1} and the diagonal blocks of Fa1F_{a}^{-1}, Fb1F_{b}^{-1}, Fu1F_{u}^{-1}, and Fv1F_{v}^{-1}.

  2. (2)

    Compute the diagonals of Fs,obs1F_{s,\mathrm{obs}}^{-1} and Ft,obs1F_{t,\mathrm{obs}}^{-1}.

Compute joint uncertainty in (U,V)(U,V) accounting for constraints.

  1. (1)

    Compute F~(u,v)\tilde{F}_{(u,v)}, the regularized constraint-augmented Fisher information for [uv]\big{[}\begin{smallmatrix}\vec{u}\\ \vec{v}\end{smallmatrix}\big{]}.

  2. (2)

    Compute diag(F~(u,v)1)\operatorname*{diag}(\tilde{F}_{(u,v)}^{-1}). (It is key to do this in a computationally efficient way.)

  3. (3)

    Define var^u\widehat{\mathrm{var}}_{u} and var^v\widehat{\mathrm{var}}_{v} to be the entries of diag(F~(u,v)1)\operatorname*{diag}(\tilde{F}_{(u,v)}^{-1}) corresponding to UU and VV.

Propagate uncertainty between components using delta propagation.

  1. (1)

    Propagate uncertainty in (U,V)(U,V) to AA and BB, to obtain var^(u,v)a\widehat{\mathrm{var}}_{(u,v)\to a} and var^(u,v)b\widehat{\mathrm{var}}_{(u,v)\to b}, the additional variance of the estimators of AA and BB due to uncertainty in (U,V)(U,V).

  2. (2)

    Propagate uncertainty in AA and BB through to CC, to obtain var^(a,b)c\widehat{\mathrm{var}}_{(a,b)\to c}.

  3. (3)

    Propagate uncertainty in AA, BB, UU, VV to SS and TT, to get var^(a,b,u,v)s\widehat{\mathrm{var}}_{(a,b,u,v)\to s} and var^(a,b,u,v)t\widehat{\mathrm{var}}_{(a,b,u,v)\to t}.

Compute approximate standard errors.

  1. (1)

    se^asqrt(diag(Fa1)+var^(u,v)a)\hat{\mathrm{se}}_{a}\leftarrow\mathrm{sqrt}(\operatorname*{diag}(F_{a}^{-1})+\widehat{\mathrm{var}}_{(u,v)\to a})   and   se^bsqrt(diag(Fb1)+var^(u,v)b)\hat{\mathrm{se}}_{b}\leftarrow\mathrm{sqrt}(\operatorname*{diag}(F_{b}^{-1})+\widehat{\mathrm{var}}_{(u,v)\to b})

  2. (2)

    se^csqrt(diag(Fc1)+var^(a,b)c)\hat{\mathrm{se}}_{c}\leftarrow\mathrm{sqrt}(\operatorname*{diag}(F_{c}^{-1})+\widehat{\mathrm{var}}_{(a,b)\to c})

  3. (3)

    se^usqrt(var^u)\hat{\mathrm{se}}_{u}\leftarrow\mathrm{sqrt}(\widehat{\mathrm{var}}_{u})   and   se^vsqrt(var^v)\hat{\mathrm{se}}_{v}\leftarrow\mathrm{sqrt}(\widehat{\mathrm{var}}_{v})

  4. (4)

    se^ssqrt(diag(Fs,obs1)+var^(a,b,u,v)s)\hat{\mathrm{se}}_{s}\leftarrow\mathrm{sqrt}(\operatorname*{diag}(F_{s,\mathrm{obs}}^{-1})+\widehat{\mathrm{var}}_{(a,b,u,v)\to s})   and   se^tsqrt(diag(Ft,obs1)+var^(a,b,u,v)t)\hat{\mathrm{se}}_{t}\leftarrow\mathrm{sqrt}(\operatorname*{diag}(F_{t,\mathrm{obs}}^{-1})+\widehat{\mathrm{var}}_{(a,b,u,v)\to t})

Here, sqrt()\mathrm{sqrt}(\cdot) is the element-wise square root. We do not provide standard errors for DD and ω\omega, since it seems difficult to estimate them without non-negligible bias. See Section S8 for the complete step-by-step algorithm. See Section 5 for computation time complexity.

5 Theory

In this section, we provide theoretical results on GBMs. The proofs are in Section S10.

Theorem 5.1 (Identifiability).

If (A1,B1,C1,D1,U1,V1,X,Z)(A_{1},B_{1},C_{1},D_{1},U_{1},V_{1},X,Z) and (A2,B2,C2,D2,U2,V2,X,Z)(A_{2},B_{2},C_{2},D_{2},U_{2},V_{2},\allowbreak X,Z) satisfy Condition 2.1 and

XA1𝚃+B1Z𝚃+XC1Z𝚃+U1D1V1𝚃=XA2𝚃+B2Z𝚃+XC2Z𝚃+U2D2V2𝚃,XA_{1}^{\mathtt{T}}+B_{1}Z^{\mathtt{T}}+XC_{1}Z^{\mathtt{T}}+U_{1}D_{1}V_{1}^{\mathtt{T}}=XA_{2}^{\mathtt{T}}+B_{2}Z^{\mathtt{T}}+XC_{2}Z^{\mathtt{T}}+U_{2}D_{2}V_{2}^{\mathtt{T}}, (5.1)

then A1=A2A_{1}=A_{2}, B1=B2B_{1}=B_{2}, C1=C2C_{1}=C_{2}, D1=D2D_{1}=D_{2}, U1=U2U_{1}=U_{2}, and V1=V2V_{1}=V_{2}. In particular, for any GBM satisfying Equation 2.2 for some XX, ZZ, and MM, if Condition 2.1 holds, then AA, BB, CC, DD, UU, and VV are identifiable in the sense that they are uniquely determined by the distribution of 𝐘\bm{Y}; in fact, they are uniquely determined by 𝔼(𝐘)\mathbb{E}(\bm{Y}).

Theorem 5.2 (Interpretation of parameters).

If Conditions 2.1 and 2.2 hold and μij:=𝔼(Yij)\mu_{ij}:=\mathbb{E}(Y_{ij}) satisfies Equation 2.1, then:

  1. (a)

    j=1Jajk=0\sum_{j=1}^{J}a_{jk}=0 and i=1Ibi=0\sum_{i=1}^{I}b_{i\ell}=0 for all k{1,,K},{1,,L}k\in\{1,\ldots,K\},\ell\in\{1,\ldots,L\},

  2. (b)

    i=1Iuim=0\sum_{i=1}^{I}u_{im}=0 and j=1Jvjm=0\sum_{j=1}^{J}v_{jm}=0 for all m{1,,M}m\in\{1,\ldots,M\},

  3. (c)

    1Ii=1Ig(μij)=c11+aj1+=2Lc1zj\frac{1}{I}\sum_{i=1}^{I}g(\mu_{ij})=c_{11}+a_{j1}+\sum_{\ell=2}^{L}c_{1\ell}z_{j\ell} for all j{1,,J}j\in\{1,\ldots,J\},

  4. (d)

    1Jj=1Jg(μij)=c11+bi1+k=2Kck1xik\frac{1}{J}\sum_{j=1}^{J}g(\mu_{ij})=c_{11}+b_{i1}+\sum_{k=2}^{K}c_{k1}x_{ik} for all i{1,,I}i\in\{1,\ldots,I\}, and

  5. (e)

    1IJi=1Ij=1Jg(μij)=c11\frac{1}{IJ}\sum_{i=1}^{I}\sum_{j=1}^{J}g(\mu_{ij})=c_{11}.

For a matrix Qm×nQ\in\mathbb{R}^{m\times n}, we write SS(Q):=i=1mj=1nqij2\mathrm{SS}(Q):=\sum_{i=1}^{m}\sum_{j=1}^{n}q_{ij}^{2} for the sum of squares.

Theorem 5.3 (Sum-of-squares decomposition).

If Condition 2.1(b) holds, then

SS(XA𝚃+BZ𝚃+XCZ𝚃+UDV𝚃)=SS(XA𝚃)+SS(BZ𝚃)+SS(XCZ𝚃)+SS(UDV𝚃).\mathrm{SS}(XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+UDV^{\mathtt{T}})=\mathrm{SS}(XA^{\mathtt{T}})+\mathrm{SS}(BZ^{\mathtt{T}})+\mathrm{SS}(XCZ^{\mathtt{T}})+\mathrm{SS}(UDV^{\mathtt{T}}).

Theorem 5.4 shows that the projections we use in the estimation algorithm are likelihood-preserving. The idea is that, for example, A~\tilde{A} is the result of an unconstrained optimization step on AA, and we project (A~,C)(\tilde{A},C) to (A1,C1)(A_{1},C_{1}) to enforce the constraints in Condition 2.1 without affecting the likelihood. To interpret items 3 and 4 of Theorem 5.4, note that in the algorithm, we optimize with respect to G:=UDG:=UD and H:=VDH:=VD, rather than UU and VV. We write Q+Q^{\texttt{\small{+}}} to denote the pseudoinverse. When (Q𝚃Q)1(Q^{\mathtt{T}}Q)^{-1} exists, Q+=(Q𝚃Q)1Q𝚃Q^{\texttt{\small{+}}}=(Q^{\mathtt{T}}Q)^{-1}Q^{\mathtt{T}}.

Theorem 5.4 (Likelihood-preserving projections).

Suppose (A,B,C,D,U,V,X,Z)(A,B,C,D,U,V,X,Z) satisfies Condition 2.1. Fix XX and ZZ and define η()\eta(\cdot) as in Equation 2.3.

  1. 1.

    Let A~J×K\tilde{A}\in\mathbb{R}^{J\times K}. Define A1=A~Z(Z+A~)A_{1}=\tilde{A}-Z(Z^{\texttt{\small{+}}}\tilde{A}) and C1=C+(Z+A~)𝚃C_{1}=C+(Z^{\texttt{\small{+}}}\tilde{A})^{\mathtt{T}}. Then η(A1,B,C1,D,U,V)=η(A~,B,C,D,U,V)\eta(A_{1},B,\allowbreak C_{1},D,U,V)=\eta(\tilde{A},B,C,D,U,V) and (A1,B,C1,D,U,V)(A_{1},B,C_{1},D,U,V) satisfies Condition 2.1.

  2. 2.

    Let B~I×L\tilde{B}\in\mathbb{R}^{I\times L}. Define B1=B~X(X+B~)B_{1}=\tilde{B}-X(X^{\texttt{\small{+}}}\tilde{B}) and C1=C+(X+B~)C_{1}=C+(X^{\texttt{\small{+}}}\tilde{B}). Then η(A,B1,C1,D,U,V)=η(A,B~,C,D,U,V)\eta(A,B_{1},\allowbreak C_{1},D,U,V)=\eta(A,\tilde{B},C,D,U,V) and (A,B1,C1,D,U,V)(A,B_{1},C_{1},D,U,V) satisfies Condition 2.1.

  3. 3.

    Let G~I×M\tilde{G}\in\mathbb{R}^{I\times M}. Define G0=G~X(X+G~)G_{0}=\tilde{G}-X(X^{\texttt{\small{+}}}\tilde{G}) and let U1D1V1𝚃U_{1}D_{1}V_{1}^{\mathtt{T}} be the compact SVD (of rank MM) of G0V𝚃G_{0}V^{\mathtt{T}}. Assume the singular values are distinct and positive, and choose the SVD in such a way that Conditions 2.1(d) and 2.1(e) are satisfied. Define A0=A+V(X+G~)𝚃A_{0}=A+V(X^{\texttt{\small{+}}}\tilde{G})^{\mathtt{T}}, A1=A0Z(Z+A0)A_{1}=A_{0}-Z(Z^{\texttt{\small{+}}}A_{0}), and C1=C+(Z+A0)𝚃C_{1}=C+(Z^{\texttt{\small{+}}}A_{0})^{\mathtt{T}}. Then η(A1,B,C1,D1,U1,V1)=η(A,B,C,I,G~,V)\eta(A_{1},B,C_{1},\allowbreak D_{1},U_{1},V_{1})=\eta(A,B,C,\mathrm{I},\tilde{G},V) and (A1,B,C1,D1,U1,V1)(A_{1},B,C_{1},D_{1},U_{1},V_{1}) satisfies Condition 2.1.

  4. 4.

    Let H~J×M\tilde{H}\in\mathbb{R}^{J\times M}. Define H0=H~Z(Z+H~)H_{0}=\tilde{H}-Z(Z^{\texttt{\small{+}}}\tilde{H}) and let U1D1V1𝚃U_{1}D_{1}V_{1}^{\mathtt{T}} be the compact SVD (of rank MM) of UH0𝚃UH_{0}^{\mathtt{T}}. Assume the singular values are distinct and positive, and choose the SVD in such a way that Conditions 2.1(d) and 2.1(e) are satisfied. Define B0=B+U(Z+H~)𝚃B_{0}=B+U(Z^{\texttt{\small{+}}}\tilde{H})^{\mathtt{T}}, B1=B0X(X+B0)B_{1}=B_{0}-X(X^{\texttt{\small{+}}}B_{0}), and C1=C+X+B0C_{1}=C+X^{\texttt{\small{+}}}B_{0}. Then η(A,B1,C1,D1,U1,V1)=η(A,B,C,I,U,H~)\eta(A,B_{1},C_{1},D_{1},U_{1},V_{1})=\eta(A,B,C,\mathrm{I},U,\tilde{H}) and (A,B1,C1,D1,U1,V1)(A,B_{1},C_{1},D_{1},U_{1},V_{1}) satisfies Condition 2.1.

Next, we provide the computation time complexity of our estimation and inference algorithms in Sections 3 and 4. To simplify the expressions, here we assume

max{K2,L2,M}min{I,J}.\displaystyle\max\{K^{2},L^{2},M\}\leq\min\{I,J\}. (5.2)

For the estimation algorithm, preprocessing and initialization take O(IJmax{K,L,M})O(IJ\max\{K,L,M\}) time, and Table 1 summarizes the computation time for updating each model component. The table first breaks out the time required to compute η\eta (Equation 2.3), which is a prerequisite within each other update, and then lists the time required for each update given η\eta. In total, it takes O(IJmax{K2,L2,M2})O(IJ\max\{K^{2},L^{2},M^{2}\}) time to perform each overall iteration.

For the inference algorithm, Table 2 shows the computation time assuming Equation 5.2 and also assuming IJI\geq J. These are one-time costs since there are not repeated iterations. When M>0M>0, the most expensive operation tends to be computing the joint uncertainty in (U,V)(U,V), and as JJ grows this dominates the cost. We have experimented extensively but have not found a faster alternative that provides well-calibrated standard errors.

Table 1: Computation time complexity of each update in the estimation algorithm.
Operation Time complexity
Computing η\eta O(IJmax{K,L,M})O(IJ\max\{K,L,M\})
Updating AA O(IJK2)O(IJK^{2})
Updating BB O(IJL2)O(IJL^{2})
Updating CC O(IJmax{K2,L2})O(IJ\max\{K^{2},L^{2}\})
Updating DD, UU, and VV O(IJM2)O(IJM^{2})
Updating SS and TT O(IJ)O(IJ)
Total per iteration O(IJmax{K2,L2,M2})O(IJ\max\{K^{2},L^{2},M^{2}\})
Table 2: Computation time complexity of the inference algorithm.
Operation Time complexity
Preprocessing O(IJmax{K,L,M})O(IJ\max\{K,L,M\})
Conditional uncertainty for each component O(IJmax{K2,L2,M2})O(IJ\max\{K^{2},L^{2},M^{2}\})
Joint uncertainty in (U,V)(U,V) accounting for constraints O(IJ2M3)O(IJ^{2}M^{3})
Propagate uncertainty between components O(IJmax{K3,L3,M3})O(IJ\max\{K^{3},L^{3},M^{3}\})
Compute approximate standard errors O(IJ)O(IJ)
Total O(IJmax{K3,L3,JM3})O(IJ\max\{K^{3},L^{3},JM^{3}\})

6 Simulations

In this section, we present simulation studies assessing (a) consistency and statistical efficiency, (b) accuracy of standard errors, (c) computation time and algorithm convergence, and (d) robustness to the outcome distribution. See Section S2 for more simulation results.

In each simulation run, the data are generated as follows; see Section S2 for full details. We generate the covariates using one of three schemes, Normal, Gamma, or Binary, then we generate the true parameters using either a Normal or Gamma scheme, and finally we generate the outcome data using the log link and a NB (negative binomial), LNP (log-normal Poisson), Poisson, or Geometric distribution. For brevity, we refer to each combination of choices by the triplet of outcomes/covariates/parameters, for instance, NB/Binary/Normal.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Scatterplots of estimated versus true parameters for a typical simulated data matrix.

Typical example. Figure 3 shows scatterplots of the estimated versus true parameters for an NB/Normal/Normal simulation with I=1000I=1000, J=100J=100, K=4K=4, L=2L=2, and M=3M=3. Each dot represents a single univariate parameter, for example, the plot for AA contains JKJK dots, one for each entry ajka_{jk}. The error bars are ±2se^\pm 2\,\hat{\mathrm{se}}. Visually, the estimates are close to true values, and the standard errors look appropriate. Since the likelihood and prior are invariant to permutations and sign changes of the latent factors, in this section we permute and flip signs to find the correct assignment to the true latent factors. Note that the sis_{i} estimates are biased upward when the true value of sis_{i} is very low; this is because very low values of sis_{i} make row ii roughly Poisson, and in this case any value of sis_{i} from -\infty to 2\approx-2 could yield a reasonable fit. The prior on sis_{i} prevents the estimate from diverging, but also leads to an upward bias when the true value is very low.

6.1 Consistency and statistical efficiency

In many applications, II is much larger than JJ. One would hope that for any JJ, the estimates of AA, CC, VV, and TT would be consistent as II\to\infty since then these parameters have fixed dimension. Meanwhile, one cannot hope for consistency in BB, UU, and SS as II\to\infty.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Relative mean-squared error between estimated and true parameter values. Each plot contains 5050 thin lines, one for each run, along with the median over the 5050 runs (thick blue line).

To assess consistency and efficiency, for each I{100,316,1000,3162,10000}I\in\{100,316,1000,3162,10000\} we use the NB/Normal/Normal simulation scheme to generate 5050 data matrices with J=100J=100, K=4K=4, L=2L=2, and M=3M=3, each with a different set of covariates and parameters. For each data matrix, we run our NB-GBM estimation algorithm with convergence tolerance τ=108\tau=10^{-8}. Figure 4 shows the relative mean-squared error (MSE) between the estimates and the true values for AA, CC, VV, and TT. For TT, we measure the relative MSE in the dispersion parametrization (rather than log-dispersion) since there is little difference between, say, tj=5t_{j}=-5 and tj=100t_{j}=-100; both make column jj approximately Poisson distributed.

We see that for AA, CC, VV, and TT, the relative MSE is decreasing to zero, suggesting that the estimates of these parameters are consistent as II\to\infty. Further, for AA and VV, the relative MSE appears to be O(1/I)O(1/I), which is the optimal rate of convergence even for fixed-dimension parametric models. For BB, UU, and SS (Figure S1), the relative MSE hovers around a small nonzero value, but does not appear to be trending to zero, as expected. For DD and ω\omega (Figure S1), the relative MSE is small and the trend is suggestive but less clear.

6.2 Accuracy of standard errors

Next, we assess the accuracy of the standard errors produced by our inference algorithm, in terms of the coverage. Ideally, a 95% confidence interval would contain the true parameter 95% of the time, but even when the model is correct, this is not guaranteed since intervals are usually based on an approximation to the distribution of an estimator. To assess coverage, for each I{100,1000,10000}I\in\{100,1000,10000\}, we use the NB/Normal/Normal scheme to generate 5050 data matrices with J=100J=100, K=4K=4, L=2L=2, and M=3M=3, each with a different set of covariates and parameters. For each data matrix, we run our NB-GBM estimation algorithm (with tolerance τ=108\tau=10^{-8}) and then we run our NB-GBM inference algorithm to obtain approximate standard errors. We construct Wald-type confidence intervals for each univariate parameter, for example, the 95% confidence interval for ajka_{jk} is a^jk±1.96se^\hat{a}_{jk}\pm 1.96\,\hat{\mathrm{se}} where se^\hat{\mathrm{se}} is the approximate standard error for a^jk\hat{a}_{jk}.

Figure 5 shows actual coverage versus target coverage, estimated by combining across all 5050 runs and across all entries of each parameter matrix/vector. Perfect coverage would be a straight line on the diagonal. We exclude c11c_{11}, DD, and ω\omega in these coverage results since it seems challenging to estimate them without non-negligible bias, skewing the results. We see in Figure 5 that the actual coverage for AA, BB, CC, UU, and SS is excellent at every target coverage level from 0% to 100%. For VV and TT, the coverage is reasonably good for the smaller values of II but appears to degrade when II increases.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Coverage of confidence intervals for the entries of each parameter matrix/vector.

6.3 Computation time and algorithm convergence

Computation time. For each combination of I{100,316,1000,3162,10000}I\in\{100,316,1000,3162,10000\} and J{15,30,60,120,240}J\in\{15,30,60,120,240\}, we generate 1010 data matrices using the NB/Normal/Normal simulation scheme with K=4K=4, L=2L=2, and M=3M=3. For each II and JJ, Figure 6 shows the average computation time per iteration of the estimation algorithm, along with the average computation time for the inference algorithm. These empirical results agree with our theory in Section 5 showing that the time per iteration scales like IJIJ (that is, linearly with the size of the data matrix) and the time for inference scales like IJ2IJ^{2}.

Refer to caption
Refer to caption
Figure 6: Computation time of our GBM algorithms as a function of II and JJ.

Algorithm convergence. Next, we evaluate the number of iterations required for the estimation algorithm to converge. Similarly to before, for I{100,1000,10000}I\in\{100,1000,10000\}, we run the NB/Normal/Normal scheme 2525 times with J=100J=100, K=4K=4, L=2L=2, and M=3M=3. Figure S2 shows the log-likelihood+log-prior (plus a constant) versus iteration number for each simulation run. In these simulations, the log-likelihood+log-prior levels off after around 5 or fewer iterations, indicating that the algorithm converges rapidly.

6.4 Robustness to the outcome distribution

To assess the robustness of the NB-GBM to the assumption that the outcome distribution is negative binomial, we rerun the experiments in Sections 6.1 and 6.2 using the following data simulation schemes: (a) LNP/Normal/Normal (b) Poisson/Normal/Normal, and (c) Geometric/Normal/Normal. The results in Figures S3, S4, and S5 show that the algorithms are quite robust to misspecification of the outcome distribution.

7 Application to gene expression analysis

In this section, we evaluate our GBM algorithms on RNA-seq gene expression data. An RNA-seq dataset consists of a matrix of counts in which entry (i,j)(i,j) is the number of high-throughput sequencing reads that were mapped to gene ii for sample jj. These read counts are related to gene expression level, but there are many biases – both sample-specific and gene-specific. More generally, there are often significant sources of unwanted variation—both biological and technical—that obscure the signal of interest. Most methods use pipelines that adjust for each bias sequentially, rather than in an integrated way. GBMs enable one to use a single coherent model that adjusts for gene covariates and sample covariates as well as unobserved factors such as batch effects.

7.1 Comparing to DESeq2 on lymphoblastoid cell lines

As a test of our GBM methods, we compare with DESeq2 (Love et al., 2014), a leading method for RNA-seq differential expression analysis. We first consider a benchmark dataset used by Love et al. (2014) consisting of 161 samples from lymphoblastoid cell lines (Pickrell et al., 2010). We use the subset of 20,815 genes with nonzero median count across samples.

In both DESeq2 and the GBM, we adjust for two sample covariates: sequencing center (Argonne or Yale) and cDNA concentration. To adjust for GC content, which is often the most important gene covariate, in the GBM we construct the XX matrix using a natural cubic spline basis with knots at the 2.5%, 25%, 50%, 75%, and 97.5% quantiles of GC content. DESeq2 does not have a built-in capacity to adjust for gene covariates, so for DESeq2, we adjust for GC using their recommended approach of pre-computing normalization factors using CQN (Hansen et al., 2012), which uses the same spline basis. Since DESeq2 does not adjust for latent factors, we first set M=0M=0 for direct comparison; later, we set M=2M=2.

It is natural to use negative binomial (NB) outcomes for sequencing data since the technical variability is close to Poisson (Marioni et al., 2008), and biological variability introduces overdispersion (Robinson et al., 2010). These modeling choices yield an NB-GBM with I=20,815I=20{,}815, J=161J=161, K=7K=7, and L=3L=3. DESeq2 also uses an NB model, so the main difference between DESeq2 and this particular GBM is the way that the parameters and standard errors are estimated. Using a 1.8GHz processor, GBM estimation and inference took 42 seconds, whereas DESeq2+CQN took 105 seconds.

Refer to caption
Refer to caption
Refer to caption
Figure 7: Comparison of p-value distributions on Pickrell data using DESeq2 and the GBM. (Left) CDFs of p-values for mock null comparisons over 50 random splits. (Middle) Same, but zoomed in to region shaded in left-hand figure. (Right) CDFs of p-values for testing for a difference between sequencing centers. The x-axis ends at the significance threshold for controlling FWER at 0.05.

Correctness of p-values under mock null comparisons. First, we assess the calibration of p-values for testing for differential expression between two conditions. Under the null hypothesis of no difference, the p-values would ideally be uniformly distributed on [0,1][0,1]. Since the Pickrell samples appear to be relatively homogeneous (when adjusting for sequencing center and cDNA concentration), we can assess how well this ideal is attained by randomly splitting the samples into two groups and testing for differential expression.

To this end, we add a sample covariate zj4z_{j4} consisting of a dummy variable for the assignment of samples to the two random groups. Thus, the null hypothesis of no difference for gene ii is bi4=0b_{i4}=0 and the alternative is bi40b_{i4}\neq 0. The (two-sided) p-value for gene ii is pi:=2(1Φ(|b^i4/se^(bi4)|))p_{i}:=2\big{(}1-\Phi(|\hat{b}_{i4}/\hat{\mathrm{se}}(b_{i4})|)\big{)} where Φ(x)\Phi(x) is the standard normal CDF (cumulative distribution function). Figure 7 shows the p-value CDF over all genes, aggregating over 50 random splits into two groups containing 80 and 81 samples, respectively. Both DESeq2 and the GBM yield p-values that are very close to the ideal uniform distribution. This indicates that both methods are accurately controlling the false positive rate.

Sensitivity to detect actual differences. To compare sensitivity, we test for differential expression between sequencing centers by computing p-values pi=2(1Φ(|b^i/se^(bi)|))p_{i}=2\big{(}1-\Phi(|\hat{b}_{i\ell}/\hat{\mathrm{se}}(b_{i\ell})|)\big{)} where \ell is the index of the sequencing center covariate. Using Bonferroni to control the family-wise error rate (FWER) at 0.05, the number of genes detected as differentially expressed by the GBM and DESeq2 are 1038 and 892, respectively. Figure 7 shows the lower tail of the p-value CDFs. In these results, the GBM yields equal or greater sensitivity.

Refer to caption
Figure 8: Visualization of Pickrell data using GBM latent factors, adjusting for covariates. Each dot represents one of the 161 samples, and each subject ID is indicated by a different combination of color and shape (by cycling through 32 colors and 3 shapes).

Visualization using GBM latent factors. The latent factors of the GBM provide a model-based approach to visualizing high-dimensional count data, while adjusting for covariates. To illustrate, we modify the model to use M=2M=2. Figure 8 shows a scatterplot of vj2v_{j2} versus vj1v_{j1} for the estimated VV matrix. Observe that this yields very tightly grouped clusters of samples from the same subject. This is analogous to plotting the first two scores in principal components analysis (PCA). Thus, for comparison, Figure S15 shows the PCA plots based on (a) log-transformed TPMs (Transcripts per Million), specifically, log(TPMij+1)\log(\texttt{TPM}_{ij}+1), and (b) the variance stabilizing transform (VST) in the DESeq2 package, using the GC adjustment from CQN. The DESeq2 model does not estimate latent factors, which is why PCA is used in DESeq2. The TPM plot is very noisy in terms of subject ID clusters. The VST plot is better than TPMs, but still not quite as clean as the GBM plot.

Overall, in terms of sensitivity, controlling false positives, computation time, and visualization, these results suggest that the GBM performs very well. When JJ is very small, it may be beneficial to augment the GBM to use DESeq2-like shrinkage estimates for sis_{i}.

7.2 Analyzing GTEx data for aging-related genes

Next, we test our methods on an application of scientific interest, using RNA-seq data from the Genotype-Tissue Expression (GTEx) project (Melé et al., 2015), consisting of 8,551 samples from 30 tissues in the human body, obtained from 544 subjects. We apply the GBM to find genes whose expression changes with age, adjusting for technical biases. See Jia et al. (2018) and Zeng et al. (2020) for studies of age-related genes using GTEx.

We use the GTEx RNA-seq data from recount2 (Collado-Torres et al., 2017),111Downloaded from https://jhubiostatistics.shinyapps.io/recount on 8/7/2020. normalized using the scale_counts function in the recount R library. We use the subset of 8,551 samples that passed GTEx quality control, and the subset of genes in chromosomes 1–22 that have an HGNC-approved gene symbol and have nonzero median across all samples.

Refer to caption
Figure 9: Visualization of GTEx data using NB-GBM latent factors, adjusting for covariates. Each dot represents one of the 8,551 samples, and the color indicates the tissue type.

To visualize the samples, we take a random subset of 5,000 genes and estimate an NB-GBM with two latent factors and no sample covariates. For the gene covariates, we use log(lengthi)\log(\texttt{length}_{i}), gci\texttt{gc}_{i}, and (gcigc¯)2(\texttt{gc}_{i}-\overline{\texttt{gc}})^{2} where lengthi\texttt{length}_{i} is the sum of the exon lengths and gci\texttt{gc}_{i} is the GC content of gene ii. Thus, in this initial model for visualization, I=5,000I=5{,}000, J=8,551J=8{,}551, K=4K=4, L=1L=1, and M=2M=2. Figure 9 shows the latent factors (vj2v_{j2} versus vj1v_{j1}), similarly to Figure 8. The samples tend to fall into clusters according to the tissue from which they were taken. Some tissues, such as brain and blood, clearly contain two or more subclusters which turn out to correspond to subtissue types (Figure S16). Meanwhile, when more latent factors are used (that is, M>2M>2), some clusters that overlap in Figure 9 become well-separated in higher latent dimensions. For comparison, running PCA on the log TPMs is not nearly as clear in terms of tissue/subtissue clusters (Figure S17).

Testing for age-related genes. To find genes that are related to aging, we add subject age as a sample covariate. Each gene then has a coefficient describing how its expression changes with age, and we compute a p-value for each gene to test whether its coefficient is nonzero. Due to the heterogeneity of tissue/subtissue types, we analyze each subtissue type separately. To perform both exploratory analysis and valid hypothesis testing, we used a random subset of 108 subjects during an exploratory model-building phase and then used the remaining 436 subjects during a testing phase with the selected model.

In the exploratory phase, we considered adjusting for various technical sample covariates and gene covariates, and varied MM from 0 to 10. For each model and each subtissue type, we used the GBM to find the set of genes exhibiting a significant association with age, controlling FWER at 0.05 using the Bonferroni correction. To score the relevance of each of these gene sets in terms of aging biology, we computed its F1 score for overlap with the set of aging-related genes identified by De Magalhães et al. (2009).222From https://genomics.senescence.info/gene_expression/signatures.html on 8/11/2020. Based on this exploratory analysis, we chose to keep log(lengthi)\log(\texttt{length}_{i}), gci\texttt{gc}_{i}, and (gcigc¯)2(\texttt{gc}_{i}-\overline{\texttt{gc}})^{2} as gene covariates, and use smexncrt (exonic rate, the fraction of reads that map within exons) as well as age (subject age, coded as a numerical value in {25,35,,75}\{25,35,\ldots,75\}) as sample covariates. For each subtissue, we chose the MM that yielded the highest F1 score on the exploratory data.

In the testing phase, we apply the selected model for each subtissue to test for age-associated genes. For illustration, we present results for the “Heart - Left Ventricle” subtissue (Heart-LV), which had the highest F1 score across all subtissues on the exploratory data. We ran the GBM on the 176 Heart-LV samples in the test set, using the 19,853 genes with nonzero median across these samples, with M=3M=3 based on the exploratory phase. Thus, in this model, I=19,853I=19{,}853, J=176J=176, K=4K=4, L=3L=3, and M=3M=3.

Refer to caption
Refer to caption
Refer to caption
Figure 10: Estimated PCMT1 expression based on log counts, log TPMs, and GBM residuals.

Results. We found 2,444 genes to be significantly associated with age in Heart-LV, controlling FWER at 0.05. For comparison, simple linear regression on the log TPMs yields only 1 significant gene; thus, the GBM has much greater power than a simple standard approach. To validate the biological relevance of the GBM hits, we compare with what is known from the aging literature. First, the top GBM hit for Heart-LV is PCMT1 (Entrez gene ID 5110) with a p-value of 1.1×10471.1\times 10^{-47}. PCMT1 is involved in the repair and degradation of damaged proteins, and is a well-known aging gene, being one of 307 human genes in the GenAge database (build version 20) from Tacutu et al. (2018). Figure 10 shows the estimated expression of PCMT1 versus age for the Heart-LV samples. The GBM-estimated expression exhibits a clear downward linear trend with age. For comparison, Figure 10 shows that the log TPMs are considerably noiser and the trend is much less clear. Simple linear regression on the log TPMs yields a p-value of 1.1×1031.1\times 10^{-3} for PCMT1, which does not reach the Bonferroni significance threshold of 0.05/I=2.5×1060.05/I=2.5\times 10^{-6}. Here, we define the GBM-estimated expression as the partial residual c^11+a^j1+b^i1+(c^1+b^i)zj+εij\hat{c}_{11}+\hat{a}_{j1}+\hat{b}_{i1}+(\hat{c}_{1\ell}+\hat{b}_{i\ell})z_{j\ell}+\varepsilon_{ij} where \ell is the index of the age column in ZZ, and εij\varepsilon_{ij} is the GBM residual (Section 2.3).

To evaluate the GBM hits altogether for biological relevance to aging, we test for enrichment of Gene Ontology (GO) term gene sets using DAVID v6.8 (Huang et al., 2009a, b). We run DAVID on the top 1000 GBM hits for Heart-LV, using all 19,853 tested genes as the background list. (DAVID allows at most 1000 genes.) Tables S2 and S3 show the top 20 enriched GO terms in the Biological Process and Cellular Component categories. These results are highly consistent with known aging biology (López-Otín et al., 2013).

8 Application to cancer genomics

Next, we apply the GBM to estimate copy ratios for sequencing data from cancer cell lines. Copy ratio estimation is an essential step in detecting somatic copy number alterations (SCNAs), that is, duplications or deletions of segments of the genome. The input data is a matrix of counts where entry (i,j)(i,j) is the number of reads from sample jj that map to target region ii of the genome. The goal is to estimate the copy ratio of each region, that is, the relative concentration of copies of that region in the original DNA sample.

Simple estimates based on row and column normalization are very noisy and are contaminated by significant technical biases. State-of-the-art methods employ a panel of normals (that is, sequencing samples from non-cancer tissues) to estimate technical biases using principal components analysis (PCA), and then use linear regression to remove these biases from the cancer samples of interest. We take an analogous approach, first running a GBM on a panel of normals, and then running a GBM on the cancer samples using a feature covariate matrix XX that includes the UU matrix estimated from the panel of normals.

To assess performance, we compare with the state-of-the-art method provided by the Broad Institute’s Genome Analysis Toolkit (GATK) (Broad Institute, 2020) on the 326 whole-exome sequencing samples from the Cancer Cell Line Encyclopedia (CCLE) (Ghandi et al., 2019). These samples are from a wide range of cancer types, including lung, breast, colon, prostate, brain, and many others. We use the subset of 180,495 target regions that are in chromosomes 1–22 and have nonzero median count across the 326 samples.

Since there are essentially no normal samples in the CCLE dataset, we create a panel of pseudo-normals by taking a random subset of 163 samples as a training set and de-segmenting them to adjust for copy number alterations; see Section S3.2 for details. The remaining 163 samples are used as a test set. For the GBM, we use log(lengthi)\log(\texttt{length}_{i}), gci\texttt{gc}_{i}, and (gcigc¯)2(\texttt{gc}_{i}-\overline{\texttt{gc}})^{2} as region covariates, no sample covariates, and 5 latent factors. Thus, I=180,495I=180{,}495, J=163J=163, K=4K=4, L=1L=1, and M=5M=5 on the training set, while I=180,495I=180{,}495, J=163J=163, K=9K=9, L=1L=1, and M=0M=0 on the test set. We define the GBM copy ratio estimates as the exponentiated residuals Y~ij/μ^ij\tilde{Y}_{ij}/\hat{\mu}_{ij} where Y~ij:=Yij+1/8\tilde{Y}_{ij}:=Y_{ij}+1/8; see Section 2.3. The GBM took 10 minutes and 4.3 minutes to run on the training and test sets, respectively, while GATK took 3.3 minutes and 28 minutes on training and test, respectively. The slowness of GATK on the test set is likely due to having to run it separately on every test sample.

Refer to caption
Refer to caption
Refer to caption
Figure 11: Copy ratio estimates for an illustrative sample from the CCLE data. The x-axis is genomic position, and each blue dot is the estimate for one region; moving averages are in red. For the GBM, regions with high and low precision estimates are plotted in blue and cyan, respectively.

Figure 11 shows the GBM and GATK copy ratio estimates for an illustrative sample from the test set. As a baseline, we also show the simple normalization-based estimates defined as Y~ij/(αiβj)\tilde{Y}_{ij}/(\alpha_{i}\beta_{j}) where αi=1Jj=1JY~ij\alpha_{i}=\frac{1}{J}\sum_{j=1}^{J}\tilde{Y}_{ij} and βj=1Ii=1IY~ij/αi\beta_{j}=\frac{1}{I}\sum_{i=1}^{I}\tilde{Y}_{ij}/\alpha_{i}. A major advantage of the GBM is that it provides uncertainty quantification. Here, the estimated precision (that is, the inverse variance) of each log copy ratio estimate is wijw_{ij} (Section 2.3). In the GBM plot in Figure 11, this is illustrated by using cyan for the regions with low relative precision; see Section S3.2. By downweighting regions with low estimated precision, downstream analyses such as SCNA detection can be made more accurate.

Refer to caption
Refer to caption
Figure 12: Performance of the GBM versus GATK on the 163 test samples from the CCLE whole-exome sequencing dataset. The GBM exhibits better performance in terms of both metrics.

To quantify performance, Figure 12 compares the GBM and GATK in terms of two performance metrics. The local relative standard error quantifies the variability of the log copy ratio estimates around a weighted moving average, accounting for the precision of each estimate. Meanwhile, the weighted median absolute difference quantifies the typical magnitude of the slope of a weighted moving average. On these data, the GBM exhibits better performance in terms of both metrics; see Section S3.2 for details. Figures S18 and S19 show the GBM and GATK copy ratio estimates for all 163 test samples. The GBM estimates are visibly less noisy than the GATK estimates.

Overall, the GBM appears to perform very well in terms of removing technical biases and denoising, particularly when using uncertainty quantification to downweight low precision regions. The improved performance appears to be due to (a) model-based uncertainty quantification and (b) using a robust probabilistic model for count data.

9 Conclusion

Generalized bilinear models provide a flexible framework for the analysis of matrix data, and the delta propagation method enables accurate GBM uncertainty quantification in modern applications. In future work, it would be interesting to extend to the more general model of Gabriel (1998), provide theoretical guarantees for delta propagation, and try applying delta propagation to other models.

Acknowledgments

We would like to thank Jonathan Huggins, Will Townes, Mehrtash Babadi, Samuel Lee, David Benjamin, Robert Klein, Samuel Markson, Philipp Hähnel, and Rafael Irizarry for many helpful conversations.

References

  • Agresti (2015) Agresti, A. Foundations of Linear and Generalized Linear Models. John Wiley & Sons, 2015.
  • Aitchison and Silvey (1958) Aitchison, J. and Silvey, S. Maximum-likelihood estimation of parameters subject to restraints. The Annals of Mathematical Statistics, pages 813–828, 1958.
  • Babadi et al. (2018) Babadi, M., Lee, S. K., and Smirnov, A. N. GATK gCNV: accurate germline copy-number variant discovery from sequencing read-depth data. The International Conference on Probabilistic Programming (PROBPROG), Oct 2018.
  • Benzécri (1973) Benzécri, J. L’analyse des Correspondances. L’analyse des Données, Vol. 2. Dunod. Paris, 1973.
  • Bishop (2006) Bishop, C. M. Pattern Recognition and Machine Learning. springer, 2006.
  • Blum et al. (2020) Blum, A., Hopcroft, J., and Kannan, R. Foundations of Data Science. Cambridge University Press, 2020.
  • Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.
  • Broad Institute (2020) Broad Institute. Genome Analysis Toolkit (GATK) v4.1.8.1, 2020. URL https://gatk.broadinstitute.org/.
  • Buettner et al. (2017) Buettner, F., Pratanwanich, N., McCarthy, D. J., Marioni, J. C., and Stegle, O. f-scLVM: scalable and versatile factor analysis for single-cell RNA-seq. Genome Biology, 18(1):212, 2017.
  • Carroll et al. (1980) Carroll, J. D., Pruzansky, S., and Kruskal, J. B. CANDELINC: A general approach to multidimensional analysis of many-way arrays with linear constraints on parameters. Psychometrika, 45(1):3–24, 1980.
  • Carvalho et al. (2008) Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. High-dimensional sparse factor modeling: applications in gene expression genomics. Journal of the American Statistical Association, 103(484):1438–1456, 2008.
  • Chadoeuf and Denis (1991) Chadoeuf, J. and Denis, J. B. Asymptotic variances for the multiplicative interaction model. Journal of Applied Statistics, 18(3):331–353, 1991.
  • Choulakian (1996) Choulakian, V. Generalized bilinear models. Psychometrika, 61(2):271–283, 1996.
  • Cochran (1943) Cochran, W. The comparison of different scales of measurement for experimental results. The Annals of Mathematical Statistics, 14(3):205–216, 1943.
  • Collado-Torres et al. (2017) Collado-Torres, L., Nellore, A., Kammers, K., Ellis, S. E., Taub, M. A., Hansen, K. D., Jaffe, A. E., Langmead, B., and Leek, J. T. Reproducible RNA-seq analysis using recount2. Nature Biotechnology, 35(4):319–321, 2017.
  • Davies and Tso (1982) Davies, P. and Tso, M. K.-S. Procedures for reduced-rank regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 31(3):244–255, 1982.
  • de Falguerolles (2000) de Falguerolles, A. GBMs: GLMs with bilinear terms. In COMPSTAT, pages 53–64. Springer, 2000.
  • De Magalhães et al. (2009) De Magalhães, J. P., Curado, J., and Church, G. M. Meta-analysis of age-related gene expression profiles identifies common signatures of aging. Bioinformatics, 25(7):875–881, 2009.
  • Denis and Gower (1996) Denis, J.-B. and Gower, J. C. Asymptotic confidence regions for biadditive models: Interpreting genotype-environment interactions. Journal of the Royal Statistical Society: Series C (Applied Statistics), 45(4):479–493, 1996.
  • Dorkenoo and Mathieu (1993) Dorkenoo, K. and Mathieu, J.-R. Etude d’un modele factoriel d’analyse de la variance comme modele lineaire generalise. Revue de Statistique Appliquée, 41(2):43–57, 1993.
  • Fisher and Mackenzie (1923) Fisher, R. and Mackenzie, W. Studies in Crop Variation: The Manurial Response of Different Potato Varieties. Journal of Agricultural Sciences, 13:311–320, 1923.
  • Freeman (1973) Freeman, G. Statistical methods for the analysis of genotype-environment interactions. Heredity, 31(3):339–354, 1973.
  • Fromer et al. (2012) Fromer, M., Moran, J. L., Chambert, K., Banks, E., Bergen, S. E., Ruderfer, D. M., Handsaker, R. E., McCarroll, S. A., O’Donovan, M. C., Owen, M. J., et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. The American Journal of Human Genetics, 91(4):597–607, 2012.
  • Gabriel (1978) Gabriel, K. R. Least squares approximation of matrices by additive and multiplicative models. Journal of the Royal Statistical Society: Series B (Methodological), 40(2):186–196, 1978.
  • Gabriel (1998) Gabriel, K. R. Generalised bilinear regression. Biometrika, 85(3):689–700, 1998.
  • Gabriel and Zamir (1979) Gabriel, K. R. and Zamir, S. Lower rank approximation of matrices by least squares with any choice of weights. Technometrics, 21(4):489–498, 1979.
  • Gauch (1988) Gauch, H. G. Jr. Model selection and validation for yield trials with interaction. Biometrics, pages 705–715, 1988.
  • Gauch (2006) Gauch, H. G. Jr. Statistical analysis of yield trials by AMMI and GGE. Crop Science, 46(4):1488–1500, 2006.
  • Gauch et al. (2008) Gauch, H. G. Jr., Piepho, H.-P., and Annicchiarico, P. Statistical analysis of yield trials by AMMI and GGE: Further considerations. Crop Science, 48(3):866–889, 2008.
  • Ghandi et al. (2019) Ghandi, M., Huang, F. W., Jané-Valbuena, J., Kryukov, G. V., Lo, C. C., McDonald, E. R., Barretina, J., Gelfand, E. T., Bielski, C. M., Li, H., et al. Next-generation characterization of the cancer cell line encyclopedia. Nature, 569(7757):503–508, 2019.
  • Gilbert (1963) Gilbert, N. Non-additive combining abilities. Genetics Research, 4(1):65–73, 1963.
  • Gollob (1968) Gollob, H. F. A statistical model which combines features of factor analytic and analysis of variance techniques. Psychometrika, 33(1):73–115, 1968.
  • Goodman (1979) Goodman, L. A. Simple models for the analysis of association in cross-classifications having ordered categories. Journal of the American Statistical Association, 74(367):537–552, 1979.
  • Goodman (1981) Goodman, L. A. Association models and canonical correlation in the analysis of cross-classifications having ordered categories. Journal of the American Statistical Association, 76(374):320–334, 1981.
  • Goodman (1986) Goodman, L. A. Some useful extensions of the usual correspondence analysis approach and the usual log-linear models approach in the analysis of contingency tables. International Statistical Review/Revue Internationale de Statistique, pages 243–270, 1986.
  • Goodman (1991) Goodman, L. A. Measures, models, and graphical displays in the analysis of cross-classified data. Journal of the American Statistical Association, 86(416):1085–1111, 1991.
  • Goodman and Haberman (1990) Goodman, L. A. and Haberman, S. J. The analysis of nonadditivity in two-way analysis of variance. Journal of the American Statistical Association, 85(409):139–145, 1990.
  • Gower (1989) Gower, J. Discussion of the paper by van der Heijden, de Falguerolles and de Leeuw. Applied Statistics, 38:273–276, 1989.
  • Greenacre (1984) Greenacre, M. J. Theory and applications of correspondence analysis. London (UK) Academic Press, 1984.
  • Halko et al. (2011) Halko, N., Martinsson, P.-G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
  • Hansen et al. (2012) Hansen, K. D., Irizarry, R. A., and Wu, Z. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics, 13(2):204–216, 2012.
  • Hoff (2015) Hoff, P. D. Multilinear tensor regression for longitudinal relational data. The Annals of Applied Statistics, 9(3):1169, 2015.
  • Huang et al. (2009a) Huang, D. W., Sherman, B. T., and Lempicki, R. A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research, 37(1):1–13, 2009a.
  • Huang et al. (2009b) Huang, D. W., Sherman, B. T., and Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols, 4(1):44, 2009b.
  • Jia et al. (2018) Jia, K., Cui, C., Gao, Y., Zhou, Y., and Cui, Q. An analysis of aging-related genes derived from the Genotype-Tissue Expression project (GTEx). Cell Death Discovery, 4(1):1–14, 2018.
  • Jiang et al. (2015) Jiang, Y., Oldridge, D. A., Diskin, S. J., and Zhang, N. R. CODEX: A normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Research, 43(6):e39–e39, 2015.
  • Jørgensen (1987) Jørgensen, B. Exponential dispersion models. Journal of the Royal Statistical Society: Series B (Methodological), 49(2):127–145, 1987.
  • Killick et al. (2012) Killick, R., Fearnhead, P., and Eckley, I. A. Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500):1590–1598, 2012.
  • Krumm et al. (2012) Krumm, N., Sudmant, P. H., Ko, A., O’Roak, B. J., Malig, M., Coe, B. P., Quinlan, A. R., Nickerson, D. A., and Eichler, E. E. Copy number variation detection and genotyping from exome sequence data. Genome Research, 22(8):1525–1532, 2012.
  • Leek and Storey (2007) Leek, J. T. and Storey, J. D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics, 3(9):e161, 2007.
  • Leek and Storey (2008) Leek, J. T. and Storey, J. D. A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences, 105(48):18718–18723, 2008.
  • López-Otín et al. (2013) López-Otín, C., Blasco, M. A., Partridge, L., Serrano, M., and Kroemer, G. The hallmarks of aging. Cell, 153(6):1194–1217, 2013.
  • Love et al. (2014) Love, M. I., Huber, W., and Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12):550, 2014.
  • Mandel (1961) Mandel, J. Non-additivity in two-way analysis of variance. Journal of the American Statistical Association, 56(296):878–888, 1961.
  • Mandel (1969) Mandel, J. The partitioning of interaction in analysis of variance. Journal of Research of the National Bureau of Standards, Series B, 73:309–328, 1969.
  • Marchenko and Pastur (1967) Marchenko, V. A. and Pastur, L. A. Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik, 114(4):507–536, 1967.
  • Marioni et al. (2008) Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., and Gilad, Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18(9):1509–1517, 2008.
  • Melé et al. (2015) Melé, M., Ferreira, P. G., Reverter, F., DeLuca, D. S., Monlong, J., Sammeth, M., Young, T. R., Goldmann, J. M., Pervouchine, D. D., Sullivan, T. J., et al. The human transcriptome across tissues and individuals. Science, 348(6235):660–665, 2015.
  • Perry and Pillai (2013) Perry, P. O. and Pillai, N. S. Degrees of freedom for combining regression with factor analysis. arXiv preprint arXiv:1310.7269, 2013.
  • Pickrell et al. (2010) Pickrell, J. K., Marioni, J. C., Pai, A. A., Degner, J. F., Engelhardt, B. E., Nkadori, E., Veyrieras, J.-B., Stephens, M., Gilad, Y., and Pritchard, J. K. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464(7289):768–772, 2010.
  • Price et al. (2006) Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909, 2006.
  • Risso et al. (2014) Risso, D., Ngai, J., Speed, T. P., and Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nature Biotechnology, 32(9):896–902, 2014.
  • Robinson et al. (2010) Robinson, M. D., McCarthy, D. J., and Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010.
  • Silvey (1959) Silvey, S. D. The Lagrangian multiplier test. The Annals of Mathematical Statistics, 30(2):389–407, 1959.
  • Silvey (1975) Silvey, S. D. Statistical Inference. CRC Press, 1975.
  • Stegle et al. (2010) Stegle, O., Parts, L., Durbin, R., and Winn, J. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput Biol, 6(5):e1000770, 2010.
  • Sun et al. (2012) Sun, Y., Zhang, N. R., and Owen, A. B. Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. The Annals of Applied Statistics, 6(4):1664–1688, 2012.
  • Tacutu et al. (2018) Tacutu, R., Thornton, D., Johnson, E., Budovsky, A., Barardo, D., Craig, T., Diana, E., Lehmann, G., Toren, D., Wang, J., et al. Human ageing genomic resources: New and updated databases. Nucleic Acids Research, 46(D1):D1083–D1090, 2018.
  • Takane and Shibayama (1991) Takane, Y. and Shibayama, T. Principal component analysis with external information on both subjects and variables. Psychometrika, 56(1):97–120, 1991.
  • Townes (2019) Townes, F. W. Generalized principal component analysis. arXiv preprint arXiv:1907.02647, 2019.
  • Tukey (1949) Tukey, J. W. One degree of freedom for non-additivity. Biometrics, 5(3):232–242, 1949.
  • Tukey (1962) Tukey, J. W. The future of data analysis. The Annals of Mathematical Statistics, 33(1):1–67, 1962.
  • Van Eeuwijk (1995) Van Eeuwijk, F. A. Multiplicative interaction in generalized linear models. Biometrics, pages 1017–1032, 1995.
  • Williams (1952) Williams, E. J. The interpretation of interactions in factorial experiments. Biometrika, 39(1-2):65–81, 1952.
  • Zeng et al. (2020) Zeng, L., Yang, J., Peng, S., Zhu, J., Zhang, B., Suh, Y., and Tu, Z. Transcriptome analysis reveals the difference between “healthy” and “common” aging and their connection with age-related diseases. Aging Cell, 19(3):e13121, 2020.


Supplementary material for “Inference in generalized bilinear models”

S1 Discussion

S1.1 Previous work

There is an extensive literature on models involving an unknown low-rank matrix, going by a variety of names including latent factor models, factor analysis models, multiplicative models, bi-additive models, and bilinear models. In particular, a large number of models can be viewed as special cases of generalized bilinear models (GBMs). Since a full review is beyond the scope of this article, we settle for covering the main threads in the literature.

S1.1.1 Normal bilinear models without covariates.

Principal components analysis (PCA) is equivalent to maximum likelihood estimation in a GBM with only the UDV𝚃UDV^{\mathtt{T}} term (K=0K=0, L=0L=0, M>0M>0), assuming normally distributed outcomes with common variance σ2\sigma^{2}. PCA (or equivalently, the SVD) is often performed after centering the rows and columns of the data matrix, and from a model-based perspective, this is equivalent to including intercepts (K=1K=1, L=1L=1, M>0M>0):

Yij=c+ai+bj+m=1Muimdmvjm+εij\displaystyle Y_{ij}=c+a_{i}+b_{j}+\sum_{m=1}^{M}u_{im}d_{m}v_{jm}+\varepsilon_{ij} (S1.1)

where εij\varepsilon_{ij} is a normal residual. Similarly, scaling the rows and columns is analogous to using a rank-one factorization of the variance, that is, εij𝒩(0,σi2σj2)\varepsilon_{ij}\sim\mathcal{N}(0,\sigma_{i}^{2}\sigma_{j}^{2}).

Estimation. Equation S1.1 is often called the AMMI (additive main effects and multiplicative interaction) model, and a range of techniques for using it have been developed (Gauch, 1988). The least squares fit of an AMMI model can be obtained by first fitting the linear terms c+ai+bjc+a_{i}+b_{j} ignoring the non-linear term, and then estimating the non-linear term m=1Muimdmvjm\sum_{m=1}^{M}u_{im}d_{m}v_{jm} using PCA on the residuals (Gilbert, 1963; Gollob, 1968; Mandel, 1969; Gabriel, 1978). Estimation is more difficult when each entry is allowed to have a different variance, that is, when εij𝒩(0,σij2)\varepsilon_{ij}\sim\mathcal{N}(0,\sigma_{ij}^{2}) with σij2\sigma_{ij}^{2} known; this is sometimes called a weighted AMMI model (Van Eeuwijk, 1995). To handle this, Gabriel and Zamir (1979) develop the criss-cross method of estimation, which successively fits the non-linear terms m=1,,Mm=1,\ldots,M, one by one, using weighted least squares.

Hypothesis testing for model selection. While most applications of PCA only use the estimates, without any uncertainty quantification, statistical research on the AMMI model has largely focused on hypothesis testing for which factors mm to include in the model. Early contributions on testing in this model were made by Fisher and Mackenzie (1923), Cochran (1943), Tukey (1949), Williams (1952), Mandel (1961), Gollob (1968), and Mandel (1969). Methods of this type are very widely used, particularly in the study of genotype-environment interactions in agronomy; see reviews by Freeman (1973), Gauch (2006), and Gauch et al. (2008).

Confidence regions for parameters. Uncertainty quantification for the AMMI model parameters has also been studied. Asymptotic covariance formulas for the least squares estimates have been given by Goodman and Haberman (1990), Chadoeuf and Denis (1991), Dorkenoo and Mathieu (1993) and Denis and Gower (1996) for the AMMI model and various special cases. These results are based on inverting the constraint-augmented Fisher information matrix (Aitchison and Silvey, 1958; Silvey, 1959); we use the same technique in Section S6 to estimate standard errors for UU and VV in our more general GBM model and we extend it using delta propagation.

S1.1.2 Normal bilinear models with covariates.

Estimation. In a wide-ranging article, Tukey (1962) discussed the possibility of combining regression with factor analysis, by factoring the matrix of residuals after adjusting for covariates. Indeed, for the case of normal outcomes with common variance, Gabriel (1978, Cor 3.1) showed that when using a model of the form 𝒀=XA𝚃+BZ𝚃+UDV𝚃+𝜺\bm{Y}=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+UDV^{\mathtt{T}}+\bm{\varepsilon}, the least squares fit can be obtained simply by first fitting AA and BB using regression (ignoring UDV𝚃UDV^{\mathtt{T}}), then fitting UDV𝚃UDV^{\mathtt{T}} to the residuals. This can be viewed as a generalization of the AMMI estimation procedure. Takane and Shibayama (1991) extend the results of Gabriel (1978) by first fitting 𝒀=XA𝚃+BZ𝚃+XCZ𝚃+𝜺\bm{Y}=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+\bm{\varepsilon} using least squares, and then using PCA to analyze the residuals as well as each fitted component of the model, that is, XA^𝚃X\hat{A}^{\mathtt{T}}, B^Z𝚃\hat{B}Z^{\mathtt{T}}, XC^Z𝚃X\hat{C}Z^{\mathtt{T}}, 𝜺^\hat{\bm{\varepsilon}}, and combinations thereof.

In a complementary direction, reduced-rank regression (Davies and Tso, 1982) and CANDELINC (Carroll et al., 1980) use least squares to fit models of the form 𝒀=XA𝚃+𝜺\bm{Y}=XA^{\mathtt{T}}+\bm{\varepsilon} and 𝒀=XCZ𝚃+𝜺\bm{Y}=XCZ^{\mathtt{T}}+\bm{\varepsilon}, respectively, where AA and CC are constrained to be low-rank.

Hypothesis testing and confidence regions. In the case of normal outcomes with common variance σ2\sigma^{2}, for the model with 𝒀=XA𝚃+BZ𝚃+UDV𝚃+𝜺\bm{Y}=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+UDV^{\mathtt{T}}+\bm{\varepsilon}, Perry and Pillai (2013) show how to perform inference for univariate entries of AA and BB (and univariate linear projections, more generally) accounting for the uncertainty in UDV𝚃UDV^{\mathtt{T}} via an estimate of the degrees of freedom associated with the latent factors. Further, Perry and Pillai (2013) show that the problem can be reduced to the covariate-free case, 𝒀=UDV𝚃+𝜺\bm{Y}=UDV^{\mathtt{T}}+\bm{\varepsilon}, enabling one to use results on hypothesis testing in the AMMI model (Gollob, 1968; Mandel, 1969) which provide estimates of the degrees of freedom. However, this approach appears to rely on the assumption of normal outcomes with common variance.

S1.1.3 Generalized bilinear models without covariates.

In many applications, it is unreasonable to use a normal outcome model. A classical approach is to transform the data and then apply a normal outcome model, however, as discussed by Van Eeuwijk (1995), there is unlikely to be a transformation that simultaneously achieves (a) approximate normality, (b) common variance, and (c) additive effects.

A more principled approach is to extend the generalized linear model (GLM) framework to handle latent factors, as suggested by Gower (1989). Goodman’s RC models are early contributions in this direction (Goodman, 1979, 1981, 1986, 1991), consisting of count models with multinomial or Poisson outcomes where log(𝔼(Yij))=c+ai+bj+m=1Muimdmvjm\log(\mathbb{E}(Y_{ij}))=c+a_{i}+b_{j}+\sum_{m=1}^{M}u_{im}d_{m}v_{jm}. More generally, Van Eeuwijk (1995) develops the generalized AMMI (GAMMI) model, which is a GLM version of the AMMI model in Equation S1.1, specifically, g(𝔼(Yij))=c+ai+bj+m=1Muimdmvjmg(\mathbb{E}(Y_{ij}))=c+a_{i}+b_{j}+\sum_{m=1}^{M}u_{im}d_{m}v_{jm}. Van Eeuwijk (1995) introduces a coordinate descent algorithm and discusses approaches for choosing MM, however, he does not consider uncertainty quantification for parameters, does not estimate dispersion parameters, and only demonstrates the method on very small datasets (11×511\times 5 and 17×1217\times 12).

Correspondence analysis (Benzécri, 1973; Greenacre, 1984) is an SVD-based exploratory analysis method for matrices of categorical data, and has been reinvented under many names, such as reciprocal averaging and dual scaling (de Falguerolles, 2000). Correspondence analysis bears resemblence to estimation methods for the GAMMI model, however, it is primarily descriptive in perspective, and thus typically does not involve quantification of uncertainty.

S1.1.4 Generalized bilinear models with covariates.

Choulakian (1996) defines a class of GBMs of the same form as in this article, where g(𝔼(𝒀))=XA𝚃+BZ𝚃+XCZ𝚃+UDV𝚃g(\mathbb{E}(\bm{Y}))=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+UDV^{\mathtt{T}} and gg is the (a) canonical, (b) identical, or (c) logarithmic link function. For the case of no covariates (that is, the GAMMI model), Choulakian (1996) proposes an estimation algorithm that involves univariate Fisher scoring updates, which is attractive for its simplicity, but may exhibit slow convergence or failure to converge due to strong dependencies among parameters. While the defined model class is general, some limitations of the paper by Choulakian (1996) are that uncertainty quantification is not addressed, the estimation algorithm is for the special case of GAMMI, a single common dispersion is assumed and estimation of dispersion is not addressed, identifiability constraints are not enforced, no initialization procedure is provided, and only very small datasets are considered (10×710\times 7 and 11×11×211\times 11\times 2).

Gabriel (1998) considers a very general class of models of the form g(𝔼(𝒀))=k=1KXkΘkZkg(\mathbb{E}(\bm{Y}))=\sum_{k=1}^{K}X_{k}\Theta_{k}Z_{k}, where XkX_{k} and ZkZ_{k} are observed matrices (for instance, covariates) and Θk\Theta_{k} is a low-rank matrix of parameters for each k=1,,Kk=1,\ldots,K. He extends the criss-cross estimation algorithm of Gabriel and Zamir (1979) to this model. While the model of Gabriel (1998) is very elegant, some limitations are that estimation is performed using a vectorization approach that is computationally prohibitive on large matrices, uncertainty quantification is not addressed, a common dispersion parameter is assumed for all entries, and only very small datasets are considered (10×910\times 9 and 17×217\times 2). Also, it is not clear what identifiability constraints are assumed on the Θk\Theta_{k} matrices.

In recent work, Townes (2019) considers a model of the form g(𝔼(𝒀))=XA𝚃+BZ𝚃+UDV𝚃+𝟏δ𝚃g(\mathbb{E}(\bm{Y}))=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+UDV^{\mathtt{T}}+\bm{1}\delta^{\mathtt{T}} where 𝟏\bm{1} is a vector of ones and δJ\delta\in\mathbb{R}^{J} is a vector of fixed column-specific offsets. Townes (2019) derives diagonal approximations to Fisher scoring updates for 2\ell_{2}-penalized maximum likelihood estimation, and in a postprocessing stage, enforces orthogonality constraints to aid interpretability. Other differences compared to the present work are that only estimation is considered (uncertainty quantification is not addressed) and overdispersion parameters are not estimated.

The overview by de Falguerolles (2000) provides an interesting and insightful discussion of several threads in the literature.

S1.1.5 Recent applications of bilinear models.

Several authors have used bilinear models or GBMs in genetics and genomics, usually to remove unwanted variation such as batch effects. However, most of these methods do not fully account for uncertainty in the latent factors, which may lead to miscalibrated inferences such as overconfident p-values. For example, to remove batch effects in gene expression analysis, several approaches involve first estimating UDV𝚃UDV^{\mathtt{T}} and then treating VV as a known matrix of covariates, accounting for uncertainty only in UDUD using standard regression (Leek and Storey, 2007, 2008; Sun et al., 2012; Risso et al., 2014); this is also done to adjust for population structure in genetic association studies (Price et al., 2006). In copy number variation detection, it is common to simply treat the estimated UDV𝚃UDV^{\mathtt{T}} as known and subtract it off (Fromer et al., 2012; Krumm et al., 2012; Jiang et al., 2015).

Carvalho et al. (2008) use a Bayesian sparse factor analysis model with covariates, employing evolutionary stochastic search for model selection and Markov chain Monte Carlo (MCMC) for posterior inference within models. Stegle et al. (2010), Buettner et al. (2017), and Babadi et al. (2018) use complex hierarchical models that can be viewed as Bayesian GBMs with additional prior structure, and they employ variational methods for approximate posterior inference. Another application in which bilinear models have seen recent use is longitudinal relational data such as networks, for which Hoff (2015) employs an interesting Bayesian model with 𝒀t=AXtB𝚃+𝜺\bm{Y}_{t}=AX_{t}B^{\mathtt{T}}+\bm{\varepsilon}, where XtX_{t} is a matrix of observed covariates that depend on time tt.

S1.2 Challenges and solutions

Estimation and inference in large GBMs is complicated by a number of nontrivial challenges. In this section, we discuss several issues and how we resolve them.

Estimating the dispersion parameters. There are several issues with estimating the negative binomial (NB) dispersions 1/rij1/r_{ij}. First, since there is insufficient information to estimate all IJIJ dispersions individually, we use the rank-one parametrization 1/rij=exp(si+tj+ω)1/r_{ij}=\exp(s_{i}+t_{j}+\omega). Second, the choice of identifiability constraints matters — the natural choice of contraints, isi=0\sum_{i}s_{i}=0 and jtj=0\sum_{j}t_{j}=0, leads to noticeably biased estimates of sis_{i} and tjt_{j}, particularly for higher values; see Figure S6. Instead, we constrain 1Iiesi=1\frac{1}{I}\sum_{i}e^{s_{i}}=1 and 1Jjetj=1\frac{1}{J}\sum_{j}e^{t_{j}}=1, which effectively mitigates this bias, empirically. Third, the maximum likelihood estimates sometimes exhibit a severe downward bias, particularly for low values of log-dispersion; we use a simple heuristic bias correction to deal with this. Fourth, to avoid arithmetic underflow/overflow in the log-dispersion update steps, we develop carefully constructed expressions for the gradient and Hessian. Finally, to prevent occasional lack of convergence due to oscillating estimates, we employ an adaptive maximum step size.

Inapplicability of standard GLM methods. Since XA𝚃+BZ𝚃+XCZ𝚃XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}} is linear in the parameters, one could vectorize and write it as vec(XA𝚃+BZ𝚃+XCZ𝚃)=X~β\mathrm{vec}(XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}})=\tilde{X}\beta where β=(vec(A)𝚃,vec(B)𝚃,vec(C)𝚃)𝚃JK+IL+KL\beta=(\mathrm{vec}(A)^{\mathtt{T}},\mathrm{vec}(B)^{\mathtt{T}},\mathrm{vec}(C)^{\mathtt{T}})^{\mathtt{T}}\in\mathbb{R}^{JK+IL+KL} and X~IJ×(JK+IL+KL)\tilde{X}\in\mathbb{R}^{IJ\times(JK+IL+KL)} is a function of XX and ZZ. In principle, one could then apply standard GLM estimation methods for estimating β\beta to construct a joint update to (A,B,C)(A,B,C). However, this vectorization approach is only computationally feasible for small data matrices since computing the matrix inverse (X~𝚃X~)1(\tilde{X}^{\mathtt{T}}\tilde{X})^{-1} takes on the order of (JK+IL+KL)3(JK+IL+KL)^{3} time. Further, this update would need to be done repeatedly since DD, UU, VV, SS, TT, and ω\omega also need to be simultaneously estimated, and the vectorization approach does not help estimate these parameters.

Inapplicability of the singular value decomposition. At first glance, it might appear that the singular value decomposition (SVD) would make it straightforward to estimate UU, DD, and VV given the other parameters. However, when used for estimation, the SVD implicitly assumes that every entry has the same variance. This is far from true in GBMs, and consequently, naively using the SVD to update UDV𝚃UDV^{\mathtt{T}} leads to poor estimation accuracy. The criss-cross algorithm of Gabriel and Zamir (1979) yields a low-rank matrix factorization that accounts for entry-specific variances, and our algorithm provides another way of doing this while adjusting for covariates in a GBM. In our algorithm, we only directly use the SVD for enforcing the identifiability constraints, not for estimation of UDV𝚃UDV^{\mathtt{T}}.

Computational efficiency. The genomics applications in Sections 7 and 8 involve large count matrices 𝒀I×J\bm{Y}\in\mathbb{R}^{I\times J} where the number of features II is on the order of 10410^{4} to 10610^{6} and the number of samples JJ can be as large as 10410^{4} or more. Consequently, computational efficiency is essential for practical usage of the method. For estimation, we exploit the special structure of the GBM to derive computationally efficient Fisher scoring updates to each component of the model. For inference, we develop a novel method for efficiently propagating uncertainty between components of the model. Assuming max{K2,L2,M}JI\max\{K^{2},L^{2},M\}\leq J\leq I, our estimation algorithm takes O(IJmax{K2,L2,M2})O(IJ\max\{K^{2},L^{2},M^{2}\}) time per iteration, and our inference algorithm requires O(IJmax{K3,L3,JM3})O(IJ\max\{K^{3},L^{3},JM^{3}\}) time, making them computationally feasible on large data matrices.

Numerical stability. Using a good choice of initialization is crucial for numerical stability. To initialize the estimation algorithm, we analytically solve for values of AA, BB, and CC to approximate the data matrix and then, for NB-GBMs, we iteratively update SS and TT for a few iterations. Even with a good initialization, optimization methods occasionally diverge. In a large GBM, there are so many parameters that even occasional divergences cause the algorithm to fail with high probability. We reduce the frequency of divergences to be negligible by enforcing a bound on the norm of the optimization steps; see Section 3.

Enforcing identifiability constraints. Rather than performing constrained optimization steps, we use a combination of unconstrained optimization steps and likelihood-preserving projections onto the constrained parameter space. Although the construction of likelihood-preserving projections in a GBM is not obvious, we show that they can be efficiently computed using simple linear algebra operations. This optimization-projection approach has a number of advantages; see Section S1.3 for further discussion.

Dependencies in latent factors. Optimizing the latent factor term UDV𝚃UDV^{\mathtt{T}} is challenging due to the dependencies among UU, DD, and VV as well as the orthonormality contraints U𝚃U=IU^{\mathtt{T}}U=\mathrm{I} and V𝚃V=IV^{\mathtt{T}}V=\mathrm{I}. Consequently, updating UU, DD, and VV individually does not seem to work well. To resolve this issue, we relax the dependencies and constraints by defining G:=UDG:=UD and H:=VDH:=VD, and updating GG, HH, and DD separately.

Prior / regularization. To improve estimation accuracy in a high-dimensional setting, we place independent normal priors on the entries of AA, BB, CC, DD, UU, VV, SS, and TT, and use maximum a posteriori (MAP) estimation, which is equivalent to 2\ell_{2}-penalization/shrinkage for this choice of prior. An additional benefit of using priors is that it improves the numerical stability of the estimation algorithm. See Section S9 for prior details.

S1.3 Enforcing the GBM identifiability constraints

It might seem preferable to perform unconstrained optimization throughout the estimation algorithm until convergence, and then enforce the identifiability constraints as a postprocessing step. However, in general, this would not converge to a local optimum in the constrained space because the prior does not have the same invariance properties as the likelihood. Thus, we maintain the constraints throughout the algorithm by applying a projection at each step.

When updating each component (AA, BB, CC, DD, UU, VV, SS, and TT), rather than using a constrained optimization step such as equality-constrained Newton’s method (Boyd and Vandenberghe, 2004), we use an unconstrained optimization step followed by a likelihood-preserving projection onto the constrained space.

It is crucial to preserve the likelihood when projecting onto the constrained space, since otherwise the projection might undo all the gains obtained by the unconstrained optimization step — in short, otherwise we might end up “taking one step forward and two steps back.” To this end, we employ likelihood-preserving projections for each component of the GBM. By Theorem 5.4, the likelihood is invariant under these operations and the projected values satisfy the identifiability constraints. The optimization-projection approach has several major advantages.

  1. 1.

    In the likelihood surface, there can be strong dependencies among the parameters within each row of AA, BB, UU, and VV, whereas the between-row dependencies are much weaker (specifically, they have zero Fisher cross-information). Thus, it is desirable to optimize each row jointly, however, this is complicated by the fact that the constraints create dependencies between rows. Consequently, using equality-constrained Newton appears to be computationally infeasible since it would require a joint update of each parameter matrix in entirety.

  2. 2.

    Since each optimization-projection step modifies multiple components of the GBM, it effectively performs a joint update on multiple components. For instance, the likelihood-preserving projection for AA also modifies CC, so the optimization-projection step on AA is effectively a joint update to AA and CC. This has the effect of enlarging the constrained space within which each update takes place, improving convergence.

  3. 3.

    For UU and VV, the constrained space is particular difficult to optimize over since it involves not only within-column linear dependencies (X𝚃U=0X^{\mathtt{T}}U=0 and Z𝚃V=0Z^{\mathtt{T}}V=0), but also quadratic dependencies within and between columns (U𝚃U=IU^{\mathtt{T}}U=\mathrm{I} and V𝚃V=IV^{\mathtt{T}}V=\mathrm{I}). The optimization-projection approach makes it easy to handle these constraints.

  4. 4.

    It is straightforward to perform unconstrained optimization for each component separately, and the projections that we derive turn out to be very easy to apply.

S2 Additional simulation results and details

We present additional simulation results and details supplementing Section 6.

Simulating covariates, parameters, and data

Here, we provide the details of how the simulation data are generated. First, the covariates are generated using a copula model as follows. We describe the procedure for the feature covariate matrix XI×KX\in\mathbb{R}^{I\times K}; the sample covariate matrix ZJ×LZ\in\mathbb{R}^{J\times L} is generated in the same way but with JJ and LL in place of II and KK. We generate a random covariance matrix Σ=Q𝚃Q\Sigma=Q^{\mathtt{T}}Q where the entries of QK×KQ\in\mathbb{R}^{K\times K} are qkk𝒩(0,1)q_{kk^{\prime}}\sim\mathcal{N}(0,1) i.i.d., and then we compute the resulting correlation matrix Σ~K×K\tilde{\Sigma}\in\mathbb{R}^{K\times K} by setting Σ~kk=Σkk/ΣkkΣkk\tilde{\Sigma}_{kk^{\prime}}=\Sigma_{kk^{\prime}}/\sqrt{\Sigma_{kk}\Sigma_{k^{\prime}k^{\prime}}}. We generate (x~i1,,x~iK)𝚃𝒩(0,Σ~)(\tilde{x}_{i1},\ldots,\tilde{x}_{iK})^{\mathtt{T}}\sim\mathcal{N}(0,\tilde{\Sigma}) i.i.d. for i=1,,Ii=1,\ldots,I, and define XI×KX\in\mathbb{R}^{I\times K} by setting xik=h(F1(Φ(x~ik)))x_{ik}=h(F^{-1}(\Phi(\tilde{x}_{ik}))) where h(x)=sign(x)min{100,|x|}h(x)=\mathrm{sign}(x)\min\{100,|x|\}, Φ(x)\Phi(x) is the standard normal CDF, and F1F^{-1} is the generalized inverse CDF for the desired marginal distribution, which we take to be 𝒩(0,1)\mathcal{N}(0,1) for the Normal scheme, Gamma(2,2)\operatorname*{Gamma}(2,\sqrt{2}) for the Gamma scheme, and Bernoulli(1/2)\operatorname*{Bernoulli}(1/2) for the Binary scheme. Finally, we standardize XX by setting xi1=1x_{i1}=1 for all ii and centering/scaling so that i=1Ixik=0\sum_{i=1}^{I}x_{ik}=0 and 1Ii=1Ixik2=1\frac{1}{I}\sum_{i=1}^{I}x_{ik}^{2}=1 for k2k\geq 2.

The true parameters A0A_{0}, B0B_{0}, C0C_{0}, D0D_{0}, U0U_{0}, V0V_{0}, S0S_{0}, T0T_{0}, and ω0\omega_{0} are then generated as follows. First, we generate matrices A~\tilde{A}, B~\tilde{B}, and C~\tilde{C} with i.i.d. entries as follows: (Normal scheme) a~jk𝒩(0,1/(4K))\tilde{a}_{jk}\sim\mathcal{N}(0,1/(4K)), b~i𝒩(0,1/(4L))\tilde{b}_{i\ell}\sim\mathcal{N}(0,1/(4L)), and c~k𝒩(0,1/(KL))+3 1(k=1,=1)\tilde{c}_{k\ell}\sim\mathcal{N}(0,1/(KL))+3\,\mathds{1}(k=1,\ell=1), or (Gamma scheme) a~jkGamma(2, 22K))\tilde{a}_{jk}\sim\operatorname*{Gamma}(2,\,2\sqrt{2K})), b~iGamma(2, 22L)\tilde{b}_{i\ell}\sim\operatorname*{Gamma}(2,\,2\sqrt{2L}), and c~kGamma(2,2KL)+3 1(k=1,=1)\tilde{c}_{k\ell}\sim\operatorname*{Gamma}(2,\,\sqrt{2KL})+3\,\mathds{1}(k=1,\ell=1). These distributions are defined so that the scale of the entries of XA~𝚃X\tilde{A}^{\mathtt{T}}, B~Z𝚃\tilde{B}Z^{\mathtt{T}}, and XC~Z𝚃X\tilde{C}Z^{\mathtt{T}} is not affected by KK and LL.

Then we set A0=A~Z(Z+A~)A_{0}=\tilde{A}-Z(Z^{\texttt{\small{+}}}\tilde{A}), B0=B~X(X+B~)B_{0}=\tilde{B}-X(X^{\texttt{\small{+}}}\tilde{B}), and C0=C~C_{0}=\tilde{C}. Next, we set U0=U~X(X+U~)U_{0}=\tilde{U}-X(X^{\texttt{\small{+}}}\tilde{U}) and V0=V~Z(Z+V~)V_{0}=\tilde{V}-Z(Z^{\texttt{\small{+}}}\tilde{V}) where U~I×M\tilde{U}\in\mathbb{R}^{I\times M} and V~J×M\tilde{V}\in\mathbb{R}^{J\times M} are sampled uniformly from their respective Stiefel manifolds, that is, uniformly subject to U~𝚃U~=I\tilde{U}^{\mathtt{T}}\tilde{U}=\mathrm{I} and V~𝚃V~=I\tilde{V}^{\mathtt{T}}\tilde{V}=\mathrm{I}. The diagonal entries of D0D_{0} are evenly spaced from I+J\sqrt{I}+\sqrt{J} to 2(I+J)2(\sqrt{I}+\sqrt{J}); this scaling is motivated by the Marchenko–Pastur law for the distribution of singular values of I×JI\times J random matrices (Marchenko and Pastur, 1967). For the log-dispersion parameters (S0S_{0}, T0T_{0}, and ω0\omega_{0}), we generate s~i,t~j𝒩(0,1)\tilde{s}_{i},\tilde{t}_{j}\sim\mathcal{N}(0,1) i.i.d., and set ω0=2.3\omega_{0}=-2.3, s0i=s~ilog(1Ii=1Iexp(s~i))s_{0i}=\tilde{s}_{i}-\log(\frac{1}{I}\sum_{i=1}^{I}\exp(\tilde{s}_{i})), and t0j=t~jlog(1Jj=1Jexp(t~j))t_{0j}=\tilde{t}_{j}-\log(\frac{1}{J}\sum_{j=1}^{J}\exp(\tilde{t}_{j})),

Given the true parameters and covariates, the data matrix 𝒀{0,1,2,}I×J\bm{Y}\in\{0,1,2,\ldots\}^{I\times J} is generated as follows. We compute the mean matrix μ0:=g1(XA0𝚃+B0Z𝚃+XC0Z𝚃+U0D0V0𝚃)\mu_{0}:=g^{-1}(XA_{0}^{\mathtt{T}}+B_{0}Z^{\mathtt{T}}+XC_{0}Z^{\mathtt{T}}+U_{0}D_{0}V_{0}^{\mathtt{T}}), where the inverse link function g1(x)=exg^{-1}(x)=e^{x} is applied element-wise, and we compute the inverse dispersions r0ij:=exp(s0it0jω0)r_{0ij}:=\exp(-s_{0i}-t_{0j}-\omega_{0}). Then we sample Yij𝒟(μ0ij,r0ij)Y_{ij}\sim\mathcal{D}(\mu_{0ij},r_{0ij}) where 𝒟(μ,r)=NegBin(μ,r)\mathcal{D}(\mu,r)=\operatorname*{NegBin}(\mu,r) in the NB scheme, 𝒟(μ,r)=LNP(μ,log(1/r+1))\mathcal{D}(\mu,r)=\operatorname*{LNP}(\mu,\,\log(1/r+1)) in the LNP scheme, 𝒟(μ,r)=Poisson(μ)\mathcal{D}(\mu,r)=\operatorname*{Poisson}(\mu) in the Poisson scheme, or 𝒟(μ,r)=Geometric(1/(μ+1))\mathcal{D}(\mu,r)=\operatorname*{Geometric}(1/(\mu+1)) in the Geometric scheme, so that 𝔼(Yij)=μ0ij\mathbb{E}(Y_{ij})=\mu_{0ij} in each case. Here, for y{0,1,2,}y\in\{0,1,2,\ldots\}, Geometric(p)\operatorname*{Geometric}(p) has p.m.f. f(y)=(1p)ypf(y)=(1-p)^{y}p, whereas LNP(μ,σ2)\operatorname*{LNP}(\mu,\sigma^{2}) has p.m.f.

f(y)=Poisson(y|λ)LogNormal(λlog(μ)12σ2,σ2)dλ\displaystyle f(y)=\int\operatorname*{Poisson}(y|\lambda)\operatorname*{LogNormal}(\lambda\mid\log(\mu)-\tfrac{1}{2}\sigma^{2},\;\sigma^{2})\,d\lambda (S2.1)

for μ>0\mu>0 and σ2>0\sigma^{2}>0. These outcome distributions are defined so that in each case, if Y𝒟(μ,r)Y\sim\mathcal{D}(\mu,r) then 𝔼(Y)=μ\mathbb{E}(Y)=\mu. Further, in the LNP case, Var(Y)=μ+μ2/r\mathrm{Var}(Y)=\mu+\mu^{2}/r, so the interpretation of rr is the same as in the NB case.

Consistency and statistical efficiency – Details on Section 6.1

In these simulations, to accurately measure the trend with increasing II, we generate the covariates, true parameters, and data with I=10000I=10000 and project them onto the lower-dimensional spaces for smaller II values; for 𝒀\bm{Y}, XX, and S0S_{0} this projection simply consists of taking the first II rows/entries, ZZ and T0T_{0} are unaffected by the projection, and A0A_{0}, B0B_{0}, C0C_{0}, D0D_{0}, U0U_{0}, and V0V_{0} are projected by matching the first II rows of the mean matrix μ0\mu_{0}.

We use the relative MSE rather than the MSE to facilitate interpretability, since this puts the errors on a common scale that does not depend on the magnitude of the parameters. For instance, the relative MSE for AA is defined as

MSErel(A,A0)=j=1Jk=1K|ajka0jk|2j=1Jk=1K|a0jk|2\mathrm{MSE}_{\mathrm{rel}}(A,A_{0})=\frac{\sum_{j=1}^{J}\sum_{k=1}^{K}|a_{jk}-a_{0jk}|^{2}}{\sum_{j=1}^{J}\sum_{k=1}^{K}|a_{0jk}|^{2}}

where AA is the estimate and A0A_{0} is the true value.

Figure S1 shows the relative MSE plots for all GBM parameter components. For AA, CC, VV, and TT, the relative MSE appears to be decreasing to zero. The trend for DD and ω\omega is suggestive but not as clear, making it difficult to gauge whether DD and ω\omega are likely to be consistent based on these experiments. The relative MSEs for BB, UU, and SS are small but do not appear to be going to zero as II\to\infty with JJ fixed; this is expected since for these parameters the amount of data informing each univariate entry is fixed.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S1: Relative mean-squared error between estimated and true parameter values. Each plot contains 5050 thin lines, one for each run, along with the median over the 5050 runs (thick blue line).

Accuracy of standard errors – Details on Section 6.2

To estimate the actual coverage at every target coverage level from 0% to 100%, we use the fact that the actual coverage of a 100×(1α)%100\times(1-\alpha)\% interval for some parameter θ\theta can be written as

(|θ^θ0|<zα/2se^)=(12(1Φ(|θ^θ0|/se^))<1α)\mathbb{P}(|\hat{\theta}-\theta_{0}|<z_{\alpha/2}\,\hat{\mathrm{se}})=\mathbb{P}\Big{(}1-2\big{(}1-\Phi(|\hat{\theta}-\theta_{0}|/\hat{\mathrm{se}})\big{)}<1-\alpha\Big{)}

where zα/2=Φ1(1α/2)z_{\alpha/2}=\Phi^{-1}(1-\alpha/2) and Φ(x)\Phi(x) is the 𝒩(0,1)\mathcal{N}(0,1) CDF. Thus, since 1α1-\alpha is the target coverage, the curve of actual coverage versus target coverage is simply the CDF of the random variable 12(1Φ(|θ^θ0|/se^))1-2\big{(}1-\Phi(|\hat{\theta}-\theta_{0}|/\hat{\mathrm{se}})\big{)}. The plots in Figure 5 are empirical CDFs of this random variable, aggregating across all entries of each parameter matrix/vector.

Refer to caption
Refer to caption
Refer to caption
Figure S2: Log posterior density (plus constant) versus iteration for the estimation algorithm.

Algorithm convergence – Details on Section 6.3

Figure S2 shows the results of the algorithm convergence experiments described in Section 6.3. For each II, the plot shows 25 curves of the log-likelihood+log-prior (plus a constant) versus iteration number, one for each of the 25 simulation runs. For visual interpretability, we add a constant to each curve such that the final value after iteration 5050 is equal to zero. Based on our experiments, the estimation algorithm converges rapidly.

Robustness to the outcome distribution – Details on Section 6.4

Figures S3, S4, and S5 show the results of the robustness experiments in Section 6.4. With LNP outcomes (Figure S3), the results are very similar to when the true outcome distribution is actually NB (Figures S1 and 5). With Poisson outcomes (Figure S4), the estimation accuracy is even better than with NB outcomes, presumably because there is less variability and the Poisson distribution is a limiting case of NB when the dispersion goes to zero. The coverage with Poisson outcomes is essentially the same as NB, slightly better for some parameters and slightly worse for others. With Geometric outcomes (Figure S5), the same parameters appear to be consistently estimated, however, the estimation accuracy in terms of relative MSE is worse by roughly a factor of 10. Compared to NB outcomes, the coverage with Geometric outcomes is similar for some parameters and slightly worse for others. Overall, it appears that the estimation and inference algorithms are quite robust to misspecification of the outcome distribution.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S3: Robustness to outcome: K=4K=4, L=2L=2, M=3M=3, LNP/Normal/Normal.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S4: Robustness to outcome: K=4K=4, L=2L=2, M=3M=3, Poisson/Normal/Normal.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S5: Robustness to outcome: K=4K=4, L=2L=2, M=3M=3, Geometric/Normal/Normal.

Convergence to global optimum

Based on the consistency experiments reported in Figure S1, it seems unlikely that the algorithm is getting stuck in a suboptimal local mode. To more directly assess whether the algorithm is converging to a global optimum, we compare the estimates obtained when (a) initializing the estimation algorithm to the true parameter values versus (b) initializing using our proposed approach. Table S1 shows that the difference between these estimates is negligible. In detail, we generate 5050 data matrices using the NB/Normal/Normal scheme with I=1000I=1000, J=100J=100, K=4K=4, L=2L=2, and M=3M=3. For each data matrix, we run the estimation algorithm with initialization approaches (a) and (b) and compute the relative MSE between the two resulting estimates. In Table S1, for each parameter we report the largest observed value of the relative MSE over these 5050 simulation runs. The maximum relative MSE is extremely small in each case, meaning that initializing at the true parameter values yields nearly identical estimates as initializing with our proposed approach. This suggests that our estimation algorithm is not getting stuck in a suboptimal local mode.

Table S1: Initialization error. Maximum relative MSE between estimates when initializing at the true values versus the proposed initialization approach, over 50 runs with I=1000I=1000 and J=100J=100.
AA BB CC DD UU VV SS TT ω\omega
2×1072\times 10^{-7} 9×1079\times 10^{-7} 7×1097\times 10^{-9} 1×1081\times 10^{-8} 4×1064\times 10^{-6} 3×1073\times 10^{-7} 3×1073\times 10^{-7} 4×1084\times 10^{-8} 2×1092\times 10^{-9}

Choice of identifiability constraint on log-dispersions

The choice of identifiability constraint on the log-dispersions SS and TT has a significant effect on estimation accuracy. Perhaps the most obvious choice would be sum-to-zero constraints: isi=0\sum_{i}s_{i}=0 and jtj=0\sum_{j}t_{j}=0. However, it turns out that this leads to poor performance in terms of estimation accuracy. The reason is that low log-dispersions (for instance, around tj2t_{j}\approx-2 or less) are difficult to estimate accurately, and noisy estimates of a few low values of tjt_{j} can have an undue effect on jtj\sum_{j}t_{j}, causing significant error in the rest of the tjt_{j}’s. Figure S6 (left panel) illustrates the issue in a simulated example using the NB/Normal/Normal scheme with I=10000I=10000, J=100J=100, K=4K=4, L=2L=2, and M=3M=3. Here, the sum-to-zero constraints are used on both the true parameters and the estimates.

Refer to caption
Refer to caption
Figure S6: Scatterplots of t^jt0j\hat{t}_{j}-t_{0j} versus t0jt_{0j} for a typical simulated data matrix.

In our proposed algorithm, we instead constrain 1Iiesi=1\frac{1}{I}\sum_{i}e^{s_{i}}=1 and 1Jjetj=1\frac{1}{J}\sum_{j}e^{t_{j}}=1, which has the effect of downweighting the low values (which are harder to estimate) and upweighting the large values (which are easier to estimate). As illustrated in Figure S6 (right panel), this effectively resolves the issue. Here, we run the same algorithm on the same data matrix, but instead use our proposed constraints. We can see that the undesirable shift is effectively removed.

Special case of no latent factors

If the number of latent factors is zero, that is, M=0M=0, then the model is no longer “bilinear” since the UDV𝚃UDV^{\mathtt{T}} term is no longer present. The resulting model with g(𝔼(𝒀))=XA𝚃+BZ𝚃+XCZ𝚃g(\mathbb{E}(\bm{Y}))=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}} can be viewed as a standard GLM via the vectorization approach discussed in Section S1.2, however, computation using standard GLM methods becomes intractable when II or JJ is large. Thus, our methods provide a computationally tractable approach to estimation and inference in this quite general class of GLMs for matrix data. In Figure S7, we present a subset of simulation results using the NB/Normal/Normal simulation scheme with J=100J=100, K=4K=4, L=2L=2, and M=0M=0. As expected, the estimates for AA and CC appear to be consistent and rapidly converging to the true values. Further, the coverage is nearly perfect for AA and BB, and possibly slightly conservative for CC.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S7: Special case of no latent factors: K=4K=4, L=2L=2, M=0M=0, NB/Normal/Normal.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S8: Effect of small sample size: J=10J=10, K=4K=4, L=2L=2, M=0M=0, NB/Normal/Normal.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S9: Effect of small sample size: J=10J=10, K=1K=1, L=1L=1, M=1M=1, NB/Normal/Normal.

Effect of small sample size

In applications such as gene expression analysis, sometimes a very small number of samples are available. To explore this regime, we run simulations using the NB/Normal/Normal scheme with J=10J=10 and (a) K=4K=4, L=2L=2, M=0M=0, and (b) K=1K=1, L=1L=1, M=1M=1. As Figures S8 and S9 show, consistency for the same parameters appears to hold, although in some cases the rate is less clear than when J=100J=100 (Figures S7 and S10). Meanwhile, the coverage performance is noticably worse when J=10J=10 than when J=100J=100.

Effect of varying latent dimension

The “bilinear” latent factorization term UDV𝚃UDV^{\mathtt{T}} is more challenging in terms of estimation and inference compared to the linear terms XA𝚃XA^{\mathtt{T}}, BZ𝚃BZ^{\mathtt{T}}, and XCZ𝚃XCZ^{\mathtt{T}}. To assess the effect of the latent dimension MM on consistency, efficiency, and accuracy of standard errors, we run the simulations using: (a) K=1K=1, L=1L=1, M=1M=1, and (b) K=1K=1, L=1L=1, and M=5M=5. This is similar to the GAMMI model (Van Eeuwijk, 1995), but also includes SS, TT, and ω\omega. We see in Figure S10 that when M=1M=1, the estimates of AA and VV are converging to the true values as expected, and the coverage for AA, BB, UU and VV is excellent, except that the coverage for VV degrades slightly when I=10000I=10000. When the latent dimension is increased to M=5M=5, Figure S11 shows that the performance is similar, except that the coverage for VV degrades more severely when I=10000I=10000.

Based on these and other experiments, we find that as long as the entries of DD are not too small and are sufficiently well-spaced, the performance remains good. The magnitude of each entry of DD controls the strength of the signal for the corresponding latent factor; thus, a small entry in DD makes it difficult to estimate the corresponding columns of UU and VV. Meanwhile, if two entries of DD are similar in magnitude, then there is near non-identifiability of UU and VV with respect to those two latent factors due to rotational invariance, leading to a loss of performance in terms of statistical efficiency and coverage.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S10: Effect of varying latent dimension: K=1K=1, L=1L=1, M=1M=1, NB/Normal/Normal.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S11: Effect of varying latent dimension: K=1K=1, L=1L=1, M=5M=5, NB/Normal/Normal.

Varying the distribution of covariates and parameters

As usual in regression, since the model is defined conditionally on the covariates, we would expect the results to be robust to the distribution of the covariates. To verify that this is indeed the case, we run the simulations with the following simulation schemes as defined above: (a) NB/Bernoulli/Normal (Bernoulli-distributed covariates) and (b) NB/Gamma/Normal (Gamma-distributed covariates), with J=100J=100, K=4K=4, L=2L=2, and M=3M=3; see Figures S12 and S13. Similarly, although we have a prior on the parameters, we would not expect the results to be highly sensitive to the distribution of the true parameters. To check this, we also run the simulations with the NB/Normal/Gamma scheme (Gamma-distributed true parameters) with J=100J=100, K=4K=4, L=2L=2, and M=3M=3; see Figure S14. The results are nearly identical to the case of normally distributed covariates and true parameters.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S12: Varying the covariate distribution: K=4K=4, L=2L=2, M=3M=3, NB/Bernoulli/Normal.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S13: Varying the covariate distribution: K=4K=4, L=2L=2, M=3M=3, NB/Gamma/Normal.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure S14: Varying the parameter distribution: K=4K=4, L=2L=2, M=3M=3, NB/Normal/Gamma.

S3 Additional application results and details

Refer to caption
Refer to caption
Figure S15: PCA of the Pickrell data using: (top) log-transformed TPMs and (bottom) the VST method in the DESeq2 software package, including the GC bias adjustment from CQN. Compare with the GBM approach in Figure 8.
Refer to caption
Figure S16: Visualization of GTEx data using NB-GBM latent factors, adjusting for covariates. Each dot represents one of the 8,551 samples, and the color indicates the subtissue type.
Refer to caption
Figure S17: PCA of the GTEx data using log-transformed TPMs, specifically, log(TPMij+1)\log(\texttt{TPM}_{ij}+1). Each dot represents one of the 8,551 samples, and the color indicates the tissue type.

S3.1 Gene expression application – Details on Section 7

Here we provide additional results and details on the gene expression applications in Section 7. For the Pickrell data, Figure S15 shows the PCA plots based on (a) log-transformed TPMs, specifically, log(TPMij+1)\log(\texttt{TPM}_{ij}+1), and (b) the variance stabilizing transform (VST) method in the DESeq2 software package, using the GC adjustment from CQN. For the GTEx data, Figure S16 shows the latent factors (vj2v_{j2} versus vj1v_{j1}) as in Figure 9, but coloring the points according to tissue subtype instead tissue type. We see that the samples tend to fall into clusters according to tissue subtype, further resolving the clustering in Figure 9. Figure S17 shows a PCA plot of the log TPMs of the GTEx data, which is not nearly as clear as Figure 9 in terms of tissue type clustering. For the analysis of aging-related genes in the Heart-LV subtissue using the GTEx data (Section 7.2), Tables S2 and S3 show the top 20 enriched GO terms in the Biological Process and Cellular Component categories.

Table S2: Top GO terms (Biological Process) for age-related expression in Heart-LV.
GO term ID Description Count p-value Benjamini
GO:0098609 cell-cell adhesion 48 5.1e-125.1\text{e-}12 1.5e-081.5\text{e-}08
GO:0006418 tRNA aminoacylation for protein translation 16 1.4e-091.4\text{e-}09 2.0e-062.0\text{e-}06
GO:0006099 tricarboxylic acid cycle 12 3.7e-073.7\text{e-}07 3.6e-043.6\text{e-}04
GO:1904871 positive regulation of protein localization to Cajal body 7 1.1e-061.1\text{e-}06 6.1e-046.1\text{e-}04
GO:1904851 positive regulation of establishment of protein localization to telomere 7 1.1e-061.1\text{e-}06 6.1e-046.1\text{e-}04
GO:0006607 NLS-bearing protein import into nucleus 10 1.3e-061.3\text{e-}06 6.2e-046.2\text{e-}04
GO:0006914 autophagy 22 1.8e-051.8\text{e-}05 7.6e-037.6\text{e-}03
GO:0016192 vesicle-mediated transport 24 2.6e-052.6\text{e-}05 8.3e-038.3\text{e-}03
GO:0006511 ubiquitin-dependent protein catabolic process 24 2.6e-052.6\text{e-}05 8.3e-038.3\text{e-}03
GO:0006888 ER to Golgi vesicle-mediated transport 24 3.5e-053.5\text{e-}05 1.0e-021.0\text{e-}02
GO:0006886 intracellular protein transport 31 4.3e-054.3\text{e-}05 1.1e-021.1\text{e-}02
GO:1904874 positive regulation of telomerase RNA localization to Cajal body 7 8.3e-058.3\text{e-}05 2.0e-022.0\text{e-}02
GO:0006090 pyruvate metabolic process 8 9.6e-059.6\text{e-}05 2.1e-022.1\text{e-}02
GO:0070125 mitochondrial translational elongation 16 1.1e-041.1\text{e-}04 2.2e-022.2\text{e-}02
GO:0006446 regulation of translational initiation 10 1.5e-041.5\text{e-}04 2.8e-022.8\text{e-}02
GO:0043039 tRNA aminoacylation 5 1.6e-041.6\text{e-}04 3.0e-023.0\text{e-}02
GO:0018107 peptidyl-threonine phosphorylation 10 2.9e-042.9\text{e-}04 4.9e-024.9\text{e-}02
GO:0000462 maturation of SSU-rRNA from tricistronic rRNA transcript 9 3.3e-043.3\text{e-}04 5.4e-025.4\text{e-}02
GO:0006610 ribosomal protein import into nucleus 5 3.7e-043.7\text{e-}04 5.6e-025.6\text{e-}02
GO:0016236 macroautophagy 14 4.0e-044.0\text{e-}04 5.9e-025.9\text{e-}02
Table S3: Top GO terms (Cellular Component) for age-related expression in Heart-LV.
GO term ID Description Count p-value Benjamini
GO:0016020 membrane 220 9.8e-219.8\text{e-}21 3.7e-183.7\text{e-}18
GO:0005739 mitochondrion 157 1.2e-201.2\text{e-}20 3.7e-183.7\text{e-}18
GO:0070062 extracellular exosome 242 4.3e-164.3\text{e-}16 9.1e-149.1\text{e-}14
GO:0005829 cytosol 282 1.0e-151.0\text{e-}15 1.6e-131.6\text{e-}13
GO:0005913 cell-cell adherens junction 57 9.5e-159.5\text{e-}15 1.2e-121.2\text{e-}12
GO:0005737 cytoplasm 380 2.3e-132.3\text{e-}13 2.4e-112.4\text{e-}11
GO:0043209 myelin sheath 36 4.7e-134.7\text{e-}13 4.2e-114.2\text{e-}11
GO:0005759 mitochondrial matrix 47 5.7e-095.7\text{e-}09 4.5e-074.5\text{e-}07
GO:0005654 nucleoplasm 217 1.1e-081.1\text{e-}08 7.8e-077.8\text{e-}07
GO:0000502 proteasome complex 18 1.4e-081.4\text{e-}08 8.0e-078.0\text{e-}07
GO:0005743 mitochondrial inner membrane 56 1.4e-081.4\text{e-}08 8.0e-078.0\text{e-}07
GO:0042645 mitochondrial nucleoid 14 3.5e-073.5\text{e-}07 1.8e-051.8\text{e-}05
GO:0014704 intercalated disc 14 8.5e-078.5\text{e-}07 4.2e-054.2\text{e-}05
GO:0005832 chaperonin-containing T-complex 7 2.5e-062.5\text{e-}06 1.1e-041.1\text{e-}04
GO:0005643 nuclear pore 16 5.2e-065.2\text{e-}06 2.2e-042.2\text{e-}04
GO:0043231 intracellular membrane-bounded organelle 55 2.7e-052.7\text{e-}05 1.1e-031.1\text{e-}03
GO:0002199 zona pellucida receptor complex 6 2.9e-052.9\text{e-}05 1.1e-031.1\text{e-}03
GO:0043034 costamere 8 5.4e-055.4\text{e-}05 1.9e-031.9\text{e-}03
GO:0043234 protein complex 42 7.8e-057.8\text{e-}05 2.6e-032.6\text{e-}03
GO:0045254 pyruvate dehydrogenase complex 5 1.5e-041.5\text{e-}04 4.6e-034.6\text{e-}03

S3.2 Cancer genomics application – Details on Section 8

Data acquisition and preprocessing. We downloaded BAM files for the 326 CCLE whole-exome samples from the Genomic Data Commons (GDC) Legacy Archive of the National Cancer Institute (https://portal.gdc.cancer.gov/legacy-archive/), using the GDC Data Transfer Tool v1.6.0. Using the PreprocessIntervals tool from the Genome Analysis Toolkit (GATK) v4.1.8.1 (https://github.com/broadinstitute/gatk/) running on Java v1.8, we preprocessed the CCLE exome target region interval list to pad the intervals by 250 base pairs on either side (options: padding 250, bin-length 0, interval-merging-rule OVERLAPPING_ONLY). Then, to convert each BAM file to a vector of counts, we counted the number of reads in each target region using the CollectReadCounts tool from GATK (options: interval-merging-rule OVERLAPPING_ONLY). For analysis, we included all target regions in chromosomes 1–22 that have nonzero median across the 326 samples.

De-segmenting the training samples. We randomly selected half of the samples to use as a training set, and the rest were used as a test set to evaluate performance. To be able to treat the training samples as “pseudo-normal” (non-cancer) samples, we de-segment them as follows. We first compute a rough estimate of copy ratio, defined as ρij:=Y~ij/(αiβj)\rho_{ij}:=\tilde{Y}_{ij}/(\alpha_{i}\beta_{j}) where αi=1Jj=1JY~ij\alpha_{i}=\frac{1}{J}\sum_{j=1}^{J}\tilde{Y}_{ij}, βj=1Ii=1IY~ij/αi\beta_{j}=\frac{1}{I}\sum_{i=1}^{I}\tilde{Y}_{ij}/\alpha_{i}, and Y~ij=Yij+1/8\tilde{Y}_{ij}=Y_{ij}+1/8; here, 1/81/8 is a pseudocount that avoids issues when taking logs. For each sample jj, we then run a standard binary segmentation algorithm (Killick et al., 2012, Eqn 2) on logρij\log\rho_{ij} to detect changepoints. For binary segmentation, we use cost function 𝒞(x1:n)=(1ni=1nxi/σj)2\mathcal{C}(x_{1:n})=-(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}x_{i}/\sigma_{j})^{2} and penalty β=1000\beta=1000 where σj2\sigma_{j}^{2} is the sample variance of (logρijlogρi+1,j)/2(\log\rho_{ij}-\log\rho_{i+1,j})/\sqrt{2}. Define oijo_{ij} to be the average of logρij\log\rho_{ij} over the segment containing region ii.

We then compute the de-segmented counts Yijdeseg:=round(αiβjexp(log(ρij)oij))Y_{ij}^{\textrm{deseg}}:=\mathrm{round}\big{(}\alpha_{i}\beta_{j}\exp(\log(\rho_{ij})-o_{ij})\big{)}. The idea is that this adjusts out the departures from copy neutral (that is, from normal diploid) as inferred by the segmentation algorithm, to create a panel of pseudo-normals.

Running the GBM to estimate copy ratios. We run the GBM estimation algorithm on the de-segmented training data using M=5M=5 latent factors. To avoid overfitting of the latent factors on the training data, we fix the sample-specific log-dispersion TT after running the initialization procedure – that is, we do not update TT at each iteration of the algorithm. We use the defaults for all other algorithm settings. For covariates, we construct XX to include log(lengthi)\log(\texttt{length}_{i}), gci\texttt{gc}_{i}, and (gcigc¯)2(\texttt{gc}_{i}-\overline{\texttt{gc}})^{2}, and we use no sample covariates.

On the test data, we update TT at each iteration as usual. We construct XX to include the same covariates as before, along with 5 additional covariates equal to the columns of the UU matrix that was estimated on the training data. We use no sample covariates, and we set M=0M=0 on the test data. We define the GBM copy ratio estimates as the exponentiated residuals ρijGBM=Y~ij/μ^ij\rho_{ij}^{\textrm{GBM}}=\tilde{Y}_{ij}/\hat{\mu}_{ij} where Y~ij=Yij+1/8\tilde{Y}_{ij}=Y_{ij}+1/8; also see Section 2.3.

Running GATK to estimate copy ratios. We run the CreateReadCountPanelOfNormals GATK tool on the de-segmented training data, and to enable comparison with the GBM, we set the options to use 5 principal components and include all regions (number-of-eigensamples 5, minimum-interval-median-percentile 0, maximum-zeros-in-interval-percentage 100). To estimate copy ratios for the test data, we run the DenoiseReadCounts GATK tool using the pon file from CreateReadCountPanelOfNormals. On the test data, we use the original counts (not de-segmented).

Performance metrics. Suppose x,wnx,w\in\mathbb{R}^{n} and k{0,2,4,}k\in\{0,2,4,\ldots\}, where x1,,xnx_{1},\ldots,x_{n} represent noisy measurements of a signal of interest, wiw_{i} is a weight for point xix_{i}, and kk is a smoothing bandwidth. We define the local relative standard error as

LRSE(x,w,k)=1ni=1n(wi/w¯i)2(xix¯i)2\mathrm{LRSE}(x,w,k)=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(w_{i}/\bar{w}_{i})^{2}(x_{i}-\bar{x}_{i})^{2}}

where w¯i\bar{w}_{i} is the moving average of ww using bandwidth kk, and x¯i\bar{x}_{i} is the weighted moving average of xx using bandwidth kk and weights ww. More precisely, w¯i=1|Ai|jAiwj\bar{w}_{i}=\frac{1}{|A_{i}|}\sum_{j\in A_{i}}w_{j} and

x¯i=jAiwjxjjAiwj\displaystyle\bar{x}_{i}=\frac{\sum_{j\in A_{i}}w_{j}x_{j}}{\sum_{j\in A_{i}}w_{j}} (S3.1)

where Ai={max(1,ik/2),,min(n,i+k/2)}A_{i}=\{\max(1,\,i-k/2),\ldots,\min(n,\,i+k/2)\}. The idea is that if one is trying to estimate the mean of the signal, then a natural approach would be to use a weighted moving average, and the LRSE approximates the standard deviation (times k\sqrt{k}) of this estimator.

We define the weighted median absolute difference as

WMAD(x,w,k)=median{k|x¯i+1x¯i|:i=1,,n1}\mathrm{WMAD}(x,w,k)=\mathrm{median}\big{\{}k\,|\bar{x}_{i+1}-\bar{x}_{i}|\,:\,i=1,\ldots,n-1\big{\}}

where x¯i\bar{x}_{i} is the same as above. The WMAD is similar to the median absolute deviation metric that is frequently used in this application, but it allows one to account for weights.

To assess copy ratio estimation performance for sample jj, we take xix_{i} to be the log copy ratio estimate for region ii and we use a bandwidth of k=100k=100 for both metrics. We take wi=Wijw_{i}=W_{ij} for the GBM (following Section 2.3), and for GATK we take wi=1w_{i}=1 since GATK does not provide weights/precisions.

Estimated copy ratios for all test samples. Figures S18 and S19 show heatmaps of the GATK and GBM copy ratio estimates on the log2\log_{2} scale for all 163 samples in the test set. For visualization, these heatmaps are smoothed using a moving weighted average as in Equation S3.1 with k=3k=3. The GBM estimates are visibly less noisy and appear to infer copy neutral regions (white portions of the heatmap) more accurately than GATK.

Refer to caption
Figure S18: GATK copy ratio estimates (on the log2 scale) for all 163 test samples from the CCLE whole-exome sequencing dataset. The x-axis represents genomic position. Red and blue indicate copy gains and losses, respectively.
Refer to caption
Figure S19: GBM copy ratio estimates (on the log2 scale) for all 163 test samples from the CCLE whole-exome sequencing dataset. The x-axis represents genomic position. Red and blue indicate copy gains and losses, respectively.

S4 Exponential dispersion families

For completeness, we state the basic results for discrete EDFs that we use in this paper. See Jørgensen (1987) or Agresti (2015) for reference on this material. Suppose

f(yθ,r)=exp(θyrκ(θ))h(y,r)\displaystyle f(y\mid\theta,r)=\exp(\theta y-r\kappa(\theta))h(y,r) (S4.1)

is a p.m.f. on yy\in\mathbb{Z} for θΘ\theta\in\Theta and rr\in\mathcal{R}, where Θ\Theta\subseteq\mathbb{R} and (0,)\mathcal{R}\subseteq(0,\infty) are convex open sets. If Yf(yθ,r)Y\sim f(y\mid\theta,r), then

𝔼(Yθ,r)=rκ(θ) and Var(Yθ,r)=rκ′′(θ).\displaystyle\mathbb{E}(Y\mid\theta,r)=r\kappa^{\prime}(\theta)\text{~{}~{} and ~{}~{}}\operatorname*{Var}(Y\mid\theta,r)=r\kappa^{\prime\prime}(\theta). (S4.2)

These properties are straightforward to derive from the fact that logyexp(θy)h(y,r)=rκ(θ)\log\sum_{y\in\mathbb{Z}}\exp(\theta y)h(y,r)=r\kappa(\theta). For y,θy,\theta, and rr such that h(y,r)>0h(y,r)>0, let

\displaystyle\mathcal{L} =(y,θ,r):=logf(yθ,r)=θyrκ(θ)+logh(y,r),\displaystyle=\mathcal{L}(y,\theta,r):=\log f(y\mid\theta,r)=\theta y-r\kappa(\theta)+\log h(y,r),
μ\displaystyle\mu =μ(θ,r):=𝔼(Yθ,r)=rκ(θ), and\displaystyle=\mu(\theta,r):=\mathbb{E}(Y\mid\theta,r)=r\kappa^{\prime}(\theta),\text{ and}
σ2\displaystyle\sigma^{2} =σ2(θ,r):=Var(Yθ,r)=rκ′′(θ).\displaystyle=\sigma^{2}(\theta,r):=\operatorname*{Var}(Y\mid\theta,r)=r\kappa^{\prime\prime}(\theta).

Assume rr is not functionally dependent on θ\theta. Then we have

θ=yrκ(θ)=yμ and 2θ2=rκ′′(θ)=σ2.\displaystyle\frac{\partial\mathcal{L}}{\partial\theta}=y-r\kappa^{\prime}(\theta)=y-\mu\text{~{}~{} and ~{}~{}}\frac{\partial^{2}\mathcal{L}}{\partial\theta^{2}}=-r\kappa^{\prime\prime}(\theta)=-\sigma^{2}.

Assume κ\kappa^{\prime} is invertible; this holds as long as Var(Yθ,r)>0\operatorname*{Var}(Y\mid\theta,r)>0 for all θ\theta and rr. Then since μ=rκ(θ)\mu=r\kappa^{\prime}(\theta), we have θ=κ1(μ/r)\theta=\kappa^{\prime-1}(\mu/r), and it follows that θ/μ=1/(rκ′′(θ))=1/σ2\partial\theta/\partial\mu=1/(r\kappa^{\prime\prime}(\theta))=1/\sigma^{2} by the inverse function theorem. Similarly, defining η:=g(μ)\eta:=g(\mu), where g()g(\cdot) is a smooth function such that gg^{\prime} is positive, we have μ/η=1/g(μ)\partial\mu/\partial\eta=1/g^{\prime}(\mu). Therefore, by the chain rule,

η=θθμμη=(yrκ(θ))1rκ′′(θ)1g(μ)=yμσ2g(μ),\frac{\partial\mathcal{L}}{\partial\eta}=\frac{\partial\mathcal{L}}{\partial\theta}\frac{\partial\theta}{\partial\mu}\frac{\partial\mu}{\partial\eta}=(y-r\kappa^{\prime}(\theta))\frac{1}{r\kappa^{\prime\prime}(\theta)}\frac{1}{g^{\prime}(\mu)}=\frac{y-\mu}{\sigma^{2}g^{\prime}(\mu)},

and

2η2=(μη)1σ2g(μ)+(yμ)η(1σ2g(μ)).\frac{\partial^{2}\mathcal{L}}{\partial\eta^{2}}=\Big{(}-\frac{\partial\mu}{\partial\eta}\Big{)}\frac{1}{\sigma^{2}g^{\prime}(\mu)}+(y-\mu)\frac{\partial}{\partial\eta}\Big{(}\frac{1}{\sigma^{2}g^{\prime}(\mu)}\Big{)}.

Thus, if Yf(yθ,r)Y\sim f(y\mid\theta,r) and =logf(Yθ,r)\mathcal{L}=\log f(Y\mid\theta,r), then

η=Yμσ2g(μ) and 𝔼(2η2)=1σ2g(μ)2.\displaystyle\frac{\partial\mathcal{L}}{\partial\eta}=\frac{Y-\mu}{\sigma^{2}g^{\prime}(\mu)}\text{ ~{}~{}~{}and~{}~{}~{} }\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial\eta^{2}}\Big{)}=\frac{1}{\sigma^{2}g^{\prime}(\mu)^{2}}. (S4.3)

If η\eta depends on some parameters α\alpha and β\beta (and rr does not), then

α=ηηα and 2αβ=2η2ηαηβ+η2ηαβ.\frac{\partial\mathcal{L}}{\partial\alpha}=\frac{\partial\mathcal{L}}{\partial\eta}\frac{\partial\eta}{\partial\alpha}\text{ ~{}~{}~{}and~{}~{}~{} }\frac{\partial^{2}\mathcal{L}}{\partial\alpha\partial\beta}=\frac{\partial^{2}\mathcal{L}}{\partial\eta^{2}}\frac{\partial\eta}{\partial\alpha}\frac{\partial\eta}{\partial\beta}+\frac{\partial\mathcal{L}}{\partial\eta}\frac{\partial^{2}\eta}{\partial\alpha\partial\beta}.

Therefore, by Equation S4.3,

α\displaystyle\frac{\partial\mathcal{L}}{\partial\alpha} =Yμσ2g(μ)ηα\displaystyle=\frac{Y-\mu}{\sigma^{2}g^{\prime}(\mu)}\,\frac{\partial\eta}{\partial\alpha} (S4.4)
𝔼(2αβ)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial\alpha\partial\beta}\Big{)} =1σ2g(μ)2ηαηβ\displaystyle=\frac{1}{\sigma^{2}g^{\prime}(\mu)^{2}}\,\frac{\partial\eta}{\partial\alpha}\frac{\partial\eta}{\partial\beta} (S4.5)

since 2η/αβ\partial^{2}\eta/\partial\alpha\partial\beta does not depend on YY and 𝔼(/η)=0\mathbb{E}(\partial\mathcal{L}/\partial\eta)=0.

S5 Gradient and Fisher information for EDF-GBMs

In Section S5.1, we derive the gradient and Fisher information matrix with respect to each of AA, BB, CC, DD, UU, and VV individually, for a discrete EDF-GBM. In Section S5.2, we specialize these formulas to the case of NB-GBMs, and we derive the gradient and observed Fisher information for SS and TT in this case. For completeness, in Section S5.3 we also provide the cross-terms of the Fisher information between all pairs of components, although not all of these are needed for our approach. The formulas contain factors reminiscent of the gradient and Fisher information for a standard GLM, which take a standard form based on the link function and the mean-variance relationship (Agresti, 2015).

For our estimation algorithm, we only require the Fisher information matrix with respect to each parameter matrix/vector individually, rather than jointly. Meanwhile, for inference, we also use the constraint-augmented Fisher information matrix for UU and VV jointly; see Section S6.2. To enable comparison with the standard approach of using the full Fisher information matrix, in Section S6.3 we provide the constraint-augmented Fisher information matrix for all parameters jointly.

Consider a GBM with discrete EDF outcomes, that is, suppose Yijf(yθij,rij)Y_{ij}\sim f(y\mid\theta_{ij},r_{ij}) where θij:=κ1(μij/rij)\theta_{ij}:=\kappa^{\prime-1}(\mu_{ij}/r_{ij}), μij:=g1(ηij)\mu_{ij}:=g^{-1}(\eta_{ij}), and η=η(A,B,C,D,U,V)I×J\eta=\eta(A,B,C,D,U,V)\in\mathbb{R}^{I\times J} as in Equation 2.3. This makes μij=𝔼(Yijθij,rij)\mu_{ij}=\mathbb{E}(Y_{ij}\mid\theta_{ij},r_{ij}); also, define σij2:=Var(Yijθij,rij)\sigma_{ij}^{2}:=\operatorname*{Var}(Y_{ij}\mid\theta_{ij},r_{ij}). Let

:=i=1Ij=1Jij=i=1Ij=1Jlogf(Yijθij,rij)\displaystyle\mathcal{L}:=\sum_{i=1}^{I}\sum_{j=1}^{J}\mathcal{L}_{ij}=\sum_{i=1}^{I}\sum_{j=1}^{J}\log f(Y_{ij}\mid\theta_{ij},r_{ij}) (S5.1)

denote the overall log-likelihood, where ij:=logf(Yijθij,rij)\mathcal{L}_{ij}:=\log f(Y_{ij}\mid\theta_{ij},r_{ij}). By Equations S4.4 and S4.5, for any two univariate entries of AA, BB, CC, DD, UU, and VV, say α\alpha and β\beta, we have

ijα=eijηijα and 𝔼(2ijαβ)=wijηijαηijβ\displaystyle\frac{\partial\mathcal{L}_{ij}}{\partial\alpha}=e_{ij}\,\frac{\partial\eta_{ij}}{\partial\alpha}\text{ ~{}~{}~{}and~{}~{}~{} }\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}_{ij}}{\partial\alpha\partial\beta}\Big{)}=w_{ij}\,\frac{\partial\eta_{ij}}{\partial\alpha}\frac{\partial\eta_{ij}}{\partial\beta} (S5.2)

where we define the matrices EI×JE\in\mathbb{R}^{I\times J} and WI×JW\in\mathbb{R}^{I\times J} such that

eij:=ijηij=Yijμijσij2g(μij) and wij:=𝔼(2ijηij2)=1σij2g(μij)2.\displaystyle e_{ij}:=\frac{\partial\mathcal{L}_{ij}}{\partial\eta_{ij}}=\frac{Y_{ij}-\mu_{ij}}{\sigma_{ij}^{2}g^{\prime}(\mu_{ij})}\text{ ~{}~{}~{}and~{}~{}~{} }w_{ij}:=\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}_{ij}}{\partial\eta_{ij}^{2}}\Big{)}=\frac{1}{\sigma_{ij}^{2}g^{\prime}(\mu_{ij})^{2}}. (S5.3)

The partial derivatives ηij/α\partial\eta_{ij}/\partial\alpha with respect to each entry of AA, BB, CC, DD, UU, and VV are:

ηijajk\displaystyle\frac{\partial\eta_{ij}}{\partial a_{j^{\prime}k}} =xik 1(j=j)\displaystyle=x_{ik}\,\mathds{1}(j=j^{\prime})\qquad ηijvjm\displaystyle\frac{\partial\eta_{ij}}{\partial v_{j^{\prime}m}} =uimdmm 1(j=j)\displaystyle=u_{im}d_{mm}\,\mathds{1}(j=j^{\prime})
ηijbi\displaystyle\frac{\partial\eta_{ij}}{\partial b_{i^{\prime}\ell}} =zj 1(i=i)\displaystyle=z_{j\ell}\,\mathds{1}(i=i^{\prime}) ηijuim\displaystyle\frac{\partial\eta_{ij}}{\partial u_{i^{\prime}m}} =vjmdmm 1(i=i)\displaystyle=v_{jm}d_{mm}\,\mathds{1}(i=i^{\prime}) (S5.4)
ηijck\displaystyle\frac{\partial\eta_{ij}}{\partial c_{k\ell}} =xikzj\displaystyle=x_{ik}z_{j\ell} ηijdmm\displaystyle\frac{\partial\eta_{ij}}{\partial d_{mm}} =uimvjm.\displaystyle=u_{im}v_{jm}.

S5.1 Gradient and Fisher information for each component

By Equations S5.2 and S5, we have

ajk\displaystyle\frac{\partial\mathcal{L}}{\partial a_{jk}} =i=1Ixikeij\displaystyle=\sum_{i=1}^{I}x_{ik}e_{ij}\qquad vjm\displaystyle\frac{\partial\mathcal{L}}{\partial v_{jm}} =i=1Iuimdmmeij\displaystyle=\sum_{i=1}^{I}u_{im}d_{mm}e_{ij}
bi\displaystyle\frac{\partial\mathcal{L}}{\partial b_{i\ell}} =j=1Jzjeij\displaystyle=\sum_{j=1}^{J}z_{j\ell}e_{ij} uim\displaystyle\frac{\partial\mathcal{L}}{\partial u_{im}} =j=1Jvjmdmmeij\displaystyle=\sum_{j=1}^{J}v_{jm}d_{mm}e_{ij} (S5.5)
ck\displaystyle\frac{\partial\mathcal{L}}{\partial c_{k\ell}} =i=1Ij=1Jxikeijzj\displaystyle=\sum_{i=1}^{I}\sum_{j=1}^{J}x_{ik}e_{ij}z_{j\ell} dmm\displaystyle\frac{\partial\mathcal{L}}{\partial d_{mm}} =i=1Ij=1Juimeijvjm.\displaystyle=\sum_{i=1}^{I}\sum_{j=1}^{J}u_{im}e_{ij}v_{jm}.

To express these equations in matrix notation, we vectorize the parameter matrices as in Section 4.2: a:=vec(A𝚃)\vec{a}:=\mathrm{vec}(A^{\mathtt{T}}), b:=vec(B𝚃)\vec{b}:=\mathrm{vec}(B^{\mathtt{T}}), c:=vec(C)\vec{c}:=\mathrm{vec}(C), d:=diag(D)\vec{d}:=\operatorname*{diag}(D), u:=vec(U𝚃)\vec{u}:=\mathrm{vec}(U^{\mathtt{T}}), and v:=vec(V𝚃)\vec{v}:=\mathrm{vec}(V^{\mathtt{T}}). By Equation S5.1, the gradient of \mathcal{L} with respect to each vectorized component is then:

a\displaystyle\nabla_{\!\vec{a}}\,\mathcal{L} =vec(X𝚃E)\displaystyle=\mathrm{vec}(X^{\mathtt{T}}E)\qquad v\displaystyle\nabla_{\!\vec{v}}\,\mathcal{L} =vec((UD)𝚃E)\displaystyle=\mathrm{vec}((UD)^{\mathtt{T}}E)
b\displaystyle\nabla_{\!\vec{b}}\,\mathcal{L} =vec(Z𝚃E𝚃)\displaystyle=\mathrm{vec}(Z^{\mathtt{T}}E^{\mathtt{T}}) u\displaystyle\nabla_{\!\vec{u}}\,\mathcal{L} =vec((VD)𝚃E𝚃)\displaystyle=\mathrm{vec}((VD)^{\mathtt{T}}E^{\mathtt{T}}) (S5.6)
c\displaystyle\nabla_{\!\vec{c}}\,\mathcal{L} =vec(X𝚃EZ)\displaystyle=\mathrm{vec}(X^{\mathtt{T}}EZ) d\displaystyle\nabla_{\!\vec{d}}\,\mathcal{L} =diag(U𝚃EV).\displaystyle=\operatorname*{diag}(U^{\mathtt{T}}EV).

For each component of the model, the entries of the Fisher information matrices are as follows, by Equations S5.2 and S5:

𝔼(2ajkajk)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial a_{jk}\partial a_{j^{\prime}k^{\prime}}}\Big{)} =𝟙(j=j)i=1Iwijxikxik\displaystyle=\mathds{1}(j=j^{\prime})\sum_{i=1}^{I}w_{ij}x_{ik}x_{ik^{\prime}} (S5.7)
𝔼(2bibi)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial b_{i\ell}\partial b_{i^{\prime}\ell^{\prime}}}\Big{)} =𝟙(i=i)j=1Jwijzjzj\displaystyle=\mathds{1}(i=i^{\prime})\sum_{j=1}^{J}w_{ij}z_{j\ell}z_{j\ell^{\prime}}
𝔼(2ckck)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial c_{k\ell}\partial c_{k^{\prime}\ell^{\prime}}}\Big{)} =i=1Ij=1Jwijxikxikzjzj\displaystyle=\sum_{i=1}^{I}\sum_{j=1}^{J}w_{ij}x_{ik}x_{ik^{\prime}}z_{j\ell}z_{j\ell^{\prime}}
𝔼(2vjmvjm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial v_{jm}\partial v_{j^{\prime}m^{\prime}}}\Big{)} =𝟙(j=j)i=1Iwijuimdmmuimdmm\displaystyle=\mathds{1}(j=j^{\prime})\sum_{i=1}^{I}w_{ij}u_{im}d_{mm}u_{im^{\prime}}d_{m^{\prime}m^{\prime}}
𝔼(2uimuim)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial u_{im}\partial u_{i^{\prime}m^{\prime}}}\Big{)} =𝟙(i=i)j=1Jwijvjmdmmvjmdmm\displaystyle=\mathds{1}(i=i^{\prime})\sum_{j=1}^{J}w_{ij}v_{jm}d_{mm}v_{jm^{\prime}}d_{m^{\prime}m^{\prime}}
𝔼(2dmmdmm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial d_{mm}\partial d_{m^{\prime}m^{\prime}}}\Big{)} =i=1Ij=1Jwijuimvjmuimvjm.\displaystyle=\sum_{i=1}^{I}\sum_{j=1}^{J}w_{ij}u_{im}v_{jm}u_{im^{\prime}}v_{jm^{\prime}}.

To express these equations in matrix form, we introduce the following notation. We write Diag(Q1,,Qn)\mathrm{Diag}(Q_{1},\ldots,Q_{n}) to denote the block diagonal matrix with blocks Q1,,QnQ_{1},\ldots,Q_{n},

Diag(Q1,,Qn):=[Q1000Q2000Qn],\mathrm{Diag}(Q_{1},\ldots,Q_{n}):=\begin{bmatrix}Q_{1}&0&\cdots&0\\ 0&Q_{2}&&0\\ \vdots&&\ddots&\vdots\\ 0&0&\cdots&Q_{n}\end{bmatrix},

and we write block(Qij:i,j{1,,n})\mathrm{block}(Q_{ij}:i,j\in\{1,\ldots,n\}) to denote the block matrix with block i,ji,j equal to QijQ_{ij}. For a matrix Qm×nQ\in\mathbb{R}^{m\times n}, we write QiQ_{i*} and QjQ_{*j} to denote the diagonal matrices constructed from the iith row and jjth column, respectively, that is, Qi:=Diag(qi1,,qin)Q_{i*}:=\mathrm{Diag}(q_{i1},\ldots,q_{in}) and Qj:=Diag(q1j,,qmj)Q_{*j}:=\mathrm{Diag}(q_{1j},\ldots,q_{mj}). Then, by Equation S5.7, the Fisher information matrices for each component of the model are:

𝔼(a2)\displaystyle\mathbb{E}(-\nabla_{\!\vec{a}}^{2}\,\mathcal{L}) =Diag(X𝚃W1X,,X𝚃WJX)\displaystyle=\mathrm{Diag}(X^{\mathtt{T}}W_{*1}X,\,\ldots,\,X^{\mathtt{T}}W_{*J}X) (S5.8)
𝔼(b2)\displaystyle\mathbb{E}(-\nabla_{\!\vec{b}}^{2}\,\mathcal{L}) =Diag(Z𝚃W1Z,,Z𝚃WIZ)\displaystyle=\mathrm{Diag}(Z^{\mathtt{T}}W_{1*}Z,\,\ldots,\,Z^{\mathtt{T}}W_{I*}Z)
𝔼(c2)\displaystyle\mathbb{E}(-\nabla_{\!\vec{c}}^{2}\,\mathcal{L}) =block(j=1Jzjzj(X𝚃WjX):,{1,,L})\displaystyle=\mathrm{block}\Big{(}{\textstyle\sum_{j=1}^{J}}z_{j\ell}z_{j\ell^{\prime}}(X^{\mathtt{T}}W_{*j}X):\ell,\ell^{\prime}\in\{1,\ldots,L\}\Big{)}
𝔼(v2)\displaystyle\mathbb{E}(-\nabla_{\!\vec{v}}^{2}\,\mathcal{L}) =Diag((UD)𝚃W1(UD),,(UD)𝚃WJ(UD))\displaystyle=\mathrm{Diag}((UD)^{\mathtt{T}}W_{*1}(UD),\,\ldots,\,(UD)^{\mathtt{T}}W_{*J}(UD))
𝔼(u2)\displaystyle\mathbb{E}(-\nabla_{\!\vec{u}}^{2}\,\mathcal{L}) =Diag((VD)𝚃W1(VD),,(VD)𝚃WI(VD))\displaystyle=\mathrm{Diag}((VD)^{\mathtt{T}}W_{1*}(VD),\,\ldots,\,(VD)^{\mathtt{T}}W_{I*}(VD))
𝔼(d2)\displaystyle\mathbb{E}(-\nabla_{\!\vec{d}}^{2}\,\mathcal{L}) =j=1J(UVj)𝚃Wj(UVj).\displaystyle=\sum_{j=1}^{J}(UV_{j*})^{\mathtt{T}}W_{*j}(UV_{j*}).

S5.2 NB-GBM with log link: Gradient and Fisher information

The negative binomial distribution has probability mass function

NegBin(yμ,r)=Γ(y+r)Γ(y+1)Γ(r)(μμ+r)y(rμ+r)r\operatorname*{NegBin}(y\mid\mu,r)=\frac{\Gamma(y+r)}{\Gamma(y+1)\Gamma(r)}\Big{(}\frac{\mu}{\mu+r}\Big{)}^{y}\Big{(}\frac{r}{\mu+r}\Big{)}^{r}

for y{0,1,2,}y\in\{0,1,2,\ldots\}, given μ>0\mu>0 and r>0r>0. This is a discrete EDF of the form in Equation S4.1 with θ=log(μ/(μ+r))\theta=\log(\mu/(\mu+r)) and κ(θ)=log(1exp(θ))\kappa(\theta)=-\log(1-\exp(\theta)). Observe that

κ(θ)\displaystyle\kappa^{\prime}(\theta) =eθ1eθ=μr\displaystyle=\frac{e^{\theta}}{1-e^{\theta}}=\frac{\mu}{r}
κ′′(θ)\displaystyle\kappa^{\prime\prime}(\theta) =eθ(1eθ)2=μ+μ2/rr.\displaystyle=\frac{e^{\theta}}{(1-e^{\theta})^{2}}=\frac{\mu+\mu^{2}/r}{r}.

Thus, letting YNegBin(μ,r)Y\sim\operatorname*{NegBin}(\mu,r), we have 𝔼(Y)=μ\mathbb{E}(Y)=\mu and Var(Y)=μ+μ2/r\operatorname*{Var}(Y)=\mu+\mu^{2}/r by Equation S4.2. For an NB-GBM with log link g(μ)=log(μ)g(\mu)=\log(\mu), the gradients and Fisher information matrices for AA, BB, CC, DD, UU, and VV are given by Equations S5.1 and S5.8 where, by Equation S5.3,

μij=exp(ηij),wij=rijμijrij+μij, and eij=(Yijμij)wijμij.\displaystyle\mu_{ij}=\exp(\eta_{ij}),~{}~{}~{}~{}w_{ij}=\frac{r_{ij}\mu_{ij}}{r_{ij}+\mu_{ij}},\text{ ~{}~{}~{}and~{}~{}~{} }e_{ij}=(Y_{ij}-\mu_{ij})\,\frac{w_{ij}}{\mu_{ij}}. (S5.9)

Next, we derive the gradient and observed information matrix for the log-dispersions. First, consider a single entry. Letting ψ(x)\psi(x) denote the digamma function, we have

rlogNegBin(Yμ,r)=ψ(Y+r)ψ(r)+log(r)+1log(μ+r)Y+rμ+r2r2logNegBin(Yμ,r)=ψ(Y+r)ψ(r)+1r1μ+rμY(μ+r)2.\displaystyle\begin{split}\frac{\partial}{\partial r}\log\operatorname*{NegBin}(Y\mid\mu,r)&=\psi(Y+r)-\psi(r)+\log(r)+1-\log(\mu+r)-\frac{Y+r}{\mu+r}\\ \frac{\partial^{2}}{\partial r^{2}}\log\operatorname*{NegBin}(Y\mid\mu,r)&=\psi^{\prime}(Y+r)-\psi^{\prime}(r)+\frac{1}{r}-\frac{1}{\mu+r}-\frac{\mu-Y}{(\mu+r)^{2}}.\end{split} (S5.10)

We work with the observed information rather than the expected information since 𝔼(ψ(r+Y))\mathbb{E}(\psi^{\prime}(r+Y)) does not seem to have a simple expression.

It turns out that Equation S5.10 tends to lead to arithmetic overflow/underflow in extreme cases. Although these extreme cases occur only occasionally for individual entries, large GBMs have so many entries that these failures occur persistently and must be addressed. To avoid arithmetic overflow/underflow, we rewrite Equation S5.10 as follows:

rlogNegBin(Yμ,r)=ψΔ(Y,r)log1p(μ/r)Yμr+μ+err(Y,r)2r2logNegBin(Yμ,r)=ψΔ(Y,r)+Y+μ2/r(r+μ)2+err(Y,r)\displaystyle\begin{split}\frac{\partial}{\partial r}\log\operatorname*{NegBin}(Y\mid\mu,r)&=\psi_{\Delta}(Y,r)-\mathrm{log1p}(\mu/r)-\frac{Y-\mu}{r+\mu}+\mathrm{err}(Y,r)\\ \frac{\partial^{2}}{\partial r^{2}}\log\operatorname*{NegBin}(Y\mid\mu,r)&=\psi^{\prime}_{\Delta}(Y,r)+\frac{Y+\mu^{2}/r}{(r+\mu)^{2}}+\mathrm{err}^{\prime}(Y,r)\end{split} (S5.11)

where

log1p(x)\displaystyle\mathrm{log1p}(x) ={x+log((1+x)/ex)if x<1log(1+x)if x1\displaystyle=\left\{\begin{array}[]{ll}x+\log((1+x)/e^{x})&\mbox{if }x<1\\ \log(1+x)&\mbox{if }x\geq 1\end{array}\right. (S5.14)
ψΔ(y,r)\displaystyle\psi_{\Delta}(y,r) ={ψ(y+r)ψ(r)if r<108log1p(y/r)if r108\displaystyle=\left\{\begin{array}[]{ll}\psi(y+r)-\psi(r)&\mbox{if }r<10^{8}\\ \mathrm{log1p}(y/r)&\mbox{if }r\geq 10^{8}\end{array}\right. (S5.17)
ψΔ(y,r)\displaystyle\psi^{\prime}_{\Delta}(y,r) ={ψ(y+r)ψ(r)if r<108(y/r)/(y+r)if r108\displaystyle=\left\{\begin{array}[]{ll}\psi^{\prime}(y+r)-\psi^{\prime}(r)&\mbox{if }r<10^{8}\\ -(y/r)/(y+r)&\mbox{if }r\geq 10^{8}\end{array}\right. (S5.20)

and the error terms err(y,r)\mathrm{err}(y,r) and err(y,r)\mathrm{err}^{\prime}(y,r) are typically exceedingly small and can be safely ignored. Mathematically, log1p(x)=log(1+x)\mathrm{log1p}(x)=\log(1+x), however, numerically, the expression in Equation S5.14 computes this value in a way that helps avoid arithmetic overflow and underflow. Similarly, ψΔ(y,r)\psi_{\Delta}(y,r) and ψΔ(y,r)\psi^{\prime}_{\Delta}(y,r) compute ψ(y+r)ψ(r)\psi(y+r)-\psi(r) and ψ(y+r)ψ(r)\psi^{\prime}(y+r)-\psi^{\prime}(r), respectively, to very high accuracy while avoiding overflow/underflow; the errors in these approximations are err(y,r)\mathrm{err}(y,r) and err(y,r)\mathrm{err}^{\prime}(y,r), respectively. To derive Equation S5.11, we group terms of similar magnitude and we use the asymptotics of ψ(x)\psi(x) and ψ(x)\psi^{\prime}(x).

Now, we derive the gradient and observed information for SS and TT in the NB-GBM. Recall that we parametrize rij=exp(sitjω)r_{ij}=\exp(-s_{i}-t_{j}-\omega) and we work in terms of the vector of feature-specific log-dispersion offsets SIS\in\mathbb{R}^{I}, the vector of sample-specific log-dispersion offsets TJT\in\mathbb{R}^{J}, and the overall log-dispersion ω\omega, subject to the identifiability constraints 1Iiesi=1\frac{1}{I}\sum_{i}e^{s_{i}}=1 and 1Jjetj=1\frac{1}{J}\sum_{j}e^{t_{j}}=1. Let ij=logNegBin(Yijμij,rij)\mathcal{L}_{ij}=\log\operatorname*{NegBin}(Y_{ij}\mid\mu_{ij},r_{ij}) and =i=1Ij=1Jij\mathcal{L}=\sum_{i=1}^{I}\sum_{j=1}^{J}\mathcal{L}_{ij} as before. The derivatives with respect to sis_{i} and tjt_{j} are then

si\displaystyle\frac{\partial\mathcal{L}}{\partial s_{i}} =j=1Jijrij(rij)\displaystyle=\sum_{j=1}^{J}\frac{\partial\mathcal{L}_{ij}}{\partial r_{ij}}(-r_{ij})\qquad\qquad 2si2\displaystyle\frac{\partial^{2}\mathcal{L}}{\partial s_{i}^{2}} =j=1J(2ijrij2rij2+ijrijrij)\displaystyle=\sum_{j=1}^{J}\Big{(}\frac{\partial^{2}\mathcal{L}_{ij}}{\partial r_{ij}^{2}}r_{ij}^{2}+\frac{\partial\mathcal{L}_{ij}}{\partial r_{ij}}r_{ij}\Big{)} (S5.21)
tj\displaystyle\frac{\partial\mathcal{L}}{\partial t_{j}} =i=1Iijrij(rij)\displaystyle=\sum_{i=1}^{I}\frac{\partial\mathcal{L}_{ij}}{\partial r_{ij}}(-r_{ij})\qquad\qquad 2tj2\displaystyle\frac{\partial^{2}\mathcal{L}}{\partial t_{j}^{2}} =i=1I(2ijrij2rij2+ijrijrij)\displaystyle=\sum_{i=1}^{I}\Big{(}\frac{\partial^{2}\mathcal{L}_{ij}}{\partial r_{ij}^{2}}r_{ij}^{2}+\frac{\partial\mathcal{L}_{ij}}{\partial r_{ij}}r_{ij}\Big{)}

since

rijsi=rijtj=rij and 2rijsi2=2rijtj2=rij.\frac{\partial r_{ij}}{\partial s_{i}}=\frac{\partial r_{ij}}{\partial t_{j}}=-r_{ij}\qquad\text{ and }\qquad\frac{\partial^{2}r_{ij}}{\partial s_{i}^{2}}=\frac{\partial^{2}r_{ij}}{\partial t_{j}^{2}}=r_{ij}.

By Equation S5.11,

ijrij=ψΔ(Yij,rij)log1p(μij/rij)Yijμijrij+μij+err(Yij,rij)2ijrij2=ψΔ(Yij,rij)+Yij+μij2/rij(rij+μij)2+err(Yij,rij)\displaystyle\begin{split}\frac{\partial\mathcal{L}_{ij}}{\partial r_{ij}}&=\psi_{\Delta}(Y_{ij},r_{ij})-\mathrm{log1p}(\mu_{ij}/r_{ij})-\frac{Y_{ij}-\mu_{ij}}{r_{ij}+\mu_{ij}}+\mathrm{err}(Y_{ij},r_{ij})\\ \frac{\partial^{2}\mathcal{L}_{ij}}{\partial r_{ij}^{2}}&=\psi^{\prime}_{\Delta}(Y_{ij},r_{ij})+\frac{Y_{ij}+\mu_{ij}^{2}/r_{ij}}{(r_{ij}+\mu_{ij})^{2}}+\mathrm{err}^{\prime}(Y_{ij},r_{ij})\end{split} (S5.22)

where the error terms are negligible. Note that ω\omega is implicitly optimized in the optimization-projection steps for SS and TT due to the likelihood-preserving projections, making it unnecessary to optimize with respect to ω\omega directly.

In practice, we do not use the off-diagonal terms of the observed information matrix for the log-dispersion parameters, since they seem to have a very small effect on the results. However, for completeness, we note here that

2sitj=2ijrij2rij2+ijrijrij\frac{\partial^{2}\mathcal{L}}{\partial s_{i}\partial t_{j}}=\frac{\partial^{2}\mathcal{L}_{ij}}{\partial r_{ij}^{2}}r_{ij}^{2}+\frac{\partial\mathcal{L}_{ij}}{\partial r_{ij}}r_{ij}

for all i,ji,j, and

2sisi=02tjtj=0\frac{\partial^{2}\mathcal{L}}{\partial s_{i}\partial s_{i^{\prime}}}=0\qquad\qquad\frac{\partial^{2}\mathcal{L}}{\partial t_{j}\partial t_{j^{\prime}}}=0

for all iii\neq i^{\prime} and jjj\neq j^{\prime}.

S5.3 Fisher information between components of the model

In this section, we provide formulas for the off-block-diagonal entries of the full Fisher information matrix for a discrete EDF-GBM, that is, the entries that involve more than one of AA, BB, CC, DD, UU, VV, SS, TT, and ω\omega. With the exception of the entries involving UU and VV, we do not use these formulas in our proposed algorithms, however, we provide the formulas here to enable comparison with the standard inference approach based on the full Fisher information matrix. As in Equation S5.3, we define

eij=ijηij and wij=𝔼(2ijηij2).\displaystyle e_{ij}=\frac{\partial\mathcal{L}_{ij}}{\partial\eta_{ij}}\text{~{}~{}~{}~{}and~{}~{}~{}~{}}w_{ij}=\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}_{ij}}{\partial\eta_{ij}^{2}}\Big{)}.

For any univariate entry of AA, BB, CC, DD, UU, or VV, say β\beta, we have

𝔼(eijβ)=𝔼(2ijηij2)ηijβ=wijηijβ.\displaystyle\mathbb{E}\Big{(}-\frac{\partial e_{ij}}{\partial\beta}\Big{)}=\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}_{ij}}{\partial\eta_{ij}^{2}}\Big{)}\frac{\partial\eta_{ij}}{\partial\beta}=w_{ij}\frac{\partial\eta_{ij}}{\partial\beta}. (S5.23)

Using Equation S5.23 along with Equations S5 and S5.1, the cross terms among AA, BB, and CC are:

𝔼(2ajkbi)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial a_{jk}\partial b_{i\ell}}\Big{)} =wijxikzj\displaystyle=w_{ij}x_{ik}z_{j\ell} (S5.24)
𝔼(2ajkck)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial a_{jk}\partial c_{k^{\prime}\ell}}\Big{)} =i=1Iwijxikxikzj\displaystyle=\sum_{i=1}^{I}w_{ij}x_{ik}x_{ik^{\prime}}z_{j\ell} (S5.25)
𝔼(2bick)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial b_{i\ell}\partial c_{k\ell^{\prime}}}\Big{)} =j=1Jwijxikzjzj.\displaystyle=\sum_{j=1}^{J}w_{ij}x_{ik}z_{j\ell}z_{j\ell^{\prime}}. (S5.26)

Similarly, the cross terms among UU, VV, and DD are:

𝔼(2uimvjm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial u_{im}\partial v_{jm^{\prime}}}\Big{)} =wijuimdmmdmmvjm\displaystyle=w_{ij}u_{im^{\prime}}d_{m^{\prime}m^{\prime}}d_{mm}v_{jm} (S5.27)
𝔼(2uimdmm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial u_{im}\partial d_{m^{\prime}m^{\prime}}}\Big{)} =j=1Jwijuimvjmvjmdmm\displaystyle=\sum_{j=1}^{J}w_{ij}u_{im^{\prime}}v_{jm^{\prime}}v_{jm}d_{mm} (S5.28)
𝔼(2vjmdmm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial v_{jm}\partial d_{m^{\prime}m^{\prime}}}\Big{)} =i=1Iwijuimdmmuimvjm.\displaystyle=\sum_{i=1}^{I}w_{ij}u_{im}d_{mm}u_{im^{\prime}}v_{jm^{\prime}}. (S5.29)

The cross terms between AA, BB, CC and UU, VV, and DD are:

𝔼(2ajkuim)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial a_{jk}\partial u_{im}}\Big{)} =wijxikvjmdmm\displaystyle=w_{ij}x_{ik}v_{jm}d_{mm} (S5.30)
𝔼(2ajkvjm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial a_{jk}\partial v_{j^{\prime}m}}\Big{)} =𝟙(j=j)i=1Iwijxikuimdmm\displaystyle=\mathds{1}(j=j^{\prime})\sum_{i=1}^{I}w_{ij}x_{ik}u_{im}d_{mm} (S5.31)
𝔼(2ajkdmm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial a_{jk}\partial d_{mm}}\Big{)} =i=1Iwijxikuimvjm\displaystyle=\sum_{i=1}^{I}w_{ij}x_{ik}u_{im}v_{jm} (S5.32)
𝔼(2biuim)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial b_{i\ell}\partial u_{i^{\prime}m}}\Big{)} =𝟙(i=i)j=1Jwijzjvjmdmm\displaystyle=\mathds{1}(i=i^{\prime})\sum_{j=1}^{J}w_{ij}z_{j\ell}v_{jm}d_{mm} (S5.33)
𝔼(2bivjm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial b_{i\ell}\partial v_{jm}}\Big{)} =wijzjuimdmm\displaystyle=w_{ij}z_{j\ell}u_{im}d_{mm} (S5.34)
𝔼(2bidmm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial b_{i\ell}\partial d_{mm}}\Big{)} =j=1Jwijzjuimvjm\displaystyle=\sum_{j=1}^{J}w_{ij}z_{j\ell}u_{im}v_{jm} (S5.35)
𝔼(2ckuim)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial c_{k\ell}\partial u_{im}}\Big{)} =j=1Jwijxikzjvjmdmm\displaystyle=\sum_{j=1}^{J}w_{ij}x_{ik}z_{j\ell}v_{jm}d_{mm} (S5.36)
𝔼(2ckvjm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial c_{k\ell}\partial v_{jm}}\Big{)} =i=1Iwijxikzjuimdmm\displaystyle=\sum_{i=1}^{I}w_{ij}x_{ik}z_{j\ell}u_{im}d_{mm} (S5.37)
𝔼(2ckdmm)\displaystyle\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}}{\partial c_{k\ell}\partial d_{mm}}\Big{)} =i=1Ij=1Jwijxikzjuimvjm.\displaystyle=\sum_{i=1}^{I}\sum_{j=1}^{J}w_{ij}x_{ik}z_{j\ell}u_{im}v_{jm}. (S5.38)

All of the cross terms between (S,T,ω)(S,T,\omega) and (A,B,C,D,U,V)(A,B,C,D,U,V) are zero. To see this, first note that by differentiating Equation S5.10 with respect to μ\mu,

2μrlogNegBin(Yμ,r)=Yμ(r+μ)2.\frac{\partial^{2}}{\partial\mu\partial r}\log\operatorname*{NegBin}(Y\mid\mu,r)=\frac{Y-\mu}{(r+\mu)^{2}}.

Hence, for any entry of AA, BB, CC, DD, UU, or VV, say β\beta, we have 𝔼(2/βsi)=0\mathbb{E}(-\partial^{2}\mathcal{L}/\partial\beta\partial s_{i})=0 since

𝔼(2ijβsi)=𝔼(2ijμijrijμijβrijsi)=𝔼(Yijμij(rij+μij)2)μijβrijsi=0.\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}_{ij}}{\partial\beta\partial s_{i}}\Big{)}=\mathbb{E}\Big{(}-\frac{\partial^{2}\mathcal{L}_{ij}}{\partial\mu_{ij}\partial r_{ij}}\frac{\partial\mu_{ij}}{\partial\beta}\frac{\partial r_{ij}}{\partial s_{i}}\Big{)}=-\mathbb{E}\Big{(}\frac{Y_{ij}-\mu_{ij}}{(r_{ij}+\mu_{ij})^{2}}\Big{)}\frac{\partial\mu_{ij}}{\partial\beta}\frac{\partial r_{ij}}{\partial s_{i}}=0.

Similarly, 𝔼(2/βtj)=0\mathbb{E}(-\partial^{2}\mathcal{L}/\partial\beta\partial t_{j})=0 and 𝔼(2/βω)=0\mathbb{E}(-\partial^{2}\mathcal{L}/\partial\beta\partial\omega)=0.

S6 Constraint-augmented Fisher information

During estimation, we enforce constraints that ensure identifiability of the parameters, however, these constraints are not accounted for in the Fisher information matrix. Consequently, it turns out that to appropriately quantify uncertainty in the parameters, it is necessary to augment the Fisher information matrix to account for the identifiability constraints; see Section S6.1.

In our proposed algorithm, we use this constraint-augmentation technique only for UU and VV jointly (see Section S6.2), and for the remaining components we use our delta propagation technique, which does not involve special handling of the constraints. We also compare our proposed method to the standard approach of inverting the full constraint-augmented Fisher information matrix. In Section S6.1, we review the constraint-augmentation technique in general, and in Section S6.3, we provide formulas for the full constraint-augmented Fisher information for GBMs.

S6.1 Constraint-augmentation technique

In this section, we review the constraint-augmentation technique of Aitchison and Silvey (1958) and Silvey (1959) in the setting of a finite-dimensional parametric model and we extend it to our setting; also see Silvey (1975, Section 4.7). Consider an i.i.d. model with parameter θd\theta\in\mathbb{R}^{d}, and suppose we want to perform estimation and inference subject to the constraint that g(θ)=0g(\theta)=0, where g(θ)=(g1(θ),,gk(θ))𝚃kg(\theta)=(g_{1}(\theta),\ldots,g_{k}(\theta))^{\mathtt{T}}\in\mathbb{R}^{k} is a differentiable function. Suppose θ^\hat{\theta} is the maximum likelihood estimator subject to g(θ)=0g(\theta)=0; this is referred to as a “restricted maximum likelihood estimator” (Silvey, 1975).

Let Jθk×dJ_{\theta}\in\mathbb{R}^{k\times d} be the Jacobian matrix of gg, that is, Jθ,ij=gi/θjJ_{\theta,ij}=\partial g_{i}/\partial\theta_{j}. Let θd×d\mathcal{I}_{\theta}\in\mathbb{R}^{d\times d} be the Fisher information matrix for θ\theta, that is, θ=𝔼(θ2)\mathcal{I}_{\theta}=\mathbb{E}(-\nabla_{\theta}^{2}\mathcal{L}) where \mathcal{L} is the log-likelihood. Suppose θ0\theta_{0} is the true value of the parameter. Aitchison and Silvey (1958) show that under regularity conditions, when θ0\mathcal{I}_{\theta_{0}} is invertible, the covariance matrix of θ^\hat{\theta} is approximately equal to the leading d×dd\times d submatrix of [θ0Jθ0𝚃Jθ00]1\begin{bmatrix}\mathcal{I}_{\theta_{0}}&J_{\theta_{0}}^{\mathtt{T}}\\ J_{\theta_{0}}&0\end{bmatrix}^{-1}. However, in GBMs, we need to consider situations in which θ0\mathcal{I}_{\theta_{0}} is not invertible since the model is overparametrized unless the identifiability constraints are imposed. When θ0\mathcal{I}_{\theta_{0}} is not invertible, Silvey (1959) extends the technique by showing that the covariance matrix of θ^\hat{\theta} is approximately equal to the leading d×dd\times d submatrix of [θ0+Jθ0𝚃Jθ0Jθ0𝚃Jθ00]1\begin{bmatrix}\mathcal{I}_{\theta_{0}}+J_{\theta_{0}}^{\mathtt{T}}J_{\theta_{0}}&J_{\theta_{0}}^{\mathtt{T}}\\ J_{\theta_{0}}&0\end{bmatrix}^{-1}.

Since we employ maximum a posteriori estimates, we modify the technique to use the regularized Fisher information matrix Fθ=𝔼(θ2(+logπ))F_{\theta}=\mathbb{E}(-\nabla_{\theta}^{2}(\mathcal{L}+\log\pi)) in place of θ\mathcal{I}_{\theta}. For our choice of prior, Fθ0F_{\theta_{0}} is invertible even when θ0\mathcal{I}_{\theta_{0}} is not invertible, and consequently it turns out that the leading d×dd\times d submatrices of [Fθ0Jθ0𝚃Jθ00]1\begin{bmatrix}F_{\theta_{0}}&J_{\theta_{0}}^{\mathtt{T}}\\ J_{\theta_{0}}&0\end{bmatrix}^{-1} and [Fθ0+Jθ0𝚃Jθ0Jθ0𝚃Jθ00]1\begin{bmatrix}F_{\theta_{0}}+J_{\theta_{0}}^{\mathtt{T}}J_{\theta_{0}}&J_{\theta_{0}}^{\mathtt{T}}\\ J_{\theta_{0}}&0\end{bmatrix}^{-1} coincide; see Proposition S6.1. Thus, when using FθF_{\theta} instead of θ\mathcal{I}_{\theta}, it is not necessary to employ the extended version provided by Silvey (1959); this provides a big advantage, computationally, when we apply the method to perform inference for (U,V)(U,V) in GBMs. Although Aitchison and Silvey (1958) and Silvey (1959) justify the technique in the i.i.d. setting, empirically we find that it works well in our non-i.i.d. setting also.

Proposition S6.1.

Let Fd×dF\in\mathbb{R}^{d\times d} and Jk×dJ\in\mathbb{R}^{k\times d}. If FF and JF1J𝚃JF^{-1}J^{\mathtt{T}} are invertible, then the leading d×dd\times d submatrices of [FJ𝚃J0]1\begin{bmatrix}F&J^{\mathtt{T}}\\ J&0\end{bmatrix}^{-1} and [F+J𝚃JJ𝚃J0]1\begin{bmatrix}F+J^{\mathtt{T}}J&J^{\mathtt{T}}\\ J&0\end{bmatrix}^{-1} are equal.

Proof.

Let A:=F+J𝚃JA:=F+J^{\mathtt{T}}J. By the formula for inverting a 2×22\times 2 block matrix,

[AJ𝚃J0]1=[A1A1J𝚃(JA1J𝚃)1JA1],\displaystyle\begin{bmatrix}A&J^{\mathtt{T}}\\ J&0\end{bmatrix}^{-1}=\begin{bmatrix}A^{-1}-A^{-1}J^{\mathtt{T}}(JA^{-1}J^{\mathtt{T}})^{-1}JA^{-1}&~{}\ast\\ \ast&~{}\ast\end{bmatrix}, (S6.1)

and the same formula applies for [FJ𝚃J0]1\begin{bmatrix}F&J^{\mathtt{T}}\\ J&0\end{bmatrix}^{-1}, but with FF in place of AA. Here, \ast denotes unneeded entries. By the Woodbury matrix inversion formula,

A1=F1F1J𝚃(I+JF1J𝚃)1JF1.\displaystyle A^{-1}=F^{-1}-F^{-1}J^{\mathtt{T}}(\mathrm{I}+JF^{-1}J^{\mathtt{T}})^{-1}JF^{-1}. (S6.2)

Defining C:=JF1J𝚃C:=JF^{-1}J^{\mathtt{T}}, this implies JA1J𝚃=CC(I+C)1C=C(I+C)1JA^{-1}J^{\mathtt{T}}=C-C(\mathrm{I}+C)^{-1}C=C(\mathrm{I}+C)^{-1}, hence

(JA1J𝚃)1=(I+C)C1.\displaystyle(JA^{-1}J^{\mathtt{T}})^{-1}=(\mathrm{I}+C)C^{-1}. (S6.3)

Using Equation S6.2 again, we see that JA1=JF1C(I+C)1JF1JA^{-1}=JF^{-1}-C(\mathrm{I}+C)^{-1}JF^{-1}, and thus

(JA1J𝚃)1JA1=(I+C)C1JF1JF1=C1JF1.\displaystyle(JA^{-1}J^{\mathtt{T}})^{-1}JA^{-1}=(\mathrm{I}+C)C^{-1}JF^{-1}-JF^{-1}=C^{-1}JF^{-1}. (S6.4)

Likewise, A1J𝚃=F1J𝚃F1J𝚃(I+C)1CA^{-1}J^{\mathtt{T}}=F^{-1}J^{\mathtt{T}}-F^{-1}J^{\mathtt{T}}(\mathrm{I}+C)^{-1}C, so combining this with Equations S6.4 and S6.2 and canceling, we have

A1A1J𝚃(JA1J𝚃)1JA1=F1F1J𝚃(JF1J𝚃)1JF1.A^{-1}-A^{-1}J^{\mathtt{T}}(JA^{-1}J^{\mathtt{T}})^{-1}JA^{-1}=F^{-1}-F^{-1}J^{\mathtt{T}}(JF^{-1}J^{\mathtt{T}})^{-1}JF^{-1}.

The result follows by Equation S6.1. ∎

S6.2 Constraint-augmented Fisher information for (U,V)(U,V)

We have found that it is important to quantify uncertainty in UU and VV jointly and account for the identifiability constraints. In this section, we derive a computationally efficient method for computing the diagonal of the inverse of the constraint-augmented Fisher information matrix (Section S6.1) to obtain approximate standard errors for UU and VV.

Let JuJ_{u} and JvJ_{v} denote the constraint Jacobian matrices for UU and VV; see Section S6.3. The regularized, constraint-augmented Fisher information matrix for (U,V)(U,V) is then

F~(u,v):=[FuFuvJu𝚃0Fuv𝚃Fv0Jv𝚃Ju0000Jv00]\tilde{F}_{(u,v)}:=\begin{bmatrix}F_{u}&F_{uv}&J_{u}^{\mathtt{T}}&0\\ F_{uv}^{\mathtt{T}}&F_{v}&0&J_{v}^{\mathtt{T}}\\ J_{u}&0&0&0\\ 0&J_{v}&0&0\end{bmatrix}

where Fu=𝔼(u2)+λuIF_{u}=\mathbb{E}(-\nabla_{\!\vec{u}}^{2}\,\mathcal{L})+\lambda_{u}\mathrm{I}, Fv=𝔼(v2)+λvIF_{v}=\mathbb{E}(-\nabla_{\!\vec{v}}^{2}\,\mathcal{L})+\lambda_{v}\mathrm{I}, and Fuv=𝔼(uv𝚃)F_{uv}=\mathbb{E}(-\nabla_{\!\vec{u}}\nabla_{\!\vec{v}}^{\mathtt{T}}\,\mathcal{L}); formulas for each of these expectations are given in Equations S5.8 and S5.27.

We need the first IM+JMIM+JM entries of diag(F~(u,v)1)\operatorname*{diag}(\tilde{F}_{(u,v)}^{-1}) in order to obtain approximate standard errors for the entries of UU and VV, however, naively performing this matrix inversion is computationally intractable when II (or JJ) is large. To compute this efficiently, we structure the calculation as follows. First, let PP be the permutation matrix such that

PF~(u,v)P𝚃=[FuJu𝚃Fuv0Ju000Fuv𝚃0FvJv𝚃00Jv0]=[𝖠𝖡𝖡𝚃𝖢]P\tilde{F}_{(u,v)}P^{\mathtt{T}}=\begin{bmatrix}F_{u}&J_{u}^{\mathtt{T}}&F_{uv}&0\\ J_{u}&0&0&0\\ F_{uv}^{\mathtt{T}}&0&F_{v}&J_{v}^{\mathtt{T}}\\ 0&0&J_{v}&0\end{bmatrix}=\begin{bmatrix}\mathsf{A}&\mathsf{B}\,\\ \mathsf{B}^{\mathtt{T}}&\mathsf{C}\,\end{bmatrix}

where we define

𝖠=[FuJu𝚃Ju0],𝖡=[Fuv000],𝖢=[FvJv𝚃Jv0].\mathsf{A}=\begin{bmatrix}F_{u}&J_{u}^{\mathtt{T}}\,\\ J_{u}&0\,\end{bmatrix},\qquad\mathsf{B}=\begin{bmatrix}F_{uv}&0\,\\ 0&0\,\end{bmatrix},\qquad\mathsf{C}=\begin{bmatrix}F_{v}&J_{v}^{\mathtt{T}}\,\\ J_{v}&0\,\end{bmatrix}.

(We use sans serif font for these block matrices, such as 𝖠\mathsf{A}, to distinguish them from parameter matrices such as AA.) Since diag((PF~(u,v)P𝚃)1)=diag(PF~(u,v)1P𝚃)=Pdiag(F~(u,v)1)\operatorname*{diag}((P\tilde{F}_{(u,v)}P^{\mathtt{T}})^{-1})=\operatorname*{diag}(P\tilde{F}_{(u,v)}^{-1}P^{\mathtt{T}})=P\operatorname*{diag}(\tilde{F}_{(u,v)}^{-1}), we can compute the diagonal of the inverse of PF~(u,v)P𝚃P\tilde{F}_{(u,v)}P^{\mathtt{T}} and then permute back to get diag(F~(u,v)1)\operatorname*{diag}(\tilde{F}_{(u,v)}^{-1}). By the formula for inversion of a 2×22\times 2 block matrix,

(PF~(u,v)P𝚃)1=[𝖠𝖡𝖡𝚃𝖢]1=[𝖠1+𝖠1𝖡(𝖢𝖡𝚃𝖠1𝖡)1𝖡𝚃𝖠1(𝖢𝖡𝚃𝖠1𝖡)1](P\tilde{F}_{(u,v)}P^{\mathtt{T}})^{-1}=\begin{bmatrix}\mathsf{A}&\mathsf{B}\,\\ \mathsf{B}^{\mathtt{T}}&\mathsf{C}\,\end{bmatrix}^{-1}=\begin{bmatrix}\mathsf{A}^{-1}+\mathsf{A}^{-1}\mathsf{B}(\mathsf{C}-\mathsf{B}^{\mathtt{T}}\mathsf{A}^{-1}\mathsf{B})^{-1}\mathsf{B}^{\mathtt{T}}\mathsf{A}^{-1}&\ast\\ \ast&(\mathsf{C}-\mathsf{B}^{\mathtt{T}}\mathsf{A}^{-1}\mathsf{B})^{-1}\end{bmatrix}

where \ast denotes entries that are not needed. Similarly, by the same formula,

𝖠1=[FuJu𝚃Ju0]1=[𝖣𝖤𝖤𝚃]\mathsf{A}^{-1}=\begin{bmatrix}F_{u}&J_{u}^{\mathtt{T}}\,\\ J_{u}&0\,\end{bmatrix}^{-1}=\begin{bmatrix}\mathsf{D}&\mathsf{E}\,\\ \mathsf{E}^{\mathtt{T}}&\ast\,\end{bmatrix}

where 𝖣=Fu1Fu1Ju𝚃(JuFu1Ju𝚃)1JuFu1\mathsf{D}=F_{u}^{-1}-F_{u}^{-1}J_{u}^{\mathtt{T}}(J_{u}F_{u}^{-1}J_{u}^{\mathtt{T}})^{-1}J_{u}F_{u}^{-1} and 𝖤=Fu1Ju𝚃(JuFu1Ju𝚃)1\mathsf{E}=F_{u}^{-1}J_{u}^{\mathtt{T}}(J_{u}F_{u}^{-1}J_{u}^{\mathtt{T}})^{-1}. Thus,

(𝖢𝖡𝚃𝖠1𝖡)1=[FvFuv𝚃𝖣FuvJv𝚃Jv0]1=[𝖦],(\mathsf{C}-\mathsf{B}^{\mathtt{T}}\mathsf{A}^{-1}\mathsf{B})^{-1}=\begin{bmatrix}F_{v}-F_{uv}^{\mathtt{T}}\mathsf{D}F_{uv}&J_{v}^{\mathtt{T}}\,\\ J_{v}&0\,\end{bmatrix}^{-1}=\begin{bmatrix}\,\mathsf{G}&\,\ast\,\\ \ast&\ast\,\end{bmatrix},

where 𝖦\mathsf{G} is defined to be the leading JM×JMJM\times JM block of this matrix. Putting these pieces together justifies defining the approximate variance of each entry of v=vec(V𝚃)\vec{v}=\mathrm{vec}(V^{\mathtt{T}}) as

var^(v):=diag(G)\widehat{\mathrm{var}}(\vec{v}):=\operatorname*{diag}(G)

since these are the entries of diag((PF~(u,v)P𝚃)1)\operatorname*{diag}((P\tilde{F}_{(u,v)}P^{\mathtt{T}})^{-1}) corresponding to v\vec{v}. To approximate the variance of the entries of u=vec(U𝚃)\vec{u}=\mathrm{vec}(U^{\mathtt{T}}), observe that

𝖠1𝖡(𝖢𝖡𝚃𝖠1𝖡)1𝖡𝚃𝖠1\displaystyle\mathsf{A}^{-1}\mathsf{B}(\mathsf{C}-\mathsf{B}^{\mathtt{T}}\mathsf{A}^{-1}\mathsf{B})^{-1}\mathsf{B}^{\mathtt{T}}\mathsf{A}^{-1} =[𝖣𝖤𝖤𝚃][Fuv000][𝖦][Fuv𝚃000][𝖣𝖤𝖤𝚃]\displaystyle=\begin{bmatrix}\mathsf{D}&\mathsf{E}\,\\ \mathsf{E}^{\mathtt{T}}&\ast\,\end{bmatrix}\begin{bmatrix}F_{uv}&0\,\\ 0&0\,\end{bmatrix}\begin{bmatrix}\mathsf{G}&\ast\,\\ \ast&\ast\,\end{bmatrix}\begin{bmatrix}F_{uv}^{\mathtt{T}}&0\,\\ 0&0\,\end{bmatrix}\begin{bmatrix}\mathsf{D}&\mathsf{E}\,\\ \mathsf{E}^{\mathtt{T}}&\ast\,\end{bmatrix}
=[𝖣Fuv𝖦Fuv𝚃𝖣].\displaystyle=\begin{bmatrix}\mathsf{D}F_{uv}\mathsf{G}F_{uv}^{\mathtt{T}}\mathsf{D}&\ast\,\\ \ast&\ast\,\end{bmatrix}.

Therefore, we define

var^(u):=diag(𝖣+𝖣Fuv𝖦Fuv𝚃𝖣)\widehat{\mathrm{var}}(\vec{u}):=\operatorname*{diag}(\mathsf{D}+\mathsf{D}F_{uv}\mathsf{G}F_{uv}^{\mathtt{T}}\mathsf{D})

since these are the entries of diag((PF~(u,v)P𝚃)1)\operatorname*{diag}((P\tilde{F}_{(u,v)}P^{\mathtt{T}})^{-1}) corresponding to u\vec{u}.

S6.3 Full constraint-augmented Fisher information for GBMs

To facilitate comparison with the classical approach, we derive the constraint-augmented Fisher information matrix for all of the components (A,B,C,D,U,V)(A,B,C,D,U,V) jointly, even though our approach only requires the matrix for (U,V)(U,V). It is not necessary to include the log-dispersion parameters (S,T,ω)(S,T,\omega) since the Fisher information (as well as the constraint Jacobian) between these parameters and (A,B,C,D,U,V)(A,B,C,D,U,V) is zero (see Section S5.3); thus, in the classical approach, inference for (S,T,ω)(S,T,\omega) and (A,B,C,D,U,V)(A,B,C,D,U,V) can be performed independently since the constraint-augmented Fisher information decomposes.

For each of AA, BB, CC, DD, UU, and VV, we vectorize both the parameter matrix and the corresponding constraints, as follows. For AA, the constraint Z𝚃A=0Z^{\mathtt{T}}A=0 can be written as g(A)=0g(A)=0 where g(A)=vec(A𝚃Z)KLg(A)=\mathrm{vec}(A^{\mathtt{T}}Z)\in\mathbb{R}^{KL}. The constraint Jacobian for a=vec(A𝚃)\vec{a}=\mathrm{vec}(A^{\mathtt{T}}) is then Ja:=Z𝚃IKJ_{a}:=Z^{\mathtt{T}}\otimes\mathrm{I}_{K}, where \otimes denotes the Kronecker product and IK\mathrm{I}_{K} is the K×KK\times K identity matrix. Likewise, vectorizing the constraint on BB as vec(B𝚃X)=0\mathrm{vec}(B^{\mathtt{T}}X)=0, the constraint Jacobian for b=vec(B𝚃)\vec{b}=\mathrm{vec}(B^{\mathtt{T}}) is Jb:=X𝚃ILJ_{b}:=X^{\mathtt{T}}\otimes\mathrm{I}_{L}.

For uncertainty quantification, the key constraints on UU and VV are X𝚃U=0X^{\mathtt{T}}U=0, Z𝚃V=0Z^{\mathtt{T}}V=0, U𝚃U=IU^{\mathtt{T}}U=\mathrm{I}, and V𝚃V=IV^{\mathtt{T}}V=\mathrm{I}. Vectorizing, the constraints on UU can be written as vec(U𝚃X)=0\mathrm{vec}(U^{\mathtt{T}}X)=0 and vec(U𝚃UI)=0\mathrm{vec}(U^{\mathtt{T}}U-\mathrm{I})=0. Thus, the constraint Jacobian for u\vec{u} is

Ju:=[X1:IMXI:IM(U1:IM)+(IMU1:)(UI:IM)+(IMUI:)](MK+M2)×IM.\displaystyle J_{u}:=\begin{bmatrix}X_{1\bm{:}}\otimes\mathrm{I}_{M}&\cdots&X_{I\bm{:}}\otimes\mathrm{I}_{M}\\ (U_{1\bm{:}}\otimes\mathrm{I}_{M})+(\mathrm{I}_{M}\otimes U_{1\bm{:}})&\cdots&(U_{I\bm{:}}\otimes\mathrm{I}_{M})+(\mathrm{I}_{M}\otimes U_{I\bm{:}})\end{bmatrix}\in\mathbb{R}^{(MK+M^{2})\times IM}.

Here, for a matrix Qm×nQ\in\mathbb{R}^{m\times n}, we write Qi:Q_{i\bm{:}} and Q:jQ_{\bm{:}j} to denote the iith row and jjth column as column vectors, respectively, that is, Qi::=(qi1,,qin)𝚃nQ_{i\bm{:}}:=(q_{i1},\ldots,q_{in})^{\mathtt{T}}\in\mathbb{R}^{n} and Q:j:=(q1j,,qmj)𝚃mQ_{\bm{:}j}:=(q_{1j},\ldots,q_{mj})^{\mathtt{T}}\in\mathbb{R}^{m}. The constraint Jacobian for v\vec{v}, namely JvJ_{v}, is computed the same way as JuJ_{u} but with VV, ZZ, JJ, and LL in place of UU, XX, II, and KK, respectively. There are no constraints on CC, and the remaining constraints on DD, UU, and VV do not reduce the dimensionality of the parameter space. Thus, altogether, the constraint Jacobian for (A,B,C,D,U,V)(A,B,C,D,U,V) is

𝒥:=[Ja00000Jb000000Ju00000Jv]\mathcal{J}:=\begin{bmatrix}J_{a}&0&0&0&0\\ 0&J_{b}&0&0&0\\ 0&0&0&J_{u}&0\\ 0&0&0&0&J_{v}\end{bmatrix}

where the column of zero blocks corresponding to (C,D)(C,D) has width KL+MKL+M and height JK+IL+IM+JMJK+IL+IM+JM.

Let =𝔼(2)\mathcal{I}=\mathbb{E}(-\nabla^{2}\mathcal{L}) denote the full Fisher information matrix for (A,B,C,D,U,V)(A,B,C,D,U,V). Formulas for all of the entries of \mathcal{I} are given in Equation S5.8 and in Section S5.3. The full constraint-augmented Fisher information matrix for all of the parameters is

~:=[𝒥𝚃𝒥0].\tilde{\mathcal{I}}:=\begin{bmatrix}\mathcal{I}&\mathcal{J}^{\mathtt{T}}\\ \mathcal{J}&0\end{bmatrix}.

The classical approach is then to define approximate standard errors as the square roots of the entries of diag(~1)\operatorname*{diag}(\tilde{\mathcal{I}}^{-1}) corresponding to each component (Section S6.1 and Silvey, 1975); for instance, the first JKJK entries of diag(~1)\operatorname*{diag}(\tilde{\mathcal{I}}^{-1}) are the variances of the entries of vec(A𝚃)\mathrm{vec}(A^{\mathtt{T}}).

S7 Step-by-step estimation algorithm

Given the inputs and preprocessing as described in Section 3, the algorithm is as follows.

S7.1 Initialization procedure

We initialize the GBM estimation algorithm by (a) solving for values of AA, BB, and CC to minimize the sum-of-squares of the GBM residuals εij\varepsilon_{ij}, (b) randomly initializing DD, UU, and VV, and in the NB-GBM case, (c) iteratively updating SS, TT, and ω\omega for a few iterations. This approach has the advantage of being simple, fast, and effective; see below for details.

It is somewhat tricky to initialize the algorithm well due to a chicken-and-egg problem. The issue is that having decent estimates of SS and TT is important to avoid overfitting to outliers, but we need a reasonable estimate of the mean matrix in order to estimate SS and TT. Our solution to this problem is to exclude DD, UU, and VV from the initial fitting of AA, BB, and CC in step (a) above. This prevents the latent factors from overfitting to outlier samples or outlier features, thus helping avoid getting stuck at a suboptimal point. In detail, we initialize as follows.

  1. (1)

    Compute Yˇij=g(Yij+ϵ)\check{Y}_{ij}=g(Y_{ij}+\epsilon) where ϵ=1/8\epsilon=1/8, and set Yˇ=(Yˇij)I×J\check{Y}=(\check{Y}_{ij})\in\mathbb{R}^{I\times J}.

  2. (2)

    CX+Yˇ(Z+)𝚃C\leftarrow X^{\texttt{\small{+}}}\check{Y}(Z^{\texttt{\small{+}}})^{\mathtt{T}}

  3. (3)

    A(X+YˇCZ𝚃)𝚃A\leftarrow(X^{\texttt{\small{+}}}\check{Y}-CZ^{\mathtt{T}})^{\mathtt{T}}

  4. (4)

    BYˇ(Z+)𝚃XCB\leftarrow\check{Y}(Z^{\texttt{\small{+}}})^{\mathtt{T}}-XC

  5. (5)

    Sample qij𝒩(0,1016)q_{ij}\sim\mathcal{N}(0,10^{-16}) i.i.d. for all i,ji,j, and set Q=(qij)I×JQ=(q_{ij})\in\mathbb{R}^{I\times J}.

  6. (6)

    Compute UU, DD, and VV such that UDV𝚃=QUDV^{\mathtt{T}}=Q is the compact SVD of rank MM.

  7. (7)

    Initialize S=0S=0, T=0T=0, and ω=0\omega=0, and then run the updates to SS and TT as defined in Section S7.2 for 4 iterations, using the current values of AA, BB, CC, DD, UU, and VV. Note that this also modifies ω\omega.

S7.2 Updates to each component of the model

In this section, we provide step-by-step algorithms for updating each component of the model using the optimization-projection approach. The optimization part of each update is based on the bounded regularized Fisher scoring step in Equation 3.1, using the formulas for the gradients and Fisher information matrices derived in Section S5. The projection part of each update is based on Theorem 5.4. In the updates to G=UDG=UD and H=VDH=VD, we use the priors on GG and HH induced by the priors on UU and VV, given DD; see Section S9.

For a matrix Qm×nQ\in\mathbb{R}^{m\times n}, we denote Qi::=(qi1,,qin)𝚃nQ_{i\bm{:}}:=(q_{i1},\ldots,q_{in})^{\mathtt{T}}\in\mathbb{R}^{n}, Q:j:=(q1j,,qmj)𝚃mQ_{\bm{:}j}:=(q_{1j},\ldots,q_{mj})^{\mathtt{T}}\in\mathbb{R}^{m}, Qi=Diag(Qi:)Q_{i*}=\mathrm{Diag}(Q_{i\bm{:}}), Qj=Diag(Q:j)Q_{*j}=\mathrm{Diag}(Q_{\bm{:}j}), vec(Q)\mathrm{vec}(Q) is the column-wise vectorization of QQ, and block(Qij:i,j{1,,n})\mathrm{block}(Q_{ij}:i,j\in\{1,\ldots,n\}) is the block matrix with blocks QijQ_{ij}. When multiplying by a diagonal matrix such as QiQ_{i*} or QjQ_{*j}, we do not allocate the full diagonal matrix and perform matrix multiplication. Instead, it is much more efficient to simply multiply each row (when left-multiplying) or column (when right-multiplying) by the corresponding diagonal entry.

Computing η\eta, μ\mu, WW, and EE

  1. (1)

    ηXA𝚃+BZ𝚃+XCZ𝚃+UDV𝚃\eta\leftarrow XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+UDV^{\mathtt{T}}

  2. (2)

    For i=1,,Ii=1,\ldots,I and j=1,,Jj=1,\ldots,J,

    1. (a)

      μijexp(ηij)\mu_{ij}\leftarrow\exp(\eta_{ij})

    2. (b)

      wijrijμij/(rij+μij)w_{ij}\leftarrow r_{ij}\mu_{ij}/(r_{ij}+\mu_{ij})

    3. (c)

      eij(Yijμij)wij/μije_{ij}\leftarrow(Y_{ij}-\mu_{ij})w_{ij}/\mu_{ij}

Updating AA

  1. (1)

    Recompute WW and EE using the current parameter estimates.

  2. (2)

    For j=1,,Jj=1,\ldots,J

    1. (a)

      ξ(X𝚃WjX+λaI)1(X𝚃E:jλaAj:)\xi\leftarrow(X^{\mathtt{T}}W_{*j}X+\lambda_{a}\mathrm{I})^{-1}(X^{\mathtt{T}}E_{\bm{:}j}-\lambda_{a}A_{j\bm{:}})     (compute Fisher scoring step)

    2. (b)

      Aj:Aj:+ξmin{1,ρK/ξ}A_{j\bm{:}}\leftarrow A_{j\bm{:}}+\xi\min\{1,\,\rho\sqrt{K}/\|\xi\|\}     (apply modified step to Aj:A_{j\bm{:}})

  3. (3)

    QZ+AQ\leftarrow Z^{\texttt{\small{+}}}A    (efficiently structure computation of projection)

  4. (4)

    AAZQA\leftarrow A-ZQ    (enforce Z𝚃A=0Z^{\mathtt{T}}A=0 by projecting onto nullspace of Z𝚃Z^{\mathtt{T}})

  5. (5)

    CC+Q𝚃C\leftarrow C+Q^{\mathtt{T}}    (compensate to preserve likelihood)

Updating BB

  1. (1)

    Recompute WW and EE using the current parameter estimates.

  2. (2)

    For i=1,,Ii=1,\ldots,I

    1. (a)

      ξ(Z𝚃WiZ+λbI)1(Z𝚃Ei:λbBi:)\xi\leftarrow(Z^{\mathtt{T}}W_{i*}Z+\lambda_{b}\mathrm{I})^{-1}(Z^{\mathtt{T}}E_{i\bm{:}}-\lambda_{b}B_{i\bm{:}})     (compute Fisher scoring step)

    2. (b)

      Bi:Bi:+ξmin{1,ρL/ξ}B_{i\bm{:}}\leftarrow B_{i\bm{:}}+\xi\min\{1,\,\rho\sqrt{L}/\|\xi\|\}     (apply modified step to Bi:B_{i\bm{:}})

  3. (3)

    QX+BQ\leftarrow X^{\texttt{\small{+}}}B    (efficiently structure computation of projection)

  4. (4)

    BBXQB\leftarrow B-XQ    (enforce X𝚃B=0X^{\mathtt{T}}B=0 by projecting onto nullspace of X𝚃X^{\mathtt{T}})

  5. (5)

    CC+QC\leftarrow C+Q    (compensate to preserve likelihood)

Updating CC

  1. (1)

    Recompute WW and EE using the current parameter estimates.

  2. (2)

    Fblock(j=1Jzjzj(X𝚃WjX):,{1,,L})F\leftarrow\mathrm{block}\Big{(}{\textstyle\sum_{j=1}^{J}}z_{j\ell}z_{j\ell^{\prime}}(X^{\mathtt{T}}W_{*j}X):\ell,\ell^{\prime}\in\{1,\ldots,L\}\Big{)}     (compute Fisher info)

  3. (3)

    ξ(F+λcI)1(vec(X𝚃EZ)λcvec(C))\xi\leftarrow(F+\lambda_{c}\mathrm{I})^{-1}(\mathrm{vec}(X^{\mathtt{T}}EZ)-\lambda_{c}\mathrm{vec}(C))     (compute Fisher scoring step)

  4. (4)

    vec(C)vec(C)+ξmin{1,ρKL/ξ}\mathrm{vec}(C)\leftarrow\mathrm{vec}(C)+\xi\min\{1,\,\rho\sqrt{KL}/\|\xi\|\}    (apply modified step to CC)

Updating DD

  1. (1)

    Recompute WW and EE using the current parameter estimates.

  2. (2)

    Fj=1J(UVj)𝚃Wj(UVj)F\leftarrow\sum_{j=1}^{J}(UV_{j*})^{\mathtt{T}}W_{*j}(UV_{j*})     (compute Fisher information)

  3. (3)

    ξ(F+λdI)1(diag(U𝚃EV)λddiag(D))\xi\leftarrow(F+\lambda_{d}\mathrm{I})^{-1}(\operatorname*{diag}(U^{\mathtt{T}}EV)-\lambda_{d}\,\mathrm{diag}(D))     (compute Fisher scoring step)

  4. (4)

    diag(D)diag(D)+ξmin{1,ρM/ξ}\mathrm{diag}(D)\leftarrow\mathrm{diag}(D)+\xi\min\{1,\,\rho\sqrt{M}/\|\xi\|\}    (apply modified step to DD)

Updating G=UDG=UD

  1. (1)

    Recompute WW and EE using the current parameter estimates.

  2. (2)

    GUDG\leftarrow UD

  3. (3)

    Λ(D2/λu)1\Lambda\leftarrow(D^{2}/\lambda_{u})^{-1}     (precision matrix for prior on each row of GG)

  4. (4)

    For i=1,,Ii=1,\ldots,I

    1. (a)

      ξ(V𝚃WiV+Λ)1(V𝚃Ei:ΛGi:)\xi\leftarrow(V^{\mathtt{T}}W_{i*}V+\Lambda)^{-1}(V^{\mathtt{T}}E_{i\bm{:}}-\Lambda G_{i\bm{:}})     (compute Fisher scoring step)

    2. (b)

      Gi:Gi:+ξmin{1,ρM/ξ}G_{i\bm{:}}\leftarrow G_{i\bm{:}}+\xi\min\{1,\,\rho\sqrt{M}/\|\xi\|\}     (apply modified step to Gi:G_{i\bm{:}})

  5. (5)

    QX+GQ\leftarrow X^{\texttt{\small{+}}}G    (efficiently structure computation of projection)

  6. (6)

    GGXQG\leftarrow G-XQ    (enforce X𝚃G=0X^{\mathtt{T}}G=0 by projecting onto nullspace of X𝚃X^{\mathtt{T}})

  7. (7)

    AA+VQ𝚃A\leftarrow A+VQ^{\mathtt{T}}    (compensate to preserve likelihood)

  8. (8)

    QZ+AQ\leftarrow Z^{\texttt{\small{+}}}A    (efficiently structure computation of projection)

  9. (9)

    AAZQA\leftarrow A-ZQ    (enforce Z𝚃A=0Z^{\mathtt{T}}A=0 by projecting onto nullspace of Z𝚃Z^{\mathtt{T}})

  10. (10)

    CC+Q𝚃C\leftarrow C+Q^{\mathtt{T}}    (compensate to preserve likelihood)

  11. (11)

    Run compact SVD of rank MM on GV𝚃GV^{\mathtt{T}}, yielding UU, DD, VV such that UDV𝚃=GV𝚃UDV^{\mathtt{T}}=GV^{\mathtt{T}}.

Updating H=VDH=VD

  1. (1)

    Recompute WW and EE using the current parameter estimates.

  2. (2)

    HVDH\leftarrow VD

  3. (3)

    Λ(D2/λv)1\Lambda\leftarrow(D^{2}/\lambda_{v})^{-1}     (precision matrix for prior on each row of HH)

  4. (4)

    For j=1,,Jj=1,\ldots,J

    1. (a)

      ξ(U𝚃WjU+Λ)1(U𝚃E:jΛHj:)\xi\leftarrow(U^{\mathtt{T}}W_{*j}U+\Lambda)^{-1}(U^{\mathtt{T}}E_{\bm{:}j}-\Lambda H_{j\bm{:}})     (compute Fisher scoring step)

    2. (b)

      Hj:Hj:+ξmin{1,ρM/ξ}H_{j\bm{:}}\leftarrow H_{j\bm{:}}+\xi\min\{1,\,\rho\sqrt{M}/\|\xi\|\}     (apply modified step to Hj:H_{j\bm{:}})

  5. (5)

    QZ+HQ\leftarrow Z^{\texttt{\small{+}}}H    (efficiently structure computation of projection)

  6. (6)

    HHZQH\leftarrow H-ZQ    (enforce Z𝚃H=0Z^{\mathtt{T}}H=0 by projecting onto nullspace of Z𝚃Z^{\mathtt{T}})

  7. (7)

    BB+UQ𝚃B\leftarrow B+UQ^{\mathtt{T}}    (compensate to preserve likelihood)

  8. (8)

    QX+BQ\leftarrow X^{\texttt{\small{+}}}B    (efficiently structure computation of projection)

  9. (9)

    BBXQB\leftarrow B-XQ    (enforce X𝚃B=0X^{\mathtt{T}}B=0 by projecting onto nullspace of X𝚃X^{\mathtt{T}})

  10. (10)

    CC+QC\leftarrow C+Q    (compensate to preserve likelihood)

  11. (11)

    Run compact SVD of rank MM on UH𝚃UH^{\mathtt{T}}, yielding UU, DD, VV such that UDV𝚃=UH𝚃UDV^{\mathtt{T}}=UH^{\mathtt{T}}.

Updating SS in an NB-GBM

For the updates to SS and TT, we employ adaptive maximum step sizes ρsi\rho_{si} and ρtj\rho_{tj} for sis_{i} and tjt_{j}, respectively. This helps prevent occasional lack of convergence due to oscillating estimates. At the start of the algorithm, we initialize ρsiρ\rho_{si}\leftarrow\rho and ρtjρ\rho_{tj}\leftarrow\rho. Define log1p(x)\mathrm{log1p}(x), ψΔ(y,r)\psi_{\Delta}(y,r), and ψΔ(y,r)\psi^{\prime}_{\Delta}(y,r) as in Equation S5.14. Note that we do not explicitly update ω\omega, since ω\omega is implicitly updated in the projection part of the updates to SS and TT.

  1. (1)

    Compute μ\mu using the current parameter estimates.

  2. (2)

    For i=1,,Ii=1,\ldots,I and j=1,,Jj=1,\ldots,J,     (differentiate each term in the log-likelihood)

    1. (a)

      δijrij(ψΔ(Yij,rij)log1p(μij/rij)(Yijμij)/(rij+μij))\delta_{ij}\leftarrow-r_{ij}\Big{(}\psi_{\Delta}(Y_{ij},r_{ij})-\mathrm{log1p}(\mu_{ij}/r_{ij})-(Y_{ij}-\mu_{ij})/(r_{ij}+\mu_{ij})\Big{)}

    2. (b)

      δijδij+rij2ψΔ(Yij,rij)+(Yij+μij2/rij)/(1+μij/rij)2\delta^{\prime}_{ij}\leftarrow-\delta_{ij}+r_{ij}^{2}\psi^{\prime}_{\Delta}(Y_{ij},r_{ij})+(Y_{ij}+\mu_{ij}^{2}/r_{ij})/(1+\mu_{ij}/r_{ij})^{2}

  3. (3)

    For i=1,,Ii=1,\ldots,I

    1. (a)

      gλs(sims)+j=1Jδijg\leftarrow-\lambda_{s}(s_{i}-m_{s})+\sum_{j=1}^{J}\delta_{ij}     (derivative of log-posterior with respect to sis_{i})

    2. (b)

      hλs+j=1Jδijh\leftarrow-\lambda_{s}+\sum_{j=1}^{J}\delta^{\prime}_{ij}     (second derivative of log-posterior with respect to sis_{i})

    3. (c)

      If h<0h<0 then ξg/h\xi\leftarrow-g/h, otherwise, ξg\xi\leftarrow g. (Newton if valid, otherwise gradient)

    4. (d)

      sisi+ξmin{1,ρsi/|ξ|}s_{i}\leftarrow s_{i}+\xi\min\{1,\,\rho_{si}/|\xi|\}    (apply modified optimization step to sis_{i})

    5. (e)

      If |ξ|>ρsi|\xi|>\rho_{si} then ρsiρsi/2\rho_{si}\leftarrow\rho_{si}/2, otherwise, ρsiρ\rho_{si}\leftarrow\rho.    (adapt maximum step size)

  4. (4)

    clog(1Ii=1Iesi)c\leftarrow\log(\frac{1}{I}\sum_{i=1}^{I}e^{s_{i}})

  5. (5)

    SScS\leftarrow S-c    (enforce constraint by projecting)

  6. (6)

    ωω+c\omega\leftarrow\omega+c    (compensate to preserve likelihood)

  7. (7)

    rijexp(sitjω)r_{ij}\leftarrow\exp(-s_{i}-t_{j}-\omega) for i=1,,Ii=1,\ldots,I and j=1,,Jj=1,\ldots,J    (update inverse dispersions)

Updating TT in an NB-GBM

Steps (1)-(2) and (7) are the same as in the update to SS. Steps (3)-(6) become:

  1. (3)

    For j=1,,Jj=1,\ldots,J

    1. (a)

      gλt(tjmt)+i=1Iδijg\leftarrow-\lambda_{t}(t_{j}-m_{t})+\sum_{i=1}^{I}\delta_{ij}     (derivative of log-posterior with respect to tjt_{j})

    2. (b)

      hλt+i=1Iδijh\leftarrow-\lambda_{t}+\sum_{i=1}^{I}\delta^{\prime}_{ij}     (second derivative of log-posterior with respect to tjt_{j})

    3. (c)

      If h<0h<0 then ξg/h\xi\leftarrow-g/h, otherwise, ξg\xi\leftarrow g. (Newton if valid, otherwise gradient)

    4. (d)

      tjtj+ξmin{1,ρtj/|ξ|}t_{j}\leftarrow t_{j}+\xi\min\{1,\,\rho_{tj}/|\xi|\}    (apply modified optimization step to sis_{i})

    5. (e)

      If |ξ|>ρtj|\xi|>\rho_{tj} then ρtjρtj/2\rho_{tj}\leftarrow\rho_{tj}/2, otherwise, ρtjρ\rho_{tj}\leftarrow\rho.    (adapt maximum step size)

  2. (4)

    clog(1Jj=1Jetj)c\leftarrow\log(\frac{1}{J}\sum_{j=1}^{J}e^{t_{j}})

  3. (5)

    TTcT\leftarrow T-c    (enforce constraint by projecting)

  4. (6)

    ωω+c\omega\leftarrow\omega+c    (compensate to preserve likelihood)

Bias correction for SS and TT in an NB-GBM

Empirically, when the true values of SS and TT are low, the maximum likelihood estimates tend to exhibit a downward bias. Occasionally, this leads to massive underestimation of some of the log-dispersion values. This issue is mitigated somewhat by using a prior to shrink the estimates toward zero, however, it seems difficult to tune the prior to appropriately balance the bias. Thus, we employ the following simple bias correction procedure, applied after the final iteration of the estimation algorithm. Choose lower bounds ss_{*} and tt_{*} on sis_{i} and tjt_{j}, respectively; we use s=t=4s_{*}=t_{*}=-4 as defaults.

  1. (1)

    sis+log(exp(sis)+1)s_{i}\leftarrow s_{*}+\log(\exp(s_{i}-s_{*})+1) for i=1,,Ii=1,\ldots,I     (apply bias correction to SS)

  2. (2)

    clog(1Ii=1Iesi)c\leftarrow\log(\frac{1}{I}\sum_{i=1}^{I}e^{s_{i}})

  3. (3)

    SScS\leftarrow S-c    (enforce constraint by projecting)

  4. (4)

    ωω+c\omega\leftarrow\omega+c    (compensate to preserve likelihood)

The same procedure is applied to TT, with tt_{*} in place of ss_{*}. We find that this improves the accuracy of the log-dispersion estimates when the true values are at the low end.

S7.3 Remarks on the estimation algorithm

We continue iterating until either (a) the relative change in log-likelihood+log-prior (Equations S5.1 and S9.2) from one iteration to the next is less than the convergence tolerance τ\tau or (b) the maximum number of iterations has been reached.

In the updates to G=UDG=UD and H=VDH=VD, we use the compact SVD to enforce the constraints on DD, UU, and VV. Fast computation of the compact SVD can be done using procedures for the truncated SVD, which allows one to specify the rank (that is, the number of latent factors MM). Procedures for the truncated SVD are available in many programming languages. It is not necessary to enforce the ordering and sign constraints on DD, UU, and VV (Conditions 2.1(d) and 2.1(e)) during the iterative updates since both the likelihood and the prior are invariant to the order and sign of the latent factors.

Note that, due to the symmetry of the model, the updates for AA and BB are similar enough that a single function can be used to compute both of them, with an option to handle the transpose for CC. Likewise, a single function can be used to compute both the UDUD and VDVD updates, with an option to handle transposes appropriately.

S8 Step-by-step inference algorithm

Notation. For matrices AA and BB, we use ABA\otimes B to denote the Kronecker product and ABA\odot B for the element-wise product. Normally, we write ABAB for matrix multiplication, but for improved clarity we use A×BA\times B to denote matrix multiplication when multi-letter variables such as invFc are involved. We write hcat(A1,,An)\mathrm{hcat}(A_{1},\ldots,A_{n}) for the horizontal concatenation of matrices A1,,AnA_{1},\ldots,A_{n}, that is, hcat(A1,,An):=[A1An]\mathrm{hcat}(A_{1},\ldots,A_{n}):=[A_{1}\;\cdots\;A_{n}]. Likewise, vcat\mathrm{vcat} denotes vertical concatenation. We define block(j,K):=((j1)K+1,(j1)K+2,,jK)\mathrm{block}(j,K):=((j-1)K+1,(j-1)K+2,\ldots,jK). For a matrix Am×nA\in\mathbb{R}^{m\times n}, colsums(A)\mathrm{colsums}(A) denotes the vector of column sums, that is, colsums(A)=(iai1,,iain)𝚃n\mathrm{colsums}(A)=(\sum_{i}a_{i1},\ldots,\sum_{i}a_{in})^{\mathtt{T}}\in\mathbb{R}^{n}. Likewise, rowsums\mathrm{rowsums} denotes the row sums. For a vector xmnx\in\mathbb{R}^{mn}, we define reshape(x,m,n)\mathrm{reshape}(x,m,n) to be the matrix Am×nA\in\mathbb{R}^{m\times n} such that x=vec(A)x=\mathrm{vec}(A). For a vector xnx\in\mathbb{R}^{n}, we write Diag(x)\mathrm{Diag}(x) to denote the n×nn\times n diagonal matrix with xx on the diagonal. For a matrix Am×nA\in\mathbb{R}^{m\times n} and vectors xmx\in\mathbb{R}^{m}, yny\in\mathbb{R}^{n}, with mnm\neq n, we extend the \odot operator as follows: Ax=xA:=Diag(x)AA\odot x=x\odot A:=\mathrm{Diag}(x)A and Ay=yA:=ADiag(y)A\odot y=y\odot A:=A\,\mathrm{Diag}(y). We write sqrt()\mathrm{sqrt}(\cdot) to denote the element-wise square root. We use In\mathrm{I}_{n} to denote the n×nn\times n identity matrix.


Preprocessing.

  1. (1)

    Compute the inverse dispersions rijexp(sitjω)r_{ij}\leftarrow\exp(-s_{i}-t_{j}-\omega) for all i,ji,j.

  2. (2)

    Compute μ\mu, WW, and EE as in Section S7.

  3. (3)

    Compute dWMI×J\texttt{dWM}\in\mathbb{R}^{I\times J} where dWMijμijrij2/(rij+μij)2\texttt{dWM}_{ij}\leftarrow\mu_{ij}r_{ij}^{2}/(r_{ij}+\mu_{ij})^{2}.

  4. (4)

    Compute dEMI×J\texttt{dEM}\in\mathbb{R}^{I\times J} where dEMijμijrij(rij+Yij)/(rij+μij)2\texttt{dEM}_{ij}\leftarrow-\mu_{ij}r_{ij}(r_{ij}+Y_{ij})/(r_{ij}+\mu_{ij})^{2}.

  5. (5)

    gradAE𝚃X\texttt{gradA}\leftarrow E^{\mathtt{T}}X

  6. (6)

    gradBEZ\texttt{gradB}\leftarrow EZ

  7. (7)

    gradCX𝚃EZ\texttt{gradC}\leftarrow X^{\mathtt{T}}EZ

  8. (8)

    Compute δij\delta_{ij} and δij\delta^{\prime}_{ij} for all i,ji,j using the formula from the SS update during estimation.

  9. (9)

    gradSiλs(sims)+j=1Jδij\texttt{gradS}_{i}\leftarrow-\lambda_{s}(s_{i}-m_{s})+\sum_{j=1}^{J}\delta_{ij} for i=1,,Ii=1,\ldots,I.

  10. (10)

    gradTjλt(tjmt)+i=1Iδij\texttt{gradT}_{j}\leftarrow-\lambda_{t}(t_{j}-m_{t})+\sum_{i=1}^{I}\delta_{ij} for j=1,,Jj=1,\ldots,J.

Compute conditional uncertainty for each component.

  1. (1)

    invFaj(X𝚃WjX+λaI)1\texttt{invFa}_{j}\leftarrow(X^{\mathtt{T}}W_{*j}X+\lambda_{a}\mathrm{I})^{-1} for j=1,,Jj=1,\ldots,J.

  2. (2)

    invFbi(Z𝚃WiZ+λbI)1\texttt{invFb}_{i}\leftarrow(Z^{\mathtt{T}}W_{i*}Z+\lambda_{b}\mathrm{I})^{-1} for i=1,,Ii=1,\ldots,I.

  3. (3)

    invFc(𝔼(c2)+λcI)1\texttt{invFc}\leftarrow(\mathbb{E}(-\nabla_{\!\vec{c}}^{2}\,\mathcal{L})+\lambda_{c}\mathrm{I})^{-1} where 𝔼(c2)\mathbb{E}(-\nabla_{\!\vec{c}}^{2}\,\mathcal{L}) is given in Equation S5.8.

  4. (4)

    invFui((VD)𝚃Wi(VD)+λuI)1\texttt{invFu}_{i}\leftarrow((VD)^{\mathtt{T}}W_{i*}(VD)+\lambda_{u}\mathrm{I})^{-1} for i=1,,Ii=1,\ldots,I.

  5. (5)

    invFvj((UD)𝚃Wj(UD)+λvI)1\texttt{invFv}_{j}\leftarrow((UD)^{\mathtt{T}}W_{*j}(UD)+\lambda_{v}\mathrm{I})^{-1} for j=1,,Jj=1,\ldots,J.

  6. (6)

    invFsi1/(λsj=1Jδij)\texttt{invFs}_{i}\leftarrow 1/(\lambda_{s}-\sum_{j=1}^{J}\delta^{\prime}_{ij}) for all i=1,,Ii=1,\ldots,I.

  7. (7)

    invFtj1/(λti=1Iδij)\texttt{invFt}_{j}\leftarrow 1/(\lambda_{t}-\sum_{i=1}^{I}\delta^{\prime}_{ij}) for all j=1,,Jj=1,\ldots,J.

  8. (8)

    invFs(invFs1,,invFsI)𝚃\texttt{invFs}\leftarrow(\texttt{invFs}_{1},\ldots,\texttt{invFs}_{I})^{\mathtt{T}}

  9. (9)

    invFt(invFt1,,invFtJ)𝚃\texttt{invFt}\leftarrow(\texttt{invFt}_{1},\ldots,\texttt{invFt}_{J})^{\mathtt{T}}

Compute constraint Jacobians for UU and VV.

  1. (1)

    Juivcat(Xi:IM,(Ui:IM)+(IMUi:))\texttt{Ju}_{i}\leftarrow\mathrm{vcat}(X_{i\bm{:}}\otimes\mathrm{I}_{M},\;(U_{i\bm{:}}\otimes\mathrm{I}_{M})+(\mathrm{I}_{M}\otimes U_{i\bm{:}})) for i=1,,Ii=1,\ldots,I.

  2. (2)

    Juhcat(Ju1,,JuI)\texttt{Ju}\leftarrow\mathrm{hcat}(\texttt{Ju}_{1},\ldots,\texttt{Ju}_{I})

  3. (3)

    Jvjvcat(Zj:IM,(Vj:IM)+(IMVj:))\texttt{Jv}_{j}\leftarrow\mathrm{vcat}(Z_{j\bm{:}}\otimes\mathrm{I}_{M},\;(V_{j\bm{:}}\otimes\mathrm{I}_{M})+(\mathrm{I}_{M}\otimes V_{j\bm{:}})) for j=1,,Jj=1,\ldots,J.

  4. (4)

    Jvhcat(Jv1,,JvJ)\texttt{Jv}\leftarrow\mathrm{hcat}(\texttt{Jv}_{1},\ldots,\texttt{Jv}_{J})

Compute joint uncertainty in (U,V)(U,V) accounting for constraints.

  1. (1)

    Fuvihcat(wi1(DV1:)(DUi:)𝚃,,wiJ(DVJ:)(DUi:)𝚃)\texttt{Fuv}_{i}\leftarrow\mathrm{hcat}(w_{i1}(DV_{1\bm{:}})(DU_{i\bm{:}})^{\mathtt{T}},\ldots,w_{iJ}(DV_{J\bm{:}})(DU_{i\bm{:}})^{\mathtt{T}}) for i=1,,Ii=1,\ldots,I

  2. (2)

    Fuvvcat(Fuv1,,FuvI)\texttt{Fuv}\leftarrow\mathrm{vcat}(\texttt{Fuv}_{1},\ldots,\texttt{Fuv}_{I})

  3. (3)

    FJvcat(invFu1×Ju1𝚃,,invFuI×JuI𝚃)\texttt{FJ}\leftarrow\mathrm{vcat}(\texttt{invFu}_{1}\times\texttt{Ju}_{1}^{\mathtt{T}},\ldots,\texttt{invFu}_{I}\times\texttt{Ju}_{I}^{\mathtt{T}})

  4. (4)

    FuvFJFuv𝚃×FJ\texttt{FuvFJ}\leftarrow\texttt{Fuv}^{\mathtt{T}}\times\texttt{FJ}

  5. (5)

    invJFJ(Ju×FJ)1\texttt{invJFJ}\leftarrow(\texttt{Ju}\times\texttt{FJ})^{-1}

  6. (6)

    FFuvvcat(invFu1×Fuv1,,invFuI×FuvI)\texttt{FFuv}\leftarrow\mathrm{vcat}(\texttt{invFu}_{1}\times\texttt{Fuv}_{1},\ldots,\texttt{invFu}_{I}\times\texttt{Fuv}_{I})

  7. (7)

    FuvFFuvFuv𝚃×FFuv\texttt{FuvFFuv}\leftarrow\texttt{Fuv}^{\mathtt{T}}\times\texttt{FFuv}

  8. (8)

    Fv\texttt{Fv}\leftarrow block diagonal matrix with jjth block equal to (UD)𝚃Wj(UD)+λvI(UD)^{\mathtt{T}}W_{*j}(UD)+\lambda_{v}\mathrm{I}.

  9. (9)

    AFvFuvFFuv+FuvFJ×invJFJ×FuvFJ𝚃\texttt{A}\leftarrow\texttt{Fv}-\texttt{FuvFFuv}+\texttt{FuvFJ}\times\texttt{invJFJ}\times\texttt{FuvFJ}^{\mathtt{T}}

  10. (10)

    B[AJv𝚃Jv0]1\texttt{B}\leftarrow\displaystyle\begin{bmatrix}\texttt{A}&\texttt{Jv}^{\mathtt{T}}\,\\ \texttt{Jv}&0\,\end{bmatrix}^{-1}

  11. (11)

    CB[1:JM, 1:JM]\texttt{C}\leftarrow\texttt{B}[1\!:\!JM,\,1\!:\!JM] (that is, C is the leading JM×JMJM\times JM block of B)

  12. (12)

    FuvDFFuv𝚃FuvFJ×invJFJ×FJ𝚃\texttt{FuvD}\leftarrow\texttt{FFuv}^{\mathtt{T}}-\texttt{FuvFJ}\times\texttt{invJFJ}\times\texttt{FJ}^{\mathtt{T}}

  13. (13)

    dvcat(diag(invFu1),,diag(invFuI))\texttt{d}\leftarrow\mathrm{vcat}(\operatorname*{diag}(\texttt{invFu}_{1}),\ldots,\operatorname*{diag}(\texttt{invFu}_{I}))

  14. (14)

    fcolsums(FJ𝚃(invJFJ×FJ𝚃))\texttt{f}\leftarrow\mathrm{colsums}(\texttt{FJ}^{\mathtt{T}}\odot(\texttt{invJFJ}\times\texttt{FJ}^{\mathtt{T}}))

  15. (15)

    gcolsums(FuvD(C×FuvD))\texttt{g}\leftarrow\mathrm{colsums}(\texttt{FuvD}\odot(\texttt{C}\times\texttt{FuvD}))

  16. (16)

    varUdf+g\texttt{varU}\leftarrow\texttt{d}-\texttt{f}+\texttt{g}

  17. (17)

    varVdiag(C)\texttt{varV}\leftarrow\operatorname*{diag}(\texttt{C})

Propagate uncertainty from UU to AA.

  1. (1)

    For j=1,,Jj=1,\ldots,J,

    1. (a)

      Qj(invFaj×(XdWM:j)𝚃)(X×invFaj×gradAj:)𝚃+invFaj×(XdEM:j)𝚃\texttt{Q}_{j}\leftarrow(-\texttt{invFa}_{j}\times(X\odot\texttt{dWM}_{\bm{:}j})^{\mathtt{T}})\odot(X\times\texttt{invFa}_{j}\times\texttt{gradA}_{j\bm{:}})^{\mathtt{T}}+\texttt{invFa}_{j}\times(X\odot\texttt{dEM}_{\bm{:}j})^{\mathtt{T}}

  2. (2)

    dAvcat(Q1(DV1:)𝚃,,QJ(DVJ:)𝚃)\texttt{dA}\leftarrow\mathrm{vcat}(\texttt{Q}_{1}\otimes(DV_{1\bm{:}})^{\mathtt{T}},\ldots,\texttt{Q}_{J}\otimes(DV_{J\bm{:}})^{\mathtt{T}})

  3. (3)

    varAfromUcolsums(dA𝚃(varUdA𝚃))\texttt{varAfromU}\leftarrow\mathrm{colsums}(\texttt{dA}^{\mathtt{T}}\odot(\texttt{varU}\odot\texttt{dA}^{\mathtt{T}}))

Propagate uncertainty from VV to AA.

  1. (1)

    For j=1,,Jj=1,\ldots,J,

    1. (a)

      Initialize dAK×M\texttt{dA}\in\mathbb{R}^{K\times M} to all zeros.

    2. (b)

      For m=1,,Mm=1,\ldots,M,

      1. (i)

        XdEX𝚃(dEM:j(DmU:m))\texttt{XdE}\leftarrow X^{\mathtt{T}}(\texttt{dEM}_{\bm{:}j}\odot(D_{m}U_{\bm{:}m}))

      2. (ii)

        XdWX(X𝚃((dWM:j(DmU:m))X))\texttt{XdWX}\leftarrow(X^{\mathtt{T}}((\texttt{dWM}_{\bm{:}j}\odot(D_{m}U_{\bm{:}m}))\odot X))

      3. (iii)

        dA:m(invFaj×XdWX)×(invFaj×gradAj:)+invFaj×XdE\texttt{dA}_{\bm{:}m}\leftarrow(-\texttt{invFa}_{j}\times\texttt{XdWX})\times(\texttt{invFa}_{j}\times\texttt{gradA}_{j\bm{:}})+\texttt{invFa}_{j}\times\texttt{XdE}

    3. (c)

      varAfromVjcolsums(dA𝚃(varV[block(j,M)]dA𝚃))\texttt{varAfromV}_{j}\leftarrow\mathrm{colsums}(\texttt{dA}^{\mathtt{T}}\odot(\texttt{varV}[\mathrm{block}(j,M)]\odot\texttt{dA}^{\mathtt{T}}))

  2. (2)

    varAfromVvcat(varAfromV1,,varAfromVJ)\texttt{varAfromV}\leftarrow\mathrm{vcat}(\texttt{varAfromV}_{1},\ldots,\texttt{varAfromV}_{J})

Propagate uncertainty from UU and VV to BB.

  1. (1)

    Computing varBfromU is identical to calculating varAfromV, but with ZZ, VV, varU, invFb, gradB, dWM𝚃\texttt{dWM}^{\mathtt{T}}, and dEM𝚃\texttt{dEM}^{\mathtt{T}} in place of XX, UU, varV, invFa, gradA, dWM, and dEM, respectively.

  2. (2)

    Computing varBfromV is identical to calculating varAfromU, but with ZZ, UU, varV, invFb, gradB, dWM𝚃\texttt{dWM}^{\mathtt{T}}, and dEM𝚃\texttt{dEM}^{\mathtt{T}} in place of XX, VV, varU, invFa, gradA, dWM, and dEM, respectively.

Propagate uncertainty from AA to CC.

  1. (1)

    Initialize dCKL×JK\texttt{dC}\in\mathbb{R}^{KL\times JK} to all zeros.

  2. (2)

    For j=1,,Jj=1,\ldots,J and k=1,,Kk=1,\ldots,K,

    1. (a)

      dF(Zj:Zj:𝚃)(X𝚃((dWM:jX:k)X))\texttt{dF}\leftarrow(Z_{j\bm{:}}Z_{j\bm{:}}^{\mathtt{T}})\otimes(X^{\mathtt{T}}((\texttt{dWM}_{\bm{:}j}\odot X_{\bm{:}k})\odot X))

    2. (b)

      dgradC(X𝚃×(dEM:jX:k))Zj:𝚃\texttt{dgradC}\leftarrow(X^{\mathtt{T}}\times(\texttt{dEM}_{\bm{:}j}\odot X_{\bm{:}k}))Z_{j\bm{:}}^{\mathtt{T}}

    3. (c)

      dC[:,(j1)K+k]invFc×(dF×(invFc×vec(gradC))+vec(dgradC))\texttt{dC}[\bm{:},(j-1)K+k]\leftarrow\texttt{invFc}\times(-\texttt{dF}\times(\texttt{invFc}\times\mathrm{vec}(\texttt{gradC}))+\mathrm{vec}(\texttt{dgradC}))

  3. (3)

    invFdCjinvFaj×dC[:,block(j,K)]𝚃\texttt{invFdC}_{j}\leftarrow\texttt{invFa}_{j}\times\texttt{dC}[\bm{:},\mathrm{block}(j,K)]^{\mathtt{T}} for j=1,,Jj=1,\ldots,J

  4. (4)

    invFdCvcat(invFdC1,,invFdCJ)\texttt{invFdC}\leftarrow\mathrm{vcat}(\texttt{invFdC}_{1},\ldots,\texttt{invFdC}_{J})

  5. (5)

    varCfromAcolsums(dC𝚃invFdC)\texttt{varCfromA}\leftarrow\mathrm{colsums}(\texttt{dC}^{\mathtt{T}}\odot\texttt{invFdC})

Propagate uncertainty from BB to CC.

  1. (1)

    Initialize dCKL×IL\texttt{dC}\in\mathbb{R}^{KL\times IL} to all zeros.

  2. (2)

    For i=1,,Ii=1,\ldots,I and =1,,L\ell=1,\ldots,L,

    1. (a)

      dF(Z𝚃((dWMi:Z:)Z))(Xi:Xi:𝚃)\texttt{dF}\leftarrow(Z^{\mathtt{T}}((\texttt{dWM}_{i\bm{:}}\odot Z_{\bm{:}\ell})\odot Z))\otimes(X_{i\bm{:}}X_{i\bm{:}}^{\mathtt{T}})

    2. (b)

      dgradCXi:((dEMi:Z:)𝚃×Z)\texttt{dgradC}\leftarrow X_{i\bm{:}}((\texttt{dEM}_{i\bm{:}}\odot Z_{\bm{:}\ell})^{\mathtt{T}}\times Z)

    3. (c)

      dC[:,(i1)L+]invFc×(dF×(invFc×vec(gradC))+vec(dgradC))\texttt{dC}[\bm{:},(i-1)L+\ell]\leftarrow\texttt{invFc}\times(-\texttt{dF}\times(\texttt{invFc}\times\mathrm{vec}(\texttt{gradC}))+\mathrm{vec}(\texttt{dgradC}))

  3. (3)

    invFdCiinvFbi×dC[:,block(i,L)]𝚃\texttt{invFdC}_{i}\leftarrow\texttt{invFb}_{i}\times\texttt{dC}[\bm{:},\mathrm{block}(i,L)]^{\mathtt{T}} for i=1,,Ii=1,\ldots,I

  4. (4)

    invFdCvcat(invFdC1,,invFdCI)\texttt{invFdC}\leftarrow\mathrm{vcat}(\texttt{invFdC}_{1},\ldots,\texttt{invFdC}_{I})

  5. (5)

    varCfromBcolsums(dC𝚃invFdC)\texttt{varCfromB}\leftarrow\mathrm{colsums}(\texttt{dC}^{\mathtt{T}}\odot\texttt{invFdC})

Compute approximate variances for AA, BB, and CC.

  1. (1)

    varAvcat(diag(invFa1),,diag(invFaJ))+varAfromU+varAfromV\texttt{varA}\leftarrow\mathrm{vcat}(\operatorname*{diag}(\texttt{invFa}_{1}),\ldots,\operatorname*{diag}(\texttt{invFa}_{J}))+\texttt{varAfromU}+\texttt{varAfromV}

  2. (2)

    varBvcat(diag(invFb1),,diag(invFbI))+varBfromU+varBfromV\texttt{varB}\leftarrow\mathrm{vcat}(\operatorname*{diag}(\texttt{invFb}_{1}),\ldots,\operatorname*{diag}(\texttt{invFb}_{I}))+\texttt{varBfromU}+\texttt{varBfromV}

  3. (3)

    varCdiag(invFc)+varCfromA+varCfromB\texttt{varC}\leftarrow\operatorname*{diag}(\texttt{invFc})+\texttt{varCfromA}+\texttt{varCfromB}

Propagate uncertainty from (A,B,U,V)(A,B,U,V) to SS.
First, we describe how to compute varSfromU and varSfromV.

  1. (1)

    Compute QI×JQ\in\mathbb{R}^{I\times J} where qijwijeij/rijq_{ij}\leftarrow-w_{ij}e_{ij}/r_{ij}.

  2. (2)

    Compute PI×JP\in\mathbb{R}^{I\times J} where pij2wijqij/μijp_{ij}\leftarrow 2w_{ij}q_{ij}/\mu_{ij}.

  3. (3)

    dgradSQVD\texttt{dgradS}\leftarrow QVD

  4. (4)

    dFdgradSPVD\texttt{dF}\leftarrow\texttt{dgradS}-PVD

  5. (5)

    dS(invFsdFinvFsgradS)+(invFsdgradS)\texttt{dS}\leftarrow(-\texttt{invFs}\odot\texttt{dF}\odot\texttt{invFs}\odot\texttt{gradS})+(\texttt{invFs}\odot\texttt{dgradS})

  6. (6)

    varSfromUrowsums(dSreshape(varU,M,I)𝚃dS)\texttt{varSfromU}\leftarrow\mathrm{rowsums}(\texttt{dS}\odot\mathrm{reshape}(\texttt{varU},M,I)^{\mathtt{T}}\odot\texttt{dS})

  7. (7)

    For i=1,,Ii=1,\ldots,I,

    1. (a)

      dgradSQi:(DUi:)𝚃\texttt{dgradS}\leftarrow Q_{i\bm{:}}(DU_{i\bm{:}})^{\mathtt{T}}

    2. (b)

      dFdgradSPi:(DUi:)𝚃\texttt{dF}\leftarrow\texttt{dgradS}-P_{i\bm{:}}(DU_{i\bm{:}})^{\mathtt{T}}

    3. (c)

      dSinvFsidFinvFsigradSi+invFsidgradS\texttt{dS}\leftarrow-\texttt{invFs}_{i}\cdot\texttt{dF}\cdot\texttt{invFs}_{i}\cdot\texttt{gradS}_{i}+\texttt{invFs}_{i}\cdot\texttt{dgradS}

    4. (d)

      varSfromVivarV𝚃×vec((dSdS)𝚃)\texttt{varSfromV}_{i}\leftarrow\texttt{varV}^{\mathtt{T}}\times\mathrm{vec}((\texttt{dS}\odot\texttt{dS})^{\mathtt{T}})

  8. (8)

    varSfromV(varSfromV1,,varSfromVI)𝚃\texttt{varSfromV}\leftarrow(\texttt{varSfromV}_{1},\ldots,\texttt{varSfromV}_{I})^{\mathtt{T}}

Next, varSfromB and varSfromA are computed in exactly the same way as varSfromU and varSfromV, respectively, but with XX, ZZ, I\mathrm{I}, varB, and varA in place of UU, VV, DD, varU, and varV, respectively.


Propagate uncertainty from (A,B,U,V)(A,B,U,V) to TT.

  1. (1)

    We compute varTfromV and varTfromU in exactly the same way as varSfromU and varSfromV, respectively, but with Y𝚃Y^{\mathtt{T}}, μ𝚃\mu^{\mathtt{T}}, W𝚃W^{\mathtt{T}}, E𝚃E^{\mathtt{T}}, r𝚃r^{\mathtt{T}}, VV, UU, gradT, invFt, varV, and varU in place of YY, μ\mu, WW, EE, rr, UU, VV, gradS, invFs, varU, and varV, respectively.

  2. (2)

    We compute varTfromA and varTfromB in exactly the same way as varSfromU and varSfromV, respectively, but with Y𝚃Y^{\mathtt{T}}, μ𝚃\mu^{\mathtt{T}}, W𝚃W^{\mathtt{T}}, E𝚃E^{\mathtt{T}}, r𝚃r^{\mathtt{T}}, ZZ, XX, I\mathrm{I}, gradT, invFt, varA, and varB in place of YY, μ\mu, WW, EE, rr, UU, VV, DD, gradS, invFs, varU, and varV, respectively.


Compute approximate standard errors.

  1. (1)

    se^Areshape(sqrt(varA),K,J)𝚃\hat{\mathrm{se}}_{A}\leftarrow\mathrm{reshape}(\mathrm{sqrt}(\texttt{varA}),K,J)^{\mathtt{T}}

  2. (2)

    se^Breshape(sqrt(varB),L,I)𝚃\hat{\mathrm{se}}_{B}\leftarrow\mathrm{reshape}(\mathrm{sqrt}(\texttt{varB}),L,I)^{\mathtt{T}}

  3. (3)

    se^Creshape(sqrt(varC),K,L)\hat{\mathrm{se}}_{C}\leftarrow\mathrm{reshape}(\mathrm{sqrt}(\texttt{varC}),K,L)

  4. (4)

    se^Ureshape(sqrt(varU),M,I)𝚃\hat{\mathrm{se}}_{U}\leftarrow\mathrm{reshape}(\mathrm{sqrt}(\texttt{varU}),M,I)^{\mathtt{T}}

  5. (5)

    se^Vreshape(sqrt(varV),M,J)𝚃\hat{\mathrm{se}}_{V}\leftarrow\mathrm{reshape}(\mathrm{sqrt}(\texttt{varV}),M,J)^{\mathtt{T}}

  6. (6)

    se^Ssqrt(invFs+varSfromA+varSfromB+varSfromU+varSfromV)\hat{\mathrm{se}}_{S}\leftarrow\mathrm{sqrt}(\texttt{invFs}+\texttt{varSfromA}+\texttt{varSfromB}+\texttt{varSfromU}+\texttt{varSfromV})

  7. (7)

    se^Tsqrt(invFt+varTfromA+varTfromB+varTfromU+varTfromV)\hat{\mathrm{se}}_{T}\leftarrow\mathrm{sqrt}(\texttt{invFt}+\texttt{varTfromA}+\texttt{varTfromB}+\texttt{varTfromU}+\texttt{varTfromV})

We do not attempt to provide standard errors for DD, since it seems difficult to estimate DD without significant bias. Note that here, we reshape the vectorized standard errors to matrices having the same dimensions as the corresponding components, for instance, se^A\hat{\mathrm{se}}_{A} has the same dimensions as AA, namely J×KJ\times K.

S9 Priors for regularization

We place independent normal priors on all the entries of AA, BB, CC, DD, UU, and VV, and in the NB-GBM, on the entries of SS and TT as well. Specifically, the prior is π(A,B,C,D,U,V,S,T)=πa(A)πb(B)πc(C)πd(D)πu(U)πv(V)πs(S)πt(T)\pi(A,B,C,D,U,V,S,T)=\pi_{a}(A)\pi_{b}(B)\pi_{c}(C)\pi_{d}(D)\pi_{u}(U)\pi_{v}(V)\pi_{s}(S)\pi_{t}(T) where

πa(A)\displaystyle\pi_{a}(A) =j,k𝒩(ajk0,λa1)\displaystyle={\textstyle\prod_{j,k}\;}\mathcal{N}(a_{jk}\mid 0,\lambda_{a}^{-1})\qquad πu(U)\displaystyle\pi_{u}(U) =i,m𝒩(uim0,λu1)\displaystyle={\textstyle\prod_{i,m}\;}\mathcal{N}(u_{im}\mid 0,\lambda_{u}^{-1})
πb(B)\displaystyle\pi_{b}(B) =i,l𝒩(bi0,λb1)\displaystyle={\textstyle\prod_{i,l}\;}\mathcal{N}(b_{i\ell}\mid 0,\lambda_{b}^{-1})\qquad πv(V)\displaystyle\pi_{v}(V) =j,m𝒩(vjm0,λv1)\displaystyle={\textstyle\prod_{j,m}\;}\mathcal{N}(v_{jm}\mid 0,\lambda_{v}^{-1}) (S9.1)
πc(C)\displaystyle\pi_{c}(C) =k,𝒩(ck0,λc1)\displaystyle={\textstyle\prod_{k,\ell}\;}\mathcal{N}(c_{k\ell}\mid 0,\lambda_{c}^{-1})\qquad πs(S)\displaystyle\pi_{s}(S) =i𝒩(sims,λs1)\displaystyle={\textstyle\prod_{i}\;}\mathcal{N}(s_{i}\mid m_{s},\lambda_{s}^{-1})
πd(D)\displaystyle\pi_{d}(D) =m𝒩(dmm0,λd1)\displaystyle={\textstyle\prod_{m}\;}\mathcal{N}(d_{mm}\mid 0,\lambda_{d}^{-1})\qquad πt(T)\displaystyle\pi_{t}(T) =j𝒩(tjmt,λt1).\displaystyle={\textstyle\prod_{j}\;}\mathcal{N}(t_{j}\mid m_{t},\lambda_{t}^{-1}).

Thus, the log-prior is

logπ=const12λaj,kajk212λbi,bi212λck,ck212λdmdmm212λui,muim212λvj,mvjm212λsi(sims)212λtj(tjmt)2.\displaystyle\begin{split}\log\pi&=\mathrm{const}-\tfrac{1}{2}\lambda_{a}\sum_{j,k}a_{jk}^{2}-\tfrac{1}{2}\lambda_{b}\sum_{i,\ell}b_{i\ell}^{2}-\tfrac{1}{2}\lambda_{c}\sum_{k,\ell}c_{k\ell}^{2}-\tfrac{1}{2}\lambda_{d}\sum_{m}d_{mm}^{2}\\ &~{}~{}~{}~{}-\tfrac{1}{2}\lambda_{u}\sum_{i,m}u_{im}^{2}-\tfrac{1}{2}\lambda_{v}\sum_{j,m}v_{jm}^{2}-\tfrac{1}{2}\lambda_{s}\sum_{i}(s_{i}-m_{s})^{2}-\tfrac{1}{2}\lambda_{t}\sum_{j}(t_{j}-m_{t})^{2}.\end{split} (S9.2)

For the prior parameters, we use the following default settings: λa=λb=λc=λd=λu=λv=1\lambda_{a}=\lambda_{b}=\lambda_{c}=\lambda_{d}=\lambda_{u}=\lambda_{v}=1, ms=mt=0m_{s}=m_{t}=0, and λs=λt=1\lambda_{s}=\lambda_{t}=1. These defaults are fairly generally applicable since they are acting on coefficients that are on the same scale in terms of units, due to the fact that we standardize the covariates to have zero mean and unit variance, that is, 1Ii=1Ixik=0\frac{1}{I}\sum_{i=1}^{I}x_{ik}=0 and 1Ii=1Ixik2=1\frac{1}{I}\sum_{i=1}^{I}x_{ik}^{2}=1 for all k2k\geq 2 and 1Jj=1Jzj=0\frac{1}{J}\sum_{j=1}^{J}z_{j\ell}=0 and 1Jj=1Jzj2=1\frac{1}{J}\sum_{j=1}^{J}z_{j\ell}^{2}=1 for all 2\ell\geq 2. However, specific applications may call for departures from these defaults.

For the updates to G=UDG=UD and H=VDH=VD in the GBM estimation algorithm (Section S7), we use the priors on GG and HH induced by the priors on UU and VV, given DD. First consider GG. For any fixed DD, the induced prior on gim=uimdmmg_{im}=u_{im}d_{mm} is π(gim)=𝒩(gim0,dmm2/λu)\pi(g_{im})=\mathcal{N}(g_{im}\mid 0,d_{mm}^{2}/\lambda_{u}). Thus, given DD, the prior on each row of GG is Gi:𝒩(0,Λ1)G_{i\bm{:}}\sim\mathcal{N}(0,\Lambda^{-1}) where Λ=(D2/λu)1\Lambda=(D^{2}/\lambda_{u})^{-1}. The gradient and Hessian of the log-prior on Gi:G_{i\bm{:}} are therefore ΛGi:-\Lambda G_{i\bm{:}} and Λ-\Lambda, respectively. Hence, with this prior, the regularized Fisher scoring approach (Equation 3.1) yields the GG update formulas used in the algorithm (Section S7). The HH update is similar, except that the induced prior is Hj:𝒩(0,Λ1)H_{j\bm{:}}\sim\mathcal{N}(0,\Lambda^{-1}) where Λ=(D2/λv)1\Lambda=(D^{2}/\lambda_{v})^{-1}. It seems reasonable to hold DD fixed when computing the induced priors on GG and HH, rather than integrating it out, since DD tends to be more accurately estimated than UU or VV (in terms of relative MSE), presumably due to the fact that DD has only MM nonzero entries, each of which is informed by all of the data; see Figure S1 for an empirical example.

S10 Proofs

S10.1 Identifiability and interpretation

Proof of Theorem 5.1.

Left-multiplying both sides of Equation 5.1 by X𝚃X^{\mathtt{T}}, we have

X𝚃XA1𝚃+X𝚃XC1Z𝚃=X𝚃XA2𝚃+X𝚃XC2Z𝚃X^{\mathtt{T}}XA_{1}^{\mathtt{T}}+X^{\mathtt{T}}XC_{1}Z^{\mathtt{T}}=X^{\mathtt{T}}XA_{2}^{\mathtt{T}}+X^{\mathtt{T}}XC_{2}Z^{\mathtt{T}} (S10.1)

by Condition 2.1(b). Since X𝚃XX^{\mathtt{T}}X is invertible, this implies

A1𝚃+C1Z𝚃=A2𝚃+C2Z𝚃.A_{1}^{\mathtt{T}}+C_{1}Z^{\mathtt{T}}=A_{2}^{\mathtt{T}}+C_{2}Z^{\mathtt{T}}. (S10.2)

Right-multiplying Equation S10.2 by ZZ, we have C1Z𝚃Z=C2Z𝚃ZC_{1}Z^{\mathtt{T}}Z=C_{2}Z^{\mathtt{T}}Z by Condition 2.1(b). Since Z𝚃ZZ^{\mathtt{T}}Z is invertible, this implies that C1=C2C_{1}=C_{2}. Plugging C1=C2C_{1}=C_{2} into Equation S10.2 yields A1=A2A_{1}=A_{2}. Plugging A1=A2A_{1}=A_{2} and C1=C2C_{1}=C_{2} into Equation 5.1, we have

B1Z𝚃+U1D1V1𝚃=B2Z𝚃+U2D2V2𝚃.B_{1}Z^{\mathtt{T}}+U_{1}D_{1}V_{1}^{\mathtt{T}}=B_{2}Z^{\mathtt{T}}+U_{2}D_{2}V_{2}^{\mathtt{T}}. (S10.3)

Right-multiplying Equation S10.3 by ZZ, using Condition 2.1(b), and using the fact that Z𝚃ZZ^{\mathtt{T}}Z is invertible, we obtain B1=B2B_{1}=B_{2}. This implies that U1D1V1𝚃=U2D2V2𝚃U_{1}D_{1}V_{1}^{\mathtt{T}}=U_{2}D_{2}V_{2}^{\mathtt{T}}. By the uniqueness properties of the singular value decomposition, Conditions 2.1(c) and 2.1(d) imply that D1=D2D_{1}=D_{2}, U1=U2𝖲U_{1}=U_{2}\mathsf{S}, and V1𝚃=𝖲V2𝚃V_{1}^{\mathtt{T}}=\mathsf{S}V_{2}^{\mathtt{T}} for a diagonal matrix 𝖲\mathsf{S} of the form 𝖲=Diag(±1,,±1)\mathsf{S}=\mathrm{Diag}(\pm 1,\ldots,\pm 1) (Blum et al., 2020). By Condition 2.1(e), 𝖲=I\mathsf{S}=\mathrm{I}. Therefore, U1=U2U_{1}=U_{2} and V1=V2V_{1}=V_{2}. This proves that AA, BB, CC, DD, UU, and VV are uniquely determined by 𝔼(𝒀)\mathbb{E}(\bm{Y}) for any given XX, ZZ, MM. ∎

Proof of Theorem 5.2.

First, j=1Jajk=0\sum_{j=1}^{J}a_{jk}=0 follows from the fact that zj1=1z_{j1}=1 for all jj by Condition 2.2(a) and Z𝚃A=0Z^{\mathtt{T}}A=0 by Condition 2.1(b). Likewise, i=1Ibi=0\sum_{i=1}^{I}b_{i\ell}=0 follows from xi1=1x_{i1}=1 and X𝚃B=0X^{\mathtt{T}}B=0. In the same way, i=1Iuim=0\sum_{i=1}^{I}u_{im}=0 and j=1Jvjm=0\sum_{j=1}^{J}v_{jm}=0 since xi1=1x_{i1}=1, zj1=1z_{j1}=1, X𝚃U=0X^{\mathtt{T}}U=0, and Z𝚃V=0Z^{\mathtt{T}}V=0.

When Condition 2.2(a) holds, we can rearrange Equation 2.1 as

g(μij)\displaystyle g(\mu_{ij}) =c11+aj1+bi1+k=2K(ck1+ajk)xik+=2L(c1+bi)zj\displaystyle=c_{11}+a_{j1}+b_{i1}+\sum_{k=2}^{K}(c_{k1}+a_{jk})x_{ik}+\sum_{\ell=2}^{L}(c_{1\ell}+b_{i\ell})z_{j\ell}
+k=2K=2Lckxikzj+m=1Muimdmmvjm.\displaystyle~{}~{}~{}~{}+\sum_{k=2}^{K}\sum_{\ell=2}^{L}c_{k\ell}x_{ik}z_{j\ell}+\sum_{m=1}^{M}u_{im}d_{mm}v_{jm}. (S10.4)

Averaging Equation S10.1 over all ii, and using these sum-to-zero properties (specifically, using that i=1Ixik=0\sum_{i=1}^{I}x_{ik}=0 for k2k\geq 2, i=1Ibi=0\sum_{i=1}^{I}b_{i\ell}=0, and i=1Iuim=0\sum_{i=1}^{I}u_{im}=0),

1Ii=1Ig(μij)\displaystyle\frac{1}{I}\sum_{i=1}^{I}g(\mu_{ij}) =c11+aj1+(1Iibi1)+k=2K(ck1+ajk)(1Iixik)+=2L(c1+(1Iibi))zj\displaystyle=c_{11}+a_{j1}+({\textstyle\frac{1}{I}\sum_{i}}\,b_{i1})+\sum_{k=2}^{K}(c_{k1}+a_{jk})({\textstyle\frac{1}{I}\sum_{i}}\,x_{ik})+\sum_{\ell=2}^{L}\big{(}c_{1\ell}+({\textstyle\frac{1}{I}\sum_{i}}\,b_{i\ell})\big{)}z_{j\ell}
+k=2K=2Lck(1Iixik)zj+m=1M(1Iiuim)dmmvjm\displaystyle~{}~{}~{}~{}+\sum_{k=2}^{K}\sum_{\ell=2}^{L}c_{k\ell}({\textstyle\frac{1}{I}\sum_{i}}\,x_{ik})z_{j\ell}+\sum_{m=1}^{M}({\textstyle\frac{1}{I}\sum_{i}}\,u_{im})d_{mm}v_{jm}
=c11+aj1+=2Lc1zj.\displaystyle=c_{11}+a_{j1}+\sum_{\ell=2}^{L}c_{1\ell}z_{j\ell}. (S10.5)

In the same way, averaging Equation S10.1 over all jj (and using that j=1Jzj=0\sum_{j=1}^{J}z_{j\ell}=0 for 2\ell\geq 2, j=1Jajk=0\sum_{j=1}^{J}a_{jk}=0, and j=1Jvjm=0\sum_{j=1}^{J}v_{jm}=0), we have

1Jj=1Jg(μij)=c11+bi1+k=2Kck1xik.\frac{1}{J}\sum_{j=1}^{J}g(\mu_{ij})=c_{11}+b_{i1}+\sum_{k=2}^{K}c_{k1}x_{ik}.

Finally, averaging Equation S10.1 over all jj, we have 1IJi=1Ij=1Jg(μij)=c11\frac{1}{IJ}\sum_{i=1}^{I}\sum_{j=1}^{J}g(\mu_{ij})=c_{11}. ∎

Proof of Theorem 5.3.

For any Qm×nQ\in\mathbb{R}^{m\times n}, we have SS(Q)=tr(Q𝚃Q)\mathrm{SS}(Q)=\mathrm{tr}(Q^{\mathtt{T}}Q), where tr()\mathrm{tr}(\cdot) denotes the trace. Define Q=XA𝚃+BZ𝚃+XCZ𝚃+UDV𝚃Q=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+UDV^{\mathtt{T}}. By using X𝚃B=0X^{\mathtt{T}}B=0, Z𝚃A=0Z^{\mathtt{T}}A=0, X𝚃U=0X^{\mathtt{T}}U=0, and Z𝚃V=0Z^{\mathtt{T}}V=0, we have that

Q𝚃Q\displaystyle Q^{\mathtt{T}}Q =(AX𝚃+ZB𝚃+ZC𝚃X𝚃+VDU𝚃)(XA𝚃+BZ𝚃+XCZ𝚃+UDV𝚃)\displaystyle=(AX^{\mathtt{T}}+ZB^{\mathtt{T}}+ZC^{\mathtt{T}}X^{\mathtt{T}}+VDU^{\mathtt{T}})(XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+UDV^{\mathtt{T}})
=AX𝚃XA+AX𝚃XCZ𝚃+ZB𝚃BZ𝚃+ZB𝚃UDV𝚃+ZC𝚃X𝚃XA𝚃\displaystyle=AX^{\mathtt{T}}XA+AX^{\mathtt{T}}XCZ^{\mathtt{T}}+ZB^{\mathtt{T}}BZ^{\mathtt{T}}+ZB^{\mathtt{T}}UDV^{\mathtt{T}}+ZC^{\mathtt{T}}X^{\mathtt{T}}XA^{\mathtt{T}}
+ZC𝚃X𝚃XCZ𝚃+VDU𝚃BZ𝚃+VDU𝚃UDV𝚃.\displaystyle~{}~{}~{}+ZC^{\mathtt{T}}X^{\mathtt{T}}XCZ^{\mathtt{T}}+VDU^{\mathtt{T}}BZ^{\mathtt{T}}+VDU^{\mathtt{T}}UDV^{\mathtt{T}}.

By the cyclic property of the trace,

tr(AX𝚃XCZ𝚃)\displaystyle\mathrm{tr}(AX^{\mathtt{T}}XCZ^{\mathtt{T}}) =tr(XCZ𝚃AX𝚃)=0,\displaystyle=\mathrm{tr}(XCZ^{\mathtt{T}}AX^{\mathtt{T}})=0,
tr(ZB𝚃UDV𝚃)\displaystyle\mathrm{tr}(ZB^{\mathtt{T}}UDV^{\mathtt{T}}) =tr(B𝚃UDV𝚃Z)=0,\displaystyle=\mathrm{tr}(B^{\mathtt{T}}UDV^{\mathtt{T}}Z)=0,
tr(ZC𝚃X𝚃XA𝚃)\displaystyle\mathrm{tr}(ZC^{\mathtt{T}}X^{\mathtt{T}}XA^{\mathtt{T}}) =tr(XA𝚃ZC𝚃X𝚃)=0,\displaystyle=\mathrm{tr}(XA^{\mathtt{T}}ZC^{\mathtt{T}}X^{\mathtt{T}})=0,
tr(VDU𝚃BZ𝚃)\displaystyle\mathrm{tr}(VDU^{\mathtt{T}}BZ^{\mathtt{T}}) =tr(BZ𝚃VDU𝚃)=0.\displaystyle=\mathrm{tr}(BZ^{\mathtt{T}}VDU^{\mathtt{T}})=0.

Therefore, by the linearity of the trace,

SS(Q)\displaystyle\mathrm{SS}(Q) =tr(Q𝚃Q)=tr(AX𝚃XA𝚃)+tr(ZB𝚃BZ𝚃)+tr(ZC𝚃X𝚃XCZ𝚃)+tr(VDU𝚃UDV𝚃)\displaystyle=\mathrm{tr}(Q^{\mathtt{T}}Q)=\mathrm{tr}(AX^{\mathtt{T}}XA^{\mathtt{T}})+\mathrm{tr}(ZB^{\mathtt{T}}BZ^{\mathtt{T}})+\mathrm{tr}(ZC^{\mathtt{T}}X^{\mathtt{T}}XCZ^{\mathtt{T}})+\mathrm{tr}(VDU^{\mathtt{T}}UDV^{\mathtt{T}})
=SS(XA𝚃)+SS(BZ𝚃)+SS(XCZ𝚃)+SS(UDV𝚃).\displaystyle=\mathrm{SS}(XA^{\mathtt{T}})+\mathrm{SS}(BZ^{\mathtt{T}})+\mathrm{SS}(XCZ^{\mathtt{T}})+\mathrm{SS}(UDV^{\mathtt{T}}).

S10.2 Likelihood-preserving projections

Proof of Theorem 5.4.

(1.) For the projection of A~\tilde{A}, plugging in the definitions of A1A_{1} and C1C_{1}, we have

XA1𝚃+XC1Z𝚃=XA~𝚃X(Z+A~)𝚃Z𝚃+XCZ𝚃+X(Z+A~)𝚃Z𝚃=XA~𝚃+XCZ𝚃,XA_{1}^{\mathtt{T}}+XC_{1}Z^{\mathtt{T}}=X\tilde{A}^{\mathtt{T}}-X(Z^{\texttt{\small{+}}}\tilde{A})^{\mathtt{T}}Z^{\mathtt{T}}+XCZ^{\mathtt{T}}+X(Z^{\texttt{\small{+}}}\tilde{A})^{\mathtt{T}}Z^{\mathtt{T}}=X\tilde{A}^{\mathtt{T}}+XCZ^{\mathtt{T}},

and therefore, η(A1,B,C1,D,U,V)=η(A~,B,C,D,U,V)\eta(A_{1},B,C_{1},D,U,V)=\eta(\tilde{A},B,C,D,U,V). To see that Condition 2.1 is satisfied, first note that Z𝚃(IZZ+)=Z𝚃Z𝚃Z(Z𝚃Z)1Z𝚃=0Z^{\mathtt{T}}(\mathrm{I}-ZZ^{\texttt{\small{+}}})=Z^{\mathtt{T}}-Z^{\mathtt{T}}Z(Z^{\mathtt{T}}Z)^{-1}Z^{\mathtt{T}}=0, and therefore

Z𝚃A1=Z𝚃(A~ZZ+A~)=Z𝚃(IZZ+)A~=0.Z^{\mathtt{T}}A_{1}=Z^{\mathtt{T}}(\tilde{A}-ZZ^{\texttt{\small{+}}}\tilde{A})=Z^{\mathtt{T}}(\mathrm{I}-ZZ^{\texttt{\small{+}}})\tilde{A}=0.

(2.) Similarly, for the projection of B~\tilde{B}, we have B1Z𝚃+XC1Z𝚃=B~Z𝚃+XCZ𝚃B_{1}Z^{\mathtt{T}}+XC_{1}Z^{\mathtt{T}}=\tilde{B}Z^{\mathtt{T}}+XCZ^{\mathtt{T}} and X𝚃B1=0X^{\mathtt{T}}B_{1}=0. (3.) For the projection of G~\tilde{G}, we have

XA1𝚃+XC1Z𝚃+U1D1V1𝚃\displaystyle XA_{1}^{\mathtt{T}}+XC_{1}Z^{\mathtt{T}}+U_{1}D_{1}V_{1}^{\mathtt{T}} =XA0𝚃X(Z+A0)𝚃Z𝚃+XCZ𝚃+X(Z+A0)𝚃Z𝚃+G0V𝚃\displaystyle=XA_{0}^{\mathtt{T}}-X(Z^{\texttt{\small{+}}}A_{0})^{\mathtt{T}}Z^{\mathtt{T}}+XCZ^{\mathtt{T}}+X(Z^{\texttt{\small{+}}}A_{0})^{\mathtt{T}}Z^{\mathtt{T}}+G_{0}V^{\mathtt{T}}
=XA𝚃+X(X+G~)V𝚃+XCZ𝚃+G~V𝚃X(X+G~)V𝚃\displaystyle=XA^{\mathtt{T}}+X(X^{\texttt{\small{+}}}\tilde{G})V^{\mathtt{T}}+XCZ^{\mathtt{T}}+\tilde{G}V^{\mathtt{T}}-X(X^{\texttt{\small{+}}}\tilde{G})V^{\mathtt{T}}
=XA𝚃+XCZ𝚃+G~IV𝚃,\displaystyle=XA^{\mathtt{T}}+XCZ^{\mathtt{T}}+\tilde{G}\mathrm{I}V^{\mathtt{T}},

and thus, η(A1,B,C1,D1,U1,V1)=η(A,B,C,I,G~,V)\eta(A_{1},B,C_{1},D_{1},U_{1},V_{1})=\eta(A,B,C,\mathrm{I},\tilde{G},V). To check that Condition 2.1 is satisfied, first observe that Z𝚃A1=Z𝚃(IZZ+)A0=0Z^{\mathtt{T}}A_{1}=Z^{\mathtt{T}}(\mathrm{I}-ZZ^{\texttt{\small{+}}})A_{0}=0 and X𝚃G0=X𝚃(IXX+)G~=0X^{\mathtt{T}}G_{0}=X^{\mathtt{T}}(\mathrm{I}-XX^{\texttt{\small{+}}})\tilde{G}=0. Hence,

0=(a)X𝚃G0V𝚃V1D11=(b)X𝚃U1D1V1𝚃V1D11=(c)X𝚃U10\stackrel{{\scriptstyle\text{(a)}}}{{=}}X^{\mathtt{T}}G_{0}V^{\mathtt{T}}V_{1}D_{1}^{-1}\stackrel{{\scriptstyle\text{(b)}}}{{=}}X^{\mathtt{T}}U_{1}D_{1}V_{1}^{\mathtt{T}}V_{1}D_{1}^{-1}\stackrel{{\scriptstyle\text{(c)}}}{{=}}X^{\mathtt{T}}U_{1}

where we have used (a) X𝚃G0=0X^{\mathtt{T}}G_{0}=0, (b) G0V𝚃=U1D1V1𝚃G_{0}V^{\mathtt{T}}=U_{1}D_{1}V_{1}^{\mathtt{T}}, and (c) V1𝚃V1=IV_{1}^{\mathtt{T}}V_{1}=\mathrm{I} and D1D11=ID_{1}D_{1}^{-1}=\mathrm{I}. Likewise, since V𝚃Z=0V^{\mathtt{T}}Z=0 by assumption,

0=D11U1𝚃G0V𝚃Z=D11U1𝚃U1D1V1𝚃Z=V1𝚃Z0=D_{1}^{-1}U_{1}^{\mathtt{T}}G_{0}V^{\mathtt{T}}Z=D_{1}^{-1}U_{1}^{\mathtt{T}}U_{1}D_{1}V_{1}^{\mathtt{T}}Z=V_{1}^{\mathtt{T}}Z

since U1𝚃U1=IU_{1}^{\mathtt{T}}U_{1}=\mathrm{I} and D11D1=ID_{1}^{-1}D_{1}=\mathrm{I}.

(4.) For the projection of H~\tilde{H}, in an altogether similar way, we have

B1Z𝚃+XC1Z𝚃+U1D1V1𝚃=BZ𝚃+XCZ𝚃+UIH~𝚃.B_{1}Z^{\mathtt{T}}+XC_{1}Z^{\mathtt{T}}+U_{1}D_{1}V_{1}^{\mathtt{T}}=BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+U\mathrm{I}\tilde{H}^{\mathtt{T}}.

Further, X𝚃B1=X𝚃(IXX+)B0=0X^{\mathtt{T}}B_{1}=X^{\mathtt{T}}(\mathrm{I}-XX^{\texttt{\small{+}}})B_{0}=0 and Z𝚃H0=0Z^{\mathtt{T}}H_{0}=0, thus

0\displaystyle 0 =Z𝚃H0U𝚃U1D11=Z𝚃V1D1U1𝚃U1D11=Z𝚃V1\displaystyle=Z^{\mathtt{T}}H_{0}U^{\mathtt{T}}U_{1}D_{1}^{-1}=Z^{\mathtt{T}}V_{1}D_{1}U_{1}^{\mathtt{T}}U_{1}D_{1}^{-1}=Z^{\mathtt{T}}V_{1}
0\displaystyle 0 =X𝚃UH0𝚃V1D11=X𝚃U1D1V1𝚃V1D11=X𝚃U1.\displaystyle=X^{\mathtt{T}}UH_{0}^{\mathtt{T}}V_{1}D_{1}^{-1}=X^{\mathtt{T}}U_{1}D_{1}V_{1}^{\mathtt{T}}V_{1}D_{1}^{-1}=X^{\mathtt{T}}U_{1}.

Computationally, it is highly advantageous to structure the calculation of the projections in Theorem 5.4 as follows. First, one can precompute the pseudoinverses X+X^{\texttt{\small{+}}} and Z+Z^{\texttt{\small{+}}} since XX and ZZ are fixed throughout the algorithm. In the updates to AA (or BB), it is advantageous to first compute Z+A~Z^{\texttt{\small{+}}}\tilde{A} (or X+B~X^{\texttt{\small{+}}}\tilde{B}, respectively) in order to avoid explicitly computing and storing XX+I×IXX^{\texttt{\small{+}}}\in\mathbb{R}^{I\times I} and ZZ+J×JZZ^{\texttt{\small{+}}}\in\mathbb{R}^{J\times J}. Likewise, in the projection of G~\tilde{G} (or H~\tilde{H}), first compute X+G~X^{\texttt{\small{+}}}\tilde{G} and Z+A0Z^{\texttt{\small{+}}}A_{0} (or Z+H~Z^{\texttt{\small{+}}}\tilde{H} and X+B0X^{\texttt{\small{+}}}B_{0}, respectively). We use this approach in the step-by-step algorithm provided in Section S7.

The interpretation of the operations in Theorem 5.4 is as follows. For A~\tilde{A}, the idea is that A1=(IZZ+)A~A_{1}=(\mathrm{I}-ZZ^{\texttt{\small{+}}})\tilde{A} is an orthogonal projection of the columns of A~\tilde{A} onto the nullspace of Z𝚃Z^{\mathtt{T}}, and C1C_{1} is a shifted version of CC to compensate for the shift from A~\tilde{A} to A1A_{1}. Likewise for B~\tilde{B}, but with XX instead of ZZ. For G~\tilde{G}, the idea is that (a) G0G_{0} is a projection of G~\tilde{G} onto the nullspace of X𝚃X^{\mathtt{T}}, (b) the SVD enforces the orthonormality, ordering, and sign constraints on U1U_{1}, D1D_{1}, and V1V_{1} while maintaining X𝚃U1=0X^{\mathtt{T}}U_{1}=0 and Z𝚃V1=0Z^{\mathtt{T}}V_{1}=0, (c) A0A_{0} compensates for the shift from G~\tilde{G} to G0G_{0}, (d) A1A_{1} projects A0A_{0} onto the nullspace of Z𝚃Z^{\mathtt{T}}, and (e) C1C_{1} compensates for the shift from A0A_{0} to A1A_{1}. For H~\tilde{H}, the interpretation is similar.

S10.3 Estimation time complexity

In this section, we justify the expressions in Section 5 giving the time complexity of the estimation algorithm as a function of II, JJ, KK, LL, and MM, assuming max{K2,L2,M}min{I,J}\max\{K^{2},L^{2},M\}\leq\min\{I,J\} (Equation 5.2). The outline of the estimation algorithm is in Section 3, and the step-by-step details are in Section S7. Denote xy=min{x,y}x\wedge y=\min\{x,y\} and xy=max{x,y}x\vee y=\max\{x,y\}. For the updates to each of AA, BB, CC, DD, UDUD, VDVD, SS, and TT, we report the computation time complexity after η\eta, μ\mu, WW, and EE have been recomputed.

Cost of preprocessing and initialization. Precomputing the pseudoinverses X+X^{\texttt{\small{+}}} and Z+Z^{\texttt{\small{+}}} takes O(IK2)O(IK^{2}) and O(JL2)O(JL^{2}) time, respectively; thus, both are O(IJ)O(IJ) by Equation 5.2. Computing CX+Yˇ(Z+)𝚃C\leftarrow X^{\texttt{\small{+}}}\check{Y}(Z^{\texttt{\small{+}}})^{\mathtt{T}} takes O(IJ(KL))O(IJ(K\wedge L)) time, A(X+YˇCZ𝚃)𝚃A\leftarrow(X^{\texttt{\small{+}}}\check{Y}-CZ^{\mathtt{T}})^{\mathtt{T}} takes O(IJK)O(IJK) time, and BYˇ(Z+)𝚃XCB\leftarrow\check{Y}(Z^{\texttt{\small{+}}})^{\mathtt{T}}-XC takes O(IJL)O(IJL) time. Computing DD, UU, and VV takes O(IJM)O(IJM) time, since the truncated SVD of rank MM for an I×JI\times J matrix can be done in O(IJM)O(IJM) time (Halko et al., 2011). Finally, each update to SS and TT takes O(IJ)O(IJ) time (see below). Thus, overall, preprocessing and initialization takes O(IJ(KLM))O(IJ(K\vee L\vee M)) time.

Cost of computing η\eta, μ\mu, WW, and EE. Computing η=XA𝚃+BZ𝚃+XCZ𝚃+UDV𝚃\eta=XA^{\mathtt{T}}+BZ^{\mathtt{T}}+XCZ^{\mathtt{T}}+UDV^{\mathtt{T}} takes O(IJ(KLM))O(IJ(K\vee L\vee M)) time, since XA𝚃XA^{\mathtt{T}}, BZ𝚃BZ^{\mathtt{T}}, XCZ𝚃XCZ^{\mathtt{T}}, and UDV𝚃UDV^{\mathtt{T}} take O(IJK)O(IJK), O(IJL)O(IJL), O(IJ(KL))O(IJ(K\wedge L)), and O(IJM)O(IJM) time, respectively. Computing μ\mu, WW, and EE takes O(IJ)O(IJ) time given η\eta.

Cost of updating AA. For each jj, computing the Fisher scoring step takes O(IK2)O(IK^{2}) time, so altogether the JJ steps take O(IJK2)O(IJK^{2}) time. For the projection, we compute QZ+AQ\leftarrow Z^{\texttt{\small{+}}}A and ZQZQ, which takes O(JKL)O(JKL) time. By Equation 5.2, we have LIL\leq I, so the cost of computing the projection can be absorbed into the cost of the Fisher scoring steps.

Cost of updating BB. By symmetry, this takes O(IJL2)O(IJL^{2}) time.

Cost of updating CC. Computing the Fisher information matrix FF takes O(IJK2+JK2L2)O(IJK^{2}+JK^{2}L^{2}) time, which is O(IJK2)O(IJK^{2}) since L2IL^{2}\leq I by Equation 5.2. Inverting F+λcIF+\lambda_{c}\mathrm{I} takes O((KL)3)=O(IJKL)O((KL)^{3})=O(IJKL) time (using Equation 5.2), and computing X𝚃EZX^{\mathtt{T}}EZ takes O(IJK)O(IJK) time, so the update to CC can be done in O(IJ(K2L2)O(IJ(K^{2}\vee L^{2}) time.

Cost of updating DD. Computing the Fisher information matrix FF takes O(IJM2)O(IJM^{2}) time, inverting F+λdIF+\lambda_{d}\mathrm{I} takes O(M3)O(M^{3}) time, and computing diag(U𝚃EV)\operatorname*{diag}(U^{\mathtt{T}}EV) takes O(IJM)O(IJM) time. Thus, the update to DD takes O(IJM2)O(IJM^{2}) time.

Cost of updating G=UDG=UD. By comparison with the BB update, the Fisher scoring steps cost O(IJM2)O(IJM^{2}) time. The projection steps (except for the SVD) take O(IKM+JKM+JKL)O(IKM+JKM+JKL) time, and by Equation 5.2, this is O(IJM)O(IJM). Computing the truncated SVD of rank MM for an I×JI\times J matrix can be done in O(IJM)O(IJM) time (Halko et al., 2011). Therefore, the cost of the projection can be absorbed into the Fisher scoring steps.

Cost of updating H=VDH=VD. By symmetry, this takes O(IJM2)O(IJM^{2}) time.

Cost of updating SS and TT. Given μ\mu, updating SS and TT takes O(IJ)O(IJ) time, since computing δ\delta and δ\delta^{\prime} involves a loop over all ii and jj.

S10.4 Inference time complexity

Here, we justify the expressions in Section 5 giving the time complexity of the inference algorithm, assuming max{K2,L2,M}min{I,J}\max\{K^{2},L^{2},M\}\leq\min\{I,J\} (Equation 5.2) and also assuming IJI\geq J. The outline of the inference algorithm is in Section 4.2, and the step-by-step details are in Section S8. Denote xy=min{x,y}x\wedge y=\min\{x,y\} and xy=max{x,y}x\vee y=\max\{x,y\}.

Cost of preprocessing. Computing η\eta, μ\mu, WW, and EE takes O(IJ(KLM))O(IJ(K\vee L\vee M)) time. Computing gradA, gradB, and gradC takes O(IJK)O(IJK), O(IJL)O(IJL), and O(IJK)O(IJK) time, respectively. All of the other preprocessing steps take O(IJ)O(IJ) time.

Cost of computing conditional uncertainty for each component. Computing invFa, invFb, and invFc take O(IJK2)O(IJK^{2}), O(IJL2)O(IJL^{2}), and O(IJ(K2L2))O(IJ(K^{2}\vee L^{2})) time, respectively. Both invFu and invFv take O(IJM2)O(IJM^{2}) time, and invFs and invFt take O(IJ)O(IJ) time. Thus, overall, this part is O(IJ(K2L2M2))O(IJ(K^{2}\vee L^{2}\vee M^{2})).

Cost of computing constraint Jacobians for UU and VV. Computing Ju and Jv take O(I(KM2+M3))O(I(KM^{2}+M^{3})) and O(J(LM2+M3))O(J(LM^{2}+M^{3})) time, respectively.

Cost of computing joint uncertainty in (U,V)(U,V) accounting for constraints. The most expensive steps are computing FuvFFuvFuv𝚃×FFuv\texttt{FuvFFuv}\leftarrow\texttt{Fuv}^{\mathtt{T}}\times\texttt{FFuv},

B[AJv𝚃Jv0]1,\texttt{B}\leftarrow\displaystyle\begin{bmatrix}\texttt{A}&\texttt{Jv}^{\mathtt{T}}\,\\ \texttt{Jv}&0\,\end{bmatrix}^{-1},

and gcolsums(FuvD(C×FuvD))\texttt{g}\leftarrow\mathrm{colsums}(\texttt{FuvD}\odot(\texttt{C}\times\texttt{FuvD})), which take O(IJ2M3)O(IJ^{2}M^{3}), O(J3M3)O(J^{3}M^{3}), and O(IJ2M3)O(IJ^{2}M^{3}) time, respectively. Since we assume IJI\geq J, these are all O(IJ2M3)O(IJ^{2}M^{3}). It is tedious but straightforward to check that all of the other steps in this part take less than O(IJ2M3)O(IJ^{2}M^{3}) time, assuming IJI\geq J and max{K2,L2,M}min{I,J}\max\{K^{2},L^{2},M\}\leq\min\{I,J\} (Equation 5.2).

Cost of propagating uncertainty from UU and VV to AA and BB. Computing varAfromU and varAfromV take O(IJK(KM))O(IJK(K\vee M)) and O(IJK2M)O(IJK^{2}M) time, respectively; combined, this is O(IJK2M)O(IJK^{2}M). By symmetry, varBfromU and varBfromV take O(IJL2M)O(IJL^{2}M) time.

Cost of propagating uncertainty from AA and BB to CC. First, consider computing varCfromA. Each step in the loop over jj and kk takes O(IK2)O(IK^{2}) time (since L2IL^{2}\leq I), thus, computing dC takes O(IJK3)O(IJK^{3}) time altogether. Computing invFdC takes O(JK3L)O(JK^{3}L) time, and the last step is O(JK2L)O(JK^{2}L). Thus, overall, varCfromA takes O(IJK3)O(IJK^{3}) time. By symmetry, varCfromB takes O(IJL3)O(IJL^{3}) time.

Cost of propagating uncertainty from (A,B,U,V)(A,B,U,V) to SS. Computing varSfromA, varSfromB, varSfromU, and varSfromV take O(IJK)O(IJK), O(IJL)O(IJL), O(IJM)O(IJM), and O(IJM)O(IJM) time, respectively. Thus, overall this takes O(IJ(KLM))O(IJ(K\vee L\vee M)) time.

Cost of propagating uncertainty from (A,B,U,V)(A,B,U,V) to TT. By symmetry, varTfromA, varTfromB, varTfromU, and varTfromV takes O(IJ(KLM))O(IJ(K\vee L\vee M)) time overall.

Cost of computing approximate standard errors. Given the approximate element-wise variances, this is only takes as much time as the dimension of the each of the parameter matrices/vectors; namely, O(JK)O(JK), O(IL)O(IL), O(KL)O(KL), O(IM)O(IM), O(JM)O(JM), O(I)O(I), and O(J)O(J) for each of AA, BB, CC, UU, VV, SS, and TT, respectively. Using Equation 5.2 to easily upper bound each of these shows that, overall, this is O(IJ)O(IJ).