This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamic Factor Analysis with Dependent Gaussian Processes for High-Dimensional Gene Expression Trajectories

Jiachen Cai MRC Biostatistics Unit, University of Cambridge, UK Robert J. B. Goudie MRC Biostatistics Unit, University of Cambridge, UK Colin Starr MRC Biostatistics Unit, University of Cambridge, UK Brian D. M. Tom MRC Biostatistics Unit, University of Cambridge, UK
Abstract

The increasing availability of high-dimensional, longitudinal measures of gene expression can facilitate understanding of biological mechanisms, as required for precision medicine. Biological knowledge suggests that it may be best to describe complex diseases at the level of underlying pathways, which may interact with one another. We propose a Bayesian approach that allows for characterising such correlation among different pathways through Dependent Gaussian Processes (DGP) and mapping the observed high-dimensional gene expression trajectories into unobserved low-dimensional pathway expression trajectories via Bayesian Sparse Factor Analysis. Our proposal is the first attempt to relax the classical assumption of independent factors for longitudinal data and has demonstrated a superior performance in recovering the shape of pathway expression trajectories, revealing the relationships between genes and pathways, and predicting gene expressions (closer point estimates and narrower predictive intervals), as demonstrated through simulations and real data analysis. To fit the model, we propose a Monte Carlo Expectation Maximization (MCEM) scheme that can be implemented conveniently by combining a standard Markov Chain Monte Carlo sampler and an R package GPFDA (Konzen et al.,, 2021), which returns the maximum likelihood estimates of DGP hyperparameters. The modular structure of MCEM makes it generalizable to other complex models involving the DGP model component. Our R package DGP4LCF that implements the proposed approach is available on CRAN.

Keywords: Dependent Gaussian processes; High-dimensional biomarker expression trajectories; Monte Carlo Expectation Maximization; Multivariate longitudinal data; Pathways; Sparse factor analysis.

1 Introduction

The development of high-throughput technologies has enabled researchers to collect high-dimensional biomarker data repeatedly over time, such as transcriptome, metabolome and proteome, facilitating the discovery of disease mechanisms. Biological knowledge suggests that it may be best to describe complex diseases at the level of pathways, rather than the level of individual biomarkers (Richardson et al.,, 2016). A biological pathway is a series of interactions among molecules that achieves a certain biological function. Pathways are often summarized by activity scores derived from expression values of corresponding biomarker set (Temate-Tiagueu et al.,, 2016); such scores form a basis of making further comparisons between people with different clinical statuses (Kim et al.,, 2021).

This work was motivated by longitudinal measurements of gene expression in a human challenge study, where the main goal is to understand biological mechanism of influenza virus infection within the host, which provides a basis for developing novel diagnostic approaches for differentiating viral respiratory infection from respiratory disease caused by other pathogens (Zaas et al.,, 2009). In brief, 1717 healthy individuals were inoculated with influenza virus H3N2, and then blood samples were collected at regular time intervals until the individuals were discharged after a fixed period of 7 days. The blood samples were assayed by DNA microarray, to produce gene expression values for 11,96111,961 genes. Binary labels denoting clinical status (symptomatic or asymptomatic) for each individual were also recorded.

To handle such high-dimensional data, one commonly used dimension-reduction technique is factor analysis (FA). In the context of our motivating H3N2 example, FA maps the observed high-dimensional gene expression data to a latent low-dimensional pathway expression representation (Carvalho et al.,, 2008). One classical assumption of FA is the independence among latent factors, which implies expressions of pathways are uncorrelated. However, it has been shown that pathways often interact with one another to achieve complex biological functions (Hsu and Yang,, 2012), so this assumption may not hold biologically.

There have been attempts to relax this assumption of independence for cross-sectional data. Oblique rotation has been proposed as an alternative to standard orthogonal rotation under the frequentist framework (Costello and Osborne,, 2019), and Conti et al., (2014) considered correlated factors within a Bayesian framework. However, for longitudinal data, no available method can characterize the potential cross-correlation between factor trajectories. To our knowledge, this paper is the first attempt to address this gap. Specifically, we model the latent factor trajectories using dependent Gaussian Processes (DGP), a tool from the machine learning community that allows for estimating cross-correlations and borrowing strength from correlated factors. Compared to an approach that assumes independence between factor trajectories, our approach performs better at recovering the shape of latent factor trajectories, estimating the relationship between genes and pathways, and predicting future gene expression.

In addition to the modeling innovation, another contribution is the algorithm developed for estimating hyperparameters of the DGP model when it is embedded within another model. To obtain the maximum likelihood estimate (MLE) of DGP hyperparameters, we developed a Monte Carlo Expectation Maximization (MCEM) algorithm, which can be conveniently implemented by combining an existing R package, GPFDA (Konzen et al.,, 2021), with a standard Markov Chain Monte Carlo (MCMC) sampler. Our R package DGP4LCF (Dependent Gaussian Processes for Longitudinal Correlated Factors) that implements the proposed method is available on CRAN.

The remainder of the article is organized as follows. In Section 2, we review Bayesian sparse factor analysis (BSFA) and DGP, then propose our integrated model based on them. In Section 3, we introduce the inference method for the proposed model. We explore the empirical performance of our proposed approach for prediction of gene expression, estimation of gene-pathway relationships and the shape of pathway expression trajectories under several simulated factor generation mechanisms in Section 4. In Section 5, we compare our results on real data with those of a previous analysis in Chen et al., (2011). We conclude with a discussion on future research directions in Section 6.

2 Model

Our proposed model is based on BSFA and DGP. Therefore, before presenting the proposed model in Section 2.3, we first introduce BSFA in Section 2.1 and DGP in Section 2.2.

Let tijt_{ij} denote the jjth measured time point of the iith individual, i=1,,ni=1,\dots,n, j=1,,qij=1,\dots,q_{i}, where nn and qiq_{i} are the number of subjects and subject-specific time points, respectively. At time tijt_{ij}, xijgx_{ijg} is the ggth gene expression, g=1,,pg=1,\dots,p, where pp is the number of genes. We seek to describe these data in terms of kk latent factors/pathways yijay_{ija}, a=1,,ka=1,\dots,k. In practice, we expect kk to be much smaller than pp. Throughout this paper we assume that kk is pre-specified and fixed (see Section 2.4 for discussion on the choice of kk).

2.1 Uncovering sparse factor structure via BSFA

BSFA is an established approach to uncover a sparse factor structure. Sparsity stems from the fact that each pathway (i.e., factor) only involves a small proportion of genes (i.e., observed variable). Mathematically, BSFA connects observations xijgx_{ijg} with the unobserved latent factors yijay_{ija} via a factor loading matrix 𝐋={lga}g=1,,p,a=1,,kp×k\mathbf{L}=\{l_{ga}\}_{g=1,...,p,a=1,...,k}\in\mathbb{R}^{p\times k}, where each element lgal_{ga} quantifies the extent to which the ggth gene expression is related to the aath pathway expression, with larger absolute values indicating a stronger loading of the gene expression on the pathway expression.

xijg=μig+a=1klgayija+eijg,x_{ijg}=\mu_{ig}+\sum_{a=1}^{k}l_{ga}y_{ija}+e_{ijg}, (1)

where μig\mu_{ig} is the intercept term for the ggth gene of the iith individual (hereafter the “subject-gene mean”), eijge_{ijg} is the residual error. We assume lga{l}_{ga} is constant across all time points.

We incorporate the prior belief of sparsity by imposing a sparsity-inducing prior distribution on 𝐋\mathbf{L}. Specifically we adopt point-mass mixture priors (Carvalho et al.,, 2008) because we want the model to shrink insignificant parameters completely to zero, without the need to further set up a threshold for inclusion, as in continuous shrinkage priors (Bhattacharya and Dunson,, 2011). The point-mass mixture prior is introduced by first decomposing lga=ZgaAgal_{ga}=Z_{ga}\cdot A_{ga}, where the binary variable ZgaZ_{ga} indicates inclusion and the continuous variable AgaA_{ga} denotes the regression coefficient. We then specify a Bernoulli-Beta prior for ZgaZ_{ga} with ZgaBern(πa){Z}_{ga}\sim\text{Bern}(\pi_{a}), πaBeta(c0,d0)\pi_{a}\sim\text{Beta}(c_{0},d_{0}) and a Normal-Inverse-Gamma prior for AgaA_{ga} with AgaN(0,ρa2){A}_{ga}\sim\text{N}(0,\rho_{a}^{2}), ρa2Inverse-Gamma(c1,d1)\rho_{a}^{2}\sim\text{Inverse-Gamma}(c_{1},d_{1}), where g=1,,p;a=1,,kg=1,\ldots,p;a=1,\ldots,k and c0,d0,c1,d1c_{0},d_{0},c_{1},d_{1} are pre-specified positive constants. An a priori belief about sparsity can be represented via (c0,d0)(c_{0},d_{0}), which controls the proportion of genes πa\pi_{a} that contributes to the aath pathway. If Zga=0Z_{ga}=0, meaning that the ggth gene does not relate to the aath pathway, then the corresponding loading lga=0l_{ga}=0.

We assign normal priors to the subject-gene means μigN(μg,σg2)\mu_{ig}\sim\text{N}(\mu_{g},\sigma_{g}^{2}), where μg\mu_{g} is fixed to the mean of the ggth gene expression across all time points of all people, and the residuals eijgN(0,ϕg2)e_{ijg}\sim\text{N}(0,\phi_{g}^{2}); inverse gamma priors to the variances σg2Inverse-Gamma(c2,d2)\sigma_{g}^{2}\sim\text{Inverse-Gamma}(c_{2},d_{2}) and ϕg2Inverse-Gamma(c3,d3)\phi_{g}^{2}\sim\text{Inverse-Gamma}(c_{3},d_{3}), where c2,d2,c3,d3c_{2},d_{2},c_{3},d_{3} are pre-specified positive constants.

To implement the BSFA model, the software BFRM has been developed (Carvalho et al.,, 2008). However, the current version of BFRM has two major limitations. First, it only returns point estimates of parameters, without any quantification of uncertainty. Second, it can handle only cross-sectional data; therefore it does not account for within-individual correlation when supplied with longitudinal data.

2.2 Modeling correlated, time-dependent factor trajectories via DGP

Several approaches treating factor trajectories ya(t){y}_{a}(t), a=1,,ka=1,...,k as functional data have been proposed, including spline functions (Ramsay et al.,, 2009), differential equations (Ramsay and Hooker,, 2019), autoregressive models (Penny and Roberts,, 2002), and Gaussian Processes (GP)(Shi and Choi,, 2011). As mentioned previously, we are interested in incorporating the cross-correlation among different factors into the model. GP are well-suited to this task because DGP can account for the inter-dependence of factors in a straight-forward manner. Indeed, the DGP model has been widely applied to model dependent multi-output time series in the machine learning community (Alvarez and Lawrence,, 2011), where it is also known as “multitask learning”. Sharing information between tasks using DGP can improve prediction compared to using independent Gaussian Processes (IGP) (Bonilla et al.,, 2007; Cheng et al.,, 2020). DGP have also been used to model correlated, multivariate spatial data in the field of geostatistics, improving prediction performance over IGP (Dey et al.,, 2020).

The main difficulty with DGP modeling is how to appropriately define cross-covariance functions that imply a positive definite covariance matrix. Liu et al., (2018) reviewed existing strategies developed to address this issue and found, through simulation studies, that no single approach outperformed all others in all scenarios. When the intent was to improve the predictions of all the outputs jointly (such is our case here), the kernel convolution framework (KCF)(Boyle and Frean, 2005a, ) was among the best performers. The KCF has also been widely employed by other researchers (Alvarez and Lawrence,, 2011; Sofro et al.,, 2017). Therefore, we adopt the KCF strategy for DGP modeling here, which assumes that the correlation between factors is the same at each time point. A detailed description of the KCF can be found in Supplementary A.2.

The distribution of (yi1a,,yiqia,yi1b,,yiqib)T(y_{i1a},\dots,y_{iq_{i}a},y_{i1b},\dots,y_{iq_{i}b})^{T} induced under KCF is a multivariate normal distribution (MVN) with mean vector 𝟎\mathbf{0} and covariance matrix fully determined by parameters of kernel functions and a noise parameter; we use 𝚯\boldsymbol{\Theta} to denote all of them hereafter. The covariance matrix contains the information of both the auto-correlation for each single process and the cross-correlation between different processes. In this paper, we focus on the cross-correlation, which correspond to interactions across different biological pathways that have been ignored in previous analysis (Chen et al.,, 2011).

The KCF can be implemented via the R package GPFDA, which outputs the MLE for DGP hyperparameters given measurements of the processes ya(t),a=1,,k{y}_{a}(t),a=1,\dots,k. Its availability inspired and facilitated the algorithm developed for the proposed model. GPFDA assumes all input processes are measured at common time points, which is often unrealistic in practice; thus, in Section 3, we will adapt GPFDA to subject-specific time points.

2.3 Proposed integrated model

We propose a model that combines BSFA and DGP (referred as “BSFA-DGP” hereafter), and present it in matrix notation below. To accommodate irregularly measured time points across individuals, we first introduce the vector of all unique observation times 𝐭=i=1n𝐭i\mathbf{t}=\bigcup\limits_{i=1}^{n}\mathbf{t}_{i} across all individuals, and denote its length as qq; each 𝐭i=j=1qitij\mathbf{t}_{i}=\bigcup\limits_{j=1}^{q_{i}}t_{ij} is a vector of observed time points for the iith individual. We assume the irregularity of measurements is sufficiently limited such that Bayesian imputation is appropriate. Let 𝐘i=(𝐲i1,,𝐲ik)Tk×q\mathbf{Y}_{i}=(\mathbf{y}_{i1},...,\mathbf{y}_{ik})^{T}\in\mathbb{R}^{k\times q} be the matrix of pathway expression of the iith individual, with 𝐲ia=(yi1a,,yiqa)T\mathbf{y}_{ia}=(y_{i1a},...,y_{iqa})^{T} denoting the aath factor’s expression across all observation times 𝐭\mathbf{t}; and let 𝐘i,obs,𝐘i,miss\mathbf{Y}_{i,\text{obs}},\mathbf{Y}_{i,\text{miss}} be the sub-matrices of 𝐘i\mathbf{Y}_{i}, denoting pathway expression at times when gene expression of the iith individual are observed and missing, respectively. Let vec(𝐘iT)\operatorname{vec}({\mathbf{Y}_{i}}^{T}) denote the column vector obtained by stacking the columns of matrix 𝐘iT\mathbf{Y}_{i}^{T} on top of one another; similar definitions apply to vec(𝐘i,obsT)\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}) and vec(𝐘i,missT)\operatorname{vec}({\mathbf{Y}_{i,\text{miss}}^{T}}).

Let 𝐗i=(𝐱i1,,𝐱ip)Tp×qi\mathbf{X}_{i}=(\mathbf{x}_{i1},...,\mathbf{x}_{ip})^{T}\in\mathbb{R}^{p\times q_{i}} be the matrix of gene expression measurements at the qiq_{i} observation times for the iith individual, with 𝐱ig=(xi1g,,xiqig)T\mathbf{x}_{ig}=(x_{i1g},...,x_{iq_{i}g})^{T} denoting the ggth gene’s trajectory; and correspondingly let 𝐌i=(𝝁i1,,𝝁ip)Tp×qi\mathbf{M}_{i}=(\boldsymbol{\mu}_{i1},...,\boldsymbol{\mu}_{ip})^{T}\in\mathbb{R}^{p\times q_{i}} be the matrix of subject-gene means, with 𝝁ig=μig𝟏\boldsymbol{\mu}_{ig}=\mu_{ig}\mathbf{1}, where 𝟏\mathbf{1} is a qiq_{i}-dimensional column vector consisting of the scalar 11. Furthermore, let 𝐀={Aga}g=1,,p;a=1,,kp×k\mathbf{A}=\{A_{ga}\}_{g=1,...,p;a=1,...,k}\in\mathbb{R}^{p\times k} be the matrix of regression coefficients and 𝐙={Zga}g=1,,p;a=1,,kp×k\mathbf{Z}=\{Z_{ga}\}_{g=1,...,p;a=1,...,k}\in\mathbb{R}^{p\times k} be the matrix of inclusion indicators,

𝐗i=𝐌i+𝐋𝐘i,obs+𝐄i,𝐋=𝐀𝐙,vec(𝐘iT)MVN(𝟎,Σ𝐘),\begin{split}\mathbf{X}_{i}&=\mathbf{M}_{i}+\mathbf{L}\mathbf{Y}_{i,\text{obs}}+\mathbf{E}_{i},\\ \mathbf{L}&=\mathbf{A}\circ\mathbf{Z},\\ \operatorname{vec}({\mathbf{Y}_{i}}^{T})&\sim\text{MVN}(\mathbf{0},\Sigma_{\mathbf{Y}}),\\ \end{split} (2)

where \circ denotes element-wise matrix multiplication, Σ𝐘kq×kq\Sigma_{\mathbf{Y}}\in\mathbb{R}^{kq\times kq} is the covariance matrix induced via the KCF modeling, and 𝐄i\mathbf{E}_{i} is the residual matrix. The prior distributions for components of 𝐀\mathbf{A}, 𝐙\mathbf{Z}, 𝐌i\mathbf{M}_{i}, and 𝐄i\mathbf{E}_{i} have been described in Section 2.1.

2.4 Choosing the number of factors

In practice, the choice of kk is primarily based on previous knowledge about the data at hand. Where this is unavailable, we recommend trying several different values for kk and comparing results, including quantitative metrics (such as prediction performance adopted in this work) and practical interpretation. Note that it may be possible to identify when kk is unnecessarily large through factor loading estimates (as we will show in the simulation), since redundant factors will not have non-zero loadings.

3 Inference

In this section, we describe inference for our proposed model. We develop an MCEM framework to obtain the MLE for DGP hyperparameters 𝚯\boldsymbol{\Theta} in Section 3.1 and the framework is summarized in Algorithm 1. For fixed DGP hyperparameter values, we propose a Gibbs sampler for the other variables in the model, denoted by 𝛀={𝐌,𝐘,𝐀,𝐙,𝝆,𝝅,𝝈,ϕ}\boldsymbol{\Omega}=\{\mathbf{M},\mathbf{Y},\mathbf{A},\mathbf{Z},\boldsymbol{\rho},\boldsymbol{\pi},\boldsymbol{\sigma},\boldsymbol{\phi}\}, where 𝐌={𝐌i}i=1,,n\mathbf{M}=\{\mathbf{M}_{i}\}_{i=1,...,n}, 𝐘={𝐘i}i=1,,n\mathbf{Y}=\{\mathbf{Y}_{i}\}_{i=1,...,n}, 𝝆={ρa2}a=1,,k\boldsymbol{\rho}=\{\rho_{a}^{2}\}_{a=1,...,k}, 𝝅={πa}a=1,,k\boldsymbol{\pi}=\{\pi_{a}\}_{a=1,...,k}, 𝝈={σg2}g=1,,p\boldsymbol{\sigma}=\{\sigma_{g}^{2}\}_{g=1,...,p}, ϕ={ϕg2}g=1,,p\boldsymbol{\phi}=\{\phi_{g}^{2}\}_{g=1,...,p} (Section 3.2). This sampler serves two purposes. First, within the MCEM algorithm (called “Gibbs-within-MCEM” hereafter), it generates samples for approximating the expectation in Equation 5, which is used for updating estimates of 𝚯\boldsymbol{\Theta}. Second, after the final DGP estimate, denoted by 𝚯^MLE\widehat{\boldsymbol{\Theta}}^{\text{MLE}}, is obtained from MCEM, we proceed with implementing the sampler (called “Gibbs-after-MCEM” hereafter) to find the final posterior distribution of interest, which is f(𝛀|𝐗,𝚯^MLE)f(\boldsymbol{\Omega}|\mathbf{X},\widehat{\boldsymbol{\Theta}}_{\text{MLE}}), where 𝐗={𝐗i}i=1,,n\mathbf{X}=\{\mathbf{X}_{i}\}_{i=1,...,n} represents observed gene expression. We discuss how we handle identifiability challenges in Section 3.3.

3.1 MCEM for estimating cross-correlation determined by DGP hyperparameters

3.1.1 Options for Estimating DGP Hyperparameters

To estimate 𝚯\boldsymbol{\Theta}, two strategies have been widely adopted: Fully Bayesian (FB) and Empirical Bayes (EB). The former proceeds by assigning prior distributions to 𝚯\boldsymbol{\Theta} to account for our uncertainty. The latter proceeds by fixing 𝚯\boldsymbol{\Theta} to reasonable values based on the data; for example, we can set them to the MLE. Compared to FB, EB sacrifices the quantification of uncertainty for a lower computational cost. Here, we adopt the EB approach because the key quantities of interest for inference in our model are the factor loading matrix 𝐋\mathbf{L}, latent factors 𝐘i\mathbf{Y}_{i}, and the prediction of gene expression, rather than DGP hyperparameters 𝚯\boldsymbol{\Theta}. Therefore, we simply want a reasonably good estimate of 𝚯\boldsymbol{\Theta} to proceed, without expending excessive computation time.

3.1.2 Finding the MLE for DGP Hyperparameters via an MCEM Framework

To derive 𝚯^MLE\widehat{\boldsymbol{\Theta}}^{\text{MLE}}, we first note, with ff denoting probability density function, the factorisation

f(𝐗,𝛀|𝚯)=f(𝐗|𝐌,𝐘,𝐀,𝐙,ϕ)f(𝐌|𝝈)f(𝐘|𝚯)f(𝐀|𝝆)f(𝐙|𝝅)f(ϕ)f(𝝈)f(𝝆)f(𝝅)\displaystyle\begin{split}f(\mathbf{X},\boldsymbol{\Omega}|\boldsymbol{\Theta})=f(\mathbf{X}|\mathbf{M},\mathbf{Y},\mathbf{A},\mathbf{Z},\boldsymbol{\phi})f(\mathbf{M}|\boldsymbol{\sigma})f(\mathbf{Y}|\boldsymbol{\Theta})f(\mathbf{A}|\boldsymbol{\rho})f(\mathbf{Z}|\boldsymbol{\pi})f(\boldsymbol{\phi})f(\boldsymbol{\sigma})f(\boldsymbol{\rho})f(\boldsymbol{\pi})\end{split} (3)

which highlights that the marginal likelihood function f(𝐗|𝚯)=f(𝐗,𝛀|𝚯)𝑑𝛀f(\mathbf{X}|\boldsymbol{\Theta})=\int{\!f(\mathbf{X},\boldsymbol{\Omega}|\boldsymbol{\Theta})}d\boldsymbol{\Omega} with respect to 𝚯\boldsymbol{\Theta} for our proposed model in Section 2.3 requires high-dimensional integration.

To deal with the integration, one strategy is the Expectation-Maximization (EM) algorithm (Dempster et al.,, 1977), in which we view 𝛀\boldsymbol{\Omega} as hidden variables and then iteratively construct a series of estimates 𝚯^(l)\widehat{\boldsymbol{\Theta}}^{(l)}, l=1,2,3,l=1,2,3,..., that converges to 𝚯^MLE\widehat{\boldsymbol{\Theta}}^{\text{MLE}} (Wu,, 1983) by alternating between E-steps and M-steps. At the llth iteration, the E-step requires evaluation of the “Q-function”, which is the conditional expectation of the log-likelihood of the complete data {𝐗,𝛀}\{\mathbf{X},\boldsymbol{\Omega}\} given the observed data 𝐗\mathbf{X} and the previously iterated value 𝚯^(l1)\widehat{\boldsymbol{\Theta}}^{(l-1)}.

Q(𝚯,𝚯^(l1))=𝔼𝛀[lnf(𝐗,𝛀|𝚯)|𝐗,𝚯^(l1)].\begin{split}Q(\boldsymbol{\Theta},\widehat{\boldsymbol{\Theta}}^{(l-1)})&=\mathbb{E}_{\boldsymbol{\Omega}}\left[\text{ln}f(\mathbf{X},\boldsymbol{\Omega}|\boldsymbol{\Theta})\,\middle|\,\mathbf{X},\widehat{\boldsymbol{\Theta}}^{(l-1)}\right].\end{split} (4)

In the M-step, this expectation is maximized to obtain the updated parameter 𝚯^(l)=argmax𝚯Q(𝚯,𝚯^(l1)).\widehat{\boldsymbol{\Theta}}^{(l)}=\underset{\boldsymbol{\Theta}}{\mathrm{arg\ max}}\,Q(\boldsymbol{\Theta},\widehat{\boldsymbol{\Theta}}^{(l-1)}). EM keeps updating parameters in this way until the pre-specified stopping condition is met.

As for many complex models, the analytic form of the conditional expectation in Equation 4 is unavailable. To address this issue, a Monte Carlo version of EM (MCEM) has been developed (Wei and Tanner,, 1990; Levine and Casella,, 2001), which uses Monte Carlo samples to approximate the exact expectation. Casella, (2001) showed that MCEM produces consistent estimates of posterior distributions and asymptotically valid confidence sets.

Suppose that at the llth iteration (l1l\geq 1), we have R(l)R^{(l)} samples {𝛀r}r=1,,R(l)\{\boldsymbol{\Omega}^{r}\}_{r=1,\dots,R^{(l)}} drawn from the posterior distribution f(𝛀|𝐗,𝚯^(l1))f(\boldsymbol{\Omega}|\mathbf{X},\widehat{\boldsymbol{\Theta}}^{(l-1)}) using an MCMC sampler (see Section 3.2), then Equation 4 can be approximated as,

Q(𝚯,𝚯^(l1))Q~(𝚯,𝚯^(l1))=1R(l)r=1R(l)lnf(𝐗,𝛀r|𝚯).\displaystyle\begin{split}Q(\boldsymbol{\Theta},\widehat{\boldsymbol{\Theta}}^{(l-1)})\approx\widetilde{Q}(\boldsymbol{\Theta},\widehat{\boldsymbol{\Theta}}^{(l-1)})&=\frac{1}{R^{(l)}}\sum_{r=1}^{R^{(l)}}\text{ln}f(\mathbf{X},\boldsymbol{\Omega}^{r}|\boldsymbol{\Theta}).\\ \end{split} (5)

The M-Step maximizes Equation 5 with respect to 𝚯\boldsymbol{\Theta}, but after applying the factorisation in Equation 3, the only term depending on 𝚯\boldsymbol{\Theta} is f(𝐘r|𝚯)f(\mathbf{Y}^{r}|\boldsymbol{\Theta}). This means the Q-function can be simplified to

Q~(𝚯,𝚯^(l1))=1R(l)r=1R(l)lnf(𝐘r|𝚯).\displaystyle\begin{split}\widetilde{Q}(\boldsymbol{\Theta},\widehat{\boldsymbol{\Theta}}^{(l-1)})&=\frac{1}{R^{(l)}}\sum_{r=1}^{R^{(l)}}\text{ln}f(\mathbf{Y}^{r}|\boldsymbol{\Theta}).\end{split} (6)

As a result, the task has now been reduced to finding 𝚯\boldsymbol{\Theta} that can maximize the likelihood function of {𝐘r}r=1,,R(l)={𝐘ir}i=1,,n;r=1,,R(l)\{\mathbf{Y}^{r}\}_{r=1,...,R^{(l)}}=\{\mathbf{Y}_{i}^{r}\}_{i=1,...,n;r=1,...,R^{(l)}}, given that each 𝐘ir\mathbf{Y}_{i}^{r} follows a DGP distribution determined by 𝚯\boldsymbol{\Theta}. Although gene expressions are measured at irregular time points for different individuals, samples of latent factors are available at common times 𝐭\mathbf{t}; therefore enabling the use of GPFDA to estimate the MLE.

The MCEM framework described above is summarized in Algorithm 1. To provide good initial values for MCEM, we implemented a two-step approach using available software (see Supplementary B for details). Implementation challenges of MCEM are rooted in one common consideration: the computational cost of the algorithm. We discuss these challenges, including the choice of MCMC sample size R(l)R^{(l)} and the stopping condition, in Supplementary A.4 and Supplementary A.5, respectively.

3.2 Gibbs sampler for other variables under fixed GP estimates

To acquire samples for 𝛀\boldsymbol{\Omega} from the posterior distribution under fixed DGP estimates f(𝛀|𝐗,𝚯^)f(\boldsymbol{\Omega}|\mathbf{X},\widehat{\boldsymbol{\Theta}}), we use a Gibbs sampler since full conditionals for variables in 𝛀\boldsymbol{\Omega} are analytically available. We summarise the high-level approach here; details are available in Supplementary A.1.

For the key variables 𝐙\mathbf{Z}, 𝐀\mathbf{A}, and 𝐘\mathbf{Y}, we use blocked Gibbs to improve mixing. To block-update the ggth row of the binary matrix 𝐙\mathbf{Z}, we compute the posterior probability under 2k2^{k} possible values of (Zg1,,Zgk)(Z_{g1},...,Z_{gk}), then sample with corresponding probabilities. To update the ggth row of the regression coefficient matrix 𝐀\mathbf{A}, (Ag1,,Agk)(A_{g1},...,A_{gk}), we draw from a MVN distribution. When updating vec(𝐘iT)\operatorname{vec}({\mathbf{Y}_{i}}^{T}), the vectorized form of the iith individual’s factor scores, we factorise f(vec(𝐘iT)|𝐗,𝚯^,𝛀vec(𝐘iT))f(\operatorname{vec}({\mathbf{Y}_{i}}^{T})|\mathbf{X},\widehat{\boldsymbol{\Theta}},\boldsymbol{\Omega}\setminus\operatorname{vec}({\mathbf{Y}_{i}}^{T})), where 𝛀vec(𝐘iT)\boldsymbol{\Omega}\setminus\operatorname{vec}({\mathbf{Y}_{i}}^{T}) denotes the remaining parameters excluding vec(𝐘iT)\operatorname{vec}({\mathbf{Y}_{i}}^{T}), according to the partition vec(𝐘iT)=(vec(𝐘i,obsT),vec(𝐘i,missT))\operatorname{vec}({\mathbf{Y}_{i}}^{T})=(\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}),\operatorname{vec}({\mathbf{Y}_{i,\text{miss}}^{T}})). The first factor f(vec(𝐘i,obsT)|𝐗i,𝚯^,𝛀vec(𝐘iT))f(\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}})|\mathbf{X}_{i},\widehat{\boldsymbol{\Theta}},\boldsymbol{\Omega}\setminus\operatorname{vec}({\mathbf{Y}_{i}}^{T})) follows a MVN distribution depending on measured gene expression. The second factor f(vec(𝐘i,missT)|vec(𝐘i,obsT),𝚯^)f(\operatorname{vec}({\mathbf{Y}_{i,\text{miss}}^{T}})|\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}),\widehat{\boldsymbol{\Theta}}) also follows a MVN distribution according to standard properties of the DGP model. Our sampler samples from these two factors in turn.

3.3 Identifiability challenges

A challenge of factor analysis is identifiability, and we address non-identifiability of two types (see Supplementary A.3 for details). First, for the covariance matrix for latent factors Σ𝐘\Sigma_{\mathbf{Y}} we constrain its main diagonal elements to be 11 (in other words, the variance of each factor at each time point is 11). Second, to address sign-permutation identifability of the factor loadings 𝐋\mathbf{L} and factor scores vec(𝐘iT)\operatorname{vec}({\mathbf{Y}_{i}}^{T}), we post-process samples using the R package “factor.switch” (Papastamoulis and Ntzoufras,, 2022).

4 Simulation

To assess the performance of our approach, we simulated gene expression observations from our model as described in Section 4.1, and fitted the proposed and comparator models as described in Section 4.2. Section 4.3 introduces the assessment metrics and Section 4.4 describes the results.

4.1 Simulation setting

To mimic the real data, we chose the sample size n=17n=17. The number of training and test time points was set to q=8q=8 and u=2u=2, respectively, for all individuals (i.e. we split the observed data into training and test datasets to assess models’ performance in predicting gene expression). We set the true number of latent factors as k=4k=4.

Since recovering latent factor trajectories is one of our major interests, we considered 4 different mechanisms for generating true latent factors according to a 2×22\times 2 factorial design: the first variable determined whether different factors were actually correlated (“C”) or uncorrelated (“U”), and the second variable determined whether the variability of factors was small (“S”) or large (“L”). The mean value for each factor score yijay_{ija} was fixed to be 0, and we generated the covariance matrix Σ𝐘\Sigma_{\mathbf{Y}} for different scenarios in the following way: (1) under scenario “C”, we set the cross-correlations based on the estimated covariance matrix from real data under k=4k=4; otherwise under scenario “U” the true cross-correlations were set to 0. (2) under scenario “S”, we set the standard deviation for factors 1-4 to be 0.210.21, 0.230.23, 0.210.21, and 0.170.17, respectively, following the estimated results from real data under k=4k=4; while under scenario “L”, the standard deviation for each yijay_{ija} was set to be 11. Below we use the factors’ generation mechanism to name each scenario. For example, “scenario CS” refers to the case where true factors were correlated (C) and had small (S) variability.

Each factor was assumed to regulate 10%10\% of all genes, and the total number of genes was p=100p=100. Note that we allow for the possibility that one gene may be regulated by more than one factor. If a gene was regulated by an underlying factor, then the corresponding factor loading was generated from a normal distribution N(4,12)\text{N}(4,1^{2}); otherwise the factor loading was set to 0. Each subject-gene mean μig\mu_{ig} was generated from N(μg,σg2)\text{N}(\mu_{g},\sigma^{2}_{g}), where μg\mu_{g} ranged between 44 and 1616 (to match the real data) and σg=0.5\sigma_{g}=0.5. Finally, observed genes were generated according to Equation 1, xijgN(μig+a=1klgayija,ϕg2)x_{ijg}\sim\text{N}(\mu_{ig}+\sum_{a=1}^{k}l_{ga}y_{ija},\phi_{g}^{2}), where ϕg=0.5\phi_{g}=0.5.

Note that in addition to the aforementioned setting, we also considered more simulation scenarios, to investigate the performance of our approach under different nn, pp, qq and proportion of overlap between factors. Supplementary C.2.2-C.2.5 details results under those scenarios, and below we present results only under the setting described above.

4.2 Analysis methods

For all scenarios, we fit the data using the proposed approach BSFA-DGP under k=4k=4. In addition, we investigate how the mis-specification of kk impacts model performance using the scenario CS as an example (Supplementary C.2.1). Hyperparameters and other tuning parameters are detailed in Supplementary B.2.

We compared our approach to a traditional approach that models each latent factor independently, specifically assuming an IGP prior for each factor trajectory 𝐲ia\mathbf{y}_{ia} (other parts of the model remain unchanged). This model specification is referred to as BSFA-IGP. We assessed convergence of the Gibbs samples of the continuous variables from three parallel chains using Rhat (Gelman and Rubin,, 1992). Convergence of the latent factor scores yijay_{ija} and predictions of gene expression xijgnewx_{ijg}^{\text{new}} were of particular interest. We used 1.21.2 as an empirical cutoff. For factor loadings lgal_{ga}, which are the product of binary variable ZgaZ_{ga} and continuous variable AgaA_{ga}, we first summarized the corresponding ZgaZ_{ga}: if the proportion of Zga=0Z_{ga}=0 exceeded 0.50.5 for all chains, then lgal_{ga} was directly summarized as 0; otherwise lga=Agal_{ga}=A_{ga}, and we assessed convergence for these factor loadings using Rhat.

4.3 Performance metrics

We assessed performance according to four aspects: estimation of the correlation structure; prediction of gene expression; recovery of factor trajectories; and estimating factor loadings.

To assess estimation of correlation structure, we compared the final correlation estimate with the truth and additionally, we monitored the change in the correlation estimate throughout the whole iterative process, to assess the speed of convergence of the algorithm.

We used three metrics to assess the performance in predicting gene expressions: mean absolute error (MAE𝐗\text{MAE}_{\mathbf{X}}), mean width of the 95%95\% predictive interval (MWI𝐗\text{MWI}_{\mathbf{X}}), and proportion of genes within the 95%95\% predictive interval (PWI𝐗\text{PWI}_{\mathbf{X}}) (see Supplementary A.6 for details).

To assess the performance in recovering factor trajectories underlying gene expression observed in the training data, we first calculate the mean absolute error (MAE𝐘\text{MAE}_{\mathbf{Y}}) for each scenario. In addition, we plot true and estimated factor trajectories for visual comparison. For the convenience of discussing results, we present trajectories for factor 1 of person 1 as an example. This chosen person-factor estimation result is representative of all factors of all people.

To evaluate the ability to estimate factor loadings, we consider two perspectives. First, for a specific factor, could the model identify all genes that were truly affected by this factor? Second, for a specific gene, could the model identify all factors that regulate this gene? To answer these questions, we calculated 95%95\% credible intervals for lgal_{ga}s and presented those lgal_{ga}s of which the interval did not contain 0 in the heatmap, using posterior median estimates.

4.4 Results

When the number of latent factors kk is correctly specified, our model returns satisfactory cross-correlation estimates under all data generation mechanisms considered (Figure 1). Furthermore, despite the initial (from the two-step approach) being poor, MCEM is able to propose estimates of the cross-correlations that rapidly approximate the truth (Supplementary Figure 2 shows an example under the scenario CS).

For the inference of xijgnewx_{ijg}^{\text{new}}, yijay_{ija}, lgal_{ga} using Gibbs-after-MCEM, Supplementary Table 1 lists the largest Rhat for each variable type, and reflect no apparent concerns over non-convergence. In terms of predicting gene expression, similar PWI𝐗\text{PWI}_{\mathbf{X}} is observed for both models. However, the DGP specification always leads to smaller MAE𝐗\text{MAE}_{\mathbf{X}} and narrower MWI𝐗\text{MWI}_{\mathbf{X}}, even when the factors are uncorrelated in truth. This indicates more accuracy and less uncertainty in prediction (Table 1).

Summary results in Supplementary Table 2 suggest that estimating factors is generally easier when the true variability of factors is larger. A closer inspection of factor trajectories delineated in Supplementary Figure 3 further confirms this: under scenarios CL and UL, both DGP and IGP models are able to recover the shape of factor trajectories very well. This could be explained by the relatively strong expression of latent factors. In contrast, in the case of CS and US, where expression is relatively weak, recovery of details of factor shapes is harder: if factors are not correlated at all (scenario US), both DGP and IGP fail to recover the details at the 22nd and 55th time point, though the overall shape is still close to the truth. However, if factors are truly correlated (scenario CS), DGP is able to recover trajectories very well, due to its ability to borrow information from other related factors.

With regard to factor loading estimation, Figure 2 shows results under the scenario CS, and Supplementary Figures 4,5 and 6 show results for the remaining scenarios. Both DGP and IGP perform well at identifying the correct genes for a given factor under all scenarios. Both models also perform well at identifying the correct factors for a given gene under scenarios CL, UL and US, but under the scenario CS we observe that DGP still performs well whereas IGP performs less well. As can be seen from Figure 2, IGP specification leads to the result that genes estimated to be significantly loaded on the first and second factor also significantly load on the fourth factor. One possible explanation for this is that, if in truth, factors have strong signals and/or are not correlated at all (as in scenarios CL, UL and US), then it is relatively easy to distinguish contributions from different factors. In contrast, under the scenario CS where factors actually have weak signals and are highly-correlated (true correlation between factor 1 and 4 is 0.69-0.69, and correlation between factor 2 and 4 is 0.71-0.71, as can be found from Figure 1), it is more difficult. DGP specification explicitly takes the correlation among factors into consideration, allowing improved estimation.

5 Data application

We ran 500,000500,000 iterations for the final Gibbs sampler, with a 50%50\% burn-in proportion and retained only every 100100th iteration. We also compared our results with two alternative models, both of which assume independence between factors: the BSFA-IGP model, and the previous model in Chen et al., (2011), which adopted spline functions to model each factor trajectory independently.

5.1 Statistical results

We fitted the model with k=2,3,,10k=2,3,\dots,10 factors. We display results under k=5k=5 after a comprehensive consideration of biological interpretation, computational cost and statistical prediction performance. Biologically, we found that 22 biological pathways (immune response pathway and ribosome pathway; see Section 5.2 for details) were consistently identified when kk ranged from 33 to 1010 while k=2k=2 led to only 11 pathway (immune response pathway) identified. Computationally, once kk was larger than 55, our algorithm became slower as the parameter space was very large. Statistically, among the remaining models, k=5k=5 showed the best prediction performance (see Supplementary Table 6).

In terms of the prediction performance, similar PWI𝐗\text{PWI}_{\mathbf{X}} is observed under BSFA-DGP (0.9510.951) and BSFA-IGP (0.9560.956); whereas MAE𝐗\text{MAE}_{\mathbf{X}} and MWI𝐗\text{MWI}_{\mathbf{X}} are both smaller under BSFA-DGP (MAE𝐗 0.211\text{MAE}_{\mathbf{X}}\ 0.211; MWI𝐗 1.113\text{MWI}_{\mathbf{X}}\ 1.113) than BSFA-IGP (MAE𝐗 0.217\text{MAE}_{\mathbf{X}}\ 0.217; MWI𝐗 1.213\text{MWI}_{\mathbf{X}}\ 1.213). This again demonstrates the advantage of reducing prediction error and uncertainty if cross-correlations among factors are taken into consideration.

Estimated factor trajectories using BSFA-DGP are displayed in Figure 3. Among all, factor 11 is able to distinguish symptomatic people from asymptomatic people most clearly, and we find that its shape is largely similar to the “principal factor” identified in Chen et al., (2011). Despite the similarity, it is noteworthy that the trajectory of our factor 11 is actually more individualized and informative than their principal factor (see Supplementary D.2 for discussion in detail). Compared with BSFA-IGP, our model BSFA-DGP reduces the variability within subject-specific trajectory, and consequently it slightly increases the difference between the symptomatic and asymptomatic. Although results look similar at first glance (Supplementary Figure 17), a more careful inspection reveals that BSFA-DGP has a shrinkage effect compared to BSFA-IGP. Take factor 1 as an example (similar observations for other factors). For the blue line at the bottom: at 3838 hours, it remains around 1-1 under BSFA-DGP while reaches as low as 2-2 under BSFA-IGP, making it harder to differentiate this symptomatic subject from asymptomatic subjects with red lines.

5.2 Biological intepretation

To identify the biological counterpart (i.e., pathway) of the statistical factor, we used the KEGG Pathway analysis of the online bioinformatics platform DAVID. We uploaded the total 11,96111,961 genes as the “Background” and the top 5050 genes loaded on each factor as the “Gene List”. This showed that factor 1 can be interpreted as ‘immune response pathway’ and factor 3 as ‘ribosome pathway’, while other factors have no known biological meaning (see Supplementary D.4). In addition, our MCEM algorithm estimates the magnitude of the cross-correlation coefficient to be 0.570.57 between factors 1 and 3. This statistically moderate but biologically noteworthy correlation between these two pathways is consistent with existing biological knowledge (e.g. see Bianco and Mohr,, 2019, which concludes that ribosome biogenesis restricts innate immune responses to virus infection). This interrelated relationship would be missed by an approach that assumed independence between pathways, such as BSFA-IGP.

6 Discussion

In this paper we propose a BSFA-DGP model, which is the first attempt to relax the classical assumption of independent factors for longitudinal data. By borrowing information from correlated factor trajectories, this model has demonstrated advantages in estimating factor loadings, recovering factor trajectories and predicting gene expression when comparing with other models.

We also develop an MCEM algorithm for the inference of the model; the main motivation for adopting this algorithm is that it can make full use of the existing package to estimate DGP hyperparameters. In the M-step of MCEM, GPFDA can be directly implemented to find the maximizer of the Q-function; thus we can exploit the existing optimization algorithm. In addition to its implementation convenience, the modular structure of the MCEM framework makes it generalizable to other complex models involving the DGP model component. For example, when using a DGP model for the latent continuous variable introduced to model multivariate dependent, non-continuous data (such as binary or count data), Sofro et al., (2017) used Laplace Approximation; the MCEM framework we develop would be an alternative inference approach in this case.

Computation time of our approach is largely dependent on GPFDA, which returns hyperparameter estimates for the DGP to maximize the exact likelihood of observing the inputs 𝐘\mathbf{Y}. The computational complexity of GPFDA at the llth iteration of MCEM is O(nR(l)k2q2)O(nR^{(l)}k^{2}q^{2}), indicating the computational cost scales linearly with the sample size nn and the current number of MCMC samples R(l)R^{(l)}, quadratically with the number of latent factors kk and the number of observed time points qq. MCEM scales well with pp, as it is not involved in the expression of complexity at all; it took around 1.51.5 hours for the real data in Section 5 on a standard laptop (Quad-Core Intel Core i5). Potential approaches that can further improve the computational performance include a model approximation method proposed in Álvarez et al., (2010), or stochastic variational inference, as discussed in Hensman et al., (2013).

Throughout this paper, the number of latent factors kk was fixed and needed to be pre-specified. Though results of our model could infer redundant factors (as demonstrated in the simulation study), an automatic approach to infer this number from the data may be preferable in some contexts. A potential solution would be to introduce the Indian Buffet Process as a prior distribution over equivalence classes of infinite-dimensional binary matrices (Knowles and Ghahramani,, 2011). In addition, although our empirical evidence suggests that the constraints we adopt are sufficient to avoid non-identifiability, theoretical investigation would be helpful in future work. Extensions of our methodology to handle non-normal data, limits of detection, covariates and batch effects would also be interesting avenues for future research. Furthermore, in some settings it may be helpful to explicitly incorporate the clinical outcome into the model, which could be another direction for future research.

7 Software

The R code used for implementing the proposed BSFA-DGP model is available as an R package, DGP4LCF, on Github: https://github.com/jcai-1122/DGP4LCF. The package contains vignettes, which illustrate the usage of the functions within the package by applying them to analyze simulated dataset. The release used in this paper is available at https://doi.org/10.5281/zenodo.8108150.

Acknowledgments

This work is supported through the United Kingdom Medical Research Council programme grants MC_UU_00002/2, MC_UU_00002/20, MC_UU_00040/02 and MC_UU_00040/04. The authors are grateful to Oscar Rueda for the helpful discussion on the bioinformatics side of the project.

Conflict of Interest: None declared.

References

  • Álvarez et al., (2010) Álvarez, M., Luengo, D., Titsias, M., and Lawrence, N. D. (2010). Efficient multioutput gaussian processes through variational inducing kernels. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 25–32. JMLR Workshop and Conference Proceedings.
  • Alvarez and Lawrence, (2011) Alvarez, M. A. and Lawrence, N. D. (2011). Computationally efficient convolved multiple output Gaussian processes. The Journal of Machine Learning Research, 12:1459–1500.
  • Bekker and ten Berge, (1997) Bekker, P. A. and ten Berge, J. M. (1997). Generic global indentification in factor analysis. Linear Algebra and its Applications, 264:255–263.
  • Bhattacharya and Dunson, (2011) Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models. Biometrika, 98(2):291–306.
  • Bianco and Mohr, (2019) Bianco, C. and Mohr, I. (2019). Ribosome biogenesis restricts innate immune responses to virus infection and dna. Elife, 8:e49551.
  • Bonilla et al., (2007) Bonilla, E. V., Chai, K., and Williams, C. (2007). Multi-task Gaussian process prediction. Advances in Neural Information Processing Systems, 20.
  • Booth and Hobert, (1999) Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(1):265–285.
  • (8) Boyle, P. and Frean, M. (2005a). Dependent Gaussian processes. Advances in Neural Information Processing Systems, 17:217–224.
  • (9) Boyle, P. and Frean, M. (2005b). Multiple output Gaussian process regression.
  • Caffo et al., (2005) Caffo, B. S., Jank, W., and Jones, G. L. (2005). Ascent-based Monte Carlo expectation–maximization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):235–251.
  • Carvalho et al., (2008) Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. (2008). High-dimensional sparse factor modeling: applications in gene expression genomics. Journal of the American Statistical Association, 103(484):1438–1456.
  • Casella, (2001) Casella, G. (2001). Empirical Bayes Gibbs sampling. Biostatistics, 2(4):485–500.
  • Chen et al., (2011) Chen, M., Zaas, A., Woods, C., Ginsburg, G. S., Lucas, J., Dunson, D., et al. (2011). Predicting viral infection from high-dimensional biomarker trajectories. Journal of the American Statistical Association, 106(496):1259–1279.
  • Cheng et al., (2020) Cheng, L.-F., Dumitrascu, B., Darnell, G., Chivers, C., Draugelis, M., Li, K., et al. (2020). Sparse multi-output gaussian processes for online medical time series prediction. BMC medical informatics and decision making, 20(1):1–23.
  • Conti et al., (2014) Conti, G., Frühwirth-Schnatter, S., Heckman, J. J., and Piatek, R. (2014). Bayesian exploratory factor analysis. Journal of econometrics, 183(1):31–57.
  • Costello and Osborne, (2019) Costello, A. B. and Osborne, J. (2019). Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical assessment, research, and evaluation, 10(1):7.
  • Dempster et al., (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22.
  • Dey et al., (2020) Dey, D., Datta, A., and Banerjee, S. (2020). Graphical Gaussian process models for highly multivariate spatial data. arXiv preprint arXiv:2009.04837.
  • Gelman and Rubin, (1992) Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457–472.
  • Guan and Haran, (2019) Guan, Y. and Haran, M. (2019). Fast expectation-maximization algorithms for spatial generalized linear mixed models. arXiv preprint arXiv:1909.05440.
  • Hensman et al., (2013) Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. arXiv preprint arXiv:1309.6835.
  • Hsu and Yang, (2012) Hsu, C.-L. and Yang, U.-C. (2012). Discovering pathway cross-talks based on functional relations between pathways. BMC Genomics, 13(7):1–15.
  • Jones, (2004) Jones, G. L. (2004). On the Markov chain central limit theorem. Probability Surveys, 1:299–320.
  • Kim et al., (2021) Kim, S. Y., Choe, E. K., Shivakumar, M., Kim, D., and Sohn, K.-A. (2021). Multi-layered network-based pathway activity inference using directed random walks: application to predicting clinical outcomes in urologic cancer. Bioinformatics, 37(16):2405–2413.
  • Knowles and Ghahramani, (2011) Knowles, D. and Ghahramani, Z. (2011). Nonparametric Bayesian sparse factor models with application to gene expression modeling. The Annals of Applied Statistics, 5(2B):1534 – 1552.
  • Konzen et al., (2021) Konzen, E., Cheng, Y., and Shi, J. Q. (2021). Gaussian process for functional data analysis: The GPFDA package for R. arXiv preprint arXiv:2102.00249.
  • Ledermann, (1937) Ledermann, W. (1937). On the rank of the reduced correlational matrix in multiple-factor analysis. Psychometrika, 2(2):85–93.
  • Levine and Casella, (2001) Levine, R. A. and Casella, G. (2001). Implementations of the Monte Carlo EM algorithm. Journal of Computational and Graphical Statistics, 10(3):422–439.
  • Levine and Fan, (2004) Levine, R. A. and Fan, J. (2004). An automated (Markov chain) Monte Carlo EM algorithm. Journal of Statistical Computation and Simulation, 74(5):349–360.
  • Li et al., (2020) Li, Y., Wen, Z., Hau, K.-T., Yuan, K.-H., and Peng, Y. (2020). Effects of cross-loadings on determining the number of factors to retain. Structural Equation Modeling: A Multidisciplinary Journal, 27(6):841–863.
  • Liu et al., (2018) Liu, H., Cai, J., and Ong, Y.-S. (2018). Remarks on multi-output Gaussian process regression. Knowledge-Based Systems, 144:102–121.
  • Lunn et al., (2013) Lunn, D., Jackson, C., Best, N., Thomas, A., and Spiegelhalter, D. (2013). The BUGS book. A Practical Introduction to Bayesian Analysis, Chapman Hall, London.
  • Neath, (2013) Neath, R. C. (2013). On convergence properties of the Monte Carlo EM algorithm. In Advances in modern statistical theory and applications: a Festschrift in Honor of Morris L. Eaton, pages 43–62. Institute of Mathematical Statistics.
  • Papastamoulis and Ntzoufras, (2022) Papastamoulis, P. and Ntzoufras, I. (2022). On the identifiability of Bayesian factor analytic models. Statistics and Computing, 32(2):1–29.
  • Penny and Roberts, (2002) Penny, W. and Roberts, S. (2002). Bayesian multivariate autoregressive models with structured priors. IEE Proceedings-Vision, Image and Signal Processing, 149(1):33–41.
  • Ramsay and Hooker, (2019) Ramsay, J. and Hooker, G. (2019). Dynamic data analysis: modeling data with differential equations. Springer.
  • Ramsay et al., (2009) Ramsay, J., Hooker, G., and Graves, S. (2009). Introduction to functional data analysis. In Functional data analysis with R and MATLAB, pages 1–19. Springer.
  • Richardson et al., (2016) Richardson, S., Tseng, G. C., and Sun, W. (2016). Statistical methods in integrative genomics. Annual review of statistics and its application, 3:181.
  • Shi and Choi, (2011) Shi, J. Q. and Choi, T. (2011). Gaussian process regression analysis for functional data. CRC Press.
  • Sofro et al., (2017) Sofro, A., Shi, J. Q., and Cao, C. (2017). Regression analysis for multivariate dependent count data using convolved Gaussian processes. arXiv preprint arXiv:1710.01523.
  • Temate-Tiagueu et al., (2016) Temate-Tiagueu, Y., Seesi, S. A., Mathew, M., Mandric, I., Rodriguez, A., Bean, K., et al. (2016). Inferring metabolic pathway activity levels from RNA-Seq data. BMC Genomics, 17(5):493–503.
  • Wei and Tanner, (1990) Wei, G. C. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association, 85(411):699–704.
  • Wu, (1983) Wu, C. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics, pages 95–103.
  • Ximénez et al., (2022) Ximénez, C., Revuelta, J., and Castañeda, R. (2022). What are the consequences of ignoring cross-loadings in bifactor models? a simulation study assessing parameter recovery and sensitivity of goodness-of-fit indices. Frontiers in Psychology, 13:923877.
  • Zaas et al., (2009) Zaas, A. K., Chen, M., Varkey, J., Veldman, T., Hero III, A. O., Lucas, J., et al. (2009). Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. Cell Host & Microbe, 6(3):207–217.
  • Zhao et al., (2016) Zhao, S., Gao, C., Mukherjee, S., and Engelhardt, B. E. (2016). Bayesian group factor analysis with structured sparsity. Journal of Machine Learning Research, 17(196):1–47.
1
Input : Observed gene expression 𝐗i\mathbf{X}_{i} and time points 𝐭i\mathbf{t}_{i}, i=1,,ni=1,...,n for nn people.
Output : The MLE of DGP hyperparameters 𝚯^MLE\widehat{\boldsymbol{\Theta}}^{\text{MLE}}.
2
3Step 1: Initialization Step
4
5
  • Center 𝐗i\mathbf{X}_{i} to obtain 𝐗ic\mathbf{X}_{i}^{c} (as defined in Section 4.2), which is input to BFRM for 𝐘0\mathbf{Y}^{0}.

  • Input 𝐘0\mathbf{Y}^{0} to GPFDA for 𝚯^(0)\widehat{\boldsymbol{\Theta}}^{(0)}, and construct Σ^𝐘(0)\widehat{\Sigma}_{\mathbf{Y}}^{(0)} using 𝚯^(0)\widehat{\boldsymbol{\Theta}}^{(0)} and 𝐭=i=1n𝐭i\mathbf{t}=\bigcup\limits_{i=1}^{n}\mathbf{t}_{i} .

  • Initialize the starting sample size R(1)R^{(1)}; the counter w=0w=0, which counts the total number of sample size increase, and specify its upper limit as WW, as described in Supplementary A.5.

6Step 2: Iteration Step, Starting from l=1l=1
7
8do
9      
10      
  1. 2.1

    Draw R(l)R^{(l)} samples of 𝛀\boldsymbol{\Omega} from the Gibbs sampler f(𝛀|𝐗,Σ^𝐘(l1))f(\boldsymbol{\Omega}|\mathbf{X},\widehat{\Sigma}_{\mathbf{Y}}^{(l-1)}) in Section 3.2, where Σ^𝐘(l1)\widehat{\Sigma}_{\mathbf{Y}}^{(l-1)} is constructed by 𝚯^(l1)\widehat{\boldsymbol{\Theta}}^{(l-1)} and 𝐭\mathbf{t} .

  2. 2.2

    Retain Rremain(l)R^{(l)}_{\text{remain}} samples by discarding burn-in and thinning.

  3. 2.3

    Align post-processed samples {𝐋r,𝐘ir}r=1,,Rremain(l)\{\mathbf{L}^{r},\mathbf{Y}_{i}^{r}\}_{r=1,...,R^{(l)}_{\text{remain}}} using R package “factor.switch”.

  4. 2.4

    Input aligned samples {𝐘ir}r=1,,Rremain(l)\{\mathbf{Y}_{i}^{r}\}_{r=1,...,R^{(l)}_{\text{remain}}} to GPFDA to obtain Σ^𝐘(l)\widehat{\Sigma}_{\mathbf{Y}}^{(l)}.

  5. 2.5

    Calculate the lower bound proposed by Caffo et al., (2005) using Σ^𝐘(l1)\widehat{\Sigma}_{\mathbf{Y}}^{(l-1)}, Σ^𝐘(l)\widehat{\Sigma}_{\mathbf{Y}}^{(l)} and {𝐘ir}r=1,,Rremain(l)\{\mathbf{Y}_{i}^{r}\}_{r=1,...,R^{(l)}_{\text{remain}}} (see Supplementary A.4.2 for details).

if LB>0\text{LB}>0 then
11             R(l+1)R(l)R^{(l+1)}\leftarrow R^{(l)};
12             ll+1l\leftarrow l+1;
13            
14      else
15             R(l)R(l)+R(l)mR^{(l)}\leftarrow R^{(l)}+\frac{R^{(l)}}{m};
16             ww+1w\leftarrow w+1;
17            
18      
19while wWw\leq W;
Algorithm 1 MCEM Algorithm for the MLE of DGP Hyperparameters.
Refer to caption
Figure 1: Comparison between true and estimated cross-correlation matrices using the DGP model: all scenarios, number of latent factors kk correctly specified as 44. The first column displays true correlation, while the second and third columns display estimates. The first and second row show values under correlated and uncorrelated factors, respectively. This figure appears in color in the electronic version of this article.
Refer to caption
Figure 2: Comparison between true and estimated factor loadings: scenario CS, with the number of latent factors kk correctly specified as 44. Genes displayed in the heatmap of truth (first column) are ordered following two rules: first, genes on factors with smaller indexes are ranked first; second, genes with larger absolute factor loadings are ranked first. Genes displayed in the heatmaps of estimates (second and third columns) follow the ordering of the ground truth to facilitate comparison. This figure appears in color in the electronic version of this article.
Refer to caption
Figure 3: Estimated trajectories for all factors in the H3N2 data. Each panel displays all subjects’ trajectories for a single factor. This figure appears in color in the electronic version of this article.
Table 1: Prediction performance for all scenarios: number of latent factors kk correctly specified as 44 for both DGP and IGP models. MAE𝐗\text{MAE}_{\mathbf{X}}, MWI𝐗\text{MWI}_{\mathbf{X}} and PWI𝐗\text{PWI}_{\mathbf{X}} are short for mean absolute error, mean width of the 95%95\% predictive intervals, and the proportion of genes within 95%95\% the predictive intervals, respectively.
Factor Generation Mechanism DGP, k=4k=4 IGP, k=4k=4
Correlation Variability MAE𝐗\text{MAE}_{\mathbf{X}} MWI𝐗\text{MWI}_{\mathbf{X}} PWI𝐗\text{PWI}_{\mathbf{X}} MAE𝐗\text{MAE}_{\mathbf{X}} MWI𝐗\text{MWI}_{\mathbf{X}} PWI𝐗\text{PWI}_{\mathbf{X}}
Correlated Large 1.29 6.52 0.95 1.42 7.85 0.95
Small 0.53 2.66 0.95 0.55 2.76 0.95
Uncorrelated Large 1.44 7.28 0.95 1.53 7.70 0.95
Small 0.55 2.78 0.95 0.56 2.79 0.95

Supplementary Materials

A. Mathematical derivations

A.1 Full conditionals of the Gibbs sampler for the proposed model

Throughout the following derivation, “\circ” denotes element-wise multiplication, “-” denotes observed data and all parameters in the model other than the parameter under derivation, “𝐳2||\mathbf{z}||^{2}” denotes the sum of squares of each element of the vector 𝐳\mathbf{z}, “diag(𝐳)\text{diag}(\mathbf{z})” denotes a diagonal matrix with the vector 𝐳\mathbf{z} as its main diagonal elements, “MVN(𝐳;𝝁𝐳,Σ𝐳)\text{MVN}(\mathbf{z};\boldsymbol{\mu}_{\mathbf{z}},\Sigma_{\mathbf{z}})” denotes that 𝐳\mathbf{z} follows a multivariate normal distribution with mean 𝝁𝐳,\boldsymbol{\mu}_{\mathbf{z}}, and variance Σ𝐳\Sigma_{\mathbf{z}}, and similar interpretations apply to other distributions. “𝐳T\mathbf{z}^{T}” denotes the transpose of the vector or matrix 𝐳\mathbf{z}, and “pos” is short for ‘posterior probability’.

  • Full conditional for the latent factors vec(𝐘iT),i=1,,n\operatorname{vec}({\mathbf{Y}_{i}}^{T}),i=1,\ldots,n

    f(vec(𝐘iT)|)=f(vec(𝐘i,obsT)|)f(vec(𝐘i,missT)|vec(𝐘i,obsT),Σ𝐘)\displaystyle\begin{split}f(\operatorname{vec}({\mathbf{Y}_{i}}^{T})|-)&=f(\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}})|-)\cdot f(\operatorname{vec}({\mathbf{Y}_{i,\text{miss}}^{T}})|\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}),\Sigma_{\mathbf{Y}})\\ \end{split}

    We first sample for vec(𝐘i,obsT)\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}):

    f(vec(𝐘i,obsT)|)MVN(vec(𝐗iT);vec(𝐌iT)+𝐋ivec(𝐘i,obsT),Σ𝐗i)MVN(vec(𝐘i,obsT); 0,Σ𝐘i,obs)=MVN(μ𝐘i,obspos,Σ𝐘i,obspos),\displaystyle\begin{split}f(\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}})|-)&\propto\text{MVN}(\operatorname{vec}({\mathbf{X}_{i}^{T}});\ \operatorname{vec}({\mathbf{M}_{i}^{T}})+\mathbf{L}_{i}^{*}\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}),\ \Sigma_{\mathbf{X}_{i}})\cdot\text{MVN}(\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}});\ \mathbf{0},\ \Sigma_{\mathbf{Y}_{i,\text{obs}}})\\ &=\text{MVN}(\mu_{\mathbf{Y}_{i,\text{obs}}}^{\text{pos}},\ \Sigma_{\mathbf{Y}_{i,\text{obs}}}^{\text{pos}}),\end{split}

    with

    Σ𝐘i,obspos=[𝐋iTΣ𝐗i1𝐋i+Σ𝐘i,obs1]1μ𝐘i,obspos=Σ𝐘i,obspos(𝐋iTΣ𝐗i1(vec(𝐗iT)vec(𝐌iT))).\displaystyle\begin{split}\Sigma^{pos}_{\mathbf{Y}_{i,\text{obs}}}&=[{\mathbf{L}_{i}^{*}}^{T}\Sigma_{\mathbf{X}_{i}}^{-1}\mathbf{L}_{i}^{*}+\Sigma_{\mathbf{Y}_{i,\text{obs}}}^{-1}]^{-1}\\ \mu^{\text{pos}}_{\mathbf{Y}_{i,\text{obs}}}&=\Sigma^{\text{pos}}_{\mathbf{Y}_{i,\text{obs}}}({\mathbf{L}_{i}^{*}}^{T}\Sigma_{\mathbf{X}_{i}}^{-1}(\operatorname{vec}({\mathbf{X}_{i}^{T}})-\operatorname{vec}({\mathbf{M}_{i}^{T}}))).\end{split}

    Then we sample for vec(𝐘i,missT)\operatorname{vec}({\mathbf{Y}_{i,\text{miss}}^{T}}) from f(vec(𝐘i,missT)|vec(𝐘i,obsT),Σ𝐘)f(\operatorname{vec}({\mathbf{Y}_{i,\text{miss}}^{T}})|\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}),\ \Sigma_{\mathbf{Y}}), a MVN distribution due to the property of the DGP model (Shi and Choi,, 2011).

    In the above equations:

    • Σ𝐘\Sigma_{\mathbf{Y}} is the covariance matrix of factor scores at full time 𝐭\mathbf{t}, and Σ𝐘i,obs\Sigma_{\mathbf{Y}_{i,\text{obs}}} is a sub-matrix of it (at subject-specific time points);

    • 𝐋i\mathbf{L}_{i}^{*} is constructed using components of the factor loading matrix 𝐋\mathbf{L}: 𝐋i=(𝐋1,,𝐋p)Tpqi×kqi\mathbf{L}_{i}^{*}=(\mathbf{L}_{1}^{*},...,\mathbf{L}_{p}^{*})^{T}\in\mathbb{R}^{pq_{i}\times kq_{i}}, where (𝐋g)T=(diag(lg1)qi×qi,,diag(lgk)qi×qi)qi×kqi(\mathbf{L}_{g}^{*})^{T}=({\text{diag}(l_{g1})}_{q_{i}\times q_{i}},...,{\text{diag}(l_{gk})}_{q_{i}\times q_{i}})\in\mathbb{R}^{q_{i}\times kq_{i}};

    • Σ𝐗i=diag((ϕ12)×qi,,(ϕp2)×qi)pqi×pqi\Sigma_{\mathbf{X}_{i}}=\text{diag}(({\phi_{1}^{2}})_{\times q_{i}},...,({\phi_{p}^{2}})_{\times q_{i}})\in\mathbb{R}^{pq_{i}\times pq_{i}}, where (ϕa2)×qi\ ({\phi_{a}^{2}})_{\times q_{i}} represents a qiq_{i}-dimensional row vector consisting of the scalar ϕa2{\phi_{a}^{2}}.

    Note that when coding the algorithm, there is no need to directly create the pqi×pqipq_{i}\times pq_{i} diagonal matrix Σ𝐗i1\Sigma_{\mathbf{X}_{i}}^{-1} as the memory will be exhausted. To calculate the term 𝐋iTΣ𝐗i1{\mathbf{L}_{i}^{*}}^{T}\Sigma_{\mathbf{X}_{i}}^{-1}, we can use the property of multiplication of a diagonal matrix: post-multiplying a diagonal matrix is equivalent to multiplying each column of the first matrix by corresponding elements in the diagonal matrix.

  • Full conditional for the binary matrix 𝐙\mathbf{Z}
    Let 𝐙g=(Zg1,,Zgk)\mathbf{Z}_{g\cdot}=(Z_{g1},...,Z_{gk}) denote the ggth row of the matrix 𝐙,g=1,,p\mathbf{Z},\ g=1,...,p; then,

    f(𝐙g|)i=1nMVN(𝐱ig;𝝁ig+(𝐀g𝐙g)𝐘i,diag(ϕg2,qi))a=1kBernoulli(Zga;πa),\displaystyle\begin{split}f(\mathbf{Z}_{g\cdot}|-)&\propto\prod_{i=1}^{n}\text{MVN}(\mathbf{x}_{ig};\ \boldsymbol{\mu}_{ig}+(\mathbf{A}_{g\cdot}\circ\mathbf{Z}_{g\cdot})\mathbf{Y}_{i},\ \text{diag}(\phi_{g}^{2},\ q_{i}))\cdot\prod_{a=1}^{k}\text{Bernoulli}(Z_{ga};\pi_{a}),\end{split}

    We calculate the posterior probability under 2k2^{k} possible values of 𝐙g\mathbf{Z}_{g\cdot} based on the above formula, then sample with corresponding probability.

  • Full conditional for the regression coefficient matrix 𝐀\mathbf{A}
    Let 𝐀g=(Ag1,,Agk)\mathbf{A}_{g\cdot}=(A_{g1},...,A_{gk}) denote the ggth row of the matrix 𝐀,g=1,,p\mathbf{A},\ g=1,...,p; then,

    f(𝐀g|)i=1nMVN(𝐱ig;𝝁ig+(𝐀g𝐙g)𝐘i,diag(ϕg2,qi))MVN(𝐀g; 0,diag(𝝆2))=MVN(𝐀g;μ𝐀gpos,Σ𝐀gpos),\displaystyle\begin{split}f(\mathbf{A}_{g\cdot}|-)&\propto\prod_{i=1}^{n}\text{MVN}(\mathbf{x}_{ig};\ \boldsymbol{\mu}_{ig}+(\mathbf{A}_{g\cdot}\circ\mathbf{Z}_{g\cdot})\mathbf{Y}_{i},\ \text{diag}(\phi_{g}^{2},\ q_{i}))\cdot\text{MVN}(\mathbf{A}_{g\cdot};\ \mathbf{0},\ \text{diag}(\boldsymbol{\rho}^{2}))\\ &=\text{MVN}(\mathbf{A}_{g\cdot};\ \mu_{\mathbf{A}_{g\cdot}}^{\text{pos}},\ \Sigma_{\mathbf{A}_{g\cdot}}^{\text{pos}}),\end{split}

    where

    Σ𝐀gpos=(diag(𝐙g)(i=1n𝐘iT𝐘i)diag(𝐙g)ϕg2+diag(1𝝆𝟐))1μ𝐀gpos=Σ𝐀g(diag(𝐙g)i=1n𝐘i(𝐱ig𝝁ig))ϕg2.\displaystyle\begin{split}\Sigma_{\mathbf{A}_{g\cdot}}^{\text{pos}}&=({\frac{\text{diag}(\mathbf{Z}_{g\cdot})(\sum_{i=1}^{n}{\mathbf{Y}_{i}}^{T}\mathbf{Y}_{i})\text{diag}(\mathbf{Z}_{g\cdot})}{\phi^{2}_{g}}+\text{diag}(\frac{1}{\boldsymbol{\rho^{2}}}))}^{-1}\\ \mu_{\mathbf{A}_{g\cdot}}^{\text{pos}}&=\frac{\Sigma_{\mathbf{A}_{g\cdot}}(\text{diag}(\mathbf{Z}_{g\cdot})\sum_{i=1}^{n}{\mathbf{Y}_{i}}(\mathbf{x}_{ig}-\boldsymbol{\mu}_{ig}))}{\phi^{2}_{g}}.\end{split}

    and 𝝆𝟐=(ρ12,,ρa2)\boldsymbol{\rho^{2}}=(\rho^{2}_{1},...,\rho^{2}_{a}).

  • Full conditional for the intercept μig,i=1,,n;g=1,,p\mu_{ig},i=1,...,n;g=1,...,p

    f(μig|)j=1qiN(xijg;μig+a=1klgayija,ϕg2)N(μig;μg,σg2)=N(μig;μigpos,σig2,pos),\displaystyle\begin{split}f(\mu_{ig}|-)&\propto\prod_{j=1}^{q_{i}}\text{N}(x_{ijg};\mu_{ig}+\sum_{a=1}^{k}l_{ga}y_{ija},\phi^{2}_{g})\cdot\text{N}(\mu_{ig};\mu_{g},\sigma^{2}_{g})\\ &=\text{N}(\mu_{ig};\mu_{ig}^{\text{pos}},\ \sigma_{ig}^{2,\text{pos}}),\end{split}

    where

    σig2,pos=(1σg2+qiϕg2)1μigpos=(μgσg2+j=1qi(xijga=1klgayija)ϕg2)σig2,pos\displaystyle\begin{split}\sigma_{ig}^{2,\text{pos}}&={(\frac{1}{\sigma^{2}_{g}}+\frac{q_{i}}{\phi^{2}_{g}})}^{-1}\\ \mu_{ig}^{\text{pos}}&=(\frac{\mu_{g}}{\sigma^{2}_{g}}+\frac{\sum_{j=1}^{q_{i}}(x_{ijg}-\sum_{a=1}^{k}l_{ga}y_{ija})}{\phi^{2}_{g}})\cdot\sigma_{ig}^{2,\text{pos}}\end{split}
  • Full conditional for πa,a=1,,k\pi_{a},\ a=1,...,k

    f(πa|)a=1kBernoulli(Zga;πa)Beta(πa;c0,d0)=Beta(c0+g=1pZga,d0+g=1p(1Zga))\displaystyle\begin{split}f(\pi_{a}|-)&\propto\prod_{a=1}^{k}\text{Bernoulli}(Z_{ga};\pi_{a})\cdot\text{Beta}(\pi_{a};c_{0},d_{0})\\ &=\text{Beta}(c_{0}+\sum_{g=1}^{p}Z_{ga},\ d_{0}+\sum_{g=1}^{p}(1-Z_{ga}))\\ \end{split}
  • Full conditional for ρa2,a=1,,k\rho_{a}^{2},\ a=1,...,k

    f(ρa2|)g=1pN(Aga;0,ρa2)Inverse-Gamma(ρa2;c1,d1)=Inverse-Gamma(c1+p2,d1+12g=1pAga2)\displaystyle\begin{split}f(\rho_{a}^{2}|-)&\propto\prod_{g=1}^{p}\text{N}(A_{ga};0,\rho^{2}_{a})\cdot\text{Inverse-Gamma}(\rho_{a}^{2};c_{1},d_{1})\\ &=\text{Inverse-Gamma}(c_{1}+\frac{p}{2},\ d_{1}+\frac{1}{2}\sum_{g=1}^{p}A_{ga}^{2})\\ \end{split}
  • Full conditional for σg2,g=1,,p\sigma_{g}^{2},\ g=1,...,p

    f(σg2|)i=1nN(μig;μg,σg2)Inverse-Gamma(σg2;c2,d2)=Inverse-Gamma(c2+12n,d2+12i=1n(μigμg)2)\displaystyle\begin{split}f(\sigma_{g}^{2}|-)&\propto\prod_{i=1}^{n}\text{N}(\mu_{ig};\mu_{g},\sigma^{2}_{g})\cdot\text{Inverse-Gamma}(\sigma^{2}_{g};c_{2},d_{2})\\ &=\text{Inverse-Gamma}(c_{2}+\frac{1}{2}n,\ d_{2}+\frac{1}{2}\sum_{i=1}^{n}{(\mu_{ig}-\mu_{g}})^{2})\end{split}
  • Full conditional for ϕg2,g=1,,p\phi_{g}^{2},\ g=1,...,p

    f(ϕg2|)i=1nMVN(𝐱ig;𝝁ig+(𝐀g𝐙g)𝐘i,diag(ϕg2,qi))Inverse-Gamma(ϕg2;c3,d3)=Inverse-Gamma(c3+12i=1nqi,d3+12i=1n𝐱ig𝝁ig(𝐀g𝐙g)𝐘i2)\displaystyle\begin{split}f(\phi_{g}^{2}|-)&\propto\prod_{i=1}^{n}\text{MVN}(\mathbf{x}_{ig};\ \boldsymbol{\mu}_{ig}+(\mathbf{A}_{g\cdot}\circ\mathbf{Z}_{g\cdot})\mathbf{Y}_{i},\ \text{diag}(\phi_{g}^{2},\ q_{i}))\cdot\text{Inverse-Gamma}(\phi^{2}_{g};c_{3},d_{3})\\ &=\text{Inverse-Gamma}(c_{3}+\frac{1}{2}\sum_{i=1}^{n}q_{i},\ d_{3}+\frac{1}{2}\sum_{i=1}^{n}{||\mathbf{x}_{ig}-\boldsymbol{\mu}_{ig}-(\mathbf{A}_{g\cdot}\circ\mathbf{Z}_{g\cdot})\mathbf{Y}_{i}||}^{2})\end{split}
  • Full conditional for predictions of gene expression (only implemented when assessing models’ prediction performance on the test dataset)

    Suppose that 𝐗inew,𝐘inew\mathbf{X}_{i}^{\text{new}},\mathbf{Y}_{i}^{\text{new}} represent predicted gene expression and factor expression of the iith individual at new time points, respectively. The posterior predictive distribution under MCEM-algorithm-returned 𝚯^MLE\widehat{\boldsymbol{\Theta}}^{\text{MLE}} can be expressed as

    f(𝐗inew|𝚯^MLE,𝛀)=f(𝐗inew,𝐘inew|𝚯^MLE,𝛀)𝑑𝐘inew=f(𝐗inew|𝐘inew,𝛀)f(𝐘inew|𝚯^MLE,𝐘i,obs)𝑑𝐘inew,\displaystyle\begin{split}f(\mathbf{X}_{i}^{\text{new}}|\ \widehat{\boldsymbol{\Theta}}^{\text{MLE}},\boldsymbol{\Omega})&=\int f(\mathbf{X}_{i}^{\text{new}},\mathbf{Y}_{i}^{\text{new}}|\ \widehat{\boldsymbol{\Theta}}^{\text{MLE}},\boldsymbol{\Omega})d\mathbf{Y}_{i}^{\text{new}}\\ &=\int f(\mathbf{X}_{i}^{\text{new}}|\mathbf{Y}_{i}^{\text{new}},\boldsymbol{\Omega})\cdot f(\mathbf{Y}_{i}^{\text{new}}|\widehat{\boldsymbol{\Theta}}^{\text{MLE}},\mathbf{Y}_{i,\text{obs}})d\mathbf{Y}_{i}^{\text{new}},\end{split}

    where the first term of the integrand is a MVN because of the assumed factor model, and the second term is also a MVN because of the assumed DGP model on latent factor trajectories (Shi and Choi,, 2011). Therefore, once a sample of parameters 𝛀r\boldsymbol{\Omega}^{r} is generated, the rrth sample of 𝐘inew{\mathbf{Y}_{i}^{\text{new}}} can be generated from f(𝐘inew|𝚯^MLE,𝐘i,obsr)f(\mathbf{Y}_{i}^{\text{new}}|\ \widehat{\boldsymbol{\Theta}}^{\text{MLE}},\mathbf{Y}_{i,\text{obs}}^{r}), then the rrth sample of 𝐗inew{\mathbf{X}_{i}^{\text{new}}} can be sampled from f(𝐗inew|𝐘inew,𝛀r)f(\mathbf{X}_{i}^{\text{new}}|\ {\mathbf{Y}_{i}^{\text{new}}},\boldsymbol{\Omega}^{r}).

A.2 Kernel convolution framework to model dependent Gaussian processes

A.2.1 Illustration using two processes

The KCF constructs correlated processes by introducing common “base processes”. Take two processes ya(t){y}_{a}(t) and yb(t){y}_{b}(t) as an example. KCF constructs them as,

ya(t)=ηa(t)+ξa(t)+ϵa(t),yb(t)=ηb(t)+ξb(t)+ϵb(t),\begin{split}y_{a}(t)&=\eta_{a}(t)+\xi_{a}(t)+\epsilon_{a}(t),\\ y_{b}(t)&=\eta_{b}(t)+\xi_{b}(t)+\epsilon_{b}(t),\end{split}

where ϵa(t),ϵb(t)\epsilon_{a}(t),\ \epsilon_{b}(t) are residual errors from N(0,ψ2)\text{N}(0,\psi^{2}), and ηa(t),ηb(t),ξa(t),ξb(t)\eta_{a}(t),\ \eta_{b}(t),\ \xi_{a}(t),\ \xi_{b}(t) are processes constructed in the following way, illustrated in Supplementary Figure 1 below.

Refer to caption
Supplementary Figure 1: Illustration of the kernel convolution framework for DGPs. The star (\ast) denotes a convolutional operation and directed arrows indicate direct dependence. tt denotes time; a,ba,b are indexes of factor trajectories; τa(t),τ0(t),τb(t)\tau_{a}(t),\tau_{0}(t),\tau_{b}(t) are independent Gaussian white noise processes; ha1(t),ha0(t),hb0(t),hb1(t)h_{a1}(t),h_{a0}(t),h_{b0}(t),h_{b1}(t) are Gaussian kernel functions; ϵa(t),ϵb(t)\epsilon_{a}(t),\epsilon_{b}(t) are residuals; ya(t),yb(t)y_{a}(t),y_{b}(t) are the aath and bbth factor trajectories, respectively.

First, three independent, zero-mean base processes τ0(t)\tau_{0}(t), τa(t)\tau_{a}(t) and τb(t)\tau_{b}(t) are introduced, which are all Gaussian white noise processes. The first process τ0(t)\tau_{0}(t) is shared by both ya(t){y}_{a}(t) and yb(t){y}_{b}(t), thereby inducing dependence between them. Whereas τa(t)\tau_{a}(t) and τb(t)\tau_{b}(t) are specific to ya(t){y}_{a}(t) and yb(t){y}_{b}(t), respectively; they are responsible for capturing the unique aspects of each process.

Second, Gaussian kernel functions ha0(t),ha1(t),hb0(t),hb1(t)h_{a0}(t),\ h_{a1}(t),\ h_{b0}(t),\ h_{b1}(t) are applied to convolve the base processes: with h0(t)h_{-0}(t) applied to the shared process τ0(t)\tau_{0}(t) and h1(t)h_{-1}(t) to the output-specific processes τa(t)\tau_{a}(t) and τb(t)\tau_{b}(t),

ξa(t)\displaystyle\xi_{a}(t) =ha0(t)τ0(t),\displaystyle=h_{a0}(t)*\tau_{0}(t), ηa(t)\displaystyle\eta_{a}(t) =ha1(t)τa(t),\displaystyle=h_{a1}(t)*\tau_{a}(t),
ξb(t)\displaystyle\xi_{b}(t) =hb0(t)τ0(t),\displaystyle=h_{b0}(t)*\tau_{0}(t), ηb(t)\displaystyle\eta_{b}(t) =hb1(t)τb(t),\displaystyle=h_{b1}(t)*\tau_{b}(t),

where the convolution operator * is defined as h(t)τ(t)=h(ts)τ(s)𝑑sh(t)*\tau(t)=\int^{\infty}_{-\infty}h(t-s)\tau(s)ds. All kernel functions h(t)h(t) take the form h(t)=vexp{12Bt2}h(t)=v\,\text{exp}\{-\frac{1}{2}{B}t^{2}\}, where vv and B{B} are positive parameters that are specific to each kernel function.

A.2.2 Specific form of covariance function

Under the kernel convolution framework, the covariance function between the time tjt_{j} and tt_{\ell} within a single process aa, denoted as CaaY(tj,t)C_{aa}^{Y}(t_{j},t_{\ell}), can be decomposed as,

CaaY(tj,t)=Caaξ(tj,t)+Caaη(tj,t)+δjψ2,Caaξ(tj,t)=va02(π)12|Ba0|exp{14Ba0dt2},Caaη(tj,t)=va12(π)12|Ba1|exp{14Ba1dt2},\displaystyle\begin{split}C_{aa}^{Y}(t_{j},t_{\ell})&=C_{aa}^{\xi}(t_{j},t_{\ell})+C_{aa}^{\eta}(t_{j},t_{\ell})+\delta_{j\ell}\psi^{2},\\ C_{aa}^{\xi}(t_{j},t_{\ell})&=v_{a0}^{2}\frac{(\pi)^{\frac{1}{2}}}{\sqrt{|B_{a0}|}}\text{exp}\{-\frac{1}{4}B_{a0}d_{t}^{2}\},\\ C_{aa}^{\eta}(t_{j},t_{\ell})&=v_{a1}^{2}\frac{(\pi)^{\frac{1}{2}}}{\sqrt{|B_{a1}|}}\text{exp}\{-\frac{1}{4}B_{a1}d_{t}^{2}\},\\ \end{split}

where j,j,\ell are time indexes, dt=tjtd_{t}=t_{j}-t_{\ell}, and δj=1\delta_{j\ell}=1 if j=j=\ell, otherwise δj=0\delta_{j\ell}=0.

The covariance function between the time tjt_{j} of process aa and tt_{\ell} of process bb, denoted as CabY(tj,t)C_{ab}^{Y}(t_{j},t_{\ell}), can be expressed as,

CabY(tj,t)=Cabξ(tj,t),ab,Cabξ(tj,t)=va0vb0(2π)12|Ba0+Bb0|exp{12Babdt2},\displaystyle\begin{split}C_{ab}^{Y}(t_{j},t_{\ell})&=C_{ab}^{\xi}(t_{j},t_{\ell}),\ a\neq b,\\ C_{ab}^{\xi}(t_{j},t_{\ell})&=v_{a0}v_{b0}\frac{(2\pi)^{\frac{1}{2}}}{\sqrt{|B_{a0}+B_{b0}|}}\text{exp}\{-\frac{1}{2}B_{ab}d_{t}^{2}\},\end{split}

where Bab=Ba0Bb0Ba0+Bb0B_{ab}=\frac{B_{a0}B_{b0}}{B_{a0}+B_{b0}}.

Note that the original paper (Boyle and Frean, 2005b, ) provides derivation results under more generalized cases. Let QQ denote the dimension of the input variable tjt_{j}, and MM the number of shared base process τ0(t)\tau_{0}(t). In our proposed approach, Q=1Q=1 and M=1M=1. In Boyle and Frean, 2005b , QQ, MM can be arbitrary positive integers.

A.3 Identifiability issue of the model

Similar to the classical factor analysis, we have identifiability issue caused by scaling and label switching. Differently, we do not have rotational invariance, due to the assumption of correlated factor trajectories (see Zhao et al., (2016) for a detailed description of identifiability issue caused by rotation, scaling and label switching under the classical factor analysis). Instead, we have a new source of non-identifiability, which is the covariance matrix itself. Below we explain specific issues in our model using mathematical expression and describe our corresponding solutions.

To facilitate illustrating the identifiability issue, we first re-express the proposed model in Section 2.3 as,

vec(𝐗iT)=vec(𝐌iT)+𝐋ivec(𝐘i,obsT)+vec(𝐄iT),vec(𝐘iT)MVN(𝟎,Σ𝐘),vec(𝐄iT)MVN(𝟎,Σ𝐗i),\begin{split}\operatorname{vec}({\mathbf{X}_{i}^{T}})&=\operatorname{vec}({\mathbf{M}_{i}^{T}})+\mathbf{L}_{i}^{*}\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}})+\operatorname{vec}({\mathbf{E}_{i}^{T}}),\\ \operatorname{vec}({\mathbf{Y}_{i}}^{T})&\sim\text{MVN}(\mathbf{0},\Sigma_{\mathbf{Y}}),\\ \operatorname{vec}({\mathbf{E}_{i}^{T}})&\sim\text{MVN}(\mathbf{0},\Sigma_{\mathbf{X}_{i}}),\\ \end{split} (7)

where vec(𝐗iT)\operatorname{vec}({\mathbf{X}_{i}^{T}}), vec(𝐌iT)\operatorname{vec}({\mathbf{M}_{i}^{T}}), and vec(𝐄iT)\operatorname{vec}({\mathbf{E}_{i}^{T}}) are vectorized from matrices 𝐗iT\mathbf{X}_{i}^{T}, 𝐌iT\mathbf{M}_{i}^{T}, and 𝐄iT\mathbf{E}_{i}^{T}, respectively. To make the above equations hold, 𝐋i\mathbf{L}_{i}^{*} and Σ𝐗i\Sigma_{\mathbf{X}_{i}} are constructed using components of 𝐋\mathbf{L} and ϕ\boldsymbol{\phi}, respectively. Specific forms are available in Supplementary Section A.1.

The distribution of vec(𝐗iT)\operatorname{vec}({\mathbf{X}_{i}^{T}}) after integrating out vec(𝐘i,obsT)\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}), is

vec(𝐗iT)|vec(𝐌iT),𝐋i,Σ𝐗i,Σ𝐘MVN(vec(𝐌iT),𝐋iΣ𝐘i,obs(𝐋i)T+Σ𝐗i),\begin{split}\operatorname{vec}({\mathbf{X}_{i}^{T}})\ |\ \operatorname{vec}({\mathbf{M}_{i}^{T}}),\mathbf{L}_{i}^{*},\Sigma_{\mathbf{X}_{i}},\Sigma_{\mathbf{Y}}\sim\text{MVN}(\operatorname{vec}({\mathbf{M}_{i}^{T}}),\mathbf{L}_{i}^{*}\Sigma_{\mathbf{Y}_{i,\text{obs}}}(\mathbf{L}_{i}^{*})^{T}+\Sigma_{\mathbf{X}_{i}}),\end{split} (8)

where Σ𝐘i,obs\Sigma_{\mathbf{Y}_{i,\text{obs}}} is a sub-matrix of Σ𝐘\Sigma_{\mathbf{Y}} that characterizes the covariance structure for vec(𝐘i,obsT)\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}).

First, there is identifiability issue with the covariance matrix Σ𝐘i,obs\Sigma_{\mathbf{Y}_{i,\text{obs}}}. This issue arises from the invariance of the covariance 𝐋iΣ𝐘i,obs(𝐋i)T+Σ𝐗i\mathbf{L}_{i}^{*}\Sigma_{\mathbf{Y}_{i,\text{obs}}}(\mathbf{L}_{i}^{*})^{T}+\Sigma_{\mathbf{X}_{i}} in Equation 8. The uniqueness of Σ𝐗i\Sigma_{\mathbf{X}_{i}} has been ensured in previous research (Ledermann,, 1937; Bekker and ten Berge,, 1997; Conti et al.,, 2014; Papastamoulis and Ntzoufras,, 2022); given its identifiability, we are concerned with identifiability of Σ𝐘i,obs\Sigma_{\mathbf{Y}_{i,\text{obs}}}. Non-identifiability is present because for any non-singular transformation matrix 𝐃kqi×kqi\mathbf{D}\in\mathbb{R}^{kq_{i}\times kq_{i}}, the expression 𝐋iΣ𝐘i,obs(𝐋i)T+Σ𝐗i\mathbf{L}_{i}^{*}\Sigma_{\mathbf{Y}_{i,\text{obs}}}(\mathbf{L}_{i}^{*})^{T}+\Sigma_{\mathbf{X}_{i}} are equal under these two sets of estimators: the first estimator is {𝐋i^,vec(𝐘i,obsT)^,Σ^𝐘i,obs}\{\widehat{\mathbf{L}_{i}^{*}},\widehat{\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}})},\widehat{\Sigma}_{\mathbf{Y}_{i,\text{obs}}}\} and the second estimator is {𝐋i^𝐃,𝐃1vec(𝐘i,obsT)^,𝐃1Σ^𝐘i,obs(𝐃1)T}\{\widehat{\mathbf{L}_{i}^{*}}\mathbf{D},\mathbf{D}^{-1}\widehat{\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}})},\mathbf{D}^{-1}\widehat{\Sigma}_{\mathbf{Y}_{i,\text{obs}}}(\mathbf{D}^{-1})^{T}\}.

To address the issue, we place a constraint on Σ𝐘\Sigma_{\mathbf{Y}} that requires its main diagonal element to be 11. In other words, the covariance matrix of latent factors Σ𝐘\Sigma_{\mathbf{Y}} is forced to be a correlation matrix by assuming the variance of factor to be 11. This restriction has also been used in Conti et al., (2014), with multiple purposes: first, it ensures the uniqueness of Σ𝐘\Sigma_{\mathbf{Y}}; in turn, it helps set the scale of vec(𝐘i,obsT)\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}}) and consequently helps set the scale of 𝐋i\mathbf{L}_{i}^{*}. In addition, note that the rotational invariance discussed in Zhao et al., (2016) does not exist here, as Σ𝐘i,obs𝐈\Sigma_{\mathbf{Y}_{i,\text{obs}}}\neq\mathbf{I}.

Second, even under the above constraint, there is still identifiability issue with {𝐋i,vec(𝐘i,obsT)}\{\mathbf{L}_{i}^{*},\operatorname{vec}({\mathbf{Y}_{i,\text{obs}}^{T}})\}. This is because 𝐃1Σ^𝐘i,obs(𝐃1)T\mathbf{D}^{-1}\widehat{\Sigma}_{\mathbf{Y}_{i,\text{obs}}}(\mathbf{D}^{-1})^{T} might still equal Σ^𝐘i,obs\widehat{\Sigma}_{\mathbf{Y}_{i,\text{obs}}} when 𝐃\mathbf{D} is a specific signed matrix (diagonal matrix with diagonal elements being 11 or 1-1) or permutation matrix (under this case, the identifiability problem is also known as label switching), as explained in Papastamoulis and Ntzoufras, (2022). The specific form of the transformation matrix leading to such invariance depends on the real correlation structure of the data.

To deal with this unidentifiability caused by signed permutation, a common approach is to post-process samples (Zhao et al.,, 2016). Here, we use the R package “factor.switch” developed in Papastamoulis and Ntzoufras, (2022) to align samples across Gibbs iterations. Alignment of {𝐋r,𝐘r}r=1,,R\{\mathbf{L}^{r},\mathbf{Y}^{r}\}_{r=1,...,R} should be completed before inputting factor scores to GPFDA for estimating DGP parameters 𝚯\boldsymbol{\Theta} within the MCEM algorithm, and also before the final posterior summary of {𝐋,𝐘}\{\mathbf{L},\mathbf{Y}\} using the Gibbs-After-MCEM samples. For the latter Gibbs sampler, we ran three chains in parallel; therefore alignment should be firstly carried out within each chain, then across chains.

A.4 Choosing the Gibbs sample size within the MCEM algorithm

One challenge with implementing the MCEM algorithm is the choice of the Gibbs sample size. First and foremost, the Gibbs-within-MCEM sample size RR determines the computational cost for both the Gibbs sampler and EM. Choosing RR too small, while saving computational expense for the Gibbs sampler, results in a less precise approximation of the Q-function, and thus leads to significantly slower convergence of the EM algorithm. In contrast, choosing RR too large, while providing an accurate approximation of the Q-function and faster EM convergence, results in the time taken for generating the required Gibbs samples becoming prohibitive. Consequently, the choice of RR must trade-off the accuracy of the Q-function with the computational cost of generating Gibbs-within-MCEM samples. We provided a literature review on this topic in Supplementary Section A.4.1 and adapted the approach proposed by Caffo et al., (2005) to our model in Supplementary Section A.4.2. This approach automatically determines when to increase RR dependent on whether the ascent property of the marginal likelihood under EM is preserved or not (Wu,, 1983).

A.4.1 Literature review

A thorough review of strategies for choosing RR can be found in Levine and Fan, (2004). We focus on approaches that automatically adjust the sample size to avoid tedious manual tuning; Neath, (2013) provided a review of this class of approaches. Briefly, there are primarily two methods with the main difference between them being the criterion to increase the sample size. The first approach (Booth and Hobert,, 1999; Levine and Casella,, 2001; Levine and Fan,, 2004) achieves automatic tuning by monitoring the Monte Carlo error associated with each individual parameter, which is the approximation error incurred when using Monte Carlo samples to approximate the exact expectation. Additional samples are needed if the Monte Carlo error for any of the parameters is deemed too large. However, this approach may be difficult to implement here because our experiments with GPFDA revealed that individual DGP parameters are not identifiable: for the same input data, different runs of GPFDA could return differing individual estimates yet still ensure similar estimates of the covariance matrix (therefore similar marginal likelihoods).

An alternative, proposed by Caffo et al., (2005), considers increasing the sample size dependent on whether the ascent property of the marginal likelihood under EM is preserved or not (Wu,, 1983). For exact EM, where the expectation can be calculated precisely, the likelihood function is non-decreasing as the algorithm progresses. However, under MCEM, where the E-step is estimated with Monte Carlo samples, it is possible for the likelihood to decrease, due to approximation error. The algorithm by Caffo et al., (2005) ensures that the likelihood still increases with a high probability, as the algorithm iterates, by introducing a mechanism for rejection of proposed parameter estimates. Specifically, they use the Q-function as a proxy for the marginal likelihood. An updated 𝚯^(l)\widehat{\boldsymbol{\Theta}}^{(l)} will be accepted only when it increases the Q-function compared to the previous 𝚯^(l1)\widehat{\boldsymbol{\Theta}}^{(l-1)}; otherwise, the algorithm will increase the sample size and will propose a new 𝚯^(l)\widehat{\boldsymbol{\Theta}}^{(l)} using the larger sample.

A.4.2 Adaptation of Caffo’s approach to our model

To apply the method of Caffo et al., (2005) to our model, we begin by writing out the exact Q-function after the (l1)(l-1)th iteration of EM. Following Equations 3.5 and 3.8 of the main manuscript, Q(𝚯,𝚯^(l1))=𝔼𝐘[lnf(𝐘|𝚯)|𝐗,𝚯^(l1)]Q(\boldsymbol{\Theta},\widehat{\boldsymbol{\Theta}}^{(l-1)})=\mathbb{E}_{\mathbf{Y}}\left[\text{ln}f(\mathbf{Y}|\boldsymbol{\Theta})\middle|\mathbf{X},\widehat{\boldsymbol{\Theta}}^{(l-1)}\right]. Thus, the change in the value of the Q-function after obtaining an updated 𝚯^(l)\widehat{\boldsymbol{\Theta}}^{(l)} compared to the current 𝚯^(l1)\widehat{\boldsymbol{\Theta}}^{(l-1)} can be represented as,

ΔQ\displaystyle\Delta Q =Q(𝚯^(l),𝚯^(l1))Q(𝚯^(l1),𝚯^(l1))\displaystyle=Q(\widehat{\boldsymbol{\Theta}}^{(l)},\widehat{\boldsymbol{\Theta}}^{(l-1)})-Q(\widehat{\boldsymbol{\Theta}}^{(l-1)},\widehat{\boldsymbol{\Theta}}^{(l-1)})
=𝔼𝐘[lnf(𝐘|𝚯^(l))f(𝐘|𝚯^(l1))|𝐗,𝚯^(l1)]\displaystyle=\mathbb{E}_{\mathbf{Y}}\left[\text{ln}\frac{f(\mathbf{Y}|\widehat{\boldsymbol{\Theta}}^{(l)})}{f(\mathbf{Y}|\widehat{\boldsymbol{\Theta}}^{(l-1)})}\,\middle|\,\mathbf{X},\widehat{\boldsymbol{\Theta}}^{(l-1)}\right]
=𝔼𝐘[g(𝐘)],\displaystyle=\mathbb{E}_{\mathbf{Y}}[g(\mathbf{Y})],

where g(𝐘)=lnf(𝐘|𝚯^(l))f(𝐘|𝚯^(l1))g(\mathbf{Y})=\text{ln}\frac{f(\mathbf{Y}|\widehat{\boldsymbol{\Theta}}^{(l)})}{f(\mathbf{Y}|\widehat{\boldsymbol{\Theta}}^{(l-1)})}, and the expectation is with respect to f(𝐘|𝐗,𝚯^(l1))f(\mathbf{Y}|\mathbf{X},\widehat{\boldsymbol{\Theta}}^{(l-1)}). Under MCEM, we approximate this change ΔQ\Delta Q using the approximate Q-function Q~\widetilde{Q} in Equation 3.7,

ΔQ~=Q~(𝚯^(l),𝚯^(l1))Q~(𝚯^(l1),𝚯^(l1))=1Rr=1Rlnf(𝐘r|𝚯^(l))f(𝐘r|𝚯^(l1))=1Rr=1Rg(𝐘r),𝐘rf(𝐘|𝐗,𝚯^(l1))=g¯R.\displaystyle\begin{split}\Delta\widetilde{Q}&=\widetilde{Q}(\widehat{\boldsymbol{\Theta}}^{(l)},\widehat{\boldsymbol{\Theta}}^{(l-1)})-\widetilde{Q}(\widehat{\boldsymbol{\Theta}}^{(l-1)},\widehat{\boldsymbol{\Theta}}^{(l-1)})\\ &=\frac{1}{R}\sum_{r=1}^{R}\text{ln}\frac{f(\mathbf{Y}^{r}|\widehat{\boldsymbol{\Theta}}^{(l)})}{f(\mathbf{Y}^{r}|\widehat{\boldsymbol{\Theta}}^{(l-1)})}\\ &=\frac{1}{R}\sum_{r=1}^{R}g(\mathbf{Y}^{r}),\qquad\ \mathbf{Y}^{r}\sim f(\mathbf{Y}|\mathbf{X},\widehat{\boldsymbol{\Theta}}^{(l-1)})\\ &=\overline{g}_{R}.\end{split}

A generalized version of the Central Limit Theorem (CLT) (Jones,, 2004) shows that g¯R\overline{g}_{R} converges, in distribution, to N(𝔼𝐘[g(𝐘)],ζR\mathbb{E}_{\mathbf{Y}}[g(\mathbf{Y})],\frac{\zeta}{R}) as RR\rightarrow\infty, where ζ=Var(g(𝐘))\zeta=\text{Var}(g(\mathbf{Y})) can be estimated using either the sample variance of g(𝐘r)g(\mathbf{Y}^{r}) when the samples {𝐘r}r=1,,R\{\mathbf{Y}^{r}\}_{r=1,...,R} are independent, or the batch means approach (Lunn et al.,, 2013; Guan and Haran,, 2019) when the samples are dependent (as is the case here, because they are obtained using an MCMC sampler). This implies that,

P(g¯R𝔼𝐘[g(𝐘)]ζ^R<Z1α)1α;or equivalently,P(ΔQ>ΔQ~ζ^RZ1α)1α,\displaystyle\begin{split}&\text{P}\left(\frac{\overline{g}_{R}-\mathbb{E}_{\mathbf{Y}}[g(\mathbf{Y})]}{\sqrt{\frac{\widehat{\zeta}}{R}}}<Z_{1-\alpha}\right)\approx 1-\alpha;\text{or equivalently,}\\ &\text{P}\left(\Delta Q>\Delta\widetilde{Q}-\sqrt{\frac{\widehat{\zeta}}{R}}Z_{1-\alpha}\right)\approx 1-\alpha,\end{split}

where ζ^\widehat{\zeta} is the estimate of ζ\zeta, Z1αZ_{1-\alpha} is the upper α\alpha quantile of the standard normal distribution. (ΔQ~ζRZ1α)(\Delta\widetilde{Q}-\sqrt{\frac{\zeta}{R}}Z_{1-\alpha}) is called the “Lower Bound” (LB) for ΔQ\Delta Q (Caffo et al.,, 2005) because there is a high chance that ΔQ\Delta Q is larger than this estimator if we choose α\alpha to be small. When LB is positive, it is highly likely that ΔQ\Delta Q is also positive. The automatic updating rule for the sample size is based on LB. In the llth iteration, if LB is positive, then we accept the updated 𝚯^(l)\widehat{\boldsymbol{\Theta}}^{(l)} and keep the current sample size RR; otherwise, we reject 𝚯^(l)\widehat{\boldsymbol{\Theta}}^{(l)} and continue generating additional samples under 𝚯^(l1)\widehat{\boldsymbol{\Theta}}^{(l-1)} for updating again. Caffo et al., (2005) suggests a geometric rate of increase for the sample size, by drawing additional Rm\frac{R}{m} samples for some fixed mm.

A.5 Specifying the stopping condition

The other challenge when implementing the MCEM algorithm is to specify a stopping criterion. We propose to stop the algorithm when the total number of sample size increase exceeds a pre-specified value WW; by doing so, the number of Gibbs samples and samples input to GPFDA is bounded, therefore ensuring the MCEM algorithm can return results within reasonable time.

Finally, post-processing Gibbs samples {𝐘r}r=1,,R\{\mathbf{Y}^{r}\}_{r=1,...,R} (including burn-in and thinning) before inputting them to GPFDA can help drastically reduce the computational cost of GPFDA while still preserving most information in the samples.

A.6 Metrics used for assessing the performance of models

We assessed the performance of predicting gene expressions using three metrics: mean absolute error (MAE𝐗\text{MAE}_{\mathbf{X}}), mean width of the 95%95\% predictive interval (MWI𝐗\text{MWI}_{\mathbf{X}}), and proportion of genes within the 95%95\% predictive interval (PWI𝐗\text{PWI}_{\mathbf{X}}).

MAE𝐗=i=1nj=q+1q+ug=1p|xijg0.5xijgtrue|nu2p,MWI𝐗=i=1nj=q+1q+ug=1p|xijg0.975xijg0.025|nu2p,PWI𝐗=i=1nj=q+1q+ug=1pI(xijg0.025<xijgtrue<xijg0.975)nu2p,\begin{split}\text{MAE}_{\mathbf{X}}&=\frac{\sum_{i=1}^{n}\sum_{j=q+1}^{q+u}\sum_{g=1}^{p}|x_{ijg}^{0.5}-x_{ijg}^{true}|}{nu_{2}p},\\ \text{MWI}_{\mathbf{X}}&=\frac{\sum_{i=1}^{n}\sum_{j=q+1}^{q+u}\sum_{g=1}^{p}|x_{ijg}^{0.975}-x_{ijg}^{0.025}|}{nu_{2}p},\\ \text{PWI}_{\mathbf{X}}&=\frac{\sum_{i=1}^{n}\sum_{j=q+1}^{q+u}\sum_{g=1}^{p}I(x_{ijg}^{0.025}<x_{ijg}^{true}<x_{ijg}^{0.975})}{nu_{2}p},\end{split}

where nn, pp, qq, uu denotes the number of subjects, biomarkers, training time points and test time points, respectively; xijgtruex_{ijg}^{true} denotes the true value of xijgx_{ijg}, and xijg0.025,xijg0.5,xijg0.975x_{ijg}^{0.025},x_{ijg}^{0.5},x_{ijg}^{0.975} denotes 2.5%2.5\%, 50%50\%, and 97.5%97.5\% quantiles of the posterior samples, respectively; and I(xijg0.025<xijgtrue<xijg0.975)=1I(x_{ijg}^{0.025}<x_{ijg}^{true}<x_{ijg}^{0.975})=1 if xijgtruex_{ijg}^{true} is within the predictive interval, otherwise it is 0.

Similarly, we assessed the overall performance of estimating factor trajectories using mean absolute error MAE𝐘\text{MAE}_{\mathbf{Y}},

MAE𝐘=i=1nj=1qa=1k|yija0.5yijatrue|nqk,\text{MAE}_{\mathbf{Y}}=\frac{\sum_{i=1}^{n}\sum_{j=1}^{q}\sum_{a=1}^{k}|y_{ija}^{0.5}-y_{ija}^{true}|}{nqk},

where nn, kk, qq denotes the number of subjects, factors, and training time points, respectively; yijatruey_{ija}^{true} is the true value of yijay_{ija}, and yija0.5y_{ija}^{0.5} is the 50%50\% quantile of the posterior samples.

B. Implementation details of the proposed approach

B.1 Obtaining good initial values

To provide good initial values for MCEM, we implemented a two-step approach using available software. First, point estimates of latent factor scores yijay_{ija} were obtained using BFRM. We centered the gene expression within each individual before inputting it into BFRM: the input data 𝐗c\mathbf{X}^{c} consisted of the centered xijgcx_{ijg}^{c}, xijgc=xijgj=1qixijgqix_{ijg}^{c}=x_{ijg}-\frac{\sum_{j=1}^{q_{i}}x_{ijg}}{q_{i}}. We did so because BFRM assumes independent data, therefore the intercept is specific to only a gene and not to a subject; in other words, it cannot estimate subject-gene mean μig\mu_{ig}. Second, initial values of GP parameters were obtained by inputting 𝐘\mathbf{Y} to GPFDA, with either a DGP or IGP specification.

Note that occasionally, the software BFRM may return an extremely bad initial (i.e., far from the truth), which may consequently result in a local optimum during MCEM rather than a global optimum. To ensure a good estimate of DGP hyperparameters is identified, we suggest users fit the model with multiple initials from BFRM and compare prediction performance for model selection.

B.2 Specification of hyperparameters and other tuning parameters

We specified hyperparameters and other tuning parameters as follows; these specifications were used for both the simulation study and the real-data application.

Specifically, we chose c1=d1=c2=d2=c3=d3=102c_{1}=d_{1}=c_{2}=d_{2}=c_{3}=d_{3}=10^{-2} for an uninformative Inverse-Gamma prior distribution for the variance terms. We also set c0=101pc_{0}=10^{-1}\cdot p and d0=(1101)pd_{0}=(1-10^{-1})\cdot p to obtain a sparsity-inducing prior on ZgaZ_{ga}, as the prior expectation of πa\pi_{a} under this specification is 𝔼[πa]=c0c0+d0=101\mathbb{E}[\pi_{a}]=\frac{c_{0}}{c_{0}+d_{0}}=10^{-1}, implying that only 10%10\% of genes are expected to be involved in each pathway. For the pre-specified values in the MCEM algorithm, we set the rate of sample size increase to m=2m=2 and the maximum number of increasing the sample size to W=5W=5.

B.3. Summary of the Gibbs-after-MCEM algorithm

1
Input : 
  • Observed gene expression 𝐗i\mathbf{X}_{i} and time points 𝐭i\mathbf{t}_{i}, i=1,,ni=1,...,n for nn people.

  • Prediction time points 𝐭new\mathbf{t}^{\text{new}}.

  • 𝚯^MLE\widehat{\boldsymbol{\Theta}}^{\text{MLE}}.

  • Number of parallel chains CC.

2
Output : 
  • Posterior samples of 𝛀\boldsymbol{\Omega}, including the pathway expression 𝐘i,i=1,,n\mathbf{Y}_{i},i=1,...,n and gene-pathway relationship 𝐋\mathbf{L}.

  • Posterior samples of predicted gene expression 𝐗inew,i=1,,n\mathbf{X}_{i}^{\text{new}},i=1,...,n at 𝐭new\mathbf{t}^{\text{new}}.

3 Step 1: Initialization Step
4
  • Construct Σ^𝐘MLE\widehat{\Sigma}_{\mathbf{Y}}^{\text{MLE}} using 𝚯^MLE\widehat{\boldsymbol{\Theta}}^{\text{MLE}}, 𝐭=i=1n𝐭i\mathbf{t}=\bigcup\limits_{i=1}^{n}\mathbf{t}_{i} and 𝐭new\mathbf{t}^{\text{new}}.

  • Specify the Gibbs sample size RfinalR^{\text{final}}.

5Step 2: Sample Generation Step
6 Run CC chains in parallel, and under each chain: draw RfinalR^{\text{final}} samples of 𝛀\boldsymbol{\Omega} using the Gibbs sampler f(𝛀|𝐗,Σ^𝐘MLE)f(\boldsymbol{\Omega}|\mathbf{X},\widehat{\Sigma}_{\mathbf{Y}}^{\text{MLE}}), then 𝐗inew\mathbf{X}_{i}^{\text{new}} from f(𝐗inew|𝛀,Σ^𝐘MLE)f(\mathbf{X}_{i}^{\text{new}}|\boldsymbol{\Omega},\widehat{\Sigma}_{\mathbf{Y}}^{\text{MLE}}). Post-process samples by burn-in and thinning, and denote the remaining sample size as RremainfinalR^{\text{final}}_{\text{remain}}.
7Step 3: Sample Alignment Step
8 Align post-processed samples {𝐋r,𝐘ir}r=1,,Rremainfinal\{\mathbf{L}^{r},\mathbf{Y}_{i}^{r}\}_{r=1,...,R^{\text{final}}_{\text{remain}}} within each chain, then across chains; both can be achieved using R package “factor.switch”.
Supplementary Algorithm 2 Gibbs-after-MCEM Algorithm for Posterior Summary.

C. Simulation study

C.1 Additional results under the setting in the manuscript

This section presents additional results under the setting described in the manuscript, where the number of subjects n=17n=17, the number of biomarkers p=100p=100, the number of training time points q=8q=8, and the number of latent factors k=4k=4.

Refer to caption
Supplementary Figure 2: An example of cross-correlation matrices at MCEM iterations 0 (initial value) to 6 during the MCEM algorithm using the DGP model: scenario CS, number of latent factors kk correctly specified as 44. The true cross-correlation matrix is also shown.
Refer to caption
Supplementary Figure 3: Comparison between true and estimated latent factor trajectory for factor 1 of person 1: all scenarios, number of latent factors kk correctly specified as 44.
Refer to caption
Supplementary Figure 4: Comparison between true and estimated factor loadings: scenario CL, number of latent factors kk correctly specified as 44. Genes displayed in the heatmap of truth (first column) are ordered following two rules: first, genes on factors with smaller indexes are ranked first; second, genes with larger absolute factor loadings are ranked first. Genes displayed in the heatmaps of estimates (second and third columns) follow the ordering of the ground truth to facilitate comparison.
Refer to caption
Supplementary Figure 5: Comparison between true and estimated factor loadings: scenario US, number of latent factors kk correctly specified as 44. Genes displayed in the heatmap of truth (first column) are ordered following two rules: first, genes on factors with smaller indexes are ranked first; second, genes with larger absolute factor loadings are ranked first. Genes displayed in the heatmaps of estimates (second and third columns) follow the ordering of the ground truth to facilitate comparison.
Refer to caption
Supplementary Figure 6: Comparison between true and estimated factor loadings: scenario UL, number of latent factors kk correctly specified as 44. Genes displayed in the heatmap of truth (first column) are ordered following two rules: first, genes on factors with smaller indexes are ranked first; second, genes with larger absolute factor loadings are ranked first. Genes displayed in the heatmaps of estimates (second and third columns) follow the ordering of the ground truth to facilitate comparison.
Supplementary Table 1: Max Rhat for each type of key variables: all scenarios, number of latent factors kk correctly specified as 44. MCMC iteration number was increased to 100,000100,000 if Rhat under 10,00010,000 iterations suggested non-convergence.
Scenario US
                    Model DGP IGP
Variable MCMC iteration = 10,00010,000 MCMC iteration = 10,00010,000
Predicted gene expression 1.01 1.01
Factor loadings 1.02 1.02
Latent factors 1.01 1.01
Scenario CS
                    Model DGP IGP
Variable MCMC iteration = 10,00010,000 MCMC iteration = 100,000100,000 (10,00010,000)
Predicted gene expression 1.01 1.04 (1.02)
Factor loadings 1.01 1.09 (4.37)
Latent factors 1.01 1.04 (2.45)
Scenario UL
                    Model DGP IGP
Variable MCMC iteration = 100,000100,000 (10,00010,000) MCMC iteration = 100,000100,000 (10,00010,000)
Predicted gene expression 1.04 (1.01) 1.04 (1.01)
Factor loadings 1.04 (1.48) 1.06 (1.48)
Latent factors 1.05 (1.40) 1.06 (1.40)
Scenario CL
                    Model DGP IGP
Variable MCMC iteration = 100,000100,000 (10,00010,000) MCMC iteration = 100,000100,000 (10,00010,000)
Predicted gene expression 1.04 (1.01) 1.04 (1.01)
Factor loadings 1.02 (1.29) 1.06 (1.45)
Latent factors 1.04 (1.25) 1.06 (1.37)
Supplementary Table 2: Performance quantification for all scenarios: number of latent factors kk correctly specified as 44 for both DGP and IGP model. MAE𝐘\text{MAE}_{\mathbf{Y}} is short for mean absolute error of estimating latent factors 𝐘\mathbf{Y}.
Factor Generation Mechanism MAE𝐘\text{MAE}_{\mathbf{Y}}
Correlation Variability DGP IGP
Correlated Large 0.15 0.16
Small 0.23 0.41
Uncorrelated Large 0.16 0.17
Small 0.23 0.23

C.2 Further simulation settings under the scenario CS

This section discusses the performance of our model under more simulation settings. For the ease of illustration, these additional settings were considered under the scenario CS (true factors are correlated and have small variability) only.

C.2.1 Mis-specification of the number of latent factors kk

We investigated how the mis-specification of kk impacts model performance using the scenario CS with n=17n=17, p=100p=100, q=8q=8 as an example. We tried different number of latent factors kk, ranging from 22 to 88.

Supplementary Table 3 shows that correct specification of k=4k=4 leads to the smallest MAE𝐗\text{MAE}_{\mathbf{X}} and narrowest MWI𝐗\text{MWI}_{\mathbf{X}}.

Supplementary Table 3: Prediction performance under different numbers of factors kk for scenario CS with p=100p=100 biomarkers. The true number of factors is k=4k=4. MAE𝐗\text{MAE}_{\mathbf{X}} and MWI𝐗\text{MWI}_{\mathbf{X}} are short for mean absolute error and mean width of the 95%95\% predictive intervals, respectively.
kk MAE𝐗\text{MAE}_{\mathbf{X}} MWI𝐗\text{MWI}_{\mathbf{X}}
2 0.54 2.75
3 0.55 2.82
𝟒\mathbf{4} 0.53\mathbf{0.53} 2.66\mathbf{2.66}
5 0.55 2.76
6 0.54 2.78
7 0.54 2.70
8 0.53 2.71

Another important metric is factor loading estimates. We have found that our results can always estimate the truth reasonably well regardless of the mis-specification. If kk is mis-specified larger than truth, then in practice we have found that the model is able to identify redundant factors (i.e., factors without any non-zero loadings); at the same time, all true factors can be recovered well. If kk is mis-specified smaller than truth, then our model can identify a subset of true factors.

When we specified k=5,6,7,8k=5,6,7,8 (i.e. larger than the true k=4k=4), the additional 1,2,31,2,3 or 44 redundant factors were clearly identified. Taking k=5k=5 as an example for illustration. In the factor loading estimates displayed in Supplementary Figure 7, there is one factor with no non-zero loadings and the remaining factors are close to the truth. Results under k=6,7,8k=6,7,8 are similar, as displayed in Supplementary Figures 8, 9, 10, respectively. These observations indicate that our model is able to identify all redundant factors via estimated factor loadings.

Refer to caption
Supplementary Figure 7: Comparison between true and estimated factor loadings using the DGP model: scenario CS, with the number of latent factors kk mis-specified as 55. The second column displays estimates after removal of the redundant factor and alignment with truth. The third column displays raw estimates without alignment with the truth. Alignment is achieved by a signed permutation to facilitate comparison, as explained in Papastamoulis and Ntzoufras, (2022).
Refer to caption
Supplementary Figure 8: Comparison between true and estimated factor loadings using the DGP model: scenario CS, with the number of latent factors kk mis-specified as 66. The second column displays estimates after removal of the redundant factor and alignment with truth. The third column displays raw estimates without alignment with the truth. Alignment is achieved by a signed permutation to facilitate comparison, as explained in Papastamoulis and Ntzoufras, (2022).
Refer to caption
Supplementary Figure 9: Comparison between true and estimated factor loadings using the DGP model: scenario CS, with the number of latent factors kk mis-specified as 77. The second column displays estimates after removal of the redundant factor and alignment with truth. The third column displays raw estimates without alignment with the truth. Alignment is achieved by a signed permutation to facilitate comparison, as explained in Papastamoulis and Ntzoufras, (2022).
Refer to caption
Supplementary Figure 10: Comparison between true and estimated factor loadings using the DGP model: scenario CS, with the number of latent factors kk mis-specified as 88. The second column displays estimates after removal of the redundant factor and alignment with truth. The third column displays raw estimates without alignment with the truth. Alignment is achieved by a signed permutation to facilitate comparison, as explained in Papastamoulis and Ntzoufras, (2022).

On the other hand, when kk is smaller than the truth, Supplementary Figure 11 shows that the genes loaded on the ‘missing factor’ (true factor 2) loaded on other factors (true factor 1) and the remaining factors approximately match the true loadings.

Refer to caption
Supplementary Figure 11: Comparison between true and estimated factor loadings using the DGP model: scenario CS, with the number of latent factors kk mis-specified as 33. The second column displays estimates after removal of the redundant factor and alignment with truth. The third column displays raw estimates without alignment with the truth. Alignment is achieved by a signed permutation to facilitate comparison, as explained in Papastamoulis and Ntzoufras, (2022).

C.2.2 Impact of the number of biomarkers pp

We investigated the performance of our approach under a larger number of genes, with p=12000p=12000. Other settings are: sample size n=17n=17 and number of training time points q=12q=12.

True factor loadings are displayed in Figure 12, and estimated results under correct specification of k=4k=4 and incorrect specification k=6k=6 are displayed in Figure 13 and Figure 14, respectively. Our approach is always able to recover the true factor loadings reasonably well; and when the specified number is unnecessarily large (k=6k=6), our model can identify redundant factors (i.e., corresponding to 22 columns with all loadings as zero).

Note that the display of results under p=12,000p=12,000 is different from that under p=100p=100, as the previous way (under p=100p=100) of displaying results would lead to small texts hard to read. Therefore, we changed the display format: for the truth, we only display non-zero loadings for each factor; then we followed the referee’s suggestion to display the estimated result following the gene ordering of the ground truth. Redundant factors under mis-specifiction of k=6k=6 are not displayed, as corresponding columns do not have non-zero loadings.

Refer to caption
Supplementary Figure 12: True factor loadings: scenario CS under biomarker number p=12,000p=12,000 and the true number of latent factors k=4k=4. Only non-zero loadings are displayed for each factor.
Refer to caption
Supplementary Figure 13: Estimated factor loadings using the DGP model: scenario CS under biomarker number p=12,000p=12,000 and the number of latent factors is correctly specified as k=4k=4. Estimated results are displayed following the gene ordering of the ground truth in Figure 12.
Refer to caption
Supplementary Figure 14: Estimated factor loadings using the DGP model: scenario CS under biomarker number p=12,000p=12,000 and the number of latent factors is mis-specified as k=6k=6. Estimated results are displayed following the gene ordering of the ground truth in Figure 12.

C.2.3 Impact of the sample size nn

We investigated the impact of the number of subjects nn on the result. We fix p=100p=100, q=8q=8 and k=4k=4, while we vary n=10,17(the size in the real data),20,40n=10,17(\text{the size in the real data}),20,40.

We find that as the sample size nn increases, the performance of our model will improve. Supplementary Table 4 shows comparison in biomarker prediction.

Supplementary Table 4: Prediction performance under different numbers of subjects nn for scenario CS with p=100p=100 biomarkers. The true number of factors is k=4k=4. MAE𝐗\text{MAE}_{\mathbf{X}} and MWI𝐗\text{MWI}_{\mathbf{X}} are short for mean absolute error and mean width of the 95%95\% predictive intervals, respectively.
nn MAE𝐗\text{MAE}_{\mathbf{X}} MWI𝐗\text{MWI}_{\mathbf{X}}
10 0.54 2.83
17 0.53 2.66
20 0.50 2.61
40 0.49 2.46

C.2.4 Impact of the number of time points qq

We investigated the impact of the number of time points qq on the result. We fix p=100p=100, n=17n=17 and k=4k=4, while we vary q=4,8,16q=4,8,16. We find that as qq increases, the performance of our model will improve. Supplementary Table 5 shows comparison in biomarker prediction.

Supplementary Table 5: Prediction performance under different numbers of time points qq for scenario CS with p=100p=100 biomarkers. The true number of factors is k=4k=4. MAE𝐗\text{MAE}_{\mathbf{X}} and MWI𝐗\text{MWI}_{\mathbf{X}} are short for mean absolute error and mean width of the 95%95\% predictive intervals, respectively.
qq MAE𝐗\text{MAE}_{\mathbf{X}} MWI𝐗\text{MWI}_{\mathbf{X}}
4 0.59 2.89
8 0.53 2.66
16 0.50 2.45
25 0.45 2.21

C.2.5 Impact of the proportion of overlap between factors

We investigated the impact of greater overlap across factors on the model performance. In literature, it is called ‘cross-loading’ if an observed variable is significantly loaded on more than one latent factor (Ximénez et al.,, 2022; Li et al.,, 2020). The simulation study in the manuscript demonstrated good performance in recovering the factor loading when the proportion of genes with cross-loading is 21%21\% (88 out of the 3838 significant genes are loaded on more than factor).

This additional simulation assesses the performance of our model under a larger cross-loading proportion 60%60\% (1616 out of the 2626 significant genes are loaded on more than factor).

Our model continued to perform well at estimating the factor loadings (Supplementary Figure 15), though a longer chain (corresponding to a more expensive computation cost) was required due to the more complex factor structure. In the original setting, only 100,000100,000 iterations were required while this number was not enough for the new setting with a greater overlap (results displayed in the third column of Supplementary Figure 15); a longer chain such as 1,000,0001,000,000 iterations were required (results displayed in the second column of Supplementary Figure 15).

Refer to caption
Supplementary Figure 15: Comparison between true and estimated factor loadings using the DGP model: scenario CS with p=100p=100 biomarkers, with the proportion of genes with cross-loadings as 60%60\%. The second column displays estimates after 1,000,000 iterations. The third column displays estimates after 100,000 iterations.

It is worth highlighting that for each gene that is loaded on more than one factor, the factor with the largest loading can always be identified by our method. Take gene 98 as an example. The true loadings are non-zero on all factors (with the largest loading on factor 3). Although our model incorrectly estimates zero loading on factor 4, it is able to correctly estimate non-zero loading on factor 3.

D. Real Data

D.1 Prediction performance under different numbers of factors kk

Supplementary Table 6: Prediction performance under different factor number kk in the H3N2 data application. MAE𝐗\text{MAE}_{\mathbf{X}} and MWI𝐗\text{MWI}_{\mathbf{X}} are short for mean absolute error and mean width of the 95%95\% predictive intervals, respectively.
Specification of kk MAE𝐗\text{MAE}_{\mathbf{X}} MWI𝐗\text{MWI}_{\mathbf{X}}
3 0.215 1.168
4 0.213 1.143
𝟓\mathbf{5} 0.212\mathbf{0.212} 1.118\mathbf{1.118}

D.2 Comparison between our model and Chen et al., (2011)’s model

To facilitate comparison between our factor 1 and their ‘principal factor’, Supplementary Figure 16 displays factor trajectories of all people in the same format as Figure 4 in Chen et al., (2011). For symptomatic people, both factors display an increase after the inoculation (time 0), then decrease to a level that is higher than before-inoculation; but for asymptomatic people, both factor shapes have little change.

Refer to caption
Supplementary Figure 16: Estimated trajectories for factor 11 in the H3N2 data. Each panel displays a single subject’s trajectory, following the same format as Figure 4 in Chen et al., (2011). The horizontal axes correspond to time in hours. This figure appears in color in the electronic version of this article.

In addition to the shape similarity, genes significantly loaded on factor 11 are also largely similar to those loaded on Chen et al., (2011)’s principal factor. Chen et al., (2011) list the top 5050 genes sorted according to the absolute loading values (from most important to least important); we present this alongside the full list of our top 5050 genes in Supplementary Table LABEL:gene_comparison for comparison. 3333 out of the 5050 genes are the same.

Supplementary Table 7: Top 5050 genes sorted according to absolute factor loadings. Numbers within the parenthesis in the right column represent rank of this gene in the left column, and “-” denote this gene is not in the left column. Note that names beginning with “M97935” in Chen et al., (2011) are actually control sequences rather than genes.
Row Index Principal Factor by Chen et al., (2011) Factor 1 by Our Approach
1 RSAD2 LAMP3 (13)
2 IFI44L RSAD2 (1)
3 IFIT1 IFI44L (2)
4 IFI44 SERPING1 (10)
5 HERC5 SPATS2L (-)
6 OAS3 SIGLEC1 (22)
7 MX1 ISG15 (8)
8 ISG15 IFIT1 (3)
9 IFIT3 IFI44 (4)
10 SERPING1 RTP4 (35)
11 IFIT2 OAS3 (6)
12 OASL IFI6 (17)
13 LAMP3 CCL2 (-)
14 IFI27 IDO1 (-)
15 OAS1 HERC5 (5)
16 OAS2 MS4A4A (-)
17 IFI6 IFIT3 (9)
18 IFIT5 OAS1 (15)
19 IFITM3 OAS2 (16)
20 XAF1 LY6E (24)
21 DDX58 OASL (12)
22 SIGLEC1 ATF3 (-)
23 DDX60 CXCL10 (-)
24 LY6E CCL8 (-)
25 GBP1 XAF1 (20)
26 IFIH1 IFI27 (14)
26 LOC26010 SAMD4A (-)
28 ZCCHC2 MX1 (7)
29 EIF2AK2 LGALS3BP (-)
30 LAP3 C1QB (-)
31 IFI35 IFITM3 (19)
32 IRF7 LAP3 (30)
33 PLSCR1 IRF7 (32)
34 M97935_MA_at ZBP1 (42)
35 RTP4 HERC6 (37)
36 M97935_MB_at TFEC (-)
37 HERC6 IFI35 (31)
38 TNFAIP6 MT2A (-)
39 PARP12 SCO2 (41)
40 M97935_5_at DDX58 (21)
41 SCO2 IFIH1 (26)
42 ZBP1 IFIT2 (11)
43 STAT1 DHX58 (-)
44 UBE2L6 TMEM255A (-)
45 MX2 TNFAIP6 (38)
46 TOR1B VAMP5 (-)
47 M97935_3_at PARP12 (39)
48 TNFSF10 GBP1 (25)
49 TRIM22 TIMM10 (-)
50 APOL6 C1QA (-)

Despite the similarity between factor 11 estimated by our BSFA-DGP and the principal factor estimated by Chen et al., (2011), as discussed above, it is noteworthy that the trajectory of factor 11 is actually more individualized and informative than the principal factor. This can be seen by observing that, for symptomatic people, the shapes of the principal factor are exactly the same after the factor starts changing (Figure 4 in Chen et al., (2011)) whereas the shapes of factor 11 still vary locally for different subjects, as can be seen from Supplementary Figure 16.

The difference in the recovered factor trajectories is caused by the different assumptions underlying the two models. Chen et al., (2011) assume that the dynamic trajectory of any factor is common to all individuals. Different factor values are observed at the same time point for different people only due to individual-dependent biological time shifts and random noise. Therefore, symptomatic and asymptomatic individuals are distinguished based on the time shift (Figure 4 in Chen et al., (2011)). In contrast, we assume that the trajectory of each factor comes from a distribution (i.e., there is no common curve for every individual), therefore allowing for different curve shapes for different people. We make direct use of the diversity of curve shapes to distinguish patients without introducing the “time shift” quantity.

An additional advantage of our model compared to that in Chen et al., (2011) is its ability to detect potential outliers, due to allowing for subject-specific trajectory. In contrast, Chen et al., assumes a universal factor trajectory and trajectories of different subjects are the result of only shifting this reference trajectory by subject-specific biological time shift along the x-axis. This assumption means extreme values of factor expression (outliers) would be hard to identify, which means additional discoveries may be missed.

D.3 Comparison between BSFA-DGP and BSFA-IGP

Refer to caption
Supplementary Figure 17: Comparison between DGP and IGP on the estimated factor trajectories in the real-data application.

D.4 Biological Intepretation

We interpret factor 1 as ‘innate immune response pathway’, because all the significantly enriched pathways are related to immune response (Supplementary Figure 18); and interpret factor 3 as the ribosome pathway, because it is the only significantly enriched pathway identified (Supplementary Figure 19).

Refer to caption
Supplementary Figure 18: Results for genes loaded on factor 1, under the KEGG Pathway analysis of the online bioinformatics platform DAVID.
Refer to caption
Supplementary Figure 19: Results for genes loaded on factor 3, under the KEGG Pathway analysis of the online bioinformatics platform DAVID.