This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\stackMath

Computational Approaches for Exponential-Family Factor Analysis

Liang Wang
Email: [email protected]
Department of Mathematics and Statistics, Boston University
and
Luis Carvalho
Email: [email protected]
Department of Mathematics and Statistics, Boston University
Abstract

We study a general factor analysis framework where the nn-by-pp data matrix is assumed to follow a general exponential family distribution entry-wise. While this model framework has been proposed before, we here further relax its distributional assumption by using a quasi-likelihood setup. By parameterizing the mean-variance relationship on data entries, we additionally introduce a dispersion parameter and entry-wise weights to model large variations and missing values. The resulting model is thus not only robust to distribution misspecification but also more flexible and able to capture mean-dependent covariance structures of the data matrix. Our main focus is on efficient computational approaches to perform the factor analysis. Previous modeling frameworks rely on simulated maximum likelihood (SML) to find the factorization solution, but this method was shown to lead to asymptotic bias when the simulated sample size grows slower than the square root of the sample size nn, eliminating its practical application for data matrices with large nn. Borrowing from expectation-maximization (EM) and stochastic gradient descent (SGD), we investigate three estimation procedures based on iterative factorization updates. Our proposed solution does not show asymptotic biases, and scales even better for large matrix factorizations with error O(1/p)O(1/p). To support our findings, we conduct simulation experiments and discuss its application in three case studies.


Keywords: matrix factorization, exponential family, factor model.

1.  Introduction

Over the past decades, factor analysis has gained tremendous attention in psychology (Ford et al., 1986), computer science (Prince et al., 2008), finance (Fama and French, 2015), and biological research (Xu et al., 2021). In particular, when the data Xn×pX\in\mathbb{R}^{n\times p} is high dimensional (npn\ll p), effectively modeling and estimating the covariance structure has been problematic (Basilevsky, 1994). The factor model provides an effective approach to model high dimensional data in which the covariance of the observations is assumed to lie on a lower dimensional manifold.

Despite its popularity in modeling high dimensional data, factor models have several limitations. First and foremost, both data and latent variables are assumed to follow a Gaussian distribution, which is not ideal for modeling binary, count, or other non-constant variance data. To address the first limitation, there exist some prior works that extend the factor model with more general exponential family assumption (Wedel and Kamakura, 2001; Wedel et al., 2003). However, even with improved assumptions away from Gaussianity, exponential family distributions are often too restrictive for real world, overly dispersed data. Moreover, as we shown later in Section 1.1.3, the proposed maximum likelihood estimation algorithm for such an extended model is problematic with both numerical and asymptotic convergence issues. As a minor issue, the latent factors are only identifiable up to a rotational transformation, potentially causing problems in interpreting the latent factors. Lastly, both these extended works and the traditionally factor analysis framework lack the flexibility to model missing data, preventing several interesting applications such as matrix completion.

This paper thus aims at generalizing the existing works by:

  • Assuming only a mean-variance relationship along with column-wise dispersion parameters to model data covariance;

  • Providing interpretability for latent factors via orthogonal identifiability constraints;

  • Proposing fast, accurate, and robust optimization algorithms leveraging modern advances in stochastic optimization;

  • Facilitating application with an efficient package implementation that allows for entry-wise factor weights and covariance modeling.

To introduce appropriate notations and to understand some of the relevant attempts to address those issues, we elaborate below on the limitations of factor models along with some existing remedies proposed in the literature that motivated our generalization.

1.1.  The Factor Model and Its Limitations

For given data Xn×pX\in\mathbb{R}^{n\times p}, the traditional rank qq factor model assumes that qpq\ll p and that the data is generated by latent factors Λn×q\Lambda\in\mathbb{R}^{n\times q} with Λ=[Λ1Λn]\Lambda^{\top}=[\Lambda_{1}\cdots\Lambda_{n}], and a deterministic projection matrix Vp×qV\in\mathbb{R}^{p\times q}, the loading matrix. We implicitly assume the following data generating process: for each observation i=1,,ni=1,\ldots,n:

ΛiiidN(0,Iq),Xi|ΛiindN(VΛi,Φ)\displaystyle\begin{split}\Lambda_{i}&\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,I_{q}),\\ X_{i}\,|\,\Lambda_{i}&\stackrel{{\scriptstyle\text{ind}}}{{\sim}}N(V\Lambda_{i},\Phi)\end{split} (1)

where Φ\Phi is a pp-th order symmetric positive definite matrix, a covariance matrix providing potential heterogeneous noise.

Marginally, XiindN(0,Φ+VV)X_{i}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}N(0,\Phi+VV^{\top}), so maximum likelihood estimation of VV and Φ\Phi is equivalent to covariance estimation. It is common to assume that Φ\Phi is diagonal so as to not confound the effect of the loadings VV (Bartholomew et al., 2011), and so we adopt the same assumption from now on. In this case, the MLE estimator for VV can be obtained in closed form using matrix calculus, based on the eigen-decomposition of the data covariance. Alternatively, VV and Φ\Phi can be estimated via expectation-maximization, especially if some of the entries in the data matrix XX are missing.

The model in general has interesting connections to matrix factorization. For example, probability PCA (Tipping and Bishop, 1998) can be considered as equivalent to the factor model with the only difference that the factor model permits heterogeneous noise structure through the specification of Φ\Phi. Under Φ=σ2Ip\Phi=\sigma^{2}I_{p}, Anderson (1963) established the connection between these two models by demonstrating that the stationary point solution of the factor model likelihood spans the columns of the sample covariance eigenvectors. Drawing further the analogy from the relationship between probability PCA and PCA, the factor model can be considered as the random counterpart of matrix factorization by allowing the factorized components (or latent local factors) Λ\Lambda to be random.

While finding a wide range of applications, the factor model and its deterministic counterpart (matrix factorization) have, however, some limitations. We discuss them below, including a brief summary on some recent improvements, along with our proposed solutions to further generalize the factor model.

1.1.1.  Relaxing the restrictive distributional assumption

In a factor model setup, both the latent variable Λi\Lambda_{i} and the data are assumed to (conditionally) follow a Gaussian distribution, yielding a marginal Gaussian distribution for the data. Both assumptions require careful examination when dealing with real data.

For the data distribution, assuming simply a Gaussian data likelihood overlooks many interesting structures in the data. For example, network adjacency matrices take only binary values of 0 and 1, while computer images take a integer values for pixel intensities. Both types of data have been shown to be better modeled with discrete distributions from the exponential family (Wang and Carvalho, 2023). One obvious relaxation is thus to extend the data likelihood assumption from Gaussian to exponential families, or, to accommodate more robust specifications, to specify mean and covariance structures, as in quasi-likelihood approaches. Moreover, those exponential family generalizations do not consider the flexible covariance modeling of the data matrix, which has shown to be one of the most important applications of Gaussian factor model (Fan et al., 2008). Ideally, at least a column-wise idiosyncratic error structure should be modeled for a flexible consideration of the high-dimensional data covariance.

The latent variable assumption is usually considered less restrictive when compared to the likelihood assumption, as evidenced from similar Gaussian latent structures in hierarchical statistical models (e.g. the random effects model (Borenstein et al., 2010) and the state space model (Carter and Kohn, 1996)). This assumption is however frequently studied together with factor identifiability (Shapiro, 1985) to ensure unique latent representations of the data. Specifically, the factor model (1) is not identifiable (or unique) since for any orthogonal matrix TO(q)T\in O(q), ΛiTΛiiidN(0,Iq)\Lambda^{*}_{i}\doteq T\Lambda_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,I_{q}) and, with V=VTV^{*}=VT^{\top}, Xi|ΛiindN(VΛi,Φ)X_{i}\,|\,\Lambda^{*}_{i}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}N(V^{*}\Lambda^{*}_{i},\Phi) specify the same model since VΛi=VTTΛi=VΛiV\Lambda_{i}=VT^{\top}T\Lambda_{i}=V^{*}\Lambda^{*}_{i}, that is, Λ=ΛT\Lambda^{*}=\Lambda T^{\top} and VV^{*} are not identifiable from Λ\Lambda and VV. For this reason it is common in factor analysis to rotate factors after fitting the model to achieve better sparsity and/or interpretability, e.g. with varimax rotation (Kaiser, 1958). However, it is advantageous to address these identifiability issues from the outset to reduce the space of potential solutions and speed up estimation procedures. In this case, it is helpful to borrow from the matrix factorization research. For example, adding various factorization constraints such as sparsity (Gribonval and Schnass, 2010), positivity (Lee and Choi, 1999) and orthogonality (Li et al., 2010) was shown to provide more representative and unique latent factors. The stochastic counterpart of these factor constraints is closely related to an evolving research field related to data manifolds (Ma and Fu, 2012).

1.1.2.  Allowing entry-wise weight and link transformation

Another potential improvement that has remained absent from factor analysis research is the specification of entry-wise likelihood weights and non-linear transformations. In matrix factorization, allowing entry-wise factorization weights and the flexibility of transforming the original data has been shown to be valuable in providing more representative factorized results. For example, in the field of natural language processing, Global Vectors for Word Representation (GloVe; Jeffrey Pennington and Manning, 2014) received great success in obtaining word embeddings. The method essentially applied a log transformation on the word-occurrence matrix with heuristic entry-wise weights to avoid over and underweighting toward rare and common word co-occurrences. In the field of computer vision (Kalayeh et al., 2014), weighting matrices related to classification class frequencies are introduced to alleviate issues with class imbalance. This residual boosting weight matrix provides latent factors that are more suitable for downstream classification. In the field of matrix completion (Davenport et al., 2014), specification of zero factorization weights can eliminate missing entries from the factorization, which in turn allows the latent structure to impute them.

Perhaps due to the computational complexity associated with these enhancements, such flexibility has not been transferred from matrix factorization to factor analysis. Link transformations might appear in the literature, e.g. (Reimann et al., 2002), but are mostly applied as an ad-hoc pre-processing methodology. In practice, it is clear that entry-wise factor weights and link transformations could greatly improve the flexibility of the factor modeling framework. Nevertheless, a unified factor modeling framework that enables such flexibility is still missing from the literature.

1.1.3.  Improving on efficient optimization

Lastly, as we seek to improve on the traditional Gaussian factor model, it is natural to consider practical computational concerns: can we scale fitting the improved model to modern large datasets?

While the marginal likelihood under model (1) is available in closed form, deriving the marginal likelihood under a non-Gaussian data assumption is typically difficult and recent research have resorted to simulated maximum likelihood (SML; Wedel and Kamakura, 2001), Markov chain Monte Carlo (MCMC) or variational inference (Gopalan et al., 2015; Ruan and Di, 2024). However, these methods have their own difficulties. Variational inference is based on an approximation to the target marginal distribution and usually relies on oversimplified representations for computational gains at the cost of poor representativity. As for MCMC, due to the identifiability issue introduced earlier, the marginal likelihood is constant along high dimensional quotient spaces on VV imposed by equivalence under orthogonal operations (rotations). These equivalent spaces cause challenges for both the MCMC sampling and the assessment of convergence. Lastly, although theoretically attractive, MCMC is computationally expensive since it usually requires long running times to achieve convergence up to a desired precision when compared to other approaches such as Laplacian approximations (Rue et al., 2009).

As the original optimization method proposed with the initial exponential factor generalization (Wedel and Kamakura, 2001; Wu and Zhang, 2003), the SML approach is considered as one of the most common estimation methods. Specifically, the maximum likelihood estimator is obtained by maximizing the following simulated likelihood based on SS Monte Carlo samples,

L(V;X)=i=1nfV(Xi)i=1n1Ss=1SfV(Xi|Λi(s))LMC(V;X),L(V;X)=\prod_{i=1}^{n}f_{V}(X_{i})\approx\prod_{i=1}^{n}\frac{1}{S}\sum_{s=1}^{S}f_{V}\big{(}X_{i}\,|\,\Lambda_{i}^{(s)}\big{)}\doteq L_{\text{MC}}(V;X),

where Λi(s)iidN(0,Iq)\Lambda_{i}^{(s)}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,I_{q}). To obtain the maximizer of logLMC(V;X)\log L_{\text{MC}}(V;X), the gradient is needed (up to a constant):

Vi=1nlogs=1SfV(Xi|Λi(s))=i=1ns=1SVfV(Xi|Λi(s))s=1SfV(Xi|Λi(s)).\nabla_{V}\sum_{i=1}^{n}\log\sum_{s=1}^{S}f_{V}(X_{i}\,|\,\Lambda_{i}^{(s)})=\sum_{i=1}^{n}\frac{\sum_{s=1}^{S}\nabla_{V}f_{V}(X_{i}\,|\,\Lambda_{i}^{(s)})}{\sum_{s=1}^{S}f_{V}(X_{i}\,|\,\Lambda_{i}^{(s)})}. (2)

Despite the fact that VfV(Xi|Λi(s))\nabla_{V}f_{V}(X_{i}|\Lambda_{i}^{(s)}) is readily known in closed form, optimization using (2) has both numerical and theoretical issues. For the numerical issue, we need to observe that the likelihood fV(Xi|Λi(s))f_{V}(X_{i}|\Lambda_{i}^{(s)}) evaluations in the denominator need to be performed in log space to avoid underflows and usually require good starting points for VV, which are particularly challenging when the data dimension pp is large.

From a theoretical perspective, the likelihood along its gradient evaluation depends heavily on the asymptotic behavior of sample size SS. It has been shown in (Lee, 1995) that the estimator will be asymptotically biased if the MC sample size SS does not grow faster than data sample size n\sqrt{n}. In fact, we verified with numerical studies that the gradient estimation can potentially require a larger MC sample size SnS\gg\sqrt{n} to stabilize Vfθ(Xi|Λi(s))\nabla_{V}f_{\theta}(X_{i}|\Lambda_{i}^{(s)}) in (2). Consequently, when optimizing the likelihood via SML, there is a trade-off between computation efficiency and estimation bias. For modern applications of large data dimensions, the sample size SS required to control the MC variance can be quite large, thus preventing the practical applications of such methods.

1.1.4.  Real world applications

Although PCA has been traditionally used for many real world applications (Li et al., 2024), in the past decades, the generalization of the deterministic PCA factorization to the exponential family has enabled a series of benchmark models across different fields. For example, the non-negative matrix factorization (NMF; Lee and Choi, 1999) in computer vision generalized the data distributional assumption to Poisson. The non-Gaussian state space model (Kitagawa, 1987) in time series generalized the data distributional assumption to non-Gaussian using non-parametric estimation; the Skip-gram model (Levy and Goldberg, 2014; Mikolov et al., 2013) in natural language processing generalized the data assumption to multinomial. Perhaps most relevant to statistics factor model research, (Wedel and Kamakura, 2001) and (Wu and Zhang, 2003) generalized the data likelihood to exponential family distribution while allowing the random specification of a latent factor Λ\Lambda.

Perhaps due to the infeasibility of the SML estimation described in the previous subsection, the lack of a practical estimation method has limited the application of the random factorization to only Bernoulli factor models with an identity link function, i.e, the random dot product model (RDPM; Hoff et al., 2002; Young and Scheinerman, 2007). Despite its restrictive identity link assumption, the RDPM has established its popularity on its empirical evidence from network analysis. After addressing the SML estimation problem, we also feel that empirical evidence of such a generalized model has still been missing from the literature.

1.2.  Organization of the paper

The paper is organized as follows: in Section 2 we introduce a more general exponential factor model that addresses the shortcomings listed in 1.1.1 and 1.1.2; next, in Section 3, we discuss our main contributions—a collection of efficient and robust optimization strategies for inference, tackling the points in 1.1.3; Section 4 demonstrates the effectiveness of our factorization with simulated examples and applications on benchmark data from various fields; finally, Section 5 concludes with a summary of the innovations and directions for future work.

2.  Exponential Factor Models

2.1.  Guaranteeing factor identifiability

We start by addressing the model issues raised in Section 1. Given our concern with computational efficiency, our first issue is non-identifiability; as discussed in 1.1.1, we need to constraint the factors to avoid lack of identifiability due to rotations. From now on we adopt the following standardization of the factors:

  1. (i)

    ΛiiidN(0,Iq)\Lambda_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,I_{q}) for i[n]i\in[n] as usual, with the distribution of the rows of Λ\Lambda being invariant to orthogonal transformations;

  2. (ii)

    VV has scaled pairwise orthogonal columns, that is, V=UDV=UD with U𝒮p,q()U\in\mathcal{S}_{p,q}(\mathbb{R}), a pp-frame in the Stiefel manifold of order qq, and D=Diagj[q]{dj}D=\text{Diag}_{j\in[q]}\{d_{j}\} with d1dq>0d_{1}\geq\cdots\geq d_{q}>0, so that VV=D2V^{\top}V=D^{2}. We denote this space for VV as 𝒮~p,q()\widetilde{\mathcal{S}}_{p,q}(\mathbb{R}).

This setup makes the factorization model identifiable since for any arbitrary TO(q)T\in O(q), V=VTV^{*}=VT^{\top} can only belong to 𝒮~p,q()\widetilde{\mathcal{S}}_{p,q}(\mathbb{R}) if TT^{\top} commutes with a diagonal matrix, that is, if TO(1)qT\in O(1)^{q}, and so VV is unique (up to column sign changes, as in the SVD). In practice, given any pair of factors Λ^\widehat{\Lambda} and V^\widehat{V} we just need to find the singular value decomposition of Λ^V^=ΛDU\widehat{\Lambda}\widehat{V}^{\top}=\Lambda DU^{\top} to identify Λ\Lambda and V=UDV=UD.

2.2.  Generalizing the normal likelihood

Next, we relax the convenient but often unrealistic Gaussian assumptions in the likelihood and settle with a more general mean and variance specification in the spirit of quasi-likelihood (Wedderburn, 1974). We assume that Xi|ΛiF(VΛi,Φi)X_{i}\,|\,\Lambda_{i}\sim F(V\Lambda_{i},\Phi_{i}) where FF belongs to the exponential family with link function gg and variance function 𝒱\mathcal{V}, that is,

𝔼(Xi|Λi)\displaystyle{\mathbb{E}}(X_{i}|\Lambda_{i}) μi=g1(ηi), with ηi=VΛi+η0,and\displaystyle\doteq\mu_{i}=g^{-1}(\eta_{i}),\text{~{}with~{}}\eta_{i}=V\Lambda_{i}+\eta_{0},\quad\text{and} (3)
Var(Xi|Λi)\displaystyle\text{Var}(X_{i}|\Lambda_{i}) =Φi𝕍(μi),\displaystyle=\Phi_{i}\mathbb{V}(\mu_{i}),

where 𝕍(μi)=Diag{𝒱(μi)}\mathbb{V}(\mu_{i})=\text{Diag}\{\mathcal{V}(\mu_{i})\} is the diagonal variance and η0p\eta_{0}\in\mathbb{R}^{p} is the latent center of the factor model. This way, we can more naturally represent data XX belonging to fields other than real numbers; common cases are binary data with FF being Bernoulli or binomial (with weights) and count data with FF being Poisson or negative binomial. In particular, the negative binomial distribution offers enhancements over the Poisson distribution by effectively accommodating the over-dispersion characteristic often observed in count data; see, e.g., (Xia, 2020) for a detailed treatment of negative binomial distributions as a compound Poisson type. To accommodate entry-wise weights, as motivated in Section 1.1.2, we set Φi=ΦWi1\Phi_{i}=\Phi W_{i}^{-1} where Wi=Diagj=1,,p{wij}W_{i}=\text{Diag}_{j=1,\ldots,p}\{w_{ij}\} are the known weights, that is, Φi=Diagj=1,,p{ϕj/wij}\Phi_{i}=\text{Diag}_{j=1,\ldots,p}\{\phi_{j}/w_{ij}\}. This setup implies 𝔼(Xij|Λi)=μij{\mathbb{E}}(X_{ij}|\Lambda_{i})=\mu_{ij} and Var(Xij|Λi)=ϕj𝒱(μij)/wij\text{Var}(X_{ij}|\Lambda_{i})=\phi_{j}\mathcal{V}(\mu_{ij})/w_{ij}. From these two moment conditions, we can adopt the extended quasi-likelihood (Nelder and Pregibon, 1987) to define:

logfV,η0,Φ(Xi|Λi)=j=1pwijϕjμijXijXijt𝒱(t)𝑑t12log(2πϕj𝒱(Xij)wij).\log f_{V,\eta_{0},\Phi}(X_{i}|\Lambda_{i})=-\sum_{j=1}^{p}\frac{w_{ij}}{\phi_{j}}\int_{\mu_{ij}}^{X_{ij}}\frac{X_{ij}-t}{\mathcal{V}(t)}dt-\frac{1}{2}\log\bigg{(}2\pi\frac{\phi_{j}\mathcal{V}(X_{ij})}{w_{ij}}\bigg{)}. (4)

We call this the exponential factor model (EFM).

The MLE estimate of θ=(V,η0,Φ)\theta=(V,\eta_{0},\Phi) then requires access to the marginal density for each observation i[n]i\in[n],

logfθ(Xi)=Λilogfθ(Xi|Λi)f(Λi)𝑑Λi\log f_{\theta}(X_{i})=\int_{\Lambda_{i}}\log f_{\theta}(X_{i}|\Lambda_{i})f(\Lambda_{i})d\Lambda_{i} (5)

where ff is the density of the standard multivariate normal density of order qq:

θ^=argmaxV𝒮~p,q(),η0p,ϕ1,,ϕp>0i=1nlogfθ(Xi).\widehat{\theta}=\operatornamewithlimits{argmax}_{V\in\widetilde{\mathcal{S}}_{p,q}(\mathbb{R}),\eta_{0}\in\mathbb{R}^{p},\phi_{1},\ldots,\phi_{p}>0}\sum_{i=1}^{n}\log f_{\theta}(X_{i}). (6)

2.3.  Modeling covariance

Such a generalized factor framework can be used to efficiently estimate the covariance structure of high dimensional data. Specifically, applying the total variance formula we can derive the covariance of a new observation X|λF(Vλ,Φ)X\,|\,\lambda\sim F(V\lambda,\Phi) given λN(0,Iq)\lambda\sim N(0,I_{q}),

Covθ(X)=𝔼λ(Varθ(X|λ))+Varλ(𝔼θ(X|λ)).\text{Cov}_{\theta}(X)={\mathbb{E}}_{\lambda}\big{(}\text{Var}_{\theta}(X|\lambda)\big{)}+\text{Var}_{\lambda}\big{(}{\mathbb{E}}_{\theta}(X|\lambda)\big{)}. (7)

Here, the first term is a diagonal matrix but the second term requires an outer product that induces correlations among the entries of XX,

𝔼λ(Varθ(X|λ))\displaystyle{\mathbb{E}}_{\lambda}\big{(}\text{Var}_{\theta}(X|\lambda)\big{)} =Diagj[p]{ϕj𝔼λ[𝒱g1((Vλ)j)]},\displaystyle=\text{Diag}_{j\in[p]}\Big{\{}\phi_{j}{\mathbb{E}}_{\lambda}\big{[}\mathcal{V}\circ g^{-1}\big{(}(V\lambda)_{j}\big{)}\big{]}\Big{\}}, (8)
Varλ(𝔼θ(X|λ))\displaystyle\text{Var}_{\lambda}\big{(}{\mathbb{E}}_{\theta}(X|\lambda)\big{)} =𝔼λ[(g1(Vλ)μλ)(g1(Vλ)μλ)],\displaystyle={\mathbb{E}}_{\lambda}\Big{[}\big{(}g^{-1}(V\lambda)-\mu_{\lambda}\big{)}\big{(}g^{-1}(V\lambda)-\mu_{\lambda}\big{)}^{\top}\Big{]},

where μλ=𝔼λ(g1(Vλ))\mu_{\lambda}={\mathbb{E}}_{\lambda}\big{(}g^{-1}(V\lambda)\big{)}.

In the special case of the Gaussian distribution, we have g(μ)=μg(\mu)=\mu and 𝒱(μ)=1\mathcal{V}(\mu)=1 and so, as expected, Covθ(X)=Φ+VV\text{Cov}_{\theta}(X)=\Phi+VV^{\top}. Using this result, Fan et al. (2008) demonstrated the efficiency of the plug-in covariance estimator via estimation of V,ϕ,Cov(Λ)V,\phi,\text{Cov}(\Lambda). Similarly, we demonstrate that the plug-in efficiency of (V,Φ)(V,\Phi) in high dimensional covariance estimation is preserved under our generalized quasi-factor setup using Eq (8).

3.  Approximate but Efficient and Robust Optimization

As we mentioned in Section 1.1.3, directly optimizing Eq (6) through simulated gradients has both numerical and theoretical issues. Next, we demonstrate how we can conduct maximum likelihood estimation on the factor matrix VV and conditional variances Φ\Phi through some approximate but efficient and robust algorithms.

3.1.  EM optimization with small qq

As in the Gaussian factor model, one common estimation procedure for such a latent space model should be Expectation-Maximization (EM; Dempster et al., 1977a). In our setup, the EM algorithm can be formulated as follows:

E-Step:

Given the parameters θ(t)\theta^{(t)} at the tt-th iteration step, we compute

Q(θ;θ(t))=𝔼Λ|X;θ(t)[i=1nlogfθ(Xi,Λi)]i=1nlogfθ(Xi,Λi)f~θ(t)(Λi|Xi)𝑑Λi,Q(\theta;\theta^{(t)})={\mathbb{E}}_{\Lambda|X;\theta^{(t)}}\Bigg{[}\sum_{i=1}^{n}\log f_{\theta}(X_{i},\Lambda_{i})\Bigg{]}\approx\sum_{i=1}^{n}\int\log f_{\theta}(X_{i},\Lambda_{i})\widetilde{f}_{\theta^{(t)}}(\Lambda_{i}|X_{i})d\Lambda_{i}, (9)

where f~θ(t)(Λi|Xi)\widetilde{f}_{\theta^{(t)}}(\Lambda_{i}|X_{i}) is a Laplace approximation to the actual conditional posterior on Λi|Xi\Lambda_{i}|X_{i} using a Gaussian density centered at Λ^i\widehat{\Lambda}_{i} and with precision matrix H^i\widehat{H}_{i}. These parameters are found by Fisher scoring on a regularized GLM regression with quasi-likelihood

i(t)(Λi)=logfθ(t)(Xi|Λi)+logf(Λi),\ell_{i}^{(t)}(\Lambda_{i})=\log f_{\theta^{(t)}}(X_{i}|\Lambda_{i})+\log f(\Lambda_{i}), (10)

that is, Λi|XiN(Λ^i,H^i1)\Lambda_{i}|X_{i}\approx N(\widehat{\Lambda}_{i},\widehat{H}_{i}^{-1}) with Λ^i=argmaxΛii(t)(Λi)\widehat{\Lambda}_{i}=\operatornamewithlimits{argmax}_{\Lambda_{i}}\ell_{i}^{(t)}(\Lambda_{i}) and H^i=𝔼Xi[2i(t)(Λ^i)/ΛiΛi]\widehat{H}_{i}={\mathbb{E}}_{X_{i}}[-\partial^{2}\ell_{i}^{(t)}(\widehat{\Lambda}_{i})/\partial\Lambda_{i}\partial\Lambda_{i}^{\top}] the negative expected Hessian. When the latent dimension qq is small, we can then evaluate the integral in (9) using multivariate Gauss-Hermite cubature (Golub and Welsch, 1969) with mm nodes for each Λi\Lambda_{i},

Q(θ;θ(t))i=1nl=1mlogfθ(Xi,Λil)wil(t),Q(\theta;\theta^{(t)})\approx\sum_{i=1}^{n}\sum_{l=1}^{m}\log f_{\theta}(X_{i},\Lambda_{il})w_{il}^{(t)}, (11)

where the cubature weights wil(t)w_{il}^{(t)} are computed based on the Laplace approximation Λi|Xi;θ(t)N(Λ^i,H^i1)\Lambda_{i}|X_{i};\theta^{(t)}\approx N(\widehat{\Lambda}_{i},\widehat{H}_{i}^{-1}).

M-Step:

We can then update the parameters to θ(t+1)\theta^{(t+1)} by maximizing the expected complete data likelihood,

θ(t+1)=argmaxθ=(V,Φ,η0)Q(θ;θ(t))=argmaxθ=(V,Φ,η0)i=1nl=1mlogfθ(Xi,Λil)wil(t)\theta^{(t+1)}=\operatornamewithlimits{argmax}_{\theta=(V,\Phi,\eta_{0})}Q(\theta;\theta^{(t)})=\operatornamewithlimits{argmax}_{\theta=(V,\Phi,\eta_{0})}\sum_{i=1}^{n}\sum_{l=1}^{m}\log f_{\theta}(X_{i},\Lambda_{il})w_{il}^{(t)} (12)

The M-step is then easily seen to be a quasi-GLM weighted regression of XX on the cubature nodes Λil\Lambda_{il}. In particular, we estimate the dispersion parameters using weighted Pearson residuals,

ϕ^j(t+1)=1nmi=1nl=1m(Xijμ^ijl(t+1))2wijwil𝒱(μ^ijl(t+1)),j[p],\widehat{\phi}_{j}^{(t+1)}=\frac{1}{nm}\sum_{i=1}^{n}\sum_{l=1}^{m}\frac{(X_{ij}-\widehat{\mu}_{ijl}^{(t+1)})^{2}}{w_{ij}w_{il}^{*}\mathcal{V}(\widehat{\mu}_{ijl}^{(t+1)})},\quad j\in[p], (13)

where wilw_{il}^{*} are the cubature weights at the last EM iteration.

The algorithm for this EM optimization is summarized in Algorithm 1. While this computational scheme is robust and efficient, both its complexity and approximation errors are proportional to the latent dimension qq. It is common in the literature (Fan et al., 2020) to assume that qpq\ll p so that the factorization is parsimonious. For cases when we need a larger qq, we explore two alternative optimization methods utilizing Stochastic Gradient Descent (SGD) next.

Data: X𝔽n×pX\in\mathbb{F}^{n\times p} for some field 𝔽\mathbb{F}, e.g. 𝔽=+,+,\mathbb{F}=\mathbb{N}_{+},\mathbb{R}_{+},\ldots
Input factorization rank qq, number of Gaussian nodes mm, and maximum iteration TT
Initialization Initialize (V0,η0)(V_{0},\eta_{0}) using centered DMF or SVD and then ϕj\phi_{j} using Eq (13)
for t=1,,T1,\ldots,T do
       \triangleright E-step:
       for i=1,,n1,\ldots,n do
             Compute Laplace approximation Λi|Xi;θ(t)N(Λ^i,H^i1)\Lambda_{i}|X_{i};\theta^{(t)}\approx N(\widehat{\Lambda}_{i},\widehat{H}_{i}^{-1}) using Fisher scoring in (10)
             Compute Gauss-Hermite nodes Λil\Lambda_{il} and weights wil(t)w_{il}^{(t)} for l=1,,ml=1,\ldots,m defining (11)
            
      \triangleright M-Step:
       Obtain V(t+1)V^{(t+1)} and η0(t+1)\eta_{0}^{(t+1)} by weighted quasi-GLM regression in (12)
       Obtain dispersions ϕj(t+1)\phi_{j}^{(t+1)} for j=1,,pj=1,\ldots,p via Pearson residuals using (13)
Identify V~T,η~T\widetilde{V}_{T},\widetilde{\eta}_{T} according to Section 2.1
Result: MLE estimator of V^=V~T,η^0=η~T,Φ^=ΦT\widehat{V}=\widetilde{V}_{T},\widehat{\eta}_{0}=\widetilde{\eta}_{T},\widehat{\Phi}=\Phi_{T}
Algorithm 1 Approximate EM (recommended for small qq)

3.2.  SGD optimization with large rank qq

Under some regularity conditions to allow the exchange of differentiation and integration, the gradient of likelihood can be written as:

θi=1nlog(fθ(Xi))\displaystyle\nabla_{\theta}\sum_{i=1}^{n}\log(f_{\theta}(X_{i})) =i=1nθΛifθ(Xi,Λi)𝑑ΛiΛifθ(Xi,Λi)𝑑Λi=i=1nΛiθ[fθ(Xi,Λi)]1fθ(Xi)dΛi\displaystyle=\sum_{i=1}^{n}\frac{\nabla_{\theta}\int_{\Lambda_{i}}f_{\theta}(X_{i},\Lambda_{i})d\Lambda_{i}}{\int_{\Lambda_{i}}f_{\theta}(X_{i},\Lambda_{i})d\Lambda_{i}}=\sum_{i=1}^{n}\int_{\Lambda_{i}}\nabla_{\theta}[f_{\theta}(X_{i},\Lambda_{i})]\frac{1}{f_{\theta}(X_{i})}d\Lambda_{i} (14)
=i=1nΛiθ[logfθ(Xi,Λi)]fθ(Xi,Λi)fθ(Xi)dΛi=i=1n𝔼Λi|Xi[θ[logfθ(Xi,Λ~i)]]\displaystyle=\sum_{i=1}^{n}\int_{\Lambda_{i}}\nabla_{\theta}[\log f_{\theta}(X_{i},\Lambda_{i})]\frac{f_{\theta}(X_{i},\Lambda_{i})}{f_{\theta}(X_{i})}d\Lambda_{i}=\sum_{i=1}^{n}{\mathbb{E}}_{\Lambda_{i}|X_{i}}[\nabla_{\theta}[\log f_{\theta}(X_{i},\widetilde{\Lambda}_{i})]]

If we can evaluate Eq (14) efficiently and accurately, we could then update the EFM parameters through gradient descent with step size α\alpha:

θt+1\displaystyle\theta_{t+1} =θtαθ[i=1nlog(fV(Xi))]\displaystyle=\theta_{t}-\alpha\nabla_{\theta}\Bigg{[}-\sum_{i=1}^{n}\log(f_{V}(X_{i}))\Bigg{]} (15)

Ignoring for now the expectation 𝔼Λi|Xi(){\mathbb{E}}_{\Lambda_{i}|X_{i}}(\cdot), θ[fθ(Xi,Λi)]\nabla_{\theta}[f_{\theta}(X_{i},\Lambda_{i})] are in fact available in closed form given a specification of our model in Eq (3). We provide a summary of these gradients below:

  • Gradient for VjV_{j}:

    Vjlogfθ(Xi,Λi)\displaystyle-\nabla_{V_{j}}\log f_{\theta}(X_{i},\Lambda_{i}) =wijϕjVj(μijXij1𝒱(t)(Xijt)𝑑t)=wijϕj1𝒱(μij)(Xijμij)μijηijηijVj\displaystyle=\frac{w_{ij}}{\phi_{j}}\nabla_{V_{j}}\bigg{(}\int_{\mu_{ij}}^{X_{ij}}\frac{1}{\mathcal{V}(t)}(X_{ij}-t)dt\bigg{)}=-\frac{w_{ij}}{\phi_{j}}\frac{1}{\mathcal{V}(\mu_{ij})}(X_{ij}-\mu_{ij})\frac{\partial\mu_{ij}}{\partial\eta_{ij}}\frac{\partial\eta_{ij}}{\partial V_{j}} (16)
    =wijϕj1𝒱(μij)(Xijμij)1g(μij)Λi.\displaystyle=-\frac{w_{ij}}{\phi_{j}}\frac{1}{\mathcal{V}(\mu_{ij})}(X_{ij}-\mu_{ij})\frac{1}{g^{\prime}(\mu_{ij})}\Lambda_{i}.
  • Gradient for ϕj\phi_{j}:

    We repeat the derivation for parameter ϕj+\phi_{j}\in\mathbb{R}_{+}, for j[p]j\in[p]:

    ϕj(logfθ(Xi,Λi))\displaystyle\nabla_{\phi_{j}}(-\log f_{\theta}(X_{i},\Lambda_{i})) =wijϕj2(μijXij1𝒱(t)(Xit)𝑑t)+12ϕj=wij2ϕj2Q(Xij;μij)+12ϕj\displaystyle=-\frac{w_{ij}}{\phi_{j}^{2}}\bigg{(}\int_{\mu_{ij}}^{X_{ij}}\frac{1}{\mathcal{V}(t)}(X_{i}-t)dt\bigg{)}+\frac{1}{2\phi_{j}}=-\frac{w_{ij}}{2\phi_{j}^{2}}Q(X_{ij};\mu_{ij})+\frac{1}{2\phi_{j}} (17)
    =12ϕj(wijϕjQ(Xij;μij)+1)12ϕj(wij(Xijμij)2ϕj𝒱(μij)+1).\displaystyle=\frac{1}{2\phi_{j}}\bigg{(}-\frac{w_{ij}}{\phi_{j}}Q(X_{ij};\mu_{ij})+1\bigg{)}\approx\frac{1}{2\phi_{j}}\bigg{(}-\frac{w_{ij}(X_{ij}-\mu_{ij})^{2}}{\phi_{j}\mathcal{V}(\mu_{ij})}+1\bigg{)}.

    where Q(Xij;μij)=2Xijμij1𝒱(t)(Xit)𝑑tQ(X_{ij};\mu_{ij})=-2\int_{X_{ij}}^{\mu_{ij}}\frac{1}{\mathcal{V}(t)}(X_{i}-t)dt is the quasi-deviance function.

  • Gradient for η0\eta_{0}:

    We repeat the derivation for η0j\eta_{0j}\in\mathbb{R}, for j[p]j\in[p]:

    η0j(logfθ(Xi,Λi))\displaystyle\nabla_{\eta_{0j}}(\log f_{\theta}(X_{i},\Lambda_{i})) =wijϕjη0j(μijXij1𝒱(t)(Xit)𝑑t)=wijϕj1𝒱(μij)(Xijμij)μijηijηijη0j\displaystyle=\frac{w_{ij}}{\phi_{j}}\nabla_{\eta_{0j}}\bigg{(}\int_{\mu_{ij}}^{X_{ij}}\frac{1}{\mathcal{V}(t)}(X_{i}-t)dt\bigg{)}=-\frac{w_{ij}}{\phi_{j}}\frac{1}{\mathcal{V}(\mu_{ij})}(X_{ij}-\mu_{ij})\frac{\partial\mu_{ij}}{\partial\eta_{ij}}\frac{\partial\eta_{ij}}{\partial\eta_{0j}} (18)
    =1Φjj1𝒱(μij)g(μij)(μijXij).\displaystyle=\frac{1}{\Phi_{jj}}\frac{1}{\mathcal{V}(\mu_{ij})g^{\prime}(\mu_{ij})}(\mu_{ij}-X_{ij}).
  • Hessian for VjV_{j}:

    In our later optimization, we additionally need the Hessian of the logfθ(Xi,Λi)\log f_{\theta}(X_{i},\Lambda_{i}). The Hessian for VjV_{j} and Λi\Lambda_{i} are symmetric with respect to each other with a simple notation change; below we use the former as an example. We need the following definitions:

    • Sij=wijϕjg1(ηij)2V(μij)S_{ij}=\frac{w_{ij}}{\phi_{j}}\frac{{{g^{-1}}^{\prime}(\eta_{ij})}^{2}}{V(\mu_{ij})} and Gij=g1(ηij)V(μij)wijϕj(Xijμij)G_{ij}=\frac{{g^{-1}}^{\prime}(\eta_{ij})}{V(\mu_{ij})}\frac{w_{ij}}{\phi_{j}}(X_{ij}-\mu_{ij})

    • Di=Diag{Si(t)}D_{i\cdot}=\text{Diag}\{S_{i\cdot}^{(t)}\} with SiS_{i\cdot} denotes the ii-th row of SS. Similarly, Diag{Sj(t)}\text{Diag}\{S_{\cdot j}^{(t)}\} with SjS_{\cdot j} denotes the jj-th column of SS.

    The Hessian for VjV_{j} is then

    𝔼Xi\displaystyle{\mathbb{E}}_{X_{i}} [Vj2(logfθ(Xi,Λi))]=\displaystyle\bigg{[}\nabla_{V_{j}}^{2}(-\log f_{\theta}(X_{i},\Lambda_{i}))\bigg{]}= (19)
    𝔼Xi[(Vj(G(μi)[S12(μi)ΦS12(μi)]1)(μiXi)+G(μi)[S12(μi)ΦS12(μi)]1Vj(μiXi))Λi]\displaystyle{\mathbb{E}}_{X_{i}}\bigg{[}\bigg{(}\nabla_{V_{j}}\big{(}G(\mu_{i})^{\top}[S^{\frac{1}{2}}(\mu_{i})\Phi S^{\frac{1}{2}}(\mu_{i})]^{-1}\big{)}(\mu_{i}-X_{i})+G(\mu_{i})^{\top}[S^{\frac{1}{2}}(\mu_{i})\Phi S^{\frac{1}{2}}(\mu_{i})]^{-1}\nabla_{V_{j}}\big{(}\mu_{i}-X_{i}\big{)}\bigg{)}\Lambda_{i}\bigg{]}
    =j=1pG(μij)2[S12(μij)ΦjjS12(μij)]1μijηijηijVjΛi\displaystyle=\sum_{j=1}^{p}G(\mu_{ij})^{2}[S^{\frac{1}{2}}(\mu_{ij})\Phi_{jj}S^{\frac{1}{2}}(\mu_{ij})]^{-1}\frac{\partial\mu_{ij}}{\partial\eta_{ij}}\frac{\partial\eta_{ij}}{\partial V_{j}}\Lambda_{i}
    =Λi(Diag(G(μi)2)[S12(μi)ΦS12(μi)]1)Λi\displaystyle=\Lambda_{i}\bigg{(}\text{Diag}(G(\mu_{i})^{2})[S^{\frac{1}{2}}(\mu_{i})\Phi S^{\frac{1}{2}}(\mu_{i})]^{-1}\bigg{)}\Lambda_{i}^{\top}

    where the first components is 0 because 𝔼(μiXi)=𝟎p{\mathbb{E}}(\mu_{i}-X_{i})=\mathbf{0}_{p}.

The random sampling for the gradient of every observation i[n]i\in[n] in Eq (14) is however expensive. We here additionally leverage modern stochastic optimization to randomly sample partial data to approximate the optimization gradient. The optimization is termed Stochastic Gradient Descent (SGD; Robbins and Monro, 1951) due to its interpretation as a stochastic approximation of the actual gradient function. Since the method effectively reduces the sample size by applying a sub-sampling on the original dataset, the SGD can be used to accelerate all optimization algorithms with explicit gradient formulation. We here use the SML optimization as an example to illustrate the implementation.

Taking the last equality from Eq (14) and applying the law of large numbers, we can compute the gradient using stochastic sample of size BB and SS:

i=1nθlogfθ(Xi)n𝔼X[θlogfθ(X)]nBb=1Bθlogfθ(X(b))\displaystyle\sum_{i=1}^{n}\nabla_{\theta}\log f_{\theta}(X_{i})\approx n{\mathbb{E}}_{X}[\nabla_{\theta}\log f_{\theta}(X)]\approx\frac{n}{B}\sum_{b=1}^{B}\nabla_{\theta}\log f_{\theta}(X^{(b)}) (20)
=nBb=1B𝔼Λ~b[θlogfθ(X(b),Λ~b)],Λ~biidN(𝟎q,Iq)\displaystyle=\frac{n}{B}\sum_{b=1}^{B}{\mathbb{E}}_{\widetilde{\Lambda}_{b}}[\nabla_{\theta}\log f_{\theta}(X^{(b)},\widetilde{\Lambda}_{b})],\quad\widetilde{\Lambda}_{b}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(\mathbf{0}_{q},I_{q})
nBb=1Bs=1S1Sθlogfθ(X(b),Λ~b,s)]],Λ~b,siidN(𝟎q,Iq)\displaystyle\approx\frac{n}{B}\sum_{b=1}^{B}\sum_{s=1}^{S}\frac{1}{S}\nabla_{\theta}\log f_{\theta}(X^{(b)},\widetilde{\Lambda}_{b,s})]],\quad\widetilde{\Lambda}_{b,s}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(\mathbf{0}_{q},I_{q})

The batch size BB is chosen to be smaller than sample size nn, which scales the original complexity with a factor of B/nB/n per iteration. To maintain distributional assumptions, we sample X(b)X^{(b)} with replacement from the original data XX. In addition, since this stochastic sampled gradient is proposed to maximize the true likelihood instead of the simulated likelihood, the sample size SS required to compute the optimization gradients does not need to grow in an order of the actual sample size (e.g. S=n1/2S=n^{1/2} as required for SML (Lee, 1995)).

As it is compared to second-order optimization such as Newton’s method, step size selection is of crucial importance for stochastic gradient descent. A large step size will make the algorithm oscillate while a small step size hardly improves our likelihood function. The theoretical analysis states that we should choose the step size according to the conditional number of the parameter Hessian matrix (Bertsekas, 1999). When the Hessian matrix is not available, recent researchers have refined the step size selection by utilizing the momentum and the scale of the parameters. Specifically, AdaGrad (Duchi et al., 2011) scales the gradient update with the gradient’s second moment while RMSProp (Ieleman and Hinton, 2012) employed exponential decay to smooth out the gradient direction. More recently, combining both RMSprop and AdaGrad, the Adam (Kingma and Ba, 2015) method has gained its well-deserved attention in modern stochastic optimization research. Here we adopt the Adam method with the parameter recommended in the original reference (β1=0.9,β2=0.999,ϵ=108\beta_{1}=0.9,\beta_{2}=0.999,\epsilon=10^{-8}). To avoid the oscillation around the minimum, we also employed a decay learning rate with γt=α1+0.5t\gamma_{t}=\frac{\alpha}{1+0.5t}. When it is necessary, the tuning on the hyperparameter α\alpha can be conducted by randomly sampling α\alpha on a log grid.

To tackle specifically the expectation 𝔼Λ~i(){\mathbb{E}}_{\widetilde{\Lambda}_{i}}(\cdot) with Λ~i=dΛi|Xi\widetilde{\Lambda}_{i}\stackrel{{\scriptstyle d}}{{=}}\Lambda_{i}|X_{i}, we propose two optimization algorithms in the following subsections.

3.2.1.  Computing the gradient using Laplacian approximation

The evaluation of the gradient is thus equivalently an evaluation on the posterior moment of general function g()g(\cdot). To such integration, Laplacian approximation (Tierney and Kadane, 1986) has been frequently studied. Denote

  • g(Λi):qq,g(Λi)=θlogfθ(Xi,Λi)g(\Lambda_{i}):\mathbb{R}^{q}\rightarrow\mathbb{R}^{q},g(\Lambda_{i})=\nabla_{\theta}\log f_{\theta}(X_{i},\Lambda_{i})

  • h(Λi):q,h(Λi)=1pj=1plogfθ(Xij|Λi)+f(Λi)h(\Lambda_{i}):\mathbb{R}^{q}\rightarrow\mathbb{R},h(\Lambda_{i})=-\frac{1}{p}\sum_{j=1}^{p}\log f_{\theta}(X_{ij}|\Lambda_{i})+f(\Lambda_{i})

It is easy to verify that h(Λi)h(\Lambda_{i}) is a constant order function of pp as pp\rightarrow\infty and that g(Λi)g(\Lambda_{i}) does not growth with pp. Interestingly, even if g(Λi)g(\Lambda_{i}) is permitted to grow with p, provided that the growth rate is bounded by O(eg0p1δ)O(e^{g_{0}p^{1-\delta}}), a valid approximation can still be derived with an error estimate of O(ep1δ)O(e^{p^{1-\delta}}). The precise approximation in this general scenario can be found in Lemma 5 of (Xia and Zhang, 2023). The evaluation follows from the derivation below with Λ~i=Λi|Xi\widetilde{\Lambda}_{i}=\Lambda_{i}|X_{i} being the posterior:

i=1n𝔼Λ~i[g(Λ~i)]=qg(Λi)f(Λi|Xi)𝑑Λi=qg(Λi)exp(ph(Λi))𝑑Λiqexp(ph(Λi))𝑑Λi\sum_{i=1}^{n}{\mathbb{E}}_{\widetilde{\Lambda}_{i}}[g(\widetilde{\Lambda}_{i})]=\int_{\mathbb{R}^{q}}g(\Lambda_{i})f(\Lambda_{i}|X_{i})d\Lambda_{i}=\frac{\int_{\mathbb{R}^{q}}g(\Lambda_{i})\exp(-ph(\Lambda_{i}))d\Lambda_{i}}{\int_{\mathbb{R}^{q}}\exp(-ph(\Lambda_{i}))d\Lambda_{i}} (21)

To further simplify the notation, we denote HΛi=Λi2h(Λ^i)H_{\Lambda_{i}}=\nabla_{\Lambda_{i}}^{2}h(\widehat{\Lambda}_{i}), and UΛi=Λih(Λ^i)U_{\Lambda_{i}}=\nabla_{\Lambda_{i}}h(\widehat{\Lambda}_{i}). If we choose Λ^i\widehat{\Lambda}_{i} to maximize h(Λi)-h(\Lambda_{i}), we will have the gradient UΛi=Λih(Λ^i)=𝟎qU_{\Lambda_{i}}=\nabla_{\Lambda_{i}}h(\widehat{\Lambda}_{i})=\mathbf{0}_{q} and the numerator can be simplified with the leading term (Tierney et al., 1989):

qg(Λi)exp(ph(Λi))𝑑Λi\displaystyle\int_{\mathbb{R}^{q}}g(\Lambda_{i})\exp(-ph(\Lambda_{i}))d\Lambda_{i} =g(Λ^i)exp(ph(Λ^i))qexp(p2(ΛiΛ^i)HΛi(ΛiΛ^i))\displaystyle=g(\widehat{\Lambda}_{i})\exp(-ph(\widehat{\Lambda}_{i}))\int_{\mathbb{R}^{q}}\exp(-\frac{p}{2}(\Lambda_{i}-\widehat{\Lambda}_{i})^{\top}H_{\Lambda_{i}}(\Lambda_{i}-\widehat{\Lambda}_{i})) (22)
=g(Λ^i)exp(ph(Λ^i))(2π/p)q/2|Σ(Λ^i)|1/2(1+O(1/p)).\displaystyle=g(\widehat{\Lambda}_{i})\exp(-ph(\widehat{\Lambda}_{i}))(2\pi/p)^{q/2}|\Sigma(\widehat{\Lambda}_{i})|^{-1/2}\big{(}1+O(1/p)\big{)}.

As a corollary of this O(1/p)O(1/p) approximation result, the denominator in Eq (21) is a special case of the numerator with g(Λi)=1g(\Lambda_{i})=1, and a more accurate approximation with a relative order O(1/p2)O(1/p^{2}) can be obtained by applying approximation in Eq (22) to both the numerator and denominator. Such an approximation however would require additionally either of the following components:

  • The higher order of derivative {V(k)g(Λ^i)}k=12\{\nabla_{V}^{(k)}g(\widehat{\Lambda}_{i})\}_{k=1}^{2} and Λi(3)g(Λ^i)\nabla_{\Lambda_{i}}^{(3)}g(\widehat{\Lambda}_{i}).

  • The maximization solution of Λ^i=argminΛi1plogg(Λi)+h(Λi)\widehat{\Lambda}_{i}=\operatornamewithlimits{argmin}_{\Lambda_{i}}-\frac{1}{p}\log g(\Lambda_{i})+h(\Lambda_{i}).

The approximation with the first evaluation is named as Standard Form while the one with the second evaluation is named as the Fully Exponential Form (Tierney et al., 1989).
However, Eq (22) still requires an evaluation of h(Λ^i)=1pj=1plog(fθ(Xij|Λi))h(\widehat{\Lambda}_{i})=-\frac{1}{p}\sum_{j=1}^{p}\log(f_{\theta}(X_{ij}|\Lambda_{i})), which will asymptotically approach 0 with bad initialization of VV as pp\rightarrow\infty. Fortunately, for an order of O(1/p)O(1/p) approximation, a joint approximation of the denominator and numerator integral with the Standard Form (Tierney et al., 1989) suggests g(Λ^i)g(\widehat{\Lambda}_{i}) would equivalently provide O(1/p)O(1/p) approximation by simply plugging Λ^i=argminΛih(Λi)\widehat{\Lambda}_{i}=\operatornamewithlimits{argmin}_{\Lambda_{i}}h(\Lambda_{i}) into function h()h(\cdot):

i=1n𝔼Λ~i[g(Λ~i))]\displaystyle\sum_{i=1}^{n}{\mathbb{E}}_{\widetilde{\Lambda}_{i}}[g(\widetilde{\Lambda}_{i}))] =i=1ng(Λ^i)(1+O(1/p))\displaystyle=\sum_{i=1}^{n}g(\widehat{\Lambda}_{i})\big{(}1+O(1/p)\big{)} (23)

To facilitate later reference, we name this optimization as Laplacian optimization, whose gradient evaluation is accurate for large pp with a relative error rate of O(1/p)O(1/p). Although our setup assumes a Gaussian latent variable, the optimization can be generally applied to non-Gaussian latent prior with a simple modification of f(Λi)f(\Lambda_{i}) in

h(Λi)=1pj=1plogfθ(Xij|Λi)+f(Λi)h(\Lambda_{i})=\frac{1}{p}\sum_{j=1}^{p}\log f_{\theta}(X_{ij}|\Lambda_{i})+f(\Lambda_{i})

The optimization can be readily accommodated as weighted MAP solution of Bayesian GLM regression for the given prior of Λi\Lambda_{i} with density f(Λi)f(\Lambda_{i}):

Λ^i=argmaxΛi1pj=1plogfθ(Xij|Λi)+f(Λi)\widehat{\Lambda}_{i}=\operatornamewithlimits{argmax}_{\Lambda_{i}}\frac{1}{p}\sum_{j=1}^{p}\log f_{\theta}(X_{ij}|\Lambda_{i})+f(\Lambda_{i}) (24)

3.2.2.  Computing the gradient using posterior sampling

When pp is of moderate dimension, the evaluation of Eq (2) still would require a good starting point, yet the gradient evaluation using Eq (22) is inaccurate. In this case, we observe from Eq (14) that if the posterior distribution of Λi|Xi\Lambda_{i}|X_{i} can be easily simulated without numerical issue, then computing the Eq (14) using stochastic sampling will be handy. Our second optimization idea thus comes from approximating the posterior distribution instead of approximating the likelihood gradient.

If we conduct Taylor expansion of the fθ(Xi|Λi)f_{\theta}(X_{i}|\Lambda_{i}) to the second order around the stationary point of Λ^i\widehat{\Lambda}_{i}, we can obtain the following data likelihood approximation:

log(fθ(Xi|Λi))\displaystyle\log(f_{\theta}(X_{i}|\Lambda_{i})) log(fθ(Xi|Λ^i))+(ΛiΛ^i)UΛi(Λ^i)+12(ΛiΛ^i)HΛi(Λ^i)(ΛiΛ^i)\displaystyle\approx\log(f_{\theta}(X_{i}|\widehat{\Lambda}_{i}))+(\Lambda_{i}-\widehat{\Lambda}_{i})^{\top}U_{\Lambda_{i}}(\widehat{\Lambda}_{i})+\frac{1}{2}(\Lambda_{i}-\widehat{\Lambda}_{i})^{\top}H_{\Lambda_{i}}(\widehat{\Lambda}_{i})(\Lambda_{i}-\widehat{\Lambda}_{i}) (25)

The required stationary point of Λ^i\widehat{\Lambda}_{i} can be obtained by solving the following normal equations for nn rows of XX with index i=1,,ni=1,\ldots,n:

VDiVλi(t+1)=VDi(Vλi(t)+Di1Gi)=VDiZi(t),V^{\top}D_{i\cdot}V\lambda_{i}^{(t+1)}=V^{\top}D_{i\cdot}(V\lambda_{i}^{(t)}+D_{i\cdot}^{-1}G_{i\cdot})=V^{\top}D_{i\cdot}Z_{i\cdot}^{(t)}, (26)

where λi\lambda_{i} denotes the ii-th row of matrix Λ\Lambda. ϕ\phi is the dispersion parameter from exponential family. Di1,GiD_{i\cdot}^{-1},G_{i\cdot} are defined as before and Z(t)Z^{(t)} is the working response:

Zij(t)=ηij(t)+Gij(t)Sij(t)=ηij(t)+Xijμij(t)g1(ηij(t)).Z_{ij}^{(t)}=\eta_{ij}^{(t)}+\frac{G_{ij}^{(t)}}{S_{ij}^{(t)}}=\eta_{ij}^{(t)}+\frac{X_{ij}-\mu_{ij}^{(t)}}{{g^{-1}}^{\prime}(\eta_{ij}^{(t)})}. (27)

After which, the posterior distribution of the latent variable Λ\Lambda can be approximated with the Gaussian formula:

f(Λi|Xi)f(Xi|Λi)f(Λi)fN(Λ^i,HΛi1(Λ^i))fN(0,Iq)=fN(μi,Σi)f(\Lambda_{i}|X_{i})\propto f(X_{i}|\Lambda_{i})f(\Lambda_{i})\approx f_{N}(\widehat{\Lambda}_{i},H_{\Lambda_{i}}^{-1}(\widehat{\Lambda}_{i}))f_{N}(0,I_{q})=f_{N}(\mu_{i},\Sigma_{i}) (28)

where fN(μ,Σ)f_{N}(\mu,\Sigma) is the multivariate normal density with mean μ\mu and variance Σ\Sigma. We have from the Gaussian integration formula that μi=[Iq+HΛi1(Λ^i)]1Λ^i\mu_{i}=[I_{q}+H_{\Lambda_{i}}^{-1}(\widehat{\Lambda}_{i})]^{-1}\widehat{\Lambda}_{i} and Σi=[Iq+HΛi(Λ^i)]1\Sigma_{i}=[I_{q}+H_{\Lambda_{i}}(\widehat{\Lambda}_{i})]^{-1}.

Under this closed-form expression of the posterior, the complexity of evaluating the gradient Eq (20) becomes as small as sampling from a multivariate normal distribution with parameter μi,Σi\mu_{i},\Sigma_{i}. As for the quality of this approximation, we appeal to a similar statement in (Rue et al., 2009) that we are implicitly assuming the shape of the posterior is determined solely by the prior and the likelihood only contributes to the location and scale. The approximation can be inaccurate when the prior is non-Gaussian but this is not the case for the factor model where ΛiN(0,Iq)\Lambda_{i}\sim N(0,I_{q}).

For the convenience of later reference, this second optimization method is named as Posterior Sampling optimization. This posterior sampling optimization is in fact closely related to MCEM algorithm (Dempster et al., 1977b), whose convergence result is proven to be superior compared to SML (Jank and Booth, 2003). To observe this connection, notice that we can introduce a new probability measure Q(Λi)Q(\Lambda_{i}) to reformulate the optimization problem in Eq (6) as:

V^\displaystyle\widehat{V} =argmaxθi=1nlogΛifθ(Xi|Λi)f(Λi)𝑑Λi=argmaxθi=1nlogΛifθ(Xi|Λi)f(Λi)Q(Λi)Q(Λi)𝑑Λi\displaystyle=\operatornamewithlimits{argmax}_{\theta}\sum_{i=1}^{n}\log\int_{\Lambda_{i}}f_{\theta}(X_{i}|\Lambda_{i})f(\Lambda_{i})d\Lambda_{i}=\operatornamewithlimits{argmax}_{\theta}\sum_{i=1}^{n}\log\int_{\Lambda_{i}}\frac{f_{\theta}(X_{i}|\Lambda_{i})f(\Lambda_{i})}{Q(\Lambda_{i})}Q(\Lambda_{i})d\Lambda_{i} (29)
=argmaxθi=1nlog𝔼Q(Λi)[fθ(Xi,Λi)Q(Λi)]\displaystyle=\operatornamewithlimits{argmax}_{\theta}\sum_{i=1}^{n}\log{\mathbb{E}}_{Q(\Lambda_{i})}\Bigg{[}\frac{f_{\theta}(X_{i},\Lambda_{i})}{Q(\Lambda_{i})}\Bigg{]}

Now if we apply Jensen’s inequality to switch the order of expectations and log operation:

fθ(X)\displaystyle f_{\theta}(X) =i=1nlog𝔼Q(Λi)[fθ(Xi,Λi)Q(Λi)]i=1n𝔼Q(Λi)[logfθ(Xi,Λi)Q(Λi)]\displaystyle=\sum_{i=1}^{n}\log{\mathbb{E}}_{Q(\Lambda_{i})}\Bigg{[}\frac{f_{\theta}(X_{i},\Lambda_{i})}{Q(\Lambda_{i})}\Bigg{]}\geq\sum_{i=1}^{n}{\mathbb{E}}_{Q(\Lambda_{i})}\Bigg{[}\log\frac{f_{\theta}(X_{i},\Lambda_{i})}{Q(\Lambda_{i})}\Bigg{]} (30)

The derived inequality becomes equality if log(fθ(Xi,Λi)/Q(Λi))\log(f_{\theta}(X_{i},\Lambda_{i})/Q(\Lambda_{i})) is a constant, which can only be achieved by choosing the posterior Q(Λi)fθ(Xi,Λi)=fθ(Xi|Λi)f(Λi)Q(\Lambda_{i})\propto f_{\theta}(X_{i},\Lambda_{i})=f_{\theta}(X_{i}|\Lambda_{i})f(\Lambda_{i}).
Then the M-step of the EM algorithm optimizes

argmaxθi=1n𝔼Q(Λi)[logfθ(Xi,Λi)Q(Λi)]\operatornamewithlimits{argmax}_{\theta}\sum_{i=1}^{n}{\mathbb{E}}_{Q(\Lambda_{i})}\Bigg{[}\log\frac{f_{\theta}(X_{i},\Lambda_{i})}{Q(\Lambda_{i})}\Bigg{]}

whose gradient, under similar regularity conditions to switch the order of derivative and integration, can be obtained in the following form:

i=1nΛiV[logfθ(Xi,Λi)]Q(Xi)𝑑Λi\sum_{i=1}^{n}\int_{\Lambda_{i}}\nabla_{V}[\log f_{\theta}(X_{i},\Lambda_{i})]Q(X_{i})d\Lambda_{i}

Hence, our proposed stochastic gradient descent is equivalent to an EM algorithm iteratively maximizing the marginalized likelihood. However, as indicated in (Caffo et al., 2005), this MCEM using simulated gradient for optimization often requires an adaptive change of the sample size StS_{t} to converge successfully. Recent research (Jank, 2006) circumvent the choice of adaptive StS_{t} by averaging the past iterations. The average is oftentimes weighted with an emphasis on the recent iterations, which is in fact equivalent to the step size selection of Adam optimization (Kingma and Ba, 2015).

We summarize those two SGD algorithms in Algorithm 2

Data: X𝔽n×pX\in\mathbb{F}^{n\times p} with 𝔽=+,+\mathbb{F}=\mathbb{N}_{+},\mathbb{R}_{+}\cdots
Notations ZZ standard normal, Σ1/2\Sigma^{-1/2} cholesky decomposition of Σ\Sigma, \circ element-wise product. Stochastic sample of size BB and SS: X(B)X^{(B)} and Λ(S)\Lambda^{(S)}.
Input Batch size BB, Sample size SS, learning rate α\alpha, factorization rank qq and maximum iteration TT
Initialization Initialize V0,η0V_{0},\eta_{0} using centered DMF or SVD, Φ0\Phi_{0} through pearson residual; set Adam param: Vdθ=𝟎,Sdθ=𝟎,β1=0.9,β2=0.999,ϵ=108V_{d\theta}=\mathbf{0},S_{d\theta}=\mathbf{0},\beta_{1}=0.9,\beta_{2}=0.999,\epsilon=10^{-8}
for t=0:T0:T  do
       Sample with replacement batch X(B)X^{(B)} from data XX
       if if p is large then
             Compute Λ^i\widehat{\Lambda}_{i} as the MAP solution defined in Eq (24)
             Compute the gradient ~θ=i=1ng(Λ^i)\widetilde{\nabla}_{\theta}=\sum_{i=1}^{n}g(\widehat{\Lambda}_{i}) as Eq (23)
            
       else
             Compute Λ^i\widehat{\Lambda}_{i} by solving Eq (26)
             Draw SS samples of Λ(S)\Lambda^{(S)} according to Eq (28)
             Compute ~θ=i=1nθ(log(fθ(Xi)))\widetilde{\nabla}_{\theta}=\sum_{i=1}^{n}\nabla_{\theta}\big{(}\log(f_{\theta}(X_{i}))\big{)} via Eq (20)
            
      Update Vdθ=β1/(1β1)Vdθ+~VV_{d\theta}=\beta_{1}/(1-\beta_{1})V_{d\theta}+\widetilde{\nabla}_{V} // Adam momentum
       Update Sdθ=β2/(1β2)Sdθ+~V~VS_{d\theta}=\beta_{2}/(1-\beta_{2})S_{d\theta}+\widetilde{\nabla}_{V}\circ\widetilde{\nabla}_{V} // Adam scale
       Update Vt+1=Vtα1+0.5tVdθSdθ+ϵV_{t+1}=V_{t}-\frac{\alpha}{1+0.5t}\frac{V_{d\theta}}{\sqrt{S_{d\theta}}+\epsilon}
      
Decompose(SVD) Vt+1=UDSV^{\prime}_{t+1}=UDS^{\top} with d1dq>0d_{1}\geq\cdots\geq d_{q}>0
Identify V~T,η~T\widetilde{V}_{T},\widetilde{\eta}_{T} according to Section 2.1
Result: MLE estimator of V^=VT,η^0=ηT,Φ^=ΦT\widehat{V}=V_{T},\widehat{\eta}_{0}=\eta_{T},\widehat{\Phi}=\Phi_{T}
Algorithm 2 Adam SGD (recommended for large qq)
  • Remark

    Although the iteration ends after TT passes the data, similar early stopping criteria (Yao et al., 2007) can be adopted if the main objective of the model is to make future predictions. If one hopes to focus on the interpretability of the factorized components, one can stop the algorithm with small Vt+1V_{t+1} updates.

4.  Examples and Results

In this section, we first demonstrate the result of simulation experiments, which validates the effectiveness and superiority of our SGD estimation compared to the SML estimation. Then with benchmark dataset in computer vision and network analysis, we compare our EFM factorization result against other commonly applied factorizations such as Non-negative Matrix Factorization (NMF), t-distributed stochastic neighbor embedding (t-SNE) and deviance matrix factorization (DMF, Wang and Carvalho, 2023). The factorization ranks qq on those empirical datasets are determined according to a rank determination proposition in (Wang and Carvalho, 2023).

4.1.  Simulated data

To validate the effectiveness of the optimization algorithm, we applied our optimization to some simulated datasets. We design small, simulated datasets where the marginalized likelihood can be evaluated using SML with a large sample size SS. To avoid potential confusion, we denote SS as the sample size used to evaluate the gradient and denote RR as the sample size used to evaluate the marginal likelihood. The notation is further clarified according to the marginal likelihood evaluation below:

(V)\displaystyle\mathcal{L}(V) =i=1nlogf(Xi|V)i=1nlogr=1R1Rf(Xi|Λi(r),V)\displaystyle=\sum_{i=1}^{n}\log f(X_{i}|V)\approx\sum_{i=1}^{n}\log\sum_{r=1}^{R}\frac{1}{R}f(X_{i}|\Lambda_{i}^{(r)},V) (31)
=i=1nr=1Rlogf(Xi|Λi(r),V)nlogR,\displaystyle=\sum_{i=1}^{n}\bigoplus_{r=1}^{R}\log f(X_{i}|\Lambda_{i}^{(r)},V)-n\log R,

where (x,y)=log(ex+ey)\oplus(x,y)=\log(e^{x}+e^{y}) is the sum operator in log space. Note that the evaluation of Eq (31) is not required for the implementation of our Algorithm 2. The evaluation is only introduced to compare the effectiveness and efficiency of those optimization to decrease the integrated negative log-likelihood, which can only be obtained using SML with large RR for non-Gaussian data likelihood.

To compare the quality of our optimization algorithm with the SML solution, we experimented with simulated data from Negative Binomial (ϕ=20)(\phi=20), Binomial and Poisson. The parameters are chosen as n=500,B=128,q=2,p=10,α=0.5n=500,B=128,q=2,p=10,\alpha=0.5 with their canonical link function. Note that the sample size nn is chosen to be small for an efficient evaluation of the loss function using Monte Carlo. Based upon the true generating parameter VV^{*}, we empirically verify that an accurate evaluation of the likelihood using Eq (31) would require R=1,500R=1,500, which is three times larger than the sample size n=500n=500. This observation is consistent to existing literature yet interesting to practitioners since the common literature is concentrated on the discussion of asymptotic efficiency with a lower bound of R>nR>\sqrt{n} (Lee, 1995). In reality, the Monte Carlo samples required for accurate gradient or likelihood evaluation obviously depend on the variance of the gradient and likelihood of the “specific” dataset, which has no upper bound. We here adopted R=1,500R=1,500 to accurately monitor the loss decrease per unit of time with the same initialization point. The evaluation time using sample size RR is later subtracted from the optimization time for a fair comparison.

To investigate the dimensionality and variance effect on different optimization algorithms, we first conducted two experiments with p=5p=5 and p=10p=10 and then conducted two additional experiments with large p=512p=512. Notice that the dimension size p=5,p=10p=5,p=10 are designed to accommodate the numerical stability of SML optimization, which has evaluation issues with large pp as mentioned in Section 1.1.3. For each of the optimization, we fix the initialization and the random seed to fairly compare the optimization paths.

We abbreviated LAPL for Laplacian Approximation optimization, PS for posterior sampling optimization, and SML for simulated maximum likelihood optimization. To also investigate the sampling requirement, we varied sample size SS from S={50,300,500}S=\{50,300,500\}. The optimization paths with p=5p=5 are plotted below:

Refer to caption
Figure 1: EFM Optimization Comparison, p =5

As we can see from Figure 1, the Posterior Sampling (PS) optimization is not very sensitive to the choice of sample size SS while the SML optimization solution varies greatly according to a different choice of sample size with larger SS leads to faster decrement. The Laplacian approximation decreases the loss function at the slowest speed due to the approximation error of O(1/p)O(1/p) in gradient evaluation (Eq (21)). The result indicates that the EM and PS optimization should be preferred on small data dimension pp due to its efficiency, less sensitivity of sample size SS, and numerical stability.

In theory, the LAPL optimization should become more accurate with a relative error with respect to the data dimensionality O(1/p)O(1/p). To observe the effectiveness of LAPL with moderate dimensionality, we continued the same experiments with p=10p=10. We also adopted S=500S=500 for both SML and PS optimization to compare the optimization efficiency. The optimization paths are again recorded with respect to the Adam optimization steps.

Refer to caption
Figure 2: EFM Optimization Comparison, p =10

As we can see from Figure 2, the posterior sampling optimization is still the most efficient among the three optimizations with more loss descended per unit of time. However, as it reaches the convergence region, the PS optimization contains higher variance as it is compared to the LAPL. This observation indicate that the potential superiority of the Laplacian optimization when the gradient evaluation contains large variance. The SML optimization undoubtedly decreases the negative likelihood at the slowest rate and should not be ever considered for practical applications.

To effectively compare the optimization performance on large-dimensional dataset, we experimented with p=512p=512, under which scenario, the SML completely lost its power due to the numerical stability issue. In fact, we observe that the SML consistently increase the loss with respect to the optimization steps. We thus compare only the PS of different sample sizes SS and LAPL optimization with their optimization paths prorated across the optimization time.

Refer to caption
Figure 3: EFM Optimization Comparison, p =512 and small V

As we can see from Figure 3, despite the potentially larger Monte Carlo error in gradient evaluation with a smaller sample size S=50S=50, the PS optimization converges as it is compared to the LAPL and EM optimization. Such a behaviour is very desired as it also indicates low computational budget required for large dimension optimization. However, the PS optimization would potentially require a large SS when the variance of the gradient evaluation increases. To validate this argument, we increased the magnitude of V=UDV^{*}=U^{*}D^{*} in simulation through a multiplication c>1c>1 on its diagonal elements. This multiplication will enlarge the magnitude of gradient according to the functional relationships in Table 4.1.

Table 1: Distributions and their respective gradient and Hessian functions.
Distribution FF Negative Log-likelihood l(X)l(X) Gradient Vjl(X)\nabla_{V_{j}}l(X) Hessian Λi(2)l(X)\nabla^{(2)}_{\Lambda_{i}}l(X)
Gaussian(identity) 12σ2j=1p(XjΛVj)(XjΛVj)\frac{1}{2\sigma^{2}}\sum_{j=1}^{p}(X_{j}-\Lambda V_{j})^{\top}(X_{j}-\Lambda V_{j}) 1σ2Λ(XjΛVj)-\frac{1}{\sigma^{2}}\Lambda^{\top}(X_{j}-\Lambda V_{j}^{\top}) 1σ2VV\frac{1}{\sigma^{2}}V^{\top}V
Poisson(log) j=1p1nexp(ΛVj)Xj(ΛVj)\sum_{j=1}^{p}1_{n}^{\top}\exp(\Lambda V_{j}^{\top})-X_{j}^{\top}(\Lambda V_{j}^{\top}) XjΛ+exp(ΛVj)Λ-X_{j}^{\top}\Lambda+\exp(\Lambda V_{j}^{\top})^{\top}\Lambda VDiag[exp(ΛiV)]VV^{\top}\text{Diag}[\exp(\Lambda_{i}V^{\top})]V
Gamma(log) ϕ[j=1pXjΛVj+log(ΛVj)]-\phi[\sum_{j=1}^{p}X_{j}^{\top}\Lambda V_{j}^{\top}+\log(-\Lambda V_{j}^{\top})] (1/ΛVj)ΛXjΛ(-1/\Lambda V_{j}^{\top})\Lambda-X_{j}^{\top}\Lambda VDiag2[ϕ/ΛiV]VV^{\top}\text{Diag}^{2}[\phi/\Lambda_{i}V^{\top}]V
Binomial(logit) j=1p(wjXj)ΛVj+\sum_{j=1}^{p}-(w_{j}\circ X_{j}^{\top})\Lambda V_{j}^{\top}+ [wj/(1+exp(ΛVj))]Λ[w_{j}/(1+\exp(-\Lambda V_{j}^{\top}))]^{\top}\Lambda- V(wiDiag[exp(ΛiV)V^{\top}(w_{i}\circ\text{Diag}[\exp(\Lambda_{i}V^{\top})
wjlog(1n+exp(ΛVj))w_{j}^{\top}\log(1_{n}+\exp(\Lambda V_{j}^{\top})) (wjXj)Λ(w_{j}\circ X_{j})^{\top}\Lambda /(1+exp(ΛiVa))])V/(1+\exp(\Lambda_{i}V^{\top}a))])V
Negative Binomial(α\alpha) j=1pXj[ΛVj\sum_{j=1}^{p}-X_{j}^{\top}[\Lambda V_{j}^{\top}- XjΛ+-X_{j}^{\top}\Lambda+ VDiag[exp(ΛiV)+V^{\top}\text{Diag}[\exp(\Lambda_{i}V^{\top})+
α1nlog(1qexp(ΛVj))]\alpha 1_{n}^{\top}\log(1_{q}-\exp(\Lambda V_{j}^{\top}))] αexp(ΛVj)/(1nexp(ΛVj))Λ\alpha\exp(\Lambda V_{j}^{\top})^{\top}/(1_{n}^{\top}-\exp(\Lambda V_{j}^{\top}))\Lambda αexp(2ΛiV)]V\alpha\exp(2\Lambda_{i}V^{\top})]V

We then continued the data simulation process with large dimension p=512p=512 to compare the performance between PS and LAPL.

Refer to caption
Figure 4: EFM Optimization Comparison, p =512 and large V

As we can see from Figure 4, when the data demonstrates high variance in the sampled gradients, the PS optimization deteriorates by indicating a higher requirement for the sample size SS. As a conclusion, the LAPL optimization can should be preferred considering the scenarios of non-Gaussian prior, an even larger dimensionality, and potential high variance in gradient evaluation using posterior sampling. Due to its efficiency demonstrated in the simulation studies, we adopted the Posterior Sampling Optimization for our later empirical studies with careful monitoring on the loss decrement.

4.2.  Covariance modeling

The Gaussian factor model is widely used due to its efficiency in covariance estimation for high dimensional data(Fan et al., 2008). Using a similar setup to (Fan et al., 2008), we simulate three quasi-factor datasets with n=756n=756, p{66,116,,466}p\in\{66,116,\ldots,466\} and four families (quasi-Poisson, negative-binomial, binomial, Poisson). For comparison, we used the same prior configuration for both factors Λ\Lambda and projection matrix VV according to the setup in Table 1 of (Fan et al., 2008). That is, for each pp and family:

  • We first simulate for i[n],j[p]i\in[n],j\in[p]

    Λi\displaystyle\Lambda_{i} N[(0.0235580.0129890.020714),(1.25070000.315640000.19303)]\displaystyle\sim N\begin{bmatrix}\begin{pmatrix}0.023558\\ 0.012989\\ 0.020714\end{pmatrix}\!\!,&\begin{pmatrix}1.2507&0&0\\ 0&0.31564&0\\ 0&0&0.19303\end{pmatrix}\end{bmatrix} (32)
    Vj\displaystyle V_{j} N[(0.782820.518030.41003),(0.0291450.0238730.0101840.0238730.0539510.0069670.0101840.0069670.086856)]\displaystyle\sim N\begin{bmatrix}\begin{pmatrix}0.78282\\ 0.51803\\ 0.41003\end{pmatrix}\!\!,&\begin{pmatrix}0.029145&0.023873&0.010184\\ 0.023873&0.053951&-0.006967\\ 0.010184&0.006967&0.086856\end{pmatrix}\end{bmatrix}
    Φj\displaystyle\Phi_{j} {Gamma(α=4.0713,β=0.1623) for quasi-Poisson1for others\displaystyle\sim\begin{cases}\text{Gamma}(\alpha=4.0713,\beta=0.1623)&\text{ for quasi-Poisson}\\ 1&\text{for others}\end{cases}
    wij\displaystyle w_{ij} {Poisson(20)for binomial1for others\displaystyle\sim\begin{cases}\text{Poisson}(20)&\text{for binomial}\\ 1&\text{for others}\end{cases}
  • Conditional on simulated (Λ,V,Φ)(\Lambda,V,\Phi), we generate X𝔽n×pX\in\mathbb{F}^{n\times p} using the four quasi-family with quasi-density ff satisfying Eq (3).

  • Based upon generated data XX, quasi-family defined by density ff and prior of Λ\Lambda in (32), we estimate θ=(V,Φ,Λ|X)\theta=(V,\Phi,\Lambda|X) by solving Eq (6).

  • We compute covariance estimates using

    1. (a)

      Naive sample covariance Σ^sam=XXn1X11Xn(n1)\widehat{\Sigma}_{sam}=\frac{XX^{\top}}{n-1}-\frac{X11^{\top}X^{\top}}{n(n-1)}.

    2. (b)

      Total covariance Σ^\widehat{\Sigma} by plugging the estimated θ\theta into Eq (7).

    3. (c)

      True covariance Σ\Sigma by plugging the actual V,ΦV,\Phi and Λ\Lambda prior into Eq (7).

  • We compute the error of covariance estimation via:

    1. (a)

      Frobenius norm ΣΣ^F\|\Sigma-\widehat{\Sigma}\|_{F}

    2. (b)

      Entropy loss tr(Σ^Σ1)log|Σ^Σ1|p\text{tr}(\widehat{\Sigma}\Sigma^{-1})-\log|\widehat{\Sigma}\Sigma^{-1}|-p

    3. (c)

      Normalized loss 1pΣ1/2(Σ^Σ)Σ1/2F\frac{1}{\sqrt{p}}\|\Sigma^{-1/2}(\widehat{\Sigma}-\Sigma)\Sigma^{-1/2}\|_{F}

  • We repeat the above process kk times.

With k = 5, we have obtained the estimation error of covariance accordingly:

Refer to caption
Figure 5: Estimation Error on Covariance

As we observe from Figure 5, we have obtained similar Non-Gaussian covariance estimation error as in it was shown previously (Fan et al., 2008) for the Gaussian case. Judging from the L2 normalized and L2 entropy distance, our EM optimization estimates the covariance matrix of high dimensional data in a much more accurately manner as it is compared to the naive estimation using covariance formula.

4.3.  Computer vision data

To illustrate the advantages that our EFM can provide more representative factorization, we also conducted experiments on computer vision datasets.

4.3.1.  MNIST datasets

Perhaps one of the most popular computer vision dataset is the MNIST dataset, which contains 70,000 handwriting pictures labeled from 0 to 9. However, modern machine learning research has evidenced that the classification task of the MNIST dataset might be too simple in the sense that an appropriately tuned classical machine learning algorithm can easily achieve 97% accuracy (Pishchik, 2023). With also 70,000 pictures of 10 classes, the Fashion-MNIST dataset (Xiao et al., 2017) has been proposed to replace the original dataset by constructing a more complex classification problem. We here examine our EFM factorization when applied to the Fashion-MNIST dataset and compare our factorized components against DMF, NMF, and t-SNE.

To determine the rank, we adopted the rank determination proposition in (Wang and Carvalho, 2023). The resulting eigenvalue plot in Figure 6 indicates potential ranks three or seven for the factorization. We adopt rank three for visualization convenience and the simulated likelihood demonstrates convergence result after 15 epochs with B=256B=256, S=50S=50, and α=0.5\alpha=0.5.

Refer to caption
Refer to caption
Figure 6: Simulated Likelihood and Eigenvalue Gap for MNIST.

To illustrate the superior performance under a general model formulation, we used only the first 2,000 samples of the 70,000 data as our training dataset to estimate V^\widehat{V}. After obtaining this V^\widehat{V}, we conduct penalized GLM regression based upon another 2,000 sampled testing set XX to estimate Λ^i|X\widehat{\Lambda}_{i}|X. Those Λ^i|X\widehat{\Lambda}_{i}|X can be considered as the out-of-sample latent estimation based upon EFM estimated V^\widehat{V}. Due to the factorization algorithm setup, t-SNE and NMF results are based up on the 2,000 training dataset and the DMF and EFM results are obtained on the separate 2,000 testing set. We summarize the factorized result in Figure 7.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Fashion-MNIST Factorization result (left to right shows results for EFM, DMF, NMF and t-SNE).

As indicated by the Fashion-MNIST factorization results, both our EFM and the t-SNE methods indicate great separability on the 10 classes with some mistakes on pullover, shirt and coat, which are actually similar classes when we look at the image representation. The NMF performs the worst, utilizing only two dimensions to separate the 10 different classes. Without the stochastic optimization and regularization, the DMF performs no better than our EFM on the testing samples, potentially due to overfitting.

4.3.2.  ORL face dataset

To illustrate that our EFM method also provides reasonable uncertainty quantification, we also conducted experiments on the ORL face dataset (Zhu et al., 2019), which contains 40 subjects with pictures taken under 10 different conditions. Each picture is in a 64x64 dimensional space, a high pixel resolution for restoration. To add uncertainty to the image restoration, we cropped part of the face images by setting pixel values to 0. For example, in Figure 8 we can see that the mouth of a person is covered with a white background.

Refer to caption
Figure 8: Cropped ORL Face.

Similar to eigen-face decomposition, we adopt q=41q=41 for factorization since we expect to find 40 individual face eigen-vectors and one “average face”. We fit EFM with negative binomial and estimated ϕ^=10.9\widehat{\phi}=10.9 by using the moment estimator. For comparison, we also conducted eigen-face restoration with rank 40 after centering the “average face”. The result is summarized in Figure 9.

Refer to caption
Figure 9: Restored Face: EFM (left) and, Eigen-Face (right)

Our EFM not only restores the faces much more accurately compared to eigen-face, but also quantifies the uncertainty in the image restoration. With Laplacian approximation, we compute the MAP estimator of Λ^i\widehat{\Lambda}_{i} and Σ^i\widehat{\Sigma}_{i} and simulate the latent variable Λi(s)\Lambda_{i}^{(s)} accordingly. The simulated Λi(s)\Lambda_{i}^{(s)} can then be combined with V^\widehat{V} to provide simulated human faces. As it is shown in Figure 10, the uncertainty due to the crop of the image is correctly identified after centering the simulated faces, which demonstrates different mouth characteristics.

Refer to caption
Figure 10: Simulated ORL Face after Centering.

4.4.  Network analysis

Another interesting application of our weighted EFM is on social network analysis, where interest often lies in summarizing a large adjacency matrix using lower-dimensional representations named “node embeddings”. Recently, emerging interests has been directed to node embedding inference based upon multiple networks. These multiple networks are formally observed as multiple interaction graphs, {𝒢(1),,𝒢(k)}\{\mathcal{G}^{(1)},\ldots,\mathcal{G}^{(k)}\}, which consists of multiple edge relationships, {(1),,(k)}\{\mathcal{E}^{(1)},\ldots,\mathcal{E}^{(k)}\}, for the same sets of vertices VV. This emerging field of research is named as multi-layer or multiplex network analysis (Kivelä et al., 2014). Denoting the number of vertices as n=|V|n=|V|, multiplex network inference starts by transforming these multiple graphs into adjacency matrices {A(1),,A(k)}\{A^{(1)},\ldots,A^{(k)}\} of same dimension n×nn\times n. Factorization and joint inference on these constructed {A(1),,A(k)}\{A^{(1)},\ldots,A^{(k)}\} have been shown to provide better node embeddings with potential applications to community detection and link prediction (Wang et al., 2019; Jones and Rubin-Delanchy, 2020).

One method to enable the joint inference on adjacency matrices is to effectively combine {A(1),,A(k)}\{A^{(1)},\ldots,A^{(k)}\} into a single adjacency matrix A~\widetilde{A}. Draves (2022, Chapter 2) provides a comprehensive introduction to various aggregation techniques. To briefly summarize some of the relevant aggregation techniques used in this section, there are

  • Average Adjacency Spectral Embedding (AASE; Tang et al., 2018) which simply averages the adjacency matrix through

    A~ij=1kl=1kAij(l)\widetilde{A}_{ij}=\frac{1}{k}\sum_{l=1}^{k}A_{ij}^{(l)} (33)
  • Unfolded Adjacency Spectral Embedding (UASE) that concatenates the adjacency matrices column-wisely

    A~=[A(1),A(2),,A(k)]n×nk\widetilde{A}=[A^{(1)},A^{(2)},\ldots,A^{(k)}]\in\mathbb{Z}^{n\times nk} (34)
  • Omnibus Embedding (OE) by constructing pair-wisely the following adjacency matrix:

    A~=[A(1),12(A(1)+A(2)),,12(A(1)+A(k))12(A(1)+A(2)),A(2),,12(A(2)+A(k)),,,12(A(k)+A(1)),12(A(k)+A(2)),,A(k)]\displaystyle\widetilde{A}=\begin{bmatrix}A^{(1)},&\frac{1}{2}(A^{(1)}+A^{(2)}),&\ldots,&\frac{1}{2}(A^{(1)}+A^{(k)})\\ \frac{1}{2}(A^{(1)}+A^{(2)}),&A^{(2)},&\ldots,&\frac{1}{2}(A^{(2)}+A^{(k)})\\ \vdots,&\vdots,&\vdots,&\vdots\\ \frac{1}{2}(A^{(k)}+A^{(1)}),&\frac{1}{2}(A^{(k)}+A^{(2)}),&\ldots,&A^{(k)}\end{bmatrix} (35)

With D=Diag(d1,,dq),d1dqD=\text{Diag}(d_{1},\ldots,d_{q}),d_{1}\geq\ldots\geq d_{q} and 𝒮n,q()\mathcal{S}_{n,q}(\mathbb{R}) the nn-dimensional Stiefel manifold of order qq, we can conduct SVD for the asymmetric A~\widetilde{A} from UASE and eigen-decomposition for the symmetric A~\widetilde{A} from ASE and OE:

A~(ASE)\displaystyle\widetilde{A}^{(\text{ASE})} =UDU,U𝒮n,q,\displaystyle=UDU^{\top},U\in\mathcal{S}_{n,q}, (36)
A~(UASE)\displaystyle\widetilde{A}^{(\text{UASE})} =UDV,U𝒮n,q,V𝒮nk,q,\displaystyle=UDV^{\top},U\in\mathcal{S}_{n,q},V\in\mathcal{S}_{nk,q},
A~(OE)\displaystyle\widetilde{A}^{(\text{OE})} =UDU,U𝒮nk,q.\displaystyle=UDU^{\top},U\in\mathcal{S}_{nk,q}.

The node embeddings Λ\Lambda can then be defined as Λ=UD1/2\Lambda=UD^{1/2} with dimension n×qn\times q, where we only take the first nn entries in the eigenvectors for the Omnibus embedding(Levin et al., 2017).

However, one obvious shortcoming of these aggregations is that the factorization implicitly assumes an equal contribution from each of the layers. In reality, we know that each layer of the network is at least different according to different levels of sparsity. Treating equally the interactions in a dense graph and the interactions in a sparse graph is problematic by over-emphasizing the interactions on the dense graphs. Additionally for the temporal network that consists of the edge relationships of same vertices across different time steps, it intuitively makes more sense if we apply higher weights to the more recent adjacency matrices compared to an equal aggregation on these edge relationships since these recent adjacency matrices should be more informative for the prediction of future interactions.

Naturally one immediate improvement to an equal aggregation of those individual networks is to apply different weights in the factor inference. The EFM provides a solution to this aggregation technique by allowing entry-wise and layer-wise weights to the aggregated interaction. With this flexibility on heuristic weight specification, we demonstrate that we can factorize to obtain improved embedding results for multiplex network analysis.

We thus explore our EFM inference on the AUCS dataset (Dickison et al., 2016).The dataset records interactions among n=61n=61 employees at the Department of Computer Science at Aarhus University. As we confirmed with the author, among those 61 employees, there are 55 employees with labels from one of the eight research groups. There are 6 employees that do not belong to any of the eight research groups. The research group labels of those 55 employees can thus be treated as known community structure to validate the effectiveness of community inference.

For the whole community, interactions are recorded according to five different online and offline relationships, whose adjacency matrices can be represented with the following plot:

Refer to caption
Figure 11: AUCS Network

As we observe from the interaction network data, the co-author network is definitely sparser when compared to other layers of the network. By following the heuristic introduced in the beginning of this section, we propose to weight each interaction of different layer of the network according to the sparsity of the layer. Specifically, with λmax(k)\lambda_{\text{max}}^{(k)} denoted as the largest eigenvalue of adjacency matrix A(k)A^{(k)}, we

  • weight all the diagonal terms with value 0 since the zero interaction of a node with respect to itself does not necessarily contain any information.

  • weight each of the interaction according to 1/λmax(k)1/\lambda_{\text{max}}^{(k)} to ensure that each matrix has its largest eigenvalue equal to 1.

  • weight all the remaining zero interactions as the minimal value of the non-zero terms in the weight matrix to reduce the bias effect towards zero of no interactions across all networks.

The weights are then defined as:

Wij\displaystyle W_{ij} ={0,if i=j,l=1k1λmax(l)Aij(l),if ij and l=1kAijl>0,mini,j=1,,n{l=1k1λmax(l)Aij(l):l=1kAij(l)>0},if ij and l=1kAij(l)=0.\displaystyle=\begin{cases}0,&\text{if~{}}i=j,\\ \sum_{l=1}^{k}\frac{1}{\lambda^{(l)}_{\text{max}}}A_{ij}^{(l)},&\text{if~{}}i\neq j\text{~{}and~{}}\sum_{l=1}^{k}A_{ij}^{l}>0,\\ \min_{i,j=1,\ldots,n}\Big{\{}\sum_{l=1}^{k}\frac{1}{\lambda^{(l)}_{\text{max}}}A_{ij}^{(l)}\,:\,\sum_{l=1}^{k}A_{ij}^{(l)}>0\Big{\}},&\text{if~{}}i\neq j\text{~{}and~{}}\sum_{l=1}^{k}A_{ij}^{(l)}=0.\\ \end{cases} (37)

Finally, for the aggregated adjacency matrix we assign value 1 to AijA_{ij} whenever ii and jj interact in at least one layer, that is, Aij=maxkAij(k)A_{ij}=\max_{k}A_{ij}^{(k)}.

With weight WW and adjacency matrix AA definition in Eq (37), we then apply a binomial EFM with logit link to obtain three dimensional nodes embedding. We can then visualize those factorized embedding according the separability of the known research group community labels. For a comparison to the {AASE, UASE, OM}\{\text{AASE, UASE, OM}\} embedding techniques, SVD or eigen-decomposition are also conducted on their corresponding aggregated adjacency matrices A~\widetilde{A} to obtain their corresponding nodes embedding defined in Eq (36). The visualization comparison on those factorized embedding Λ=UD1/2\Lambda=UD^{1/2} is provided below:

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: AUCS embedding visualization comparison

As we can see from Figure 12 the weighted EFM separated more research groups as it is compared to a naive SVD on any of the embedded graphs. The classification result is also consistent with the existing literature (Magnani et al., 2021) who have claimed that there are five major research groups identified by the publisher of the dataset.

5.  Conclusion

We propose two classes of flexible, robust, and efficient optimization algorithms for exponential family factor modeling. We start by formally addressing identifiability issues and, in effect, generalizing the singular value decomposition to deviance losses, which should yield more representative results for a broader range of real-world datasets. We address the main computational hurdle, integrating out latent factors, by proposing robust approximation schemes based on EM and SGD optimization. The optimization algorithm improves the simulated likelihood estimation (SML) by eliminating the asymptotic estimation bias with moderate simulation sample size SS. We provide an efficient R implementation to our proposed optimization algorithm. Additionally, utilizing the SGD optimization, our EFM generalizes better than alternative factorization models such as DMF. Both the simulation studies and empirical studies provide compelling evidence to these advantages.

References

  • Ford et al. [1986] J. Kevin Ford, Robert C. MacCallum, and Marianne Tait. The application of exploratory factor analysis in applied psychology: A critical review and analysis. Personnel Psychology, 1986.
  • Prince et al. [2008] Simon J D Prince, James H Elder, Jonathan Warrell, and Fatima M Felisberti. Tied factor analysis for face recognition across large pose differences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
  • Fama and French [2015] Eugene F Fama and Kenneth R. French. A five-factor asset pricing model. Journal of Financial Economics, 2015.
  • Xu et al. [2021] Tianchen Xu, Ryan T., and Demmer Gen Li. Zero-inflated poisson factor model with application to microbiome read counts. Biometrics, 2021.
  • Basilevsky [1994] Alexander Basilevsky. Statistical factor analysis and related methods. Wiley Series in Probability and Mathematical Statistics, 1994.
  • Wedel and Kamakura [2001] Michel Wedel and Wagner A. Kamakura. Factor analysis with (mixed) observed and latent variables in the exponential family. Psychometrika, 66:513–30, 2001.
  • Wedel et al. [2003] Michel Wedel, Ulf Bo Ckenholt, and Wagner A. Kamakurac. Factor models for multivariate count data. Journal of Multivariate Analysis, 2003.
  • Bartholomew et al. [2011] David J Bartholomew, Martin Knott, and Irini Moustaki. Latent variable models and factor analysis: A unified approach. John Wiley & Sons, 2011.
  • Tipping and Bishop [1998] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 1998.
  • Anderson [1963] Theodore Wilbur Anderson. The use of factor analysis in the statistical analysis of multiple time series. Psychometrika, 1963.
  • Wang and Carvalho [2023] Liang Wang and Luis Carvalho. Deviance matrix factorization. Electronic Journal of Statistics, 17(2):3762–3810, 2023.
  • Fan et al. [2008] Jianqing Fan, Yingying Fan, and Jinchi Lv. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics, 147(1):186–197, 2008.
  • Borenstein et al. [2010] Michael Borenstein, Larry V Hedges, Julian PT Higgins, and Hannah R Rothstein. A basic introduction to fixed-effect and random-effects models for meta-analysis. Research synthesis methods, 1(2):97–111, 2010.
  • Carter and Kohn [1996] Chris K Carter and Robert Kohn. Markov chain monte carlo in conditionally gaussian state space models. Biometrika, 83(3):589–601, 1996.
  • Shapiro [1985] Alexander Shapiro. Identifiability of factor analysis: Some results and open problems. Linear Algebra and its Applications, 70:1–7, 1985.
  • Kaiser [1958] Henry F Kaiser. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3):187–200, 1958.
  • Gribonval and Schnass [2010] Rémi Gribonval and Karin Schnass. Dictionary identification—sparse matrix-factorization via l_1l\_1-minimization. IEEE Transactions on Information Theory, 56(7):3523–3539, 2010.
  • Lee and Choi [1999] Daniel D. Lee and Seung Hoe Choi. Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 1999.
  • Li et al. [2010] Zhao Li, Xindong Wu, and Hong Peng. Nonnegative matrix factorization on orthogonal subspace. Pattern Recognition Letters, 31(9):905–911, 2010.
  • Ma and Fu [2012] Yunqian Ma and Yun Fu. Manifold learning theory and applications, volume 434. CRC press Boca Raton, 2012.
  • Jeffrey Pennington and Manning [2014] Richard Socher Jeffrey Pennington and Christopher Manning. Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  • Kalayeh et al. [2014] Mahdi M Kalayeh, Haroon Idrees, and Mubarak Shah. Nmf-knn: Image annotation using weighted multi-view non-negative matrix factorization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 184–191, 2014.
  • Davenport et al. [2014] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion. Information and Inference: A Journal of the IMA, 3(3):189–223, 2014.
  • Reimann et al. [2002] Clemens Reimann, Peter Filzmoser, and Robert G Garrett. Factor analysis applied to regional geochemical data: problems and possibilities. Applied geochemistry, 17(3):185–206, 2002.
  • Gopalan et al. [2015] Prem Gopalan, Jake M. Hofman, and David M. Blei. Scalable recommendation with hierarchical poisson factorization. Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 2015.
  • Ruan and Di [2024] Kangrui Ruan and Xuan Di. Infostgcan: An information-maximizing spatial-temporal graph convolutional attention network for heterogeneous human trajectory prediction. Computers, 13(6), 2024. URL https://www.mdpi.com/2073-431X/13/6/151.
  • Rue et al. [2009] Havard Rue, Sara Martino, and Nicolas Chopin. Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the Royal Statistical Society, Series B, 2009.
  • Wu and Zhang [2003] Ningning Wu and Jing Zhang. Factor analysis based anomaly detection. IEEE Systems, Man and Cybernetics SocietyInformation Assurance Workshop, 2003.
  • Lee [1995] Lung-fei Lee. Asymptotic bias in simulated maximum likelihood estimation of discrete choice models. Econometric Theory, 11:437–83, 1995.
  • Li et al. [2024] Zhenglin Li, Haibei Zhu, Houze Liu, Jintong Song, and Qishuo Cheng. Comprehensive evaluation of mal-api-2019 dataset by machine learning in malware detection. arXiv preprint arXiv:2403.02232, 2024.
  • Kitagawa [1987] Genshiro Kitagawa. Non-gaussian state—space modeling of nonstationary time series. Journal of the American statistical association, 82(400):1032–1041, 1987.
  • Levy and Goldberg [2014] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27, 2014.
  • Mikolov et al. [2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • Hoff et al. [2002] Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent space approaches to social network analysis. Journal of the american Statistical association, 97(460):1090–1098, 2002.
  • Young and Scheinerman [2007] Stephen J Young and Edward R Scheinerman. Random dot product graph models for social networks. In Algorithms and Models for the Web-Graph: 5th International Workshop, WAW 2007, San Diego, CA, USA, December 11-12, 2007. Proceedings 5, pages 138–149. Springer, 2007.
  • Wedderburn [1974] Robert WM Wedderburn. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika, 61(3):439–447, 1974.
  • Xia [2020] Weixuan Xia. The average of a negative-binomial lévy process and a class of lerch distributions. Communications in Statistics-Theory and Methods, 49(4):1008–1024, 2020.
  • Nelder and Pregibon [1987] John A Nelder and Daryl Pregibon. An extended quasi-likelihood function. Biometrika, 74(2):221–232, 1987.
  • Dempster et al. [1977a] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977a.
  • Golub and Welsch [1969] Gene H Golub and John H Welsch. Calculation of Gauss quadrature rules. Mathematics of computation, 23(106):221–230, 1969.
  • Fan et al. [2020] Jianqing Fan, Jianhua Guo, and Shurong Zheng. Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association, 2020.
  • Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • Bertsekas [1999] Dimitri P. Bertsekas. Nonlinear programming: 2nd edition. Athena Scientific, 1999.
  • Duchi et al. [2011] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 2011.
  • Ieleman and Hinton [2012] T. Ieleman and G Hinton. Neural networks for machine learning. Technical report, 2012.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015, 2015.
  • Tierney and Kadane [1986] Luke Tierney and Joseph B Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the american statistical association, 81(393):82–86, 1986.
  • Xia and Zhang [2023] Weixuan Xia and Yuyang Zhang. On the absolute-value integral of a brownian motion with drift: Exact and asymptotic formulae. arXiv preprint arXiv:2312.04172, 2023.
  • Tierney et al. [1989] Luke Tierney, Robert E Kass, and Joseph B Kadane. Fully exponential laplace approximations to expectations and variances of nonpositive functions. Journal of the american statistical association, 84(407):710–716, 1989.
  • Dempster et al. [1977b] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39:1–22, 1977b.
  • Jank and Booth [2003] Wolfgang Jank and James Booth. Efficiency of monte carlo em and simulated maximum likelihood in two-stage hierarchical models. Journal of Computational and Graphical Statistics, 12:214–29, 2003.
  • Caffo et al. [2005] Brian S Caffo, Wolfgang Jank, and Galin L Jones. Ascent-based monte carlo expectation–maximization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):235–251, 2005.
  • Jank [2006] Wolfgang Jank. Implementing and diagnosing the stochastic approximation em algorithm. Journal of Computational and Graphical Statistics, 15:803–29, 2006.
  • Yao et al. [2007] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
  • Pishchik [2023] Evgenii Pishchik. Trainable activations for image classification. 2023.
  • Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  • Zhu et al. [2019] Pengfei Zhu, Binyuan Hui, Changqing Zhang, Dawei Du, Longyin Wen, and Qinghua Hu. Multi-view deep subspace clustering networks. arXiv preprint arXiv:1908.01978, 2019.
  • Kivelä et al. [2014] Mikko Kivelä, Alex Arenas, Marc Barthelemy, James P Gleeson, Yamir Moreno, and Mason A Porter. Multilayer networks. Journal of complex networks, 2(3):203–271, 2014.
  • Wang et al. [2019] Shangsi Wang, Jesús Arroyo, Joshua T Vogelstein, and Carey E Priebe. Joint embedding of graphs. IEEE transactions on pattern analysis and machine intelligence, 43(4):1324–1336, 2019.
  • Jones and Rubin-Delanchy [2020] Andrew Jones and Patrick Rubin-Delanchy. The multilayer random dot product graph. arXiv preprint arXiv:2007.10455, 2020.
  • Draves [2022] Benjamin Draves. Joint spectral embeddings of random dot product graphs. PhD thesis, Boston University, 2022.
  • Tang et al. [2018] Runze Tang, Michael Ketcha, Alexandra Badea, Evan D Calabrese, Daniel S Margulies, Joshua T Vogelstein, Carey E Priebe, and Daniel L Sussman. Connectome smoothing via low-rank approximations. IEEE transactions on medical imaging, 38(6):1446–1456, 2018.
  • Levin et al. [2017] Keith Levin, Avanti Athreya, Minh Tang, Vince Lyzinski, Youngser Park, and Carey E Priebe. A central limit theorem for an omnibus embedding of multiple random graphs and implications for multiscale network inference. arXiv preprint arXiv:1705.09355, 2017.
  • Dickison et al. [2016] Mark E Dickison, Matteo Magnani, and Luca Rossi. Multilayer social networks. Cambridge University Press, 2016.
  • Magnani et al. [2021] Matteo Magnani, Luca Rossi, and Davide Vega. Analysis of multiplex social networks with r. Journal of Statistical Software, 98:1–30, 2021.