This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Parameterizations for Gradient-based Markov Chain Monte Carlo on the Stiefel Manifold: A Comparative Study

Masahiro Tanaka [email protected] 0000-0002-2269-8468 Faculty of Economics, Fukuoka University8-19-1 Nanakuma, JonanFukuokaFukuokaJapan814-0180
Abstract.

Orthogonal matrices play an important role in probability and statistics, particularly in high-dimensional statistical models. Parameterizing these models using orthogonal matrices facilitates dimension reduction and parameter identification. However, establishing the theoretical validity of statistical inference in these models from a frequentist perspective is challenging, leading to a preference for Bayesian approaches because of their ability to offer consistent uncertainty quantification. Markov chain Monte Carlo methods are commonly used for numerical approximation of posterior distributions, and sampling on the Stiefel manifold, which comprises orthogonal matrices, poses significant difficulties. While various strategies have been proposed for this purpose, gradient-based Markov chain Monte Carlo with parameterizations is the most efficient. However, a comprehensive comparison of these parameterizations is lacking in the existing literature. This study aims to address this gap by evaluating numerical efficiency of the four alternative parameterizations of orthogonal matrices under equivalent conditions. The evaluation was conducted for four problems. The results suggest that polar expansion parameterization is the most efficient, particularly for the high-dimensional and complex problems. However, all parameterizations exhibit limitations in significantly high-dimensional or difficult tasks, emphasizing the need for further advancements in sampling methods for orthogonal matrices.

Markov chain Monte Carlo, orthogonal matrix, Stiefel manifold, parametrization
copyright: acmcopyrightjournalyear: 2024copyright: acmlicensedconference: 2024 3rd Asia Conference on Algorithms, Computing and Machine Learning; March 22–24, 2024; Shanghai, Chinabooktitle: 2024 3rd Asia Conference on Algorithms, Computing and Machine Learning (CACML 2024), March 22–24, 2024, Shanghai, Chinadoi: 10.1145/3654823.3654895isbn: 979-8-4007-1641-6/24/03ccs: Mathematics of computing Probabilistic algorithmsccs: Mathematics of computing Probabilistic inference problemsccs: Computing methodologies Uncertainty quantificationccs: Mathematics of computing Bayesian computation

1. Introduction

Orthogonal matrices play essential roles in probability and statistics. Against the backdrop of the prosperity of big data, high-dimensional statistical models have gain increasing attention, many of which are parameterized in terms of orthogonal matrices for dimensionality reduction and parameter identification, for example, principal component analysis, factor analysis, matrix completion, reduced rank regression, and models for network data (Shi et al., 2017; Du et al., 2023). It is difficult to establish the theoretical validity of statistical inference in this class of models from a frequentist perspective. Therefore, a Bayesian approach to inference in models parameterized with orthogonal matrices is a natural choice because it can provide consistent uncertainty quantification, that is, interval estimates, and hypothesis testing.

Because a posterior distribution is generally intractable, it is necessary to obtain its numerical approximation. The standard approach is Markov chain Monte Carlo (MCMC) (Liu, 2004), which can provide asymptotically exact approximation of the target distribution, unlike other methods such as the Laplace approximation and variational Bayes (Bishop, 2006). However, sampling on the Stiefel manifold, the set of orthogonal matrices, is notoriously difficult.

Several strategies have been proposed for sampling on the Stiefel manifold. Hoff (Hoff, 2009) proposed a rejection-sampling algorithm which is employed in (Lin et al., 2017; Pal et al., 2020). Byrne and Girolami (Byrne and Girolami, 2013) introduced the geodesic Monte Carlo (GMC). Currently, the most efficient approach is to parameterize orthogonal matrices with unconstrained parameters and simulate from the unconstrained space using a no-U-turn sampler (NUTS) (Hoffman and Gelman, 2014), an adaptive version of the Hamiltonian Monte Carlo method (Duane et al., 1987; Neal, 2011). For this purpose, four alternative parameterizations have been considered: Householder parameterization (Nirwan and Bertschinger, 2019), Cayley transform (Jauch et al., 2020), Givens representation (Pourzanjani et al., 2021), and polar expansion (Jauch et al., 2021).

There has been no thorough comparison between the parameterizations and the existing literature only partially provides information about their relative advantages. In (Jauch et al., 2020), it was reported that NUTS with Cayley transform is more efficient than the rejection sampler of (Hoff, 2009) in terms of the autocorrelations in the obtained chains. Pourzanajani et al. (Pourzanjani et al., 2021) applied NUTS with Givens representation and GMC for uniform sampling on the Stiefel manifold and network eigenmodel (Hoff, 2009), reporting the superiority of the former in terms of effective sample size per iteration (ESS/iter). In (Jauch et al., 2021), three strategies, namely, NUTS with polar expansion, GMC, and rejection sampling were tested on the network eigenmodel, and it was shown that NUTS with polar expansion is the best in terms of ESS/iter. In (Byrne and Girolami, 2013), GMC and rejection sampler were applied to the network eigenmodel; the authors only showed that the latter was stacked into local modes, whereas the former was not. The study proposing NUTS with Householder parameterization (Nirwan and Bertschinger, 2019) did not compare the proposed approach with other approaches. From those previous studies, NUTS with some parameterization appears to be better than the rejection sampler and GMC, whereas the order of NUTS with different parameterizations has not been investigated rigorously.

The primary objective of this study is to fill this gap by comparing the numerical efficiency of the four parameterizations of orthogonal matrices under identical conditions, and to explore the best current practice. We apply NUTS with four alternative parameterizations to four problems: uniform distribution, network eigenmodel, probabilistic principal component analysis, and matrix completion for panel causal analysis. The first three problems were adopted from the existing studies, whereas the latter is novel to the literature.

We evaluated the numerical efficiency of the different parameterizations based on the minimum ESSs per second (minESS/sec) as well as the minimum ESSs per iteration (minESS/iter). Our choice of performance measure is notably different from that in the existing literature. Although some studies have measured the numerical efficiency of posterior simulator based on ESS/iter for certain parameters ((Pourzanjani et al., 2021; Jauch et al., 2021)), no existing study reports minESS/sec or minESS/iter. In the literature of MCMC algorithms, it is a standard practice to evaluate the numerical efficiency of a posterior simulator using minimum ESSs, that is, the performance on the dimension that is the most difficult to explore. In addition, even if minESS/iter is small, we can effectively approximate the posterior distribution by running MCMC for longer duration because MCMC algorithms are justified based on an asymptotic argument. Thus, for practitioners, minESS/sec is more important than minESS/iter.

This study provides two takeaways. First, polar expansion appears to be the best choice among the four parameterizations, particularly with respect to minESS/sec. Although the other parameterizations work for low-dimensional and/or simple problems, the relative advantage of the polar expansion is evident in high-dimensional and/or difficult problems. Second, when a sampling task is considerably high-dimensional and/or difficult, none of the four parameterizations performed well. This implies that the obtained posterior samples must be treated with caution. This study highlighted the need for further improvements in this area.

The remainder of this paper is organized as follows. Section 2 introduces the four alternative parameterizations. In Section 3, we apply NUTS with the four parameterizations to four statistical models, and compare their numerical performance. Section 4 concludes the paper.

2. parameterizations for Orthogonal Matrices

A J×KJ\times K orthogonal matrix 𝚼\boldsymbol{\Upsilon} is simulated from the Stiefel manifold 𝒱J×K\mathcal{V}^{J\times K}, the set of J×KJ\times K orthogonal matrices,

𝚼𝒱J×K={𝚼J×K:𝚼𝚼=𝑰K},\boldsymbol{\Upsilon}\in\mathcal{V}^{J\times K}=\left\{\boldsymbol{\Upsilon}\in\mathbb{R}^{J\times K}:\boldsymbol{\Upsilon}^{\top}\boldsymbol{\Upsilon}=\boldsymbol{I}_{K}\right\},

where 𝑰K\boldsymbol{I}_{K} is a K×KK\times K identity matrix. Let 𝒟\mathcal{D} denote the set of observations. The target kernel is the posterior of 𝚼\boldsymbol{\Upsilon} conditional on 𝒟\mathcal{D} evaluated up to a normalizing constant, represented by product of the likelihood f(𝒟|𝚼)f\left(\mathcal{D}|\boldsymbol{\Upsilon}\right) and prior density of 𝚼\boldsymbol{\Upsilon}, p(𝚼)p\left(\boldsymbol{\Upsilon}\right), π(𝚼)=f(𝒟|𝚼)p(𝚼)\pi\left(\boldsymbol{\Upsilon}\right)=f\left(\mathcal{D}|\boldsymbol{\Upsilon}\right)p\left(\boldsymbol{\Upsilon}\right). For numerical efficiency, instead of dealing with 𝚼\boldsymbol{\Upsilon}, we parameterize 𝚼\boldsymbol{\Upsilon} using an auxiliary vector 𝝋\boldsymbol{\varphi} and simulate 𝝋\boldsymbol{\varphi} from an unconstrained space. The target kernel is modified by representing 𝚼\boldsymbol{\Upsilon} as a matrix-valued function of 𝝋\boldsymbol{\varphi} and augmenting the determinant of the Jacobian of the transformation,

π(𝚼(𝝋))=f(𝒟|𝚼(𝝋))p(𝚼(𝝋))|𝚼𝝋|.\pi\left(\boldsymbol{\Upsilon}\left(\boldsymbol{\varphi}\right)\right)=f\left(\mathcal{D}|\boldsymbol{\Upsilon}\left(\boldsymbol{\varphi}\right)\right)p\left(\boldsymbol{\Upsilon}\left(\boldsymbol{\varphi}\right)\right)\left|\frac{\partial\boldsymbol{\Upsilon}}{\partial\boldsymbol{\varphi}^{\top}}\right|.

In the following, we introduce four alternative parameterizations of 𝚼\boldsymbol{\Upsilon}. See (Shepard et al., 2015) for a further discussion on these parameterizations from mathematical and numerical points of view.

2.1. Polar expansion

Jauch et al. (Jauch et al., 2021) parameterized an orthogonal matrix 𝚼\boldsymbol{\Upsilon} with a J×KJ\times K unconstrained matrix 𝚼~J×K\tilde{\boldsymbol{\Upsilon}}\in\mathbb{R}^{J\times K} using the polar expansion, 𝚼=𝚼~(𝚼~𝚼~)12\boldsymbol{\Upsilon}=\tilde{\boldsymbol{\Upsilon}}\left(\tilde{\boldsymbol{\Upsilon}}^{\top}\tilde{\boldsymbol{\Upsilon}}\right)^{-\frac{1}{2}}, where (𝚼~𝚼~)12\left(\tilde{\boldsymbol{\Upsilon}}^{\top}\tilde{\boldsymbol{\Upsilon}}\right)^{-\frac{1}{2}} denotes the inverse of the symmetric square root of 𝚼~𝚼~\tilde{\boldsymbol{\Upsilon}}^{\top}\tilde{\boldsymbol{\Upsilon}}. Representing the singular value decomposition of 𝚼~\tilde{\boldsymbol{\Upsilon}} as 𝚼~=𝑼𝑺𝑽\tilde{\boldsymbol{\Upsilon}}=\boldsymbol{U}\boldsymbol{S}\boldsymbol{V}^{\top}, we get (𝚼~𝚼~)12=𝑽𝑺12𝑽\left(\tilde{\boldsymbol{\Upsilon}}^{\top}\tilde{\boldsymbol{\Upsilon}}\right)^{-\frac{1}{2}}=\boldsymbol{V}\boldsymbol{S}^{-\frac{1}{2}}\boldsymbol{V}^{\top}. We sample 𝝋=vec(𝚼~)\boldsymbol{\varphi}=\textrm{vec}\left(\tilde{\boldsymbol{\Upsilon}}\right) , where vec()\textrm{vec}\left(\cdot\right) denotes the column-wise vectorization operator. Following (Jauch et al., 2021), we set the intermediate distribution to the Wishart distribution. Then, the target kernel is specified as

π(𝚼(𝝋))f(𝒟|𝚼(𝝋))p(𝚼(𝝋))exp(12𝝋𝝋).\pi\left(\boldsymbol{\Upsilon}\left(\boldsymbol{\varphi}\right)\right)\propto f\left(\mathcal{D}|\boldsymbol{\Upsilon}\left(\boldsymbol{\varphi}\right)\right)p\left(\boldsymbol{\Upsilon}\left(\boldsymbol{\varphi}\right)\right)\exp\left(-\frac{1}{2}\boldsymbol{\varphi}^{\top}\boldsymbol{\varphi}\right).

2.2. Householder parameterization

Nirwan and Bertschinger (Nirwan and Bertschinger, 2019) proposed a parameterization based on Householder transformations. Orthogonal matrix 𝚼\boldsymbol{\Upsilon} is written as an ordered product of Householder reflectors:

𝚼=𝑯J(𝒗J)𝑯J1(𝒗J1)𝑯JK+1(𝒗JK+1)𝑰J×K\boldsymbol{\Upsilon}=\boldsymbol{H}_{J}\left(\boldsymbol{v}_{J}\right)\boldsymbol{H}_{J-1}\left(\boldsymbol{v}_{J-1}\right)\cdots\boldsymbol{H}_{J-K+1}\left(\boldsymbol{v}_{J-K+1}\right)\boldsymbol{I}_{J\times K}
𝑯n=(𝑰Jn𝑶(Jn)×n𝑶n×(Jn)𝑯~n),\boldsymbol{H}_{n}=\left(\begin{array}[]{cc}\boldsymbol{I}_{J-n}&\boldsymbol{O}_{\left(J-n\right)\times n}\\ \boldsymbol{O}_{n\times\left(J-n\right)}&\tilde{\boldsymbol{H}}_{n}\end{array}\right),
𝑯~n=sgn(vn,1)(𝑰J2𝒖n𝒖n),\quad\tilde{\boldsymbol{H}}_{n}=-\mathrm{sgn}\left(v_{n,1}\right)\left(\boldsymbol{I}_{J}-2\boldsymbol{u}_{n}\boldsymbol{u}_{n}^{\top}\right),
𝒖n=𝒗n+sgn(vn,1)𝒗n𝒆n,1𝒗n+sgn(vn,1)𝒗n𝒆n,1,𝒗n𝒩(𝟎,𝑰),\boldsymbol{u}_{n}=\frac{\boldsymbol{v}_{n}+\mathrm{sgn}\left(v_{n,1}\right)\left\|\boldsymbol{v}_{n}\right\|\boldsymbol{e}_{n,1}}{\left\|\boldsymbol{v}_{n}+\mathrm{sgn}\left(v_{n,1}\right)\left\|\boldsymbol{v}_{n}\right\|\boldsymbol{e}_{n,1}\right\|},\quad\boldsymbol{v}_{n}\sim\mathcal{N}\left(\boldsymbol{0},\boldsymbol{I}\right),

where 𝑶(JK)×K\boldsymbol{O}_{\left(J-K\right)\times K} is a (JK)×K\left(J-K\right)\times K matrix of zeros, 𝑰J×K\boldsymbol{I}_{J\times K} is a J×KJ\times K matrix with the identity matrix as its top block and the remaining entries zero, 𝒆n,1=(1,𝟎n1)\boldsymbol{e}_{n,1}=\left(1,\boldsymbol{0}_{n-1}^{\top}\right)^{\top}, sgn()\mathrm{sgn}\left(\cdot\right) is the sign operator, and \left\|\cdot\right\| is the Euclidean norm. The vector to be sampled is 𝝋=(𝒗J,𝒗J1,,𝒗JK+1)\boldsymbol{\varphi}=\left(\boldsymbol{v}_{J}^{\top},\boldsymbol{v}_{J-1}^{\top},...,\boldsymbol{v}_{J-K+1}^{\top}\right)^{\top}. Using this approach, we can avoid computing the Jacobian adjustment term.

2.3. Cayley transform

Jauch et al. (Jauch et al., 2020) parameterized orthogonal matrices using the modified Cayley transform (Shepard et al., 2015). Given a K×KK\times K skew symmetric matrix

𝑿Skew(J)={𝑿J×J:𝑿=𝑿},\boldsymbol{X}\in\mathrm{Skew}\left(J\right)=\left\{\boldsymbol{X}\in\mathbb{R}^{J\times J}:\boldsymbol{X}=-\boldsymbol{X}^{\top}\right\},

the modified Cayley transform of 𝑿\boldsymbol{X} can be written as

𝚼=(𝑰J+𝑿)(𝑰J𝑿)1𝑰J×K.\boldsymbol{\Upsilon}=\left(\boldsymbol{I}_{J}+\boldsymbol{X}\right)\left(\boldsymbol{I}_{J}-\boldsymbol{X}\right)^{-1}\boldsymbol{I}_{J\times K}.

For brevity, hereinafter we call this type of transformation as Cayley transform. 𝑿\boldsymbol{X} has the following block structure:

𝑿=(𝑩𝑨𝑨𝑶(JK)×(JK)),\boldsymbol{X}=\left(\begin{array}[]{cc}\boldsymbol{B}&-\boldsymbol{A}^{\top}\\ \boldsymbol{A}&\boldsymbol{O}_{\left(J-K\right)\times\left(J-K\right)}\end{array}\right),

where 𝑨(JK)×K\boldsymbol{A}\in\mathbb{R}^{\left(J-K\right)\times K} and 𝑩Skew(K)\boldsymbol{B}\in\mathrm{Skew}\left(K\right). Let 𝒃\boldsymbol{b} be a K(K1)/2K\left(K-1\right)/2-dimensional vector of independent elements of 𝑩\boldsymbol{B} obtained by eliminating the diagonal and supradiagonal elements of vec(𝑩)\textrm{vec}\left(\boldsymbol{B}\right). We sample 𝝋=(𝒃,vec(𝑨))\boldsymbol{\varphi}=\left(\boldsymbol{b}^{\top},\textrm{vec}\left(\boldsymbol{A}^{\top}\right)\right)^{\top}. The Jacobian adjustment term is defined as follows:

|𝚼𝝋|=|22𝚪(𝑮1𝑮2)𝚪|12,\left|\frac{\partial\boldsymbol{\Upsilon}}{\partial\boldsymbol{\varphi}^{\top}}\right|=\left|2^{2}\boldsymbol{\Gamma}^{\top}\left(\boldsymbol{G}_{1}\otimes\boldsymbol{G}_{2}\right)\boldsymbol{\Gamma}\right|^{\frac{1}{2}},
𝑮1=(𝑰J𝑿)1𝑰J×K𝑰J×K(𝑰J𝑿),\boldsymbol{G}_{1}=\left(\boldsymbol{I}_{J}-\boldsymbol{X}\right)^{-1}\boldsymbol{I}_{J\times K}\boldsymbol{I}_{J\times K}^{\top}\left(\boldsymbol{I}_{J}-\boldsymbol{X}\right)^{-\top},
𝑮2=(𝑰J𝑿)(𝑰J𝑿)1,\quad\boldsymbol{G}_{2}=\left(\boldsymbol{I}_{J}-\boldsymbol{X}\right)^{-\top}\left(\boldsymbol{I}_{J}-\boldsymbol{X}\right)^{-1},
𝚪=(𝚪1𝚪2),\boldsymbol{\Gamma}=\left(\begin{array}[]{cc}\boldsymbol{\Gamma}_{1}&\boldsymbol{\Gamma}_{2}\end{array}\right),
𝚪1=(𝚯1𝚯1)𝑫K,𝚪2=(𝑰J2𝑲J,J)(𝚯1𝚯2),\boldsymbol{\Gamma}_{1}=\left(\boldsymbol{\Theta}_{1}^{\top}\otimes\boldsymbol{\Theta}_{1}^{\top}\right)\boldsymbol{D}_{K},\quad\boldsymbol{\Gamma}_{2}=\left(\boldsymbol{I}_{J^{2}}-\boldsymbol{K}_{J,J}\right)\left(\boldsymbol{\Theta}_{1}^{\top}\otimes\boldsymbol{\Theta}_{2}^{\top}\right),
𝚯1=(𝑰K𝑶K×(JK)),𝚯2=(𝑶(JK)×K𝑰JK),\boldsymbol{\Theta}_{1}=\left(\begin{array}[]{cc}\boldsymbol{I}_{K}&\boldsymbol{O}_{K\times\left(J-K\right)}\end{array}\right),\quad\boldsymbol{\Theta}_{2}=\left(\begin{array}[]{cc}\boldsymbol{O}_{\left(J-K\right)\times K}&\boldsymbol{I}_{J-K}\end{array}\right),

where 𝑫K\boldsymbol{D}_{K} is a K2×K(K1)/2K^{2}\times K\left(K-1\right)/2 matrix such that 𝑫K𝒃=vec(𝑩)\boldsymbol{D}_{K}\boldsymbol{b}=\mathrm{vec}\left(\boldsymbol{B}\right) and 𝑲J,J\boldsymbol{K}_{J,J} is the commutation matrix that satisfies 𝑲J,Jvec(𝑨)=vec(𝑨)\boldsymbol{K}_{J,J}\mathrm{vec}\left(\boldsymbol{A}\right)=\mathrm{vec}\left(\boldsymbol{A}\right)^{\top}. See Appendix B of (Jauch et al., 2020) (pp. 1581-1582) for further details to construct 𝑫K\boldsymbol{D}_{K} and 𝑲J,J\boldsymbol{K}_{J,J}.

2.4. Givens representation

Pourzanjani et al. (Pourzanjani et al., 2021) proposed to use a Givens representation of 𝚼\boldsymbol{\Upsilon}:

𝚼\displaystyle\boldsymbol{\Upsilon} =\displaystyle= 𝑹1,2(θ1,2)𝑹1,J(θ1,J)𝑹2,3(θ2,3)𝑹2,J(θ2,J)\displaystyle\boldsymbol{R}_{1,2}\left(\theta_{1,2}\right)\cdots\boldsymbol{R}_{1,J}\left(\theta_{1,J}\right)\cdots\boldsymbol{R}_{2,3}\left(\theta_{2,3}\right)\cdots\boldsymbol{R}_{2,J}\left(\theta_{2,J}\right)
𝑹K,K+1(θK,K+1)𝑹K,J(θK,J)𝑰J×K,\displaystyle\cdots\boldsymbol{R}_{K,K+1}\left(\theta_{K,K+1}\right)\boldsymbol{R}_{K,J}\left(\theta_{K,J}\right)\boldsymbol{I}_{J\times K},

where θk,j\theta_{k,j} is a coordinate variable, 𝑹k,j(θk,j)\boldsymbol{R}_{k,j}\left(\theta_{k,j}\right) is a J×JJ\times J matrix that takes the form of an identity matrix except for the (j,j)\left(j,j\right) and (k,k)\left(k,k\right) positions which are replaced by cosθj,k\cos\theta_{j,k} and the (j,k)\left(j,k\right) and (k,j)\left(k,j\right) positions which are replaced by sinθj,k-\sin\theta_{j,k} and sinθk,j\sin\theta_{k,j}, respectively. θ1,2,θ2,3,,θK,K+1(π,π)\theta_{1,2},\theta_{2,3},...,\theta_{K,K+1}\in\left(-\pi,\pi\right) and the remaining coordinates are in the range (2π,2π)\left(-2\pi,2\pi\right). This parameterization can induce the posterior of θj,k\theta_{j,k} to be multimodal (Pourzanjani et al., 2021). To address this, Pourzanjani et al. (Pourzanjani et al., 2021) suggested to further reparameterize θk,j\theta_{k,j} with an independent auxiliary parameter rk,jr_{k,j},

φk,j=rk,jcosθk,j,φk,j=rk,jsinθk,j.\varphi_{k,j}^{\flat}=r_{k,j}\cos\theta_{k,j},\quad\varphi_{k,j}^{\sharp}=r_{k,j}\sin\theta_{k,j}.

The marginal distribution of rk,jr_{k,j} is given by a normal distribution with a mean of one and a standard deviation of 0.1, rk,j𝒩(1,0.12)r_{k,j}\sim\mathcal{N}\left(1,0.1^{2}\right). See Section 4.2 of (Pourzanjani et al., 2021) (pp. 647-653) for further discussion. The Jacobian adjustment term is:

g(𝝋)=j=1Jk=j+1K(cosθk,j)kj1.g\left(\boldsymbol{\varphi}\right)=\prod_{j=1}^{J}\prod_{k=j+1}^{K}\left(\cos\theta_{k,j}\right)^{k-j-1}.

Table 1 summarizes the number of essential parameters, i.e., the dimension of 𝝋\boldsymbol{\varphi}, for the four alternative parameterizations. The polar expansion has the largest number of essential parameters, whereas the Householder and Cayley transforms the smallest.

Table 1. Number of essential parameters
Approach # of essential parameters
Polar JKJK
Householder JKK(K1)/2JK-K\left(K-1\right)/2
Cayley K(K1)/2+J(JK)K\left(K-1\right)/2+J\left(J-K\right)
Givens JKK(K1)/2JK-K\left(K-1\right)/2

3. Comparative Simulation Study

This section compares the numerical efficiencies of the alternative parameterizations on four testing benches: uniform distribution, network eigenmodel, probabilistic principal component analysis, and matrix completion for panel causal analysis. We employed NUTS (Hoffman and Gelman, 2014), an adaptive version of the Hamiltonian Monte Carlo (Duane et al., 1987; Neal, 2011). All computations were performed using CmdStan software (version 2.33.1) 111https://mc-stan.org/users/interfaces/cmdstan with R (version 4.1.2) (R Core Team, 2021)222https://cran.r-project.org/ and cmdstanr package (version 0.7.0), running on an Ubuntu 22.04.3 desktop with AMD Ryzen Threadripper 3990X 2.9GHz processor. All the stan files were based on the replication codes provided by the authors of the original papers.333The replication codes are available from the following websites. Householder parameterization:    https://github.com/RSNirwan/HouseholderBPCA Cayley transform:    https://github.com/michaeljauch/cayley Polar expansion:    https://github.com/michaeljauch/polar Givens representation:    https://github.com/pourzanj/TfRotationPca/tree/master For each posterior simulation, we generated 1,000 posterior draws and used the final 500 draws for evaluation. The numerical efficiency of each approach was measured using minESS for the main iterations, obtained using the mcmcse package (version 1.5-0). minESS was normalized by the number of iterations and wall-clock elapsed time, denoted as minESS/iter and minESS/sec, respectively. We report the averages of 64 runs.

3.1. Uniform sampling on the Stiefel manifold

We compared the alternative parameterizations for uniform sampling on the Stiefel manifold. The dimension of the target distribution was set to combinations of K{3,10}K\in\left\{3,10\right\} and J{10,100,200}J\in\left\{10,100,200\right\}. We did not examine an approach based on the Cayley transform for these situations with J=KJ=K because it requires considerable coding effort.444When J=KJ=K, 𝑨\boldsymbol{A} disappears and it holds that 𝑿=𝑩\boldsymbol{X}=\boldsymbol{B}. Thus, conditional executions must be added to virtually all lines involving 𝑿\boldsymbol{X}. Owing to the limitations of the computational budget, we did not test the Givens representation for the most high-dimensional case, (J,K)=(200,10)\left(J,K\right)=\left(200,10\right).

Table 2 presents the simulation results. Polar expansion was the best parameterization regardless of the efficiency measure. The win margin was larger for high-dimensional cases. For the minESS/iter metric, the remaining three parameterizations were comparable. For minESS/sec, Householder was the most efficient parameterization after polar expansion, whereas the Givens representation was the worst.

Table 2. Simulation result: Uniform sampling
(a) minESS/iter
JJ KK Polar Householder Cayley Givens
10 3 1.005 0.403 0.225 0.296
100 3 0.972 0.220 0.202 0.244
200 3 0.663 0.226 0.177
10 10 0.938 0.485 0.242
100 10 0.465 0.270 0.163 0.175
200 10 0.384 0.204 0.136
(b) minESS/sec
JJ KK Polar Householder Cayley Givens
10 3 11137.636 2420.666 225.096 18.020
100 3 1360.927 6.774 0.399 0.016
200 3 481.104 0.820 0.061
10 10 1804.136 965.346 3.363
100 10 185.496 1.671 0.019 0.001
200 10 64.135 0.134 0.001

3.2. Network eigenmodel

The network eigenmodel was first introduced by Hoff (2009) and has been widely used as a testing bench (Byrne and Girolami, 2013; Jauch et al., 2021; Pourzanjani et al., 2021); however, a thorough comparison of alternative parameterizations is absent. A graph matrix represents how proteins in a protein network interact with each other. Probability of a connection between proteins is modeled using a symmetric matrix of probabilities whose rank is much smaller than the dimension of the observed graph matrix. Given a J×JJ\times J symmetric graph matrix 𝒀=(yj,j)\boldsymbol{Y}=\left(y_{j,j^{\prime}}\right) with yj,j{0,1}y_{j,j^{\prime}}\in\left\{0,1\right\}, the probability of the connections is specified as follows: For j,j=1,,Jj,j^{\prime}=1,...,J,

yj,jBernoulli(Φ((𝚼𝚲𝚼)j,j+μ)),y_{j,j^{\prime}}\sim\mathrm{Bernoulli}\left(\Phi\left(\left(\boldsymbol{\Upsilon}\boldsymbol{\Lambda}\boldsymbol{\Upsilon}^{\top}\right)_{j,j^{\prime}}+\mu\right)\right),

where 𝚲=diag(λ1,λ2,λ3)\boldsymbol{\Lambda}=\mathrm{diag}\left(\lambda_{1},\lambda_{2},\lambda_{3}\right), 𝚼𝒱J×K\boldsymbol{\Upsilon}\in\mathcal{V}^{J\times K}, μ\mu, and λk\lambda_{k} are unknown parameters, and Φ()\Phi\left(\cdot\right) is the probit link function. Following previous studies, we selected K=3K=3. We assign normal priors to μ\mu and λk\lambda_{k}: μ𝒩(0,102)\mu\sim\mathcal{N}\left(0,10^{2}\right) λk𝒩(0,J)\lambda_{k}\sim\mathcal{N}\left(0,J\right). The dimension of the dataset was J=270J=270.

As shown in Table 3, none of the four parameterizations performed well. Polar expansion was the best, but its minESS/iter and minESS/sec were too small to conduct a reliable posterior analysis within a reasonable time. Previous studies reported minESS/iter for only some of the unknown parameters. Although not reported in Table 3, we confirmed the results of (Jauch et al., 2021) and (Pourzanjani et al., 2021). As reported in these papers, some of the unknown parameters (e.g., λk\lambda_{k}s and cc) are effectively simulated, but the posterior draws of some entries in 𝚼\boldsymbol{\Upsilon} are strongly autocorrelated.

Table 3. Simulation result: Network eigenmodel
(a) minESS/iter
JJ KK Polar Householder Cayley Givens
270 3 0.018 0.012 0.009 0.009
(b) minESS/sec
JJ KK Polar Householder Cayley Givens
270 3 0.075 0.000 0.001 0.001

3.3. Probabilistic principal component analysis

In probabilistic principal component analysis (Tipping and Bishop, 2002), a JJ-dimensional demeaned vector 𝒚iJ\boldsymbol{y}_{i}\in\mathbb{R}^{J} was modeled using a linear function of low-dimensional latent state 𝒛iK\boldsymbol{z}_{i}\in\mathbb{R}^{K} with K<JK<J. The distribution of 𝒚i\boldsymbol{y}_{i} is specified as follows: For i=1,,Ni=1,...,N,

𝒙i|𝒛i𝒩(𝑾𝚲𝒛i,σ2𝑰J),𝒛i𝒩(𝟎K,𝑰K),\boldsymbol{x}_{i}|\boldsymbol{z}_{i}\sim\mathcal{N}\left(\boldsymbol{W}\boldsymbol{\Lambda}\boldsymbol{z}_{i},\;\sigma^{2}\boldsymbol{I}_{J}\right),\quad\boldsymbol{z}_{i}\sim\mathcal{N}\left(\boldsymbol{0}_{K},\boldsymbol{I}_{K}\right),

where 𝚲=diag(λ1,,λK)\boldsymbol{\Lambda}=\mathrm{diag}\left(\lambda_{1},...,\lambda_{K}\right), 𝑾𝒱J×K\boldsymbol{W}\in\mathcal{V}^{J\times K}, and λk\lambda_{k} are unknown parameters, and σ2\sigma^{2} is the variance of the idiosyncratic shocks. This model is closely related to static latent factor modeling (see (Lopes, 2014) for an overview). 𝒛i\boldsymbol{z}_{i} can be seen as a vector of latent factors and the quantity 𝑾𝚲\boldsymbol{W}\boldsymbol{\Lambda} as a factor loading matrix.

We applied this model to two specifications of synthetic data. The first specification was adopted from (Nirwan and Bertschinger, 2019): N=150N=150, J=5J=5, K=2K=2, (λ1,λ2)=(9,1)\left(\lambda_{1},\lambda_{2}\right)=\left(9,1\right), and σ2=0.012\sigma^{2}=0.01^{2}. The second specification was adopted from (Jauch et al., 2020; Pourzanjani et al., 2021): N=100N=100, J=50J=50, K=3K=3, (λ1,λ2,λ3)=(5,3,1.5)\left(\lambda_{1},\lambda_{2},\lambda_{3}\right)=\left(5,3,1.5\right), and σ2=1\sigma^{2}=1. For both specifications, 𝑾\boldsymbol{W} was generated uniformly from 𝒱50×3\mathcal{V}^{50\times 3}555First, each entry of 𝑨J×K\boldsymbol{A}\in\mathbb{R}^{J\times K} is simulated independently from the standard normal distribution and then 𝑾\boldsymbol{W} is obtained through a polar expansion, 𝑾=𝑨(𝑨𝑨)12\boldsymbol{W}=\boldsymbol{A}\left(\boldsymbol{A}^{\top}\boldsymbol{A}\right)^{-\frac{1}{2}} (see Proposition 7.1 of (Eaton, 1989), pp. 100-101).. Non-informative priors were employed for all the parameters and cases.

The first two rows of panels (a) and (b) in Table 4 summarize the results. In terms of minESS/iter, the performance order of the four parameterizations are not clear. For the first dataset, polar expansion performed the best. For the second dataset, Cayley transform performed the best. When the performance is evaluated based on the minESS/sec metric, for the first dataset, polar expansion and Cayley transformation are largely comparable and superior to the other parameterizations. For the second dataset, polar expansion performed the best. Givens representation did not work well for either dataset. Similar to uniform sampling cases, in terms of minESS/iter, Givens representation was marginally inferior to Householder and Cayley transformations, while Givens representation was slow to draw, leading to a much smaller minESS/sec.

Table 4. Simulation result: Probabilistic principal component analysis
(a) minESS/iter
JJ KK Polar Householder Cayley Givens
Synthetic 1 5 2 0.819 0.120 0.249 0.100
Synthetic 2 50 3 0.077 0.043 0.131 0.056
Real 569 2 0.240 0.027 0.046
(b) minESS/sec
JJ KK Polar Householder Cayley Givens
Synthetic 1 5 2 240.253 131.043 262.117 30.558
Synthetic 2 50 3 11.828 0.131 1.784 0.009
Real 569 2 2.280 0.509 0.206

Following (Nirwan and Bertschinger, 2019), we applied the model to the breast cancer Wisconsin dataset retrieved from scikit-learn, a machine learning library for the Python programming language (Pedregosa et al., 2011). The model was extended by incorporating a mean vector 𝝁\boldsymbol{\mu} as

𝒚i|𝒛i𝒩(𝝁+𝑾𝚲𝒛i,σ2𝑰J).\boldsymbol{y}_{i}|\boldsymbol{z}_{i}\sim\mathcal{N}\left(\boldsymbol{\mu}+\boldsymbol{W}\boldsymbol{\Lambda}\boldsymbol{z}_{i},\;\sigma^{2}\boldsymbol{I}_{J}\right).

We assigned a non-informative prior to 𝝁\boldsymbol{\mu}: p(𝝁)1p\left(\boldsymbol{\mu}\right)\propto 1. As shown in Table 4, polar expansion performed the best, irrespective of the performance measure. We do not report on Givens representation because the posterior simulation was intolerably slow.

3.4. Matrix completion for causal analysis

The final testing bench is matrix completion for causal analysis. Matrix completion has been studied extensively in the machine learning community (e.g., (Salakhutdinov and Mnih, 2007; Ding et al., 2011; Babacan et al., 2012; Mai and Alquier, 2015; Farias et al., 2022; Tanaka, 2022; Yuchi et al., 2023; Zhai and Gutman, 2023)). It has been applied to causal analysis with event study designs using panel data (e.g., (Samartsidis et al., 2020; Nethery et al., 2021; Tanaka, 2021; Pang et al., 2022; Samartsidis et al., 2024)). In this framework, outcomes under treatment are removed from an outcome matrix, and the outcome matrix is “completed”, treating the missing entries as unrealized potential outcomes under treatment, or counterfactual outcomes. By comparing between the estimates of the counterfactual and realized outcomes, we can infer the treatment effects.

Let 𝒀J×T\boldsymbol{Y}\in\mathbb{R}^{J\times T} denote the matrix of outcomes for JJ entities and TT time periods. The elements in 𝒀\boldsymbol{Y} corresponding to the treated entities and the time periods are treated as missing data. We can infer unrealized potential outcomes by completing 𝒀\boldsymbol{Y}. Let 𝒀miss\boldsymbol{Y}^{\mathrm{miss}} denote the set of missing entries in 𝒀\boldsymbol{Y} and 𝒀obs\boldsymbol{Y}^{\mathrm{obs}} denote the set of observed entries.

Given 𝒀miss\boldsymbol{Y}^{\mathrm{miss}}, the model of “completed” 𝒀\boldsymbol{Y} is composed of three matrices, 𝒀=𝚵+𝚪+𝑼\boldsymbol{Y}=\boldsymbol{\Xi}+\boldsymbol{\Gamma}+\boldsymbol{U}. First, 𝚵J×T\boldsymbol{\Xi}\in\mathbb{R}^{J\times T} is a matrix of covariate effects, 𝒙j,t\boldsymbol{x}_{j,t} is a vector of covariates, 𝚵=(ξj,t)\boldsymbol{\Xi}=\left(\xi_{j,t}\right) with ξj,t=𝜷𝒙j,t\xi_{j,t}=\boldsymbol{\beta}^{\top}\boldsymbol{x}_{j,t}. Second, 𝚪J×T\boldsymbol{\Gamma}\in\mathbb{R}^{J\times T} is a matrix with rank K<min(J,T)K<\min\left(J,T\right) that is factorized as in singular value decomposition, 𝚪=𝚽𝚲𝚿\boldsymbol{\Gamma}=\boldsymbol{\Phi}\boldsymbol{\Lambda}\boldsymbol{\Psi}^{\top}, where 𝚲=diag(λ1,,λK)\boldsymbol{\Lambda}=\mathrm{diag}\left(\lambda_{1},...,\lambda_{K}\right) is a diagonal matrix, 𝚽=(ϕj,k)𝒱J×K\boldsymbol{\Phi}=\left(\phi_{j,k}\right)\in\mathcal{V}^{J\times K} and 𝚿=(ψt,k)𝒱T×K\boldsymbol{\Psi}=\left(\psi_{t,k}\right)\in\mathcal{V}^{T\times K} are orthogonal matrices. Third, 𝑼=(uj,k)J×T\boldsymbol{U}=\left(u_{j,k}\right)\in\mathbb{R}^{J\times T} is a matrix of error terms whose entries are distributed according to an independent normal distribution with variance σ2\sigma^{2}, uj,t𝒩(0,σ2).u_{j,t}\sim\mathcal{N}\left(0,\sigma^{2}\right).

We conducted a posterior simulation treating 𝒀miss\boldsymbol{Y}^{\mathrm{miss}} as unknown parameters. Let 𝑿\boldsymbol{X} denote the set of covariates. Then, the likelihood of the “completed” 𝒀\boldsymbol{Y} has the standard form:

p(𝒀|𝒀miss,𝚽,𝚲,𝚿,𝜷,σ2;𝑿)=p\left(\boldsymbol{Y}|\boldsymbol{Y}^{\mathrm{miss}},\boldsymbol{\Phi},\boldsymbol{\Lambda},\boldsymbol{\Psi},\boldsymbol{\beta},\sigma^{2};\boldsymbol{X}\right)=
(2πσ2)JT2exp{12σ2𝒀𝚽𝚲𝚿𝚵F2}.\left(2\pi\sigma^{2}\right)^{-\frac{JT}{2}}\exp\left\{-\frac{1}{2\sigma^{2}}\left\|\boldsymbol{Y}-\boldsymbol{\Phi}\boldsymbol{\Lambda}\boldsymbol{\Psi}^{\top}-\boldsymbol{\Xi}\right\|_{F}^{2}\right\}.

The target kernel is represented as:

p(𝒀miss,𝚽,𝚲,𝚿,𝜷,σ2|𝒀obs,𝑿)p\left(\boldsymbol{Y}^{\mathrm{miss}},\boldsymbol{\Phi},\boldsymbol{\Lambda},\boldsymbol{\Psi},\boldsymbol{\beta},\sigma^{2}|\boldsymbol{Y}^{\mathrm{obs}},\boldsymbol{X}\right)\propto
p(𝒀|𝒀miss,𝚽,𝚲,𝚿,𝜷,σ2;𝑿)p(𝒀miss,𝚽,𝚲,𝚿,𝜷,σ2)p\left(\boldsymbol{Y}|\boldsymbol{Y}^{\mathrm{miss}},\boldsymbol{\Phi},\boldsymbol{\Lambda},\boldsymbol{\Psi},\boldsymbol{\beta},\sigma^{2};\boldsymbol{X}\right)p\left(\boldsymbol{Y}^{\mathrm{miss}},\boldsymbol{\Phi},\boldsymbol{\Lambda},\boldsymbol{\Psi},\boldsymbol{\beta},\sigma^{2}\right)

where p(𝒀miss,𝚽,𝚲,𝚿,𝜷,σ2)p\left(\boldsymbol{Y}^{\mathrm{miss}},\boldsymbol{\Phi},\boldsymbol{\Lambda},\boldsymbol{\Psi},\boldsymbol{\beta},\sigma^{2}\right) is a prior. For λk\lambda_{k}, we employed an exponential prior λk(η)\lambda_{k}\sim\mathcal{E}\left(\eta\right), where η\eta is a prefixed rate parameter. This choice can be seen as a Bayesian counterpart to matrix completion with the nuclear norm penalty (Du et al., 2023). We assigned uniform Haar prior to 𝚽\boldsymbol{\Phi} and 𝚿\boldsymbol{\Psi}: p(𝚽)1p\left(\boldsymbol{\Phi}\right)\propto 1 and p(𝚿)1p\left(\boldsymbol{\Psi}\right)\propto 1, non-informative prior to 𝒀miss\boldsymbol{Y}^{\mathrm{miss}} and 𝜷\boldsymbol{\beta}: p(𝒀miss)1p\left(\boldsymbol{Y}^{\mathrm{miss}}\right)\propto 1, p(𝜷)1p\left(\boldsymbol{\beta}\right)\propto 1, and Jeffreys prior to σ2\sigma^{2}: p(σ2)σ2p\left(\sigma^{2}\right)\propto\sigma^{-2}.

Using this model, we re-examined the empirical analysis of carbon taxes on CO2\mathrm{CO}_{2} emissions in (Andersson, 2019) which uses synthetic control method (Abadie and Gardeazabal, 2003; Abadie et al., 2010). We used the annual panel data on per capita CO2\mathrm{CO}_{2} emissions from transport, covering the years 1960-2005 for 14 OECD (Organisation for Economic Co-operation and Development) countries: Australia, Belgium, Canada, Denmark, France, Greece, Iceland, Japan, New Zealand, Poland, Portugal, Spain, Sweden, Switzerland, and the United States. Thus, J=14J=14 and T=46T=46. Sweden implemented carbon taxes in 1990, whereas the others did not. We exploited this event as a quasi-experiment to infer the effects of the carbon taxes on CO2\mathrm{CO}_{2} emissions in Sweden, treating the other countries as a control group. See (Andersson, 2019) for further details on the dataset and a discussion on the choice of the control group. Three cases with K{3,5,10}K\in\left\{3,5,10\right\} were considered.

Table 5 presents the results. Using Givens representation, for some runs with K=10K=10, we obtained extremely autocorrelated chains and could not effectively compute the ESS due to numerical instabilities. Based on the minESS/iter metric, the performance of the four parameterizations were similarly poor. For the minESS/sec metric, polar expansion ranked the best.

Table 5. Simulation result: Matrix completion for causal analysis
(a) minESS/iter
KK Polar Householder Cayley Givens
3 0.027 0.027 0.027 0.027
5 0.028 0.027 0.028 0.027
10 0.027 0.027 0.027
(b) minESS/sec
KK Polar Householder Cayley Givens
3 4.047 1.446 0.032 0.012
5 1.795 0.766 0.005 0.003
10 0.764 0.047 0.004

4. Conclusions

To determine the best practice for Monte Carlo simulation on the Stiefel manifold, we compared four parameterizations of orthogonal matrices, namely, polar expansion, Householder transformation, (modified) Cayley transformation, and Givens representation, for various statistical applications, and compared their numerical performance based on the effective sample size. Series of simulations using NUTS revealed that polar expansion is the best among the four parameterizations. However, when the sampling space is high-dimensional and/or complex, all four approaches are unlikely to work well. Although the poor quality of a sampler does not break the theoretical (asymptotic) validity of a posterior simulation, it is necessary to generate longer chains and carefully examine the obtained draws, particularly when the sampling problem is high-dimensional and/or difficult. Therefore, further research is required in this area.

Acknowledgements.
This study was supported by JSPS KAKENHI Grant Number 20K22096.

References

  • (1)
  • Abadie et al. (2010) Alberto Abadie, Alexis Diamond, and Jens Hainmueller. 2010. Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program. J. Amer. Statist. Assoc. 105, 490 (2010), 493–505. https://doi.org/10.1198/jasa.2009.ap08746
  • Abadie and Gardeazabal (2003) Alberto Abadie and Javier Gardeazabal. 2003. The Economic Costs of Conflict: A Case Study of the Basque Country. American Economic Review 93, 1 (March 2003), 113–132. https://doi.org/10.1257/000282803321455188
  • Andersson (2019) Julius J. Andersson. 2019. Carbon Taxes and CO2 Emissions: Sweden as a Case Study. American Economic Journal: Economic Policy 11, 4 (November 2019), 1–30. https://doi.org/10.1257/pol.20170144
  • Babacan et al. (2012) S. Derin Babacan, Martin Luessi, Rafael Molina, and Aggelos K. Katsaggelos. 2012. Sparse Bayesian Methods for Low-Rank Matrix Estimation. IEEE Transactions on Signal Processing 60, 8 (2012), 3964–3977. https://doi.org/10.1109/TSP.2012.2197748
  • Bishop (2006) Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
  • Byrne and Girolami (2013) Simon Byrne and Mark Girolami. 2013. Gedodesic Monte Carlo on Embedded Manifolds. Scandinavian Journal of Statistics 40 (2013), 825–845.
  • Ding et al. (2011) Xinghao Ding, Lihan He, and Lawrence Carin. 2011. Bayesian Robust Principal Component Analysis. IEEE Transactions on Image Processing 20, 12 (2011), 3419–3430. https://doi.org/10.1109/TIP.2011.2156801
  • Du et al. (2023) Ke-Lin Du, M. N. S. Swamy, Zhang-Quan Wang, and Wai Ho Mow. 2023. Matrix Factorization Techniques in Machine Learning, Signal Processing, and Statistics. Mathematics 11, 12 (2023). https://doi.org/10.3390/math11122674
  • Duane et al. (1987) Simon Duane, A.D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. Hybrid Monte Carlo. Physics Letters B 195, 2 (1987), 216–222. https://doi.org/10.1016/0370-2693(87)91197-X
  • Eaton (1989) Morris L. Eaton. 1989. Group Invariance Applications in Statistics. Regional Conference Series in Probability and Statistics, Vol. 1. Institute of Mathematical Statistics.
  • Farias et al. (2022) Vivek Farias, Andrew A. Li, and Tianyi Peng. 2022. Uncertainty Quantification for Low-Rank Matrix Completion with Heterogeneous and Sub-Exponential Noise. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 151), Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera (Eds.). PMLR, 1179–1189.
  • Hoff (2009) Peter D. Hoff. 2009. Simulation of the Matrix Bingham-von Mises-Fisher Distribution, With Applications to Multivariate and Relational Data. Journal of Computational and Graphical Statistics 18, 2 (2009), 438–456.
  • Hoffman and Gelman (2014) Matthew D. Hoffman and Andrew Gelman. 2014. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research 15 (2014), 1351–1381.
  • Jauch et al. (2020) Michael Jauch, Peter D. Hoff, and David B. Dunson. 2020. Random Orthogonal Matrices and the Cayley Transform. Bernoulli 26, 2 (2020), 1560–1586. https://doi.org/10.3150/19-BEJ1176
  • Jauch et al. (2021) Michael Jauch, Peter D. Hoff, and David B. Dunson. 2021. Monte Carlo Simulation on the Stiefel Manifold via Polar Expansion. Journal of Computational and Graphical Statistics 30, 3 (2021), 622–631. https://doi.org/10.1080/10618600.2020.1859382 arXiv:https://doi.org/10.1080/10618600.2020.1859382
  • Lin et al. (2017) Lizhen Lin, Vinayak Rao, and David Dunson. 2017. Bayesian Nonparametric Inference on the Stiefel Manifold. Statistica Sinica 27, 2 (2017), 535–553.
  • Liu (2004) Jun S. Liu. 2004. Monte Carlo Strategies in Scientific Computing. Springer.
  • Lopes (2014) Hedibert Freitas Lopes. 2014. Modern Bayesian Factor Analysis. In Bayesian Inference in the Social Sciences, Ivan Jeliazkov and Xin-She Yang (Eds.). Wiley Online Library, Chapter 5, 115–153.
  • Mai and Alquier (2015) The Tien Mai and Pierre Alquier. 2015. A Bayesian Approach for Noisy Matrix Completion: Optimal Rate under General Sampling Distribution. Electronic Journal of Statistics 9, 1 (2015), 823 – 841. https://doi.org/10.1214/15-EJS1020
  • Murphy (2022) Kevin P. Murphy. 2022. Probabilistic Machine Learning: An Introduction. The MIT Press.
  • Neal (2011) Radford Neal. 2011. MCMC Using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo, Steve Brooks, Andrew Gelman, Galin L. Jones, and Xio-Li Meng (Eds.). Chapman & Hall/CRC, 113–162.
  • Nethery et al. (2021) Rachel C. Nethery, Nina Katz-Christy, Marianthi-Anna Kioumourtzoglou, Robbie M. Parks, Andrea Schumacher, and G. Brooke Anderson. 2021. Integrated Causal-predictive Machine Learning Models for Tropical Cyclone Epidemiology. Biostatistics 24, 2 (2021), 449–464. https://doi.org/10.1093/biostatistics/kxab047
  • Nirwan and Bertschinger (2019) Rajbir Nirwan and Nils Bertschinger. 2019. Rotation Invariant Householder Parameterization for Bayesian PCA. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 4820–4828. https://proceedings.mlr.press/v97/nirwan19a.html
  • Pal et al. (2020) Subhadip Pal, Subhajit Sengupta, Riten Mitra, and Arunava Banerjee. 2020. Conjugate Priors and Posterior Inference for the Matrix Langevin Distribution on the Stiefel Manifold. Bayesian Analysis 15, 3 (2020), 871 – 908. https://doi.org/10.1214/19-BA1176
  • Pang et al. (2022) Xun Pang, Licheng Liu, and Yiqing Xu. 2022. A Bayesian Alternative to Synthetic Control for Comparative Case Studies. Political Analysis 30, 2 (2022), 269–288. https://doi.org/10.1017/pan.2021.22
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning research 12 (2011), 2825–2830.
  • Pourzanjani et al. (2021) Arya A. Pourzanjani, Richard M. Jiang, Brian Mitchell, Paul J. Atzberger, and Linda R. Petzold. 2021. Bayesian Inference over the Stiefel Manifold via the Givens Representation. Bayesian Analysis 16, 2 (2021), 639–666. https://doi.org/10.1214/20-BA1202
  • R Core Team (2021) R Core Team. 2021. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
  • Salakhutdinov and Mnih (2007) Russ Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization. In Advances in Neural Information Processing Systems, J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Vol. 20. Curran Associates, Inc.
  • Samartsidis et al. (2024) Pantelis Samartsidis, Shaun R. Seaman, Abbie Harrison, Angelos Alexopoulos, Gareth J. Hughes, Christopher Rawlinson, Charlotte Anderson, André Charlett, Isabel Oliver, and Daniela De Angelis. 2024+. A Bayesian Multivariate Factor Analysis Model for Causal Inference Using Time-series Observational Data on Mixed Outcomes. Biostatistics (2024+).
  • Samartsidis et al. (2020) Pantelis Samartsidis, Shaun R. Seaman, Silvia Montagna, André Charlett, Matthew Hickman, and Daniela De Angelis. 2020. A Bayesian Multivariate Factor Analysis Model for Evaluating an Intervention by Using Observational Time Series Data on Multiple Outcomes. Journal of the Royal Statistical Society Series A: Statistics in Society 183, 4 (05 2020), 1437–1459. https://doi.org/10.1111/rssa.12569
  • Shepard et al. (2015) Ron Shepard, Scott R. Brozell, and Gergely Gidofalvi. 2015. The Representation and Parametrization of Orthogonal Matrices. The Journal of Physical Chemistry A 119, 28 (2015), 7924–7939. https://doi.org/10.1021/acs.jpca.5b02015 PMID: 25946418.
  • Shi et al. (2017) Jiarong Shi, Xiuyun Zheng, and Wei Yang. 2017. Survey on Probabilistic Models of Low-Rank Matrix Factorizations. Entropy 19, 8 (2017), 424. https://doi.org/10.3390/e19080424
  • Tanaka (2021) Masahiro Tanaka. 2021. Bayesian Matrix Completion Approach to Causal Inference with Panel Data. Journal of Statistical Theory and Practice 15 (2021), 49. https://doi.org/10.1007/s42519-021-00188-x
  • Tanaka (2022) Masahiro Tanaka. 2022. Bayesian Singular Value Regularization via a Cumulative Shrinkage Process. Communications in Statistics - Theory and Methods 51, 16 (2022), 5566–5589. https://doi.org/10.1080/03610926.2020.1843055
  • Tipping and Bishop (2002) Michael E. Tipping and Christopher M. Bishop. 2002. Probabilistic Principal Component Analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology 61, 3 (2002), 611–622. https://doi.org/10.1111/1467-9868.00196
  • Yuchi et al. (2023) Henry Shaowu Yuchi, Simon Mak, and Yao Xie. 2023. Bayesian Uncertainty Quantification for Low-Rank Matrix Completion. Bayesian Analysis 18, 2 (2023), 491 – 518. https://doi.org/10.1214/22-BA1317
  • Zhai and Gutman (2023) Ruoshui Zhai and Roee Gutman. 2023. A Bayesian Singular Value Decomposition Procedure for Missing Data Imputation. Journal of Computational and Graphical Statistics 32, 2 (2023), 470–482. https://doi.org/10.1080/10618600.2022.2107534