This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Low-rank Bayesian matrix completion via geodesic Hamiltonian Monte Carlo on Stiefel manifolds

Tiangang Cui School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia [email protected]  and  Alex A. Gorodetsky Department of Aerospace Engineering, University of Michigan, Ann Arbor, MI, 48109, USA [email protected]
Abstract.

We present a new sampling-based approach for enabling efficient computation of low-rank Bayesian matrix completion and quantifying the associated uncertainty. Firstly, we design a new prior model based on the singular-value-decomposition (SVD) parametrization of low-rank matrices. Our prior is analogous to the seminal nuclear-norm regularization used in non-Bayesian setting and enforces orthogonality in the factor matrices by constraining them to Stiefel manifolds. Then, we design a geodesic Hamiltonian Monte Carlo (-within-Gibbs) algorithm for generating posterior samples of the SVD factor matrices. We demonstrate that our approach resolves the sampling difficulties encountered by standard Gibbs samplers for the common two-matrix factorization used in matrix completion. More importantly, the geodesic Hamiltonian sampler allows for sampling in cases with more general likelihoods than the typical Gaussian likelihood and Gaussian prior assumptions adopted in most of the existing Bayesian matrix completion literature. We demonstrate an applications of our approach to fit the categorical data of a mice protein dataset and the MovieLens recommendation problem. Numerical examples demonstrate superior sampling performance, including better mixing and faster convergence to a stationary distribution. Moreover, they demonstrate improved accuracy on the two real-world benchmark problems we considered.

Keywords. Matrix Completion, Markov Chain Monte Carlo, Low-rank matrices

1. Introduction

We consider the problem of Bayesian low rank matrix completion. The matrix completion problem considers reconstructing a matrix 𝐗m×n\mathbf{X}\in\mathbb{R}^{m\times n} through indirect and noisy evaluations of a subset of its elements. A low rank matrix completion seeks to reduce the ill-posed nature of this problem by assuming that the underlying matrix has an approximately low rank factorization. A low rank factorization of a matrix 𝐗m×n\mathbf{X}\in\mathbb{R}^{m\times n} matrix is defined by the couple (𝐀,𝐁)(\mathbf{A},\mathbf{B}), where 𝐀m×r\mathbf{A}\in\mathbb{R}^{m\times r} and 𝐁n×r\mathbf{B}\in\mathbb{R}^{n\times r} with r<min(m,n)r<\min(m,n) and

(1) 𝐗𝐀𝐁T.\mathbf{X}\approx\mathbf{A}\mathbf{B}^{T}.

Recall that the best rank rr approximation 𝐗r=min𝐗^𝐗𝐗^F2\mathbf{X}_{r}=\min_{\hat{\mathbf{X}}}\lVert\mathbf{X}-\hat{\mathbf{X}}\rVert_{F}^{2} is provided by the singular value decomposition

(2) 𝐗r=𝐔r𝚲𝐕rT,\mathbf{X}_{r}=\mathbf{U}_{r}\mathbf{\Lambda}\mathbf{V}_{r}^{T},

where 𝐔r\mathbf{U}_{r} and 𝐕r\mathbf{V}_{r} are the first rr columns of the left and right singular vectors of 𝐗\mathbf{X}; 𝚲r\mathbf{\Lambda}_{r} is an ordered set of the largest singular values; and where we can identify 𝐀=𝐔r𝚲\mathbf{A}=\mathbf{U}_{r}\sqrt{\mathbf{\Lambda}} and 𝐁=𝐕r𝚲.\mathbf{B}=\mathbf{V}_{r}\sqrt{\mathbf{\Lambda}}.

Bayesian low-rank matrix completion

We consider the Bayesian setting aiming to identify distributions over the factors 𝐀\mathbf{A} and 𝐁\mathbf{B} to account for uncertainty stemming from an ability to uniquely identify the correct matrix. Data consists of pairs of element indices and values {(i,j),yij}\left\{\left(i,j\right),{y}_{ij}\right\} where i[m]i\in[m] denotes the rows and j[n]j\in[n] denotes the columns . We use the index set \mathcal{I} to denote the set of indices corresponding to the observed data points and let |||\mathcal{I}| be its cardinality. The full data vector is denoted by yy. We consider the following likelihood models: a Gaussian likelihood

(3) yij\displaystyle y_{ij} 𝒩(h(𝐗ij),σ2),\displaystyle\sim\mathcal{N}(h(\mathbf{X}_{ij}),\sigma^{2}),

where hh is an observation function, e.g., identity or a soft-plus function to ensure positivity, and a Binomial likelihood

(4) yij\displaystyle y_{ij} Bin(k,h(𝐗ij)),h(𝐗ij)=(1+exp(𝐗ij))1\displaystyle\sim\text{Bin}(k,h(\mathbf{X}_{ij})),\quad h(\mathbf{X}_{ij})=(1+\exp(-\mathbf{X}_{ij}))^{-1}

where kk is the number of possible trials and h(𝐗ij)h(\mathbf{X}_{ij}) denotes the success probability. Given the low rank factorization of Equation (1), here 𝐗ij\mathbf{X}_{ij} is obtained by multiplying the iith row of 𝐀\mathbf{A} with the jjth row of 𝐁\mathbf{B}. The Binomial likelihood is applied in categorical settings where the categories might have a natural ordering. However, we emphasize that our proposed sampling methodology is more broadly applicable to other likelihood formulations.

Challenge

One challenge to Bayesian matrix completion specifically, is the non-uniqueness of the low-rank representation. This challenge is pictorally represented in Figure 1, which demonstrates the geometry of posteriors that are prevalent in the problem. It is typically quite challenging to design a sampling scheme that is able to sample such posteriors, and even variational approaches are challenging to tune here. This challenge arises due to the following symmetry in the problem. A low-rank factorization can be obtained by an arbitrary linear transformation of the factors via

(5) 𝐗=𝐀𝐖𝐖1𝐁=𝐀^𝐁^T\mathbf{X}=\mathbf{A}\mathbf{W}\mathbf{W}^{-1}\mathbf{B}=\hat{\mathbf{A}}\hat{\mathbf{B}}^{T}

for any invertible 𝐖r×r\mathbf{W}\in\mathbb{R}^{r\times r}. The precise ramifications of this symmetry in the Gaussian likelihood case was extensively documented in [7].

Refer to caption
Figure 1. Examples of 2D marginals of the posteriors obtained using standard Gaussian priors for the inference of real-valued low-rank matrix factorizations [7].

Contribution

We develop an approach to address these issues in two ways: (i) we alter the factorization format that constrains the factor matrices on Stiefel manifolds; and (ii) we develop a geodesic Markov Chain Monte Carlo approach that specifically target sampling distributions that lie on the manifold. Specifically, we propose to parameterize the low-rank matrix in the SVD format

(6) 𝐗=𝐔𝐒𝐕T,𝐔T𝐔=𝐈r,𝐕𝐕T=𝐈r\mathbf{X}=\mathbf{U}\mathbf{S}\mathbf{V}^{T},\quad\mathbf{U}^{T}\mathbf{U}=\mathbf{I}_{r},\quad\mathbf{V}\mathbf{V}^{T}=\mathbf{I}_{r}

where we seek to learn semi-orthogonal matrices 𝐔\mathbf{U} and 𝐕\mathbf{V} and a positive diagonal matrix 𝐒\mathbf{S}. Within a Bayesian sampling context, we develop a geodesic Hamiltonian Monte Carlo (HMC) [4] within Gibbs algorithm to enable efficient sampling of the semi-orthogonal factor matrices.

This approach is closely related to the SVD models used in optimization-based methods [5] that solve a nuclear norm minimization problem

(7) minimize𝐗m×n𝐗subject toyh(Λ(𝐗))δ,\operatorname*{minimize}_{\mathbf{X}\in\mathbb{R}^{m\times n}}\quad\lVert\mathbf{X}\rVert_{*}\quad\text{subject to}\quad\lVert{y}-h(\Lambda(\mathbf{X}))\rVert\leq\delta,

where Λ\Lambda is an operator that subselects the observed elements of 𝐗\mathbf{X} and flattens them into a vector of equal size to y{y}. Moreover, an SVD parameterization is commonly used in this optimization procedure. From this perspective, the optimization problem can be seen as equivalently seeking a MAP estimate with a uniform prior over semi-orthogonal matrices. Prior work in Bayesian matrix completion has not used this equivalent prior to our knowledge. As a result, we provide the first Bayesian analogue of this extremely popular low-rank matrix characterization.

Related work

A large amount of work has addressed the low-rank Bayesian matrix completion problem in the variational setting. For example, [18] and [25] each apply variational Bayes approximations of this inference model to analyze the Netflix prize challenge [3] to great success. Further, [21] and [22] develop a theoretical framework to analyze the variational Bayes low-rank matrix factorization. One of the issues with these existing approaches is the mean field-type approxiamtion that assumes independence of the factors. Sampling-based approaches target the full posterior and, while more expensive, have shown that this more complete characterization can outer-perform variational inference performance [27]. This approach was later extended to other settings [6, 1], and have also been extended to the case of recovering low-rank tensor factorizations [24, 28, 29].

The majority of these algorithms leverage the bilinear nature of the low-rank factorization to develop Gibbs samplers via conditional distributions for each factor. In this setting each factor is endowed with an independent Gaussian prior, and the variance of this prior is endowed with a hyperprior. This approach is mainly used to try to circumvent the sampling difficulties previously described, but comes at the cost of increased uncertainty. This approach uses a Gaussian log likelihood function

(8) logP(y𝐀,𝐁)=||2log(2π)||log(σ)12σ2(i,j)(yij𝐀i:𝐁j:T)2.\log P(y\mid\mathbf{A},\mathbf{B})=-\frac{|\mathcal{I}|}{2}\log(2\pi)-|\mathcal{I}|\log(\sigma)-\frac{1}{2\sigma^{2}}\textstyle\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{A}_{i:}\mathbf{B}^{T}_{j:}\right)^{2}.

The prior on the factors assigns i.i.d. zero mean Gaussians for entries of the matrices 𝐀\mathbf{A} and 𝐁\mathbf{B} and assumes the prior variance is unknown. This leads to a hierarchical prior, conditioned on the variance

P(𝐀,𝐁δ)((m+n)r)12exp(12δ2(i=1,j=1m,r𝐀ij2+i=1,j=1n,r𝐁ij2)),P(\mathbf{A},\mathbf{B}\mid\delta)\propto((m+n)r)^{-\frac{1}{2}}\exp\left(-\frac{1}{2\delta^{2}}\left(\textstyle\sum_{i=1,j=1}^{m,r}\mathbf{A}_{ij}^{2}+\textstyle\sum_{i=1,j=1}^{n,r}\mathbf{B}_{ij}^{2}\right)\right),

where δ\delta is some unknown hyperparameter that controls the prior variance. We endow δ\delta with a Gamma prior. Finally, Gibbs sampling is used to estimate both the factors 𝐀\mathbf{A} and 𝐁\mathbf{B}, the prior variance δ\delta and the likeihood standard deviation σ\sigma.

Outline

This paper is structured as follows. Section 2 provides background on sampling on the Stiefel manifold and our proposed HMC within Gibbs algorithm. Section 3 provides numerical results that compare our proposed approach to Bayesian approach described above. The comparisons are made both on synthetic examples, to gain intuition, and on two real-world data sets that consider positivity constraints on the matrix elements and categorical valued matrices, respectively.

2. Sampling on low-rank manifolds

In this section we first describe the geodesic Monte Carlo and then describe our Gibbs sampling procedure. The factor matrices we learn lie on the Stiefel manifold. Traditional Markov Chain Monte Carlo approaches like Hamiltonian Monte Carlo or Riemannian-Manifold Hamiltonian Monte Carlo are not directly applicable to distributions for parameters that are on such manifolds embedded in the Euclidean space. Instead so-called geodesic Monte Carlo approaches propose modifications of HMC and related methods to directly apply to distributions in the embedded space [4].

2.1. Operations on the Stiefel manifold

We begin with description of operations on the Stiefel manifold. Let \mathcal{M} denote an mm dimensional manifold that is embedded in a higher dimensional Euclidean space n.\mathbb{R}^{n}. At every point x𝒳x\in\mathcal{X} there exists bijective mapping qx:mq_{x}:\mathcal{M}\to\mathbb{R}^{m} denoting a coordinate system on the manifold. The union of all possible coordinate systems across the manifold is called an atlas for a manifold. For our low-rank reconstruction purposes, we consider the Stiefel manifold of semi-orthogonal n×rn\times r matrices ={𝐗n×r;𝐗𝐓𝐗=𝐈r}.\mathcal{M}=\{\mathbf{X}\in\mathbb{R}^{n\times r};\mathbf{X^{T}X}=\mathbf{I}_{r}\}.

The main challenge is describing motion along the manifold since standard Euclidean vector-space addition does not apply. Instead, one must follow curves on the manifolds between points. A curve on the manifold over a time interval starting at some point xx\in\mathcal{M} is defined by γ:[0,T]\gamma:[0,T]\to\mathcal{M}, with γ(0)=x\gamma(0)=x. At each point xx, the tangent vector is an equivalence class of these curves111Two curves ϕ(t)\phi(t) and ψ(t)\psi(t) are equivalent if ϕ(0)=ψ(0)\phi(0)=\psi(0) and limt0ϕ(t)ψ(t)t=0\lim_{t\to 0}\frac{\phi(t)-\psi(t)}{t}=0, and the set of tangent vectors form a tangent vector space TMxnTM_{x}\subset\mathbb{R}^{n} [2].

These tangent vectors are time derivatives of the curves v=γ˙(0)=ddtγ(t)|t=t0.v=\dot{\gamma}(0)=\left.\frac{d}{dt}\gamma(t)\right|_{t=t0}. Every point on the Stiefel manifold statisfies 𝐗T(t)𝐗(t)=𝐈𝐫\mathbf{X}^{T}(t)\mathbf{X}(t)=\mathbf{I_{r}}, so differentiation in time yields 𝐗˙T(t)𝐗(t)+𝐗T(t)𝐗˙(t)=0.\mathbf{\dot{X}}^{T}(t)\mathbf{X}(t)+\mathbf{X}^{T}(t)\mathbf{\dot{X}}(t)=0. Therefore the tangent space is TMx={𝐙n×r;𝐙T𝐗+𝐗T𝐙=𝟎}.TM_{x}=\{\mathbf{Z}\in\mathbb{R}^{n\times r};\mathbf{Z}^{T}\mathbf{X}+\mathbf{X}^{T}\mathbf{Z}=\mathbf{0}\}.

The geodesic Monte Carlo algorithm makes use of the space normal to the manifold. This normal space is the orthogonal complement to TMxTM_{x}, and therefore a definition of an inner product is needed. For the Stiefel manifold this inner product is defined through the trace 𝐘,𝐙x=trace(𝐘T𝐙).\langle\mathbf{Y},\mathbf{Z}\rangle_{x}=\mathrm{trace}(\mathbf{Y}^{T}\mathbf{Z}). One can then verify that for any 𝐗\mathbf{X}\in\mathcal{M} on the manifold, the orthogonal projection of 𝐙n×r\mathbf{Z}\in\mathbb{R}^{n\times r} onto the normal space is

(9) πNx(𝐙)=12𝐗(𝐗T𝐙+𝐙T𝐗),\pi_{N_{x}}(\mathbf{Z})=\frac{1}{2}\mathbf{X}\left(\mathbf{X}^{T}\mathbf{Z}+\mathbf{Z}^{T}\mathbf{X}\right),

and the orthogonal projection of 𝐙\mathbf{Z} onto the tangent space becomes (IπNx)(𝐙)=𝐙πNx(𝐙)(I-\pi_{N_{x}})(\mathbf{Z})=\mathbf{Z}-\pi_{N_{x}}(\mathbf{Z}). Finally we will require geodesics, the curve of shortest length, between two points on the manifold. Since the Steifel manifold is embedded in a Euclidean space, the geodesic can be defined by the fact that the acceleration vector at each point is normal to the manifold as long as the curve is traced with uniform speed. For Stiefel manifolds we are able to obtain the following closed form expression [11]

(10) [𝐗(t),𝐗˙(t)]=[𝐗(0),𝐗˙(0)]exp([𝐀𝐒(0)𝐈𝐀])[exp(t𝐀)00exp(t𝐀)]\left[\mathbf{X}(t),\mathbf{\dot{X}}(t)\right]=\left[\mathbf{X}(0),\mathbf{\dot{X}}(0)\right]\exp\left(\begin{bmatrix}\mathbf{A}&-\mathbf{S}(0)\\ \mathbf{I}&\mathbf{A}\end{bmatrix}\right)\begin{bmatrix}\exp(-t\mathbf{A})&0\\ 0&\exp(-t\mathbf{A})\end{bmatrix}

where 𝐀=𝐗T(t)𝐗˙(t)\mathbf{A}=\mathbf{X}^{T}(t)\mathbf{\dot{X}}(t) and 𝐒(t)=𝐗˙T(t)𝐗˙(t).\mathbf{S}(t)=\mathbf{\dot{X}}^{T}(t)\mathbf{\dot{X}}(t).

2.2. Sampling on the Stiefel manifold

While the Lebesgue measure is the reference measure for probability distributions in the Euclidean space, the Hausdorff measure can be used as the reference measure on manifolds [8]. It will be useful to move between an mm-dimensional Hausdorff measure m\mathcal{H}^{m} and the mm-dimensional Lebesgue measure λm\lambda^{m}. In the context of Riemannian manifolds, such as the Stiefel manifold, the relation between these measures is provided by the following formula [12]

(11) m(dq)=|G(q)|λm(dq),\mathcal{H}^{m}(dq)=\sqrt{|G(q)|}\lambda^{m}(dq),

Where G(q)G(q) is the Riemannian metric provided by the trace. Using the metric G(q)G(q) and the target density π(q)\pi(q) with respect to the Lebesgue measure in the coordinate system provided by qq, we can form the non-separable Hamiltonian function [13]

(12) H(q,p)=logπ(q)+12log|G(q)|+12pTG(q)1p,H(q,p)=-\log\pi(q)+\frac{1}{2}\log|G(q)|+\frac{1}{2}p^{T}G(q)^{-1}p,

where π(q)\pi(q) is the target density with respect to the Lebesgue measure in the coordinate system provided by qq, and pp is an auxiliary momentum variable following 𝒩(0,G(q))\mathcal{N}(0,G(q)). Note that exp(H(q,p))\exp(-H(q,p)) defines a unnormalized joint probability density over the pair (q,p)(q,p), in which the marginal density over qq recovers the target density. As shown by the landmark paper of [10], the Hamiltonian system

dqidt=Hpi,dpidt=Hqi,\frac{dq_{i}}{dt}=\frac{\partial H}{\partial p_{i}},\quad\frac{dp_{i}}{dt}=-\frac{\partial H}{\partial q_{i}},

implicitly defines a time-reversible and one-to-one mapping TT from the state at time tt, (q(t),p(t))(q(t),p(t)), to the state at time t+st+s, (q(t+s),p(t+s))(q(t+s),p(t+s)). This mapping TT leaves the target distribution π(q)\pi(q) invariant. Thus, discretizing the Hamiltonian system using numerical integrators naturally leads MCMC proposal distributions that in general can reduce sample correlations in the resulting Markov chain. Moreover, using time-reversible and symplectic numerical integrators often yields simplified and more dimension-scalable MCMC proposals. See [23] and references therein.

To define HMC on Riemannian manifold [13] , we can use the relation (11) to write the Hamiltonian function (12) with respect to the Hausdorff measure as H(q,p)=logπ(q)+12pTG(q)1p.H(q,p)=-\log\pi_{\mathcal{H}}(q)+\frac{1}{2}p^{T}G(q)^{-1}p. The corresponding Hamiltonian system evolves the pair (q,p)(q,p) according to

(13) dqdt=Hp=G(q)1p,dpdt=Hq=q[logπ(q)12pTG(q)1p].\displaystyle\frac{dq}{dt}=\frac{\partial H}{\partial p}=G(q)^{-1}p,\qquad\frac{dp}{dt}=-\frac{\partial H}{\partial q}=\nabla_{q}\left[\log\pi_{\mathcal{H}}(q)-\frac{1}{2}p^{T}G(q)^{-1}p\right].

A key innovation of [4] is the introduction of a time-reversible and symplectic integrator that integrates this non-separable Hamiltonian system via splitting, H(q,p)=H[1](q,p)+H[2](q,p)H(q,p)=H^{[1]}(q,p)+H^{[2]}(q,p) where H[1](q,p)=logπ(q)H^{[1]}(q,p)=-\log\pi_{\mathcal{H}}(q) and H[2](q,p)=12pTG(q)1p.H^{[2]}(q,p)=\frac{1}{2}p^{T}G(q)^{-1}p. The corresponding systems are

(14) dqdt\displaystyle\frac{dq}{dt} =0,\displaystyle=0, dpdt\displaystyle\frac{dp}{dt} =qlogπ(q)\displaystyle=\nabla_{q}\log\pi_{\mathcal{H}}(q)
(15) dqdt\displaystyle\frac{dq}{dt} =2G1(q)p,\displaystyle=2G^{-1}(q)p, dpdt\displaystyle\frac{dp}{dt} =pTqG1(q)p,\displaystyle=-p^{T}\frac{\partial}{\partial q}G^{-1}(q)p,

respectively. The integrator first solves (14) for a timestep of ϵ/2\epsilon/2, then solves (15) for a timestep of ϵ\epsilon, and finally propagates (14) again for a time-step of ϵ/2\epsilon/2. The solution of the first system (14) is given simply by (q(t),p(t))=(q0,p+tqlogπ(q0))(q(t),p(t))=(q_{0},p+t\nabla_{q}\log\pi_{\mathcal{H}}(q_{0})), and the solution to the second system(15) is given by the geodesic flow to enforce it on the manifold. If we denote (𝐗,𝐕)(\mathbf{X},\mathbf{V}) as the embedded parameter and momentum, rather than (q,p)(q,p), the final HMC algorithm, made specific to the case of the Stiefel manifold, is provided in Algorithm 1. Note that the Metropolis-Hastings rejection rule is applied at the end of the time integration in Algorithm 1 to endure the HMC generates Markov chains that have the target distribution π(q)\pi(q) as the correct invariant distribution.

Algorithm 1 𝚑𝚖𝚌𝚜𝚝𝚒𝚎𝚏𝚎𝚕\mathtt{hmc-stiefel}: HMC step on the Stiefel Manifold [4]
1:timestep ϵ,\epsilon, number of steps TT, current sample on Stiefel Manifold 𝐗\mathbf{X}, target density π\pi_{\mathcal{H}}
2:v𝒩(0,𝐈nr)v\sim\mathcal{N}(0,\mathbf{I}_{nr}) and 𝐕reshape(v,(n,r))\mathbf{V}\leftarrow\textrm{reshape}(v,(n,r))
3:Hcurrent=logπ(𝐗)12vTvH_{\text{current}}=\log\pi_{\mathcal{H}}(\mathbf{X})-\frac{1}{2}v^{T}v
4:𝐕𝐕12𝐗(𝐗T𝐕+𝐕T𝐗)\mathbf{V}\leftarrow\mathbf{V}-\frac{1}{2}\mathbf{X}(\mathbf{X}^{T}\mathbf{V}+\mathbf{V}^{T}\mathbf{X}) and vreshape(𝐕,(nr))v\leftarrow\textrm{reshape}(\mathbf{V},(nr))
5:for τ=1,,T\tau=1,\ldots,T do
6:     vv+ϵ2xlogπ(𝐗)v\leftarrow v+\frac{\epsilon}{2}\nabla_{x}\log\pi_{\mathcal{H}}(\mathbf{X}) and 𝐕reshape(v,(n,r))\mathbf{V}\leftarrow\textrm{reshape}(v,(n,r)) \triangleright Propagate velocity
7:     𝐕𝐕12𝐗(𝐗T𝐕+𝐕T𝐗)\mathbf{V}\leftarrow\mathbf{V}-\frac{1}{2}\mathbf{X}(\mathbf{X}^{T}\mathbf{V}+\mathbf{V}^{T}\mathbf{X}) \triangleright Project back onto manifold
8:     (𝐗,𝐕)(\mathbf{X},\mathbf{V})\leftarrow solve Eq. (10) with initial condition (𝐗,𝐕)(\mathbf{X},\mathbf{V}) for time ϵ\epsilon \triangleright Evolve on the geodesic
9:     vreshape(𝐕,(nr))v\leftarrow\textrm{reshape}(\mathbf{V},(nr)), vv+ϵ2xlogπ(𝐗)v\leftarrow v+\frac{\epsilon}{2}\nabla_{x}\log\pi_{\mathcal{H}}(\mathbf{X}) and 𝐕reshape(v,(n,r))\mathbf{V}\leftarrow\textrm{reshape}(v,(n,r))
10:     𝐕𝐕12𝐗(𝐗T𝐕+𝐕T𝐗)\mathbf{V}\leftarrow\mathbf{V}-\frac{1}{2}\mathbf{X}(\mathbf{X}^{T}\mathbf{V}+\mathbf{V}^{T}\mathbf{X}) and vreshape(𝐕,(nr))v\leftarrow\textrm{reshape}(\mathbf{V},(nr))
11:end for
12:Hnext=logπ(𝐗)12vTvH_{\textrm{next}}=\log\pi_{\mathcal{H}}(\mathbf{X})-\frac{1}{2}v^{T}v
13:if 𝒰[0,1]<exp(HnextHcurrent)\mathcal{U}[0,1]<\exp(H_{\textrm{next}}-H_{\textrm{current}}) then X(next)𝐗X_{\textrm{(}next)}\leftarrow\mathbf{X}

2.3. Application to low-rank matrix completion

We utilize geodesic HMC within Gibbs to solve the matrix completion problem. The pseudocode is provided in Algorithm 2. Here we see that the HMC on the Stiefel manifold is used to sample the left and right singular vectors, while a standard HMC approach is used for the singular values. In all cases we use the No-U-Turn-Sampler variant of HMC [17] to adaptively tune the number of steps.

Algorithm 2 Geodesic HMC within Gibbs for matrix completion
1:ϵ,\epsilon, timestep; TT, number of steps; 𝐗\mathbf{X}, current sample on the Stiefel Manifold; NN number of samples, (𝐔(0),𝐒(0),𝐕(0))(\mathbf{U}^{(0)},\mathbf{S}^{(0)},\mathbf{V}^{(0)}) initial samples; Conditional distributions πU(𝐔,𝐒,𝐕)\pi_{U}(\mathbf{U}\mid,\mathbf{S},\mathbf{V}), πS(𝐒𝐔,𝐕)\pi_{S}(\mathbf{S}\mid\mathbf{U},\mathbf{V}), πV(𝐕𝐒,𝐔)\pi_{V}(\mathbf{V}\mid\mathbf{S},\mathbf{U}).
2:for k=1,,Nk=1,\ldots,N do
3:     𝐔(k)𝚑𝚖𝚌𝚜𝚝𝚒𝚎𝚏𝚎𝚕(πU(𝐒(k1),𝐕(k1)))\mathbf{U}^{(k)}\leftarrow\mathtt{hmc-stiefel}(\pi_{U}(\cdot\mid\mathbf{S}^{(k-1}),\mathbf{V}^{(k-1)}))
4:     𝐕(k)𝚑𝚖𝚌𝚜𝚝𝚒𝚎𝚏𝚎𝚕(πV(𝐒(k1),𝐔(k)))\mathbf{V}^{(k)}\leftarrow\mathtt{hmc-stiefel}(\pi_{V}(\cdot\mid\mathbf{S}^{(k-1)},\mathbf{U}^{(k)}))
5:     𝐒(k)𝚑𝚖𝚌(πS(𝐔(k),𝐕(k)))\mathbf{S}^{(k)}\leftarrow\mathtt{hmc}(\pi_{S}(\cdot\mid\mathbf{U}^{(k)},\mathbf{V}^{(k)}))
6:end for

Prior

For all the SVD based matrix-completion problems in this work, we consider independent prior distributions for the matrices 𝐔,𝐒\mathbf{U},\mathbf{S} and 𝐕\mathbf{V}. The priors of the left and right factors are defined as uniform in the Hausdorff measure, whereas the prior of the singular values are defined as the independent exponential distributions in a Euclidean space. This way, we have the prior

(16) π(𝐔,𝐒,𝐕)=π(𝐔)P(𝐒)π(𝐕)exp(λ=1r𝐒),\pi(\mathbf{U},\mathbf{S},\mathbf{V})=\pi_{\mathcal{H}}(\mathbf{U})P(\mathbf{S})\pi_{\mathcal{H}}(\mathbf{V})\propto\exp\left(-\lambda\textstyle\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell}\right),

for some positive rate parameter λ\lambda. The exponential prior can be interpreted as the exponential of the negative nuclear norm of the underlying matrix 𝐗\mathbf{X}, i.e., λ𝐗=λ=1r𝐒\lambda\|\mathbf{X}\|_{\ast}=\lambda\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell}, which is a convex envelope of rank(𝐗)\text{rank}(\mathbf{X}) used in the optimization literature to promote low-rankness in matrix estimation problems [5, 20, 15]. Thus, the exponential prior can be viewed as a rank revealing prior. We provide further justifications and numerical illustrations in Section C of the supplementary material.

In the rest of this section, we describe three likelihood models used for the numerical experiments: (1) real-valued data, (2) positive real-valued data for the mice protein expression example, and (3) positive integer-valued rating data for the movie recommendation.

Model SVD:

In this model, we treat the observed data as real-valued, and consider the Gaussian likelihood arising from the model

(17) yij=𝐗ij+ξ=𝐔i:𝐒𝐕j:T+ξ,y_{ij}=\mathbf{X}_{ij}+\xi=\mathbf{U}_{i:}\mathbf{S}\mathbf{V}_{j:}^{T}+\xi,

where ξ𝒩(0,σ2).\xi\sim\mathcal{N}(0,\sigma^{2}). The log likelihood becomes

logP(y𝐔,𝐒,𝐕)=||2log(2π)||log(σ)12σ2(i,j)(yij𝐔i:𝐒𝐕j:T)2.\log P(y\mid\mathbf{U},\mathbf{S},\mathbf{V})=-\frac{|\mathcal{I}|}{2}\log(2\pi)-|\mathcal{I}|\log(\sigma)-\frac{1}{2\sigma^{2}}\textstyle\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}.

Model S-SVD:

In this model, we treat the observed data as positive real-valued, and consider the Gaussian likelihood arising from the model

(18) y𝐢=h(𝐗ij)+ξ=h(𝐔i:𝐒𝐕j:T)+ξ,h(z)=log(1+exp(z)),y_{{\bf i}}=h(\mathbf{X}_{ij})+\xi=h(\mathbf{U}_{i:}\mathbf{S}\mathbf{V}_{j:}^{T})+\xi,\quad h(z)=\log(1+\exp(z)),

where ξ𝒩(0,σ2).\xi\sim\mathcal{N}(0,\sigma^{2}). The log likelihood becomes

logP(y𝐔,𝐒,𝐕)=||2log(2π)||log(σ)12σ2(i,j)(yijh(𝐔i:𝐒𝐕j:T))2.\log P(y\mid\mathbf{U},\mathbf{S},\mathbf{V})=-\frac{|\mathcal{I}|}{2}\log(2\pi)-|\mathcal{I}|\log(\sigma)-\frac{1}{2\sigma^{2}}\textstyle\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-h(\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:})\right)^{2}.

For Model SVD and Model S-SVD, the maximum a-posteriori (MAP) of the posterior becomes equivalent to a penalized version of the nuclear norm minimization problem (7). This equivalence is obtained by taking the log of the prior (16) and adding either of the log likelihoods described above.

Model B-SVD:

In this model, we treat the observed data as positive integer-valued, and consider the Binomial likelihood,

(19) yijBin(k,h(𝐗ij)),h(𝐗ij)=(1+exp(𝐗ij))1=(1+exp(𝐔i:𝐒𝐕j:T))1.y_{ij}\sim\text{Bin}(k,h(\mathbf{X}_{ij})),\quad h(\mathbf{X}_{ij})=(1+\exp(-\mathbf{X}_{ij}))^{-1}=(1+\exp(-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}))^{-1}.

This yields the log likelihood

logP(y𝐔,𝐒,𝐕)=(i,j)(log(kyij)+yijlogh(𝐔i:𝐒𝐕j:T)+(kyij)log(1h(𝐔i:𝐒𝐕j:T))).\log P(y\mid\mathbf{U},\mathbf{S},\mathbf{V})=\textstyle\sum_{(i,j)\in\mathcal{I}}\left(\log{k\choose y_{ij}}+y_{ij}\log h(\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:})+\left(k-y_{ij}\right)\log\left(1-h(\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:})\right)\right).

3. Numerical examples

We now demonstrate the efficacy of the proposed SVD-based models and the geodesic HMC sampling methods on three examples, including partially observed synthetic matrices, the mice protein expression data set [16], and the MovieLens data set [14].

Problem setups

Example 1 includes the following synthetic test cases that aim to benchmark the proposed methods.

  • Case #1:

    We construct a true matrix 𝐗true=𝐀𝐁T\mathbf{X}_{\text{true}}=\mathbf{A}\mathbf{B}^{T}, where entries of 𝐀,𝐁\mathbf{A},\mathbf{B} are drawn from i.i.d. standard Gaussian, and then partially observed data yij=𝐗ijy_{ij}=\mathbf{X}_{ij} are given by randomly selecting a subset of matrix indices.

  • Case #2:

    We construct a true positive matrix 𝐗true=𝐀𝐁T\mathbf{X}_{\text{true}}=\mathbf{A}\mathbf{B}^{T}, where entries of 𝐀\mathbf{A} are drawn from i.i.d. Exp(,1)\text{Exp}(\cdot,1) and entries of 𝐁\mathbf{B} are drawn from i.i.d. Uniform(,[0,1])\text{Uniform}(\cdot,[0,1]). Then partially observed data yij=𝐗ijy_{ij}=\mathbf{X}_{ij} are given by randomly selecting a subset of matrix indices.

  • Case #3:

    We construct a true matrix 𝐗true=𝐀𝐁T\mathbf{X}_{\text{true}}=\mathbf{A}\mathbf{B}^{T}, where entries of 𝐀,𝐁\mathbf{A},\mathbf{B} are drawn from i.i.d. standard Gaussian. Then partially observed data are drawn from the Binomial model, i.e., yijBin(k,(1+exp(𝐗ij))1)y_{ij}\sim\text{Bin}(k,(1+\exp(-\mathbf{X}_{ij}))^{-1}), where the indices are randomly chosen.

For each of the cases, we use 100×60100\times 60 matrices with rank 1010. We conduct two sets of experiments that respectively select 10%10\% and 40%40\% of matrix entries as training data. The cross validation data sets are chosen as 40%40\% of matrix entries that do not overlap with the training data sets.

Example 2 is the mice protein expression data set, which consists of 7777 protein expressions, measured in terms of nuclear fractions, from 10801080 mice specimens. The 1080×771080\times 77 matrix contains positive real-valued entries. This data set is openly available at the UCI Machine Learning Repository [9]. We use rank 2020 in the training. Here we also conduct two sets of experiments that respectively select 10%10\% and 40%40\% of matrix entries at random as training data. The cross validation data sets are chosen as 40%40\% of matrix entries at random that do not overlap with the training data sets.

Example 3 uses the ml-latest-small of the MovieLens data set, which contains 100836100836 non-negative ratings across 97429742 movies rated by 610 users. This gives a 610×9724610\times 9724 matrix with 100836100836 partially observed rating data222The data set is openly available at https://files.grouplens.org/datasets/movielens/. For the rating data, each of the observed data entries yijy_{ij} is a categorical variable, which takes value from a finite sequence in the range [0,5][0,5] with increment 0.50.5, i.e., yij{0,0.5,1,,4,4.5,5}y_{ij}\in\{0,0.5,1,\ldots,4,4.5,5\}. To apply the Binomial likelihood, we scale the rating data by a factor of two, so they become integer-valued in the ranging [0,10][0,10] with increment 11. We use rank 2020 in the training. In this example, we randomly select 80%80\% of the data in the data set as the training data, and use the rest of 20%20\% as cross validation data.

Here Case #1 of the synthetic example aims to benchmark the expression power and the sampling efficiency of the SVD model in comparison with the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model. For this reason, the ground truth is generated from the prior distribution of the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model. The synthetic cases 2 and 3 mimic the behavior of the mice protein expression data set [16] and the MovieLens data set [14], respectively.

Table 1. Summary of sampling performance and prediction accuracy. The model 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} is sampled by the Gibbs sampling. The SVD-based models are sampled by geodesic HMC. For table entries shown with /\nicefrac{{}}{{}}, the left and the right represent results for the 10%10\% and 40%40\% data sampling rates, respectively.
MCMC mixing [1%,50%,99%][1\%,50\%,99\%] quantiles of MADs
Gibbs HMC 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} SVD \ast-SVD
Case #1 [0.04,2.3,9.8]/[0.04,2.3,10.0]\nicefrac{{[0.04,2.3,9.8]}}{{[0.04,2.3,10.0]}} [0.04,2.3,9.8]/[0.04,2.3,10.0]\nicefrac{{[0.04,2.3,9.8]}}{{[0.04,2.3,10.0]}} -
Case #2 [0.20,6.3,24]/[0.67,6.1,13]\nicefrac{{[0.20,6.3,24]}}{{[0.67,6.1,13]}} [1.9,6.2,13]/[0.52,6.1,13]\nicefrac{{[1.9,6.2,13]}}{{[0.52,6.1,13]}} [1.9,6.2,13]/0.58,6.1,13]\nicefrac{{[1.9,6.2,13]}}{{0.58,6.1,13]}}
Case #3 [0.15,6.3,21.8]/[0.69,6.1,13.4]\nicefrac{{[0.15,6.3,21.8]}}{{[0.69,6.1,13.4]}} [0.15,5.5,10.5]/[0.24,5.4,10.4]\nicefrac{{[0.15,5.5,10.5]}}{{[0.24,5.4,10.4]}} [0,5,10]/[0,5,11]\nicefrac{{[0,5,10]}}{{[0,5,11]}}
Mice [0,0.33,2.6]/[0,0.22,2.4]\nicefrac{{[0,0.33,2.6]}}{{[0,0.22,2.4]}} [0,0.33,2.6]/[0.02,0.22,2.4]\nicefrac{{[0,0.33,2.6]}}{{[0.02,0.22,2.4]}} [0,0.36,2.1]/[0.02,0.42,1.8]\nicefrac{{[0,0.36,2.1]}}{{[0.02,0.42,1.8]}}
MovieLens [0.02,1.2,8.0][0.02,1.2,8.0] [0.03,1.6,5.0][0.03,1.6,5.0] [0,1,5][0,1,5]

A summary of the sampling performance and the prediction accuracy of various methods is provided in Table 1. The table specifies if the sampling method converges to the posterior distribution within 10000 iterations including 2000 iterations of burnin, and also reports the estimated median absolute deviations (MAD) on the cross validation data sets and the standard deviations of the MADs. Since both the HMC and the Gibbs sampling algorithms discussed here are asymptotically converging methods when they are simulated for an infinite number of iterations [19, 26], here we only determine the converging behavior with finite iterations using sample traces. For Markov chains with reasonable mixing, we then compute the autocorrelation time to benchmark the rate of convergence [19]. More details about the sampling performance and the prediction accuracy of various models applied to the abovementioned examples are provided in the rest of this section.

Sampling performance

Overall, the Gibbs sampling on the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model demonstrates a poor mixing behavior in our examples. The Markov chains often do not converge to the stationary distribution after many iterations. In comparison, all Markov chains generated by the HMC on the SVD-based model demonstrate a superior mixing property. For brevity, we only report the mixing of Markov chains in Case #1 of the synthetic example here, in which the ground truth is generated by the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model. As shown in Figure 2, after 8×1038\times 10^{3} sweeps of Gibbs updates on the parameters (𝐀,𝐁,δ,σ)(\mathbf{A},\mathbf{B},\delta,\sigma), the traces of the Markov chain cannot reach the stationary distribution. In comparison, the traces of the geodesic HMC can reach the stationary distribution very quickly. To further demonstrate the sampling efficiency of HMC, we reported the estimated average autocorrelation time over all parameters in Figure 2 (right), which shows that the autocorrelation of the Markov chains generated by HMC decreases to zero after one iteration. We observe similar mixing behavior in all other test cases in the synthetic example and also in examples 2 and 3. See the appendices for the mixing diagnostics of the rest of examples.

We observe that each iteration of the geodesic HMC requires more computational effort compared to one sweep of the Gibbs. For Case #1 of the synthetic example, the CPU time per iteration of the geodesic HMC is about 55 times of that of the Gibbs. For positive matrices (Case #2 and the mice protein data), the CPU time per iteration of the geodesic HMC is about 153015-30 times of that of the Gibbs. For the rating data (Case #3 and the MovieLens), the CPU time per iteration of the geodesic HMC is about 282-8 times of that of the Gibbs. However, the rather high computational cost of the geodesic HMC clearly yields a superior sampling performance.

Refer to caption
Figure 2. One random experiment of Case #1 of the synthetic example, with 10%10\% data sampling rate. From left to right: MCMC traces produced by 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} and SVD, and the average autocorrelation times ±\pm standard deviations produced by HMC. Error bars are obtained using 30 random experiments.
Refer to caption
Figure 3. Case #1 of the synthetic example. Histograms of prediction MADs estimated using different approaches. The reported error bars are obtained using 30 random experiments. The SVD model is sampled by HMC, while the 𝐀𝐁T\mathbf{AB}^{T} model is sampled by Gibbs.

Prediction accuracy

For all the examples, we report the histograms of the posterior MADs (compared to the cross validation data sets) generated by various models. For all the synthetic test cases, the uncertainty intervals on the histogram are computed using 30 random experiments. For examples 2 and 3, the uncertainty intervals are computed using 10 random experiments.

Figure 3 shows the MADs for Case #1 of the synthetic example. For both data sampling rates, the SVD model and the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model produce comparable MADs in this case.

Figure 4 shows the MADs for Case #2 of the synthetic example and the mice protein data set. Both examples here involve matrices with positive entries. We apply the S-SVD model (which preserves the positivity), the SVD model and the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model. For Case #2, Figure 4 (a) shows that with a high data sampling rate (40%40\%), all three models produce comparable MADs. However, with a low data sampling rate (10%10\%), the S-SVD model and the SVD model can still generate similar results compared to the 40%40\% data sampling rate case, whereas the MADs of the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model exhibits some heavy-tail behavior. For the mice protein data set, Figure 4 (b) shows that the S-SVD model outperforms the SVD model and the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model in the low data sampling rate case, while all three models have similar performance in the high data sampling rate case (note that the S-SVD model has the smallest uncertainty interval). Overall, the S-SVD model demonstrates a clear advantage.

Refer to caption
(a)
Refer to caption
(b)
Figure 4. Histograms of prediction MADs for positive matrices. (a): Case #2. (b): mice data. For both (a) and (b), the top row and bottom row correspond to 10%10\% and 40%40\% data sampling rates, respectively; from the left column to the right column we have the S-SVD, SVD, and 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} models. The S-SVD model generally achieves better performance with smaller tails than the 𝐀𝐁T\mathbf{AB}^{T} model.

Figure 5 shows the MADs for Case #3 of the synthetic example and the MovieLens data set. Both examples here involve rating data. We apply the B-SVD model (which uses the Binomial likelihood), the SVD model and the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model. For Case #3, Figure 5 (a) shows that for both data sampling rates, the two SVD-based models generate comparable results and are both more accurate compared to that of the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model.

Refer to caption
(a)
Refer to caption
(b)
Figure 5. Histograms of prediction MADs for rating data. (a): Case #3, where the top row and bottom row correspond to 10%10\% and 40%40\% data sampling rates, respectively. (b): MovieLens. For both (a) and (b), from the left column to the right column we have the B-SVD, SVD, and 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} models.

For the MovieLens case, Figure 5 (b) shows that the S-SVD model outperforms the SVD model and the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model in the low data sampling rate case. All three models have similar performance in the high data sampling rate case. Furthermore, the S-SVD model has the smallest uncertainty interval. Overall, the S-SVD model demonstrates a clear advantage in the prediction accuracy.

4. Acknowledgments

T. Cui acknowledges support from the Australian Research Council Discovery Project DP210103092. A. Gorodetsky acknowledges support from the Department of Energy Office of Scientific Research, ASCR under grant DE-SC0020364.

Appendix A Vectorization of conditional posteriors of SVD-based models

In this section we derive the conditional posteriors of 𝐔\mathbf{U}, 𝐒\mathbf{S}, and 𝐕\mathbf{V} in the singular value decomposition model with a Gaussian likelihood. The same procedure can be undertaken for the posteriors associated with all the proposed likelihood models.

As a starting point, we let γ=1/σ2\gamma=1/\sigma^{2} to denote the precision parameter of the Gaussian observation noise and assign it a conjugate Gamma prior Γ(ασ,βσ)\Gamma(\alpha_{\sigma},\beta_{\sigma}) with ασ=βσ=104\alpha_{\sigma}=\beta_{\sigma}=10^{-4}. We have the full posterior distribution

(20) P(𝐔,𝐒,𝐕,γ|y)γ||2exp(γ2(i,j)(yij𝐔i:𝐒𝐕j:T)2)π(𝐔)P(𝐒)γασ1exp(βσγ),P(\mathbf{U},\mathbf{S},\mathbf{V},\gamma|y)\propto\gamma^{\frac{|\mathcal{I}|}{2}}\exp\left(-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}\right)\pi_{\mathcal{H}}(\mathbf{U})P(\mathbf{S})\gamma^{\alpha_{\sigma}-1}\exp(-\beta_{\sigma}\gamma),

where π(𝐔)\pi_{\mathcal{H}}(\mathbf{U}) and π(𝐕)\pi_{\mathcal{H}}(\mathbf{V}) are uniform prior densities in the Hausdorff measure and P(𝐒)P(\mathbf{S}) is some prior density for the singular values. Given (𝐒,𝐕,γ)(\mathbf{S},\mathbf{V},\gamma), the conditional posterior distribution for 𝐔\mathbf{U} is

(21) πU(𝐔y,𝐒,𝐕,γ)\displaystyle\pi_{U}\left(\mathbf{U}\mid y,\mathbf{S},\mathbf{V},\gamma\right) exp[γ2(i,j)(yij𝐔i:𝐒𝐕j:T)2]=exp[γ2(i,j)(yij𝐔i:𝐁j)2]\displaystyle\propto\exp\left[-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}\right]=\exp\left[-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{B}_{j}\right)^{2}\right]

where 𝐁j=𝐒𝐕j:T\mathbf{B}_{j}=\mathbf{S}\mathbf{V}^{T}_{j:}. In the implementation it is easier to work directly with 𝐔\mathbf{U} rather than repeated indexing into its rows. To this end, it is convenient to reorder yy according to the row indices so that we have

(22) y=[𝐲1𝐲m]\displaystyle y=\begin{bmatrix}\mathbf{y}_{1}\\ \vdots\\ \mathbf{y}_{m}\end{bmatrix} =[𝐁~1𝐁~m][𝐔1:T𝐔m:T]+ξ=𝐁𝚛𝚎𝚜𝚑𝚊𝚙𝚎(𝐔,(mr))+ξ,\displaystyle=\begin{bmatrix}\tilde{\mathbf{B}}_{1}&&\\ &\ddots&\\ &&\tilde{\mathbf{B}}_{m}\end{bmatrix}\begin{bmatrix}\mathbf{U}^{T}_{1:}\\ \vdots\\ \mathbf{U}^{T}_{m:}\end{bmatrix}+\mathbf{\xi}=\mathbf{B}\ \mathtt{reshape}(\mathbf{U},(mr))+\mathbf{\xi},

where mkm_{k} is the number of observed entries in the kk-th row of the data matrix so that 𝐲kmk\mathbf{y}_{k}\in\mathbb{R}^{m_{k}}; 𝐁~kmk×r\tilde{\mathbf{B}}_{k}\in\mathbb{R}^{m_{k}\times r} is a matrix consisting of vertically concatenated blocks of 𝐁j\mathbf{B}_{j} corresponding to the columns jj of the observed entries in row kk; 𝐔k:r\mathbf{U}_{k:}\in\mathbb{R}^{r} is each row of 𝐔\mathbf{U}; and 𝐁\mathbf{B} is a block-diagonal matrix. Together with the uniform prior in the Hausdorff measure, this results in the following conditional posterior

(23) πU(𝐔y,𝐒,𝐕,γ)exp[γ2y𝐁reshape(𝐔,(mr))2].\pi_{U}(\mathbf{U}\mid y,\mathbf{S},\mathbf{V},\gamma)\propto\exp\left[-\frac{\gamma}{2}\lVert y-\mathbf{B}\ \texttt{reshape}(\mathbf{U},(mr))\rVert^{2}\right].

This density does not correspond to that of a Gaussian distribution over Euclidean space because the matrix must satisfy orthogonality properties. Thus there is no closed-form approach for sampling from this posterior, and some MCMC method is needed. The conditional posterior for 𝐕\mathbf{V} can be obtained in a similar way.

Next we consider the conditional posterior of the singular values lying along the diagonal of the matrix 𝐒\mathbf{S}. We can assign any prior with positive support for the diagonal elements of this matrix. In this paper we choose an exponential prior P(𝐒)exp(λ=1r𝐒)P(\mathbf{S})\propto\exp(-\lambda\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell}), which draws an analogy with the nuclear norm regularization. Then, Equation (20) leads to the conditional posterior

(24) πS(𝐒y,𝐔,𝐕,γ)exp[γ2(i,j)(yij𝐔i:𝐒𝐕j:T)2λ=1r𝐒]\displaystyle\pi_{S}\left(\mathbf{S}\mid y,\mathbf{U},\mathbf{V},\gamma\right)\propto\exp\left[-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}-\lambda\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell}\right]

where λ\lambda is the hyperparameters of the exponential distribution. To further simplify the computation, we can also express the above conditional posterior in the form of

(25) πS(𝐒y,𝐔,𝐕,γ)exp[γ2y𝐌diag(𝐒)2λ=1r𝐒],\displaystyle\pi_{S}\left(\mathbf{S}\mid y,\mathbf{U},\mathbf{V},\gamma\right)\propto\exp\left[-\frac{\gamma}{2}\lVert y-\mathbf{M}\ \texttt{diag}(\mathbf{S})\rVert^{2}-\lambda\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell}\right],

where the vector y||y\in\mathbb{R}^{|\mathcal{I}|} is the same as the one defined in (22); diag(𝐒)\texttt{diag}(\mathbf{S}) maps the diagonal entries of the matrix 𝐒r×r\mathbf{S}\in\mathbb{R}^{r\times r} into a vector in r\mathbb{R}^{r} consisting of all the singular values; and the matrix 𝐌||×r\mathbf{M}\in\mathbb{R}^{|\mathcal{I}|\times r} is given as

𝐌=[𝐔i1:𝐕j1:𝐔ik:𝐕jk:],\mathbf{M}=\begin{bmatrix}\mathbf{U}_{i_{1}:}\circ\mathbf{V}_{j_{1}:}\\ \vdots\\ \mathbf{U}_{i_{k}:}\circ\mathbf{V}_{j_{k}:}\\ \vdots\end{bmatrix},

where \circ denotes the Hadamard product.

The computational complexity of evaluating the conditional densities (23) and (25) and their gradient are governed by the cost of matrix-vector-products with 𝐁\mathbf{B} and 𝐌\mathbf{M}, which are both O(||r)O(|\mathcal{I}|r).

Finally, the conditional posterior of the precision parameter γ\gamma takes the form

(26) P(γ|y,𝐔,𝐒,𝐕)γ||2+ασ1exp[(12(i,j)(yij𝐔i:𝐒𝐕j:T)2+βσ)γ],P(\gamma|y,\mathbf{U},\mathbf{S},\mathbf{V})\propto\gamma^{\frac{|\mathcal{I}|}{2}+\alpha_{\sigma}-1}\exp\left[-\left(\frac{1}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}+\beta_{\sigma}\right)\gamma\right],

which is another Gamma distribution that can be sampled directly.

Appendix B Gibbs 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model

In this section we provide the pseudocode of the Gibbs sampler used for sampling the 𝐀𝐁T\mathbf{AB}^{T} model. Recall that the unknown parameters are the low-rank factors 𝐀,𝐁\mathbf{A},\mathbf{B}, the measurement noise variance σ2\sigma^{2}, and the prior variance of the factors δ2\delta^{2}. Similar to the SVD model, we use the precision parameters γ=1/σ2\gamma=1/\sigma^{2} and τ=1/δ2\tau=1/\delta^{2} and assign conjugate Gamma distributions Γ(ασ,βσ)\Gamma(\alpha_{\sigma},\beta_{\sigma}) and Γ(αδ,βδ)\Gamma(\alpha_{\delta},\beta_{\delta}), respectively. Here all the rate and shape parameters of the Gamma prior is set to 10410^{-4}. Together these facts lead to the following full posterior

P(𝐀,𝐁,γ,τy)\displaystyle P(\mathbf{A},\mathbf{B},\gamma,\tau\mid y)\propto γ||2exp[γ2(i,j)(yij𝐀i:𝐁j:T)2]γαδ1exp(βδγ)×\displaystyle\gamma^{\frac{|\mathcal{I}|}{2}}\exp\left[-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{A}_{i:}\mathbf{B}^{T}_{j:}\right)^{2}\right]\gamma^{\alpha_{\delta}-1}\exp(-\beta_{\delta}\gamma)\times
(27) τ(m+n)r2exp[τ2(i=1,j=1m,r𝐀ij2+i=1,j=1n,r𝐁ij2)]ταδ1exp(βδτ)\displaystyle\tau^{\frac{(m+n)r}{2}}\exp\left[-\frac{\tau}{2}\left(\sum_{i=1,j=1}^{m,r}\mathbf{A}_{ij}^{2}+\sum_{i=1,j=1}^{n,r}\mathbf{B}_{ij}^{2}\right)\right]\tau^{\alpha_{\delta}-1}\exp(-\beta_{\delta}\tau)

The conditional distributions for 𝐀\mathbf{A} and 𝐁\mathbf{B} are both Gaussian distributions and so samples can be drawn using Gibbs sampling. However, directly sampling those Gaussian distributions needs to factorize rather large covariance matrices, e.g., a matrix with dimension mr×mrmr\times mr for 𝐀\mathbf{A}. This can have superlinear computational complexity, for instance, cubic complexity if directly applying the Cholesky factorization. One way to reduce the computational cost is to apply Gibbs sampling to each of 𝐀\mathbf{A} and 𝐁\mathbf{B} row-by-row. Denoting 𝐀i:\mathbf{A}_{i:} the ii-th row of 𝐀\mathbf{A} and 𝐀i:\mathbf{A}_{-i:} the matrix containing remaining rows of 𝐀\mathbf{A}, the conditional posterior of 𝐀i:\mathbf{A}_{i:} given (𝐀i:,𝐁,γ,τ)(\mathbf{A}_{-i:},\mathbf{B},\gamma,\tau) is

P(𝐀i|y,𝐀i:𝐁,γ,τ){exp[γ2j𝒥i(yij𝐀i:𝐁j:T)2τ2(j=1r𝐀ij2)]iexp[τ2(j=1r𝐀ij2)]i,P(\mathbf{A}_{i}|y,\mathbf{A}_{-i:}\mathbf{B},\gamma,\tau)\propto\left\{\begin{array}[]{ll}\exp\left[-\frac{\gamma}{2}\sum_{j\in\mathcal{J}_{i}}\left(y_{ij}-\mathbf{A}_{i:}\mathbf{B}^{T}_{j:}\right)^{2}-\frac{\tau}{2}\left(\sum_{j=1}^{r}\mathbf{A}_{ij}^{2}\right)\right]&i\in\mathcal{I}\\ \exp\left[-\frac{\tau}{2}\left(\sum_{j=1}^{r}\mathbf{A}_{ij}^{2}\right)\right]&i\notin\mathcal{I}\end{array}\right.,

where 𝒥i\mathcal{J}_{i} is the index set containing all the column indices such that (i,j)(i,j)\in\mathcal{I} for a given row index ii. Note the second conditional refers to the situation where row ii of the matrix is not observed at all, in which case sampling proceeds from the prior. The conditional posterior of 𝐁i:\mathbf{B}_{i:} can be derived in a similar way. Each of the conditional Gaussian distributions has dimension rr, and thus can be computationally much cheaper to simulate. The pseudocode is provided in Algorithm 3.

Algorithm 3 Gibbs sampler for the 𝐀𝐁T\mathbf{AB}^{T} model
1:NN number of samples, (𝐀(0),𝐁(0),γ(0),τ(0))(\mathbf{A}^{(0)},\mathbf{B}^{(0)},\gamma^{(0)},\tau^{(0)}) initial samples;
2:for k=1,,Nk=1,\ldots,N do
3:     γ(k)𝚜𝚊𝚖𝚙𝚕𝚎_𝚐𝚊𝚖𝚖𝚊(P(γ|y,𝐀(k1),𝐁(k1),τ(k1)))\gamma^{(k)}\leftarrow\mathtt{sample\_gamma}(P(\gamma|y,\mathbf{A}^{(k-1)},\mathbf{B}^{(k-1)},\tau^{(k-1)}))
4:     τ(k)𝚜𝚊𝚖𝚙𝚕𝚎_𝚐𝚊𝚖𝚖𝚊(P(τ|y,𝐀(k1),𝐁(k1),γ(k)))\tau^{(k)}\leftarrow\mathtt{sample\_gamma}(P(\tau|y,\mathbf{A}^{(k-1)},\mathbf{B}^{(k-1)},\gamma^{(k)}))
5:     𝐀(k)𝐀(k1)\mathbf{A}^{(k)}\leftarrow\mathbf{A}^{(k-1)}
6:     for i=1,,mi=1,\ldots,m do
7:         𝐀(k)𝚜𝚊𝚖𝚙𝚕𝚎_𝚐𝚊𝚞𝚜𝚜𝚒𝚊𝚗(P(𝐀i:|y,𝐀i:(k),𝐁(k1),γ(k),τ(k)))\mathbf{A}^{(k)}\leftarrow\mathtt{sample\_gaussian}(P(\mathbf{A}_{i:}|y,\mathbf{A}^{(k)}_{-i:},\mathbf{B}^{(k-1)},\gamma^{(k)},\tau^{(k)}))
8:     end for
9:     𝐁(k)𝐁(k1)\mathbf{B}^{(k)}\leftarrow\mathbf{B}^{(k-1)}
10:     for j=1,,nj=1,\ldots,n do
11:         𝐁j:(k)𝚜𝚊𝚖𝚙𝚕𝚎_𝚐𝚊𝚞𝚜𝚜𝚒𝚊𝚗(P(𝐁j:|y,𝐁j:(k),𝐀(k),γ(k),τ(k)))\mathbf{B}_{j:}^{(k)}\leftarrow\mathtt{sample\_gaussian}(P(\mathbf{B}_{j:}|y,\mathbf{B}^{(k)}_{-j:},\mathbf{A}^{(k)},\gamma^{(k)},\tau^{(k)}))
12:     end for
13:end for

Appendix C Rank revealing of SVD-based models

Here we demonstrate the rank revealing property of the SVD-based models. Recall that the low-rank matrix 𝐗\mathbf{X} is parametrized in the SVD format

𝐗=𝐔𝐒𝐕T,𝐔T𝐔=𝐈r,𝐕𝐕T=𝐈r.\mathbf{X}=\mathbf{U}\mathbf{S}\mathbf{V}^{T},\quad\mathbf{U}^{T}\mathbf{U}=\mathbf{I}_{r},\quad\mathbf{V}\mathbf{V}^{T}=\mathbf{I}_{r}.

The nuclear norm of 𝐗\mathbf{X} is given as the summation of its singular values, i.e.,

𝐗=trace(𝐗T𝐗)=i=1r𝐒ii.\lVert\mathbf{X}\rVert_{*}=\text{trace}(\mathbf{X}^{T}\mathbf{X})=\sum_{i=1}^{r}\mathbf{S}_{ii}.

Since the nuclear norm 𝐗\lVert\mathbf{X}\rVert_{*} is a convex envelope of the rank of a matrix, rank𝐗)\text{rank}\mathbf{X}), it often used as a regularization term in optimization problems to estimate low-rank matrices. See [5, 15, 20] and references therein for further discussions. Drawing an analogy with the nuclear norm regularization, a natural choice of the prior distribution for the singular values of the SVD-based model is to use the exponential of the negative of the nuclear norm of 𝐗\mathbf{X}, which leads to

P(S)exp(λ𝐗)=exp(λi=1r𝐒ii),P(S)\propto\exp\left(-\lambda\lVert\mathbf{X}\rVert_{*}\right)=\exp\left(-\lambda\sum_{i=1}^{r}\mathbf{S}_{ii}\right),

where λ>0\lambda>0 is a penalty parameter.

Connection with nuclear norm regularization

Given a likelihood function P(y|𝐔,𝐒,𝐕)P(y|\mathbf{U},\mathbf{S},\mathbf{V}), the maximum a posteriori (MAP) estimate can be expressed As

argmax𝐔,𝐒,𝐕P(𝐔,𝐒,𝐕|y)=argmax𝐔,𝐒,𝐕P(y|𝐔,𝐒,𝐕)π(𝐔)P(𝐒)π(𝐕),\operatorname*{arg\,max}_{\mathbf{U},\mathbf{S},\mathbf{V}}P(\mathbf{U},\mathbf{S},\mathbf{V}|y)=\operatorname*{arg\,max}_{\mathbf{U},\mathbf{S},\mathbf{V}}P(y|\mathbf{U},\mathbf{S},\mathbf{V})\pi_{\mathcal{H}}(\mathbf{U})P(\mathbf{S})\pi_{\mathcal{H}}(\mathbf{V}),

where π(𝐔)\pi_{\mathcal{H}}(\mathbf{U}) and π(𝐕)\pi_{\mathcal{H}}(\mathbf{V}) are uniform in the Hausdorff measure and implicitly imply the constraints 𝐔T𝐔=𝐈r\mathbf{U}^{T}\mathbf{U}=\mathbf{I}_{r} and 𝐕𝐕T=𝐈r\mathbf{V}\mathbf{V}^{T}=\mathbf{I}_{r} on Stiefel manifolds. In this way, the MAP estimate can be equivalently expressed as

argmin𝐔,𝐒,𝐕logP(y|𝐔,𝐒,𝐕)+λi=1r𝐒ii,subject to𝐔T𝐔=𝐈r,𝐕𝐕T=𝐈r.\operatorname*{arg\,min}_{\mathbf{U},\mathbf{S},\mathbf{V}}-\log P(y|\mathbf{U},\mathbf{S},\mathbf{V})+\lambda\sum_{i=1}^{r}\mathbf{S}_{ii},\quad\text{subject\;to}\quad\mathbf{U}^{T}\mathbf{U}=\mathbf{I}_{r},\quad\mathbf{V}\mathbf{V}^{T}=\mathbf{I}_{r}.

The above formulation explains that the MAP estimate of our proposed SVD-based model is equivalent to a matrix estimation problem regularized by the nuclear norm, which is a widely-accepted way to impose low-rank properties.

Numerical demonstration

The above justification hinted that the SVD-based model together with the nuclear-norm-type prior can be capable of determining the effective rank of a Bayesian matrix estimation problem. To demonstrate this rank revealing property, we use the Case #1 of the synthetic example presented in Section 3 of the paper. Here we construct a true matrix 𝐗true=𝐀𝐁T\mathbf{X}_{\text{true}}=\mathbf{A}\mathbf{B}^{T}, where entries of 𝐀100×10,𝐁60×10\mathbf{A}\in\mathbb{R}^{100\times 10},\mathbf{B}\in\mathbb{R}^{60\times 10} are drawn from i.i.d. standard Gaussian, and partially observed data yij=𝐗ijy_{ij}=\mathbf{X}_{ij} are given by randomly selecting a subset of 40%40\% of matrix indices.

The true matrix is rank 1010 by construction, we impose a maximum rank 1515 in the parameterization and run the proposed geodesic HMC algorithm to sample the posterior distribution. The 10%10\% quantile, median and 90%90\% quantile of the posterior singular values are reported in Figure 6. Here, we observe that beyond rank 10, the estimated singular values are close to zero and the sizes of the corresponding credible intervals are near zero. This is a clear indication that the actual rank of the estimated matrix is about 1010. Using our proposed SVD-based models, the columns of 𝐔\mathbf{U} and 𝐕\mathbf{V} that do not contribute to the estimation problem will be automatically assigned near zero singular values in this example.

Refer to caption
Figure 6. Demonstration of the rank revealing property using Case #1 of the synthetic example, with 40%40\% data sampling rate. Left: quantiles of the posterior singular values. Right: quantiles of the posterior singular values on the natural logarithmic scale. The solid lines represent the median and the error bars represent 10%10\% and 90%90\% quantiles.

Appendix D Sampling efficiency

In this section, we provide a detailed summary of the sampling efficiency of the geodesic HMC and Gibbs sampling respectively applied to SVD-based models and the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model. To preserve anonymity, the git repositories containing the code for reproducing the results will be made public after the completion of the review process. URLs and instructions for running the code will also be provided in the final submission.

Table 2 summarizes the CPU time per iteration. Here the number of steps TT and the time step ϵ\epsilon used by geodesic HMC are automatically tuned by the No-U-Turn-Sampler variant of HMC. Each iteration of Gibbs consists of a full sweep of updates of (𝐀,𝐁,γ,τ)(\mathbf{A},\mathbf{B},\gamma,\tau). The CPU times are measured on a DELL workstation with dual Intel Xeon(R) E5-2680 v4 CPU.

Table 2. Summary of the CPU time per iteration of the geodesic HMC and the Gibbs sampling, measured over 30 repetitions. The model 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} is sampled by the Gibbs sampling. The SVD-based models are sampled by geodesic HMC. For table entries shown with “/\nicefrac{{}}{{}},” the left and the right represent results for the 10%10\% and 40%40\% data sampling rates, respectively.
mean±\pmstandard deviation of CPU time (in 10310^{-3} second)
𝐀𝐁T\mathbf{A}\mathbf{B}^{T} SVD \ast-SVD
Case #1 4.1±1/5.6±1\nicefrac{{4.1\pm 1}}{{5.6\pm 1}} 19±2.1/26±2.9\nicefrac{{19\pm 2.1}}{{26\pm 2.9}} -
Case #2 4.1±0.3/4.9±0.8\nicefrac{{4.1\pm 0.3}}{{4.9\pm 0.8}} 67±14/189±12\nicefrac{{67\pm 14}}{{189\pm 12}} 85±56/159±16\nicefrac{{85\pm 56}}{{159\pm 16}}
Case #3 4.0±0.3/5.2±0.6\nicefrac{{4.0\pm 0.3}}{{5.2\pm 0.6}} 36±0.7/158±2.6\nicefrac{{36\pm 0.7}}{{158\pm 2.6}} 0.7±0.1/0.7±0.2\nicefrac{{0.7\pm 0.1}}{{0.7\pm 0.2}}
Mice 47±3.1/65±3.6\nicefrac{{47\pm 3.1}}{{65\pm 3.6}} 650±12/940±350\nicefrac{{650\pm 12}}{{940\pm 350}} 670±43/878±305\nicefrac{{670\pm 43}}{{878\pm 305}}
MovieLens 1400±3401400\pm 340 2100±2202100\pm 220 2400±1802400\pm 180

For each of the examples summarized in Table 2, we also show the MCMC traces generated by the corresponding samplers in Figures 720. For each of examples, we provide the traces of three random experiments. For all examples, the Gibbs sampling applied to the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model do not have satisfactory mixing property—after several thousands iterations, the traces of Markov chains clearly suggest that the chains are still not in the stationary phase. In comparison, all the geodesic HMC counterparts can produce well mixed chains after several hundreds iterations. Although each iteration of the geodesic HMC is computationally more expensive than that of the Gibbs, the geodesic HMC is able to significantly improve the convergence of the MCMC sampling in these examples.

Refer to caption
Figure 7. Three random experiments of Case #1 of the synthetic example. MCMC traces generated by Gibbs using the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model. Top row: 10%10\% data sampling rate. Bottom row: 40%40\% data sampling rate.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 8. Three random experiments of Case #1 of the synthetic example. The SVD model and the geodesic HMC are used here. Top and middle row: MCMC traces. Bottom row: average autocorrelation function and standard derivation estimated over all parameters.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 9. Three random experiments of Case #2 of the synthetic example. MCMC traces generated by Gibbs using the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 10. Three random experiments of Case #2 of the synthetic example. The S-SVD model and the geodesic HMC are used here. Top and middle row: MCMC traces. Bottom row: average autocorrelation function and standard derivation estimated over all parameters.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 11. Three random experiments of Case #2 of the synthetic example. The SVD model and the geodesic HMC are used here. Top and middle row: MCMC traces. Bottom row: average autocorrelation function and standard derivation estimated over all parameters.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 12. Three random experiments of Case #3 of the synthetic example. MCMC traces generated by Gibbs using the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 13. Three random experiments of Case #3 of the synthetic example. The B-SVD model and the geodesic HMC are used here. Top and middle row: MCMC traces. Bottom row: average autocorrelation function and standard derivation estimated over all parameters.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 14. Three random experiments of Case #3 of the synthetic example. The SVD model and the geodesic HMC are used here. Top and middle row: MCMC traces. Bottom row: average autocorrelation function and standard derivation estimated over all parameters.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 15. Three random experiments of the mice protein expression data set. MCMC traces generated by Gibbs using the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 16. Three random experiments of the mice protein expression data set. The S-SVD model and the geodesic HMC are used here. Top and middle row: MCMC traces. Bottom row: average autocorrelation function and standard derivation estimated over all parameters.
Refer to caption
(a) 10%10\% data sampling rate
Refer to caption
(b) 40%40\% data sampling rate
Figure 17. Three random experiments of the mice protein expression data set. The SVD model and the geodesic HMC are used here. Top and middle row: MCMC traces. Bottom row: average autocorrelation function and standard derivation estimated over all parameters.
Refer to caption
Figure 18. Three random experiments of the MovieLens data set. MCMC traces generated by Gibbs using the 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model.
Refer to caption
Figure 19. Three random experiments of the MovieLens data set. The S-SVD model and the geodesic HMC are used here. Top and middle row: MCMC traces. Bottom row: average autocorrelation function and standard derivation estimated over all parameters.
Refer to caption
Figure 20. Three random experiments of the MovieLens data set. The SVD model and the geodesic HMC are used here. Top and middle row: MCMC traces. Bottom row: average autocorrelation function and standard derivation estimated over all parameters.

References

  • [1] Sungjin Ahn, Anoop Korattikara, Nathan Liu, Suju Rajan, and Max Welling. Large-scale distributed bayesian matrix factorization using stochastic gradient mcmc. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015.
  • [2] Vladimir Igorevich Arnol’d. Mathematical methods of classical mechanics, volume 60. Springer Science & Business Media, 2013.
  • [3] James Bennett, Stan Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, 2007.
  • [4] Simon Byrne and Mark Girolami. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825–845, 2013.
  • [5] Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.
  • [6] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, 2014.
  • [7] Saibal De, Hadi Salehi, and Alex Gorodetsky. Efficient mcmc sampling for bayesian matrix factorization by breaking posterior symmetries. arXiv preprint arXiv:2006.04295, 2020.
  • [8] Persi Diaconis, Susan Holmes, and Mehrdad Shahshahani. Sampling from a manifold. In Advances in modern statistical theory and applications: a Festschrift in honor of Morris L. Eaton, pages 102–125. Institute of Mathematical Statistics, 2013.
  • [9] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
  • [10] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
  • [11] Alan Edelman, Tomás A Arias, and Steven T Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
  • [12] Herbert Federer. Geometric measure theory. Springer, 2014.
  • [13] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
  • [14] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM transactions on interactive intelligent systems (TIIS), 5(4):1–19, 2015.
  • [15] Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research, 16(1):3367–3402, 2015.
  • [16] Clara Higuera, Katheleen J Gardiner, and Krzysztof J Cios. Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PloS one, 10, 2015.
  • [17] Matthew D Hoffman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014.
  • [18] Yew Jin Lim and Yee Whye Teh. Variational bayesian approach to movie rating prediction. In Proceedings of KDD cup and workshop, 2007.
  • [19] Jun S Liu. Monte Carlo strategies in scientific computing, volume 10. Springer, 2001.
  • [20] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11:2287–2322, 2010.
  • [21] Shinichi Nakajima and Masashi Sugiyama. Theoretical analysis of bayesian matrix factorization. Journal of Machine Learning Research, 12, 2011.
  • [22] Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, and Ryota Tomioka. Global analytic solution of fully-observed variational bayesian matrix factorization. Journal of Machine Learning Research, 14, 2013.
  • [23] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
  • [24] Piyush Rai, Yingjian Wang, Shengbo Guo, Gary Chen, David Dunson, and Lawrence Carin. Scalable bayesian low-rank decomposition of incomplete multiway tensors. In International Conference on Machine Learning, 2014.
  • [25] Tapani Raiko, Alexander Ilin, and Juha Karhunen. Principal component analysis for large scale problems with lots of missing values. In European Conference on Machine Learning, 2007.
  • [26] Christian P Robert and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.
  • [27] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using markov chain monte carlo. In Proceedings of the 25th international conference on Machine learning, 2008.
  • [28] Qibin Zhao, Liqing Zhang, and Andrzej Cichocki. Bayesian cp factorization of incomplete tensors with automatic rank determination. IEEE transactions on pattern analysis and machine intelligence, 37, 2015.
  • [29] Qibin Zhao, Guoxu Zhou, Liqing Zhang, Andrzej Cichocki, and Shun-Ichi Amari. Bayesian robust tensor factorization for incomplete multiway data. IEEE transactions on neural networks and learning systems, 27, 2015.