Low-rank Bayesian matrix completion via geodesic Hamiltonian Monte Carlo on Stiefel manifolds

Tiangang Cui School of Mathematics and Statistics, University of Sydney, NSW 2006, Australia [email protected] and Alex A. Gorodetsky Department of Aerospace Engineering, University of Michigan, Ann Arbor, MI, 48109, USA [email protected]

Abstract.

We present a new sampling-based approach for enabling efficient computation of low-rank Bayesian matrix completion and quantifying the associated uncertainty. Firstly, we design a new prior model based on the singular-value-decomposition (SVD) parametrization of low-rank matrices. Our prior is analogous to the seminal nuclear-norm regularization used in non-Bayesian setting and enforces orthogonality in the factor matrices by constraining them to Stiefel manifolds. Then, we design a geodesic Hamiltonian Monte Carlo (-within-Gibbs) algorithm for generating posterior samples of the SVD factor matrices. We demonstrate that our approach resolves the sampling difficulties encountered by standard Gibbs samplers for the common two-matrix factorization used in matrix completion. More importantly, the geodesic Hamiltonian sampler allows for sampling in cases with more general likelihoods than the typical Gaussian likelihood and Gaussian prior assumptions adopted in most of the existing Bayesian matrix completion literature. We demonstrate an applications of our approach to fit the categorical data of a mice protein dataset and the MovieLens recommendation problem. Numerical examples demonstrate superior sampling performance, including better mixing and faster convergence to a stationary distribution. Moreover, they demonstrate improved accuracy on the two real-world benchmark problems we considered.

Keywords. Matrix Completion, Markov Chain Monte Carlo, Low-rank matrices

1. Introduction

We consider the problem of Bayesian low rank matrix completion. The matrix completion problem considers reconstructing a matrix $\mathbf{X}\in\mathbb{R}^{m\times n}$ through indirect and noisy evaluations of a subset of its elements. A low rank matrix completion seeks to reduce the ill-posed nature of this problem by assuming that the underlying matrix has an approximately low rank factorization. A low rank factorization of a matrix $\mathbf{X}\in\mathbb{R}^{m\times n}$ matrix is defined by the couple $(\mathbf{A},\mathbf{B})$ , where $\mathbf{A}\in\mathbb{R}^{m\times r}$ and $\mathbf{B}\in\mathbb{R}^{n\times r}$ with $r<\min(m,n)$ and

(1)

\mathbf{X}\approx\mathbf{A}\mathbf{B}^{T}.

Recall that the best rank $r$ approximation $\mathbf{X}_{r}=\min_{\hat{\mathbf{X}}}\lVert\mathbf{X}-\hat{\mathbf{X}}\rVert_{F}^{2}$ is provided by the singular value decomposition

(2)

\mathbf{X}_{r}=\mathbf{U}_{r}\mathbf{\Lambda}\mathbf{V}_{r}^{T},

where $\mathbf{U}_{r}$ and $\mathbf{V}_{r}$ are the first $r$ columns of the left and right singular vectors of $\mathbf{X}$ ; $\mathbf{\Lambda}_{r}$ is an ordered set of the largest singular values; and where we can identify $\mathbf{A}=\mathbf{U}_{r}\sqrt{\mathbf{\Lambda}}$ and $\mathbf{B}=\mathbf{V}_{r}\sqrt{\mathbf{\Lambda}}.$

Bayesian low-rank matrix completion

We consider the Bayesian setting aiming to identify distributions over the factors $\mathbf{A}$ and $\mathbf{B}$ to account for uncertainty stemming from an ability to uniquely identify the correct matrix. Data consists of pairs of element indices and values $\left\{\left(i,j\right),{y}_{ij}\right\}$ where $i\in[m]$ denotes the rows and $j\in[n]$ denotes the columns . We use the index set $\mathcal{I}$ to denote the set of indices corresponding to the observed data points and let $|\mathcal{I}|$ be its cardinality. The full data vector is denoted by $y$ . We consider the following likelihood models: a Gaussian likelihood

(3)

\displaystyle y_{ij}

\displaystyle\sim\mathcal{N}(h(\mathbf{X}_{ij}),\sigma^{2}),

where $h$ is an observation function, e.g., identity or a soft-plus function to ensure positivity, and a Binomial likelihood

(4)

\displaystyle y_{ij}

\displaystyle\sim\text{Bin}(k,h(\mathbf{X}_{ij})),\quad h(\mathbf{X}_{ij})=(1+\exp(-\mathbf{X}_{ij}))^{-1}

where $k$ is the number of possible trials and $h(\mathbf{X}_{ij})$ denotes the success probability. Given the low rank factorization of Equation (1), here $\mathbf{X}_{ij}$ is obtained by multiplying the $i$ th row of $\mathbf{A}$ with the $j$ th row of $\mathbf{B}$ . The Binomial likelihood is applied in categorical settings where the categories might have a natural ordering. However, we emphasize that our proposed sampling methodology is more broadly applicable to other likelihood formulations.

Challenge

One challenge to Bayesian matrix completion specifically, is the non-uniqueness of the low-rank representation. This challenge is pictorally represented in Figure 1, which demonstrates the geometry of posteriors that are prevalent in the problem. It is typically quite challenging to design a sampling scheme that is able to sample such posteriors, and even variational approaches are challenging to tune here. This challenge arises due to the following symmetry in the problem. A low-rank factorization can be obtained by an arbitrary linear transformation of the factors via

(5)

\mathbf{X}=\mathbf{A}\mathbf{W}\mathbf{W}^{-1}\mathbf{B}=\hat{\mathbf{A}}\hat{\mathbf{B}}^{T}

for any invertible $\mathbf{W}\in\mathbb{R}^{r\times r}$ . The precise ramifications of this symmetry in the Gaussian likelihood case was extensively documented in [7].

Refer to caption — Figure 1. Examples of 2D marginals of the posteriors obtained using standard Gaussian priors for the inference of real-valued low-rank matrix factorizations [7].

Contribution

We develop an approach to address these issues in two ways: (i) we alter the factorization format that constrains the factor matrices on Stiefel manifolds; and (ii) we develop a geodesic Markov Chain Monte Carlo approach that specifically target sampling distributions that lie on the manifold. Specifically, we propose to parameterize the low-rank matrix in the SVD format

(6)

\mathbf{X}=\mathbf{U}\mathbf{S}\mathbf{V}^{T},\quad\mathbf{U}^{T}\mathbf{U}=\mathbf{I}_{r},\quad\mathbf{V}\mathbf{V}^{T}=\mathbf{I}_{r}

where we seek to learn semi-orthogonal matrices $\mathbf{U}$ and $\mathbf{V}$ and a positive diagonal matrix $\mathbf{S}$ . Within a Bayesian sampling context, we develop a geodesic Hamiltonian Monte Carlo (HMC) [4] within Gibbs algorithm to enable efficient sampling of the semi-orthogonal factor matrices.

This approach is closely related to the SVD models used in optimization-based methods [5] that solve a nuclear norm minimization problem

(7)

\operatorname*{minimize}_{\mathbf{X}\in\mathbb{R}^{m\times n}}\quad\lVert\mathbf{X}\rVert_{*}\quad\text{subject to}\quad\lVert{y}-h(\Lambda(\mathbf{X}))\rVert\leq\delta,

where $\Lambda$ is an operator that subselects the observed elements of $\mathbf{X}$ and flattens them into a vector of equal size to ${y}$ . Moreover, an SVD parameterization is commonly used in this optimization procedure. From this perspective, the optimization problem can be seen as equivalently seeking a MAP estimate with a uniform prior over semi-orthogonal matrices. Prior work in Bayesian matrix completion has not used this equivalent prior to our knowledge. As a result, we provide the first Bayesian analogue of this extremely popular low-rank matrix characterization.

Related work

A large amount of work has addressed the low-rank Bayesian matrix completion problem in the variational setting. For example, [18] and [25] each apply variational Bayes approximations of this inference model to analyze the Netflix prize challenge [3] to great success. Further, [21] and [22] develop a theoretical framework to analyze the variational Bayes low-rank matrix factorization. One of the issues with these existing approaches is the mean field-type approxiamtion that assumes independence of the factors. Sampling-based approaches target the full posterior and, while more expensive, have shown that this more complete characterization can outer-perform variational inference performance [27]. This approach was later extended to other settings [6, 1], and have also been extended to the case of recovering low-rank tensor factorizations [24, 28, 29].

The majority of these algorithms leverage the bilinear nature of the low-rank factorization to develop Gibbs samplers via conditional distributions for each factor. In this setting each factor is endowed with an independent Gaussian prior, and the variance of this prior is endowed with a hyperprior. This approach is mainly used to try to circumvent the sampling difficulties previously described, but comes at the cost of increased uncertainty. This approach uses a Gaussian log likelihood function

(8)

\log P(y\mid\mathbf{A},\mathbf{B})=-\frac{|\mathcal{I}|}{2}\log(2\pi)-|\mathcal{I}|\log(\sigma)-\frac{1}{2\sigma^{2}}\textstyle\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{A}_{i:}\mathbf{B}^{T}_{j:}\right)^{2}.

The prior on the factors assigns i.i.d. zero mean Gaussians for entries of the matrices $\mathbf{A}$ and $\mathbf{B}$ and assumes the prior variance is unknown. This leads to a hierarchical prior, conditioned on the variance

P(\mathbf{A},\mathbf{B}\mid\delta)\propto((m+n)r)^{-\frac{1}{2}}\exp\left(-\frac{1}{2\delta^{2}}\left(\textstyle\sum_{i=1,j=1}^{m,r}\mathbf{A}_{ij}^{2}+\textstyle\sum_{i=1,j=1}^{n,r}\mathbf{B}_{ij}^{2}\right)\right),

where $\delta$ is some unknown hyperparameter that controls the prior variance. We endow $\delta$ with a Gamma prior. Finally, Gibbs sampling is used to estimate both the factors $\mathbf{A}$ and $\mathbf{B}$ , the prior variance $\delta$ and the likeihood standard deviation $\sigma$ .

Outline

This paper is structured as follows. Section 2 provides background on sampling on the Stiefel manifold and our proposed HMC within Gibbs algorithm. Section 3 provides numerical results that compare our proposed approach to Bayesian approach described above. The comparisons are made both on synthetic examples, to gain intuition, and on two real-world data sets that consider positivity constraints on the matrix elements and categorical valued matrices, respectively.

2. Sampling on low-rank manifolds

In this section we first describe the geodesic Monte Carlo and then describe our Gibbs sampling procedure. The factor matrices we learn lie on the Stiefel manifold. Traditional Markov Chain Monte Carlo approaches like Hamiltonian Monte Carlo or Riemannian-Manifold Hamiltonian Monte Carlo are not directly applicable to distributions for parameters that are on such manifolds embedded in the Euclidean space. Instead so-called geodesic Monte Carlo approaches propose modifications of HMC and related methods to directly apply to distributions in the embedded space [4].

2.1. Operations on the Stiefel manifold

We begin with description of operations on the Stiefel manifold. Let $\mathcal{M}$ denote an $m$ dimensional manifold that is embedded in a higher dimensional Euclidean space $\mathbb{R}^{n}.$ At every point $x\in\mathcal{X}$ there exists bijective mapping $q_{x}:\mathcal{M}\to\mathbb{R}^{m}$ denoting a coordinate system on the manifold. The union of all possible coordinate systems across the manifold is called an atlas for a manifold. For our low-rank reconstruction purposes, we consider the Stiefel manifold of semi-orthogonal $n\times r$ matrices $\mathcal{M}=\{\mathbf{X}\in\mathbb{R}^{n\times r};\mathbf{X^{T}X}=\mathbf{I}_{r}\}.$

The main challenge is describing motion along the manifold since standard Euclidean vector-space addition does not apply. Instead, one must follow curves on the manifolds between points. A curve on the manifold over a time interval starting at some point $x\in\mathcal{M}$ is defined by $\gamma:[0,T]\to\mathcal{M}$ , with $\gamma(0)=x$ . At each point $x$ , the tangent vector is an equivalence class of these curves¹¹1Two curves $\phi(t)$ and $\psi(t)$ are equivalent if $\phi(0)=\psi(0)$ and $\lim_{t\to 0}\frac{\phi(t)-\psi(t)}{t}=0$ , and the set of tangent vectors form a tangent vector space $TM_{x}\subset\mathbb{R}^{n}$ [2].

These tangent vectors are time derivatives of the curves $v=\dot{\gamma}(0)=\left.\frac{d}{dt}\gamma(t)\right|_{t=t0}.$ Every point on the Stiefel manifold statisfies $\mathbf{X}^{T}(t)\mathbf{X}(t)=\mathbf{I_{r}}$ , so differentiation in time yields $\mathbf{\dot{X}}^{T}(t)\mathbf{X}(t)+\mathbf{X}^{T}(t)\mathbf{\dot{X}}(t)=0.$ Therefore the tangent space is $TM_{x}=\{\mathbf{Z}\in\mathbb{R}^{n\times r};\mathbf{Z}^{T}\mathbf{X}+\mathbf{X}^{T}\mathbf{Z}=\mathbf{0}\}.$

The geodesic Monte Carlo algorithm makes use of the space normal to the manifold. This normal space is the orthogonal complement to $TM_{x}$ , and therefore a definition of an inner product is needed. For the Stiefel manifold this inner product is defined through the trace $\langle\mathbf{Y},\mathbf{Z}\rangle_{x}=\mathrm{trace}(\mathbf{Y}^{T}\mathbf{Z}).$ One can then verify that for any $\mathbf{X}\in\mathcal{M}$ on the manifold, the orthogonal projection of $\mathbf{Z}\in\mathbb{R}^{n\times r}$ onto the normal space is

(9)

\pi_{N_{x}}(\mathbf{Z})=\frac{1}{2}\mathbf{X}\left(\mathbf{X}^{T}\mathbf{Z}+\mathbf{Z}^{T}\mathbf{X}\right),

and the orthogonal projection of $\mathbf{Z}$ onto the tangent space becomes $(I-\pi_{N_{x}})(\mathbf{Z})=\mathbf{Z}-\pi_{N_{x}}(\mathbf{Z})$ . Finally we will require geodesics, the curve of shortest length, between two points on the manifold. Since the Steifel manifold is embedded in a Euclidean space, the geodesic can be defined by the fact that the acceleration vector at each point is normal to the manifold as long as the curve is traced with uniform speed. For Stiefel manifolds we are able to obtain the following closed form expression [11]

(10)

\left[\mathbf{X}(t),\mathbf{\dot{X}}(t)\right]=\left[\mathbf{X}(0),\mathbf{\dot{X}}(0)\right]\exp\left(\begin{bmatrix}\mathbf{A}&-\mathbf{S}(0)\\ \mathbf{I}&\mathbf{A}\end{bmatrix}\right)\begin{bmatrix}\exp(-t\mathbf{A})&0\\ 0&\exp(-t\mathbf{A})\end{bmatrix}

where $\mathbf{A}=\mathbf{X}^{T}(t)\mathbf{\dot{X}}(t)$ and $\mathbf{S}(t)=\mathbf{\dot{X}}^{T}(t)\mathbf{\dot{X}}(t).$

2.2. Sampling on the Stiefel manifold

While the Lebesgue measure is the reference measure for probability distributions in the Euclidean space, the Hausdorff measure can be used as the reference measure on manifolds [8]. It will be useful to move between an $m$ -dimensional Hausdorff measure $\mathcal{H}^{m}$ and the $m$ -dimensional Lebesgue measure $\lambda^{m}$ . In the context of Riemannian manifolds, such as the Stiefel manifold, the relation between these measures is provided by the following formula [12]

(11)

\mathcal{H}^{m}(dq)=\sqrt{|G(q)|}\lambda^{m}(dq),

Where $G(q)$ is the Riemannian metric provided by the trace. Using the metric $G(q)$ and the target density $\pi(q)$ with respect to the Lebesgue measure in the coordinate system provided by $q$ , we can form the non-separable Hamiltonian function [13]

(12)

H(q,p)=-\log\pi(q)+\frac{1}{2}\log|G(q)|+\frac{1}{2}p^{T}G(q)^{-1}p,

where $\pi(q)$ is the target density with respect to the Lebesgue measure in the coordinate system provided by $q$ , and $p$ is an auxiliary momentum variable following $\mathcal{N}(0,G(q))$ . Note that $\exp(-H(q,p))$ defines a unnormalized joint probability density over the pair $(q,p)$ , in which the marginal density over $q$ recovers the target density. As shown by the landmark paper of [10], the Hamiltonian system

\frac{dq_{i}}{dt}=\frac{\partial H}{\partial p_{i}},\quad\frac{dp_{i}}{dt}=-\frac{\partial H}{\partial q_{i}},

implicitly defines a time-reversible and one-to-one mapping $T$ from the state at time $t$ , $(q(t),p(t))$ , to the state at time $t+s$ , $(q(t+s),p(t+s))$ . This mapping $T$ leaves the target distribution $\pi(q)$ invariant. Thus, discretizing the Hamiltonian system using numerical integrators naturally leads MCMC proposal distributions that in general can reduce sample correlations in the resulting Markov chain. Moreover, using time-reversible and symplectic numerical integrators often yields simplified and more dimension-scalable MCMC proposals. See [23] and references therein.

To define HMC on Riemannian manifold [13] , we can use the relation (11) to write the Hamiltonian function (12) with respect to the Hausdorff measure as $H(q,p)=-\log\pi_{\mathcal{H}}(q)+\frac{1}{2}p^{T}G(q)^{-1}p.$ The corresponding Hamiltonian system evolves the pair $(q,p)$ according to

(13)

\displaystyle\frac{dq}{dt}=\frac{\partial H}{\partial p}=G(q)^{-1}p,\qquad\frac{dp}{dt}=-\frac{\partial H}{\partial q}=\nabla_{q}\left[\log\pi_{\mathcal{H}}(q)-\frac{1}{2}p^{T}G(q)^{-1}p\right].

A key innovation of [4] is the introduction of a time-reversible and symplectic integrator that integrates this non-separable Hamiltonian system via splitting, $H(q,p)=H^{[1]}(q,p)+H^{[2]}(q,p)$ where $H^{[1]}(q,p)=-\log\pi_{\mathcal{H}}(q)$ and $H^{[2]}(q,p)=\frac{1}{2}p^{T}G(q)^{-1}p.$ The corresponding systems are

(14)		$\displaystyle\frac{dq}{dt}$	$\displaystyle=0,$	$\displaystyle\frac{dp}{dt}$	$\displaystyle=\nabla_{q}\log\pi_{\mathcal{H}}(q)$
(15)		$\displaystyle\frac{dq}{dt}$	$\displaystyle=2G^{-1}(q)p,$	$\displaystyle\frac{dp}{dt}$	$\displaystyle=-p^{T}\frac{\partial}{\partial q}G^{-1}(q)p,$

respectively. The integrator first solves (14) for a timestep of $\epsilon/2$ , then solves (15) for a timestep of $\epsilon$ , and finally propagates (14) again for a time-step of $\epsilon/2$ . The solution of the first system (14) is given simply by $(q(t),p(t))=(q_{0},p+t\nabla_{q}\log\pi_{\mathcal{H}}(q_{0}))$ , and the solution to the second system(15) is given by the geodesic flow to enforce it on the manifold. If we denote $(\mathbf{X},\mathbf{V})$ as the embedded parameter and momentum, rather than $(q,p)$ , the final HMC algorithm, made specific to the case of the Stiefel manifold, is provided in Algorithm 1. Note that the Metropolis-Hastings rejection rule is applied at the end of the time integration in Algorithm 1 to endure the HMC generates Markov chains that have the target distribution $\pi(q)$ as the correct invariant distribution.

Algorithm 1

\mathtt{hmc-stiefel}

: HMC step on the Stiefel Manifold [4]

1:timestep

\epsilon,

number of steps

T

, current sample on Stiefel Manifold

\mathbf{X}

, target density

\pi_{\mathcal{H}}

v\sim\mathcal{N}(0,\mathbf{I}_{nr})

and

\mathbf{V}\leftarrow\textrm{reshape}(v,(n,r))

H_{\text{current}}=\log\pi_{\mathcal{H}}(\mathbf{X})-\frac{1}{2}v^{T}v

\mathbf{V}\leftarrow\mathbf{V}-\frac{1}{2}\mathbf{X}(\mathbf{X}^{T}\mathbf{V}+\mathbf{V}^{T}\mathbf{X})

and

v\leftarrow\textrm{reshape}(\mathbf{V},(nr))

5:for

\tau=1,\ldots,T

v\leftarrow v+\frac{\epsilon}{2}\nabla_{x}\log\pi_{\mathcal{H}}(\mathbf{X})

and

\mathbf{V}\leftarrow\textrm{reshape}(v,(n,r))

\triangleright

Propagate velocity

\mathbf{V}\leftarrow\mathbf{V}-\frac{1}{2}\mathbf{X}(\mathbf{X}^{T}\mathbf{V}+\mathbf{V}^{T}\mathbf{X})

\triangleright

Project back onto manifold

(\mathbf{X},\mathbf{V})\leftarrow

solve Eq. (10) with initial condition

(\mathbf{X},\mathbf{V})

for time

\epsilon

\triangleright

Evolve on the geodesic

v\leftarrow\textrm{reshape}(\mathbf{V},(nr))

v\leftarrow v+\frac{\epsilon}{2}\nabla_{x}\log\pi_{\mathcal{H}}(\mathbf{X})

and

\mathbf{V}\leftarrow\textrm{reshape}(v,(n,r))

10:

\mathbf{V}\leftarrow\mathbf{V}-\frac{1}{2}\mathbf{X}(\mathbf{X}^{T}\mathbf{V}+\mathbf{V}^{T}\mathbf{X})

and

v\leftarrow\textrm{reshape}(\mathbf{V},(nr))

11:end for

12:

H_{\textrm{next}}=\log\pi_{\mathcal{H}}(\mathbf{X})-\frac{1}{2}v^{T}v

13:if

\mathcal{U}[0,1]<\exp(H_{\textrm{next}}-H_{\textrm{current}})

then

X_{\textrm{(}next)}\leftarrow\mathbf{X}

2.3. Application to low-rank matrix completion

We utilize geodesic HMC within Gibbs to solve the matrix completion problem. The pseudocode is provided in Algorithm 2. Here we see that the HMC on the Stiefel manifold is used to sample the left and right singular vectors, while a standard HMC approach is used for the singular values. In all cases we use the No-U-Turn-Sampler variant of HMC [17] to adaptively tune the number of steps.

Algorithm 2 Geodesic HMC within Gibbs for matrix completion

\epsilon,

timestep;

T

, number of steps;

\mathbf{X}

, current sample on the Stiefel Manifold;

N

number of samples,

(\mathbf{U}^{(0)},\mathbf{S}^{(0)},\mathbf{V}^{(0)})

initial samples; Conditional distributions

\pi_{U}(\mathbf{U}\mid,\mathbf{S},\mathbf{V})

\pi_{S}(\mathbf{S}\mid\mathbf{U},\mathbf{V})

\pi_{V}(\mathbf{V}\mid\mathbf{S},\mathbf{U})

2:for

k=1,\ldots,N

\mathbf{U}^{(k)}\leftarrow\mathtt{hmc-stiefel}(\pi_{U}(\cdot\mid\mathbf{S}^{(k-1}),\mathbf{V}^{(k-1)}))

\mathbf{V}^{(k)}\leftarrow\mathtt{hmc-stiefel}(\pi_{V}(\cdot\mid\mathbf{S}^{(k-1)},\mathbf{U}^{(k)}))

\mathbf{S}^{(k)}\leftarrow\mathtt{hmc}(\pi_{S}(\cdot\mid\mathbf{U}^{(k)},\mathbf{V}^{(k)}))

6:end for

Prior

For all the SVD based matrix-completion problems in this work, we consider independent prior distributions for the matrices $\mathbf{U},\mathbf{S}$ and $\mathbf{V}$ . The priors of the left and right factors are defined as uniform in the Hausdorff measure, whereas the prior of the singular values are defined as the independent exponential distributions in a Euclidean space. This way, we have the prior

(16)

\pi(\mathbf{U},\mathbf{S},\mathbf{V})=\pi_{\mathcal{H}}(\mathbf{U})P(\mathbf{S})\pi_{\mathcal{H}}(\mathbf{V})\propto\exp\left(-\lambda\textstyle\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell}\right),

for some positive rate parameter $\lambda$ . The exponential prior can be interpreted as the exponential of the negative nuclear norm of the underlying matrix $\mathbf{X}$ , i.e., $\lambda\|\mathbf{X}\|_{\ast}=\lambda\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell}$ , which is a convex envelope of $\text{rank}(\mathbf{X})$ used in the optimization literature to promote low-rankness in matrix estimation problems [5, 20, 15]. Thus, the exponential prior can be viewed as a rank revealing prior. We provide further justifications and numerical illustrations in Section C of the supplementary material.

In the rest of this section, we describe three likelihood models used for the numerical experiments: (1) real-valued data, (2) positive real-valued data for the mice protein expression example, and (3) positive integer-valued rating data for the movie recommendation.

Model SVD:

In this model, we treat the observed data as real-valued, and consider the Gaussian likelihood arising from the model

(17)

y_{ij}=\mathbf{X}_{ij}+\xi=\mathbf{U}_{i:}\mathbf{S}\mathbf{V}_{j:}^{T}+\xi,

where $\xi\sim\mathcal{N}(0,\sigma^{2}).$ The log likelihood becomes

\log P(y\mid\mathbf{U},\mathbf{S},\mathbf{V})=-\frac{|\mathcal{I}|}{2}\log(2\pi)-|\mathcal{I}|\log(\sigma)-\frac{1}{2\sigma^{2}}\textstyle\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}.

Model S-SVD:

In this model, we treat the observed data as positive real-valued, and consider the Gaussian likelihood arising from the model

(18)

y_{{\bf i}}=h(\mathbf{X}_{ij})+\xi=h(\mathbf{U}_{i:}\mathbf{S}\mathbf{V}_{j:}^{T})+\xi,\quad h(z)=\log(1+\exp(z)),

where $\xi\sim\mathcal{N}(0,\sigma^{2}).$ The log likelihood becomes

\log P(y\mid\mathbf{U},\mathbf{S},\mathbf{V})=-\frac{|\mathcal{I}|}{2}\log(2\pi)-|\mathcal{I}|\log(\sigma)-\frac{1}{2\sigma^{2}}\textstyle\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-h(\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:})\right)^{2}.

For Model SVD and Model S-SVD, the maximum a-posteriori (MAP) of the posterior becomes equivalent to a penalized version of the nuclear norm minimization problem (7). This equivalence is obtained by taking the log of the prior (16) and adding either of the log likelihoods described above.

Model B-SVD:

In this model, we treat the observed data as positive integer-valued, and consider the Binomial likelihood,

(19)

y_{ij}\sim\text{Bin}(k,h(\mathbf{X}_{ij})),\quad h(\mathbf{X}_{ij})=(1+\exp(-\mathbf{X}_{ij}))^{-1}=(1+\exp(-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}))^{-1}.

This yields the log likelihood

\log P(y\mid\mathbf{U},\mathbf{S},\mathbf{V})=\textstyle\sum_{(i,j)\in\mathcal{I}}\left(\log{k\choose y_{ij}}+y_{ij}\log h(\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:})+\left(k-y_{ij}\right)\log\left(1-h(\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:})\right)\right).

3. Numerical examples

We now demonstrate the efficacy of the proposed SVD-based models and the geodesic HMC sampling methods on three examples, including partially observed synthetic matrices, the mice protein expression data set [16], and the MovieLens data set [14].

Problem setups

Example 1 includes the following synthetic test cases that aim to benchmark the proposed methods.

Case #1:

We construct a true matrix $\mathbf{X}_{\text{true}}=\mathbf{A}\mathbf{B}^{T}$ , where entries of $\mathbf{A},\mathbf{B}$ are drawn from i.i.d. standard Gaussian, and then partially observed data $y_{ij}=\mathbf{X}_{ij}$ are given by randomly selecting a subset of matrix indices.
Case #2:

We construct a true positive matrix $\mathbf{X}_{\text{true}}=\mathbf{A}\mathbf{B}^{T}$ , where entries of $\mathbf{A}$ are drawn from i.i.d. $\text{Exp}(\cdot,1)$ and entries of $\mathbf{B}$ are drawn from i.i.d. $\text{Uniform}(\cdot,[0,1])$ . Then partially observed data $y_{ij}=\mathbf{X}_{ij}$ are given by randomly selecting a subset of matrix indices.
Case #3:

We construct a true matrix $\mathbf{X}_{\text{true}}=\mathbf{A}\mathbf{B}^{T}$ , where entries of $\mathbf{A},\mathbf{B}$ are drawn from i.i.d. standard Gaussian. Then partially observed data are drawn from the Binomial model, i.e., $y_{ij}\sim\text{Bin}(k,(1+\exp(-\mathbf{X}_{ij}))^{-1})$ , where the indices are randomly chosen.

For each of the cases, we use $100\times 60$ matrices with rank $10$ . We conduct two sets of experiments that respectively select $10\%$ and $40\%$ of matrix entries as training data. The cross validation data sets are chosen as $40\%$ of matrix entries that do not overlap with the training data sets.

Example 2 is the mice protein expression data set, which consists of $77$ protein expressions, measured in terms of nuclear fractions, from $1080$ mice specimens. The $1080\times 77$ matrix contains positive real-valued entries. This data set is openly available at the UCI Machine Learning Repository [9]. We use rank $20$ in the training. Here we also conduct two sets of experiments that respectively select $10\%$ and $40\%$ of matrix entries at random as training data. The cross validation data sets are chosen as $40\%$ of matrix entries at random that do not overlap with the training data sets.

Example 3 uses the ml-latest-small of the MovieLens data set, which contains $100836$ non-negative ratings across $9742$ movies rated by 610 users. This gives a $610\times 9724$ matrix with $100836$ partially observed rating data²²2The data set is openly available at https://files.grouplens.org/datasets/movielens/. For the rating data, each of the observed data entries $y_{ij}$ is a categorical variable, which takes value from a finite sequence in the range $[0,5]$ with increment $0.5$ , i.e., $y_{ij}\in\{0,0.5,1,\ldots,4,4.5,5\}$ . To apply the Binomial likelihood, we scale the rating data by a factor of two, so they become integer-valued in the ranging $[0,10]$ with increment $1$ . We use rank $20$ in the training. In this example, we randomly select $80\%$ of the data in the data set as the training data, and use the rest of $20\%$ as cross validation data.

Here Case #1 of the synthetic example aims to benchmark the expression power and the sampling efficiency of the SVD model in comparison with the $\mathbf{A}\mathbf{B}^{T}$ model. For this reason, the ground truth is generated from the prior distribution of the $\mathbf{A}\mathbf{B}^{T}$ model. The synthetic cases 2 and 3 mimic the behavior of the mice protein expression data set [16] and the MovieLens data set [14], respectively.

Table 1. Summary of sampling performance and prediction accuracy. The model

\mathbf{A}\mathbf{B}^{T}

is sampled by the Gibbs sampling. The SVD-based models are sampled by geodesic HMC. For table entries shown with

\nicefrac{{}}{{}}

, the left and the right represent results for the

10\%

and

40\%

data sampling rates, respectively.

	MCMC mixing		$[1\%,50\%,99\%]$ quantiles of MADs
	Gibbs	HMC	$\mathbf{A}\mathbf{B}^{T}$	SVD	$\ast$ -SVD
Case #1	✗	✓	$\nicefrac{{[0.04,2.3,9.8]}}{{[0.04,2.3,10.0]}}$	$\nicefrac{{[0.04,2.3,9.8]}}{{[0.04,2.3,10.0]}}$	-
Case #2	✗	✓	$\nicefrac{{[0.20,6.3,24]}}{{[0.67,6.1,13]}}$	$\nicefrac{{[1.9,6.2,13]}}{{[0.52,6.1,13]}}$	$\nicefrac{{[1.9,6.2,13]}}{{0.58,6.1,13]}}$
Case #3	✗	✓	$\nicefrac{{[0.15,6.3,21.8]}}{{[0.69,6.1,13.4]}}$	$\nicefrac{{[0.15,5.5,10.5]}}{{[0.24,5.4,10.4]}}$	$\nicefrac{{[0,5,10]}}{{[0,5,11]}}$
Mice	✗	✓	$\nicefrac{{[0,0.33,2.6]}}{{[0,0.22,2.4]}}$	$\nicefrac{{[0,0.33,2.6]}}{{[0.02,0.22,2.4]}}$	$\nicefrac{{[0,0.36,2.1]}}{{[0.02,0.42,1.8]}}$
MovieLens	✗	✓	$[0.02,1.2,8.0]$	$[0.03,1.6,5.0]$	$[0,1,5]$

A summary of the sampling performance and the prediction accuracy of various methods is provided in Table 1. The table specifies if the sampling method converges to the posterior distribution within 10000 iterations including 2000 iterations of burnin, and also reports the estimated median absolute deviations (MAD) on the cross validation data sets and the standard deviations of the MADs. Since both the HMC and the Gibbs sampling algorithms discussed here are asymptotically converging methods when they are simulated for an infinite number of iterations [19, 26], here we only determine the converging behavior with finite iterations using sample traces. For Markov chains with reasonable mixing, we then compute the autocorrelation time to benchmark the rate of convergence [19]. More details about the sampling performance and the prediction accuracy of various models applied to the abovementioned examples are provided in the rest of this section.

Sampling performance

Overall, the Gibbs sampling on the $\mathbf{A}\mathbf{B}^{T}$ model demonstrates a poor mixing behavior in our examples. The Markov chains often do not converge to the stationary distribution after many iterations. In comparison, all Markov chains generated by the HMC on the SVD-based model demonstrate a superior mixing property. For brevity, we only report the mixing of Markov chains in Case #1 of the synthetic example here, in which the ground truth is generated by the $\mathbf{A}\mathbf{B}^{T}$ model. As shown in Figure 2, after $8\times 10^{3}$ sweeps of Gibbs updates on the parameters $(\mathbf{A},\mathbf{B},\delta,\sigma)$ , the traces of the Markov chain cannot reach the stationary distribution. In comparison, the traces of the geodesic HMC can reach the stationary distribution very quickly. To further demonstrate the sampling efficiency of HMC, we reported the estimated average autocorrelation time over all parameters in Figure 2 (right), which shows that the autocorrelation of the Markov chains generated by HMC decreases to zero after one iteration. We observe similar mixing behavior in all other test cases in the synthetic example and also in examples 2 and 3. See the appendices for the mixing diagnostics of the rest of examples.

We observe that each iteration of the geodesic HMC requires more computational effort compared to one sweep of the Gibbs. For Case #1 of the synthetic example, the CPU time per iteration of the geodesic HMC is about $5$ times of that of the Gibbs. For positive matrices (Case #2 and the mice protein data), the CPU time per iteration of the geodesic HMC is about $15-30$ times of that of the Gibbs. For the rating data (Case #3 and the MovieLens), the CPU time per iteration of the geodesic HMC is about $2-8$ times of that of the Gibbs. However, the rather high computational cost of the geodesic HMC clearly yields a superior sampling performance.

Prediction accuracy

For all the examples, we report the histograms of the posterior MADs (compared to the cross validation data sets) generated by various models. For all the synthetic test cases, the uncertainty intervals on the histogram are computed using 30 random experiments. For examples 2 and 3, the uncertainty intervals are computed using 10 random experiments.

Figure 3 shows the MADs for Case #1 of the synthetic example. For both data sampling rates, the SVD model and the $\mathbf{A}\mathbf{B}^{T}$ model produce comparable MADs in this case.

Figure 4 shows the MADs for Case #2 of the synthetic example and the mice protein data set. Both examples here involve matrices with positive entries. We apply the S-SVD model (which preserves the positivity), the SVD model and the $\mathbf{A}\mathbf{B}^{T}$ model. For Case #2, Figure 4 (a) shows that with a high data sampling rate ( $40\%$ ), all three models produce comparable MADs. However, with a low data sampling rate ( $10\%$ ), the S-SVD model and the SVD model can still generate similar results compared to the $40\%$ data sampling rate case, whereas the MADs of the $\mathbf{A}\mathbf{B}^{T}$ model exhibits some heavy-tail behavior. For the mice protein data set, Figure 4 (b) shows that the S-SVD model outperforms the SVD model and the $\mathbf{A}\mathbf{B}^{T}$ model in the low data sampling rate case, while all three models have similar performance in the high data sampling rate case (note that the S-SVD model has the smallest uncertainty interval). Overall, the S-SVD model demonstrates a clear advantage.

Figure 5 shows the MADs for Case #3 of the synthetic example and the MovieLens data set. Both examples here involve rating data. We apply the B-SVD model (which uses the Binomial likelihood), the SVD model and the $\mathbf{A}\mathbf{B}^{T}$ model. For Case #3, Figure 5 (a) shows that for both data sampling rates, the two SVD-based models generate comparable results and are both more accurate compared to that of the $\mathbf{A}\mathbf{B}^{T}$ model.

For the MovieLens case, Figure 5 (b) shows that the S-SVD model outperforms the SVD model and the $\mathbf{A}\mathbf{B}^{T}$ model in the low data sampling rate case. All three models have similar performance in the high data sampling rate case. Furthermore, the S-SVD model has the smallest uncertainty interval. Overall, the S-SVD model demonstrates a clear advantage in the prediction accuracy.

4. Acknowledgments

T. Cui acknowledges support from the Australian Research Council Discovery Project DP210103092. A. Gorodetsky acknowledges support from the Department of Energy Office of Scientific Research, ASCR under grant DE-SC0020364.

Appendix A Vectorization of conditional posteriors of SVD-based models

In this section we derive the conditional posteriors of $\mathbf{U}$ , $\mathbf{S}$ , and $\mathbf{V}$ in the singular value decomposition model with a Gaussian likelihood. The same procedure can be undertaken for the posteriors associated with all the proposed likelihood models.

As a starting point, we let $\gamma=1/\sigma^{2}$ to denote the precision parameter of the Gaussian observation noise and assign it a conjugate Gamma prior $\Gamma(\alpha_{\sigma},\beta_{\sigma})$ with $\alpha_{\sigma}=\beta_{\sigma}=10^{-4}$ . We have the full posterior distribution

(20)

P(\mathbf{U},\mathbf{S},\mathbf{V},\gamma|y)\propto\gamma^{\frac{|\mathcal{I}|}{2}}\exp\left(-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}\right)\pi_{\mathcal{H}}(\mathbf{U})P(\mathbf{S})\gamma^{\alpha_{\sigma}-1}\exp(-\beta_{\sigma}\gamma),

where $\pi_{\mathcal{H}}(\mathbf{U})$ and $\pi_{\mathcal{H}}(\mathbf{V})$ are uniform prior densities in the Hausdorff measure and $P(\mathbf{S})$ is some prior density for the singular values. Given $(\mathbf{S},\mathbf{V},\gamma)$ , the conditional posterior distribution for $\mathbf{U}$ is

(21)

\displaystyle\pi_{U}\left(\mathbf{U}\mid y,\mathbf{S},\mathbf{V},\gamma\right)

\displaystyle\propto\exp\left[-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}\right]=\exp\left[-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{B}_{j}\right)^{2}\right]

where $\mathbf{B}_{j}=\mathbf{S}\mathbf{V}^{T}_{j:}$ . In the implementation it is easier to work directly with $\mathbf{U}$ rather than repeated indexing into its rows. To this end, it is convenient to reorder $y$ according to the row indices so that we have

(22)

\displaystyle y=\begin{bmatrix}\mathbf{y}_{1}\\ \vdots\\ \mathbf{y}_{m}\end{bmatrix}

\displaystyle=\begin{bmatrix}\tilde{\mathbf{B}}_{1}&&\\ &\ddots&\\ &&\tilde{\mathbf{B}}_{m}\end{bmatrix}\begin{bmatrix}\mathbf{U}^{T}_{1:}\\ \vdots\\ \mathbf{U}^{T}_{m:}\end{bmatrix}+\mathbf{\xi}=\mathbf{B}\ \mathtt{reshape}(\mathbf{U},(mr))+\mathbf{\xi},

where $m_{k}$ is the number of observed entries in the $k$ -th row of the data matrix so that $\mathbf{y}_{k}\in\mathbb{R}^{m_{k}}$ ; $\tilde{\mathbf{B}}_{k}\in\mathbb{R}^{m_{k}\times r}$ is a matrix consisting of vertically concatenated blocks of $\mathbf{B}_{j}$ corresponding to the columns $j$ of the observed entries in row $k$ ; $\mathbf{U}_{k:}\in\mathbb{R}^{r}$ is each row of $\mathbf{U}$ ; and $\mathbf{B}$ is a block-diagonal matrix. Together with the uniform prior in the Hausdorff measure, this results in the following conditional posterior

(23)

\pi_{U}(\mathbf{U}\mid y,\mathbf{S},\mathbf{V},\gamma)\propto\exp\left[-\frac{\gamma}{2}\lVert y-\mathbf{B}\ \texttt{reshape}(\mathbf{U},(mr))\rVert^{2}\right].

This density does not correspond to that of a Gaussian distribution over Euclidean space because the matrix must satisfy orthogonality properties. Thus there is no closed-form approach for sampling from this posterior, and some MCMC method is needed. The conditional posterior for $\mathbf{V}$ can be obtained in a similar way.

Next we consider the conditional posterior of the singular values lying along the diagonal of the matrix $\mathbf{S}$ . We can assign any prior with positive support for the diagonal elements of this matrix. In this paper we choose an exponential prior $P(\mathbf{S})\propto\exp(-\lambda\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell})$ , which draws an analogy with the nuclear norm regularization. Then, Equation (20) leads to the conditional posterior

(24)

\displaystyle\pi_{S}\left(\mathbf{S}\mid y,\mathbf{U},\mathbf{V},\gamma\right)\propto\exp\left[-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}-\lambda\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell}\right]

where $\lambda$ is the hyperparameters of the exponential distribution. To further simplify the computation, we can also express the above conditional posterior in the form of

(25)

\displaystyle\pi_{S}\left(\mathbf{S}\mid y,\mathbf{U},\mathbf{V},\gamma\right)\propto\exp\left[-\frac{\gamma}{2}\lVert y-\mathbf{M}\ \texttt{diag}(\mathbf{S})\rVert^{2}-\lambda\sum_{\ell=1}^{r}\mathbf{S}_{\ell\ell}\right],

where the vector $y\in\mathbb{R}^{|\mathcal{I}|}$ is the same as the one defined in (22); $\texttt{diag}(\mathbf{S})$ maps the diagonal entries of the matrix $\mathbf{S}\in\mathbb{R}^{r\times r}$ into a vector in $\mathbb{R}^{r}$ consisting of all the singular values; and the matrix $\mathbf{M}\in\mathbb{R}^{|\mathcal{I}|\times r}$ is given as

\mathbf{M}=\begin{bmatrix}\mathbf{U}_{i_{1}:}\circ\mathbf{V}_{j_{1}:}\\ \vdots\\ \mathbf{U}_{i_{k}:}\circ\mathbf{V}_{j_{k}:}\\ \vdots\end{bmatrix},

where $\circ$ denotes the Hadamard product.

The computational complexity of evaluating the conditional densities (23) and (25) and their gradient are governed by the cost of matrix-vector-products with $\mathbf{B}$ and $\mathbf{M}$ , which are both $O(|\mathcal{I}|r)$ .

Finally, the conditional posterior of the precision parameter $\gamma$ takes the form

(26)

P(\gamma|y,\mathbf{U},\mathbf{S},\mathbf{V})\propto\gamma^{\frac{|\mathcal{I}|}{2}+\alpha_{\sigma}-1}\exp\left[-\left(\frac{1}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{U}_{i:}\mathbf{S}\mathbf{V}^{T}_{j:}\right)^{2}+\beta_{\sigma}\right)\gamma\right],

which is another Gamma distribution that can be sampled directly.

Appendix B Gibbs $\mathbf{A}\mathbf{B}^{T}$ model

In this section we provide the pseudocode of the Gibbs sampler used for sampling the $\mathbf{AB}^{T}$ model. Recall that the unknown parameters are the low-rank factors $\mathbf{A},\mathbf{B}$ , the measurement noise variance $\sigma^{2}$ , and the prior variance of the factors $\delta^{2}$ . Similar to the SVD model, we use the precision parameters $\gamma=1/\sigma^{2}$ and $\tau=1/\delta^{2}$ and assign conjugate Gamma distributions $\Gamma(\alpha_{\sigma},\beta_{\sigma})$ and $\Gamma(\alpha_{\delta},\beta_{\delta})$ , respectively. Here all the rate and shape parameters of the Gamma prior is set to $10^{-4}$ . Together these facts lead to the following full posterior

	$\displaystyle P(\mathbf{A},\mathbf{B},\gamma,\tau\mid y)\propto$	$\displaystyle\gamma^{\frac{\|\mathcal{I}\|}{2}}\exp\left[-\frac{\gamma}{2}\sum_{(i,j)\in\mathcal{I}}\left(y_{ij}-\mathbf{A}_{i:}\mathbf{B}^{T}_{j:}\right)^{2}\right]\gamma^{\alpha_{\delta}-1}\exp(-\beta_{\delta}\gamma)\times$
(27)			$\displaystyle\tau^{\frac{(m+n)r}{2}}\exp\left[-\frac{\tau}{2}\left(\sum_{i=1,j=1}^{m,r}\mathbf{A}_{ij}^{2}+\sum_{i=1,j=1}^{n,r}\mathbf{B}_{ij}^{2}\right)\right]\tau^{\alpha_{\delta}-1}\exp(-\beta_{\delta}\tau)$

The conditional distributions for $\mathbf{A}$ and $\mathbf{B}$ are both Gaussian distributions and so samples can be drawn using Gibbs sampling. However, directly sampling those Gaussian distributions needs to factorize rather large covariance matrices, e.g., a matrix with dimension $mr\times mr$ for $\mathbf{A}$ . This can have superlinear computational complexity, for instance, cubic complexity if directly applying the Cholesky factorization. One way to reduce the computational cost is to apply Gibbs sampling to each of $\mathbf{A}$ and $\mathbf{B}$ row-by-row. Denoting $\mathbf{A}_{i:}$ the $i$ -th row of $\mathbf{A}$ and $\mathbf{A}_{-i:}$ the matrix containing remaining rows of $\mathbf{A}$ , the conditional posterior of $\mathbf{A}_{i:}$ given $(\mathbf{A}_{-i:},\mathbf{B},\gamma,\tau)$ is

P(\mathbf{A}_{i}|y,\mathbf{A}_{-i:}\mathbf{B},\gamma,\tau)\propto\left\{\begin{array}[]{ll}\exp\left[-\frac{\gamma}{2}\sum_{j\in\mathcal{J}_{i}}\left(y_{ij}-\mathbf{A}_{i:}\mathbf{B}^{T}_{j:}\right)^{2}-\frac{\tau}{2}\left(\sum_{j=1}^{r}\mathbf{A}_{ij}^{2}\right)\right]&i\in\mathcal{I}\\ \exp\left[-\frac{\tau}{2}\left(\sum_{j=1}^{r}\mathbf{A}_{ij}^{2}\right)\right]&i\notin\mathcal{I}\end{array}\right.,

where $\mathcal{J}_{i}$ is the index set containing all the column indices such that $(i,j)\in\mathcal{I}$ for a given row index $i$ . Note the second conditional refers to the situation where row $i$ of the matrix is not observed at all, in which case sampling proceeds from the prior. The conditional posterior of $\mathbf{B}_{i:}$ can be derived in a similar way. Each of the conditional Gaussian distributions has dimension $r$ , and thus can be computationally much cheaper to simulate. The pseudocode is provided in Algorithm 3.

Algorithm 3 Gibbs sampler for the

\mathbf{AB}^{T}

model

N

number of samples,

(\mathbf{A}^{(0)},\mathbf{B}^{(0)},\gamma^{(0)},\tau^{(0)})

initial samples;

2:for

k=1,\ldots,N

\gamma^{(k)}\leftarrow\mathtt{sample\_gamma}(P(\gamma|y,\mathbf{A}^{(k-1)},\mathbf{B}^{(k-1)},\tau^{(k-1)}))

\tau^{(k)}\leftarrow\mathtt{sample\_gamma}(P(\tau|y,\mathbf{A}^{(k-1)},\mathbf{B}^{(k-1)},\gamma^{(k)}))

\mathbf{A}^{(k)}\leftarrow\mathbf{A}^{(k-1)}

6: for

i=1,\ldots,m

\mathbf{A}^{(k)}\leftarrow\mathtt{sample\_gaussian}(P(\mathbf{A}_{i:}|y,\mathbf{A}^{(k)}_{-i:},\mathbf{B}^{(k-1)},\gamma^{(k)},\tau^{(k)}))

8: end for

\mathbf{B}^{(k)}\leftarrow\mathbf{B}^{(k-1)}

10: for

j=1,\ldots,n

11:

\mathbf{B}_{j:}^{(k)}\leftarrow\mathtt{sample\_gaussian}(P(\mathbf{B}_{j:}|y,\mathbf{B}^{(k)}_{-j:},\mathbf{A}^{(k)},\gamma^{(k)},\tau^{(k)}))

12: end for

13:end for

Appendix C Rank revealing of SVD-based models

Here we demonstrate the rank revealing property of the SVD-based models. Recall that the low-rank matrix $\mathbf{X}$ is parametrized in the SVD format

\mathbf{X}=\mathbf{U}\mathbf{S}\mathbf{V}^{T},\quad\mathbf{U}^{T}\mathbf{U}=\mathbf{I}_{r},\quad\mathbf{V}\mathbf{V}^{T}=\mathbf{I}_{r}.

The nuclear norm of $\mathbf{X}$ is given as the summation of its singular values, i.e.,

\lVert\mathbf{X}\rVert_{*}=\text{trace}(\mathbf{X}^{T}\mathbf{X})=\sum_{i=1}^{r}\mathbf{S}_{ii}.

Since the nuclear norm $\lVert\mathbf{X}\rVert_{*}$ is a convex envelope of the rank of a matrix, $\text{rank}\mathbf{X})$ , it often used as a regularization term in optimization problems to estimate low-rank matrices. See [5, 15, 20] and references therein for further discussions. Drawing an analogy with the nuclear norm regularization, a natural choice of the prior distribution for the singular values of the SVD-based model is to use the exponential of the negative of the nuclear norm of $\mathbf{X}$ , which leads to

P(S)\propto\exp\left(-\lambda\lVert\mathbf{X}\rVert_{*}\right)=\exp\left(-\lambda\sum_{i=1}^{r}\mathbf{S}_{ii}\right),

where $\lambda>0$ is a penalty parameter.

Connection with nuclear norm regularization

Given a likelihood function $P(y|\mathbf{U},\mathbf{S},\mathbf{V})$ , the maximum a posteriori (MAP) estimate can be expressed As

\operatorname*{arg\,max}_{\mathbf{U},\mathbf{S},\mathbf{V}}P(\mathbf{U},\mathbf{S},\mathbf{V}|y)=\operatorname*{arg\,max}_{\mathbf{U},\mathbf{S},\mathbf{V}}P(y|\mathbf{U},\mathbf{S},\mathbf{V})\pi_{\mathcal{H}}(\mathbf{U})P(\mathbf{S})\pi_{\mathcal{H}}(\mathbf{V}),

where $\pi_{\mathcal{H}}(\mathbf{U})$ and $\pi_{\mathcal{H}}(\mathbf{V})$ are uniform in the Hausdorff measure and implicitly imply the constraints $\mathbf{U}^{T}\mathbf{U}=\mathbf{I}_{r}$ and $\mathbf{V}\mathbf{V}^{T}=\mathbf{I}_{r}$ on Stiefel manifolds. In this way, the MAP estimate can be equivalently expressed as

\operatorname*{arg\,min}_{\mathbf{U},\mathbf{S},\mathbf{V}}-\log P(y|\mathbf{U},\mathbf{S},\mathbf{V})+\lambda\sum_{i=1}^{r}\mathbf{S}_{ii},\quad\text{subject\;to}\quad\mathbf{U}^{T}\mathbf{U}=\mathbf{I}_{r},\quad\mathbf{V}\mathbf{V}^{T}=\mathbf{I}_{r}.

The above formulation explains that the MAP estimate of our proposed SVD-based model is equivalent to a matrix estimation problem regularized by the nuclear norm, which is a widely-accepted way to impose low-rank properties.

Numerical demonstration

The above justification hinted that the SVD-based model together with the nuclear-norm-type prior can be capable of determining the effective rank of a Bayesian matrix estimation problem. To demonstrate this rank revealing property, we use the Case #1 of the synthetic example presented in Section 3 of the paper. Here we construct a true matrix $\mathbf{X}_{\text{true}}=\mathbf{A}\mathbf{B}^{T}$ , where entries of $\mathbf{A}\in\mathbb{R}^{100\times 10},\mathbf{B}\in\mathbb{R}^{60\times 10}$ are drawn from i.i.d. standard Gaussian, and partially observed data $y_{ij}=\mathbf{X}_{ij}$ are given by randomly selecting a subset of $40\%$ of matrix indices.

The true matrix is rank $10$ by construction, we impose a maximum rank $15$ in the parameterization and run the proposed geodesic HMC algorithm to sample the posterior distribution. The $10\%$ quantile, median and $90\%$ quantile of the posterior singular values are reported in Figure 6. Here, we observe that beyond rank 10, the estimated singular values are close to zero and the sizes of the corresponding credible intervals are near zero. This is a clear indication that the actual rank of the estimated matrix is about $10$ . Using our proposed SVD-based models, the columns of $\mathbf{U}$ and $\mathbf{V}$ that do not contribute to the estimation problem will be automatically assigned near zero singular values in this example.

Appendix D Sampling efficiency

In this section, we provide a detailed summary of the sampling efficiency of the geodesic HMC and Gibbs sampling respectively applied to SVD-based models and the $\mathbf{A}\mathbf{B}^{T}$ model. To preserve anonymity, the git repositories containing the code for reproducing the results will be made public after the completion of the review process. URLs and instructions for running the code will also be provided in the final submission.

Table 2 summarizes the CPU time per iteration. Here the number of steps $T$ and the time step $\epsilon$ used by geodesic HMC are automatically tuned by the No-U-Turn-Sampler variant of HMC. Each iteration of Gibbs consists of a full sweep of updates of $(\mathbf{A},\mathbf{B},\gamma,\tau)$ . The CPU times are measured on a DELL workstation with dual Intel Xeon(R) E5-2680 v4 CPU.

Table 2. Summary of the CPU time per iteration of the geodesic HMC and the Gibbs sampling, measured over 30 repetitions. The model

\mathbf{A}\mathbf{B}^{T}

is sampled by the Gibbs sampling. The SVD-based models are sampled by geodesic HMC. For table entries shown with “

\nicefrac{{}}{{}}

,” the left and the right represent results for the

10\%

and

40\%

data sampling rates, respectively.

	mean $\pm$ standard deviation of CPU time (in $10^{-3}$ second)
	$\mathbf{A}\mathbf{B}^{T}$	SVD	$\ast$ -SVD
Case #1	$\nicefrac{{4.1\pm 1}}{{5.6\pm 1}}$	$\nicefrac{{19\pm 2.1}}{{26\pm 2.9}}$	-
Case #2	$\nicefrac{{4.1\pm 0.3}}{{4.9\pm 0.8}}$	$\nicefrac{{67\pm 14}}{{189\pm 12}}$	$\nicefrac{{85\pm 56}}{{159\pm 16}}$
Case #3	$\nicefrac{{4.0\pm 0.3}}{{5.2\pm 0.6}}$	$\nicefrac{{36\pm 0.7}}{{158\pm 2.6}}$	$\nicefrac{{0.7\pm 0.1}}{{0.7\pm 0.2}}$
Mice	$\nicefrac{{47\pm 3.1}}{{65\pm 3.6}}$	$\nicefrac{{650\pm 12}}{{940\pm 350}}$	$\nicefrac{{670\pm 43}}{{878\pm 305}}$
MovieLens	$1400\pm 340$	$2100\pm 220$	$2400\pm 180$

For each of the examples summarized in Table 2, we also show the MCMC traces generated by the corresponding samplers in Figures 7–20. For each of examples, we provide the traces of three random experiments. For all examples, the Gibbs sampling applied to the $\mathbf{A}\mathbf{B}^{T}$ model do not have satisfactory mixing property—after several thousands iterations, the traces of Markov chains clearly suggest that the chains are still not in the stationary phase. In comparison, all the geodesic HMC counterparts can produce well mixed chains after several hundreds iterations. Although each iteration of the geodesic HMC is computationally more expensive than that of the Gibbs, the geodesic HMC is able to significantly improve the convergence of the MCMC sampling in these examples.

References

[1] Sungjin Ahn, Anoop Korattikara, Nathan Liu, Suju Rajan, and Max Welling. Large-scale distributed bayesian matrix factorization using stochastic gradient mcmc. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015.
[2] Vladimir Igorevich Arnol’d. Mathematical methods of classical mechanics, volume 60. Springer Science & Business Media, 2013.
[3] James Bennett, Stan Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, 2007.
[4] Simon Byrne and Mark Girolami. Geodesic monte carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825–845, 2013.
[5] Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.
[6] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, 2014.
[7] Saibal De, Hadi Salehi, and Alex Gorodetsky. Efficient mcmc sampling for bayesian matrix factorization by breaking posterior symmetries. arXiv preprint arXiv:2006.04295, 2020.
[8] Persi Diaconis, Susan Holmes, and Mehrdad Shahshahani. Sampling from a manifold. In Advances in modern statistical theory and applications: a Festschrift in honor of Morris L. Eaton, pages 102–125. Institute of Mathematical Statistics, 2013.
[9] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
[10] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
[11] Alan Edelman, Tomás A Arias, and Steven T Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
[12] Herbert Federer. Geometric measure theory. Springer, 2014.
[13] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
[14] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM transactions on interactive intelligent systems (TIIS), 5(4):1–19, 2015.
[15] Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research, 16(1):3367–3402, 2015.
[16] Clara Higuera, Katheleen J Gardiner, and Krzysztof J Cios. Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PloS one, 10, 2015.
[17] Matthew D Hoffman, Andrew Gelman, et al. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014.
[18] Yew Jin Lim and Yee Whye Teh. Variational bayesian approach to movie rating prediction. In Proceedings of KDD cup and workshop, 2007.
[19] Jun S Liu. Monte Carlo strategies in scientific computing, volume 10. Springer, 2001.
[20] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. The Journal of Machine Learning Research, 11:2287–2322, 2010.
[21] Shinichi Nakajima and Masashi Sugiyama. Theoretical analysis of bayesian matrix factorization. Journal of Machine Learning Research, 12, 2011.
[22] Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, and Ryota Tomioka. Global analytic solution of fully-observed variational bayesian matrix factorization. Journal of Machine Learning Research, 14, 2013.
[23] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
[24] Piyush Rai, Yingjian Wang, Shengbo Guo, Gary Chen, David Dunson, and Lawrence Carin. Scalable bayesian low-rank decomposition of incomplete multiway tensors. In International Conference on Machine Learning, 2014.
[25] Tapani Raiko, Alexander Ilin, and Juha Karhunen. Principal component analysis for large scale problems with lots of missing values. In European Conference on Machine Learning, 2007.
[26] Christian P Robert and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.
[27] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using markov chain monte carlo. In Proceedings of the 25th international conference on Machine learning, 2008.
[28] Qibin Zhao, Liqing Zhang, and Andrzej Cichocki. Bayesian cp factorization of incomplete tensors with automatic rank determination. IEEE transactions on pattern analysis and machine intelligence, 37, 2015.
[29] Qibin Zhao, Guoxu Zhou, Liqing Zhang, Andrzej Cichocki, and Shun-Ichi Amari. Bayesian robust tensor factorization for incomplete multiway data. IEEE transactions on neural networks and learning systems, 27, 2015.

Low-rank Bayesian matrix completion via geodesic Hamiltonian Monte Carlo on Stiefel manifolds

Abstract.

1. Introduction

Bayesian low-rank matrix completion

Challenge

Contribution

Related work

Outline

2. Sampling on low-rank manifolds

2.1. Operations on the Stiefel manifold

2.2. Sampling on the Stiefel manifold

2.3. Application to low-rank matrix completion

Prior

Model SVD:

Model S-SVD:

Model B-SVD:

3. Numerical examples

Problem setups

Sampling performance

Prediction accuracy

4. Acknowledgments

Appendix A Vectorization of conditional posteriors of SVD-based models

Appendix B Gibbs 𝐀𝐁T\mathbf{A}\mathbf{B}^{T} model

Appendix C Rank revealing of SVD-based models

Connection with nuclear norm regularization

Numerical demonstration

Appendix D Sampling efficiency

References

Appendix B Gibbs $\mathbf{A}\mathbf{B}^{T}$ model