This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unitary Approximate Message Passing for Matrix Factorization

Zhengdao Yuan, Qinghua Guo, , Yonina C. Eldar, , and Yonghui Li Corresponding author: Qinghua Guo.Z. Yuan is with the Artificial Intelligence Technology Engineering Research Center, Open University of Henan, Zhengzhou 450002, China. He was with the School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Wollongong, NSW 2522, Australia (e-mail: [email protected]).Q. Guo is with the School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Wollongong, NSW 2522, Australia (e-mail: [email protected]).Y. C. Eldar is with the Faculty of Mathematics and Computer Science, Weizmann Institute of Science, Rehovot 7610001, Israel (e-mail: [email protected]).Y. Li is with the School of Electrical and Information Engineering, University of Sydney, Sydney, NSW 2006, Australia (e-mail: [email protected]).
Abstract

We consider matrix factorization (MF) with certain constraints, which finds wide applications in various areas. Leveraging variational inference (VI) and unitary approximate message passing (UAMP), we develop a Bayesian approach to MF with an efficient message passing implementation, called UAMP-MF. With proper priors imposed on the factor matrices, UAMP-MF can be used to solve many problems that can be formulated as MF, such as non-negative matrix factorization, dictionary learning, compressive sensing with matrix uncertainty, robust principal component analysis, and sparse matrix factorization. Extensive numerical examples are provided to show that UAMP-MF significantly outperforms state-of-the-art algorithms in terms of recovery accuracy, robustness and computational complexity.

Index Terms:
Variational inference (VI), approximate message passing (AMP), matrix factorization (MF), compressive sensing (CS), dictionary learning (DL), robust principal component analysis (RPCA).

I Introduction

We consider the problem of matrix factorization (MF) with the following model

𝒀=𝑯𝑿+𝑾,\displaystyle\boldsymbol{Y}=\boldsymbol{H}\boldsymbol{X}+\boldsymbol{W}, (1)

where 𝒀M×L\boldsymbol{Y}\in\mathcal{R}^{M\times L} denotes a known (observation) matrix, 𝑾M×L\boldsymbol{W}\in\mathcal{R}^{M\times L} accounts for unknown perturbations or measurement errors, and the matrices 𝑯M×N\boldsymbol{H}\in\mathcal{R}^{M\times N} and 𝑿N×L\boldsymbol{X}\in\mathcal{R}^{N\times L} are two factor matrices to be obtained. Depending on concrete application scenarios, the MF problem in (1) often has specific requirements on the matrices 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X}. For example, in non-negative matrix factorization (NMF) [1], the entries of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} need to take non-negative values. In dictionary learning (DL) [2], matrix 𝑿\boldsymbol{X} is sparse matrix and 𝑯\boldsymbol{H} is a dictionary matrix to be learned. In compressive sensing with matrix uncertainty (CSMU) [3], 𝑿\boldsymbol{X} is sparse and the sensing matrix 𝑯\boldsymbol{H} can be modeled as 𝑯=𝑯¯+𝑯\boldsymbol{H}=\bar{\boldsymbol{H}}+\boldsymbol{H}^{\prime}, where the matrix 𝑯¯\bar{\boldsymbol{H}} is known, and 𝑯\boldsymbol{H}^{\prime} denotes an unknown perturbation matrix. The robust principal component analysis (RPCA) problem [4] can also be formulated as (1), where both 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} admit specific structures [5]. We may also be interested in sparse MF, where both matrices 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} are sparse.

Many algorithms have been developed to solve the above problems. For NMF, the multiplication (Mult) update algorithm proposed in [6] is often used, which is simple to implement, and produces good results. The alternating least squares (ALS) algorithm was proposed in [7] for NMF, where two least squares steps are performed alternately. The above two algorithms are very popular and are built in the MATLAB library. The projected gradient descent (AlsPGrad) [8] algorithm, which is an improvement to Mult, usually has faster convergence speed. To solve the DL problem, one can use algorithms such as K-SVD (singular value decomposition) [9] and SPAMS (sparse modeling software) [10]. K-SVD is often effective, but can be trapped in local minima or even saddle points. The SPAMS algorithm often does not perform well when the number of training samples is small. For CSMU, the sparsity-cognizant total least squares (S-TLS) method was developed in [11] by extending the classical TLS approach. For the RPCA problem, many algorithms have been proposed. The work in [4] developed iterative thresholding algorithms with low complexity, but their convergence is usually slow. The inexact augmented Lagrange multiplier algorithm (IALM) [12] and low-rank matrix fitting algorithm (LMaFit) [13] deliver better convergence speed and performance.

Another line of techniques is based on the Bayesian framework, especially with the approximate message passing (AMP) algorithm [14], [15] and its extension or variants. With Bayesian approaches, the requirements on matrices 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} for the factorization in (1) are incorporated by imposing proper priors on them. The AMP algorithm is employed due to its low complexity. AMP was originally developed for compressive sensing based on loopy belief propagation (BP) [14], and was later extended in [16] to solve the estimation problem with a generalized linear observation model. The bilinear generalized AMP (Bi-GAMP) algorithm was proposed in [5] to deal with the problem in (1). However, Bi-GAMP inherits the shortcoming of the AMP algorithm, i.e., it is not robust to generic matrices 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X}. To achieve better robustness, a bilinear adaptive VAMP (BAd-VAMP) algorithm was proposed in [17] and generalized in [18]. However, for the general problem in (1), to reduce computational complexity, BAd-VAMP does not allow the use of priors for matrix 𝑯\boldsymbol{H}, i.e., it is treated as a deterministic unknown matrix. This makes BAd-VAMP incapable of dealing with the requirements on 𝑯\boldsymbol{H}, e.g., RPCA or sparse MF where 𝑯\boldsymbol{H} should be sparse. In addition, there is still considerable room for improvement in terms of performance, robustness and computational complexity. In [19], with UAMP [20] (which was formally called UTAMP), a more robust and faster approximate inference algorithm (called Bi-UTAMP in [19]) was proposed, which is designed for specially structured 𝑯\boldsymbol{H} in (1), i.e., 𝑯=k=1Kbk𝑨k\boldsymbol{H}=\sum_{k=1}^{K}b_{k}\boldsymbol{A}_{k}, where matrices {𝑨k}\{\boldsymbol{A}_{k}\} are known, and coefficients {bk}\{b_{k}\} (and matrix 𝑿\boldsymbol{X}) need to be estimated.

In this work, with the framework of variational inference (VI) [21], we develop a more efficient Bayesian algorithm leveraging UAMP. The use of VI requires the updates of two vairational distributions on 𝑿\boldsymbol{X} and 𝑯\boldsymbol{H}, which may be difficult and expensive With matrix normal distributions [22] and through a covariance matrix whitening process, we incorporate UAMP into VI to deal with the updates of the variational distributions. The VI-based method is implemented using message passing with UAMP as its key component, leading to an algorithm called UAMP-MF. Inheriting the low complexity and robustness of UAMP, UAMP-MF is highly efficient and robust. It has cubic complexity per iteration, which is considered low for an MF problem (note that even the multiplication of two matrices has cubic complexity). Enjoying the flexibility of a Bayesian approach, UAMP-MF is able to handle various problems in a unified way with different priors. We focus on its applications to NMF, DL, CSMU, RPCA and sparse MF, and compare its performance with various algorithms, including IAML, LMaFit and Bi-GAMP for RPCA, K-SVD, SPAMS and BAd-VAMP for DL, BAd-VAMP for the CSMU, AlsPGrad, and Mult for NMF. Extensive numerical results are provided to demonstrate the superiority of UAMP-MF in terms of recovery performance, robustness and computational complexity.

The remainder of this paper is organized as follows. In Section II, VI, (U)AMP and matrix normal distribution are briefly introduced. The UAMP-MF algorithm is derived in Section III. Applications of UAMP-MF to NMF, RPCA, DL, CSMU and sparse MF are investigated in Section IV, followed by conclusions in Section V.

Throughout the paper, we use the following notations. Boldface lower-case and upper-case letters denote vectors and matrices, respectively, and superscript ()T(\cdot)^{T} represents the transpose operation. A Gaussian distribution of xx with mean x^\hat{x} and variance νx\nu_{x} is represented by 𝒩(x;x^,νx)\mathcal{N}(x;{\hat{x}},\nu_{x}). Notation Tr()\text{Tr}(\cdot) denotes the trace operation. The relation f(x)=cg(x)f(x)=cg(x) for some positive constant cc is written as f(x)g(x)f(x)\propto g(x). The operation diag(𝒂)diag(\boldsymbol{a}) returns a diagonal matrix with 𝒂\boldsymbol{a} on its diagonal. We use 𝑨𝑩\boldsymbol{A}\cdot\boldsymbol{B} and 𝑨/𝑩\boldsymbol{A}\cdot/\boldsymbol{B} to represent the element-wise product and division between 𝑨\boldsymbol{A} and 𝑩\boldsymbol{B}, respectively. We use |𝑨|.2|\boldsymbol{A}|^{.2} to denote element-wise magnitude squared operation for 𝑨\boldsymbol{A}, and 𝑨||\boldsymbol{A}|| is the Frobenius norm of 𝑨\boldsymbol{A}. The notations 1, 0 and 𝑰\boldsymbol{I} denote an all-one matrix, an all-zero matrix and an identity matrix with a proper size, respectively. The above operations defined for matrices can also be applied to vectors. We use Vec()\textrm{Vec}(\cdot) to denote the vectorization operation.

II Preliminaries

II-A Variational Inference

VI is a machine learning method widely used to approximate posterior densities for Bayesian models [21], [23]. Let 𝑽\boldsymbol{V} and 𝑹\boldsymbol{R} be the set of hidden (latent) variables and visible (observed) variables, respectively, with joint distribution p(𝑽,𝑹)p(\boldsymbol{V},\boldsymbol{R}). The goal of VI is to find a tractable variational distribution q(𝑽){q}(\boldsymbol{V}) that approximates the true posterior distribution p(𝑽|𝑹)p(\boldsymbol{V}|\boldsymbol{R}). With the distribution q(𝑽)q(\boldsymbol{V}), the log marginal distribution of the observed variables admits the following decomposition

lnp(𝑹)=(q(𝑽))+𝒦(q(𝑽)||p(𝑽|𝑹)),\ln p(\boldsymbol{R})=\mathcal{L}(q(\boldsymbol{V}))+\mathcal{KL}(q(\boldsymbol{V})||p(\boldsymbol{V}|\boldsymbol{R})), (2)

where the variational lower bound (q(𝑽))\mathcal{L}(q(\boldsymbol{V})) is given as

(q(𝑽))=𝑽q(𝑽)lnp(𝑽,𝑹)q(𝑽),\mathcal{L}(q(\boldsymbol{V}))=\int_{\boldsymbol{V}}q(\boldsymbol{V})\ln\frac{p(\boldsymbol{V},\boldsymbol{R})}{q(\boldsymbol{V})}, (3)

and the Kullback-Leibler (KL) divergence between q(𝑽)q(\boldsymbol{V}) and p(𝑽|𝑹)p(\boldsymbol{V}|\boldsymbol{R}) is

𝒦(q(𝑽)||p(𝑽|𝑹))=𝑽q(𝑽)logp(𝑽|𝑹)q(𝑽).\mathcal{KL}(q(\boldsymbol{V})||p(\boldsymbol{V}|\boldsymbol{R}))=-\int_{\boldsymbol{V}}q(\boldsymbol{V})\log\frac{p(\boldsymbol{V}|\boldsymbol{R})}{q(\boldsymbol{V})}. (4)

The distribution q(𝑽)q(\boldsymbol{V}) that minimizes the KL divergence 𝒦(q(𝑽)||p(𝑽|𝑹))\mathcal{KL}(q(\boldsymbol{V})||p(\boldsymbol{V}|\boldsymbol{R})) can be found by maximizing the variational lower bound (q(𝑽))\mathcal{L}(q(\boldsymbol{V})).

VI can be implemented with message passing with the assistance of graphical models such as factor graphs. If the variational distribution with some factorization is chosen, e.g.,

q(𝑽)=kqk(𝑽k),q(\boldsymbol{V})=\prod_{k}q_{k}(\boldsymbol{V}_{k}), (5)

where 𝑽={𝑽k}\boldsymbol{V}=\{\boldsymbol{V}_{k}\}, then the variational distribution may be found through an iterative procedure [21] with the update rule

qk(𝑽k)exp(𝑽~q(𝑽~)logf(𝑽k,𝑽~)).q_{k}(\boldsymbol{V}_{k})\propto\exp\left(\int_{\tilde{\boldsymbol{V}}}q(\tilde{\boldsymbol{V}})\log f(\boldsymbol{V}_{k},\tilde{\boldsymbol{V}})\right). (6)

Here f(𝑽k,𝑽~)f(\boldsymbol{V}_{k},\tilde{\boldsymbol{V}}) is a local factor associated with 𝑽k\boldsymbol{V}_{k} and 𝑽~{𝑽i,ik}\tilde{\boldsymbol{V}}\in\{\boldsymbol{V}_{i},i\neq k\}, depending on the structure of the factor graph. The update of {qk(𝑽k)}\{q_{k}(\boldsymbol{V}_{k})\} is carried out iteratively until it converges or a pre-set number of iterations is reached.

II-B (U)AMP

AMP was derived based on loopy BP with Gaussian and Taylor-series approximations [24], [16], which can be used to recover 𝒙\boldsymbol{x} from the noisy measurement 𝒚=𝑨𝒙+𝒘\boldsymbol{y}=\boldsymbol{A}\boldsymbol{x}+\boldsymbol{w} with 𝒘\boldsymbol{w} being a zero-mean white Gaussian noise vector. AMP works well with an i.i.d. Gaussian 𝑨\boldsymbol{A}, but it can easily diverge for generic 𝐀\mathbf{A} [25]. Inspired by [26], the work in [20] showed that the robustness of AMP can be remarkably improved through simple pre-processing, i.e., performing a unitary transformation to the original linear model [19]. As any matrix 𝑨\boldsymbol{A} has an SVD 𝑨=𝑼𝚲𝑽\boldsymbol{A}=\boldsymbol{U}\boldsymbol{\Lambda}\boldsymbol{V} with 𝑼\boldsymbol{U} and 𝑽\boldsymbol{V} being two unitary matrices, performing a unitary transformation with 𝑼T\boldsymbol{U}^{\textrm{T}} leads to the following model

𝒓=𝚽𝒙+𝝎,\boldsymbol{r}=\mathbf{\Phi}\boldsymbol{x}+\boldsymbol{\omega}, (7)

where 𝒓=𝑼T𝒚\boldsymbol{r}=\boldsymbol{U}^{\textrm{T}}\boldsymbol{y}, 𝚽=𝑼T𝑨=𝚲𝑽\boldsymbol{\Phi}=\boldsymbol{U}^{\textrm{T}}\boldsymbol{A}=\boldsymbol{\Lambda}\boldsymbol{V}, 𝚲\mathbf{\Lambda} is an M×NM\times N rectangular diagonal matrix, and 𝝎=𝑼T𝒘\boldsymbol{\omega}=\boldsymbol{U}^{\textrm{T}}\boldsymbol{w} remains a zero-mean white Gaussian noise vector with the same covariance as 𝝎\boldsymbol{\omega}.

Applying the vector step size AMP [16] with model (7) leads to the first version of UAMP (called UAMPv1) shown in Algorithm 1. Note that by replacing 𝒓\boldsymbol{r} and 𝚽\mathbf{\Phi} with 𝒚\boldsymbol{y} and 𝑨\boldsymbol{A} in Algorithm 1, the original AMP algorithm is recovered. An average operation can be applied to two vectors: 𝝉x\boldsymbol{\tau}_{x} in Line 7 and |𝚽H|.2𝝉s|\mathbf{\Phi}^{H}|^{.2}\boldsymbol{\tau}_{s} in Line 5 of UAMPv1 in Algorithm 1, leading to the second version of UAMP [20] (called UAMPv2), where the operations in the brackets of Lines 1, 5 and 7 are executed (refer to [19] for the detailed derivations). Compared to AMP and UAMPv1, UAMPv2 does not require matrix-vector multiplications in Lines 1 and 5, so that the number of matrix-vector products is reduced from 4 to 2 per iteration.

Algorithm 1 UAMP (UAMPv2 executes operations in [ ])

Initialize 𝝉x(0)(orτx(0))>0\boldsymbol{\tau}_{x}^{(0)}(\mathrm{or}~{}{\tau}_{x}^{(0)})>0 and 𝒙(0){{\boldsymbol{x}}^{(0)}}. Set 𝒔(1)=𝟎\boldsymbol{s}^{(-1)}=\mathbf{0} and t=0t=0. Define vector 𝝀=𝚲𝚲T1\boldsymbol{\lambda}={\mathbf{\Lambda\Lambda}^{\textrm{T}}}\textbf{1}.
Repeat

1:  𝝉p\boldsymbol{\tau}_{p} = |𝚽|.2𝝉xt\mathbf{|\Phi|}^{.2}\boldsymbol{\tau}^{t}_{x}             [or𝝉p=τxt𝝀]\left[\mathrm{or}~{}\boldsymbol{\tau}_{p}=\tau^{t}_{x}\boldsymbol{\lambda}\right]
2:  𝒑=𝚽𝒙t𝝉p𝒔t1\boldsymbol{p}=\mathbf{\Phi}{{\boldsymbol{x}}^{t}}-\boldsymbol{\tau}_{p}\cdot\boldsymbol{s}^{t-1}
3:  𝝉s=𝟏./(𝝉p+β1𝟏)\boldsymbol{\tau}_{s}=\mathbf{1}./(\boldsymbol{\tau}_{p}+\beta^{-1}\mathbf{1})
4:  𝒔t=𝝉s(𝒓𝒑)\boldsymbol{s}^{t}=\boldsymbol{\tau}_{s}\cdot(\boldsymbol{r}-\boldsymbol{p})
5:  𝟏./𝝉q\mathbf{1}./\boldsymbol{\tau}_{q} = |𝚽T|.2𝝉s|\mathbf{\Phi}^{\textrm{T}}|^{.2}\boldsymbol{\tau}_{s}       [or𝟏./𝝉q=(1N𝝀T𝝉s)𝟏]\left[\mathrm{or}~{}\mathbf{1}./\boldsymbol{\tau}_{q}=(\frac{1}{N}\boldsymbol{\lambda}^{\textrm{T}}\boldsymbol{\tau}_{s})\mathbf{1}\right]
6:  𝒒=𝒙t+𝝉q(𝚽H𝒔t)\boldsymbol{q}={{\boldsymbol{x}}^{t}}+\boldsymbol{\tau}_{q}\cdot(\mathbf{\Phi}^{H}\boldsymbol{s}^{t})
7:  𝝉xt+1\boldsymbol{\tau}_{x}^{t+1} = 𝝉qgx(𝒒,𝝉q)\boldsymbol{\tau}_{q}\cdot g_{x}^{\prime}(\boldsymbol{q},\boldsymbol{\tau}_{q})      [orτxt+1=1N𝟏T(𝝉qgx(𝒒,τq))]\left[\mathrm{or}~{}\tau_{x}^{t+1}\!=\!\frac{1}{N}\mathbf{1}^{\textrm{T}}\left(\boldsymbol{\tau}_{q}\cdot g_{x}^{\prime}(\boldsymbol{q},\tau_{q})\right)\right]
8:  𝐱t+1=gx(𝒒,𝝉q){{\mathbf{x}}^{t+1}}=g_{x}(\boldsymbol{q},\boldsymbol{\tau}_{q})
9:  t=t+1t=t+1

Until terminated

In the (U)AMP algorithms, gx(𝒒,𝝉q)g_{x}(\boldsymbol{q},\boldsymbol{\tau}_{q}) is related to the prior of 𝒙\boldsymbol{x} and returns a column vector with nnth entry [gx(𝐪,𝝉q)]n[g_{x}(\mathbf{q},\boldsymbol{\tau}_{q})]_{n} given by

[gx(𝒒,𝝉q)]n=xnp(xn)𝒩(xn;qn,τqn)𝑑xnp(xn)𝒩(xn;qn,τqn)𝑑xn,[g_{x}(\boldsymbol{q},\boldsymbol{\tau}_{q})]_{n}=\frac{\int x_{n}p(x_{n})\mathcal{N}(x_{n};q_{n},\tau_{q_{n}})dx_{n}}{\int p(x_{n})\mathcal{N}(x_{n};q_{n},\tau_{q_{n}})dx_{n}}, (8)

where p(xn)p(x_{n}) represents a prior for xnx_{n}. The function gx(𝒒,𝝉q)g_{x}^{\prime}(\boldsymbol{q},\boldsymbol{\tau}_{q}) returns a column vector and the nnth element is denoted by [gx(𝒒,𝝉q)]n[g_{x}^{\prime}(\boldsymbol{q},\boldsymbol{\tau}_{q})]_{n}, where the derivative is taken with respect to qnq_{n}.

II-C Matrix Normal Distribution

In this work, the matrix normal distribution is used in the derivation of VI for the MF problem, into which UAMP is incorporated, leading to the UAMP-MF algorithm. The matrix normal distribution is a generalization of the multivariate normal distribution to matrix-valued random variables [22]. The distribution of normal random matrix 𝑿\boldsymbol{X} is

𝒩(𝑿;𝑿^,𝑼X,𝑽X)=(2π)NL2|𝑼X|N2|𝑽X|L2\displaystyle\!\!\!\!\!\!\!\!\!\!\mathcal{MN}\big{(}\boldsymbol{X};\hat{\boldsymbol{X}},\boldsymbol{U}_{X},\boldsymbol{V}_{X}\big{)}=(2\pi)^{-\frac{NL}{2}}\big{|}\boldsymbol{U}_{X}\big{|}^{-\frac{N}{2}}\big{|}\boldsymbol{V}_{X}\big{|}^{-\frac{L}{2}}
exp(12Tr(𝑽X1(𝑿𝑿^)T𝑼X1(𝑿𝑿^))),\displaystyle\exp\Big{(}-\frac{1}{2}\text{Tr}\big{(}\boldsymbol{V}_{X}^{-1}\big{(}\boldsymbol{X}-\hat{\boldsymbol{X}}\big{)}^{\textrm{T}}\boldsymbol{U}_{X}^{-1}\big{(}\boldsymbol{X}-\hat{\boldsymbol{X}}\big{)}\big{)}\Big{)}, (9)

where 𝑿^N×L\hat{\boldsymbol{X}}\in\mathcal{R}^{N\times L} is the mean of 𝑿\boldsymbol{X}, 𝑼XN×N\boldsymbol{U}_{X}\in\mathcal{R}^{N\times N} and 𝑽XL×L\boldsymbol{V}_{X}\in\mathcal{R}^{L\times L} are the covariance among rows and columns of 𝑿\boldsymbol{X}, respectively. The matrix normal distribution is related to the multivariate normal distribution in the following way

𝒙𝒩(𝒙;𝒙^,𝑽X𝑼X),\displaystyle\boldsymbol{x}\sim\mathcal{N}\big{(}\boldsymbol{x};\hat{\boldsymbol{x}},\boldsymbol{V}_{X}\otimes\boldsymbol{U}_{X}\big{)}, (10)

where 𝒙=Vec(𝑿)\boldsymbol{x}=\textrm{Vec}(\boldsymbol{X}) and 𝒙^=Vec(𝑿^)\hat{\boldsymbol{x}}=\textrm{Vec}(\hat{\boldsymbol{X}}).

III Design of UAMP-MF

The Bayesian treatment of the MF problem leads to great flexibility. Imposing proper priors p(𝑯)p(\boldsymbol{H}) and p(𝑿)p(\boldsymbol{X}) on 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X}, we can accommodate many concrete MF problems. For example in DL, we aim to find sparse coding of data set 𝒀\boldsymbol{Y}, where 𝑯\boldsymbol{H} is the dictionary and 𝑿\boldsymbol{X} is the sparse representation with the dictionary. For the entries of 𝑯\boldsymbol{H}, we can use a Gaussian prior

p(𝑯)=m,np(hm,n)=m,n𝒩(hm,n;0,1),\displaystyle p(\boldsymbol{H})=\prod\nolimits_{m,n}p(h_{m,n})=\prod\nolimits_{m,n}\mathcal{N}(h_{m,n};0,1), (11)

while for the entries of 𝑿\boldsymbol{X}, we can consider a sparsity promoting hierarchical Gaussian-Gamma prior

p(𝑿)\displaystyle p(\boldsymbol{X}) =\displaystyle= n,lp(xn,l)=n,l𝒩(xn,l;0,γn,l1)\displaystyle\prod\nolimits_{n,l}p(x_{n,l})=\prod\nolimits_{n,l}\mathcal{N}(x_{n,l};0,\gamma_{n,l}^{-1}) (12)
p(γn,l)\displaystyle p(\gamma_{n,l}) =\displaystyle= Ga(γn,l;ϵ,η).\displaystyle Ga(\gamma_{n,l};\epsilon,\eta). (13)

In Section IV, more examples on the priors, including NMF, RPCA, CSMU and sparse MF, are provided.

With the priors and model (1), if the a posteriori distributions p(𝑯|𝒀)p(\boldsymbol{H}|\boldsymbol{Y}) and p(𝑿|𝒀)p(\boldsymbol{X}|\boldsymbol{Y}) can be found, then the estimates of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} are obtained, e.g., using the a posteriori means of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} to serve as their estimates. However, it is intractable to find the exact a posteriori distributions in general, so we resort to the approximate inference method VI to find their variational approximations q(𝑯)p(𝑯|𝒀)q(\boldsymbol{H})\approx p(\boldsymbol{H}|\boldsymbol{Y}) and q(𝑿)p(𝑿|𝒀)q(\boldsymbol{X})\approx p(\boldsymbol{X}|\boldsymbol{Y}). Finding the varational distributions are still challenging due to the large dimensions of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X}, and the priors of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} may also lead to intractable q(𝑯)q(\boldsymbol{H}) and q(𝑿)q(\boldsymbol{X}). In this work, through a covariance matrix (model noise) whitening process, UAMP is incorporated to solve these challenges with high efficiency. This also leads to Gaussian variational distributions q(𝑯)q(\boldsymbol{H}) and q(𝑿)q(\boldsymbol{X}), so that the estimates of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} (i.e., the a posteriori means of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X}) appear as the parameters of the variational distributions.

Leveraging UAMP, we carry out VI in a message passing manner, with the aid of a factor graph representation of the problem. This leads to the message passing algorithm UAMP-MF. In the derivation of UAMP-MF, we use the notation mnanb(𝑿)m_{n_{a}\rightarrow n_{b}}(\boldsymbol{X}) to denote a message passed from node nan_{a} to node nbn_{b}, which is a function of 𝑿\boldsymbol{X}. The algorithm UAMP-MF is shown in Algorithm 2, which we will frequently refer to in this section.

III-A Factor Graph Representation

Algorithm 2 UAMP-MF

Initialization: 𝑼H=𝑰M\boldsymbol{U}_{H}=\boldsymbol{I}_{M}, 𝑽H=𝑰N\boldsymbol{V}_{H}=\boldsymbol{I}_{N}, 𝑯^=1MN\hat{\boldsymbol{H}}=\textbf{1}_{MN}, 𝑽X=𝑰L\boldsymbol{V}_{X}=\boldsymbol{I}_{L}. 𝚵X=1NL\boldsymbol{\Xi}_{X}=\textbf{1}_{NL}, 𝑺X=0NL\boldsymbol{S}_{X}=\textbf{0}_{NL}, 𝚵H=1MN\boldsymbol{\Xi}_{H}=\textbf{1}_{MN}, 𝑺H=0MN\boldsymbol{S}_{H}=\textbf{0}_{MN}. Repeat

1:  𝑾¯X=𝑯^T𝑯^+M𝑽H\overline{\boldsymbol{W}}_{X}=\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+M\boldsymbol{V}_{H}
2:  [𝑪X,𝑫X]=eig(𝑾¯X)[\boldsymbol{C}_{X},\boldsymbol{D}_{X}]=\text{eig}(\overline{\boldsymbol{W}}_{X})
3:  𝑹X=𝑫X12𝑪XT𝑯^T𝒀\boldsymbol{R}_{X}=\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}, 𝚽X=𝑫X12𝑪XT\boldsymbol{\Phi}_{X}=\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}
4:  𝑽PX=|𝚽X|.2𝚵X\boldsymbol{V}_{P_{X}}=|\boldsymbol{\Phi}_{X}|^{.2}\boldsymbol{\Xi}_{X}
5:  𝑷=𝚽X𝑿^𝑽𝑷X𝑺X\boldsymbol{P}=\boldsymbol{\Phi}_{X}\hat{\boldsymbol{X}}-\boldsymbol{V}_{\boldsymbol{P}_{X}}\cdot\boldsymbol{S}_{X}
6:  𝑽SX=1./(𝑽PX+λ^1𝟏)\boldsymbol{V}_{S_{X}}=\textbf{1}./(\boldsymbol{V}_{P_{X}}+\hat{\lambda}^{-1}\mathbf{1})
7:  𝑺X=𝑽SX(𝑹X𝑷X)\boldsymbol{S}_{X}=\boldsymbol{V}_{S_{X}}\cdot(\boldsymbol{R}_{X}-\boldsymbol{P}_{X})
8:  𝑽QX=1./(|𝚽XT|.2𝑽SX)\boldsymbol{V}_{Q_{X}}=\textbf{1}./(|\boldsymbol{\Phi}_{X}^{\textrm{T}}|^{.2}\boldsymbol{V}_{S_{X}})
9:  𝑸X=𝑿^+𝑽QX(𝚽XT𝑺X)\boldsymbol{Q}_{X}=\hat{\boldsymbol{X}}+\boldsymbol{V}_{Q_{X}}\cdot(\boldsymbol{\Phi}_{X}^{\textrm{T}}\boldsymbol{S}_{X})
10:  𝚵X=𝑽QX𝑮X(𝑸X,𝑽QX)\boldsymbol{\Xi}_{X}=\boldsymbol{V}_{Q_{X}}\cdot\boldsymbol{G}_{X}^{\prime}(\boldsymbol{Q}_{X},\boldsymbol{V}_{Q_{X}})
11:  𝑿^=𝑮X(𝑸X,𝑽QX)\hat{\boldsymbol{X}}=\boldsymbol{G}_{X}(\boldsymbol{Q}_{X},\boldsymbol{V}_{Q_{X}})
12:  𝑼X=diag(mean(𝚵X,2))\boldsymbol{U}_{X}=\text{diag}(\text{mean}(\boldsymbol{\Xi}_{X},2))
13:  𝑾¯H=𝑿^𝑿^T+L𝑼X\overline{\boldsymbol{W}}_{H}=\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+L\boldsymbol{U}_{X}
14:  [𝑪H,𝑫H]=eig(𝑾¯H)[\boldsymbol{C}_{H},\boldsymbol{D}_{H}]=\text{eig}(\overline{\boldsymbol{W}}_{H})
15:  𝑹H=𝑫H12𝑪HT𝑿^𝒀T\boldsymbol{R}_{H}=\boldsymbol{D}_{H}^{-\frac{1}{2}}\boldsymbol{C}_{H}^{\textrm{T}}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}}, 𝚽H=𝑫H12𝑪HT\boldsymbol{\Phi}_{H}=\boldsymbol{D}_{H}^{-\frac{1}{2}}\boldsymbol{C}_{H}^{\textrm{T}}
16:  𝑽PH=|𝚽H|.2𝚵H\boldsymbol{V}_{P_{H}}=|\boldsymbol{\Phi}_{H}|^{.2}\boldsymbol{\Xi}_{H}
17:  𝑷H=𝚽H𝑯^𝑽𝑷H𝑺H\boldsymbol{P}_{H}=\boldsymbol{\Phi}_{H}\hat{\boldsymbol{H}}-\boldsymbol{V}_{\boldsymbol{P}_{H}}\cdot\boldsymbol{S}_{H}
18:  𝑽SH=𝟏./(𝑽PH+λ^1𝟏)\boldsymbol{V}_{S_{H}}=\mathbf{1}./(\boldsymbol{V}_{P_{H}}+\hat{\lambda}^{-1}\mathbf{1})
19:  𝑺H=𝑽SH(𝑹H𝑷H)\boldsymbol{S}_{H}=\boldsymbol{V}_{S_{H}}\cdot(\boldsymbol{R}_{H}-\boldsymbol{P}_{H})
20:  𝑽QH=𝟏./(|𝚽HT|.2𝑽SH)\boldsymbol{V}_{Q_{H}}=\mathbf{1}./(|\boldsymbol{\Phi}_{H}^{\textrm{T}}|^{.2}\boldsymbol{V}_{S_{H}})
21:  𝑸H=𝑯^+𝑽QH(𝚽HT𝑺H)\boldsymbol{Q}_{H}=\hat{\boldsymbol{H}}+\boldsymbol{V}_{Q_{H}}\cdot(\boldsymbol{\Phi}_{H}^{\textrm{T}}\boldsymbol{S}_{H})
22:  𝚵H=𝑽QH𝑮H(𝑸H,𝑽QH)\boldsymbol{\Xi}_{H}=\boldsymbol{V}_{Q_{H}}\cdot\boldsymbol{G}_{H}^{\prime}(\boldsymbol{Q}_{H},\boldsymbol{V}_{Q_{H}})
23:  𝑯^=𝑮H(𝑸H,𝑽QH)\hat{\boldsymbol{H}}=\boldsymbol{G}_{H}(\boldsymbol{Q}_{H},\boldsymbol{V}_{Q_{H}})
24:  𝑼H=diag(mean(𝚵H,1))\boldsymbol{U}_{H}=\text{diag}(\text{mean}(\boldsymbol{\Xi}_{H},1))
25:  λ^=ML/C\hat{\lambda}=ML/C with CC given in (44)

Until terminated

With model (1), we have the following joint conditional distribution and its factorization

p(𝑿,𝑯,λ|𝒀)\displaystyle p(\boldsymbol{X},\boldsymbol{H},\lambda|\boldsymbol{Y})\!\!\!\!\! \displaystyle\propto p(𝒀|𝑿,𝑯,λ)p(𝑿)p(𝑯)p(λ)\displaystyle\!\!\!\!\!p(\boldsymbol{Y}|\boldsymbol{X},\boldsymbol{H},\lambda)p(\boldsymbol{X})p(\boldsymbol{H})p(\lambda) (14)
\displaystyle\triangleq fY(𝒀,𝑿,𝑯,λ)fX(𝑿)fH(𝑯)fλ(λ)\displaystyle\!\!\!\!\!f_{Y}(\boldsymbol{Y},\!\boldsymbol{X},\!\boldsymbol{H},\!\lambda)f_{X}(\boldsymbol{X})f_{H}(\boldsymbol{H})f_{\lambda}(\lambda)

where

fY(𝒀,𝑿,𝑯,λ)\displaystyle f_{Y}(\boldsymbol{Y},\boldsymbol{X},\boldsymbol{H},\lambda) \displaystyle\triangleq p(𝒀|𝑿,𝑯,λ)\displaystyle p\big{(}\boldsymbol{Y}|\boldsymbol{X},\boldsymbol{H},\lambda\big{)} (15)
=\displaystyle= 𝒩(𝒀;𝑯𝑿,𝑰M,λ1𝑰L).\displaystyle\mathcal{MN}\big{(}\boldsymbol{Y};\boldsymbol{H}\boldsymbol{X},\boldsymbol{I}_{M},\lambda^{-1}\boldsymbol{I}_{L}\big{)}.

The precision λ\lambda of the noise has an improper prior fλ(λ)p(λ)1/λf_{\lambda}(\lambda)\triangleq p(\lambda)\propto 1/\lambda [27], and the priors for 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} denoted by fH(𝑯)p(𝑯)f_{H}(\boldsymbol{H})\triangleq p(\boldsymbol{H}) and fX(𝑿)p(𝑿)f_{X}(\boldsymbol{X})\triangleq p(\boldsymbol{X}) are customized according to the requirements in a specific problem. The factor graph representation of (14)\eqref{eq:factor} is depicted in Fig.1, where the arguments of the local functions are dropped for the sake of conciseness. We define a variational distribution

q(𝑿,𝑯,λ)=q(𝑿)q(𝑯)q(λ),q(\boldsymbol{X},\boldsymbol{H},\lambda)=q(\boldsymbol{X})q(\boldsymbol{H})q(\lambda), (16)

and we expect that q(𝑿)p(𝑿|𝒀)q(\boldsymbol{X})\approx p(\boldsymbol{X}|\boldsymbol{Y}), q(𝑯)p(𝑯|𝒀)q(\boldsymbol{H})\approx p(\boldsymbol{H}|\boldsymbol{Y}) and q(λ)p(λ|𝒀)q(\lambda)\approx p(\lambda|\boldsymbol{Y}). Next, we study how to efficiently update q(𝑿)q(\boldsymbol{X}), q(𝑯)q(\boldsymbol{H}) and q(λ)q(\lambda) iteratively.

Refer to caption
Figure 1: Factor graph representation of (14)\eqref{eq:factor}.

III-B Update of q(𝐗)q(\boldsymbol{X})

According to VI, with q(𝑯)q(\boldsymbol{H}) and q(λ)q(\lambda) (updated in the previous iteration), we update q(𝑿)q(\boldsymbol{X}). As shown by the factor graph in Fig. 1, we need to compute the message mfYX(𝑿)m_{f_{Y}\to X}(\boldsymbol{X}) from the factor node fYf_{Y} to the variable node 𝑿\boldsymbol{X} and then combine it with the prior p(𝑿)p(\boldsymbol{X}). Later we will see that q(𝑯)q(\boldsymbol{H}) is a matrix normal distribution, i.e., q(𝑯)=𝒩(𝑯;𝑯^,𝑼H,𝑽H)q(\boldsymbol{H})=\mathcal{MN}(\boldsymbol{H};\hat{\boldsymbol{H}},\boldsymbol{U}_{H},\boldsymbol{V}_{H}) with 𝑼H=𝑰N\boldsymbol{U}_{H}=\boldsymbol{I}_{N}, and q(λ)q(\lambda) is a Gamma distribution. It turns out that the message mfYX(𝑿)m_{f_{Y}\to X}(\boldsymbol{X}) is matrix normal, as shown by Proposition 1.

Proposition 1: The message from fYf_{Y} to 𝑿\boldsymbol{X} is a matrix normal distribution of 𝑿\boldsymbol{X}, which can be expressed as

mfYX(𝑿)𝒩(𝑿;𝑿¯,λ^1𝑼¯X,𝑰L),m_{f_{Y}\to X}(\boldsymbol{X})\propto\mathcal{MN}(\boldsymbol{X};\overline{\boldsymbol{X}},\hat{\lambda}^{-1}\overline{\boldsymbol{U}}_{X},\boldsymbol{I}_{L}), (17)

with

𝑿¯=𝑼¯X𝑯^T𝒀,\overline{\boldsymbol{X}}=\overline{\boldsymbol{U}}_{X}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}, (18)
𝑼¯X=(𝑯^T𝑯^+Tr(𝑼H)𝑽H)1,\overline{\boldsymbol{U}}_{X}=\big{(}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H}\big{)}^{-1}, (19)

and

λ^=λλq(λ).\hat{\lambda}=\int_{\lambda}\lambda q(\lambda). (20)

The computation of λ^\hat{\lambda} is shown in (46).

Proof.

See Appendix A. ∎

According to VI, this message is combined with the prior p(𝑿)p(\boldsymbol{X}) to obtain q(𝑿)q(\boldsymbol{X}). This can be challenging as p(𝑿)p(\boldsymbol{X}) may lead to an intractable q(𝑿)q(\boldsymbol{X}) and high computational complexity. Next, leveraging UAMP, we update q(𝑿)q(\boldsymbol{X}) efficiently. It is also worth mentioning that, with the use of UAMP, the matrix inversion involved in (19) does not need to be performed.

Note that 𝑿=[𝒙1,,𝒙l]\boldsymbol{X}=[\boldsymbol{x}_{1},...,\boldsymbol{x}_{l}] and 𝑿¯=[𝒙¯1,,𝒙¯l]\overline{\boldsymbol{X}}=[\overline{\boldsymbol{x}}_{1},...,\overline{\boldsymbol{x}}_{l}]. The result in (17) indicates that 𝒙l𝒩(𝒙l;𝒙¯l,λ^1𝑼¯X)\boldsymbol{x}_{l}\sim\mathcal{N}(\boldsymbol{x}_{l};\overline{\boldsymbol{x}}_{l},\hat{\lambda}^{-1}\overline{\boldsymbol{U}}_{X}). This leads to the following pseudo observation model

𝒙¯l=𝒙l+𝒆x,\displaystyle\overline{\boldsymbol{x}}_{l}=\boldsymbol{x}_{l}+\boldsymbol{e}_{x}, (21)

where 𝒆x𝒩(𝒆x;0,λ^1𝑼¯X)\boldsymbol{e}_{x}\sim\mathcal{N}(\boldsymbol{e}_{x};0,\hat{\lambda}^{-1}\overline{\boldsymbol{U}}_{X}), i.e., the model noise is not white. We can whiten the noise by left-multiplying both sides of (21) by 𝑼¯X12\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}, leading to

𝑼¯X12𝒙¯l=𝑼¯X12𝒙l+𝒘x,\displaystyle\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{x}}_{l}=\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\boldsymbol{x}_{l}+\boldsymbol{w}_{x}, (22)

where 𝒘x=𝑼¯X12𝒆x\boldsymbol{w}_{x}=\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\boldsymbol{e}_{x} is white and Gaussian with covariance λ^1𝑰\hat{\lambda}^{-1}\boldsymbol{I}. Through the whitening process, we get a standard linear model with white additive Gaussian noise, which facilitates the use of UAMP. Considering all the vectors in 𝑿¯\overline{\boldsymbol{X}}, we have

𝑼¯X12𝑿¯=𝑼¯X12𝑿+𝛀𝑿,\displaystyle\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{X}}=\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\boldsymbol{X}+\boldsymbol{\Omega_{X}}, (23)

where 𝛀𝑿\boldsymbol{\Omega_{X}} is white and Gaussian.

With model (23) and the prior p(𝑿)p(\boldsymbol{X}), we use UAMP to update q(𝑿)q(\boldsymbol{X}). Following UAMP, a unitary transformation needs to be performed with the unitary matrix 𝑪XT\boldsymbol{C}_{X}^{T} obtained from the SVD of 𝑼¯X12\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}} (or eigenvalue decomposition (EVD) as the matrix is definite and symmetric), i.e.,

𝑼¯X12=𝑪X𝚲𝑪XT\displaystyle\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}=\boldsymbol{C}_{X}\boldsymbol{\Lambda}\boldsymbol{C}_{X}^{T} (24)

with 𝚲\boldsymbol{\Lambda} being a diagonal matrix. After performing the unitary transformation, we have

𝑹X=𝚽X𝑿+𝛀𝑿,\displaystyle\boldsymbol{R}_{X}=\boldsymbol{\Phi}_{X}\boldsymbol{X}+\boldsymbol{\Omega^{\prime}_{X}}, (25)

where 𝑹x=𝑪XT𝑼¯X12𝑿¯\boldsymbol{R}_{x}=\boldsymbol{C}_{X}^{T}\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{X}}, 𝚽X=𝑪XT𝑼¯X12=𝚲𝑪XT\boldsymbol{\Phi}_{X}=\boldsymbol{C}_{X}^{T}\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}=\boldsymbol{\Lambda}\boldsymbol{C}_{X}^{T} and 𝛀𝑿=𝑪XT𝛀𝑿\boldsymbol{\Omega^{\prime}_{X}}=\boldsymbol{C}_{X}^{T}\boldsymbol{\Omega_{X}}, which is still white and Gaussian.

From the above, it seems costly to obtain 𝑹X\boldsymbol{R}_{X} and 𝚽X\boldsymbol{\Phi}_{X} in model (25). Specifically, a matrix inverse operation needs to be performed according to (19) so that 𝑿¯\overline{\boldsymbol{X}} in (18) can be obtained; a matrix squared root operation is needed to obtain 𝑼¯X12\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}; and an SVD operation is required to 𝑪XT\boldsymbol{C}_{X}^{T}. Next, we show that the computations can be greatly simplified.

Instead of computing 𝑼¯X1\overline{\boldsymbol{U}}_{X}^{-1} and 𝑼¯X12\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}} followed by SVD of 𝑼¯X12\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}, we perform EVD to 𝑼¯X1=𝑾¯X=𝑯^T𝑯^+Tr(𝑼H)𝑽H\overline{\boldsymbol{U}}_{X}^{-1}=\overline{\boldsymbol{W}}_{X}=\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H}, i.e.,

[𝑪X,𝑫X]=eig(𝑾¯X),\displaystyle[\boldsymbol{C}_{X},\boldsymbol{D}_{X}]=\text{eig}(\overline{\boldsymbol{W}}_{X}), (26)

where the diagonal matrix

𝑫X=𝚲2.\boldsymbol{D}_{X}=\boldsymbol{\Lambda}^{-2}. (27)

Hence 𝚽X=𝚲𝑪XT=𝑫X1/2𝑪XT\boldsymbol{\Phi}_{X}=\boldsymbol{\Lambda}\boldsymbol{C}_{X}^{T}=\boldsymbol{D}_{X}^{-1/2}\boldsymbol{C}_{X}^{T}, where the computation of 𝑫x1/2\boldsymbol{D}_{x}^{-1/2} is trivial as 𝑫x\boldsymbol{D}_{x} is a diagonal matrix. Meanwhile, the computation of the pseudo observation matrix 𝑹x\boldsymbol{R}_{x} can also be simplified:

𝑹X\displaystyle\boldsymbol{R}_{X} =\displaystyle= 𝑪XT𝑼¯X12𝑿¯\displaystyle\boldsymbol{C}_{X}^{T}\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{X}} (28)
=\displaystyle= 𝑪XT𝑼¯X12𝑼¯X𝑯^T𝒀\displaystyle\boldsymbol{C}_{X}^{T}\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{U}}_{X}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}
=\displaystyle= 𝑫X12𝑪XT𝑯^T𝒀.\displaystyle{\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}}.

For convenience, we rewrite the unitary transformed pseudo observation model as

𝑫X12𝑪XT𝑯^T𝒀𝑹X=𝑫X12𝑪XT𝚽X𝑿+𝛀𝑿.\displaystyle{\underbrace{\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}}_{\boldsymbol{R}_{X}}}=\underbrace{\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}}_{\boldsymbol{\Phi}_{X}}\boldsymbol{X}+\boldsymbol{\Omega^{\prime}_{X}}. (29)

The above leads to Lines 1-3 of UAMP-MF in Algorithm 2.

Due to the prior p(𝑿)p(\boldsymbol{X}), the use of exact q(𝑿)q(\boldsymbol{X}) often makes the message update intractable. Following UAMP in Algorithm 1, we project it to be Gaussian, which can also be interpreted as performing the minimum mean squared estimation (MMSE) based on the pseudo observation model (29) with prior p(𝑿)p(\boldsymbol{X}). These correspond to Lines 4-11 of UAMP-MF in Algorithm 2. We assume that the prior p(𝑿)p(\boldsymbol{X}) are separable, i.e., p(𝑿)=n,lp(xnl)p(\boldsymbol{X})=\prod_{n,l}p(x_{nl}), then the operations in Lines 10 and 11 are element-wise as explained in the following. As the estimation is decoupled by (U)AMP, we assume the following scalar pseudo models

qnl=xnl+wnl,n=1,,N,l=1,,L,q_{nl}=x_{nl}+w_{nl},n=1,...,N,l=1,...,L, (30)

where qnlq_{nl} is the (n,l)(n,l)th element of 𝑸X\boldsymbol{Q}_{X} in Line 9 of the UAMP-MF algorithm, wnlw_{nl} represents a Gaussian noise with mean zero and variance vnlv_{nl}, and vnlv_{nl}is the (n,l)(n,l)th element of 𝑽Qx\boldsymbol{V}_{Q_{x}} in Line 8 of the UAMP-MF algorithm. This is significant as the complex estimation is simplified to much simpler MMSE estimation based on a number of scalar models (30) with prior p(xnl)p(x_{nl}). With the notations in Lines 10 and 11, for each entry xnlx_{nl} in 𝑿\boldsymbol{X}, the MMSE estimation leads to a Gaussian distribution

q(xnl)=𝒩(xnl;x^nl,vxnl),\displaystyle q^{\prime}(x_{nl})=\mathcal{N}(x_{nl};\hat{x}_{nl},v_{x_{nl}}), (31)

where x^nl\hat{x}_{nl} and vxnlv_{x_{nl}} is the (n,l)(n,l)th element of 𝑿^\hat{\boldsymbol{X}} in Line 11 and the (n,l)(n,l)th element of 𝚵X\boldsymbol{\Xi}_{X} in Line 10, respectively. We can see that each element xnlx_{nl} has its own variance. To facilitate subsequent processing, we make an approximation by assuming that the elements of each row in 𝑿^\hat{\boldsymbol{X}} have the same variance, which is the average of their variances. Then they are collectively characterized by a matrix normal distribution, i.e., q(𝑿)=𝒩(𝑿;𝑿^,𝑼X,𝑽X)q(\boldsymbol{X})=\mathcal{MN}(\boldsymbol{X};\hat{\boldsymbol{X}},\boldsymbol{U}_{X},\boldsymbol{V}_{X}) with 𝑼X=diag(mean(𝚵X,2))\boldsymbol{U}_{X}=\text{diag}(\text{mean}(\boldsymbol{\Xi}_{X},2)) and 𝑽X=𝑰L\boldsymbol{V}_{X}=\boldsymbol{I}_{L}, where mean(𝚵X,2)\text{mean}(\boldsymbol{\Xi}_{X},2) represents the average operation on the rows of 𝚵X\boldsymbol{\Xi}_{X}. This leads to Line 12 of UAMP-MF in algorithm 2.

III-C Update of q(𝐇)q(\boldsymbol{H})

With the updated q(𝑿)=𝒩(𝑿;𝑿^,𝑼X,𝑽X)q(\boldsymbol{X})=\mathcal{MN}(\boldsymbol{X};\hat{\boldsymbol{X}},\boldsymbol{U}_{X},\boldsymbol{V}_{X}) and q(λ)q(\lambda), we compute the message from fYf_{Y} to 𝑯\boldsymbol{H} according to VI. Regarding the message, we have the following result.

Proposition 2: The message from fYf_{Y} to 𝑯\boldsymbol{H} is a matrix normal distribution about 𝑯\boldsymbol{H}, which can be expressed as

mfYH(𝑯)𝒩(𝑯;𝑯¯,𝑰M,λ^1𝑽¯H),m_{f_{Y}\to H}(\boldsymbol{H})\propto\mathcal{MN}(\boldsymbol{H};\overline{\boldsymbol{H}},\boldsymbol{I}_{M},\hat{\lambda}^{-1}\overline{\boldsymbol{V}}_{H}), (32)

with

𝑯¯=𝒀𝑿^T𝑽¯H.\overline{\boldsymbol{H}}=\boldsymbol{Y}\hat{\boldsymbol{X}}^{\textrm{T}}\overline{\boldsymbol{V}}_{H}. (33)

and

𝑽¯H=(𝑿^𝑿^T+Tr(𝑽X)𝑼X)1,\overline{\boldsymbol{V}}_{H}=\left(\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X}\right)^{-1}, (34)

where λ^=λλq(λ)\hat{\lambda}=\int_{\lambda}\lambda q(\lambda).

Proof.

See Appendix B. ∎

Then we combine the message mfYH(𝑯)m_{f_{Y}\to H}(\boldsymbol{H}) with the prior of 𝑯\boldsymbol{H} to update q(𝑯)q(\boldsymbol{H}). This will also be realized with UAMP through a whitening process. The procedure is similar to that for q(𝑿)q(\boldsymbol{X}), and the difference is that the pseudo model is established row by row (rather than column by column) by considering the form of mfYH(𝑯)m_{f_{Y}\to H}(\boldsymbol{H}).

With the message mfYH(𝑯)m_{f_{Y}\to H}(\boldsymbol{H}) and

𝑯¯=(𝒉¯1T𝒉¯MT),\overline{\boldsymbol{H}}=\left(\begin{matrix}\overline{\boldsymbol{h}}_{1}^{\textrm{T}}\\ \vdots\\ \overline{\boldsymbol{h}}_{M}^{\textrm{T}}\end{matrix}\right), (35)

where 𝒉¯mT1×N\overline{\boldsymbol{h}}_{m}^{\textrm{T}}\in\mathcal{R}^{1\times N} is the mmth row vector, we have the pseudo observation model

𝒉¯m=𝒉m+𝒆h,\displaystyle\overline{\boldsymbol{h}}_{m}=\boldsymbol{h}_{m}+\boldsymbol{e}_{h}, (36)

where 𝒆h𝒩(𝒆h;0,λ^1𝑽¯H)\boldsymbol{e}_{h}\sim\mathcal{N}(\boldsymbol{e}_{h};0,\hat{\lambda}^{-1}\overline{\boldsymbol{V}}_{H}). Whitening operation to (36) leads to

𝑽¯H12𝒉¯m=𝑽¯H12𝒉m+𝒘h,\displaystyle\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\overline{\boldsymbol{h}}_{m}=\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\boldsymbol{h}_{m}+\boldsymbol{w}_{h}, (37)

where 𝒘h=𝑽¯H12𝒆h\boldsymbol{w}_{h}=\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\boldsymbol{e}_{h}, which is white and Gaussian with covariance λ^1𝑰\hat{\lambda}^{-1}\boldsymbol{I}. Collecting all rows and representing them in matrix form, we have

𝑽¯H12𝑯¯T=𝑽¯H12𝑯T+𝛀H.\displaystyle\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\overline{\boldsymbol{H}}^{\textrm{T}}=\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\boldsymbol{H}^{\textrm{T}}+\boldsymbol{\Omega}_{H}. (38)

UAMP is then employed with the model (38). Using the ideas in the previous section for updating q(𝑿)q(\boldsymbol{X}), we obtain the unitary transformed model efficiently.

We first perform an EVD to matrix 𝑽¯H1=𝑾¯H=𝑿^𝑿^T+Tr(𝑽X)𝑼X\overline{\boldsymbol{V}}_{H}^{-1}=\overline{\boldsymbol{W}}_{H}=\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X}, i.e.,

[𝑪H,𝑫H]=eig(𝑾¯H).\displaystyle[\boldsymbol{C}_{H},\boldsymbol{D}_{H}]=\text{eig}(\overline{\boldsymbol{W}}_{H}). (39)

Then the unitary transformed model is given as

𝑹HT=𝚽HT𝑯T+𝛀H\displaystyle\boldsymbol{R}_{H}^{\textrm{T}}=\boldsymbol{\Phi}_{H}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}+\boldsymbol{\Omega}^{\prime}_{H} (40)

where

𝚽H\displaystyle\boldsymbol{\Phi}_{H} =\displaystyle= 𝑫H12𝑪HT\displaystyle\boldsymbol{D}_{H}^{-\frac{1}{2}}\boldsymbol{C}_{H}^{\textrm{T}} (41)
𝑹H\displaystyle\boldsymbol{R}_{H} =\displaystyle= 𝑫H12𝑪HT𝑿^𝒀T,\displaystyle\boldsymbol{D}_{H}^{-\frac{1}{2}}\boldsymbol{C}_{H}^{\textrm{T}}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}}, (42)

and 𝛀𝑯\boldsymbol{\Omega_{H}} is white and Gaussian. The above leads to Lines 13-15 of UAMP-MF in Algorithm 2.

Following UAMP in Algorithm 1, we obtain q(𝑯)q(\boldsymbol{H}) and project it to be Gaussian, which correspond to Lines 16-23 of UAMP-MF Algorithm 2. Similar to the case for updating q(𝑿)q(\boldsymbol{X}), to accommodate {q(hmn)}\{q(h_{mn})\} with a matrix normal distribution, we make an approximation by assuming the entries in each column of 𝑯\boldsymbol{H} share a common variance, which is the average of their variances. Then q(𝑯)=𝒩(𝑯;𝑯^,𝑼H,𝑽H)q(\boldsymbol{H})=\mathcal{MN}(\boldsymbol{H};\hat{\boldsymbol{H}},\boldsymbol{U}_{H},\boldsymbol{V}_{H}) with 𝑽H=diag(mean(𝚵H,1))\boldsymbol{V}_{H}=\text{diag}(\text{mean}(\boldsymbol{\Xi}_{H},1)) and 𝑼H=𝑰M\boldsymbol{U}_{H}=\boldsymbol{I}_{M} (i.e., Line 24 of UAMP-MF in Algorithm 2), where mean(𝚵H,1)\text{mean}(\boldsymbol{\Xi}_{H},1) represents the average operation on the columns of 𝚵H\boldsymbol{\Xi}_{H}.

III-D Update of q(λ)q(\lambda)

With q(𝑯)=𝒩(𝑯;𝑯^,𝑽H,𝑰M)q(\boldsymbol{H})=\mathcal{MN}\left(\boldsymbol{H};\hat{\boldsymbol{H}},\boldsymbol{V}_{H},\boldsymbol{I}_{M}\right) and q(𝑿)=𝒩(𝑿;𝑿^,𝑰L,𝑼X)q(\boldsymbol{X})=\mathcal{MN}\left(\boldsymbol{X};\hat{\boldsymbol{X}},\boldsymbol{I}_{L},\boldsymbol{U}_{X}\right), we update q(λ)q(\lambda), i.e., combining the message mfYλ(λ)m_{f_{Y}\to\lambda}(\lambda) from fYf_{Y} to λ\lambda and its prior p(λ)p(\lambda).

Proposition 3: The message from fYf_{Y} to λ\lambda can be expressed as

mfYλ(λ)λMLexp(λC),\displaystyle m_{f_{Y}\to\lambda}(\lambda)\propto\lambda^{ML}\exp\Big{(}-\lambda C\Big{)}, (43)

where

C=𝒀𝑯^𝑿^F2+MTr(𝑿^𝑿^T𝑽H)\displaystyle\!\!C=\|\boldsymbol{Y}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\|_{F}^{2}+M\text{Tr}\Big{(}\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{V}_{H}\Big{)}
+LTr(𝑼X𝑯^T𝑯^)+MLTr(𝑼X𝑽H).\displaystyle\ \ \ \ \ \ \ \ +L\text{Tr}(\boldsymbol{U}_{X}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}})+ML\text{Tr}(\boldsymbol{U}_{X}\boldsymbol{V}_{H}). (44)
Proof.

See Appendix C. ∎

With the message mfYλ(λ)m_{f_{Y}\to\lambda}(\lambda) and the prior p(λ)1/λp(\lambda)\propto 1/\lambda, q(λ)q(\lambda) can be computed as

q(λ)mfYλ(λ)p(λ)=λML1exp(λC),q(\lambda)\propto m_{f_{Y}\to\lambda}(\lambda)p(\lambda)=\lambda^{ML-1}\exp\Big{(}-\lambda C\Big{)}, (45)

which is a Gamma distribution. Then the mean of λ\lambda is obtained as

λ^=λλq(λ)=ML/C,\displaystyle\hat{\lambda}={\int_{\lambda}\lambda q(\lambda)}={ML}/{C}, (46)

which is Line 25 of UAMP-MF.

III-E Discussion and Computational Complexity

Regarding UAMP-MF in Algorithm 2, we have the following remarks:

  • We do not specify the priors of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} in Algorithm 2. The detailed implementations of Lines 10, 11, 22 and 23 of the algorithm depends on the priors in a concrete scenario. Various examples are provided in Section IV.

  • As the MF problem often has local minima, we can use the strategy of restart to mitigate the issue of being stuck at local minima. In addition, the iterative process can be terminated based on some criterion, e.g., the normalized difference between the estimates of two consecutive iterations is smaller than a threshold.

  • There are often hyper-parameters in the priors of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X}. If we do not have knowledge on these hyper-parameters, their values need to be leaned or tuned automatically. Thanks to the factor graph and message passing frame work, these extra tasks can be implemented by extending the factor graph in Fig. 1.

xnlx_{nl}p(xnl|γnl)p(x_{nl}|\gamma_{nl})γnl\gamma_{nl}p(γnl)p(\gamma_{nl})𝒩(xnl;qxnl,vxnl)\mathcal{N}(x_{nl};{q}_{x_{nl}},v_{x_{nl}})
Figure 2: Factor graph for learning hyper-parameters.

In the following, we show how to learn the hyper-parameters of the prior of 𝑿\boldsymbol{X}. We assume that 𝑿\boldsymbol{X} is sparse and the sparsity rate is unknown. In this case, we may employ the sparsity promoting hierarchical Gaussian-Gamma prior [27]. The prior is then p(xnl,γnl)=p(xnl|γnl)p(γnl)p(x_{nl},\gamma_{nl})=p(x_{nl}|\gamma_{nl})p(\gamma_{nl}), where p(xnl|γnl)=𝒩(xnl;0,γnl1)fxnlp(x_{nl}|\gamma_{nl})=\mathcal{N}(x_{nl};0,\gamma_{nl}^{-1})\triangleq f_{x_{nl}} and p(γnl)=Ga(γnl;ϵ,η)p(\gamma_{nl})=Ga(\gamma_{nl};\epsilon,\eta) with ϵ\epsilon and η\eta being the shape parameter and scale parameter of the Gamma distribution, respectively. While the precision γnl\gamma_{nl} is to be learned, the values for ϵ\epsilon and η\eta are often set empirically, e.g., ϵ=η=0\epsilon=\eta=0 [27]. It is worth mentioning that ϵ\epsilon can be tuned automatically to improve the performance [28].

The factor graph representation for this part is shown in Fig. 2. Also note that UAMP decouples the estimation as shown by the pseudo model (30), indicating that the incoming message to the factor graph in Fig. 2 is Gaussian, i.e.,

mxnlfxnl(xnl)=𝒩(xnl;qnl,vnl).m_{x_{nl}\to f_{x_{nl}}}(x_{nl})=\mathcal{N}(x_{nl};q_{nl},v_{nl}). (47)

We perform inference on xnlx_{nl} and γnl\gamma_{nl}, which can also be achieved by using VI with a variational distribution q(xnl,γnl)=q(xnl)q(γnl)q(x_{nl},\gamma_{nl})=q(x_{nl})q(\gamma_{nl}).

According to VI,

q(xnl)\displaystyle q(x_{nl}) \displaystyle\propto mxnlfxnl(xnl)mfxnlxnl(xnl)\displaystyle m_{x_{nl}\to f_{x_{nl}}}(x_{nl})m_{f_{x_{nl}}\to x_{nl}}(x_{nl}) (48)
=\displaystyle= 𝒩(xnl;x^nl,vxnl)\displaystyle\mathcal{N}(x_{nl};\hat{x}_{nl},v_{x_{nl}})

with mfxnlxnl(xnl)m_{f_{x_{nl}}\to x_{nl}}(x_{nl}) shown in (52), and

vxnl=vqnl1+γ^nlvqnl,x^nl=q^nl1+γ^nlvqnl.\displaystyle v_{x_{nl}}=\frac{v_{q_{nl}}}{1+\hat{\gamma}_{nl}v_{q_{nl}}},~{}\hat{x}_{nl}=\frac{\hat{q}_{nl}}{1+\hat{\gamma}_{nl}v_{q_{nl}}}. (49)

The message mfxnlγnl(γnl)m_{f_{x_{nl}}\to\gamma_{nl}}(\gamma_{nl}) can be expressed as

mfxnlγnl(γnl)\displaystyle m_{f_{x_{nl}}\to\gamma_{nl}}(\gamma_{nl}) \displaystyle\propto exp(xnlq(xnl)log𝒩(xnl;0,γnl1)),\displaystyle\exp\big{(}\int_{x_{nl}}q(x_{nl})\log\mathcal{N}(x_{nl};0,\gamma_{nl}^{-1})\big{)}, (50)
\displaystyle\propto γnl12exp(γnl2(|x^nl|2+vxnl)).\displaystyle\gamma_{nl}^{\frac{1}{2}}\exp\big{(}-\frac{\gamma_{nl}}{2}(|\hat{x}_{nl}|^{2}+v_{x_{nl}})\big{)}.

The message mfγnlγnl(γnl)m_{f_{\gamma_{nl}}\to\gamma_{nl}}(\gamma_{nl}) is the predefined Gamma distribution, i.e., mfγnlγnl(γnl)γnlϵ1exp(ηγnl)m_{f_{\gamma_{nl}}\to\gamma_{nl}}(\gamma_{nl})\propto\gamma_{nl}^{\epsilon-1}\exp(-\eta\gamma_{nl}). According to VI,

q(γnl)\displaystyle q(\gamma_{nl}) \displaystyle\propto mfγnlγnl(γnl)mfxnlγnl(γnl)\displaystyle m_{f_{\gamma_{nl}}\to\gamma_{nl}}(\gamma_{nl})m_{f_{x_{nl}}\to\gamma_{nl}}(\gamma_{nl}) (51)
\displaystyle\propto γnlϵ12exp(γnl2(|x^nl|2+vxnl+2η)).\displaystyle\gamma_{nl}^{\epsilon-\frac{1}{2}}\exp\big{(}-\frac{\gamma_{nl}}{2}(|\hat{x}_{nl}|^{2}+v_{x_{nl}}+2\eta)\big{)}.

Thus

mfxnlxnl(xnl)\displaystyle m_{f_{x_{nl}}\to x_{nl}}(x_{nl}) \displaystyle\propto exp(γnlq(γnl)log𝒩(xnl;0,γnl1)),\displaystyle\exp\big{(}\int_{\gamma_{nl}}q(\gamma_{nl})\log\mathcal{N}(x_{nl};0,\gamma_{nl}^{-1})\big{)}, (52)
\displaystyle\propto 𝒩(xnl;0,γ^nl1),\displaystyle\mathcal{N}(x_{nl};0,\hat{\gamma}_{nl}^{-1}),

where

γ^nl=1+2ϵ2η+vxnl+|x^nl|2.\displaystyle\hat{\gamma}_{nl}=\frac{1+2\epsilon}{2\eta+v_{x_{nl}}+|\hat{x}_{nl}|^{2}}. (53)

The above computations for all the entries of 𝑿\boldsymbol{X} can be collectively expressed as

𝚵X=𝑽QX./(𝟏+𝚪^𝑽QX),\displaystyle\boldsymbol{\Xi}_{X}=\boldsymbol{V}_{Q_{X}}./\left(\mathbf{1}+\hat{\boldsymbol{\Gamma}}\cdot\boldsymbol{V}_{Q_{X}}\right), (54)
𝑿^=𝑸^X./(𝟏+𝚪^𝑽QX),\displaystyle\hat{\boldsymbol{X}}={\hat{\boldsymbol{Q}}_{X}}./\left(\mathbf{1}+\hat{\boldsymbol{\Gamma}}\cdot\boldsymbol{V}_{Q_{X}}\right), (55)
𝚪^=(𝟏+2ϵ𝟏)./(2η𝟏+𝚵X+|𝑿^|.2),\displaystyle\hat{\boldsymbol{\Gamma}}=(\mathbf{1}+2\epsilon\mathbf{1})./\left(2\eta\mathbf{1}+\boldsymbol{\Xi}_{X}+|\hat{\boldsymbol{X}}|^{.2}\right), (56)

where 𝚪^\hat{\boldsymbol{\Gamma}} is a matrix with {γnl}\{\gamma_{nl}\} as its entries.

We see that UAMP-MF in Algorithm 2 involves matrix multiplications and two EVDs per iteration, which dominate its complexity. The complexity of the algorithm is therefore in a cubic order, which is low in a MF problem (the complexity of two matrices product is cubic). In the next section, we will compare the proposed algorithm with state-of-the-art methods in various scenarios. As the algorithms have different computational complexity per iteration and the numbers of iterations required for convergence are also different, we will compare their complexity in terms of runtime, as in the literature. Extensive examples show that UAMP-MF is much faster than state-of-the-art algorithms in many scenarios, and often delivers significantly better performance.

IV Applications and Numerical Results

We focus on the applications of UAMP-MF to NFM, RPCA, DL, CSMU and sparse MF. Extensive comparisons with state-of-the-art algorithms are provided to demonstrate its merits in terms of robustness, recovery accuracy and complexity.

IV-A UAMP-MF for Robust Principal Components Analysis

In RPCA [4], we aim to estimate a low-rank matrix observed in the presence of noise and outliers. The signal model for RPCA can be written as

𝒀=𝑨𝑩+𝑬+𝑾,\displaystyle\boldsymbol{Y}=\boldsymbol{A}\boldsymbol{B}+\boldsymbol{E}+\boldsymbol{W}, (57)

where the product of a tall matrix 𝑨M×N\boldsymbol{A}\in\mathcal{R}^{M\times N} and wide matrix 𝑩N×L\boldsymbol{B}\in\mathcal{R}^{N\times L} is a low-rank matrix, 𝑬\boldsymbol{E} is a sparse outlier matrix, and 𝑾\boldsymbol{W} is noise matrix, which is assumed to be Gaussian and white. To enable the use of UAMP-MF, the model (57) can be rewritten as [5]

𝒀=[𝑨,𝑰]𝑯[𝑩𝑬]𝑿+𝑾,\displaystyle{\boldsymbol{Y}}=\underbrace{\Big{[}\boldsymbol{A},\boldsymbol{I}]}_{{\boldsymbol{H}}}\underbrace{\begin{bmatrix}\boldsymbol{B}\\ \boldsymbol{E}\end{bmatrix}}_{\boldsymbol{X}}+{\boldsymbol{W}}, (58)

which is a MF problem.

In (58), part of matrix 𝑯{\boldsymbol{H}} is known, and part of matrix 𝑿{\boldsymbol{X}} is sparse. These can be captured by imposing proper priors to the matrix entries. Similar to [5], for matrix 𝑯{\boldsymbol{H}}, the priors are

p(hm,n)\displaystyle p({h}_{m,n}) =\displaystyle= {𝒩(hm,n;0,1)nN𝒩(hm,n;Im,n,0)nN\displaystyle\begin{cases}\mathcal{N}({h}_{m,n};0,1)&n\leq N\\ \mathcal{N}({h}_{m,n};I_{m,n},0)&n\geq N\end{cases} (59)

where the entries corresponding to matrix 𝑰{\boldsymbol{I}} have a zero variance, indicating that they are deterministic and known. For the sparse matrix 𝑬{\boldsymbol{E}}, the sparsity rate is unknown, and we use the hierarchical Gaussian-Gamma priors. Then we have

p(xn,l)\displaystyle p({x}_{n,l}) =\displaystyle= {𝒩(xn,l;0,αx)nN𝒩(xn,l;0,γn,l1)nN\displaystyle\begin{cases}\mathcal{N}({x}_{n,l};0,\alpha_{x})&n\leq N\\ \mathcal{N}({x}_{n,l};0,\gamma^{-1}_{n,l})&n\geq N\end{cases} (60)

where the precisions have Gamma distribution p(γnl)=Ga(γnl;ϵ,η)p(\gamma_{nl})=Ga(\gamma_{nl};\epsilon,\eta). If the knowledge about the parameter αx\alpha_{x} is unavailable, it can also be learned using VI, where the estimate of αx\alpha_{x} can be updated as α^x=𝚵X+|𝑿^|.22/NM\hat{\alpha}_{x}=||\boldsymbol{\Xi}^{\prime}_{X}+|\hat{\boldsymbol{X}}^{\prime}|^{.2}||^{2}/NM with

𝚵X\displaystyle\boldsymbol{\Xi}^{\prime}_{X} =\displaystyle= α^x𝑽QX./(α^x𝟏+𝑽QX),\displaystyle{\hat{\alpha}^{\prime}_{x}\boldsymbol{V}^{\prime}_{Q_{X}}}./{(\hat{\alpha}^{\prime}_{x}\mathbf{1}+\boldsymbol{V}^{\prime}_{Q_{X}})}, (61)
𝑿^\displaystyle\hat{\boldsymbol{X}}^{\prime} =\displaystyle= α^x𝑸^X./(α^x𝟏+𝑽QX),\displaystyle{\hat{\alpha}^{\prime}_{x}\hat{\boldsymbol{Q}}^{\prime}_{X}}./{(\hat{\alpha}^{\prime}_{x}\mathbf{1}+\boldsymbol{V}^{\prime}_{Q_{X}})}, (62)

where 𝑽QX\boldsymbol{V}^{\prime}_{Q_{X}} and 𝑸^X\hat{\boldsymbol{Q}}^{\prime}_{X} represent the upper (nNn\leq N) part of matrices 𝑽QX\boldsymbol{V}_{Q_{X}} and 𝑸^X\hat{\boldsymbol{Q}}_{X}, and α^x\hat{\alpha}^{\prime}_{x} is the estimate of αx\alpha_{x} in the last round of iteration. In addition, the rank NN of matrix 𝒁=𝑨𝑩\boldsymbol{Z}=\boldsymbol{A}\boldsymbol{B} is normally unknown, and rank-selection can be performed using the method in [29].

We compare UAMP-MF with state-of-the-art algorithms including Bi-GAMP [5], IALM [12] and LMaFit [13]. The performance is evaluated in terms of the normalized mean squared error (NMSE) of matrix 𝒁\boldsymbol{Z}, which is defined as

NMSE(Z)=𝒁𝑯^𝑿^2𝒁2.\displaystyle\text{NMSE(Z)}=\frac{\|\boldsymbol{Z}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\|^{2}}{\|\boldsymbol{Z}\|^{2}}. (63)

In the numerical examples, the low rank matrix 𝒁=𝑯𝑿\boldsymbol{Z}=\boldsymbol{H}\boldsymbol{X} is generated from 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} with i.i.d standard Gaussian entries. The sparsity rate of matrix 𝑬\boldsymbol{E} is denoted by δ\delta, and the non-zero entries are independently drawn from a uniform distribution on [10,10][-10,10], which are located uniformly at random in 𝑬\boldsymbol{E}. It is noted that, with these settings, the magnitudes of the outliers and the entries of 𝒁\boldsymbol{Z} are on the same order, making the detection of the outliers challenging. The NMSE performance of various algorithms versus SNR is shown in Fig. 3, where δ=0.1\delta=0.1, M=L=200M=L=200, and the rank N=50,70N=50,70 and 9090 for (a), (b) and (c), respectively. It can be seen that IALM does not work well, and LMaFit struggles, delivering poor performance when N=70N=70 and 9090. Bi-GAMP and UAMP-MF deliver similar performance when N=50N=50, but when N=70N=70 and 9090, UAMP-MF still works very well while the performance of Bi-GAMP degrades significantly. We also show the performance of UAMP-MF with known rank NN, which is denoted by UAMP-MF(N) in Fig. 3, where we see that UAMP-MF has almost the same performance as UAMP-MF(N). Then, with M=L=200M=L=200, we vary the rank NN from 30 to 120 and the sparsity rate δ\delta is changed from 0.05 to 0.3. The NMSE of the algorithms with different combinations of δ\delta and NN is shown in Fig.4, where SNR = 60 dB. We see that the blue regions of UAMP-MF are significantly larger than other algorithms, indicating the superior performance of UAMP-MF.

Refer to caption
Figure 3: NMSE of the algorithms for RPCA, where δ=0.1\delta=0.1, and N=50,70N=50,70 and 9090 in (a), (b) and (c), respectively.
Refer to caption
Figure 4: NMSE of the algorithms versus the combinations of rank NN and sparsity ratio δ\delta in RPCA.
Refer to caption
Figure 5: NMSE of the algorithms versus correlation parameter ρ\rho in RPCA, where (a) N=60N=60 and (b) N=80N=80.
Refer to caption
Figure 6: Runtime of the algorithms versus (a) SNR and (b) ρ\rho.

Next we consider a more challenging case. All the settings are the same as the previous example except that the generated matrices 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} are no longer i.i.d., but correlated. Take 𝑯\boldsymbol{H} as an example, which is generated with

𝑯=𝑪L𝑮𝑪R,\displaystyle\boldsymbol{H}=\boldsymbol{C}_{L}\boldsymbol{G}\boldsymbol{C}_{R}, (64)

where 𝑮\boldsymbol{G} is an i.i.d Gaussian matrix, 𝑪L\boldsymbol{C}_{L} is an M×MM\times M matrix with the (m,n)(m,n)th element given by ρ|mn|\rho^{|m-n|} where ρ[0,1]\rho\in[0,1], and 𝑪R\boldsymbol{C}_{R} is generated in the same way but with size of N×NN\times N. The parameter ρ\rho controls the correlation of matrix 𝑯\boldsymbol{H}. Matrix 𝑿\boldsymbol{X} is generated in the same way. We show the performance of the algorithms versus the correlation parameter ρ\rho in Fig. 5, where SNR = 60dB, the rank N=60N=60 and 8080 in (a) and (b), respectively. The results show that UAMP-MF is much more robust and it outperforms other algorithms significantly when ρ\rho is relatively large.

The runtime of the algorithms is shown in Fig. 6, where the simulation parameter settings in Figs. 6 (a) and (b) are the same as those for Fig. 3(b) and Fig. 5(a), respectively. The results show that UAMP-MF can be much faster than Bi-GAMP while delivering better performance. IALM and LMaFit is faster than UAMP-MF, but they may suffer from severe performance degradation as shown in Figs. 3 - 5.

IV-B UAMP-MF for Dictionary Learning

In DL, we aim to find a dictionary 𝑯M×N\boldsymbol{H}\in\mathcal{R}^{M\times N} that allows the training samples 𝒀M×L\boldsymbol{Y}\in\mathcal{R}^{M\times L} to be coded as

𝒀=𝑯𝑿+𝑾\displaystyle\boldsymbol{Y}=\boldsymbol{H}\boldsymbol{X}+\boldsymbol{W} (65)

for some sparse matrix 𝑿\boldsymbol{X} and some perturbation 𝑾\boldsymbol{W}. To avoid over-fitting, a sufficiently large number of training examples are needed. For the entries of 𝑯\boldsymbol{H}, we use the Gaussian prior as shown in (11), and for the entries of 𝑿\boldsymbol{X}, we use the hierarchical Gaussian-Gamma prior (as we assume unknown sparsity rate) as shown in (12) and (13).

Refer to caption
Figure 7: NMSE of the algorithms versus combinations of NN and LL for DL, where M=NM=N and ρ=0.2\rho=0.2.
Refer to caption
Figure 8: NMSE of the algorithms versus ρ\rho for DL, where M=N=100M=N=100, (a) L=600L=600 and (b) L=800L=800.

We compare UAMP-MF with the state-of-the-art algorithms including BAd-VAMP [17], K-SVD [9] and SPAMS [10]. It is worth mentioning that, in the case of DL, compared to Bi-GAMP, BAd-VAMP is significantly enhanced in terms of robustness and performance [17]. To test the robustness of the algorithms, we generate correlated matrix 𝑯\boldsymbol{H} using (64) in the numerical examples. The sparse matrix 𝑿\boldsymbol{X} is generated by selecting non-zero entries in each column uniformly and drawing their values from the standard Gaussian distribution, while all other entries are set to zero. The performance of DL is evaluated using the relative NMSE [17], defined as

NMSE(H)=min𝑱𝑯^𝑱𝑯2𝑯2,\displaystyle\text{NMSE(H)}=\underset{\boldsymbol{J}}{\text{min}}\frac{\|\hat{\boldsymbol{H}}\boldsymbol{J}-\boldsymbol{H}\|^{2}}{\|\boldsymbol{H}\|^{2}}, (66)

where 𝑱\boldsymbol{J} is a matrix accounting for the permutation and scale ambiguities.

Refer to caption
Figure 9: NMSE of the algorithms versus NN for DL, where M=100M=100, L=600L=600, (a) ρ=0.1\rho=0.1 and (b) ρ=0.3\rho=0.3.
Refer to caption
Figure 10: Runtime of the algorithms (a) versus LL, where M=N=100M=N=100 and ρ=0.1\rho=0.1, and (b) versus NN, where M=100M=100, L=600L=600 and ρ=0.1\rho=0.1 for DL.

We first assume square dictionaries (i.e., M=NM=N) and vary the values of NN and LL. The NMSE performance of the algorithms is shown in Fig.7, where SNR = 50dB. It can be seen that, when the correlation parameter ρ=0.2\rho=0.2, neither K-SVD nor SPAMS has satisfactory performance. In contrast, UAMP-MF and BAd-VAMP works well for a number of combinations of NN and LL, but the blue region of UAMP-MF is significantly larger than that of the BAd-VAMP. We show the performance of the algorithms versus the correlation parameter ρ\rho in Fig. 8, where N=100N=100, L=500L=500 in (a) and L=800L=800 in (b). It can be seen that UAMP-MF exhibits much higher robustness than other algorithms.

Next, we test the performance of the algorithms when learning an over-complete dictionary, i.e., N>MN>M. The setting for the experiments is same as those in the previous one, except that we fix MM and LL, but vary the value of NN. The NMSE of the algorithms is shown in Fig. 9, where M=100M=100 and L=600L=600. The correlation parameter ρ\rho is set to 0.10.1 in (a) and 0.30.3 in (b). Again, K-SVD and SPAMS do not work well. Comparing BAd-VAMP with UAMP-MF, we see that UAMP-MF performs significantly better, especially when ρ=0.3\rho=0.3.

The runtime versus LL of the algorithms is shown in Fig.10.(a), where M=N=100M=N=100 and ρ=0.1\rho=0.1, and the runtime versus NN of the algorithms is shown in Fig. 10.(b), where M=100M=100, L=600L=600, and ρ=0.1\rho=0.1. The SNR is set to 50 dB. The results show that UAMP-MF can be much faster than other algorithms.

IV-C UAMP-MF for Compressive Sensing with Matrix Uncertainty

In practical applications of compressive sensing, we often do not know the measurement matrix 𝑯\boldsymbol{H} perfectly, e.g., due to hardware imperfections and model mismatch. We aim to recover the sparse signal 𝑿\boldsymbol{X} (and estimate the measurement matrix 𝑯\boldsymbol{H}) from the noisy measurements 𝒀=𝑯𝑿+𝑾\boldsymbol{Y}=\boldsymbol{H}\boldsymbol{X}+\boldsymbol{W} with

𝑯𝑯¯+𝑯,\boldsymbol{H}\triangleq\bar{\boldsymbol{H}}+\boldsymbol{H}^{\prime}, (67)

where 𝑯¯\bar{\boldsymbol{H}} is a known matrix and 𝑯\boldsymbol{H}^{\prime} is an unknown disturbance matrix. Compared to the case of DL, the difference lies in the prior of 𝑯\boldsymbol{H}. We assume that the entries of 𝑯\boldsymbol{H}^{\prime} are modeled as i.i.d Gaussian distributed variables, i.e., p(hm,n)=𝒩(hm,n;0,ν)p(h^{\prime}_{m,n})=\mathcal{N}(h^{\prime}_{m,n};0,\nu), so the prior of hm,nh_{m,n} can be represented as p(hm,n)=𝒩(hm,n;h¯m,n,ν)p(h_{m,n})=\mathcal{N}(h_{m,n};\bar{h}_{m,n},\nu). For the entries of 𝑿\boldsymbol{X}, we still use the hierarchical Gaussian-Gamma prior. It is worth noting that, if the sparsity of 𝑿\boldsymbol{X} exhibits some pattern, e.g., the columns of 𝑿\boldsymbol{X} share a common support, we can impose that the entries of each row of 𝑿\boldsymbol{X} share a common precision in their Gaussian priors to capture the common support.

Refer to caption
Figure 11: NMSE of UAMP-MF and BAd-VAMP for CSMU where SNR = 50dB and ρ=0.1\rho=0.1.
Refer to caption
Figure 12: NMSE of UAMP-MF and BAd-VAMP for CSMU where SNR = 50dB and ρ=0.3\rho=0.3.
Refer to caption
Figure 13: Runtime comparison for CSMU (a) ρ=0.1\rho=0.1, (b) ρ=0.3\rho=0.3, where M=N=100M=N=100 and SNR = 50dB

In the numerical example, to test the robustness of the algorithms, we generate correlated matrices 𝑯¯\bar{\boldsymbol{H}} and 𝑯\boldsymbol{H}^{\prime} using (64). The sparse matrix 𝑿\boldsymbol{X} is generated in the same way as before. It is noted that, different from DL, the ambiguity does not exist thanks to the known matrix 𝑯\boldsymbol{H}. The performance is evaluated using the NMSE of 𝑿\boldsymbol{X}

NMSE(X)=𝑿^𝑿2/𝑿2.\displaystyle{\text{NMSE(X)}={\|\hat{\boldsymbol{X}}-\boldsymbol{X}\|^{2}}/{\|\boldsymbol{X}\|^{2}}.} (68)

The NMSE performance of UAMP-MF and BAd-VAMP is shown in Fig.11 and Fig. 12 for various combinations of M=N{50,,200}M=N\in\{50,...,200\} and L{100,,600}L\in\{100,...,600\}. The parameter ρ=0.1\rho=0.1 and 0.30.3 in Figs. 11 and 12, respectively and SNR = 50dB. It can be seen that, UAMP-MF outperforms BAd-VAMP significantly when the correlation coefficient ρ=0.3\rho=0.3. The runtime comparison is shown in Fig. 13, which indicate that UAMP-MF is much faster.

IV-D UAMP-MF for Non-Negative Matrix Factorization

The NMF problem [1] finds various applications [30], [31]. In NMF, from the observation

𝒀=𝒁+𝑾\displaystyle\boldsymbol{Y}=\boldsymbol{Z}+\boldsymbol{W} (69)

where 𝒁+M×L\boldsymbol{Z}\in\mathcal{R}_{+}^{M\times L}, we aim to find a factorization of 𝒁\boldsymbol{Z} as the product of two non-negative matrices 𝑯+M×N\boldsymbol{H}\in\mathcal{R}_{+}^{M\times N} and 𝑿+N×L\boldsymbol{X}\in\mathcal{R}_{+}^{N\times L}, i.e., 𝒁𝑯𝑿\boldsymbol{Z}\approx\boldsymbol{H}\boldsymbol{X}. To achieve NMF with UAMP-MF, we set the priors of p(𝑿)p(\boldsymbol{X}) and p(𝑯)p(\boldsymbol{H}) to be i.i.d non-negative Gaussian, i.e.,

p(xn,l)=N+(xn,l;θ,ϕ)={𝒩(x;θ,ϕ)Φc(θ/ϕ)x00x<0\displaystyle p(x_{n,l})=N_{+}(x_{n,l};\theta,\phi)=\begin{cases}\frac{\mathcal{N}(x;\theta,\phi)}{\Phi_{c}(-\theta/\sqrt{\phi})}&x\geq 0\\ 0&x<0\end{cases} (70)
p(hm,n)=N+(hm,n;θ,ϕ)={𝒩(h;θ,ϕ)Φc(θ/ϕ)h00h<0,\displaystyle p(h_{m,n})=N_{+}(h_{m,n};\theta,\phi)=\begin{cases}\frac{\mathcal{N}(h;\theta,\phi)}{\Phi_{c}(-\theta/\sqrt{\phi})}&h\geq 0\\ 0&h<0\end{cases}, (71)

where Φc\Phi_{c} is the complementary cumulative distribution function.

Refer to caption
Figure 14: NMSE of the algorithms versus SNR for NMF.
Refer to caption
Figure 15: Runtime of the algorithms for NMF.
Refer to caption
Figure 16: NMSE performance comparison for sparse MF.

We compare UAMP-MF with the alternating non-negative least squares projected gradient algorithm (AlsPGrad) [8] and multiplicative update algorithm (Mult)[7]. In the numerical examples, we set M=L=200M=L=200 and N=100N=100. The metric used to evaluate the performance is defined as

NMSE(Z)=𝒁𝑯^𝑿^2𝒁2.\displaystyle\text{NMSE(Z)}=\frac{\|\boldsymbol{Z}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\|^{2}}{\|\boldsymbol{Z}\|^{2}}. (72)

For Mult, we use the built-in function ‘nnmf’ in Matlab, and the termination tolerance is set to 101010^{-10} and maximum number of iterations is set to 2000, 5000, 10000 and 20000 for comparison.

The NMSE results are shown in Fig. 14 and the runtime is shown in Fig. 15. It can be seen that, with the increase of the maximum number of iterations, the performance of Mult improves considerably, but it comes at the cost of significant increase in runtime. AlsPGrad delivers performance similar to that of Mult with 20000 iterations. UAMP-MF delivers the best performance while with significant smaller runtime compared to Mult (except Mult-2000) and AlsPGrad.

Refer to caption
Figure 17: Performance comparison for sparse NMF.
Refer to caption
Figure 18: Runtime comparison (a) sparse MF; (b) sparse NMF.

IV-E UAMP-MF for Sparse Matrix Factorization

We consider the use of UAMP-MF to solve the sparse MF problem [32]. From the observation

𝒀=𝒁+𝑾,\displaystyle\boldsymbol{Y}=\boldsymbol{Z}+\boldsymbol{W}, (73)

where 𝒁M×L\boldsymbol{Z}\in\mathcal{R}^{M\times L}, and we aim to find a factorization of 𝒁\boldsymbol{Z} as the product of two sparse matrices 𝑯M×N\boldsymbol{H}\in\mathcal{R}^{M\times N} and 𝑿N×L\boldsymbol{X}\in\mathcal{R}^{N\times L}, i.e., 𝒁𝑯𝑿\boldsymbol{Z}\approx\boldsymbol{H}\boldsymbol{X}.

To find the sparse matrices 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} with UAMP-MF, we use the sparsity promoting hierarchical Gaussian-Gamma prior for entries in both matrices, i.e.,

p(𝑯)\displaystyle p(\boldsymbol{H})\!\!\! =\displaystyle= m,np(hm,n)=m,nN(hm,n;0,γhm,n1)\displaystyle\!\!\!\prod_{m,n}p(h_{m,n})=\prod_{m,n}N(h_{m,n};0,\gamma_{h_{m,n}}^{-1}) (74)
p(𝑿)\displaystyle p(\boldsymbol{X})\!\!\! =\displaystyle= n,lp(xn,l)=n,lN(xn,l;0,γxn,l1)\displaystyle\!\!\!\prod_{n,l}p(x_{n,l})=\prod_{n,l}N(x_{n,l};0,\gamma_{x_{n,l}}^{-1}) (75)
p(γhm,n)\displaystyle p(\gamma_{h_{m,n}})\!\!\! =\displaystyle= Ga(γhm,n;ϵ,η)\displaystyle\!\!\!Ga(\gamma_{h_{m,n}};\epsilon,\eta) (76)
p(γxn,l)\displaystyle p(\gamma_{x_{n,l}})\!\!\! =\displaystyle= Ga(γxn,l;ϵ,η).\displaystyle\!\!\!Ga(\gamma_{x_{n,l}};\epsilon,\eta). (77)

The computations of the a posterior means and variances of the entries of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} are the same as (54)-(56). We compare the performance of UAMP-MF and BAd-VAMP, and the results are shown in Fig.16 for various combinations of N{50,,200}N\in\{50,...,200\} and L{100,,800}L\in\{100,...,800\}, where M=NM=N, SNR = 50dB, and the sparsity rate δ\delta of 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X} is set to 0.20.2. It can be seen that the performance of UAMP-MF is significantly better than that of BAd-VAMP as no priors can be imposed on 𝑯\boldsymbol{H} in BAd-VAMP.

We can also achieve sparse NMF with UAMP-MF, where 𝒁+M×L\boldsymbol{Z}\in\mathcal{R}_{+}^{M\times L} and a sparse non-negative factorization 𝒁𝑯𝑿\boldsymbol{Z}\approx\boldsymbol{H}\boldsymbol{X} is to be found. In this case, we can employ a non-negative Bernoulli-Gaussian distribution for the entries in both 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X}. We compare UMAP-MF with Mult and AlsPGrad and the numerical results are shown in Fig. 17, where the sparsity rate δ=0.1\delta=0.1 for both 𝑯\boldsymbol{H} and 𝑿\boldsymbol{X}, M=200M=200, N=100N=100 and L=200L=200. We can see that the performance of UAMP-MF is much better than that of other algorithms as they lack a sparsity promoting mechanism.

The runtime comparisons for sparse MF and spares NMF are shown in Fig. 18. It can be seen that UAMP-MF is much faster than BAd-VAMP. Compared to AlsGrad and Mul with 2000 iterations, UAMP-MF has similar runtime but delivers much better performance.

V Conclusions

In this paper, leveraging variational inference and UAMP, we designed a message passing algorithm UAMP-MF to deal with matrix factorization, where UAMP is incorporated into variational inference through a whitening process. By imposing proper priors, UAMP-MF can be used to solve various problems, such as RPCA, DL, CSMU, NMF, sparse MF and sparse NMF. Extensive numerical results demonstrate the superiority of UAMP-MF in recovery accuracy, robustness, and computational complexity, compared to state-of-the-art algorithms.

Appendix A Proof of Proposition 1

According to VI, the message from fYf_{Y} to 𝑿\boldsymbol{X} can be expressed as

mfYX(𝑿)\displaystyle\!\!\!\!\!\!\!\!\!\!m_{f_{Y}\to X}(\boldsymbol{X})
exp(𝑯,λq(𝑯)q(λ)logfY)\displaystyle\!\!\!\!\!\!\!\!\!\!\propto\exp\left(\int_{\boldsymbol{H},\lambda}q(\boldsymbol{H})q(\lambda)\log f_{Y}\right)
exp(λ^𝑯Tr((𝒀𝑯𝑿)T(𝒀𝑯𝑿))q(𝑯))\displaystyle\!\!\!\!\!\!\!\!\!\!\propto\exp\Big{(}-\hat{\lambda}\int_{\boldsymbol{H}}\text{Tr}\Big{(}(\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})^{\textrm{T}}(\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})\Big{)}q(\boldsymbol{H})\Big{)}
exp(λ^𝑯(𝒚Vec(𝑯𝑿))T(𝒚Vec(𝑯𝑿))q(𝑯))\displaystyle\!\!\!\!\!\!\!\!\!\!\propto\exp\left(-\hat{\lambda}\int_{\boldsymbol{H}}(\boldsymbol{y}-\textrm{Vec}(\boldsymbol{H}\boldsymbol{X}))^{\textrm{T}}(\boldsymbol{y}-\textrm{Vec}(\boldsymbol{H}\boldsymbol{X}))q(\boldsymbol{H})\right)
=exp(λ^𝒉(𝒚𝑿~𝒉))T(𝒚𝑿~𝒉))q(𝒉))\displaystyle\!\!\!\!\!\!\!\!\!\!=\exp\left(-\hat{\lambda}\int_{\boldsymbol{h}}(\boldsymbol{y}-\tilde{\boldsymbol{X}}\boldsymbol{h}))^{\textrm{T}}(\boldsymbol{y}-\tilde{\boldsymbol{X}}\boldsymbol{h}))q(\boldsymbol{h})\right) (78)
=exp(λ^𝒉(𝒚T𝒚+𝒉T𝑿~T𝑿~𝒉(i)\displaystyle\!\!\!\!\!\!\!\!\!\!=\exp\Big{(}-\hat{\lambda}\int_{\boldsymbol{h}}\big{(}\boldsymbol{y}^{\textrm{T}}\boldsymbol{y}+\underbrace{\boldsymbol{h}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}}_{(i)}
𝒉T𝑿~T𝒚(ii)𝒚T𝑿~𝒉(iii))q(𝒉)),\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\underbrace{\boldsymbol{h}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{y}}_{(ii)}-\underbrace{\boldsymbol{y}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}}_{(iii)}\big{)}q(\boldsymbol{h})\Big{)}, (79)

where 𝒉=Vec(𝑯)\boldsymbol{h}=\text{Vec}(\boldsymbol{H}), 𝒚=Vec(𝒀)\boldsymbol{y}=\text{Vec}(\boldsymbol{Y}) and 𝑿~=𝑿T𝑰M\tilde{\boldsymbol{X}}=\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M}. In the derivation of (78), we use the matrix identity Vec(𝑯𝑿)=Vec(𝑰M𝑯𝑿)=(𝑿T𝑰M)𝒉\text{Vec}(\boldsymbol{H}\boldsymbol{X})=\text{Vec}(\boldsymbol{I}_{M}\boldsymbol{H}\boldsymbol{X})=\left(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M}\right)\boldsymbol{h} [33].

Next, we work out the the integration of the three terms (i)(i), (ii)(ii) and (iii)(iii) in (79). For term (i)(i), we have

𝒉𝒉T𝑿~T𝑿~𝒉q(𝒉)\displaystyle\!\!\!\!\!\!\!\!\!\int_{\boldsymbol{h}}\boldsymbol{h}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}q(\boldsymbol{h})
=𝒉Tr(𝑿~T𝑿~𝒉𝒉T)q(𝒉)\displaystyle\!\!\!\!\!\!\!\!\!=\int_{\boldsymbol{h}}\text{Tr}\left(\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}\boldsymbol{h}^{\textrm{T}}\right)q(\boldsymbol{h})
=Tr(𝑿~T𝑿~(𝒉^𝒉^T+𝑽H𝑼H))\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}\left(\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}(\hat{\boldsymbol{h}}\hat{\boldsymbol{h}}^{\textrm{T}}+\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H})\right)
=Tr(𝒉^T𝑿~T𝑿~𝒉^)+Tr(𝑿~(𝑽H𝑼H)𝑿~T)\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\hat{\boldsymbol{h}}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}\hat{\boldsymbol{h}})+\text{Tr}(\tilde{\boldsymbol{X}}(\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H})\tilde{\boldsymbol{X}}^{\textrm{T}})
=Tr(Vec(𝑯^)T(𝑿T𝑰M)T(𝑿T𝑰M)Vec(𝑯^))\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\textrm{Vec}(\hat{\boldsymbol{H}})^{\textrm{T}}(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})^{\textrm{T}}(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})\textrm{Vec}(\hat{\boldsymbol{H}}))
+Tr((𝑿T𝑰M)(𝑽H𝑼H)(𝑿T𝑰M))\displaystyle\ \ \ \ \ \ \ +\text{Tr}((\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})(\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H})(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M}))
=Tr(Vec(𝑯^𝑿)TVec(𝑯^𝑿))\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}\big{(}\textrm{Vec}(\hat{\boldsymbol{H}}\boldsymbol{X})^{\textrm{T}}\textrm{Vec}(\hat{\boldsymbol{H}}\boldsymbol{X})\big{)}
+Tr((𝑿𝑰M)(𝑿T𝑰M)(𝑽H𝑼H))\displaystyle\ \ \ \ \ \ \ \ +\text{Tr}\big{(}(\boldsymbol{X}\otimes\boldsymbol{I}_{M})(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})(\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H})\big{)} (80)
=Tr(𝑿T𝑯^T𝑯^𝑿)\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\boldsymbol{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}\boldsymbol{X})
+Tr(((𝑿𝑿T)𝑰M)(𝑽H𝑼H))\displaystyle\ \ \ \ \ \ \ \ +\text{Tr}(((\boldsymbol{X}\boldsymbol{X}^{\textrm{T}})\otimes\boldsymbol{I}_{M})(\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H})) (81)
=Tr(𝑿T𝑯^T𝑯^𝑿)+Tr(𝑼H)Tr(𝑿𝑿T𝑽H)\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\boldsymbol{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}\boldsymbol{X})+\text{Tr}(\boldsymbol{U}_{H})\text{Tr}(\boldsymbol{X}\boldsymbol{X}^{\textrm{T}}\boldsymbol{V}_{H}) (82)
=Tr(𝑿T(𝑯^T𝑯^+Tr(𝑼H)𝑽H)𝑿),\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\boldsymbol{X}^{\textrm{T}}(\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H})\boldsymbol{X}), (83)

where we use the matrix identity Tr(𝑨𝑩𝑪)=Tr(𝑪𝑨𝑩)\text{Tr}\left(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}\right)=\text{Tr}\left(\boldsymbol{C}\boldsymbol{A}\boldsymbol{B}\right) in deriving (80), (𝑨𝑩)(𝑪𝑫)=(𝑨𝑩)(𝑪𝑫)\left(\boldsymbol{A}\otimes\boldsymbol{B}\right)\left(\boldsymbol{C}\otimes\boldsymbol{D}\right)=(\boldsymbol{A}\boldsymbol{B})\otimes(\boldsymbol{C}\boldsymbol{D}) and (𝑨𝑩)T=𝑨T𝑩T\left(\boldsymbol{A}\otimes\boldsymbol{B}\right)^{\textrm{T}}=\boldsymbol{A}^{\textrm{T}}\otimes\boldsymbol{B}^{\textrm{T}} in deriving (81), and Tr(𝑨𝑩)=Tr(𝑨)Tr(𝑩)\text{Tr}(\boldsymbol{A}\otimes\boldsymbol{B})=\text{Tr}(\boldsymbol{A})\text{Tr}(\boldsymbol{B}) in deriving (82).

Regarding term (ii)(ii), we have

𝒉𝒉T𝑿~T𝒚q(𝒉)\displaystyle\int_{\boldsymbol{h}}\boldsymbol{h}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{y}q(\boldsymbol{h})
=𝑯Vec(𝑯)T(𝑿T𝑰M)TVec(𝒀)q(𝑯)\displaystyle=\int_{\boldsymbol{H}}\text{Vec}(\boldsymbol{H})^{\textrm{T}}(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})^{\textrm{T}}\text{Vec}(\boldsymbol{Y})q(\boldsymbol{H})
=𝑯Tr(𝑿T𝑯T𝒀)q(𝑯)\displaystyle=\int_{\boldsymbol{H}}\text{Tr}(\boldsymbol{X}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}\boldsymbol{Y})q(\boldsymbol{H})
=Tr(𝑿T𝑯^T𝒀).\displaystyle=\text{Tr}\big{(}\boldsymbol{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}\big{)}. (84)

Similarly, term (iii)(iii) can be expressed as

𝒉𝒚T𝑿~𝒉q(𝒉)=Tr(𝒀T𝑯^𝑿).\displaystyle\int_{\boldsymbol{h}}\boldsymbol{y}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}q(\boldsymbol{h})=\text{Tr}(\boldsymbol{Y}^{\textrm{T}}\hat{\boldsymbol{H}}\boldsymbol{X}). (85)

Based on the above results, the message

mfYX(𝑿)exp(λ^Tr(𝑿T(𝑯^T𝑯^+Tr(𝑼H)𝑽H)𝑿\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!m_{f_{Y}\to X}(\boldsymbol{X})\propto\exp\Big{(}-\hat{\lambda}\text{Tr}\big{(}\boldsymbol{X}^{\textrm{T}}(\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H})\boldsymbol{X}
𝑿T𝑯^T𝒀𝒀T𝑯^𝑿+𝒀T𝒀)).\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\boldsymbol{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}-\boldsymbol{Y}^{\textrm{T}}\hat{\boldsymbol{H}}\boldsymbol{X}+\boldsymbol{Y}^{\textrm{T}}\boldsymbol{Y}\big{)}\Big{)}. (86)

Comparing the result against the matrix normal distribution (9), we have the result shown in (17).

Appendix B Proof of Proposition 2

According to VI, the message mfYH(𝑯)m_{f_{Y}\to H}(\boldsymbol{H}) is computed as

mfYH(𝑯)\displaystyle m_{f_{Y}\to H}(\boldsymbol{H})
exp(𝑿,λq(𝑿)q(λ)logfY)\displaystyle\propto\exp\left(\int_{\boldsymbol{X},\lambda}q(\boldsymbol{X})q(\lambda)\log f_{Y}\right)
exp(λ^𝑿Tr((𝒀𝑯𝑿)T(𝒀𝑯𝑿))b(𝑿))\displaystyle\propto\exp\left(-\hat{\lambda}\int_{\boldsymbol{X}}\text{Tr}((\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})^{\textrm{T}}(\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X}))b(\boldsymbol{X})\right)
=exp(λ^𝒙(𝒚𝑯~𝒙)T(𝒚𝑯~𝒙)b(𝒙))\displaystyle=\exp\left(-\hat{\lambda}\int_{\boldsymbol{x}}(\boldsymbol{y}-\tilde{\boldsymbol{H}}\boldsymbol{x})^{\textrm{T}}(\boldsymbol{y}-\tilde{\boldsymbol{H}}\boldsymbol{x})b(\boldsymbol{x})\right)
=exp(λ^𝒙(𝒚T𝒚+𝒙T𝑯~T𝑯~𝒙\displaystyle=\exp\Big{(}-\hat{\lambda}\int_{\boldsymbol{x}}\big{(}\boldsymbol{y}^{\textrm{T}}\boldsymbol{y}+\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}
𝒙T𝑯~T𝒚𝒚T𝑯~𝒙)b(𝒙)),\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{y}-\boldsymbol{y}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}\big{)}b(\boldsymbol{x})\Big{)}, (87)

where 𝑯~𝑰L𝑯\tilde{\boldsymbol{H}}\triangleq\boldsymbol{I}_{L}\otimes\boldsymbol{H}. Similar to the derivation of (83), the integration of the terms in (87) can be expressed as

𝒙𝒙T𝑯~T𝒚b(𝒙)=Tr(𝑿^T𝑯T𝒀)\displaystyle\int_{\boldsymbol{x}}\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{y}b(\boldsymbol{x})=\text{Tr}\big{(}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}\boldsymbol{Y}\big{)} (88)

and

𝒙𝒙T𝑯~T𝑯~𝒙b(𝒙)\displaystyle\int_{\boldsymbol{x}}\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}b(\boldsymbol{x})
=𝒙Tr(𝑯~T𝑯~𝒙𝒙T)b(𝒙)\displaystyle=\int_{\boldsymbol{x}}\text{Tr}\left(\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}\boldsymbol{x}^{\textrm{T}}\right)b(\boldsymbol{x})
=Tr(𝑯~T𝑯~(𝒙^𝒙^T+𝑽X𝑼X))\displaystyle=\text{Tr}\big{(}\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}(\hat{\boldsymbol{x}}\hat{\boldsymbol{x}}^{\textrm{T}}+\boldsymbol{V}_{X}\otimes\boldsymbol{U}_{X})\big{)}
=Tr(𝑿^T𝑯T𝑯𝑿^)+Tr(𝑽X(𝑯T𝑯𝑼X))\displaystyle=\text{Tr}\big{(}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}\boldsymbol{H}\hat{\boldsymbol{X}}\big{)}+\text{Tr}\big{(}\boldsymbol{V}_{X}\otimes(\boldsymbol{H}^{\textrm{T}}\boldsymbol{H}\boldsymbol{U}_{X})\big{)}
=Tr(𝑿^T𝑯T𝑯𝑿^)+Tr(𝑽X)Tr(𝑯𝑼X𝑯T)\displaystyle=\text{Tr}\big{(}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}\boldsymbol{H}\hat{\boldsymbol{X}}\big{)}+\text{Tr}(\boldsymbol{V}_{X})\text{Tr}\big{(}\boldsymbol{H}\boldsymbol{U}_{X}\boldsymbol{H}^{\textrm{T}}\big{)}
=Tr(𝑯(𝑿^𝑿^T+Tr(𝑽X)𝑼X)𝑯T).\displaystyle=\text{Tr}\big{(}\boldsymbol{H}(\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X})\boldsymbol{H}^{\textrm{T}}\big{)}. (89)

The message from fYf_{Y} to 𝑯\boldsymbol{H} can be represented as

mfYH(𝑯)\displaystyle m_{f_{Y}\to H}(\boldsymbol{H})
exp(λ^Tr(𝒀𝒀T+𝑯(𝑿^𝑿^T+Tr(𝑽X)𝑼X)𝑯T\displaystyle\propto\exp\Big{(}-\hat{\lambda}\text{Tr}(\boldsymbol{Y}\boldsymbol{Y}^{\textrm{T}}+\boldsymbol{H}(\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X})\boldsymbol{H}^{\textrm{T}}
𝑯𝑿^𝒀T𝒀𝑿^T𝑯T)).\displaystyle\ \ \ \ \ \ \ \ -\boldsymbol{H}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}}-\boldsymbol{Y}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}})\Big{)}. (90)

Comparing the above against the matrix normal distribution, we obtain the result shown by (32) - (34).

Appendix C Proof of Proposition 3

According to VI, the message

mfYλ(λ)\displaystyle m_{f_{Y}\to\lambda}(\lambda)
=det(λ1𝑰M𝑰L)exp(λ(Vec(𝒀)Vec(𝑯𝑿))T\displaystyle=\text{det}(\lambda^{-1}\boldsymbol{I}_{M}\otimes\boldsymbol{I}_{L})\exp\Big{(}-\lambda\big{(}\text{Vec}(\boldsymbol{Y})-\textrm{Vec}(\boldsymbol{H}\boldsymbol{X})\big{)}^{\textrm{T}}
(Vec(𝒀)Vec(𝑯𝑿)))\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \big{(}\textrm{Vec}(\boldsymbol{Y})-\textrm{Vec}(\boldsymbol{H}\boldsymbol{X})\big{)}\Big{)}
=λMLexp(λ𝑯,𝑿Tr((𝒀𝑯𝑿)T\displaystyle=\lambda^{ML}\exp\Big{(}-\lambda\int_{\boldsymbol{H},\boldsymbol{X}}\text{Tr}((\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})^{\textrm{T}}
(𝒀𝑯𝑿))b(𝑯)b(𝑿))\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X}))b(\boldsymbol{H})b(\boldsymbol{X})\Big{)}
=λMLexp(λC),\displaystyle=\lambda^{ML}\exp\Big{(}-\lambda C\Big{)}, (91)

where

C=𝑯,𝑿Tr((𝒀𝑯𝑿)T(𝒀𝑯𝑿))q(𝑯)q(𝑿)\displaystyle\!\!\!\!\!\!\!\!\!C=\int_{\boldsymbol{H},\boldsymbol{X}}\text{Tr}((\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})^{\textrm{T}}(\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X}))q(\boldsymbol{H})q(\boldsymbol{X})
=𝒙,𝑯(𝒚T𝒚+𝒙T𝑯~T𝑯~𝒙𝒙T𝑯~T𝒚𝒚T𝑯~𝒙)q(𝒙)q(𝑯)\displaystyle\!\!\!\!\!\!\!\!\!\!=\!\int_{\boldsymbol{x},\boldsymbol{H}}(\boldsymbol{y}^{\textrm{T}}\boldsymbol{y}+\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}-\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{y}-\boldsymbol{y}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x})q(\boldsymbol{x})q(\boldsymbol{H})
=𝑯Tr(𝒀𝒀T+𝑯(𝑿^𝑿^T+Tr(𝑽X)𝑼X)𝑯T\displaystyle\!\!\!\!\!\!\!\!\!\!=\!\int_{\boldsymbol{H}}\text{Tr}(\boldsymbol{Y}\boldsymbol{Y}^{\textrm{T}}+\boldsymbol{H}(\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X})\boldsymbol{H}^{\textrm{T}}
𝑯𝑿^𝒀T𝒀𝑿^T𝑯T)q(𝑯)\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\boldsymbol{H}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}}-\boldsymbol{Y}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}})q(\boldsymbol{H})
=Tr(𝒀𝒀T𝑯^𝑿^𝒀T𝒀𝑿^T𝑯^T)\displaystyle\!\!\!\!\!\!\!\!\!\!=\text{Tr}(\boldsymbol{Y}\boldsymbol{Y}^{\textrm{T}}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}}-\boldsymbol{Y}\hat{\boldsymbol{X}}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}})
+Tr((𝑿^𝑿^T+Tr(𝑽X)𝑼X)(𝑯^T𝑯^+Tr(𝑼H)𝑽H)T)\displaystyle+\text{Tr}((\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X})(\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H})^{\textrm{T}})
=Tr((𝒀𝑯^𝑿^)T(𝒀𝑯^𝑿^))+Tr(𝑿^𝑿^TTr(𝑼H)𝑽H\displaystyle\!\!\!\!\!\!\!\!\!\!=\text{Tr}\Big{(}\big{(}\boldsymbol{Y}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\big{)}^{\textrm{T}}\big{(}\boldsymbol{Y}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\big{)}\Big{)}+\text{Tr}\Big{(}\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H}
+Tr(𝑽X)𝑼X𝑯^T𝑯^+Tr(𝑽X)𝑼XTr(𝑼H)𝑽H).\displaystyle+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X}\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H}\Big{)}. (92)

By simplifying the above result, we obtain (44).

References

  • [1] D. D. Lee, “Learning parts of objects by non-negative matrix factorization,” Letter of Nature, vol. 401, pp. 788–791, 1999.
  • [2] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1045–1057, June 2010.
  • [3] H. Zhu, G. Leus, and G. B. Giannakis, “Sparsity-cognizant total least-squares for perturbed compressive sampling,” IEEE Transactions on Signal Processing, vol. 59, no. 5, pp. 2002–2016, May 2011.
  • [4] E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, Jun. 2011. [Online]. Available: https://doi.org/10.1145/1970392.1970395
  • [5] J. T. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate message passing—part i: Derivation,” IEEE Transactions on Signal Processing, vol. 62, no. 22, pp. 5839–5853, 2014.
  • [6] D. D. Lee, “Algorithms for non-negative matrix factorization,” Advances in Neural Information Processing Systems, vol. 13, pp. 556–562, 2001.
  • [7] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons, “Algorithms and applications for approximate nonnegative matrix factorization,” Computational Statistics and Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.
  • [8] C.-J. Lin, “Projected gradient methods for nonnegative matrix factorization,” Neural Computation, vol. 19, no. 10, pp. 2756–2779, 2007.
  • [9] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.
  • [10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.
  • [11] Zhu, Hao, Leus, Geert, Giannakis, Georgios, and B., “Sparsity-cognizant total least-squares for perturbed compressive sampling.” IEEE Transactions on Signal Processing, vol. 59, no. 5, pp. 2002–2016, 2011.
  • [12] Z. Lin, M. Chen, and Y. Ma, “The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices,” http://export.arxiv.org/abs/1009.5055, 2010.
  • [13] Z. Wen, W. Yin, and Y. Zhang, “Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm,” Mathematical Programming Computation, vol. 4, no. 4, pp. 333–361, 2012.
  • [14] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing: I. motivation and construction,” in 2010 IEEE Information Theory Workshop on Information Theory (ITW 2010, Cairo), Jan 2010, pp. 1–5.
  • [15] ——, “Message passing algorithms for compressed sensing: II. analysis and validation,” in 2010 IEEE Information Theory Workshop on Information Theory (ITW 2010, Cairo), Jan 2010, pp. 1–5.
  • [16] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in Proc. Int. Symp. Inf. Theory.   IEEE, July. 2011, pp. 2168–2172.
  • [17] S. Sarkar, A. K. Fletcher, S. Rangan, and P. Schniter, “Bilinear recovery using adaptive vector-amp,” IEEE Transactions on Signal Processing, vol. 67, no. 13, pp. 3383–3396, July 2019.
  • [18] X. Meng and J. Zhu, “Bilinear adaptive generalized vector approximate message passing,” IEEE Access, vol. 7, pp. 4807–4815, 2019.
  • [19] Z. Yuan, Q. Guo, and M. Luo, “Approximate message passing with unitary transformation for robust bilinear recovery,” IEEE Trans. Signal Process., vol. 69, pp. 617–630, Dec. 2020.
  • [20] Q. Guo and J. Xi, “Approximate message passing with unitary transformation,” arXiv preprint arXiv:1504.04799, Apr. 2015.
  • [21] J. Winn and C. M. Bishop, “Variational message passing,” J. Mach. Learn. Res., vol. 6, pp. 661–694, Apr. 2005.
  • [22] D. J. De Waal, Matrix-Valued Distributions.   Encyclopedia of Statistical Sciences, 1985.
  • [23] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Mach. Learn., vol. 37, no. 2, pp. 183–233, Nov. 1999.
  • [24] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Nat. Acad. Sci., vol. 106, no. 45, pp. 18 914–18 919, Nov. 2009.
  • [25] S. Rangan, P. Schniter, A. K. Fletcher, and S. Sarkar, “On the convergence of approximate message passing with arbitrary matrices,” IEEE Trans. Inf. Theory, vol. 65, no. 9, pp. 5339–5351, Sept. 2019.
  • [26] Q. Guo, D. D. Huang, S. Nordholm, J. Xi, and Y. Yu, “Iterative frequency domain equalization with generalized approximate message passing,” IEEE Signal Process. Lett., vol. 20, no. 6, pp. 559–562, June. 2013.
  • [27] M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” J. Mach. Learn. Res., vol. 1, p. 211–244, sep 2001. [Online]. Available: https://doi.org/10.1162/15324430152748236
  • [28] M. Luo, Q. Guo, M. Jin, Y. C. Eldar, D. Huang, and X. Meng, “Unitary approximate message passing for sparse bayesian learning,” IEEE Transactions on Signal Processing, vol. 69, pp. 6023–6039, 2021.
  • [29] J. T. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate message passing Part ii: Applications,” IEEE Transactions on Signal Processing, vol. 62, no. 22, pp. 5854–5867, 2014.
  • [30] J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot, “Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 5, no. 2, pp. 354–379, 2012.
  • [31] W. Xu, “Document clustering based on non-negative matrix factorization,” Proceedings of SIGIR-03, 2003.
  • [32] P. O. Hoyer, “Non-negative sparse coding,” in Neural Networks for Signal Processing, 2002. Proceedings of the 2002 12th IEEE Workshop on, 2002.
  • [33] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook.   on-line, Sep. 2007, http://matrixcookbook.com.