Unitary Approximate Message Passing for Matrix Factorization

Zhengdao Yuan, Qinghua Guo, , Yonina C. Eldar, , and Yonghui Li Corresponding author: Qinghua Guo.Z. Yuan is with the Artificial Intelligence Technology Engineering Research Center, Open University of Henan, Zhengzhou 450002, China. He was with the School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Wollongong, NSW 2522, Australia (e-mail: [email protected]).Q. Guo is with the School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Wollongong, NSW 2522, Australia (e-mail: [email protected]).Y. C. Eldar is with the Faculty of Mathematics and Computer Science, Weizmann Institute of Science, Rehovot 7610001, Israel (e-mail: [email protected]).Y. Li is with the School of Electrical and Information Engineering, University of Sydney, Sydney, NSW 2006, Australia (e-mail: [email protected]).

Abstract

We consider matrix factorization (MF) with certain constraints, which finds wide applications in various areas. Leveraging variational inference (VI) and unitary approximate message passing (UAMP), we develop a Bayesian approach to MF with an efficient message passing implementation, called UAMP-MF. With proper priors imposed on the factor matrices, UAMP-MF can be used to solve many problems that can be formulated as MF, such as non-negative matrix factorization, dictionary learning, compressive sensing with matrix uncertainty, robust principal component analysis, and sparse matrix factorization. Extensive numerical examples are provided to show that UAMP-MF significantly outperforms state-of-the-art algorithms in terms of recovery accuracy, robustness and computational complexity.

Index Terms:

Variational inference (VI), approximate message passing (AMP), matrix factorization (MF), compressive sensing (CS), dictionary learning (DL), robust principal component analysis (RPCA).

I Introduction

We consider the problem of matrix factorization (MF) with the following model

\displaystyle\boldsymbol{Y}=\boldsymbol{H}\boldsymbol{X}+\boldsymbol{W},

(1)

where $\boldsymbol{Y}\in\mathcal{R}^{M\times L}$ denotes a known (observation) matrix, $\boldsymbol{W}\in\mathcal{R}^{M\times L}$ accounts for unknown perturbations or measurement errors, and the matrices $\boldsymbol{H}\in\mathcal{R}^{M\times N}$ and $\boldsymbol{X}\in\mathcal{R}^{N\times L}$ are two factor matrices to be obtained. Depending on concrete application scenarios, the MF problem in (1) often has specific requirements on the matrices $\boldsymbol{H}$ and $\boldsymbol{X}$ . For example, in non-negative matrix factorization (NMF) [1], the entries of $\boldsymbol{H}$ and $\boldsymbol{X}$ need to take non-negative values. In dictionary learning (DL) [2], matrix $\boldsymbol{X}$ is sparse matrix and $\boldsymbol{H}$ is a dictionary matrix to be learned. In compressive sensing with matrix uncertainty (CSMU) [3], $\boldsymbol{X}$ is sparse and the sensing matrix $\boldsymbol{H}$ can be modeled as $\boldsymbol{H}=\bar{\boldsymbol{H}}+\boldsymbol{H}^{\prime}$ , where the matrix $\bar{\boldsymbol{H}}$ is known, and $\boldsymbol{H}^{\prime}$ denotes an unknown perturbation matrix. The robust principal component analysis (RPCA) problem [4] can also be formulated as (1), where both $\boldsymbol{H}$ and $\boldsymbol{X}$ admit specific structures [5]. We may also be interested in sparse MF, where both matrices $\boldsymbol{H}$ and $\boldsymbol{X}$ are sparse.

Many algorithms have been developed to solve the above problems. For NMF, the multiplication (Mult) update algorithm proposed in [6] is often used, which is simple to implement, and produces good results. The alternating least squares (ALS) algorithm was proposed in [7] for NMF, where two least squares steps are performed alternately. The above two algorithms are very popular and are built in the MATLAB library. The projected gradient descent (AlsPGrad) [8] algorithm, which is an improvement to Mult, usually has faster convergence speed. To solve the DL problem, one can use algorithms such as K-SVD (singular value decomposition) [9] and SPAMS (sparse modeling software) [10]. K-SVD is often effective, but can be trapped in local minima or even saddle points. The SPAMS algorithm often does not perform well when the number of training samples is small. For CSMU, the sparsity-cognizant total least squares (S-TLS) method was developed in [11] by extending the classical TLS approach. For the RPCA problem, many algorithms have been proposed. The work in [4] developed iterative thresholding algorithms with low complexity, but their convergence is usually slow. The inexact augmented Lagrange multiplier algorithm (IALM) [12] and low-rank matrix fitting algorithm (LMaFit) [13] deliver better convergence speed and performance.

Another line of techniques is based on the Bayesian framework, especially with the approximate message passing (AMP) algorithm [14], [15] and its extension or variants. With Bayesian approaches, the requirements on matrices $\boldsymbol{H}$ and $\boldsymbol{X}$ for the factorization in (1) are incorporated by imposing proper priors on them. The AMP algorithm is employed due to its low complexity. AMP was originally developed for compressive sensing based on loopy belief propagation (BP) [14], and was later extended in [16] to solve the estimation problem with a generalized linear observation model. The bilinear generalized AMP (Bi-GAMP) algorithm was proposed in [5] to deal with the problem in (1). However, Bi-GAMP inherits the shortcoming of the AMP algorithm, i.e., it is not robust to generic matrices $\boldsymbol{H}$ and $\boldsymbol{X}$ . To achieve better robustness, a bilinear adaptive VAMP (BAd-VAMP) algorithm was proposed in [17] and generalized in [18]. However, for the general problem in (1), to reduce computational complexity, BAd-VAMP does not allow the use of priors for matrix $\boldsymbol{H}$ , i.e., it is treated as a deterministic unknown matrix. This makes BAd-VAMP incapable of dealing with the requirements on $\boldsymbol{H}$ , e.g., RPCA or sparse MF where $\boldsymbol{H}$ should be sparse. In addition, there is still considerable room for improvement in terms of performance, robustness and computational complexity. In [19], with UAMP [20] (which was formally called UTAMP), a more robust and faster approximate inference algorithm (called Bi-UTAMP in [19]) was proposed, which is designed for specially structured $\boldsymbol{H}$ in (1), i.e., $\boldsymbol{H}=\sum_{k=1}^{K}b_{k}\boldsymbol{A}_{k}$ , where matrices $\{\boldsymbol{A}_{k}\}$ are known, and coefficients $\{b_{k}\}$ (and matrix $\boldsymbol{X}$ ) need to be estimated.

In this work, with the framework of variational inference (VI) [21], we develop a more efficient Bayesian algorithm leveraging UAMP. The use of VI requires the updates of two vairational distributions on $\boldsymbol{X}$ and $\boldsymbol{H}$ , which may be difficult and expensive With matrix normal distributions [22] and through a covariance matrix whitening process, we incorporate UAMP into VI to deal with the updates of the variational distributions. The VI-based method is implemented using message passing with UAMP as its key component, leading to an algorithm called UAMP-MF. Inheriting the low complexity and robustness of UAMP, UAMP-MF is highly efficient and robust. It has cubic complexity per iteration, which is considered low for an MF problem (note that even the multiplication of two matrices has cubic complexity). Enjoying the flexibility of a Bayesian approach, UAMP-MF is able to handle various problems in a unified way with different priors. We focus on its applications to NMF, DL, CSMU, RPCA and sparse MF, and compare its performance with various algorithms, including IAML, LMaFit and Bi-GAMP for RPCA, K-SVD, SPAMS and BAd-VAMP for DL, BAd-VAMP for the CSMU, AlsPGrad, and Mult for NMF. Extensive numerical results are provided to demonstrate the superiority of UAMP-MF in terms of recovery performance, robustness and computational complexity.

The remainder of this paper is organized as follows. In Section II, VI, (U)AMP and matrix normal distribution are briefly introduced. The UAMP-MF algorithm is derived in Section III. Applications of UAMP-MF to NMF, RPCA, DL, CSMU and sparse MF are investigated in Section IV, followed by conclusions in Section V.

Throughout the paper, we use the following notations. Boldface lower-case and upper-case letters denote vectors and matrices, respectively, and superscript $(\cdot)^{T}$ represents the transpose operation. A Gaussian distribution of $x$ with mean $\hat{x}$ and variance $\nu_{x}$ is represented by $\mathcal{N}(x;{\hat{x}},\nu_{x})$ . Notation $\text{Tr}(\cdot)$ denotes the trace operation. The relation $f(x)=cg(x)$ for some positive constant $c$ is written as $f(x)\propto g(x)$ . The operation $diag(\boldsymbol{a})$ returns a diagonal matrix with $\boldsymbol{a}$ on its diagonal. We use $\boldsymbol{A}\cdot\boldsymbol{B}$ and $\boldsymbol{A}\cdot/\boldsymbol{B}$ to represent the element-wise product and division between $\boldsymbol{A}$ and $\boldsymbol{B}$ , respectively. We use $|\boldsymbol{A}|^{.2}$ to denote element-wise magnitude squared operation for $\boldsymbol{A}$ , and $||\boldsymbol{A}||$ is the Frobenius norm of $\boldsymbol{A}$ . The notations 1, 0 and $\boldsymbol{I}$ denote an all-one matrix, an all-zero matrix and an identity matrix with a proper size, respectively. The above operations defined for matrices can also be applied to vectors. We use $\textrm{Vec}(\cdot)$ to denote the vectorization operation.

II Preliminaries

II-A Variational Inference

VI is a machine learning method widely used to approximate posterior densities for Bayesian models [21], [23]. Let $\boldsymbol{V}$ and $\boldsymbol{R}$ be the set of hidden (latent) variables and visible (observed) variables, respectively, with joint distribution $p(\boldsymbol{V},\boldsymbol{R})$ . The goal of VI is to find a tractable variational distribution ${q}(\boldsymbol{V})$ that approximates the true posterior distribution $p(\boldsymbol{V}|\boldsymbol{R})$ . With the distribution $q(\boldsymbol{V})$ , the log marginal distribution of the observed variables admits the following decomposition

\ln p(\boldsymbol{R})=\mathcal{L}(q(\boldsymbol{V}))+\mathcal{KL}(q(\boldsymbol{V})||p(\boldsymbol{V}|\boldsymbol{R})),

(2)

where the variational lower bound $\mathcal{L}(q(\boldsymbol{V}))$ is given as

\mathcal{L}(q(\boldsymbol{V}))=\int_{\boldsymbol{V}}q(\boldsymbol{V})\ln\frac{p(\boldsymbol{V},\boldsymbol{R})}{q(\boldsymbol{V})},

(3)

and the Kullback-Leibler (KL) divergence between $q(\boldsymbol{V})$ and $p(\boldsymbol{V}|\boldsymbol{R})$ is

\mathcal{KL}(q(\boldsymbol{V})||p(\boldsymbol{V}|\boldsymbol{R}))=-\int_{\boldsymbol{V}}q(\boldsymbol{V})\log\frac{p(\boldsymbol{V}|\boldsymbol{R})}{q(\boldsymbol{V})}.

(4)

The distribution $q(\boldsymbol{V})$ that minimizes the KL divergence $\mathcal{KL}(q(\boldsymbol{V})||p(\boldsymbol{V}|\boldsymbol{R}))$ can be found by maximizing the variational lower bound $\mathcal{L}(q(\boldsymbol{V}))$ .

VI can be implemented with message passing with the assistance of graphical models such as factor graphs. If the variational distribution with some factorization is chosen, e.g.,

q(\boldsymbol{V})=\prod_{k}q_{k}(\boldsymbol{V}_{k}),

(5)

where $\boldsymbol{V}=\{\boldsymbol{V}_{k}\}$ , then the variational distribution may be found through an iterative procedure [21] with the update rule

q_{k}(\boldsymbol{V}_{k})\propto\exp\left(\int_{\tilde{\boldsymbol{V}}}q(\tilde{\boldsymbol{V}})\log f(\boldsymbol{V}_{k},\tilde{\boldsymbol{V}})\right).

(6)

Here $f(\boldsymbol{V}_{k},\tilde{\boldsymbol{V}})$ is a local factor associated with $\boldsymbol{V}_{k}$ and $\tilde{\boldsymbol{V}}\in\{\boldsymbol{V}_{i},i\neq k\}$ , depending on the structure of the factor graph. The update of $\{q_{k}(\boldsymbol{V}_{k})\}$ is carried out iteratively until it converges or a pre-set number of iterations is reached.

II-B (U)AMP

AMP was derived based on loopy BP with Gaussian and Taylor-series approximations [24], [16], which can be used to recover $\boldsymbol{x}$ from the noisy measurement $\boldsymbol{y}=\boldsymbol{A}\boldsymbol{x}+\boldsymbol{w}$ with $\boldsymbol{w}$ being a zero-mean white Gaussian noise vector. AMP works well with an i.i.d. Gaussian $\boldsymbol{A}$ , but it can easily diverge for generic $\mathbf{A}$ [25]. Inspired by [26], the work in [20] showed that the robustness of AMP can be remarkably improved through simple pre-processing, i.e., performing a unitary transformation to the original linear model [19]. As any matrix $\boldsymbol{A}$ has an SVD $\boldsymbol{A}=\boldsymbol{U}\boldsymbol{\Lambda}\boldsymbol{V}$ with $\boldsymbol{U}$ and $\boldsymbol{V}$ being two unitary matrices, performing a unitary transformation with $\boldsymbol{U}^{\textrm{T}}$ leads to the following model

\boldsymbol{r}=\mathbf{\Phi}\boldsymbol{x}+\boldsymbol{\omega},

(7)

where $\boldsymbol{r}=\boldsymbol{U}^{\textrm{T}}\boldsymbol{y}$ , $\boldsymbol{\Phi}=\boldsymbol{U}^{\textrm{T}}\boldsymbol{A}=\boldsymbol{\Lambda}\boldsymbol{V}$ , $\mathbf{\Lambda}$ is an $M\times N$ rectangular diagonal matrix, and $\boldsymbol{\omega}=\boldsymbol{U}^{\textrm{T}}\boldsymbol{w}$ remains a zero-mean white Gaussian noise vector with the same covariance as $\boldsymbol{\omega}$ .

Applying the vector step size AMP [16] with model (7) leads to the first version of UAMP (called UAMPv1) shown in Algorithm 1. Note that by replacing $\boldsymbol{r}$ and $\mathbf{\Phi}$ with $\boldsymbol{y}$ and $\boldsymbol{A}$ in Algorithm 1, the original AMP algorithm is recovered. An average operation can be applied to two vectors: $\boldsymbol{\tau}_{x}$ in Line 7 and $|\mathbf{\Phi}^{H}|^{.2}\boldsymbol{\tau}_{s}$ in Line 5 of UAMPv1 in Algorithm 1, leading to the second version of UAMP [20] (called UAMPv2), where the operations in the brackets of Lines 1, 5 and 7 are executed (refer to [19] for the detailed derivations). Compared to AMP and UAMPv1, UAMPv2 does not require matrix-vector multiplications in Lines 1 and 5, so that the number of matrix-vector products is reduced from 4 to 2 per iteration.

Algorithm 1 UAMP (UAMPv2 executes operations in [ ])

Initialize $\boldsymbol{\tau}_{x}^{(0)}(\mathrm{or}~{}{\tau}_{x}^{(0)})>0$ and ${{\boldsymbol{x}}^{(0)}}$ . Set $\boldsymbol{s}^{(-1)}=\mathbf{0}$ and $t=0$ . Define vector $\boldsymbol{\lambda}={\mathbf{\Lambda\Lambda}^{\textrm{T}}}\textbf{1}$ .
Repeat

\boldsymbol{\tau}_{p}

\mathbf{|\Phi|}^{.2}\boldsymbol{\tau}^{t}_{x}

\left[\mathrm{or}~{}\boldsymbol{\tau}_{p}=\tau^{t}_{x}\boldsymbol{\lambda}\right]

\boldsymbol{p}=\mathbf{\Phi}{{\boldsymbol{x}}^{t}}-\boldsymbol{\tau}_{p}\cdot\boldsymbol{s}^{t-1}

\boldsymbol{\tau}_{s}=\mathbf{1}./(\boldsymbol{\tau}_{p}+\beta^{-1}\mathbf{1})

\boldsymbol{s}^{t}=\boldsymbol{\tau}_{s}\cdot(\boldsymbol{r}-\boldsymbol{p})

\mathbf{1}./\boldsymbol{\tau}_{q}

|\mathbf{\Phi}^{\textrm{T}}|^{.2}\boldsymbol{\tau}_{s}

\left[\mathrm{or}~{}\mathbf{1}./\boldsymbol{\tau}_{q}=(\frac{1}{N}\boldsymbol{\lambda}^{\textrm{T}}\boldsymbol{\tau}_{s})\mathbf{1}\right]

\boldsymbol{q}={{\boldsymbol{x}}^{t}}+\boldsymbol{\tau}_{q}\cdot(\mathbf{\Phi}^{H}\boldsymbol{s}^{t})

\boldsymbol{\tau}_{x}^{t+1}

\boldsymbol{\tau}_{q}\cdot g_{x}^{\prime}(\boldsymbol{q},\boldsymbol{\tau}_{q})

\left[\mathrm{or}~{}\tau_{x}^{t+1}\!=\!\frac{1}{N}\mathbf{1}^{\textrm{T}}\left(\boldsymbol{\tau}_{q}\cdot g_{x}^{\prime}(\boldsymbol{q},\tau_{q})\right)\right]

{{\mathbf{x}}^{t+1}}=g_{x}(\boldsymbol{q},\boldsymbol{\tau}_{q})

t=t+1

Until terminated

In the (U)AMP algorithms, $g_{x}(\boldsymbol{q},\boldsymbol{\tau}_{q})$ is related to the prior of $\boldsymbol{x}$ and returns a column vector with $n$ th entry $[g_{x}(\mathbf{q},\boldsymbol{\tau}_{q})]_{n}$ given by

[g_{x}(\boldsymbol{q},\boldsymbol{\tau}_{q})]_{n}=\frac{\int x_{n}p(x_{n})\mathcal{N}(x_{n};q_{n},\tau_{q_{n}})dx_{n}}{\int p(x_{n})\mathcal{N}(x_{n};q_{n},\tau_{q_{n}})dx_{n}},

(8)

where $p(x_{n})$ represents a prior for $x_{n}$ . The function $g_{x}^{\prime}(\boldsymbol{q},\boldsymbol{\tau}_{q})$ returns a column vector and the $n$ th element is denoted by $[g_{x}^{\prime}(\boldsymbol{q},\boldsymbol{\tau}_{q})]_{n}$ , where the derivative is taken with respect to $q_{n}$ .

II-C Matrix Normal Distribution

In this work, the matrix normal distribution is used in the derivation of VI for the MF problem, into which UAMP is incorporated, leading to the UAMP-MF algorithm. The matrix normal distribution is a generalization of the multivariate normal distribution to matrix-valued random variables [22]. The distribution of normal random matrix $\boldsymbol{X}$ is

	$\displaystyle\!\!\!\!\!\!\!\!\!\!\mathcal{MN}\big{(}\boldsymbol{X};\hat{\boldsymbol{X}},\boldsymbol{U}_{X},\boldsymbol{V}_{X}\big{)}=(2\pi)^{-\frac{NL}{2}}\big{\|}\boldsymbol{U}_{X}\big{\|}^{-\frac{N}{2}}\big{\|}\boldsymbol{V}_{X}\big{\|}^{-\frac{L}{2}}$
	$\displaystyle\exp\Big{(}-\frac{1}{2}\text{Tr}\big{(}\boldsymbol{V}_{X}^{-1}\big{(}\boldsymbol{X}-\hat{\boldsymbol{X}}\big{)}^{\textrm{T}}\boldsymbol{U}_{X}^{-1}\big{(}\boldsymbol{X}-\hat{\boldsymbol{X}}\big{)}\big{)}\Big{)},$		(9)

where $\hat{\boldsymbol{X}}\in\mathcal{R}^{N\times L}$ is the mean of $\boldsymbol{X}$ , $\boldsymbol{U}_{X}\in\mathcal{R}^{N\times N}$ and $\boldsymbol{V}_{X}\in\mathcal{R}^{L\times L}$ are the covariance among rows and columns of $\boldsymbol{X}$ , respectively. The matrix normal distribution is related to the multivariate normal distribution in the following way

\displaystyle\boldsymbol{x}\sim\mathcal{N}\big{(}\boldsymbol{x};\hat{\boldsymbol{x}},\boldsymbol{V}_{X}\otimes\boldsymbol{U}_{X}\big{)},

(10)

where $\boldsymbol{x}=\textrm{Vec}(\boldsymbol{X})$ and $\hat{\boldsymbol{x}}=\textrm{Vec}(\hat{\boldsymbol{X}})$ .

III Design of UAMP-MF

The Bayesian treatment of the MF problem leads to great flexibility. Imposing proper priors $p(\boldsymbol{H})$ and $p(\boldsymbol{X})$ on $\boldsymbol{H}$ and $\boldsymbol{X}$ , we can accommodate many concrete MF problems. For example in DL, we aim to find sparse coding of data set $\boldsymbol{Y}$ , where $\boldsymbol{H}$ is the dictionary and $\boldsymbol{X}$ is the sparse representation with the dictionary. For the entries of $\boldsymbol{H}$ , we can use a Gaussian prior

\displaystyle p(\boldsymbol{H})=\prod\nolimits_{m,n}p(h_{m,n})=\prod\nolimits_{m,n}\mathcal{N}(h_{m,n};0,1),

(11)

while for the entries of $\boldsymbol{X}$ , we can consider a sparsity promoting hierarchical Gaussian-Gamma prior

	$\displaystyle p(\boldsymbol{X})$	$\displaystyle=$	$\displaystyle\prod\nolimits_{n,l}p(x_{n,l})=\prod\nolimits_{n,l}\mathcal{N}(x_{n,l};0,\gamma_{n,l}^{-1})$		(12)
	$\displaystyle p(\gamma_{n,l})$	$\displaystyle=$	$\displaystyle Ga(\gamma_{n,l};\epsilon,\eta).$		(13)

In Section IV, more examples on the priors, including NMF, RPCA, CSMU and sparse MF, are provided.

With the priors and model (1), if the a posteriori distributions $p(\boldsymbol{H}|\boldsymbol{Y})$ and $p(\boldsymbol{X}|\boldsymbol{Y})$ can be found, then the estimates of $\boldsymbol{H}$ and $\boldsymbol{X}$ are obtained, e.g., using the a posteriori means of $\boldsymbol{H}$ and $\boldsymbol{X}$ to serve as their estimates. However, it is intractable to find the exact a posteriori distributions in general, so we resort to the approximate inference method VI to find their variational approximations $q(\boldsymbol{H})\approx p(\boldsymbol{H}|\boldsymbol{Y})$ and $q(\boldsymbol{X})\approx p(\boldsymbol{X}|\boldsymbol{Y})$ . Finding the varational distributions are still challenging due to the large dimensions of $\boldsymbol{H}$ and $\boldsymbol{X}$ , and the priors of $\boldsymbol{H}$ and $\boldsymbol{X}$ may also lead to intractable $q(\boldsymbol{H})$ and $q(\boldsymbol{X})$ . In this work, through a covariance matrix (model noise) whitening process, UAMP is incorporated to solve these challenges with high efficiency. This also leads to Gaussian variational distributions $q(\boldsymbol{H})$ and $q(\boldsymbol{X})$ , so that the estimates of $\boldsymbol{H}$ and $\boldsymbol{X}$ (i.e., the a posteriori means of $\boldsymbol{H}$ and $\boldsymbol{X}$ ) appear as the parameters of the variational distributions.

Leveraging UAMP, we carry out VI in a message passing manner, with the aid of a factor graph representation of the problem. This leads to the message passing algorithm UAMP-MF. In the derivation of UAMP-MF, we use the notation $m_{n_{a}\rightarrow n_{b}}(\boldsymbol{X})$ to denote a message passed from node $n_{a}$ to node $n_{b}$ , which is a function of $\boldsymbol{X}$ . The algorithm UAMP-MF is shown in Algorithm 2, which we will frequently refer to in this section.

III-A Factor Graph Representation

Algorithm 2 UAMP-MF

Initialization: $\boldsymbol{U}_{H}=\boldsymbol{I}_{M}$ , $\boldsymbol{V}_{H}=\boldsymbol{I}_{N}$ , $\hat{\boldsymbol{H}}=\textbf{1}_{MN}$ , $\boldsymbol{V}_{X}=\boldsymbol{I}_{L}$ . $\boldsymbol{\Xi}_{X}=\textbf{1}_{NL}$ , $\boldsymbol{S}_{X}=\textbf{0}_{NL}$ , $\boldsymbol{\Xi}_{H}=\textbf{1}_{MN}$ , $\boldsymbol{S}_{H}=\textbf{0}_{MN}$ . Repeat

\overline{\boldsymbol{W}}_{X}=\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+M\boldsymbol{V}_{H}

[\boldsymbol{C}_{X},\boldsymbol{D}_{X}]=\text{eig}(\overline{\boldsymbol{W}}_{X})

\boldsymbol{R}_{X}=\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}

\boldsymbol{\Phi}_{X}=\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}

\boldsymbol{V}_{P_{X}}=|\boldsymbol{\Phi}_{X}|^{.2}\boldsymbol{\Xi}_{X}

\boldsymbol{P}=\boldsymbol{\Phi}_{X}\hat{\boldsymbol{X}}-\boldsymbol{V}_{\boldsymbol{P}_{X}}\cdot\boldsymbol{S}_{X}

\boldsymbol{V}_{S_{X}}=\textbf{1}./(\boldsymbol{V}_{P_{X}}+\hat{\lambda}^{-1}\mathbf{1})

\boldsymbol{S}_{X}=\boldsymbol{V}_{S_{X}}\cdot(\boldsymbol{R}_{X}-\boldsymbol{P}_{X})

\boldsymbol{V}_{Q_{X}}=\textbf{1}./(|\boldsymbol{\Phi}_{X}^{\textrm{T}}|^{.2}\boldsymbol{V}_{S_{X}})

\boldsymbol{Q}_{X}=\hat{\boldsymbol{X}}+\boldsymbol{V}_{Q_{X}}\cdot(\boldsymbol{\Phi}_{X}^{\textrm{T}}\boldsymbol{S}_{X})

10:

\boldsymbol{\Xi}_{X}=\boldsymbol{V}_{Q_{X}}\cdot\boldsymbol{G}_{X}^{\prime}(\boldsymbol{Q}_{X},\boldsymbol{V}_{Q_{X}})

11:

\hat{\boldsymbol{X}}=\boldsymbol{G}_{X}(\boldsymbol{Q}_{X},\boldsymbol{V}_{Q_{X}})

12:

\boldsymbol{U}_{X}=\text{diag}(\text{mean}(\boldsymbol{\Xi}_{X},2))

13:

\overline{\boldsymbol{W}}_{H}=\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+L\boldsymbol{U}_{X}

14:

[\boldsymbol{C}_{H},\boldsymbol{D}_{H}]=\text{eig}(\overline{\boldsymbol{W}}_{H})

15:

\boldsymbol{R}_{H}=\boldsymbol{D}_{H}^{-\frac{1}{2}}\boldsymbol{C}_{H}^{\textrm{T}}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}}

\boldsymbol{\Phi}_{H}=\boldsymbol{D}_{H}^{-\frac{1}{2}}\boldsymbol{C}_{H}^{\textrm{T}}

16:

\boldsymbol{V}_{P_{H}}=|\boldsymbol{\Phi}_{H}|^{.2}\boldsymbol{\Xi}_{H}

17:

\boldsymbol{P}_{H}=\boldsymbol{\Phi}_{H}\hat{\boldsymbol{H}}-\boldsymbol{V}_{\boldsymbol{P}_{H}}\cdot\boldsymbol{S}_{H}

18:

\boldsymbol{V}_{S_{H}}=\mathbf{1}./(\boldsymbol{V}_{P_{H}}+\hat{\lambda}^{-1}\mathbf{1})

19:

\boldsymbol{S}_{H}=\boldsymbol{V}_{S_{H}}\cdot(\boldsymbol{R}_{H}-\boldsymbol{P}_{H})

20:

\boldsymbol{V}_{Q_{H}}=\mathbf{1}./(|\boldsymbol{\Phi}_{H}^{\textrm{T}}|^{.2}\boldsymbol{V}_{S_{H}})

21:

\boldsymbol{Q}_{H}=\hat{\boldsymbol{H}}+\boldsymbol{V}_{Q_{H}}\cdot(\boldsymbol{\Phi}_{H}^{\textrm{T}}\boldsymbol{S}_{H})

22:

\boldsymbol{\Xi}_{H}=\boldsymbol{V}_{Q_{H}}\cdot\boldsymbol{G}_{H}^{\prime}(\boldsymbol{Q}_{H},\boldsymbol{V}_{Q_{H}})

23:

\hat{\boldsymbol{H}}=\boldsymbol{G}_{H}(\boldsymbol{Q}_{H},\boldsymbol{V}_{Q_{H}})

24:

\boldsymbol{U}_{H}=\text{diag}(\text{mean}(\boldsymbol{\Xi}_{H},1))

25:

\hat{\lambda}=ML/C

with

C

given in (44)

Until terminated

With model (1), we have the following joint conditional distribution and its factorization

	$\displaystyle p(\boldsymbol{X},\boldsymbol{H},\lambda\|\boldsymbol{Y})\!\!\!\!\!$	$\displaystyle\propto$	$\displaystyle\!\!\!\!\!p(\boldsymbol{Y}\|\boldsymbol{X},\boldsymbol{H},\lambda)p(\boldsymbol{X})p(\boldsymbol{H})p(\lambda)$		(14)
		$\displaystyle\triangleq$	$\displaystyle\!\!\!\!\!f_{Y}(\boldsymbol{Y},\!\boldsymbol{X},\!\boldsymbol{H},\!\lambda)f_{X}(\boldsymbol{X})f_{H}(\boldsymbol{H})f_{\lambda}(\lambda)$		(14)

where

	$\displaystyle f_{Y}(\boldsymbol{Y},\boldsymbol{X},\boldsymbol{H},\lambda)$	$\displaystyle\triangleq$	$\displaystyle p\big{(}\boldsymbol{Y}\|\boldsymbol{X},\boldsymbol{H},\lambda\big{)}$		(15)
		$\displaystyle=$	$\displaystyle\mathcal{MN}\big{(}\boldsymbol{Y};\boldsymbol{H}\boldsymbol{X},\boldsymbol{I}_{M},\lambda^{-1}\boldsymbol{I}_{L}\big{)}.$		(15)

The precision $\lambda$ of the noise has an improper prior $f_{\lambda}(\lambda)\triangleq p(\lambda)\propto 1/\lambda$ [27], and the priors for $\boldsymbol{H}$ and $\boldsymbol{X}$ denoted by $f_{H}(\boldsymbol{H})\triangleq p(\boldsymbol{H})$ and $f_{X}(\boldsymbol{X})\triangleq p(\boldsymbol{X})$ are customized according to the requirements in a specific problem. The factor graph representation of $\eqref{eq:factor}$ is depicted in Fig.1, where the arguments of the local functions are dropped for the sake of conciseness. We define a variational distribution

q(\boldsymbol{X},\boldsymbol{H},\lambda)=q(\boldsymbol{X})q(\boldsymbol{H})q(\lambda),

(16)

and we expect that $q(\boldsymbol{X})\approx p(\boldsymbol{X}|\boldsymbol{Y})$ , $q(\boldsymbol{H})\approx p(\boldsymbol{H}|\boldsymbol{Y})$ and $q(\lambda)\approx p(\lambda|\boldsymbol{Y})$ . Next, we study how to efficiently update $q(\boldsymbol{X})$ , $q(\boldsymbol{H})$ and $q(\lambda)$ iteratively.

Refer to caption — Figure 1: Factor graph representation of $\eqref{eq:factor}$ .

III-B Update of $q(\boldsymbol{X})$

According to VI, with $q(\boldsymbol{H})$ and $q(\lambda)$ (updated in the previous iteration), we update $q(\boldsymbol{X})$ . As shown by the factor graph in Fig. 1, we need to compute the message $m_{f_{Y}\to X}(\boldsymbol{X})$ from the factor node $f_{Y}$ to the variable node $\boldsymbol{X}$ and then combine it with the prior $p(\boldsymbol{X})$ . Later we will see that $q(\boldsymbol{H})$ is a matrix normal distribution, i.e., $q(\boldsymbol{H})=\mathcal{MN}(\boldsymbol{H};\hat{\boldsymbol{H}},\boldsymbol{U}_{H},\boldsymbol{V}_{H})$ with $\boldsymbol{U}_{H}=\boldsymbol{I}_{N}$ , and $q(\lambda)$ is a Gamma distribution. It turns out that the message $m_{f_{Y}\to X}(\boldsymbol{X})$ is matrix normal, as shown by Proposition 1.

Proposition 1: The message from $f_{Y}$ to $\boldsymbol{X}$ is a matrix normal distribution of $\boldsymbol{X}$ , which can be expressed as

m_{f_{Y}\to X}(\boldsymbol{X})\propto\mathcal{MN}(\boldsymbol{X};\overline{\boldsymbol{X}},\hat{\lambda}^{-1}\overline{\boldsymbol{U}}_{X},\boldsymbol{I}_{L}),

(17)

with

\overline{\boldsymbol{X}}=\overline{\boldsymbol{U}}_{X}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y},

(18)

\overline{\boldsymbol{U}}_{X}=\big{(}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H}\big{)}^{-1},

(19)

and

\hat{\lambda}=\int_{\lambda}\lambda q(\lambda).

(20)

The computation of $\hat{\lambda}$ is shown in (46).

Proof.

See Appendix A. ∎

According to VI, this message is combined with the prior $p(\boldsymbol{X})$ to obtain $q(\boldsymbol{X})$ . This can be challenging as $p(\boldsymbol{X})$ may lead to an intractable $q(\boldsymbol{X})$ and high computational complexity. Next, leveraging UAMP, we update $q(\boldsymbol{X})$ efficiently. It is also worth mentioning that, with the use of UAMP, the matrix inversion involved in (19) does not need to be performed.

Note that $\boldsymbol{X}=[\boldsymbol{x}_{1},...,\boldsymbol{x}_{l}]$ and $\overline{\boldsymbol{X}}=[\overline{\boldsymbol{x}}_{1},...,\overline{\boldsymbol{x}}_{l}]$ . The result in (17) indicates that $\boldsymbol{x}_{l}\sim\mathcal{N}(\boldsymbol{x}_{l};\overline{\boldsymbol{x}}_{l},\hat{\lambda}^{-1}\overline{\boldsymbol{U}}_{X})$ . This leads to the following pseudo observation model

\displaystyle\overline{\boldsymbol{x}}_{l}=\boldsymbol{x}_{l}+\boldsymbol{e}_{x},

(21)

where $\boldsymbol{e}_{x}\sim\mathcal{N}(\boldsymbol{e}_{x};0,\hat{\lambda}^{-1}\overline{\boldsymbol{U}}_{X})$ , i.e., the model noise is not white. We can whiten the noise by left-multiplying both sides of (21) by $\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}$ , leading to

\displaystyle\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{x}}_{l}=\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\boldsymbol{x}_{l}+\boldsymbol{w}_{x},

(22)

where $\boldsymbol{w}_{x}=\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\boldsymbol{e}_{x}$ is white and Gaussian with covariance $\hat{\lambda}^{-1}\boldsymbol{I}$ . Through the whitening process, we get a standard linear model with white additive Gaussian noise, which facilitates the use of UAMP. Considering all the vectors in $\overline{\boldsymbol{X}}$ , we have

\displaystyle\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{X}}=\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\boldsymbol{X}+\boldsymbol{\Omega_{X}},

(23)

where $\boldsymbol{\Omega_{X}}$ is white and Gaussian.

With model (23) and the prior $p(\boldsymbol{X})$ , we use UAMP to update $q(\boldsymbol{X})$ . Following UAMP, a unitary transformation needs to be performed with the unitary matrix $\boldsymbol{C}_{X}^{T}$ obtained from the SVD of $\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}$ (or eigenvalue decomposition (EVD) as the matrix is definite and symmetric), i.e.,

\displaystyle\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}=\boldsymbol{C}_{X}\boldsymbol{\Lambda}\boldsymbol{C}_{X}^{T}

(24)

with $\boldsymbol{\Lambda}$ being a diagonal matrix. After performing the unitary transformation, we have

\displaystyle\boldsymbol{R}_{X}=\boldsymbol{\Phi}_{X}\boldsymbol{X}+\boldsymbol{\Omega^{\prime}_{X}},

(25)

where $\boldsymbol{R}_{x}=\boldsymbol{C}_{X}^{T}\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{X}}$ , $\boldsymbol{\Phi}_{X}=\boldsymbol{C}_{X}^{T}\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}=\boldsymbol{\Lambda}\boldsymbol{C}_{X}^{T}$ and $\boldsymbol{\Omega^{\prime}_{X}}=\boldsymbol{C}_{X}^{T}\boldsymbol{\Omega_{X}}$ , which is still white and Gaussian.

From the above, it seems costly to obtain $\boldsymbol{R}_{X}$ and $\boldsymbol{\Phi}_{X}$ in model (25). Specifically, a matrix inverse operation needs to be performed according to (19) so that $\overline{\boldsymbol{X}}$ in (18) can be obtained; a matrix squared root operation is needed to obtain $\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}$ ; and an SVD operation is required to $\boldsymbol{C}_{X}^{T}$ . Next, we show that the computations can be greatly simplified.

Instead of computing $\overline{\boldsymbol{U}}_{X}^{-1}$ and $\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}$ followed by SVD of $\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}$ , we perform EVD to $\overline{\boldsymbol{U}}_{X}^{-1}=\overline{\boldsymbol{W}}_{X}=\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H}$ , i.e.,

\displaystyle[\boldsymbol{C}_{X},\boldsymbol{D}_{X}]=\text{eig}(\overline{\boldsymbol{W}}_{X}),

(26)

where the diagonal matrix

\boldsymbol{D}_{X}=\boldsymbol{\Lambda}^{-2}.

(27)

Hence $\boldsymbol{\Phi}_{X}=\boldsymbol{\Lambda}\boldsymbol{C}_{X}^{T}=\boldsymbol{D}_{X}^{-1/2}\boldsymbol{C}_{X}^{T}$ , where the computation of $\boldsymbol{D}_{x}^{-1/2}$ is trivial as $\boldsymbol{D}_{x}$ is a diagonal matrix. Meanwhile, the computation of the pseudo observation matrix $\boldsymbol{R}_{x}$ can also be simplified:

$\displaystyle\boldsymbol{R}_{X}$	$\displaystyle=$	$\displaystyle\boldsymbol{C}_{X}^{T}\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{X}}$	(28)
	$\displaystyle=$	$\displaystyle\boldsymbol{C}_{X}^{T}\overline{\boldsymbol{U}}_{X}^{-\frac{1}{2}}\overline{\boldsymbol{U}}_{X}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}$
	$\displaystyle=$	$\displaystyle{\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}}.$

For convenience, we rewrite the unitary transformed pseudo observation model as

\displaystyle{\underbrace{\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}}_{\boldsymbol{R}_{X}}}=\underbrace{\boldsymbol{D}_{X}^{-\frac{1}{2}}\boldsymbol{C}_{X}^{\textrm{T}}}_{\boldsymbol{\Phi}_{X}}\boldsymbol{X}+\boldsymbol{\Omega^{\prime}_{X}}.

(29)

The above leads to Lines 1-3 of UAMP-MF in Algorithm 2.

Due to the prior $p(\boldsymbol{X})$ , the use of exact $q(\boldsymbol{X})$ often makes the message update intractable. Following UAMP in Algorithm 1, we project it to be Gaussian, which can also be interpreted as performing the minimum mean squared estimation (MMSE) based on the pseudo observation model (29) with prior $p(\boldsymbol{X})$ . These correspond to Lines 4-11 of UAMP-MF in Algorithm 2. We assume that the prior $p(\boldsymbol{X})$ are separable, i.e., $p(\boldsymbol{X})=\prod_{n,l}p(x_{nl})$ , then the operations in Lines 10 and 11 are element-wise as explained in the following. As the estimation is decoupled by (U)AMP, we assume the following scalar pseudo models

q_{nl}=x_{nl}+w_{nl},n=1,...,N,l=1,...,L,

(30)

where $q_{nl}$ is the $(n,l)$ th element of $\boldsymbol{Q}_{X}$ in Line 9 of the UAMP-MF algorithm, $w_{nl}$ represents a Gaussian noise with mean zero and variance $v_{nl}$ , and $v_{nl}$ is the $(n,l)$ th element of $\boldsymbol{V}_{Q_{x}}$ in Line 8 of the UAMP-MF algorithm. This is significant as the complex estimation is simplified to much simpler MMSE estimation based on a number of scalar models (30) with prior $p(x_{nl})$ . With the notations in Lines 10 and 11, for each entry $x_{nl}$ in $\boldsymbol{X}$ , the MMSE estimation leads to a Gaussian distribution

\displaystyle q^{\prime}(x_{nl})=\mathcal{N}(x_{nl};\hat{x}_{nl},v_{x_{nl}}),

(31)

where $\hat{x}_{nl}$ and $v_{x_{nl}}$ is the $(n,l)$ th element of $\hat{\boldsymbol{X}}$ in Line 11 and the $(n,l)$ th element of $\boldsymbol{\Xi}_{X}$ in Line 10, respectively. We can see that each element $x_{nl}$ has its own variance. To facilitate subsequent processing, we make an approximation by assuming that the elements of each row in $\hat{\boldsymbol{X}}$ have the same variance, which is the average of their variances. Then they are collectively characterized by a matrix normal distribution, i.e., $q(\boldsymbol{X})=\mathcal{MN}(\boldsymbol{X};\hat{\boldsymbol{X}},\boldsymbol{U}_{X},\boldsymbol{V}_{X})$ with $\boldsymbol{U}_{X}=\text{diag}(\text{mean}(\boldsymbol{\Xi}_{X},2))$ and $\boldsymbol{V}_{X}=\boldsymbol{I}_{L}$ , where $\text{mean}(\boldsymbol{\Xi}_{X},2)$ represents the average operation on the rows of $\boldsymbol{\Xi}_{X}$ . This leads to Line 12 of UAMP-MF in algorithm 2.

III-C Update of $q(\boldsymbol{H})$

With the updated $q(\boldsymbol{X})=\mathcal{MN}(\boldsymbol{X};\hat{\boldsymbol{X}},\boldsymbol{U}_{X},\boldsymbol{V}_{X})$ and $q(\lambda)$ , we compute the message from $f_{Y}$ to $\boldsymbol{H}$ according to VI. Regarding the message, we have the following result.

Proposition 2: The message from $f_{Y}$ to $\boldsymbol{H}$ is a matrix normal distribution about $\boldsymbol{H}$ , which can be expressed as

m_{f_{Y}\to H}(\boldsymbol{H})\propto\mathcal{MN}(\boldsymbol{H};\overline{\boldsymbol{H}},\boldsymbol{I}_{M},\hat{\lambda}^{-1}\overline{\boldsymbol{V}}_{H}),

(32)

with

\overline{\boldsymbol{H}}=\boldsymbol{Y}\hat{\boldsymbol{X}}^{\textrm{T}}\overline{\boldsymbol{V}}_{H}.

(33)

and

\overline{\boldsymbol{V}}_{H}=\left(\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X}\right)^{-1},

(34)

where $\hat{\lambda}=\int_{\lambda}\lambda q(\lambda)$ .

Proof.

See Appendix B. ∎

Then we combine the message $m_{f_{Y}\to H}(\boldsymbol{H})$ with the prior of $\boldsymbol{H}$ to update $q(\boldsymbol{H})$ . This will also be realized with UAMP through a whitening process. The procedure is similar to that for $q(\boldsymbol{X})$ , and the difference is that the pseudo model is established row by row (rather than column by column) by considering the form of $m_{f_{Y}\to H}(\boldsymbol{H})$ .

With the message $m_{f_{Y}\to H}(\boldsymbol{H})$ and

\overline{\boldsymbol{H}}=\left(\begin{matrix}\overline{\boldsymbol{h}}_{1}^{\textrm{T}}\\ \vdots\\ \overline{\boldsymbol{h}}_{M}^{\textrm{T}}\end{matrix}\right),

(35)

where $\overline{\boldsymbol{h}}_{m}^{\textrm{T}}\in\mathcal{R}^{1\times N}$ is the $m$ th row vector, we have the pseudo observation model

\displaystyle\overline{\boldsymbol{h}}_{m}=\boldsymbol{h}_{m}+\boldsymbol{e}_{h},

(36)

where $\boldsymbol{e}_{h}\sim\mathcal{N}(\boldsymbol{e}_{h};0,\hat{\lambda}^{-1}\overline{\boldsymbol{V}}_{H})$ . Whitening operation to (36) leads to

\displaystyle\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\overline{\boldsymbol{h}}_{m}=\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\boldsymbol{h}_{m}+\boldsymbol{w}_{h},

(37)

where $\boldsymbol{w}_{h}=\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\boldsymbol{e}_{h}$ , which is white and Gaussian with covariance $\hat{\lambda}^{-1}\boldsymbol{I}$ . Collecting all rows and representing them in matrix form, we have

\displaystyle\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\overline{\boldsymbol{H}}^{\textrm{T}}=\overline{\boldsymbol{V}}_{H}^{-\frac{1}{2}}\boldsymbol{H}^{\textrm{T}}+\boldsymbol{\Omega}_{H}.

(38)

UAMP is then employed with the model (38). Using the ideas in the previous section for updating $q(\boldsymbol{X})$ , we obtain the unitary transformed model efficiently.

We first perform an EVD to matrix $\overline{\boldsymbol{V}}_{H}^{-1}=\overline{\boldsymbol{W}}_{H}=\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X}$ , i.e.,

\displaystyle[\boldsymbol{C}_{H},\boldsymbol{D}_{H}]=\text{eig}(\overline{\boldsymbol{W}}_{H}).

(39)

Then the unitary transformed model is given as

\displaystyle\boldsymbol{R}_{H}^{\textrm{T}}=\boldsymbol{\Phi}_{H}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}+\boldsymbol{\Omega}^{\prime}_{H}

(40)

where

	$\displaystyle\boldsymbol{\Phi}_{H}$	$\displaystyle=$	$\displaystyle\boldsymbol{D}_{H}^{-\frac{1}{2}}\boldsymbol{C}_{H}^{\textrm{T}}$		(41)
	$\displaystyle\boldsymbol{R}_{H}$	$\displaystyle=$	$\displaystyle\boldsymbol{D}_{H}^{-\frac{1}{2}}\boldsymbol{C}_{H}^{\textrm{T}}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}},$		(42)

and $\boldsymbol{\Omega_{H}}$ is white and Gaussian. The above leads to Lines 13-15 of UAMP-MF in Algorithm 2.

Following UAMP in Algorithm 1, we obtain $q(\boldsymbol{H})$ and project it to be Gaussian, which correspond to Lines 16-23 of UAMP-MF Algorithm 2. Similar to the case for updating $q(\boldsymbol{X})$ , to accommodate $\{q(h_{mn})\}$ with a matrix normal distribution, we make an approximation by assuming the entries in each column of $\boldsymbol{H}$ share a common variance, which is the average of their variances. Then $q(\boldsymbol{H})=\mathcal{MN}(\boldsymbol{H};\hat{\boldsymbol{H}},\boldsymbol{U}_{H},\boldsymbol{V}_{H})$ with $\boldsymbol{V}_{H}=\text{diag}(\text{mean}(\boldsymbol{\Xi}_{H},1))$ and $\boldsymbol{U}_{H}=\boldsymbol{I}_{M}$ (i.e., Line 24 of UAMP-MF in Algorithm 2), where $\text{mean}(\boldsymbol{\Xi}_{H},1)$ represents the average operation on the columns of $\boldsymbol{\Xi}_{H}$ .

III-D Update of $q(\lambda)$

With $q(\boldsymbol{H})=\mathcal{MN}\left(\boldsymbol{H};\hat{\boldsymbol{H}},\boldsymbol{V}_{H},\boldsymbol{I}_{M}\right)$ and $q(\boldsymbol{X})=\mathcal{MN}\left(\boldsymbol{X};\hat{\boldsymbol{X}},\boldsymbol{I}_{L},\boldsymbol{U}_{X}\right)$ , we update $q(\lambda)$ , i.e., combining the message $m_{f_{Y}\to\lambda}(\lambda)$ from $f_{Y}$ to $\lambda$ and its prior $p(\lambda)$ .

Proposition 3: The message from $f_{Y}$ to $\lambda$ can be expressed as

\displaystyle m_{f_{Y}\to\lambda}(\lambda)\propto\lambda^{ML}\exp\Big{(}-\lambda C\Big{)},

(43)

where

	$\displaystyle\!\!C=\\|\boldsymbol{Y}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\\|_{F}^{2}+M\text{Tr}\Big{(}\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{V}_{H}\Big{)}$
	$\displaystyle\ \ \ \ \ \ \ \ +L\text{Tr}(\boldsymbol{U}_{X}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}})+ML\text{Tr}(\boldsymbol{U}_{X}\boldsymbol{V}_{H}).$		(44)

Proof.

See Appendix C. ∎

With the message $m_{f_{Y}\to\lambda}(\lambda)$ and the prior $p(\lambda)\propto 1/\lambda$ , $q(\lambda)$ can be computed as

q(\lambda)\propto m_{f_{Y}\to\lambda}(\lambda)p(\lambda)=\lambda^{ML-1}\exp\Big{(}-\lambda C\Big{)},

(45)

which is a Gamma distribution. Then the mean of $\lambda$ is obtained as

\displaystyle\hat{\lambda}={\int_{\lambda}\lambda q(\lambda)}={ML}/{C},

(46)

which is Line 25 of UAMP-MF.

III-E Discussion and Computational Complexity

Regarding UAMP-MF in Algorithm 2, we have the following remarks:

•

We do not specify the priors of $\boldsymbol{H}$ and $\boldsymbol{X}$ in Algorithm 2. The detailed implementations of Lines 10, 11, 22 and 23 of the algorithm depends on the priors in a concrete scenario. Various examples are provided in Section IV.
•

As the MF problem often has local minima, we can use the strategy of restart to mitigate the issue of being stuck at local minima. In addition, the iterative process can be terminated based on some criterion, e.g., the normalized difference between the estimates of two consecutive iterations is smaller than a threshold.
•

There are often hyper-parameters in the priors of $\boldsymbol{H}$ and $\boldsymbol{X}$ . If we do not have knowledge on these hyper-parameters, their values need to be leaned or tuned automatically. Thanks to the factor graph and message passing frame work, these extra tasks can be implemented by extending the factor graph in Fig. 1.

Figure 2: Factor graph for learning hyper-parameters.

In the following, we show how to learn the hyper-parameters of the prior of $\boldsymbol{X}$ . We assume that $\boldsymbol{X}$ is sparse and the sparsity rate is unknown. In this case, we may employ the sparsity promoting hierarchical Gaussian-Gamma prior [27]. The prior is then $p(x_{nl},\gamma_{nl})=p(x_{nl}|\gamma_{nl})p(\gamma_{nl})$ , where $p(x_{nl}|\gamma_{nl})=\mathcal{N}(x_{nl};0,\gamma_{nl}^{-1})\triangleq f_{x_{nl}}$ and $p(\gamma_{nl})=Ga(\gamma_{nl};\epsilon,\eta)$ with $\epsilon$ and $\eta$ being the shape parameter and scale parameter of the Gamma distribution, respectively. While the precision $\gamma_{nl}$ is to be learned, the values for $\epsilon$ and $\eta$ are often set empirically, e.g., $\epsilon=\eta=0$ [27]. It is worth mentioning that $\epsilon$ can be tuned automatically to improve the performance [28].

The factor graph representation for this part is shown in Fig. 2. Also note that UAMP decouples the estimation as shown by the pseudo model (30), indicating that the incoming message to the factor graph in Fig. 2 is Gaussian, i.e.,

m_{x_{nl}\to f_{x_{nl}}}(x_{nl})=\mathcal{N}(x_{nl};q_{nl},v_{nl}).

(47)

We perform inference on $x_{nl}$ and $\gamma_{nl}$ , which can also be achieved by using VI with a variational distribution $q(x_{nl},\gamma_{nl})=q(x_{nl})q(\gamma_{nl})$ .

According to VI,

	$\displaystyle q(x_{nl})$	$\displaystyle\propto$	$\displaystyle m_{x_{nl}\to f_{x_{nl}}}(x_{nl})m_{f_{x_{nl}}\to x_{nl}}(x_{nl})$		(48)
		$\displaystyle=$	$\displaystyle\mathcal{N}(x_{nl};\hat{x}_{nl},v_{x_{nl}})$		(48)

with $m_{f_{x_{nl}}\to x_{nl}}(x_{nl})$ shown in (52), and

\displaystyle v_{x_{nl}}=\frac{v_{q_{nl}}}{1+\hat{\gamma}_{nl}v_{q_{nl}}},~{}\hat{x}_{nl}=\frac{\hat{q}_{nl}}{1+\hat{\gamma}_{nl}v_{q_{nl}}}.

(49)

The message $m_{f_{x_{nl}}\to\gamma_{nl}}(\gamma_{nl})$ can be expressed as

	$\displaystyle m_{f_{x_{nl}}\to\gamma_{nl}}(\gamma_{nl})$	$\displaystyle\propto$	$\displaystyle\exp\big{(}\int_{x_{nl}}q(x_{nl})\log\mathcal{N}(x_{nl};0,\gamma_{nl}^{-1})\big{)},$		(50)
		$\displaystyle\propto$	$\displaystyle\gamma_{nl}^{\frac{1}{2}}\exp\big{(}-\frac{\gamma_{nl}}{2}(\|\hat{x}_{nl}\|^{2}+v_{x_{nl}})\big{)}.$		(50)

The message $m_{f_{\gamma_{nl}}\to\gamma_{nl}}(\gamma_{nl})$ is the predefined Gamma distribution, i.e., $m_{f_{\gamma_{nl}}\to\gamma_{nl}}(\gamma_{nl})\propto\gamma_{nl}^{\epsilon-1}\exp(-\eta\gamma_{nl})$ . According to VI,

	$\displaystyle q(\gamma_{nl})$	$\displaystyle\propto$	$\displaystyle m_{f_{\gamma_{nl}}\to\gamma_{nl}}(\gamma_{nl})m_{f_{x_{nl}}\to\gamma_{nl}}(\gamma_{nl})$		(51)
		$\displaystyle\propto$	$\displaystyle\gamma_{nl}^{\epsilon-\frac{1}{2}}\exp\big{(}-\frac{\gamma_{nl}}{2}(\|\hat{x}_{nl}\|^{2}+v_{x_{nl}}+2\eta)\big{)}.$		(51)

Thus

	$\displaystyle m_{f_{x_{nl}}\to x_{nl}}(x_{nl})$	$\displaystyle\propto$	$\displaystyle\exp\big{(}\int_{\gamma_{nl}}q(\gamma_{nl})\log\mathcal{N}(x_{nl};0,\gamma_{nl}^{-1})\big{)},$		(52)
		$\displaystyle\propto$	$\displaystyle\mathcal{N}(x_{nl};0,\hat{\gamma}_{nl}^{-1}),$		(52)

where

\displaystyle\hat{\gamma}_{nl}=\frac{1+2\epsilon}{2\eta+v_{x_{nl}}+|\hat{x}_{nl}|^{2}}.

(53)

The above computations for all the entries of $\boldsymbol{X}$ can be collectively expressed as

	$\displaystyle\boldsymbol{\Xi}_{X}=\boldsymbol{V}_{Q_{X}}./\left(\mathbf{1}+\hat{\boldsymbol{\Gamma}}\cdot\boldsymbol{V}_{Q_{X}}\right),$		(54)
	$\displaystyle\hat{\boldsymbol{X}}={\hat{\boldsymbol{Q}}_{X}}./\left(\mathbf{1}+\hat{\boldsymbol{\Gamma}}\cdot\boldsymbol{V}_{Q_{X}}\right),$		(55)
	$\displaystyle\hat{\boldsymbol{\Gamma}}=(\mathbf{1}+2\epsilon\mathbf{1})./\left(2\eta\mathbf{1}+\boldsymbol{\Xi}_{X}+\|\hat{\boldsymbol{X}}\|^{.2}\right),$		(56)

where $\hat{\boldsymbol{\Gamma}}$ is a matrix with $\{\gamma_{nl}\}$ as its entries.

We see that UAMP-MF in Algorithm 2 involves matrix multiplications and two EVDs per iteration, which dominate its complexity. The complexity of the algorithm is therefore in a cubic order, which is low in a MF problem (the complexity of two matrices product is cubic). In the next section, we will compare the proposed algorithm with state-of-the-art methods in various scenarios. As the algorithms have different computational complexity per iteration and the numbers of iterations required for convergence are also different, we will compare their complexity in terms of runtime, as in the literature. Extensive examples show that UAMP-MF is much faster than state-of-the-art algorithms in many scenarios, and often delivers significantly better performance.

IV Applications and Numerical Results

We focus on the applications of UAMP-MF to NFM, RPCA, DL, CSMU and sparse MF. Extensive comparisons with state-of-the-art algorithms are provided to demonstrate its merits in terms of robustness, recovery accuracy and complexity.

IV-A UAMP-MF for Robust Principal Components Analysis

In RPCA [4], we aim to estimate a low-rank matrix observed in the presence of noise and outliers. The signal model for RPCA can be written as

\displaystyle\boldsymbol{Y}=\boldsymbol{A}\boldsymbol{B}+\boldsymbol{E}+\boldsymbol{W},

(57)

where the product of a tall matrix $\boldsymbol{A}\in\mathcal{R}^{M\times N}$ and wide matrix $\boldsymbol{B}\in\mathcal{R}^{N\times L}$ is a low-rank matrix, $\boldsymbol{E}$ is a sparse outlier matrix, and $\boldsymbol{W}$ is noise matrix, which is assumed to be Gaussian and white. To enable the use of UAMP-MF, the model (57) can be rewritten as [5]

\displaystyle{\boldsymbol{Y}}=\underbrace{\Big{[}\boldsymbol{A},\boldsymbol{I}]}_{{\boldsymbol{H}}}\underbrace{\begin{bmatrix}\boldsymbol{B}\\ \boldsymbol{E}\end{bmatrix}}_{\boldsymbol{X}}+{\boldsymbol{W}},

(58)

which is a MF problem.

In (58), part of matrix ${\boldsymbol{H}}$ is known, and part of matrix ${\boldsymbol{X}}$ is sparse. These can be captured by imposing proper priors to the matrix entries. Similar to [5], for matrix ${\boldsymbol{H}}$ , the priors are

\displaystyle p({h}_{m,n})

\displaystyle=

\displaystyle\begin{cases}\mathcal{N}({h}_{m,n};0,1)&n\leq N\\ \mathcal{N}({h}_{m,n};I_{m,n},0)&n\geq N\end{cases}

(59)

where the entries corresponding to matrix ${\boldsymbol{I}}$ have a zero variance, indicating that they are deterministic and known. For the sparse matrix ${\boldsymbol{E}}$ , the sparsity rate is unknown, and we use the hierarchical Gaussian-Gamma priors. Then we have

\displaystyle p({x}_{n,l})

\displaystyle=

\displaystyle\begin{cases}\mathcal{N}({x}_{n,l};0,\alpha_{x})&n\leq N\\ \mathcal{N}({x}_{n,l};0,\gamma^{-1}_{n,l})&n\geq N\end{cases}

(60)

where the precisions have Gamma distribution $p(\gamma_{nl})=Ga(\gamma_{nl};\epsilon,\eta)$ . If the knowledge about the parameter $\alpha_{x}$ is unavailable, it can also be learned using VI, where the estimate of $\alpha_{x}$ can be updated as $\hat{\alpha}_{x}=||\boldsymbol{\Xi}^{\prime}_{X}+|\hat{\boldsymbol{X}}^{\prime}|^{.2}||^{2}/NM$ with

	$\displaystyle\boldsymbol{\Xi}^{\prime}_{X}$	$\displaystyle=$	$\displaystyle{\hat{\alpha}^{\prime}_{x}\boldsymbol{V}^{\prime}_{Q_{X}}}./{(\hat{\alpha}^{\prime}_{x}\mathbf{1}+\boldsymbol{V}^{\prime}_{Q_{X}})},$		(61)
	$\displaystyle\hat{\boldsymbol{X}}^{\prime}$	$\displaystyle=$	$\displaystyle{\hat{\alpha}^{\prime}_{x}\hat{\boldsymbol{Q}}^{\prime}_{X}}./{(\hat{\alpha}^{\prime}_{x}\mathbf{1}+\boldsymbol{V}^{\prime}_{Q_{X}})},$		(62)

where $\boldsymbol{V}^{\prime}_{Q_{X}}$ and $\hat{\boldsymbol{Q}}^{\prime}_{X}$ represent the upper ( $n\leq N$ ) part of matrices $\boldsymbol{V}_{Q_{X}}$ and $\hat{\boldsymbol{Q}}_{X}$ , and $\hat{\alpha}^{\prime}_{x}$ is the estimate of $\alpha_{x}$ in the last round of iteration. In addition, the rank $N$ of matrix $\boldsymbol{Z}=\boldsymbol{A}\boldsymbol{B}$ is normally unknown, and rank-selection can be performed using the method in [29].

We compare UAMP-MF with state-of-the-art algorithms including Bi-GAMP [5], IALM [12] and LMaFit [13]. The performance is evaluated in terms of the normalized mean squared error (NMSE) of matrix $\boldsymbol{Z}$ , which is defined as

\displaystyle\text{NMSE(Z)}=\frac{\|\boldsymbol{Z}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\|^{2}}{\|\boldsymbol{Z}\|^{2}}.

(63)

In the numerical examples, the low rank matrix $\boldsymbol{Z}=\boldsymbol{H}\boldsymbol{X}$ is generated from $\boldsymbol{H}$ and $\boldsymbol{X}$ with i.i.d standard Gaussian entries. The sparsity rate of matrix $\boldsymbol{E}$ is denoted by $\delta$ , and the non-zero entries are independently drawn from a uniform distribution on $[-10,10]$ , which are located uniformly at random in $\boldsymbol{E}$ . It is noted that, with these settings, the magnitudes of the outliers and the entries of $\boldsymbol{Z}$ are on the same order, making the detection of the outliers challenging. The NMSE performance of various algorithms versus SNR is shown in Fig. 3, where $\delta=0.1$ , $M=L=200$ , and the rank $N=50,70$ and $90$ for (a), (b) and (c), respectively. It can be seen that IALM does not work well, and LMaFit struggles, delivering poor performance when $N=70$ and $90$ . Bi-GAMP and UAMP-MF deliver similar performance when $N=50$ , but when $N=70$ and $90$ , UAMP-MF still works very well while the performance of Bi-GAMP degrades significantly. We also show the performance of UAMP-MF with known rank $N$ , which is denoted by UAMP-MF(N) in Fig. 3, where we see that UAMP-MF has almost the same performance as UAMP-MF(N). Then, with $M=L=200$ , we vary the rank $N$ from 30 to 120 and the sparsity rate $\delta$ is changed from 0.05 to 0.3. The NMSE of the algorithms with different combinations of $\delta$ and $N$ is shown in Fig.4, where SNR = 60 dB. We see that the blue regions of UAMP-MF are significantly larger than other algorithms, indicating the superior performance of UAMP-MF.

Next we consider a more challenging case. All the settings are the same as the previous example except that the generated matrices $\boldsymbol{H}$ and $\boldsymbol{X}$ are no longer i.i.d., but correlated. Take $\boldsymbol{H}$ as an example, which is generated with

\displaystyle\boldsymbol{H}=\boldsymbol{C}_{L}\boldsymbol{G}\boldsymbol{C}_{R},

(64)

where $\boldsymbol{G}$ is an i.i.d Gaussian matrix, $\boldsymbol{C}_{L}$ is an $M\times M$ matrix with the $(m,n)$ th element given by $\rho^{|m-n|}$ where $\rho\in[0,1]$ , and $\boldsymbol{C}_{R}$ is generated in the same way but with size of $N\times N$ . The parameter $\rho$ controls the correlation of matrix $\boldsymbol{H}$ . Matrix $\boldsymbol{X}$ is generated in the same way. We show the performance of the algorithms versus the correlation parameter $\rho$ in Fig. 5, where SNR = 60dB, the rank $N=60$ and $80$ in (a) and (b), respectively. The results show that UAMP-MF is much more robust and it outperforms other algorithms significantly when $\rho$ is relatively large.

The runtime of the algorithms is shown in Fig. 6, where the simulation parameter settings in Figs. 6 (a) and (b) are the same as those for Fig. 3(b) and Fig. 5(a), respectively. The results show that UAMP-MF can be much faster than Bi-GAMP while delivering better performance. IALM and LMaFit is faster than UAMP-MF, but they may suffer from severe performance degradation as shown in Figs. 3 - 5.

IV-B UAMP-MF for Dictionary Learning

In DL, we aim to find a dictionary $\boldsymbol{H}\in\mathcal{R}^{M\times N}$ that allows the training samples $\boldsymbol{Y}\in\mathcal{R}^{M\times L}$ to be coded as

\displaystyle\boldsymbol{Y}=\boldsymbol{H}\boldsymbol{X}+\boldsymbol{W}

(65)

for some sparse matrix $\boldsymbol{X}$ and some perturbation $\boldsymbol{W}$ . To avoid over-fitting, a sufficiently large number of training examples are needed. For the entries of $\boldsymbol{H}$ , we use the Gaussian prior as shown in (11), and for the entries of $\boldsymbol{X}$ , we use the hierarchical Gaussian-Gamma prior (as we assume unknown sparsity rate) as shown in (12) and (13).

We compare UAMP-MF with the state-of-the-art algorithms including BAd-VAMP [17], K-SVD [9] and SPAMS [10]. It is worth mentioning that, in the case of DL, compared to Bi-GAMP, BAd-VAMP is significantly enhanced in terms of robustness and performance [17]. To test the robustness of the algorithms, we generate correlated matrix $\boldsymbol{H}$ using (64) in the numerical examples. The sparse matrix $\boldsymbol{X}$ is generated by selecting non-zero entries in each column uniformly and drawing their values from the standard Gaussian distribution, while all other entries are set to zero. The performance of DL is evaluated using the relative NMSE [17], defined as

\displaystyle\text{NMSE(H)}=\underset{\boldsymbol{J}}{\text{min}}\frac{\|\hat{\boldsymbol{H}}\boldsymbol{J}-\boldsymbol{H}\|^{2}}{\|\boldsymbol{H}\|^{2}},

(66)

where $\boldsymbol{J}$ is a matrix accounting for the permutation and scale ambiguities.

We first assume square dictionaries (i.e., $M=N$ ) and vary the values of $N$ and $L$ . The NMSE performance of the algorithms is shown in Fig.7, where SNR = 50dB. It can be seen that, when the correlation parameter $\rho=0.2$ , neither K-SVD nor SPAMS has satisfactory performance. In contrast, UAMP-MF and BAd-VAMP works well for a number of combinations of $N$ and $L$ , but the blue region of UAMP-MF is significantly larger than that of the BAd-VAMP. We show the performance of the algorithms versus the correlation parameter $\rho$ in Fig. 8, where $N=100$ , $L=500$ in (a) and $L=800$ in (b). It can be seen that UAMP-MF exhibits much higher robustness than other algorithms.

Next, we test the performance of the algorithms when learning an over-complete dictionary, i.e., $N>M$ . The setting for the experiments is same as those in the previous one, except that we fix $M$ and $L$ , but vary the value of $N$ . The NMSE of the algorithms is shown in Fig. 9, where $M=100$ and $L=600$ . The correlation parameter $\rho$ is set to $0.1$ in (a) and $0.3$ in (b). Again, K-SVD and SPAMS do not work well. Comparing BAd-VAMP with UAMP-MF, we see that UAMP-MF performs significantly better, especially when $\rho=0.3$ .

The runtime versus $L$ of the algorithms is shown in Fig.10.(a), where $M=N=100$ and $\rho=0.1$ , and the runtime versus $N$ of the algorithms is shown in Fig. 10.(b), where $M=100$ , $L=600$ , and $\rho=0.1$ . The SNR is set to 50 dB. The results show that UAMP-MF can be much faster than other algorithms.

IV-C UAMP-MF for Compressive Sensing with Matrix Uncertainty

In practical applications of compressive sensing, we often do not know the measurement matrix $\boldsymbol{H}$ perfectly, e.g., due to hardware imperfections and model mismatch. We aim to recover the sparse signal $\boldsymbol{X}$ (and estimate the measurement matrix $\boldsymbol{H}$ ) from the noisy measurements $\boldsymbol{Y}=\boldsymbol{H}\boldsymbol{X}+\boldsymbol{W}$ with

\boldsymbol{H}\triangleq\bar{\boldsymbol{H}}+\boldsymbol{H}^{\prime},

(67)

where $\bar{\boldsymbol{H}}$ is a known matrix and $\boldsymbol{H}^{\prime}$ is an unknown disturbance matrix. Compared to the case of DL, the difference lies in the prior of $\boldsymbol{H}$ . We assume that the entries of $\boldsymbol{H}^{\prime}$ are modeled as i.i.d Gaussian distributed variables, i.e., $p(h^{\prime}_{m,n})=\mathcal{N}(h^{\prime}_{m,n};0,\nu)$ , so the prior of $h_{m,n}$ can be represented as $p(h_{m,n})=\mathcal{N}(h_{m,n};\bar{h}_{m,n},\nu)$ . For the entries of $\boldsymbol{X}$ , we still use the hierarchical Gaussian-Gamma prior. It is worth noting that, if the sparsity of $\boldsymbol{X}$ exhibits some pattern, e.g., the columns of $\boldsymbol{X}$ share a common support, we can impose that the entries of each row of $\boldsymbol{X}$ share a common precision in their Gaussian priors to capture the common support.

In the numerical example, to test the robustness of the algorithms, we generate correlated matrices $\bar{\boldsymbol{H}}$ and $\boldsymbol{H}^{\prime}$ using (64). The sparse matrix $\boldsymbol{X}$ is generated in the same way as before. It is noted that, different from DL, the ambiguity does not exist thanks to the known matrix $\boldsymbol{H}$ . The performance is evaluated using the NMSE of $\boldsymbol{X}$

\displaystyle{\text{NMSE(X)}={\|\hat{\boldsymbol{X}}-\boldsymbol{X}\|^{2}}/{\|\boldsymbol{X}\|^{2}}.}

(68)

The NMSE performance of UAMP-MF and BAd-VAMP is shown in Fig.11 and Fig. 12 for various combinations of $M=N\in\{50,...,200\}$ and $L\in\{100,...,600\}$ . The parameter $\rho=0.1$ and $0.3$ in Figs. 11 and 12, respectively and SNR = 50dB. It can be seen that, UAMP-MF outperforms BAd-VAMP significantly when the correlation coefficient $\rho=0.3$ . The runtime comparison is shown in Fig. 13, which indicate that UAMP-MF is much faster.

IV-D UAMP-MF for Non-Negative Matrix Factorization

The NMF problem [1] finds various applications [30], [31]. In NMF, from the observation

\displaystyle\boldsymbol{Y}=\boldsymbol{Z}+\boldsymbol{W}

(69)

where $\boldsymbol{Z}\in\mathcal{R}_{+}^{M\times L}$ , we aim to find a factorization of $\boldsymbol{Z}$ as the product of two non-negative matrices $\boldsymbol{H}\in\mathcal{R}_{+}^{M\times N}$ and $\boldsymbol{X}\in\mathcal{R}_{+}^{N\times L}$ , i.e., $\boldsymbol{Z}\approx\boldsymbol{H}\boldsymbol{X}$ . To achieve NMF with UAMP-MF, we set the priors of $p(\boldsymbol{X})$ and $p(\boldsymbol{H})$ to be i.i.d non-negative Gaussian, i.e.,

	$\displaystyle p(x_{n,l})=N_{+}(x_{n,l};\theta,\phi)=\begin{cases}\frac{\mathcal{N}(x;\theta,\phi)}{\Phi_{c}(-\theta/\sqrt{\phi})}&x\geq 0\\ 0&x<0\end{cases}$		(70)
	$\displaystyle p(h_{m,n})=N_{+}(h_{m,n};\theta,\phi)=\begin{cases}\frac{\mathcal{N}(h;\theta,\phi)}{\Phi_{c}(-\theta/\sqrt{\phi})}&h\geq 0\\ 0&h<0\end{cases},$		(71)

where $\Phi_{c}$ is the complementary cumulative distribution function.

We compare UAMP-MF with the alternating non-negative least squares projected gradient algorithm (AlsPGrad) [8] and multiplicative update algorithm (Mult)[7]. In the numerical examples, we set $M=L=200$ and $N=100$ . The metric used to evaluate the performance is defined as

\displaystyle\text{NMSE(Z)}=\frac{\|\boldsymbol{Z}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\|^{2}}{\|\boldsymbol{Z}\|^{2}}.

(72)

For Mult, we use the built-in function ‘nnmf’ in Matlab, and the termination tolerance is set to $10^{-10}$ and maximum number of iterations is set to 2000, 5000, 10000 and 20000 for comparison.

The NMSE results are shown in Fig. 14 and the runtime is shown in Fig. 15. It can be seen that, with the increase of the maximum number of iterations, the performance of Mult improves considerably, but it comes at the cost of significant increase in runtime. AlsPGrad delivers performance similar to that of Mult with 20000 iterations. UAMP-MF delivers the best performance while with significant smaller runtime compared to Mult (except Mult-2000) and AlsPGrad.

IV-E UAMP-MF for Sparse Matrix Factorization

We consider the use of UAMP-MF to solve the sparse MF problem [32]. From the observation

\displaystyle\boldsymbol{Y}=\boldsymbol{Z}+\boldsymbol{W},

(73)

where $\boldsymbol{Z}\in\mathcal{R}^{M\times L}$ , and we aim to find a factorization of $\boldsymbol{Z}$ as the product of two sparse matrices $\boldsymbol{H}\in\mathcal{R}^{M\times N}$ and $\boldsymbol{X}\in\mathcal{R}^{N\times L}$ , i.e., $\boldsymbol{Z}\approx\boldsymbol{H}\boldsymbol{X}$ .

To find the sparse matrices $\boldsymbol{H}$ and $\boldsymbol{X}$ with UAMP-MF, we use the sparsity promoting hierarchical Gaussian-Gamma prior for entries in both matrices, i.e.,

$\displaystyle p(\boldsymbol{H})\!\!\!$	$\displaystyle=$	$\displaystyle\!\!\!\prod_{m,n}p(h_{m,n})=\prod_{m,n}N(h_{m,n};0,\gamma_{h_{m,n}}^{-1})$	(74)
$\displaystyle p(\boldsymbol{X})\!\!\!$	$\displaystyle=$	$\displaystyle\!\!\!\prod_{n,l}p(x_{n,l})=\prod_{n,l}N(x_{n,l};0,\gamma_{x_{n,l}}^{-1})$	(75)
$\displaystyle p(\gamma_{h_{m,n}})\!\!\!$	$\displaystyle=$	$\displaystyle\!\!\!Ga(\gamma_{h_{m,n}};\epsilon,\eta)$	(76)
$\displaystyle p(\gamma_{x_{n,l}})\!\!\!$	$\displaystyle=$	$\displaystyle\!\!\!Ga(\gamma_{x_{n,l}};\epsilon,\eta).$	(77)

The computations of the a posterior means and variances of the entries of $\boldsymbol{H}$ and $\boldsymbol{X}$ are the same as (54)-(56). We compare the performance of UAMP-MF and BAd-VAMP, and the results are shown in Fig.16 for various combinations of $N\in\{50,...,200\}$ and $L\in\{100,...,800\}$ , where $M=N$ , SNR = 50dB, and the sparsity rate $\delta$ of $\boldsymbol{H}$ and $\boldsymbol{X}$ is set to $0.2$ . It can be seen that the performance of UAMP-MF is significantly better than that of BAd-VAMP as no priors can be imposed on $\boldsymbol{H}$ in BAd-VAMP.

We can also achieve sparse NMF with UAMP-MF, where $\boldsymbol{Z}\in\mathcal{R}_{+}^{M\times L}$ and a sparse non-negative factorization $\boldsymbol{Z}\approx\boldsymbol{H}\boldsymbol{X}$ is to be found. In this case, we can employ a non-negative Bernoulli-Gaussian distribution for the entries in both $\boldsymbol{H}$ and $\boldsymbol{X}$ . We compare UMAP-MF with Mult and AlsPGrad and the numerical results are shown in Fig. 17, where the sparsity rate $\delta=0.1$ for both $\boldsymbol{H}$ and $\boldsymbol{X}$ , $M=200$ , $N=100$ and $L=200$ . We can see that the performance of UAMP-MF is much better than that of other algorithms as they lack a sparsity promoting mechanism.

The runtime comparisons for sparse MF and spares NMF are shown in Fig. 18. It can be seen that UAMP-MF is much faster than BAd-VAMP. Compared to AlsGrad and Mul with 2000 iterations, UAMP-MF has similar runtime but delivers much better performance.

V Conclusions

In this paper, leveraging variational inference and UAMP, we designed a message passing algorithm UAMP-MF to deal with matrix factorization, where UAMP is incorporated into variational inference through a whitening process. By imposing proper priors, UAMP-MF can be used to solve various problems, such as RPCA, DL, CSMU, NMF, sparse MF and sparse NMF. Extensive numerical results demonstrate the superiority of UAMP-MF in recovery accuracy, robustness, and computational complexity, compared to state-of-the-art algorithms.

Appendix A Proof of Proposition 1

According to VI, the message from $f_{Y}$ to $\boldsymbol{X}$ can be expressed as

	$\displaystyle\!\!\!\!\!\!\!\!\!\!m_{f_{Y}\to X}(\boldsymbol{X})$
	$\displaystyle\!\!\!\!\!\!\!\!\!\!\propto\exp\left(\int_{\boldsymbol{H},\lambda}q(\boldsymbol{H})q(\lambda)\log f_{Y}\right)$
	$\displaystyle\!\!\!\!\!\!\!\!\!\!\propto\exp\Big{(}-\hat{\lambda}\int_{\boldsymbol{H}}\text{Tr}\Big{(}(\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})^{\textrm{T}}(\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})\Big{)}q(\boldsymbol{H})\Big{)}$
	$\displaystyle\!\!\!\!\!\!\!\!\!\!\propto\exp\left(-\hat{\lambda}\int_{\boldsymbol{H}}(\boldsymbol{y}-\textrm{Vec}(\boldsymbol{H}\boldsymbol{X}))^{\textrm{T}}(\boldsymbol{y}-\textrm{Vec}(\boldsymbol{H}\boldsymbol{X}))q(\boldsymbol{H})\right)$
	$\displaystyle\!\!\!\!\!\!\!\!\!\!=\exp\left(-\hat{\lambda}\int_{\boldsymbol{h}}(\boldsymbol{y}-\tilde{\boldsymbol{X}}\boldsymbol{h}))^{\textrm{T}}(\boldsymbol{y}-\tilde{\boldsymbol{X}}\boldsymbol{h}))q(\boldsymbol{h})\right)$		(78)
	$\displaystyle\!\!\!\!\!\!\!\!\!\!=\exp\Big{(}-\hat{\lambda}\int_{\boldsymbol{h}}\big{(}\boldsymbol{y}^{\textrm{T}}\boldsymbol{y}+\underbrace{\boldsymbol{h}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}}_{(i)}$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\underbrace{\boldsymbol{h}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{y}}_{(ii)}-\underbrace{\boldsymbol{y}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}}_{(iii)}\big{)}q(\boldsymbol{h})\Big{)},$		(79)

where $\boldsymbol{h}=\text{Vec}(\boldsymbol{H})$ , $\boldsymbol{y}=\text{Vec}(\boldsymbol{Y})$ and $\tilde{\boldsymbol{X}}=\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M}$ . In the derivation of (78), we use the matrix identity $\text{Vec}(\boldsymbol{H}\boldsymbol{X})=\text{Vec}(\boldsymbol{I}_{M}\boldsymbol{H}\boldsymbol{X})=\left(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M}\right)\boldsymbol{h}$ [33].

Next, we work out the the integration of the three terms $(i)$ , $(ii)$ and $(iii)$ in (79). For term $(i)$ , we have

	$\displaystyle\!\!\!\!\!\!\!\!\!\int_{\boldsymbol{h}}\boldsymbol{h}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}q(\boldsymbol{h})$
	$\displaystyle\!\!\!\!\!\!\!\!\!=\int_{\boldsymbol{h}}\text{Tr}\left(\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}\boldsymbol{h}^{\textrm{T}}\right)q(\boldsymbol{h})$
	$\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}\left(\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}(\hat{\boldsymbol{h}}\hat{\boldsymbol{h}}^{\textrm{T}}+\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H})\right)$
	$\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\hat{\boldsymbol{h}}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\tilde{\boldsymbol{X}}\hat{\boldsymbol{h}})+\text{Tr}(\tilde{\boldsymbol{X}}(\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H})\tilde{\boldsymbol{X}}^{\textrm{T}})$
	$\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\textrm{Vec}(\hat{\boldsymbol{H}})^{\textrm{T}}(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})^{\textrm{T}}(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})\textrm{Vec}(\hat{\boldsymbol{H}}))$
	$\displaystyle\ \ \ \ \ \ \ +\text{Tr}((\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})(\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H})(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M}))$
	$\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}\big{(}\textrm{Vec}(\hat{\boldsymbol{H}}\boldsymbol{X})^{\textrm{T}}\textrm{Vec}(\hat{\boldsymbol{H}}\boldsymbol{X})\big{)}$
	$\displaystyle\ \ \ \ \ \ \ \ +\text{Tr}\big{(}(\boldsymbol{X}\otimes\boldsymbol{I}_{M})(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})(\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H})\big{)}$		(80)
	$\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\boldsymbol{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}\boldsymbol{X})$
	$\displaystyle\ \ \ \ \ \ \ \ +\text{Tr}(((\boldsymbol{X}\boldsymbol{X}^{\textrm{T}})\otimes\boldsymbol{I}_{M})(\boldsymbol{V}_{H}\otimes\boldsymbol{U}_{H}))$		(81)
	$\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\boldsymbol{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}\boldsymbol{X})+\text{Tr}(\boldsymbol{U}_{H})\text{Tr}(\boldsymbol{X}\boldsymbol{X}^{\textrm{T}}\boldsymbol{V}_{H})$		(82)
	$\displaystyle\!\!\!\!\!\!\!\!\!=\text{Tr}(\boldsymbol{X}^{\textrm{T}}(\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H})\boldsymbol{X}),$		(83)

where we use the matrix identity $\text{Tr}\left(\boldsymbol{A}\boldsymbol{B}\boldsymbol{C}\right)=\text{Tr}\left(\boldsymbol{C}\boldsymbol{A}\boldsymbol{B}\right)$ in deriving (80), $\left(\boldsymbol{A}\otimes\boldsymbol{B}\right)\left(\boldsymbol{C}\otimes\boldsymbol{D}\right)=(\boldsymbol{A}\boldsymbol{B})\otimes(\boldsymbol{C}\boldsymbol{D})$ and $\left(\boldsymbol{A}\otimes\boldsymbol{B}\right)^{\textrm{T}}=\boldsymbol{A}^{\textrm{T}}\otimes\boldsymbol{B}^{\textrm{T}}$ in deriving (81), and $\text{Tr}(\boldsymbol{A}\otimes\boldsymbol{B})=\text{Tr}(\boldsymbol{A})\text{Tr}(\boldsymbol{B})$ in deriving (82).

Regarding term $(ii)$ , we have

	$\displaystyle\int_{\boldsymbol{h}}\boldsymbol{h}^{\textrm{T}}\tilde{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{y}q(\boldsymbol{h})$
	$\displaystyle=\int_{\boldsymbol{H}}\text{Vec}(\boldsymbol{H})^{\textrm{T}}(\boldsymbol{X}^{\textrm{T}}\otimes\boldsymbol{I}_{M})^{\textrm{T}}\text{Vec}(\boldsymbol{Y})q(\boldsymbol{H})$
	$\displaystyle=\int_{\boldsymbol{H}}\text{Tr}(\boldsymbol{X}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}\boldsymbol{Y})q(\boldsymbol{H})$
	$\displaystyle=\text{Tr}\big{(}\boldsymbol{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}\big{)}.$		(84)

Similarly, term $(iii)$ can be expressed as

\displaystyle\int_{\boldsymbol{h}}\boldsymbol{y}^{\textrm{T}}\tilde{\boldsymbol{X}}\boldsymbol{h}q(\boldsymbol{h})=\text{Tr}(\boldsymbol{Y}^{\textrm{T}}\hat{\boldsymbol{H}}\boldsymbol{X}).

(85)

Based on the above results, the message

	$\displaystyle\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!m_{f_{Y}\to X}(\boldsymbol{X})\propto\exp\Big{(}-\hat{\lambda}\text{Tr}\big{(}\boldsymbol{X}^{\textrm{T}}(\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H})\boldsymbol{X}$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\boldsymbol{X}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{Y}-\boldsymbol{Y}^{\textrm{T}}\hat{\boldsymbol{H}}\boldsymbol{X}+\boldsymbol{Y}^{\textrm{T}}\boldsymbol{Y}\big{)}\Big{)}.$		(86)

Comparing the result against the matrix normal distribution (9), we have the result shown in (17).

Appendix B Proof of Proposition 2

According to VI, the message $m_{f_{Y}\to H}(\boldsymbol{H})$ is computed as

	$\displaystyle m_{f_{Y}\to H}(\boldsymbol{H})$
	$\displaystyle\propto\exp\left(\int_{\boldsymbol{X},\lambda}q(\boldsymbol{X})q(\lambda)\log f_{Y}\right)$
	$\displaystyle\propto\exp\left(-\hat{\lambda}\int_{\boldsymbol{X}}\text{Tr}((\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})^{\textrm{T}}(\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X}))b(\boldsymbol{X})\right)$
	$\displaystyle=\exp\left(-\hat{\lambda}\int_{\boldsymbol{x}}(\boldsymbol{y}-\tilde{\boldsymbol{H}}\boldsymbol{x})^{\textrm{T}}(\boldsymbol{y}-\tilde{\boldsymbol{H}}\boldsymbol{x})b(\boldsymbol{x})\right)$
	$\displaystyle=\exp\Big{(}-\hat{\lambda}\int_{\boldsymbol{x}}\big{(}\boldsymbol{y}^{\textrm{T}}\boldsymbol{y}+\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{y}-\boldsymbol{y}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}\big{)}b(\boldsymbol{x})\Big{)},$		(87)

where $\tilde{\boldsymbol{H}}\triangleq\boldsymbol{I}_{L}\otimes\boldsymbol{H}$ . Similar to the derivation of (83), the integration of the terms in (87) can be expressed as

\displaystyle\int_{\boldsymbol{x}}\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{y}b(\boldsymbol{x})=\text{Tr}\big{(}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}\boldsymbol{Y}\big{)}

(88)

and

	$\displaystyle\int_{\boldsymbol{x}}\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}b(\boldsymbol{x})$
	$\displaystyle=\int_{\boldsymbol{x}}\text{Tr}\left(\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}\boldsymbol{x}^{\textrm{T}}\right)b(\boldsymbol{x})$
	$\displaystyle=\text{Tr}\big{(}\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}(\hat{\boldsymbol{x}}\hat{\boldsymbol{x}}^{\textrm{T}}+\boldsymbol{V}_{X}\otimes\boldsymbol{U}_{X})\big{)}$
	$\displaystyle=\text{Tr}\big{(}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}\boldsymbol{H}\hat{\boldsymbol{X}}\big{)}+\text{Tr}\big{(}\boldsymbol{V}_{X}\otimes(\boldsymbol{H}^{\textrm{T}}\boldsymbol{H}\boldsymbol{U}_{X})\big{)}$
	$\displaystyle=\text{Tr}\big{(}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}}\boldsymbol{H}\hat{\boldsymbol{X}}\big{)}+\text{Tr}(\boldsymbol{V}_{X})\text{Tr}\big{(}\boldsymbol{H}\boldsymbol{U}_{X}\boldsymbol{H}^{\textrm{T}}\big{)}$
	$\displaystyle=\text{Tr}\big{(}\boldsymbol{H}(\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X})\boldsymbol{H}^{\textrm{T}}\big{)}.$		(89)

The message from $f_{Y}$ to $\boldsymbol{H}$ can be represented as

	$\displaystyle m_{f_{Y}\to H}(\boldsymbol{H})$
	$\displaystyle\propto\exp\Big{(}-\hat{\lambda}\text{Tr}(\boldsymbol{Y}\boldsymbol{Y}^{\textrm{T}}+\boldsymbol{H}(\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X})\boldsymbol{H}^{\textrm{T}}$
	$\displaystyle\ \ \ \ \ \ \ \ -\boldsymbol{H}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}}-\boldsymbol{Y}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}})\Big{)}.$		(90)

Comparing the above against the matrix normal distribution, we obtain the result shown by (32) - (34).

Appendix C Proof of Proposition 3

According to VI, the message

	$\displaystyle m_{f_{Y}\to\lambda}(\lambda)$
	$\displaystyle=\text{det}(\lambda^{-1}\boldsymbol{I}_{M}\otimes\boldsymbol{I}_{L})\exp\Big{(}-\lambda\big{(}\text{Vec}(\boldsymbol{Y})-\textrm{Vec}(\boldsymbol{H}\boldsymbol{X})\big{)}^{\textrm{T}}$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \big{(}\textrm{Vec}(\boldsymbol{Y})-\textrm{Vec}(\boldsymbol{H}\boldsymbol{X})\big{)}\Big{)}$
	$\displaystyle=\lambda^{ML}\exp\Big{(}-\lambda\int_{\boldsymbol{H},\boldsymbol{X}}\text{Tr}((\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})^{\textrm{T}}$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X}))b(\boldsymbol{H})b(\boldsymbol{X})\Big{)}$
	$\displaystyle=\lambda^{ML}\exp\Big{(}-\lambda C\Big{)},$		(91)

where

	$\displaystyle\!\!\!\!\!\!\!\!\!C=\int_{\boldsymbol{H},\boldsymbol{X}}\text{Tr}((\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X})^{\textrm{T}}(\boldsymbol{Y}-\boldsymbol{H}\boldsymbol{X}))q(\boldsymbol{H})q(\boldsymbol{X})$
	$\displaystyle\!\!\!\!\!\!\!\!\!\!=\!\int_{\boldsymbol{x},\boldsymbol{H}}(\boldsymbol{y}^{\textrm{T}}\boldsymbol{y}+\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x}-\boldsymbol{x}^{\textrm{T}}\tilde{\boldsymbol{H}}^{\textrm{T}}\boldsymbol{y}-\boldsymbol{y}^{\textrm{T}}\tilde{\boldsymbol{H}}\boldsymbol{x})q(\boldsymbol{x})q(\boldsymbol{H})$
	$\displaystyle\!\!\!\!\!\!\!\!\!\!=\!\int_{\boldsymbol{H}}\text{Tr}(\boldsymbol{Y}\boldsymbol{Y}^{\textrm{T}}+\boldsymbol{H}(\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X})\boldsymbol{H}^{\textrm{T}}$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\boldsymbol{H}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}}-\boldsymbol{Y}\hat{\boldsymbol{X}}^{\textrm{T}}\boldsymbol{H}^{\textrm{T}})q(\boldsymbol{H})$
	$\displaystyle\!\!\!\!\!\!\!\!\!\!=\text{Tr}(\boldsymbol{Y}\boldsymbol{Y}^{\textrm{T}}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\boldsymbol{Y}^{\textrm{T}}-\boldsymbol{Y}\hat{\boldsymbol{X}}^{\textrm{T}}\hat{\boldsymbol{H}}^{\textrm{T}})$
	$\displaystyle+\text{Tr}((\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X})(\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H})^{\textrm{T}})$
	$\displaystyle\!\!\!\!\!\!\!\!\!\!=\text{Tr}\Big{(}\big{(}\boldsymbol{Y}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\big{)}^{\textrm{T}}\big{(}\boldsymbol{Y}-\hat{\boldsymbol{H}}\hat{\boldsymbol{X}}\big{)}\Big{)}+\text{Tr}\Big{(}\hat{\boldsymbol{X}}\hat{\boldsymbol{X}}^{\textrm{T}}\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H}$
	$\displaystyle+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X}\hat{\boldsymbol{H}}^{\textrm{T}}\hat{\boldsymbol{H}}+\text{Tr}(\boldsymbol{V}_{X})\boldsymbol{U}_{X}\text{Tr}(\boldsymbol{U}_{H})\boldsymbol{V}_{H}\Big{)}.$		(92)

By simplifying the above result, we obtain (44).

References

[1] D. D. Lee, “Learning parts of objects by non-negative matrix factorization,” Letter of Nature, vol. 401, pp. 788–791, 1999.
[2] R. Rubinstein, A. M. Bruckstein, and M. Elad, “Dictionaries for sparse representation modeling,” Proceedings of the IEEE, vol. 98, no. 6, pp. 1045–1057, June 2010.
[3] H. Zhu, G. Leus, and G. B. Giannakis, “Sparsity-cognizant total least-squares for perturbed compressive sampling,” IEEE Transactions on Signal Processing, vol. 59, no. 5, pp. 2002–2016, May 2011.
[4] E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 58, no. 3, Jun. 2011. [Online]. Available: https://doi.org/10.1145/1970392.1970395
[5] J. T. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate message passing—part i: Derivation,” IEEE Transactions on Signal Processing, vol. 62, no. 22, pp. 5839–5853, 2014.
[6] D. D. Lee, “Algorithms for non-negative matrix factorization,” Advances in Neural Information Processing Systems, vol. 13, pp. 556–562, 2001.
[7] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons, “Algorithms and applications for approximate nonnegative matrix factorization,” Computational Statistics and Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.
[8] C.-J. Lin, “Projected gradient methods for nonnegative matrix factorization,” Neural Computation, vol. 19, no. 10, pp. 2756–2779, 2007.
[9] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.
[10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.
[11] Zhu, Hao, Leus, Geert, Giannakis, Georgios, and B., “Sparsity-cognizant total least-squares for perturbed compressive sampling.” IEEE Transactions on Signal Processing, vol. 59, no. 5, pp. 2002–2016, 2011.
[12] Z. Lin, M. Chen, and Y. Ma, “The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices,” http://export.arxiv.org/abs/1009.5055, 2010.
[13] Z. Wen, W. Yin, and Y. Zhang, “Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm,” Mathematical Programming Computation, vol. 4, no. 4, pp. 333–361, 2012.
[14] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing: I. motivation and construction,” in 2010 IEEE Information Theory Workshop on Information Theory (ITW 2010, Cairo), Jan 2010, pp. 1–5.
[15] ——, “Message passing algorithms for compressed sensing: II. analysis and validation,” in 2010 IEEE Information Theory Workshop on Information Theory (ITW 2010, Cairo), Jan 2010, pp. 1–5.
[16] S. Rangan, “Generalized approximate message passing for estimation with random linear mixing,” in Proc. Int. Symp. Inf. Theory. IEEE, July. 2011, pp. 2168–2172.
[17] S. Sarkar, A. K. Fletcher, S. Rangan, and P. Schniter, “Bilinear recovery using adaptive vector-amp,” IEEE Transactions on Signal Processing, vol. 67, no. 13, pp. 3383–3396, July 2019.
[18] X. Meng and J. Zhu, “Bilinear adaptive generalized vector approximate message passing,” IEEE Access, vol. 7, pp. 4807–4815, 2019.
[19] Z. Yuan, Q. Guo, and M. Luo, “Approximate message passing with unitary transformation for robust bilinear recovery,” IEEE Trans. Signal Process., vol. 69, pp. 617–630, Dec. 2020.
[20] Q. Guo and J. Xi, “Approximate message passing with unitary transformation,” arXiv preprint arXiv:1504.04799, Apr. 2015.
[21] J. Winn and C. M. Bishop, “Variational message passing,” J. Mach. Learn. Res., vol. 6, pp. 661–694, Apr. 2005.
[22] D. J. De Waal, Matrix-Valued Distributions. Encyclopedia of Statistical Sciences, 1985.
[23] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Mach. Learn., vol. 37, no. 2, pp. 183–233, Nov. 1999.
[24] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Nat. Acad. Sci., vol. 106, no. 45, pp. 18 914–18 919, Nov. 2009.
[25] S. Rangan, P. Schniter, A. K. Fletcher, and S. Sarkar, “On the convergence of approximate message passing with arbitrary matrices,” IEEE Trans. Inf. Theory, vol. 65, no. 9, pp. 5339–5351, Sept. 2019.
[26] Q. Guo, D. D. Huang, S. Nordholm, J. Xi, and Y. Yu, “Iterative frequency domain equalization with generalized approximate message passing,” IEEE Signal Process. Lett., vol. 20, no. 6, pp. 559–562, June. 2013.
[27] M. E. Tipping, “Sparse bayesian learning and the relevance vector machine,” J. Mach. Learn. Res., vol. 1, p. 211–244, sep 2001. [Online]. Available: https://doi.org/10.1162/15324430152748236
[28] M. Luo, Q. Guo, M. Jin, Y. C. Eldar, D. Huang, and X. Meng, “Unitary approximate message passing for sparse bayesian learning,” IEEE Transactions on Signal Processing, vol. 69, pp. 6023–6039, 2021.
[29] J. T. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approximate message passing Part ii: Applications,” IEEE Transactions on Signal Processing, vol. 62, no. 22, pp. 5854–5867, 2014.
[30] J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, and J. Chanussot, “Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 5, no. 2, pp. 354–379, 2012.
[31] W. Xu, “Document clustering based on non-negative matrix factorization,” Proceedings of SIGIR-03, 2003.
[32] P. O. Hoyer, “Non-negative sparse coding,” in Neural Networks for Signal Processing, 2002. Proceedings of the 2002 12th IEEE Workshop on, 2002.
[33] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook. on-line, Sep. 2007, http://matrixcookbook.com.

Unitary Approximate Message Passing for Matrix Factorization

Abstract

Index Terms:

I Introduction

II Preliminaries

II-A Variational Inference

II-B (U)AMP

II-C Matrix Normal Distribution

III Design of UAMP-MF

III-A Factor Graph Representation

III-B Update of q​(𝐗)q(\boldsymbol{X})

Proof.

III-C Update of q​(𝐇)q(\boldsymbol{H})

Proof.

III-D Update of q​(λ)q(\lambda)

Proof.

III-E Discussion and Computational Complexity

IV Applications and Numerical Results

IV-A UAMP-MF for Robust Principal Components Analysis

IV-B UAMP-MF for Dictionary Learning

IV-C UAMP-MF for Compressive Sensing with Matrix Uncertainty

IV-D UAMP-MF for Non-Negative Matrix Factorization

IV-E UAMP-MF for Sparse Matrix Factorization

V Conclusions

Appendix A Proof of Proposition 1

Appendix B Proof of Proposition 2

Appendix C Proof of Proposition 3

References

III-B Update of $q(\boldsymbol{X})$

III-C Update of $q(\boldsymbol{H})$

III-D Update of $q(\lambda)$