This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Computation of Rate-Distortion-Perception Functions With Wasserstein Barycenter

Chunhui Chen12, Xueyan Niu2, Wenhao Ye12, Shitong Wu12, Bo Bai2, Weichao Chen2, Sian-Jheng Lin2 1Department of Mathematical Sciences, Tsinghua University, Beijing, China 2Theory Lab, 2012 Labs, Huawei Technologies Co., Ltd., Hong Kong, China
[email protected], [email protected], {yewh20, wust20}@mails.tsinghua.edu.cn,
{baibo8, chenweichao5, lin.sian.jheng1}@huawei.com
Abstract

The nascent field of Rate-Distortion-Perception (RDP) theory is seeing a surge of research interest due to the application of machine learning techniques in the area of lossy compression. The information RDP function characterizes the three-way trade-off between description rate, average distortion, and perceptual quality measured by discrepancy between probability distributions. However, computing RDP functions has been a challenge due to the introduction of the perceptual constraint, and existing research often resorts to data-driven methods. In this paper, we show that the information RDP function can be transformed into a Wasserstein Barycenter problem. The non-strictly convexity brought by the perceptual constraint can be regularized by an entropy regularization term. We prove that the entropy regularized model converges to the original problem. Furthermore, we propose an alternating iteration method based on the Sinkhorn algorithm to numerically solve the regularized optimization problem. Experimental results demonstrate the efficiency and accuracy of the proposed algorithm.

I Introduction

Lossy compression plays a vital role in the communication and storage of images, videos, and audio data [1, 2, 3, 4]. As the cornerstone of lossy compression, the classical Rate-Distortion (RD) theory [5] studies the tradeoff between the bit rate used for representing data and the distortion caused by compression [6]. The reconstruction quality is traditionally measured by a per-letter distortion metric, such as the mean-squared error. However, it has been shown that in practice, minimizing the distortion does not result in perceptually satisfying output for human subjects [7]. Since high perceptual quality may come at the expense of distortion [8, 9], researchers are motivated to extend the RD theory by bringing perception into account [10, 11, 12].

Blau and Michaeli first proposed and studied the information Rate-Distortion-Perception (RDP) functions in [10]. Theoretical solutions with closed form expressions to the RDP problem are often intractable, except for some special cases such as the Gaussian source with squared error distortion and Wasserstein-2 metric perception [11]. Therefore, a computation method for RDP functions is desirable.

Traditionally, the Blahut–Arimoto (BA) algorithm [13, 14] has been successful in the computation of capacities and RD functions. However, to our best knowledge, we have not seen any generalization of the BA algorithm to computing RDP functions. This may be due to the fact that RDP functions have more independent variables than RD functions, while in each alternating iteration step in the BA algorithm, all variables except the updating one need to be fixed. Also, RDP functions own an additional nonlinear constraint on the perception which destroys the original simplex structure and invalidates existing numerical methods. An alternative way to compute RDP functions is based on data-driven methods, such as generative adversarial networks, which minimize a weighted combination of the distortion and the perception [10, 11, 15]. However, these deep learning-based methods often require huge computational resources while having poor generalization ability.

In this paper, we propose a numerical method for computing the RDP functions. Referring to the approach in a recent paper [16] of RD, we reformulate RDP functions in an Optimal Transport (OT) form with an additional constraint on perception. However, compared to RD functions, the introduction of an additional constraint on perceptual quality destroys the origin simplex structure in RDP functions. Thus the Alternating Sinkhorn (AS) algorithm proposed in [16] cannot be applied directly to solving RDP functions. To handle this issue, we prove that the additional constraint can be equivalently converted to a set of linear constraints by introducing an auxiliary variable. Consequently, we observe that the new model appears to be in the form of the celebrated Wasserstein Barycenter problem [17, 18, 19], as it can be viewed as a minimizer over two couplings (i.e., the transition mappings and the newly introduced variable) between Barycenter (i.e., the reconstruction distribution) and the source distribution. Moreover, the objective of the optimization is to compute a weighed distance according to the Wasserstein metric. Our model therefore will be referred to as the Wasserstein Barycenter model for Rate-Distortion-Perception functions (WBM-RDP).

With the operations above, we are able to design an algorithm for the WBM-RDP. In order to tackle the difficulty that the WBM-RDP is not strictly convex, we construct an entropy-regularized formulation of the WBM-RDP. We show that the new form admits a unique optimal solution, and that it converges to the origin WBM-RDP. After obtaining the Lagrangian of the entropy regularized WBM-RDP, we observe that the degrees of freedom therein can be divided into three groups to be optimized alternatively with closed-form solutions. As such, we propose an improved AS algorithm, which effectively combines the advantages of AS algorithm for RD functions [16] and entropy regularized algorithm for Wasserstein Barycenter model [18]. Numerical experiments demonstrate that the proposed algorithm reaches high precision and efficiency under various circumstances.

II RDP functions and WBM-RDP

II-A Rate-Distortion-Perception Functions

Consider a discrete memoryless source X𝒳X\in\mathcal{X} and a reconstruction X^𝒳^\hat{X}\in\hat{\mathcal{X}}, where 𝒳={x1,,xM},𝒳^={x^1,,x^N}\mathcal{X}=\{x_{1},\cdots,x_{M}\},\hat{\mathcal{X}}=\{\hat{x}_{1},\cdots,\hat{x}_{N}\} are finite alphabets. Suppose pXp_{X} and pX^p_{\hat{X}} are defined on the probability space (𝒳,,).(\mathcal{X},\mathcal{F},\mathbb{P}). We consider the single-letter distortion measure Δ:𝒳×𝒳^[0,)\Delta:\mathcal{X}\times\hat{\mathcal{X}}\mapsto[0,\infty) and perception measure between distributions d:×[0,).d:\mathbb{P}\times\mathbb{P}\mapsto[0,\infty).

Definition 1 (The information RDP function [10]).

Given a distortion fidelity DD and a perception quality PP, the information RDP function is defined as

R(D,P)=minpX^X\displaystyle R(D,P)=\min_{p_{\hat{X}\mid X}}\quad I(X,X^)\displaystyle I(X,\hat{X})\vspace{1ex} (1a)
s.t. 𝔼[Δ(X,X^)]D,\displaystyle\mathbb{E}[\Delta(X,\hat{X})]\leq D,\vspace{1ex} (1b)
d(pX,pX^)P,\displaystyle d\left(p_{X},p_{\hat{X}}\right)\leq P, (1c)

where the minimization is taken over all conditional distributions, and I(X,X^)I(X,\hat{X}) is the mutual information.

Note that the information RDP function (1) degenerates to RD functions when the constraint (1c) is removed.

Since the alphabets 𝒳\mathcal{X} and 𝒳^\hat{\mathcal{X}} are finite, we denote

pi=pX(xi),rj=pX^(x^j),dij=Δ(xi,x^j)p_{i}=p_{X}(x_{i}),\ r_{j}=p_{\hat{X}}(\hat{x}_{j}),\ d_{ij}=\Delta(x_{i},\hat{x}_{j})

and wij=W(x^jxi)w_{ij}=W(\hat{x}_{j}\mid x_{i}) for all 1iM,1jN1\leq i\leq M,1\leq j\leq N. Here W:𝒳𝒳^W:\mathcal{X}\rightarrow\hat{\mathcal{X}} is the channel transition mapping. Thus the discrete form of problem (1) can be written as

min𝒘,𝒓\displaystyle\min_{\bm{w},\bm{r}}\quad i=1Mj=1N(wijpi)[logwijlogrj]\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}\left(w_{ij}p_{i}\right)\left[\log w_{ij}-\log r_{j}\right]\vspace{1ex} (2a)
s.t. j=1Nwij=1,i=1Mwijpi=rj,i,j,\displaystyle\sum_{j=1}^{N}w_{ij}=1,\ \sum_{i=1}^{M}w_{ij}p_{i}=r_{j},\ \forall i,j,\vspace{1ex} (2b)
i=1Mj=1NwijpidijD,j=1Nrj=1,\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}w_{ij}p_{i}d_{ij}\leq D,\ \sum_{j=1}^{N}r_{j}=1,\vspace{1ex} (2c)
d(𝒑,𝒓)P.\displaystyle d(\bm{p},\bm{r})\leq P. (2d)

II-B Perception Measures

One commonly used measure of perceptual quality d(𝒑,𝒓)d(\bm{p},\bm{r}) is the Wasserstein metric,

𝒲(𝒑,𝒓)=\displaystyle\mathcal{W}(\bm{p},\bm{r})= min𝚷i=1Mj=1NΠijcij\displaystyle\min_{\bm{\Pi}}\quad\sum_{i=1}^{M}\sum_{j=1}^{N}\Pi_{ij}c_{ij}\vspace{1ex} (3a)
s.t. i=1MΠij=rj,j=1NΠij=pi,i,j,\displaystyle\text{ s.t. }\sum_{i=1}^{M}\Pi_{ij}=r_{j},\ \sum_{j=1}^{N}\Pi_{ij}=p_{i},\ \forall i,j, (3b)

where cijc_{ij} denotes the cost matrix between xix_{i} and x^j\hat{x}_{j}. In some cases, the total variation (TV) distance δ(𝒑,𝒓)=12𝒑𝒓1\delta(\bm{p},\bm{r})=\frac{1}{2}\|\bm{p}-\bm{r}\|_{1} and the Kulback-Leibler (KL) divergence KL(𝒑𝒓)=i=1Mpi[logpilogri]\text{KL}(\bm{p}\|\bm{r})=\sum_{i=1}^{M}p_{i}\left[\log p_{i}-\log r_{i}\right] are also considered as a measure of perceptual quality [10]. We will focus on the Wasserstein metric, as generalizations to the TV and KL metrics follow straightforwardly as will be discussed in Section III.

II-C Wasserstein Barycenter Model for RDP functions

Obviously, solving the RDP functions is more difficult than solving the RD function due to the introduction of the new perception constraint (2d). Numerically, we need to frequently update the d(𝒑,𝒓)d(\bm{p},\bm{r}) to ensure that the perception constraint is satisfied. For metrics such as the Wasserstein metric (3), the computational cost for this is high. Moreover, the original feasible domain of RD functions is changed due to the introduction of (2d). The feasible solution of RD functions is inside the simplex structure since (2b) and (2c) are both linear constraints. This is the reason why the BA algorithm and the AS algorithm can solve the RD functions at low costs [16]. However, the perception constraint (2d) breaks the desirable simplex structure, which brings computational challenges. In this section, we develop a new model and an algorithm for solving RDP functions efficiently and accurately. Our starting point is to convert the perception constraint (2d) into a linear constraint in a higher dimensional space. Consequently, all the constraints, (2b)-(2d), also preserve the simplex structure in the higher-dimensional space. This would bring great convenience to our algorithm design. The main idea is summarized into the following theorem:

Theorem 1.

The solution to RDP functions, which is the optimal value of (2), equals the optimal value of the following optimization problem. Meanwhile, the optimal solution (𝐰,𝐫)(\bm{w},\bm{r}) of the following optimization problem (4) is the optimal solution of (2):

min𝒘,𝒓,𝚷\displaystyle\min_{\bm{w},\bm{r},\bm{\Pi}}\quad i=1Mj=1N(wijpi)[logwijlogrj]\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}\left(w_{ij}p_{i}\right)\left[\log w_{ij}-\log r_{j}\right]\vspace{1ex} (4a)
s.t. j=1Nwij=1,i=1Mwijpi=rj,\displaystyle\sum_{j=1}^{N}w_{ij}=1,\ \sum_{i=1}^{M}w_{ij}p_{i}=r_{j},\vspace{1ex} (4b)
i=1MΠij=rj,j=1NΠij=pi,i,j,\displaystyle\sum_{i=1}^{M}\Pi_{ij}=r_{j},\ \sum_{j=1}^{N}\Pi_{ij}=p_{i},\ \forall i,j,\vspace{1ex} (4c)
i=1Mj=1NwijpidijD,j=1Nrj=1,\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}w_{ij}p_{i}d_{ij}\leq D,\ \sum_{j=1}^{N}r_{j}=1,\vspace{1ex} (4d)
i=1Mj=1NΠijcijP.\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}\Pi_{ij}c_{ij}\leq P. (4e)

Theorem 1 establishes an equivalence between the RDP problem (2) and the optimization (4). Also, we observe that the model (4) has the Wasserstein Barycenter structure. The optimization variable rjr_{j} can be regarded as the Barycenter, and diag(𝒑)𝒘\text{diag}(\bm{p})\cdot\bm{w} and 𝚷\bm{\Pi} are two couplings between each input 𝒑\bm{p} and Barycenter 𝒓\bm{r}. Thus, we have obtained the Wasserstein Barycenter model for RDP functions. In general, the numerical methods for the Wasserstein Barycenter problem are computationally intensive [18, 20]. However, as 𝚷\bm{\Pi} is not explicitly included in the objective function of (4), it is possible to design a fast and efficient algorithm for the above WBM-RDP.

III Entropy Regularized WBM-RDP and Improved AS Algorithm

The above section establishes the WBM-RDP model (4), which is the first step towards computing the RDP functions. However, there are two difficulties in designing a practical algorithm. First, the WBM-RDP (4) is not strictly convex on 𝚷\bm{\Pi}. Although the optimal value of (4) is unique, the corresponding optimal solutions may vary in the dimensions of 𝚷\bm{\Pi}. Therefore, the convergence and the numerical stability of the AS algorithm designed for RD functions [16] cannot be guaranteed. Second, WBM-RDP contains logarithmic terms of the Barycenter objective optimization function logrj\log r_{j}, which is not standard formulation of the classical Wasserstein Barycenter problems. In this section, we will overcome these difficulties and improve the Alternating Sinkhorn algorithm to solve WBM-RDP.

III-A Entropy Regularized WBM-RDP

As discussed above, the WBM-RDP (4) is not strictly convex on 𝚷\bm{\Pi}, the most direct way [21] is to introduce an extra entropy regularized term in the objective optimization function (4a), i.e.,

H(𝚷)=i=1Mj=1NΠijlog(Πij).H(\bm{\Pi})=\sum_{i=1}^{M}\sum_{j=1}^{N}\Pi_{ij}\log(\Pi_{ij}).

This leads to the entropy regularized WBM-RDP:

min𝒘,𝒓,𝚷\displaystyle\min_{\bm{w},\bm{r},\bm{\Pi}}\quad i=1Mj=1N(wijpi)[logwijlogrj]+εH(𝚷)\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}\left(w_{ij}p_{i}\right)\left[\log w_{ij}-\log r_{j}\right]+\varepsilon H(\bm{\Pi})\vspace{1ex} (5a)
s.t. j=1Nwij=1,i=1Mwijpi=rj,\displaystyle\sum_{j=1}^{N}w_{ij}=1,\ \sum_{i=1}^{M}w_{ij}p_{i}=r_{j},\vspace{1ex} (5b)
i=1MΠij=rj,j=1NΠij=pi,i,j.\displaystyle\sum_{i=1}^{M}\Pi_{ij}=r_{j},\ \sum_{j=1}^{N}\Pi_{ij}=p_{i},\forall i,j.\vspace{1ex} (5c)
i=1Mj=1NwijpidijD,j=1Nrj=1,\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}w_{ij}p_{i}d_{ij}\leq D,\ \sum_{j=1}^{N}r_{j}=1,\vspace{1ex} (5d)
i=1Mj=1NΠijcijP.\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{N}\Pi_{ij}c_{ij}\leq P. (5e)

Here, ε>0\varepsilon>0 is a newly introduced regularization parameter. Thus, we get the entropy regularized WBM-RDP with strict convexity. Moreover, during the alternative optimization iterations, closed-form expressions of the dual variables can always be obtained, which accelerates the algorithm while improving the accuracy. Not only that, the following theorem guarantees that the solution to entropy regularized WBM-RDP (5) converges to WBM-RDP (4) as ε0\varepsilon\rightarrow 0.

Theorem 2.

(Convergence in ε\varepsilon) The solution {𝐰ε,𝚷ε,𝐫ε}\{\bm{w}_{\varepsilon},\bm{\Pi}_{\varepsilon},\bm{r}_{\varepsilon}\} to (5) converges to the optimal solution with minimal entropy of H(𝚷)H(\bm{\Pi}) within the set of all optimal solutions to (4), i.e,

{𝒘ε,𝚷ε,𝒓ε}ε0argmin{H(𝚷)|{𝒘,𝚷,𝒓}},\{\bm{w}_{\varepsilon},\bm{\Pi}_{\varepsilon},\bm{r}_{\varepsilon}\}\underset{\varepsilon\rightarrow 0}{\longrightarrow}\operatorname{argmin}\Big{\{}H(\bm{\Pi})\Big{|}\left\{\bm{w},\bm{\Pi},\bm{r}\right\}\in\mathcal{M}\Big{\}}, (6)

where \mathcal{M} denotes the set of all optimal solutions to (4).

III-B The Improved Alternating Sinkhorn Algorithm

We construct the Lagrangian function of the regularized WBM-RDP (5) by introducing dual variables 𝜶,𝜽M,𝜷,𝝉N,λ,γ+\bm{\alpha},\bm{\theta}\in\mathbb{R}^{M},\bm{\beta},\bm{\tau}\in\mathbb{R}^{N},\lambda,\gamma\in\mathbb{R}^{+} and η\eta\in\mathbb{R}:

(𝒘,𝚷,𝒓;𝜶,𝜷,𝜽,𝝉,λ,η,γ)=i=1Mj=1Nwijpilogwijrj+εi=1Mj=1NΠijlnΠij\displaystyle\begin{aligned} &\mathcal{L}\left(\bm{w},\bm{\Pi},\bm{r};\bm{\alpha},\bm{\beta},\bm{\theta},\bm{\tau},\lambda,\eta,\gamma\right)\\ &=\sum_{i=1}^{M}{\sum_{j=1}^{N}}w_{ij}p_{i}\log\frac{w_{ij}}{r_{j}}+\varepsilon\sum_{i=1}^{M}\sum_{j=1}^{N}\Pi_{ij}\ln\Pi_{ij}\end{aligned} (7)
+i=1Mαi(j=1Nwij1)+j=1Nβj(i=1Nwijpirj)\displaystyle+\sum_{i=1}^{M}\alpha_{i}\left(\sum_{j=1}^{N}w_{ij}-1\right)+\sum_{j=1}^{N}\beta_{j}\left(\sum_{i=1}^{N}w_{ij}p_{i}-r_{j}\right)
+i=1Mθi(j=1NΠijpi)+j=1Nτj(i=1MΠijrj)\displaystyle+\sum_{i=1}^{M}\theta_{i}\left(\sum_{j=1}^{N}\Pi_{ij}-p_{i}\right)+\sum_{j=1}^{N}\tau_{j}\left(\sum_{i=1}^{M}\Pi_{ij}-r_{j}\right)
+λ(i=1Mj=1NwijpidijD)+η(j=1Nrj1)\displaystyle+\lambda\left(\sum_{i=1}^{M}\sum_{j=1}^{N}w_{ij}p_{i}d_{ij}-D\right)+\eta\left(\sum_{j=1}^{N}r_{j}-1\right)
+γ(i=1Mj=1NΠijcijP).\displaystyle+\gamma\left(\sum_{i=1}^{M}\sum_{j=1}^{N}\Pi_{ij}c_{ij}-P\right).

Here we note that the Lagrangian function (7) is convex with respect to each variable. Furthermore, we can improve the algorithm by designing the directions of the alternating iterations according to how the variables appear in (7). Next, we sketch the main ingredients of our algorithm. The pseudo-code is presented in Algorithm 1.

  1. 1)

    Update 𝒘\bm{w} and associated dual variables 𝜶,𝜷,λ\bm{\alpha,\beta},\lambda while fixing 𝒓,𝚷\bm{r},\bm{\Pi} as constant parameters. First, we can update αi\alpha_{i} and βj\beta_{j}:

    ψj1/i=1MKijϕipi,ϕi1/j=1NKijψjrj,\psi_{j}\leftarrow 1\Big{/}\sum_{i=1}^{M}K_{ij}\phi_{i}p_{i},\quad\phi_{i}\leftarrow 1\Big{/}\sum_{j=1}^{N}K_{ij}\psi_{j}r_{j}, (8)

    where ϕi=exp(αi/pi1/2),ψj=exp(βj1/2)\phi_{i}=\exp(-\alpha_{i}/p_{i}-1/2),\psi_{j}=\exp(-\beta_{j}-1/2) and Kij=exp(λdij)K_{ij}=\exp(-\lambda d_{ij}). Second, we update λ\lambda by a root-searching operation on the following monotonic single-variable function on +\mathbb{R}^{+}:

    F(λ)i=1Mj=1NdijpiϕieλdijψjrjD=0.F(\lambda)\triangleq\sum_{i=1}^{M}\sum_{j=1}^{N}d_{ij}p_{i}\phi_{i}e^{-\lambda d_{ij}}\psi_{j}r_{j}-D=0. (9)
  2. 2)

    Update 𝚷\bm{\Pi} and associated dual variables 𝜽,𝝉,γ\bm{\theta,\tau},\gamma while fixing 𝒓,𝒘\bm{r},\bm{w} as constant parameters. We alternatively update θi,τj\theta_{i},\tau_{j}:

    φjrj/i=1MMijξi,ξipi/j=1NMijφj,\varphi_{j}\leftarrow r_{j}\Big{/}\sum_{i=1}^{M}M_{ij}\xi_{i},\quad\xi_{i}\leftarrow p_{i}\Big{/}\sum_{j=1}^{N}M_{ij}\varphi_{j}, (10)

    where ξi=exp(θi/ε1/2),φj=exp(τj/ε1/2)\xi_{i}=\exp(-\theta_{i}/\varepsilon-1/2),\varphi_{j}=\exp(-\tau_{j}/\varepsilon-1/2) and Mij=exp(γcij/ε)M_{ij}=\exp(-\gamma c_{ij}/\varepsilon). Again, we can update γ\gamma by the root-searching operation on the following monotonic single-variable function on +\mathbb{R}^{+}:

    G(γ)i=1Mj=1Ncijξieγcij/εφjP=0.G(\gamma)\triangleq\sum_{i=1}^{M}\sum_{j=1}^{N}c_{ij}\xi_{i}e^{-\gamma c_{ij}/\varepsilon}\varphi_{j}-P=0. (11)
  3. 3)

    Update 𝒓\bm{r} and associated dual variables η\eta while fixing 𝒘,𝚷\bm{w},\bm{\Pi} as constant parameters. We can update η\eta by finding the root of the following single-variable function on its largest monotone interval (maxj(βj+τj),+)(\max_{j}(\beta_{j}+\tau_{j}),+\infty):

    H(η)j=1N[(i=1Mwijpi)/(ηβjτj)]1=0,H(\eta)\triangleq\sum_{j=1}^{N}\left[\left(\sum_{i=1}^{M}w_{ij}p_{i}\right)\Big{/}\left(\eta-\beta_{j}-\tau_{j}\right)\right]-1=0, (12)

    and we finally get

    rj=(i=1Mwijpi)/(ηβjτj).r_{j}=\left(\sum_{i=1}^{M}w_{ij}p_{i}\right)\Big{/}(\eta-\beta_{j}-\tau_{j}). (13)

We stress that the three steps above do not contain inner iterations, because 𝚷\bm{\Pi} is not explicitly included in the objective optimization function of the WBM-RDP as discussed above. Therefore, compared to other existing algorithms for solving the Wasserstein Barycenter, our proposed algorithm gain much efficiency and simplicity. On the other hand, our proposed algorithm adopts a similar alternating iteration technique to the Alternating Sinkhorn algorithm for RD functions. However, a vital operation we conduct is altering the projection direction of different joint distribution variables, i.e., 𝒘\bm{w} and 𝚷\bm{\Pi}, which is essentially different from the AS algorithm for RD functions. Thus we name it the Improved Alternating Sinkhorn Algorithm.

Algorithm 1 The Improved Alternating Sinkhorn Algorithm
0:  Distortion measure dijd_{ij}, marginal distribution pip_{i},cost matrix cijc_{ij}, regularization parameter ε\varepsilon, maximum iteration number max_itermax\_iter.
0:  Minimal value i=1Mj=1N(wijpi)[logwijlogrj]+εi=1Mj=1NΠijlog(Πij)\sum_{i=1}^{M}\sum_{j=1}^{N}(w_{ij}p_{i})\left[\log w_{ij}-\log r_{j}\right]+\varepsilon\sum_{i=1}^{M}\sum_{j=1}^{N}\Pi_{ij}\log(\Pi_{ij}) with respect to variables 𝒘\bm{w}, 𝒓\bm{r} and 𝚷\bm{\Pi}.
1:  Initialization: ϕ,𝝃=𝟏M,𝝍,𝝋=𝟏N,λ,γ=1;\bm{\phi},\bm{\xi}=\mathbf{1}_{M},\bm{\psi},\bm{\varphi}=\mathbf{1}_{N},\lambda,\gamma=1;
2:  Set Kijexp(λdij)K_{ij}\leftarrow\exp(-\lambda d_{ij})
3:  Set Mijexp(γcij/ε)M_{ij}\leftarrow\exp(-\gamma c_{ij}/\varepsilon)
4:  for =1:max_iter\ell=1:max\_iter do
5:     ψj1/i=1MKijϕipi,j=1,,N\psi_{j}\leftarrow 1/\sum_{i=1}^{M}K_{ij}\phi_{i}p_{i},\quad j=1,\cdots,N
6:     ϕi1/j=1NKijψjrj,i=1,,M\phi_{i}\leftarrow 1/\sum_{j=1}^{N}K_{ij}\psi_{j}r_{j},\quad i=1,\cdots,M
7:     Solve F(λ)=0F(\lambda)=0 for λ+\lambda\in\mathbb{R}^{+} with Newton’s method
8:     Update Kijexp(λdij)K_{ij}\leftarrow\exp(-\lambda d_{ij}) and wijϕiKijψjrjw_{ij}\leftarrow\phi_{i}K_{ij}\psi_{j}r_{j}
9:     φjrj/i=1MMijξi,j=1,,N\varphi_{j}\leftarrow r_{j}/\sum_{i=1}^{M}M_{ij}\xi_{i},\quad j=1,\cdots,N
10:     ξipi/j=1NMijφj,i=1,,M\xi_{i}\leftarrow p_{i}/\sum_{j=1}^{N}M_{ij}\varphi_{j},\quad i=1,\cdots,M
11:     Solve G(γ)=0G(\gamma)=0 for γ+\gamma\in\mathbb{R}^{+} with Newton’s method
12:     Update Mijexp(γcij/ε)M_{ij}\leftarrow\exp(-\gamma c_{ij}/\varepsilon)
13:     Solve H(η)=0H(\eta)=0 for η\eta\in\mathbb{R} with Newton’s method
14:     Update rj(i=1Mwijpi)/(ηβjτj)r_{j}\leftarrow\left(\sum_{i=1}^{M}w_{ij}p_{i}\right)\Big{/}\left(\eta-\beta_{j}-\tau_{j}\right)
15:  end for
16:  end
17:  return  i=1Mj=1N(ϕipiKijψjrj)[log(ϕiKijψj)]\sum_{i=1}^{M}\sum_{j=1}^{N}\left(\phi_{i}p_{i}K_{ij}\psi_{j}r_{j}\right)\left[\log\left(\phi_{i}K_{ij}\psi_{j}\right)\right]
Remark 1.

If the perception measure in (2d) is substituted by TV distance, we only need to set the cost matrix in (4e) as cij=𝟏ijc_{ij}=\mathbf{1}_{i\neq j} (see Eq. (6.11) of [22]). Then our improved AS algorithm is still applicable.

Remark 2.

If the perception measure in (2d) is substituted by KL divergence, our improved AS algorithm would be simpler. The Sinkhorn iteration in step 2) can be omitted since 𝚷\bm{\Pi} does not need to be introduced. In step 2) the G(γ)G(\gamma) is substituted by

G(γ)j=1Mpjlog((γpj+i=1Mwijpi)/(ηβj))T,G(\gamma)\triangleq\sum_{j=1}^{M}p_{j}\log\left(\Big{(}\gamma p_{j}+\sum_{i=1}^{M}w_{ij}p_{i}\Big{)}\Big{/}(\eta-\beta_{j})\right)-T,

where T=i=1MpilogpiPT=\sum_{i=1}^{M}p_{i}\log p_{i}-P and H(η)H(\eta) is substituted by

H(η)j=1M(γpj+i=1Mwijpi)/(ηβj)1,H(\eta)\triangleq\sum_{j=1}^{M}\left(\gamma p_{j}+\sum_{i=1}^{M}w_{ij}p_{i}\right)\Big{/}(\eta-\beta_{j})-1,

and rj=(γpj+i=1Mwijpi)/(ηβj)r_{j}=(\gamma p_{j}+\sum_{i=1}^{M}w_{ij}p_{i})/(\eta-\beta_{j}).

IV Numerical Experiment

In this section, we numerically study the validity of the WBM-RDP and the improved AS algorithm. All the experiments are conducted on a PC with 8G RAM, and one 11th Gen Intel (R) Core (TM) i5-1135G7 CPU @2.40GHz.

We compute RDP functions under two settings with different perception measures: one is the binary source with Hamming distortion [10], and the other is the Gaussian source with squared error distortion [11]. Moreover, the above two settings have analytical expressions [10, 11] when the perceptual constraints are TV distance and Wasserstein-2 metric, respectively.

For the binary source, we can directly compute the result since we can set the discrete distribution pp beforehand. As for Gaussian source, we first truncate the sources into an interval [S,S][-S,S] and then discretize it by a set of uniform grid points {xi}i=1N\{x_{i}\}_{i=1}^{N} whose adjacent spacing is δ=2S/(N1)\delta=2S/({N-1}), i.e.,

xi=S+(i1)δ,i=1,,N.x_{i}=-S+(i-1)\delta,\quad i=1,\cdots,N.

The corresponding distribution 𝒑\bm{p} of the Gaussian source can then be denoted by

pi=F(xi+δ2)F(xiδ2),i=1,,N,p_{i}=F(x_{i}+\frac{\delta}{2})-F(x_{i}-\frac{\delta}{2}),\quad i=1,\cdots,N, (14)

where F(x)F(x) denotes the distribution of the Gaussian source. Unless otherwise specified, we take p=0.1p=0.1 for the binary source and S=8,δ=0.5,μ=0,σ=2S=8,\delta=0.5,\mu=0,\sigma=2 for the Gaussian source. For our method, we set ε=0.01\varepsilon=0.01. Additionally, we note that the space complexity is 𝒪(N2)\mathcal{O}(N^{2}) with the dimension of the above discretized Gaussian source. The following results will suggest δ=0.5\delta=0.5 is precise enough for such discretization.

Refer to caption
Figure 1: The Rate-Distortion-Perception functions obtained by our method and a) the binary source with TV distance and its analytical solution; b) the binary source with KL divergence; c) the Gaussian source with Wasserstein-2 metric and its analytical solution; d) the Gaussian source with TV distance.

In Fig. 1, we plot RDP curves given by our method under different perception parameter PP and compare them to the results with known theoretical expression. The results obtained by our method match well the analytic expression in Fig. 1 a) and c). Furthermore, we can also plot the results where the analytical solution is not known in Fig. 1 b) and d). We also plot the 3D diagram of RDP surface in Fig. 2. For the Gaussian source, we set S=4,σ=1S=4,\sigma=1 for visual effect. The results are in accord with those derived from data-driven methods in [10].

Refer to caption
Figure 2: The 3D plot of the Rate-Distortion-Perception functions obtained by our method. a) the binary source with TV distance. b) the binary source with KL divergence. c) the Gaussian source with Wasserstein-2 metric d) the Gaussian source with TV distance.

IV-A Algorithm Convergence Verification

We verify the convergence of the improved AS algorithm in this subsection. Here we consider the residual errors of the Karish-Kuhn-Tucker (KKT) condition of the optimization problem (5) to be the indicator of convergence. We define L1L_{1} residual errors rψr_{\psi} as rψ=j=1N|ψji=1MKijϕipi1|,r_{\psi}=\sum_{j=1}^{N}\left|\psi_{j}\sum_{i=1}^{M}K_{ij}\phi_{i}p_{i}-1\right|, and rϕr_{\phi},rλr_{\lambda},rηr_{\eta},rφr_{\varphi},rξr_{\xi},rγr_{\gamma} can be defined similarly. We define the overall residual error rr to be the root mean square of the above residual errors.

In Fig. 3, we respectively output the convergent trajectories of rr of the binary source with TV distance and Gaussian source with Wasserstein-2 metric against iteration numbers. For the binary source, we set P=0.06P=0.06. For the Gaussian source, we set P=2P=2. The results show different convergence behaviors with different distortion parameters, and all of these cases will converge below 1×10101\times 10^{-10} at last.

Refer to caption
Figure 3: The convergent trajectories of rr from our method and a) the binary source with different distortion parameters; b) the Gaussian source with different distortion parameters.

IV-B The Regularization Parameter ε\varepsilon

We have theoretically demonstrated that when ε0\varepsilon\rightarrow 0 the entropy regularized problem converges to the original problem. However, according to the general results of entropy regularized OT problem, problems with smaller ε\varepsilon have higher precision but require more computation [20]. Moreover, according to the theory on Sinkhorn [20], an excessive λ\lambda might trigger numerical stability problems. Thus we wish to investigate the impact of ε\varepsilon on the WBM-RDP.

Results are shown in Table I, where the error represents the L1L_{1} difference between the algorithm results and the explicit results. Here for the binary source with TV distance, we set P=0.06P=0.06. And we set P=2P=2 for the Gaussian source with Wasserstein-2 metric. When ε\varepsilon decreases, the computational time rises while the results are more accurate regardless of the source. Furthermore, for ε=1e4\varepsilon=1e-4 in both sources, the small ε\varepsilon causes numerical instability. Therefore, from a practical perspective, ε=0.01\varepsilon=0.01 seems to be an ideal choice for RDP functions, as it ensures a certain level of accuracy and does not consume too much time.

TABLE I: The computational time and error against ε\varepsilon with different sources.
ε\varepsilon Time (s) Error
Binary source 1.00E-01 6.16 4.36E-04
5.00E-02 6.50 8.75E-05
1.00E-02 8.69 5.41E-06
5.00E-03 10.62 3.47E-06
1.00E-04 - -
Gaussian source 1.00E-01 70.01 1.49E-02
5.00E-02 74.87 7.30E-03
1.00E-02 99.56 2.30E-03
5.00E-03 118.97 1.80E-03
1.00E-04 - -

V Conclusion

In this work, we present a new scheme for computing the information Rate-Distortion-Perception functions. We convert the original problem to a Wasserstein Barycenter model for Rate-Distortion-Perception functions. Furthermore, we propose the improved Alternating Sinkhorn method with entropy regularization to solve the optimization problem. Numerical experiments show that our algorithm performs with high accuracy and efficiency. Extensions to properties of RDP functions and implementation to practical lossy compression schemes will be reported in the journal version.

References

  • [1] F. Mentzer, G. Toderici, M. Tschannen, and E. Agustsson, “High-fidelity generative image compression,” in Advances in Neural Information Processing Systems (NeurIPS 2020), H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33, Dec. 2020, pp. 11 913–11 924.
  • [2] S. Ma, X. Zhang, C. Jia, Z. Zhao, S. Wang, and S. Wang, “Image and video compression with neural networks: A review,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1683–1698, 2020.
  • [3] G. Lu, X. Zhang, W. Ouyang, L. Chen, Z. Gao, and D. Xu, “An end-to-end learning framework for video compression,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3292–3308, 2020.
  • [4] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022.
  • [5] C. E. Shannon et al., “Coding Theorems for a Discrete Source with a Fidelity Criterion,” Institute of Radio Engineers International Convention Record, vol. 4, no. 142-163, p. 1, Mar. 1959.
  • [6] T. M. Cover and J. A. Thomas, Elements of Information Theory.   Wiley-Interscience, 2006.
  • [7] S. Santurkar, D. M. Budden, and N. Shavit, “Generative Compression,” in 2018 Picture Coding Symposium (PCS), San Francisco, California, USA, Jun. 2018, pp. 1–5.
  • [8] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool, “Generative Adversarial Networks for Extreme Learned Image Compression,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, Oct.-Nov. 2019, pp. 221–231.
  • [9] Y. Blau and T. Michaeli, “The Perception-Distortion Tradeoff,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, Utah, USA, Jun. 2018, pp. 6228–6237.
  • [10] Y. Blau and T. Michaeli, “Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff,” in 36th International Conference on Machine Learning (ICML), Long Beach, California, USA, Jun. 2019, pp. 675–685.
  • [11] G. Zhang, J. Qian, J. Chen, and A. Khisti, “Universal Rate-Distortion-Perception Representations for Lossy Compression,” in Advances in Neural Information Processing Systems (NIPS 2021), vol. 34, Dec. 2021, pp. 11 517–11 529.
  • [12] X. Niu, D. Gündüz, B. Bai, and W. Han, “Conditional rate-distortion-perception trade-off,” in 2023 IEEE International Symposium on Information Theory (ISIT).   (to appear), 2023.
  • [13] S. Arimoto, “An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless Channels,” IEEE Transactions on Information Theory, vol. 18, no. 1, pp. 14–20, Jan. 1972.
  • [14] R. E. Blahut, “Computation of Channel Capacity and Rate-Distortion Functions,” IEEE Transactions on Information Theory, vol. 18, no. 4, pp. 460–473, Jan. 1972.
  • [15] O. Kirmemis and A. M. Tekalp, “A Practical Approach for Rate-Distortion-Perception Analysis in Learned Image Compression,” in 2021 Picture Coding Symposium (PCS), Bristol, United Kingdom, Jun. 2021, pp. 1–5.
  • [16] S. Wu, W. Ye, H. Wu, H. Wu, W. Zhang, and B. Bai, “A Communication Optimal Transport Approach to the Computation of Rate Distortion Functions,” arXiv preprint arXiv:2212.10098, 2022.
  • [17] B. Pass, “Multi-marginal Optimal Transport: Theory and Applications,” ESAIM: Mathematical Modelling and Numerical Analysis, vol. 49, no. 6, pp. 1771–1790, Feb. 2015.
  • [18] M. Cuturi and A. Doucet, “Fast Computation of Wasserstein Barycenters,” in 31th International Conference on Machine Learning (ICML), vol. 32, Beijing, China, Jun. 2014, pp. 685–693.
  • [19] M. Agueh and G. Carlier, “Barycenters in the Wasserstein Space,” Society for Industrial and Applied Mathematics Journal on Mathematical Analysis, vol. 43, no. 2, pp. 904–924, Jan. 2011.
  • [20] G. Peyré, M. Cuturi et al., “Computational Optimal Transport: With Applications to Data Science,” Foundations and Trends® in Machine Learning, vol. 11, no. 5-6, pp. 355–607, Feb. 2019.
  • [21] M. Nutz and J. Wiesel, “Entropic Optimal Transport: Convergence of Potentials,” Probability Theory and Related Fields, vol. 184, no. 1, pp. 401–424, Nov. 2022.
  • [22] C. Villani, Optimal Transport: Old and New.   Springer, 2009, vol. 338.