Estimating the Rate-Distortion Function by Wasserstein Gradient Descent
Abstract
In the theory of lossy compression, the rate-distortion (R-D) function describes how much a data source can be compressed (in bit-rate) at any given level of fidelity (distortion). Obtaining for a given data source establishes the fundamental performance limit for all compression algorithms. We propose a new method to estimate from the perspective of optimal transport. Unlike the classic Blahut–Arimoto algorithm which fixes the support of the reproduction distribution in advance, our Wasserstein gradient descent algorithm learns the support of the optimal reproduction distribution by moving particles. We prove its local convergence and analyze the sample complexity of our R-D estimator based on a connection to entropic optimal transport. Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort. We also highlight a connection to maximum-likelihood deconvolution and introduce a new class of sources that can be used as test cases with known solutions to the R-D problem.
1 Introduction
The rate-distortion (R-D) function occupies a central place in the theory of lossy compression. For a given data source and a fidelity (or distortion) criterion, characterizes the minimum possible communication cost needed to reproduce a source sample within an error threshold of , by any compression algorithm (Shannon, 1959). A basic scientific and practical question is therefore establishing for any given data source of interest, which helps assess the (sub)optimality of the compression algorithms and guide their development. The classic algorithm by Blahut (1972) and Arimoto (1972) assumes a known discrete source and computes its by an exhaustive optimization procedure. This often has limited applicability in practice, and a line of research has sought to instead estimate from data samples (Harrison and Kontoyiannis, 2008; Gibson, 2017), with recent methods (Yang and Mandt, 2022; Lei et al., 2023a) inspired by deep generative models.
In this work, we propose a new approach to R-D estimation from the perspective of optimal transport. Our starting point is the formulation of the R-D problem as the minimization of a certain rate functional (Harrison and Kontoyiannis, 2008) over the space of probability measures on the reproduction alphabet. Optimization over such an infinite-dimensional space has long been studied under gradient flows (Ambrosio et al., 2008), and we consider a concrete algorithmic implementation based on moving particles in space. This formulation of the R-D problem also suggests connections to entropic optimal transport and non-parametric statistics, each offering us new insight into the solution of the R-D problem under a quadratic distortion. More specifically, our contributions are three-fold:
First, we introduce a neural-network-free upper bound estimator for continuous alphabets. We implement the estimator by Wasserstein gradient descent (WGD) over the space of reproduction distributions. Experimentally, we found the method to converge much more quickly than state-of-the-art neural methods with hand-tuned architectures, while offering comparable or tighter bounds.
Second, we theoretically characterize convergence of our WGD algorithm and the sample complexity of our estimator. The latter draws on a connection between the R-D problem and that of minimizing an entropic optimal transport (EOT) cost relative to the source measure, allowing us to turn statistical bounds for EOT (Mena and Niles-Weed, 2019) into finite-sample bounds for R-D estimation.
Finally, we introduce a new, rich class of sources with known ground truth, including Gaussian mixtures, as a benchmark for algorithms. While the literature relies on the Gaussian or Bernoulli for this purpose, we use the connection with maximum likelihood deconvolution to show that a Gaussian convolution of any distribution can serve as a source with a known solution to the R-D problem.
2 Lossy compression, entropic optimal transport, and MLE
This section introduces the R-D problem and its rich connections to entropic optimal transport and statistics, along with new insights into its solution. Sec. 2.1 sets the stage for our method (Sec. 4) by a known formulation of the standard R-D problem as an optimization problem over a space of probability measures. Sec. 2.2 discusses the equivalence between the R-D problem and a projection of the source distribution under entropic optimal transport; this is a key to our sample complexity results in Sec. 4.3. Lastly, Sec 2.3 gives a statistical interpretation of R-D as maximum-likelihood deconvolution and uses it to analytically derive a segment of the R-D curve for a new class of sources under quadratic distortion; this allows us to assess the optimality of algorithms in experiment Sec. 5.1.
2.1 Setup
For a memoryless data source with distribution , its rate-distortion (R-D) function describes the minimum possible number of bits per sample needed to reproduce the source within a prescribed distortion threshold . Let the source and reproduction take values in two sets and , known as the source and reproduction alphabets, and let be a given distortion function. The R-D function is defined by the following optimization problem (Polyanskiy and Wu, 2022),
(1) |
where is any Markov kernel from to conceptually associated with a (possibly) stochastic compression algorithm, and is the mutual information of the joint distribution .
For ease of presentation, we now switch to a more abstract notation without reference to random variables. We provide the precise definitions in the Supplementary Material. Let and be standard Borel spaces; let be a fixed probability measure on , which should be thought of as the source distribution . For a measure on the product space , the notation (or ) denotes the first (or second) marginal of . For any , we denote by the set of couplings between and (i.e., and ). Similarly, denotes the set of measures with . Throughout the paper, denotes a transition kernel (conditional distribution) from to , and denotes the product measure formed by and . Then is equivalent to
(2) |
where denotes relative entropy, i.e., for two measures defined on a common measurable space, when is absolutely continuous w.r.t , and infinite otherwise.
To make the problem more tractable, we follow the approach of the classic Blahut–Arimoto algorithm (Blahut, 1972; Arimoto, 1972) (to be discussed in Sec. 3.1) and work with an equivalent unconstrained Lagrangian problem as follows. Instead of parameterizing the R-D function via a distortion threshold , we parameterize it via a Lagrange multiplier . For each fixed (usually selected from a predefined grid), we aim to solve the following optimization problem,
(3) |
Geometrically, is the y-axis intercept of a tangent line to the with slope , and is determined by the convex envelope of all such tangent lines (Gray, 2011). To simplify notation, we often drop the dependence on (e.g., we write ) whenever it is harmless.
To set the stage for our later developments, we write the unconstrained R-D problem as
(4) | ||||
(5) |
where we refer to the optimization objective as the rate function (Harrison and Kontoyiannis, 2008). We abuse the notation to write when it is viewed as a function of only, and refer to it as the rate functional. The rate function characterizes a generalized Asymptotic Equipartition Property, where is the asymptotically optimal cost of lossy compression of data using a random codebook constructed from samples of (Dembo and Kontoyiannis, 2002). Notably, the optimization in (5) can be solved analytically (Csiszár, 1974a, Lemma 1.3), and simplifies to
(6) |
In practice, the source is only accessible via independent samples, on the basis of which we propose to estimate its , or equivalently . Let denote an -sample empirical measure of , i.e., with being independent samples from , which should be thought of as the “training data”. Following Harrison and Kontoyiannis (2008), we consider two kinds of (plug-in) estimators for : (1) the non-parametric estimator , and (2) the parametric estimator , where is a family of probability measures on . Harrison and Kontoyiannis (2008) showed that under rather broad conditions, both kinds of estimators are strongly consistent, i.e., converges to (and respectively, to ) with probability one as . Our algorithm will implement the parametric estimator with chosen to be the set of probability measures with finite support, and we will develop finite-sample convergence results for both kinds of estimators in the continuous setting (Proposition. 4.3).
2.2 Connection to entropic optimal transport
The R-D problem turns out to have a close connection to entropic optimal transport (EOT) (Peyré and Cuturi, 2019), which we will exploit in Sec. 4.3 to obtain sample complexity results under our approach. For , the entropy-regularized optimal transport problem is given by
(7) |
We now consider the problem of projecting onto under the cost :
(8) |
In the OT literature this is known as the (regularized) Kantorovich estimator (Bassetti et al., 2006) for , and can also be viewed as a Wasserstein barycenter problem (Agueh and Carlier, 2011).
With the identification , problem (8) is in fact equivalent to the R-D problem (4): compared to (5), the extra constraint on the second marginal of in (7) is redundant at the optimal . More precisely, Lemma 7.1 shows that (we omit the notational dependence on when it is fixed):
(9) |
Existence of a minimizer holds under mild conditions, for instance if and is a coercive lower semicontinuous function of (Csiszár, 1974a, p. 66).
2.3 Connection to maximum-likelihood deconvolution
The connection between R-D and maximum-likelihood estimation has been observed in the information theory, machine learning and compression literature (Harrison and Kontoyiannis, 2008; Alemi et al., 2018; Ballé et al., 2017; Theis et al., 2017; Yang et al., 2020; Yang and Mandt, 2022). Here, we bring attention to a basic equivalence between the R-D problem and maximum-likelihood deconvolution, where the connection is particularly natural under a quadratic distortion function. Also see (Rigollet and Weed, 2018) for a related discussion that inspired ours and extension to a non-quadratic distortion. We provide further insight from the view of variational learning and inference in Section 10.

Maximum-likelihood deconvolution is a classical problem of non-parametric statistics and mixture models (Carroll and Hall, 1988; Lindsay and Roeder, 1993). The deconvolution problem is concerned with estimating an unknown distribution from noise-corrupted observations , where for each , we have , , and are i.i.d. independent noise variables with a known distribution. For concreteness, suppose all variables are valued and the noise distribution is with Lebesgue density . Denote the distribution of the observations by . Then has a Lebesgue density given by the convolution . Here, we consider the population-level (instead of the usual sample-based) maximum-likelihood estimator (MLE) for :
(10) |
and observe that . Plugging in the density , we see that the MLE problem (10) is equivalent to the R-D problem (4) with , and given by (6) in the form of a marginal log-likelihood. Thus the R-D problem has the interpretation of estimating a distribution from its noisy observations given through , assuming a Gaussian noise with variance .
This connection suggests analytical solutions to the R-D problem for a variety of sources that arise from convolving an underlying distribution with Gaussian noise. Consider an R-D problem (4) with , and let the source be the convolution between an arbitrary measure and Gaussian noise with known variance . E.g., using a discrete measure for results in a Gaussian mixture source with equal covariance among its components. When , we recover exactly the population-MLE problem (10) discussed earlier, which has the solution . While this allows us to obtain one point of , we can in fact extend this idea to any and obtain the analytical form for the corresponding segment of the R-D curve. Specifically, for any , applying the summation rule for independent Gaussians reveals the source distribution as
i.e., as the convolution between another underlying distribution and independent noise with variance . A solution to the R-D problem (3) is then analogously given by , with the corresponding optimal coupling given by , . 111 Here maps from the reproduction to the source alphabet, opposite to the kernel elsewhere in the text. Evaluating the distortion and mutual information of the coupling then yields the point associated with . Fig. 1 illustrates the of a toy Gaussian mixture source, along with the estimated by our proposed WGD algorithm (Sec. 4); note that transitions from continuous (a Gaussian mixture with smaller component variances) to singular (a mixture of two Diracs) at . See caption for more details.
3 Related Work
3.1 Blahut–Arimoto
The Blahut–Arimoto (BA) algorithm (Blahut, 1972; Arimoto, 1972) is the default method for computing for a known and discrete case. For a fixed , BA carries out the optimization problem (3) via coordinate ascent. Starting from an initial measure , the BA algorithm at step computes an updated pair as follows
(11) | ||||
(12) |
When the alphabets are finite, the above computation can be carried out in matrix and vector operations, and the resulting sequence can be shown to converge to an optimum of (3); cf. (Csiszár, 1974b, 1984). When the alphabets are not finite, e.g., , the BA algorithm no longer applies, as it is unclear how to digitally represent the measure and kernel and to tractably perform the integrals required by the algorithm. The common workaround is to perform a discretization step and then apply BA on the resulting discrete problem.
One standard discretization method is to tile up the alphabets with small bins (Gray and Neuhoff, 1998). This quickly becomes infeasible as the number of dimensions increases. We therefore consider discretizing the data space to be the support of training data distribution , i.e., the discretized alphabet is the set of training samples; this can be justified by the consistency of the parametric R-D estimator (Harrison and Kontoyiannis, 2008). It is less clear how to discretize the reproduction space , especially in high dimensions. Since we work with , we will disretize similarly and use an -element random subset of the training samples, as also considered by Lei et al. (2023a). As we will show, this rather arbitrary placement of the support of results in poor performance, and can be significantly improved from our perspective of evolving particles.
3.2 Neural network-based methods for estimating
RD-VAE ((Yang and Mandt, 2022)): To overcome the limitations of the BA algorithm, Yang and Mandt (2022) proposed to parameterize the transition kernel and reproduction distribution of the BA algorithm by neural density estimators (Papamakarios et al., 2021), and optimize the same objective (3) by (stochastic) gradient descent. They estimate (3) by Monte Carlo using joint samples ; in particular, the relative entropy can be written as , where the integrand is computed exactly via a density ratio. In practice, an alternative parameteriation is often used where the neural density estimators are defined on a lower dimensional latent space than the reproduction alphabet, and the resulting approach is closely related to VAEs (Kingma and Welling, 2013). Yang and Mandt (2022) additionally propose a neural estimator for a lower bound on , based on a dual representation due to Csiszár (1974a). NERD (Lei et al., 2023a): Instead of working with the transition kernel as in the RD-VAE, Lei et al. (2023a) considered optimizing the form of the rate functional in (6), via gradient descent on the parameters of parameterized by a neural network. Let be a base distribution over , such as the standard Gaussian, and be a decoder network. The variational measure is then modeled as the image measure of under . To evaluate and optimize the objective (6), the intractable inner integral w.r.t. is replaced with a plug-in estimator, so that for a given ,
(13) |
After training, we estimate an R-D upper bound using samples from (to be discussed in Sec. 4.4).
3.3 Other related work
Within information theory: Recent work by Wu et al. (2022) and Lei et al. (2023b) also note the connection between the R-D function and entropic optimal transport. Wu et al. (2022) compute the R-D function in the finite and known alphabet setting by solving a version of the EOT problem (8), whereas Lei et al. (2023b) numerically verify the equivalence (9) on a discrete problem and discuss the connection to scalar quantization. We also experimented with estimating by solving the EOT problem (8), but found it computationally much more efficient to work with the rate functional (6), and we see the primary benefit of the EOT connection as bringing in tools from statistical OT (Genevay et al., 2019; Mena and Niles-Weed, 2019; Rigollet and Stromme, 2022) for R-D estimation. Outside of information theory: Rigollet and Weed (2018) note a connection between the EOT projection problem (8) and maximum-likelihood deconvolution (10); our work complements their perspective by re-interpreting both problems through the equivalent R-D problem. Unbeknownst to us at the time, Yan et al. (2023) proposed similar algorithms to ours in the context of Gaussian mixture estimation, which we recognize as R-D estimation under quadratic distortion (see Sec. 2.3). Their work is based on gradient flow in the Fisher-Rao-Wasserstein (FRW) geometry (Chizat et al., 2018), which our hybrid algorithm can be seen as implementing. Yan et al. (2023) prove that, in an idealized setting with infinite particles, FRW gradient descent does not get stuck at local minima; by contrast, our convergence and sample-complexity results (Prop. 4.2, 4.3) hold for any finite number of particles. We additionally consider larger-scale problems and the stochastic optimization setting.
4 Proposed method
For our algorithm, we require and be continuously differentiable. We now introduce the gradient descent algorithm in Wasserstein space to solve the problems (4) and (8). We defer all proofs to the Supplementary Material. To minimize a functional over the space of probability measures, our algorithm essentially simulates the gradient flow (Ambrosio et al., 2008) of and follows the trajectory of steepest descent in the Wasserstein geometry. In practice, we represent a measure by a collection of particles and at each time step update in a direction of steepest descent of as given by its (negative) Wasserstein gradient. Denote by the set of probability measures on that are supported on at most points. Our algorithm implements the parametric R-D estimator with the choice (see discussions at the end of Sec. 2.1).
4.1 Wasserstein gradient descent (WGD)
Abstractly, Wasserstein gradient descent updates the variational measure to its pushforward under the map , for a function called the Wasserstein gradient of at (see below) and a step size . To implement this scheme, we represent as a convex combination of Dirac measures, with locations and weights . The algorithm moves each particle in the direction of , more precisely, .
Since the optimization objectives (4) and (8) appear as integrals w.r.t. the data distribution , we can also apply stochastic optimization and perform stochastic gradient descent on mini-batches with size . This allows us to handle a very large or infinite amount of data samples, or when the source is continuous. We formalize the procedure in Algorithm 1.
The following gives a constructive definition of a Wasserstein gradient which forms the computational basis of our algorithm. In the literature, the Wasserstein gradient is instead usually defined as a Fréchet differential (cf. (Ambrosio et al., 2008, Definition 10.1.1)), but we emphasize that in smooth settings, the given definition recovers the one from the literature (cf. (Chizat, 2022, Lemma A.2)).
Definition 4.1.
For a functional and , we say that is a first variation of at if
We call its (Euclidean) gradient , if it exists, the Wasserstein gradient of at .
For , the first variation is given by the Kantorovich potential, which is the solution of the convex dual of and commonly computed by Sinkhorn’s algorithm (Peyré and Cuturi, 2019; Nutz, 2021). Specifically, let be potentials for . Then is the first variation w.r.t. (cf. (Carlier et al., 2022, equation (20))), and hence is the Wasserstein gradient. This gradient exists whenever is differentiable and the marginals are sufficiently light-tailed; we give details in Sec. 9.1 of the Supplementary Material. For , the first variation can be computed explicitly. As derived in Sec. 9.1 of the Supplementary Material, the first variation at is
and then the Wasserstein gradient is . We observe that is computationally cheap; it corresponds to running a single iteration of Sinkhorn’s algorithm. By contract, finding the potential for requires running Sinkhorn’s algorithm to convergence.
Like the usual Euclidean gradient, the Wasserstein gradient can be shown to possess a linearization property, whereby the loss functional is reduced by taking a small enough step along its Wasserstein gradient. Following (Carlier et al., 2022), we state it as follows: for any and ,
(14) |
The first line of (14) is proved for in (Carlier et al., 2022, Proposition 4.2) in the case that the marginals are compactly supported and is twice continuously differentiable. In this setting, the second line of (14) follows using and a combination of boundedness and Lipschitz continuity of , see (Carlier et al., 2022, Proposition 2.2 and Corollary 2.4).
The linearization property given by (14) enables us to show that Wasserstein gradient descent for and converges to a stationary point under mild conditions:
4.2 Hybrid algorithm
A main limitation of the BA algorithm is that the support of is restricted to that of the (possibly bad) initialization . On the other hand, Wasserstein gradient descent (Algorithm 1) only evolves the particle locations of , but not the weights, which are fixed to be uniform by default. We therefore consider a hybrid algorithm where we alternate between WGD and the BA update steps, allowing us to optimize the particle weights as well. Experimentally, this translates to faster convergence than the base WGD algorithm (Sec. 5.1). Note however, unlike WGD, the hybrid algorithm does not directly lend itself to the stochastic optimization setting, as BA updates on mini-batches no longer guarantee monotonic improvement in the objective and can lead to divergence. We treat the convergence of the hybrid algorithm in the Supplementary Material Sec. 9.4.
4.3 Sample complexity
Let and . Leveraging work on the statistical complexity of EOT (Mena and Niles-Weed, 2019), we obtain finite-sample bounds for the theoretical estimators implemented by WGD in terms of the number of particles and source samples. The bounds hold for both the R-D problem (4) and EOT projection problem (8) as they share the same optimizers (see Sec. 2.2), and strengthen existing asymptotic results for empirical R-D estimators (Harrison and Kontoyiannis, 2008). We note that a recent result by Rigollet and Stromme (2022) might be useful for deriving alternative bounds under distortion functions other than the quadratic.
Proposition 4.3.
Let be -subgaussian. Then every optimizer of (4) and (8) is also -subgaussian. Consider . For a constant only depending on , we have
for all , where is the set of probability measures over supported on at most points, is the empirical measure of with independent samples and the expectation is over these samples. The same inequalities hold for , with the identification .
4.4 Estimation of rate and distortion
Here, we describe our estimator for an upper bound of after solving the unconstrained problem (3). We provide more details in Sec. 8 of the Supplementary Material.
For any given pair of and , we always have that and lie on an upper bound of (Berger, 1971). The two quantities can be estimated by standard Monte Carlo provided we can sample from and evaluate the density .
When only is given, e.g., obtained from optimizing (6) with WGD or NERD, we estimate an R-D upper bound as follows. As in the BA algorithm, we construct a kernel similarly to (11), i.e., ; then we estimate using the pair as described earlier.
As NERD uses a continuous , we follow (Lei et al., 2023a) and approximate it with its -sample empirical measure to estimate . A limitation of NERD, BA, and our method is that they tend to converge to a rate estimate of at most , where is the support size of . This is because as the algorithms approach an -point minimizer of the R-D problem, the rate estimate approaches the mutual information of , which is upper-bounded by (Eckstein and Nutz, 2022). In practice, this means if a target point of has rate , then we need to estimate it accurately.
4.5 Computational considerations
Common to all the aforementioned methods is the evaluation of a pairwise distortion matrix between points in and points in , which usually has a cost of for a -dimensional source. While RD-VAE uses (in the reparameterization trick), the other methods (BA, WGD, NERD) typically use a much larger and thus has the distortion computation as their main computation bottleneck. Compared to BA and WGD, the neural methods (RD-VAE, NERD) incur additional computation from neural network operations, which can be significant for large networks.
For NERD and WGD (and BA), the rate estimate upper bound of nats/sample (see Sec. 4.4) can present computational challenges. To target a high-rate setting, a large number of particles (high ) is required, and care needs to be taken to avoid running out of memory during the distortion matrix computation (one possibility is to use a small batch size with stochastic optimization).
5 Experiments
We compare the empirical performance of our proposed method (WGD) and its hybrid variant with Blahut–Arimoto (BA) (Blahut, 1972; Arimoto, 1972), RD-VAE (Yang and Mandt, 2022), and NERD (Lei et al., 2022) on the tasks of maximum-likelihood deconvolution and estimation of R-D upper bounds. While we experimented with WGD for both and , we found the former to be 10 to 100 times faster computationally while giving similar or better results; we therefore focus on WGD for in discussions below. For the neural-network baselines, we use the same (or as similar as possible) network architectures as in the original work (Yang and Mandt, 2022; Lei et al., 2023a). We use the Adam optimizer for all gradient-based methods, except we use simple gradient descent with a decaying step size in Sec. 5.1 to better compare the convergence speed of WGD and its hybrid variant. Further experiment details and results are given in the Supplementary Material Sec. 11.
5.1 Deconvolution
To better understand the behavior of the various algorithms, we apply them to a deconvolution problem with known ground truth (see Sec. 2.3). We adopt the Gaussian noise as before, letting be the uniform measure on the unit circle in and the source with .








We use particles for BA, NERD, WGD and its hybrid variant. We use a two-layer network for NERD and RD-VAE with some hand-tuning (we replace the softplus activation in the original RD-VAE network by ReLU as it led to difficulty in optimization). Fig. 3 plots the resulting loss curves and shows that the proposed algorithms converge the fastest to the ground truth value . In Fig. 3, we visualize the final at the end of training, compared to the ground truth supported on the circle (colored in cyan). Note that we initialize for BA, WGD, and its hybrid variant to the same random data samples. While BA is stuck with the randomly initialized particles and assigns large weights to those closer to the circle, WGD learns to move the particles to uniformly cover the circle. The hybrid algorithm, being able to reweight particles to reduce their transportation cost, learns a different solution where a cluster of particles covers the top-left portion of the circle with small weights while the remaining particles evenly covers the rest. Unlike our particle-based methods, the neural methods generally struggle to place the support of their exactly on the circle.
We additionally compare how the performance of BA, NERD, and the proposed algorithms scale to higher dimensions and a higher (corresponding to lower entropic regularization in ). Fig. 4 plots the gap between the converged and the optimal losses for the algorithms, and demonstrates the proposed algorithms to be more particle-efficient and scale more favorably than the alternatives which also use particles in the reproduction space. We additionally visualize how the converged particles for our methods vary across the R-D trade-off in Fig. 6 of the Supplementary Material.
5.2 Higher-dimensional data
We perform estimation on higher-dimensional data, including the physics and speech datasets from (Yang and Mandt, 2022) and MNIST (LeCun et al., 1998). As the memory cost to operating on the full datasets becomes prohibitive, we focus on NERD, RD-VAE, and WGD using mini-batch stochastic gradient descent. BA and hybrid WGD do not directly apply in the stochastic setting, as BA updates on random mini-batches can lead to divergence (as discussed in Sec. 4.2).
Fig. 5 plots the estimated R-D bounds on the datasets, and compares the convergence speed of WGD and neural methods in both iteration count and compute time. Overall, we find WGD to require minimal tuning and obtains the tightest R-D upper bounds within the rate limit (see Sec. 4.4), and consistently obtains tighter bounds than NERD given the same computation budget.




6 Discussions
In this work, we leverage tools from optimal transport to develop a new approach for estimating the rate-distortion function in the continuous setting. Compared to state-of-the-art neural approaches (Yang and Mandt, 2022; Lei et al., 2022), our Wasserstein gradient descent algorithm offers complementary strengths: 1) It requires a single main hyperparameter (the number of particles) and no network architecture tuning; and 2) empirically we found it to converge significantly faster and rarely end up in bad local optima; increasing almost always yielded an improvement (unless the bound is already close to being tight). From a modeling perspective, a particle representation may be inherently more efficient when the optimal reproduction distribution is singular or has many disconnected modes (shown, e.g., in Figs. 1 and 3). However, like NERD (Lei et al., 2022), our method has a fundamental limitation – it requires an that is exponential in the rate of the targeted point to estimate it accurately (see Sec. 4.4). Thus, a neural method like RD-VAE (Yang and Mandt, 2022) may still be preferable on high-rate sources, while our particle-based method stands as a state-of-the-art solution on lower-rate sources with substantially less tuning or computation requirements.
Besides R-D estimation, our algorithm also applies to the mathematically equivalent problems of maximum likelihood deconvolution and projection under an entropic optimal transport (EOT) cost, and may find other connections and applications. Indeed, the EOT projection view of our algorithm is further related to optimization-based approaches to sampling (Wibisono, 2018), variational inference (Liu and Wang, 2016), and distribution compression (Shetty et al., 2021). Our particle-based algorithm also generalizes optimal quantization (corresponding to in and projection of the source under the Wasserstein distance (Graf and Luschgy, 2007; Gray, 2013)) to incorporate a rate constraint (), and it would be interesting to explore the use of the resulting rate-distortion optimal quantizer for practical data compression and communication.
References
- Shannon [1959] CE Shannon. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec., March 1959, 4:142–163, 1959.
- Blahut [1972] R. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473, 1972. doi: 10.1109/TIT.1972.1054855.
- Arimoto [1972] Suguru Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14–20, 1972.
- Harrison and Kontoyiannis [2008] Matthew T. Harrison and Ioannis Kontoyiannis. Estimation of the rate–distortion function. IEEE Transactions on Information Theory, 54(8):3757–3762, 2008. doi: 10.1109/tit.2008.926387.
- Gibson [2017] Jerry Gibson. Rate distortion functions and rate distortion function lower bounds for real-world sources. Entropy, 19(11):604, 2017.
- Yang and Mandt [2022] Yibo Yang and Stephan Mandt. Towards empirical sandwich bounds on the rate-distortion function. In International Conference on Learning Representations, 2022.
- Lei et al. [2023a] Eric Lei, Hamed Hassani, and Shirin Saeedi Bidokhti. Neural estimation of the rate-distortion function with applications to operational source coding. IEEE Journal on Selected Areas in Information Theory, 2023a.
- Ambrosio et al. [2008] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows in metric spaces and in the space of probability measures. Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel, second edition, 2008.
- Mena and Niles-Weed [2019] Gonzalo Mena and Jonathan Niles-Weed. Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem. Advances in Neural Information Processing Systems, 32, 2019.
- Polyanskiy and Wu [2022] Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning. Book draft, 2022.
- Gray [2011] Robert M Gray. Entropy and information theory. Springer Science & Business Media, 2011.
- Dembo and Kontoyiannis [2002] Amir Dembo and L Kontoyiannis. Source coding, large deviations, and approximate pattern matching. IEEE Transactions on Information Theory, 48(6):1590–1615, 2002.
- Csiszár [1974a] Imre Csiszár. On an extremum problem of information theory. Studia Scientiarum Mathematicarum Hungarica, 9, 01 1974a.
- Peyré and Cuturi [2019] Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355–607, 2019.
- Bassetti et al. [2006] Federico Bassetti, Antonella Bodini, and Eugenio Regazzini. On minimum kantorovich distance estimators. Statistics & probability letters, 76(12):1298–1302, 2006.
- Agueh and Carlier [2011] Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
- Alemi et al. [2018] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken ELBO. In International Conference on Machine Learning, pages 159–168. PMLR, 2018.
- Ballé et al. [2017] Johannes Ballé, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. International Conference on Learning Representations, 2017.
- Theis et al. [2017] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Huszár. Lossy image compression with compressive autoencoders. International Conference on Learning Representations, 2017.
- Yang et al. [2020] Yibo Yang, Robert Bamler, and Stephan Mandt. Improving inference for neural image compression. In Neural Information Processing Systems (NeurIPS), 2020, 2020.
- Rigollet and Weed [2018] Philippe Rigollet and Jonathan Weed. Entropic optimal transport is maximum-likelihood deconvolution. Comptes Rendus Mathematique, 356(11-12):1228–1235, 2018.
- Carroll and Hall [1988] Raymond J Carroll and Peter Hall. Optimal rates of convergence for deconvolving a density. Journal of the American Statistical Association, 83(404):1184–1186, 1988.
- Lindsay and Roeder [1993] Bruce G Lindsay and Kathryn Roeder. Uniqueness of estimation and identifiability in mixture models. Canadian Journal of Statistics, 21(2):139–147, 1993.
- Csiszár [1974b] Imre Csiszár. On the computation of rate-distortion functions (corresp.). IEEE Transactions on Information Theory, 20(1):122–124, 1974b. doi: 10.1109/TIT.1974.1055146.
- Csiszár [1984] Imre Csiszár. Information geometry and alternating minimization procedures. Statistics and Decisions, Dedewicz, 1:205–237, 1984.
- Gray and Neuhoff [1998] Robert M. Gray and David L. Neuhoff. Quantization. IEEE transactions on information theory, 44(6):2325–2383, 1998.
- Papamakarios et al. [2021] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617–2680, 2021.
- Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Wu et al. [2022] Shitong Wu, Wenhao Ye, Hao Wu, Huihui Wu, Wenyi Zhang, and Bo Bai. A communication optimal transport approach to the computation of rate distortion functions. arXiv preprint arXiv:2212.10098, 2022.
- Lei et al. [2023b] Eric Lei, Hamed Hassani, and Shirin Saeedi Bidokhti. On a relation between the rate-distortion function and optimal transport, 2023b.
- Genevay et al. [2019] Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample complexity of Sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1574–1583. PMLR, 2019.
- Rigollet and Stromme [2022] Philippe Rigollet and Austin J Stromme. On the sample complexity of entropic optimal transport. arXiv preprint arXiv:2206.13472, 2022.
- Yan et al. [2023] Yuling Yan, Kaizheng Wang, and Philippe Rigollet. Learning gaussian mixtures using the wasserstein-fisher-rao gradient flow. arXiv preprint arXiv:2301.01766, 2023.
- Chizat et al. [2018] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. An interpolating distance between optimal transport and fisher–rao metrics. Foundations of Computational Mathematics, 18:1–44, 2018.
- Chizat [2022] Lénaïc Chizat. Mean-field langevin dynamics: Exponential convergence and annealing. arXiv preprint arXiv:2202.01009, 2022.
- Nutz [2021] Marcel Nutz. Introduction to entropic optimal transport. Lecture notes, Columbia University, 2021. https://www.math.columbia.edu/~mnutz/docs/EOT_lecture_notes.pdf.
- Carlier et al. [2022] Guillaume Carlier, Lénaïc Chizat, and Maxime Laborde. Lipschitz continuity of the Schrödinger map in entropic optimal transport. arXiv preprint arXiv:2210.00225, 2022.
- Berger [1971] Toby Berger. Rate distortion theory, a mathematical basis for data compression. Prentice Hall, 1971.
- Eckstein and Nutz [2022] Stephan Eckstein and Marcel Nutz. Convergence rates for regularized optimal transport via quantization. arXiv preprint arXiv:2208.14391, 2022.
- Lei et al. [2022] Eric Lei, Hamed Hassani, and Shirin Saeedi Bidokhti. Neural estimation of the rate-distortion function for massive datasets. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 608–613. IEEE, 2022.
- LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Wibisono [2018] Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 2093–3027. PMLR, 06–09 Jul 2018. URL https://proceedings.mlr.press/v75/wibisono18a.html.
- Liu and Wang [2016] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29, 2016.
- Shetty et al. [2021] Abhishek Shetty, Raaz Dwivedi, and Lester Mackey. Distribution compression in near-linear time. arXiv preprint arXiv:2111.07941, 2021.
- Graf and Luschgy [2007] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. Springer, 2007.
- Gray [2013] Robert M. Gray. Transportation distance, shannon information, and source coding. GRETSI 2013 Symposium on Signal and Image Processing, 2013. URL https://ee.stanford.edu/~gray/gretsi.pdf.
- Çinlar [2011] Erhan Çinlar. Probability and stochastics, volume 261. Springer, 2011.
- Folland [1999] Gerald B Folland. Real analysis: modern techniques and their applications, volume 40. John Wiley & Sons, 1999.
- Polyanskiy and Wu [2014] Yury Polyanskiy and Yihong Wu. Lecture notes on information theory. Lecture Notes for ECE563 (UIUC) and, 6(2012-2016):7, 2014.
- Yang et al. [2023] Yibo Yang, Stephan Mandt, and Lucas Theis. An introduction to neural data compression. Foundations and Trends® in Computer Graphics and Vision, 15(2):113–200, 2023. ISSN 1572-2740. doi: 10.1561/0600000107. URL http://dx.doi.org/10.1561/0600000107.
- Blei et al. [2017] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
- Beal and Ghahramani [2003] MJ Beal and Z Ghahramani. The variational bayesian em algorithm for incomplete data: with application to scoring graphical model structures. Bayesian statistics, 7(453-464):210, 2003.
- Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
- Zhang et al. [2020] Mingtian Zhang, Peter Hayes, Thomas Bird, Raza Habib, and David Barber. Spread divergence. In International Conference on Machine Learning, pages 11106–11116. PMLR, 2020.
- Vincent [2011] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
- Jordan et al. [1999] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37:183–233, 1999.
- Wainwright et al. [2008] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
- Jordan [1999] Michael Irwin Jordan. Learning in graphical models. MIT press, 1999.
- Kingma and Ba [2015] Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic gradient descent. In International Conference on Learning Representations, 2015.
- Platt and Barr [1987] John Platt and Alan Barr. Constrained differential optimization. In Neural Information Processing Systems, 1987.
- Papamakarios et al. [2017] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pages 2338–2347, 2017.
- Grosse et al. [2015] Roger B Grosse, Zoubin Ghahramani, and Ryan P Adams. Sandwiching the marginal likelihood using bidirectional monte carlo. arXiv preprint arXiv:1511.02543, 2015.
- Bradbury et al. [2018] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
Broader Impacts
Improved estimates on the fundamental cost of data compression aids the development and analysis of compression algorithms. This helps researchers and engineers make better decisions about where to allocate their resources to improve certain compression algorithms, and can translate to economic gains for the broader society. However, like most machine learning algorithms trained on data, the output of our estimator is only accurate insofar as the training data is representative of the population distribution of interest, and practitioners need to ensure this in the data collection process.
Acknowledgements
We thank anonymous reviewers for feedback on the manuscript. Yibo Yang acknowledges support from the Hasso Plattner Foundation. Marcel Nutz acknowledges support from NSF Grants DMS-1812661, DMS-2106056. Stephan Mandt acknowledges support from the National Science Foundation (NSF) under the NSF CAREER Award 2047418; NSF Grants 2003237 and 2007719, the Department of Energy, Office of Science under grant DE-SC0022331, the IARPA WRIVA program, as well as gifts from Intel, Disney, and Qualcomm.
Supplementary Material for
Estimating the Rate-Distortion Function by Wasserstein Gradient Descent
We review probability theory background and explain our notation from the main text in Section 7, give the formulas we used for numerically estimating an upper bound in Section 8, provide additional discussions and proofs regarding Wasserstein gradient descent in Section 9, elaborate on the connections between the R-D estimation problem and variational inference/learning in Section 10, provide additional experimental results and details in Section 11, and list an example implementation of WGD in Section 12. Our code and can be found at https://github.com/yiboyang/wgd.
7 Notions from probability theory
In this section we collect notions of probability theory used in the main text. See, e.g., [Çinlar, 2011] or [Folland, 1999] for more background.
Marginal and conditional distributions.
The source and reproduction spaces are equipped with sigma-algebras and , respectively. Let denote the product space equipped with the product sigma algebra . For any probability measure on , its first marginal is
which is a probability measure on . When is the distribution of a random vector , then is the distribution of . The second marginal of is defined analogously as
A Markov kernel or conditional distribution is a map such that
-
1.
is a probability measure on for each ;
-
2.
the function is measurable for each set .
When speaking of the conditional distribution of a random variable given another random variable , we occasionally also use the notation from information theory [Polyanskiy and Wu, 2014]. Then, is the conditional probability of the event given .
Suppose that a probability measure on is given, in addition to a kernel . Together they define a unique measure on the product space . For a rectangle set ,
The measure has first marginal .
The classic product measure is a special case of this construction. Namely, when a measure on is given, using the constant kernel (which does not depend on ) gives rise to the product measure ,
Under mild conditions (for instance when are Polish spaces equipped with their Borel sigma algebras, as in the main text), any probability measure on is of the above form. Namely, the disintegration theorem asserts that can be written as for some kernel . When is the joint distribution of a random vector , this says that there is a measurable version of the conditional distribution .
Optimal transport.
Given a measure on and a measurable function , the pushforward (or image measure) of under is a measure on , given by
If is seen as a random variable and as the baseline probability measure, then is simply the distribution of .
Suppose that and are probability measures on with finite second moment. As introduced in the main text, denotes the set of couplings, i.e., measures on with and . The 2-Wasserstein distance between and is defined as
This indeed defines a metric on the space of probability measures with finite second moment.
We finish this section by giving a proof of the basic equivalence between optimization of and , which goes back to [Csiszár, 1974a, Lemma 1.3 and subsequent discussion]. Recall that denotes the set of measures on the product space with .
Lemma 7.1.
Set . It holds that
Proof.
Both statements will follow from a simple property of relative entropy [Polyanskiy and Wu, 2022, Theorem 4.1, “golden formula”]: For with , the properties of the logarithm reveal
(15) |
and hence . This implies
Further, any optimizer of with corresponding optimal “coupling” must satisfy (otherwise, taking has better objective) and thus and , therefore is also an optimizer of by the above equality. Conversely, any optimizer of with coupling clearly yields a feasible solution for as well, and hence is also an optimizer of by the above equality. ∎
8 Numerical estimation of rate and distortion from a reproduction distribution
Given a reproduction distribution and a kernel , the tuple defined by
and
lies above the curve. This again follows from the variational inequality (15) , so that .
Given only a reproduction distribution , we will construct a kernel from and use to compute , letting
as in the first step of the BA algorithm. This choice of is in fact optimal as it achieves the minimum in the definition of (5) for the given . Plugging into the formulas for and gives
(16) |
and
(17) | ||||
where we introduced the shorthand
(18) |
Note that we have the following relation (which explains (6))
Let be an -point measure, , e.g., the output of WGD or in the inner step of NERD. Then can be evaluated exactly as a finite sum over the particles, and the expressions above for and (which are integrals w.r.t. the product distribution ) can be estimated as sample averages. That is, given independent samples from , we compute unbiased estimates
where . Similarly, a sample mean estimate for is given by
(19) |
In practice, we found it simpler and numerically more stable to instead compute as
Whenever possible, we avoid exponentiation and instead use logsumexp to prevent numerical issues.
9 Wasserstein gradient descent
9.1 On Wasserstein gradients of the EOT and rate functionals
First, we elaborate on the Wasserstein gradient of the EOT functional . That the dual potential from Sinkhorn’s algorithm is differentiable follows from the fact that optimal dual potentials satisfy the Schrödinger equations (cf. [Nutz, 2021, Corollary 2.5]). Differentiability was shown in [Genevay et al., 2019, Theorem 2] in the compact setting, and in [Mena and Niles-Weed, 2019, Proposition 1] in unbounded settings. While Mena and Niles-Weed [2019] only states the result for quadratic cost, the approach of Proposition 1 therein applies more generally.
Below, we compute the Wasserstein gradient of the rate functional . Recall from (6),
Under sufficient integrability on and to exchange the order of limit and integral, we can calculate the first variation as
where the last equality uses and Fubini’s theorem. Thus the first variation of at is
(20) |
To find the desired Wasserstein gradient of , it remains to take the Euclidean gradient of , i.e., . Doing so gives us the desired Wasserstein gradient:
(21) |
again assuming suitable regularity conditions on and to exchange the order of integral and differentiation.
9.2 Proof of Proposition 4.2 (convergence of Wasserstein gradient descent)
We first provide an auxiliary result.
Lemma 9.1.
Let and , , satisfy , , and for all . Then .
Proof.
The conclusion remains unchanged when rescaling by the constant , and thus without loss of generality .
Clearly as . Moreover, there exists a subsequence of which converges to zero (otherwise there exists such that for all but finitely many , contradicting ).
Arguing by contradiction, suppose that the conclusion fails, i.e., that there exists a subsequence of which is uniformly bounded away from zero, say along that subsequence. Using this subsequence and the convergent subsequence mentioned above, we can construct a subsequence where for odd and for even. We will show that
contradicting the finiteness of . (The notation () indicates (in)equality up to additive terms converging to zero for .)
To ease notation, fix and set and . We show that . To this end, using we find
Since , there exists a largest , , such that (and thus as well). We conclude
where we used that . This completes the proof. ∎
Proof of Proposition 4.2.
Using the linear approximation property in (14), we calculate
As is finite and is bounded from below, it follows that
The claim now follow by applying Lemma 9.1 with ; note that the assumption in the lemma, , is satisfied due to the second inequality in (14) and a short calculation below
where we use in the last step. ∎
9.3 Proof of Proposition 4.3 (sample complexity)
Recall that and in this proposition. For the proof, we will need the following lemma which is of independent interest. We write if is dominated by in convex order, i.e., for all convex functions .
Lemma 9.2.
Let have finite second moment. Given , there exists with and
This inequality is strict if . In particular, any optimizer of (8) satisfies .
Proof.
Because this proof uses disintegration over , it is convenient to reverse the order of the spaces in the notation and write a generic point as . Consider and its disintegration over . Define by
Define also and . From the definition of , we see that is a martingale, thus . Moreover, . The data-processing inequality now shows that
On the other hand, since the barycenter minimizes the squared distance, and this inequality is strict whenever . Thus
and the inequality is strict unless for -a.e. , which in turn is equivalent to being a martingale. The claims follow. ∎
Proof of Proposition 4.3.
Subgaussianity of the optimizer follows directly from Lemma 9.2.
Recalling that and have the same values and minimizers (given by (9) in Sec. 2.2), it suffices to show the claim for . Let be an optimizer of (8) (i.e., an optimal reproduction distribution) and its empirical measure from samples, then clearly
where the expectation is taken over samples for . The first inequality of Proposition 4.3 now follows from the sample complexity result for entropic optimal transport in [Mena and Niles-Weed, 2019, Theorem 2].
Denote by the optimizer for the problem (8) with replaced by . Similarly to the above, we obtain
where the expectation is taken over samples from . In this situation, we cannot directly apply [Mena and Niles-Weed, 2019, Theorem 2]. However, the bound given by [Mena and Niles-Weed, 2019, Proposition 2] still applies, and the only dependence on is through their subgaussianity constants. By Lemma 9.2, these constants are bounded by the corresponding constants of and . Thus, the arguments in the proof of [Mena and Niles-Weed, 2019, Theorem 2] can be applied, yielding the second inequality of Proposition 4.3.
9.4 Convergence of the proposed hybrid algorithm
In our present work, our hybrid algorithm targets the non-stochastic setting and is motivated by the fact that the BA update is in some sense orthogonal to the Wasserstein gradient update, and can only monotonically improve the objective. While empirically we observe the additional BA steps to not hurt – but rather accelerate – the convergence of WGD (see 5.1), additional effort is required to theoretically guarantee convergence of the hybrid algorithm.
There are two key properties we use for the convergence of the base WGD algorithm: 1) a certain monotonicity of the update steps (up to higher order terms, gradient descent improves the objective) and 2) stability of gradients across iterations. If we include the BA step, we find that 1) still holds, but 2) may a-priori be lost. Indeed, 1) holds since BA updates monotonically improve the objective. Using just 1), we can still obtain a Pareto convergence of the gradients for the hybrid algorithm, (here are the outputs from the respective BA steps and is the step size of the gradient steps). Without property 2), we cannot conclude for . We emphasize that in practice, it still appears that 2) holds even after including the BA step. Motivated by this analysis, an adjusted hybrid algorithm where, e.g., the BA update is rejected if it causes a drastic change in the Wasserstein gradient, could guarantee that 2) holds with little practical changes. From a different perspective, we also believe the hybrid algorithm may be tractable to study in relation to gradient flows in the Wasserstein-Fisher-Rao geometry (cf. [Yan et al., 2023]), in which the BA step corresponds to a gradient update in the Fisher-Rao geometry with a unit step size.
In the stochastic setting, the BA (and therefore our hybrid) algorithm does not directly apply, as performing BA updates on mini-batches no longer guarantees monotonic improvement of the overall objective. Extending the BA and hybrid algorithm to this setting would be interesting future work.
Context | ||||
---|---|---|---|---|
OT | source distribution | transport cost | “transport plan” | target distribution |
R-D | data distribution | distortion criterion | compression algorithm | codebook distribution |
LVM/NPMLE | data distribution | “” | variational posterior | prior distribution |
deconvolution | noisy measurements | “noise kernel” | — | noiseless model |
10 R-D estimation and variational inference/learning
In this section, we elaborate on the connection between the R-D problem (3) and variational inference and learning in latent variable models, of which maximum likelihood deconvolution (discussed in Sec. 2.3) can be seen as a special case. Also see Section 3 of [Yang et al., 2023] for a related discussion in the context of lossy data compression.
To make the connections clearer to a general machine learning audience, we adopt notation more common in statistics and information theory. Table 1 summarizes the notation and the correspondence to the measure-theoretic notation used in the main text.
In statistics, a common goal is to model an (unknown) data generating distribution by some density model . In particular, here we will choose to be a latent variable model, where takes on the role of a latent space, and is the distribution of a latent variable (which may encapsulate the model parameters). As we shall see, the optimization objective defining the rate functional (5) corresponds to an aggregate Evidence LOwer Bound (ELBO) [Blei et al., 2017]. Thus, computing the rate functional corresponds to variational inference [Blei et al., 2017] in a given model (see Sec. 10.2), and the parametric R-D estimation problem, i.e.,
is equivalent to estimating a latent variable model using the variational EM algorithm [Beal and Ghahramani, 2003] (see Sec. 10.3). The variational EM algorithm can be seen as a restricted version of the BA algorithm (see Sec. 10.3), whereas the EM algorithm [Dempster et al., 1977] shares its E-step with the BA algorithm but can differ in its M-step (see Sec. 10.4).
10.1 Setup
For concreteness, fix a reference measure on , and suppose has density w.r.t. . Often the latent space is a Euclidean space, and is the usual probability density function w.r.t. the Lebesgue measure ; or when the latent space is discrete/countable, is the counting measure and is a probability mass function. We will consider the typical parametric estimation problem and choose a particular parametric form for indexed by a parameter vector . This defines a parametric family for some parameter space . Finally, suppose the distortion function induces a conditional likelihood density via , with a normalization constant that is constant in .
These choices then result in a latent variable model specified by the joint density , and we model the data distribution with the marginal density: 222To be more precise, we always fix a reference measure on , so that densities such as and are with respect to . In the applications we considered in this work, is the Lebesgue measure on .
(22) |
As an example, a Gaussian mixture model with isotropic component variances can be specified as follows. Let be a mixing distribution on parameterized by component weights and locations , such that . Let be a conditional Gaussian density with mean and variance . Now formula (22) gives the usual Gaussian mixture density on .
Maximum-likelihood estimation then ideally maximizes the population log (marginal) likelihood,
(23) |
The maximum-likelihood deconvolution setup can be seen as a special case where the form of the marginal density (22) derives from knowledge of the true data generation process, with for some unknown and known noise (i.e., the model is well-specified). We note in passing that the idea of estimating an unknown data distribution by adding artificial noise to it also underlies work on spread divergence [Zhang et al., 2020] and denoising score matching [Vincent, 2011].
10.2 Connection to variational inference
Given a latent variable model and any data observation , a central task in Bayesian statistics is to infer the Bayesian posterior [Jordan, 1999], which we formally view as a conditional distribution . It is given by
or, in terms of the density of , given by the following conditional density via the familiar Bayes’ rule,
Unfortunately, the true Bayesian posterior is typically intractable, as the (marginal) data likelihood in the denominator involves an often high-dimensional integral. Variational inference [Jordan et al., 1999, Wainwright et al., 2008] therefore aims to approximate the true posterior by a variational distribution by minimizing their relative divergence . The problem is equivalent to maximizing the following lower bound on the marginal log-likelihood, known as the Evidence Lower BOund (ELBO) [Blei et al., 2017]:
(24) |
Recall the rate functional (5) arises through an optimization problem over a transition kernel ,
Translating the above into the present notation (; see Table 1), we obtain
(25) |
which allows us to interpret the rate functional as the result of performing variational inference through ELBO optimization. At optimality, , the ELBO (24) is tight and recovers , and the rate functional takes on the form of a (negated) population marginal log likelihood (23), as given earlier by (6) in Sec. 2.1.
10.3 Connection to variational EM
The discussion so far concerns probabilistic inference, where a latent variable model has been given and we saw that computing the rate functional amounts to variational inference. Suppose now we wish to learn a model from data. The R-D problem (4) then corresponds to model estimation using the variational EM algorithm [Beal and Ghahramani, 2003].
To estimate a latent variable model by (approximate) maximum-likelihood, the variational EM algorithm maximizes the population ELBO
(26) |
w.r.t. and . This precisely corresponds to the R-D problem , using the form of from (25).
In popular implementations of variational EM such as the VAE [Kingma and Welling, 2013], and are restricted to parametric families. When they are allowed to range over all of and all conditional distributions, variational EM then becomes equivalent to the BA algorithm.
10.4 The Blahut–Arimoto and (exact) EM algorithms
Here we focus on the distinction between the BA algorithm and the (exact) EM algorithm [Dempster et al., 1977], rather than the variational EM algorithm. Both BA and (exact) EM perform coordinate descent on the same objective function (namely the negative of the population ELBO from (26)), but they define the coordinates slightly differently — the BA algorithm uses with , whereas the EM algorithm uses with indexing a parametric family . Thus the coordinate update w.r.t. (the “E-step”) is the same in both algorithms, but the subseuquent “M-step” potentially differs depending on the role of .
Given the optimization objective, which is simply the negative of (26),
(27) |
both BA and EM optimize the transition kernel the same way in the E-step, as
(28) |
For the M-step, the BA algorithm only minimizes the relative entropy term of the objective (27),
(with the optimal given by the second marginal of ) whereas the EM algorithm minimizes the full objective w.r.t. the parameters of ,
(29) |
The difference comes from the fact that when we parameterize by in the parameter estimation problem, — and consequently both terms in the objective of (29) — will have functional dependence on through the E-step optimality condition (28).
In the Gaussian mixture example, , and its parameters consist of the components weights and location vectors . The E-step computes . For the M-step, if we regard the locations as known so that only consists of the weights, then the two algorithms perform the same update; however if also includes the locations, then the M-step of the EM algorithm will not only update the weights as in the BA algorithm, but also the locations, due to the distortion term .
11 Further experimental results
Our deconvolution experiments were run on Intel(R) Xeon(R) CPUs, and the rest of the experiments were run on Titan RTX GPUs.
In most experiments, we use the Adam [Kingma and Ba, 2015] optimizer for updating the particle locations in WGD and for updating the variational parameters in other methods. For our hybrid WGD algorithm, which adjusts the particle weights in addition to their locations, we found that applying momentum to the particle locations can in fact slow down convergence, and therefore use plain gradient descent with a decaying step size.
To trace an R-D upper bound for a given source, we solve the unconstrained R-D problem 3 for a heuristically-chosen grid of values similarly to those in [Yang and Mandt, 2022], e.g., on a log-linear grid . Alternatively, a constrained optimization method like the Modified Differential Multiplier Method (MDMM) [Platt and Barr, 1987] can be adopted to directly target for various values of s. The latter approach will also help us identify when we run into the rate limit of particle-based methods (Sec. 4.4): suppose we perform constrained optimization with a distortion threshold of ; if the chosen is not large enough, i.e., , then no supported on (at most) points is feasible for the given constraint (otherwise we have a contradiction). When this happens, the Lagrange multiplier associated with the infeasiable constraint ( in our case) will be observed to increase indefinitely rather than stabalize to a finite value under a method like MDMM.
11.1 Deconvolution
Architectures for the neural baselines
For the RD-VAE, we used a similar architecture as the one used on the banana-shaped source in [Yang and Mandt, 2022], consisting of two-layer MLPs for the encoder and decoder networks, and Masked Autoregressive Flow [Papamakarios et al., 2017] for the variational prior. For NERD, we follow similar architecture settings as [Lei et al., 2023a], using a two-layer MLP for the decoder network. Following Yang and Mandt [2022], we initially used softplus activation for the MLP, but found it made the optimization difficult; therefore we switched to ReLU activation which gave much better results. We conjecture that the softplus activation led to overly smooth mappings, which had difficulty matching the optimal measure which is concentrated on the unit circle and does not have a Lebesgue density.
Experiment setup
As discussed in Sec. 4.2, BA and our hybrid algorithms do not apply to the stochastic setting; to be able to include them in our comparison, the input to all the algorithms is an empirical measure (training distribution) with samples, given the same fixed seed. We found the number of training samples is sufficiently large that the losses do not differ significantly on the training distribution v.s. random/held-out samples.
Recall from Sec. 2.3, given data observations following , if we perform deconvolution with an assumed noise variance for some , then the optimal solution to the problem is given by . We compute the optimal loss by a Monte-Carlo estimate, using formula (19) with samples from and samples from . The resulting is positively biased (it overestimates the in expectation) due to the bias of the plug-in estimator for (18), so we use a large to reduce this bias. 333Another, more sophisticated solution would be annealed importance sampling [Grosse et al., 2015] or a related method developed for estimating marginal likelihoods and partition functions. Note the same issue occurs in the NERD algorithm (13).


Optimized reproduction distributions
To visualize the (continuous) learned by RD-VAE and NERD, we fit a kernel density estimator on 10000 samples drawn from each, using seaborn.kdeplot.
In Figure 6, we provide additional visualization for the particles estimated from the training samples by WGD and its hybrid variant, for a decreasing distortion penalty (equivalently, increasing entropic regularization strength ).
11.2 Higher-dimensional datasets
For NERD, we set to the default in the code provided by [Lei et al., 2023a], on all three datasets.
For WGD, we used on physics, 200000 on speech, and 40000 on MNIST (see also smaller for comparison in Fig. 5).
On speech, both NERD and WGD encountered the issue of a upper bound on the rate estimate as described in Sec. 4.4. Therefore, we increased to 200000 for WGD and obtain a tighter R-D upper bound than NERD, particularly in the low-distortion regime. Such a large incurred out-of-memory error for NERD.
We borrow the R-D upper and lower bound results of [Yang and Mandt, 2022] from their paper on physics and speech. We obtain sandwich bounds on MNIST using a similar neural network architecture and other hyperparameter choices as in the GAN-generated image experiments of [Yang and Mandt, 2022] (using a ResNet-VAE for the upper bound and a CNN for the lower bound), except we set a larger in the lower bound experiment; we suspect the resulting lower bound is still far from being tight.
12 Example implementation of WGD
We provide a self-contained minimal implementation of Wasserstein gradient descent on , using the Jax library [Bradbury et al., 2018]. To compute the Wasserstein gradient, we evaluate the first variation of the rate functional in compute_psi_sum according to (20), yielding , then simply take its gradient w.r.t. the particle locations using Jax’s autodiff tool on line 51.
The implementation of WGD on is similar, except the first variation is computed using Sinkhorn’s algorithm. Both versions can be easily just-in-time compiled and enjoy GPU acceleration.