Foundations of Multivariate Distributional Reinforcement Learning

Harley Wiltzer Correspondence to [email protected] Mila — Québec AI Institute McGill University Jesse Farebrother Mila — Québec AI Institute McGill University Arthur Gretton Google DeepMind Gatsby Unit, University College London Mark Rowland Google DeepMind

Abstract

In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning. This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate distributional dynamic programming and temporal difference learning. Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension. Surprisingly, when the reward dimension is larger than $1$ , we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass- $1$ signed measures. Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice.

1 Introduction

Distributional reinforcement learning [MSK⁺10, BDM17b, BDR23, DRL] focuses on the idea of learning probability distributions of an agent’s random return, rather than the classical approach of learning only its mean. This has been highly effective in combination with deep reinforcement learning [YZL⁺19, BCC⁺20, WBK⁺22], and DRL has found applications in risk-sensitive decision making [LM22, KEF23], neuroscience [DKNU⁺20], and multi-agent settings [ROH⁺21, SLL21].

In general, research in distributional reinforcement learning has focused on the classical setting of a scalar reward function. However, prior non-distributional approaches to multi-objective RL [RVWD13, HRB⁺22] and transfer learning [BDM⁺17a, BHB⁺20] model value functions of multivariate cumulants,¹¹1Cumulants refer to accumulated quantities in RL (e.g., rewards or multivariate rewards)—not to be confused with statistical cumulants. rather than a scalar reward. Having learnt such a multivariate value function, it is then possible to perform zero-shot evaluation and policy improvement for any scalar reward signal contained in the span of the coordinates of the multivariate cumulants, opening up a variety of applications in transfer learning, and multi-objective and constrained RL.

Multivariate distributional RL combines these two ideas, and aims to learn the full probability distribution of returns given a multivariate cumulant function. Successfully learning the multivariate reward distribution opens up a variety of unique possibilities, such as zero-shot return distribution estimation [WFG⁺24] and risk-sensitive policy improvement [CZZ⁺24].

Pioneering works have already proposed algorithms for multivariate distributional RL. While these works all demonstrate benefits from the proposed algorithmic approaches, each suffers from separate drawbacks, such as not modelling the full joint distribution [GBSL21], lacking theoretical guarantees [FSMT19, ZCZ⁺21], or requiring a maximum-likelihood optimisation oracle for implementation [WUS23]. Concurrently, the work of [LK24] analyzed algorithms for DRL with Banach-space-valued rewards, and provided convergence guarantees for dynamic programming with non-parametric (intractable) distribution models.

Our central contribution in this paper is to propose algorithms for dynamic programming and temporal-difference learning in multivariate DRL which are computationally tractable and theoretically justified, with convergence guarantees. We show that reward dimensions strictly larger than $1$ introduce new computational and statistical challenges. To resolve these challenges, we introduce multiple novel algorithmic techniques, including a randomized dynamic programming operator for efficiently approximating projected updates with high probability, and a novel TD-learning algorithm operating on mass- $1$ signed measures. These new techniques recover existing bounds even in the scalar reward case, and provide new insights into the behavior of distributional RL algorithms as a function of the reward dimension.

2 Background

We consider a Markov decision process with Polish state space $\mathcal{X}$ , action space $\mathcal{A}$ , transition kernel $P:\mathcal{X}\times\mathcal{A}\to\mathscr{P}(\mathcal{X})$ , and discount factor $\gamma\in[0,1)$ . Unlike the standard RL setting, we consider a vector-valued reward function $r:\mathcal{X}\to[0,R_{\max}]^{d}$ , as in the literature on successor features [BDM⁺18]. Given a policy $\pi:\mathcal{X}\to\mathcal{P}(\mathcal{A})$ , we write the policy-conditioned transition kernel $P^{\pi}(\cdot\mid x)=\int P(\cdot\mid x,a)\pi(\mathrm{d}a\mid x)$ .

Multi-variate return distributions. We write $(X_{t})_{t\geq 0}$ for a trajectory generated by setting $X_{0}=x$ , and for each $t\geq 0$ , $X_{t+1}\sim P^{\pi}(\cdot|X_{t})$ . The return obtained along this trajectory is then defined by $G^{\pi}(x)=\sum_{t=0}^{\infty}\gamma^{t}r(X_{t})$ , and the (multi-)return distribution function is $\eta^{\pi}(x)=\text{Law}\left(G^{\pi}(x)\right)$ .

Zero-shot evaluation. An intriguing prospect of estimating multivariate return distributions is the ability to predict (scalar) return distributions for any reward function in the span of the cumulants. Indeed, [ZCZ⁺21, CZZ⁺24] show that for any reward function $\tilde{r}:x\mapsto\langle r(x),w\rangle$ for some $w\in\operatorname{\mathbf{R}}^{d}$ , $\langle G^{\pi}(x),w\rangle\operatorname{=_{\rm law}}\sum_{t\geq 0}\gamma^{t}\tilde{r}(X_{t})$ for $X_{0}=x$ . Likewise, one might consider $r(x)=\delta_{x}$ , in which case $G^{\pi}(x)$ corresponds to the per-trajectory discounted state visitation measure, and [WFG⁺24] demonstrated methods for learning the distribution of $G^{\pi}$ to infer the return distribution for any bounded deterministic reward function.

Multivariate distributional Bellman equation. It was shown in [ZCZ⁺21] that multi-return distributions obey a distributional Bellman equation, similar to the scalar case [MSK⁺10, BDM17b], and defines the multivariate distributional Bellman operator $\mathcal{T}^{\pi}:\mathcal{P}(\operatorname{\mathbf{R}}^{d})^{\mathcal{X}}\to\mathcal{P}(\operatorname{\mathbf{R}}^{d})^{\mathcal{X}}$ by

(\mathcal{T}^{\pi}\eta)(x)=\underset{X^{\prime}\sim P^{\pi}(\cdot\mid x)}{\mathbf{E}}\left[(\mathrm{b}_{r(x),\gamma})_{\sharp}\eta(X^{\prime})\right],

(1)

where $\mathrm{b}_{y,\gamma}(z)=y+\gamma z$ and $f_{\sharp}\mu=\mu\circ f^{-1}$ is the pushforward of a measure $\mu$ through a measurable function $f$ . In particular, [ZCZ⁺21] showed that $\eta^{\pi}$ satisfies the multi-variate distributional Bellman equation $\mathcal{T}^{\pi}\eta^{\pi}=\eta^{\pi}$ , and that $\mathcal{T}^{\pi}$ is a $\gamma$ -contraction in $\overline{W}_{p}$ , where $\overline{W}_{p}(\eta_{1},\eta_{2})=\sup_{x\in\mathcal{X}}W_{p}(\eta_{1}(x),\eta_{2}(x))$ and $W_{p}$ is the $p$ -Wasserstein metric [Vil09]. This suggests a convergent scheme for approximating $\eta^{\pi}$ in $\overline{W}_{p}$ by distributional dynamic programming, that is, computing the iterates $\eta_{k+1}=\mathcal{T}^{\pi}\eta_{k}$ , following Banach’s fixed point theorem.

Approximating multivariate return distributions. In practice, however, the iterates $\eta_{k+1}=\mathcal{T}^{\pi}\eta_{k}$ cannot be computed efficiently, because the size of the support of $\eta_{k}$ may increase exponentially with $k$ . A variety of alternative approaches that aim to circumvent this computational difficulty have been considered [FSMT19, ZCZ⁺21, WUS23]. Many of these approaches have proven effective in combination with deep reinforcement learning, though as tabular algorithms, either lack theoretical guarantees, or rely on oracles for solving possibly intractable optimisation problems. A more complete account of multivariate DRL is given in Appendix A. A central motivation of our work is the development of computationally-tractable algorithms for multivariate distributional RL with theoretically guarantees.

Maximum mean discrepancies. A core tool in the development of our proposed algorithms, as well as some prior work [NGV20, ZCZ⁺21], is the notion of distance over probability distributions given by maximum mean discrepancies [GBR⁺12, MMD]. A maximum mean discrepancy $\mathrm{MMD}_{\kappa}:\mathscr{P}(\mathcal{Y})\times\mathscr{P}(\mathcal{Y})\to\operatorname{\mathbf{R}}_{+}$ assigns a notion of distance to pairs of probability distributions, and is parametrised via a choice of kernel $\kappa:\mathcal{Y}\times\mathcal{Y}\rightarrow\operatorname{\mathbf{R}}$ , defined by

\displaystyle\mathrm{MMD}_{\kappa}(p,q)=\mathbb{E}_{(Y_{1},Y_{2})\sim p\otimes p}[\kappa(Y_{1},Y_{2})]-2\mathbb{E}_{(Y,Z)\sim p\otimes q}[\kappa(Y,Z)]+\mathbb{E}_{(Z_{1},Z_{2})\sim q\otimes q}[\kappa(Z_{1},Z_{2})]\,.

A useful alternative perspective on MMD is that the choice of kernel $\kappa$ induces a reproducing kernel Hilbert space (RKHS) of functions $\mathcal{H}$ , namely the closure of the span of functions of the form $z\mapsto\kappa(y,z)$ for each $y\in\mathcal{Y}$ , with respect to the norm $\|\cdot\|_{\mathcal{H}}$ induced by the inner product $\langle\kappa(y_{1},\cdot),\kappa(y_{2},\cdot)\rangle=\kappa(y_{1},y_{2})$ . With this interpretation, $\mathrm{MMD}_{\kappa}(p,q)$ is equal to $\|\mu_{p}-\mu_{q}\|_{\mathcal{H}}$ , where $\mu_{p}=\int_{\mathcal{Y}}\kappa(\cdot,y)p(\mathrm{d}y)\in\mathcal{H}$ is the mean embedding of $p$ (similarly for $\mu_{q}$ ). When $p\mapsto\mu_{p}$ is injective, the kernel $\kappa$ is called characteristic, and $\mathrm{MMD}_{\kappa}$ is then a proper metric on $\mathcal{P}(\mathcal{Y})$ [GBR⁺12]. In the remainder of this work, we will assume that all spaces of measures will be over compact sets $\mathcal{Y}$ ; thus with continuous kernels, we are ensured that distances between probability measures are bounded. When comparing return distributions, this is achieved by asserting that rewards are bounded.

We conclude this section by recalling a particular family of kernels, introduced in [SSGF13], that will be particularly useful for our analysis. These are the kernels induced by semimetrics.

Definition 1.

Let $\mathcal{Y}$ be a nonempty set, and consider a function $\rho:\mathcal{Y}\times\mathcal{Y}\to\operatorname{\mathbf{R}}_{+}$ . Then $\rho$ is called a semimetric if it is symmetric and $\rho(y_{1},y_{2})=0\iff y_{1}=y_{2}$ . Additionally, $\rho$ is said to have strong negative type if $\int\rho\;\mathrm{d}([p-q]\times[p-q])<0$ whenever $p,q\in\mathcal{P}(\mathcal{Y})$ with $p\neq q$ .

Notably, certain semimetrics naturally induce characteristic kernels and probability metrics.

Theorem 1 ([SSGF13, Proposition 29]).

Let $\rho$ be a semimetric on a space $\mathcal{Y}$ have strong negative type, in the sense that $\int\rho\mathrm{d}([p-q]\times[p-q])<0$ whenever $p\neq q$ are probability measures on a compact set $\mathcal{Y}$ . Moreover, let $\kappa:\mathcal{Y}\times\mathcal{Y}\to\operatorname{\mathbf{R}}$ denote the kernel induced by $\rho$ , that is

\kappa(y_{1},y_{2})=\frac{1}{2}(\rho(y_{1},y_{0})+\rho(y_{2},y_{0})-\rho(y_{1},y_{2}))

for some $y_{0}\in\mathcal{Y}$ . Then $\kappa$ is characteristic, so $\mathrm{MMD}_{\kappa}$ is a metric.

Remark 1.

An important class of semimetrics are the functions $\rho_{\alpha}:\operatorname{\mathbf{R}}^{d}\times\operatorname{\mathbf{R}}^{d}\to\operatorname{\mathbf{R}}_{+}$ given by $\rho_{\alpha}(y_{1},y_{2})=\|y_{1}-y_{2}\|_{2}^{\alpha}$ for $\alpha\in(0,2)$ . It is known that these semimetrics have strong negative type, and thus the kernels $\kappa_{\alpha}$ induced by $\rho_{\alpha}$ are characteristic [SR13, SSGF13]. The resulting metric $\mathrm{MMD}_{\kappa_{\alpha}}$ is known as the energy distance.

3 Multivariate Distributional Dynamic Programming

To warm up, we begin by demonstrating that indeed the (multivariate) distributional Bellman operator is contractive in a supremal form $\overline{\mathrm{MMD}_{\kappa}}$ of MMD, given by $\overline{\mathrm{MMD}_{\kappa}}(\eta_{1},\eta_{2})=\sup_{x\in\mathcal{X}}\mathrm{MMD}_{\kappa}(\eta_{1}(x),\eta_{2}(x))$ , for a particular class of kernels $\kappa$ . Our first theorem generalizes the analogous results of [NGV20] in the scalar case to multivariate cumulants. The proof of Theorem 3, as well as proofs of all remaining results, are deferred to Appendix B.

{restatable}

[Convergent MMD dynamic programming for the multi-return distribution function]theoremdpmmd Let $\kappa$ be a kernel induced by a semimetric $\rho$ on $[0,(1-\gamma)^{-1}R_{\max}]^{d}$ with strong negative type, satisfying

1.

Shift-invariance. For any $z\in\operatorname{\mathbf{R}}^{d}$ , $\rho(z+y_{1},z+y_{2})=\rho(y_{1},y_{2})$ .
2.

Homogeneity. For any $\gamma\in[0,1)$ , there exists $c>0$ for which $\rho(\gamma y_{1},\gamma y_{2})=\gamma^{c}\rho(y_{1},y_{2})$ .

Consider the sequence $\left\{{\eta}_{k}\right\}_{k=1}^{\infty}$ given by $\eta_{k+1}=\mathcal{T}^{\pi}\eta_{k}$ . Then $\eta_{k}\to\eta^{\pi}$ at a geometric rate of $\gamma^{c/2}$ in $\overline{\mathrm{MMD}_{\kappa}}$ , as long as $\overline{\mathrm{MMD}_{\kappa}}(\eta_{0},\eta^{\pi})\leq C<\infty$ .

Notably, the energy distance kernels $\kappa_{\alpha}$ satisfy the conditions of Theorem 3, and $\rho_{\alpha}(\gamma y_{1},\gamma y_{2})\leq\gamma^{\alpha}\rho(y_{1},y_{2})$ by the homogeneity of the Euclidean norm, so $\mathcal{T}^{\pi}$ is a $\gamma^{\alpha/2}$ -contraction in the energy distances. This generalizes the analogous result of [NGV20] in the one-dimensional case.

While Theorem 3 illustrates a method for approximating $\eta^{\pi}$ in MMD, it leaves a lot to be desired. Firstly, even in tabular MDPs, just as in the case of scalar distributional RL, return distribution functions have infinitely many degrees of freedom, precluding a tractable exact representation. As such, it will be necessary to study approximate, finite parameterizations of the return distribution functions, requiring more careful convergence analysis. Moreover, in RL it is generally assumed that the transition kernel and reward function are not known analytically—we only have access to sampled state transitions and cumulants. Thus, $\mathcal{T}^{\pi}$ cannot be represented or computed exactly, and instead we must study algorithms for approximating $\eta^{\pi}$ from samples. We provide algorithms for resolving both of these concerns—the former in Section 5 and the latter in Section 6—where we illustrate the difficulties that arise once the cumulant dimension exceeds unity.

4 Particle-Based Multivariate Distributional Dynamic Programming

Our first algorithmic contribution is inspired by the empirically successful equally-weighted particle (EWP) representations of multivariate return distributions employed by [ZCZ⁺21].

Temporal-difference learning with EWP representations. EWP representations, expressed by the class $\mathscr{C}_{\mathrm{EWP},m}$ , are defined by

\displaystyle\mathscr{C}_{\mathrm{EWP},m}=(\mathscr{C}_{\mathrm{EWP},m}^{\circ})^{\mathcal{X}},\qquad\mathscr{C}_{\mathrm{EWP},m}^{\circ}=\left\{\frac{1}{m}\sum_{i=1}^{m}\delta_{\theta_{i}}\,:\,\theta_{i}\in\operatorname{\mathbf{R}}^{d}\right\}.

(2)

For simplicity, we consider the case here where at each state $x$ , the multi-return distribution is approximated by $N(x)=m$ atoms. We can represent $\eta\in\mathscr{C}_{\mathrm{EWP},m}$ by $\eta(x)=\frac{1}{m}\sum_{i=1}^{m}\delta_{\theta_{i}(x)}$ for $\theta_{i}:\mathcal{X}\to\operatorname{\mathbf{R}}^{d}$ . The work of [ZCZ⁺21] introduced a TD-learning algorithm for learning a $\mathscr{C}_{\mathrm{EWP},m}$ representation of $\eta^{\pi}$ , computing iterates of the particles $(\theta^{(k)}_{i})_{i=1}^{m}$ according to

\theta^{(k+1)}_{i}(x)=\theta^{(k)}_{i}(x)-\lambda_{k}\nabla_{\theta_{i}(x)}\mathrm{MMD}_{\kappa}^{2}\left(\frac{1}{m}\sum_{i=j}^{m}\delta_{\theta^{(k)}_{j}(x)},\frac{1}{m}\sum_{j=1}^{m}\delta_{r(x)+\gamma\overline{\theta}^{(k)}_{j}(X^{\prime})}\right)

(3)

for step sizes $(\lambda_{k})_{k\geq 0}$ and sampled next states $X^{\prime}\sim P^{\pi}(\cdot\mid x)$ , where $\overline{\theta}=\texttt{stop-gradient}(\theta^{(k)})$ is a copy of $\theta^{(k)}$ that does not propagate gradients. Despite the empirical success of this method in combination with deep learning, no convergence analysis has been established, owing to the nonconvexity of the MMD objective with respect to the particle locations. In this section we aim to understand to what extent analysis is possible for dynamic programming and temporal-difference learning algorithms based on the EWP representations in Equation (2).

Dynamic programming with EWP representations. As is often the case in approximate distributional dynamic programming [RBD⁺18, RMA⁺23], we have $\mathcal{T}^{\pi}\mathscr{C}_{\mathrm{EWP},m}\not\subset\mathscr{C}_{\mathrm{EWP},m}$ ; in words, the distributional Bellman operator does not map EWP representations to themselves. Concretely, as long as there exists a state $x$ at which the support of $P^{\pi}(\cdot\mid x)$ is not a singleton, $(\mathcal{T}^{\pi}\eta)(x)$ will consist of more than $m$ atoms even when $\eta\in\mathscr{C}_{\mathrm{EWP},m}$ ; and secondly, if $P(\cdot\mid x)$ is not uniform, $(\mathcal{T}^{\pi}\eta)(x)$ will not consist of equally-weighted particles.

Consequently, to obtain a DP algorithm over EWP representations, we must consider a projected operator of the form $\Pi_{\mathrm{EWP}}\mathcal{T}^{\pi}$ , for a projection $\Pi_{\mathrm{EWP}}:\mathcal{P}(\operatorname{\mathbf{R}}^{d})^{\mathcal{X}}\to\mathscr{C}_{\mathrm{EWP},m}$ . A natural choice for this projection is the operator that minimizes the MMD of each multi-return distribution in $\mathscr{C}_{\mathrm{EWP},m}$ ,

(\Pi_{\mathrm{EWP},\kappa}^{m}\eta)(x)\in\operatorname*{argmin}_{p\in\mathscr{C}_{\mathrm{EWP},m}^{\circ}}\mathrm{MMD}_{\kappa}(p,\eta(x)).

(4)

Unfortunately, even in the scalar-reward ( $d=1$ ) case, the operator $\Pi_{\mathrm{EWP},\kappa}^{m}$ is problematic; $(\Pi_{\mathrm{EWP},\kappa}^{m}\eta)(x)$ is not uniquely defined, and $\Pi_{\mathrm{EWP},\kappa}^{m}$ is not a non-expansion in $\overline{\mathrm{MMD}_{\kappa}}$ [LB22, RMA⁺23]. These pathologies present significant complications when analyzing even the convergence of dynamic programming routines for learning an EWP representation of the multi-return distribution — in particular, it is not even clear that $\Pi_{\mathrm{EWP},\kappa}^{m}\mathcal{T}^{\pi}$ has a fixed point (let alone a unique one). Another complication arises due to the computational difficulty of computing the projection (4): even in the case where $\eta(x)$ has finite support for each state $x$ , the projection $(\Pi_{\mathrm{EWP},\kappa}^{m}\eta)(x)$ is very similar to clustering, which can be intractable to compute exactly for large $m$ [She21]. Thus, the argmin projection in Equation (4) cannot be used directly to obtain a tractable DP algorithm.

Randomised dynamic programming. Towards this end, we introduce a tractable randomized dynamic programming algorithm for the EWP representation, by using a randomized proxy $\mathsf{BootProj}^{\pi}_{\kappa,m}$ for $\Pi_{\kappa,m}\mathcal{T}^{\pi}$ , that produces accurate return distribution estimates with high probability. The method produces the following iterates,

\eta_{k+1}(x)=\mathsf{BootProj}^{\pi}_{\kappa,m}\eta_{k}(x):=\frac{1}{m}\sum_{i=1}^{m}\delta_{r(x)+\gamma Z_{i}},\qquad Z_{i}\sim\eta_{k}(X_{i}),\ X_{i}\overset{\rm iid}{\sim}\;P^{\pi}(\cdot\mid x)

(5)

A similar algorithm for categorical representations was discussed in concurrent work [LK24] without convergence analysis.

The intuition is that, particularly for large $m$ , the Monte Carlo error associated with the sample-based approximation to $(\mathcal{T}^{\pi}\eta)(x)$ is small, and we can therefore expect the DP process, though randomised, to be accurate with high probability. This is summarised by a key theoretical result of this section; our proof of this result in the appendix provides a general approach to proving convergence for algorithms using arbitrary accurate approximations to (4) that we expect to be useful in future work.

{restatable}

theoremewpmmddpconvergence Consider a kernel $\kappa$ induced by the semimetric $\rho(x,y)=\|x-y\|_{2}^{\alpha}$ with $\alpha\in(0,2)$ , and suppose rewards are bounded in each dimension within $[0,R_{\max}]$ . For any $\eta_{0}$ such that $\overline{\mathrm{MMD}_{\kappa}}(\eta_{0},\eta^{\pi})\leq D<\infty$ , and any $\delta>0$ , for the sequence $(\eta_{k})_{k\geq 0}$ defined in Equation (5), with probability at least $1-\delta$ we have

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{K},\eta^{\pi})\in\widetilde{O}\left(\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}\sqrt{m}}\log\left(\frac{|\mathcal{X}|\delta^{-1}}{\log\gamma^{-\alpha}}\right)\right).

where $\eta_{k+1}=\mathsf{BootProj}^{\pi}_{\kappa,m}\eta_{k}$ and $K=\lceil\frac{\log m}{\log\gamma^{-\alpha}}\rceil$ , and where $\widetilde{O}$ omits logarithmic factors in $m$ .

This shows that our novel randomised DP algorithm with EWP representations can tractably compute accurate approximations to the true multivariate return distributions, with only polynomial dependence on the dimension $d$ . Appendix C illustrates explicitly how this procedure is more memory efficient than unprojected EWP dynamic programming. However, the guarantees associated with this algorithm hold only in high probability, and are weaker than the pointwise convergence guarantees of one-dimensional distributional DP algorithms [RBD⁺18, RMA⁺23, BDR23]. Consequently, these guarantees do not provide a clear understanding of the EWP-TD method described at the beginning of this section. However, in the sequel, we introduce DP and TD algorithms based on categorical representations, for which we derive dynamic programming and TD-learning convergence bounds.

The proof of Theorem 4 is hinges on the following proposition, which demonstrates that convergence of projected EWP dynamic programming is controlled by how far return distributions are transported under the projection map.

{restatable}

[Convergence of EWP Dynamic Programming]propositiondpewp Consider a kernel satisfying the hypotheses of Theorem 3, suppose rewards are globally bounded in each dimension in $[0,R_{\max}]$ , and let $\{\Pi^{(k)}_{\kappa,m}\}_{k\geq 0}$ be a sequence of maps $\Pi:\mathcal{P}([0,(1-\gamma)^{-1}R_{\max}]^{d})^{\mathcal{X}}\to\mathscr{C}_{\mathrm{EWP},m}$ satisfying

\mathrm{MMD}_{\kappa}((\Pi^{(k)}_{\kappa,m}\eta)(x),\eta(x))\leq f(d,m)<\infty\qquad\forall k\geq 0.

(6)

Then the iterates $(\eta_{k})_{k\geq 0}$ given by $\eta_{k+1}=\Pi^{(k)}_{\kappa,m}\mathcal{T}^{\pi}\eta_{k}$ with $\overline{\mathrm{MMD}_{\kappa}}(\eta_{0},\eta^{\pi})=D<\infty$ converge to a set $\boldsymbol{\eta}_{\mathrm{EWP},\kappa}^{m}\subset\overline{B}(\eta^{\pi},(1-\gamma^{c/2})^{-1}f(d,m))$ in $\overline{\mathrm{MMD}_{\kappa}}$ , where $\overline{B}$ denotes the closed ball in $\overline{\mathrm{MMD}_{\kappa}}$ ,

\displaystyle\overline{B}(\eta,R)\triangleq\left\{\eta^{\prime}\in\mathcal{P}(\operatorname{\mathbf{R}}^{d})^{\mathcal{X}}\,:\,\overline{\mathrm{MMD}_{\kappa}}(\eta,\eta^{\prime})\leq R\right\}.

As an immediate corollary of Proposition 4 and Theorem 4, we can derive an error rate for projected dynamic programming with $\Pi_{\mathrm{EWP},\kappa}^{m}$ as well.

{restatable}

corollarydpewporacle For any kernel $\kappa$ satisfying the hypotheses of Theorem 4, and for any $\eta_{0}\in\mathscr{C}_{\mathrm{EWP},m}$ for which $\overline{\mathrm{MMD}_{\kappa}}(\eta_{0},\eta^{\pi})\leq D<\infty$ , the iterates $\eta_{k+1}=\Pi_{\mathrm{EWP},\kappa}^{m}\mathcal{T}^{\pi}\eta_{k}$ converge to a set $\boldsymbol{\eta}_{\mathrm{EWP},\kappa}^{m}\subset\mathscr{C}_{\mathrm{EWP},m}$ , where

\displaystyle\boldsymbol{\eta}_{\mathrm{EWP},\kappa}^{m}\subset\overline{B}\left(\eta^{\pi},\frac{6d^{\alpha/2}R^{\alpha}_{\max}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}\sqrt{m}}\right).

5 Categorical Multivariate Distributional Dynamic
Programming

Our next contribution is the introduction of a convergent multivariate distributional dynamic programming algorithm based on a categorical representation of return distribution functions, generalizing the algorithms and analysis of [RBD⁺18] to the multivariate setting.

Categorical representations. As outlined above, to model the multi-return distribution function in practice, it is necessary to restrict each multi-return distribution to a finitely-parameterized class. In this work, we take inspiration from successful distributional RL algorithms [BDM17b, RBD⁺18] and employ a categorical representation. The work of [WUS23] proposed a categorical representation for multivariate DRL, but their categorical projection was not justified theoretically, and it required a particular choice of fixed support. We propose a novel categorical representation with a finite (possibly state-dependent) support $\mathcal{R}(x)=\left\{{\xi(x)}_{i}\right\}_{i=1}^{N(x)}\subset\operatorname{\mathbf{R}}^{d}$ , that models the multi-return distribution function $\eta$ such that $\eta(x)\in\Delta_{\mathcal{R}(x)}$ for each $x\in\mathcal{X}$ . The notation $\xi(x)_{i}$ simply refers to the $i$ th support point at state $x$ specified by $\mathcal{R}$ , and $\Delta_{A}$ denotes the probability simplex on the finite set $A$ . We refer to the mapping $\mathcal{R}$ as the support map²²2In many applications, the most natural support map is constant across the state space. and we denote the class of multi-return distribution functions under the corresponding categorical representation as $\mathscr{C}_{\mathcal{R}}\triangleq\prod_{x\in\mathcal{X}}\Delta_{\mathcal{R}(x)}$ .

Categorical projection. Once again, the distributional Bellman operator is not generally closed over $\mathscr{C}_{\mathcal{R}}$ , that is, $\mathcal{T}^{\pi}\mathscr{C}_{\mathcal{R}}\not\subset\mathscr{C}_{\mathcal{R}}$ . As such, we cannot actually employ the procedure described in Theorem 3 – rather, we need to project applications of $\mathcal{T}^{\pi}$ back onto $\mathscr{C}_{\mathcal{R}}$ . Roughly, we need an operator $\Pi:\mathcal{P}(\operatorname{\mathbf{R}}^{d})^{\mathcal{X}}\to\mathscr{C}_{\mathcal{R}}$ for which $\Pi\rvert_{\mathscr{C}_{\mathcal{R}}}=\operatorname{\mathsf{id}}$ . Given such an operator, as in the literature on categorical distributional RL [BDM17b, RBD⁺18], we will study the convergence of iterates $\eta_{k+1}=\Pi\mathcal{T}^{\pi}\eta_{k}$ .

Projection operators used in the scalar categorical distributional RL literature are specific to distributions over $\operatorname{\mathbf{R}}$ , so we must introduce a new projection. We propose a projection similar to (4),

(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta)(x)=\operatorname*{arginf}_{p\in\Delta_{\mathcal{R}(x)}}\mathrm{MMD}_{\kappa}(p,\eta(x)).

(7)

We will now verify that $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is well-defined, and that it satisfies the aforementioned properties. {restatable}lemmacatprojwelldefined Let $\kappa$ be a kernel induced by a semimetric $\rho$ on $[0,(1-\gamma)^{-1}R_{\max}]^{d}$ with strong negative type (cf. Theorem 1). Then $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is well-defined, $\operatorname{\mathsf{Ran}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}})\subset\mathscr{C}_{\mathcal{R}}$ , and $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\rvert_{\mathscr{C}_{\mathcal{R}}}=\operatorname{\mathsf{id}}$ . It is worth noting that beyond simply ensuring the well-posedness of the projection $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ , Lemma 7 also certifies an efficient algorithm for computing the projection — namely, by solving the appropriate quadratic program (QP), as observed by [SZS⁺08]. We demonstrate pseudocode for computing the projected Bellman operator $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ with a QP solver $\mathsf{QPSolve}$ in Algorithm 1.

Algorithm 1 Projected Categorical Dynamic Programming

Support map

\mathcal{R}

, kernel

\kappa

, transition kernel

P^{\pi}

, reward function

r

, discount

\gamma

Return distribution function

\eta\in\mathscr{C}_{\mathcal{R}}

for

x\in\mathcal{X}

(\mathcal{T}^{\pi}\eta)_{x}\leftarrow\sum_{x^{\prime}\in\mathcal{X}}\sum_{\xi\in\mathcal{R}(x^{\prime})}P^{\pi}(x^{\prime}\mid x)\eta_{x^{\prime}}(\xi)\delta_{r(x)+\gamma\xi}

K_{i,j}^{x}\leftarrow\kappa(\xi_{i},\xi_{j})

for each

(\xi_{i},\xi_{j})\in\mathcal{R}(x)^{2}

q^{x}_{j}\leftarrow\sum_{\xi\in\operatorname{supp}{(\mathcal{T}^{\pi}\eta)_{x}}}(\mathcal{T}^{\pi}\eta)_{x}(\xi)\kappa(\xi_{j},\xi)

for each

\xi_{j}\in\mathcal{R}(x)

p\leftarrow\mathsf{QPSolve}\left(\min_{p\in\operatorname{\mathbf{R}}^{|\mathcal{R}(x)|}}\left[p^{\top}K^{x}p-2p^{\top}q\right]\ \text{subject to}\ p\succeq 0,\ \sum_{i}p_{i}=1\right)

(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta)_{x}\leftarrow\sum_{\xi_{i}\in\mathcal{R}(x)}p_{i}\delta_{\xi_{i}}

end for

return

\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta

{restatable}

lemmaprojorthogonal Under the conditions of Lemma 7, $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is a nonexpansion in $\overline{\mathrm{MMD}_{\kappa}}$ . That is, for any $\eta_{1},\eta_{2}\in\mathcal{P}([0,(1-\gamma)^{-1}R_{\max}]^{d})^{\mathcal{X}}$ , we have $\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{1},\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{2})\leq\overline{\mathrm{MMD}_{\kappa}}(\eta_{1},\eta_{2}).$

Categorical multivariate distributional dynamic programming. As an immediate consequence of Lemma 1, it follows that projected dynamic programming under the projection $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is convergent; this is because $\mathcal{T}^{\pi}$ is a contraction in $\overline{\mathrm{MMD}_{\kappa}}$ and $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is a nonexpansion in $\overline{\mathrm{MMD}_{\kappa}}$ , so the projected operator $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ is a contraction in $\overline{\mathrm{MMD}_{\kappa}}$ ; a standard invokation of the Banach fixed point theorem appealing to the completenes of $\overline{\mathrm{MMD}_{\kappa}}$ certifies that repeated application of $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ will result in convergence to a unique fixed point.

{restatable}

corollarydpprojfixedpoint Let $\kappa$ be a kernel satisfying the conditions of Theorem 3. Then for any $\eta_{0}\in\mathscr{C}_{\mathcal{R}}$ , the iterates $\left\{{\eta}_{k}\right\}_{k=1}^{\infty}$ given by $\eta_{k+1}=\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{k}$ converge geometrically to a unique fixed point.

Beyond the result of Theorem 3, Corollary 1 illustrates an algorithm for estimating $\eta^{\pi}$ in $\overline{\mathrm{MMD}_{\kappa}}$ provided knowledge of the transition kernel and the reward function, which is computationally tractable in tabular MDPs. Indeed, the iterates $(\eta_{k})_{k\geq 0}$ all lie in $\mathscr{C}_{\mathcal{R}}$ , having finitely-many degrees of freedom. Algorithm 1 outlines a computationally tractable procedure for learning $\eta_{\mathrm{C},\kappa}^{\pi}$ in this setting.

We note that the work of [WUS23] provided an alternative multivariate categorical algorithm, which was not analyzed theoretically. Moreover, our method provides the additional ability to support state-dependent arbitrary support maps, while theirs requires support maps to be uniform grids.

Accurate approximations. We now provide bounds showing that the fixed point $\eta_{\mathrm{C},\kappa}^{\pi}$ from Corollary 1 can be made arbitrarily accurate by increasing the number of atoms.

To derive a bound on the quality of the fixed point, we provide a reduction via partitioning the space of returns to the covering number of this space. Proceeding, we define a class of partitions $\mathscr{P}_{\mathcal{R}(x)}$ , where each $P\in\mathscr{P}_{\mathcal{R}(x)}$ satisfies

1.

$|P|=N(x)$ ;
2.

For any $\theta_{1},\theta_{2}\in P$ , either $\theta_{1}\cap\theta_{2}=\emptyset$ or $\theta_{1}=\theta_{2}$ ;
3.

$\cup_{\theta\in P}\theta=\mathcal{P}([0,(1-\gamma)^{-1}R_{\max}]^{d})$ ;
4.

Each element $\theta_{i}\in P$ contains exactly one element $z_{i}\in\mathcal{R}(x)$ .

For any partition $P$ , we define a notion of mesh size relative to a kernel $\kappa$ induced by a semimetric $\rho$ ,

\mathsf{mesh}(P;\kappa)=\max_{\theta\in P}\sup_{y_{1},y_{2}\in\theta}\rho(y_{1},y_{2}).

(8)

The following result confirms that $\eta_{\mathrm{C},\kappa}^{\pi}$ recovers $\eta^{\pi}$ as the mesh decreases.

{restatable}

theoremdpconsistent Let $\kappa$ be a kernel induced by a $c$ -homogeneous and shift-invariant semimetric $\rho$ conforming to the conditions of Theorem 3. Then the fixed point $\eta_{\mathrm{C},\kappa}^{\pi}$ of $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ satisfies

\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\eta^{\pi})\leq\frac{1}{1-\gamma^{c/2}}\sup_{x\in\mathcal{X}}\inf_{P\in\mathscr{P}_{\mathcal{R}(x)}}\sqrt{\mathsf{mesh}(P;\kappa)}.

(9)

Thus, for any sequence of supports $\{\mathcal{R}(x)_{m}\}_{m\geq 1}$ for which the maximal space (in $\rho$ ) between any two points in $\mathcal{R}(x)_{m}$ tends to $0$ as $m\to\infty$ , the fixed point $\eta_{\mathrm{C},\kappa}^{\pi}$ approximates $\eta^{\pi}$ to arbitrary precision for large enough $m$ . The next corollary illustrates this in a familiar setting.

{restatable}

corollarymeshcovering Let $\mathcal{R}(x)=U_{m}$ , where $U_{m}$ is a set of $m$ uniformly-spaced support points on $[0,(1-\gamma)^{-1}R_{\max}]$ . For $\kappa$ induced by the semimetric $\rho(x,y)=\|x-y\|_{2}^{\alpha}$ for $\alpha\in(0,2)$ ,

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\eta^{\pi})\leq\frac{1}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha/2}}\frac{d^{\alpha/4}R^{\alpha/2}_{\max}}{(m^{1/d}-2)^{\alpha/2}}.

With $\alpha=1$ and $d=1$ , the MMD in Corollary 1 is equivalent to the Cramér metric [SR13], the setting in which categorical (scalar) distributional dynamic programming is well understood. Our rate matches the known $\Theta(m^{-1/2})$ rate shown by [RBD⁺18] in this setting. Thus, our results offer a new perspective on categorical DRL, and naturally generalizes the theory to the multivariate setting.

Theorem 1 relies on the following lemma about the approximation quality of the categorical MMD projection, which may be of independent interest.

{restatable}

lemmammdprojapprox Let $\kappa$ be kernel satisfying the conditions of Lemma 7, and for any finite $\mathcal{R}\subset\operatorname{\mathbf{R}}^{d}$ , define $\Pi:\mathcal{P}(\operatorname{\mathbf{R}}^{d})\to\Delta_{\mathcal{R}}$ via $\Pi p=\operatorname*{arginf}_{q\in\Delta_{\mathcal{R}}}\mathrm{MMD}_{\kappa}(p,q)$ . Then $\mathrm{MMD}_{\kappa}^{2}(\Pi p,p)\leq\inf_{P\in\mathscr{P}_{\mathcal{R}}}\mathsf{mesh}(P;\kappa)$ .

At this stage, we have shown definitively that categorical dynamic programming converges in the multivariate case. In the sequel, we build on these results to provide a convergent multivariate categorical TD-learning algorithm.

5.1 Simulation: The Distributional Successor Measure

As a preliminary example, we consider 3-state MDPs with the cumulants $r(i)=(1-\gamma)e_{i},i\in[3]$ for $e_{i}$ the $i$ th basis vector. In this setting, $\eta^{\pi}$ encodes the distribution over trajectory-wise discounted state occupancies, which was discussed in the recent work of [WFG⁺24] and called the distributional successor measure (DSM). Particularly, [WFG⁺24] showed that $x\mapsto\text{Law}\left(G_{x}^{\top}\tilde{r}\right)$ for $G_{x}\sim\eta^{\pi}(x)$ is the return distribution function for any scalar reward function $\tilde{r}$ . Figure 1 shows that the projected categorical dynamic programming algorithm accurately approximates the distribution over discounted state occupancies as well as distributions over returns on held-out reward functions.

Refer to caption — Figure 1: Distributional SMs and associated predicted return distributions with the categorical (left) and EWP (right) representations. Simplex plots denote the distributional SM. Histograms denote the associated return distributions, predicted from a pair of held-out reward functions.

6 Multivariate Distributional TD-Learning

Next, we devise an algorithm for approximating the multi-return distribution function when the transition kernel and reward function are not known, and are observed only through samples. Indeed, this is a strong motivation for TD-learning algorithms [Sut88], wherein state transition data alone is used to solve the Bellman equation. In this section, we devise a TD-learning algorithm for multivariate DRL, leveraging our results on categorical dynamic programming in $\overline{\mathrm{MMD}_{\kappa}}$ .

Relaxation to signed measures. In the $d=1$ setting, the categorical projection presented above is known to be affine [RBD⁺18], making scalar categorical TD-learning amenable to common techniques in stochastic approximation theory. However, the projection is not affine when $d\geq 2$ ; we give an explicit example in Appendix D. Thus, we relax the categorical representation to include signed measures, which will provide us with an affine projection [BRCM19]—this is crucial for proving our main result, Theorem 6. We denote by $\mathcal{M}^{1}(\mathcal{Y})$ the set of all signed measures $\mu$ over $\mathcal{Y}$ with $\mu(\mathcal{Y})=1$ . We begin by noting that the MMD endows $\mathcal{M}^{1}(\mathcal{Y})$ with a metric structure.

{restatable}

lemmammdsigned Let $\kappa:\mathcal{Y}\times\mathcal{Y}\to\operatorname{\mathbf{R}}$ be a characteristic kernel over some space $\mathcal{Y}$ . Then $\mathrm{MMD}_{\kappa}:\mathcal{M}^{1}(\mathcal{Y})\times\mathcal{M}^{1}(\mathcal{Y})\to\operatorname{\mathbf{R}}_{+}$ given by $(p,q)\mapsto\|\mu_{p}-\mu_{q}\|_{\mathcal{H}}$ defines a metric on $\mathcal{M}^{1}(\mathcal{Y})$ , where $\mu_{p}$ denotes the usual mean embedding of $p$ and $\mathcal{H}$ is the RKHS with kernel $\kappa$ .

We define the relaxed projection $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}:\mathcal{M}^{1}([0,(1-\gamma)^{-1}R_{\max}]^{d})^{\mathcal{X}}\to\prod_{x\in\mathcal{X}}\mathcal{M}^{1}(\mathcal{R}(x))=:\mathscr{S}_{\mathcal{R}}$ ,

\left(\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\eta\right)(x)\in\operatorname*{arginf}_{p\in\mathcal{M}^{1}(\mathcal{R}(x))}\mathrm{MMD}_{\kappa}(p,\eta(x)).

(10)

Note that $\eqref{eq:catproj:sm}$ is very similar to the definition of the categorical MMD projection in (7)—the only difference is that the optimization occurs over the larger class of signed mass-1 measures. It is also worth noting that the distributional Bellman operator can be applied directly to signed measures, which yields the following convenient result.

{restatable}

lemmasmprojlikecatproj Under the conditions of Corollary 1, the projected operator $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}:\mathscr{S}_{\mathcal{R}}\to\mathscr{S}_{\mathcal{R}}$ is affine, is contractive with contraction factor $\gamma^{c/2}$ , and has a unique fixed point $\eta_{\mathrm{SC},\kappa}^{\pi}$ .

While we have “relaxed” the projection, the fixed point $\eta_{\mathrm{SC},\kappa}^{\pi}$ is a good approximation of $\eta^{\pi}$ .

{restatable}

theoremsmqualityfixedpoint Under the conditions of Lemma 6, we have that

1.

$\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{SC},\kappa}^{\pi},\eta^{\pi})\leq\frac{1}{1-\gamma^{c/2}}\sup_{x\in\mathcal{X}}\inf_{P\in\mathscr{P}_{\mathcal{R}(x)}}\sqrt{\mathsf{mesh}(P;\kappa)}$ ; and
2.

$\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{\mathrm{SC},\kappa}^{\pi},\eta^{\pi})\leq(1+\frac{1}{1-\gamma^{c/2}})\sup_{x\in\mathcal{X}}\inf_{P\in\mathscr{P}_{\mathcal{R}(x)}}\sqrt{\mathsf{mesh}(P;\kappa)}$ .

Notably, the second statement of Theorem 6 states that projecting $\eta_{\mathrm{SC},\kappa}^{\pi}$ back onto the space of multi-return distribution functions yields approximately the same error as $\eta_{\mathrm{C},\kappa}^{\pi}$ when $\gamma$ is near $1$ .

In the remainder of the section, we assume access to a stream of MDP transitions $\left\{{T}_{t}\right\}_{t=1}^{\infty}\subset\mathcal{X}\times\mathcal{A}\times\operatorname{\mathbf{R}}^{d}\times\mathcal{X}$ consisting of elements $T_{t}=(X_{t},A_{t},R_{t},X^{\prime}_{t})$ with the following structure,

X_{t}\sim\mathbf{P}(\cdot\mid\mathcal{F}_{t-1})\qquad A_{t}\sim\pi(\cdot\mid X_{t})\qquad R_{t}=r(X_{t})\qquad X^{\prime}_{t}\sim P(\cdot\mid X_{t},A_{t})

(11)

where $\mathbf{P}$ is some probability measure and $\left\{{\mathcal{F}}_{t}\right\}_{t=1}^{\infty}$ is the canonical filtration $\mathcal{F}_{t}=\sigma(\cup_{t=1}^{t}T_{t})$ . Based on these transitions, we can define stochastic distributional Bellman backups by

\widehat{\mathcal{T}}^{\pi}_{t}\eta(x)=\begin{cases}(\mathrm{b}_{R_{t},\gamma})_{\sharp}\eta(X_{t}^{\prime})&x=X_{t}\\ \eta(x)&\text{otherwise}\end{cases},

(12)

which notably can be computed exactly without knowledge of $P,r$ . Due to the stronger convergence guarantees shown for projected multivariate distributional dynamic programming, we introduce an asynchronous incremental algorithm leveraging the categorical representation, which produces iterates $\left\{{\widehat{\eta}}_{t}\right\}_{t=1}^{\infty}$ according to

\widehat{\eta}_{t+1}=(1-\alpha_{t})\widehat{\eta}_{t}+\alpha_{t}\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\widehat{\mathcal{T}}^{\pi}_{t}\widehat{\eta}_{t}

(13)

for $\widehat{\eta}_{0}\in\mathscr{C}_{\mathcal{R}}$ , where $\left\{{\alpha}_{t}\right\}_{t=1}^{\infty}$ is any sequence of (possibly) random step sizes adapted to the filtration $\left\{{\mathcal{F}}_{t}\right\}_{t=1}^{\infty}$ . The iterates of (13) closely resemble those of classic stochastic approximation algorithms [RM51] and particularly asynchronous TD learning algorithms [JJS93, Tsi94, BT96], but with iterates taking values in the space of state-indexed signed measures. Indeed, our next result draws on the techniques from these works to establish convergence of TD-learning on $\mathscr{S}_{\mathcal{R}}$ representations.

{restatable}

theoremcattdconvergence For a kernel $\kappa$ induced by a semimetric $\rho$ of strong negative type, the sequence $\left\{{\widehat{\eta}}_{t}\right\}_{t=1}^{\infty}$ given by (11)-(13) converges to $\eta_{\mathrm{SC},\kappa}^{\pi}$ with probability 1.

6.1 Simulations: Distributional Successor Features

To illustrate the behavior of our categorical TD algorithm, we employ it to learn the multi-return distributions for several tabular MDPs with random cumulants. We focus on the case of $2$ - and $3$ -dimensional cumulants, which is the setting studied in recent works regarding multivariate distributional RL [ZCZ⁺21, WUS23]. Interpreting the multi-return distributions as joint distributions over successor features [BDM⁺18, SFs], we additionally evaluate the return distributions for random reward functions in the span of the cumulants. We compare our categorical TD approach with a tabular implementation of the EWP TD algorithm of [ZCZ⁺21], for which no convergence bounds are known.

Figure 2 compares TD learning approaches based on their ability to accurately infer (scalar) return distributions on held out reward functions, averaged over 100 random MDPs, with transitions drawn from Dirichlet priors and $2$ -dimensional cumulants drawn from uniform priors. The performance of the categorical algorithms sharply increases as the number of atoms increases. On the other hand, the EWP TD algorithm performs well with few atoms, but does not improve very much with higher-resolution representations. We posit this is due to the algorithm getting stuck in local minima, given the non-convexity of the EWP MMD objective. This hypothesis is supported as well by Figure 4, which depicts the learned distributional SFs and return distribution predictions.

Particularly, we observe that the learned particle locations in the EWP TD approach tend to cluster in some areas or get stuck in low-density regimes, which suggests the presence of a local optimum. On the other hand, our provably-convergent categorical TD method learns a high fidelity quantization of the true multi-return distributions.

Naturally, however, the benefits of the $\mathsf{poly}(d)$ bounds for EWP suggested by Theorem 4 become more present as we increase the cumulant dimension. Figure 3 repeats the experiment of Figure 2 with $d=3$ , using randomized support points for the categorical algorithm to avoid a cubic growth in the cardinality of the supports. Notably, our method is the first capable of supporting such unstructured supports. While the categorical TD approach can still outperform EWP, a much larger number of atoms is required. This is unsurprising in light of our theoretical results.

7 Conclusion

We have provided the first provably convergent and computationally tractable algorithms for learning multivariate return distributions in tabular MDPs. Our theoretical results include convergence guarantees that indicate how accuracy scales with the number of particles $m$ used in distribution representations, and interestingly motivate the use of signed measures to develop provably convergent TD algorithms.

While it is difficult to scale categorical representations to high-dimensional cumulants, our algorithm is highly performant in the low $d$ setting, which has been the focus of recent work in multivariate distributional RL. Notably, even the $d=2$ setting has important applications—indeed, efforts in safe RL depend on distinguishing a cost signal from a reward signal (see, e.g., [YSTS23]), which can be modeled by bivariate distributional RL. In this setting, our method can easily be scaled to large state spaces by approximating the categorical signed measures with neural networks; an illustrated example is given in Appendix F.

On the other hand, the prospect of learning multi-return distributions for high-dimensional cumulants also has many important applications, such as modeling close approximations to distributional successor measures [WFG⁺24] for zero-shot risk-sensitive policy evaluation. In this setting, we believe EWP-based multivariate DRL will be highly impactful. Our results concerning EWP dynamic programming provide promising evidence that the accuracy of EWP representations scales gracefully with $d$ for a fixed number of atoms. Thus, we believe that understanding convergence of EWP TD-learning algorithms is a very interesting and important open problem for future investigation.

Acknowledgements

The authors would like to thank Yunhao Tang, Tyler Kastner, Arnav Jain, Yash Jhaveri, Will Dabney, David Meger, and Marc Bellemare for their helpful feedback, as well as insightful suggestions from anonymous reviewers. This work was supported by the Fonds de Recherche du Québec, the National Sciences and Engineering Research Council of Canada, and the compute resources provided by Mila (mila.quebec).

References

[BB96] Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1):33–57, 1996.
[BBC⁺21] Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, and Jean-Philippe Vert. Efficient and modular implicit differentiation. In Advances in Neural Information Processing Systems, 2021.
[BCC⁺20] Marc G. Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C. Machado, Subhodeep Moitra, Sameera S. Ponda, and Ziyu Wang. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):77–82, December 2020.
[BDM⁺17a] André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado P. van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
[BDM17b] Marc G. Bellemare, Will Dabney, and Rémi Munos. A Distributional Perspective on Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, 2017.
[BDM⁺18] André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado van Hasselt, and David Silver. Successor Features for Transfer in Reinforcement Learning, April 2018.
[BDR23] Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional Reinforcement Learning. The MIT Press, 2023.
[BEKS17] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017.
[BFH⁺18] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018.
[BHB⁺20] André Barreto, Shaobo Hou, Diana Borsa, David Silver, and Doina Precup. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences (PNAS), 117(48):30079–30087, 2020.
[BRCM19] Marc G Bellemare, Nicolas Le Roux, Pablo Samuel Castro, and Subhodeep Moitra. Distributional reinforcement learning with linear function approximation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2019.
[BT96] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.
[CGH⁺96] Robert M. Corless, Gaston H. Gonnet, David E.G. Hare, David J. Jeffrey, and Donald E. Knuth. On the Lambert W function. Advances in Computational Mathematics, 5:329–359, 1996.
[CZZ⁺24] Xin-Qiang Cai, Pushi Zhang, Li Zhao, Jiang Bian, Masashi Sugiyama, and Ashley Llorens. Distributional Pareto-optimal multi-objective reinforcement learning. In Advances in Neural Information Processing Systems, 2024.
[DKNU⁺20] Will Dabney, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis Hassabis, Rémi Munos, and Matthew Botvinick. A distributional code for value in dopamine-based reinforcement learning. Nature, 577(7792):671–675, 2020.
[DRBM18] Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional Reinforcement Learning with Quantile Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[FSMT19] Dror Freirich, Tzahi Shimkin, Ron Meir, and Aviv Tamar. Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN. In Proceedings of the International Conference on Machine Learning, 2019.
[GBR⁺12] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A Kernel Two-Sample Test. Journal of Machine Learning Research, 13(25):723–773, 2012.
[GBSL21] Michael Gimelfarb, André Barreto, Scott Sanner, and Chi-Guhn Lee. Risk-Aware Transfer in Reinforcement Learning using Successor Features. In Advances in Neural Information Processing Systems, 2021.
[HRB⁺22] Conor F. Hayes, Roxana Radulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Nowé, Gabriel de Oliveira Ramos, Marcello Restelli, Peter Vamplew, and Diederik M. Roijers. A practical guide to multi-objective reinforcement learning and planning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2022.
[JJS93] Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. In Advances in Neural Information Processing Systems, 1993.
[KEF23] Tyler Kastner, Murat A. Erdogdu, and Amir-massoud Farahmand. Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning. In Advances in Neural Information Processing Systems, 2023.
[Lax02] Peter D. Lax. Functional analysis. John Wiley & Sons, 2002.
[LB22] Alix Lhéritier and Nicolas Bondoux. A Cramér Distance perspective on Quantile Regression based Distributional Reinforcement Learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2022.
[LK24] Dong Neuck Lee and Michael R. Kosorok. Off-policy reinforcement learning with high dimensional reward. arXiv preprint arXiv:2408.07660, 2024.
[LM22] Shiau Hong Lim and Ilyas Malik. Distributional Reinforcement Learning for Risk-Sensitive Policies. In Advances in Neural Information Processing Systems, 2022.
[MSK⁺10] Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2010.
[NGV20] Thanh Tang Nguyen, Sunil Gupta, and Svetha Venkatesh. Distributional Reinforcement Learning via Moment Matching. In AAAI, 2020.
[RBD⁺18] Mark Rowland, Marc G. Bellemare, Will Dabney, Remi Munos, and Yee Whye Teh. An Analysis of Categorical Distributional Reinforcement Learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2018.
[RM51] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, September 1951.
[RMA⁺23] Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, and Will Dabney. An Analysis of Quantile Temporal-Difference Learning. arXiv, 2023.
[ROH⁺21] Mark Rowland, Shayegan Omidshafiei, Daniel Hennes, Will Dabney, Andrew Jaegle, Paul Muller, Julien Pérolat, and Karl Tuyls. Temporal difference and return optimism in cooperative multi-agent reinforcement learning. In AAMAS ALA Workshop, 2021.
[RVWD13] Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research (JAIR), 48:67–113, 2013.
[Sch00] Bernhard Schölkopf. The Kernel Trick for Distances. In Advances in Neural Information Processing Systems, 2000.
[SFS24] Eiki Shimizu, Kenji Fukumizu, and Dino Sejdinovic. Neural-kernel conditional mean embeddings. arXiv, 2024.
[She21] Vladimir Shenmaier. On the Complexity of the Geometric Median Problem with Outliers. arXiv, 2021.
[SLL21] Wei-Fang Sun, Cheng-Kuang Lee, and Chun-Yi Lee. DFAC framework: Factorizing the value function via quantile mixture for multi-agent distributional Q-learning. In Proceedings of the International Conference on Machine Learning, 2021.
[SR13] Gábor J. Székely and Maria L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249–1272, August 2013.
[SSGF13] Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5), October 2013.
[Sut88] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
[SZS⁺08] Le Song, Xinhua Zhang, Alex Smola, Arthur Gretton, and Bernhard Schölkopf. Tailoring density estimation via reproducing kernel moment matching. In Proceedings of the 25th international conference on Machine learning, pages 992–999, 2008.
[Tsi94] John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine learning, 16:185–202, 1994.
[TSM17] Ilya Tolstikhin, Bharath K. Sriperumbudur, and Krikamol Mu. Minimax estimation of kernel mean embeddings. Journal of Machine Learning Research, 18(86):1–47, 2017.
[TVR97] J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997.
[Vil09] Cédric Villani. Optimal transport: Old and new, volume 338. Springer, 2009.
[VN02] Jan Van Neerven. Approximating Bochner integrals by Riemann sums. Indagationes Mathematicae, 13(2):197–208, June 2002.
[WBK⁺22] Peter R Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Subramanian, Thomas J Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Florian Fuchs, et al. Outracing champion gran turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228, 2022.
[WFG⁺24] Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Yunhao Tang, André Barreto, Will Dabney, Marc G. Bellemare, and Mark Rowland. A distributional analogue to the successor representation. In Proceedings of the International Conference on Machine Learning, 2024.
[WUS23] Runzhe Wu, Masatoshi Uehara, and Wen Sun. Distributional Offline Policy Evaluation with Predictive Error Guarantees. In Proceedings of the International Conference on Machine Learning, 2023.
[YSTS23] Qisong Yang, Thiago D. Simão, Simon H. Tindemans, and Matthijs T. J. Spaan. Safety-constrained reinforcement learning with a distributional safety critic. Machine Learning, 112(3):859–887, 2023.
[YZL⁺19] Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning. Advances in neural information processing systems, 32, 2019.
[ZCZ⁺21] Pushi Zhang, Xiaoyu Chen, Li Zhao, Wei Xiong, Tao Qin, and Tie-Yan Liu. Distributional Reinforcement Learning for Multi-Dimensional Reward Functions. In Advances in Neural Information Processing Systems, 2021.

\startcontents

[sections] \printcontents[sections]l1

Appendix A In-Depth Summary of Related Work

In Sections 1 and 2, we provided a high-level synopsis of the state of existing work in multivariate distributional RL. In this section, we elaborate further.

Analysis techniques. Our results in this paper mostly drawn on the analysis of one-dimensional distributional RL algorithms such as categorical and quantile dynamic programming, as well as their temporal-difference learning counterparts [RBD⁺18, DRBM18, RMA⁺23, BDR23]. The proof techniques in these works themselves are related to contraction-based arguments for reinforcement learning with function approximation [Tsi94, BT96, TVR97].

Multivariate distributional RL algorithms. Several prior works have contributed algorithms for multivariate distributional reinforcement learning, along with empirical demonstrations and theoretical analysis, though as we note in the main paper, the approaches proposed in this paper are the first algorithms with strong theoretical guarantees and efficient tabular implementations. [FSMT19] propose a deep-learning-based approach in which generative adversarial networks are used to model multivariate return distributions, and use these predictions to inform exploration strategies. [ZCZ⁺21] propose the TD algorithm combing equally-weighted particle representations and an MMD loss that we recall in Equation (3). They demonstrate the effectiveness of this algorithm in combination with deep learning function approximators, though do not analyze it. [WUS23] propose a family of algorithms for multivariate distributional RL termed fitted likelihood evaluation. These methods mirror LSTD algorithms [BB96], iteratively minimising a batch objective function (in this case, a negative log-likelihood, NLL) over a growing dataset. [WUS23] demonstrate that these algorithms are performant in low-dimensional settings empirically, and provide theoretical analysis for FLE algorithms, assuming an oracle which can approximately optimise the NLL objective at each algorithm step. [SFS24] also propose a TD learning algorithm for one-dimensional distributional RL using categorical support and an MMD-based loss. They demonstrate strong performance of this algorithm in classic RL domains such as CartPole and Mountain Car, but do not analyze the algorithm. Our analysis in this paper suggests our novel relaxation to mass-1 signed measures may be crucial to obtaining a straightforwardly analyzable TD algorithm.

Finally, the concurrent work of [LK24] studied distributional Bellman operators for Banach-space-valued reward functions. Their work focuses on analyzing how well the fixed point of a distributional finite-dimensional multivariate Bellman equation can approximate the fixed point of a distributional Banach-space-valued Bellman equation. In contrast, our work only studies finite-dimensional reward functions, but provides explicit convergence rates and approximation bounds when the distribution representations are finite dimensional, unlike [LK24]. Moreover, [LK24] considers a similar algorithm to that discussed in Theorem 4 but for categorical representations, though its convergence is not proved. Furthermore, [LK24] did not prove convergence of any TD-learning algorithms, although they did propose some TD-learning algorithms which achieved interesting results in simulation.

Appendix B Proofs

B.1 Multivariate Distributional Dynamic Programming: Section 3

In this section, we will state some lemmas building up to the proof of Theorem 3. These lemmas generalize corresponding results of [NGV20] that were specific to the scalar reward setting. We begin with a lemma that demonstrates a notion of convexity for the MMD induced by a conditional positive definite kernel.

Lemma 1.

Let $(p_{a})_{a\in\mathcal{I}}\subset\mathcal{P}(\mathcal{Y})$ and $(q_{a})_{a\in\mathcal{I}}\subset\mathcal{P}(\mathcal{Y})$ be collections of probability measures indexed by an index set $\mathcal{I}$ . Suppose $T\in\mathcal{P}(\mathcal{I})$ . Then for any characteristic kernel $\kappa$ , the following holds,

\displaystyle\mathrm{MMD}_{\kappa}(\mathbf{E}_{a\sim T}\left[p_{a}\right],\mathbf{E}_{a^{\prime}\sim T}\left[q_{a^{\prime}}\right])\leq\sup_{a\in\mathcal{I}}\mathrm{MMD}_{\kappa}(p_{a},q_{a})

Proof.

It is known from [Sch00] that characteristic kernels generate RKHSs $\mathcal{H}$ into which probability measures can be embedded. As such, it holds that

\displaystyle\mathrm{MMD}_{\kappa}(p,q)=\|\mu_{p}-\mu_{q}\|

where $\|\cdot\|$ is the norm in the Hilbert space $\mathcal{H}$ and $\mu_{p}$ is the mean embedding of $p$ – that is, the unique element of $\mathcal{H}$ such that $\mathbf{E}_{y\sim p}\left[f(y)\right]=\langle f,\mu_{p}\rangle$ for every $f\in\mathcal{H}$ , and where $\langle\cdot,\cdot\rangle$ is the inner product in $\mathcal{H}$ .

Let $T_{p}\triangleq\mathbf{E}_{a\sim T}\left[p_{a}\right]$ and define $T_{q}$ analogously. We claim that $\mu_{Tp}=\mathbf{E}_{a\sim T}\left[\mu_{p_{a}}\right]\triangleq T\mu_{p}$ . To see this, let $f\in\mathcal{H}$ , and observe that

	$\displaystyle\underset{y\sim Tp}{\mathbf{E}}\left[f\right](y)$	$\displaystyle=\int f(y)Tp(\mathrm{d}y)$
		$\displaystyle=\iint f(y)T(\mathrm{d}a)p_{a}(\mathrm{d}y)$
		$\displaystyle=\iint f(y)p_{a}(\mathrm{d}y)T(\mathrm{d}a)$
		$\displaystyle=\int\langle f,\mu_{p_{a}}\rangle T(\mathrm{d}a)$
		$\displaystyle=\left\langle f,\int\mu_{p_{a}}T(\mathrm{d}a)\right\rangle$
		$\displaystyle=\langle f,T\mu_{p}\rangle,$

where the third step invokes Fubini’s theorem, and the penultimate steps leverages the linearity of the inner product. Notably, $T$ acts as a linear operator on mean embeddings. As a result, we see that

	$\displaystyle\mathrm{MMD}_{\kappa}(Tp,Tq)$	$\displaystyle=\\|\mu_{Tp}-\mu_{Tq}\\|$
		$\displaystyle=\\|T\mu_{p}-T\mu_{q}\\|$
		$\displaystyle=\left\\|\int_{\mathcal{I}}(\mu_{p_{a}}-\mu_{q_{a}})T(\mathrm{d}a)\right\\|$
		$\displaystyle\leq\int\\|\mu_{p_{a}}-\mu_{q_{a}}\\|T(\mathrm{d}a)$
		$\displaystyle\leq\sup_{a\in\mathcal{I}}\\|\mu_{p_{a}}-\mu_{q_{a}}\\|$
		$\displaystyle=\sup_{a\in\mathcal{I}}\mathrm{MMD}_{\kappa}(\mu_{p_{a}},\mu_{q_{a}}).$

where the penultimate inequality is due to Jensen’s inequality, and the final inequality holds since $\sup_{a\in\mathcal{I}}\|\mu_{p_{a}}-\mu_{q_{a}}\|$ upper bounds the integrand, and the integral is a monotone operator. ∎

Next, we show how the $\mathrm{MMD}_{\kappa}$ under the kernels hypothesized in Theorem 3 behave under affine transformations to random variables.

Lemma 2.

Let $\kappa$ be a kernel induced by a semimetric $\rho$ of strong negative type defined over a compact subset $\mathcal{Y}\subset\operatorname{\mathbf{R}}^{d}$ that is both shift invariant and $c$ -homogeneous (cf. Theorem 3). Then for any $a\in\mathcal{Y}$ and $p,q\in\mathcal{P}(\mathcal{Y})$ ,

\displaystyle\mathrm{MMD}_{\kappa}((\mathrm{b}_{a,\gamma})_{\sharp}p,(\mathrm{b}_{a,\gamma})_{\sharp}q)\leq\gamma^{c/2}\mathrm{MMD}_{\kappa}(p,q).

Proof.

It is known [GBR⁺12] that the MMD can be expressed in terms of expected kernel evaluations, according to

$\displaystyle\mathrm{MMD}_{\kappa}^{2}(p,q)$	$\displaystyle=\underset{}{\mathbf{E}}\left[\kappa(Y,Y^{\prime})\right]+\underset{}{\mathbf{E}}\left[\kappa(Z,Z^{\prime})\right]-2\underset{}{\mathbf{E}}\left[\kappa(Y,Z)\right]$	$\displaystyle(Y,Y^{\prime},Z,Z^{\prime})\sim p\otimes p\otimes q\otimes q$
	$\displaystyle=\underset{}{\mathbf{E}}\left[\frac{1}{2}(\rho(Y,y_{0})+\rho(Y^{\prime},y_{0})-\rho(Y,Y^{\prime}))\right]$
	$\displaystyle\qquad+\underset{}{\mathbf{E}}\left[\frac{1}{2}(\rho(Z,y_{0})+\rho(Z^{\prime},y_{0})-\rho(Z,Z^{\prime}))\right]$
	$\displaystyle\qquad-\underset{}{\mathbf{E}}\left[\rho(Y,y_{0})+\rho(Z,y_{0})-\rho(Y,Z)\right]$
	$\displaystyle=\underset{}{\mathbf{E}}\left[\rho(Y,Z)\right]-\frac{1}{2}\left(\underset{}{\mathbf{E}}\left[\rho(Y,Y^{\prime})\right]+\underset{}{\mathbf{E}}\left[\rho(Z,Z^{\prime})\right]\right),$

where the last step invokes the definition of a kernel induced by a semimetric, and the linearity of expectation. Then, defining $\tilde{Y},\tilde{Y}^{\prime}$ as independent samples from $(\mathrm{b}_{a,\gamma})_{\sharp}p$ and $\tilde{Z},\tilde{Z}^{\prime}$ as independent samples from $(\mathrm{b}_{a,\gamma})_{\sharp}q$ , we have

	$\displaystyle\mathrm{MMD}_{\kappa}^{2}((\mathrm{b}_{a,\gamma})_{\sharp}p,(\mathrm{b}_{a,\gamma})_{\sharp}q)$	$\displaystyle=\underset{}{\mathbf{E}}\left[\rho(\tilde{Y},\tilde{Z})\right]-\frac{1}{2}\left(\underset{}{\mathbf{E}}\left[\rho(\tilde{Y},\tilde{Y}^{\prime})\right]+\underset{}{\mathbf{E}}\left[\rho(\tilde{Z},\tilde{Z}^{\prime})\right]\right)$
		$\displaystyle=\underset{}{\mathbf{E}}\left[\rho(a+\gamma Y,a+\gamma Z)\right]-\frac{1}{2}\left(\underset{}{\mathbf{E}}\left[\rho(a+\gamma Y,a+\gamma Y^{\prime})\right]+\underset{}{\mathbf{E}}\left[\rho(a+\gamma Z,a+\gamma Z^{\prime})\right]\right)$
		$\displaystyle=\underset{}{\mathbf{E}}\left[\rho(\gamma Y,\gamma Z)\right]-\frac{1}{2}\left(\underset{}{\mathbf{E}}\left[\rho(\gamma Y,\gamma Y^{\prime})\right]+\underset{}{\mathbf{E}}\left[\rho(\gamma Z,\gamma Z^{\prime})\right]\right)$
		$\displaystyle=\gamma^{c}\underset{}{\mathbf{E}}\left[\rho(Y,Z)\right]-\frac{\gamma^{c}}{2}\left(\underset{}{\mathbf{E}}\left[\rho(Y,Y^{\prime})\right]+\underset{}{\mathbf{E}}\left[\rho(Z,Z^{\prime})\right]\right)$
		$\displaystyle=\gamma^{c}\mathrm{MMD}_{\kappa}^{2}(p,q),$

where the second step is a change of variables, the third step invokes the shift invariance of $\rho$ , and the fourth step invokes the homogeneity of $\rho$ .

Thus, it follows that $\mathrm{MMD}_{\kappa}((\mathrm{b}_{a,\gamma})_{\sharp}p,(\mathrm{b}_{a,\gamma})_{\sharp}q)\leq\gamma^{c/2}\mathrm{MMD}_{\kappa}(p,q)$ . ∎

We are now ready to prove the main result of this section.

\dpmmd

Proof.

We begin by showing that the distributional Bellman operator $\mathcal{T}^{\pi}$ is contractive in $\overline{\mathrm{MMD}_{\kappa}}$ . We have

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\mathcal{T}^{\pi}\eta_{1},\mathcal{T}^{\pi}\eta_{2})$	$\displaystyle=\sup_{x\in\mathcal{X}}\mathrm{MMD}_{\kappa}(\mathcal{T}^{\pi}\eta_{1}(x),\mathcal{T}^{\pi}\eta_{2}(x))$
		$\displaystyle=\sup_{x\in\mathcal{X}}\mathrm{MMD}_{\kappa}\left(\underset{x^{\prime}\sim P^{\pi}(\cdot\mid x)}{\mathbf{E}}\left[(\mathrm{b}_{r(x),\gamma})_{\sharp}\eta_{1}(x^{\prime})\right],\underset{x^{\prime\prime}\sim P^{\pi}(\cdot\mid x)}{\mathbf{E}}\left[(\mathrm{b}_{r(x),\gamma})_{\sharp}\eta_{2}(x^{\prime\prime})\right]\right).$

We apply Lemma 1 with $\mathcal{I}=\mathcal{X}$ and $T=P^{\pi}(\cdot\mid x)$ , yielding

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\mathcal{T}^{\pi}\eta_{1},\mathcal{T}^{\pi}\eta_{2})

\displaystyle\leq\sup_{x\in\mathcal{X}}\sup_{x^{\prime}\in\mathcal{X}}\mathrm{MMD}_{\kappa}\left((\mathrm{b}_{r(x),\gamma})_{\sharp}\eta_{1}(x^{\prime}),(\mathrm{b}_{r(x),\gamma})_{\sharp}\eta_{2}(x^{\prime})\right).

Next, invoking Lemma 2 with the shift-invariance and $c$ -homogeneity of $\kappa$ , we have

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\mathcal{T}^{\pi}\eta_{1},\mathcal{T}^{\pi}\eta_{2})$	$\displaystyle\leq\gamma^{c/2}\sup_{x\in\mathcal{X}}\sup_{x^{\prime}\in\mathcal{X}}\mathrm{MMD}_{\kappa}\left(\eta_{1}(x^{\prime}),\eta_{2}(x^{\prime})\right)$
		$\displaystyle=\gamma^{c/2}\sup_{x\in\mathcal{X}}\mathrm{MMD}_{\kappa}\left(\eta_{1}(x),\eta_{2}(x)\right)$
		$\displaystyle=\gamma^{c/2}\overline{\mathrm{MMD}_{\kappa}}(\eta_{1},\eta_{2}).$

It follows that $\overline{\mathrm{MMD}_{\kappa}}(\eta_{k+1},\eta^{\pi})\leq\gamma^{c/2}\overline{\mathrm{MMD}_{\kappa}}(\mathcal{T}^{\pi}\eta_{k},\mathcal{T}^{\pi}\eta^{\pi})=\gamma^{c/2}\overline{\mathrm{MMD}_{\kappa}}(\mathcal{T}^{\pi}\eta_{k},\eta^{\pi})$ , since $\eta^{\pi}$ is a fixed point of $\mathcal{T}^{\pi}$ . Continuing, we see that $\overline{\mathrm{MMD}_{\kappa}}(\eta_{k},\eta^{\pi})\leq\gamma^{kc/2}\overline{\mathrm{MMD}_{\kappa}}(\eta_{0},\eta^{\pi})\leq\gamma^{kc/2}C\in O(\gamma^{kc/2})\subset o(1)$ . Since $\overline{\mathrm{MMD}_{\kappa}}$ is a metric on $\mathcal{P}([0,(1-\gamma)^{-1}R_{\max}]^{d})^{\mathcal{X}}$ for any characteristic kernel $\kappa$ , it follows that $\eta_{k}$ approaches $\eta^{\pi}$ at a geometric rate. ∎

B.2 EWP Dynamic Programming: Section 4

In this section, we provide the proof of Theorem 4. To do so, we prove an abstract, general result, regarding any projection mapping that approximates the argmin MMD projection given in Equation (4).

\dpewp

Proof.

Let $\Delta_{k}=\overline{\mathrm{MMD}_{\kappa}}(\eta_{k},\eta^{\pi})$ . Then we have

	$\displaystyle\Delta_{k}$	$\displaystyle=\overline{\mathrm{MMD}_{\kappa}}(\Pi^{(k)}_{\kappa,m}\mathcal{T}^{\pi}\eta_{k-1},\mathcal{T}^{\pi}\eta^{\pi})$
		$\displaystyle\leq\overline{\mathrm{MMD}_{\kappa}}(\Pi^{(k)}_{\kappa,m}\mathcal{T}^{\pi}\eta_{k-1},\mathcal{T}^{\pi}\eta_{k-1})+\overline{\mathrm{MMD}_{\kappa}}(\mathcal{T}^{\pi}\eta_{k-1},\mathcal{T}^{\pi}\eta^{\pi})$
		$\displaystyle\leq f(d,m)+\gamma^{c/2}\overline{\mathrm{MMD}_{\kappa}}(\eta_{k-1},\eta^{\pi})$
	$\displaystyle\therefore\Delta_{k}$	$\displaystyle\leq f(d,m)+\gamma^{c/2}\Delta_{k-1},$

where the first step invokes the identity that $\eta^{\pi}$ is the fixed point of $\mathcal{T}^{\pi}$ (which is well-defined by Theorem 3), the second step leverages the triangle inequality, and the third step follows by the definition of $f(d,m)$ and the contractivity of $\mathcal{T}^{\pi}$ established in Theorem 3. Unrolling the recurrence above, we have

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{k},\eta^{\pi})=\Delta_{k}$	$\displaystyle\leq\gamma^{ck/2}\Delta_{0}+f(d,m)\sum_{i=0}^{\infty}\gamma^{ci/2}$
		$\displaystyle\leq\gamma^{ck/2}D+\frac{f(d,m)}{1-\gamma^{c/2}}.$

As such, as $k\to\infty$ , we have that

\displaystyle\lim_{k\to\infty}\overline{\mathrm{MMD}_{\kappa}}\left(\eta_{k},\overline{B}\left(\eta^{\pi},\frac{f(d,m)}{1-\gamma^{c/2}}\right)\right)=0,

proving our claim. ∎

Proposition 4, despite its simplicity, reveals an interesting fact: given a good enough approximate MMD projection $\Pi_{\kappa,m}$ in the sense that $f(d,m)$ decays quickly with $m$ , the dynamic programming iterates $(\eta_{k})_{k\geq 0}$ will eventually be contained in a (arbitrarily) small neighborhood of $\eta^{\pi}$ . The next result provides an example consequence of this abstract result, and establishes that $m\in\mathsf{poly}(d)$ is enough for convergence to an arbitrarily small set with projected distributional dynamic programming under the EWP representation.

Finally, we can now prove Theorem 4, which we restate below for convenience. \ewpmmddpconvergence*

Proof.

For each $x\in\mathcal{X}$ and $k\in[K]$ , denote by $\mathcal{E}_{x,k}$ the event given by

\displaystyle\mathcal{E}_{x,k}=\bigg{\{}\mathrm{MMD}_{\kappa}(\mathsf{BootProj}^{\pi}_{\kappa,m}\eta_{k}(x),\mathcal{T}^{\pi}\eta_{k}(x))\leq\frac{2d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma)^{\alpha}\sqrt{m}}+\frac{4d^{\alpha/2}R_{\max}^{\alpha}\log\delta^{\prime-1}}{(1-\gamma)^{\alpha}\sqrt{m}}=:f(d,m;\delta^{\prime})\bigg{\}},

for $\delta^{\prime}>0$ a constant to be chosen shortly. Moreover, with $\mathcal{E}=\cap_{(x,k)\in\mathcal{X}\times[K]}\mathcal{E}_{x,k}$ , it holds that under $\mathcal{E}$ , $\overline{\mathrm{MMD}_{\kappa}}(\mathsf{BootProj}^{\pi}_{\kappa,m}\eta_{k},\mathcal{T}^{\pi}\eta_{k})\leq f(d,m;\delta^{\prime})$ for all $k\in[K]$ . Following the proof of Proposition 4, we have that, conditioned on $\mathcal{E}$ ,

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{k},\eta^{\pi})$	$\displaystyle\leq\gamma^{\alpha k/2}D+\frac{f(d,m;\delta^{\prime})}{1-\gamma^{\alpha/2}}$
		$\displaystyle\leq\frac{2d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma)^{\alpha}\sqrt{m}}+\frac{f(d,m;\delta^{\prime})}{1-\gamma^{\alpha/2}}.$

Now, by [TSM17, Proposition A.1], event $\mathcal{E}_{x,k}$ holds with probability at least $1-\delta^{\prime}$ , since each $(\mathsf{BootProj}^{\pi}_{\kappa,m}\eta_{k})(x)$ is generated independently by sampling $m$ independent draws from the distribution $\mathcal{T}^{\pi}\eta_{k}$ . Therefore, event $\mathcal{E}$ holds with probability at least $1-|\mathcal{X}|K\delta^{\prime}$ . Choosing $\delta^{\prime}=\delta/(|\mathcal{X}|K)$ , we have that, with probability at least $1-\delta$ ,

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{K},\eta^{\pi})$	$\displaystyle\leq\frac{2d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma)^{\alpha}\sqrt{m}}+\frac{1}{1-\gamma^{\alpha/2}}\frac{2d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma)^{\alpha}\sqrt{m}}\left(1+2\log(\|\mathcal{X}\|K\delta^{-1})\right)$
		$\displaystyle\leq\frac{2d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma)^{\alpha}\sqrt{m}}+\frac{2d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}\sqrt{m}}\left(1+2\log\left(\|\mathcal{X}\|\left(1+\frac{\log m}{\log\gamma^{-\alpha}}\right)\delta^{-1}\right)\right).$

As such, there exist universal constants $C_{0},C_{1}\in\operatorname{\mathbf{R}}_{+}$ such that

$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{K},\eta^{\pi})$	$\displaystyle\leq C_{1}\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}\sqrt{m}}\left(1+2\log\left(\|\mathcal{X}\|\left(1+\frac{\log m}{\log\gamma^{-\alpha}}\right)\delta^{-1}\right)\right)$	(14)
	$\displaystyle\leq C_{0}\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}\sqrt{m}}\left(\log\|\mathcal{X}\|+\log\frac{\log m}{\log\gamma^{-\alpha}}+\log\delta^{-1}\right)$
	$\displaystyle=C_{0}\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}\sqrt{m}}\left(\log\left(\frac{\|\mathcal{X}\|\delta^{-1}}{\log\gamma^{-\alpha}}\right)+\log m\right).$

∎

\dpewporacle

Proof.

Proposition 4 shows that projected EWP dynamic programming converges to a set with radius controlled by the quantity $f(d,m)$ that upper bounds the distance $f(d,m)$ between $\eta_{k}(x)$ and $\Pi_{\kappa,m}^{(k)}\eta_{k}(x)$ at the worst state $x$ . In the proof of Theorem 4, we saw that with nonzero probability, the randomized projections satisfy $f(d,m)\leq\frac{6d^{\alpha/2}R^{\alpha}_{\max}}{(1-\gamma)^{\alpha}\sqrt{m}}$ . Thus, there exists a projection satisfying this bound. Since the projection $\Pi_{\mathrm{EWP},\kappa}^{m}$ is, by definition, the projection with the smallest possible $f(d,m)$ , the claim follows directly by Proposition 4. ∎

B.3 Categorical Dynamic Programming: Section 5

\catprojwelldefined

Proof.

Firstly, note that $\Delta_{\mathcal{R}(x)}$ is a bounded, finite-dimensional subspace for each $x\in\mathcal{X}$ . Thus, $\Delta_{\mathcal{R}(x)}$ is compact, and by the continuity of the MMD, the infimum in (7) is attained.

Following the technique of [SZS⁺08], we establish that $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ can be computed as the solution to a particular quadratic program with convex constraints.

Let $K\in\operatorname{\mathbf{R}}^{N(x)\times N(x)}$ denote a matrix where $K_{i,j}=\kappa(\xi(x)_{i},\xi(x)_{j})$ . Since $\kappa$ is a positive definite kernel when $\kappa$ is characteristic [GBR⁺12], it follows that $K$ is a positive definite matrix. Then, for any $\varrho\in\mathcal{P}([0,(1-\gamma)^{-1}R_{\max}]^{d})$ , we have

	$\displaystyle\operatorname*{arginf}_{p\in\Delta_{\mathcal{R}(x)}}\mathrm{MMD}_{\kappa}(p,\varrho)$
	$\displaystyle=\operatorname*{arginf}_{p\in\Delta_{\mathcal{R}(x)}}\mathrm{MMD}_{\kappa}^{2}(p,\varrho)$
	$\displaystyle=\operatorname*{arginf}_{p\in\Delta_{\mathcal{R}(x)}}\left\{\sum_{i=1}^{N(x)}\sum_{j=1}^{N(x)}p_{i}p_{j}\kappa(\xi(x)_{i},\xi(x)_{j})-2\sum_{i=1}^{N(x)}p_{i}\overbrace{\int\kappa(\xi(x)_{i},y)\varrho(\mathrm{d}y)}^{b\in\operatorname{\mathbf{R}}^{N(x)}}+M(\kappa,\mathcal{R},\varrho)\right\}$
	$\displaystyle=\operatorname*{arginf}_{p\in\Delta_{\mathcal{R}(x)}}\left\{p^{\top}Kp-2p^{\top}b\right\},$

where $M(\kappa,\mathcal{R},\varrho)$ is independent of $p$ , so it does not influence the minimization. Now, since $K$ is positive definite (by virtue of $\kappa$ being characteristic) and $\Delta_{\mathcal{R}(x)}$ is a closed convex subset of $\operatorname{\mathbf{R}}^{N(x)}$ , it is well-known that there is unique optimum, and the infimum above is attained for some $p^{\star}\in\Delta_{\mathcal{R}(x)}$ . Therefore, $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is indeed well-defined, and its range is contained in $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ , confirming the first two claims. Finally, since $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is well-defined and since $\mathrm{MMD}_{\kappa}$ is nonnegative and separates points, $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ must map elements of $\Delta_{\mathcal{R}(x)}$ to themselves – this is because $\mathrm{MMD}_{\kappa}(p,p)=0$ for the kernels we consider. ∎

\projorthogonal

Proof.

Fix any $x\in\mathcal{X}$ and denote $M(x)=\{\mu_{p}\in\mathcal{H}:p\in\Delta_{\mathcal{R}(x)}\}$ , where $\mathcal{H}$ is the RKHS induced by the kernel $\kappa$ and $\mu_{p}$ denotes the mean embedding of $p$ in this RKHS. It is simple to verify that $p\mapsto\mu_{p}$ is linear: for any $p,q\in\mathcal{P}(\operatorname{\mathbf{R}}^{d})$ and $\alpha,\beta\in\operatorname{\mathbf{R}}$ , for all $f\in\mathcal{H}$ with $\|f\|=1$ we have

	$\displaystyle\langle f,\mu_{\alpha p+\beta q}\rangle=\int f(y)[\alpha p+\beta q](\mathrm{d}y)$	$\displaystyle=\alpha\int f(y)p(\mathrm{d}y)+\beta\int f(y)q(\mathrm{d}y)$
		$\displaystyle=\langle a,\alpha\mu_{p}+\beta\mu_{q}\rangle,$

which implies that $\mu_{\alpha p+\beta q}=\alpha\mu_{p}+\beta\mu_{q}$ . As a consequence, $M(x)$ inherits convexity from $\Delta_{\mathcal{R}(x)}$ .

We claim that $M(x)$ is closed as a subset of $\mathcal{H}$ . Since $p\mapsto\mu_{p}$ is an injection [GBR⁺12], by Lemma 7, since there is a unique $q\in\Delta_{\mathcal{R}(x)}$ minimizing $\mathrm{MMD}_{\kappa}(p,q)$ , there is a unique $\mu_{q}\in M(x)$ attaining the infimum $\|\mu_{p}-\mu_{q}\|$ over $M(x)$ . Let $\mu\in\mathcal{H}\setminus M(x)$ . Then there exists $\mu_{q}\in M(x)$ such that $\|\mu_{q}-\mu\|=\inf_{\nu\in M(x)}\|\mu-\nu\|$ , and since $\mu_{q}\neq\mu$ , it follows that $\inf_{\nu\in M(x)}\|\nu-\mu\|=\epsilon>0$ . Since this is true for any $\mu\not\in M(x)$ , it follows that $\mathcal{H}\setminus M(x)$ is open, so $M(x)$ is closed.

We will now show that $\eta(x)\mapsto\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta(x)$ is a nonexpansion in $\mathcal{H}$ . Let $\eta_{1},\eta_{2}\in\mathscr{C}_{\mathcal{R}}$ , and denote by $\mu_{1}(x),\mu_{2}(x)$ the mean embeddings of $\eta_{1}(x),\eta_{2}(x)$ . We slightly abuse notation and write $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{i}(x)$ to denote the mean embedding of $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{i}(x)$ . Since $M(x)$ is convex, for any $\iota(x)\in M(x)$ and $\lambda\in[0,1]$ we have

	$\displaystyle\mathrm{MMD}_{\kappa}(\eta_{1}(x),\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{1}(x))^{2}$	$\displaystyle=\\|\mu_{1}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)\\|^{2}$
		$\displaystyle\leq\\|\mu_{1}(x)-(\lambda\iota(x)+(1-\lambda)\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x))\\|^{2}$
		$\displaystyle=\\|\mu_{1}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)-\lambda(\iota(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x))\\|^{2}.$

Now, by expanding the squared norms and taking $\lambda\downarrow 0$ , since $M(x)$ is closed we have

	$\displaystyle\langle\mu_{1}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x),\iota_{1}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)\rangle$	$\displaystyle\leq 0\qquad\forall\iota_{1}(x),\iota_{2}(x)\in M(x)$
	$\displaystyle\therefore\langle\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{2}(x)-\mu_{2}(x),\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{2}(x)-\iota_{2}(x)\rangle$	$\displaystyle\leq 0,$

where the second inequality follows by applying the same logic to $\mu_{2}(x)$ . Choosing $\iota_{1}(x)=\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{2}(x),\iota_{2}(x)=\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)\in M(x)$ and adding these inequalities yields

\displaystyle\langle\mu_{1}(x)-\mu_{2}(x)+(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{2}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)),\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{2}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)\rangle

\displaystyle\leq 0.

Expanding, we see that

	$\displaystyle\mathrm{MMD}_{\kappa}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{1}(x),\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{2}(x))^{2}$	$\displaystyle=\\|\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{2}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)\\|^{2}$
		$\displaystyle\leq\langle\mu_{2}(x)-\mu_{1}(x),\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{2}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)\rangle$
		$\displaystyle\leq\\|\mu_{2}(x)-\mu_{1}(x)\\|\\|\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{2}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)\\|$
		$\displaystyle=\mathrm{MMD}_{\kappa}(\eta_{1}(x),\eta_{2}(x))\mathrm{MMD}_{\kappa}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{1}(x),\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{2}(x))$
	$\displaystyle\therefore\mathrm{MMD}_{\kappa}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{1}(x),\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{2}(x))$	$\displaystyle\leq\mathrm{MMD}_{\kappa}(\eta_{1}(x),\eta_{2}(x)),$

confirming that $\eta(x)\mapsto\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta(x)$ is a non-expansion. It follows that

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{1},\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{2})$	$\displaystyle=\sup_{x\in\mathcal{X}}\mathrm{MMD}_{\kappa}(\eta_{1}(x),\eta_{2}(x))$
		$\displaystyle\leq\sup_{x\in\mathcal{X}}\mathrm{MMD}_{\kappa}(\eta_{1}(x),\eta_{2}(x))$
		$\displaystyle=\overline{\mathrm{MMD}_{\kappa}}(\eta_{1},\eta_{2}).$

∎

\dpprojfixedpoint

Proof.

Combining Theorem 3 and Lemma 1, we see that

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{1},\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{2})\leq\overline{\mathrm{MMD}_{\kappa}}(\mathcal{T}^{\pi}\eta_{1},\eta_{2})\leq\gamma^{c/2}\overline{\mathrm{MMD}_{\kappa}}(\eta_{1},\eta_{2})

for some $c>0$ . Thus, $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ is a contraction on $(\mathscr{C}_{\mathcal{R}},\overline{\mathrm{MMD}_{\kappa}})$ . If $\mathcal{H}$ is the RKHS induced by $\kappa$ , we showed in Lemma 1 that $\mathscr{C}_{\mathcal{R}}$ is isometric to a product of closed, convex subsets of $\mathcal{H}$ . Therefore, by the completeness of $\mathcal{H}$ , $\mathscr{C}_{\mathcal{R}}$ is isometric to a complete subspace, and consequently $\mathscr{C}_{\mathcal{R}}$ is a complete subspace under the metric $\overline{\mathrm{MMD}_{\kappa}}$ . Invoking the Banach fixed-point theorem, it follows that $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ has a unique fixed point $\eta_{\mathrm{C},\kappa}^{\pi}$ , and $\eta_{k}\to\eta_{\mathrm{C},\kappa}^{\pi}$ geometrically. ∎

B.3.1 Quality of the Categorical Fixed Point

As we saw in our analysis of multivariate DRL with EWP representations, the distance between a distribution and its projection (as a function of $m,d$ ) plays a major role in controlling the approximation error in projected distributional dynamic programming. Before proving the main results of this section, we begin by analyzing this quantity by reducing it to the largest distance between points among certain partitions of the space of returns.

\mmdprojapprox

Proof.

Our proof proceeds by establishing approximation bounds of Riemann sums to the Bochner integral $\mu_{p}$ , similar to [VN02]. Let $P\in\mathscr{P}_{\mathcal{R}}$ . Abusing notation to denote by $\Pi p_{i}$ the probability of the $i$ th atom of the discrete support under $\Pi p$ , we have

	$\displaystyle\mathrm{MMD}_{\kappa}^{2}(p,\Pi p)$	$\displaystyle=\\|\mu_{\Pi p}-\mu_{p}\\|^{2}$
		$\displaystyle=\left\\|\int\kappa(\cdot,y)\Pi p(\mathrm{d}y)-\int\kappa(\cdot,y)p(\mathrm{d}y)\right\\|^{2}$
		$\displaystyle=\left\\|\sum_{i}\kappa(\cdot,z_{i})\Pi p_{i}-\int\kappa(\cdot,y)p(\mathrm{d}y)\right\\|^{2},$

where $\mathcal{R}=\{z_{i}\}_{i=1}^{n}$ for some $n\in\mathbf{N}$ . Since $\Pi p$ optimizes the MMD over all probability vectors in $\Delta_{\mathcal{R}}$ , for $q\in\Delta_{\mathcal{R}}$ with $q_{i}=p(\theta_{i})$ for the $i$ th element of $P$ , we have

	$\displaystyle\mathrm{MMD}_{\kappa}^{2}(p,\Pi p)$	$\displaystyle\leq\left\\|\sum_{i}\kappa(\cdot,z_{i})p(\theta_{i})-\int\kappa(\cdot,y)p(\mathrm{d}y)\right\\|^{2}$
		$\displaystyle=\left\\|\sum_{i}\int_{\theta_{i}}(\kappa(\cdot,z_{i})-\kappa(\cdot,y))p(\mathrm{d}y)\right\\|^{2}$
		$\displaystyle\leq\left\\|\sum_{i}\sup_{y_{1},y_{2}\in\theta_{i}}\\|\kappa(\cdot,y_{1})-\kappa(\cdot,y_{2})\\|p(\theta_{i})\right\\|^{2}$
		$\displaystyle\leq\sup_{\theta\in P}\sup_{y_{1},y_{2}\in\theta}\\|\kappa(\cdot,y_{1})-\kappa(\cdot,y_{2})\\|^{2}.$

It was shown by [SSGF13] that $z\mapsto\kappa(\cdot,z)$ is an isometry from $(\operatorname{\mathbf{R}}^{d},\rho^{1/2})$ to $\mathcal{H}$ , where $\mathcal{H}$ is the RKHS induced by $\kappa$ . Thus, we have

\displaystyle\mathrm{MMD}_{\kappa}^{2}(p,\Pi p)

\displaystyle\leq\sup_{\theta\in P}\sup_{y_{1},y_{2}\in\theta}\rho(y_{1},y_{2})=\mathsf{mesh}(P;\kappa).

Since this is true for any partition $P\in\mathscr{P}_{\mathcal{R}}$ , the claim follows by taking the infimum over $\mathscr{P}_{\mathcal{R}}$ . ∎

We now move on to the main results.

\dpconsistent

Proof.

The proof begins in a similar manner to [RBD⁺18, Proposition 3]. Given that $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is a nonexpansion as shown in Lemma 1, we have

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\eta^{\pi})$	$\displaystyle=\sup_{x\in\mathcal{X}}\mathrm{MMD}_{\kappa}(\eta_{\mathrm{C},\kappa}^{\pi}(x),\eta^{\pi}(x))$
		$\displaystyle\leq\sup_{x\in\mathcal{X}}\left[\mathrm{MMD}_{\kappa}(\eta_{\mathrm{C},\kappa}^{\pi}(x),\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi}(x))+\mathrm{MMD}_{\kappa}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi}(x),\eta^{\pi}(x))\right]$
		$\displaystyle\leq\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi})+\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi})$
		$\displaystyle\overset{(a)}{=}\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{\mathrm{C},\kappa}^{\pi},\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta^{\pi})+\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi})$
		$\displaystyle\overset{(b)}{\leq}\overline{\mathrm{MMD}_{\kappa}}(\mathcal{T}^{\pi}\eta_{\mathrm{C},\kappa}^{\pi},\mathcal{T}^{\pi}\eta^{\pi})+\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi})$
		$\displaystyle\overset{(c)}{\leq}\gamma^{c/2}\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\eta^{\pi})+\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi})$
	$\displaystyle\therefore\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\eta^{\pi})$	$\displaystyle\leq\frac{1}{1-\gamma^{c/2}}\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi}),$

where $(a)$ leverages the fact that $\eta_{\mathrm{C},\kappa}^{\pi}$ is the fixed point of $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ and that $\eta^{\pi}$ is the fixed point of $\mathcal{T}^{\pi}$ , $(b)$ follows since $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is a nonexpansion by Lemma 1, and $(c)$ follows by the contractivity of $\mathcal{T}^{\pi}$ established in Theorem 3. Finally, by Lemma 1, we have

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\eta^{\pi})\leq\frac{1}{1-\gamma^{c/2}}\sup_{x\in\mathcal{X}}\mathrm{MMD}_{\kappa}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi}(x),\eta^{\pi}(x))\leq\frac{1}{1-\gamma^{c/2}}\sup_{x\in\mathcal{X}}\inf_{P\in\mathscr{P}_{\mathcal{R}(x)}}\sqrt{\mathsf{mesh}(P;\kappa)}.

∎

Finally, we explicitly derive a convergence rate for a particular support map under the energy distance kernels.

\meshcovering

Proof.

We begin bounding $\mathsf{mesh}(P;\kappa)$ . Assume $m=n^{d}$ for some $n\in\operatorname{\mathbf{N}}$ . We consider a partition $\overline{P}\subset\mathscr{P}_{U_{m}}$ consisting of $d$ -dimensional hypercubes with side length $(1-\gamma)^{-1}R_{\max}/(n-1)$ . By definition of $U_{m}$ , it is clear that these hypercubes cover $[0,(1-\gamma)^{-1}R_{\max}]^{d}$ and each contain exactly one support point. Now, for each $\theta\in\overline{P}$ , we have

\displaystyle\sup_{y_{1},y_{2}\in\theta}\rho(y_{1},y_{2})

\displaystyle\leq\|y-(y+(1-\gamma)^{-1}R_{\max}/(n-1)\vec{1}\|_{2}^{\alpha}

where $\vec{1}$ is the vector of all ones, and $y$ is any element in $\theta$ . Expanding, we have

	$\displaystyle\sup_{y_{1},y_{2}\in\theta}\rho(y_{1},y_{2})$	$\displaystyle\leq(1-\gamma)^{-\alpha}\left(\sum_{i=1}^{d}\left(\frac{R_{\max}}{n-1}\right)^{2}\right)^{\alpha/2}$
		$\displaystyle\leq\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma)^{\alpha}(n-1)^{\alpha}}.$

Since this bound holds for any $\theta\in\overline{P}$ , invoking Theorem 1 yields

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\eta^{\pi})$	$\displaystyle\leq\frac{1}{1-\gamma^{\alpha/2}}\sup_{x\in\mathcal{X}}\inf_{P\in\mathscr{P}_{U_{m}}}\sqrt{\mathsf{mesh}(P;\kappa)}$
		$\displaystyle\leq\frac{1}{1-\gamma^{\alpha/2}}\sup_{x\in\mathcal{X}}\sqrt{\mathsf{mesh}(\overline{P};\kappa)}$
		$\displaystyle\leq\frac{1}{1-\gamma^{\alpha/2}}\sup_{x\in\mathcal{X}}\sqrt{\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma)^{\alpha}(n-2)^{\alpha}}}$
		$\displaystyle=\frac{1}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha/2}}\frac{d^{\alpha/4}R_{\max}^{\alpha/2}}{(n-1)^{\alpha/2}}$
		$\displaystyle=\frac{1}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha/2}}\frac{d^{\alpha/4}R_{\max}^{\alpha/2}}{(n-1)^{\alpha/2}}$
		$\displaystyle=\frac{1}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha/2}}\frac{d^{\alpha/4}R_{\max}^{\alpha/2}}{(m^{1/d}-1)^{\alpha/2}}.$

If instead $m\in((n-1)^{d},n^{d})$ , we omit all but $(n-1)^{d}$ of the support points, and achieve

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\eta^{\pi})

\displaystyle\leq\frac{1}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha/2}}\frac{d^{\alpha/4}R_{\max}^{\alpha/2}}{(\lfloor m^{1/d}\rfloor-1)^{\alpha/2}}.

Alternatively, we may write

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{C},\kappa}^{\pi},\eta^{\pi})

\displaystyle\leq\frac{1}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha/2}}\frac{d^{\alpha/4}R_{\max}^{\alpha/2}}{(m^{1/d}-2)^{\alpha/2}}.

∎

B.4 Categorical TD Learning: Section 6

In this section, we prove results leading up to and including the convergence of the categorical TD-learning algorithm over mass-1 signed measures. First, in Section B.4.1, we show that $\mathrm{MMD}_{\kappa}$ is in fact a metric on the space of mass-1 signed measures, and establish that the multivariate distributional Bellman operator is contractive under these distribution representations. Subsequently, in Section B.4.2, we analyze the temporal difference learning algorithm leveraging the results from Section B.4.1.

B.4.1 The Signed Measure Relaxation

We begin by establishing that $\mathrm{MMD}_{\kappa}$ is a metric on $\mathcal{M}^{1}(\mathcal{Y})$ for spaces $\mathcal{Y}$ under which $\mathrm{MMD}_{\kappa}$ is a metric on $\mathcal{P}(\mathcal{Y})$ . \mmdsigned*

Proof.

Naturally, since $\mathrm{MMD}_{\kappa}$ is given by a norm, it is non-negative, symmetric, and satisfies the triangle inequality. We must show that $\mathrm{MMD}_{\kappa}(p,q)=0\iff p=q$ . Firstly, it is clear that $\mathrm{MMD}_{\kappa}(p,p)=0$ by the positive homogeneity of the norm. It remains to show that $\mathrm{MMD}_{\kappa}(p,q)=0\implies p=q$ .

Let $p\neq q\in\mathcal{M}^{1}(\mathcal{Y})$ . For the sake of contradiction, assume that $\mathrm{MMD}_{\kappa}(p,q)=0$ . We will show that this implies that $\mathrm{MMD}_{\kappa}(P,Q)=0$ for a pair of distinct probability measures, which is a contradiction since $\mathrm{MMD}_{\kappa}$ with characteristic $\kappa$ is known to be a metric on $\mathcal{P}(\mathcal{Y})$ .

By the Hahn decomposition theorem, we may write $p=\tilde{p}^{+}-\tilde{p}^{-}$ for non-negative measures $\tilde{p}^{+},\tilde{p}^{-}$ . Therefore, for some $a\in\operatorname{\mathbf{R}}_{+}$ , we may express

\displaystyle p=(a+1)p^{+}-ap^{-}

where $p^{+},p^{-}\in\mathcal{P}(\mathcal{Y})$ . Likewise, there exist $b\in\operatorname{\mathbf{R}}_{+}$ and probability measure $q^{+},q^{-}$ for which $q=(b+1)q^{+}-bq^{-}$ . Since $\mathrm{MMD}_{\kappa}(p,q)=0$ by hypothesis, and by linearity of $p\mapsto\mu_{p}$ , we have

	$\displaystyle 0$	$\displaystyle=\\|\mu_{p}-\mu_{q}\\|_{\mathcal{H}}$
		$\displaystyle=\\|(a+1)\mu_{p^{+}}-a\mu_{p^{-}}+b\mu_{q^{-}}-(b+1)\mu_{q^{+}}\\|_{\mathcal{H}}$
		$\displaystyle=\\|(a+1)\mu_{p^{+}}+b\mu_{q^{-}}-(a\mu_{p^{-}}+(b+1)\mu_{q^{+}})\\|_{\mathcal{H}}$
		$\displaystyle=(a+b+1)\left\\|\bigg{(}\lambda\mu_{p^{+}}+(1-\lambda)\mu_{q^{-}}\bigg{)}-\bigg{(}\lambda^{\prime}\mu_{p^{-}}+(1-\lambda^{\prime})\mu_{q^{+}}\bigg{)}\right\\|,\ \lambda=\frac{a+1}{a+b+1},\ \lambda^{\prime}=\frac{a}{a+b+1}$
		$\displaystyle:=(a+b+1)\\|\mu_{P}-\mu_{Q}\\|_{\mathcal{H}},$

where $P=\lambda p^{+}+(1-\lambda)q^{-}$ and $Q=\lambda^{\prime}p^{-}+(1-\lambda^{\prime})q^{+}$ are convex combinations of probability measures, and are therefore probability measures themselves. So, we have that

	$\displaystyle\lambda p^{+}-\lambda^{\prime}p^{-}$	$\displaystyle=(1-\lambda^{\prime})q^{+}-(1-\lambda)q^{-}$
	$\displaystyle(a+1)\lambda p^{+}-ap^{-}$	$\displaystyle=(b+1)q^{+}-bq^{-}$
	$\displaystyle\therefore p$	$\displaystyle=q,$

which contradicts our hypothesis. Therefore, $\mathrm{MMD}_{\kappa}(p,q)=0\iff p=q$ for any $p,q\in\mathcal{M}^{1}(\mathcal{Y})$ , and it follows that $\mathrm{MMD}_{\kappa}$ is a metric. ∎

Next, we show that the distributional Bellman operator is contractive on the space of mass-1 signed measure return distribution representations.

\smprojlikecatproj

Proof.

Indeed, $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}$ is, in a sense, a simpler operator than $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ . Since $\mathcal{M}^{1}(\mathcal{R}(x))$ is an affine subspace of $\mathcal{M}^{1}(\operatorname{\mathbf{R}}^{d})$ , it holds that $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}$ is simply a Hilbertian projection, which is known to be affine and a nonexpansion [Lax02]. Moreover, since $\mathcal{T}^{\pi}$ acts identically on $\mathcal{M}^{1}(\operatorname{\mathbf{R}}^{d})$ as it does on $\mathcal{P}(\operatorname{\mathbf{R}}^{d})$ , it immediately follows that $\mathcal{T}^{\pi}$ is a $\gamma^{c/2}$ -contraction on $\mathcal{M}^{1}(\operatorname{\mathbf{R}}^{d})$ , inheriting the result from Theorem 3. Thus, we have that for any $\eta_{1},\eta_{2}\in\mathscr{S}_{\mathcal{R}}$ ,

	$\displaystyle\mathrm{MMD}_{\kappa}(\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{1},\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{2})$	$\displaystyle\leq\mathrm{MMD}_{\kappa}(\mathcal{T}^{\pi}\eta_{1},\mathcal{T}^{\pi}\eta_{2})$
		$\displaystyle\leq\gamma^{c/2}\mathrm{MMD}_{\kappa}(\eta_{1},\eta_{2})$

confirming that the projected operator is a contraction. Since $\mathrm{MMD}_{\kappa}$ is a metric on $\mathcal{M}^{1}(\mathcal{R}(x))$ for each $x\in\mathcal{X}$ , it follows that $\overline{\mathrm{MMD}_{\kappa}}$ is a metric on $\mathscr{S}_{\mathcal{R}}$ . The existence and uniqueness of the fixed point $\eta_{\mathrm{SC},\kappa}^{\pi}$ follows by the Banach fixed point theorem. ∎

Finally, we show that the fixed point of distributional dynamic programming with signed measure representations is roughly as accurate as $\eta_{\mathrm{C},\kappa}^{\pi}$ .

\smqualityfixedpoint

Proof.

Since $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}$ is a nonexpansion in $\overline{\mathrm{MMD}_{\kappa}}$ by Lemma 6, following the procedure of Theorem 1, we have

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{SC},\kappa}^{\pi},\eta^{\pi})\leq\frac{1}{1-\gamma^{c/2}}\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi}).

Note that $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\eta^{\pi}$ identifies the closest point (in $\overline{\mathrm{MMD}_{\kappa}}$ ) to $\eta^{\pi}$ in $\mathscr{S}_{\mathcal{R}}:=\prod_{x\in\mathcal{X}}\mathcal{M}^{1}(\mathcal{R}(x))$ and $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi}$ identifies the closest point to $\eta^{\pi}$ in $\mathscr{C}_{\mathcal{R}}:=\prod_{x\in\mathcal{X}}\Delta_{\mathcal{R}(x)}$ . Since it is clear that $\mathscr{C}_{\mathcal{R}}\subset\mathscr{S}_{\mathcal{R}}$ , it follows that

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{SC},\kappa}^{\pi},\eta^{\pi})\leq\frac{1}{1-\gamma^{c/2}}\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi}).

The first statement then directly follows since it was shown in Lemma 1 that $\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi})\leq\sup_{x\in\mathcal{X}}\inf_{P\in\mathscr{P}_{\mathcal{R}(x)}}\sqrt{\mathsf{mesh}(P;\kappa)}$ .

To prove the second statement, we apply the triangle inequality to yield

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{\mathrm{SC},\kappa}^{\pi},\eta^{\pi})$	$\displaystyle\leq\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{\mathrm{SC},\kappa}^{\pi},\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi})+\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi})$
		$\displaystyle\leq\overline{\mathrm{MMD}_{\kappa}}(\eta_{\mathrm{SC},\kappa}^{\pi},\eta^{\pi})+\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi}),$

where the second step leverages the fact that $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is a nonexpansion in $\overline{\mathrm{MMD}_{\kappa}}$ by Lemma 1. Applying the conclusion of the first statement as well as the bound on $\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta^{\pi},\eta^{\pi})$ , we have

	$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{\mathrm{SC},\kappa}^{\pi},\eta^{\pi})$	$\displaystyle\leq\frac{1}{1-\gamma^{c/2}}\sup_{x\in\mathcal{X}}\inf_{P\in\mathscr{P}_{\mathcal{R}(x)}}\sqrt{\mathsf{mesh}(P;\kappa)}+\sup_{x\in\mathcal{X}}\inf_{P\in\mathscr{P}_{\mathcal{R}(x)}}\sqrt{\mathsf{mesh}(P;\kappa)}$
		$\displaystyle=\left(1+\frac{1}{1-\gamma^{c/2}}\right)\sup_{x\in\mathcal{X}}\inf_{P\in\mathscr{P}_{\mathcal{R}(x)}}\sqrt{\mathsf{mesh}(P;\kappa)}.$

∎

B.4.2 Convergence of Categorical TD Learning

Convergence of the proposed categorical TD-learning algorithm will rely on studying the iterates through an isometry to an affine subspace of $\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}$ . This affine subspace is that consisting of the set of state-conditioned “signed probability vectors”. We define $\operatorname{\mathbf{R}}^{n}_{1}$ as an affine subspace of $\operatorname{\mathbf{R}}^{n}$ for any $n\in\operatorname{\mathbf{N}}$ according to

\operatorname{\mathbf{R}}^{n}_{1}=\left\{v\in\operatorname{\mathbf{R}}^{n}:\sum_{i=1}^{n}v_{i}=1\right\}.

(15)

We note that any element $\eta$ of $\mathscr{S}_{\mathcal{R}}$ can be encoded in $\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}_{1}$ by expresing $\eta(x)$ as the sequence of signed masses associated to each atom of $\mathcal{R}(x)$ . In Lemma B.4.2, we exhibit an isometry $\mathcal{I}$ between $\mathscr{S}_{\mathcal{R}}$ and $\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}_{1}$ .

{restatable}

lemmacatisometry Let $\kappa$ be a characteristic kernel. There exists an affine isometric isomorphism $\mathcal{I}$ between $\mathscr{S}_{\mathcal{R}}$ and an affine subspace $\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}_{1}$ (cf. (15)).

Proof.

Since $\kappa$ is characteristic, it is positive definite [GBR⁺12]. Thus, for each $x\in\mathcal{X}$ , define $K_{x}\in\operatorname{\mathbf{R}}^{N(x)\times N(x)}$ according to

\displaystyle(K_{x})_{i,j}=\kappa(z_{i},z_{j})\qquad\mathcal{R}(x)=\left\{{z}_{k}\right\}_{k=1}^{N(x)}.

Then each $K_{x}$ is positive definite since $\kappa$ is a positive definite kernel. Let $p,q\in\Delta_{\mathcal{R}(x)}$ , and define $P\in\operatorname{\mathbf{R}}^{N(x)}$ and $Q\in\operatorname{\mathbf{R}}^{N(x)}$ such that $P_{i}=p(z_{i})$ and $Q_{i}=q(z_{i})$ . Then, we have

	$\displaystyle\mathrm{MMD}_{\kappa}^{2}(p,q)$	$\displaystyle=\\|\mu_{p}-\mu_{q}\\|^{2}_{\mathcal{H}}$
		$\displaystyle=\left\\|\sum_{i=1}^{N(x)}\kappa(\cdot,z_{i})p(z_{i})-\sum_{i=1}^{N(x)}\kappa(\cdot,z_{i})q(z_{i})\right\\|^{2}_{\mathcal{H}}$
		$\displaystyle=\left\langle\sum_{i=1}^{N(x)}\kappa(\cdot,z_{i})(p(z_{i})-q(z_{i})),\sum_{j=1}^{N(x)}\kappa(\cdot,z_{j})(p(z_{j})-q(z_{j}))\right\rangle_{\mathcal{H}}$
		$\displaystyle=\sum_{i=1}^{N(x)}\sum_{j=1}^{N(x)}(p(z_{i})-q(z_{i}))(p(z_{j})-q(z_{j}))\kappa(z_{i},z_{j})$
		$\displaystyle=(P-Q)^{\top}K_{x}(P-Q)$
		$\displaystyle=\\|P-Q\\|_{K_{x}}^{2}.$

Since $K_{x}$ is positive definite, $\|\cdot\|_{K_{x}}$ is a norm on $\operatorname{\mathbf{R}}^{N(x)}$ . Therefore, the map $\mathcal{I}^{1}_{x}:(\Delta_{\mathcal{R}(x)},\mathrm{MMD}_{\kappa})\to(\operatorname{\mathbf{R}}^{N(x)},\|\cdot\|_{K_{x}})$ given by $\mathcal{I}^{1}_{x}(p)=\overline{P}$ is a linear isometric isomorphism onto the affine subspace of $\operatorname{\mathbf{R}}^{N(x)}$ with entries summing to $1$ , which we denote $\operatorname{\mathbf{R}}^{N(x)}_{1}$ . Moreover, since $(\operatorname{\mathbf{R}}^{N(x)}_{1},\|\cdot\|_{K_{x}})$ is a finite dimensional Hilbert space, it is well known that there exists a linear isometric isomorphism $\mathcal{I}^{2}_{x}:(\mathbf{R}^{N(x)}_{1},\|\cdot\|_{K_{x}})\to\operatorname{\mathbf{R}}^{N(x)}_{1}$ with the usual $L^{2}$ norm. Thus, $\mathcal{I}_{x}=\mathcal{I}^{2}_{x}\mathcal{I}^{1}_{x}:(\Delta_{\mathcal{R}(x)},\mathrm{MMD}_{\kappa})\to\operatorname{\mathbf{R}}^{N(x)}_{1}$ is a linear isometric isomorphism. Consequently, it follows that $\mathcal{I}:(\mathscr{C}_{\mathcal{R}},\overline{\mathrm{MMD}_{\kappa}})\to\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}_{1}$ given by $\mathcal{I}=(\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)},\|\cdot\|_{2,\infty})$ is a linear isometric isomorphism, where $\|v\|_{2,\infty}=\sup_{x\in\mathcal{X}}\|v(x)\|_{2}$ . ∎

{restatable}

lemmadpisometry Define the operator $\mathcal{U}^{\pi}:\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}_{1}\to\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}_{1}$ by $\mathcal{U}^{\pi}=\mathcal{I}\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\mathcal{I}^{-1}$ , where $\mathcal{I}$ is the isometry of Lemma B.4.2. Let $\left\{{U}_{t}\right\}_{t=1}^{\infty}$ be given by $U_{t}=\mathcal{I}\eta_{t}$ , where $\left\{{\eta}_{t}\right\}_{t=1}^{\infty}$ are the dynamic programming iterates $\eta_{t+1}=\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{t}$ . Then $U_{t+1}=\mathcal{U}^{\pi}U_{t}$ . Moreover, $\mathcal{U}^{\pi}$ is contractive whenever $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ is, and in this case, $U_{t}\to U^{\star}$ , where $U^{\star}$ is the unique fixed point of $\mathcal{U}^{\pi}$ .

Proof.

By definition, we have

	$\displaystyle U_{t+1}$	$\displaystyle=\mathcal{I}\eta_{t+1}$
		$\displaystyle=\mathcal{I}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{t})$
		$\displaystyle=\mathcal{I}\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}(\mathcal{I}^{-1}U_{t})$
		$\displaystyle=\mathcal{U}^{\pi}U_{t},$

which proves the first claim. Moreover, for $U_{1}=\mathcal{I}\eta_{1}$ and $U_{2}=\mathcal{I}\eta_{2}$ , we have

	$\displaystyle\left\\|\mathcal{U}^{\pi}U_{1}-\mathcal{U}^{\pi}U_{2}\right\\|_{2,\infty}$	$\displaystyle=\left\\|\mathcal{I}\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{1}-\mathcal{I}\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{2}\right\\|_{2,\infty}$
		$\displaystyle=\overline{\mathrm{MMD}_{\kappa}}(\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{1},\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\eta_{2}),$

where the second step transforms the $\|\cdot\|_{2,\infty}$ to $\overline{\mathrm{MMD}_{\kappa}}$ since $\mathcal{I}$ is an isometry between those metric spaces. Therefore, if $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ is contractive with contraction factor $\beta\in(0,1)$ , we have

	$\displaystyle\left\\|\mathcal{U}^{\pi}U_{1}-\mathcal{U}^{\pi}U_{2}\right\\|_{2,\infty}$	$\displaystyle\leq\beta\overline{\mathrm{MMD}_{\kappa}}(\eta_{1},\eta_{2})$
		$\displaystyle=\beta\left\\|\mathcal{I}\eta_{1}-\mathcal{I}\eta_{2}\right\\|_{2,\infty}$
		$\displaystyle=\beta\left\\|U_{1}-U_{2}\right\\|_{2,\infty},$

so that $\mathcal{U}^{\pi}$ has the same contraction factor as $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ . Consequently, by the Banach fixed point theorem, $\mathcal{U}^{\pi}$ has a unique fixed point $U^{\star}$ whenever $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ is contractive, and $U_{t}\to U^{\star}$ at the same rate as $\eta_{t}\to\eta^{\pi}$ . ∎

The main theorem in this section is that temporal difference learning on the finite dimensional representations $\mathcal{I}(\eta_{t})$ converges.

{restatable}

[Convergence of categorical temporal difference learning]propositioncattdisomconvergence Let $\left\{{U}_{t}\right\}_{t=1}^{\infty}\subset\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}_{1}$ be given by $U_{t}=\mathcal{I}\widehat{\eta}_{t}$ , and let $\kappa$ be a kernel satisfying the conditions of Theorem 3. Suppose that, for each $x\in\mathcal{X}$ , the states $\left\{{X}_{t}\right\}_{t=1}^{\infty}$ and step sizes $\left\{{\alpha}_{t}\right\}_{t=1}^{\infty}$ satisfy the Robbins-Munro conditions

\displaystyle\sum_{t=0}^{\infty}\alpha_{t}\mathbf{1}_{\left[X_{t}=x\right]}=\infty\qquad\sum_{t=0}^{\infty}\alpha_{t}^{2}\mathbf{1}_{\left[X_{t}=x\right]}<\infty.

Then, with probability 1, $U_{k}\to U^{\star}$ , where $U^{\star}$ is the fixed point of $\mathcal{U}^{\pi}$ .

The proof of this result as a natural extension of the convergence analysis of Categorical TD Learning given in [BDR23] to the multivariate return setting under the supremal MMD metric. The analysis hinges on the following general lemma.

Lemma 3 ([BDR23, Theorem 6.9]).

Let $\mathcal{O}:(\operatorname{\mathbf{R}}^{n})^{\mathcal{X}}\to(\operatorname{\mathbf{R}}^{n})^{\mathcal{X}}$ be a contractive operator with respect to $\|\cdot\|_{2,\infty}$ with fixed point $Z^{\star}$ , and let $(\Omega,\mathcal{F},\left\{{F}_{k}\right\}_{k=1}^{\infty},\mathbf{P})$ be a filtered probability space. Define a map $\widehat{\mathcal{O}}:(\operatorname{\mathbf{R}}^{n})^{\mathcal{X}}\times\mathcal{X}\times\Omega\to(\operatorname{\mathbf{R}}^{n})^{\mathcal{X}}$ such that

\mathbf{E}_{\mathbf{P}}\left[\left.\widehat{\mathcal{O}}(Z,X,\omega)\ \right\rvert\ X=x\right]=(\mathcal{O}Z)(x).

(16)

For a stochastic process $\left\{{\xi}_{k}\right\}_{k=1}^{\infty}$ adapted to $\left\{{\mathcal{F}}_{k}\right\}_{k=1}^{\infty}$ with $\xi_{k}=X_{k}\oplus\omega_{k}$ , consider a sequence $\left\{{Z}_{k}\right\}_{k=1}^{\infty}\subset(\operatorname{\mathbf{R}}^{n})^{\mathcal{X}}$ given by

Z_{k+1}(x)=\begin{cases}(1-\alpha_{k})Z_{k}(x)+\alpha_{k}\widehat{\mathcal{O}}(Z_{k},X_{k},\omega_{k})(x)&X_{k}=x\\ Z_{k}(x)&\text{otherwise}\end{cases}

(17)

where $\left\{{\alpha}_{k}\right\}_{k=1}^{\infty}$ is adapted to $\left\{{\mathcal{F}}_{k}\right\}_{k=1}^{\infty}$ and satisfies the Robbins-Munro conditions for each $x\in\mathcal{X}$ ,

\displaystyle\sum_{k=1}^{\infty}\alpha_{k}\mathbf{1}_{\left[X_{t}=x\right]}=\infty,\qquad\sum_{k=1}^{\infty}\alpha_{k}^{2}\mathbf{1}_{\left[X_{t}=x\right]}<\infty.

Finally, for the processes $\left\{{w(x)}_{k}\right\}_{k=1}^{\infty}$ where $w(x)_{k}=(\widehat{\mathcal{O}}(Z_{k},X_{k},\omega_{k})-(\mathcal{O}Z_{k})(X_{k}))\mathbf{1}_{\left[X_{k}=x\right]}$ , assume the following moment condition holds,

\mathbf{E}_{\mathbf{P}}\left[\left.\|w(x)_{k}\|^{2}\ \right\rvert\ \mathcal{F}_{k}\right]\leq C_{1}+C_{2}\|Z_{k}\|_{2,\infty}^{2}

(18)

for finite universal constants $C_{1},C_{2}$ . Then, with probability 1, $Z_{k}\to Z^{\star}$ .

The operator $\mathcal{O}$ of Lemma 3 will be substituted with $\mathcal{U}^{\pi}$ , governing the dynamics of the encoded iterates of the multi-return distribution. The stochastic operator $\widehat{\mathcal{O}}$ will be substituted with the corresponding stochastic TD operator for $\mathcal{U}^{\pi}$ , given by

\widehat{\mathcal{U}}^{\pi}(U,x_{1},(R,x_{2}))(x)=\begin{cases}\mathcal{I}\left(\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}(\mathrm{b}_{R,\gamma})_{\sharp}\mathcal{I}^{-1}U(x_{2})\right)(x_{1})&x_{1}=x\\ U(x)&\text{otherwise}.\end{cases}

(19)

This corresponds to applying a Bellman backup from a stochastic reward $R$ and next state $x_{2}$ , followed by projecting back onto $\mathscr{S}_{\mathcal{R}}$ , and applying the isometry back to $\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}_{1}$ .

Proof of Proposition B.4.2.

Let $n=\max_{x\in\mathcal{X}}N(x)$ . Note that for any $m\leq n$ , $\operatorname{\mathbf{R}}^{m}$ can be isometrically embedded into $\operatorname{\mathbf{R}}^{n}$ by zero-padding. Thus, $\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)}_{1}$ can be isometrically embedded into $(\operatorname{\mathbf{R}}^{n}_{1})^{\mathcal{X}}$ , so without loss of generality, we will assume that $N(x)\equiv n$ .

Define the map $\widehat{\mathcal{U}}^{\pi}:(\operatorname{\mathbf{R}}^{n}_{1})^{\mathcal{X}}\times\mathcal{X}\times(\operatorname{\mathbf{R}}^{d}\times\mathcal{X})\to(\operatorname{\mathbf{R}}^{n}_{1})^{\mathcal{X}}$ according to

(\widehat{\mathcal{U}}^{\pi}(U,x_{1},(R,x_{2})))(x)=\begin{cases}\mathcal{I}(\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}(\mathrm{b}_{R,\gamma})_{\sharp}\mathcal{I}^{-1}U(x_{2}))(x_{1})&x_{1}=x\\ U(x)&\text{otherwise}\end{cases}

(20)

Then, defining $\widehat{\mathcal{U}}^{\pi}_{k}U=\widehat{\mathcal{U}}^{\pi}(U,X_{k},(R_{k},X^{\prime}_{k}))$ with $(X_{k},A_{k},R_{k},X^{\prime}_{k})=T_{k}\sim\mathbf{P}$ as in (11), we have

	$\displaystyle U_{k+1}(x)$	$\displaystyle=(\mathcal{I}\widehat{\mathcal{T}}^{\pi}\widehat{\eta}_{k+1})(x)$
		$\displaystyle=\mathcal{I}\left(\mathbf{1}_{\left[X_{k}=x\right]}\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}(\mathrm{b}_{R_{k},\gamma})_{\sharp}\widehat{\eta}_{k}(X^{\prime}_{k})+\mathbf{1}_{\left[X_{k}\neq x\right]}\widehat{\eta}_{k}(x)\right)$
		$\displaystyle=\mathbf{1}_{\left[X_{k}=x\right]}\mathcal{I}\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}(\mathrm{b}_{R_{k},\gamma})_{\sharp}\widehat{\eta}_{k}(X^{\prime}_{k})+\mathbf{1}_{\left[X_{k}\neq x\right]}U_{k}(x)$
		$\displaystyle=(\widehat{\mathcal{U}}^{\pi}_{k}U_{k})(x).$

Note that, since $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}$ is a Hilbert projection onto an affine subspace, it is affine [Lax02]. Consequently, $\widehat{\mathcal{U}}^{\pi}$ is an unbiased estimator of the operator $\mathcal{U}^{\pi}$ in the following sense,

	$\displaystyle\mathbf{E}_{\mathbf{P}}\left[\left.\widehat{\mathcal{U}}^{\pi}(U,X_{k},(R_{k},X^{\prime}_{k}))\ \right\rvert\ X_{k}=x\right]$	$\displaystyle=\mathbf{E}_{\mathbf{P}}\left[\mathcal{I}\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}(\mathrm{b}_{R_{k},\gamma})_{\sharp}\mathcal{I}^{-1}U(X^{\prime}_{k})\right]$
		$\displaystyle=\mathcal{I}\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathbf{E}_{X^{\prime}_{k}\sim P^{\pi}(\cdot\mid x)}\left[(\mathrm{b}_{r(x),\gamma})_{\sharp}\mathcal{I}^{-1}U(X^{\prime}_{k})\right]$
		$\displaystyle=\mathcal{I}\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\mathcal{I}^{-1}U(x)=(\mathcal{U}^{\pi}U)(x),$

where the first step invokes the linearity of $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}$ , the second step invokes the linearity of the isometry $\mathcal{I}$ established in Lemma B.4.2 and the third step is due to the definition of $\mathcal{T}^{\pi}$ . As a result, we see that the conditions (16) and (17) of Lemma 3 are satisfied by $\widehat{\mathcal{U}}^{\pi}$ , the iterates $\left\{{U}_{k}\right\}_{k=1}^{\infty}$ , and the step sizes $\left\{{\alpha}_{k}\right\}_{k=1}^{\infty}$ . Moreover, for $w_{k}(x)$ defined by

\displaystyle w_{k}(x)

\displaystyle=\left(\widehat{\mathcal{U}}^{\pi}(U_{k},X_{k},(R_{k},X^{\prime}_{k}))-(\mathcal{U}^{\pi}U_{k})(X_{k})\right)\mathbf{1}_{\left[X_{k}=x\right]}

we have $\|w_{k}(x)\|^{2}\leq C_{1}+C_{2}\|U(x)\|^{2}$ for universal constants $C_{1},C_{2}$ —this is shown in Lemma 4. As such, the condition of (18) is satisfied.

Finally, since $\mathcal{U}^{\pi}$ inherits contractivity from $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ as shown in Lemma 6, we may invoke Lemma 3, which ensures that $U_{k}\to U$ with probability 1, where $U=\mathcal{U}^{\pi}U$ is a unique fixed point. ∎

\cattdconvergence

Proof.

By Proposition B.4.2, the sequence $\left\{{U}_{t}\right\}_{t=1}^{\infty}$ with $U_{t}=\mathcal{I}\eta_{t}$ converges to a unique fixed point $U$ with probability 1. Note that

	$\displaystyle U^{\star}$	$\displaystyle=\mathcal{U}^{\pi}U^{\star}$
	$\displaystyle\mathcal{I}^{-1}U^{\star}$	$\displaystyle=\mathcal{I}^{-1}\mathcal{U}^{\pi}U^{\star}$
		$\displaystyle=\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}(\mathcal{I}^{-1}U^{\star}).$

Therefore, $\mathcal{I}^{-1}U^{\star}$ is a fixed point of $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ . Since it was shown in Lemma 6 that $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}$ has a unique fixed point, it follows that $\mathcal{I}^{-1}U^{\star}=\eta_{\mathrm{SC},\kappa}^{\pi}$ . Since $\mathcal{I}$ is an isometry, $\widehat{\eta}_{t}=\mathcal{I}^{-1}U_{t}\to\mathcal{I}^{-1}U^{\star}$ with probability 1, so indeed $\widehat{\eta}_{t}\to\eta_{\mathrm{SC},\kappa}^{\pi}$ with probability 1. ∎

To conclude, we prove Lemma 4, which was invoked in the proof of Proposition B.4.2.

Lemma 4.

Under the conditions of Proposition B.4.2, it holds that for any $x\in\mathcal{X}$ and $U\in\prod_{x\in\mathcal{X}}\operatorname{\mathbf{R}}^{N(x)-1}$ ,

\displaystyle\underset{X\sim P^{\pi}(\cdot\mid x)}{\mathbf{E}}\left[\left\|\mathcal{U}^{\pi}U(x)-\widehat{\mathcal{U}}^{\pi}(U,x,(R(x),X))(x)\right\|^{2}\right]\leq C_{1}+C_{2}\|U(x)\|^{2}

for finite constants $C_{1},C_{2}\in\operatorname{\mathbf{R}}_{+}$ .

Proof.

Since $\mathcal{I}$ is an isometry, we have that

\left\|\mathcal{U}^{\pi}U(x)-\widehat{\mathcal{U}}^{\pi}(U,x,(r,x^{\prime}))\right\|^{2}=\left\|\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\mathcal{T}^{\pi}\mathcal{I}^{-1}U(x)-\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}\left((\mathrm{b}_{r,\gamma})_{\sharp}\mathcal{I}^{-1}(x^{\prime})\right)(x)\right\|^{2}_{\mathcal{H}},

where $\mathcal{H}$ is the RKHS induced by the kernel $\kappa$ . Moreover, since $\Pi_{\mathrm{SC},\kappa}^{\mathcal{R}}$ is a nonexpansion in $\|\cdot\|_{\mathcal{H}}$ as argued in Lemma 6, we have that

	$\displaystyle\underset{X\sim P^{\pi}(\cdot\mid x)}{\mathbf{E}}\left[\left\\|\mathcal{U}^{\pi}U(x)-\widehat{\mathcal{U}}^{\pi}(U,x,(R(x),X))(x)\right\\|^{2}\right]$
	$\displaystyle\leq\underset{X\sim P^{\pi}(\cdot\mid x)}{\mathbf{E}}\left[\left\\|\mathcal{T}^{\pi}\mathcal{I}^{-1}U(x)-\left((\mathrm{b}_{R,\gamma})_{\sharp}\mathcal{I}^{-1}U(X)\right)(x)\right\\|^{2}_{\mathcal{H}}\right]$
	$\displaystyle\leq 2\underbrace{\\|\mathcal{T}^{\pi}\mathcal{I}^{-1}U(x)\\|_{\mathcal{H}}^{2}}_{A}+2\underbrace{\underset{X\sim P^{\pi}(\cdot\mid x)}{\mathbf{E}}\left[\left\\|\left((\mathrm{b}_{R,\gamma})_{\sharp}\mathcal{I}^{-1}U(X)\right)(x)\right\\|^{2}_{\mathcal{H}}\right]}_{B}.$

Proceeding, we will bound the terms $A,B$ . To bound $A$ , we simply have

	$\displaystyle A$	$\displaystyle\leq\\|\mathcal{T}^{\pi}\mathcal{I}^{-1}U(x)-\eta^{\pi}(x)\\|^{2}_{\mathcal{H}}+\\|\eta^{\pi}(x)\\|^{2}_{\mathcal{H}}$
		$\displaystyle\leq\gamma^{c/2}\\|\mathcal{I}^{-1}U(x)-\eta^{\pi}(x)\\|^{2}_{\mathcal{H}}+\\|\eta^{\pi}(x)\\|^{2}_{\mathcal{H}},$

where we invoke the contraction of $\mathcal{U}^{\pi}$ in $\mathrm{MMD}_{\kappa}$ from Theorem 3. Note that $\eta^{\pi}(x)\in\mathcal{P}([0,(1-\gamma)^{-1}R_{\max}]^{d})$ , so it follows that $\|\eta^{\pi}\|^{2}_{\mathcal{H}}\leq D_{1,1}$ for some constant $D_{1,1}$ since the kernel $\kappa$ is bounded in compact domains. Expanding the norm of the difference above yields

A\leq(1+\gamma^{c/2})D_{1,1}+D_{1,2}\|\mathcal{I}^{-1}U(x)\|^{2}_{\mathcal{H}}=(1+\gamma^{c/2})D_{1,1}+D_{1,2}\|U(x)\|^{2}

for a finite constant $D_{1,2}$ , again invoking the isometry $\mathcal{I}$ in the last step.

Our bound for $B$ is similar. Choose any $x^{\prime}\in\operatorname{supp}{P^{\pi}(\cdot\mid x)}$ . We consider the operator $\widetilde{\mathcal{T}}_{x^{\prime}}:\mathcal{P}([0,(1-\gamma)^{-1}R_{\max}]^{d})\to\mathcal{P}([0,(1-\gamma)^{-1}R_{\max}]^{d})$ given by

(\widetilde{\mathcal{T}}_{x^{\prime}}\eta)(x)=(\mathrm{b}_{R(x),\gamma})_{\sharp}\eta(x^{\prime}).

This operator is a contraction in $\mathrm{MMD}_{\kappa}$ , and correspondingly has a fixed point $\eta^{\pi}_{x^{\prime}}$ . To see this, we note that $\widetilde{T}_{x^{\prime}}$ is simply a special case of $\mathcal{U}^{\pi}$ for the case $P^{\pi}(\cdot\mid x)=\delta_{x^{\prime}}$ , so the contractivity and existence of the fixed point are inherited from Theorem 3. Proceeding in a manner similar to the bound on $A$ , we have

	$\displaystyle\left\\|\left((\mathrm{b}_{R,\gamma})_{\sharp}\mathcal{I}^{-1}U(x^{\prime})\right)(x)\right\\|^{2}_{\mathcal{H}}$	$\displaystyle=\left\\|\widetilde{\mathcal{T}}_{x^{\prime}}\mathcal{I}^{-1}U(x)\right\\|^{2}_{\mathcal{H}}$
		$\displaystyle\leq\left\\|\widetilde{\mathcal{T}}_{x^{\prime}}\mathcal{I}^{-1}U(x)-\eta_{x^{\prime}}\right\\|^{2}_{\mathcal{H}}+\\|\eta_{x^{\prime}}(x)\\|_{\mathcal{H}}^{2}$
		$\displaystyle\leq\gamma^{c/2}\\|\mathcal{I}^{-1}U(x)-\eta_{x^{\prime}}\\|^{2}_{\mathcal{H}}+\\|\eta_{x^{\prime}}(x)\\|_{\mathcal{H}}^{2}$
		$\displaystyle\leq(1+\gamma^{c/2})D_{2,1}+D_{2,2}\\|U(x)\\|^{2}$

where the final step mirrors the bound on $A$ . Therefore, we have shown that

	$\displaystyle\underset{X\sim P^{\pi}(\cdot\mid x)}{\mathbf{E}}\left[\left\\|\mathcal{U}^{\pi}U(x)-\widehat{\mathcal{U}}^{\pi}(U,x,(R(x),X))(x)\right\\|^{2}\right]$
	$\displaystyle\leq 2(1+\gamma^{c/2})(D_{1,1}+D_{2,1})+(D_{1,2}+D_{2,2})\\|U(x)\\|^{2},$

completing the proof. ∎

Appendix C Memory Efficiency of Randomized EWP Dynamic Programming

In Section 4, we argued for the necessity of considering a projection operator in EWP dynamic programming. While we provided a randomized projection, Theorem 4 requires that we apply only a finite amount of DP iterations. Thus, one might ask if, given that we apply only finitely many iterations, the naive unprojected EWP dynamic programming can produce accurate enough approximations of $\eta^{\pi}$ without costing too much in memory.

In this section, we demonstrate that, in fact, the algorithm described in Theorem 4 can approximate $\eta^{\pi}$ to any desired accuracy with many fewer particles. Suppose our goal is to derive some $\eta$ such that

\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta,\eta^{\pi})\leq\epsilon

for some $\epsilon>0$ . We will derive bounds on the number of required particles to attain such an approximation with unprojected EWP dynamic programming (denoting the number of particles $m_{\mathsf{unproj}}$ ) as well as with our algorithm described in Theorem 4 (denoting the number of particles $m_{\mathsf{proj}}$ . In both cases, we will compute iterates starting with some $\eta_{0}\in\mathscr{C}_{\mathrm{EWP},m}$ with $\overline{\mathrm{MMD}_{\kappa}}(\eta_{0},\eta^{\pi})\leq D<\infty$ . For simplicity, we will consider the energy distance kernel with $\alpha=1$ .

The remainder of this section will show that the dependence of the number of atoms on both $\epsilon$ and $|\mathcal{X}|$ is substantially worse in the unprojected case (that is, $m_{\mathsf{proj}}\ll m_{\mathsf{unproj}}$ for large state spaces or low error tolerance). We demonstrate this with concrete lower bounds on $m_{\mathsf{unproj}}$ and upper bounds on $m_{\mathsf{proj}}$ below; note that these bounds are not optimized for tightness or generality, and are instead aimed to provide straightforward evidence of our core points above.

We will begin by bounding $m_{\mathsf{unproj}}$ . In the best case, $\eta_{0}(x)$ is supported on $1$ particle for each $x$ . If any state can be reached from any other state in the MDP with non-zero probability, then applying the distributional Bellman operator to $\eta_{0}$ will result in $\eta_{1}(x)$ having support on $|\mathcal{X}|$ atoms at each state $x$ (due to the mixture over successor states in the Bellman backup). Consequently, the iterate $\eta_{k}(x)$ will be supported on $|\mathcal{X}|^{k}$ atoms. Since $\overline{\mathrm{MMD}_{\kappa}}(\eta_{k},\eta^{\pi})\leq\gamma^{1/2}D$ by Theorem 3, we require

\displaystyle K

\displaystyle\geq\frac{2\log(D/\epsilon)}{\log\gamma^{-1}}

to ensure that $\overline{\mathrm{MMD}_{\kappa}}(\eta_{K},\eta^{\pi})\leq\epsilon$ . Thus, we have

\displaystyle m_{\mathsf{unproj}}

\displaystyle\geq|\mathcal{X}|^{\frac{2\log(D/\epsilon)}{\log\gamma^{-1}}}.

On the other hand, the following lemma bounds $m_{\mathsf{proj}}$ ; we prove the lemma at the end of this section.

{restatable}

lemmamprojbound Let $\eta_{m_{\mathsf{proj}}}$ denote the output of the projected EWP algorithm described by Theorem 4 with $m=m_{\mathsf{proj}}$ particles. Then under the assumptions of Theorem 4 and with the energy distance kernel with $\alpha=1$ , $\overline{\mathrm{MMD}_{\kappa}}(\eta_{m_{\mathsf{proj}}},\eta^{\pi})\leq\epsilon$ is achievable with

m_{\mathsf{proj}}\in\Theta\left(\epsilon^{-2}\frac{dR^{2}_{\max}}{(1-\sqrt{\gamma})^{2}(1-\gamma)^{2}}\mathsf{polylog}\left(\frac{1}{\epsilon},\frac{1}{\delta},|\mathcal{X}|,d,R_{\max},\frac{1}{1-\sqrt{\gamma}}\right)\right).

(21)

For any fixed MDP with $|\mathcal{X}|\geq 4$ and $\gamma\geq 1/2$ , we have that

	$\displaystyle m_{\mathsf{unproj}}$	$\displaystyle\geq\exp\left(2\log\|\mathcal{X}\|\frac{\log\epsilon^{-1}}{\log\gamma^{-1}}\right)\exp\left(2\log\|\mathcal{X}\|\frac{\log D}{\log\gamma^{-1}}\right)$
		$\displaystyle=\exp\left(2\log\|\mathcal{X}\|\frac{\log D}{\log\gamma^{-1}}\right)\epsilon^{-2\frac{\log\|\mathcal{X}\|}{\log\gamma^{-1}}}$
		$\displaystyle\in\Omega(\epsilon^{-4})$

since $D>0$ and does not depend on $\epsilon$ . Meanwhile, we have $m_{\mathsf{proj}}\in\Theta(\epsilon^{-2}\mathsf{polylog}(\epsilon^{-1}))$ by Lemma C, indicating a much more graceful dependence on $\epsilon$ relative to the unprojected algorithm.

On the other hand, for any fixed tolerance $\epsilon\leq\gamma D$ , we immediately have

	$\displaystyle m_{\mathsf{unproj}}$	$\displaystyle\in\Omega(\|\mathcal{X}\|^{2})$
	$\displaystyle m_{\mathsf{proj}}$	$\displaystyle\in\Theta(d\cdot\mathsf{polylog}(d,\|\mathcal{X}\|)).$

In the worst case, we may have $d\in\Theta(|\mathcal{X}|)$ (any larger $d$ will induce linearly dependent cumulants). Thus, we have

\displaystyle\frac{m_{\mathsf{proj}}}{m_{\mathsf{unproj}}}

\displaystyle\in\begin{cases}\widetilde{O}(|\mathcal{X}|^{-1})&d\in\omega(1)\\ \widetilde{O}(|\mathcal{X}|^{-2})&d\in\Theta(1),\end{cases}

so the projected algorithm scales much more gracefully with $|\mathcal{X}|$ as well.

Proof of Lemma C

Finally, we prove Lemma C, which determines the number of atoms required to achieve an $\epsilon$ -accurate return distribution estimate with the algorithm of Theorem 4.

\mprojbound

Proof.

Note that, by Theorem 4, increasing $m_{\mathsf{proj}}$ can only decrease the error $\epsilon$ as long as $m_{\mathsf{proj}}\geq 1$ . Therefore, as shown in (14) in the proof of Theorem 4, there exists a universal constant $C_{0}>0$ such that

\epsilon:=C_{0}\frac{1}{\sqrt{m_{\mathsf{proj}}}}\underbrace{\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}}}_{c_{1}}\left(\underbrace{\log\left(\frac{|\mathcal{X}|\delta^{-1}}{\log\gamma^{-\alpha}}\right)}_{c_{2}}+\log m_{\mathsf{proj}}\right).

(22)

Now, we write $c_{3}=C_{0}c_{1}c_{2},c_{4}=C_{0}c_{1}$ , and $u:=\sqrt{m_{\mathsf{proj}}}$ , yielding

	$\displaystyle\epsilon$	$\displaystyle=\frac{c_{3}}{u}+c_{4}\frac{\log u^{2}}{u}$
		$\displaystyle=\frac{c_{3}}{u}+2c_{4}\frac{\log u}{u}.$

Then, after isolating the logarithmic term and exponentiating, we see that

\displaystyle u

\displaystyle=\exp\left(\frac{u\epsilon-c_{3}}{2c_{4}}\right).

We now rearrange this expression and invoke the identity $W(z)e^{W(z)}=z$ where $W$ is a Lambert $W$ -function [CGH⁺96]:

$\displaystyle ue^{c_{3}/2c_{4}}\exp\left(-\frac{u\epsilon}{2c_{4}}\right)$	$\displaystyle=1$
$\displaystyle-\frac{u\epsilon}{2c_{4}}\exp\left(-\frac{u\epsilon}{2c_{4}}\right)$	$\displaystyle=-\frac{e^{-c_{3}/2c_{4}}\epsilon}{2c_{4}}=-\frac{e^{-c_{2}/2}\epsilon}{2c_{4}}$
$\displaystyle\therefore ze^{z}$	$\displaystyle=-\frac{e^{-c_{2}/2}\epsilon}{2c_{4}}$	$\displaystyle z:=-\frac{u\epsilon}{2c_{4}}.$

There are two branches of the Lambert $W$ -function on the reals, namely $W_{0}$ and $W_{-1}$ . These two branches satisfy $W_{0}(ze^{z})=z$ when $z\geq-1$ and $W_{-1}(ze^{z})=z$ when $z\leq-1$ . In our case, we know that $z$ is negative, and it is known [CGH⁺96] that $|W_{0}(z)|\leq 1$ when $z\in[-1,0]$ . Consequently, when $z\geq-1$ , we have $|\frac{u\epsilon}{2c_{4}}|\leq 1$ , and substituting $m_{\mathsf{proj}}=u^{2}$ , we have

m_{\mathsf{proj}}\leq\frac{4c_{4}^{2}}{\epsilon^{2}}\ \text{when}\ z\geq-1.

(23)

On the other hand, when $z\leq-1$ , we have

	$\displaystyle z$	$\displaystyle=W_{-1}\left(-\frac{e^{-c_{2}/2}\epsilon}{2c_{4}}\right)$
	$\displaystyle\therefore-\frac{u\epsilon}{2c_{4}}$	$\displaystyle=W_{-1}\left(-\frac{e^{-c_{2}/2}\epsilon}{2c_{4}}\right)$
	$\displaystyle\therefore m_{\mathsf{proj}}$	$\displaystyle=\frac{4c_{4}^{2}}{\epsilon^{2}}W^{2}_{-1}\left(-\frac{e^{-c_{2}/2}\epsilon}{2c_{4}}\right),\quad z\leq-1.$

Since it is known [CGH⁺96, Equations 4.19, 4.20] that $W_{-1}^{2}(-\overline{z})\in\mathsf{polylog}(1/\overline{z})$ , incorporating (23), we have that

	$\displaystyle m_{\mathsf{proj}}$	$\displaystyle\leq\begin{cases}\frac{4c^{2}_{4}}{\epsilon^{2}}&z\geq-1\\ \frac{4c^{2}_{4}}{\epsilon^{2}}W^{2}_{-1}\left(-\frac{e^{c_{2}/2}\epsilon}{2c_{4}}\right)&z\leq-1\\ \end{cases}$
		$\displaystyle\leq\frac{4c_{4}^{2}}{\epsilon^{2}}\max\left(1,\mathsf{polylog}(c_{4}e^{c_{2}/2}\epsilon^{-1})\right)$
		$\displaystyle\leq\frac{4C_{0}^{2}dR^{2}_{\max}}{(1-\sqrt{\gamma})^{2}(1-\gamma)^{2}\epsilon^{2}}\mathsf{polylog}\left(\frac{1}{\epsilon},\frac{1}{\delta},\|\mathcal{X}\|,d,R_{\max},\frac{1}{1-\sqrt{\gamma}}\right).$

The upper bound given above will generally not be an integer. Howevever, increasing $m_{\mathsf{proj}}$ can only improve the approximation error, as shown in Theorem 4 since $\log m/\sqrt{m}$ decreases monotonically when $m>7$ . So, we can round $m_{\mathsf{proj}}$ up to the nearest integer (or round it down when $m\leq 7$ ) incurring a penalty of at most one atom. It follows that the randomized EWP dynamic programming algorithm of Theorem 4 run with $m_{\mathsf{proj}}$ given by (21) produces a return distribution function $\eta_{m_{\mathsf{proj}}}$ for which $\overline{\mathrm{MMD}_{\kappa}}(\eta_{m_{\mathsf{proj}}},\eta^{\pi})\leq\epsilon$ . ∎

Appendix D Nonlinearity of the Categorical MMD Projection

In Section 6, we noted that the categorical projection $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is non-affine. Here, we provide an explicit example certifying this phenomenon.

We consider a single-state MDP, since the nonlinearity issue is independent of the cardinality of the state space (the projection is applied to each state-conditioned distribution independently). We write $\mathcal{R}=\{0,\dots,3\}^{2}$ , and consider the kernel $\kappa$ induced by $\rho(x,y)=\|x-y\|_{2}$ —this resulting MMD is known as the energy distance, which is what we used in our experiments. We consider two distributions, $p_{1}=\delta_{[1.5,1.5]}$ and $p_{2}=\delta_{[2.5,0]}$ .

Table 1: Certificate that

\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}

is not affine

Support point $\xi\in\mathcal{R}$	$q_{1}(\xi)$	$q_{2}(\xi)$
$(0,0)$	$0$	$0$
$(0,1)$	$0$	$0$
$(0,2)$	$0$	$0$
$(0,3)$	$0$	$0$
$(1,0)$	$0$	$0$
$(1,1)$	$0.1999$	$0.2057$
$(1,2)$	$0.1999$	$0.1959$
$(1,3)$	$0$	$0$
$\mathbf{(2,0)}$	$\mathbf{0.0937}$	$\mathbf{0.07957}$
$\mathbf{(2,1)}$	$\mathbf{0.2062}$	$\mathbf{0.2413}$
$(2,2)$	$0.1999$	$0.2026$
$(2,3)$	$0$	$0$
$\mathbf{(3,0)}$	$\mathbf{0.0937}$	$\mathbf{0.0787}$
$(3,1)$	$0.0063$	$0$
$(3,2)$	$0$	$0$
$(3,3)$	$0$	$0$

We consider $\lambda=0.8$ and compare $q_{1}=\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}(\lambda p_{1}+(1-\lambda)p_{2})$ with $q_{2}=\lambda\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}p_{1}+(1-\lambda)\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}p_{2}$ , and we note that $q_{1}\neq q_{2}$ ; confirming that $\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}$ is not an affine map. The results are tabulated in Table 1, with bolded entries depicting the atoms with non-negligible differences in probability under $q_{1},q_{2}$ .

Appendix E Experiment Details

TD-learning experiments were conducted on a NVidia A100 80G GPU to parallelize experiments. Methods were implemented in Jax [BFH⁺18], particularly with the help of JaxOpt [BBC⁺21] for vectorizing QP solutions — this was helpful for computing the categorical projections discussed in this work. SGD was used for optimization, using an annealed learning rate schedule $(\lambda_{k})_{k\geq 0}$ with $\lambda_{k}=k^{-3/5}$ , satisfying the conditions of Lemma 3. Experiments with constant learning rates yielded similar results, but were less stable—this validates that the choice of learning rate schedule did not impede learning.

The dynamic programming experiments were implemented in the Julia programming language [BEKS17].

In all experiments, we used the kernel induced by $\rho(x,y)=\|x-y\|_{2}$ with reference point $0$ for MMD optimization—this corresponds to the energy distance, and satisfies the requisite assumptions for convergent multivariate distributional dynamic programming outlined in Theorem 3.

Appendix F Neural Multivariate Distributional TD-Learning

Figure 5: Example state in the parking environment.

For the sake of illustration, in this section, we demonstrate that the signed categorical TD learning algorithm presented in Section 6 can be scaled to continuous state spaces with neural networks. We will consider an environment with visual (pixel) observations of a car in a parking lot, an example observation is shown in Figure 5.

Here, we consider 2-dimensional cumulants, where the first dimension tracks the $x$ coordinate of the car, and the second dimension is an indicator that is $1$ if and only if the car is parked in the parking spot. We learn a multivariate return distribution function with transitions sampled from trajectories that navigate around the obstacle to the parking spot. Notably, the successor features (expectation of multivariate return distribution) will be zero in the first dimension, since the set of trajectories is horizontally symmetric. Thus, from the successor features alone, one cannot distinguish the observed policy from one that traverses straight through the obstacle!

Fortunately, when modeling a distribution over multivariate returns, we should see that the support of the multivariate return distribution does not include points with vanishing first dimension.

To learn the multivariate return distribution function from images, we use a convolutional neural architecture as shown in Figure 6.

Notably, we simply use convolutional networks to model the signed masses for the fixed atoms of the categorical representation. The projection $\Pi^{\mathcal{R}}_{\mathrm{SC},\kappa}$ is computed by a QP solver as discussed in Section 5, and is applied only to the target distributions (thus we do not backpropagate through it).

We compared the multi-return distributions learned by our signed categorical TD method with that of [ZCZ⁺21]. Our results are shown in Figure 7. We see that both TD-learning methods accurately estimate the distribution over multivariate returns, indicating that no multivariate return will have a vanishing lateral component. Quantitatively, we see that the EWP algorithm appears to be stuck in a local optimum, with some particles lying in regions of low probability mass.

Moreover, on the right side of Figure 7, we show predicted return distributions for two randomly sampled reward vectors, and quantitatively evaluate the two methods. The leftmost reward vector incentivizes the agent to take paths conservatively avoiding the obstacle on the left. The rightmost reward vector incentivizes the agent to get to the parking spot as quickly as possible. We see that the EWP TD learning algorithm of [ZCZ⁺21] more accurately estimates the return distribution function corresponding to the latter reward vector, while our signed categorical TD algorithm more accurately estimates the return distribution function corresponding to the former reward vector. In both cases, both methods produce accurate estimations.

	$\displaystyle\mathrm{MMD}_{\kappa}(Tp,Tq)$	$\displaystyle=\\|\mu_{Tp}-\mu_{Tq}\\|$
		$\displaystyle=\\|T\mu_{p}-T\mu_{q}\\|$
		$\displaystyle=\left\\|\int_{\mathcal{I}}(\mu_{p_{a}}-\mu_{q_{a}})T(\mathrm{d}a)\right\\|$
		$\displaystyle\leq\int\\|\mu_{p_{a}}-\mu_{q_{a}}\\|T(\mathrm{d}a)$
		$\displaystyle\leq\sup_{a\in\mathcal{I}}\\|\mu_{p_{a}}-\mu_{q_{a}}\\|$
		$\displaystyle=\sup_{a\in\mathcal{I}}\mathrm{MMD}_{\kappa}(\mu_{p_{a}},\mu_{q_{a}}).$

$\displaystyle\overline{\mathrm{MMD}_{\kappa}}(\eta_{K},\eta^{\pi})$	$\displaystyle\leq C_{1}\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}\sqrt{m}}\left(1+2\log\left(\|\mathcal{X}\|\left(1+\frac{\log m}{\log\gamma^{-\alpha}}\right)\delta^{-1}\right)\right)$	(14)
	$\displaystyle\leq C_{0}\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}\sqrt{m}}\left(\log\|\mathcal{X}\|+\log\frac{\log m}{\log\gamma^{-\alpha}}+\log\delta^{-1}\right)$
	$\displaystyle=C_{0}\frac{d^{\alpha/2}R_{\max}^{\alpha}}{(1-\gamma^{\alpha/2})(1-\gamma)^{\alpha}\sqrt{m}}\left(\log\left(\frac{\|\mathcal{X}\|\delta^{-1}}{\log\gamma^{-\alpha}}\right)+\log m\right).$

	$\displaystyle\mathrm{MMD}_{\kappa}(\eta_{1}(x),\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\eta_{1}(x))^{2}$	$\displaystyle=\\|\mu_{1}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)\\|^{2}$
		$\displaystyle\leq\\|\mu_{1}(x)-(\lambda\iota(x)+(1-\lambda)\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x))\\|^{2}$
		$\displaystyle=\\|\mu_{1}(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x)-\lambda(\iota(x)-\Pi_{\mathrm{C},\kappa}^{\mathcal{R}}\mu_{1}(x))\\|^{2}.$

	$\displaystyle\mathrm{MMD}_{\kappa}^{2}(p,\Pi p)$	$\displaystyle=\\|\mu_{\Pi p}-\mu_{p}\\|^{2}$
		$\displaystyle=\left\\|\int\kappa(\cdot,y)\Pi p(\mathrm{d}y)-\int\kappa(\cdot,y)p(\mathrm{d}y)\right\\|^{2}$
		$\displaystyle=\left\\|\sum_{i}\kappa(\cdot,z_{i})\Pi p_{i}-\int\kappa(\cdot,y)p(\mathrm{d}y)\right\\|^{2},$

	$\displaystyle\mathrm{MMD}_{\kappa}^{2}(p,\Pi p)$	$\displaystyle\leq\left\\|\sum_{i}\kappa(\cdot,z_{i})p(\theta_{i})-\int\kappa(\cdot,y)p(\mathrm{d}y)\right\\|^{2}$
		$\displaystyle=\left\\|\sum_{i}\int_{\theta_{i}}(\kappa(\cdot,z_{i})-\kappa(\cdot,y))p(\mathrm{d}y)\right\\|^{2}$
		$\displaystyle\leq\left\\|\sum_{i}\sup_{y_{1},y_{2}\in\theta_{i}}\\|\kappa(\cdot,y_{1})-\kappa(\cdot,y_{2})\\|p(\theta_{i})\right\\|^{2}$
		$\displaystyle\leq\sup_{\theta\in P}\sup_{y_{1},y_{2}\in\theta}\\|\kappa(\cdot,y_{1})-\kappa(\cdot,y_{2})\\|^{2}.$

Foundations of Multivariate Distributional Reinforcement Learning

Abstract

1 Introduction

2 Background

Definition 1.

Theorem 1 ([SSGF13, Proposition 29]).

Remark 1.

3 Multivariate Distributional Dynamic Programming

4 Particle-Based Multivariate Distributional Dynamic Programming

5 Categorical Multivariate Distributional Dynamic Programming

5.1 Simulation: The Distributional Successor Measure

6 Multivariate Distributional TD-Learning

6.1 Simulations: Distributional Successor Features

7 Conclusion

Acknowledgements

References

Appendix A In-Depth Summary of Related Work

Appendix B Proofs

B.1 Multivariate Distributional Dynamic Programming: Section 3

Lemma 1.

Proof.

Lemma 2.

Proof.

Proof.

B.2 EWP Dynamic Programming: Section 4

Proof.

Proof.

Proof.

B.3 Categorical Dynamic Programming: Section 5

Proof.

Proof.

Proof.

B.3.1 Quality of the Categorical Fixed Point

Proof.

Proof.

Proof.

B.4 Categorical TD Learning: Section 6

B.4.1 The Signed Measure Relaxation

Proof.

Proof.

Proof.

B.4.2 Convergence of Categorical TD Learning

Proof.

Proof.

Lemma 3 ([BDR23, Theorem 6.9]).

Proof of Proposition B.4.2.

Proof.

Lemma 4.

Proof.

Appendix C Memory Efficiency of Randomized EWP Dynamic Programming

Proof of Lemma C

Proof.

Appendix D Nonlinearity of the Categorical MMD Projection

Appendix E Experiment Details

Appendix F Neural Multivariate Distributional TD-Learning

5 Categorical Multivariate Distributional Dynamic
Programming