Appendix

Proof of LABEL:lemma:nashcov-bound

Proof.

$\displaystyle\delta_{i}^{A}(\pi)$	$\displaystyle=\max_{\pi_{i}^{\prime}}\left(\pi_{i}^{\prime T}A\pi_{-i}\right)-\pi_{i}^{T}A\pi_{-i}$	(1)
	$\displaystyle=\max_{\pi_{i}^{\prime}}\left(\pi_{i}^{\prime T}(\hat{A}+E)\pi_{-i}\right)-\pi_{i}^{T}(\hat{A}+E)\pi_{-i}$
	$\displaystyle\leq\left\{\max_{\pi_{i}^{\prime}}\left(\pi_{i}^{\prime T}\hat{A}\pi_{-i}\right)-\pi_{i}^{T}\hat{A}\pi_{-i}\right\}+\left\{\max_{\pi_{i}^{\prime}}\left(\pi_{i}^{\prime T}E\pi_{-i}\right)-\pi_{i}^{T}E\pi_{-i}\right\}$
	$\displaystyle=\delta_{i}^{\hat{A}}(\pi)+\delta_{i}^{E}(\pi)$
	$\displaystyle\leq\delta_{i}^{\hat{A}}(\pi)+2\|\|E\|\|_{\infty}$

Consequently,

\textsc{NashConv}_{A}(\pi)\leq\sum_{i}\left[\delta^{i}_{\hat{A}^{i}}(\pi)+2||E^{i}||_{\infty}\right].

(2)

∎

Proof of LABEL:lemma:approx-dev-incentive

Proof.

Suppose we have a surrogate approximate game wherein player $i$ has desire $\delta^{i}$ to deviate. If we allow player $i$ the policy space afforded in the full approximate game, they will still only have desire $\delta^{i}$ to deviate.

\delta^{i}_{s}=\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right]-\hat{U}(\pi^{i},\pi^{-i})

(3)

\hat{U}^{i}(\sigma^{i},\pi^{-i})=\sum_{\sigma^{-i}\in\Sigma^{-i}}\hat{U}(\sigma^{i},\sigma^{-i})\pi^{-i}(\sigma^{-i})

(4)

\delta^{i}=\max_{\sigma^{i}\in\Sigma}\left[\hat{U}^{i}(\sigma^{i},\pi^{-i})\right]-\hat{U}(\pi^{i},\pi^{-i})

(5)

By consequence of LABEL:lemma:tree-equiv-utility, we get the following relation

$\displaystyle\delta^{i}-\delta^{i}_{s}$	$\displaystyle=\max_{\sigma^{i}\in\Sigma}\left[\hat{U}^{i}(\sigma^{i},\pi^{-i})\right]-\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right]$	(6)
	$\displaystyle=\max_{\sigma^{i}\in\Sigma}\left[\hat{U}^{i}(\phi(\sigma^{i}),\pi^{-i})\right]-\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right]$
	$\displaystyle=\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right]-\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right]$
	$\displaystyle=0\,.$

Therefore, $\delta^{i}=\delta^{i}_{s}$

∎

Proof of LABEL:theorem:single-policy-error-bound

Theorem 0.1.

(Theorem 3 of [lev2024simplifying]) Assume that the immediate state reward estimate is probabilistically bounded such that $P(|r_{j}^{i}-\tilde{r}_{j}^{i}|\geq\nu)\leq\delta_{r}(\nu,N_{r})$ , for a number of reward samples $N_{r}$ and state sample $x_{i}^{j}$ . Assume that $\delta_{r}(\nu,N_{r})\rightarrow 0$ as $N_{r}\rightarrow\infty$ . For all policies $\pi,t=0,...,L$ , and $a\in A$ , the following bounds hold with probability of at least $1-5(4C)^{D+1}(\exp(-C\cdot\acute{k}^{2})+\delta_{r}(\nu,N_{r}))$ :

\left|Q_{\mathbf{P},t}^{\pi,\left[p_{Z}/q_{Z}\right]}\left(b_{t},a\right)-Q_{\mathbf{M}_{\mathbf{P}},t}^{\pi,\left[p_{Z}/q_{Z}\right]}\left(\bar{b}_{t},a\right)\right|\leq\alpha_{t}+\beta_{t}=\epsilon,

(7)

where

\alpha_{t}=(1+\gamma)\lambda+\gamma\alpha_{t+1},\,\alpha_{L}=\lambda\geq 0,

(8)

\beta_{t}=2\nu+\gamma\beta_{t+1},\,\beta_{L}=2\nu\geq 0

(9)

k_{\max}(\lambda,C)=\frac{\lambda}{4V_{\max}d_{\infty}^{\max}}-\frac{1}{\sqrt{C}}>0,

(10)

\acute{k}=\min\left\{k_{\max},\lambda/4\sqrt{2}V_{\max}\right\}.

(11)

While [lev2024simplifying] further generalize particle-belief MDP policy evaluation to simplified observation models, we assume that the observation model is known. This simplifies the Rényi divergence back to the definition provided in [lim2023optimality], where $\mathcal{P}^{d}$ is the target distribution and $\mathcal{Q}^{d}$ is the sampling distribution for particle importance sampling.

d_{\infty}(\mathcal{P}^{d}||\mathcal{Q}^{d})=\operatorname*{esssup}_{x\sim\mathcal{Q}^{d}}w_{\mathcal{P}^{d}/\mathcal{Q}^{d}}(x)\leq d_{\infty}^{\max}<+\infty

(12)

In order to extend $V_{\max}$ from theorem 0.1 to the multiagent setting, we specify that $V_{\max,i}$ bounds value for player $i$ with a finite geometric sum such that

V_{\max,i}=\frac{\max_{s,a}|R^{i}(s,a)|(1-\gamma^{D})}{1-\gamma}\,.

(13)

Furthermore, we assume reward to be a deterministic function of a given state and action. As such, we have $\forall N_{r}>0,\nu\geq 0,\,\delta_{r}(\nu,N_{r})=0$ . Consequently, we can simple choose $\nu=0$ and $N_{r}=1$ .

By definition of $\beta_{t}$ , we now have $\beta_{t}=0\,\forall t$ .

This reduces the concentration bound to

\left|Q_{\mathbf{P},t}^{\pi,\left[p_{Z}/q_{Z}\right]}\left(b_{t},a\right)-Q_{\mathbf{M}_{\mathbf{P}},t}^{\pi,\left[p_{Z}/q_{Z}\right]}\left(\bar{b}_{t},a\right)\right|\leq\alpha_{t}=\epsilon,

(14)

where

\alpha_{t}=(1+\gamma)\lambda+\gamma\alpha_{t+1},\,\alpha_{D}=\lambda\geq 0.

(15)

By noting the sequence of $\alpha_{0}^{D}$ for different horizons $D$

$\displaystyle\alpha^{0}_{0}$	$\displaystyle=\lambda$	(16)
$\displaystyle\alpha^{1}_{0}$	$\displaystyle=\lambda(1+2\gamma)$
$\displaystyle\alpha^{2}_{0}$	$\displaystyle=\lambda(1+2\gamma+2\gamma^{2})$
	$\displaystyle\vdots$
$\displaystyle\alpha^{D}_{0}$	$\displaystyle=\lambda(1+2\sum_{t=1}^{D}\gamma^{t}),$

we can establish a closed form via finite geometric sum where

\alpha_{0}=\lambda\left[2\left(\frac{1-\gamma^{D+1}}{1-\gamma}\right)-1\right],

(17)

for a given problem horizon $D$ .

We take the established guarantee at the root belief, and

	$\displaystyle U^{i}(\sigma)$	$\displaystyle=Q_{\mathbf{P},0}^{\sigma,\left[p_{Z}/q_{Z}\right]}\left(b_{0},\sigma(h_{0})\right)$		(18)
	$\displaystyle\hat{U}^{i}(\sigma)$	$\displaystyle=Q_{\mathbf{M}_{\mathbf{P}},0}^{\sigma,\left[p_{Z}/q_{Z}\right]}\left(\bar{b}_{0},\sigma(h_{0})\right)\,,$		(18)

for a given player $i$ .

For all policies $\sigma,t=0,...,D$ , and $a\in A$ , the following bounds hold with probability of at least $1-5(4C)^{D+1}\exp(-C\cdot\acute{k}_{i}^{2})$ :

\left|U^{i}(\sigma)-\hat{U}^{i}(\sigma)\right|\leq\epsilon,

(19)

where

\epsilon=\lambda\left[2\left(\frac{1-\gamma^{D+1}}{1-\gamma}\right)-1\right]

(20)

k_{\max,i}(\lambda,C)=\frac{\lambda}{4V_{\max,i}d_{\infty}^{\max}}-\frac{1}{\sqrt{C}}>0,

(21)

\acute{k}_{i}=\min\left\{k_{\max,i},\lambda/4\sqrt{2}V_{\max,i}\right\}.

(22)

For zero-sum games, we have $R^{i}(s,a)=-R^{-i}(s,a)$ . As such, $|R^{i}(s,a)|=|R^{-i}(s,a)|$ , consequently allowing for the following equivalencies:

\begin{split}V_{\max}&=V_{\max,i}=V_{\max,-i}\\ k_{\max}&=k_{\max,i}=k_{\max,-i}\\ \acute{k}&=\acute{k}_{i}=\acute{k}_{-i}\\ \end{split}

(23)