This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Appendix

Proof of LABEL:lemma:nashcov-bound

Proof.
δiA(π)\displaystyle\delta_{i}^{A}(\pi) =maxπi(πiTAπi)πiTAπi\displaystyle=\max_{\pi_{i}^{\prime}}\left(\pi_{i}^{\prime T}A\pi_{-i}\right)-\pi_{i}^{T}A\pi_{-i} (1)
=maxπi(πiT(A^+E)πi)πiT(A^+E)πi\displaystyle=\max_{\pi_{i}^{\prime}}\left(\pi_{i}^{\prime T}(\hat{A}+E)\pi_{-i}\right)-\pi_{i}^{T}(\hat{A}+E)\pi_{-i}
{maxπi(πiTA^πi)πiTA^πi}+{maxπi(πiTEπi)πiTEπi}\displaystyle\leq\left\{\max_{\pi_{i}^{\prime}}\left(\pi_{i}^{\prime T}\hat{A}\pi_{-i}\right)-\pi_{i}^{T}\hat{A}\pi_{-i}\right\}+\left\{\max_{\pi_{i}^{\prime}}\left(\pi_{i}^{\prime T}E\pi_{-i}\right)-\pi_{i}^{T}E\pi_{-i}\right\}
=δiA^(π)+δiE(π)\displaystyle=\delta_{i}^{\hat{A}}(\pi)+\delta_{i}^{E}(\pi)
δiA^(π)+2E\displaystyle\leq\delta_{i}^{\hat{A}}(\pi)+2||E||_{\infty}

Consequently,

NashConvA(π)i[δA^ii(π)+2Ei].\textsc{NashConv}_{A}(\pi)\leq\sum_{i}\left[\delta^{i}_{\hat{A}^{i}}(\pi)+2||E^{i}||_{\infty}\right]. (2)

Proof of LABEL:lemma:approx-dev-incentive

Proof.

Suppose we have a surrogate approximate game wherein player ii has desire δi\delta^{i} to deviate. If we allow player ii the policy space afforded in the full approximate game, they will still only have desire δi\delta^{i} to deviate.

δsi=maxσsiΣs[U^i(σsi,πi)]U^(πi,πi)\delta^{i}_{s}=\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right]-\hat{U}(\pi^{i},\pi^{-i}) (3)
U^i(σi,πi)=σiΣiU^(σi,σi)πi(σi)\hat{U}^{i}(\sigma^{i},\pi^{-i})=\sum_{\sigma^{-i}\in\Sigma^{-i}}\hat{U}(\sigma^{i},\sigma^{-i})\pi^{-i}(\sigma^{-i}) (4)
δi=maxσiΣ[U^i(σi,πi)]U^(πi,πi)\delta^{i}=\max_{\sigma^{i}\in\Sigma}\left[\hat{U}^{i}(\sigma^{i},\pi^{-i})\right]-\hat{U}(\pi^{i},\pi^{-i}) (5)

By consequence of LABEL:lemma:tree-equiv-utility, we get the following relation

δiδsi\displaystyle\delta^{i}-\delta^{i}_{s} =maxσiΣ[U^i(σi,πi)]maxσsiΣs[U^i(σsi,πi)]\displaystyle=\max_{\sigma^{i}\in\Sigma}\left[\hat{U}^{i}(\sigma^{i},\pi^{-i})\right]-\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right] (6)
=maxσiΣ[U^i(ϕ(σi),πi)]maxσsiΣs[U^i(σsi,πi)]\displaystyle=\max_{\sigma^{i}\in\Sigma}\left[\hat{U}^{i}(\phi(\sigma^{i}),\pi^{-i})\right]-\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right]
=maxσsiΣs[U^i(σsi,πi)]maxσsiΣs[U^i(σsi,πi)]\displaystyle=\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right]-\max_{\sigma^{i}_{s}\in\Sigma_{s}}\left[\hat{U}^{i}(\sigma^{i}_{s},\pi^{-i})\right]
=0.\displaystyle=0\,.

Therefore, δi=δsi\delta^{i}=\delta^{i}_{s}

Proof of LABEL:theorem:single-policy-error-bound

Theorem 0.1.

(Theorem 3 of [lev2024simplifying]) Assume that the immediate state reward estimate is probabilistically bounded such that P(|rjir~ji|ν)δr(ν,Nr)P(|r_{j}^{i}-\tilde{r}_{j}^{i}|\geq\nu)\leq\delta_{r}(\nu,N_{r}), for a number of reward samples NrN_{r} and state sample xijx_{i}^{j}. Assume that δr(ν,Nr)0\delta_{r}(\nu,N_{r})\rightarrow 0 as NrN_{r}\rightarrow\infty. For all policies π,t=0,,L\pi,t=0,...,L, and aAa\in A, the following bounds hold with probability of at least 15(4C)D+1(exp(Ck´2)+δr(ν,Nr))1-5(4C)^{D+1}(\exp(-C\cdot\acute{k}^{2})+\delta_{r}(\nu,N_{r})):

|Q𝐏,tπ,[pZ/qZ](bt,a)Q𝐌𝐏,tπ,[pZ/qZ](b¯t,a)|αt+βt=ϵ,\left|Q_{\mathbf{P},t}^{\pi,\left[p_{Z}/q_{Z}\right]}\left(b_{t},a\right)-Q_{\mathbf{M}_{\mathbf{P}},t}^{\pi,\left[p_{Z}/q_{Z}\right]}\left(\bar{b}_{t},a\right)\right|\leq\alpha_{t}+\beta_{t}=\epsilon, (7)

where

αt=(1+γ)λ+γαt+1,αL=λ0,\alpha_{t}=(1+\gamma)\lambda+\gamma\alpha_{t+1},\,\alpha_{L}=\lambda\geq 0, (8)
βt=2ν+γβt+1,βL=2ν0\beta_{t}=2\nu+\gamma\beta_{t+1},\,\beta_{L}=2\nu\geq 0 (9)
kmax(λ,C)=λ4Vmaxdmax1C>0,k_{\max}(\lambda,C)=\frac{\lambda}{4V_{\max}d_{\infty}^{\max}}-\frac{1}{\sqrt{C}}>0, (10)
k´=min{kmax,λ/42Vmax}.\acute{k}=\min\left\{k_{\max},\lambda/4\sqrt{2}V_{\max}\right\}. (11)

While [lev2024simplifying] further generalize particle-belief MDP policy evaluation to simplified observation models, we assume that the observation model is known. This simplifies the Rényi divergence back to the definition provided in [lim2023optimality], where 𝒫d\mathcal{P}^{d} is the target distribution and 𝒬d\mathcal{Q}^{d} is the sampling distribution for particle importance sampling.

d(𝒫d||𝒬d)=esssupx𝒬dw𝒫d/𝒬d(x)dmax<+d_{\infty}(\mathcal{P}^{d}||\mathcal{Q}^{d})=\operatorname*{esssup}_{x\sim\mathcal{Q}^{d}}w_{\mathcal{P}^{d}/\mathcal{Q}^{d}}(x)\leq d_{\infty}^{\max}<+\infty (12)

In order to extend VmaxV_{\max} from theorem 0.1 to the multiagent setting, we specify that Vmax,iV_{\max,i} bounds value for player ii with a finite geometric sum such that

Vmax,i=maxs,a|Ri(s,a)|(1γD)1γ.V_{\max,i}=\frac{\max_{s,a}|R^{i}(s,a)|(1-\gamma^{D})}{1-\gamma}\,. (13)

Furthermore, we assume reward to be a deterministic function of a given state and action. As such, we have Nr>0,ν0,δr(ν,Nr)=0\forall N_{r}>0,\nu\geq 0,\,\delta_{r}(\nu,N_{r})=0. Consequently, we can simple choose ν=0\nu=0 and Nr=1N_{r}=1.

By definition of βt\beta_{t}, we now have βt=0t\beta_{t}=0\,\forall t.

This reduces the concentration bound to

|Q𝐏,tπ,[pZ/qZ](bt,a)Q𝐌𝐏,tπ,[pZ/qZ](b¯t,a)|αt=ϵ,\left|Q_{\mathbf{P},t}^{\pi,\left[p_{Z}/q_{Z}\right]}\left(b_{t},a\right)-Q_{\mathbf{M}_{\mathbf{P}},t}^{\pi,\left[p_{Z}/q_{Z}\right]}\left(\bar{b}_{t},a\right)\right|\leq\alpha_{t}=\epsilon, (14)

where

αt=(1+γ)λ+γαt+1,αD=λ0.\alpha_{t}=(1+\gamma)\lambda+\gamma\alpha_{t+1},\,\alpha_{D}=\lambda\geq 0. (15)

By noting the sequence of α0D\alpha_{0}^{D} for different horizons DD

α00\displaystyle\alpha^{0}_{0} =λ\displaystyle=\lambda (16)
α01\displaystyle\alpha^{1}_{0} =λ(1+2γ)\displaystyle=\lambda(1+2\gamma)
α02\displaystyle\alpha^{2}_{0} =λ(1+2γ+2γ2)\displaystyle=\lambda(1+2\gamma+2\gamma^{2})
\displaystyle\vdots
α0D\displaystyle\alpha^{D}_{0} =λ(1+2t=1Dγt),\displaystyle=\lambda(1+2\sum_{t=1}^{D}\gamma^{t}),

we can establish a closed form via finite geometric sum where

α0=λ[2(1γD+11γ)1],\alpha_{0}=\lambda\left[2\left(\frac{1-\gamma^{D+1}}{1-\gamma}\right)-1\right], (17)

for a given problem horizon DD.

We take the established guarantee at the root belief, and

Ui(σ)\displaystyle U^{i}(\sigma) =Q𝐏,0σ,[pZ/qZ](b0,σ(h0))\displaystyle=Q_{\mathbf{P},0}^{\sigma,\left[p_{Z}/q_{Z}\right]}\left(b_{0},\sigma(h_{0})\right) (18)
U^i(σ)\displaystyle\hat{U}^{i}(\sigma) =Q𝐌𝐏,0σ,[pZ/qZ](b¯0,σ(h0)),\displaystyle=Q_{\mathbf{M}_{\mathbf{P}},0}^{\sigma,\left[p_{Z}/q_{Z}\right]}\left(\bar{b}_{0},\sigma(h_{0})\right)\,,

for a given player ii.

For all policies σ,t=0,,D\sigma,t=0,...,D, and aAa\in A, the following bounds hold with probability of at least 15(4C)D+1exp(Ck´i2)1-5(4C)^{D+1}\exp(-C\cdot\acute{k}_{i}^{2}):

|Ui(σ)U^i(σ)|ϵ,\left|U^{i}(\sigma)-\hat{U}^{i}(\sigma)\right|\leq\epsilon, (19)

where

ϵ=λ[2(1γD+11γ)1]\epsilon=\lambda\left[2\left(\frac{1-\gamma^{D+1}}{1-\gamma}\right)-1\right] (20)
kmax,i(λ,C)=λ4Vmax,idmax1C>0,k_{\max,i}(\lambda,C)=\frac{\lambda}{4V_{\max,i}d_{\infty}^{\max}}-\frac{1}{\sqrt{C}}>0, (21)
k´i=min{kmax,i,λ/42Vmax,i}.\acute{k}_{i}=\min\left\{k_{\max,i},\lambda/4\sqrt{2}V_{\max,i}\right\}. (22)

For zero-sum games, we have Ri(s,a)=Ri(s,a)R^{i}(s,a)=-R^{-i}(s,a). As such, |Ri(s,a)|=|Ri(s,a)||R^{i}(s,a)|=|R^{-i}(s,a)|, consequently allowing for the following equivalencies:

Vmax=Vmax,i=Vmax,ikmax=kmax,i=kmax,ik´=k´i=k´i\begin{split}V_{\max}&=V_{\max,i}=V_{\max,-i}\\ k_{\max}&=k_{\max,i}=k_{\max,-i}\\ \acute{k}&=\acute{k}_{i}=\acute{k}_{-i}\\ \end{split} (23)