Proof of LABEL:theorem:single-policy-error-bound
Theorem 0.1.
(Theorem 3 of [lev2024simplifying])
Assume that the immediate state reward estimate is probabilistically bounded such that , for a number of reward samples and state sample . Assume that as . For all policies , and , the following bounds hold with probability of at least :
|
|
|
(7) |
where
|
|
|
(8) |
|
|
|
(9) |
|
|
|
(10) |
|
|
|
(11) |
While [lev2024simplifying] further generalize particle-belief MDP policy evaluation to simplified observation models, we assume that the observation model is known. This simplifies the Rényi divergence back to the definition provided in [lim2023optimality], where is the target distribution and is the sampling distribution for particle importance sampling.
|
|
|
(12) |
In order to extend from theorem 0.1 to the multiagent setting, we specify that bounds value for player with a finite geometric sum such that
|
|
|
(13) |
Furthermore, we assume reward to be a deterministic function of a given state and action. As such, we have . Consequently, we can simple choose and .
By definition of , we now have .
This reduces the concentration bound to
|
|
|
(14) |
where
|
|
|
(15) |
By noting the sequence of for different horizons
|
|
|
|
(16) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
we can establish a closed form via finite geometric sum where
|
|
|
(17) |
for a given problem horizon .
We take the established guarantee at the root belief, and
|
|
|
|
(18) |
|
|
|
|
for a given player .
For all policies , and , the following bounds hold with probability of at least :
|
|
|
(19) |
where
|
|
|
(20) |
|
|
|
(21) |
|
|
|
(22) |
For zero-sum games, we have . As such, , consequently allowing for the following equivalencies:
|
|
|
(23) |