This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Near-Optimal Reinforcement Learning with Self-Play

Yu Bai
Salesforce Research
[email protected]
   Chi Jin
Princeton University
[email protected]
   Tiancheng Yu
MIT
[email protected]
Abstract

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with SS states, AA max-player actions and BB min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires 𝒪~(S2AB)\mathcal{\tilde{O}}(S^{2}AB) steps of game playing, when only highlighting the dependency on (S,A,B)(S,A,B). In contrast, the best existing lower bound scales as Ω(S(A+B))\Omega(S(A+B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity 𝒪~(SAB)\mathcal{\tilde{O}}(SAB), and a new Nash V-learning algorithm with sample complexity 𝒪~(S(A+B))\mathcal{\tilde{O}}(S(A+B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games—a learning objective different from finding the Nash equilibrium.

1 Introduction

A wide range of modern artificial intelligence challenges can be cast as a multi-agent reinforcement learning (multi-agent RL) problem, in which more than one agent performs sequential decision making in an interactive environment. Multi-agent RL has achieved significant recent success on traditionally challenging tasks, for example in the game of GO [30, 31], Poker [6], real-time strategy games [33, 22], decentralized controls or multiagent robotics systems [5], autonomous driving [27], as well as complex social scenarios such as hide-and-seek [3]. In many scenarios, the learning agents even outperform the best human experts .

Despite the great empirical success, a major bottleneck for many existing RL algorithms is that they require a tremendous number of samples. For example, the biggest AlphaGo Zero model is trained on tens of millions of games and took more than a month to train [31]. While requiring such amount of samples may be acceptable in simulatable environments such as GO, it is not so in other sample-expensive real world settings such as robotics and autonomous driving. It is thus important for us to understand the sample complexity in RL—how can we design algorithms that find a near optimal policy with a small number of samples, and what is the fundamental limit, i.e. the minimum number of samples required for any algorithm to find a good policy.

Theoretical understandings on the sample complexity for multi-agent RL are rather limited, especially when compared with single-agent settings. The standard model for a single-agent setting is an episodic Markov Decision Process (MDP) with SS states, and AA actions, and HH steps per episode. The best known algorithm can find an ϵ\epsilon near-optimal policy in Θ~(poly(H)SA/ϵ2)\tilde{\Theta}({\rm poly}(H)SA/\epsilon^{2}) episodes, which matches the lower bound up to a single HH factor [1, 8]. In contrast, in multi-agent settings, the optimal sample complexity remains open even in the basic setting of two-player tabular Markov games [28], where the agents are required to find the solutions of the games—the Nash equilibria. The best known algorithm, VI-ULCB, finds an ϵ\epsilon-approximate Nash equilibrium in 𝒪~(poly(H)S2AB/ϵ2)\mathcal{\tilde{O}}({\rm poly}(H)S^{2}AB/\epsilon^{2}) episodes [2], where BB is the number of actions for the other player. The information theoretical lower bound is Ω(poly(H)S(A+B)/ϵ2)\Omega({\rm poly}(H)S(A+B)/\epsilon^{2}). Specifically, the number of episodes required for the algorithm scales quadratically in both SS and (A,B)(A,B), and exhibits a gap from the linear dependency in the lower bound. This motivates the following question:

Can we design algorithms with near-optimal sample complexity for learning Markov games?

In this paper, we present the first line of near-optimal algorithms for two-player Markov games that match the aforementioned lower bound up to a poly(H)\mathrm{poly}(H) factor. This closes the open problem for achieving the optimal sample complexity in all (S,A,B)(S,A,B) dependency. Our algorithm learns by playing against itself without requiring any direct supervision, and is thus a self-play algorithm.

1.1 Our contributions

  • We propose an optimistic variant of Nash Q-learning [11], and prove that it achieves sample complexity 𝒪~(H5SAB/ϵ2)\mathcal{\tilde{O}}(H^{5}SAB/\epsilon^{2}) for finding an ϵ\epsilon-approximate Nash equilibrium in two-player Markov games (Section 3). Our algorithm builds optimistic upper and lower estimates of QQ-values, and computes the Coarse Correlated Equilibrium (CCE) over this pair of QQ estimates as its execution policies for both players.

  • We design a new algorithm—Nash V-learning—for finding approximate Nash equilibria, and show that it achieves sample complexity 𝒪~(H6S(A+B)/ϵ2)\mathcal{\tilde{O}}(H^{6}S(A+B)/\epsilon^{2}) (Section 4). This improves upon Nash Q-learning in case min{A,B}>H\min{\left\{A,B\right\}}>H. It is also the first result that matches the minimax lower bound up to only a poly(H)\mathrm{poly}(H) factor. This algorithm builds optimistic upper and lower estimates of VV-values, and features a novel combination of Follow-the-Regularized-Leader (FTRL) and standard Q-learning algorithm to determine its execution policies.

  • Apart from finding Nash equilibria, we prove that learning the best responses of fixed opponents in Markov games is as hard as learning parity with noise—a notoriously difficult problem that is believed to be computationally hard (Section 5). As a corollary, this hardness result directly implies that achieving sublinear regret against adversarial opponents in Markov games is also computationally hard, a result that first appeared in [25]. This in turn rules out the possibility of designing efficient algorithms for finding Nash equilibria by running no-regret algorithms for each player separately.

In addition to above contributions, this paper also features a novel approach of extracting certified policies—from the estimates produced by reinforcement learning algorithms such as Nash Q-learning and Nash V-learning—that are certified to have similar performance as Nash equilibrium policies, even when facing against their best response (see Section 3 for more details). We believe this technique could be of broader interest to the community.

1.2 Related Work

Markov games

Markov games (or stochastic games) are proposed in the early 1950s [28]. They are widely used to model multi-agent RL. Learning the Nash equilibria of Markov games has been studied in classical work [18, 19, 11, 10], where the transition matrix and reward are assumed to be known, or in the asymptotic setting where the number of data goes to infinity. These results do not directly apply to the non-asymptotic setting where the transition and reward are unknown and only a limited amount of data are available for estimating them.

A recent line of work tackles self-play algorithms for Markov games in the non-asymptotic setting with strong reachability assumptions. Specifically, Wei et al. [35] assumes no matter what strategy one agent sticks to, the other agent can always reach all states by playing a certain policy, and Jia et al. [13], Sidford et al. [29] assume access to simulators (or generative models) that enable the agent to directly sample transition and reward information for any state-action pair. These settings ensure that all states can be reached directly, so no sophisticated exploration is not required.

Very recently, [2, 36] study learning Markov games without these reachability assumptions, where exploration becomes essential. However, both results suffer from highly suboptimal sample complexity. We compare them with our results in Table 1. The results of [36] also applies to the linear function approximation setting. We remark that the R-max algorithm [4] does provide provable guarantees for learning Markov game, even in the setting of playing against the adversarial opponent, but using a definition of regret that is weaker than the standard regret. Their result does not imply any sample complexity result for finding Nash equilibrium policies.

Table 1: Sample complexity (the required number of episodes) for algorithms to find ϵ\epsilon-approximate Nash equlibrium policies in zero-sum Markov games.
Algorithm Sample Complexity Runtime
VI-ULCB [2] 𝒪~(H4S2AB/ϵ2)\mathcal{\tilde{O}}(H^{4}S^{2}AB/\epsilon^{2}) PPAD-complete
VI-explore [2] 𝒪~(H5S2AB/ϵ2)\mathcal{\tilde{O}}(H^{5}S^{2}AB/\epsilon^{2}) Polynomial
OMVI-SM [36] 𝒪~(H4S3A3B3/ϵ2)\mathcal{\tilde{O}}(H^{4}S^{3}A^{3}B^{3}/\epsilon^{2})
Optimistic Nash Q-learning 𝒪~(H5SAB/ϵ2)\mathcal{\tilde{O}}(H^{5}SAB/\epsilon^{2})
Optimistic Nash V-learning 𝒪~(H6S(A+B)/ϵ2)\quad\mathcal{\tilde{O}}(H^{6}S(A+B)/\epsilon^{2})\quad
Lower Bound [14, 2] Ω(H3S(A+B)/ϵ2)\Omega(H^{3}S(A+B)/\epsilon^{2}) -

Adversarial MDP

Another line of related work focuses on provably efficient algorithms for adversarial MDPs. Most work in this line considers the setting with adversarial rewards [38, 26, 15], because adversarial MDP with changing dynamics is computationally hard even under full-information feedback [37]. These results do not directly imply provable self-play algorithms in our setting, because the opponent in Markov games can affect both the reward and the transition.

Single-agent RL

There is a rich literature on reinforcement learning in MDPs [see e.g. 12, 24, 1, 7, 32, 14]. MDP is a special case of Markov games, where only a single agent interacts with a stochastic environment. For the tabular episodic setting with nonstationary dynamics and no simulators, the best sample complexity achieved by existing model-based and model-free algorithms are 𝒪~(H3SA/ϵ2)\tilde{\mathcal{O}}(H^{3}SA/\epsilon^{2}) [1] and 𝒪~(H4SA/ϵ2)\tilde{\mathcal{O}}(H^{4}SA/\epsilon^{2}) [14], respectively, where SS is the number of states, AA is the number of actions, HH is the length of each episode. Both of them (nearly) match the lower bound Ω(H3SA/ϵ2)\Omega(H^{3}SA/\epsilon^{2}) [12, 23, 14].

2 Preliminaries

We consider zero-sum Markov Games (MG) [28, 18], which are also known as stochastic games in the literature. Zero-sum Markov games are generalization of standard Markov Decision Processes (MDP) into the two-player setting, in which the max-player seeks to maximize the total return and the min-player seeks to minimize the total return.

Formally, we denote a tabular episodic Markov game as MG(H,𝒮,𝒜,,,r){\rm MG}(H,\mathcal{S},\mathcal{A},\mathcal{B},\mathbb{P},r), where HH is the number of steps in each episode, 𝒮\mathcal{S} is the set of states with |𝒮|S|\mathcal{S}|\leq S, (𝒜,)(\mathcal{A},\mathcal{B}) are the sets of actions of the max-player and the min-player respectively, ={h}h[H]\mathbb{P}=\{\mathbb{P}_{h}\}_{h\in[H]} is a collection of transition matrices, so that h(|s,a,b)\mathbb{P}_{h}(\cdot|s,a,b) gives the distribution over states if action pair (a,b)(a,b) is taken for state ss at step hh, and r={rh}h[H]r=\{r_{h}\}_{h\in[H]} is a collection of reward functions, and rh:𝒮×𝒜×[0,1]r_{h}\colon\mathcal{S}\times\mathcal{A}\times\mathcal{B}\to[0,1] is the deterministic reward function at step hh. 111We assume the rewards in [0,1][0,1] for normalization. Our results directly generalize to randomized reward functions, since learning the transition is more difficult than learning the reward.

In each episode of this MG, we start with a fixed initial state s1s_{1}. Then, at each step h[H]h\in[H], both players observe state sh𝒮s_{h}\in\mathcal{S}, and the max-player picks action ah𝒜a_{h}\in\mathcal{A} while the min-player picks action bhb_{h}\in\mathcal{B} simultaneously. Both players observe the actions of the opponents, receive reward rh(sh,ah,bh)r_{h}(s_{h},a_{h},b_{h}), and then the environment transitions to the next state sh+1h(|sh,ah,bh)s_{h+1}\sim\mathbb{P}_{h}(\cdot|s_{h},a_{h},b_{h}). The episode ends when sH+1s_{H+1} is reached.

Markov policy, value function

A Markov policy μ\mu of the max-player is a collection of HH functions {μh:𝒮Δ𝒜}h[H]\{\mu_{h}:\mathcal{S}\rightarrow\Delta_{\mathcal{A}}\}_{h\in[H]}, which maps from a state to a distribution of actions. Here Δ𝒜\Delta_{\mathcal{A}} is the probability simplex over action set 𝒜\mathcal{A}. Similarly, a policy ν\nu of the min-player is a collection of HH functions {νh:𝒮Δ}h[H]\{\nu_{h}:\mathcal{S}\rightarrow\Delta_{\mathcal{B}}\}_{h\in[H]}. We use the notation μh(a|s)\mu_{h}(a|s) and νh(b|s)\nu_{h}(b|s) to present the probability of taking action aa or bb for state ss at step hh under Markov policy μ\mu or ν\nu respectively.

We use Vhμ,ν:𝒮V^{\mu,\nu}_{h}\colon\mathcal{S}\to\mathbb{R} to denote the value function at step hh under policy μ\mu and ν\nu, so that Vhμ,ν(s)V^{\mu,\nu}_{h}(s) gives the expected cumulative rewards received under policy μ\mu and ν\nu, starting from ss at step hh:

Vhμ,ν(s):=𝔼μ,ν[h=hHrh(sh,ah,bh)|sh=s].\textstyle V^{\mu,\nu}_{h}(s)\mathrel{\mathop{:}}=\mathbb{E}_{\mu,\nu}\left[\left.\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}})\right|s_{h}=s\right]. (1)

We also define Qhμ,ν:𝒮×𝒜×Q^{\mu,\nu}_{h}:\mathcal{S}\times\mathcal{A}\times\mathcal{B}\to\mathbb{R} to denote QQ-value function at step hh so that Qhμ,ν(s,a,b)Q^{\mu,\nu}_{h}(s,a,b) gives the cumulative rewards received under policy μ\mu and ν\nu, starting from (s,a,b)(s,a,b) at step hh:

Qhμ,ν(s,a,b):=𝔼μ,ν[h=hHrh(sh,ah,bh)|sh=s,ah=a,bh=b].\textstyle Q^{\mu,\nu}_{h}(s,a,b)\mathrel{\mathop{:}}=\mathbb{E}_{\mu,\nu}\left[\left.\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}})\right|s_{h}=s,a_{h}=a,b_{h}=b\right]. (2)

For simplicity, we use notation of operator h\mathbb{P}_{h} so that [hV](s,a,b):=𝔼sh(|s,a,b)V(s)[\mathbb{P}_{h}V](s,a,b)\mathrel{\mathop{:}}=\mathbb{E}_{s^{\prime}\sim\mathbb{P}_{h}(\cdot|s,a,b)}V(s^{\prime}) for any value function VV. We also use notation [𝔻πQ](s):=𝔼(a,b)π(,|s)Q(s,a,b)[\mathbb{D}_{\pi}Q](s)\mathrel{\mathop{:}}=\mathbb{E}_{(a,b)\sim\pi(\cdot,\cdot|s)}Q(s,a,b) for any action-value function QQ. By definition of value functions, we have the Bellman equation

Qhμ,ν(s,a,b)=(rh+hVh+1μ,ν)(s,a,b),Vhμ,ν(s)=(𝔻μh×νhQhμ,ν)(s)\displaystyle Q^{\mu,\nu}_{h}(s,a,b)=(r_{h}+\mathbb{P}_{h}V^{\mu,\nu}_{h+1})(s,a,b),\qquad V^{\mu,\nu}_{h}(s)=(\mathbb{D}_{\mu_{h}\times\nu_{h}}Q^{\mu,\nu}_{h})(s)

for all (s,a,b,h)𝒮×𝒜××[H](s,a,b,h)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B}\times[H]. We define VH+1μ,ν(s)=0V^{\mu,\nu}_{H+1}(s)=0 for all s𝒮H+1s\in\mathcal{S}_{H+1}.

Best response and Nash equilibrium

For any Markov policy of the max-player μ\mu, there exists a best response of the min-player, which is a Markov policy ν(μ)\nu^{\dagger}(\mu) satisfying Vhμ,ν(μ)(s)=infνVhμ,ν(s)V_{h}^{\mu,\nu^{\dagger}(\mu)}(s)=\inf_{\nu}V_{h}^{\mu,\nu}(s) for any (s,h)𝒮×[H](s,h)\in\mathcal{S}\times[H]. Here the infimum is taken over all possible policies which are not necessarily Markovian (we will define later in this section). We define Vhμ,:=Vhμ,ν(μ)V_{h}^{\mu,\dagger}\mathrel{\mathop{:}}=V_{h}^{\mu,\nu^{\dagger}(\mu)}. By symmetry, we can also define μ(ν)\mu^{\dagger}(\nu) and Vh,νV_{h}^{\dagger,\nu}. It is further known (cf. [9]) that there exist Markov policies μ\mu^{\star}, ν\nu^{\star} that are optimal against the best responses of the opponents, in the sense that

Vhμ,(s)=supμVhμ,(s),Vh,ν(s)=infνVh,ν(s),for all(s,h).\textstyle V^{\mu^{\star},\dagger}_{h}(s)=\sup_{\mu}V^{\mu,\dagger}_{h}(s),\qquad V^{\dagger,\nu^{\star}}_{h}(s)=\inf_{\nu}V^{\dagger,\nu}_{h}(s),\qquad\textrm{for all}~{}(s,h).

We call these optimal strategies (μ,ν)(\mu^{\star},\nu^{\star}) the Nash equilibrium of the Markov game, which satisfies the following minimax equation: 222The minimax theorem here is different from the one for matrix games, i.e. maxϕminψϕAψ=minψmaxϕϕAψ\max_{\phi}\min_{\psi}\phi^{\top}A\psi=\min_{\psi}\max_{\phi}\phi^{\top}A\psi for any matrix AA, since here Vhμ,ν(s)V^{\mu,\nu}_{h}(s) is in general not bilinear in μ,ν\mu,\nu.

supμinfνVhμ,ν(s)=Vhμ,ν(s)=infνsupμVhμ,ν(s).\textstyle\sup_{\mu}\inf_{\nu}V^{\mu,\nu}_{h}(s)=V^{\mu^{\star},\nu^{\star}}_{h}(s)=\inf_{\nu}\sup_{\mu}V^{\mu,\nu}_{h}(s).

Intuitively, a Nash equilibrium gives a solution in which no player has anything to gain by changing only her own policy. We further abbreviate the values of Nash equilibrium Vhμ,νV_{h}^{\mu^{\star},\nu^{\star}} and Qhμ,νQ_{h}^{\mu^{\star},\nu^{\star}} as VhV_{h}^{{\star}} and QhQ_{h}^{{\star}}. We refer readers to Appendix A for Bellman optimality equations for values of best responses or Nash equilibria.

General (non-Markovian) policy

In certain situations, it is beneficial to consider general, history-dependent policies that are not necessarily Markovian. A (general) policy μ\mu of the max-player is a set of HH maps μ:={μh:×(𝒮×𝒜××)h1×𝒮Δ𝒜}h[H]\mu\mathrel{\mathop{:}}=\big{\{}\mu_{h}:\mathbb{R}\times(\mathcal{S}\times\mathcal{A}\times\mathcal{B}\times\mathbb{R})^{h-1}\times\mathcal{S}\rightarrow\Delta_{\mathcal{A}}\big{\}}_{h\in[H]}, from a random number zz\in\mathbb{R} and a history of length hh—say (s1,a1,b1,r1,,sh)(s_{1},a_{1},b_{1},r_{1},\cdots,s_{h}), to a distribution over actions in 𝒜\mathcal{A}. By symmetry, we can also define the (general) policy ν\nu of the min-player, by replacing the action set 𝒜\mathcal{A} in the definition by set \mathcal{B}. The random number zz is sampled from some underlying distribution 𝒟\mathcal{D}, but may be shared among all steps h[H]h\in[H].

For a pair of general policy (μ,ν)(\mu,\nu), we can still use the same definitions (1) to define their value V1μ,ν(s1)V_{1}^{\mu,\nu}(s_{1}) at step 11. We can also define the best response ν(μ)\nu^{\dagger}(\mu) of a general policy μ\mu as the minimizing policy so that V1μ,(s1)V1μ,ν(μ)(s1)=infνVhμ,ν(s1)V_{1}^{\mu,\dagger}(s_{1})\equiv V_{1}^{\mu,\nu^{\dagger}(\mu)}(s_{1})=\inf_{\nu}V_{h}^{\mu,\nu}(s_{1}) at step 1. We remark that the best response of a general policy is not necessarily Markovian.

Learning Objective

There are two possible learning objectives in the setting of Markov games. The first one is to find the best response for a fixed opponent. Without loss of generality, we consider the case where the learning agent is the max-player, and the min-player is the opponent.

Definition 1 (ϵ\epsilon-approximate best response).

For an opponent with an fixed unknown general policy ν\nu, a general policy μ^\hat{\mu} is the ϵ\epsilon-approximate best response if V1,ν(s1)V1μ^,ν(s1)ϵV^{\dagger,\nu}_{1}(s_{1})-V^{\hat{\mu},\nu}_{1}(s_{1})\leq\epsilon.

The second goal is to find a Nash equilibrium of the Markov games. We measure the suboptimality of any pair of general policies (μ^,ν^)(\hat{\mu},\hat{\nu}) using the gap between their performance and the performance of the optimal strategy (i.e. Nash equilibrium) when playing against the best responses respectively:

V1,ν^(s1)V1μ^,(s1)=[V1,ν^(s1)V1(s1)]+[V1(s1)V1μ^,(s1)]\textstyle\textstyle V^{\dagger,\hat{\nu}}_{1}(s_{1})-V^{\hat{\mu},\dagger}_{1}(s_{1})={\left[V^{\dagger,\hat{\nu}}_{1}(s_{1})-V^{{\star}}_{1}(s_{1})\right]}+{\left[V^{{\star}}_{1}(s_{1})-V^{\hat{\mu},\dagger}_{1}(s_{1})\right]}
Definition 2 (ϵ\epsilon-approximate Nash equilibrium).

A pair of general policies (μ^,ν^)(\hat{\mu},\hat{\nu}) is an ϵ\epsilon-approximate Nash equilibrium, if V1,ν^(s1)V1μ^,(s1)ϵV^{\dagger,\hat{\nu}}_{1}(s_{1})-V^{\hat{\mu},\dagger}_{1}(s_{1})\leq\epsilon.

Loosely speaking, Nash equilibria can be viewed as “the best responses to the best responses”. In most applications, they are the ultimate solutions to the games. In Section 3 and 4, we present sharp guarantees for learning an approximate Nash equilibrium with near-optimal sample complexity. However, rather surprisingly, learning a best response in the worst case is more challenging than learning the Nash equilibrium. In Section 5, we present a computational hardness result for learning an approximate best response.

3 Optimistic Nash Q-learning

In this section, we present our first algorithm Optimistic Nash Q-learning and its corresponding theoretical guarantees.

Algorithm 1 Optimistic Nash Q-learning
1:  Initialize: for any (s,a,b,h)(s,a,b,h), Q¯h(s,a,b)H\overline{Q}_{h}(s,a,b)\leftarrow H, Q¯h(s,a,b)0\underline{Q}_{h}(s,a,b)\leftarrow 0, Nh(s,a,b)0N_{h}(s,a,b)\leftarrow 0,      πh(a,b|s)1/(AB)\pi_{h}(a,b|s)\leftarrow 1/(AB).
2:  for episode k=1,,Kk=1,\dots,K do
3:     receive s1s_{1}.
4:     for step h=1,,Hh=1,\dots,H do
5:        take action (ah,bh)πh(,|sh)(a_{h},b_{h})\sim\pi_{h}(\cdot,\cdot|s_{h}).
6:        observe reward rh(sh,ah,bh)r_{h}(s_{h},a_{h},b_{h}) and next state sh+1s_{h+1}.
7:        t=Nh(sh,ah,bh)Nh(sh,ah,bh)+1t=N_{h}(s_{h},a_{h},b_{h})\leftarrow N_{h}(s_{h},a_{h},b_{h})+1.
8:        Q¯h(sh,ah,bh)(1αt)Q¯h(sh,ah,bh)+αt(rh(sh,ah,bh)+V¯h+1(sh+1)+βt)\overline{Q}_{h}(s_{h},a_{h},b_{h})\leftarrow(1-\alpha_{t})\overline{Q}_{h}(s_{h},a_{h},b_{h})+\alpha_{t}(r_{h}(s_{h},a_{h},b_{h})+\overline{V}_{h+1}(s_{h+1})+\beta_{t})
9:        Q¯h(sh,ah,bh)(1αt)Q¯h(sh,ah,bh)+αt(rh(sh,ah,bh)+V¯h+1(sh+1)βt)\underline{Q}_{h}(s_{h},a_{h},b_{h})\leftarrow(1-\alpha_{t})\underline{Q}_{h}(s_{h},a_{h},b_{h})+\alpha_{t}(r_{h}(s_{h},a_{h},b_{h})+\underline{V}_{h+1}(s_{h+1})-\beta_{t})
10:        πh(,|sh)CCE(Q¯h(sh,,),Q¯h(sh,,))\pi_{h}(\cdot,\cdot|s_{h})\leftarrow\textsc{CCE}(\overline{Q}_{h}(s_{h},\cdot,\cdot),\underline{Q}_{h}(s_{h},\cdot,\cdot))
11:        V¯h(sh)(𝔻πhQ¯h)(sh);V¯h(sh)(𝔻πhQ¯h)(sh)\overline{V}_{h}(s_{h})\leftarrow(\mathbb{D}_{\pi_{h}}\overline{Q}_{h})(s_{h});\quad\underline{V}_{h}(s_{h})\leftarrow(\mathbb{D}_{\pi_{h}}\underline{Q}_{h})(s_{h}).

Algorithm part I: learning values

Our algorithm Optimistic Nash Q-learning (Algorithm 1) is an optimistic variant of Nash Q-learning [11]. For each step in each episode, it (a) takes actions according to the previously computed policy πh\pi_{h}, and observes the reward and next state, (b) performs incremental updates on Q-values, and (c) computes new greedy policies and updates VV-values. Part (a) is straightforward; we now focus on explaining part (b) and part (c).

In part (b), the incremental updates on Q-values (Line 8, 9) are almost the same as standard Q-learning [34], except here we maintain two separate Q-values—Q¯h\overline{Q}_{h} and Q¯h\underline{Q}_{h}, as upper and lower confidence versions respectively. We add and subtract a bonus term βt\beta_{t} in the corresponding updates, which depends on t=Nh(sh,ah,bh)t=N_{h}(s_{h},a_{h},b_{h})—the number of times (sh,ah,bh)(s_{h},a_{h},b_{h}) has been visited at step hh. We pick parameter αt\alpha_{t} and βt\beta_{t} as follows for some large constant cc , and log factors ι\iota:

αt=(H+1)/(H+t),βt=cH3ι/t\textstyle\alpha_{t}=(H+1)/(H+t),\qquad\beta_{t}=c\sqrt{H^{3}\iota/t} (3)

In part (c), our greedy policies are computed using a Coarse Correlated Equilibrium (CCE) subroutine, which is first introduced by [36] to solve Markov games using value iteration algorithms. For any pair of matrices Q¯,Q¯[0,H]A×B\overline{Q},\underline{Q}\in[0,H]^{A\times B}, CCE(Q¯,Q¯)\textsc{CCE}(\overline{Q},\underline{Q}) returns a distribution πΔ𝒜×\pi\in\Delta_{\mathcal{A}\times\mathcal{B}} such that

𝔼(a,b)πQ¯(a,b)\displaystyle\mathbb{E}_{(a,b)\sim\pi}\overline{Q}(a,b)\geq maxa𝔼(a,b)πQ¯(a,b)\displaystyle\max_{a^{\star}}\mathbb{E}_{(a,b)\sim\pi}\overline{Q}(a^{\star},b) (4)
𝔼(a,b)πQ¯(a,b)\displaystyle\mathbb{E}_{(a,b)\sim\pi}\underline{Q}(a,b)\leq minb𝔼(a,b)πQ¯(a,b)\displaystyle\min_{b^{\star}}\mathbb{E}_{(a,b)\sim\pi}\underline{Q}(a,b^{\star})

It can be shown that a CCE always exists, and it can be computed by linear programming in polynomial time (see Appendix B for more details).

Now we are ready to state an intermediate guarantee for optimistic Nash Q-learning. We assume the algorithm has played the game for KK episodes, and we use Vk,Qk,Nk,πkV^{k},Q^{k},N^{k},\pi^{k} to denote values, visitation counts, and policies at the beginning of the kk-th episode in Algorithm 1.

Lemma 3.

For any p(0,1]p\in(0,1], choose hyperparameters αt,βt\alpha_{t},\beta_{t} as in (3) for a large absolute constant cc and ι=log(SABT/p)\iota=\log(SABT/p). Then, with probability at least 1p1-p, Algorithm 1 has following guarantees

  • V¯hk(s)Vh(s)V¯hk(s)\overline{V}_{h}^{k}(s)\geq V^{\star}_{h}(s)\geq\underline{V}_{h}^{k}(s) for all (s,h,k)𝒮×[H]×[K](s,h,k)\in\mathcal{S}\times[H]\times[K].

  • (1/K)k=1K(V¯1kV¯1k)(s1)𝒪(H5SABι/K)(1/K)\cdot\sum_{k=1}^{K}(\overline{V}_{1}^{k}-\underline{V}_{1}^{k})(s_{1})\leq\mathcal{O}{\left(\sqrt{H^{5}SAB\iota/K}\right)}.

Lemma 3 makes two statements. First, it claims that the V¯hk(s)\overline{V}_{h}^{k}(s) and V¯hk(s)\underline{V}_{h}^{k}(s) computed in Algorithm 1 are indeed upper and lower bounds of the value of the Nash equilibrium. Second, Lemma 3 claims that the averages of the upper bounds and the lower bounds are also very close to the value of Nash equilibrium V1(s1)V_{1}^{\star}(s_{1}), where the gap decrease as 1/K1/\sqrt{K}. This implies that in order to learn the value V1(s1)V_{1}^{\star}(s_{1}) up to ϵ\epsilon-accuracy, we only need 𝒪(H5SABι/ϵ2)\mathcal{O}(H^{5}SAB\iota/\epsilon^{2}) episodes.

However, Lemma 3 has a significant drawback: it only guarantees the learning of the value of Nash equilibrium. It does not imply that the policies (μk,νk)(\mu^{k},\nu^{k}) used in Algorithm 1 are close to the Nash equilibrium, which requires the policies to have a near-optimal performance even against their best responses. This is a major difference between Markov games and standard MDPs, and is the reason why standard techniques from the MDP literature does not apply here. To resolve this problem, we propose a novel way to extract a certified policy from the optimistic Nash Q-learning algorithm.

Algorithm 2 Certified Policy μ^\hat{\mu} of Nash Q-learning
1:  sample kUniform([K])k\leftarrow\text{Uniform}([K]).
2:  for step h=1,,Hh=1,\dots,H do
3:     observe shs_{h}, and take action ahμhk(|sh)a_{h}\sim\mu^{k}_{h}(\cdot|s_{h}).
4:     observe bhb_{h}, and set tNhk(sh,ah,bh)t\leftarrow N^{k}_{h}(s_{h},a_{h},b_{h}).
5:     sample m[t]m\in[t] with (m=i)=αti\mathbb{P}(m=i)=\alpha^{i}_{t}.
6:     kkhm(sh,ah,bh)k\leftarrow k^{m}_{h}(s_{h},a_{h},b_{h})

Algorithm part II: certified policies

We describe our procedure of executing the certified policy μ^\hat{\mu} of the max-player is described in Algorithm 2. Above, μhk,νhk\mu^{k}_{h},\nu^{k}_{h} denote the marginal distributions of πhk\pi^{k}_{h} produced in Algorithm 1 over action set 𝒜,\mathcal{A},\mathcal{B} respectively. We also introduce the following quantities that directly induced by αt\alpha_{t}:

αt0:=j=1t(1αj),αti:=αij=i+1t(1αj)\textstyle\alpha_{t}^{0}:=\prod_{j=1}^{t}{\left(1-\alpha_{j}\right)},\,\,\alpha_{t}^{i}:=\alpha_{i}\prod_{j=i+1}^{t}{\left(1-\alpha_{j}\right)} (5)

whose properties are listed in the following Lemma 11. Especially, i=1tαti=1\sum_{i=1}^{t}\alpha_{t}^{i}=1, so {αti}i=1t\{\alpha_{t}^{i}\}_{i=1}^{t} defines a distribution over [t][t]. We use khm(s,a,b)k^{m}_{h}(s,a,b) to denote the index of the episode where (s,a,b)(s,a,b) is observed in step hh for the mm-th time. The certified policy ν^\hat{\nu} of the min-player is easily defined by symmetry. We note that μ^,ν^\hat{\mu},\hat{\nu} are clearly general policies, but they are no longer Markov policies.

The intuitive reason why such policy μ^\hat{\mu} defined in Algorithm 2 is certified by Nash Q-learning algorithm, is because the update equation in line 8 of Algorithm 1 and equation (5) gives relation:

Q¯hk(s,a,b)=αt0H+i=1tαti[rh(s,a,b)+V¯h+1khi(s,a,b)(sh+1khi(s,a,b))+βi]\textstyle\overline{Q}_{h}^{k}(s,a,b)=\alpha_{t}^{0}H+\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}(s,a,b)+\overline{V}_{h+1}^{k_{h}^{i}(s,a,b)}(s_{h+1}^{k_{h}^{i}(s,a,b)})+\beta_{i}\right]}

This certifies the good performance against the best responses if the max-player plays a mixture of policies {μh+1khi(s,a,b)}i=1t\{\mu_{h+1}^{k_{h}^{i}(s,a,b)}\}_{i=1}^{t} at step h+1h+1 with mixing weights {αti}i=1t\{\alpha_{t}^{i}\}_{i=1}^{t} (see Appendix C.2 for more details). A recursion of this argument leads to the certified policy μ^\hat{\mu}—a nested mixture of policies.

We now present our main result for Nash Q-learning, using the certified policies (μ^,ν^)(\hat{\mu},\hat{\nu}).

Theorem 4 (Sample Complexity of Nash Q-learning).

For any p(0,1]p\in(0,1], choose hyperparameters αt,βt\alpha_{t},\beta_{t} as in (3) for large absolute constant cc and ι=log(SABT/p)\iota=\log(SABT/p). Then, with probability at least 1p1-p, if we run Nash Q-learning (Algorithm 1) for KK episodes where

KΩ(H5SABι/ϵ2),\textstyle K\geq\Omega\left(H^{5}SAB\iota/\epsilon^{2}\right),

the certified policies (μ^,ν^)(\hat{\mu},\hat{\nu}) (Algorithm 2) will be ϵ\epsilon-approximate Nash, i.e. V1,ν^(s1)V1μ^,(s1)ϵV^{\dagger,\hat{\nu}}_{1}(s_{1})-V^{\hat{\mu},\dagger}_{1}(s_{1})\leq\epsilon.

Theorem 4 asserts that if we run the optimistic Nash Q-learning algorithm for more than 𝒪(H5SABι/ϵ2)\mathcal{O}(H^{5}SAB\iota/\epsilon^{2}) episodes, the certified policies (μ^,ν^)(\hat{\mu},\hat{\nu}) extracted using Algorithm 2 will be ϵ\epsilon-approximate Nash equilibrium (Definition 2).

We make two remarks. First, the executions of the certified policies μ^,ν^\hat{\mu},\hat{\nu} require the storage of {μhk}\{\mu^{k}_{h}\} and {νhk}\{\nu^{k}_{h}\} for all k,h[H]×[K]k,h\in[H]\times[K]. This makes the space complexity of our algorithm scales up linearly in the total number of episodes KK. Second, Q-learning style algorithms (especially online updates) are crucial in our analysis for achieving sample complexity linear in SS. They enjoy the property that every sample is only been used once, on the value function that is independent of this sample. In contrast, value iteration type algorithms do not enjoy such an independence property, which is why the best existing sample complexity scales as S2S^{2} [2]. 333Despite [1] provides techniques to improve the sample complexity from S2S^{2} to SS for value iteration in MDP, the same techniques can not be applied to Markov games due to the unique challenge that, in Markov games, we aim at finding policies that are good against their best responses.

4 Optimistic Nash V-learning

In this section, we present our new algorithm Optimistic Nash V-learning and its corresponding theoretical guarantees. This algorithm improves over Nash Q-learning in sample complexity from 𝒪~(SAB)\mathcal{\tilde{O}}(SAB) to 𝒪~(S(A+B))\mathcal{\tilde{O}}(S(A+B)), when only highlighting the dependency on S,A,BS,A,B.

Algorithm 3 Optimistic Nash V-learning (the max-player version)
1:  Initialize: for any (s,a,b,h)(s,a,b,h), V¯h(s)H\overline{V}_{h}(s)\leftarrow H, L¯h(s,a)0\overline{L}_{h}(s,a)\leftarrow 0, Nh(s)0N_{h}(s)\leftarrow 0, μh(a|s)1/A\mu_{h}(a|s)\leftarrow 1/A.
2:  for episode k=1,,Kk=1,\dots,K do
3:     receive s1s_{1}.
4:     for step h=1,,Hh=1,\dots,H do
5:        take action ahμh(|sh)a_{h}\sim\mu_{h}(\cdot|s_{h}), observe the action bhb_{h} from opponent.
6:        observe reward rh(sh,ah,bh)r_{h}(s_{h},a_{h},b_{h}) and next state sh+1s_{h+1}.
7:        t=Nh(sh)Nh(sh)+1t=N_{h}(s_{h})\leftarrow N_{h}(s_{h})+1.
8:        V¯h(sh)min{H,(1αt)V¯h(sh)+αt(rh(sh,ah,bh)+V¯h+1(sh+1)+β¯t)}\overline{V}_{h}(s_{h})\leftarrow\min\{H,(1-\alpha_{t})\overline{V}_{h}(s_{h})+\alpha_{t}(r_{h}(s_{h},a_{h},b_{h})+\overline{V}_{h+1}(s_{h+1})+\overline{\beta}_{t})\}.
9:        for all a𝒜a\in\mathcal{A} do
10:           ¯h(sh,a)[Hrh(sh,ah,bh)V¯h+1(sh+1)]𝕀{ah=a}/[μh(ah|sh)+η¯t]\overline{\ell}_{h}(s_{h},a)\leftarrow[H-r_{h}(s_{h},a_{h},b_{h})-\overline{V}_{h+1}(s_{h+1})]\mathbb{I}\{a_{h}=a\}/[\mu_{h}(a_{h}|s_{h})+\overline{\eta}_{t}].
11:           L¯h(sh,a)(1αt)L¯h(sh,a)+αt¯h(sh,a)\overline{L}_{h}(s_{h},a)\leftarrow(1-\alpha_{t})\overline{L}_{h}(s_{h},a)+\alpha_{t}\overline{\ell}_{h}(s_{h},a).
12:        set μh(|sh)exp[(η¯t/αt)L¯h(sh,)]\mu_{h}(\cdot|s_{h})\propto\exp[-(\overline{\eta}_{t}/\alpha_{t})\overline{L}_{h}(s_{h},\cdot)].

Algorithm description

Nash V-learning combines the idea of Follow-The-Regularized-Leader (FTRL) in the bandit literature with the Q-learning algorithm in reinforcement learning. This algorithm does not require extra information exchange between players other than standard game playing, thus can be ran separately by the two players. We describe the max-player version in Algorithm 3. See Algorithm 7 in Appendix D for the min-player version, where V¯h\underline{V}_{h}, L¯h\underline{L}_{h}, νh\nu_{h}, η¯t\underline{\eta}_{t} and β¯t\underline{\beta}_{t} are defined symmetrically.

For each step in each episode, the algorithm (a) first takes action according to μh\mu_{h}, observes the action of the opponent, the reward, and the next state, (b) performs an incremental update on V¯\overline{V}, and (c) updates policy μh\mu_{h}. The first two parts are very similar to Nash Q-learning. In the third part, the agent first computes ¯h(sh,)\overline{\ell}_{h}(s_{h},\cdot) as the importance weighted estimator of the current loss. She then computes the weighted cumulative loss L¯h(sh,)\overline{L}_{h}(s_{h},\cdot). Finally, the policy μh\mu_{h} is updated using FTRL principle:

μh(|sh)argminμΔ𝒜η¯tL¯h(sh,),μ+αtKL(μμ0)\textstyle\mu_{h}(\cdot|s_{h})\leftarrow\mathop{\rm argmin}_{\mu\in\Delta_{\mathcal{A}}}~{}\overline{\eta}_{t}\langle\overline{L}_{h}(s_{h},\cdot),\mu\rangle+\alpha_{t}\text{KL}(\mu\|\mu_{0})

Here μ0\mu_{0} is the uniform distribution over all actions 𝒜\mathcal{A}. Solving above minimization problem gives the update equation as in Line 12 in Algorithm 3. In multi-arm bandit, FTRL can defend against adversarial losses, with regret independent of the number of the opponent’s actions. This property turns out to be crucial for Nash V-learning to achieve sharper sample complexity than Nash Q-learning (see the analog of Lemma 3 in Lemma 15).

Algorithm 4 Certified Policy μ^\hat{\mu} of Nash V-learning
1:  sample kUniform([K])k\leftarrow\text{Uniform}([K]).
2:  for step h=1,,Hh=1,\dots,H do
3:     observe shs_{h}, and set tNhk(sh)t\leftarrow N^{k}_{h}(s_{h}).
4:     sample m[t]m\in[t] with (m=i)=αti\mathbb{P}(m=i)=\alpha^{i}_{t}.
5:     kkhm(sh)k\leftarrow k^{m}_{h}(s_{h}).
6:     take action ahμhk(|sh)a_{h}\sim\mu^{k}_{h}(\cdot|s_{h}).

Similar to Nash Q-learning, we also propose a new algorithm (Algorithm 4) to extract a certified policy from the optimistic Nash V-learning algorithm. The certified policies are again non-Markovian. We choose all hyperparameters as follows, for some large constant cc , and log factors ι\iota.

αt=H+1H+t,η¯t=logAAt,η¯t=logBBt,β¯t=cH4Aιt,β¯t=cH4Bιt,\alpha_{t}=\frac{H+1}{H+t},\quad\overline{\eta}_{t}=\sqrt{\frac{\log A}{At}},\quad\underline{\eta}_{t}=\sqrt{\frac{\log B}{Bt}},\quad\overline{\beta}_{t}=c\sqrt{\frac{H^{4}A\iota}{t}},\quad\underline{\beta}_{t}=c\sqrt{\frac{H^{4}B\iota}{t}}, (6)

We now present our main result on the sample complexity of Nash V-learning.

Theorem 5 (Sample Complexity of Nash V-learning).

For any p(0,1]p\in(0,1], choose hyperparameters as in (6) for large absolute constant cc and ι=log(SABT/p)\iota=\log(SABT/p). Then, with probability at least 1p1-p, if we run Nash V-learning (Algorithm 3 and 7) for KK episodes with

KΩ(H6S(A+B)ι/ϵ2),\textstyle K\geq\Omega\left(H^{6}S(A+B)\iota/\epsilon^{2}\right),

its induced policies (μ^,ν^)(\hat{\mu},\hat{\nu}) (Algorithm 4) will be ϵ\epsilon-approximate Nash, i.e. V1,ν^(s1)V1μ^,(s1)ϵV^{\dagger,\hat{\nu}}_{1}(s_{1})-V^{\hat{\mu},\dagger}_{1}(s_{1})\leq\epsilon.

Theorem 4 claims that if we run the optimistic Nash V-learning for more than 𝒪(H6S(A+B)ι/ϵ2)\mathcal{O}(H^{6}S(A+B)\iota/\epsilon^{2}) episodes, the certified policies (μ^,ν^)(\hat{\mu},\hat{\nu}) extracted from Algorithm 4 will be ϵ\epsilon-approximate Nash (Definition 2). Nash V-learning is the first algorithm of which the sample complexity matches the information theoretical lower bound Ω(H3S(A+B)/ϵ2)\Omega(H^{3}S(A+B)/\epsilon^{2}) up to poly(H)\mathrm{poly}(H) factors and logarithmic terms.

5 Hardness for Learning the Best Response

In this section, we present a computational hardness result for computing the best response against an opponent with a fixed unknown policy. We further show that this implies the computational hardness result for achieving sublinear regret in Markov games when playing against adversarial opponents, which rules out a popular approach to design algorithms for finding Nash equilibria.

We first remark that if the opponent is restricted to only play Markov policies, then learning the best response is as easy as learning a optimal policy in the standard single-agent Markov decision process, where efficient algorithms are known to exist. Nevertheless, when the opponent can as well play any policy which may be non-Markovian, we show that finding the best response against those policies is computationally challenging.

We say an algorithm is a polynomial time algorithm for learning the best response if for any policy of the opponent ν\nu, and for any ϵ>0\epsilon>0, the algorithm finds the ϵ\epsilon-approximate best response of policy ν\nu (Definition 1) with probability at least 1/21/2, in time polynomial in S,H,A,B,ϵ1S,H,A,B,\epsilon^{-1}.

We can show the following hardness result for finding the best response in polynomial time.

Theorem 6 (Hardness for learning the best response).

There exists a Markov game with deterministic transitions and rewards defined for any horizon H1H\geq 1 with S=2S=2, A=2A=2, and B=2B=2, such that if there exists a polynomial time algorithm for learning the best response for this Markov game, then there exists a polynomial time algorithm for learning parity with noise (see problem description in Appendix E).

We remark that learning parity with noise is a notoriously difficult problem that has been used to design efficient cryptographic schemes. It is conjectured by the community to be hard.

Conjecture 7 ([16]).

There is no polynomial time algorithm for learning party with noise.

Theorem 6 with Conjecture 7 demonstrates the fundamental difficulty—if not strict impossibility—of designing a polynomial time for learning the best responses in Markov games. The intuitive reason for such computational hardness is that, while the underlying system has Markov transitions, the opponent can play policies that encode long-term correlations with non-Markovian nature, such as parity with noise, which makes it very challenging to find the best response. It is known that learning many other sequential models with long-term correlations (such as hidden Markov models or partially observable MDPs) is as hard as learning parity with noise [20].

5.1 Hardness for Playing Against Adversarial Opponent

Theorem 6 directly implies the difficulty for achieving sublinear regret in Markov games when playing against adversarial opponents in Markov games. Our construction of hard instances in the proof of Theorem 6 further allows the adversarial opponent to only play Markov policies in each episode. Since playing against adversarial opponent is a different problem with independent interest, we present the full result here.

Without loss of generality, we still consider the setting where the algorithm can only control the max-player, while the min-player is an adversarial opponent. In the beginning of every episode kk, both players pick their own policies μk\mu^{k} and νk\nu^{k}, and execute them throughout the episode. The adversarial opponent can possibly pick her policy νk\nu^{k} adaptive to all the observations in the earlier episodes.

We say an algorithm for the learner is a polynomial time no-regret algorithm if there exists a δ>0\delta>0 such that for any adversarial opponent, and any fixed K>0K>0, the algorithm outputs policies {μk}k=1K\{\mu^{k}\}_{k=1}^{K} which satisfies the following, with probability at least 1/21/2, in time polynomial in S,H,A,B,KS,H,A,B,K.

Regret(K)=supμk=1KV1μ,νk(s1)k=1KV1μk,νk(s1)poly(S,H,A,B)K1δ{\rm Regret}(K)=\sup_{\mu}\sum_{k=1}^{K}V^{\mu,\nu^{k}}_{1}(s_{1})-\sum_{k=1}^{K}V^{\mu^{k},\nu^{k}}_{1}(s_{1})\leq{\rm poly}(S,H,A,B)K^{1-\delta} (7)

Theorem 6 directly implies the following hardness result for achieving no-regret against adversarial opponents, a result that first appeared in [25].

Corollary 8 (Hardness for playing against adversarial opponent).

There exists a Markov game with deterministic transitions and rewards defined for any horizon H1H\geq 1 with S=2S=2, A=2A=2, and B=2B=2, such that if there exists a polynomial time no-regret algorithm for this Markov game, then there exists a polynomial time algorithm for learning parity with noise (see problem description in Appendix E). The claim remains to hold even if we restrict the adversarial opponents in the Markov game to be non-adaptive, and to only play Markov policies in each episode.

Similar to Theorem 6, Corollary 8 combined with Conjecture 7 demonstrates the fundamental difficulty of designing a polynomial time no-regret algorithm against adversarial opponents for Markov games.

Implications on algorithm design for finding Nash Equilibria

Corollary 8 also rules out a natural approach for designing efficient algorithms for finding approximate Nash equilibrium through combining two no-regret algorithms. In fact, it is not hard to see that if the min-player also runs a non-regret algorithm, and obtain a regret bound symmetric to (7), then summing the two regret bounds shows the mixture policies (μ^,ν^)(\hat{\mu},\hat{\nu})—which assigns uniform mixing weights to policies {μk}k=1K\{\mu^{k}\}_{k=1}^{K} and {νk}k=1K\{\nu^{k}\}_{k=1}^{K} respectively—is an approximate Nash equilibrium. Corollary 8 with Conjecture 7 claims that any algorithm designed using this approach is not a polynomial time algorithm.

6 Conclusion

In this paper, we designed first line of near-optimal self-play algorithms for finding an approximate Nash equilibrium in two-player Markov games. The sample complexity of our algorithms matches the information theoretical lower bound up to only a polynomial factor in the length of each episode. Apart from finding Nash equilibria, we also prove the fundamental hardness in computation for finding the best responses of fixed opponents, as well as achieving sublinear regret against adversarial opponents, in Markov games.

References

  • Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
  • Bai and Jin [2020] Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. arXiv preprint arXiv:2002.04017, 2020.
  • Baker et al. [2020] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkxpxJBKwS.
  • Brafman and Tennenholtz [2002] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
  • Brambilla et al. [2013] Manuele Brambilla, Eliseo Ferrante, Mauro Birattari, and Marco Dorigo. Swarm robotics: a review from the swarm engineering perspective. Swarm Intelligence, 7(1):1–41, 2013.
  • Brown and Sandholm [2019] Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
  • Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
  • Dann et al. [2019] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516, 2019.
  • Filar and Vrieze [2012] Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science & Business Media, 2012.
  • Hansen et al. [2013] Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1–16, 2013.
  • Hu and Wellman [2003] Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
  • Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Jia et al. [2019] Zeyu Jia, Lin F Yang, and Mengdi Wang. Feature-based q-learning for two-player stochastic games. arXiv preprint arXiv:1906.00423, 2019.
  • Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4868–4878, 2018.
  • Jin et al. [2019] Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192, 2019.
  • Kearns [1998] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
  • Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. 2018.
  • Littman [1994] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
  • Littman [2001] Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322–328, 2001.
  • Mossel and Roch [2005] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
  • Neu [2015] Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 3168–3176, 2015.
  • OpenAI [2018] OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
  • Osband and Van Roy [2016] Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
  • Osband et al. [2014] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
  • Radanovic et al. [2019] Goran Radanovic, Rati Devidze, David Parkes, and Adish Singla. Learning to collaborate in markov decision processes. In International Conference on Machine Learning, pages 5261–5270, 2019.
  • Rosenberg and Mansour [2019] Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. arXiv preprint arXiv:1905.07773, 2019.
  • Shalev-Shwartz et al. [2016] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
  • Shapley [1953] Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
  • Sidford et al. [2019] Aaron Sidford, Mengdi Wang, Lin F Yang, and Yinyu Ye. Solving discounted stochastic two-player games with near-optimal time and sample complexity. arXiv preprint arXiv:1908.11071, 2019.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • Strehl et al. [2006] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, pages 881–888, 2006.
  • Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Watkins [1989] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.
  • Wei et al. [2017] Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems, pages 4987–4997, 2017.
  • Xie et al. [2020] Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. arXiv preprint arXiv:2002.07066, 2020.
  • Yadkori et al. [2013] Yasin Abbasi Yadkori, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvári. Online learning in markov decision processes with adversarially chosen transition probability distributions. In Advances in neural information processing systems, pages 2508–2516, 2013.
  • Zimin and Neu [2013] Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems, pages 1583–1591, 2013.

Appendix A Bellman Equations for Markov Games

In this section, we present the Bellman equations for different types of values in Markov games.

Fixed policies.

For any pair of Markov policy (μ,ν)(\mu,\nu), by definition of their values in (1) (2), we have the following Bellman equations:

Qhμ,ν(s,a,b)=(rh+hVh+1μ,ν)(s,a,b),Vhμ,ν(s)=(𝔻μh×νhQhμ,ν)(s)\displaystyle Q^{\mu,\nu}_{h}(s,a,b)=(r_{h}+\mathbb{P}_{h}V^{\mu,\nu}_{h+1})(s,a,b),\qquad V^{\mu,\nu}_{h}(s)=(\mathbb{D}_{\mu_{h}\times\nu_{h}}Q^{\mu,\nu}_{h})(s)

for all (s,a,b,h)𝒮×𝒜××[H](s,a,b,h)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B}\times[H], where VH+1μ,ν(s)=0V^{\mu,\nu}_{H+1}(s)=0 for all s𝒮H+1s\in\mathcal{S}_{H+1}.

Best responses.

For any Markov policy μ\mu of the max-player, by definition, we have the following Bellman equations for values of its best response:

Qhμ,(s,a,b)=(rh+hVh+1μ,)(s,a,b),Vhμ,(s)=infνΔ(𝔻μh×νQhμ,)(s),\displaystyle Q^{\mu,\dagger}_{h}(s,a,b)=(r_{h}+\mathbb{P}_{h}V^{\mu,\dagger}_{h+1})(s,a,b),\qquad V^{\mu,\dagger}_{h}(s)=\inf_{\nu\in\Delta_{\mathcal{B}}}(\mathbb{D}_{\mu_{h}\times\nu}Q^{\mu,\dagger}_{h})(s),

for all (s,a,b,h)𝒮×𝒜××[H](s,a,b,h)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B}\times[H], where VH+1μ,(s)=0V^{\mu,\dagger}_{H+1}(s)=0 for all s𝒮H+1s\in\mathcal{S}_{H+1}.

Similarly, for any Markov policy ν\nu of the min-player, we also have the following symmetric version of Bellman equations for values of its best response:

Qh,ν(s,a,b)=(rh+hVh+1,ν)(s,a,b),Vh,ν(s)=supμΔ𝒜(𝔻μ×νhQh,ν)(s).\displaystyle Q^{\dagger,\nu}_{h}(s,a,b)=(r_{h}+\mathbb{P}_{h}V^{\dagger,\nu}_{h+1})(s,a,b),\qquad V^{\dagger,\nu}_{h}(s)=\sup_{\mu\in\Delta_{\mathcal{A}}}(\mathbb{D}_{\mu\times\nu_{h}}Q^{\dagger,\nu}_{h})(s).

for all (s,a,b,h)𝒮×𝒜××[H](s,a,b,h)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B}\times[H], where VH+1,ν(s)=0V^{\dagger,\nu}_{H+1}(s)=0 for all s𝒮H+1s\in\mathcal{S}_{H+1}.

Nash equilibria.

Finally, by definition of Nash equilibria in Markov games, we have the following Bellman optimality equations:

Qh(s,a,b)=\displaystyle Q^{{\star}}_{h}(s,a,b)= (rh+hVh+1)(s,a,b)\displaystyle(r_{h}+\mathbb{P}_{h}V^{{\star}}_{h+1})(s,a,b)
Vh(s)=\displaystyle V^{{\star}}_{h}(s)= supμΔ𝒜infνΔ(𝔻μ×νQh)(s)=infνΔsupμΔ𝒜(𝔻μ×νQh)(s).\displaystyle\sup_{\mu\in\Delta_{\mathcal{A}}}\inf_{\nu\in\Delta_{\mathcal{B}}}(\mathbb{D}_{\mu\times\nu}Q^{{\star}}_{h})(s)=\inf_{\nu\in\Delta_{\mathcal{B}}}\sup_{\mu\in\Delta_{\mathcal{A}}}(\mathbb{D}_{\mu\times\nu}Q^{{\star}}_{h})(s).

for all (s,a,b,h)𝒮×𝒜××[H](s,a,b,h)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B}\times[H], where VH+1(s)=0V^{{\star}}_{H+1}(s)=0 for all s𝒮H+1s\in\mathcal{S}_{H+1}.

Appendix B Properties of Coarse Correlated Equilibrium

Recall the definition for CCE in our main paper (4), we restate it here after rescaling. For any pair of matrix P,Q[0,1]n×mP,Q\in[0,1]^{n\times m}, the subroutine CCE(P,Q)\textsc{CCE}(P,Q) returns a distribution πΔn×m\pi\in\Delta_{n\times m} that satisfies:

𝔼(a,b)πP(a,b)\displaystyle\mathbb{E}_{(a,b)\sim\pi}P(a,b)\geq maxa𝔼(a,b)πP(a,b)\displaystyle\max_{a^{\star}}\mathbb{E}_{(a,b)\sim\pi}P(a^{\star},b) (8)
𝔼(a,b)πQ(a,b)\displaystyle\mathbb{E}_{(a,b)\sim\pi}Q(a,b)\leq minb𝔼(a,b)πQ(a,b)\displaystyle\min_{b^{\star}}\mathbb{E}_{(a,b)\sim\pi}Q(a,b^{\star})

We make three remarks on CCE. First, a CCE always exists since a Nash equilibrium for a general-sum game with payoff matrices (P,Q)(P,Q) is also a CCE defined by (P,Q)(P,Q), and a Nash equilibrium always exists. Second, a CCE can be efficiently computed, since above constraints (8) for CCE can be rewritten as n+mn+m linear constraints on πΔn×m\pi\in\Delta_{n\times m}, which can be efficiently resolved by standard linear programming algorithm. Third, a CCE in general-sum games needs not to be a Nash equilibrium. However, a CCE in zero-sum games is guaranteed to be a Nash equalibrium.

Proposition 9.

Let π=CCE(Q,Q)\pi=\textsc{CCE}(Q,Q), and (μ,ν)(\mu,\nu) be the marginal distribution over both players’ actions induced by π\pi. Then (μ,ν)(\mu,\nu) is a Nash equilibrium for payoff matrix QQ.

Proof of Proposition 9.

Let NN^{\star} be the value of Nash equilibrium for QQ. Since π=CCE(Q,Q)\pi=\textsc{CCE}(Q,Q), by definition, we have:

𝔼(a,b)πQ(a,b)\displaystyle\mathbb{E}_{(a,b)\sim\pi}Q(a,b)\geq maxa𝔼(a,b)πQ(a,b)=maxa𝔼bνQ(a,b)N\displaystyle\max_{a^{\star}}\mathbb{E}_{(a,b)\sim\pi}Q(a^{\star},b)=\max_{a^{\star}}\mathbb{E}_{b\sim\nu}Q(a^{\star},b)\geq N^{\star}
𝔼(a,b)πQ(a,b)\displaystyle\mathbb{E}_{(a,b)\sim\pi}Q(a,b)\leq minb𝔼(a,b)πQ(a,b)=minb𝔼aμQ(a,b)N\displaystyle\min_{b^{\star}}\mathbb{E}_{(a,b)\sim\pi}Q(a,b^{\star})=\min_{b^{\star}}\mathbb{E}_{a\sim\mu}Q(a,b^{\star})\leq N^{\star}

This gives:

maxa𝔼bνQ(a,b)=minb𝔼aμQ(a,b)=N\max_{a^{\star}}\mathbb{E}_{b\sim\nu}Q(a^{\star},b)=\min_{b^{\star}}\mathbb{E}_{a\sim\mu}Q(a,b^{\star})=N^{\star}

which finishes the proof. ∎

Intuitively, a CCE procedure can be used in Nash Q-learning for finding an approximate Nash equilibrium, because the values of upper confidence and lower confidence—Q¯\overline{Q} and Q¯\underline{Q} will be eventually very close, so that the preconditions of Proposition 9 becomes approximately satisfied.

Appendix C Proof for Nash Q-learning

In this section, we present proofs for results in Section 3.

We denote Vk,Qk,πkV^{k},Q^{k},\pi^{k} for values and policies at the beginning of the kk-th episode. We also introduce the following short-hand notation [^hkV](s,a,b):=V(sh+1k)[\widehat{\mathbb{P}}_{h}^{k}V](s,a,b):=V(s_{h+1}^{k}).

We will use the following notations several times later: suppose (s,a,b)(s,a,b) was taken at the in episodes k1,k2,k^{1},k^{2},\ldots at the hh-th step. Since the definition of kik^{i} depends on the tuple (s,a,b)(s,a,b) and hh, we will show the dependence explicitly by writing khi(s,a,b)k_{h}^{i}(s,a,b) when necessary and omit it when there is no confusion. We also define Nhk(s,a,b)N_{h}^{k}(s,a,b) to be the number of times (s,a,b)(s,a,b) has been taken at the beginning of the kk-th episode. Finally we denote nhk=Nhk(shk,ahk,bhk)n_{h}^{k}=N_{h}^{k}\left(s_{h}^{k},a_{h}^{k},b_{h}^{k}\right).

The following lemma is a simple consequence of the update rule in Algorithm 1, which will be used several times later.

Lemma 10.

Let t=Nhk(s,a,b)t=N_{h}^{k}\left(s,a,b\right) and suppose (s,a,b)(s,a,b) was previously taken at episodes k1,,kt<kk^{1},\ldots,k^{t}<k at the hh-th step. The update rule in Algorithm 1 is equivalent to the following equations.

Q¯hk(s,a,b)=αt0H+i=1tαti[rh(s,a,b)+V¯h+1ki(sh+1ki)+βi]\overline{Q}_{h}^{k}(s,a,b)=\alpha_{t}^{0}H+\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}(s,a,b)+\overline{V}_{h+1}^{k^{i}}(s_{h+1}^{k^{i}})+\beta_{i}\right]} (9)
Q¯hk(s,a,b)=i=1tαti[rh(s,a,b)+V¯h+1ki(sh+1ki)βi]\underline{Q}_{h}^{k}(s,a,b)=\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}(s,a,b)+\underline{V}_{h+1}^{k^{i}}(s_{h+1}^{k^{i}})-\beta_{i}\right]} (10)

C.1 Learning values

We begin an auxiliary lemma. Some of the analysis in this section is adapted from [14] which studies Q-learning under the single agent MDP setting.

Lemma 11.

([14, Lemma 4.1]) The following properties hold for αti\alpha_{t}^{i}:

  1. 1.

    1ti=1tαtii2t\frac{1}{\sqrt{t}}\leq\sum_{i=1}^{t}\frac{\alpha^{i}_{t}}{\sqrt{i}}\leq\frac{2}{\sqrt{t}} for every t1t\geq 1.

  2. 2.

    maxi[t]αti2Ht\max_{i\in[t]}\alpha^{i}_{t}\leq\frac{2H}{t} and i=1t(αti)22Ht\sum_{i=1}^{t}(\alpha^{i}_{t})^{2}\leq\frac{2H}{t} for every t1t\geq 1.

  3. 3.

    t=iαti=1+1H\sum_{t=i}^{\infty}\alpha^{i}_{t}=1+\frac{1}{H} for every i1i\geq 1.

We also define β~t:=2i=1tαtiβi𝒪(H3ι/t)\tilde{\beta}_{t}:=2\sum_{i=1}^{t}{\alpha_{t}^{i}\beta_{i}}\leq\mathcal{O}(\sqrt{H^{3}\iota/t}). Now we are ready to prove Lemma 3.

Proof of Lemma 3.

We give the proof for one direction and the other direction is similar. For the proof of the first claim, let t=Nhk(s,a,b)t=N_{h}^{k}\left(s,a,b\right) and suppose (s,a,b)(s,a,b) was previously taken at episodes k1,,kt<kk^{1},\ldots,k^{t}<k at the hh-th step. Let i\mathcal{F}_{i} be the σ\sigma-algebra generated by all the random variables in until the kik^{i}-th episode. Then {αti[(^hkih)Vh+1](s,a,b)}i=1t\{\alpha_{t}^{i}[(\widehat{\mathbb{P}}_{h}^{k^{i}}-\mathbb{P}_{h})V_{h+1}^{\star}]\left(s,a,b\right)\}_{i=1}^{t} is a martingale differene sequence w.r.t. the filtration {i}i=1t\{\mathcal{F}_{i}\}_{i=1}^{t}. By Azuma-Hoeffding,

|i=1tαti[(^hkih)Vh+1](s,a,b)|2Hi=1t(αti)2ιβ~t\left|\sum_{i=1}^{t}{\alpha_{t}^{i}\left[\left(\widehat{\mathbb{P}}_{h}^{k^{i}}-\mathbb{P}_{h}\right)V_{h+1}^{\star}\right]\left(s,a,b\right)}\right|\leq 2H\sqrt{\sum_{i=1}^{t}{\left(\alpha_{t}^{i}\right)^{2}\iota}}\leq\tilde{\beta}_{t}

Here we prove a stronger version of the first claim by induction: for any (s,a,b,h,k)𝒮×𝒜××[H]×[K](s,a,b,h,k)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B}\times[H]\times[K],

Q¯hk(s,a,b)Qh(s,a,b)Q¯hk(s,a,b),V¯hk(s)Vh(s)V¯hk(s).\overline{Q}_{h}^{k}(s,a,b)\geq Q^{\star}_{h}(s,a,b)\geq\underline{Q}_{h}^{k}(s,a,b),\,\,\,\,\overline{V}_{h}^{k}(s)\geq V^{\star}_{h}(s)\geq\underline{V}_{h}^{k}(s).

Suppose the guarantee is true for h+1h+1, then by the above concentration result,

(Q¯hkQh)(s,a,b)αt0H+i=1tαti(V¯h+1kiVh+1)(sh+1ki)0.(\overline{Q}_{h}^{k}-Q^{\star}_{h})(s,a,b)\geq\alpha_{t}^{0}H+\sum_{i=1}^{t}{\alpha_{t}^{i}\left(\overline{V}_{h+1}^{k^{i}}-V_{h+1}^{\star}\right)\left(s_{h+1}^{k^{i}}\right)}\geq 0.

Also,

V¯hk(s)Vh(s)=\displaystyle\overline{V}_{h}^{k}(s)-V^{\star}_{h}(s)= (𝔻πhkQ¯h+1k)(s)maxμΔ𝒜minνΔ(𝔻μ×νQh+1)(s)\displaystyle(\mathbb{D}_{\pi^{k}_{h}}\overline{Q}_{h+1}^{k})(s)-\max_{\mu\in\Delta_{\mathcal{A}}}\min_{\nu\in\Delta_{\mathcal{B}}}(\mathbb{D}_{\mu\times\nu}Q_{h+1}^{\star})(s)
\displaystyle\geq maxμΔ𝒜(𝔻μ×νhkQ¯hk)(s)maxμΔ𝒜(𝔻μ×νhkQh)(s)0\displaystyle\max_{\mu\in\Delta_{\mathcal{A}}}(\mathbb{D}_{\mu\times\nu^{k}_{h}}\overline{Q}_{h}^{k})(s)-\max_{\mu\in\Delta_{\mathcal{A}}}(\mathbb{D}_{\mu\times\nu^{k}_{h}}Q_{h}^{\star})(s)\geq 0

where Q¯hk(s,a,b)Qh(s,a,b)\overline{Q}_{h}^{k}(s,a,b)\geq Q_{h}^{\star}(s,a,b) has just been proved. The other direction is proved similarly.

Now we continue with the proof of the second claim. Let t=nhkt=n_{h}^{k} and define δhk:=(V¯hkV¯hk)(shk)\delta_{h}^{k}:=\left(\overline{V}_{h}^{k}-\underline{V}_{h}^{k}\right)\left(s_{h}^{k}\right), then by definition

δhk=\displaystyle\delta_{h}^{k}= 𝔼(a,b)πhk(Q¯hkQ¯hk)(shk,a,b)=(Q¯hkQ¯hk)(shk,ahk,bhk)+ζhk\displaystyle\mathbb{E}_{\left(a,b\right)\sim\pi_{h}^{k}}\left(\overline{Q}_{h}^{k}-\underline{Q}_{h}^{k}\right)\left(s_{h}^{k},a,b\right)=\left(\overline{Q}_{h}^{k}-\underline{Q}_{h}^{k}\right)\left(s_{h}^{k},a_{h}^{k},b_{h}^{k}\right)+\zeta_{h}^{k}
=(i)αt0H+i=1tαtiδh+1khi(shk,ahk,bhk)+2β~t+ζhk\displaystyle\overset{\left(i\right)}{=}\alpha_{t}^{0}H+\sum_{i=1}^{t}{\alpha_{t}^{i}\delta_{h+1}^{k_{h}^{i}(s_{h}^{k},a_{h}^{k},b_{h}^{k})}}+2\tilde{\beta}_{t}+\zeta_{h}^{k}

where (i)(i) is by taking the difference of equation (9) and equation (10) and

ζhk:=𝔼(a,b)πhk(Q¯hkQ¯hk)(shk,a,b)(Q¯hkQ¯hk)(shk,ahk,bhk)\zeta_{h}^{k}:=\mathbb{E}_{\left(a,b\right)\sim\pi_{h}^{k}}\left(\overline{Q}_{h}^{k}-\underline{Q}_{h}^{k}\right)\left(s_{h}^{k},a,b\right)-\left(\overline{Q}_{h}^{k}-\underline{Q}_{h}^{k}\right)\left(s_{h}^{k},a_{h}^{k},b_{h}^{k}\right)

is a martingale difference sequence.

Taking the summation w.r.t. kk, we begin with the first two terms,

k=1Kαnhk0H=k=1KH𝕀{nhk=0}SABH\sum_{k=1}^{K}{\alpha_{n_{h}^{k}}^{0}H}=\sum_{k=1}^{K}{H\mathbb{I}\left\{n_{h}^{k}=0\right\}}\leq SABH
k=1Ki=1nhkαnhkiδh+1khi(shk,ahk,bhk)(i)k=1Kδh+1ki=nhk+1αinhk(ii)(1+1H)k=1Kδh+1k.\sum_{k=1}^{K}{\sum_{i=1}^{n_{h}^{k}}{\alpha_{n_{h}^{k}}^{i}\delta_{h+1}^{k_{h}^{i}\left(s_{h}^{k},a_{h}^{k},b_{h}^{k}\right)}}}\overset{\left(i\right)}{\leq}\sum_{k^{\prime}=1}^{K}{\delta_{h+1}^{k^{\prime}}\sum_{i=n_{h}^{k^{\prime}}+1}^{\infty}{\alpha_{i}^{n_{h}^{k^{\prime}}}}}\overset{\left(ii\right)}{\leq}\left(1+\frac{1}{H}\right)\sum_{k=1}^{K}{\delta_{h+1}^{k}}.

where (i)(i) is by changing the order of summation and (ii)(ii) is by Lemma 11.

Plugging them in,

k=1KδhkSABH+(1+1H)k=1Kδh+1k+k=1K(2β~nhk+ζhk).\displaystyle\sum_{k=1}^{K}{\delta_{h}^{k}}\leq SABH+\left(1+\frac{1}{H}\right)\sum_{k=1}^{K}{\delta_{h+1}^{k}}+\sum_{k=1}^{K}{\left(2\tilde{\beta}_{n_{h}^{k}}+\zeta_{h}^{k}\right)}.

Recursing this argument for h[H]h\in[H] gives

k=1Kδ1keSABH2+2eh=1Hk=1Kβ~nhk+h=1Hk=1K(1+1/H)h1ζhk\sum_{k=1}^{K}{\delta_{1}^{k}}\leq eSABH^{2}+2e\sum_{h=1}^{H}\sum_{k=1}^{K}{\tilde{\beta}_{n_{h}^{k}}}+\sum_{h=1}^{H}\sum_{k=1}^{K}{(1+1/H)^{h-1}\zeta_{h}^{k}}

By pigeonhole argument,

k=1Kβ~nhk𝒪(1)k=1KH3ιnhk=𝒪(1)s,a,bn=1NhK(s,a,b)H3ιn𝒪(H3SABKι)=𝒪(H2SABTι)\sum_{k=1}^{K}{\tilde{\beta}_{n_{h}^{k}}}\leq\mathcal{O}\left(1\right)\sum_{k=1}^{K}{\sqrt{\frac{H^{3}\iota}{n_{h}^{k}}}}=\mathcal{O}\left(1\right)\sum_{s,a,b}{\sum_{n=1}^{N_{h}^{K}\left(s,a,b\right)}{\sqrt{\frac{H^{3}\iota}{n}}}}\leq\mathcal{O}\left(\sqrt{H^{3}SABK\iota}\right)=\mathcal{O}\left(\sqrt{H^{2}SABT\iota}\right)

By Azuma-Hoeffding,

h=1Hk=1K(1+1/H)h1ζhke2H3Kι=eH2Tι\sum_{h=1}^{H}{\sum_{k=1}^{K}{(1+1/H)^{h-1}\zeta_{h}^{k}}}\leq e\sqrt{2H^{3}K\iota}=eH\sqrt{2T\iota}

with high probability. The proof is completed by putting everything together.

C.2 Certified policies

Algorithm 1 only learns the value of game but itself cannot give a near optimal policy for each player. In this section, we analyze the certified policy based on the above exploration process (Algorithm 2) and prove the sample complexity guarantee. To this end, we need to first define a new group of policies μ^hk\hat{\mu}_{h}^{k} to facilitate the proof , and ν^hk\hat{\nu}_{h}^{k} are defined similarly. Notice μ^hk\hat{\mu}_{h}^{k} is related to μ^\hat{\mu} defined in Algorithm 2 by μ^=1ki=1kμ^1i\hat{\mu}=\frac{1}{k}\sum_{i=1}^{k}{\hat{\mu}_{1}^{i}}.

Algorithm 5 Policy μ^hk\hat{\mu}_{h}^{k}
1:  Initialize: kkk^{\prime}\leftarrow k.
2:  for step h=h,h+1,,Hh^{\prime}=h,h+1,\dots,H do
3:     Observe shs_{h^{\prime}}.
4:     Sample ahμhk(sh)a_{h^{\prime}}\sim\mu^{k^{\prime}}_{h}(s_{h^{\prime}}).
5:     Observe bhb_{h^{\prime}}.
6:     tNhk(sh,ah,bh)t\leftarrow N^{k^{\prime}}_{h}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}}).
7:     Sample ii from [t][t] with (i)=αti\mathbb{P}(i)=\alpha^{i}_{t}.
8:     kkhi(sh,ah,bh)k^{\prime}\leftarrow k^{i}_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}})

We also define μ^h+1k[s,a,b]\hat{\mu}_{h+1}^{k}[s,a,b] for hH1h\leq H-1, which is na intermediate algorithm only involved in the analysis. The above two policies are related by μ^h+1k[s,a,b]=i=1tαtiμ^h+1k\hat{\mu}_{h+1}^{k}\left[s,a,b\right]=\sum_{i=1}^{t}{\alpha_{t}^{i}}\hat{\mu}_{h+1}^{k} where t=Nhk(s,a,b)t=N_{h}^{k}\left(s,a,b\right). ν^h+1k[s,a,b]\hat{\nu}_{h+1}^{k}[s,a,b] is defined similarly.

Algorithm 6 Policy μ^h+1k[s,a,b]\hat{\mu}_{h+1}^{k}[s,a,b]
1:  tNhk(s,a,b)t\leftarrow N^{k}_{h}(s,a,b).
2:  Sample ii from [t][t] with (i)=αti\mathbb{P}(i)=\alpha^{i}_{t}.
3:  kkhi(s,a,b)k^{\prime}\leftarrow k^{i}_{h}(s,a,b)
4:  for step h=h+1,,Hh^{\prime}=h+1,\dots,H do
5:     Observe shs_{h^{\prime}}.
6:     Sample ahμhk(sh)a_{h^{\prime}}\sim\mu^{k^{\prime}}_{h}(s_{h^{\prime}}).
7:     Observe bhb_{h^{\prime}}.
8:     tNhk(sh,ah,bh)t\leftarrow N^{k^{\prime}}_{h}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}}).
9:     Sample ii from [t][t] with (i)=αti\mathbb{P}(i)=\alpha^{i}_{t}.
10:     kkhi(sh,ah,bh)k^{\prime}\leftarrow k^{i}_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}})

Since the policies defined in Algorithm 5 and Algorithm 6 are non-Markov, many notations for values of Markov policies are no longer valid here. To this end, we need to define the value and Q-value of general policies starting from step hh, if the general policies starting from the hh-th step do not depends the history before the hh-th step. Notice the special case h=1h=1 has already been covered in Section 2. For a pair of general policy (μ,ν)(\mu,\nu) which does not depend on the hostory before the hh-th step, we can still use the same definitions (1) and (2) to define their value Vhμ,ν(s)V_{h}^{\mu,\nu}(s) and Qhμ,ν(s,a,b)Q_{h}^{\mu,\nu}(s,a,b) at step hh. We can also define the best response ν(μ)\nu^{\dagger}(\mu) of a general policy μ\mu as the minimizing policy so that Vhμ,(s)Vhμ,ν(μ)(s)=infνVhμ,ν(s)V_{h}^{\mu,\dagger}(s)\equiv V_{h}^{\mu,\nu^{\dagger}(\mu)}(s)=\inf_{\nu}V_{h}^{\mu,\nu}(s) at step hh. Similarly, we can define Qhμ,(s,a,b)Qhμ,ν(μ)(s,a,b)=infνQhμ,ν(s,a,b)Q_{h}^{\mu,\dagger}(s,a,b)\equiv Q_{h}^{\mu,\nu^{\dagger}(\mu)}(s,a,b)=\inf_{\nu}Q_{h}^{\mu,\nu}(s,a,b). As before, the best reponse of a general policy is not necessarily Markovian.

It should be clear from the definition of Algorithm 5 and Algorithm 6 that μ^hk\hat{\mu}_{h}^{k}, ν^hk\hat{\nu}_{h}^{k}, μ^h+1k[s,a,b]\hat{\mu}_{h+1}^{k}[s,a,b] and ν^h+1k[s,a,b]\hat{\nu}_{h+1}^{k}[s,a,b] does not depend on the history before step hh, therefore related value and Q-value functions are well defined for the corresponding steps. Now we can show the policies defined above are indeed certified.

Lemma 12.

For any p(0,1)p\in(0,1), with probability at least 1p1-p, the following holds for any (s,a,b,h,k)𝒮×𝒜××[H]×[K](s,a,b,h,k)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B}\times[H]\times[K],

Q¯hk(s,a,b)Qh,ν^h+1k[s,a,b](s,a,b),V¯hk(s)Vh,ν^hk(s)\overline{Q}_{h}^{k}(s,a,b)\geq Q_{h}^{{\dagger},\hat{\nu}_{h+1}^{k}[s,a,b]}(s,a,b),\,\,\,\overline{V}_{h}^{k}(s)\geq V_{h}^{{\dagger},\hat{\nu}_{h}^{k}}(s)
Q¯hk(s,a,b)Qhμ^h+1k[s,a,b],(s,a,b),V¯hk(s)Vhμ^hk,(s)\underline{Q}_{h}^{k}(s,a,b)\leq Q_{h}^{\hat{\mu}_{h+1}^{k}[s,a,b],{\dagger}}(s,a,b),\,\,\,\underline{V}_{h}^{k}(s)\leq V_{h}^{\hat{\mu}_{h}^{k},{\dagger}}(s)
Proof of Lemma 12.

We first prove this for h=Hh=H.

Q¯Hk(s,a,b)=\displaystyle\overline{Q}_{H}^{k}(s,a,b)= αt0H+i=1tαti[rH(s,a,b)+βi]\displaystyle\alpha_{t}^{0}H+\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{H}(s,a,b)+\beta_{i}\right]}
\displaystyle\geq rH(s,a,b)=QH,ν^H+1k(s,a,b)\displaystyle r_{H}(s,a,b)=Q_{H}^{{\dagger},\hat{\nu}_{H+1}^{k}}\left(s,a,b\right)

because HH is the last step and

V¯Hk(s)=\displaystyle\overline{V}_{H}^{k}(s)= (𝔻πHkQ¯Hk)(s)supμΔ𝒜(𝔻μ×νHkQ¯Hk)(s)\displaystyle(\mathbb{D}_{\pi^{k}_{H}}\overline{Q}_{H}^{k})(s)\geq\sup_{\mu\in\Delta_{\mathcal{A}}}(\mathbb{D}_{\mu\times\nu^{k}_{H}}\overline{Q}_{H}^{k})(s)
\displaystyle\geq supμΔ𝒜(𝔻μ×νHkrH)(s)=VH,νHk(s)=VH,ν^Hk(s)\displaystyle\sup_{\mu\in\Delta_{\mathcal{A}}}(\mathbb{D}_{\mu\times\nu^{k}_{H}}r_{H})(s)=V_{H}^{{\dagger},\nu_{H}^{k}}\left(s\right)=V_{H}^{{\dagger},\hat{\nu}_{H}^{k}}\left(s\right)

because πHk\pi_{H}^{k} is CCE, and by definition ν^Hk=νHk\hat{\nu}_{H}^{k}=\nu_{H}^{k}.

Now suppose the claim is true for h+1h+1, consider the hh case. Consider a fixed tuple (s,a,b)(s,a,b) and let t=Nhk(s,a,b)t=N_{h}^{k}\left(s,a,b\right). Suppose (s,a,b)(s,a,b) was previously taken at episodes k1,,kt<kk^{1},\ldots,k^{t}<k at the hh-th step. Let i\mathcal{F}_{i} be the σ\sigma-algebra generated by all the random variables in until the kik^{i}-th episode. Then {αti[rh(s,a,b)+Vh+1,ν^h+1ki(sh+1ki)+βi]}i=1t\{\alpha_{t}^{i}[r_{h}(s,a,b)+V_{h+1}^{{\dagger},\hat{\nu}_{h+1}^{k^{i}}}(s_{h+1}^{k^{i}})+\beta_{i}]\}_{i=1}^{t} is a martingale differene sequence w.r.t. the filtration {i}i=1t\{\mathcal{F}_{i}\}_{i=1}^{t}. By Azuma-Hoeffding and the definition of bib_{i},

i=1tαti[rh(s,a,b)+Vh+1,ν^h+1ki(sh+1ki)+βi]i=1tαtiQh,ν^h+1ki(s,a,b)\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}(s,a,b)+V_{h+1}^{{\dagger},\hat{\nu}_{h+1}^{k^{i}}}(s_{h+1}^{k^{i}})+\beta_{i}\right]}\geq\sum_{i=1}^{t}{\alpha_{t}^{i}Q_{h}^{{\dagger},\hat{\nu}_{h+1}^{k^{i}}}(s,a,b)}

with high probability. Combining this with the induction hypothesis,

Q¯hk(s,a,b)=\displaystyle\overline{Q}_{h}^{k}(s,a,b)= αt0H+i=1tαti[rh(s,a,b)+V¯h+1ki(sh+1ki)+βi]\displaystyle\alpha_{t}^{0}H+\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}(s,a,b)+\overline{V}_{h+1}^{k^{i}}(s_{h+1}^{k^{i}})+\beta_{i}\right]}
\displaystyle\geq i=1tαti[rh(s,a,b)+Vh+1,ν^h+1ki(sh+1ki)+βi]i=1tαtiQh,ν^h+1ki(s,a,b)\displaystyle\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}(s,a,b)+V_{h+1}^{{\dagger},\hat{\nu}_{h+1}^{k^{i}}}(s_{h+1}^{k^{i}})+\beta_{i}\right]}\geq\sum_{i=1}^{t}{\alpha_{t}^{i}Q_{h}^{{\dagger},\hat{\nu}_{h+1}^{k^{i}}}(s,a,b)}
(i)\displaystyle\overset{\left(i\right)}{\geq} max𝜇i=1tαtiQhμ,ν^h+1ki(s,a,b)=Qh,ν^h+1k[s,a,b](s,a,b)\displaystyle\underset{\mu}{\max}\sum_{i=1}^{t}{\alpha_{t}^{i}Q_{h}^{\mu,\hat{\nu}_{h+1}^{k^{i}}}(s,a,b)}=Q_{h}^{{\dagger},\hat{\nu}_{h+1}^{k}[s,a,b]}(s,a,b)

where we have taken the maximum operator out of the summation in (i)(i),which does not increase the sum.

On the other hand,

V¯hk(s)=\displaystyle\overline{V}_{h}^{k}(s)= (𝔻πhkQ¯hk)(s)(i)supμΔ𝒜(𝔻μ×νhkQ¯hk)(s)\displaystyle(\mathbb{D}_{\pi^{k}_{h}}\overline{Q}_{h}^{k})(s)\overset{\left(i\right)}{\geq}\sup_{\mu\in\Delta_{\mathcal{A}}}(\mathbb{D}_{\mu\times\nu^{k}_{h}}\overline{Q}_{h}^{k})(s)
(ii)\displaystyle\overset{\left(ii\right)}{\geq} maxa𝒜𝔼bνhkQh,ν^h+1k[s,a,b](s,a,b)=Vh,ν^hk(s)\displaystyle\max_{a\in\mathcal{A}}\mathbb{E}_{b\sim\nu^{k}_{h}}Q_{h}^{{\dagger},\hat{\nu}_{h+1}^{k}[s,a,b]}(s,a,b)=V_{h}^{{\dagger},\hat{\nu}_{h}^{k}}(s)

where (i)(i) is by the definition of CCE and (ii)(ii) is the induction hypothesis. The other direction is proved by performing smilar arguments on Q¯hk(s,a,b)\underline{Q}_{h}^{k}(s,a,b), Qhμ^h+1k[s,a,b],(s,a,b)Q_{h}^{\hat{\mu}_{h+1}^{k}[s,a,b],{\dagger}}(s,a,b), V¯hk(s)\underline{V}_{h}^{k}(s) and Vhμ^hk,(s)V_{h}^{\hat{\mu}_{h}^{k},{\dagger}}(s). ∎

Finally we give the theoretical guarantee of the policies defined above.

Proof of Theorem 4.

By lemma 12, we have

k=1K(V1,ν^1kV1μ^1k,)(s1)k=1K(V¯1kV¯1k)(s1)\sum_{k=1}^{K}{\left(V_{1}^{{\dagger},\hat{\nu}_{1}^{k}}-V_{1}^{\hat{\mu}_{1}^{k},{\dagger}}\right)\left(s_{1}\right)}\leq\sum_{k=1}^{K}{\left(\overline{V}_{1}^{k}-\underline{V}_{1}^{k}\right)\left(s_{1}\right)}

and Lemma 3 upper bounds this quantity by

k=1K(V1,ν^1kV1μ^1k,)(s1)𝒪(H4SABTι)\sum_{k=1}^{K}{\left(V_{1}^{{\dagger},\hat{\nu}_{1}^{k}}-V_{1}^{\hat{\mu}_{1}^{k},{\dagger}}\right)\left(s_{1}\right)}\leq\mathcal{O}\left(\sqrt{H^{4}SABT\iota}\right)

By definition of the induced policy, with probability at least 1p1-p, if we run Nash Q-learning (Algorithm 1) for KK episodes with

KΩ(H5SABιϵ2),K\geq\Omega\left(\frac{H^{5}SAB\iota}{\epsilon^{2}}\right),

its induced policies (μ^,ν^)(\hat{\mu},\hat{\nu}) (Algorithm 2) will be ϵ\epsilon-optimal in the sense V1,ν^(s1)V1μ^,(s1)ϵV^{\dagger,\hat{\nu}}_{1}(s_{1})-V^{\hat{\mu},\dagger}_{1}(s_{1})\leq\epsilon. ∎

Appendix D Proof for Nash V-learning

In this section, we present proofs of the results in Section 4. We denote Vk,μk,νkV^{k},\mu^{k},\nu^{k} for values and policies at the beginning of the kk-th episode. We also introduce the following short-hand notation [^hkV](s,a,b):=V(sh+1k)[\widehat{\mathbb{P}}_{h}^{k}V](s,a,b):=V(s_{h+1}^{k}).

We will use the following notations several times later: suppose the state ss was visited at episodes k1,k2,k^{1},k^{2},\ldots at the hh-th step. Since the definition of kik^{i} depends on the state ss , we will show the dependence explicitly by writing khi(s)k_{h}^{i}(s) when necessary and omit it when there is no confusion. We also define Nhk(s)N_{h}^{k}(s) to be the number of times the state ss has been visited at the beginning of the kk-th episode. Finally we denote nhk=Nhk(shk)n_{h}^{k}=N_{h}^{k}\left(s_{h}^{k}\right). Notice the definitions here are different from that in Appendix C.

The following lemma is a simple consequence of the update rule in Algorithm 3, which will be used several times later.

Lemma 13.

Let t=Nhk(s)t=N_{h}^{k}\left(s\right) and suppose ss was previously visited at episodes k1,,kt<kk^{1},\ldots,k^{t}<k at the hh-th step. The update rule in Algorithm 3 is equivalent to the following equations.

V¯hk(s)=αt0H+i=1tαti[rh(s,ahki,bhki)+V¯h+1ki(sh+1ki)+β¯i]\overline{V}_{h}^{k}(s)=\alpha_{t}^{0}H+\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}(s,a_{h}^{k^{i}},b_{h}^{k^{i}})+\overline{V}_{h+1}^{k^{i}}(s_{h+1}^{k^{i}})+\overline{\beta}_{i}\right]} (11)
V¯hk(s)=i=1tαti[rh(s,ahkhi,bhkhi)+V¯h+1khi(sh+1khi)β¯i]\underline{V}_{h}^{k}(s)=\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}(s,a_{h}^{k_{h}^{i}},b_{h}^{k_{h}^{i}})+\underline{V}_{h+1}^{k_{h}^{i}}(s_{h+1}^{k_{h}^{i}})-\underline{\beta}_{i}\right]} (12)

D.1 Missing algorithm details

We first give Algorithm 7: the min-player counterpart of Algorithm 3. Almost everything is symmetric except the definition of loss function to keep it non-negative.

Algorithm 7 Optimistic Nash V-learning (the min-player version)
1:  Initialize: for any (s,a,b,h)(s,a,b,h), V¯h(s)0\underline{V}_{h}(s)\leftarrow 0, L¯h(s,b)0\underline{L}_{h}(s,b)\leftarrow 0, Nh(s)0N_{h}(s)\leftarrow 0, νh(b|s)1/B\nu_{h}(b|s)\leftarrow 1/B.
2:  for episode k=1,,Kk=1,\dots,K do
3:     receive s1s_{1}.
4:     for step h=1,,Hh=1,\dots,H do
5:        take action bhνh(|sh)b_{h}\sim\nu_{h}(\cdot|s_{h}), observe the action aha_{h} from opponent
6:        observe reward rh(sh,ah,bh)r_{h}(s_{h},a_{h},b_{h}) and next state sh+1s_{h+1}.
7:        t=Nh(sh)Nh(sh)+1t=N_{h}(s_{h})\leftarrow N_{h}(s_{h})+1.
8:        V¯h(sh)max{0,(1αt)V¯h(sh)+αt(rh(sh,ah,bh)+V¯h+1(sh+1)β¯t)}\underline{V}_{h}(s_{h})\leftarrow\max\{0,(1-\alpha_{t})\underline{V}_{h}(s_{h})+\alpha_{t}(r_{h}(s_{h},a_{h},b_{h})+\underline{V}_{h+1}(s_{h+1})-\underline{\beta}_{t})\}
9:        for all bb\in\mathcal{B} do
10:           ¯h(sh,b)[rh(sh,ah,bh)+V¯h+1(sh+1)]𝕀{bh=b}/[νh(bh|sh)+η¯t]\underline{\ell}_{h}(s_{h},b)\leftarrow[r_{h}(s_{h},a_{h},b_{h})+\underline{V}_{h+1}(s_{h+1})]\mathbb{I}\{b_{h}=b\}/[\nu_{h}(b_{h}|s_{h})+\underline{\eta}_{t}].
11:           L¯h(sh,b)(1αt)L¯h(sh,b)+αt¯h(sh,b)\underline{L}_{h}(s_{h},b)\leftarrow(1-\alpha_{t})\underline{L}_{h}(s_{h},b)+\alpha_{t}\underline{\ell}_{h}(s_{h},b).
12:        set νh(|sh)exp[(η¯t/αt)L¯h(sh,)]\nu_{h}(\cdot|s_{h})\propto\exp[-(\underline{\eta}_{t}/\alpha_{t})\underline{L}_{h}(s_{h},\cdot)].

D.2 Learning values

As usual, we begin with learning the value VV^{\star} of the Markov game. We begin with an auxiliary lemma, which justifies our choice of confidence bound.

Lemma 14.

Let t=Nhk(s)t=N_{h}^{k}\left(s\right) and suppose state ss was previously taken at episodes k1,,kt<kk^{1},\ldots,k^{t}<k at the hh-th step. Choosing η¯t=logAAt\overline{\eta}_{t}=\sqrt{\frac{\log A}{At}} and η¯t=logBBt\underline{\eta}_{t}=\sqrt{\frac{\log B}{Bt}}, with probability 1p1-p, for any (s,h,t)𝒮×[H]×[K](s,h,t)\in\mathcal{S}\times[H]\times[K], there exist a constant cc s.t.

max𝜇i=1tαti𝔻μ×νhki(rh+hV¯h+1ki)(s)i=1tαti[rh(s,ahki,bhki)+V¯h+1ki(sh+1ki)]c2H4Aι/t\underset{\mu}{\max}\sum_{i=1}^{t}{\alpha_{t}^{i}\mathbb{D}_{\mu\times\nu_{h}^{k^{i}}}\left(r_{h}+\mathbb{P}_{h}\overline{V}_{h+1}^{k^{i}}\right)\left(s\right)}-\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}\left(s,a_{h}^{k^{i}},b_{h}^{k^{i}}\right)+\overline{V}_{h+1}^{k^{i}}\left(s_{h+1}^{k^{i}}\right)\right]}\leq c\sqrt{2H^{4}A\iota/t}
i=1tαti[rh(s,ahki,bhki)+V¯h+1ki(sh+1ki)]min𝜈i=1tαti𝔻μhki×ν(rh+hV¯h+1ki)(s)c2H4Bι/t\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}\left(s,a_{h}^{k^{i}},b_{h}^{k^{i}}\right)+\underline{V}_{h+1}^{k^{i}}\left(s_{h+1}^{k^{i}}\right)\right]}-\underset{\nu}{\min}\sum_{i=1}^{t}{\alpha_{t}^{i}\mathbb{D}_{\mu_{h}^{k^{i}}\times\nu}\left(r_{h}+\mathbb{P}_{h}\overline{V}_{h+1}^{k^{i}}\right)\left(s\right)}\leq c\sqrt{2H^{4}B\iota/t}
Proof of Lemma 14.

We prove the first inequality. The proof for the second inequality is similar. We consider thoughout the proof a fixed (s,h,t)𝒮×[H]×[K](s,h,t)\in\mathcal{S}\times[H]\times[K]. Define i\mathcal{F}_{i} as the σ\sigma-algebra generated by all the random variables before the khik_{h}^{i}-th episode. Then {rh(s,ahki,bhki)+V¯h+1ki(sh+1ki)}i=1t\{r_{h}(s,a_{h}^{k^{i}},b_{h}^{k^{i}})+\overline{V}_{h+1}^{k^{i}}(s_{h+1}^{k^{i}})\}_{i=1}^{t} is a martingale sequence w.r.t. the filtration {i}i=1t\{\mathcal{F}_{i}\}_{i=1}^{t}. By Azuma-Hoeffding,

i=1tαti𝔻μhki×νhki(rh+hV¯h+1ki)(s)i=1tαti[rh(s,ahki,bhki)+V¯h+1ki(sh+1ki)]2H3ι/t\sum_{i=1}^{t}{\alpha_{t}^{i}\mathbb{D}_{\mu_{h}^{k^{i}}\times\nu_{h}^{k^{i}}}\left(r_{h}+\mathbb{P}_{h}\overline{V}_{h+1}^{k^{i}}\right)\left(s\right)}-\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}\left(s,a_{h}^{k^{i}},b_{h}^{k^{i}}\right)+\overline{V}_{h+1}^{k^{i}}\left(s_{h+1}^{k^{i}}\right)\right]}\leq 2\sqrt{H^{3}\iota/t}

So we only need to bound

max𝜇i=1tαti𝔻μ×νhki(rh+hV¯h+1ki)(s)i=1tαti𝔻μhki×νhki(rh+hV¯h+1ki)(s):=Rt\underset{\mu}{\max}\sum_{i=1}^{t}{\alpha_{t}^{i}\mathbb{D}_{\mu\times\nu_{h}^{k^{i}}}\left(r_{h}+\mathbb{P}_{h}\overline{V}_{h+1}^{k^{i}}\right)\left(s\right)}-\sum_{i=1}^{t}{\alpha_{t}^{i}\mathbb{D}_{\mu_{h}^{k^{i}}\times\nu_{h}^{k^{i}}}\left(r_{h}+\mathbb{P}_{h}\overline{V}_{h+1}^{k^{i}}\right)\left(s\right)}:=R^{\star}_{t} (13)

where RtR^{\star}_{t} is the weighted regret in the first tt times of visiting state ss, with respect to the optimal policy in hindsight, in the following adversarial bandit problem. The loss function is defined by

li(a)=𝔼bνhki(s){Hh+1rh(s,a,b)hV¯h+1ki(s,a,b)}l_{i}(a)=\mathbb{E}_{b\sim\nu_{h}^{k^{i}}\left(s\right)}\{H-h+1-r_{h}\left(s,a,b\right)-\mathbb{P}_{h}\overline{V}_{h+1}^{k^{i}}\left(s,a,b\right)\}

with weight wi=αtiw_{i}=\alpha^{i}_{t}. We note the weighted regret can be rewrite as Rt=i=1twiμhμhki,liR_{t}^{\star}=\sum_{i=1}^{t}{w_{i}\left<\mu_{h}^{\star}-\mu_{h}^{k_{i}},l_{i}\right>} where μh\mu_{h}^{\star} is argmax for (13), and the loss function satisfies li(a)[0,H]l_{i}(a)\in[0,H]

Therefore, Algorithm 3 is essentially performing follow the regularized leader (FTRL) algorithm with changing step size for each state to solve this adversarial bandit problem. The policy we are using is μhki(s,a)\mu_{h}^{k^{i}}\left(s,a\right) and the optimistic biased estimator

l^i(a)=Hh+1rh(shki,ahki,bhki)V¯h+1ki(sh+1ki)μhki(s,a)+η¯i𝕀{ahki=a}\hat{l}_{i}(a)=\frac{H-h+1-r_{h}(s_{h}^{k^{i}},a_{h}^{k^{i}},b_{h}^{k^{i}})-\overline{V}^{k^{i}}_{h+1}(s_{h+1}^{k^{i}})}{\mu_{h}^{k^{i}}\left(s,a\right)+\overline{\eta}_{i}}\cdot\mathbb{I}\left\{a_{h}^{k^{i}}=a\right\}

is used to handle the bandit feedback.

A more detailed discussion on how to solve the weighted adversarial bandit problem is included in Appendix F. Note that wi=αtiw_{i}=\alpha_{t}^{i} is monotonic inscreasing, i.e. maxitwi=wt\max_{i\leq t}w_{i}=w_{t}. By Lemma 17, we have

Rt\displaystyle R^{\star}_{t} 2HαttAtι+3HAι2i=1tαtii+12Hαttι+H2ιi=1t(αti)2\displaystyle\leq 2H\alpha_{t}^{t}\sqrt{At\iota}+\frac{3H\sqrt{A\iota}}{2}\sum_{i=1}^{t}{\frac{\alpha_{t}^{i}}{\sqrt{i}}}+\frac{1}{2}H\alpha_{t}^{t}\iota+H\sqrt{2\iota\sum_{i=1}^{t}{\left(\alpha_{t}^{i}\right)^{2}}}
4H2Aι/t+3HAι/t+H2ι/t+4H3ι/t\displaystyle\leq 4H^{2}\sqrt{A\iota/t}+3H\sqrt{A\iota/t}+H^{2}\iota/t+\sqrt{4H^{3}\iota/t}
10H2Aι/t\displaystyle\leq 10H^{2}\sqrt{A\iota/t}

with probability 1p/(SHK)1-p/(SHK). Finally by a union bound over all (s,h,t)𝒮×[H]×[K](s,h,t)\in\mathcal{S}\times[H]\times[K], we finish the proof. ∎

We now prove the following Lemma 15, which is an analoge of Lemma 3 in Nash Q-learning.

Lemma 15.

For any p(0,1]p\in(0,1], choose hyperparameters as in (6) for large absolute constant cc and ι=log(SABT/p)\iota=\log(SABT/p). Then, with probability at least 1p1-p, Algorithm 3 and 7 will jointly provide the following guarantees

  • V¯hk(s)Vh(s)V¯hk(s)\overline{V}_{h}^{k}(s)\geq V^{\star}_{h}(s)\geq\underline{V}_{h}^{k}(s) for all (s,h,k)𝒮×[K]×[H](s,h,k)\in\mathcal{S}\times[K]\times[H].

  • (1/K)k=1K(V¯1kV¯1k)(s1)𝒪(H6S(A+B)ι/K)(1/K)\cdot\sum_{k=1}^{K}(\overline{V}_{1}^{k}-\underline{V}_{1}^{k})(s_{1})\leq\mathcal{O}{\left(\sqrt{H^{6}S(A+B)\iota/K}\right)}.

Proof of Lemma 15.

We proof the first claim by backward induction. The claim is true for h=H+1h=H+1. Asumme for any s, V¯h+1k(s)Vh+1(s)\overline{V}_{h+1}^{k}(s)\geq V_{h+1}^{\star}\left(s\right), V¯h+1k(s)Vh+1(s)\underline{V}_{h+1}^{k}(s)\leq V_{h+1}^{\star}\left(s\right). For a fixed (s,h)𝒮×[H](s,h)\in\mathcal{S}\times[H] and episode k[K]k\in[K], let t=Nhk(s)t=N_{h}^{k}\left(s\right) and suppose ss was previously visited at episodes k1,,kt<kk^{1},\ldots,k^{t}<k at the hh-th step. By Bellman equation,

Vh(s)=\displaystyle V_{h}^{\star}\left(s\right)= max𝜇min𝜈𝔻μ×ν(rh+hVh+1)(s)\displaystyle\underset{\mu}{\max}\underset{\nu}{\min}\mathbb{D}_{\mu\times\nu}\left(r_{h}+\mathbb{P}_{h}V_{h+1}^{\star}\right)\left(s\right)
=\displaystyle= max𝜇i=1tαtimin𝜈𝔻μ×ν(rh+hVh+1)(s)\displaystyle\underset{\mu}{\max}\sum_{i=1}^{t}{\alpha_{t}^{i}}\underset{\nu}{\min}\mathbb{D}_{\mu\times\nu}\left(r_{h}+\mathbb{P}_{h}V_{h+1}^{\star}\right)\left(s\right)
\displaystyle\leq max𝜇i=1tαti𝔻μ×νhki(rh+hVh+1)(s)\displaystyle\underset{\mu}{\max}\sum_{i=1}^{t}{\alpha_{t}^{i}\mathbb{D}_{\mu\times\nu_{h}^{k^{i}}}\left(r_{h}+\mathbb{P}_{h}V_{h+1}^{\star}\right)\left(s\right)}
\displaystyle\leq max𝜇i=1tαti𝔻μ×νhki(rh+hV¯h+1ki)(s)\displaystyle\underset{\mu}{\max}\sum_{i=1}^{t}{\alpha_{t}^{i}\mathbb{D}_{\mu\times\nu_{h}^{k^{i}}}\left(r_{h}+\mathbb{P}_{h}\overline{V}_{h+1}^{k^{i}}\right)\left(s\right)}

Comparing with the decomposition of V¯hk(s)\overline{V}_{h}^{k}(s) in Equation (11) and use Lemma 14, we can see if β¯t=cAH4ι/t\overline{\beta}_{t}=c\sqrt{AH^{4}\iota/t}, then V¯hk(s)Vh(s)\overline{V}_{h}^{k}(s)\geq V_{h}^{\star}\left(s\right). Similar by taking β¯t=cBH4ι/t\underline{\beta}_{t}=c\sqrt{BH^{4}\iota/t}, we also have V¯hk(s)Vh(s)\underline{V}_{h}^{k}(s)\leq V_{h}^{\star}\left(s\right).

The second cliam is to bound δhk:=V¯hk(shk)V¯hk(shk)0\delta_{h}^{k}:=\overline{V}_{h}^{k}(s_{h}^{k})-\underline{V}_{h}^{k}(s_{h}^{k})\geq 0. Similar to what we have done in Nash Q-learning analysis, taking the difference of Equation (11) and Equation (12),

δhk=\displaystyle\delta_{h}^{k}= V¯hk(shk)V¯hk(shk)\displaystyle\overline{V}_{h}^{k}(s_{h}^{k})-\underline{V}_{h}^{k}(s_{h}^{k})
=\displaystyle= αnhk0H+i=1nhkαnhki[(V¯h+1khi(shk)V¯h+1khi(shk))(sh+1khi(shk))+β¯i+β¯i]\displaystyle\alpha_{n_{h}^{k}}^{0}H+\sum_{i=1}^{n_{h}^{k}}{\alpha_{n_{h}^{k}}^{i}\left[\left(\overline{V}_{h+1}^{k_{h}^{i}(s_{h}^{k})}-\underline{V}_{h+1}^{k_{h}^{i}(s_{h}^{k})}\right)\left(s_{h+1}^{k_{h}^{i}(s_{h}^{k})}\right)+\overline{\beta}_{i}+\underline{\beta}_{i}\right]}
=\displaystyle= αnhk0H+i=1nhkαnhkiδh+1khi(shk)+β~nhk\displaystyle\alpha_{n_{h}^{k}}^{0}H+\sum_{i=1}^{n_{h}^{k}}{\alpha_{n_{h}^{k}}^{i}\delta_{h+1}^{k_{h}^{i}(s_{h}^{k})}}+\tilde{\beta}_{n_{h}^{k}}

where

β~j:=i=1jαji(b¯i+b¯i)c(A+B)H4ι/j.\tilde{\beta}_{j}:=\sum_{i=1}^{j}{\alpha_{j}^{i}(\overline{b}_{i}+\underline{b}_{i}})\leq c\sqrt{(A+B)H^{4}\iota/j}.

Taking the summation w.r.t. kk, we begin with the first two terms,

k=1Kαnhk0H=k=1KH𝕀{nhk=0}SH\sum_{k=1}^{K}{\alpha_{n_{h}^{k}}^{0}H}=\sum_{k=1}^{K}{H\mathbb{I}\left\{n_{h}^{k}=0\right\}}\leq SH
k=1Ki=1nhkαnhkiδh+1khi(shk)(i)k=1Kδh+1ki=nhk+1αinhk(ii)(1+1H)k=1Kδh+1k.\sum_{k=1}^{K}{\sum_{i=1}^{n_{h}^{k}}{\alpha_{n_{h}^{k}}^{i}\delta_{h+1}^{k_{h}^{i}\left(s_{h}^{k}\right)}}}\overset{\left(i\right)}{\leq}\sum_{k^{\prime}=1}^{K}{\delta_{h+1}^{k^{\prime}}\sum_{i=n_{h}^{k^{\prime}}+1}^{\infty}{\alpha_{i}^{n_{h}^{k^{\prime}}}}}\overset{\left(ii\right)}{\leq}\left(1+\frac{1}{H}\right)\sum_{k=1}^{K}{\delta_{h+1}^{k}}.

where (i)(i) is by changing the order of summation and (ii)(ii) is by Lemma 11. Putting them together,

k=1Kδhk=\displaystyle\sum_{k=1}^{K}{\delta_{h}^{k}}= k=1Kαnhk0H+k=1Ki=1nhkαnhkiδh+1khi(shk)+k=1Kβ~nhk\displaystyle\sum_{k=1}^{K}{\alpha_{n_{h}^{k}}^{0}H}+\sum_{k=1}^{K}{\sum_{i=1}^{n_{h}^{k}}{\alpha_{n_{h}^{k}}^{i}\delta_{h+1}^{k_{h}^{i}\left(s_{h}^{k}\right)}}}+\sum_{k=1}^{K}{\tilde{\beta}_{n_{h}^{k}}}
\displaystyle\leq HS+(1+1H)k=1Kδh+1k+k=1Kβ~nhk\displaystyle HS+\left(1+\frac{1}{H}\right)\sum_{k=1}^{K}{\delta_{h+1}^{k}}+\sum_{k=1}^{K}{\tilde{\beta}_{n_{h}^{k}}}

Recursing this argument for h[H]h\in[H] gives

k=1Kδ1keSH2+eh=1Hk=1Kβ~nhk\sum_{k=1}^{K}{\delta_{1}^{k}}\leq eSH^{2}+e\sum_{h=1}^{H}\sum_{k=1}^{K}{\tilde{\beta}_{n_{h}^{k}}}

By pigeonhole argument,

k=1Kβ~nhk\displaystyle\sum_{k=1}^{K}{\tilde{\beta}_{n_{h}^{k}}} 𝒪(1)k=1K(A+B)H4ιnhk=𝒪(1)sn=1nhK(s)(A+B)H4ιn\displaystyle\leq\mathcal{O}\left(1\right)\sum_{k=1}^{K}{\sqrt{\frac{(A+B)H^{4}\iota}{n_{h}^{k}}}}=\mathcal{O}\left(1\right)\sum_{s}{\sum_{n=1}^{n_{h}^{K}\left(s\right)}{\sqrt{\frac{(A+B)H^{4}\iota}{n}}}}
𝒪(H4S(A+B)Kι)=𝒪(H3S(A+B)Tι)\displaystyle\leq\mathcal{O}\left(\sqrt{H^{4}S(A+B)K\iota}\right)=\mathcal{O}\left(\sqrt{H^{3}S(A+B)T\iota}\right)

Expanding this formula repeatedly and apply pigeonhole argument we have

k=1K[V¯hkV¯hk](s1)𝒪(H5S(A+B)Tι).\sum_{k=1}^{K}{[\overline{V}_{h}^{k}-\underline{V}_{h}^{k}](s_{1})}\leq\mathcal{O}(\sqrt{H^{5}S(A+B)T\iota}).

which finishes the proof. ∎

D.3 Certified policies

As before, we construct a series of new policies μ^hk\hat{\mu}_{h}^{k} in Algorithm 8. Notice μ^hk\hat{\mu}_{h}^{k} is related to μ^\hat{\mu} defined in Algorithm 4 by μ^=1ki=1kμ^1i\hat{\mu}=\frac{1}{k}\sum_{i=1}^{k}{\hat{\mu}_{1}^{i}}. Also we need to consider value and Q-value functions of general policies which does not depend on the hostory before the hh-th step. See Appendix C.2 for details. Again, we can show the policies defined above are indeed certified.

Algorithm 8 Policy μ^hk\hat{\mu}_{h}^{k}
1:  sample kUniform([K])k\leftarrow\text{Uniform}([K]).
2:  for step h=h,h+1,,Hh^{\prime}=h,h+1,\dots,H do
3:     observe shs_{h^{\prime}}, and set tNhk(sh)t\leftarrow N^{k}_{h^{\prime}}(s_{h^{\prime}}).
4:     sample m[t]m\in[t] with (m=i)=αti\mathbb{P}(m=i)=\alpha^{i}_{t}.
5:     kkhm(sh)k\leftarrow k^{m}_{h^{\prime}}(s_{h^{\prime}}).
6:     take action ahμhk(|sh)a_{h^{\prime}}\sim\mu^{k}_{h^{\prime}}(\cdot|s_{h^{\prime}}).
Lemma 16.

For any p(0,1)p\in(0,1), with probability at least 1p1-p, the following holds for any (s,a,b,h,k)𝒮×𝒜××[H]×[K](s,a,b,h,k)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B}\times[H]\times[K],

V¯hk(s)Vh,ν^hk(s),V¯hk(s)Vhμ^hk,(s)\overline{V}_{h}^{k}(s)\geq V_{h}^{{\dagger},\hat{\nu}_{h}^{k}}(s),\,\,\,\underline{V}_{h}^{k}(s)\leq V_{h}^{\hat{\mu}_{h}^{k},{\dagger}}(s)
Proof of Lemma 16.

We prove one side by induction and the other side is similar. The claim is trivially satisfied for h=H+1h=H+1. Suppose it is ture for h+1h+1, consider a fixed state ss. Let t=Nhk(s)t=N_{h}^{k}\left(s\right) and suppose ss was previously visited at episodes k1,,kt<kk^{1},\ldots,k^{t}<k at the hh-th step. Then using Lemma 13,

V¯hk(s)\displaystyle\overline{V}_{h}^{k}(s) =αt0H+i=1tαti[rh(s,ahki,bhkhi)+V¯h+1ki(sh+1ki)+β¯i]\displaystyle=\alpha_{t}^{0}H+\sum_{i=1}^{t}{\alpha_{t}^{i}\left[r_{h}(s,a_{h}^{k^{i}},b_{h}^{k_{h}^{i}})+\overline{V}_{h+1}^{k^{i}}(s_{h+1}^{k^{i}})+\overline{\beta}_{i}\right]}
(i)max𝜇i=1tαti𝔻μ×νhki(rh+hV¯h+1ki)(s)\displaystyle\overset{\left(i\right)}{\geq}\underset{\mu}{\max}\sum_{i=1}^{t}{\alpha_{t}^{i}\mathbb{D}_{\mu\times\nu_{h}^{k^{i}}}\left(r_{h}+\mathbb{P}_{h}\overline{V}_{h+1}^{k^{i}}\right)\left(s\right)}
(ii)max𝜇i=1tαti𝔻μ×νhki(rh+hVh+1,ν^h+1ki)(s)\displaystyle\overset{\left(ii\right)}{\geq}\underset{\mu}{\max}\sum_{i=1}^{t}{\alpha_{t}^{i}\mathbb{D}_{\mu\times\nu_{h}^{k^{i}}}\left(r_{h}+\mathbb{P}_{h}V_{h+1}^{{\dagger},\hat{\nu}_{h+1}^{k^{i}}}\right)\left(s\right)}
=Vh,ν^hk(s)\displaystyle=V_{h}^{{\dagger},\hat{\nu}_{h}^{k}}(s)

where (i)(i) is by using Lemma 14 and the definition of β¯i\overline{\beta}_{i}, and (ii)(ii) is by induction hypothesis. ∎

Equipped with the above lemmas, we are now ready to prove Theorem 5.

Proof of Theorem 5.

By lemma 16, we have

k=1K(V1,ν^1kV1μ^1k,)(s1)k=1K(V¯1kV¯1k)(s1)\sum_{k=1}^{K}{\left(V_{1}^{{\dagger},\hat{\nu}_{1}^{k}}-V_{1}^{\hat{\mu}_{1}^{k},{\dagger}}\right)\left(s_{1}\right)}\leq\sum_{k=1}^{K}{\left(\overline{V}_{1}^{k}-\underline{V}_{1}^{k}\right)\left(s_{1}\right)}

and Lemma 15 upper bounds this quantity by

k=1K(V1,ν^1kV1μ^1k,)(s1)𝒪(H5S(A+B)Tι)\sum_{k=1}^{K}{\left(V_{1}^{{\dagger},\hat{\nu}_{1}^{k}}-V_{1}^{\hat{\mu}_{1}^{k},{\dagger}}\right)\left(s_{1}\right)}\leq\mathcal{O}\left(\sqrt{H^{5}S(A+B)T\iota}\right)

By definition of the induced policy, with probability at least 1p1-p, if we run Nash V-learning (Algorithm 3) for KK episodes with

KΩ(H6S(A+B)ιϵ2),K\geq\Omega\left(\frac{H^{6}S(A+B)\iota}{\epsilon^{2}}\right),

its induced policies (μ^,ν^)(\hat{\mu},\hat{\nu}) (Algorithm 4) will be ϵ\epsilon-optimal in the sense V1,ν^(s1)V1μ^,(s1)ϵV^{\dagger,\hat{\nu}}_{1}(s_{1})-V^{\hat{\mu},\dagger}_{1}(s_{1})\leq\epsilon. ∎

Appendix E Proofs of Hardness for Learning the Best Responses

In this section we give the proof of Theorem 6, and Corollary 8. Our proof is inspired by a computational hardness result for adversarial MDPs in [37, Section 4.2], which constructs a family of adversarial MDPs that are computationally as hard as an agnostic parity learning problem.

Section E.1, E.2, E.3 will be devoted to prove Theorem 6, while Corollary 8 is proved in Section E.4. Towards proving Theorem 6, we will:

  • (Section E.1) Construct a Markov game.

  • (Section E.2) Define a series of problems where a solution in problem implies another.

  • (Section E.3) Based on the believed computational hardness of learning paries with noise (Conjecture 7), we conclude that finding the best response of non-Markov policies is computationally hard.

E.1 Markov game construction

We now describe a Markov game inspired the adversarial MDP in [37, Section 4.2]. We define a Markov game in which we have 2H2H states, {i0,i1}i=2H\left\{i_{0},i_{1}\right\}_{i=2}^{H}, 101_{0} (the initial state) and \bot (the terminal state)444In [37] the states are denoted by {ia,ib}i=2H\left\{i_{a},i_{b}\right\}_{i=2}^{H} instead. Here we slightly change the notation to make it different from the notation of the actions. In each state the max-player has two actions a0a_{0} and a1a_{1}, while the min-player has two actions b0b_{0} and b1b_{1}. The transition kernel is deterministic and the next state for steps hH1h\leq H-1 is defined in Table 2:

State/Action (a0,b0)(a_{0},b_{0}) (a0,b1)(a_{0},b_{1}) (a1,b0)(a_{1},b_{0}) (a1,b1)(a_{1},b_{1})
i0i_{0} (i+1)0(i+1)_{0} (i+1)0(i+1)_{0} (i+1)0(i+1)_{0} (i+1)1(i+1)_{1}
i1i_{1} (i+1)1(i+1)_{1} (i+1)0(i+1)_{0} (i+1)1(i+1)_{1} (i+1)1(i+1)_{1}
Table 2: Transition kernel of the hard instance.

At the HH-th step, i.e. states H0H_{0} and H1H_{1}, the next state is always \bot regardless of the action chosen by both players. The reward function is always 0 except at the HH-th step. The reward is determined by the action of the min-player, defined by

State/Action (,b0)(\cdot,b_{0}) (,b1)(\cdot,b_{1})
H0H_{0} 11 0
H1H_{1} 0 11
Table 3: Reward of the hard instance.

At the beginning of every episode kk, both players pick their own policies μk\mu_{k} and νk\nu_{k}, and execute them throughout the episode. The min-player can possibly pick her policy νk\nu_{k} adaptive to all the observations in the earlier episodes. The only difference from the standard Markov game protocol is that the actions of the min-player except the last step will be revealed at the beginning of each episode, to match the setting in agnostic learning parities (Problem 2 below). Therefore we are actually considering a easier problem (for the max-player) and the lower bound naturally applies.

E.2 A series of computationally hard problems

We first introduce a series of problems and then show how the reduction works.

Problem 1

The max-player ϵ\epsilon-approximates the best reponse for any general policy ν\nu in the Markov game defined in Appendix E.1 with probability at least 1/21/2, in poly(H,1/ϵ)\mathrm{poly}(H,1/\epsilon) time.

Problem 2

Let x=(x1,,xn)x=\left(x_{1},\cdots,x_{n}\right) be a vector in {0,1}n\left\{\text{0,}1\right\}^{n}, T[n]T\subseteq\left[n\right] and 0<α<1/20<\alpha<1/2.The parity of xx on TT is the boolean function ϕT(x)=iTxi\phi_{T}\left(x\right)=\oplus_{i\in T}x_{i}. In words, ϕT(x)\phi_{T}\left(x\right) outputs 0 if the number of ones in the subvector (xi)iT(x_{i})_{i\in T} is even and 11 otherwise. A uniform query oracle for this problem is a randomized algorithm that returns a random uniform vector xx, as well as a noisy classification f(x)f(x) which is equal to ϕT(x)\phi_{T}(x) w.p. α\alpha and 1ϕT(x)1-\phi_{T}(x) w.p. 1α1-\alpha. All examples returned by the oracle are independent. The learning parity with noise problem consists in designing an algorithm with access to the oracle such that,

  • (Problem 2.1) w.p at least 1/21/2, find a (possibly random) function h:{0,1}n{0,1}h:\{0,1\}^{n}\rightarrow\{0,1\} satisy 𝔼hPx[h(x)ϕT(x)]ϵ\mathbb{E}_{h}P_{x}[h(x)\neq\phi_{T}(x)]\leq\epsilon, in poly(n,1/ϵ)\mathrm{poly}(n,1/\epsilon) time.

  • (Problem 2.2) w.p at least 1/41/4, find h:{0,1}n{0,1}h:\{0,1\}^{n}\rightarrow\{0,1\} satisy Px[h(x)ϕT(x)]ϵP_{x}[h(x)\neq\phi_{T}(x)]\leq\epsilon, in poly(n,1/ϵ)\mathrm{poly}(n,1/\epsilon) time.

  • (Problem 2.3) w.p at least 1p1-p, find h:{0,1}n{0,1}h:\{0,1\}^{n}\rightarrow\{0,1\} satisy Px[h(x)ϕT(x)]ϵP_{x}[h(x)\neq\phi_{T}(x)]\leq\epsilon, in poly(n,1/ϵ,1/p)\mathrm{poly}(n,1/\epsilon,1/p) time.

We remark that Problem 2.3 is the formal definition of learning parity with noise [20, Definition 2], which is conjectured to be computationally hard in the community (see also Conjecture 7).

Problem 2.3 reduces to Problem 2.2

Step 1: Repeatly apply algorithm for Problem 2.2 \ell times to get h1,,hh_{1},\dots,h_{\ell} such that miniPx[hi(x)ϕT(x)]ϵ\min_{i}P_{x}[h_{i}(x)\neq\phi_{T}(x)]\leq\epsilon with probability at least 1(3/4)1-(3/4)^{\ell}. This costs poly(n,,1/ϵ)\mathrm{poly}(n,\ell,1/\epsilon) time. Let i=argminierrii_{\star}=\mathop{\rm argmin}_{i}{\rm err}_{i} where erri=Px[hi(x)ϕT(x)]{\rm err}_{i}=P_{x}[h_{i}(x)\neq\phi_{T}(x)].

Step 2: Construct estimators using NN additional data (x(j),y(j))j=1N{(x^{(j)},y^{(j)})}_{j=1}^{N},

err^i:=1Nj=1N𝕀{hi(x(j))y(j)}α12α.\hat{\rm err}_{i}\mathrel{\mathop{:}}=\frac{\frac{1}{N}\sum_{j=1}^{N}\mathbb{I}\{h_{i}(x^{(j)})\neq y^{(j)}\}-\alpha}{1-2\alpha}.

Pick i^=argminierr^i\hat{i}=\mathop{\rm argmin}_{i}\hat{\rm err}_{i}. When Nlog(1/p)/ϵ2N\geq\log(1/p)/\epsilon^{2}, with probability at least 1p/21-p/2, we have

maxi|err^ierri|ϵ12α.\max_{i}\left|\hat{\rm err}_{i}-{\rm err}_{i}\right|\leq\frac{\epsilon}{1-2\alpha}.

This means that

erri^err^i^+ϵ12αerr^i+ϵ12αerri+2ϵ12αO(1)ϵ.{\rm err}_{\hat{i}}\leq\hat{\rm err}_{\hat{i}}+\frac{\epsilon}{1-2\alpha}\leq\hat{\rm err}_{i_{\star}}+\frac{\epsilon}{1-2\alpha}\leq{\rm err}_{i_{\star}}+\frac{2\epsilon}{1-2\alpha}\leq O(1)\epsilon.

This step uses poly(n,N,)=poly(n,1/ϵ,log(1/p),)\mathrm{poly}(n,N,\ell)=\mathrm{poly}(n,1/\epsilon,\log(1/p),\ell) time.

Step 3: Pick =log(1/p)\ell=\log(1/p), we are guaranteed that good events in step 1 and step 2 happen with probability 1p/2\geq 1-p/2 and altogether happen with probability at least 1p1-p. The total time used is poly(n,1/ϵ,log(1/p))\mathrm{poly}(n,1/\epsilon,\log(1/p)). Note better dependence on pp than required.

Problem 2.2 reduces to Problem 2.1:

If we have an algorithm that gives 𝔼h𝒟Px[h(x)ϕT(x)]ϵ\mathbb{E}_{h\sim\mathcal{D}}P_{x}[h(x)\neq\phi_{T}(x)]\leq\epsilon with probability 1/21/2. Then if we sample h^𝒟\hat{h}\sim\mathcal{D}, by Markov’s inequality, we have with probability 1/4\geq 1/4 that

Px[h^(x)ϕT(x)]2ϵP_{x}[\hat{h}(x)\neq\phi_{T}(x)]\leq 2\epsilon

Problem 2.1 reduces to Problem 1:

Consider the Markov game constructed above with H1=nH-1=n. The only missing piece we fill up here is the policy ν\nu of the min-player, which is constructed as following. The min-player draws a sample (x,y)(x,y) from the uniform query oracle, then taking action b0b_{0} at the step hH1h\leq H-1 if xh=0x_{h}=0 and b1b_{1} otherwise. For the HH-th step, the min-player take action b0b_{0} if y=0y=0 and b1b_{1} otherwise. Also notice the policy μ^\hat{\mu} of the max-player can be descibed by a set T^[H]\hat{T}\subseteq[H] where he takes action a1a_{1} at step hh if hh and a0a_{0} otherwise. As a result, the max-player receive non-zero result iff ϕT^(x)=y\phi_{\hat{T}}(x)=y.

In the Markov game, we have V1μ^,ν(s1)=(ϕT^(x)=y)V_{1}^{\hat{\mu},\nu}(s_{1})=\mathbb{P}(\phi_{\hat{T}}(x)=y). As a result, the optimal policy μ\mu^{*} corresponds to the true parity set TT. As a result,

(V1,νV1μ^,ν)(s1)=x,y(ϕT(x)=y)x,y(ϕT^(x)=y)ϵ(V_{1}^{\dagger,\nu}-V_{1}^{\hat{\mu},\nu})(s_{1})=\mathbb{P}_{x,y}(\phi_{T}(x)=y)-\mathbb{P}_{x,y}(\phi_{\hat{T}}(x)=y)\leq\epsilon

by the ϵ\epsilon-approximation guarantee.

Also notice

x,y(ϕT^(x)y)x,y(ϕT(x)y)=\displaystyle\mathbb{P}_{x,y}(\phi_{\hat{T}}(x)\neq y)-\mathbb{P}_{x,y}(\phi_{T}(x)\neq y)= (1α)x(ϕT^(x)ϕT(x))+αx(ϕT^(x)=ϕT(x))α\displaystyle(1-\alpha)\mathbb{P}_{x}(\phi_{\hat{T}}(x)\neq\phi_{T}(x))+\alpha\mathbb{P}_{x}(\phi_{\hat{T}}(x)=\phi_{T}(x))-\alpha
=\displaystyle= (12α)x(ϕT^(x)ϕT(x))\displaystyle(1-2\alpha)\mathbb{P}_{x}(\phi_{\hat{T}}(x)\neq\phi_{T}(x))

This implies:

x(ϕT^(x)ϕT(x))ϵ12α\mathbb{P}_{x}(\phi_{\hat{T}}(x)\neq\phi_{T}(x))\leq\frac{\epsilon}{1-2\alpha}

E.3 Putting them together

So far, we have proved that Solving Problem 1 implies solving Problem 2.3, where Problem 1 is the problem of learning ϵ\epsilon-approximate best response in Markov games (the problem we are interested in), and Problem 2.3 is precisely the problem of learning parity with noise [20]. This concludes the proof.

E.4 Proofs of Hardness Against Adversarial Opponents

Corollary 8 is a direct consequence of Theorem 6, as we will show now.

Proof of Corollary 8.

We only need to prove a polynomial time no-regret algorithm also learns the best response in a Markov game where the min-player following non-Markov policy ν\nu. Then the no-regret guarantee implies,

V1,ν(s1)1Kk=1KV1μk,ν(s1)poly(S,H,A,B)KδV^{\dagger,\nu}_{1}(s_{1})-\frac{1}{K}\sum_{k=1}^{K}V^{\mu^{k},\nu}_{1}(s_{1})\leq{\rm poly}(S,H,A,B)K^{-\delta}

where μk\mu_{k} is the policy of the max-player in the kk-th episode. If we choose μ^\hat{\mu} uniformly randomly from {μk}k=1K{\left\{\mu_{k}\right\}}_{k=1}^{K}, then

V1,ν(s1)V1μ^,ν(s1)poly(S,H,A,B)Kδ.V^{\dagger,\nu}_{1}(s_{1})-V^{\hat{\mu},\nu}_{1}(s_{1})\leq{\rm poly}(S,H,A,B)K^{-\delta}.

Choosing ϵ=poly(S,H,A,B)Kδ\epsilon={\rm poly}(S,H,A,B)K^{-\delta}, K=poly(S,H,A,B,1/ϵ)K={\rm poly}(S,H,A,B,1/\epsilon) and the running time of the no-regret algorithm is still poly(S,H,A,B,1/ϵ){\rm poly}(S,H,A,B,1/\epsilon) to learn the ϵ\epsilon-approximate best response.

To see that the Corollary 8 remains to hold for policies that are Markovian in each episode and non-adaptive, we can take the hard instance in Theorem 6 and let νk\nu^{k} denote the min-player’s policy in the kk-th episode. Note that each νk\nu^{k} is Markovian and non-adaptive on the observations in previous episodes. If there is a polynomial time no-regret algorithm against such {νk}{\left\{\nu^{k}\right\}}, then by the online-to-batch conversion similar as the above, the mixture of {μk}k=1K{\left\{\mu_{k}\right\}}_{k=1}^{K} learns a best response against ν\nu in polynomial time.

Appendix F Auxiliary Lemmas for Weighted Adversarial Bandit

In this section, we formulate the bandit problem we reduced to in the proof of Lemma 14. Although the machnisms are already well understood, we did not find a good reference of Follow the Regularized Leader (FTRL) algorithm with

  1. 1.

    changing step size

  2. 2.

    weighted regret

  3. 3.

    high probability regret bound

For completeness, we give the detailed derivation here.

Algorithm 9 FTRL for Weighted Regret with Changing Step Size
1:  for episode t=1,,Kt=1,\dots,K do
2:     θt(a)exp[(ηt/wt)i=1t1wil^i(a)]\theta_{t}(a)\propto\exp[-(\eta_{t}/w_{t})\cdot\sum_{i=1}^{t-1}w_{i}\hat{l}_{i}(a)]
3:     Take action atθt()a_{t}\sim\theta_{t}(\cdot), and observe loss l~t(at)\tilde{l}_{t}(a_{t}).
4:     l^t(a)l~t(a)𝕀{at=a}/(θt(a)+γt)\hat{l}_{t}(a)\leftarrow\tilde{l}_{t}(a)\mathbb{I}\{a_{t}=a\}/(\theta_{t}(a)+\gamma_{t}) for all a𝒜a\in\mathcal{A}.

We assume l~i[0,1]A\tilde{l}_{i}\in[0,1]^{A} and 𝔼il~i=li\mathbb{E}_{i}\tilde{l}_{i}=l_{i}. Define A=|𝒜|A=|\mathcal{A}|, we set the hyperparameters by

ηt=γt=logAAt\eta_{t}=\gamma_{t}=\sqrt{\frac{\log A}{At}}

Define the filtration t\mathcal{F}_{t} by the σ\sigma-algebra generated by {ai,li}i=1t1\{a_{i},l_{i}\}_{i=1}^{t-1}. Then the regret can be defined as

Rt(θ):=i=1twi𝔼aθ[li(a)li(ai)|i]=i=1twiθiθ,liR_{t}\left(\theta^{*}\right):=\sum_{i=1}^{t}{w_{i}\mathbb{E}_{a\sim\theta^{*}}[l_{i}(a)-l_{i}(a_{i})|\mathcal{F}_{i}]}=\sum_{i=1}^{t}{w_{i}\left<\theta_{i}-\theta^{*},l_{i}\right>}

We can easily check the definitions here is just an abstract version of that in the proof of Lemma 14 with rescaling. To state the regret guarantee, we also define ι=log(p/AK)\iota=\log(p/AK) for any p(0,1]p\in(0,1]. Now we can upper bound the regret by

Lemma 17.

Following Algorithm 9, with probability 13p1-3p, for any θΔA\theta^{*}\in\Delta^{A} and tKt\leq K we have

Rt(θ)2maxitwiAtι+3Aι2i=1twii+12maxitwiι+2ιi=1twi2R_{t}\left(\theta^{*}\right)\leq 2\max_{i\leq t}w_{i}\sqrt{At\iota}+\frac{3\sqrt{A\iota}}{2}\sum_{i=1}^{t}{\frac{w_{i}}{\sqrt{i}}}+\frac{1}{2}\max_{i\leq t}w_{i}\iota+\sqrt{2\iota\sum_{i=1}^{t}{w_{i}^{2}}}
Proof.

The regret Rt(θ)R_{t}(\theta^{*}) can be decomposed into three terms

Rt(θ)=\displaystyle R_{t}\left(\theta^{*}\right)= i=1twiθiθ,li\displaystyle\sum_{i=1}^{t}{w_{i}\left<\theta_{i}-\theta^{*},l_{i}\right>}
=\displaystyle= i=1twiθiθ,l^i(A)+i=1twiθi,lil^i(B)+i=1twiθ,l^ili(C)\displaystyle\underset{\left(A\right)}{\underbrace{\sum_{i=1}^{t}{w_{i}\left<\theta_{i}-\theta^{*},\hat{l}_{i}\right>}}}+\underset{\left(B\right)}{\underbrace{\sum_{i=1}^{t}{w_{i}\left<\theta_{i},l_{i}-\hat{l}_{i}\right>}}}+\underset{\left(C\right)}{\underbrace{\sum_{i=1}^{t}{w_{i}\left<\theta^{*},\hat{l}_{i}-l_{i}\right>}}}

and we bound (A)(A) in Lemma 19, (B)(B) in Lemma 20 and (C)(C) in Lemma 21.

Setting ηt=γt=logAAt\eta_{t}=\gamma_{t}=\sqrt{\frac{\log A}{At}}, the conditions in Lemma 19 and Lemma 21 are satisfied. Putting them together and take union bound, we have with probability 13p1-3p

Rt(θ)\displaystyle R_{t}\left(\theta^{*}\right)\leq wtlogAηt+A2i=1tηiwi+12maxitwiι+Ai=1tγiwi+2ιi=1twi2+maxitwiι/γt\displaystyle\frac{w_{t}\log A}{\eta_{t}}+\frac{A}{2}\sum_{i=1}^{t}{\eta_{i}w_{i}}+\frac{1}{2}\max_{i\leq t}w_{i}\iota+A\sum_{i=1}^{t}{\gamma_{i}w_{i}}+\sqrt{2\iota\sum_{i=1}^{t}{w_{i}^{2}}}+\max_{i\leq t}w_{i}\iota/\gamma_{t}
\displaystyle\leq 2maxitwiAtι+3Aι2i=1twii+12maxitwiι+2ιi=1twi2\displaystyle 2\max_{i\leq t}w_{i}\sqrt{At\iota}+\frac{3\sqrt{A\iota}}{2}\sum_{i=1}^{t}{\frac{w_{i}}{\sqrt{i}}}+\frac{1}{2}\max_{i\leq t}w_{i}\iota+\sqrt{2\iota\sum_{i=1}^{t}{w_{i}^{2}}}

The rest of this section is devoted to the proofs of the Lemmas used in the proofs of Lemma 17. We begin the following useful lemma adapted from Lemma 1 in [21], which is crucial in constructing high probability guarantees.

Lemma 18.

For any sequence of coefficients c1,c2,,ctc_{1},c_{2},\ldots,c_{t} s.t. ci[0,2γi]Ac_{i}\in[0,2\gamma_{i}]^{A} is i\mathcal{F}_{i}-measurable, we have with probability 1p/AK1-p/AK,

i=1twici,l^ilimaxitwiι\sum_{i=1}^{t}{w_{i}\left<c_{i},\hat{l}_{i}-l_{i}\right>}\leq\max_{i\leq t}w_{i}\iota
Proof.

Define w=maxitwiw=\max_{i\leq t}w_{i}. By definition,

wil^i(a)=\displaystyle w_{i}\hat{l}_{i}\left(a\right)= wil~i(a)𝕀{ai=a}θi(a)+γiwil~i(a)𝕀{ai=a}θi(a)+wil~i(a)𝕀{ai=a}wγi\displaystyle\frac{w_{i}\tilde{l}_{i}\left(a\right)\mathbb{I}\left\{a_{i}=a\right\}}{\theta_{i}\left(a\right)+\gamma_{i}}\leq\frac{w_{i}\tilde{l}_{i}\left(a\right)\mathbb{I}\left\{a_{i}=a\right\}}{\theta_{i}\left(a\right)+\frac{w_{i}\tilde{l}_{i}\left(a\right)\mathbb{I}\left\{a_{i}=a\right\}}{w}\gamma_{i}}
=\displaystyle= w2γi2γiwil~i(a)𝕀{ai=a}wθi(a)1+γiwil~i(a)𝕀{ai=a}wθi(a)(i)w2γilog(1+2γiwil~i(a)𝕀{ai=a}wθi(a))\displaystyle\frac{w}{2\gamma_{i}}\frac{\frac{2\gamma_{i}w_{i}\tilde{l}_{i}\left(a\right)\mathbb{I}\left\{a_{i}=a\right\}}{w\theta_{i}\left(a\right)}}{1+\frac{\gamma_{i}w_{i}\tilde{l}_{i}\left(a\right)\mathbb{I}\left\{a_{i}=a\right\}}{w\theta_{i}\left(a\right)}}\overset{\left(i\right)}{\leq}\frac{w}{2\gamma_{i}}\log\left(1+\frac{2\gamma_{i}w_{i}\tilde{l}_{i}\left(a\right)\mathbb{I}\left\{a_{i}=a\right\}}{w\theta_{i}\left(a\right)}\right)

where (i)(i) is because z1+z/2log(1+z)\frac{z}{1+z/2}\leq\log\left(1+z\right) for all z0z\geq 0.

Defining the sum

S^i=wiwci,l^i,Si=wiwci,li,\hat{S}_{i}=\frac{w_{i}}{w}\left<c_{i},\hat{l}_{i}\right>,\,\,\,S_{i}=\frac{w_{i}}{w}\left<c_{i},l_{i}\right>,

we have

𝔼i[exp(S^i)]\displaystyle\mathbb{E}_{i}\left[\exp\left(\hat{S}_{i}\right)\right] 𝔼i[exp(aci(a)2γilog(1+2γiwil~i(a)𝕀{ai=a}wθi(a)))]\displaystyle\leq\mathbb{E}_{i}\left[\exp\left(\sum_{a}{\frac{c_{i}\left(a\right)}{2\gamma_{i}}\log\left(1+\frac{2\gamma_{i}w_{i}\tilde{l}_{i}\left(a\right)\mathbb{I}\left\{a_{i}=a\right\}}{w\theta_{i}\left(a\right)}\right)}\right)\right]
(i)𝔼i[a(1+ci(a)wil~i(a)𝕀{ai=a}wθi(a))]\displaystyle\overset{\left(i\right)}{\leq}\mathbb{E}_{i}\left[\prod_{a}{\left(1+\frac{c_{i}\left(a\right)w_{i}\tilde{l}_{i}\left(a\right)\mathbb{I}\left\{a_{i}=a\right\}}{w\theta_{i}\left(a\right)}\right)}\right]
=𝔼i[1+aci(a)wil~i(a)𝕀{ai=a}wθi(a)]\displaystyle=\mathbb{E}_{i}\left[1+\sum_{a}{\frac{c_{i}\left(a\right)w_{i}\tilde{l}_{i}\left(a\right)\mathbb{I}\left\{a_{i}=a\right\}}{w\theta_{i}\left(a\right)}}\right]
=1+Siexp(Si)\displaystyle=1+S_{i}\leq\exp\left(S_{i}\right)

where (i)(i) is because z1log(1+z2)log(1+z1z2)z_{1}\log\left(1+z_{2}\right)\leq\log\left(1+z_{1}z_{2}\right) for any 0z110\leq z_{1}\geq 1 and z21z_{2}\geq-1. Here we are using the condition ci(a)2γic_{i}\left(a\right)\leq 2\gamma_{i} to guarantee the condition is satisfied.

Equipped with the above bound, we can now prove the concentration result.

[i=1t(S^iSi)ι]\displaystyle\mathbb{P}\left[\sum_{i=1}^{t}{\left(\hat{S}_{i}-S_{i}\right)}\geq\iota\right] =[exp[i=1t(S^iSi)]AKp]\displaystyle=\mathbb{P}\left[\exp\left[\sum_{i=1}^{t}{\left(\hat{S}_{i}-S_{i}\right)}\right]\geq\frac{AK}{p}\right]
pAK𝔼t[exp[i=1t(S^iSi)]]\displaystyle\leq\frac{p}{AK}\mathbb{E}_{t}\left[\exp\left[\sum_{i=1}^{t}{\left(\hat{S}_{i}-S_{i}\right)}\right]\right]
pAK𝔼t1[exp[i=1t1(S^iSi)]Et[exp(S^tSt)]]\displaystyle\leq\frac{p}{AK}\mathbb{E}_{t-1}\left[\exp\left[\sum_{i=1}^{t-1}{\left(\hat{S}_{i}-S_{i}\right)}\right]E_{t}\left[\exp\left(\hat{S}_{t}-S_{t}\right)\right]\right]
pAK𝔼t1[exp[i=1t1(S^iSi)]]\displaystyle\leq\frac{p}{AK}\mathbb{E}_{t-1}\left[\exp\left[\sum_{i=1}^{t-1}{\left(\hat{S}_{i}-S_{i}\right)}\right]\right]
pAK\displaystyle\leq\cdots\leq\frac{p}{AK}

The claim is proved by taking the union bound. ∎

Using Lemma 18, we can bound the (A)(B)(C)(A)(B)(C) separately as below.

Lemma 19.

If ηi2γi\eta_{i}\leq 2\gamma_{i} for all iti\leq t, with probability 1p1-p, for any t[K]t\in[K] and θΔA\theta^{*}\in\Delta^{A},

i=1twiθiθ,l^iwtlogAηt+A2i=1tηiwi+12maxitwiι\sum_{i=1}^{t}{w_{i}\left<\theta_{i}-\theta^{*},\hat{l}_{i}\right>}\leq\frac{w_{t}\log A}{\eta_{t}}+\frac{A}{2}\sum_{i=1}^{t}{\eta_{i}w_{i}}+\frac{1}{2}\max_{i\leq t}w_{i}\iota
Proof.

We use the standard analysis of FTRL with changing step size, see for example Exercise 28.13 in [17]. Notice the essential step size is ηt/wt\eta_{t}/w_{t},

i=1twiθiθ,l^i\displaystyle\sum_{i=1}^{t}{w_{i}\left<\theta_{i}-\theta^{*},\hat{l}_{i}\right>} wtlogAηt+12i=1tηiwiθi,l^i2\displaystyle\leq\frac{w_{t}\log A}{\eta_{t}}+\frac{1}{2}\sum_{i=1}^{t}{\eta_{i}w_{i}}\left<\theta_{i},\hat{l}_{i}^{2}\right>
wtlogAηt+12i=1ta𝒜ηiwil^i(a)\displaystyle\leq\frac{w_{t}\log A}{\eta_{t}}+\frac{1}{2}\sum_{i=1}^{t}{\sum_{a\in\mathcal{A}}{\eta_{i}}w_{i}\hat{l}_{i}\left(a\right)}
(i)wtlogAηt+12i=1ta𝒜ηiwili(a)+12maxitwiι\displaystyle\overset{\left(i\right)}{\leq}\frac{w_{t}\log A}{\eta_{t}}+\frac{1}{2}\sum_{i=1}^{t}{\sum_{a\in\mathcal{A}}{\eta_{i}}w_{i}l_{i}\left(a\right)}+\frac{1}{2}\max_{i\leq t}w_{i}\iota
wtlogAηt+A2i=1tηiwi+12maxitwiι\displaystyle\leq\frac{w_{t}\log A}{\eta_{t}}+\frac{A}{2}\sum_{i=1}^{t}{\eta_{i}w_{i}}+\frac{1}{2}\max_{i\leq t}w_{i}\iota

where (i)(i) is by using Lemma 18 with ci(a)=ηic_{i}(a)=\eta_{i}. The any-time guarantee is justifed by taking union bound. ∎

Lemma 20.

With probability 1p1-p, for any t[K]t\in[K],

i=1twiθi,lil^iAi=1tγiwi+2ιi=1twi2\sum_{i=1}^{t}{w_{i}\left<\theta_{i},l_{i}-\hat{l}_{i}\right>}\leq A\sum_{i=1}^{t}{\gamma_{i}w_{i}}+\sqrt{2\iota\sum_{i=1}^{t}{w_{i}^{2}}}
Proof.

We further decopose it into

i=1twiθi,lil^i=i=1twiθi,li𝔼il^i+i=1twiθi,𝔼il^il^i\sum_{i=1}^{t}{w_{i}\left<\theta_{i},l_{i}-\hat{l}_{i}\right>}=\sum_{i=1}^{t}{w_{i}\left<\theta_{i},l_{i}-\mathbb{E}_{i}\hat{l}_{i}\right>}+\sum_{i=1}^{t}{w_{i}\left<\theta_{i},\mathbb{E}_{i}\hat{l}_{i}-\hat{l}_{i}\right>}

The first term is bounded by

i=1twiθi,li𝔼il^i=\displaystyle\sum_{i=1}^{t}{w_{i}\left<\theta_{i},l_{i}-\mathbb{E}_{i}\hat{l}_{i}\right>}= i=1twiθi,liθiθi+γili\displaystyle\sum_{i=1}^{t}{w_{i}\left<\theta_{i},l_{i}-\frac{\theta_{i}}{\theta_{i}+\gamma_{i}}l_{i}\right>}
=\displaystyle= i=1twiθi,γiθi+γiliAi=1tγiwi\displaystyle\sum_{i=1}^{t}{w_{i}\left<\theta_{i},\frac{\gamma_{i}}{\theta_{i}+\gamma_{i}}l_{i}\right>}\leq A\sum_{i=1}^{t}{\gamma_{i}w_{i}}

To bound the second term, notice

θi,l^ia𝒜θi(a)𝕀{at=a}θi(a)+γia𝒜𝕀{ai=a}=1,\left<\theta_{i},\hat{l}_{i}\right>\leq\sum_{a\in\mathcal{A}}{\theta_{i}\left(a\right)}\frac{\mathbb{I}\left\{a_{t}=a\right\}}{\theta_{i}(a)+\gamma_{i}}\leq\sum_{a\in\mathcal{A}}{\mathbb{I}\left\{a_{i}=a\right\}}=1,

thus {wiθi,𝔼il^il^i}i=1t\{w_{i}\left<\theta_{i},\mathbb{E}_{i}\hat{l}_{i}-\hat{l}_{i}\right>\}_{i=1}^{t} is a bounded martingale difference sequence w.r.t. the filtration {i}i=1t\{\mathcal{F}_{i}\}_{i=1}^{t}. By Azuma-Hoeffding,

i=1tθi,𝔼il^il^i2ιi=1twi2\sum_{i=1}^{t}{\left<\theta_{i},\mathbb{E}_{i}\hat{l}_{i}-\hat{l}_{i}\right>}\leq\sqrt{2\iota\sum_{i=1}^{t}{w_{i}^{2}}}

Lemma 21.

With probability 1p1-p, for any t[K]t\in[K] and any θΔA\theta^{*}\in\Delta^{A}, if γi\gamma_{i} is non-increasing in ii,

i=1twiθ,l^ilimaxitwiι/γt\sum_{i=1}^{t}{w_{i}\left<\theta^{*},\hat{l}_{i}-l_{i}\right>}\leq\max_{i\leq t}w_{i}\iota/\gamma_{t}
Proof.

Define a basis {ej}j=1A\left\{e_{j}\right\}_{j=1}^{A} of A\mathbb{R}^{A} by

ej(a)={1 if a=j0 otherwisee_{j}\left(a\right)=\begin{cases}\text{1 if }a=j\\ \text{0 otherwise}\\ \end{cases}

Then for all the j[A]j\in[A], apply Lemma 18 with ci=γtejc_{i}=\gamma_{t}e_{j}. Sine now ci(a)γtγic_{i}(a)\leq\gamma_{t}\leq\gamma_{i}, the condition in Lemma 18 is satisfied. As a result,

i=1twiej,l^ilimaxitwiι/γt\sum_{i=1}^{t}{w_{i}\left<e_{j},\hat{l}_{i}-l_{i}\right>}\leq\max_{i\leq t}w_{i}\iota/\gamma_{t}

Since any θ\theta^{*} is a convex combination of {ej}j=1A\left\{e_{j}\right\}_{j=1}^{A}, by taking the union bound over j[A]j\in[A], we have

i=1twiθ,l^ilimaxitwiι/γt\sum_{i=1}^{t}{w_{i}\left<\theta^{*},\hat{l}_{i}-l_{i}\right>}\leq\max_{i\leq t}w_{i}\iota/\gamma_{t}