Near-Optimal Reinforcement Learning with Self-Play
Abstract
This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with states, max-player actions and min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires steps of game playing, when only highlighting the dependency on . In contrast, the best existing lower bound scales as and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity , and a new Nash V-learning algorithm with sample complexity . The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games—a learning objective different from finding the Nash equilibrium.
1 Introduction
A wide range of modern artificial intelligence challenges can be cast as a multi-agent reinforcement learning (multi-agent RL) problem, in which more than one agent performs sequential decision making in an interactive environment. Multi-agent RL has achieved significant recent success on traditionally challenging tasks, for example in the game of GO [30, 31], Poker [6], real-time strategy games [33, 22], decentralized controls or multiagent robotics systems [5], autonomous driving [27], as well as complex social scenarios such as hide-and-seek [3]. In many scenarios, the learning agents even outperform the best human experts .
Despite the great empirical success, a major bottleneck for many existing RL algorithms is that they require a tremendous number of samples. For example, the biggest AlphaGo Zero model is trained on tens of millions of games and took more than a month to train [31]. While requiring such amount of samples may be acceptable in simulatable environments such as GO, it is not so in other sample-expensive real world settings such as robotics and autonomous driving. It is thus important for us to understand the sample complexity in RL—how can we design algorithms that find a near optimal policy with a small number of samples, and what is the fundamental limit, i.e. the minimum number of samples required for any algorithm to find a good policy.
Theoretical understandings on the sample complexity for multi-agent RL are rather limited, especially when compared with single-agent settings. The standard model for a single-agent setting is an episodic Markov Decision Process (MDP) with states, and actions, and steps per episode. The best known algorithm can find an near-optimal policy in episodes, which matches the lower bound up to a single factor [1, 8]. In contrast, in multi-agent settings, the optimal sample complexity remains open even in the basic setting of two-player tabular Markov games [28], where the agents are required to find the solutions of the games—the Nash equilibria. The best known algorithm, VI-ULCB, finds an -approximate Nash equilibrium in episodes [2], where is the number of actions for the other player. The information theoretical lower bound is . Specifically, the number of episodes required for the algorithm scales quadratically in both and , and exhibits a gap from the linear dependency in the lower bound. This motivates the following question:
Can we design algorithms with near-optimal sample complexity for learning Markov games?
In this paper, we present the first line of near-optimal algorithms for two-player Markov games that match the aforementioned lower bound up to a factor. This closes the open problem for achieving the optimal sample complexity in all dependency. Our algorithm learns by playing against itself without requiring any direct supervision, and is thus a self-play algorithm.
1.1 Our contributions
-
•
We propose an optimistic variant of Nash Q-learning [11], and prove that it achieves sample complexity for finding an -approximate Nash equilibrium in two-player Markov games (Section 3). Our algorithm builds optimistic upper and lower estimates of -values, and computes the Coarse Correlated Equilibrium (CCE) over this pair of estimates as its execution policies for both players.
-
•
We design a new algorithm—Nash V-learning—for finding approximate Nash equilibria, and show that it achieves sample complexity (Section 4). This improves upon Nash Q-learning in case . It is also the first result that matches the minimax lower bound up to only a factor. This algorithm builds optimistic upper and lower estimates of -values, and features a novel combination of Follow-the-Regularized-Leader (FTRL) and standard Q-learning algorithm to determine its execution policies.
-
•
Apart from finding Nash equilibria, we prove that learning the best responses of fixed opponents in Markov games is as hard as learning parity with noise—a notoriously difficult problem that is believed to be computationally hard (Section 5). As a corollary, this hardness result directly implies that achieving sublinear regret against adversarial opponents in Markov games is also computationally hard, a result that first appeared in [25]. This in turn rules out the possibility of designing efficient algorithms for finding Nash equilibria by running no-regret algorithms for each player separately.
In addition to above contributions, this paper also features a novel approach of extracting certified policies—from the estimates produced by reinforcement learning algorithms such as Nash Q-learning and Nash V-learning—that are certified to have similar performance as Nash equilibrium policies, even when facing against their best response (see Section 3 for more details). We believe this technique could be of broader interest to the community.
1.2 Related Work
Markov games
Markov games (or stochastic games) are proposed in the early 1950s [28]. They are widely used to model multi-agent RL. Learning the Nash equilibria of Markov games has been studied in classical work [18, 19, 11, 10], where the transition matrix and reward are assumed to be known, or in the asymptotic setting where the number of data goes to infinity. These results do not directly apply to the non-asymptotic setting where the transition and reward are unknown and only a limited amount of data are available for estimating them.
A recent line of work tackles self-play algorithms for Markov games in the non-asymptotic setting with strong reachability assumptions. Specifically, Wei et al. [35] assumes no matter what strategy one agent sticks to, the other agent can always reach all states by playing a certain policy, and Jia et al. [13], Sidford et al. [29] assume access to simulators (or generative models) that enable the agent to directly sample transition and reward information for any state-action pair. These settings ensure that all states can be reached directly, so no sophisticated exploration is not required.
Very recently, [2, 36] study learning Markov games without these reachability assumptions, where exploration becomes essential. However, both results suffer from highly suboptimal sample complexity. We compare them with our results in Table 1. The results of [36] also applies to the linear function approximation setting. We remark that the R-max algorithm [4] does provide provable guarantees for learning Markov game, even in the setting of playing against the adversarial opponent, but using a definition of regret that is weaker than the standard regret. Their result does not imply any sample complexity result for finding Nash equilibrium policies.
Algorithm | Sample Complexity | Runtime |
VI-ULCB [2] | PPAD-complete | |
VI-explore [2] | Polynomial | |
OMVI-SM [36] | ||
Optimistic Nash Q-learning | ||
Optimistic Nash V-learning | ||
Lower Bound [14, 2] | - |
Adversarial MDP
Another line of related work focuses on provably efficient algorithms for adversarial MDPs. Most work in this line considers the setting with adversarial rewards [38, 26, 15], because adversarial MDP with changing dynamics is computationally hard even under full-information feedback [37]. These results do not directly imply provable self-play algorithms in our setting, because the opponent in Markov games can affect both the reward and the transition.
Single-agent RL
There is a rich literature on reinforcement learning in MDPs [see e.g. 12, 24, 1, 7, 32, 14]. MDP is a special case of Markov games, where only a single agent interacts with a stochastic environment. For the tabular episodic setting with nonstationary dynamics and no simulators, the best sample complexity achieved by existing model-based and model-free algorithms are [1] and [14], respectively, where is the number of states, is the number of actions, is the length of each episode. Both of them (nearly) match the lower bound [12, 23, 14].
2 Preliminaries
We consider zero-sum Markov Games (MG) [28, 18], which are also known as stochastic games in the literature. Zero-sum Markov games are generalization of standard Markov Decision Processes (MDP) into the two-player setting, in which the max-player seeks to maximize the total return and the min-player seeks to minimize the total return.
Formally, we denote a tabular episodic Markov game as , where is the number of steps in each episode, is the set of states with , are the sets of actions of the max-player and the min-player respectively, is a collection of transition matrices, so that gives the distribution over states if action pair is taken for state at step , and is a collection of reward functions, and is the deterministic reward function at step . 111We assume the rewards in for normalization. Our results directly generalize to randomized reward functions, since learning the transition is more difficult than learning the reward.
In each episode of this MG, we start with a fixed initial state . Then, at each step , both players observe state , and the max-player picks action while the min-player picks action simultaneously. Both players observe the actions of the opponents, receive reward , and then the environment transitions to the next state . The episode ends when is reached.
Markov policy, value function
A Markov policy of the max-player is a collection of functions , which maps from a state to a distribution of actions. Here is the probability simplex over action set . Similarly, a policy of the min-player is a collection of functions . We use the notation and to present the probability of taking action or for state at step under Markov policy or respectively.
We use to denote the value function at step under policy and , so that gives the expected cumulative rewards received under policy and , starting from at step :
(1) |
We also define to denote -value function at step so that gives the cumulative rewards received under policy and , starting from at step :
(2) |
For simplicity, we use notation of operator so that for any value function . We also use notation for any action-value function . By definition of value functions, we have the Bellman equation
for all . We define for all .
Best response and Nash equilibrium
For any Markov policy of the max-player , there exists a best response of the min-player, which is a Markov policy satisfying for any . Here the infimum is taken over all possible policies which are not necessarily Markovian (we will define later in this section). We define . By symmetry, we can also define and . It is further known (cf. [9]) that there exist Markov policies , that are optimal against the best responses of the opponents, in the sense that
We call these optimal strategies the Nash equilibrium of the Markov game, which satisfies the following minimax equation: 222The minimax theorem here is different from the one for matrix games, i.e. for any matrix , since here is in general not bilinear in .
Intuitively, a Nash equilibrium gives a solution in which no player has anything to gain by changing only her own policy. We further abbreviate the values of Nash equilibrium and as and . We refer readers to Appendix A for Bellman optimality equations for values of best responses or Nash equilibria.
General (non-Markovian) policy
In certain situations, it is beneficial to consider general, history-dependent policies that are not necessarily Markovian. A (general) policy of the max-player is a set of maps , from a random number and a history of length —say , to a distribution over actions in . By symmetry, we can also define the (general) policy of the min-player, by replacing the action set in the definition by set . The random number is sampled from some underlying distribution , but may be shared among all steps .
For a pair of general policy , we can still use the same definitions (1) to define their value at step . We can also define the best response of a general policy as the minimizing policy so that at step 1. We remark that the best response of a general policy is not necessarily Markovian.
Learning Objective
There are two possible learning objectives in the setting of Markov games. The first one is to find the best response for a fixed opponent. Without loss of generality, we consider the case where the learning agent is the max-player, and the min-player is the opponent.
Definition 1 (-approximate best response).
For an opponent with an fixed unknown general policy , a general policy is the -approximate best response if .
The second goal is to find a Nash equilibrium of the Markov games. We measure the suboptimality of any pair of general policies using the gap between their performance and the performance of the optimal strategy (i.e. Nash equilibrium) when playing against the best responses respectively:
Definition 2 (-approximate Nash equilibrium).
A pair of general policies is an -approximate Nash equilibrium, if .
Loosely speaking, Nash equilibria can be viewed as “the best responses to the best responses”. In most applications, they are the ultimate solutions to the games. In Section 3 and 4, we present sharp guarantees for learning an approximate Nash equilibrium with near-optimal sample complexity. However, rather surprisingly, learning a best response in the worst case is more challenging than learning the Nash equilibrium. In Section 5, we present a computational hardness result for learning an approximate best response.
3 Optimistic Nash Q-learning
In this section, we present our first algorithm Optimistic Nash Q-learning and its corresponding theoretical guarantees.
Algorithm part I: learning values
Our algorithm Optimistic Nash Q-learning (Algorithm 1) is an optimistic variant of Nash Q-learning [11]. For each step in each episode, it (a) takes actions according to the previously computed policy , and observes the reward and next state, (b) performs incremental updates on Q-values, and (c) computes new greedy policies and updates -values. Part (a) is straightforward; we now focus on explaining part (b) and part (c).
In part (b), the incremental updates on Q-values (Line 8, 9) are almost the same as standard Q-learning [34], except here we maintain two separate Q-values— and , as upper and lower confidence versions respectively. We add and subtract a bonus term in the corresponding updates, which depends on —the number of times has been visited at step . We pick parameter and as follows for some large constant , and log factors :
(3) |
In part (c), our greedy policies are computed using a Coarse Correlated Equilibrium (CCE) subroutine, which is first introduced by [36] to solve Markov games using value iteration algorithms. For any pair of matrices , returns a distribution such that
(4) | ||||
It can be shown that a CCE always exists, and it can be computed by linear programming in polynomial time (see Appendix B for more details).
Now we are ready to state an intermediate guarantee for optimistic Nash Q-learning. We assume the algorithm has played the game for episodes, and we use to denote values, visitation counts, and policies at the beginning of the -th episode in Algorithm 1.
Lemma 3.
Lemma 3 makes two statements. First, it claims that the and computed in Algorithm 1 are indeed upper and lower bounds of the value of the Nash equilibrium. Second, Lemma 3 claims that the averages of the upper bounds and the lower bounds are also very close to the value of Nash equilibrium , where the gap decrease as . This implies that in order to learn the value up to -accuracy, we only need episodes.
However, Lemma 3 has a significant drawback: it only guarantees the learning of the value of Nash equilibrium. It does not imply that the policies used in Algorithm 1 are close to the Nash equilibrium, which requires the policies to have a near-optimal performance even against their best responses. This is a major difference between Markov games and standard MDPs, and is the reason why standard techniques from the MDP literature does not apply here. To resolve this problem, we propose a novel way to extract a certified policy from the optimistic Nash Q-learning algorithm.
Algorithm part II: certified policies
We describe our procedure of executing the certified policy of the max-player is described in Algorithm 2. Above, denote the marginal distributions of produced in Algorithm 1 over action set respectively. We also introduce the following quantities that directly induced by :
(5) |
whose properties are listed in the following Lemma 11. Especially, , so defines a distribution over . We use to denote the index of the episode where is observed in step for the -th time. The certified policy of the min-player is easily defined by symmetry. We note that are clearly general policies, but they are no longer Markov policies.
The intuitive reason why such policy defined in Algorithm 2 is certified by Nash Q-learning algorithm, is because the update equation in line 8 of Algorithm 1 and equation (5) gives relation:
This certifies the good performance against the best responses if the max-player plays a mixture of policies at step with mixing weights (see Appendix C.2 for more details). A recursion of this argument leads to the certified policy —a nested mixture of policies.
We now present our main result for Nash Q-learning, using the certified policies .
Theorem 4 (Sample Complexity of Nash Q-learning).
Theorem 4 asserts that if we run the optimistic Nash Q-learning algorithm for more than episodes, the certified policies extracted using Algorithm 2 will be -approximate Nash equilibrium (Definition 2).
We make two remarks. First, the executions of the certified policies require the storage of and for all . This makes the space complexity of our algorithm scales up linearly in the total number of episodes . Second, Q-learning style algorithms (especially online updates) are crucial in our analysis for achieving sample complexity linear in . They enjoy the property that every sample is only been used once, on the value function that is independent of this sample. In contrast, value iteration type algorithms do not enjoy such an independence property, which is why the best existing sample complexity scales as [2]. 333Despite [1] provides techniques to improve the sample complexity from to for value iteration in MDP, the same techniques can not be applied to Markov games due to the unique challenge that, in Markov games, we aim at finding policies that are good against their best responses.
4 Optimistic Nash V-learning
In this section, we present our new algorithm Optimistic Nash V-learning and its corresponding theoretical guarantees. This algorithm improves over Nash Q-learning in sample complexity from to , when only highlighting the dependency on .
Algorithm description
Nash V-learning combines the idea of Follow-The-Regularized-Leader (FTRL) in the bandit literature with the Q-learning algorithm in reinforcement learning. This algorithm does not require extra information exchange between players other than standard game playing, thus can be ran separately by the two players. We describe the max-player version in Algorithm 3. See Algorithm 7 in Appendix D for the min-player version, where , , , and are defined symmetrically.
For each step in each episode, the algorithm (a) first takes action according to , observes the action of the opponent, the reward, and the next state, (b) performs an incremental update on , and (c) updates policy . The first two parts are very similar to Nash Q-learning. In the third part, the agent first computes as the importance weighted estimator of the current loss. She then computes the weighted cumulative loss . Finally, the policy is updated using FTRL principle:
Here is the uniform distribution over all actions . Solving above minimization problem gives the update equation as in Line 12 in Algorithm 3. In multi-arm bandit, FTRL can defend against adversarial losses, with regret independent of the number of the opponent’s actions. This property turns out to be crucial for Nash V-learning to achieve sharper sample complexity than Nash Q-learning (see the analog of Lemma 3 in Lemma 15).
Similar to Nash Q-learning, we also propose a new algorithm (Algorithm 4) to extract a certified policy from the optimistic Nash V-learning algorithm. The certified policies are again non-Markovian. We choose all hyperparameters as follows, for some large constant , and log factors .
(6) |
We now present our main result on the sample complexity of Nash V-learning.
Theorem 5 (Sample Complexity of Nash V-learning).
Theorem 4 claims that if we run the optimistic Nash V-learning for more than episodes, the certified policies extracted from Algorithm 4 will be -approximate Nash (Definition 2). Nash V-learning is the first algorithm of which the sample complexity matches the information theoretical lower bound up to factors and logarithmic terms.
5 Hardness for Learning the Best Response
In this section, we present a computational hardness result for computing the best response against an opponent with a fixed unknown policy. We further show that this implies the computational hardness result for achieving sublinear regret in Markov games when playing against adversarial opponents, which rules out a popular approach to design algorithms for finding Nash equilibria.
We first remark that if the opponent is restricted to only play Markov policies, then learning the best response is as easy as learning a optimal policy in the standard single-agent Markov decision process, where efficient algorithms are known to exist. Nevertheless, when the opponent can as well play any policy which may be non-Markovian, we show that finding the best response against those policies is computationally challenging.
We say an algorithm is a polynomial time algorithm for learning the best response if for any policy of the opponent , and for any , the algorithm finds the -approximate best response of policy (Definition 1) with probability at least , in time polynomial in .
We can show the following hardness result for finding the best response in polynomial time.
Theorem 6 (Hardness for learning the best response).
There exists a Markov game with deterministic transitions and rewards defined for any horizon with , , and , such that if there exists a polynomial time algorithm for learning the best response for this Markov game, then there exists a polynomial time algorithm for learning parity with noise (see problem description in Appendix E).
We remark that learning parity with noise is a notoriously difficult problem that has been used to design efficient cryptographic schemes. It is conjectured by the community to be hard.
Conjecture 7 ([16]).
There is no polynomial time algorithm for learning party with noise.
Theorem 6 with Conjecture 7 demonstrates the fundamental difficulty—if not strict impossibility—of designing a polynomial time for learning the best responses in Markov games. The intuitive reason for such computational hardness is that, while the underlying system has Markov transitions, the opponent can play policies that encode long-term correlations with non-Markovian nature, such as parity with noise, which makes it very challenging to find the best response. It is known that learning many other sequential models with long-term correlations (such as hidden Markov models or partially observable MDPs) is as hard as learning parity with noise [20].
5.1 Hardness for Playing Against Adversarial Opponent
Theorem 6 directly implies the difficulty for achieving sublinear regret in Markov games when playing against adversarial opponents in Markov games. Our construction of hard instances in the proof of Theorem 6 further allows the adversarial opponent to only play Markov policies in each episode. Since playing against adversarial opponent is a different problem with independent interest, we present the full result here.
Without loss of generality, we still consider the setting where the algorithm can only control the max-player, while the min-player is an adversarial opponent. In the beginning of every episode , both players pick their own policies and , and execute them throughout the episode. The adversarial opponent can possibly pick her policy adaptive to all the observations in the earlier episodes.
We say an algorithm for the learner is a polynomial time no-regret algorithm if there exists a such that for any adversarial opponent, and any fixed , the algorithm outputs policies which satisfies the following, with probability at least , in time polynomial in .
(7) |
Theorem 6 directly implies the following hardness result for achieving no-regret against adversarial opponents, a result that first appeared in [25].
Corollary 8 (Hardness for playing against adversarial opponent).
There exists a Markov game with deterministic transitions and rewards defined for any horizon with , , and , such that if there exists a polynomial time no-regret algorithm for this Markov game, then there exists a polynomial time algorithm for learning parity with noise (see problem description in Appendix E). The claim remains to hold even if we restrict the adversarial opponents in the Markov game to be non-adaptive, and to only play Markov policies in each episode.
Similar to Theorem 6, Corollary 8 combined with Conjecture 7 demonstrates the fundamental difficulty of designing a polynomial time no-regret algorithm against adversarial opponents for Markov games.
Implications on algorithm design for finding Nash Equilibria
Corollary 8 also rules out a natural approach for designing efficient algorithms for finding approximate Nash equilibrium through combining two no-regret algorithms. In fact, it is not hard to see that if the min-player also runs a non-regret algorithm, and obtain a regret bound symmetric to (7), then summing the two regret bounds shows the mixture policies —which assigns uniform mixing weights to policies and respectively—is an approximate Nash equilibrium. Corollary 8 with Conjecture 7 claims that any algorithm designed using this approach is not a polynomial time algorithm.
6 Conclusion
In this paper, we designed first line of near-optimal self-play algorithms for finding an approximate Nash equilibrium in two-player Markov games. The sample complexity of our algorithms matches the information theoretical lower bound up to only a polynomial factor in the length of each episode. Apart from finding Nash equilibria, we also prove the fundamental hardness in computation for finding the best responses of fixed opponents, as well as achieving sublinear regret against adversarial opponents, in Markov games.
References
- Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
- Bai and Jin [2020] Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. arXiv preprint arXiv:2002.04017, 2020.
- Baker et al. [2020] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkxpxJBKwS.
- Brafman and Tennenholtz [2002] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Brambilla et al. [2013] Manuele Brambilla, Eliseo Ferrante, Mauro Birattari, and Marco Dorigo. Swarm robotics: a review from the swarm engineering perspective. Swarm Intelligence, 7(1):1–41, 2013.
- Brown and Sandholm [2019] Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
- Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
- Dann et al. [2019] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516, 2019.
- Filar and Vrieze [2012] Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science & Business Media, 2012.
- Hansen et al. [2013] Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1–16, 2013.
- Hu and Wellman [2003] Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
- Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Jia et al. [2019] Zeyu Jia, Lin F Yang, and Mengdi Wang. Feature-based q-learning for two-player stochastic games. arXiv preprint arXiv:1906.00423, 2019.
- Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4868–4878, 2018.
- Jin et al. [2019] Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192, 2019.
- Kearns [1998] Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
- Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. 2018.
- Littman [1994] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
- Littman [2001] Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322–328, 2001.
- Mossel and Roch [2005] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
- Neu [2015] Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Advances in Neural Information Processing Systems, pages 3168–3176, 2015.
- OpenAI [2018] OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
- Osband and Van Roy [2016] Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
- Osband et al. [2014] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
- Radanovic et al. [2019] Goran Radanovic, Rati Devidze, David Parkes, and Adish Singla. Learning to collaborate in markov decision processes. In International Conference on Machine Learning, pages 5261–5270, 2019.
- Rosenberg and Mansour [2019] Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. arXiv preprint arXiv:1905.07773, 2019.
- Shalev-Shwartz et al. [2016] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
- Shapley [1953] Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
- Sidford et al. [2019] Aaron Sidford, Mengdi Wang, Lin F Yang, and Yinyu Ye. Solving discounted stochastic two-player games with near-optimal time and sample complexity. arXiv preprint arXiv:1908.11071, 2019.
- Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Strehl et al. [2006] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC model-free reinforcement learning. In International Conference on Machine Learning, pages 881–888, 2006.
- Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Watkins [1989] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.
- Wei et al. [2017] Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems, pages 4987–4997, 2017.
- Xie et al. [2020] Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. arXiv preprint arXiv:2002.07066, 2020.
- Yadkori et al. [2013] Yasin Abbasi Yadkori, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvári. Online learning in markov decision processes with adversarially chosen transition probability distributions. In Advances in neural information processing systems, pages 2508–2516, 2013.
- Zimin and Neu [2013] Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems, pages 1583–1591, 2013.
Appendix A Bellman Equations for Markov Games
In this section, we present the Bellman equations for different types of values in Markov games.
Fixed policies.
Best responses.
For any Markov policy of the max-player, by definition, we have the following Bellman equations for values of its best response:
for all , where for all .
Similarly, for any Markov policy of the min-player, we also have the following symmetric version of Bellman equations for values of its best response:
for all , where for all .
Nash equilibria.
Finally, by definition of Nash equilibria in Markov games, we have the following Bellman optimality equations:
for all , where for all .
Appendix B Properties of Coarse Correlated Equilibrium
Recall the definition for CCE in our main paper (4), we restate it here after rescaling. For any pair of matrix , the subroutine returns a distribution that satisfies:
(8) | ||||
We make three remarks on CCE. First, a CCE always exists since a Nash equilibrium for a general-sum game with payoff matrices is also a CCE defined by , and a Nash equilibrium always exists. Second, a CCE can be efficiently computed, since above constraints (8) for CCE can be rewritten as linear constraints on , which can be efficiently resolved by standard linear programming algorithm. Third, a CCE in general-sum games needs not to be a Nash equilibrium. However, a CCE in zero-sum games is guaranteed to be a Nash equalibrium.
Proposition 9.
Let , and be the marginal distribution over both players’ actions induced by . Then is a Nash equilibrium for payoff matrix .
Proof of Proposition 9.
Let be the value of Nash equilibrium for . Since , by definition, we have:
This gives:
which finishes the proof. ∎
Intuitively, a CCE procedure can be used in Nash Q-learning for finding an approximate Nash equilibrium, because the values of upper confidence and lower confidence— and will be eventually very close, so that the preconditions of Proposition 9 becomes approximately satisfied.
Appendix C Proof for Nash Q-learning
In this section, we present proofs for results in Section 3.
We denote for values and policies at the beginning of the -th episode. We also introduce the following short-hand notation .
We will use the following notations several times later: suppose was taken at the in episodes at the -th step. Since the definition of depends on the tuple and , we will show the dependence explicitly by writing when necessary and omit it when there is no confusion. We also define to be the number of times has been taken at the beginning of the -th episode. Finally we denote .
The following lemma is a simple consequence of the update rule in Algorithm 1, which will be used several times later.
Lemma 10.
Let and suppose was previously taken at episodes at the -th step. The update rule in Algorithm 1 is equivalent to the following equations.
(9) |
(10) |
C.1 Learning values
We begin an auxiliary lemma. Some of the analysis in this section is adapted from [14] which studies Q-learning under the single agent MDP setting.
Lemma 11.
([14, Lemma 4.1]) The following properties hold for :
-
1.
for every .
-
2.
and for every .
-
3.
for every .
We also define . Now we are ready to prove Lemma 3.
Proof of Lemma 3.
We give the proof for one direction and the other direction is similar. For the proof of the first claim, let and suppose was previously taken at episodes at the -th step. Let be the -algebra generated by all the random variables in until the -th episode. Then is a martingale differene sequence w.r.t. the filtration . By Azuma-Hoeffding,
Here we prove a stronger version of the first claim by induction: for any ,
Suppose the guarantee is true for , then by the above concentration result,
Also,
where has just been proved. The other direction is proved similarly.
Now we continue with the proof of the second claim. Let and define , then by definition
where is by taking the difference of equation (9) and equation (10) and
is a martingale difference sequence.
Taking the summation w.r.t. , we begin with the first two terms,
where is by changing the order of summation and is by Lemma 11.
Plugging them in,
Recursing this argument for gives
By pigeonhole argument,
By Azuma-Hoeffding,
with high probability. The proof is completed by putting everything together.
∎
C.2 Certified policies
Algorithm 1 only learns the value of game but itself cannot give a near optimal policy for each player. In this section, we analyze the certified policy based on the above exploration process (Algorithm 2) and prove the sample complexity guarantee. To this end, we need to first define a new group of policies to facilitate the proof , and are defined similarly. Notice is related to defined in Algorithm 2 by .
We also define for , which is na intermediate algorithm only involved in the analysis. The above two policies are related by where . is defined similarly.
Since the policies defined in Algorithm 5 and Algorithm 6 are non-Markov, many notations for values of Markov policies are no longer valid here. To this end, we need to define the value and Q-value of general policies starting from step , if the general policies starting from the -th step do not depends the history before the -th step. Notice the special case has already been covered in Section 2. For a pair of general policy which does not depend on the hostory before the -th step, we can still use the same definitions (1) and (2) to define their value and at step . We can also define the best response of a general policy as the minimizing policy so that at step . Similarly, we can define . As before, the best reponse of a general policy is not necessarily Markovian.
It should be clear from the definition of Algorithm 5 and Algorithm 6 that , , and does not depend on the history before step , therefore related value and Q-value functions are well defined for the corresponding steps. Now we can show the policies defined above are indeed certified.
Lemma 12.
For any , with probability at least , the following holds for any ,
Proof of Lemma 12.
We first prove this for .
because is the last step and
because is CCE, and by definition .
Now suppose the claim is true for , consider the case. Consider a fixed tuple and let . Suppose was previously taken at episodes at the -th step. Let be the -algebra generated by all the random variables in until the -th episode. Then is a martingale differene sequence w.r.t. the filtration . By Azuma-Hoeffding and the definition of ,
with high probability. Combining this with the induction hypothesis,
where we have taken the maximum operator out of the summation in ,which does not increase the sum.
On the other hand,
where is by the definition of CCE and is the induction hypothesis. The other direction is proved by performing smilar arguments on , , and . ∎
Finally we give the theoretical guarantee of the policies defined above.
Appendix D Proof for Nash V-learning
In this section, we present proofs of the results in Section 4. We denote for values and policies at the beginning of the -th episode. We also introduce the following short-hand notation .
We will use the following notations several times later: suppose the state was visited at episodes at the -th step. Since the definition of depends on the state , we will show the dependence explicitly by writing when necessary and omit it when there is no confusion. We also define to be the number of times the state has been visited at the beginning of the -th episode. Finally we denote . Notice the definitions here are different from that in Appendix C.
The following lemma is a simple consequence of the update rule in Algorithm 3, which will be used several times later.
Lemma 13.
Let and suppose was previously visited at episodes at the -th step. The update rule in Algorithm 3 is equivalent to the following equations.
(11) |
(12) |
D.1 Missing algorithm details
We first give Algorithm 7: the min-player counterpart of Algorithm 3. Almost everything is symmetric except the definition of loss function to keep it non-negative.
D.2 Learning values
As usual, we begin with learning the value of the Markov game. We begin with an auxiliary lemma, which justifies our choice of confidence bound.
Lemma 14.
Let and suppose state was previously taken at episodes at the -th step. Choosing and , with probability , for any , there exist a constant s.t.
Proof of Lemma 14.
We prove the first inequality. The proof for the second inequality is similar. We consider thoughout the proof a fixed . Define as the -algebra generated by all the random variables before the -th episode. Then is a martingale sequence w.r.t. the filtration . By Azuma-Hoeffding,
So we only need to bound
(13) |
where is the weighted regret in the first times of visiting state , with respect to the optimal policy in hindsight, in the following adversarial bandit problem. The loss function is defined by
with weight . We note the weighted regret can be rewrite as where is argmax for (13), and the loss function satisfies
Therefore, Algorithm 3 is essentially performing follow the regularized leader (FTRL) algorithm with changing step size for each state to solve this adversarial bandit problem. The policy we are using is and the optimistic biased estimator
is used to handle the bandit feedback.
Lemma 15.
Proof of Lemma 15.
We proof the first claim by backward induction. The claim is true for . Asumme for any s, , . For a fixed and episode , let and suppose was previously visited at episodes at the -th step. By Bellman equation,
Comparing with the decomposition of in Equation (11) and use Lemma 14, we can see if , then . Similar by taking , we also have .
The second cliam is to bound . Similar to what we have done in Nash Q-learning analysis, taking the difference of Equation (11) and Equation (12),
where
Taking the summation w.r.t. , we begin with the first two terms,
where is by changing the order of summation and is by Lemma 11. Putting them together,
Recursing this argument for gives
By pigeonhole argument,
Expanding this formula repeatedly and apply pigeonhole argument we have
which finishes the proof. ∎
D.3 Certified policies
As before, we construct a series of new policies in Algorithm 8. Notice is related to defined in Algorithm 4 by . Also we need to consider value and Q-value functions of general policies which does not depend on the hostory before the -th step. See Appendix C.2 for details. Again, we can show the policies defined above are indeed certified.
Lemma 16.
For any , with probability at least , the following holds for any ,
Proof of Lemma 16.
We prove one side by induction and the other side is similar. The claim is trivially satisfied for . Suppose it is ture for , consider a fixed state . Let and suppose was previously visited at episodes at the -th step. Then using Lemma 13,
where is by using Lemma 14 and the definition of , and is by induction hypothesis. ∎
Equipped with the above lemmas, we are now ready to prove Theorem 5.
Appendix E Proofs of Hardness for Learning the Best Responses
In this section we give the proof of Theorem 6, and Corollary 8. Our proof is inspired by a computational hardness result for adversarial MDPs in [37, Section 4.2], which constructs a family of adversarial MDPs that are computationally as hard as an agnostic parity learning problem.
Section E.1, E.2, E.3 will be devoted to prove Theorem 6, while Corollary 8 is proved in Section E.4. Towards proving Theorem 6, we will:
E.1 Markov game construction
We now describe a Markov game inspired the adversarial MDP in [37, Section 4.2]. We define a Markov game in which we have states, , (the initial state) and (the terminal state)444In [37] the states are denoted by instead. Here we slightly change the notation to make it different from the notation of the actions. In each state the max-player has two actions and , while the min-player has two actions and . The transition kernel is deterministic and the next state for steps is defined in Table 2:
State/Action | ||||
At the -th step, i.e. states and , the next state is always regardless of the action chosen by both players. The reward function is always except at the -th step. The reward is determined by the action of the min-player, defined by
State/Action | ||
At the beginning of every episode , both players pick their own policies and , and execute them throughout the episode. The min-player can possibly pick her policy adaptive to all the observations in the earlier episodes. The only difference from the standard Markov game protocol is that the actions of the min-player except the last step will be revealed at the beginning of each episode, to match the setting in agnostic learning parities (Problem 2 below). Therefore we are actually considering a easier problem (for the max-player) and the lower bound naturally applies.
E.2 A series of computationally hard problems
We first introduce a series of problems and then show how the reduction works.
Problem 1
The max-player -approximates the best reponse for any general policy in the Markov game defined in Appendix E.1 with probability at least , in time.
Problem 2
Let be a vector in , and .The parity of on is the boolean function . In words, outputs if the number of ones in the subvector is even and otherwise. A uniform query oracle for this problem is a randomized algorithm that returns a random uniform vector , as well as a noisy classification which is equal to w.p. and w.p. . All examples returned by the oracle are independent. The learning parity with noise problem consists in designing an algorithm with access to the oracle such that,
-
•
(Problem 2.1) w.p at least , find a (possibly random) function satisy , in time.
-
•
(Problem 2.2) w.p at least , find satisy , in time.
-
•
(Problem 2.3) w.p at least , find satisy , in time.
Problem 2.3 reduces to Problem 2.2
Step 1: Repeatly apply algorithm for Problem 2.2 times to get such that with probability at least . This costs time. Let where .
Step 2: Construct estimators using additional data ,
Pick . When , with probability at least , we have
This means that
This step uses time.
Step 3: Pick , we are guaranteed that good events in step 1 and step 2 happen with probability and altogether happen with probability at least . The total time used is . Note better dependence on than required.
Problem 2.2 reduces to Problem 2.1:
If we have an algorithm that gives with probability . Then if we sample , by Markov’s inequality, we have with probability that
Problem 2.1 reduces to Problem 1:
Consider the Markov game constructed above with . The only missing piece we fill up here is the policy of the min-player, which is constructed as following. The min-player draws a sample from the uniform query oracle, then taking action at the step if and otherwise. For the -th step, the min-player take action if and otherwise. Also notice the policy of the max-player can be descibed by a set where he takes action at step if and otherwise. As a result, the max-player receive non-zero result iff .
In the Markov game, we have . As a result, the optimal policy corresponds to the true parity set . As a result,
by the -approximation guarantee.
Also notice
This implies:
E.3 Putting them together
So far, we have proved that Solving Problem 1 implies solving Problem 2.3, where Problem 1 is the problem of learning -approximate best response in Markov games (the problem we are interested in), and Problem 2.3 is precisely the problem of learning parity with noise [20]. This concludes the proof.
E.4 Proofs of Hardness Against Adversarial Opponents
Proof of Corollary 8.
We only need to prove a polynomial time no-regret algorithm also learns the best response in a Markov game where the min-player following non-Markov policy . Then the no-regret guarantee implies,
where is the policy of the max-player in the -th episode. If we choose uniformly randomly from , then
Choosing , and the running time of the no-regret algorithm is still to learn the -approximate best response.
To see that the Corollary 8 remains to hold for policies that are Markovian in each episode and non-adaptive, we can take the hard instance in Theorem 6 and let denote the min-player’s policy in the -th episode. Note that each is Markovian and non-adaptive on the observations in previous episodes. If there is a polynomial time no-regret algorithm against such , then by the online-to-batch conversion similar as the above, the mixture of learns a best response against in polynomial time.
∎
Appendix F Auxiliary Lemmas for Weighted Adversarial Bandit
In this section, we formulate the bandit problem we reduced to in the proof of Lemma 14. Although the machnisms are already well understood, we did not find a good reference of Follow the Regularized Leader (FTRL) algorithm with
-
1.
changing step size
-
2.
weighted regret
-
3.
high probability regret bound
For completeness, we give the detailed derivation here.
We assume and . Define , we set the hyperparameters by
Define the filtration by the -algebra generated by . Then the regret can be defined as
We can easily check the definitions here is just an abstract version of that in the proof of Lemma 14 with rescaling. To state the regret guarantee, we also define for any . Now we can upper bound the regret by
Lemma 17.
Following Algorithm 9, with probability , for any and we have
Proof.
The regret can be decomposed into three terms
Setting , the conditions in Lemma 19 and Lemma 21 are satisfied. Putting them together and take union bound, we have with probability
∎
The rest of this section is devoted to the proofs of the Lemmas used in the proofs of Lemma 17. We begin the following useful lemma adapted from Lemma 1 in [21], which is crucial in constructing high probability guarantees.
Lemma 18.
For any sequence of coefficients s.t. is -measurable, we have with probability ,
Proof.
Define . By definition,
where is because for all .
Defining the sum
we have
where is because for any and . Here we are using the condition to guarantee the condition is satisfied.
Equipped with the above bound, we can now prove the concentration result.
The claim is proved by taking the union bound. ∎
Using Lemma 18, we can bound the separately as below.
Lemma 19.
If for all , with probability , for any and ,
Proof.
Lemma 20.
With probability , for any ,
Proof.
We further decopose it into
The first term is bounded by
To bound the second term, notice
thus is a bounded martingale difference sequence w.r.t. the filtration . By Azuma-Hoeffding,
∎
Lemma 21.
With probability , for any and any , if is non-increasing in ,