This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Easy as ABCs: Unifying Boltzmann Q-Learning and Counterfactual Regret Minimization

Luca D’Amico-Wong    Hugh Zhang    Marc Lanctot    David C. Parkes
Abstract

We propose ABCs (Adaptive Branching through Child stationarity), a best-of-both-worlds algorithm combining Boltzmann Q-learning (BQL), a classic reinforcement learning algorithm for single-agent domains, and counterfactual regret minimization (CFR), a central algorithm for learning in multi-agent domains. ABCs adaptively chooses what fraction of the environment to explore each iteration by measuring the stationarity of the environment’s reward and transition dynamics. In Markov decision processes, ABCs converges to the optimal policy with at most an O(A)O(A) factor slowdown compared to BQL, where AA is the number of actions in the environment. In two-player zero-sum games, ABCs is guaranteed to converge to a Nash equilibrium (assuming access to a perfect oracle for detecting stationarity), while BQL has no such guarantees. Empirically, ABCs demonstrates strong performance when benchmarked across environments drawn from the OpenSpiel game library and OpenAI Gym and exceeds all prior methods in environments which are neither fully stationary nor fully nonstationary.

counterfactual regret minimization, q-learning

1 Introduction

The ultimate dream of reinforcement learning (RL) is a general algorithm that can learn in any environment. Nevertheless, present-day RL often requires assuming that the environment is stationary (i.e. that the transition and reward dynamics do not change over time). When this assumption is violated, many RL methods, such as Boltzmann Q-Learning (BQL), fail to learn good policies or even converge at all, both in theory and in practice (Zinkevich et al., 2008; Laurent & Laëtitia Matignon, 2011; Brown et al., 2020).

Meanwhile, breakthroughs in no-regret learning, such as counterfactual regret minimization (CFR) (Zinkevich et al., 2008), have led to tremendous progress in imperfect-information, multi-agent games like Poker and Diplomacy (Moravčík et al., 2017; Brown & Sandholm, 2018, 2019b; Meta Fundamental AI Research Diplomacy Team et al.(2022)Meta Fundamental AI Research Diplomacy Team (FAIR), Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, Jacob, Komeili, Konath, Kwon, Lerer, Lewis, Miller, Mitts, Renduchintala, Roller, Rowe, Shi, Spisak, Wei, Wu, Zhang, and Zijlstra, FAIR; Bakhtin et al., 2023). Such algorithms are able to guarantee convergence to Nash equilibria in two-player zero-sum games, which are typically not Markov Decision Processes (MDPs). However, CFR has poor scaling properties due to the need to perform updates across the entire game tree at every learning iteration, as opposed to only across a single trajectory like BQL. As a result, CFR algorithms are typically substantially less efficient than their RL counterparts when used on stationary environments such as MDPs. While Monte Carlo based methods such as MCCFR have been proposed to help alleviate this issue (Lanctot et al., 2009), CFR algorithms often remain impractical in even toy reinforcement learning environments, such as Cartpole or the Arcade Learning Environment (Sutton & Barto, 2018; Bellemare et al., 2013).

We propose ABCs (Adaptive Branching through Child stationarity), a best-of-both-worlds approach that combines the strengths of BQL and CFR to create a single, versatile algorithm capable of learning in both stationary and nonstationary environments. The key insight behind ABCs is to dynamically adapt the algorithm’s learning strategy by measuring the stationarity of the environment. If an information state (a set of observationally equivalent states) is deemed to be approximately stationary, ABCs performs a relatively cheap BQL-like update, only exploring a single trajectory. On the other hand, if it is nonstationary, ABCs applies a more expensive CFR-like update, branching out all possible actions, to preserve convergence guarantees. This selective updating mechanism enables ABCs to exploit the efficiency of BQL in stationary settings while maintaining the robustness of CFR in nonstationary environments, only exploring the game tree more thoroughly when necessary.

The precise notion of stationarity that ABCs tests for is child stationarity, a weaker notion than the Markovian stationarity typically assumed in MDPs. In an MDP, the reward and transition functions must remain stationary for the full environment, regardless of the policy chosen by the agent. Instead, child stationarity isolates the transition associated with a specific infostate ss and action aa and requires only that this transition remain stationary with respect to the policies encountered over the course of learning. The primary advantage of testing for child stationarity over full Markov stationarity is that this allows ABCs to run efficiently in partially stationary environments. In empirical experiments across Cartpole, Kuhn/Leduc poker, and a third novel environment combining the two to create a partially stationary environment, we find that ABCs performs comparably to BQL on stationary environments while still guaranteeing convergence to equilibria in all environments. Moreover, in the partially stationary experiments, ABCs outperforms both methods. Theoretically, under a slightly stronger assumption that we are given access to a perfect oracle for detecting child stationarity, we show that running a (fast) BQL update for transitions that satisfy child stationarity allows convergence to equilibria even if the remainder of the environment is nonstationary.

1.1 Related Work

Discovering general algorithms capable of learning in varied environments is a common theme in RL and more generally in machine learning. For example, DQN (Mnih et al., 2015) and its successors (Wang et al., 2016; Bellemare et al., 2017; Hessel et al., 2018) are capable of obtaining human level performance on a test suite of games drawn from the Arcade Learning Environment, and algorithms such as AlphaZero (Silver et al., 2017) attain superhuman performance on perfect information games such as Go, Shogi, and Chess. However, such algorithms still fail on multi-agent imperfect information settings such as Poker (Brown et al., 2020). More recently, several algorithms such as Player of Games (PoG) (Schmid et al., 2021), REBEL (Brown et al., 2020), and MMD (Sokota et al., 2023) have been able to simultaneously achieve reasonable performance on both perfect and imperfect information games. However, unlike ABCs, none of these algorithms adapt to the stationarity of their environment. As such, they can neither guarantee performance comparable to their reinforcement learning counterparts in MDPs nor efficiently allocate resources in environments which are partially stationary, as ABCs is able to do.

2 Preliminaries

2.1 Markov Decision Processes

Classical single-agent RL typically assumes that the learning environment is a Markov Decision Process (MDP). An MDP is defined via a tuple (S,A,T,R,γ)(S,A,T,R,\gamma). SS and AA are the set of states and actions, respectively. The state transition function, T(ss,a)T(s^{\prime}\mid s,a), defines the probability of transitioning to state ss^{\prime} after taking action aa at state ss. R(rs,a,s)R(r\mid s,a,s^{\prime}) is the reward function, which specifies the reward for taking action aa at state ss, given that we transition to state ss^{\prime}. Importantly, in an MDP, both the transition and the reward functions are assumed to be stationary/fixed. γ[0,1]\gamma\in[0,1] is the discount factor determining the importance of future rewards relative to immediate rewards. In an episodic MDP, the agent interacts with the environment across each of separate episodes via a policy, which is a function π:SΔA\pi:S\rightarrow\Delta A, mapping states to probabilities over valid actions. In a Partially Observable Markov Decision Process (POMDP), the tuple contains an additional element \mathcal{H}. In a POMDP, the agent only observes information states SS while the true hidden states \mathcal{H} of the world remain unknown. The transition and reward dynamics of the environment follow an MDP over the hidden states \mathcal{H}, but the agent’s policy π:SΔA\pi:S\to\Delta A must depend on the observed infostates rather than hidden states. However, since POMDPs still only contain a single agent and fix the transition probabilities over time, they are still stationary environments, and there exist standard methods for reducing POMDPs into MDPs (Shani et al., 2013).

2.2 Finite Extensive-Form Games

While MDPs and POMDPs model a large class of single-agent environments, the stationarity assumption can easily fail in multi-agent environments. Specifically, from a single agent’s perspective, the transition and reward dynamics are no longer fixed as they now depend on the other agents, who are also learning. To capture these multi-agent environments, we cast our environment as a finite, imperfect information, extensive-form game with perfect recall.

In defining this, we have a set of MM players (or agents), and an additional chance player, denoted by cc, whose moves capture any potential stochasticity in the game. There is a finite set \mathcal{H} of possible histories/hidden states, corresponding to a finite sequence of states encountered and actions taken by all players in the game. Let hhh\sqsubseteq h^{\prime} denote that hh is a prefix of hh^{\prime}, as sequences. The set of terminal histories, ZZ\subseteq\mathcal{H}, is the set of histories where the game terminates. For each history hh\in\mathcal{H}, we let ri(h)r_{i}(h) denote the reward obtained by player ii upon reaching history hh. The total utility in a given episode for player ii after reaching terminal history zZz\in Z is ui(z)=htzγtri(ht)u_{i}(z)=\sum_{h_{t}\sqsubseteq z}\gamma^{t}r_{i}(h_{t}), where hth_{t} denotes the t+1t+1st longest prefix of zz (i.e., tt indexes a step within an episode). Let ui(h,h)u_{i}(h,h^{\prime}) denote the total utility for player ii when reaching history hh^{\prime}, starting from history hh. Many of the guarantees in this paper focus on two-player zero-sum games, which have the additional property that the sum of the utility functions at any zZz\in Z for the two players is guaranteed to be 0.

To model imperfect information, we define the set of infostates SS as a partition of the possible histories \mathcal{H}. Let S(h)S(h) denote the information states (abbreviated infostates) corresponding to history hh. There is a player function P:S{1,,M}{c}P:S\to\{1,\dots,M\}\cup\{c\} that maps infostates to the player whose turn it is to move. Each infostate sSs\in S is composed of states that are observationally equivalent to the player P(s)P(s) whose turn it is to play. As is standard, we assume perfect recall in that players never forget information obtained through prior states and actions (Zinkevich et al., 2008; Brown & Sandholm, 2018; Moravčík et al., 2017; Celli et al., 2020).

Let 𝒜(s)\mathcal{A}(s) denote the set of possible actions available to player P(s)P(s) at infostate ss. We denote the policy of player ii as πi:SΔ𝒜\pi_{i}:S\rightarrow\Delta\mathcal{A} and let π={π1,πN}\pi=\{\pi_{1},\cdots\pi_{N}\} denote the joint policy. Let π(s)\mathcal{H}_{\pi}(s) denote the distribution of true hidden states corresponding to infostate ss, assuming all players play according to joint policy π\pi.

In order to bridge the gap between POMDPs and general extensive-form games, it will be helpful to recast our multi-agent environment into |M||M| separate single-agent environments. From the perspective of any single agent ii, the policy of the other agents πi\pi_{-i} uniquely defines a POMDP which agent ii interacts with alone. To construct this POMDP, after player ii takes an action, we simulate the play of the other agents according to πi\pi_{-i} and return the state/rewards corresponding to the point at which it is next ii’s turn to play. Within these single-agent POMDPs, defined by the joint policy π\pi, we use Tπ(hh,a)T_{\pi}\left(h^{\prime}\mid h,a\right) to denote the probability111Note that these transition probabilities are only defined for the player whose turn it is to play at history hh that hh^{\prime} is the next hidden state after player ii plays action aa under true hidden state hh. We can similarly define the transition probability to infostate ss^{\prime} from hidden state hh as Tπ(sh,a)=hπ(s)𝟙(S(h)=s)Tπ(hh,a)T_{\pi}\left(s^{\prime}\mid h,a\right)=\sum_{h^{\prime}\in\mathcal{H}_{\pi}(s^{\prime})}\mathds{1}\left(S\left(h^{\prime}\right)=s^{\prime}\right)T_{\pi}\left(h^{\prime}\mid h,a\right), and the transition probabilities from infostate to infostate as Tπ(ss,a)=𝔼hπ(s)[Tπ(sh,a)]T_{\pi}\left(s^{\prime}\mid s,a\right)=\mathbb{E}_{h\sim\mathcal{H}_{\pi}(s)}\left[T_{\pi}\left(s^{\prime}\mid h,a\right)\right]. Analogously, let Rπ(rs,a,s)R_{\pi}\left(r\mid s,a,s^{\prime}\right) denote the probability of obtaining immediate reward rr after taking action aa at infostate ss under joint policy π\pi and transitioning to state ss^{\prime}. For the above transitions to always be well defined, we require that all policies be fully mixed, in that π(as)>0\pi(a\mid s)>0 for all valid actions aa and all infostates ss. It is important to note that the h,sh^{\prime},s^{\prime} above do not refer to the next history/infostate seen in the full game but rather the next in the single-agent POMDP of the current player—i.e., the agent who actually takes action aa.

3 Unifying Q-Learning and Counterfactual Regret Minimization

The following notation will be useful for describing both BQL and CFR. Let ηπ(h)\eta^{\pi}(h) be the probability of reaching history hh if all players play policy π\pi from the start of the game. Let ηiπ(h),ηiπ(h)\eta^{\pi}_{-i}(h),\eta^{\pi}_{i}(h) denote the contribution of all players except player ii and the contribution of player ii to this probability respectively. Analogously, let ηπ(s)=h𝟙(S(h)=h)ηπ(h)\eta^{\pi}(s)=\sum_{h}\mathds{1}\left(S(h)=h\right)\eta^{\pi}(h) be the probability of reaching infostate ss and ηπ(h,h)\eta^{\pi}(h,h^{\prime}) be the probability of reaching history hh^{\prime} starting from history hh. Finally, let ZsZ_{s} denote the set of all (h,z)(h,z) such that S(h)=sS(h)=s and zZz\in Z is a possible terminal history when starting from hh.

3.1 Boltzmann Q-Learning

Boltzmann Q-learning (BQL) is a variant of the standard Q-learning algorithm, where Boltzmann exploration is used as the exploration policy. Formally, we define the value VV of a state and the QQ-value of an action at that state as:

Viπ(s)\displaystyle V_{i}^{\pi}(s) =(h,z)Zsηπ(h)ηπ(s)ηπ(h,z)ui(h,z)\displaystyle=\sum_{(h,z)\in Z_{s}}\frac{\eta^{\pi}(h)}{\eta^{\pi}(s)}\eta^{\pi}(h,z)u_{i}(h,z)
Qπ(s,a)\displaystyle Q^{\pi}(s,a) =𝔼sTπ(s|s,a)rRπ(r|s,a,s)[r+γViπ(s)]\displaystyle=\underset{{\begin{subarray}{c}s^{\prime}\sim T_{\pi}(s^{\prime}|s,a)\\ r\sim R_{\pi}(r|s,a,s^{\prime})\end{subarray}}}{\mathbb{E}}\left[r+\gamma V_{i}^{\pi}(s^{\prime})\right]

where the ii in the definition of Qπ(s,a)Q^{\pi}(s,a) refers to the player who plays at state ss.

We adapt BQL to the multi-agent setting as follows. At each iteration, BQL first freezes the policies of each agent. It then samples a single trajectory for each agent. These trajectories take the form of (s,a,r,s)(s,a,r,s^{\prime}) tuples, recording the reward rr and new infostate ss^{\prime} observed after taking action aa at infostate ss. Q-values are updated for each (s,a)(s,a) pair in the trajectory using a temporal difference update Q(s,a)(1α)Q(s,a)+α(r+γmaxaQ(s,a))Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha(r+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{\prime})), where α\alpha denotes the learning rate. These average Q-values are stored, and at the end of each iteration nn, a new joint policy πn+1(s,a)=exp(Q(s,a)/τ)aAexp(Q(s,a)/τ)\pi^{n+1}(s,a)=\frac{\exp\left(Q(s,a)/\tau\right)}{\sum_{a^{\prime}\in A}\exp\left(Q(s,a^{\prime})/\tau\right)} is computed, where τ\tau is the temperature parameter. While BQL converges to the optimal policy in MDPs and in perfect information games (given a suitable temperature schedule) (Cesa-Bianchi et al., 2017; Singh et al., 2000), it fails to converge to a Nash equilibrium in imperfect-information multi-agent games such as Poker (Brown et al., 2020).

3.2 Counterfactual Regret Minimization

Counterfactual regret minimization (CFR) is a popular method for solving imperfect-information extensive form games and is based on the notion of minimizing counterfactual regrets. The counterfactual value for a given infostate is given by viπ(s)=(h,z)Zsηiπ(h)ηπ(h,z)ui(h,z)v_{i}^{\pi}(s)=\sum_{(h,z)\in Z_{s}}\eta_{-i}^{\pi}(h)\eta^{\pi}(h,z)u_{i}(h,z). The counterfactual regret for not taking action aa at infostate ss is then given by rπ(s,a)=viπ(sa)(s)viπ(s)r^{\pi}(s,a)=v_{i}^{\pi_{(s\to a)}}(s)-v_{i}^{\pi}(s), where i=P(s)i=P(s) and π(sa)\pi_{(s\to a)} is identical to policy π\pi except action aa is always taken at infostate ss. Notice that because ηiπ(s)=ηiπ(h)\eta^{\pi}_{i}(s)=\eta^{\pi}_{i}(h) (due to the assumption of perfect recall), the counterfactual values are simply the standard RL value functions normalized by the opponents’ contribution to the probability of reaching ss; that is, viπ(s)=Viπ(s)ηiπ(s)v_{i}^{\pi}(s)=\frac{V_{i}^{\pi}(s)}{\eta^{\pi}_{-i}(s)}222Intuitively, the counterfactual regret upper bounds how much an agent could possibly “regret” not playing action aa at infostate ss if she modified her policy to maximize the probability that state ss is encountered. As such, the normalization constant excludes player ii’s probability contribution to reaching ss under π\pi.. This was first pointed out by (Srinivasan et al., 2018). As with multi-agent BQL, at each iteration, we freeze all agents’ policies before traversing the game tree and computing the counterfactual regrets for each (s,a)(s,a) pair. A new current policy πn+1\pi^{n+1} is then computed using a regret minimizer, typically regret matching (Hart & Mas-Colell, 2000) or Hedge (Arora et al., 2012a) on the cumulative regrets Rn(s,a)=nrπn(s,a)R^{n}(s,a)=\sum_{n}r^{\pi^{n}}(s,a), and the process continues. A learning procedure has vanishing regret if limN1NRN(s,a)=0\lim_{N\rightarrow\infty}\frac{1}{N}R^{N}(s,a)=0 for all s,as,a. In two player, zero-sum games, (Zinkevich et al., 2008) prove that CFR has vanishing regret, and by extension, that the average policy converges to a Nash equilibrium of the game. In general-sum games, computing a Nash equilibrium is computationally hard (Daskalakis et al., 2009), but CFR guarantees convergence to a weaker solution concept known as a coarse-correlated equilibrium (Morrill et al., 2021).

3.3 External Sampling MCCFR with Hedge

External sampling MCCFR (ES-MCCFR) is a popular variant of CFR which uses sampling to get unbiased estimates of the counterfactual regrets without exploring the full game tree (Lanctot et al., 2009). For each player ii, ES-MCCFR samples a single action for every infostate which players other than ii act on, but explores all actions for player ii. Let WW denote the set of terminal histories reached during this sampling process. For any infostate ss, let h(s)h(s) denote the corresponding history that was reached during exploration (there can only be one due to perfect recall). For each visited infostate ss, the sampled counterfactual regret, r~i(s,a)=v~iπ(sa)(s,a)v~iπ(s,a)=zWui(h,z)(ηiπ(h(s),z)ηiπ(h(s),z))\tilde{r}_{i}(s,a)=\tilde{v}^{\pi_{(s\to a)}}_{i}(s,a)-\tilde{v}^{\pi}_{i}(s,a)=\sum_{z\in W}u_{i}(h,z)(\eta_{i}^{\pi}(h(s^{\prime}),z)-\eta_{i}^{\pi}(h(s),z)), is added to the cumulative regret, where ss^{\prime} is the infostate reached after taking action aa at infostate ss. After this process has been completed for each player, a new joint policy is computed using Hedge on the new cumulative regrets, and the procedure continues. Under Hedge, the joint policy at episode n+1n+1 is given by πn+1(s,a)=eRn(s,a)/τnaA(s)eRn(s,a)/τn\pi^{n+1}(s,a)=\frac{e^{R^{n}(s,a)/\tau_{n}}}{\sum_{a^{\prime}\in A(s)}e^{R^{n}(s,a^{\prime})/\tau_{n}}}, where τn\tau_{n} is a temperature hyperparameter. As Hedge is invariant to any additive constants in the exponent, we can replace the sampled counterfactual regrets with the sampled counterfactual values, i.e., v~iπ(s,a)=zWui(z)ηiπ(h(s),z)\tilde{v}^{\pi}_{i}(s,a)=\sum_{z\in W}u_{i}(z)\cdot\eta_{i}^{\pi}(h(s^{\prime}),z), and similarly with the cumulative regrets.

3.4 BQL vs. CFR

The most important difference between CFR and BQL is in regard to which infostates are updated at each iteration of learning. By default, BQL only performs updates along a single trajectory of encountered infostates (“trajectory based”). In a game of depth DD, this means that there will be at most O(D)O(D) updates per iteration. By contrast, CFR and most of its variants are typically333In theory, variants such as OS-MCCFR can learn with O(D)O(D) updates per iteration, but the variance of such methods is often too high for practical usage. Other approaches, such as ESCHER (McAleer et al., 2022), have attempted to address this issue by avoiding the use of importance sampling. not trajectory-based learning algorithms because many infostates that are updated are not on the path of play. In particular, CFR (Zinkevich et al., 2008) requires performing counterfactual value updates at every infostate, which can require up to O(AD)O(A^{D}) updates per iteration. While MCCFR reduces this by performing Monte-Carlo updates, variants such as ES-MCCFR still require making an exponential number of updates per iteration.

Despite the differences between BQL and CFR, the two algorithms are intimately linked. (Brown et al., 2019, Lemma 1) first pointed out that the sampled counterfactual values v~iπ(s,a)\tilde{v}_{i}^{\pi}(s,a) from ES-MCCFR have an alternative interpretation as Monte-Carlo based estimates of the Q-value Qπ(s,a)Q^{\pi}(s,a). Thus, ES-MCCFR with Hedge has a gradient update that ends up being very similar to BQL, with two exceptions. Firstly, BQL estimates its Q-values by bootstrapping whereas ES-MCCFR uses a Monte-Carlo update and thus updates based on the current policy of all players. Secondly, BQL chooses a policy based on a softmax over the average Q-values, while CFR(Hedge) uses a softmax over cumulative Q-values. As not every action is tested in BQL, the average Q-values may be based on a different number of samples for each action, making it unclear how to convert average Q-values into cumulative Q-values.

4 Learning our ABCs

As described in Section 3.4, while the updates that CFR(Hedge) and BQL perform at each infoset are quite similar, BQL learns over trajectories, whereas CFR expands most (if not all) of the game tree at every infostate. This exploration is necessary in nonstationary environments where the Q-values may change over time but wasteful in stationary environments. As a concrete example, consider the following game. Half of the game is comprised of a standard game of poker, and half of the game is a “dummy” game identical to poker except that the payoff is always 0. In the “poker” subtree, BQL will fail to converge because this subtree is not an MDP. However, on the “dummy” subtree of the game, CFR will continue to perform full updates and explore the entire subtree, even though simply sampling trajectories on this “dummy” subtree would be sufficient for convergence.

What if we could formulate an algorithm that could itself discover when it needed to expand a child node, and when it could simply sample a trajectory? The hope is that this algorithm could closely match the performance of RL algorithms in stationary environments while retaining performance guarantees in nonstationary ones. Moreover, one can easily imagine general environments containing elements of both – playing a game of Poker followed by a game of Atari, for example – where such an algorithm could yield better performance than either CFR or BQL.

4.1 Child Stationarity

Motivated by the goal of adaptive exploration, we define the important concept of child stationarity, which is a relaxation of the standard Markov stationarity requirement. Let σ:ΔΠ\sigma:\Delta\Pi denote a distribution over joint policies.

Definition 4.1 (Child stationarity).

An infostate ss and action aa satisfy child stationarity with respect to distributions over joint policies σ\sigma and σ\sigma^{\prime} if 𝔼πσ[Tπ(hs,a)]=𝔼πσ[Tπ(hs,a)]\mathbb{E}_{\pi\in\sigma}\left[T_{\pi}(h^{\prime}\mid s,a)\right]=\mathbb{E}_{\pi\in\sigma^{\prime}}\left[T_{\pi}(h^{\prime}\mid s,a)\right] and 𝔼πσ[Rπ(rs,a,s)]=𝔼πσ[Rπ(rs,a,s)]\mathbb{E}_{\pi\sim\sigma}\left[R_{\pi}(r\mid s,a,s^{\prime})\right]=\mathbb{E}_{\pi\sim\sigma^{\prime}}\left[R_{\pi}(r\mid s,a,s^{\prime})\right], where ss^{\prime} denotes the infostate observed after playing aa at ss.

An infostate action pair (s,a)(s,a) satisfies child stationarity for policy distributions σ\sigma and σ\sigma^{\prime} if the local reward and state transition functions, from the perspective of the current player, remain stationary despite a change in policy distribution, and even if other portions of the environment, possibly including ancestors of ss, are nonstationary. Importantly, the transition function must remain constant with respect to the true, hidden states and not merely the observed infostates.444We give a more detailed discussion of the similarities and differences between child stationarity and Markovian stationarity in Appendix B. By definition, an MDP satisfies child stationarity for all infostates ss and actions aa with respect to all possible distributions over joint policies σ,σ\sigma,\sigma^{\prime}. In contrast, in imperfect-information games such as Poker, the distribution over the hidden state at your next turn, after playing a given action aa at infostate ss, will generally depend on the policies of the other players, violating child stationarity.

The utility of child stationarity can be seen as follows. While MDPs and perfect information games can be decomposed into subgames that are solved independently from each other (Tesauro et al., 1995; Silver et al., 2016; Sutton & Barto, 2018), the same is not true for their imperfect-information counterparts (Brown et al., 2020). Finding transitions that satisfy child stationarity allows for natural decompositions of the environment that can be solved independently from the rest of the environment. As such, child stationarity forms a core reason for why ABCs is able to unify BQL and CFR into a single algorithm.

Define Gσ(s,a)G^{\sigma}(s,a) as a game whose initial hidden state is drawn from 𝔼πσ[Tπ(hs,a)]\mathbb{E}_{\pi\in\sigma}\left[T_{\pi}(h^{\prime}\mid s,a)\right] and then played identically to GG, the original game, from that point onward. Gσ(s,a)G^{\sigma}(s,a) is analogous to the concept of Public Belief States introduced by ReBeL (Brown et al., 2020, Section 4).

Theorem 4.2.

If s,as,a satisfy child stationarity with respect to σ,σ\sigma,\sigma^{\prime}, then Gσ(s,a)=Gσ(s,a)G^{\sigma}(s,a)=G^{\sigma^{\prime}}(s,a). Furthermore, if π\pi^{*} is a Nash equilibrium of Gσ(s,a)G^{\sigma}(s,a), then π\pi^{*} is also a Nash equilibrium of Gσ(s,a)G^{\sigma^{\prime}}(s,a).

Algorithm 1 Returns True if s,as,a is not child stationary.
infostate ss, action aa, child hidden states h1:Nh_{1:N} (states following (s,a)(s,a)), total episodes NN, significance level αs>0\alpha_{s}>0.
function DETECT(s,a,r1:N,h1:N,αss,a,r_{1:N},h_{1:N},\alpha_{s})
     X1X_{1} \leftarrow {rn,hnn1N/2}r_{n},h_{n}\mid n\in 1\cdots\lfloor N/2\rfloor\}
     X2X_{2} \leftarrow {rn,hnnN/2+1N}r_{n},h_{n}\mid n\in\lfloor N/2\rfloor+1\cdots N\}
     pval \leftarrow ChiSquared(X1,X2)X_{1},X_{2})
return pval<αspval<\alpha_{s}
Proof.

By the definition of child stationarity, 𝔼πσ[Tπ(hs,a)]=𝔼πσ[Tπ(hs,a)]\mathbb{E}_{\pi\in\sigma}\left[T_{\pi}(h^{\prime}\mid s,a)\right]=\mathbb{E}_{\pi\in\sigma^{\prime}}\left[T_{\pi}(h^{\prime}\mid s,a)\right]. Thus, Gσ(s,a)G^{\sigma}(s,a) and Gσ(s,a)G^{\sigma}(s,a) are informationally equivalent, as all players have the same belief distribution over hidden states at the beginning of the game and will thus have identical equilibria. ∎

Child stationarity forms the crux of the adaptive exploration policy of ABCs, allowing us to employ trajectory-style updates at stationary (s,a)(s,a) pairs without affecting the convergence guarantees of CFR in two-player zero-sum games.

4.2 Detecting Child Stationarity

How can we measure child stationarity in practice? Notice that the definition only requires that the transition and reward dynamics stay fixed with respect to two particular distributions of joint policies, not all of them. Let π1,π2,πN\pi_{1},\pi_{2},\cdots\pi_{N} be the policies used by our learning procedure for each iteration of learning from 1N1\cdots N. Define σAN\sigma^{N}_{A} as the uniform distribution over {π1πN/2}\{\pi_{1}\cdots\pi_{\left\lfloor N/2\right\rfloor}\} and σBN\sigma^{N}_{B} as the uniform distribution over {πN/2+1πN}\{\pi_{\left\lfloor N/2\right\rfloor+1}\cdots\pi_{N}\}. At iteration NN, Algorithm 1 directly tests child stationarity with respect to distributions over joint policies σAN,σBN\sigma_{A}^{N},\sigma_{B}^{N} as follows: for each infostate ss and action aa, it keeps track of the rewards and true hidden states that follow playing action aa at ss. Then, it runs a Chi-Squared Goodness-Of-Fit test (Pearson, 1900) to determine whether the distribution of true hidden states has changed between the first and second half of training. We adopt child stationarity as the null hypothesis. While Algorithm 1 is not a perfect detector, Theorem 4.3 shows that in the limit, Algorithm 1 will asymptotically never claim that a given s,a,σA,σBs,a,\sigma_{A},\sigma_{B} satisfies child stationarity when it does not, assuming (s,a)(s,a) is queried infinitely often. However, the Chi Squared test retains a fixed rate of false positives given by the significance level αs\alpha_{s}, so Algorithm 2 will incorrectly reject the null hypothesis of child stationarity with some positive probability.

Theorem 4.3.

As |X1|,|X2||X_{1}|,|X_{2}|\rightarrow\infty, the probability that Algorithm 1 falsely claims s,as,a satisfies child stationarity when it does not goes to 0.

Proof.

This statement is directly implied by the fact that the Chi-Squared Goodness-of-Fit test guarantees that the Type II error vanishes in the limit (Pearson, 1900). ∎

4.3 The ABCs Algorithm

We are now ready to learn our ABCs. At a high level, ABCs chooses between a BQL update and a CFR update for each Q(s,a)Q(s,a) value based on whether or not it satisfies child stationarity with respect to the joint policies encountered over the course of training. One can view ABCs from two equivalent perspectives. From one perspective, ABCs runs a variant of ES-MCCFR, except that at stationary infostates, where it is unnecessary to run a full CFR update, it defaults to a cheaper BQL update. Alternatively, one can view ABCs as defaulting to a variant of BQL until it realizes that an infostate is too nonstationary for BQL to converge. At that point, it backs off to more expensive CFR style updates. We provide the formal description of ABCs in Algorithm 2. In the multi-agent setting, ABCs freezes all players’ policies at the beginning of each learning iteration and then runs independently for each player, exactly as multi-agent BQL and ES-MCCFR does.

Algorithm 2 ABCs
QQ, CNT initialized to 0, discount factor γ\gamma, total episodes NN. Initial observation and hidden states s0,h0s_{0},h_{0}. H=H=\varnothing. Significance value pp, exploration ϵ\epsilon
for n1 to Nn\leftarrow 1\textrm{ to }N do
     ABCs(h0,s0,Q,γ,n,a0,H,p,ϵ)\left(h_{0},s_{0},Q,\gamma,n,a_{0},H,p,\epsilon\right) \triangleright a0a_{0} is null
function GetChild(h,s,a,π,Q,γh,s,a,\pi,Q,\gamma)
     Sample h,sTπ(hh,a),S(h)h^{\prime},s^{\prime}\sim T^{\pi}\left(h^{\prime}\mid h,a\right),S(h^{\prime})
     Sample rRπ(rs,a,s)r\sim R^{\pi}\left(r\mid s,a,s^{\prime}\right)
     aa^{*} \leftarrow argmaxaQ(s,a)\operatorname*{arg\,max}_{a}Q\left(s^{\prime},a\right)
     s,a\nabla_{s,a} \leftarrow (r+γQ(s,a))Q(s,a)\left(r+\gamma Q\left(s^{\prime},a^{*}\right)\right)-Q(s,a)
     return h,s,a,s,ah^{\prime},s^{\prime},a^{*},\nabla_{s,a}
function ABCs(h,s,Q,γ,n,aopt,H,p,ϵh,s,Q,\gamma,n,a_{opt},H,p,\epsilon)
     if h is terminal then return 0      
     πn\pi^{n} \leftarrow softmax(Q(s,),τ=1/(τnCNT(s)))\texttt{softmax}\left(Q\left(s,*\right),\tau=1/(\tau_{n}\cdot\text{CNT}(s))\right)
     CNT(s)\text{CNT}(s) \leftarrow CNT(s)+1\text{CNT}(s)+1
     Sample atraja_{traj} from πn(s)\pi^{n}(s) with ϵ\epsilon-exploration
     for aA(s)a\in A(s) do
         h,s,a,s,ah^{\prime},s^{\prime},a^{*},\nabla_{s,a} \leftarrow GetChild(h,s,a,πn,Q,γ)\left(h,s,a,\pi^{n},Q,\gamma\right)
         Append r,hr,h^{\prime} to H(s,a)H(s,a)
         if DETECT(s,a,H(s,a),p)\left(s,a,H(s,a),p\right) OR a=atraja=a_{traj} then
              s,a\nabla_{s^{\prime},a^{*}} \leftarrow ABCs(h,s,Q,γ,t,a,H,p,ϵ)\text{ABCs{}}\left(h^{\prime},s^{\prime},Q,\gamma,t,a^{*},H,p,\epsilon\right)
              if s is not stationary then
                  s,a\nabla_{s,a} \leftarrow s,a+CNT(s)s,a\nabla_{s,a}+\text{CNT}(s^{\prime})\cdot\nabla_{s^{\prime},a^{*}}                        
         Q(s,a)Q(s,a) \leftarrow Q(s,a)+1CNT(s)s,aQ(s,a)+\frac{1}{\text{CNT}(s)}\nabla_{s,a}      
     aopta_{opt} \leftarrow argmaxQ(s,)\operatorname*{arg\,max}Q(s,*)
     return s,aopt\nabla_{s,a_{opt}}
Where ABCs Runs Updates

As described in Section 3.4, one major difference between BQL and ES-MCCFR is the number of infostates each algorithm updates per learning iteration. While BQL updates only infostates encountered along a single trajectory, ABCs adaptively adjusts the proportion of the game tree it expands by measuring child stationarity at each (s,a)(s,a) pair. Concretely, at every infostate encountered (regardless of its stationarity), ABCs chooses an action atraja_{traj} from its policy (with ϵ\epsilon-exploration) and recursively runs itself on the resulting successor state ss^{\prime}. For all other actions aaa^{\prime}\neq a, it tests whether the tuple (s,a)(s,a^{\prime}) satisfies child stationarity. Unless the null is rejected, it prunes all further updates on that subtree for this iteration. Otherwise, when (s,a)(s,a^{\prime}) is detected as nonstationary, it also recursively runs on the successor state sampled from T(ss,a)T(s^{\prime}\mid s,a^{\prime}). As such, if all actions aa fail the child stationarity test, then ABCs will expand all child nodes, just like ES-MCCFR, but if the environment is an MDP and all s,as,a satisfy child stationarity (according to the detector), then ABCs will only expand one child at each state and thus approximate BQL style trajectory updates. With sufficient ϵ\epsilon-exploration (and because at least one action atraja_{traj} will be expanded for each infostate ss), ABCs satisfies the requirement of Theorem 4.3 that every infostate ss will be visited infinitely often in the limit.

Q-value updates

In terms of the actual Q-value update that ABCs performs, recall that in Section 3.4, we describe two differences between BQL and ES-MCCFR. Firstly, ES-MCCFR updates its Q-values based on the current policy while BQL performs updates based on the average policy. To determine which update to make, ABCs measures the level of child stationarity for a given (s,a)(s,a) pair, as it does when choosing which infostates to update. If (s,a)(s,a) satisfies child stationarity, then ABCs does a regular BQL update. If not, ABCs adds a nonstationarity correction factor CNT(s)s,a\text{CNT}(s^{\prime})\cdot\nabla_{s^{\prime},a^{*}} to the Q-value update, correcting for the difference between the instantaneous Q-values sampled via Monte-Carlo sampling and the average bootstrapped Q-values. This correction exactly recovers the update function of MAX-CFR, a variant of ES-MCCFR described in more detail in Appendix C.

Secondly, ES-MCCFR chooses its policy as a function of cumulative Q-values, whereas BQL uses average Q-values. To address this, we force ABCs to test each action upon reaching an infostate, and thus update Q(s,a)Q(s,a) for all aa. This allows us to set the temperature to recover a scaled multiple of the cumulative Q-values. Specifically, by setting a temperature of 1τnCNT(s)\frac{1}{\tau_{n}\cdot CNT(s)}, where CNT(s)CNT(s) counts the number of visits to ss, we recover a softmax over the cumulative Q-values scaled by τn\tau_{n}, just like with CFR(Hedge). BQL is well known to converge to the optimal policy in an MDP as long as we are greedy in the limit and visit every state, action pair infinitely often, which is guaranteed with an appropriate choice of τn\tau_{n} and ϵ\epsilon-schedule (Singh et al., 2000; Cesa-Bianchi et al., 2017).

Convergence Guarantees

We formally prove the following theorems describing the performance of ABCs in MDPs and two-player zero-sum games. Theorem 4.4 proves that ABCs converges no slower than an O(A)O(A) factor compared to BQL in an MDP using only the detector described in Section 4.2. Additionally, Theorem 4.5 proves that a minor variant of ABCs finds a Nash equilibrium in a two-player zero-sum game (assuming access to a perfect oracle). The full proofs are included in the supplementary material.

Theorem 4.4.

Given an appropriate choice of significance levels pp (to control the rate of false positives), ABCs (as given in Algorithm 2) asymptotically converges to the optimal policy with only a worst-case O(A)O(A) factor slowdown compared to BQL in an MDP, where AA is the maximum number of actions available at any infoset in the game.

To guarantee convergence to a Nash equilibrium in two-player zero-sum games, we require two modifications to Algorithm 2. Firstly, we assume access to a perfect oracle for detecting child stationarity. Secondly, to ensure that stationary and nonstationary updates do not interfere with each other, we keep track of separate Q-values, choosing which to update and use for exploration according to whether the current (s,a)(s,a) pair satisfies child stationarity.

Theorem 4.5.

Assume that Algorithm 2 tracks separate Qstat(s,a)Q^{stat}(s,a) and Qnonstat(s,a)Q^{nonstat}(s,a) values for stationary and nonstationary updates and has access to a perfect oracle for detecting child stationarity. Then, the average policy limN1Nn=1Nπn\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n=1}^{N}\pi_{n} converges to a Nash equilibrium in a two-player zero-sum game with high probability.

In our practical implementation of ABCs, we make several simplifications compared to the assumptions required in theory. Since assuming access to a perfect oracle is unrealistic in practice, we use the detector described in Algorithm 1, which guarantees asymptotically vanishing Type II error, but nevertheless retains some probability of a false positive in the limit. As described in Algorithm 2, we only track a single Q-value for both stationary and nonstationary updates. Finally, we use a constant significance value for all infostates. Our practical implementation of ABCs, despite not fully satisfying the assumptions of Theorems 4.4 and 4.5, nonetheless achieves comparable performance with both BQL and CFR in our experiments.

We also allow for different temperature schedules for stationary and nonstationary infostates, and we only perform the stationarity check with probability 0.050.05 at each iteration, significantly reducing the time spent on these checks. We find that these modifications simplify our implementation without greatly affecting performance.

5 Experimental Results

We evaluate ABCs across several single and multi-agent settings drawn from both the OpenSpiel game library (Lanctot et al., 2019) and OpenAI Gym (Brockman et al., 2016), benchmarking it against MAX-CFR, BQL (Sutton & Barto, 2018), OS-MCCFR (Lanctot et al., 2009), and ES-MCCFR (Lanctot et al., 2009) (when computationally feasible). All experiments with ABCs were run on a single 2020 Macbook Air M1 with 8GB of RAM. We use a single set of hyperparameters and run across three random seeds, with 95% error bars included. Hyperparameters and detailed descriptions of the environments are given in Appendix E. Auxiliary experiments with TicTacToe are also included in Appendix F. All plots describe regret/exploitability, so a lower line indicates superior performance.

Refer to caption
(a) Cartpole
Refer to caption
(c) Kuhn Poker
Refer to caption
(e) Cartpole (Stacked Environment)
Refer to caption
(b) Weighted RPS
Refer to caption
(d) Leduc Poker
Refer to caption
(f) Leduc (Stacked Environment)
Figure 1: ABCs matches the performance of BQL on stationary environments like Cartpole (a) and the performance of CFR methods across several non-stationary environments including weighted rock-paper-scissors (b), Kuhn poker (c), and Leduc poker (d). In a partially stationary environment with elements of both stationarity and nonstationarity, ABCs outperforms both BQL and CFR, being the only algorithm capable of efficiently solving both the Cartpole (e) and Leduc poker (f) portion of the stacked environment.
Cartpole

We evaluate ABCs on OpenAI Gym’s Cartpole environment, a classic benchmark in reinforcement learning555We modify Cartpole slightly to ensure that it satisfies the Markovian property. See Appendix E for details.. Figure 1(a) shows that ABCs performs comparably to BQL and substantially better than OS-MCCFR. ES-MCCFR is infeasible to run due to the large state space of Cartpole.

Weighted Rock Paper Scissors

Weighted rock-paper-scissors is a classic nonstationary environment in which rock-paper-scissors is played, but there is twice as much reward if a player wins with Rock. Figure 1(b) demonstrates that while BQL fails to learn the optimal policy, ABCs performs comparably to the standard MCCFR benchmarks.

Kuhn and Leduc Poker

Kuhn and Leduc Poker are simplified versions of the full game of poker which operate as classic benchmarks for equilibrium finding algorithms. As illustrated in Figures 1(c) and 1(d), BQL fails to converge on either Kuhn or Leduc Poker as they are not Markovian environments, and we show similar results for a wide variety of hyperparameters in Appendix F. However, both ABCs and MCCFR converge to a Nash equilibrium of the game, with ABCs closely matching the performance of the standard MCCFR benchmarks.

A Partially Nonstationary Environment

We also test ABCs’s ability to adapt to a partially nonstationary environment, where its strengths are most properly utilized. Specifically, we consider a stacked environment, where each round consists of a game of Cartpole followed by a game of Leduc poker against another player. Figures 1(e) and 1(f) show that BQL does not achieve the maximum score in this combined game, failing to converge on Leduc poker. Although CFR based methods are theoretically capable of learning both games, MAXCFR wastes substantial time conducting unnecessary exploration of the game tree in the larger Cartpole setting, failing to converge on Leduc. OS-MCCFR similarly fails to learn either environment, though in this case, the issue preventing convergence on Leduc relates more to variance induced by importance sampling on such a deep game tree rather than unnecessary exploration. As such, ABCs outperforms all MCCFR variants and BQL by better allocating resources between the stationary and nonstationary portions of the environment.

6 Conclusion

We introduce ABCs, a unified algorithm combining the best parts of both Boltzmann Q-Learning and Counterfactual Regret Minimization. Theoretically, we show that ABCs simultaneously matches the performance of BQL up to an O(A)O(A) factor in MDPs while also guaranteeing convergence to Nash equilibria in two-player zero-sum games (under the assumption of access to a perfect oracle for stationarity). To our knowledge, this is the first such algorithm of its kind. Empirically, we show that these guarantees do not just lie in the realm of theory. In experiments, ABCs performs comparably to BQL on stationary environments and CFR on nonstationary ones, and is able to exceed the performance of both algorithms in partially nonstationary domains. Our empirical experiments do not require any assumptions regarding access to a perfect oracle and use the detector described in Algorithm 1.

We give a number of limitations of ABCs, which we leave to future work. While Theorem 4.5 guarantees that ABCs converges to a Nash equilibrium in two-player zero-sum games, it both requires access to a perfect oracle and does not provide a bound on the rate of convergence. Additionally, we run experiments for ABCs only on relatively small environments due to computational reasons. For future work, we hope to scale ABCs to larger settings such as the Arcade Learning Environment (Bellemare et al., 2013) and larger games such as Go or Poker using function approximation and methods such as DQN (Mnih et al., 2013) or DeepCFR (Brown et al., 2019).

7 Broader Impacts

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Arora et al. (2012a) Arora, R., Dekel, O., and Tewari, A. Online bandit learning against an adaptive adversary: From regret to policy regret. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, pp.  1747–1754. Omnipress, 2012a. ISBN 978-1-4503-1285-1.
  • Arora et al. (2012b) Arora, S., Hazan, E., and Kale, S. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012b.
  • Bakhtin et al. (2023) Bakhtin, A., Wu, D. J., Lerer, A., Gray, J., Jacob, A. P., Farina, G., Miller, A. H., and Brown, N. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. In ICLR, 2023.
  • Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, Jun 2013.
  • Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp.  449–458. JMLR.org, 2017.
  • Billingsley (2012) Billingsley, P. Probability and Measure. John Wiley & Sons, 2012.
  • Brafman & Tennenholtz (2002) Brafman, R. I. and Tennenholtz, M. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Brown & Sandholm (2018) Brown, N. and Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
  • Brown & Sandholm (2019a) Brown, N. and Sandholm, T. Solving imperfect-information games via discounted regret minimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  1829–1836, 2019a.
  • Brown & Sandholm (2019b) Brown, N. and Sandholm, T. Superhuman AI for multiplayer poker. Science, 365(6456):885–890, 2019b.
  • Brown et al. (2019) Brown, N., Lerer, A., Gross, S., and Sandholm, T. Deep counterfactual regret minimization. In International Conference on Machine Learning, pp.  793–802. PMLR, 2019.
  • Brown et al. (2020) Brown, N., Bakhtin, A., Lerer, A., and Gong, Q. Combining deep reinforcement learning and search for imperfect-information games. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  17057–17069. Curran Associates, Inc., 2020.
  • Celli et al. (2020) Celli, A., Marchesi, A., Farina, G., and Gatti, N. No-Regret Learning Dynamics for Extensive-Form Correlated Equilibrium. Advances in Neural Information Processing Systems, 33:7722–7732, 2020.
  • Cesa-Bianchi et al. (2017) Cesa-Bianchi, N., Gentile, C., Lugosi, G., and Neu, G. Boltzmann Exploration Done Right. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Daskalakis et al. (2009) Daskalakis, C., Goldberg, P. W., and Papadimitriou, C. H. The Complexity of Computing a Nash Equilibrium. SIAM Journal on Computing, 39(1):195–259, 2009. ISSN 0097-5397. doi: 10.1137/070699652.
  • Freund & Schapire (1997) Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. In Journal of Computer and System Sciences, volume 55, pp.  119–139, 1997.
  • Hart & Mas-Colell (2000) Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
  • Hessel et al. (2018) Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. G., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018.
  • Hu & Wellman (2003) Hu, J. and Wellman, M. P. Nash q-learning for general-sum stochastic games. J. Mach. Learn. Res., 4(null):1039–1069, dec 2003. ISSN 1532-4435.
  • Lanctot et al. (2009) Lanctot, M., Waugh, K., Zinkevich, M., and Bowling, M. Monte Carlo Sampling for Regret Minimization in Extensive Games. In Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009.
  • Lanctot et al. (2019) Lanctot, M., Lockhart, E., Lespiau, J.-B., Zambaldi, V., Upadhyay, S., Pérolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., et al. OpenSpiel: A framework for reinforcement learning in games. 2019.
  • Laurent & Laëtitia Matignon (2011) Laurent, G. J. and Laëtitia Matignon, N. L. F.-P. The world of independent learners is not markovian. International Journal of Knowledge-Based and Intelligent Engineering Systems, 15(1):55–64, 2011. 10.3233/KES-2010-0206. hal-00601941.
  • McAleer et al. (2022) McAleer, S., Farina, G., Lanctot, M., and Sandholm, T. Escher: Eschewing importance sampling in games by computing a history value function to estimate regret. arXiv preprint arXiv:2206.04122, 2022.
  • Meta Fundamental AI Research Diplomacy Team et al.(2022)Meta Fundamental AI Research Diplomacy Team (FAIR), Bakhtin, Brown, Dinan, Farina, Flaherty, Fried, Goff, Gray, Hu, Jacob, Komeili, Konath, Kwon, Lerer, Lewis, Miller, Mitts, Renduchintala, Roller, Rowe, Shi, Spisak, Wei, Wu, Zhang, and Zijlstra (FAIR) Meta Fundamental AI Research Diplomacy Team (FAIR), Bakhtin, A., Brown, N., Dinan, E., Farina, G., Flaherty, C., Fried, D., Goff, A., Gray, J., Hu, H., Jacob, A. P., Komeili, M., Konath, K., Kwon, M., Lerer, A., Lewis, M., Miller, A. H., Mitts, S., Renduchintala, A., Roller, S., Rowe, D., Shi, W., Spisak, J., Wei, A., Wu, D., Zhang, H., and Zijlstra, M. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022. doi: 10.1126/science.ade9097.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. doi: 10.1038/nature14236.
  • Moravčík et al. (2017) Moravčík, M., Schmid, M., Burch, N., Lisý, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowling, M. DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker. Science, 356(6337):508–513, 2017. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.aam6960.
  • Morrill et al. (2021) Morrill, D., D’Orazio, R., Sarfati, R., Lanctot, M., Wright, J. R., Greenwald, A., and Bowling, M. Hindsight and Sequential Rationality of Correlated Play. 2021. doi: 10.48550/arXiv.2012.05874.
  • OpenAI (2019) OpenAI. Openai five. https://openai.com/blog/openai-five/, 2019.
  • Pearson (1900) Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175, 1900.
  • Schmid et al. (2021) Schmid, M., Moravcik, M., Burch, N., Kadlec, R., Davidson, J., Waugh, K., Bard, N., Timbers, F., Lanctot, M., Holland, Z., Davoodi, E., Christianson, A., and Bowling, M. Player of games. CoRR, abs/2112.03178, 2021.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Shani et al. (2013) Shani, G., Pineau, J., and Kaplow, R. A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems, 27:1–15, 2013.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. ISSN 0028-0836, 1476-4687. doi: 10.1038/nature16961.
  • Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
  • Singh et al. (2000) Singh, S., Jaakkola, T., Littman, M. L., and Szepesvári, C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38:287–308, 2000.
  • Sokota et al. (2023) Sokota, S., D’Orazio, R., Kolter, J. Z., Loizou, N., Lanctot, M., Mitliagkas, I., Brown, N., and Kroer, C. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In ICLR, 2023.
  • Srinivasan et al. (2018) Srinivasan, S., Lanctot, M., Zambaldi, V., Pérolat, J., Tuyls, K., Munos, R., and Bowling, M. Actor-critic policy optimization in partially observable multiagent environments. Advances in Neural Information Processing Systems, 31, 2018.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning Series. The MIT Press, second edition edition, 2018. ISBN 978-0-262-03924-6.
  • Tammelin (2014) Tammelin, O. Solving large imperfect information games using CFR+. 2014.
  • Tesauro et al. (1995) Tesauro, G. et al. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
  • Vinyals et al. (2019) Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T. L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., Wünsch, D., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C., and Silver, D. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. ISSN 1476-4687. doi: 10.1038/s41586-019-1724-z.
  • von Neumann (1928) von Neumann, J. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928. ISSN 1432-1807. doi: 10.1007/BF01448847.
  • Wang et al. (2016) Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., and de Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pp.  1995–2003, 2016.
  • Zhang et al. (2022) Zhang, H., Lerer, A., and Brown, N. Equilibrium finding in normal-form games via greedy regret minimization. Proceedings of the AAAI Conference on Artificial Intelligence, 36(9):9484–9492, June 2022. doi: 10.1609/aaai.v36i9.21181.
  • Zinkevich et al. (2008) Zinkevich, M., Johanson, M., Bowling, M., and Piccione, C. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems, pp.  1729–1736, 2008.

Appendix A Nash Equilibria and Exploitability

We formally describe a Nash equilibrium and the concept of exploitability. Let π:SΔ(𝒜)\pi:S\to\Delta(\mathcal{A)} denote the joint policy of all players in the game. We assume that π\pi has full support: for all actions a𝒜(s)a\in\mathcal{A}(s), π(as)>0\pi(a\mid s)>0. Let Σ\Sigma denote the space of all possible joint policies, with Σi\Sigma_{i} defined accordingly. Let ui(π)u_{i}(\pi) denote the expected utility to player ii under joint policy π\pi. A joint policy π\pi is a Nash equilibrium of a game if for all players ii, πiargmaxπiΣiui(πi,πi)\pi_{i}\in\operatorname*{arg\,max}_{\pi^{\prime}_{i}\in\Sigma_{i}}u_{i}(\pi^{\prime}_{i},\pi_{-i}). Similarly, policy π\pi is an ϵ\epsilon-Nash equilibrium if for all players ii, ui(π)maxπiΣiui(πi,πi)ϵu_{i}(\pi)\geq\max_{\pi^{\prime}_{i}\in\Sigma_{i}}u_{i}(\pi^{\prime}_{i},\pi_{-i})-\epsilon.

In a two-player zero-sum game, playing a Nash equilibrium guarantees you the minimax value; in other words, it maximizes the utility you receive given that your opponent always perfectly responds to your strategy. Exploitability is often used to measure the distance of a policy π\pi from a Nash equilibrium, bounding the worst possible losses from playing a given policy. The total exploitability of a policy π\pi is given by iMmaxπiΣiui(πi,πi)ui(π)\sum_{i\in M}\max_{\pi^{\prime}_{i}\in\Sigma_{i}}u_{i}(\pi^{\prime}_{i},\pi_{-i})-u_{i}(\pi). In a two-player zero-sum game, the exploitability of a policy π\pi is given by maxπ1Σ1u1(π1,π2)+maxπ2Σ2u2(π1,π2).\max_{\pi^{\prime}_{1}\in\Sigma_{1}}u_{1}(\pi^{\prime}_{1},\pi_{2})+\max_{\pi^{\prime}_{2}\in\Sigma_{2}}u_{2}(\pi_{1},\pi^{\prime}_{2}).

Appendix B Markov Stationarity vs. Child Stationarity

In this section, we highlight the relationship between Markov stationarity and child stationarity. An MDP satisfies child stationarity at all infostates and actions s,as,a and with respect to all possible distributions over joint policies σ,σ\sigma,\sigma^{\prime} (see Section 4.1).

Here we highlight an environment that does not satisfy Markov stationarity but satisfies child stationarity. Consider an environment with a single infostate and two possible hidden states. Furthermore, let hn=hnmod2h_{n}=h_{n\mod 2}. Such an environment is not stationary in the Markov sense, as the hidden states change with each episode. However, defining σA,σB\sigma_{A},\sigma_{B} as the empirical distribution of joint policies across the first and second half of timesteps as described in Section 4.2, such an environment would (asymptotically) satisfy child stationarity as the distribution of hidden states for the first and second half would approach (12,12)\left(\frac{1}{2},\frac{1}{2}\right) in the limit. As shown in Section D, child stationarity, despite being weaker than Markov stationarity still allows BQL to converge to the optimal policy.

Appendix C The MAX-CFR Algorithm

MAX-CFR is a variant of ES-MCCFR that ABCs reverts to if the environment fails child stationarity at every possible s,as,a (i.e., is “maximally nonstationary”). MAX-CFR can also be viewed as a bootstrapped variant of external-sampling MCCFR (Lanctot et al., 2009). See Algorithm 3.

Algorithm 3 MAX-CFR
QQ, CNT initialized to 0, discount factor γ\gamma, total episodes NN. Initial observation and hidden states s0,h0s_{0},h_{0}.
for n1 to Nn\leftarrow 1\textrm{ to }N do
     MAXCFR(h0,s0,Q,γ,n)(h_{0},s_{0},Q,\gamma,n)
function MAXCFR(h,s,Q,γ,nh,s,Q,\gamma,n)
     if h is terminal then return 0      
     πn(s)\pi^{n}(s) \leftarrow softmax(Q(s,),τ=1CNT(s))\texttt{softmax}\left(Q\left(s,*\right),\tau=\frac{1}{\text{CNT}(s)}\right)
     CNT(s)\text{CNT}(s) \leftarrow CNT(s)+1\text{CNT}(s)+1
     for aA(s)a\in A(s) do
         α\alpha \leftarrow 1/CNT(s)1/\text{CNT}(s)
         h,s,a,s,ah^{\prime},s^{\prime},a^{\prime},\nabla_{s,a} \leftarrow GetChild(h,s,a,n,Q,γ)\left(h,s,a,n,Q,\gamma\right)
         s,a\nabla_{s^{\prime},a^{\prime}} \leftarrow f(h,s,Q,γ,n)f\left(h^{\prime},s^{\prime},Q,\gamma,n\right)
         s,a\nabla_{s,a} \leftarrow s,a+CNT(s)s,a\nabla_{s,a}+\text{CNT}(s^{\prime})\cdot\nabla_{s^{\prime},a^{\prime}}
         Q(s,a)Q(s,a) \leftarrow Q(s,a)+αs,aQ(s,a)+\alpha\nabla_{s,a}
     
     aa^{*} \leftarrow argmaxaQ(s,a)\arg\max_{a}Q(s,a)
     return s,a\nabla_{s,a^{*}}

C.1 Connecting ABCs, MAX-CFR, and ES-MCCFR

To make explicit the connection between MAX-CFR and traditional external-sampling MCCFR (Lanctot et al., 2009) we adopt the Multiplicative Weights (MW) algorithm (also known as Hedge) as the underlying regret minimizer. In MW, a player’s policy at iteration n+1n+1 is given by

πn+1(s,a)=eτnRn(s,a)aA(s)eτnRn(s,a),\pi^{n+1}(s,a)=\frac{e^{\tau_{n}R^{n}(s,a)}}{\sum_{a^{\prime}\in A(s)}e^{\tau_{n}R^{n}(s,a^{\prime})}},

where Rn(s,a)R^{n}(s,a) denotes the cumulative regret at iteration nn, for action aa at infostate ss. Hedge also includes a tunable hyperparameter τn\tau_{n}, which may be adjusted from iteration to iteration. Letting τn=1\tau_{n}=1 across all iterations, we recover the softmax operator on cumulative regrets as a special case of Hedge.

We will assume there are no non-terminal rewards rtr_{t}, and use a discount factor of γ=1\gamma=1, both of which are standard for the two-player zero-sum environments on which CFR is typically applied. We adopt τn=1\tau_{n}=1 for simplicity, but note that the following equivalences hold for any schedule of τn\tau_{n}.

We first present Boot-CFR, a “bootstrapped” version of external-sampling MCCFR that is identical to the original ES-MCCFR algorithm. To make the comparison between Boot-CFR and MAX-CFR clear, we highlight the difference between the two algorithms in cyan.

Algorithm 4 Boot-CFR
Same preconditions as MAX-CFR (Algorithm 3).
for n1 to Nn\leftarrow 1\textrm{ to }N do
     BOOTCFR(h0,s0,Q,γ,n,BOOTCFR)(h_{0},s_{0},Q,\gamma,n,\text{BOOTCFR})
function ModGetChild(h,s,a,π,Q,γh,s,a,\pi,Q,\gamma)
     Sample h,sTπ(hh,a),S(h)h^{\prime},s^{\prime}\sim T^{\pi}\left(h^{\prime}\mid h,a\right),S(h^{\prime})
     Sample rRπ(rs,a,s)r\sim R^{\pi}\left(r\mid s,a,s^{\prime}\right)
     return h,s,rh^{\prime},s^{\prime},r
function BOOTCFR(h,s,Q,γ,nh,s,Q,\gamma,n)
     if h is terminal then
         for aA(s)a\in A(s) do
              Δs,a\Delta_{s,a} \leftarrow 0          return      
     πn(s)\pi^{n}(s) \leftarrow softmax(Q(s,),τ=1CNT(s))\texttt{softmax}\left(Q\left(s,*\right),\tau=\frac{1}{\text{CNT}(s)}\right)
     CNT(s)\text{CNT}(s) \leftarrow CNT(s)+1\text{CNT}(s)+1
     for aA(s)a\in A(s) do
         Δs,a\Delta_{s,a} \leftarrow 0
         h,s,rh^{\prime},s^{\prime},r \leftarrow ModGetChild(h,s,a,πn,Q,γ)\left(h,s,a,\pi^{n},Q,\gamma\right)
         for aA(s)a^{\prime}\in A(s^{\prime}) do
              s,a,a\nabla_{s,a,a^{\prime}} \leftarrow r+γQ(s,a))Q(s,a)r+\gamma Q(s^{\prime},a^{\prime}))-Q(s,a)          
         QoldQ_{\mathrm{old}} \leftarrow Q(s,)Q(s^{\prime},*)
         BOOTCFR(h,s,Q,γ,n)\left(h^{\prime},s^{\prime},Q,\gamma,n\right)
         QnewQ_{\mathrm{new}} \leftarrow Q(s,)Q(s^{\prime},*)
         πevaln(s)\pi_{\mathrm{eval}}^{n}(s^{\prime}) \leftarrow softmax(Qold(s,),τ=1CNT(s)1)\texttt{softmax}\left(Q_{\mathrm{old}}\left(s^{\prime},*\right),\tau=\frac{1}{\text{CNT}(s^{\prime})-1}\right)
         for aA(s)a^{\prime}\in A(s^{\prime}) do
              Δs,a\Delta_{s,a} \leftarrow Δs,a+πevaln(s,a)[1CNT(s)(s,a,a+CNT(s)Δs,a)]\Delta_{s,a}+\pi_{\mathrm{eval}}^{n}(s^{\prime},a^{\prime})\cdot\left[\frac{1}{\text{CNT}(s)}\left(\nabla_{s,a,a^{\prime}}+\text{CNT}(s^{\prime})\cdot\Delta_{s^{\prime},a^{\prime}}\right)\right]          
         Q(s,a)Q(s,a) \leftarrow Q(s,a)+Δs,aQ(s,a)+\Delta_{s,a}      
return
Lemma C.1.

Multiplicative Weights / Hedge is invariant to the choice of cumulative regrets or cumulative rewards/utility.

Proof.

It is well-known and easy to verify that Multiplicative Weights / Hedge is shift-invariant. That is,

Hedge(x)=Hedge(x+c).\mathrm{Hedge}(x)=\mathrm{Hedge}(x+c).

Over NN iterations, the cumulative counterfactual regret of not taking action aa at infostate ss is given by

RN(s,a)=1Nn=1Nvnπ(sa)n(s)vnπn(s),R^{N}(s,a)=\frac{1}{N}\sum_{n=1}^{N}v_{n}^{\pi_{(s\to a)}^{n}}(s)-v_{n}^{\pi^{n}}(s),

where πsan\pi^{n}_{s\to a} is identical to πn\pi^{n}, except that action aa is always taken given infostate ss. Since the second term does not depend on the action aa, this is equivalent to using Hedge on the cumulative counterfactual rewards

1Nn=1Nvnπ(sa)n(s),\frac{1}{N}\sum_{n=1}^{N}v_{n}^{\pi_{(s\to a)}^{n}}(s),

by the shift-invariance of Hedge. ∎

Lemma C.2.

Boot-CFR and external-sampling MCCFR use the same current policy and traverse the game tree in the same way.

Proof.

First, note that game-tree traversal is identical to external-sampling MCCFR by construction. For a given player, we branch all their actions and sample all opponent actions from the current policy. Additionally, while external-sampling MCCFR uses softmax on cumulative regrets, by Lemma C.1, this is equivalent to softmax on cumulative rewards. Note that the policy in Boot-CFR is given by

πn(s)\displaystyle\pi^{n}(s) =softmax(Q(s,),τ=1CNT(s))\displaystyle=\texttt{softmax}\left(Q(s,*),\tau=\frac{1}{\text{CNT}(s)}\right)
=softmax(Q(s,)CNT(s)).\displaystyle=\texttt{softmax}\left(Q(s,*)\cdot\text{CNT}(s)\right).

As Q(s,a)Q(s,a) is the average reward from taking action aa at infostate ss, it follows that Q(s,a)CNT(s,a)Q(s,a)\cdot\text{CNT}(s,a) is the cumulative reward. Further, since we branch all actions anytime we visit an infoset, then CNT(s,a)=CNT(s)\text{CNT}(s,a)=\text{CNT}(s) and the policy in Boot-CFR is equivalent to a softmax on cumulative rewards. ∎

Theorem C.3.

Boot-CFR and external-sampling MCCFR are identical.

Proof.

Boot-CFR and external-sampling MCCFR are identical regarding traversal of the game tree and computation of the current policy (Lemma C.2). Thus, it suffices to show that the updates to cumulative/average rewards are identical.

Let WW denote the set of terminal states that are reached on a given iteration of Boot-CFR or external-sampling MCCFR. For any zWz\in W, we define ui(z)u_{i}(z) to be the reward for player ii at terminal history zz. Consider current history hh, current infostate ss, and action aa. Let hh^{\prime} be the history reached on the current iteration after taking action aa. For external-sampling MCCFR, the sampled counterfactual regret of taking action aa at infostate ss is given by

r~(s,a)=zWui(z)(ηiπ(h,z)ηiπ(h,z)).\tilde{r}(s,a)=\sum_{z\in W}u_{i}(z)(\eta_{i}^{\pi}(h^{\prime},z)-\eta_{i}^{\pi}(h,z)).

By Lemma C.1, when using Hedge, we can consider only the sampled counterfactual rewards

r(s,a)=zWui(z)ηπ(h,z).r(s,a)=\sum_{z\in W}u_{i}(z)\cdot\eta^{\pi}(h^{\prime},z).

While the sampled counterfactual reward is typically only defined for non-terminal observations, for the sake of the proof, we extend the definition to terminal observations, letting r(s,a)=ui(z)r(s,a)=u_{i}(z), where zz is the terminal history corresponding to observation hh.

We will prove that Δs,a=r(s,a)Q(s,a)CNT(s)\Delta_{s,a}=\frac{r(s,a)-Q(s,a)}{\text{CNT}(s)} for all ss that we visit on a given iteration and all actions aa.

We proceed by induction. First, we have this equivalence for all terminal nodes, since Δs,a=0\Delta_{s,a}=0 at a terminal node. Now, consider an arbitrary infostate ss and action aa, and suppose that Δs,a=r(s,a)Q(s,a)CNT(s,a)\Delta_{s^{\prime},a^{\prime}}=\frac{r(s^{\prime},a^{\prime})-Q(s^{\prime},a^{\prime})}{\text{CNT}(s^{\prime},a^{\prime})} for all actions aa^{\prime} and ss^{\prime} reachable from ss after playing action aa. By our inductive hypothesis, we have

s,a,a+CNT(s)Δs,a\displaystyle\nabla_{s,a,a^{\prime}}+\text{CNT}(s^{\prime})\cdot\Delta_{s^{\prime},a^{\prime}}
=Q(s,a)Q(s,a)+CNT(s)r(s,a)Q(s,a)CNT(s)\displaystyle=Q(s^{\prime},a^{\prime})-Q(s,a)+\text{CNT}(s^{\prime})\cdot\frac{r(s^{\prime},a^{\prime})-Q(s^{\prime},a^{\prime})}{\text{CNT}(s^{\prime})}
=r(s,a)Q(s,a).\displaystyle=r(s^{\prime},a^{\prime})-Q(s,a).

Additionally, we have

r(s,a)\displaystyle r(s,a) =zWui(z)ηiπ(h,z)\displaystyle=\sum_{z\in W}u_{i}(z)\cdot\eta_{i}^{\pi}(h^{\prime},z)
=zWui(z)[aA(s)π(a|s)ηiπ(ha′′,z)]\displaystyle=\sum_{z\in W}u_{i}(z)\cdot\left[\sum_{a^{\prime}\in A(s^{\prime})}\pi(a^{\prime}|s^{\prime})\cdot\eta_{i}^{\pi}(h^{\prime\prime}_{a^{\prime}},z)\right]
=aA(s)π(a|s)zWui(z)ηiπ(ha′′,z)\displaystyle=\sum_{a^{\prime}\in A(s^{\prime})}\pi(a^{\prime}|s^{\prime})\sum_{z\in W}u_{i}(z)\eta_{i}^{\pi}(h^{\prime\prime}_{a^{\prime}},z)
=aA(s)π(a|s)r(s,a),\displaystyle=\sum_{a^{\prime}\in A(s^{\prime})}\pi(a^{\prime}|s^{\prime})\cdot r(s^{\prime},a^{\prime}),

where ha′′h^{\prime\prime}_{a^{\prime}} is the history reached after playing action aa^{\prime} at history hh^{\prime}. It follows that

Δs,a\displaystyle\Delta_{s,a} =aA(s)1CNT(s)π(a|s)[s,a,a+CNT(s)Δs,a]\displaystyle=\sum_{a^{\prime}\in A(s^{\prime})}\frac{1}{CNT(s)}\cdot\pi(a^{\prime}|s^{\prime})\cdot\left[\nabla_{s,a,a^{\prime}}+\text{CNT}(s^{\prime})\cdot\Delta_{s^{\prime},a^{\prime}}\right]
=aA(s)1CNT(s)π(a|s)[r(s,a)Q(s,a)]\displaystyle=\sum_{a^{\prime}\in A(s^{\prime})}\frac{1}{CNT(s)}\cdot\pi(a^{\prime}|s^{\prime})\cdot\left[r(s^{\prime},a^{\prime})-Q(s,a)\right]
=1CNT(s)[Q(s,a)+aA(s)π(a|s)r(s,a)]\displaystyle=\frac{1}{\text{CNT}(s)}\cdot\left[-Q(s,a)+\sum_{a^{\prime}\in A(s^{\prime})}\pi(a^{\prime}|s^{\prime})\cdot r(s^{\prime},a^{\prime})\right]
=r(s,a)Q(s,a)CNT(s),\displaystyle=\frac{r(s,a)-Q(s,a)}{CNT(s)},

as desired. The Q-update becomes

Q(s,a)\displaystyle Q(s,a) Q(s,a)+r(s,a)Q(s,a)CNT(s)\displaystyle\leftarrow Q(s,a)+\frac{r(s,a)-Q(s,a)}{CNT(s)}
=(CNT(s)1)Q(s,a)+r(s,a)CNT(s),\displaystyle=\frac{(CNT(s)-1)\cdot Q(s,a)+r(s,a)}{CNT(s)},

and corresponds precisely to updating the average Q-value with the external-sampling sampled counterfactual reward. ∎

Corollary C.4.

MAX-CFR and External-Sampling MCCFR are identical, up to minor differences in the calculation of an action’s expected utility/reward.

Proof.

One can verify that there is a single difference between Boot-CFR and MAX-CFR, where the line highlighted in cyan in Algorithm 4 is replaced by

πevaln(s)hardmax(Qnew(s,)CNT(s)),\pi_{\mathrm{eval}}^{n}(s^{\prime})\leftarrow\texttt{hardmax}\left(Q_{\mathrm{new}}\left(s^{\prime},*\right)\cdot\text{CNT}(s^{\prime})\right),

where hardmax places all probability mass on the entry/action with the highest value. Thus, instead of calculating the sampled reward of taking action aa at infostate ss with respect to the current softmax policy, MAX-CFR calculates the sampled reward with respect to a greedy policy that always selects the action with the greatest cumulative reward/smallest cumulative regret. This greedy policy is taken with respect to the new Q-values after they have been updated on the current iteration.

Importantly, the current policy and cumulative policy of all players is still calculated using softmax on cumulative rewards. It is only the calculation of sampled rewards that relies on this new hardmax policy. ∎

Finally, we can relate MAX-CFR to ABCs. We have the following simple relationship.

Theorem C.5.

Up to ϵ\epsilon-exploration, ABCs reduces to MAX-CFR in the case where all (s,a)(s,a) pairs are detected as nonstationary.

Proof.

This follows from the definition of Algorithm 2, where removing the ϵ\epsilon-exploration recovers MAX-CFR. ∎

Theorem C.6.

With high probability, MAX-CFR minimizes regrets at a rate of O(1N)O\left(\frac{1}{\sqrt{N}}\right).

Proof.

Written in terms of Q-value notation, the local regrets that are minimized by CFR(Hedge) are given by

rN(s,a)=ηiπ(s)(Qπ(s,a)𝔼aπ(s)[Qπ(s,a)]).r^{N}(s,a)=\eta_{-i}^{\pi}(s)\left(Q^{\pi}(s,a)-\mathbb{E}_{a\sim\pi(s)}\left[Q^{\pi}(s,a)\right]\right).

Let

rN+(s,a)=max(0,rN(s,a))r^{N+}(s,a)=\max\left(0,r^{N}(s,a)\right)

Define πs,an\pi^{n}_{s,a} as the policy in which:

  1. 1.

    All players except player P(s)P(s) follow π\pi at all states in the game,

  2. 2.

    Player P(s)P(s) plays as necessary to reach ss and plays action aa at ss, and

  3. 3.

    Player P(s)P(s) plays the action with the highest Q value at ss^{\prime}, where ss^{\prime} is the state directly following ss.

Policy πs,an\pi^{n}_{s,a} is identical to the counterfactual target policy that CFR(Hedge) follows, except that πs,an\pi^{n}_{s,a} alters the policy at the successor state ss^{\prime} to follow the action with the highest Q value instead of sampling an action according to πn\pi^{n}. Analogously, define the modified MAXCFR local regrets as

rMAXCFRN(s,a)\displaystyle r^{N}_{MAXCFR}(s,a) =ηiπ(s)(Qπs,an(s,a)𝔼aπ(s)[Qπ(s,a)])\displaystyle=\eta_{-i}^{\pi}(s)\left(Q^{\pi^{n}_{s,a}}(s,a)-\mathbb{E}_{a\sim\pi(s)}\left[Q^{\pi}(s,a)\right]\right)
rMAXCFRN+(s,a)\displaystyle r^{N+}_{MAXCFR}(s,a) =max(0,rMAXCFRN(s,a))\displaystyle=\max\left(0,r^{N}_{MAXCFR}(s,a)\right)

These regrets are identical to the standard CFR regrets rN(s,a)r^{N}(s,a), except for the fact that the target policy is modified to πs,an\pi^{n}_{s,a}.

Lemma C.7.

For any s,πs^{\prime},\pi and any possible QQ values,

maxaQ(s,a)𝔼aπ[Q(s,a)]0.\max_{a}Q(s^{\prime},a)-\mathbb{E}_{a\sim\pi}\left[Q(s^{\prime},a)\right]\geq 0.
Proof.

This follows immediately from the properties of the max operator. ∎

Lemma C.8.

rN(s,a)rMAXCFRN(s,a)r^{N}(s,a)\leq r^{N}_{MAXCFR}(s,a).

Proof.

By definition of πs,an\pi^{n}_{s,a}, we can write:

Qπs,an(s,a)\displaystyle Q^{\pi^{n}_{s,a}}(s,a) =Qπ(s,a)\displaystyle=Q^{\pi}(s,a)
+𝔼sTπ(ss,a)[maxaQ(s,a)𝔼aπ[Q(s,a]].\displaystyle+\mathbb{E}_{s^{\prime}\sim T^{\pi}(s^{\prime}\mid s,a)}\left[\max_{a^{\prime}}Q(s^{\prime},a^{\prime})-\mathbb{E}_{a^{\prime}\sim\pi}\left[Q(s^{\prime},a^{\prime}\right]\right].

Since the second term is strictly nonnegative by Lemma C.7, we have RMAXCFRN(s,a)>RN(s,a)R^{N}_{MAXCFR}(s,a)>R^{N}(s,a). ∎

Lemma C.9.

Let Δ\Delta be the difference between the lowest and highest possible rewards achievable for any player in GG. Let Ns,aN_{s,a} denote the number of times MAX-CFR has visited s,as,a. We have RMAXCFRN(s,a)ΔNs,aR^{N}_{MAXCFR}(s,a)\leq\frac{\Delta}{\sqrt{N_{s,a}}}.

Proof.

Since we choose all policies over a softmax, all policies have full support and η1π(s)>0\eta^{\pi}_{-1}(s)>0 for all ss. Thus, Ns,a=NN_{s,a}=N in MAX-CFR. Noting that MAX-CFR chooses its policy πs,an\pi^{n}_{s,a} proportionally to exp(rMAXCFRN(s,a))\exp\left(r^{N}_{MAXCFR}(s,a)\right), it follows from the Multiplicative Weights algorithm (Freund & Schapire, 1997; Arora et al., 2012b), that the regrets given by rMAXCFRN(s,a)r^{N}_{MAXCFR}(s,a) after NN iterations are bounded above by ΔN\frac{\Delta}{\sqrt{N}} with high probability. ∎

Notice that like CFR, the guarantee of Lemma C.9 holds regardless of the strategy employed at all other s,as,a in the game, so long as the policy at iteration nn chooses action aa at state ss proportionally to exp(rMAXCFRN(s,a))\exp\left(r^{N}_{MAXCFR}(s,a)\right). Since Lemma C.8 shows that local regrets for MAXCFR form an upper bound on the standard CFR regrets (over the sequence of joint policies π1πN\pi^{1}\cdots\pi^{N}), we can apply (Zinkevich et al., 2008, Theorem 3) along with Lemma C.9 to show that the overall regret of the game is,

rG\displaystyle r^{G} 𝒮𝒜maxs,arMAXCFRN+(s,a)\displaystyle\leq\left\lVert\mathcal{S}\right\rVert\left\lVert\mathcal{A}\right\rVert\max_{s,a}r^{N+}_{MAXCFR}(s,a)
𝒮𝒜ΔN=O(1N)\displaystyle\leq\left\lVert\mathcal{S}\right\rVert\left\lVert\mathcal{A}\right\rVert\frac{\Delta}{\sqrt{N}}=O\left(\frac{1}{\sqrt{N}}\right)

Appendix D Convergence of ABCs

Theorem 4.4. Assume that Algorithm 2 tracks separate Q(s,a)Q(s,a) values for stationary and nonstationary updates and has access to a perfect oracle for detecting child stationarity. Then, the average policy limN1Nn=1Nπn\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{n=1}^{N}\pi_{n} in Algorithm 2 converges to a Nash equilibrium in a two-player zero-sum game with high probability.

Proof.

Call our current game GG. We will prove convergence to a Nash equilibria in GG by showing the average policy in Algorithm 2 has vanishing local regret. This is sufficient to show the algorithm as a whole has no-regret and thus converges to a Nash equilibrium (Zinkevich et al., 2008, Theorem 2).

We first define a perfect oracle. At every episode NN, a perfect oracle has full access to the specification of game GG and the joint policy, π1πN\pi_{1}\cdots\pi_{N}, of all players in each episode 11 through NN. As per Section 4.2, define σAN\sigma^{N}_{A} as the uniform distribution over {π1πN/2}\{\pi_{1}\cdots\pi_{\left\lfloor N/2\right\rfloor}\} and σBN\sigma^{N}_{B} as the uniform distribution over {πN/2+1πN}\{\pi_{\left\lfloor N/2\right\rfloor+1}\cdots\pi_{N}\}. Define the following two distributions:

D1N(h)\displaystyle D_{1}^{N}(h) =𝔼πσAN[Tπ(hs,a)]\displaystyle=\mathbb{E}_{\pi\in\sigma^{N}_{A}}\left[T_{\pi}(h^{\prime}\mid s,a)\right]
D2N(h)\displaystyle D_{2}^{N}(h) =𝔼πσBN[Tπ(hs,a)].\displaystyle=\mathbb{E}_{\pi\in\sigma^{N}_{B}}\left[T_{\pi}(h^{\prime}\mid s,a)\right].

Define the total variation distance between these two distributions (Billingsley, 2012), as

ΔN=hD1N(h)D2N(h).\Delta_{N}=\sum_{h\in\mathcal{H}}\left\lVert D^{N}_{1}(h)-D^{N}_{2}(h)\right\rVert.

A perfect detector will return that s,as,a is child stationary at iteration NN if and only if ΔN=0\Delta_{N}=0.

We assume separate Q-value tables for the stationary and nonstationary updates, so that the updates do not interfere with each other. To determine the policy at each iteration (or episode) nn, we chose πn(a)Q(s,a)\pi^{n}(a)\propto Q(s,a) where QQ is the nonstationary Q-table if s,as,a fails child stationarity and the stationary table otherwise. The result of this is that, with respect to a given s,as,a, updates done at child stationarity iterations have no impact on updates done at non child stationarity iterations and vice versa.

To prove Theorem 4.4, we proceed by induction. As the base case, consider any terminal infostate ss of the game tree where there are no actions. In this case, all policies are no-regret. For the inductive hypothesis, consider any infostate ss and assume that ABCs guarantees no-regret for any s,as^{\prime},a^{\prime} where ss^{\prime} is a descendent of ss and aa^{\prime} is a valid action at ss^{\prime}. To show is that ABCs also guarantees the no-regret property at s,as,a. Formally, define ηiπ(s)=hH(s)ηiπ(h)\eta_{-i}^{\pi}(s)=\sum_{h\in H(s)}\eta_{-i}^{\pi}(h). We write the local regret for a policy π\pi as rπ(s,a)=max(0,ηiπ(s)(Qπ(s,a)𝔼aπ(s)[Qπ(s,a)]))r^{\pi}(s,a)=\max\left(0,\eta_{-i}^{\pi}(s)\left(Q^{\pi}(s,a)-\mathbb{E}_{a^{\prime}\sim\pi(s)}\left[Q^{\pi}(s,a^{\prime})\right]\right)\right).

Consider the sequence (dn)n\left(d_{n}\right)_{n}, where dn{0,1}d_{n}\in\{0,1\} represents whether the oracle reports whether s,as,a is child stationarity after nn iterations (or episodes) of training. There are three cases to consider: (dn)n\left(d_{n}\right)_{n} converges to 0 and fails to satisfy child stationarity in the limit, converges to 11 and satisfies child stationarity in the limit, or fails to converge (e.g., perhaps it oscillates between child stationarity and not child stationarity).

Case 1:

s,as,a fails child stationarity in the limit (i.e., (dn)n0\left(d_{n}\right)_{n}\rightarrow 0).

In Appendix C, we show that ABCs runs MAX-CFR at every s,as,a that does not satisfy child stationarity according to the detector. Lemma C.9 shows that running MAX-CFR at a given state and action s,as,a achieves vanishing local regret at s,as,a, and the assumption of the perfect oracle means that ABCs will asymptotically converge to MAXCFR at this particular s,as,a. Combined with our inductive hypothesis, this guarantees that ABCs is no-regret for (s,a)(s,a) and all its descendent infostate, action pairs.

Case 2:

s,as,a satisfies child stationarity in the limit (i.e., (dn)n1\left(d_{n}\right)_{n}\rightarrow 1). Let π¯N\overline{\pi}^{N} denote the average joint policy at episode NN. The Multiplicative Weights / Hedge algorithm (Freund & Schapire, 1997; Arora et al., 2012b) guarantees that selecting πn+1(s)softmax(Qπ¯N(s,))\pi_{n+1}(s)\propto\text{softmax}\left(Q^{\overline{\pi}_{N}}(s,*)\right) has limNrπ¯N(s,a)=0\lim_{N\rightarrow\infty}r^{\overline{\pi}_{N}}(s,a)=0. When the detector successfully detects s,as,a as child stationary, Algorithm 2 performs exactly such an update at every infostate visited (except that it uses estimated Q^π¯N(s,a)\hat{Q}^{\overline{\pi}_{N}(s,a)} values). We also show that the estimated Q^π¯N(s,a)\hat{Q}^{\overline{\pi}_{N}}(s,a) values converge to the true Qπ¯N(s,a){Q}^{\overline{\pi}_{N}}(s,a) values, resulting in Hedge / Multiplicative weights performing the softmax update on the correct counterfactual regrets. We do so as follows.

Lemma D.1.

Define G(s,a)=limNGπ¯N(s,a)G^{*}(s,a)=\lim_{N\rightarrow\infty}G^{\overline{\pi}^{N}}(s,a). If (dt)t1\left(d_{t}\right)_{t}\rightarrow 1, then G(s,a)G^{*}(s,a) is guaranteed to exist.

Proof.

While G(s,a)G^{*}(s,a) is not a well defined subgame in general, we show that it exists if s,as,a is asymptotically child stationarity. Considering a vectorized form of the payoff entires in a finite action, normal form game, the set of all games can be written as a subset of q\mathcal{R}^{q}, for a suitable qq, and forms a compact metric space. (dn)n1\left(d_{n}\right)_{n}\rightarrow 1 implies ΔN0\Delta_{N}\rightarrow 0 by construction. Let πAN,πBN\pi_{A}^{N},\pi_{B}^{N} represent the average joint policies as defined in Section 4.2 at iteration NN. π¯N\overline{\pi}^{N} is, by definition, the average joint policy between πAN\pi_{A}^{N} and πBN\pi_{B}^{N}. Using the symmetry of the total variation distance along with the triangle inequality, we have

d(Gπ¯N(s,a),Gπ¯N+1(s,a))\displaystyle d\left(G^{\overline{\pi}^{N}}(s,a),G^{\overline{\pi}^{N+1}}(s,a)\right)
=d(G12πAN+πBN(s,a),G12πAN+1+πBN+1(s,a))\displaystyle=d\left(G^{\frac{1}{2}\pi_{A}^{N}+\pi_{B}^{N}}(s,a),G^{\frac{1}{2}\pi_{A}^{N+1}+\pi_{B}^{N+1}}(s,a)\right)
d(G12πAN+12πBN(s,a),GπAN(s,a))\displaystyle\leq d\left(G^{\frac{1}{2}\pi_{A}^{N}+\frac{1}{2}\pi_{B}^{N}}(s,a),G^{\pi_{A}^{N}}(s,a)\right)
+d(GπAN+1(s,a),G12πAN+1+12πBN+1(s,a))\displaystyle+d\left(G^{\pi_{A}^{N+1}(s,a)},G^{\frac{1}{2}\pi_{A}^{N+1}+\frac{1}{2}\pi_{B}^{N+1}}(s,a)\right)
+d(GπAN(s,a),GπAN+1(s,a))\displaystyle+d\left(G^{\pi_{A}^{N}}(s,a),G^{\pi_{A}^{N+1}}(s,a)\right)
ΔN+ΔN+1+1N+1\displaystyle\leq\Delta_{N}+\Delta_{N+1}+\frac{1}{N+1}

where we have that d(G12πAN+12πBN(s,a),GπAN)<d(GπBN(s,a),GπAN(s,a))d\left(G^{\frac{1}{2}\pi_{A}^{N}+\frac{1}{2}\pi_{B}^{N}}(s,a),G^{\pi_{A}^{N}}\right)<d\left(G^{\pi_{B}^{N}}(s,a),G^{\pi_{A}^{N}(s,a)}\right) due to the convexity of the total variation distance (Billingsley, 2012). Additionally, d(GπAN(s,a),GπAN+1(s,a))1N+1d\left(G^{\pi_{A}^{N}}(s,a),G^{\pi_{A}^{N+1}}(s,a)\right)\leq\frac{1}{N+1} because, by construction, πAN\pi_{A}^{N} and πAN+1\pi_{A}^{N+1} are identical in all but at most 1N+1\frac{1}{N+1} fraction of the joint policies in the support. Since limNd(Gπ¯N(s,a),Gπ¯N+1(s,a))=0\lim_{N\rightarrow\infty}d\left(G^{\overline{\pi}^{N}}(s,a),G^{\overline{\pi}^{N+1}}(s,a)\right)=0, the sequence of games given by Gπ¯N(s,a)G^{\overline{\pi}^{N}}(s,a) forms a Cauchy sequence and G(s,a)G^{*}(s,a) is guaranteed to exist since Cauchy sequences converge in compact metric spaces. ∎

By the assumption that GG is a two-player zero-sum game, G(s,a)G^{*}(s,a) must also be a two-player zero-sum game. By our inductive hypothesis, ABCs has no regret for any s,as^{\prime},a^{\prime} that is a descendent of s,as,a. This, combined with the existence of G(s,a)G^{*}(s,a) and (Zinkevich et al., 2008, Theorem 2) means that limNπ¯N\lim_{N\rightarrow\infty}\overline{\pi}^{N} converges to a Nash equilibrium of G(s,a)G^{*}(s,a).

The minimax theorem (von Neumann, 1928) states that in two-player zero-sum games, players receive the same payoff, known as the minimax value of the game, regardless of which Nash equilibrium is discovered. Call this value QQ^{*}. Furthermore, by the assumption that lim(dt)t1\lim\left(d_{t}\right)_{t}\rightarrow 1, the immediate rewards received after taking action aa at state ss must converge to some value rr^{*}. As such, the estimated Q values must converge to the correct value Q^π¯N(s,a)=r+γQ\hat{Q}^{\overline{\pi}_{N}}(s,a)=r^{*}+\gamma Q^{*}, as desired.

Case 3:

Imagine that s,as,a is neither stationary nor nonstationary in the limit. Let

AnonstatN={dtdt=0},AstatN={dtdt=1}.A^{N}_{nonstat}=\{d_{t}\mid d_{t}=0\},\quad A^{N}_{stat}=\{d_{t}\mid d_{t}=1\}.

Each of limNAstatN\lim_{N\rightarrow\infty}\left\lVert A^{N}_{stat}\right\rVert and limNAnonstatN\lim_{N\rightarrow\infty}\left\lVert A^{N}_{nonstat}\right\rVert must both diverge, since if one of them were only finitely large, then this would contradict the assumption that (dt)t(d_{t})_{t} does not converge.

As described in our setup, because we track separate Q values for stationary and nonstationary updates and use a perfect oracle, the updates done in iterations AstatNA^{N}_{stat} have no impact on the updates done on AnonstatNA^{N}_{nonstat} and vice versa. As such, consider the subset of timesteps given by AnonstatNA^{N}_{nonstat}. By direct reduction to Case 1, the distribution over joint policies given by sampling uniformly over {πkkAnonstatN}\{\pi_{k}\mid k\in A^{N}_{nonstat}\} must satisfy the no-regret property for s,as,a. Analogously, the distribution over joint policies given by sampling uniformly over {πkkAstatN}\{\pi_{k}\mid k\in A^{N}_{stat}\} must satisfy the no-regret property for s,as,a by direct reduction to Case 2.

As the inductive hypothesis is satisfied for both partitions AstatNA^{N}_{stat} and AnonstatNA^{N}_{nonstat}, we can continue to induct independently for each partition. The induction terminates at the initial infostate s0s_{0} of our game because we assume finite depth. Suppose we are left at this step with KK nonoverlapping partitions of {1N}\{1\cdots N\}. Denote them A1NAKNA^{N}_{1}\cdots A^{N}_{K}, and analogously define πNk=1AkNnAkNπn\pi_{N}^{k}=\frac{1}{\left\lVert A^{N}_{k}\right\rVert}\sum_{n\in A^{N}_{k}}\pi_{n}. By our inductive hypothesis and the minimax theorem, we are guaranteed that limNrπNk(s,a)=0\lim_{N\rightarrow\infty}r^{\pi_{N}^{k}}(s,a)=0, for all s,a,ks,a,k. Since π¯N=1NkAkNπNk\overline{\pi}^{N}=\frac{1}{N}\sum_{k}\left\lVert A^{N}_{k}\right\rVert\pi_{N}^{k}, we can write

maxarπ¯N(s,a)\displaystyle\max_{a}r^{\overline{\pi}^{N}}(s,a) =maxa1NkAkNrπNk(s,a)\displaystyle=\max_{a}\frac{1}{N}\sum_{k}\left\lVert A^{N}_{k}\right\rVert r^{\pi_{N}^{k}}(s,a)
1NkAkNmaxarπNk(s,a).\displaystyle\leq\frac{1}{N}\sum_{k}\left\lVert A^{N}_{k}\right\rVert\max_{a}r^{\pi_{N}^{k}}(s,a).

Lastly, since limNrπNk(s,a)=0\lim_{N\rightarrow\infty}r^{\pi_{N}^{k}}(s,a)=0 for all s,a,ks,a,k, it must also be true that rπ¯N(s,a)0r^{\overline{\pi}^{N}}(s,a)\rightarrow 0 as desired.

Theorem 4.5. Given an appropriate choice of significance levels α\alpha, Algorithm 2 asymptotically converges to the optimal policy with only a worst-case O(A)O(A) factor slowdown compared to BQL in an MDP.

Proof.

We choose the significance levels α\alpha across different infostates as follows. As described in Algorithm 2, at each infostate ss encountered, ABCs samples a single action atraja_{traj} and branches this action, regardless of whether s,atrajs,a_{traj} is determined to be child stationary or not.

Define the set of infostates s0,,sts_{0},\dots,s_{t}, where s0s_{0} is the initial infostate and sks_{k} is the infostate observed after playing the corresponding atraja_{traj} action at sk1s_{k-1}. In other words, this is the trajectory of infostates visited from the beginning to the end of the game, taking the actions sampled by ABCs, independent of the child stationarity detection. This “trajectory” is the set of infostates that ABCs would have updated even if the environment had been fully child stationary and the direct analogue of the trajectory that BQL would have updated. Note that the chosen “trajectory” may change during every iteration of the learning procedure.

We select the significance levels α\alpha as follows. For each infostate sis_{i} on the constructed trajectory described above, set α1Ad\alpha\leq\frac{1}{A^{d}}, where AA is the maximum number of actions at any infostate and dd is the depth of infostate sis_{i}. At all other infostates reached during the current learning iteration, set α1A\alpha\leq\frac{1}{A}.

This choice of significance levels guarantees only an O(A)O(A) factor slowdown compared to BQL. Let XABCs(d)X_{ABCs}(d) count the number of infostates of depth dd that ABCs updates at each iteration.

Since the environment is an MDP, an infostate is updated only if the detection algorithm hit a false positive on every ancestor of the infostate that was not on the trajectory path. We bound this probability as follows. Consider an infostate ss with depth dd. Let dd^{*} be the highest depth infostate in the ancestry of ss that was on the “trajectory” path. As false positives asymptotically occur with probability α\alpha in a Chi-Squared test (Pearson, 1900), the probability that ss is visited approaches

1Adi=d+1d1A=Ad.\frac{1}{A^{d^{*}}}\cdot\prod_{i=d^{*}+1}^{d}\frac{1}{A}=A^{-d}.

There are only at most AdA^{d} nodes at depth dd in the tree, and thus, in expectation, we can expect at most AdAd=O(1)A^{-d}A^{d}=O(1) extra updates at level dd in the tree, or O(D)O(D) total extra updates for a game tree of depth DD. Since MAXCFR is a contraction operator over the potential function which bounds the regret (Appendix C), additional, unnecessary updates at infostates falsely deemed to be nonstationary by the detector in Algorithm 1 will not hurt the asymptotic convergence rate, as MAXCFR also converges to the optimal policy in an MDP.

As our environment is acylic, ABCs will visit O(D)O(D) infostates in a game tree of depth DD, the same as BQL. At each infostate, ABCs updates every action whereas BQL only updates a single action, giving the additional O(A)O(A) inefficiency relative to BQL. ∎

Appendix E Experimental Details

E.1 Testing Environments

We present more detailed descriptions of our testing environments, including any modifications made to their standard implementations, below.

E.1.1 Cartpole

Cartpole is a classic control game from the OpenAI Gym, in which the agent must work to keep the pole atop the cart upright while keeping the cart within the boundaries of the game. While Cartpole has a continuous state space, represented by a tuple containing (cart position, cart velocity, pole angle, pole angular velocity), we discretize the state space for the tabular setting as follows:

  • Cart position: 1010 equally-spaced bins in the range [2.4,2.4][-2.4,2.4]

  • Cart velocity: 1010 equally-spaced bins in the range [3.0,3.0][-3.0,3.0]

  • Pole angle: 1010 equally-spaced bins in the range [0.5,0.5][-0.5,0.5]

  • Pole angular velocity: 1010 equally-spaced bins in the range [2.0,2.0][-2.0,2.0]

Additionally, to make convergence feasible, we parameterize the game such that the infostates exactly correspond to the discretized states, rather than the full sequence of states seen and actions taken. Note that this means that our version of Cartpole does not satisfy perfect recall; to account for this and ensure that we are not repeatedly branching the same infostate, we prevent ABCs and MAX-CFR from performing a CFR exploratory update on the same infoset twice in a given iteration.

The standard implementation of Cartpole is technically non-Markovian as each episode is limited to a finite number of steps (200200 in the v0 implementation). To make the environment Markovian, we allow the episode to run infinitely, but we introduce a 1/2001/200 probability of terminating at any given round, thus ensuring that the maximum expected length of an episode is precisely 200200.

E.1.2 Weighted Rock-Paper-Scissors

Weighted rock-paper-scissors is a slight modification on classic rock-paper-scissors, where winning with Rock yields a reward of 22 whereas winning with any other move only yields a reward of 11 (a draw results in a reward of 0). The game is played sequentially, but player 2 does not have knowledge of the move player 1 makes beforehand (possible via use of imperfect-information / infostates).

E.1.3 Kuhn Poker

Kuhn Poker is a simplified version of poker with only three cards – a Jack, Queen, and King. At the start of the round, both players receive a card (with no duplicates allowed). Each player antes 11 – player 11 then has the choice to bet or check an amount of 11. If player 11 bets, player 22 can either call (with both players then revealing their cards) or fold. If player 11 checks, player 22 can either raise an amount of 11 or check (ending the game). In the event that player 22 raises, player 11 then has the choice to call or fold, with the game terminating in either case.

We use the standard OpenSpiel implementation of Kuhn poker for all our experiments.

E.1.4 Leduc Poker

Leduc poker is another simplified variant of poker, though it has a deeper game tree than Kuhn poker. As with Leduc, there are three cards, though there are now two suits. Each player receives a single card, and there is an ante of 11 along with two betting rounds. Players can call, check, and raise, with a maximum of two raises per round. Raises in the first round are two chips while they are four chips in the second round.

Again, we use the standard OpenSpiel implementation of Leduc poker for all our experiments.

E.1.5 Cartpole + Leduc Poker

For our stacked environment, player 11 first plays a round of Cartpole. Upon termination of this round of Cartpole, they then play a single round of Leduc poker against player 22. Note that player 22 only ever interacts with the Leduc poker environment.

We use the same implementations of Cartpole and Leduc poker discussed above, with the sole change being that we set Cartpole’s termination probability to 1/1001/100 rather than 1/2001/200 to feasibly run MAX-CFR on the stacked environment.

E.2 Hyperparameters Used

For all of the experiments in the main section of the paper, we use the same hyperparameters for all of the algorithms tested. We enumerate the hyperparameters used for BQL, MAX-CFR, and the MCCFR methods in Table 1.

We tuned the hyperparameters for MCCFR methods, BQL, and ABCs as to optimize performance over all five environments for which experiments were conducted.

Table 1: Benchmark Hyperparameter Values
Hyperparameters Final Value
Discount Factor γ=1\gamma=1
BQL Temperature Schedule τn=10(0.99)n/50\tau_{n}=10\cdot\left(0.99\right)^{\lfloor n/50\rfloor}
MAX-CFR Temperature Schedule τn=1\tau_{n}=1
OS-MCCFR ϵ\epsilon-greedy Policy ϵ=0.6\epsilon=0.6

For ABCs, we use the following set of hyperparameters, enumerated in Table 2. As noted in the main paper, we make a number of practical modifications to Algorithm 2. In particular, we use a fixed p-value threshold, allow for different temperature schedules for stationary/nonstationary infosets, choose not to use ϵ\epsilon-exploration, and only perform the stationarity check with some relatively small probability.

Table 2: ABCs Hyperparameter Values
Hyperparameters Final Value
Discount Factor γ=1\gamma=1
Nonstationary Temperature Schedule τn=1\tau_{n}=1
Stationary Temperature Schedule τn=(0.99)n/20\tau_{n}=\left(0.99\right)^{\lfloor n/20\rfloor}
Cutoff p-value αs=0.05\alpha_{s}=0.05
ϵ\epsilon-exploration ϵ=0\epsilon=0
Probability of Stationarity Check pcheck=0.05p_{check}=0.05

E.3 Evaluation Metrics

For all our experiments, we plot “nodes touched” on the x-axis, a standard proxy for wall clock time that is independent of any hardware or implementation details. Specifically, it counts the number of nodes traversed in the game tree throughout the learning procedure. On each environment, we run all algorithms for a fixed number of nodes touched, as seen in the corresponding figures.

For all multi-agent settings, we plot exploitability on the y-axis. For Cartpole, we plot regret, which is simply the difference between the reward obtained by the current policy and that obtained by the optimal policy (200200 minus the current reward).

E.4 Random Seeds

We use three sequential random seeds {0,1,2}\{0,1,2\} for each experiment.

Appendix F Additional Experiments

F.1 Stationarity Detection

We plot percentage of infostates detected as nonstationary over time for each of the non-stacked environments in Figure 2(a). We normalize the x-axis to plot the fraction of total runtime/nodes touched so that all environments can be more easily compared.

Refer to caption
(a) Stationarity detection for Cartpole, RPS, Leduc poker, and Kuhn poker. The fraction of infostates detected as non-stationary is plotted over time.
Refer to caption
(b) Stationarity detection on stacked environment.

Note that the multi-agent environments possess varying levels of stationarity. The policies in weighted RPS are cyclical, meaning that the corresponding transition functions change throughout the learning procedure. In contrast, in our two poker environments, there are several transition functions that may stay fixed, either because the outcome of that round is determined by the cards drawn or because agents’ policies become fixed over time.

To take an example, in the Nash equilibrium of Kuhn poker, player 1 only ever bets in the opening round with a Jack or a King. Thus, if player 2 is in the position to call with a King, betting will necessarily yield the history in which players have cards (K, J) and player 2 calls player 1’s initial bet.

To better visualize the performance of ABCs on our stacked Cartpole/Leduc environment, we plot the fraction of infostates detected as nonstationary in each environment separately in Figure 2(b).

F.2 TicTacToe

We also run experiments on TicTacToe. While TicTacToe is not a fully stationary game – as it is still a multi-agent setting – it is a perfect information game, meaning that BQL should still be able to learn the optimal policy of the game. As with Cartpole, we modify the infostates to depend only on the current state of the board.

We plot results for ABCs and benchmarks in Figure 3(a), with exploitability clipped at 10510^{-5} to facilitate easier comparison. To accelerate convergence, we evaluate the current greedy policy rather than the average policy – note that last-iterate convergence implies average policy convergence. We additionally modify the temperature schedules for ABCs and BQL to accommodate the larger game tree. These are listed in Table 3.

Table 3: TicTacToe Hyperparameter Values
Hyperparameters Final Value
ABCs Non-
stationary Temperature Schedule τn=1\tau_{n}=1
ABCs
Stationary Temperature Schedule τn=10(0.99)n/50\tau_{n}=10\cdot\left(0.99\right)^{\lfloor n/50\rfloor}
BQL Temperature
Schedule τn=10(0.99)n/100\tau_{n}=10\cdot\left(0.99\right)^{\lfloor n/100\rfloor}
Refer to caption
(a) Performance on TicTacToe, exploitability is clipped at 10510^{-5} and plotted on log scale.
Refer to caption
(b) Nonstationary detection on TicTacToe.

Appendix G Additional Related Work

We expand on the related work from Section 1.1.

G.1 Methods from Game Theory

While CFR guarantees polynomial time convergence in MDPs (assuming the MDP satisfies perfect recall), empirically it performs far slower than its corresponding reinforcement learning counterparts despite the same worst case bound. This is primarily because, in practice, reinforcement learning methods such as PPO (Schulman et al., 2017) or DQN (Mnih et al., 2015) are able to learn reasonable policies in benchmarks such as the Atari Learning Environment (Bellemare et al., 2013) while exploring only a tiny fraction of the possible states in the environment. Meanwhile, CFR requires updating every infostate in the entire game tree at every iteration of the game.

Monte Carlo Counterfactual Regret Minimization

External sampling Monte-Carlo CFR (ES-MCCFR) and Outcome sampling Monte-Carlo CFR (OS-MCCFR) (Lanctot et al., 2009) are two methods that have been proposed to alleviate this issue. ES-MCCFR samples a single action for each player other than the player conducting the learning procedure (including the chance player) for each iteration, resulting in substantially fewer updates compared to the full version of CFR. However, because external sampling still requires performing a single update at every infostate of every player, it still requires O(AD)O(A^{D}) updates per iteration, where AA is the number of actions and DD is the maximum depth of the game tree, with respect to any single player. OS-MCCFR is similar to ES-MCCFR except that it only samples a single trajectory and then corrects its estimates of the counterfactual values using importance sampling. While this update is indeed a trajectory style update like Q-learning, the variances of the estimated values are exceedingly high as they involve dividing by the reach probability of reaching an infostate, which will be on the order of O(AD)O(A^{-D}). As such, OS-MCCFR will still require on the order of O(AD)O(A^{D}) trajectories before it learns a reasonable policy.

Several methods have been proposed to empirically improve the asymptotic convergence rate with respect to the number of iterations. CFR+ (Tammelin, 2014) is a popular method that modifies the regret estimate to be an upper bound of the actual regrets. The same work also introduced alternating updates, which is only updating the regrets of each player every PP iterations, where PP is the number of players in the game. Linear CFR (Brown & Sandholm, 2019a) and its follow up work greedy weights (Zhang et al., 2022) modify the weighting scheme of the CFR updates. All the above methods empirically improve the convergence rate but maintain the worst case bound.

G.2 Methods from Reinforcement Learning

While reinforcement learning was designed for finding optimal policies on MDPs, many attempts have been made to adapt such algorithms to the multi-agent setting. Algorithms such as Q-learning and PPO are generally not guaranteed to converge to equilibria in such settings (Brown & Sandholm, 2019a).

AlphaStar

AlphaStar (Vinyals et al., 2019) attempted to learn the real-time strategy game of Starcraft via reinforcement learning. Since Starcraft is a two-player imperfect information game, to combat the nonstationarity involved in playing multiple different opponents who required different strategies to defeat, AlphaStar trained against a league of agents instead of a single agent as is usually the case in self play. Such a league simulates the distribution of human adversaries AlphaStar is likely to encounter while playing on the Blizzard Starcraft ladder. While this method performed reasonable well in practice, it did not come with guarantees of convergence to the Nash equilibria of the game.

OpenAI Five

OpenAI Five (OpenAI, 2019) was a similar project to learn DOTA 2, a popular multiplayer online battle arena video game. In order to combat the nonstationarity, OpenAI Five played in a restricted version of DOTA 2, in which aspects of the gave that involved imperfect information (such as wards which granted vision of enemy areas) were removed from the game. Indeed, while the resulting agent was able to beat some of the best human teams in the world online players quickly found strategies that coule exploit the agents.

RMAX

On the more theoretical side, RMAX (Brafman & Tennenholtz, 2002) is a classic algorithm from reinforcement learning that achieves both optimal reward in an MDP and the minimax value of a stochastic game in polynomial time. RMAX operates by maintaining an optimistic estimate of the Q-values of each state which upper bounds the possible value of each state and thus encourages exploration of such states. This optimistic upper bound ensures However, there are several differences between RMAX and ABCs. For starters, RMAX is not model free – it requires a model of the environment to estimate the Q values. Additionally, unlike CFR (and by extension ABCs), it is not Hannan consistent, meaning it will not play optimally against an arbitrary adversary in the limit. On a more practical level, RMAX leads to substantially more exploration than its related methods such as Q-learning by design, and as a result is used less often in practice compared to variants such as Q-learning which often are able to learn reasonable policies while exploring only a tiny fraction of the environment.

Nash Q-Learning

Nash Q-Learning (Hu & Wellman, 2003) is an adaptation of the standard Q-learning algorithm finding the minimax value of stochastic game, which corresponds to the Nash equilibrium in two-player zero-sum games. Instead of doing a standard Q-learning update, Nash Q-learning calculates the Nash equilibrium of the subgame implied by the Q-values of all players and performs an update according to that policy, instead of the policy being currently followed by all players. However, there are several important differences between ABCs and Nash Q-learning. Firstly, in our multi-agent learning setup, ABCs is run for each player of the game, but the learning algorithms for each player are run separately, with no interaction between players except via the game itself. In contrast, Nash-Q requires a centralized setup in which players collectively choose their individual strategies as a function over the Q-values over all players in the game. In this sense, Nash Q-learning is a collective algorithm for finding equilibria of the game as opposed to a learning algorithm trying to optimize the reward of each agent. Additionally, Nash Q-learning requires a Nash equilibrium solver for every update of the Q-values. While this is possible in polynomial time in a two-player zero-sum game, it is more expensive than the analogous update in ABCs and Q-learning which is done in constant time.

G.3 Unified Methods from Game Theory and Reinforcement Learning

Magnetic Mirror Descent

Magnetic Mirror Descent (MMD) (Sokota et al., 2023) is similarly a unified algorithm capable of solving for equilibria in two-player zero-sum games and in single player settings. However, unlike ABCs, MMD does not adaptively determine whether or not to branch infostates based on whether or not they are stationary. As such, it cannot simultaneously guarantee convergence to Nash in two-player zero-sum games and matching performance against BQL with the same set of hyperparameters. MMD runs experiments on a variety of single and multi-agent environments, but requires different hyperparameters for each one, most importantly those governing how many infostates to explore. In contrast, ABCs manages to roughly match the performance of Boltzmann Q-Learning in stationary environments and the performance of counterfactual regret minimization —— all with the same set of hyperparameters due to adaptively choosing how much of the tree to explore based on how stationary the infostate is.

Player of Games

Player of Games (Schmid et al., 2021) is an algorithm that is able to simultaneously achieve superhuman status on many perfect and imperfect information games by combining lessons from both AlphaZero and counterfactual regret minimization. Instead of exploring the whole game tree as is the case with standard CFR, Player of Games expands a fixed number of infostates at each timestep. This is in contrast to ABCs which performs an adaptive choice on how many children to expand depending on whether the infostate and action satisfies child stationarity. As a result, PoG performs asymptotically slower than its counterparts designed for perfect information / single agent environments as a price for its generality. In contrast, we show that ABCs provably has only a constant factor slowdown compared to its reinforcement learning counterpart in the limit case for a stationary environment by converging to updating the same set of infostates as BQL.