This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Anticipatory Fictitious Play

Alex Cloud111Affiliations refer to where this work was carried out and are no longer current. Albert Wang222Corresponding author, email: [email protected]. Riot Games Technology Research, Los Angeles, CA Wesley Kerr Riot Games Technology Research, Los Angeles, CA
(December 2, 2022)
Abstract

Fictitious play is an algorithm for computing Nash equilibria of matrix games. Recently, machine learning variants of fictitious play have been successfully applied to complicated real-world games. This paper presents a simple modification of fictitious play which is a strict improvement over the original: it has the same theoretical worst-case convergence rate, is equally applicable in a machine learning context, and enjoys superior empirical performance. We conduct an extensive comparison of our algorithm with fictitious play, proving an optimal O(t1)O(t^{-1}) convergence rate for certain classes of games, demonstrating superior performance numerically across a variety of games, and concluding with experiments that extend these algorithms to the setting of deep multiagent reinforcement learning.

1 Introduction

Matrix games (also known as normal-form games) are an abstract model for interactions between multiple decision makers. Fictitious play (FP) (Brown, (1951)) is a simple algorithm for two-player matrix games. In FP, each player starts by playing an arbitrary strategy, then proceeds iteratively by playing the best strategy against the empirical average of what the other has played so far. In some cases, such as two-player, zero-sum games, the empirical average strategies will converge to a Nash equilibrium.

Although there are more efficient algorithms for computing Nash equilibria in matrix games (Adler,, 2013; Shoham and Leyton-Brown,, 2008), there are a few reasons why fictitious play remains a topic of interest. First, it serves as a model for how humans might arrive at Nash equilibria in real-world interactions (Luce and Raiffa,, 1989; Brown,, 1951; Conlisk, 1993a, ). Second, FP is extensible to real-world games which are large and complicated. Our work is primarily motivated by the secondary application.

The initial step towards extending FP to real-world games was by Kuhn, (1953), which established the equivalence of normal-form games (represented by matrices) and extensive-form games (represented by trees with additional structure). Loosely speaking, this means that results which apply for matrix games may also apply to much more complicated decision making problems, such as ones that that incorporate temporal elements or varying amounts of hidden information. Leveraging this equivalence, Heinrich et al., (2015) proposed an extension of FP to the extensive-form setting, full-width extensive-form fictitious play (XFP), and proved that it converges to a Nash equilibrium in two-player, zero-sum games. Heinrich et al., (2015) also proposed Fictitious Self Play (FSP), a machine learning approximation to XFP. In contrast to XFP, which is intractable for real-world games whose states cannot be enumerated in practice, FSP relies only on basic operations which can be approximated in a machine learning setting, like averaging (via supervised learning) and computing best responses (via reinforcement learning). In this way, FSP provides a version of fictitious play suitable for arbitrarily complex two-player, zero-sum games. Not long after the introduction of FSP, Lanctot et al., (2017) presented Policy Space Response Oracles (PSRO), a general framework for fictitious-play-like reinforcement learning algorithms in two-player, zero-sum games. These ideas were employed as part of the groundbreaking AlphaStar system that defeated professional players at StarCraft II (Vinyals et al.,, 2019).

We introduce anticipatory fictitious play (AFP), a simple variant of fictitious play which is also reinforcement-learning-friendly. In contrast to FP, where players iteratively update to exploit an estimate of the opponent’s strategy, players in AFP update proactively to respond to the strategy that the opponent would use to exploit them.

We prove that AFP is guaranteed to converge to a Nash equilibrium in two-player zero-sum games and establish an optimal convergence rate for two classes of games that are of particular interest in learning for real world games (Balduzzi et al.,, 2019), a class of “cyclic” games and a class of “transitive” games. Numerical comparisons suggest that in AFP eventually outperforms FP on virtually any game, and that its improvement over FP improves as games get larger. Finally, we propose a reinforcement learning version of AFP that is implementable as a one-line modification of an RL implementation of FP, such as FSP. These algorithms are applied to Running With Scissors (Vezhnevets et al.,, 2020), a stochastic, competitive multiagent environment with cyclic dynamics.

1.1 Related work

Aside from the literature on fictitious play and its extension to reinforcement learning, there has been substantial work on “opponent-aware” learning algorithms. These algorithms incorporate information about opponent updates and are quite similar to anticipatory fictitious play.

In the context of evolutionary game theory, Conlisk, 1993a proposed an “extrapolation process,” whereby two players in a repeated game each forecast their opponents’ strategies and respond to those forecasts. Unlike AFP, where opponent responses are explicitly calculated, the forecasts are made by linear extrapolation based on the change in the opponent’s strategy over the last two timesteps. Conlisk, 1993b proposed two types of “defensive adaptation,” which are quite similar in spirit to AFP but differ in some important details; most importantly, while they consider the opponent’s empirical payoffs at each step, they do not respond directly to what the opponent is likely to play given those payoffs.

Shamma and Arslan, (2005) proposed derivative action fictitious play, a variant of fictitious play in the continous time setting in which a best response to a forecasted strategy is played, like in (Conlisk, 1993a, ). The algorithm uses a derivative-based forecast that is analogous to the discrete-time anticipated response of AFP. However, their convergence results rely on a fixed, positive entropy bonus that incentivizes players to play more randomly, and they do not consider the discrete-time case.

Zhang and Lesser, (2010) proposed Infinitesimal Gradient Ascent with Policy Prediction, in which two policy gradient learning algorithms continuously train against a forecast of the other’s policy. Their algorithm represents the core idea of AFP, albeit implemented in a different setting. However, their proof of convergence is limited to 2x22x2 games. Foerster et al., (2018) and Letcher et al., (2018) take this idea further, modifying the objective of a reinforcement learning agent so that it accounts for how changes in the agent will change the anticipated learning of the other agents. This line of research is oriented more towards equilibrium finding in general-sum games (e.g. social dilemmas), and less on efficient estimation of equilibria in strictly competitive two-player environments.

2 Preliminaries

A (finite) two-player zero-sum game (2p0s game) is represented by a matrix Am×nA\in\mathbb{R}^{m\times n}, so that when player 1 plays ii and player 2 plays jj, the players observe payoffs (Ai,j,Ai,j)(A_{i,j},-A_{i,j}) respectively. Let Δkk\Delta^{k}\subset\mathbb{R}^{k} be the set of probability vectors representing distributions over {1,,k}\{1,\dots,k\} elements. Then a strategy for player 1 is an element xΔmx\in\Delta^{m} and similarly, a strategy for player 2 is an element yΔny\in\Delta^{n}.

A Nash equilibrium in a 2p0s game AA is a pair of strategies (x,y)(x^{*},y^{*}) such that each strategy is optimal against the other, i.e.,

xargmaxxΔmxAyandyargminyΔn(x)Ay.\displaystyle x^{*}\in\underset{x\in\Delta^{m}}{\arg\max}\;x^{\intercal}Ay^{*}\quad\text{and}\quad y^{*}\in\underset{y\in\Delta^{n}}{\arg\min}\;(x^{*})^{\intercal}Ay.

The Nash equilibrium represents a pair of strategies that are “stable” in the sense that no player can earn a higher payoff by changing their strategy. At least one Nash equilibrium is guaranteed to exist in any finite game (Nash Jr,, 1950).

Nash equilibria in 2p0s games enjoy a nice property not shared by Nash equilibria in general: in 2p0s games, if (x1,y1)(x_{1},y_{1}) and (x2,y2)(x_{2},y_{2}) are Nash equilibria, then (x2,y1)(x_{2},y_{1}) is a Nash equilibrium. In a 2p0s game, we define a Nash strategy to be be one that occurs as part of some Nash equilibrium. Note that the aforementioned property does not hold in general, so normally it is only valid to describe collections of strategies (one per player) as equilibria.

A solution to a 2p0s game AA is a pair of strategies (x,y)(x^{*},y^{*}) such that

minyΔn(x)Ay(x)AymaxxΔmxAy.\displaystyle\min_{y\in\Delta^{n}}(x^{*})^{\intercal}Ay\leq(x^{*})^{\intercal}Ay^{*}\leq\max_{x\in\Delta^{m}}x^{\intercal}Ay^{*}.

We say v=(x)Ayv^{*}=(x^{*})^{\intercal}Ay^{*}, which is unique, the value of the game. Nash equilibria are equivalent to solutions of 2p0s games (Shoham and Leyton-Brown,, 2008), which is why we use the same notation. Finally, the exploitability of a strategy is the difference between the value of the game and the worst-case payoff of that strategy. So the exploitability of xΔmx\in\Delta^{m} is vminxAv^{*}-\min x^{\intercal}A, and the exploitability of yΔny\in\Delta^{n} is maxAyv\max Ay-v^{*}.

2.1 Fictitious play

Let e1,e2,e_{1},e_{2},\dots denote the standard basis vectors in m\mathbb{R}^{m} or n\mathbb{R}^{n}. Let BRAk\texttt{BR}^{k}_{A} be the best response operator for player kk, so that

(yΔn)\displaystyle(\forall y\in\Delta^{n})\;\;\; BRA1(y)={eim:iargmaxAy};\displaystyle\texttt{BR}^{1}_{A}(y)=\{e_{i}\in\mathbb{R}^{m}:i\in\arg\max Ay\};
(xΔm)\displaystyle(\forall x\in\Delta^{m})\;\;\; BRA2(x)={ejn:jargminxA}.\displaystyle\texttt{BR}^{2}_{A}(x)=\{e_{j}\in\mathbb{R}^{n}:j\in\arg\min x^{\intercal}A\}.

Fictitious play is given by the following process. Let x1=x¯1=eix_{1}=\overline{x}_{1}=e_{i} and y1=y¯1=ejy_{1}=\overline{y}_{1}=e_{j} be initial strategies for some ii, jj. For each tt\in\mathbb{N}, let

xt+1\displaystyle x_{t+1} BRA1(y¯t);\displaystyle\in\texttt{BR}^{1}_{A}(\overline{y}_{t}); yt+1\displaystyle y_{t+1} BRA2(x¯t);\displaystyle\in\texttt{BR}^{2}_{A}(\overline{x}_{t});
x¯t+1\displaystyle\overline{x}_{t+1} =1t+1k=1t+1xt;\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}x_{t}; y¯t+1\displaystyle\overline{y}_{t+1} =1t+1k=1t+1yt.\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}y_{t}.

In other words, at each timestep tt, each player calculates the strategy that is the best response to their opponent’s average strategy so far. Robinson, (1951) proved that the pair of average strategies (x¯t,y¯t)(\overline{x}_{t},\overline{y}_{t}) converges to a solution of the game by showing that the exploitability of both strategies converge to zero.

Theorem 1.

(Robinson, 1951) If {(xt,yt)}t\{(x_{t},y_{t})\}_{t\in\mathbb{N}} is a FP process for a 2p0s game with payoff matrix Am×nA\in\mathbb{R}^{m\times n}, then

limtminx¯tA=limtmaxAy¯t=v,\displaystyle\lim_{t\rightarrow\infty}\min\overline{x}_{t}^{\intercal}A=\lim_{t\rightarrow\infty}\max A\overline{y}_{t}=v^{*},

where vv^{*} is the value of the game. Furthermore, a bound on the rate of convergence is given by

maxAy¯tminx¯tA\displaystyle\max A\overline{y}_{t}-\min\overline{x}_{t}^{\intercal}A =O(t1/(m+n2)) for all t,\displaystyle=O(t^{-1/(m+n-2)})\text{ for all }t\in\mathbb{N},

where a=maxi,jAi,ja=\max_{i,j}A_{i,j}.

(Robinson did not explicitly state the rate, but it follows directly from her proof, as noted in Daskalakis and Pan, (2014) and explicated in our Appendix B.1.)

3 Anticipatory Fictitious Play

Although FP converges to a Nash equilibrium in 2p0s games, it may take an indirect path. For example, in Rock Paper Scissors with tiebreaking towards the minimum strategy index, the sequence of average strategies {x¯1,x¯2,}\{\overline{x}_{1},\overline{x}_{2},\dots\} orbits the Nash equilibrium, slowly spiraling in with decreasing radius, as shown on the left in Figure 1. This tiebreaking scheme is not special; under random tiebreaking, the path traced by FP is qualitatively the same, resulting in slow convergence with high probability, as shown in Figure 2.

Refer to caption
Figure 1: A visualization of the first 50 steps of FP (x¯1FP,x¯2FP,,x¯50FP)(\overline{x}_{1}^{\textnormal{FP}},\overline{x}_{2}^{\textnormal{FP}},\dots,\overline{x}_{50}^{\textnormal{FP}}) and first 25 steps of AFP (x¯1AFP,x¯2AFP,,x¯25AFP)(\overline{x}_{1}^{\textnormal{AFP}},\overline{x}_{2}^{\textnormal{AFP}},\dots,\overline{x}_{25}^{\textnormal{AFP}}) on Rock Paper Scissors. This corresponds to an equal amount of computation per algorithm (50 best responses). Ties between best response strategies were broken according to the ordering ‘Rock,’ ‘Paper,’ ‘Scissors.’ The Nash equilibrium is marked by a star. The shading indicates the exploitability of the strategy at that point, with darker colors representing greater exploitability.
Refer to caption
Figure 2: Comparison of FP and AFP performance (minx¯tA\min\overline{x}_{t}^{\intercal}A) on RPS with random tiebreaking. The highlighted region depicts the 10th and 90th percentiles across 10,000 runs. All variation is due to randomly sampled tiebreaking. The value of the game is v=0v^{*}=0.

Given that FP appears to do an especially poor job of decreasing exploitability in the case above, we consider alternatives. Inspired by gradient descent, we ask if there is there a way to follow the gradient of exploitability towards the Nash equilibrium (without explicitly computing it, as is done in Lockhart et al., (2019)). By definition, the best response to the average strategy is a strategy that maximally exploits the average strategy. So, a natural choice of update to reduce exploitability is to move the average strategy in a direction that counters this best response.

To this end, we propose anticipatory fictitious play (AFP), a version of fictitious play that “anticipates” the best response an adversary might play against the current strategy, and then plays the best response to an average of that and the adversary’s current average strategy. (Simply responding directly to the opponent’s response does not work; see Appendix A.) Alternatively, one can think of AFP as a version of FP that “forgets” every other best response it calculates. This latter interpretation enables a convenient implementation of AFP as a modification of FP as demonstrated in Algorithm 2.

Remark.

The AFP update is seemingly consistent with human psychology: it is quite intuitive to imagine how an adversary might try to exploit oneself and to respond in order to best counter that strategy. Given that fictitious play provides a model for how humans or other non-algorithmic decision makers might arrive at an equilibrium in practice (Luce and Raiffa,, 1989; Brown,, 1951) anticipatory fictitious play offers a new model for how this may occur. We leave further consideration of this topic to future work.

AFP is given by the following process. For some i{1,,m}i\in\{1,\dots,m\} and j{1,,n}j\in\{1,\dots,n\}, let x1=x¯1=eix_{1}=\overline{x}_{1}=e_{i} and y1=y¯1=ejy_{1}=\overline{y}_{1}=e_{j} be initial strategies for each player. For each tt\in\mathbb{N}, define

xt+1\displaystyle x^{\prime}_{t+1} BRA1(y¯t);\displaystyle\in\texttt{BR}^{1}_{A}(\overline{y}_{t}); yt+1\displaystyle y^{\prime}_{t+1} BRA2(x¯t);\displaystyle\in\texttt{BR}^{2}_{A}(\overline{x}_{t});
x¯t+1\displaystyle\overline{x}^{\prime}_{t+1} =tt+1x¯t+1t+1xt+1;\displaystyle=\tfrac{t}{t+1}\overline{x}_{t}+\tfrac{1}{t+1}x^{\prime}_{t+1}; y¯t+1\displaystyle\overline{y}^{\prime}_{t+1} =tt+1y¯t+1t+1yt+1;\displaystyle=\tfrac{t}{t+1}\overline{y}_{t}+\tfrac{1}{t+1}y^{\prime}_{t+1};
xt+1\displaystyle x_{t+1} BRA1(y¯t+1);\displaystyle\in\texttt{BR}^{1}_{A}(\overline{y}^{\prime}_{t+1}); yt+1\displaystyle y_{t+1} BRA2(x¯t+1);\displaystyle\in\texttt{BR}^{2}_{A}(\overline{x}^{\prime}_{t+1});
x¯t+1\displaystyle\overline{x}_{t+1} =1t+1k=1t+1xt;\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}x_{t}; y¯t+1\displaystyle\overline{y}_{t+1} =1t+1k=1t+1yt.\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}y_{t}. (1)

Here, xt+1x^{\prime}_{t+1} and yt+1y^{\prime}_{t+1} are the best response to the opponent’s average strategy. They are the strategies that FP would have played at the current timestep. In AFP, each player “anticipates” this attack and defends against it by calculating the opponent’s average strategy that include this attack (x¯t\overline{x}^{\prime}_{t} and y¯t\overline{y}^{\prime}_{t}), and then playing the best response to the anticipated average strategy of the opponent.

In Figure 1, we see the effect of anticipation geometrically: AFP “cuts corners,” limiting the extent to which it overshoots its target. In contrast, FP aggressively overshoots, spending increasingly many steps playing strategies that take it further from its goal. The effect on algorithm performance is pronounced, with AFP hovering near equilibrium while FP slowly winds its way there.

Of course, RPS is a very specific example. It is natural to wonder: is AFP good in general? The rest of the paper seeks to answer that question. We begin by proving that AFP converges to a Nash equilibrium.

Proposition 1.

If {(xt,yt)}\{(x_{t},y_{t})\} is an AFP process for a 2p0s game with payoff matrix Am×nA\in\mathbb{R}^{m\times n}, the conclusion of Theorem 1 holds for this process. Namely, AFP converges to a Nash equilibrium, and it converges no slower than the rate that bounds FP.

Proof.

(Idea) Generalize the original proof of Theorem 1. We work with accumulating payoff vectors U(t)=tAx¯tU(t)=tA^{\intercal}\overline{x}_{t} and V(t)=tAy¯tV(t)=tA\overline{y}_{t}. In the original proof, a player 1 strategy index i{1,,m}i\in\{1,\dots,m\} is called eligible at time tt if iargmaxV(t)i\in\arg\max V(t) (similarly for player 2). We replace eligibility with the notion of EE-eligibility, satisfied by an index iargmax[V(t)+E]i\in\arg\max[V(t)+E], for any EmE\in\mathbb{R}^{m} with E<maxi,j|Ai,j|\|E\|_{\infty}<\max_{i,j}|A_{i,j}|. Essentially, an index is EE-eligible if it corresponds to a best response to a perturbation of the opponent’s history y¯t\overline{y}_{t} or a perturbation of the game itself. The original proof structure can be preserved in light of this replacement, requiring only minor modifications to some arguments and new constants. Treating the in-between strategies in AFP, x¯t\overline{x}^{\prime}_{t} and y¯t\overline{y}^{\prime}_{t}, as perturbations of x¯t\overline{x}_{t} and y¯t\overline{y}_{t}, it follows that AFP satisfies the conditions for the generalized result. A complete proof is given in Appendix B. ∎

4 Application to normal form games

Proposition 1 establishes that AFP converges and that AFP’s worst-case convergence rate satisfies the same bound as FP’s, where the worst-case is with respect to games and tiebreaking rules. The next proposition shows that for two classes of games of interest, AFP not only outperforms FP, but attains an optimal rate. In both classes, our proofs will reveal that AFP succeeds where FP fails because AFP avoids playing repeated strategies. The results hold for general applications of FP and AFP rather than relying on specific tiebreaking rules.

The classes of games that we analyze are intended to serve as abstract models of two fundamental aspects of real-world games: transitivity (akin to “skillfullness;” some ways of acting are strictly better than others) and nontransitivity (most notably, in the form of strategy cycles like Rock ¡ Paper ¡ Scissors ¡ Rock). Learning algorithms for real-world games must reliably improve along the transitive dimension while accounting for the existence of strategy cycles; see Balduzzi et al., (2019) for further discussion.

For each integer n3n\geq 3, define payoff matrices CnC^{n} and TnT^{n} by

Ci,jn\displaystyle C^{n}_{i,j} ={1if i=j+1modn;1if i=j1modn;0otherwise; and Ti,jn\displaystyle=\begin{cases}\hphantom{-}1&\text{if }i=j+1\mod n;\\ -1&\text{if }i=j-1\mod n;\\ \hphantom{-}0&\text{otherwise;}\end{cases}\;\;\text{ and }\;\;T^{n}_{i,j} ={(ni+2)/nif i=j+1;(ni+2)/nif i=j1; 0otherwise,\displaystyle=\begin{cases}\hphantom{-}(n-i+2)/n&\text{if }i=j+1;\\ -(n-i+2)/n&\text{if }i=j-1;\\ \hphantom{-(n-}\;0&\text{otherwise,}\end{cases}

for i,j{1,,n}i,j\in\{1,\dots,n\}. The game given by CnC^{n} is a purely cyclic game: each strategy beats the one before it and loses to the one after it; C3C^{3} is the game Rock Paper Scissors. For each CnC^{n}, a Nash equilibrium strategy is [n1,,n1][n^{-1},\dots,n^{-1}]^{\intercal}. The game given by TnT^{n} could be considered “transitive:” each strategy is in some sense better than the last, and [0,,0,1][0,\dots,0,1]^{\intercal} is a Nash equilibrium strategy. The payoffs are chosen so that each strategy ii is the unique best response to i1i-1, so that an algorithm that learns by playing best responses will progress one strategy at a time rather than skipping to directly to strategy nn (as would happen if FP or AFP were applied to a game that is transitive in a stronger sense, such as one with a single dominant strategy; c.f. the definition of transitivity in (Balduzzi et al.,, 2019)). Note: TnT^{n} could be defined equivalently without the n1n^{-1} factor, but this would create a spurious dependence on nn for the rates we derive.

The following proposition establishes a convergence rate of O(t1)O(t^{-1}) for AFP applied to CnC^{n} and TnT^{n}. This rate is optimal within the class of time-averaging algorithms, because the rate at which an average changes is t1t^{-1}. Note: we say a random variable Yt=Ωp(g(t))Y_{t}=\Omega_{p}(g(t)) if, for any ϵ>0\epsilon>0, there exists c>0c>0 such that P[Yt<cg(t)]<ϵP[Y_{t}<cg(t)]<\epsilon for all tt.

Proposition 2.

FP and AFP applied symmetrically to CnC^{n} and TnT^{n} obtain the rates given in Table 1. In particular, if {xt,xt}t\{x_{t},x_{t}\}_{t\in\mathbb{N}} is an FP or AFP process for a 2p0s game with payoff matrix G{Cn,Tn}G\in\{C^{n},T^{n}\} with tiebreaking as indicated, then maxGx¯t=R(t).\max G\overline{x}_{t}=R(t). Tiebreaking refers to the choice of xt+1argmaxBRG1(x¯t)x_{t+1}\in\arg\max\texttt{BR}^{1}_{G}(\overline{x}_{t}) when there are multiple maximizers. The “random” tiebreaking chooses between tied strategies independently and uniformly at random. For entries marked with “arbitrary” tiebreaking, the convergence rate holds no matter how tiebreaks are chosen.

Table 1: Convergence rates for FP and AFP on CnC^{n} and TnT^{n}.
Algorithm Game GG Tiebreaking Rate R(t)R(t) Caveats
FP CnC^{n} random Ωp(t1/2)\Omega_{p}(t^{-1/2})
AFP CnC^{n} arbitrary O(t1)O(t^{-1}) n=3,4n=3,4
FP TnT^{n} arbitrary Ω(t1/2)\Omega(t^{-1/2}) t<t(n)t<t^{*}(n)
AFP TnT^{n} arbitrary O(t1)O(t^{-1})
Proof.

(Sketch, CnC^{n}) Full proofs of all cases are provided in Appendix C. Define Δ0=[0,,0]n\Delta_{0}=[0,\dots,0]^{\intercal}\in\mathbb{Z}^{n} and Δt=tCnx¯t\Delta_{t}=tC^{n}\,\overline{x}_{t} for each tt\in\mathbb{N}. The desired results are equivalent to maxΔt=Op(t)\max\Delta_{t}=O_{p}(\sqrt{t}) under FP for the given tiebreaking rule, and maxΔt\max\Delta_{t} is bounded under AFP for n=3,4n=3,4. Let iti_{t} be the index played by FP (AFP) at time tt (so xt=eitx_{t}=e_{i_{t}}). It follows that

Δt+1,j={Δt,j1if j=it1modn;Δt,j+1if j=it+1modn;Δt,jotherwise;\displaystyle\Delta_{t+1,j}=\begin{cases}\Delta_{t,j}-1&\text{if }j=i_{t}-1\mod n;\\ \Delta_{t,j}+1&\text{if }j=i_{t}+1\mod n;\\ \Delta_{t,j}&\text{otherwise;}\end{cases} (2)

for each t0t\in\mathbb{N}_{0} and j{1,,n}j\in\{1,\dots,n\}. Note that the entries of Δt\Delta_{t} always sum to zero.

In the case of FP, it is easy to verify that maxΔt\max\Delta_{t} is nondecreasing for any choice of tiebreaking. For each m0m\in\mathbb{N}_{0}, define tm=inf{t0:maxΔt=m}t_{m}=\inf\{t\in\mathbb{N}_{0}:\max\Delta_{t}=m\}. Then by Markov’s inequality,

P(maxΔt<m)=P(tm>t)E(tm)/t=1tk=0m1E(tk+1tk).\displaystyle P(\max\Delta_{t}<m)=P(t_{m}>t)\leq E(t_{m})/t=\frac{1}{t}\sum_{k=0}^{m-1}E(t_{k+1}-t_{k}).

Examining the timesteps at which it+1iti_{t+1}\neq i_{t} and relating them to {tk}\{t_{k}\}, we show in the appendix that the time to increment the max from kk to k+1k+1 satisfies E(tk+1tk)=O(k)E(t_{k+1}-t_{k})=O(k). Thus the bound above becomes P(maxΔt<m)O(m2)/tP(\max\Delta_{t}<m)\leq O(m^{2})/t. Now let c0c\in\mathbb{R}^{\geq 0} be arbitrary and plug in ctc\lceil\sqrt{t}\rceil for mm, so we have P(maxΔt<ct)c2O(1)0P(\max\Delta_{t}<c\sqrt{t})\leq c^{2}O(1)\rightarrow 0 as c0c\rightarrow 0. So maxΔt=Ωp(t)\max\Delta_{t}=\Omega_{p}(\sqrt{t}).

For the AFP case, consider the first timestep at which maxΔt=m+1\max\Delta_{t}=m+1. Working backwards and checking cases, it can be shown that in order for the maximum value to increment from mm to m+1m+1, there must first be a timestep where there are two non-adjacent entries of mm with an entry of m1m-1 between them. This cannot happen in the n=3,m=2n=3,m=2 case because three positive entries (2,1,2) don’t sum to zero. Similarly, in the n=4,m=2n=4,m=2 case, it turns out by (2) that Δt=[a,b,a,b]\Delta_{t}=[a,b,-a,-b] for some aa, bb. So there cannot be three positive entries in this case either. Therefore maxtΔt2\max_{t}\Delta_{t}\leq 2 for n=3,4n=3,4. ∎

The proofs of Proposition 2 establish a theme: FP can be slow because it spends increasingly large amounts of time progressing between strategies (playing xt=xt+1==xt+kx_{t}=x_{t+1}=\dots=x_{t+k} with kk increasing as tt increases), whereas AFP avoids this. (Return to Figure 1 for a visual example.)

Some further comments on the results: we only obtain the O(t1)O(t^{-1}) rate for AFP applied to CnC^{n} in the n=3,4n=3,4 case. We conjecture that: (i) for a specific tiebreaking rule, AFP has the same worst-case rate as FP but with a better constant, (ii) under random tiebreaking, AFP is Op(t1)O_{p}(t^{-1}) for all nn. This is reflected in numerical simulations for large nn, as shown in Appendix D.

Our results are noteworthy for their lack of dependence on tiebreaking: worst-case analyses of FP typically rely on specific tiebreaking rules; see Daskalakis and Pan, (2014), for example. As for the “t<t(n)t<t^{*}(n)” caveat for FP applied to TnT^{n}, this is an unremarkable consequence of analyzing a game with a pure strategy equilibrium (all probability assigned to a single strategy). We write t(n)t^{*}(n) to indicate the first index at which FP plays ene_{n}. Both FP and AFP will play ene_{n} forever some finite number of steps after they play it for the first time, thus attaining a t1t^{-1} rate as the average strategy “catches up” to ene_{n}. Our result shows that until this point, FP is slow, whereas AFP is always fast. As before, AFP’s superior performance is reflected in numerical simulations, as shown in Appendix D.

4.1 Numerical results

In order to compare FP and AFP more generally, we sample large numbers of random payoff matrices and compute aggregate statistics across them. Matrix entries are sampled as independent, identically distributed, standard Gaussian variables (note that the shift- and scale-invariance of matrix game equilibria implies that the choice of mean and variance is inconsequential). Since FP and AFP are so similar, and AFP computes two best responses per timestep, it’s natural to wonder: is AFP’s superior performance just an artifact of using more computation per timestep? So, in order to make a fair comparison, we compare the algorithms by the number of best responses calculated instead of the number of timesteps (algorithm iterations). Using the worst-case payoff as the measure of performance, we compare FP and AFP based on the number of responses computed and based on matrix size in Figures 3 and 4.

The result is that AFP is clearly better on both counts. Although FP is better for a substantial proportion of 30×3030\times 30 games at very early timesteps tt, AFP quickly outpaces FP, eventually across each of 1,000 matrices sampled. In terms of matrix size, FP and AFP appear equivalent on average for small matrices, but quickly grow separated as matrix size grows, with AFP likely to be much better.

Refer to caption
Figure 3: For 1,000 randomly sampled (30,30) matrices AA, the proportion of the time that min(x¯r/2AFP)Amin(x¯rFP)A\min\,(\overline{x}_{r/2}^{\textnormal{AFP}})^{\intercal}A\geq\min\,(\overline{x}_{r}^{\textnormal{FP}})^{\intercal}A for r=2,4,200r=2,4\dots,200. A 95% Agresti-Coull confidence interval (Agresti and Coull,, 1998) for the true proportion is highlighted. Note that after only about six best responses, AFP is better half the time, and by 130, AFP is better than FP essentially 100% of the time.
Refer to caption
Figure 4: Average performance of FP vs. AFP at the 100th best response (timestep 100 for FP, timestep 50 for AFP) as matrix size is varied. All matrices are square. Highlighted regions show the 10th and 90th percentiles.

5 Application to reinforcement learning

We apply reinforcement learning (RL) (Sutton and Barto,, 2018) versions of FP and AFP in the context of a (two-player, zero-sum, symmetric) stochastic game (Shapley, (1953)), defined by the tuple (𝒮,𝒪,𝒳,𝒜,𝒫,,p0)(\mathcal{S},\mathcal{O},\mathcal{X},\mathcal{A},\mathcal{P},\mathcal{R},p_{0}), where 𝒮\mathcal{S} is the set of possible states of the environment, 𝒪\mathcal{O} is the set of possible observations received by an agent, 𝒳:𝒮𝒪×𝒪\mathcal{X}:\mathcal{S}\rightarrow\mathcal{O}\times\mathcal{O} gives the observations for each player based on the current state, 𝒜\mathcal{A} is the set of available actions, 𝒫:𝒮×𝒜×𝒜Δ(𝒮)\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) defines the transition dynamics for the environment given each player’s action, :𝒮×\mathcal{R}:\mathcal{S}\rightarrow\mathbb{R}\times\mathbb{R} defines the reward for both players such that (st)=(rt,rt)\mathcal{R}(s_{t})=(r_{t},-r_{t}) are the rewards observed by each player at time tt, and p0Δ(𝒮)p_{0}\in\Delta(\mathcal{S}) is the initial distribution of states, such that s0p0s_{0}\sim p_{0}. Let \mathcal{H} be the set of possible sequences of observations. Then a policy is a map π:Δ(A)\pi:\mathcal{H}\rightarrow\Delta(A). An episode is played by iteratively transitioning by the environment according to the actions sampled from each players’ policies at each state. Players 1 and 2 earn returns (trt,trt)(\sum_{t}r_{t},-\sum_{t}r_{t}). The reinforcement learning algorithms we consider take sequences of observations, actions, and rewards from both players and use them to incrementally update policies toward earning greater expected returns. For background on reinforcement learning, see Sutton and Barto, (2018). For details on machine learning approximations to FP, see Heinrich et al., (2015). Table 2 gives a high-level overview of the relationship.

Table 2: The normal-form game analogies used to extend FP and AFP to reinforcement learning.
Normal-form game Stochastic/extensive-form game
Strategy Policy
Payoff Ai,jA_{i,j} Expected return Eπi,πj(trt)E_{\pi_{i},\pi_{j}}(\sum_{t}r_{t})
Best response Approximate best response by RL
Strategy mixture αixi\sum\alpha_{i}x_{i}, αi=1\sum\alpha_{i}=1, αi0\alpha_{i}\geq 0 At start of episode, sample policy πi\pi_{i} with probability αi\alpha_{i}. Play entire episode with πi\pi_{i}.

We use two environments, our own TinyFighter, and Running With Scissors, from Vezhnevets et al., (2020).

5.1 Environments

TinyFighter is a minimal version of an arcade-style fighting game shown in Figure 5(a). It features two players with four possible actions: Move Left, Move Right, Kick, and Do Nothing. Players are represented by a rectangular body and when kicking, extend a rectangular leg towards the opponent.

Kicking consists of three phases: Startup, Active, and Recovery. Each phase of a kick lasts for a certain number of frames, and if the Active phase of the kick intersects with any part of the opponent (body or leg), a hit is registered. When a hit occurs, the players are pushed back, the opponent takes damage, and the opponent is stunned (unable to take actions) for a period of time. In the Startup and Recovery phases, the leg is extended, and like the body, can be hit by the opponent if the opponent has a kick in the active phase that intersects the player. The game is over when a player’s health is reduced to zero or when time runs out.

Player observations are vectors in 13\mathbb{R}^{13} and contain information about player and opponent state: position, health, an ‘attacking’ indicator, a ‘stunned’ indicator, and how many frames a player has been in the current action. The observation also includes the distance between players, time remaining, and the direction of the opponent (left or right of self). The game is partially observable, so information about the opponent’s state is hidden from the player for some number of frames (we use four, and the game runs at 15 frames per second). This means a strong player must guess about the distribution of actions the opponent may have taken recently and to respond to that distribution; playing deterministically will allow the opponent to exploit the player and so a stochastic strategy is required to play well.

Running With Scissors (RWS) is a spatiotemporal environment with partial observability with potential for nontransitive relationships between policies. As shown in Figure 5(b), RWS is a 2D gridworld with a few types of entities: two agents; three types of items: rock, paper, and scissors, which can be picked up by the agents to add to their inventories; and impassable walls. In addition to moving around and picking up items, agents in RWS have a “tag” action, which projects a cone in front of them for a single frame. If the cone hits the other agent, the episode ends and each agent receives rewards based on the payoff matrix CnC^{n} according to the ratios of each item in their inventory. Agents can only see their own inventory and a 5×55\times 5 grid in front of them and can remember the last four frames they’ve seen, so in order to perform effectively they must infer what items the opponent has picked up. Vezhnevets et al., (2020) and Liu et al., (2022) (Appendix B.1.) feature further discussion of the environment.

Refer to caption
(a) TinyFighter, where players dance back and forth, attempting to land kicks on each other in order to reduce the other’s health from 100 to zero.
Refer to caption
(b) Running With Scissors, a 2D gridworld where players collect items before tagging each other to earn reward based on the proportion of items in their inventory.
Figure 5: Screenshots of multiagent RL environments.

5.2 Adapting FP and AFP to reinforcement learning

Neural Population Learning (NeuPL) (Liu et al.,, 2022) is a framework for multiagent reinforcement learning wherein a collection of policies is learned and represented by a single neural network and all policies train continuously. For our experiments, we implement FP and AFP within NeuPL, as shown in Algorithm 1. For reference, we also include a simple RL version of FP and AFP in the style of PSRO in Appendix F.

Algorithm 1 NeuPL-FP/AFP
1:𝔒{𝔒FP,𝔒AFP}\mathfrak{O}\in\{\mathfrak{O}^{\text{FP}},\mathfrak{O}^{\text{AFP}}\} \triangleright Input: FP or AFP opponent sampler.
2:{Πθ(t):Δ(𝒜)}t=1n\left\{\Pi_{\theta}(t):\mathcal{H}\rightarrow\Delta(\mathcal{A})\right\}_{t=1}^{n} \triangleright Input: neural population net.
3:for Batch b=1,2,3,,b=1,2,3,\dots, do
4:     B{}B\leftarrow\{\}
5:     while per-batch compute budget remains do
6:         TlearnerUniform({1,,n})T_{\text{learner}}\sim\text{Uniform}(\{1,\dots,n\})
7:         Topponent𝔒(Tlearner)T_{\text{opponent}}\sim\mathfrak{O}(T_{\text{learner}})
8:         DlearnerPlayEpisode(Πθ(Tlearner),Πθ(Topponent))D_{\text{learner}}\leftarrow\textsc{PlayEpisode}\!\left(\Pi_{\theta}(T_{\text{learner}}),\Pi_{\theta}(T_{\text{opponent}})\right)
9:         BBDlearnerB\leftarrow B\cup D_{\text{learner}}
10:     end while
11:     ΠθReinforcementLearningUpdate(B)\Pi_{\theta}\leftarrow\textsc{ReinforcementLearningUpdate}(B)
12:end for
Refer to caption
Figure 6: A visual depiction of the distributions of opponents (“meta-strategies” in PSRO or NeuPL) each learner faces in a population learning implementation of FP or AFP. The (i,j)(i,j) entry is the probability that, given that agent ii is training, it will face agent jj in a particular episode. Dark blue indicates probability 1, white indicates probability 0.

In NeuPL-FP/AFP, the opponent sampler 𝔒\mathfrak{O} determines the distributions of opponents that each agent faces and is the only difference between the FP and AFP implementations. We have, for each t>1t>1,

𝔒FP(t)\displaystyle\mathfrak{O}^{\text{FP}}(t) =Uniform({1,2,3,,t1}), and\displaystyle=\text{Uniform}(\{1,2,3,\dots,t-1\}),\text{ and}
𝔒AFP(t)\displaystyle\mathfrak{O}^{\text{AFP}}(t) =Uniform({k<t:k odd}{t1}).\displaystyle=\text{Uniform}(\{k<t:k\text{ odd}\}\cup\{t-1\}).

These distributions are depicted in Figure 6. Just as each step of FP involves computing a best response to an average against all prior strategies, sampling from 𝔒FP(t)\mathfrak{O}^{\text{FP}}(t) corresponds to training agent tt uniformly against the prior policies; just as AFP can be thought of “forgetting” every other index, 𝔒AFP(t)\mathfrak{O}^{\text{AFP}}(t) trains learner index tt uniformly against every odd indexed policy plus the most recent policy. The neural population net Πθ(t):Δ(A)\Pi_{\theta}(t):\mathcal{H}\rightarrow\Delta(A) defines a different policy for each agent index tt, and can equivalently be represented as Πθ(a|s,t)\Pi_{\theta}(a|s,t).

5.3 Experimental setup

For the neural population net, we use an actor-critic (Sutton and Barto,, 2018) architecture similar to that used for RWS in Liu et al., (2022): first, a module is used to process environment observations into a dense representation that is shared between an actor head and a critic head. The actor takes this vector, concatenates it with a vector representing the distribution of opponents faced by the currently training agent tt (e.g., [0.5,0.5,0,,0][0.5,0.5,0,\dots,0] for agent t=3t=3 in FP or AFP), then processes it with a dense MLP with ReLu activations, with action masking applied prior to a final Softmax layer. The critic is similar, except an additional input is used: the index of the opponent sampled at the beginning of this episode, matching the original implementation.333Note that the actor (policy network) does not observe which opponent it faces, only the distribution over agents it faces; this is important because otherwise our agent would not learn a best response to an average policy as intended in FP and AFP. The reason for including this information for the critic (value network) is that it may reduce the variance of the value function estimator. For the exact architectures used, see Appendix E. We use a neural population size of n=11n=11. Based on matrix game simulations and a preliminary RL experiments, we determined that AFP performs slightly better in a short time horizon when initialized with two steps of FP, so we do this, as shown in the final panel of Figure 6. See Appendix D for a figure comparing performance in the matrix setting.

We implemented NeuPL within a basic self-play reinforcement learning loop by wrapping the base environment (TinyFighter or RWS) within a lightweight environment that handles NeuPL logic, such as opponent sampling. For reinforcement learning, we use the Asynchronous Proximal Policy Optimization (APPO) algorithm (Schulman et al.,, 2017), a distributed actor-critic RL algorithm, as implemented in RLLib (Moritz et al.,, 2018) with a single GPU learner. Hyperparameter settings are given in Appendix E. We train the entire neural population net (agents 1-11) for 12,500 steps, where a step is roughly 450 minibatch updates of stochastic gradient descent. This corresponds to about five days of training. For each of the two environments, we repeat this procedure independently 10 times for FP and 10 times for AFP.

5.4 Results

To evaluate exploitability, we made use of the fact that each FP and AFP neural population are made up of agents trained to “exploit” the ones that came before them. Specifically, each agent is trained to approximate a best response to the average policy returned by the algorithm at the previous timestep. So, to estimate the exploitability of NeuPL-FP or NeuPL-AFP at step t{1,,n1}t\in\{1,\dots,n-1\}, we simply use the average return earned by agent t+1t+1 against agents {1,,t}\{1,\dots,t\} to obtain the within-population exploitability at tt. This is a convenient metric, but insufficient on its own. In order for it to be meaningful, the agents in the population must have learned approximate best responses that are close to actual best responses; if they have not, it could be that within-population exploitability is low, not because the average policy approximates a Nash policy but because nothing had been learned at all. To account for this, we also evaluate the populations learned using relative population performance (Balduzzi et al.,, 2019), which measures the strength of one population of agents against the other. The purpose of using relative population performance is simply to verify that one algorithm did not produce generally more competent agents than the other.

We paired each of the 10 replicates of FP and AFP and computed the relative population performance for each. On TinyFighter, the average was -0.73, with a Z-test-based 90% confidence interval width of 1.37. On RWS, the average was 4.60, with a Z-test-based 90% confidence interval width of 3.50. We conclude that the agents learned by FP and AFP are not statistically significantly different in terms of performance for TinyFighter, but the agents learned by FP have a slight, statistically significant advantage in RWS. However, these differences are small relative to the total obtainable reward in either environment (20 for TinyFighter, roughly 60 for RWS), so we conclude it is reasonable to use within-population exploitability to compare FP and AFP, as shown in Figure 7. For consistency with the matrix game simulation results, we plot worst case payoff, which is simply the negative of exploitability in this case.

Refer to caption
(a) TinyFighter
Refer to caption
(b) Running With Scissors
Figure 7: Estimated worst-case payoffs for FP and AFP on two stochastic games. Highlighting indicates a pointwise 90% confidence region.

We find that AFP has a significantly better worst-case payoff of -7.4 versus 10.6 for FP at the final timestep in TinyFighter. This corresponds to a noteworthy 16% reduction in exploitability relative to the total possible reward of 20 that can be earned in TinyFighter. In RWS, the algorithms have essentially identical performance. The fact that AFP’s advantage varies widely by environment is not surprising. The matrix game simulations in Figure 3 showed that until over 100 steps of each algorithm, there is some proportion of games for which FP performs better. Correspondingly, we would expect that there is a nontrivial proportion of stochastic games where NeuPL-FP outperforms NeuPL-AFP for small population sizes. Although we expect NeuPL will not be able to support over 100 policies (the original paper used population size 8), it would be possible to do so within the PSRO framework. This remains a topic for further investigation.

6 Conclusion

We proposed a variant of fictitious play for faster estimation of Nash equilibria in two-player, zero-sum games. Anticipatory fictitious play is intuitive, easy to implement, and supported by theory and numerical simulations which suggest that it is virtually always preferable to fictitious play. Consequently, we shed new light on two motivating problems for fictitious play: primarily, large-scale multiagent reinforcement learning for complicated real-world games; also, modeling strategic decision making in humans. Further work is needed to understand the conditions under which AFP outperforms FP in the reinforcement learning setting.

Acknowledgements

The authors sincerely thank Ryan Martin for continued mentorship, especially around theory and the presentation of the paper; Philip Wardlaw for orchestrating preliminary reinforcement learning experiments; Adam Venis and Angelo Olcese for contributions to the proofs of FP and AFP applied to CnC^{n}; Jesse Clifton and Eric Laber for helpful comments on a draft; Jesse Clifton and Marc Lanctot for suggesting related works that had been overlooked; and Siqi Liu for helpful explanations and implementation details for NeuPL.

References

  • Adler, (2013) Adler, I. (2013). The equivalence of linear programs and zero-sum games. International Journal of Game Theory, 42(1):165–177.
  • Agresti and Coull, (1998) Agresti, A. and Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52(2):119–126.
  • Balduzzi et al., (2019) Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T. (2019). Open-ended learning in symmetric zero-sum games. In International Conference on Machine Learning, pages 434–443. PMLR.
  • Brown, (1951) Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1):374–376.
  • Chollet et al., (2015) Chollet, F. et al. (2015). Keras.
  • (6) Conlisk, J. (1993a). Adaptation in games: Two solutions to the crawford puzzle. Journal of Economic Behavior & Organization, 22(1):25–50.
  • (7) Conlisk, J. (1993b). Adaptive tactics in games: Further solutions to the crawford puzzle. Journal of Economic Behavior & Organization, 22(1):51–68.
  • Daskalakis and Pan, (2014) Daskalakis, C. and Pan, Q. (2014). A counter-example to karlin’s strong conjecture for fictitious play. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 11–20, Philadelphia, PA, USA. IEEE, IEEE.
  • Foerster et al., (2018) Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. (2018). Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122–130, ’. AAMAS.
  • Heinrich et al., (2015) Heinrich, J., Lanctot, M., and Silver, D. (2015). Fictitious self-play in extensive-form games. In International Conference on Machine Learning, pages 805–813.
  • Kuhn, (1953) Kuhn, H. (1953). Extensive games and the problem of information. In Contributions to the Theory of Games, pages 193–216. Princeton University Press.
  • Lanctot et al., (2017) Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T. (2017). A unified game-theoretic approach to multiagent reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30, -. Curran Associates, Inc.
  • Letcher et al., (2018) Letcher, A., Foerster, J., Balduzzi, D., Rocktäschel, T., and Whiteson, S. (2018). Stable opponent shaping in differentiable games. In International Conference on Learning Representations, page ’, Vancouver Convention Center, Vancouver, BC, Canada. ICLR.
  • Liu et al., (2022) Liu, S., Marris, L., Hennes, D., Merel, J., Heess, N., and Graepel, T. (2022). Neupl: Neural population learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Lockhart et al., (2019) Lockhart, E., Lanctot, M., Pérolat, J., Lespiau, J.-B., Morrill, D., Timbers, F., and Tuyls, K. (2019). Computing approximate equilibria in sequential adversarial games by exploitability descent. arXiv preprint arXiv:1903.05614, ’(’).
  • Luce and Raiffa, (1989) Luce, R. D. and Raiffa, H. (1989). Games and Decisions: Introduction and Critical Survey. Courier Corporation, ’.
  • Moritz et al., (2018) Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., and Stoica, I. (2018). Ray: A distributed framework for emerging {\{AI}\} applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577.
  • Nash Jr, (1950) Nash Jr, J. F. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1):48–49.
  • Robinson, (1951) Robinson, J. (1951). An iterative method of solving a game. Annals of Mathematics, 54(2):296–301.
  • Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.
  • Shamma and Arslan, (2005) Shamma, J. S. and Arslan, G. (2005). Dynamic fictitious play, dynamic gradient play, and distributed convergence to nash equilibria. IEEE Transactions on Automatic Control, 50(3):312–327.
  • Shapley, (1953) Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100.
  • Shoham and Leyton-Brown, (2008) Shoham, Y. and Leyton-Brown, K. (2008). Multiagent Systems: Algorithmic, Game-theoretic, and Logical Foundations. Cambridge University Press.
  • Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press, Cambridge, Massachusetts.
  • Vezhnevets et al., (2020) Vezhnevets, A., Wu, Y., Eckstein, M., Leblond, R., and Leibo, J. Z. (2020). Options as responses: Grounding behavioural hierarchies in multi-agent reinforcement learning. In International Conference on Machine Learning, pages 9733–9742, ’. PMLR, PMLR.
  • Vinyals et al., (2019) Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T. L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C., and Silver, D. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354.
  • Zhang and Lesser, (2010) Zhang, C. and Lesser, V. (2010). Multi-agent learning with policy prediction. In Twenty-fourth AAAI conference on artificial intelligence, pages 927–937, ’. AAAI.

Appendix

Appendix A Why naive AFP doesn’t work

The original idea for the AFP algorithm, which we refer to as naive AFP, was: at each timestep, play the best response to the opponent’s best response to your history. Formally, this given by the updates (c.f. the AFP update (1)):

xt+1\displaystyle x^{\prime}_{t+1} BRA1(y¯t);\displaystyle\in\texttt{BR}^{1}_{A}(\overline{y}_{t}); yt+1\displaystyle y^{\prime}_{t+1} BRA2(x¯t);\displaystyle\in\texttt{BR}^{2}_{A}(\overline{x}_{t});
xt+1\displaystyle x_{t+1} BRA1(yt+1);\displaystyle\in\texttt{BR}^{1}_{A}(y^{\prime}_{t+1}); yt+1\displaystyle y_{t+1} BRA2(xt+1);\displaystyle\in\texttt{BR}^{2}_{A}(x^{\prime}_{t+1});
x¯t+1\displaystyle\overline{x}_{t+1} =1t+1k=1t+1xk;\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}x_{k}; y¯t+1\displaystyle\overline{y}_{t+1} =1t+1k=1t+1yk.\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}y_{k}.

In preliminary simulations, naive AFP performed well in cyclic games and seemed to be competitive with FP in other games.

However, upon inspection it becomes clear that Naive AFP is not guaranteed to converge to a Nash equilibrium. This is because Naive AFP can only play best responses to strategies returned by the best response operator, which are pure strategies, but Nash equilibria of some games have nonzero probability assigned to strategies that are not best responses to any pure strategies. Thus there are some games where naive AFP is incapable of assigning any probability to actions which must be assigned nonzero probability in a Nash equilibrium, so naive AFP cannot converge to a Nash equilibrium. For example, see the game given in Table 3.

Rock Paper Scissors
Rock 0 -1 1
Paper 1 0 -1
Scissors -1 1 0
SafeRock 0 0 0.99
Table 3: Rock Paper Scissors SafeRock. In this game, SafeRock is not a best response to any of the opponent’s pure strategies, so it won’t be played by naive AFP. However, SafeRock is included in the Nash equilibrium which is given approximately by the following probabilities ([0.33,0.47,0.2],[0,0.32,0.18,0.15])([0.33,0.47,0.2],[0,0.32,0.18,0.15]), with value v=0.133v^{*}=0.133. If SafeRock were removed the game would be symmetric and have value 0, so we know SafeRock must be included in any Nash support.

Appendix B Proof of Proposition 1

Our analysis closely follows the original proof of FP’s convergence given in Robinson, (1951), which consists of four key lemmas. We introduce the notion of a perturbed fictitious play system, a slight generalization of Robinson’s “vector system.” We modify Robinson’s second lemma accordingly, which causes inconsequential changes in the third and fourth lemma, leaving the final result the same. In this way, we extend Robinson’s proof to a broader class of algorithms which include fictitious play and anticipatory fictitious play as special cases.

Definition 1.

An iterative play system (U,V)(U,V) for Am×nA\in\mathbb{R}^{m\times n} is a pair of sequences of vectors U={U(0),U(1),}U=\{U(0),U(1),\dots\} and V={V(0),V(1),}V=\{V(0),V(1),\dots\} with each U(t)nU(t)\in\mathbb{R}^{n} and V(t)mV(t)\in\mathbb{R}^{m} satisfying the following properties:

  • minU(0)=maxV(0)\min U(0)=\max V(0), and

  • for each tt, U(t+1)=U(t)+Ai(t),U(t+1)=U(t)+A_{i(t),*} and V(t+1)=V(t)+A,j(t)V(t+1)=V(t)+A_{*,j(t)},

where Ai,A_{i,*} and A,jA_{*,j} are the iith row and jjth column of AA, and i(t)i(t) and j(t)j(t) are the row and column “played” by players 1 and 2 at time tt.

The interpretation of an iterative play system is as follows. Suppose we choose U(0)=0U(0)=0 and V(0)=0V(0)=0. Write ehe_{h} to indicate a vector with a 1 at index hh and 0’s elsewhere. Then x¯t=t1k=1tei(k)\overline{x}_{t}=t^{-1}\sum_{k=1}^{t}e_{i(k)} and y¯t=t1k=1tej(k)\overline{y}_{t}=t^{-1}\sum_{k=1}^{t}e_{j(k)} are the empirical strategies played by players 1 and 2, and t1V(t)=Ay¯tt^{-1}V(t)=A\overline{y}_{t} and t1U(t)=Ax¯tt^{-1}U(t)=A^{\intercal}\overline{x}_{t} are the payoffs for player 1 and 2 when faced with those empirical strategies. In this way, V(t)V(t) and U(t)U(t) can be seen as “accumulating empirical payoffs” for players 11 and 22.

Definition 2.

A perturbed fictitious play system (PFP-system) is an iterative play system with the additional property that i(t)i(t) and j(t)j(t) are best responses to a perturbed set of empirical payoffs. Precisely, an iterative play system is a PFP-system such that for any values EV(t)mE_{V}(t)\in\mathbb{R}^{m} and EU(t)nE_{U}(t)\in\mathbb{R}^{n} with EV(t)<a:=maxi,j|ai,j|\|E_{V}(t)\|_{\infty}<a:=\max_{i,j}|a_{i,j}| and EU(t)<a\|E_{U}(t)\|_{\infty}<a for each tt,

i(t+1)argmax[V(t)+EV(t)] and j(t+1)argmin[U(t)+EU(t)].\displaystyle i(t+1)\in\arg\max\big{[}V(t)+E_{V}(t)\big{]}\text{ and }j(t+1)\in\arg\min\big{[}U(t)+E_{U}(t)].

A special case of a PFP-system is what Robinson calls a “vector system,” which describes fictitious play. This is obtained by setting all entries of EV(t)E_{V}(t) and EU(t)E_{U}(t) to zero at all timesteps.

Lemma 1.

If (U,V)(U,V) is an iterative play system for AA, then

liminftt1{maxV(t)minU(t)}0.\displaystyle\underset{t\rightarrow\infty}{\lim\inf}\;t^{-1}\bigl{\{}\max V(t)-\min U(t)\bigr{\}}\geq 0.

This lemma follows from the minimax nature of two-player zero-sum games and holds regardless of what rows and columns of AA are used to update elements of UU and VV.

Definition 3.

Given an iterative play system (U,V)(U,V) for matrix AA, we say row Ai,A_{i,*} is EE-eligible in the interval [t,t][t,t^{\prime}] if for some t1[t,t]t_{1}\in[t,t^{\prime}], ii could have been played as part of AFP. Precisely, the condition is that there exists t1[t,t]t_{1}\in[t,t^{\prime}] such that

iargmax[V(t1)+E]\displaystyle i\in\arg\max\big{[}V(t_{1})+E\big{]}

for some EmE\in\mathbb{R}^{m} with E<a\|E\|_{\infty}<a. Or, equivalently,

vi(t1)maxV(t1)2a.\displaystyle v_{i}(t_{1})\geq\max V(t_{1})-2a.

Similarly, we say column jj is EE-eligible if

uj(t1)minU(t1)+2a.\displaystyle u_{j}(t_{1})\leq\min U(t_{1})+2a.
Lemma 2.

If (U,V)(U,V) is an iterative play system for AA and all rows and columns are EE-eligible in [s,s+t][s,s+t], then we have

maxU(s+t)minU(s+t)2a(t+1), and\displaystyle\max U(s+t)-\min U(s+t)\leq 2a(t+1),\text{ and}
maxV(s+t)minV(s+t)2a(t+1).\displaystyle\max V(s+t)-\min V(s+t)\leq 2a(t+1).
Proof.

Let jargmaxU(s+t)j\in\arg\max\,U(s+t). By the EE-eligibility of jj, there must exist t[s,s+t]t^{\prime}\in[s,s+t] such that

uj(t)minU(t)2a.\displaystyle u_{j}(t^{\prime})-\min U(t^{\prime})\leq 2a.

So, because the jjth entry can’t change by more than aa per timestep,

maxU(s+t)\displaystyle\max\,U(s+t) =uj(s+t)\displaystyle=u_{j}(s+t)
uj(t)+at\displaystyle\leq u_{j}(t)+at
minU(t)+2a+at\displaystyle\leq\min\,U(t^{\prime})+2a+at
minU(t+s)+at+2a+at,\displaystyle\leq\min\,U(t+s)+at+2a+at,

where the last inequality holds because for t[s,s+t]t^{\prime}\in[s,s+t], the minimum of U(t)U(t^{\prime}) versus U(s+t)U(s+t) can’t change by more than atat in tt timesteps. A similar argument shows the result for VV. ∎

Lemma 3.

If (U,V)(U,V) is an iterative play system for AA and all rows and columns are EE-eligible in [s,s+t][s,s+t], then

maxV(s+t)minU(s+t)4a(t+1).\displaystyle\max V(s+t)-\min U(s+t)\leq 4a(t+1).
Proof.

As shown in Robinson, (1951), this follows immediately from Lemma 2. The only difference here is that we replace 2at2at with 2a(t+1)2a(t+1). ∎

Lemma 4.

For every matrix AA, ϵ>0\epsilon>0, there exists a t0t_{0} such that for any anticipatory fictitious play system,

maxV(t)minU(t)ϵt for all tt0.\displaystyle\max V(t)-\min U(t)\leq\epsilon t\text{ for all }t\geq t_{0}.
Proof.

We follow the proof of Robinson, (1951), replacing the notion of eligibility with EE-eligibility. The strategy is induction. If A1×1A\in\mathbb{R}^{1\times 1}, the result is trivial because V(t)=U(t)V(t)=U(t) for all tt. Now assume that the property holds for an arbitrary submatrix AA^{\prime} obtained by deleting any number of columns or rows from AA. We wish to show that the property also holds for AA.

Using the inductive hypothesis, pick tt^{*} such that

maxV(t)minU(t)<12ϵt for all tt\displaystyle\max V^{\prime}(t)-\min U^{\prime}(t)<\tfrac{1}{2}\epsilon t\text{ for all }t\geq t^{*}

for any AA^{\prime} a submatrix of AA and (U,V)(U^{\prime},V^{\prime}) an anticipatory fictitious play system for AA^{\prime}.

As in Robinson, (1951), we wish to show that if in (U,V)(U,V), some row or column is not EE-eligible in [s,s+t][s,s+t^{*}], then

maxV(s+t)minU(s+t)<maxV(s)minU(s)+12ϵt\displaystyle\max V(s+t^{*})-\min U(s+t^{*})<\max V(s)-\min U(s)+\tfrac{1}{2}\epsilon t^{*} (3)

Suppose without loss of generality that row Am,A_{m,*} is not EE-eligible [s,s+t][s,s+t^{*}]. Then we construct a new anticipatory fictitious play system (U,V)(U^{\prime},V^{\prime}) for matrix AA^{\prime}, which is AA with row mm deleted. Define

U(t)\displaystyle U^{\prime}(t) =U(s+t)+c 1n,\displaystyle=U(s+t)+c\,\mathds{1}_{n},
V(t)\displaystyle V^{\prime}(t) =ProjmV(s+t),\displaystyle=\textnormal{Proj}_{m}V(s+t),

for t=0,1,,tt=0,1,\dots,t^{*}, where 𝟙n\mathds{1}_{n} is a vector of 11’s with nn entries, c=maxV(s)minU(s)c=\max V(s)-\min U(s), and Projk:mm1\textnormal{Proj}_{k}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{m-1} is the operator that removes entry kk. We now check the conditions for an anticipatory fictitious play system:

  • We have minU(0)=min{U(s)+[maxV(s)minU(s)]𝟙n}=maxV(s)=maxV(0)\min U^{\prime}(0)=\min\{U(s)+[\max V(s)-\min U(s)]\mathds{1}_{n}\}=\max V(s)=\max V^{\prime}(0), where the last equality holds because mm is not EE-eligible, so it could not be a maximizer of V(s)V(s), so deleting it to form V(0)V^{\prime}(0) does not change the maximum.

  • We have that for each tt,

    U(t+1)\displaystyle U^{\prime}(t+1) =U(s+t+1)+c𝟙n=U(s+t)+Ai(s+t),+c𝟙n=U(t)+Ai(s+t),,\displaystyle=U(s+t+1)+c\mathds{1}_{n}=U(s+t)+A_{i(s+t),*}+c\mathds{1}_{n}=U^{\prime}(t)+A^{\prime}_{i(s+t),*},
    V(t+1)\displaystyle V^{\prime}(t+1) =ProjmV(s+t+1)=Projm[V(s+t)+A,j(s+t)]=V(t)+A,j(s+t),\displaystyle=\textnormal{Proj}_{m}V(s+t+1)=\textnormal{Proj}_{m}[V(s+t)+A_{*,j(s+t)}]=V^{\prime}(t)+A^{\prime}_{*,j(s+t)},

    where Ai(s+t),=Ai(s+t),A_{i(s+t),*}=A^{\prime}_{i(s+t),*} because the AFP-ineligibility of mm implies i(s+t)mi(s+t)\neq m.

  • Finally, we must show that the rows and columns selected still qualify as “anticipatory” responses within the context of (U,V)(U^{\prime},V^{\prime}) and AA^{\prime}, i.e. that

    vi(s+t)(t)maxV(t)2a and uj(s+t)(t)minU(t)+2a\displaystyle v^{\prime}_{i(s+t)}(t)\geq\max V^{\prime}(t)-2a\text{ and }u^{\prime}_{j(s+t)}(t)\leq\min U^{\prime}(t)+2a

    for each t=0,1,,tt=0,1,\dots,t^{*}. By the definition of VV^{\prime} and fact that i(s+t)mi(s+t)\neq m, we have

    vi(s+t)(t)\displaystyle v^{\prime}_{i(s+t)}(t) =vi(s+t)(s+t)\displaystyle=v_{i(s+t)}(s+t)
    maxV(s+t)2a\displaystyle\geq\max V(s+t)-2a ((U,V)(U,V) is a PFP-system)
    =maxV(s)2a,\displaystyle=\max V^{\prime}(s)-2a,

    and U(t)U^{\prime}(t) is just a shifted version of U(s+t)U(s+t), so the fact that uj(s+t)(s+t)minU(s+t)+2au_{j(s+t)}(s+t)\leq\min U(s+t)+2a implies the result.

These points verify that (U,V)(U^{\prime},V^{\prime}) satisfy the conditions for an anticipatory fictitious play system for AA^{\prime} on t=0,,tt=0,\dots,t^{*}. We can choose remaining values for both sequences for t=t+1,t+2,t=t^{*}+1,t^{*}+2,\dots to satisfy the anticipatory fictitious play conditions. So, using the inductive hypothesis, we have

maxV(s+t)minU(s+t)\displaystyle\max V(s+t^{*})-\min U(s+t^{*}) =maxV(t)min{U(t)[maxV(s)minU(s)]𝟙n}\displaystyle=\max V^{\prime}(t^{*})-\min\big{\{}U^{\prime}(t^{*})-[\max V(s)-\min U(s)]\mathds{1}_{n}\big{\}}
=maxV(t)minU(t)+maxV(s)minU(s)\displaystyle=\max V^{\prime}(t^{*})-\min U^{\prime}(t^{*})+\max V(s)-\min U(s)
<12ϵt+maxV(s)minU(s).\displaystyle<\tfrac{1}{2}\epsilon t^{*}+\max V(s)-\min U(s).

With (3) established, we are ready to finish the proof: under the inductive hypothesis, we will be able to deal with time intervals by splitting them into two cases: if a row or column is not EE-eligible, we apply (3); if all rows or columns are EE-eligible, we apply Lemma 3.

Specifically, we show that for any AFP system (U,V)(U,V) for AA and t8at/ϵt\geq 8at^{*}/\epsilon,

maxV(t)minU(t)<ϵt.\displaystyle\max V(t)-\min U(t)<\epsilon t.

Let t>tt>t^{*} and express it as t=(θ+q)tt=(\theta+q)t^{*}, where qq\in\mathbb{N} and θ[0,1)\theta\in[0,1). We consider the collection of length-tt^{*} intervals [(θ+r1)t,(θ+r)t)][(\theta+r-1)t^{*},(\theta+r)t^{*})] for r=1,,qr=1,\dots,q.

  • Case 1. There is at least one interval where all rows and columns are EE-eligible. Let [(θ+s1)t,(θ+s)t)][(\theta+s-1)t^{*},(\theta+s)t^{*})] be the latest such interval. By Lemma 3,

    maxV[(θ+s)t]minV[(θ+s)t]4a(t+1)8at,\displaystyle\max V[(\theta+s)t^{*}]-\min V[(\theta+s)t^{*}]\leq 4a(t^{*}+1)\leq 8at^{*},

    where the last inequality holds because t1t^{*}\geq 1. By choice of the interval, all subsequent intervals with r=s+1,,qr=s+1,\dots,q have no EE-eligible rows or columns, so (3) gives that

    maxV(t)minU(t)maxV[(θ+s)t]minV[(θ+s)t]+12ϵt(qs),\displaystyle\max V(t)-\min U(t)\leq\max V[(\theta+s)t^{*}]-\min V[(\theta+s)t^{*}]+\tfrac{1}{2}\epsilon t^{*}(q-s),

    noting that the result holds trivially if q=sq=s. Combining the previous two results and loosening the bound, we have

    maxV(t)minU(t)8at+12ϵt(qs)(8a+12ϵq)t.\displaystyle\max V(t)-\min U(t)\leq 8at^{*}+\tfrac{1}{2}\epsilon t^{*}(q-s)\leq(8a+\tfrac{1}{2}\epsilon q)t^{*}. (4)
  • Case 2. In each interval [(θ+r1)t,(θ+r)t)][(\theta+r-1)t^{*},(\theta+r)t^{*})] for r=1,,qr=1,\dots,q, some row or column is not EE-eligible. Applying (3) repeatedly,

    maxV(t)minU(t)\displaystyle\max V(t)-\min U(t) =maxV[(θ+q)t]minU[(θ+q)t]\displaystyle=\max V[(\theta+q)t^{*}]-\min U[(\theta+q)t^{*}]
    <maxV[(θ+q1)t]minU[(θ+q1)t]+12ϵt\displaystyle<\max V[(\theta+q-1)t^{*}]-\min U[(\theta+q-1)t^{*}]+\tfrac{1}{2}\epsilon t^{*}
    <maxV[(θ+q2)t]minU[(θ+q2)t]+12ϵt+12ϵt\displaystyle<\max V[(\theta+q-2)t^{*}]-\min U[(\theta+q-2)t^{*}]+\tfrac{1}{2}\epsilon t^{*}+\tfrac{1}{2}\epsilon t^{*}
    <\displaystyle<\dots
    <maxV(θt)minU(θt)+12qϵt\displaystyle<\max V(\theta t^{*})-\min U(\theta t^{*})+\tfrac{1}{2}q\epsilon t^{*}
    2aθt+12qϵt\displaystyle\leq 2a\theta t^{*}+\tfrac{1}{2}q\epsilon t^{*} (5)
    =(2aθ+12qϵ)t,\displaystyle=(2a\theta+\tfrac{1}{2}q\epsilon)t^{*}, (6)

    where the last inequality holds because maxV(θt)aθt\max V(\theta t^{*})\leq a\theta t^{*} and minU(θt)aθt\min U(\theta t^{*})\geq a\theta t^{*}.

So, comparing (4) and (6) and noting that θ[0,1)\theta\in[0,1), in either case we have that

maxV(t)minU(t)8at+12ϵ(qt)8at+12ϵtϵt\displaystyle\max V(t)-\min U(t)\leq 8at^{*}+\tfrac{1}{2}\epsilon(qt^{*})\leq 8at^{*}+\tfrac{1}{2}\epsilon t\leq\epsilon t

for all t16at/ϵt\geq 16at^{*}/\epsilon.∎

Finally, we are ready for the proof of Proposition 1, which is essentially identical to the final proof of Theorem 1 in Robinson, (1951).

Proof.

Let V(0)=0mV(0)=0\in\mathbb{R}^{m}, U(0)=0nU(0)=0\in\mathbb{R}^{n}, and V(t)=tAy¯tV(t)=tA\overline{y}_{t}, U(t)=tAx¯tU(t)=tA^{\intercal}\overline{x}_{t} for tt\in\mathbb{N}, where x¯t\overline{x}_{t} and y¯t\overline{y}_{t} are as given in (1). Clearly, (U,V)(U,V) forms an iterative play system. It follows from (1) that (U,V)(U,V) is also a PFP-system with EV(t)=ABRA2(x¯t)E_{V}(t)=A\cdot\texttt{BR}^{2}_{A}(\overline{x}_{t}) and EU(t)=ABRA1(y¯t)E_{U}(t)=A^{\intercal}\cdot\texttt{BR}^{1}_{A}(\overline{y}_{t}). This is because

y¯t\displaystyle\overline{y}^{\prime}_{t} =t1ty¯t1+1tBRA2(x¯t1) implies\displaystyle=\tfrac{t-1}{t}\overline{y}_{t-1}+\tfrac{1}{t}\texttt{BR}^{2}_{A}(\overline{x}_{t-1})\text{ implies }
tAy¯t\displaystyle tA\overline{y}^{\prime}_{t} =(t1)Ay¯t1+ABRA2(x¯t1)\displaystyle=(t-1)A\overline{y}_{t-1}+A\cdot\texttt{BR}^{2}_{A}(\overline{x}_{t-1})
=V(t1)+EV(t1),\displaystyle=V(t-1)+E_{V}(t-1),

so xt=BRA1(y¯t1)=ei(t)x_{t}=\texttt{BR}^{1}_{A}(\overline{y}^{\prime}_{t-1})=e_{i(t)}, where i(t)argmax[V(t1)+EV(t1)]i(t)\in\arg\max[V(t-1)+E_{V}(t-1)]. A similar argument holds for yty_{t}.

So, by Lemmas 1 and 4,

limt(maxAy¯tminx¯tA)=limtmaxV(t)minU(t)t=0,\displaystyle\lim_{t\rightarrow\infty}\big{(}\!\max A\overline{y}_{t}-\min\overline{x}_{t}^{\intercal}A\big{)}=\lim_{t\rightarrow\infty}\frac{\max V(t)-\min U(t)}{t}=0,

where the first equality follows from the definition of V(t)V(t) and U(t)U(t). Combining this with the fact that, for all tt,

maxAy¯t\displaystyle\max A\overline{y}_{t} infyΔn(maxAy)=v, and\displaystyle\geq\inf_{y\in\Delta^{n}}(\max Ay)=v^{*},\text{ and}
minx¯tA\displaystyle\min\overline{x}_{t}^{\intercal}A supxΔm(minxA)=v,\displaystyle\leq\sup_{x\in\Delta^{m}}(\min x^{\intercal}A)=v^{*},

we have that

limtmaxAy¯t=limtminx¯tA=v,\displaystyle\lim_{t\rightarrow\infty}\max A\overline{y}_{t}=\lim_{t\rightarrow\infty}\min\overline{x}_{t}^{\intercal}A=v^{*},

concluding the proof of convergence of AFP. ∎

B.1 Convergence rate of perturbed fictitious play

Given a 2p0s matrix game with payoff matrix Am×nA\in\mathbb{R}^{m\times n}, write t(ϵ;m,n)t^{*}(\epsilon;m,n) to denote the value of tt^{*} given by Lemma 4 such that

maxV(t)minU(t)<12ϵt for tt.\displaystyle\max V(t)-\min U(t)<\tfrac{1}{2}\epsilon t\text{ for }t\geq t^{*}. (7)

We have that t(ϵ;1,1)=1t^{*}(\epsilon;1,1)=1. So, by the inductive step of the proof of Lemma 4, t(ϵ;2,1)=t(ϵ;1,2)=8aϵt^{*}(\epsilon;2,1)=t^{*}(\epsilon;1,2)=\tfrac{8a}{\epsilon}, which then implies that t(ϵ;3,1)=t(ϵ;2,2)=t(ϵ;1,3)=(16aϵ)2t^{*}(\epsilon;3,1)=t^{*}(\epsilon;2,2)=t^{*}(\epsilon;1,3)=(\tfrac{16a}{\epsilon})^{2}. Continuing inductively, we see that t(ϵ;m,n)=(16aϵ)m+n2t^{*}(\epsilon;m,n)=(\tfrac{16a}{\epsilon})^{m+n-2}. Substituting into (7) and rearranging terms gives that

maxV(t)minU(t)t<ϵ for t(8aϵ)m+n2.\displaystyle\frac{\max V(t)-\min U(t)}{t}<\epsilon\text{ for }t\geq(\tfrac{8a}{\epsilon})^{m+n-2}.

Choosing ϵt=8at1/(m+n2)\epsilon_{t}=\frac{8a}{t^{1/(m+n-2)}} for each tt gives the result.

Appendix C Proof of Proposition 2

For first two subsections, we restate the definitions of Δt\Delta_{t}, t0t\in\mathbb{N}_{0} from (2): Δ0=[0,,0]n\Delta_{0}=[0,\dots,0]^{\intercal}\in\mathbb{Z}^{n}, Δt=tCnx¯t\Delta_{t}=tC^{n}\,\overline{x}_{t} for each tt\in\mathbb{N}, and iti_{t} is the index played by FP (AFP) at time tt, so

Δt+1,j={Δt,j1if j=it1modn;Δt,j+1if j=it+1modn;(2)Δt,jotherwise;\displaystyle\Delta_{t+1,j}=\begin{cases}\Delta_{t,j}-1&\text{if }j=i_{t}-1\mod n;\\ \Delta_{t,j}+1&\text{if }j=i_{t}+1\mod n;\hskip 28.45274pt\eqref{eqn:delta_sequence_update}\\ \Delta_{t,j}&\text{otherwise;}\end{cases}

for each t0t\in\mathbb{N}_{0} and j{1,,n}j\in\{1,\dots,n\}.

C.1 FP: CnC^{n}

We must show that maxΔt=Ωp(t)\max\Delta_{t}=\Omega_{p}(\sqrt{t}) under random tiebreaking. Throughout, whenever performing arithmetic with indices, that arithetic is done modulo nn. As in the body of the paper, define tm=inf{t0:maxΔt=m}t_{m}=\inf\{t\in\mathbb{N}_{0}:\max\Delta_{t}=m\} and note the Markov inequality bound:

P(maxΔt<m)=P(tm>t)E(tm)/t=1tk=0m1E(tk+1tk).\displaystyle P(\max\Delta_{t}<m)=P(t_{m}>t)\leq E(t_{m})/t=\frac{1}{t}\sum_{k=0}^{m-1}E(t_{k+1}-t_{k}).

The bulk of the argument is in finding a bound for E(tk+1tk)E(t_{k+1}-t_{k}).

It follows by the definition of Δt\Delta_{t} and the FP update that itargmaxΔti_{t}\in\arg\max\Delta_{t}. Index i1i_{1} may be chosen arbitrarily, but for t>1t>1, it follows from (2) that

i(t+1){{it}if Δt,it>Δt,it+1;{it,it+1}if Δt,it=Δt,it+1;{it+1}if Δt,it<Δt,it+1;\displaystyle i_{(t+1)}\in\begin{cases}\{i_{t}\}&\text{if }\Delta_{t,i_{t}}>\Delta_{t,i_{t}+1};\\ \{i_{t},i_{t}+1\}&\text{if }\Delta_{t,i_{t}}=\Delta_{t,i_{t}+1};\\ \{i_{t}+1\}&\text{if }\Delta_{t,i_{t}}<\Delta_{t,i_{t}+1};\end{cases}

because the value that is incremented at time tt is the value adjacent to index iti_{t}, Δt,it+1\Delta_{t,i_{t}+1}. Let τ1=1\tau_{1}=1 and inductively define, for \ell\in\mathbb{N}, τ+1=inf{t>τ:Δt,it=Δt,it+1}\tau_{\ell+1}=\inf\{t>\tau_{\ell}:\Delta_{t,i_{t}}=\Delta_{t,i_{t}+1}\} to be the next time at which there are two possible choices for it+1i_{t+1}. This is depicted in Table 4, writing m=maxΔτm=\max\Delta_{\tau_{\ell}} and m{m,m+1}m^{\prime}\in\{m,m+1\}.

Table 4: The process of incrementing the index played under FP on CnC^{n}.
Δτ=\Delta_{\tau_{\ell}}= [[\dots 0\leq 0 mm mm \leq 0 0\leq 0 ]\dots]
1\hphantom{-1}\Big{\downarrow}{-1} a\hphantom{-a}\Big{\downarrow}{-a} +1\hphantom{+1}\Big{\downarrow}{+1} +a\hphantom{+a}\Big{\downarrow}{+a}
Δτ(+1)=\Delta_{\tau_{(\ell+1)}}= [[\dots 0\leq 0 mam-a mm^{\prime} mm^{\prime} 0\leq 0 ]\dots]

As shown in the table, all entries of Δt\Delta_{t} other than the two maximum values must be nonpositive at each t=τt=\tau_{\ell}. This follows by induction, since it holds for τ=1\tau=1 (Δτ1\Delta_{\tau_{1}} has one positive entry) and if it’s true for some \ell\in\mathbb{N}, then in order to progress to τ(+1)\tau_{(\ell+1)}, we must add some number a>ma>m (over the course of aa timesteps) to the next entry, which means by (2) we will subtract aa from the previous entry mm, with ma0m-a\leq 0. Finally, note that between τ\tau_{\ell} and τ(+1)\tau_{(\ell+1)}, either we will have incremented the max from mm to m+1m+1 if iτ=i(τ1)i_{\tau_{\ell}}=i_{(\tau_{\ell}-1)}, or we will not have not (if iτ=i(τ1)+1)i_{\tau_{\ell}}=i_{(\tau_{\ell}-1)}+1), and the max will remain at mm until we repeat some further number of increments of the index played. It is only on these timesteps that the maximum value can increment.

Based on this reasoning, we know that for any kk, there must exist (k)\ell(k) such that tk=τ(k)+1t_{k}=\tau_{\ell(k)}+1. Consider the random variable (k+1)(k)\ell(k+1)-\ell(k), which is the number of increments of the index played that occurred between the increment of the max from kk to k+1k+1. Under uniform random tiebreaking, we have that (k+1)(k)Geometric(1/2)\ell(k+1)-\ell(k)\sim\text{Geometric}(1/2), since at each τ\tau_{\ell} there is a 1/2 chance of incrementing (“success”) or not incrementing (“failure”). So, E[(k+1)(k)]=2E[\ell(k+1)-\ell(k)]=2. Now suppose that we had the bound τ+1τdmaxΔτ\tau_{\ell+1}-\tau_{\ell}\leq d\max\Delta_{\tau_{\ell}} for some d>0d>0. That would imply that

tk+1tk=τ(k+1)τ(k)=r=(k)(k+1)1τr+1τr\displaystyle t_{k+1}-t_{k}=\tau_{\ell(k+1)}-\tau_{\ell(k)}=\sum_{r=\ell(k)}^{\ell(k+1)-1}\tau_{r+1}-\tau_{r} r=(k)(k+1)1dmaxΔtk+1\displaystyle\leq\sum_{r=\ell(k)}^{\ell(k+1)-1}d\max\Delta_{t_{k+1}}
=d[(k+1)(k)](k+1).\displaystyle=d[\ell(k+1)-\ell(k)](k+1).

Taking the expectation of both sides, we get E(tk+1tk)2d(k+1)E(t_{k+1}-t_{k})\leq 2d(k+1). Plugging this into the Markov bound is sufficient to finish the argument, as explained in the proof sketch for Proposition 2 (in the paper).

All that remains is to show that τ+1τdmaxΔτ=dm\tau_{\ell+1}-\tau_{\ell}\leq d\max\Delta_{\tau_{\ell}}=dm. From the argument depicted in Table 4, we know τ+1τa+1\tau_{\ell+1}-\tau_{\ell}\leq a+1, and that am+1minΔτa\leq m+1-\min\Delta_{\tau_{\ell}}. Because Δτ\Delta_{\tau_{\ell}} has only two positive entries and i=1nΔt,i=0\sum_{i=1}^{n}\Delta_{t,i}=0, we have minΔτ2m\min\Delta_{\tau_{\ell}}\geq-2m, so τ+1τ3m+25m\tau_{\ell+1}-\tau_{\ell}\leq 3m+2\leq 5m, concluding the proof.

C.2 AFP: CnC^{n}, n=3,4n=3,4

For n=3,4n=3,4 we show maxtΔt<3\max_{t}\Delta_{t}<3, which proves the result.

Based on the AFP update, it is impossible to get to maxtΔt=m+1\max_{t}\Delta_{t}=m+1 unless there are at least two non-adjacent mm’s in Δt1\Delta_{t-1} with an m1m-1 in between. Otherwise, the two-step nature of the AFP update will not allow an mm to be incremented to m+1m+1. However, it is impossible to have two non-adjacent mm’s with an m1m-1 in between for n=3,m=2n=3,m=2 because the entries of Δt1\Delta_{t-1} sum to 0. Furthermore, in the n=4n=4 case, for each tt, it must be that Δt=[a,b,a,b]\Delta_{t}=[a,b,-a,-b] for some aa and bb, by (2). So there also cannot be three positive numbers in this case.

C.3 FP: TnT^{n}

Assume without loss of generality that x1=e1x_{1}=e_{1}. Let τk=min{t:xt=ek}\tau_{k}=\min\{t:x_{t}=e_{k}\} and note that the form of TnT^{n} implies that the strategies played by FP must be nondecreasing and increment by at most 1 at a time. We argue by strong induction that τk+1τk>τkτk1\tau_{k+1}-\tau_{k}>\tau_{k}-\tau_{k-1} for each k<nk<n. Checking the first few terms, we have

τ1=1, so\displaystyle\tau_{1}=1,\text{ so } Tnx¯1=n1[0,n,0,,0], so\displaystyle T^{n}\overline{x}_{1}=n^{-1}[0,n,0,\dots,0]^{\intercal},\text{ so}
τ2=2, so\displaystyle\tau_{2}=2,\text{ so } Tnx¯2=21n1[n,n,n1,,0], so\displaystyle T^{n}\overline{x}_{2}=2^{-1}n^{-1}[-n,n,n-1,\dots,0]^{\intercal},\text{ so}

x3=e2x_{3}=e_{2}, and therefore τ3>3\tau_{3}>3, so τ3τ2>τ2τ1=1\tau_{3}-\tau_{2}>\tau_{2}-\tau_{1}=1. Now assume that for some fixed k<nk<n and all k{1,,k}k^{\prime}\in\{1,\dots,k\} that τk+1τk>τkτk1\tau_{k^{\prime}+1}-\tau_{k^{\prime}}>\tau_{k^{\prime}}-\tau_{k^{\prime}-1}. Note that

x¯τ(k+1)1\displaystyle\overline{x}_{\tau_{(k+1)}-1} =[τ2τ1,,τk+1τk,0,0,,0], so,\displaystyle=[\tau_{2}-\tau_{1},\dots,\tau_{k+1}-\tau_{k},0,0,\dots,0]^{\intercal},\text{ so,}
Tnx¯τ(k+1)1\displaystyle T^{n}\overline{x}_{\tau_{(k+1)}-1} [<,<,,<,(τk+1τk)(nk+1),0,,0], and\displaystyle\propto[<,<,\dots,<,(\tau_{k+1}-\tau_{k})(n-k+1),0,\dots,0]^{\intercal},\text{ and}
Tnx¯\displaystyle T^{n}\overline{x}_{\ell} [<,<,,<,(τk+1τk)(nk+1),(τk+1+1)(nk),,0],\displaystyle\propto[<,<,\dots,<,(\tau_{k+1}-\tau_{k})(n-k+1),(\ell-\tau_{k+1}+1)(n-k),\dots,0]^{\intercal},

for {τk+1,,τk+21}\ell\in\{\tau_{k+1},\dots,\tau_{k+2}-1\}, where the ‘<<’ signs indicate values that are no greater than their neighbors on the right; this holds by the inductive hypothesis and definition of TnT^{n}. We know that for steps τk+1,τk+21\tau_{k+1},\dots\tau_{k+2}-1, FP will play ek+1e_{k+1}, and we know that τk+2\tau_{k+2} is the first timestep at which (τk+2τk+1)(nk)(τk+1τk)(nk+1)(\tau_{k+2}-\tau_{k+1})(n-k)\geq(\tau_{k+1}-\tau_{k})(n-k+1) (or else FP would have played k+1k+1 at τk+2\tau_{k+2}, a contradiction). It follows that τk+2τk+1>τk+1τk\tau_{k+2}-\tau_{k+1}>\tau_{k+1}-\tau_{k}, as desired.

This result implies that τk+1τkk\tau_{k+1}-\tau_{k}\geq k for each k<nk<n, so we have τkj=1kj=k(k+1)/2k2/2\tau_{k}\geq\sum_{j=1}^{k}j=k(k+1)/2\geq k^{2}/2 for each kk. Inverting this, we get that t2tt\mapsto\sqrt{2t} is an upper bound on k(t)k(t), the index played by FP at time tt. Combining the expression for Tnx¯T^{n}\overline{x}_{\ell}, {τk+1,,τk+21}\ell\in\{\tau_{k+1},\dots,\tau_{k+2}-1\}, with this bound, we get that maxTnx¯t=n1t1(τk(t)+1τk(t))(nk(t)+1)n1t1(nk(t)+1)n1t1(n2t)=Ω(1/t)\max\,T^{n}\overline{x}_{t}=n^{-1}t^{-1}(\tau_{k(t)+1}-\tau_{k(t)})(n-k(t)+1)\geq n^{-1}t^{-1}(n-k(t)+1)\geq n^{-1}t^{-1}(n-\sqrt{2t})=\Omega(1/\sqrt{t}).

C.4 AFP: TnT^{n}

We argue first by strong induction that xt=emin(t,n)x_{t}=e_{\min(t,n)} for all tt. Assume without loss of generality that x1=e1x_{1}=e_{1}. Now assume that, for some fixed τ\tau, xt=emin(t,n)x_{t}=e_{\min(t,n)} for tτt\leq\tau.

If τ<n\tau<n, then under the inductive hypothesis,

Tnx¯τ\displaystyle T^{n}\overline{x}_{\tau} =τ1Tn[1,1,,1τ, 0,,0]\displaystyle=\tau^{-1}T^{n}\;[\overbrace{1,1,\dots,1}^{\tau},\;0,\dots,0]^{\intercal}
=τ1n1[n,1,,(nτ+2),(nτ+1),,0],\displaystyle=\tau^{-1}n^{-1}[-n,1,\dots,(n-\tau+2),\;(n-\tau+1),\dots,0]^{\intercal},

for which the largest value is at index τ\tau, so x=eτx^{\prime}=e_{\tau}, so then Tnx¯τT^{n}\overline{x}^{\prime}_{\tau} will have largest value 2n1(nτ+1)2n^{-1}(n-\tau+1) at index τ+1\tau+1, so xτ+1=eτ+1x_{\tau+1}=e_{\tau+1}. If τn\tau\geq n, then under the inductive hypothesis, Tnx¯τ=τ1Tn[1,1,,1,τn+1]=τ1n1[n,1,1,,1,12(τn),2]T^{n}\overline{x}_{\tau}=\tau^{-1}T^{n}[1,1,\dots,1,\tau-n+1]^{\intercal}=\tau^{-1}n^{-1}[-n,1,1,\dots,1,1-2(\tau-n),2]^{\intercal}, at which point xτ=xτ+1=enx^{\prime}_{\tau}=x_{\tau+1}=e_{n}, as desired.

Finally, we are interested in maxTnx¯t\max T^{n}\overline{x}_{t} for t<nt<n, which we obtain from the calculation above, nt+2nt=O(1/t)\frac{n-t+2}{nt}=O(1/t).

Appendix D Additional figures

Refer to caption
Refer to caption
Figure 8: Comparisons of FP and AFP on C20C^{20} and T20T^{20} with random tiebreaking. As before, highlighted regions indicate 10th and 90th percentiles across 10,000 runs.
Refer to caption
Figure 9: Comparison of FP with versions of AFP initialized with different numbers of steps of FP, based on worst case performance, as in Figure 3. When there is an exact tie, credit is split evenly, resulting in a solid line at 0.5 for FP compared with itself.

Appendix E RL experiment hyperparameters

APPO TinyFighter RWS
Discount factor γ\gamma 0.99 0.995
Value function coefficient 0.5 0.5
Entropy coefficient 0.01 0.02
Learning rate 3e-4 1.5e-4
Gradient clipping 10 80
Batch size 5,120 6,000
Number of workers 80 319
Rollout fragment length 64 64
Other Ray defaults (as of Nov 2022) (Moritz et al.,, 2018)
Running With Scissors Environment
Reward shaping Policy t=1t=1 played against random and received a unit of reward for each ‘scissors’ picked up and a negative unit of reward for each ‘rock’ and ‘paper’ picked up.
Reward scaling (t=1t=1) 10
Reward scaling (t>1t>1) 100
Sprite size 3x3 pixels
Agent field of view 5x5 grid units (15x15 pixels)
Agent memory 4 frames
Table 5: Population neural network Πθ(t)\Pi_{\theta}(t) architecture for TinyFighter. The neural network takes as inputs: an image array, the ratios of rock, paper, and scissors in the player’s inventory, and an agent index tt which gets mapped to a vector in R128R^{128} by an embedding layer. The Shape column does not include batch dimension. We use ReLU activation functions except for the output layers, which use the identity function (‘linear’). We used the Keras library (Chollet et al.,, 2015) to implement the model.
Name Type Shape # Parameters Input
state InputLayer [13] 0
dense Dense [128] 1,792 state
dense_1 Dense [128] 16,512 dense
neupl_opponent_dist InputLayer [4] 0
policy_head_inputs Concatenate [145] 0 state, dense_1, neupl_opponent_dist
neupl_opponent_idx InputLayer [5] 0
dense_2 Dense [256] 37,376 policy_head_inputs
value_head_inputs Concatenate [150] 0 state, dense_1, neupl_opponent_dist, neupl_opp…
dense_3 Dense [128] 32,896 dense_2
dense_6 Dense [256] 38,656 value_head_inputs
dense_4 Dense [64] 8,256 dense_3
action_mask InputLayer [4] 0
dense_7 Dense [128] 32,896 dense_6
dense_5 Dense [4] 260 dense_4
lambda Lambda [4] 0 action_mask
dense_8 Dense [64] 8,256 dense_7
policy_out Add [4] 0 dense_5, lambda
value_out Dense [1] 65 dense_8
Table 6: Population neural network Πθ(t)\Pi_{\theta}(t) architecture for RWS. The neural network takes as inputs: an image array, the ratios of rock, paper, and scissors in the player’s inventory, and an agent index tt which gets mapped to a vector in R128R^{128} by an embedding layer. The Shape column does not include batch dimension. We use ReLU activation functions except for the output layers, which use the identity function (‘linear’). We used the Keras library (Chollet et al.,, 2015) to implement the model.
Name Type Shape # Parameters Input
rgb_history InputLayer [4, 15, 15, 3] 0
rescaling Rescaling [4, 15, 15, 3] 0 rgb_history
conv2d Conv2D [4, 5, 5, 32] 896 rescaling
activation Activation [4, 5, 5, 32] 0 conv2d
conv2d_1 Conv2D [4, 5, 5, 16] 528 activation
activation_1 Activation [4, 5, 5, 16] 0 conv2d_1
reshape Reshape [4, 400] 0 activation_1
inv_ratios_history InputLayer [4, 3] 0
concatenate Concatenate [4, 403] 0 reshape, inv_ratios_history
dense Dense [4, 256] 103,424 concatenate
dense_1 Dense [4, 256] 65,792 dense
lstm LSTM [256] 525,312 dense_1
neupl_opponent_dist InputLayer [11] 0
neupl_opponent_idx InputLayer [12] 0
policy_head_inputs Concatenate [267] 0 lstm, neupl_opponent_dist
value_head_inputs Concatenate [279] 0 lstm, neupl_opponent_dist, neupl_opponent_idx
policy_1 Dense [512] 137,216 policy_head_inputs
value_1 Dense [512] 143,360 value_head_inputs
policy_2 Dense [256] 131,328 policy_1
value_2 Dense [256] 131,328 value_1
policy_3 Dense [128] 32,896 policy_2
value_3 Dense [128] 32,896 value_2
policy_out Dense [30] 3,870 policy_3
value_out Dense [1] 129 value_3

Appendix F Vanilla RL implementations of FP and AFP

Algorithm 2 gives a simple reinforcement learning implementation of FP and AFP that does not rely on Neural Population Learning or the reservoir buffer sampling of Heinrich et al., (2015). Notes:

  • i-i refers to the opponent of ii.

  • ReinforcementLearning(π,μ)\textsc{ReinforcementLearning}(\pi,\mu) plays π\pi against μ\mu for some number of episodes, gathers data, and updates π\pi by reinforcement learning.

  • Lines 8 and 9 are the approximate reinforcement learning analogue to computing a best response to the average of all previous strategies. For details, see Heinrich et al., (2015), which uses a more complicated setup in order to limit storage requirements (constantly learning β\beta by supervised learning).

Algorithm 2 FP/AFP with reinforcement learning
1:Choose setting: FP or AFP.
2:Initialize policies π11\pi^{1}_{1} and π12\pi^{2}_{1}.
3:Initialize policy stores Π1={π11},Π2={π12}\Pi^{1}=\{\pi^{1}_{1}\},\Pi^{2}=\{\pi^{2}_{1}\}.
4:for t=1,2,t=1,2,\dots do
5:     Initialize πt+11,πt+12\pi^{1}_{t+1},\pi^{2}_{t+1}.
6:     for i=1,2i=1,2 do
7:         while per-timestep compute budget remains do
8:              Sample βUniform(Πi)\beta\sim\text{Uniform}(\Pi^{-i}).
9:              πt+1iReinforcementLearning(πt+1i,β)\pi^{i}_{t+1}\leftarrow\textsc{ReinforcementLearning}(\pi^{i}_{t+1},\beta).
10:         end while
11:         ΠiΠi{πt+1i}\Pi^{i}\leftarrow\Pi^{i}\cup\{\pi^{i}_{t+1}\}.
12:     end for
13:     if setting is AFP and tt is odd then
14:         Π1Π1{πt1}\Pi^{1}\leftarrow\Pi^{1}\setminus\{\pi^{1}_{t}\} and Π2Π2{πt2}\Pi^{2}\leftarrow\Pi^{2}\setminus\{\pi^{2}_{t}\}.
15:     end if
16:end for