Anticipatory Fictitious Play

Alex Cloud¹¹1Affiliations refer to where this work was carried out and are no longer current. Albert Wang²²2Corresponding author, email: [email protected]. Riot Games Technology Research, Los Angeles, CA Wesley Kerr Riot Games Technology Research, Los Angeles, CA

(December 2, 2022)

Abstract

Fictitious play is an algorithm for computing Nash equilibria of matrix games. Recently, machine learning variants of fictitious play have been successfully applied to complicated real-world games. This paper presents a simple modification of fictitious play which is a strict improvement over the original: it has the same theoretical worst-case convergence rate, is equally applicable in a machine learning context, and enjoys superior empirical performance. We conduct an extensive comparison of our algorithm with fictitious play, proving an optimal $O(t^{-1})$ convergence rate for certain classes of games, demonstrating superior performance numerically across a variety of games, and concluding with experiments that extend these algorithms to the setting of deep multiagent reinforcement learning.

1 Introduction

Matrix games (also known as normal-form games) are an abstract model for interactions between multiple decision makers. Fictitious play (FP) (Brown, (1951)) is a simple algorithm for two-player matrix games. In FP, each player starts by playing an arbitrary strategy, then proceeds iteratively by playing the best strategy against the empirical average of what the other has played so far. In some cases, such as two-player, zero-sum games, the empirical average strategies will converge to a Nash equilibrium.

Although there are more efficient algorithms for computing Nash equilibria in matrix games (Adler,, 2013; Shoham and Leyton-Brown,, 2008), there are a few reasons why fictitious play remains a topic of interest. First, it serves as a model for how humans might arrive at Nash equilibria in real-world interactions (Luce and Raiffa,, 1989; Brown,, 1951; Conlisk, 1993a, ). Second, FP is extensible to real-world games which are large and complicated. Our work is primarily motivated by the secondary application.

The initial step towards extending FP to real-world games was by Kuhn, (1953), which established the equivalence of normal-form games (represented by matrices) and extensive-form games (represented by trees with additional structure). Loosely speaking, this means that results which apply for matrix games may also apply to much more complicated decision making problems, such as ones that that incorporate temporal elements or varying amounts of hidden information. Leveraging this equivalence, Heinrich et al., (2015) proposed an extension of FP to the extensive-form setting, full-width extensive-form fictitious play (XFP), and proved that it converges to a Nash equilibrium in two-player, zero-sum games. Heinrich et al., (2015) also proposed Fictitious Self Play (FSP), a machine learning approximation to XFP. In contrast to XFP, which is intractable for real-world games whose states cannot be enumerated in practice, FSP relies only on basic operations which can be approximated in a machine learning setting, like averaging (via supervised learning) and computing best responses (via reinforcement learning). In this way, FSP provides a version of fictitious play suitable for arbitrarily complex two-player, zero-sum games. Not long after the introduction of FSP, Lanctot et al., (2017) presented Policy Space Response Oracles (PSRO), a general framework for fictitious-play-like reinforcement learning algorithms in two-player, zero-sum games. These ideas were employed as part of the groundbreaking AlphaStar system that defeated professional players at StarCraft II (Vinyals et al.,, 2019).

We introduce anticipatory fictitious play (AFP), a simple variant of fictitious play which is also reinforcement-learning-friendly. In contrast to FP, where players iteratively update to exploit an estimate of the opponent’s strategy, players in AFP update proactively to respond to the strategy that the opponent would use to exploit them.

We prove that AFP is guaranteed to converge to a Nash equilibrium in two-player zero-sum games and establish an optimal convergence rate for two classes of games that are of particular interest in learning for real world games (Balduzzi et al.,, 2019), a class of “cyclic” games and a class of “transitive” games. Numerical comparisons suggest that in AFP eventually outperforms FP on virtually any game, and that its improvement over FP improves as games get larger. Finally, we propose a reinforcement learning version of AFP that is implementable as a one-line modification of an RL implementation of FP, such as FSP. These algorithms are applied to Running With Scissors (Vezhnevets et al.,, 2020), a stochastic, competitive multiagent environment with cyclic dynamics.

1.1 Related work

Aside from the literature on fictitious play and its extension to reinforcement learning, there has been substantial work on “opponent-aware” learning algorithms. These algorithms incorporate information about opponent updates and are quite similar to anticipatory fictitious play.

In the context of evolutionary game theory, Conlisk, 1993a proposed an “extrapolation process,” whereby two players in a repeated game each forecast their opponents’ strategies and respond to those forecasts. Unlike AFP, where opponent responses are explicitly calculated, the forecasts are made by linear extrapolation based on the change in the opponent’s strategy over the last two timesteps. Conlisk, 1993b proposed two types of “defensive adaptation,” which are quite similar in spirit to AFP but differ in some important details; most importantly, while they consider the opponent’s empirical payoffs at each step, they do not respond directly to what the opponent is likely to play given those payoffs.

Shamma and Arslan, (2005) proposed derivative action fictitious play, a variant of fictitious play in the continous time setting in which a best response to a forecasted strategy is played, like in (Conlisk, 1993a, ). The algorithm uses a derivative-based forecast that is analogous to the discrete-time anticipated response of AFP. However, their convergence results rely on a fixed, positive entropy bonus that incentivizes players to play more randomly, and they do not consider the discrete-time case.

Zhang and Lesser, (2010) proposed Infinitesimal Gradient Ascent with Policy Prediction, in which two policy gradient learning algorithms continuously train against a forecast of the other’s policy. Their algorithm represents the core idea of AFP, albeit implemented in a different setting. However, their proof of convergence is limited to $2x2$ games. Foerster et al., (2018) and Letcher et al., (2018) take this idea further, modifying the objective of a reinforcement learning agent so that it accounts for how changes in the agent will change the anticipated learning of the other agents. This line of research is oriented more towards equilibrium finding in general-sum games (e.g. social dilemmas), and less on efficient estimation of equilibria in strictly competitive two-player environments.

2 Preliminaries

A (finite) two-player zero-sum game (2p0s game) is represented by a matrix $A\in\mathbb{R}^{m\times n}$ , so that when player 1 plays $i$ and player 2 plays $j$ , the players observe payoffs $(A_{i,j},-A_{i,j})$ respectively. Let $\Delta^{k}\subset\mathbb{R}^{k}$ be the set of probability vectors representing distributions over $\{1,\dots,k\}$ elements. Then a strategy for player 1 is an element $x\in\Delta^{m}$ and similarly, a strategy for player 2 is an element $y\in\Delta^{n}$ .

A Nash equilibrium in a 2p0s game $A$ is a pair of strategies $(x^{*},y^{*})$ such that each strategy is optimal against the other, i.e.,

\displaystyle x^{*}\in\underset{x\in\Delta^{m}}{\arg\max}\;x^{\intercal}Ay^{*}\quad\text{and}\quad y^{*}\in\underset{y\in\Delta^{n}}{\arg\min}\;(x^{*})^{\intercal}Ay.

The Nash equilibrium represents a pair of strategies that are “stable” in the sense that no player can earn a higher payoff by changing their strategy. At least one Nash equilibrium is guaranteed to exist in any finite game (Nash Jr,, 1950).

Nash equilibria in 2p0s games enjoy a nice property not shared by Nash equilibria in general: in 2p0s games, if $(x_{1},y_{1})$ and $(x_{2},y_{2})$ are Nash equilibria, then $(x_{2},y_{1})$ is a Nash equilibrium. In a 2p0s game, we define a Nash strategy to be be one that occurs as part of some Nash equilibrium. Note that the aforementioned property does not hold in general, so normally it is only valid to describe collections of strategies (one per player) as equilibria.

A solution to a 2p0s game $A$ is a pair of strategies $(x^{*},y^{*})$ such that

\displaystyle\min_{y\in\Delta^{n}}(x^{*})^{\intercal}Ay\leq(x^{*})^{\intercal}Ay^{*}\leq\max_{x\in\Delta^{m}}x^{\intercal}Ay^{*}.

We say $v^{*}=(x^{*})^{\intercal}Ay^{*}$ , which is unique, the value of the game. Nash equilibria are equivalent to solutions of 2p0s games (Shoham and Leyton-Brown,, 2008), which is why we use the same notation. Finally, the exploitability of a strategy is the difference between the value of the game and the worst-case payoff of that strategy. So the exploitability of $x\in\Delta^{m}$ is $v^{*}-\min x^{\intercal}A$ , and the exploitability of $y\in\Delta^{n}$ is $\max Ay-v^{*}$ .

2.1 Fictitious play

Let $e_{1},e_{2},\dots$ denote the standard basis vectors in $\mathbb{R}^{m}$ or $\mathbb{R}^{n}$ . Let $\texttt{BR}^{k}_{A}$ be the best response operator for player $k$ , so that

	$\displaystyle(\forall y\in\Delta^{n})\;\;\;$	$\displaystyle\texttt{BR}^{1}_{A}(y)=\{e_{i}\in\mathbb{R}^{m}:i\in\arg\max Ay\};$
	$\displaystyle(\forall x\in\Delta^{m})\;\;\;$	$\displaystyle\texttt{BR}^{2}_{A}(x)=\{e_{j}\in\mathbb{R}^{n}:j\in\arg\min x^{\intercal}A\}.$

Fictitious play is given by the following process. Let $x_{1}=\overline{x}_{1}=e_{i}$ and $y_{1}=\overline{y}_{1}=e_{j}$ be initial strategies for some $i$ , $j$ . For each $t\in\mathbb{N}$ , let

	$\displaystyle x_{t+1}$	$\displaystyle\in\texttt{BR}^{1}_{A}(\overline{y}_{t});$	$\displaystyle y_{t+1}$	$\displaystyle\in\texttt{BR}^{2}_{A}(\overline{x}_{t});$
	$\displaystyle\overline{x}_{t+1}$	$\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}x_{t};$	$\displaystyle\overline{y}_{t+1}$	$\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}y_{t}.$

In other words, at each timestep $t$ , each player calculates the strategy that is the best response to their opponent’s average strategy so far. Robinson, (1951) proved that the pair of average strategies $(\overline{x}_{t},\overline{y}_{t})$ converges to a solution of the game by showing that the exploitability of both strategies converge to zero.

Theorem 1.

(Robinson, 1951) If $\{(x_{t},y_{t})\}_{t\in\mathbb{N}}$ is a FP process for a 2p0s game with payoff matrix $A\in\mathbb{R}^{m\times n}$ , then

\displaystyle\lim_{t\rightarrow\infty}\min\overline{x}_{t}^{\intercal}A=\lim_{t\rightarrow\infty}\max A\overline{y}_{t}=v^{*},

where $v^{*}$ is the value of the game. Furthermore, a bound on the rate of convergence is given by

\displaystyle\max A\overline{y}_{t}-\min\overline{x}_{t}^{\intercal}A

\displaystyle=O(t^{-1/(m+n-2)})\text{ for all }t\in\mathbb{N},

where $a=\max_{i,j}A_{i,j}$ .

(Robinson did not explicitly state the rate, but it follows directly from her proof, as noted in Daskalakis and Pan, (2014) and explicated in our Appendix B.1.)

3 Anticipatory Fictitious Play

Although FP converges to a Nash equilibrium in 2p0s games, it may take an indirect path. For example, in Rock Paper Scissors with tiebreaking towards the minimum strategy index, the sequence of average strategies $\{\overline{x}_{1},\overline{x}_{2},\dots\}$ orbits the Nash equilibrium, slowly spiraling in with decreasing radius, as shown on the left in Figure 1. This tiebreaking scheme is not special; under random tiebreaking, the path traced by FP is qualitatively the same, resulting in slow convergence with high probability, as shown in Figure 2.

Refer to caption — Figure 1: A visualization of the first 50 steps of FP $(\overline{x}_{1}^{\textnormal{FP}},\overline{x}_{2}^{\textnormal{FP}},\dots,\overline{x}_{50}^{\textnormal{FP}})$ and first 25 steps of AFP $(\overline{x}_{1}^{\textnormal{AFP}},\overline{x}_{2}^{\textnormal{AFP}},\dots,\overline{x}_{25}^{\textnormal{AFP}})$ on Rock Paper Scissors. This corresponds to an equal amount of computation per algorithm (50 best responses). Ties between best response strategies were broken according to the ordering ‘Rock,’ ‘Paper,’ ‘Scissors.’ The Nash equilibrium is marked by a star. The shading indicates the exploitability of the strategy at that point, with darker colors representing greater exploitability.

Given that FP appears to do an especially poor job of decreasing exploitability in the case above, we consider alternatives. Inspired by gradient descent, we ask if there is there a way to follow the gradient of exploitability towards the Nash equilibrium (without explicitly computing it, as is done in Lockhart et al., (2019)). By definition, the best response to the average strategy is a strategy that maximally exploits the average strategy. So, a natural choice of update to reduce exploitability is to move the average strategy in a direction that counters this best response.

To this end, we propose anticipatory fictitious play (AFP), a version of fictitious play that “anticipates” the best response an adversary might play against the current strategy, and then plays the best response to an average of that and the adversary’s current average strategy. (Simply responding directly to the opponent’s response does not work; see Appendix A.) Alternatively, one can think of AFP as a version of FP that “forgets” every other best response it calculates. This latter interpretation enables a convenient implementation of AFP as a modification of FP as demonstrated in Algorithm 2.

Remark.

The AFP update is seemingly consistent with human psychology: it is quite intuitive to imagine how an adversary might try to exploit oneself and to respond in order to best counter that strategy. Given that fictitious play provides a model for how humans or other non-algorithmic decision makers might arrive at an equilibrium in practice (Luce and Raiffa,, 1989; Brown,, 1951) anticipatory fictitious play offers a new model for how this may occur. We leave further consideration of this topic to future work.

AFP is given by the following process. For some $i\in\{1,\dots,m\}$ and $j\in\{1,\dots,n\}$ , let $x_{1}=\overline{x}_{1}=e_{i}$ and $y_{1}=\overline{y}_{1}=e_{j}$ be initial strategies for each player. For each $t\in\mathbb{N}$ , define

$\displaystyle x^{\prime}_{t+1}$	$\displaystyle\in\texttt{BR}^{1}_{A}(\overline{y}_{t});$	$\displaystyle y^{\prime}_{t+1}$	$\displaystyle\in\texttt{BR}^{2}_{A}(\overline{x}_{t});$
$\displaystyle\overline{x}^{\prime}_{t+1}$	$\displaystyle=\tfrac{t}{t+1}\overline{x}_{t}+\tfrac{1}{t+1}x^{\prime}_{t+1};$	$\displaystyle\overline{y}^{\prime}_{t+1}$	$\displaystyle=\tfrac{t}{t+1}\overline{y}_{t}+\tfrac{1}{t+1}y^{\prime}_{t+1};$
$\displaystyle x_{t+1}$	$\displaystyle\in\texttt{BR}^{1}_{A}(\overline{y}^{\prime}_{t+1});$	$\displaystyle y_{t+1}$	$\displaystyle\in\texttt{BR}^{2}_{A}(\overline{x}^{\prime}_{t+1});$
$\displaystyle\overline{x}_{t+1}$	$\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}x_{t};$	$\displaystyle\overline{y}_{t+1}$	$\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}y_{t}.$	(1)

Here, $x^{\prime}_{t+1}$ and $y^{\prime}_{t+1}$ are the best response to the opponent’s average strategy. They are the strategies that FP would have played at the current timestep. In AFP, each player “anticipates” this attack and defends against it by calculating the opponent’s average strategy that include this attack ( $\overline{x}^{\prime}_{t}$ and $\overline{y}^{\prime}_{t}$ ), and then playing the best response to the anticipated average strategy of the opponent.

In Figure 1, we see the effect of anticipation geometrically: AFP “cuts corners,” limiting the extent to which it overshoots its target. In contrast, FP aggressively overshoots, spending increasingly many steps playing strategies that take it further from its goal. The effect on algorithm performance is pronounced, with AFP hovering near equilibrium while FP slowly winds its way there.

Of course, RPS is a very specific example. It is natural to wonder: is AFP good in general? The rest of the paper seeks to answer that question. We begin by proving that AFP converges to a Nash equilibrium.

Proposition 1.

If $\{(x_{t},y_{t})\}$ is an AFP process for a 2p0s game with payoff matrix $A\in\mathbb{R}^{m\times n}$ , the conclusion of Theorem 1 holds for this process. Namely, AFP converges to a Nash equilibrium, and it converges no slower than the rate that bounds FP.

Proof.

(Idea) Generalize the original proof of Theorem 1. We work with accumulating payoff vectors $U(t)=tA^{\intercal}\overline{x}_{t}$ and $V(t)=tA\overline{y}_{t}$ . In the original proof, a player 1 strategy index $i\in\{1,\dots,m\}$ is called eligible at time $t$ if $i\in\arg\max V(t)$ (similarly for player 2). We replace eligibility with the notion of $E$ -eligibility, satisfied by an index $i\in\arg\max[V(t)+E]$ , for any $E\in\mathbb{R}^{m}$ with $\|E\|_{\infty}<\max_{i,j}|A_{i,j}|$ . Essentially, an index is $E$ -eligible if it corresponds to a best response to a perturbation of the opponent’s history $\overline{y}_{t}$ or a perturbation of the game itself. The original proof structure can be preserved in light of this replacement, requiring only minor modifications to some arguments and new constants. Treating the in-between strategies in AFP, $\overline{x}^{\prime}_{t}$ and $\overline{y}^{\prime}_{t}$ , as perturbations of $\overline{x}_{t}$ and $\overline{y}_{t}$ , it follows that AFP satisfies the conditions for the generalized result. A complete proof is given in Appendix B. ∎

4 Application to normal form games

Proposition 1 establishes that AFP converges and that AFP’s worst-case convergence rate satisfies the same bound as FP’s, where the worst-case is with respect to games and tiebreaking rules. The next proposition shows that for two classes of games of interest, AFP not only outperforms FP, but attains an optimal rate. In both classes, our proofs will reveal that AFP succeeds where FP fails because AFP avoids playing repeated strategies. The results hold for general applications of FP and AFP rather than relying on specific tiebreaking rules.

The classes of games that we analyze are intended to serve as abstract models of two fundamental aspects of real-world games: transitivity (akin to “skillfullness;” some ways of acting are strictly better than others) and nontransitivity (most notably, in the form of strategy cycles like Rock ¡ Paper ¡ Scissors ¡ Rock). Learning algorithms for real-world games must reliably improve along the transitive dimension while accounting for the existence of strategy cycles; see Balduzzi et al., (2019) for further discussion.

For each integer $n\geq 3$ , define payoff matrices $C^{n}$ and $T^{n}$ by

\displaystyle C^{n}_{i,j}

\displaystyle=\begin{cases}\hphantom{-}1&\text{if }i=j+1\mod n;\\ -1&\text{if }i=j-1\mod n;\\ \hphantom{-}0&\text{otherwise;}\end{cases}\;\;\text{ and }\;\;T^{n}_{i,j}

\displaystyle=\begin{cases}\hphantom{-}(n-i+2)/n&\text{if }i=j+1;\\ -(n-i+2)/n&\text{if }i=j-1;\\ \hphantom{-(n-}\;0&\text{otherwise,}\end{cases}

for $i,j\in\{1,\dots,n\}$ . The game given by $C^{n}$ is a purely cyclic game: each strategy beats the one before it and loses to the one after it; $C^{3}$ is the game Rock Paper Scissors. For each $C^{n}$ , a Nash equilibrium strategy is $[n^{-1},\dots,n^{-1}]^{\intercal}$ . The game given by $T^{n}$ could be considered “transitive:” each strategy is in some sense better than the last, and $[0,\dots,0,1]^{\intercal}$ is a Nash equilibrium strategy. The payoffs are chosen so that each strategy $i$ is the unique best response to $i-1$ , so that an algorithm that learns by playing best responses will progress one strategy at a time rather than skipping to directly to strategy $n$ (as would happen if FP or AFP were applied to a game that is transitive in a stronger sense, such as one with a single dominant strategy; c.f. the definition of transitivity in (Balduzzi et al.,, 2019)). Note: $T^{n}$ could be defined equivalently without the $n^{-1}$ factor, but this would create a spurious dependence on $n$ for the rates we derive.

The following proposition establishes a convergence rate of $O(t^{-1})$ for AFP applied to $C^{n}$ and $T^{n}$ . This rate is optimal within the class of time-averaging algorithms, because the rate at which an average changes is $t^{-1}$ . Note: we say a random variable $Y_{t}=\Omega_{p}(g(t))$ if, for any $\epsilon>0$ , there exists $c>0$ such that $P[Y_{t}<cg(t)]<\epsilon$ for all $t$ .

Proposition 2.

FP and AFP applied symmetrically to $C^{n}$ and $T^{n}$ obtain the rates given in Table 1. In particular, if $\{x_{t},x_{t}\}_{t\in\mathbb{N}}$ is an FP or AFP process for a 2p0s game with payoff matrix $G\in\{C^{n},T^{n}\}$ with tiebreaking as indicated, then $\max G\overline{x}_{t}=R(t).$ Tiebreaking refers to the choice of $x_{t+1}\in\arg\max\texttt{BR}^{1}_{G}(\overline{x}_{t})$ when there are multiple maximizers. The “random” tiebreaking chooses between tied strategies independently and uniformly at random. For entries marked with “arbitrary” tiebreaking, the convergence rate holds no matter how tiebreaks are chosen.

Table 1: Convergence rates for FP and AFP on

C^{n}

and

T^{n}

Algorithm	Game $G$	Tiebreaking	Rate $R(t)$	Caveats
FP	$C^{n}$	random	$\Omega_{p}(t^{-1/2})$
AFP	$C^{n}$	arbitrary	$O(t^{-1})$	$n=3,4$
FP	$T^{n}$	arbitrary	$\Omega(t^{-1/2})$	$t<t^{*}(n)$
AFP	$T^{n}$	arbitrary	$O(t^{-1})$

Proof.

(Sketch, $C^{n}$ ) Full proofs of all cases are provided in Appendix C. Define $\Delta_{0}=[0,\dots,0]^{\intercal}\in\mathbb{Z}^{n}$ and $\Delta_{t}=tC^{n}\,\overline{x}_{t}$ for each $t\in\mathbb{N}$ . The desired results are equivalent to $\max\Delta_{t}=O_{p}(\sqrt{t})$ under FP for the given tiebreaking rule, and $\max\Delta_{t}$ is bounded under AFP for $n=3,4$ . Let $i_{t}$ be the index played by FP (AFP) at time $t$ (so $x_{t}=e_{i_{t}}$ ). It follows that

\displaystyle\Delta_{t+1,j}=\begin{cases}\Delta_{t,j}-1&\text{if }j=i_{t}-1\mod n;\\ \Delta_{t,j}+1&\text{if }j=i_{t}+1\mod n;\\ \Delta_{t,j}&\text{otherwise;}\end{cases}

(2)

for each $t\in\mathbb{N}_{0}$ and $j\in\{1,\dots,n\}$ . Note that the entries of $\Delta_{t}$ always sum to zero.

In the case of FP, it is easy to verify that $\max\Delta_{t}$ is nondecreasing for any choice of tiebreaking. For each $m\in\mathbb{N}_{0}$ , define $t_{m}=\inf\{t\in\mathbb{N}_{0}:\max\Delta_{t}=m\}$ . Then by Markov’s inequality,

\displaystyle P(\max\Delta_{t}<m)=P(t_{m}>t)\leq E(t_{m})/t=\frac{1}{t}\sum_{k=0}^{m-1}E(t_{k+1}-t_{k}).

Examining the timesteps at which $i_{t+1}\neq i_{t}$ and relating them to $\{t_{k}\}$ , we show in the appendix that the time to increment the max from $k$ to $k+1$ satisfies $E(t_{k+1}-t_{k})=O(k)$ . Thus the bound above becomes $P(\max\Delta_{t}<m)\leq O(m^{2})/t$ . Now let $c\in\mathbb{R}^{\geq 0}$ be arbitrary and plug in $c\lceil\sqrt{t}\rceil$ for $m$ , so we have $P(\max\Delta_{t}<c\sqrt{t})\leq c^{2}O(1)\rightarrow 0$ as $c\rightarrow 0$ . So $\max\Delta_{t}=\Omega_{p}(\sqrt{t})$ .

For the AFP case, consider the first timestep at which $\max\Delta_{t}=m+1$ . Working backwards and checking cases, it can be shown that in order for the maximum value to increment from $m$ to $m+1$ , there must first be a timestep where there are two non-adjacent entries of $m$ with an entry of $m-1$ between them. This cannot happen in the $n=3,m=2$ case because three positive entries (2,1,2) don’t sum to zero. Similarly, in the $n=4,m=2$ case, it turns out by (2) that $\Delta_{t}=[a,b,-a,-b]$ for some $a$ , $b$ . So there cannot be three positive entries in this case either. Therefore $\max_{t}\Delta_{t}\leq 2$ for $n=3,4$ . ∎

The proofs of Proposition 2 establish a theme: FP can be slow because it spends increasingly large amounts of time progressing between strategies (playing $x_{t}=x_{t+1}=\dots=x_{t+k}$ with $k$ increasing as $t$ increases), whereas AFP avoids this. (Return to Figure 1 for a visual example.)

Some further comments on the results: we only obtain the $O(t^{-1})$ rate for AFP applied to $C^{n}$ in the $n=3,4$ case. We conjecture that: (i) for a specific tiebreaking rule, AFP has the same worst-case rate as FP but with a better constant, (ii) under random tiebreaking, AFP is $O_{p}(t^{-1})$ for all $n$ . This is reflected in numerical simulations for large $n$ , as shown in Appendix D.

Our results are noteworthy for their lack of dependence on tiebreaking: worst-case analyses of FP typically rely on specific tiebreaking rules; see Daskalakis and Pan, (2014), for example. As for the “ $t<t^{*}(n)$ ” caveat for FP applied to $T^{n}$ , this is an unremarkable consequence of analyzing a game with a pure strategy equilibrium (all probability assigned to a single strategy). We write $t^{*}(n)$ to indicate the first index at which FP plays $e_{n}$ . Both FP and AFP will play $e_{n}$ forever some finite number of steps after they play it for the first time, thus attaining a $t^{-1}$ rate as the average strategy “catches up” to $e_{n}$ . Our result shows that until this point, FP is slow, whereas AFP is always fast. As before, AFP’s superior performance is reflected in numerical simulations, as shown in Appendix D.

4.1 Numerical results

In order to compare FP and AFP more generally, we sample large numbers of random payoff matrices and compute aggregate statistics across them. Matrix entries are sampled as independent, identically distributed, standard Gaussian variables (note that the shift- and scale-invariance of matrix game equilibria implies that the choice of mean and variance is inconsequential). Since FP and AFP are so similar, and AFP computes two best responses per timestep, it’s natural to wonder: is AFP’s superior performance just an artifact of using more computation per timestep? So, in order to make a fair comparison, we compare the algorithms by the number of best responses calculated instead of the number of timesteps (algorithm iterations). Using the worst-case payoff as the measure of performance, we compare FP and AFP based on the number of responses computed and based on matrix size in Figures 3 and 4.

The result is that AFP is clearly better on both counts. Although FP is better for a substantial proportion of $30\times 30$ games at very early timesteps $t$ , AFP quickly outpaces FP, eventually across each of 1,000 matrices sampled. In terms of matrix size, FP and AFP appear equivalent on average for small matrices, but quickly grow separated as matrix size grows, with AFP likely to be much better.

5 Application to reinforcement learning

We apply reinforcement learning (RL) (Sutton and Barto,, 2018) versions of FP and AFP in the context of a (two-player, zero-sum, symmetric) stochastic game (Shapley, (1953)), defined by the tuple $(\mathcal{S},\mathcal{O},\mathcal{X},\mathcal{A},\mathcal{P},\mathcal{R},p_{0})$ , where $\mathcal{S}$ is the set of possible states of the environment, $\mathcal{O}$ is the set of possible observations received by an agent, $\mathcal{X}:\mathcal{S}\rightarrow\mathcal{O}\times\mathcal{O}$ gives the observations for each player based on the current state, $\mathcal{A}$ is the set of available actions, $\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{A}\rightarrow\Delta(\mathcal{S})$ defines the transition dynamics for the environment given each player’s action, $\mathcal{R}:\mathcal{S}\rightarrow\mathbb{R}\times\mathbb{R}$ defines the reward for both players such that $\mathcal{R}(s_{t})=(r_{t},-r_{t})$ are the rewards observed by each player at time $t$ , and $p_{0}\in\Delta(\mathcal{S})$ is the initial distribution of states, such that $s_{0}\sim p_{0}$ . Let $\mathcal{H}$ be the set of possible sequences of observations. Then a policy is a map $\pi:\mathcal{H}\rightarrow\Delta(A)$ . An episode is played by iteratively transitioning by the environment according to the actions sampled from each players’ policies at each state. Players 1 and 2 earn returns $(\sum_{t}r_{t},-\sum_{t}r_{t})$ . The reinforcement learning algorithms we consider take sequences of observations, actions, and rewards from both players and use them to incrementally update policies toward earning greater expected returns. For background on reinforcement learning, see Sutton and Barto, (2018). For details on machine learning approximations to FP, see Heinrich et al., (2015). Table 2 gives a high-level overview of the relationship.

Table 2: The normal-form game analogies used to extend FP and AFP to reinforcement learning.

Normal-form game	Stochastic/extensive-form game
Strategy	Policy
Payoff $A_{i,j}$	Expected return $E_{\pi_{i},\pi_{j}}(\sum_{t}r_{t})$
Best response	Approximate best response by RL
Strategy mixture $\sum\alpha_{i}x_{i}$ , $\sum\alpha_{i}=1$ , $\alpha_{i}\geq 0$	At start of episode, sample policy $\pi_{i}$ with probability $\alpha_{i}$ . Play entire episode with $\pi_{i}$ .

We use two environments, our own TinyFighter, and Running With Scissors, from Vezhnevets et al., (2020).

5.1 Environments

TinyFighter is a minimal version of an arcade-style fighting game shown in Figure 5(a). It features two players with four possible actions: Move Left, Move Right, Kick, and Do Nothing. Players are represented by a rectangular body and when kicking, extend a rectangular leg towards the opponent.

Kicking consists of three phases: Startup, Active, and Recovery. Each phase of a kick lasts for a certain number of frames, and if the Active phase of the kick intersects with any part of the opponent (body or leg), a hit is registered. When a hit occurs, the players are pushed back, the opponent takes damage, and the opponent is stunned (unable to take actions) for a period of time. In the Startup and Recovery phases, the leg is extended, and like the body, can be hit by the opponent if the opponent has a kick in the active phase that intersects the player. The game is over when a player’s health is reduced to zero or when time runs out.

Player observations are vectors in $\mathbb{R}^{13}$ and contain information about player and opponent state: position, health, an ‘attacking’ indicator, a ‘stunned’ indicator, and how many frames a player has been in the current action. The observation also includes the distance between players, time remaining, and the direction of the opponent (left or right of self). The game is partially observable, so information about the opponent’s state is hidden from the player for some number of frames (we use four, and the game runs at 15 frames per second). This means a strong player must guess about the distribution of actions the opponent may have taken recently and to respond to that distribution; playing deterministically will allow the opponent to exploit the player and so a stochastic strategy is required to play well.

Running With Scissors (RWS) is a spatiotemporal environment with partial observability with potential for nontransitive relationships between policies. As shown in Figure 5(b), RWS is a 2D gridworld with a few types of entities: two agents; three types of items: rock, paper, and scissors, which can be picked up by the agents to add to their inventories; and impassable walls. In addition to moving around and picking up items, agents in RWS have a “tag” action, which projects a cone in front of them for a single frame. If the cone hits the other agent, the episode ends and each agent receives rewards based on the payoff matrix $C^{n}$ according to the ratios of each item in their inventory. Agents can only see their own inventory and a $5\times 5$ grid in front of them and can remember the last four frames they’ve seen, so in order to perform effectively they must infer what items the opponent has picked up. Vezhnevets et al., (2020) and Liu et al., (2022) (Appendix B.1.) feature further discussion of the environment.

5.2 Adapting FP and AFP to reinforcement learning

Neural Population Learning (NeuPL) (Liu et al.,, 2022) is a framework for multiagent reinforcement learning wherein a collection of policies is learned and represented by a single neural network and all policies train continuously. For our experiments, we implement FP and AFP within NeuPL, as shown in Algorithm 1. For reference, we also include a simple RL version of FP and AFP in the style of PSRO in Appendix F.

Algorithm 1 NeuPL-FP/AFP

\mathfrak{O}\in\{\mathfrak{O}^{\text{FP}},\mathfrak{O}^{\text{AFP}}\}

\triangleright

Input: FP or AFP opponent sampler.

\left\{\Pi_{\theta}(t):\mathcal{H}\rightarrow\Delta(\mathcal{A})\right\}_{t=1}^{n}

\triangleright

Input: neural population net.

3:for Batch

b=1,2,3,\dots,

B\leftarrow\{\}

5: while per-batch compute budget remains do

T_{\text{learner}}\sim\text{Uniform}(\{1,\dots,n\})

T_{\text{opponent}}\sim\mathfrak{O}(T_{\text{learner}})

D_{\text{learner}}\leftarrow\textsc{PlayEpisode}\!\left(\Pi_{\theta}(T_{\text{learner}}),\Pi_{\theta}(T_{\text{opponent}})\right)

B\leftarrow B\cup D_{\text{learner}}

10: end while

11:

\Pi_{\theta}\leftarrow\textsc{ReinforcementLearningUpdate}(B)

12:end for

In NeuPL-FP/AFP, the opponent sampler $\mathfrak{O}$ determines the distributions of opponents that each agent faces and is the only difference between the FP and AFP implementations. We have, for each $t>1$ ,

	$\displaystyle\mathfrak{O}^{\text{FP}}(t)$	$\displaystyle=\text{Uniform}(\{1,2,3,\dots,t-1\}),\text{ and}$
	$\displaystyle\mathfrak{O}^{\text{AFP}}(t)$	$\displaystyle=\text{Uniform}(\{k<t:k\text{ odd}\}\cup\{t-1\}).$

These distributions are depicted in Figure 6. Just as each step of FP involves computing a best response to an average against all prior strategies, sampling from $\mathfrak{O}^{\text{FP}}(t)$ corresponds to training agent $t$ uniformly against the prior policies; just as AFP can be thought of “forgetting” every other index, $\mathfrak{O}^{\text{AFP}}(t)$ trains learner index $t$ uniformly against every odd indexed policy plus the most recent policy. The neural population net $\Pi_{\theta}(t):\mathcal{H}\rightarrow\Delta(A)$ defines a different policy for each agent index $t$ , and can equivalently be represented as $\Pi_{\theta}(a|s,t)$ .

5.3 Experimental setup

For the neural population net, we use an actor-critic (Sutton and Barto,, 2018) architecture similar to that used for RWS in Liu et al., (2022): first, a module is used to process environment observations into a dense representation that is shared between an actor head and a critic head. The actor takes this vector, concatenates it with a vector representing the distribution of opponents faced by the currently training agent $t$ (e.g., $[0.5,0.5,0,\dots,0]$ for agent $t=3$ in FP or AFP), then processes it with a dense MLP with ReLu activations, with action masking applied prior to a final Softmax layer. The critic is similar, except an additional input is used: the index of the opponent sampled at the beginning of this episode, matching the original implementation.³³3Note that the actor (policy network) does not observe which opponent it faces, only the distribution over agents it faces; this is important because otherwise our agent would not learn a best response to an average policy as intended in FP and AFP. The reason for including this information for the critic (value network) is that it may reduce the variance of the value function estimator. For the exact architectures used, see Appendix E. We use a neural population size of $n=11$ . Based on matrix game simulations and a preliminary RL experiments, we determined that AFP performs slightly better in a short time horizon when initialized with two steps of FP, so we do this, as shown in the final panel of Figure 6. See Appendix D for a figure comparing performance in the matrix setting.

We implemented NeuPL within a basic self-play reinforcement learning loop by wrapping the base environment (TinyFighter or RWS) within a lightweight environment that handles NeuPL logic, such as opponent sampling. For reinforcement learning, we use the Asynchronous Proximal Policy Optimization (APPO) algorithm (Schulman et al.,, 2017), a distributed actor-critic RL algorithm, as implemented in RLLib (Moritz et al.,, 2018) with a single GPU learner. Hyperparameter settings are given in Appendix E. We train the entire neural population net (agents 1-11) for 12,500 steps, where a step is roughly 450 minibatch updates of stochastic gradient descent. This corresponds to about five days of training. For each of the two environments, we repeat this procedure independently 10 times for FP and 10 times for AFP.

5.4 Results

To evaluate exploitability, we made use of the fact that each FP and AFP neural population are made up of agents trained to “exploit” the ones that came before them. Specifically, each agent is trained to approximate a best response to the average policy returned by the algorithm at the previous timestep. So, to estimate the exploitability of NeuPL-FP or NeuPL-AFP at step $t\in\{1,\dots,n-1\}$ , we simply use the average return earned by agent $t+1$ against agents $\{1,\dots,t\}$ to obtain the within-population exploitability at $t$ . This is a convenient metric, but insufficient on its own. In order for it to be meaningful, the agents in the population must have learned approximate best responses that are close to actual best responses; if they have not, it could be that within-population exploitability is low, not because the average policy approximates a Nash policy but because nothing had been learned at all. To account for this, we also evaluate the populations learned using relative population performance (Balduzzi et al.,, 2019), which measures the strength of one population of agents against the other. The purpose of using relative population performance is simply to verify that one algorithm did not produce generally more competent agents than the other.

We paired each of the 10 replicates of FP and AFP and computed the relative population performance for each. On TinyFighter, the average was -0.73, with a Z-test-based 90% confidence interval width of 1.37. On RWS, the average was 4.60, with a Z-test-based 90% confidence interval width of 3.50. We conclude that the agents learned by FP and AFP are not statistically significantly different in terms of performance for TinyFighter, but the agents learned by FP have a slight, statistically significant advantage in RWS. However, these differences are small relative to the total obtainable reward in either environment (20 for TinyFighter, roughly 60 for RWS), so we conclude it is reasonable to use within-population exploitability to compare FP and AFP, as shown in Figure 7. For consistency with the matrix game simulation results, we plot worst case payoff, which is simply the negative of exploitability in this case.

We find that AFP has a significantly better worst-case payoff of -7.4 versus 10.6 for FP at the final timestep in TinyFighter. This corresponds to a noteworthy 16% reduction in exploitability relative to the total possible reward of 20 that can be earned in TinyFighter. In RWS, the algorithms have essentially identical performance. The fact that AFP’s advantage varies widely by environment is not surprising. The matrix game simulations in Figure 3 showed that until over 100 steps of each algorithm, there is some proportion of games for which FP performs better. Correspondingly, we would expect that there is a nontrivial proportion of stochastic games where NeuPL-FP outperforms NeuPL-AFP for small population sizes. Although we expect NeuPL will not be able to support over 100 policies (the original paper used population size 8), it would be possible to do so within the PSRO framework. This remains a topic for further investigation.

6 Conclusion

We proposed a variant of fictitious play for faster estimation of Nash equilibria in two-player, zero-sum games. Anticipatory fictitious play is intuitive, easy to implement, and supported by theory and numerical simulations which suggest that it is virtually always preferable to fictitious play. Consequently, we shed new light on two motivating problems for fictitious play: primarily, large-scale multiagent reinforcement learning for complicated real-world games; also, modeling strategic decision making in humans. Further work is needed to understand the conditions under which AFP outperforms FP in the reinforcement learning setting.

Acknowledgements

The authors sincerely thank Ryan Martin for continued mentorship, especially around theory and the presentation of the paper; Philip Wardlaw for orchestrating preliminary reinforcement learning experiments; Adam Venis and Angelo Olcese for contributions to the proofs of FP and AFP applied to $C^{n}$ ; Jesse Clifton and Eric Laber for helpful comments on a draft; Jesse Clifton and Marc Lanctot for suggesting related works that had been overlooked; and Siqi Liu for helpful explanations and implementation details for NeuPL.

References

Adler, (2013) Adler, I. (2013). The equivalence of linear programs and zero-sum games. International Journal of Game Theory, 42(1):165–177.
Agresti and Coull, (1998) Agresti, A. and Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52(2):119–126.
Balduzzi et al., (2019) Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T. (2019). Open-ended learning in symmetric zero-sum games. In International Conference on Machine Learning, pages 434–443. PMLR.
Brown, (1951) Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1):374–376.
Chollet et al., (2015) Chollet, F. et al. (2015). Keras.
(6) Conlisk, J. (1993a). Adaptation in games: Two solutions to the crawford puzzle. Journal of Economic Behavior & Organization, 22(1):25–50.
(7) Conlisk, J. (1993b). Adaptive tactics in games: Further solutions to the crawford puzzle. Journal of Economic Behavior & Organization, 22(1):51–68.
Daskalakis and Pan, (2014) Daskalakis, C. and Pan, Q. (2014). A counter-example to karlin’s strong conjecture for fictitious play. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 11–20, Philadelphia, PA, USA. IEEE, IEEE.
Foerster et al., (2018) Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. (2018). Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122–130, ’. AAMAS.
Heinrich et al., (2015) Heinrich, J., Lanctot, M., and Silver, D. (2015). Fictitious self-play in extensive-form games. In International Conference on Machine Learning, pages 805–813.
Kuhn, (1953) Kuhn, H. (1953). Extensive games and the problem of information. In Contributions to the Theory of Games, pages 193–216. Princeton University Press.
Lanctot et al., (2017) Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T. (2017). A unified game-theoretic approach to multiagent reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30, -. Curran Associates, Inc.
Letcher et al., (2018) Letcher, A., Foerster, J., Balduzzi, D., Rocktäschel, T., and Whiteson, S. (2018). Stable opponent shaping in differentiable games. In International Conference on Learning Representations, page ’, Vancouver Convention Center, Vancouver, BC, Canada. ICLR.
Liu et al., (2022) Liu, S., Marris, L., Hennes, D., Merel, J., Heess, N., and Graepel, T. (2022). Neupl: Neural population learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
Lockhart et al., (2019) Lockhart, E., Lanctot, M., Pérolat, J., Lespiau, J.-B., Morrill, D., Timbers, F., and Tuyls, K. (2019). Computing approximate equilibria in sequential adversarial games by exploitability descent. arXiv preprint arXiv:1903.05614, ’(’).
Luce and Raiffa, (1989) Luce, R. D. and Raiffa, H. (1989). Games and Decisions: Introduction and Critical Survey. Courier Corporation, ’.
Moritz et al., (2018) Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., and Stoica, I. (2018). Ray: A distributed framework for emerging $\{$ AI $\}$ applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577.
Nash Jr, (1950) Nash Jr, J. F. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1):48–49.
Robinson, (1951) Robinson, J. (1951). An iterative method of solving a game. Annals of Mathematics, 54(2):296–301.
Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.
Shamma and Arslan, (2005) Shamma, J. S. and Arslan, G. (2005). Dynamic fictitious play, dynamic gradient play, and distributed convergence to nash equilibria. IEEE Transactions on Automatic Control, 50(3):312–327.
Shapley, (1953) Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100.
Shoham and Leyton-Brown, (2008) Shoham, Y. and Leyton-Brown, K. (2008). Multiagent Systems: Algorithmic, Game-theoretic, and Logical Foundations. Cambridge University Press.
Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press, Cambridge, Massachusetts.
Vezhnevets et al., (2020) Vezhnevets, A., Wu, Y., Eckstein, M., Leblond, R., and Leibo, J. Z. (2020). Options as responses: Grounding behavioural hierarchies in multi-agent reinforcement learning. In International Conference on Machine Learning, pages 9733–9742, ’. PMLR, PMLR.
Vinyals et al., (2019) Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T. L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C., and Silver, D. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354.
Zhang and Lesser, (2010) Zhang, C. and Lesser, V. (2010). Multi-agent learning with policy prediction. In Twenty-fourth AAAI conference on artificial intelligence, pages 927–937, ’. AAAI.

Appendix

Appendix A Why naive AFP doesn’t work

The original idea for the AFP algorithm, which we refer to as naive AFP, was: at each timestep, play the best response to the opponent’s best response to your history. Formally, this given by the updates (c.f. the AFP update (1)):

$\displaystyle x^{\prime}_{t+1}$	$\displaystyle\in\texttt{BR}^{1}_{A}(\overline{y}_{t});$	$\displaystyle y^{\prime}_{t+1}$	$\displaystyle\in\texttt{BR}^{2}_{A}(\overline{x}_{t});$
$\displaystyle x_{t+1}$	$\displaystyle\in\texttt{BR}^{1}_{A}(y^{\prime}_{t+1});$	$\displaystyle y_{t+1}$	$\displaystyle\in\texttt{BR}^{2}_{A}(x^{\prime}_{t+1});$
$\displaystyle\overline{x}_{t+1}$	$\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}x_{k};$	$\displaystyle\overline{y}_{t+1}$	$\displaystyle=\frac{1}{t+1}\sum_{k=1}^{t+1}y_{k}.$

In preliminary simulations, naive AFP performed well in cyclic games and seemed to be competitive with FP in other games.

However, upon inspection it becomes clear that Naive AFP is not guaranteed to converge to a Nash equilibrium. This is because Naive AFP can only play best responses to strategies returned by the best response operator, which are pure strategies, but Nash equilibria of some games have nonzero probability assigned to strategies that are not best responses to any pure strategies. Thus there are some games where naive AFP is incapable of assigning any probability to actions which must be assigned nonzero probability in a Nash equilibrium, so naive AFP cannot converge to a Nash equilibrium. For example, see the game given in Table 3.

	Rock	Paper	Scissors
Rock	0	-1	1
Paper	1	0	-1
Scissors	-1	1	0
SafeRock	0	0	0.99

Table 3: Rock Paper Scissors SafeRock. In this game, SafeRock is not a best response to any of the opponent’s pure strategies, so it won’t be played by naive AFP. However, SafeRock is included in the Nash equilibrium which is given approximately by the following probabilities

([0.33,0.47,0.2],[0,0.32,0.18,0.15])

, with value

v^{*}=0.133

. If SafeRock were removed the game would be symmetric and have value 0, so we know SafeRock must be included in any Nash support.

Appendix B Proof of Proposition 1

Our analysis closely follows the original proof of FP’s convergence given in Robinson, (1951), which consists of four key lemmas. We introduce the notion of a perturbed fictitious play system, a slight generalization of Robinson’s “vector system.” We modify Robinson’s second lemma accordingly, which causes inconsequential changes in the third and fourth lemma, leaving the final result the same. In this way, we extend Robinson’s proof to a broader class of algorithms which include fictitious play and anticipatory fictitious play as special cases.

Definition 1.

An iterative play system $(U,V)$ for $A\in\mathbb{R}^{m\times n}$ is a pair of sequences of vectors $U=\{U(0),U(1),\dots\}$ and $V=\{V(0),V(1),\dots\}$ with each $U(t)\in\mathbb{R}^{n}$ and $V(t)\in\mathbb{R}^{m}$ satisfying the following properties:

•

$\min U(0)=\max V(0)$ , and
•

for each $t$ , $U(t+1)=U(t)+A_{i(t),*}$ and $V(t+1)=V(t)+A_{*,j(t)}$ ,

where $A_{i,*}$ and $A_{*,j}$ are the $i$ th row and $j$ th column of $A$ , and $i(t)$ and $j(t)$ are the row and column “played” by players 1 and 2 at time $t$ .

The interpretation of an iterative play system is as follows. Suppose we choose $U(0)=0$ and $V(0)=0$ . Write $e_{h}$ to indicate a vector with a 1 at index $h$ and 0’s elsewhere. Then $\overline{x}_{t}=t^{-1}\sum_{k=1}^{t}e_{i(k)}$ and $\overline{y}_{t}=t^{-1}\sum_{k=1}^{t}e_{j(k)}$ are the empirical strategies played by players 1 and 2, and $t^{-1}V(t)=A\overline{y}_{t}$ and $t^{-1}U(t)=A^{\intercal}\overline{x}_{t}$ are the payoffs for player 1 and 2 when faced with those empirical strategies. In this way, $V(t)$ and $U(t)$ can be seen as “accumulating empirical payoffs” for players $1$ and $2$ .

Definition 2.

A perturbed fictitious play system (PFP-system) is an iterative play system with the additional property that $i(t)$ and $j(t)$ are best responses to a perturbed set of empirical payoffs. Precisely, an iterative play system is a PFP-system such that for any values $E_{V}(t)\in\mathbb{R}^{m}$ and $E_{U}(t)\in\mathbb{R}^{n}$ with $\|E_{V}(t)\|_{\infty}<a:=\max_{i,j}|a_{i,j}|$ and $\|E_{U}(t)\|_{\infty}<a$ for each $t$ ,

\displaystyle i(t+1)\in\arg\max\big{[}V(t)+E_{V}(t)\big{]}\text{ and }j(t+1)\in\arg\min\big{[}U(t)+E_{U}(t)].

A special case of a PFP-system is what Robinson calls a “vector system,” which describes fictitious play. This is obtained by setting all entries of $E_{V}(t)$ and $E_{U}(t)$ to zero at all timesteps.

Lemma 1.

If $(U,V)$ is an iterative play system for $A$ , then

\displaystyle\underset{t\rightarrow\infty}{\lim\inf}\;t^{-1}\bigl{\{}\max V(t)-\min U(t)\bigr{\}}\geq 0.

This lemma follows from the minimax nature of two-player zero-sum games and holds regardless of what rows and columns of $A$ are used to update elements of $U$ and $V$ .

Definition 3.

Given an iterative play system $(U,V)$ for matrix $A$ , we say row $A_{i,*}$ is $E$ -eligible in the interval $[t,t^{\prime}]$ if for some $t_{1}\in[t,t^{\prime}]$ , $i$ could have been played as part of AFP. Precisely, the condition is that there exists $t_{1}\in[t,t^{\prime}]$ such that

\displaystyle i\in\arg\max\big{[}V(t_{1})+E\big{]}

for some $E\in\mathbb{R}^{m}$ with $\|E\|_{\infty}<a$ . Or, equivalently,

\displaystyle v_{i}(t_{1})\geq\max V(t_{1})-2a.

Similarly, we say column $j$ is $E$ -eligible if

\displaystyle u_{j}(t_{1})\leq\min U(t_{1})+2a.

Lemma 2.

If $(U,V)$ is an iterative play system for $A$ and all rows and columns are $E$ -eligible in $[s,s+t]$ , then we have

	$\displaystyle\max U(s+t)-\min U(s+t)\leq 2a(t+1),\text{ and}$
	$\displaystyle\max V(s+t)-\min V(s+t)\leq 2a(t+1).$

Proof.

Let $j\in\arg\max\,U(s+t)$ . By the $E$ -eligibility of $j$ , there must exist $t^{\prime}\in[s,s+t]$ such that

\displaystyle u_{j}(t^{\prime})-\min U(t^{\prime})\leq 2a.

So, because the $j$ th entry can’t change by more than $a$ per timestep,

	$\displaystyle\max\,U(s+t)$	$\displaystyle=u_{j}(s+t)$
		$\displaystyle\leq u_{j}(t)+at$
		$\displaystyle\leq\min\,U(t^{\prime})+2a+at$
		$\displaystyle\leq\min\,U(t+s)+at+2a+at,$

where the last inequality holds because for $t^{\prime}\in[s,s+t]$ , the minimum of $U(t^{\prime})$ versus $U(s+t)$ can’t change by more than $at$ in $t$ timesteps. A similar argument shows the result for $V$ . ∎

Lemma 3.

If $(U,V)$ is an iterative play system for $A$ and all rows and columns are $E$ -eligible in $[s,s+t]$ , then

\displaystyle\max V(s+t)-\min U(s+t)\leq 4a(t+1).

Proof.

As shown in Robinson, (1951), this follows immediately from Lemma 2. The only difference here is that we replace $2at$ with $2a(t+1)$ . ∎

Lemma 4.

For every matrix $A$ , $\epsilon>0$ , there exists a $t_{0}$ such that for any anticipatory fictitious play system,

\displaystyle\max V(t)-\min U(t)\leq\epsilon t\text{ for all }t\geq t_{0}.

Proof.

We follow the proof of Robinson, (1951), replacing the notion of eligibility with $E$ -eligibility. The strategy is induction. If $A\in\mathbb{R}^{1\times 1}$ , the result is trivial because $V(t)=U(t)$ for all $t$ . Now assume that the property holds for an arbitrary submatrix $A^{\prime}$ obtained by deleting any number of columns or rows from $A$ . We wish to show that the property also holds for $A$ .

Using the inductive hypothesis, pick $t^{*}$ such that

\displaystyle\max V^{\prime}(t)-\min U^{\prime}(t)<\tfrac{1}{2}\epsilon t\text{ for all }t\geq t^{*}

for any $A^{\prime}$ a submatrix of $A$ and $(U^{\prime},V^{\prime})$ an anticipatory fictitious play system for $A^{\prime}$ .

As in Robinson, (1951), we wish to show that if in $(U,V)$ , some row or column is not $E$ -eligible in $[s,s+t^{*}]$ , then

\displaystyle\max V(s+t^{*})-\min U(s+t^{*})<\max V(s)-\min U(s)+\tfrac{1}{2}\epsilon t^{*}

(3)

Suppose without loss of generality that row $A_{m,*}$ is not $E$ -eligible $[s,s+t^{*}]$ . Then we construct a new anticipatory fictitious play system $(U^{\prime},V^{\prime})$ for matrix $A^{\prime}$ , which is $A$ with row $m$ deleted. Define

	$\displaystyle U^{\prime}(t)$	$\displaystyle=U(s+t)+c\,\mathds{1}_{n},$
	$\displaystyle V^{\prime}(t)$	$\displaystyle=\textnormal{Proj}_{m}V(s+t),$

for $t=0,1,\dots,t^{*}$ , where $\mathds{1}_{n}$ is a vector of $1$ ’s with $n$ entries, $c=\max V(s)-\min U(s)$ , and $\textnormal{Proj}_{k}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{m-1}$ is the operator that removes entry $k$ . We now check the conditions for an anticipatory fictitious play system:

•

We have $\min U^{\prime}(0)=\min\{U(s)+[\max V(s)-\min U(s)]\mathds{1}_{n}\}=\max V(s)=\max V^{\prime}(0)$ , where the last equality holds because $m$ is not $E$ -eligible, so it could not be a maximizer of $V(s)$ , so deleting it to form $V^{\prime}(0)$ does not change the maximum.

•

We have that for each $t$ ,

	$\displaystyle U^{\prime}(t+1)$	$\displaystyle=U(s+t+1)+c\mathds{1}_{n}=U(s+t)+A_{i(s+t),}+c\mathds{1}_{n}=U^{\prime}(t)+A^{\prime}_{i(s+t),},$
	$\displaystyle V^{\prime}(t+1)$	$\displaystyle=\textnormal{Proj}_{m}V(s+t+1)=\textnormal{Proj}_{m}[V(s+t)+A_{,j(s+t)}]=V^{\prime}(t)+A^{\prime}_{,j(s+t)},$

where $A_{i(s+t),*}=A^{\prime}_{i(s+t),*}$ because the AFP-ineligibility of $m$ implies $i(s+t)\neq m$ .

•

Finally, we must show that the rows and columns selected still qualify as “anticipatory” responses within the context of $(U^{\prime},V^{\prime})$ and $A^{\prime}$ , i.e. that

\displaystyle v^{\prime}_{i(s+t)}(t)\geq\max V^{\prime}(t)-2a\text{ and }u^{\prime}_{j(s+t)}(t)\leq\min U^{\prime}(t)+2a

for each $t=0,1,\dots,t^{*}$ . By the definition of $V^{\prime}$ and fact that $i(s+t)\neq m$ , we have

$\displaystyle v^{\prime}_{i(s+t)}(t)$	$\displaystyle=v_{i(s+t)}(s+t)$
	$\displaystyle\geq\max V(s+t)-2a$	( $(U,V)$ is a PFP-system)
	$\displaystyle=\max V^{\prime}(s)-2a,$

and $U^{\prime}(t)$ is just a shifted version of $U(s+t)$ , so the fact that $u_{j(s+t)}(s+t)\leq\min U(s+t)+2a$ implies the result.

These points verify that $(U^{\prime},V^{\prime})$ satisfy the conditions for an anticipatory fictitious play system for $A^{\prime}$ on $t=0,\dots,t^{*}$ . We can choose remaining values for both sequences for $t=t^{*}+1,t^{*}+2,\dots$ to satisfy the anticipatory fictitious play conditions. So, using the inductive hypothesis, we have

	$\displaystyle\max V(s+t^{})-\min U(s+t^{})$	$\displaystyle=\max V^{\prime}(t^{})-\min\big{\{}U^{\prime}(t^{})-[\max V(s)-\min U(s)]\mathds{1}_{n}\big{\}}$
		$\displaystyle=\max V^{\prime}(t^{})-\min U^{\prime}(t^{})+\max V(s)-\min U(s)$
		$\displaystyle<\tfrac{1}{2}\epsilon t^{*}+\max V(s)-\min U(s).$

With (3) established, we are ready to finish the proof: under the inductive hypothesis, we will be able to deal with time intervals by splitting them into two cases: if a row or column is not $E$ -eligible, we apply (3); if all rows or columns are $E$ -eligible, we apply Lemma 3.

Specifically, we show that for any AFP system $(U,V)$ for $A$ and $t\geq 8at^{*}/\epsilon$ ,

\displaystyle\max V(t)-\min U(t)<\epsilon t.

Let $t>t^{*}$ and express it as $t=(\theta+q)t^{*}$ , where $q\in\mathbb{N}$ and $\theta\in[0,1)$ . We consider the collection of length- $t^{*}$ intervals $[(\theta+r-1)t^{*},(\theta+r)t^{*})]$ for $r=1,\dots,q$ .

•

Case 1. There is at least one interval where all rows and columns are $E$ -eligible. Let $[(\theta+s-1)t^{*},(\theta+s)t^{*})]$ be the latest such interval. By Lemma 3,

\displaystyle\max V[(\theta+s)t^{*}]-\min V[(\theta+s)t^{*}]\leq 4a(t^{*}+1)\leq 8at^{*},

where the last inequality holds because $t^{*}\geq 1$ . By choice of the interval, all subsequent intervals with $r=s+1,\dots,q$ have no $E$ -eligible rows or columns, so (3) gives that

\displaystyle\max V(t)-\min U(t)\leq\max V[(\theta+s)t^{*}]-\min V[(\theta+s)t^{*}]+\tfrac{1}{2}\epsilon t^{*}(q-s),

noting that the result holds trivially if $q=s$ . Combining the previous two results and loosening the bound, we have

\displaystyle\max V(t)-\min U(t)\leq 8at^{*}+\tfrac{1}{2}\epsilon t^{*}(q-s)\leq(8a+\tfrac{1}{2}\epsilon q)t^{*}.

(4)

•

Case 2. In each interval $[(\theta+r-1)t^{*},(\theta+r)t^{*})]$ for $r=1,\dots,q$ , some row or column is not $E$ -eligible. Applying (3) repeatedly,

$\displaystyle\max V(t)-\min U(t)$	$\displaystyle=\max V[(\theta+q)t^{}]-\min U[(\theta+q)t^{}]$
	$\displaystyle<\max V[(\theta+q-1)t^{}]-\min U[(\theta+q-1)t^{}]+\tfrac{1}{2}\epsilon t^{*}$
	$\displaystyle<\max V[(\theta+q-2)t^{}]-\min U[(\theta+q-2)t^{}]+\tfrac{1}{2}\epsilon t^{}+\tfrac{1}{2}\epsilon t^{}$
	$\displaystyle<\dots$
	$\displaystyle<\max V(\theta t^{})-\min U(\theta t^{})+\tfrac{1}{2}q\epsilon t^{*}$
	$\displaystyle\leq 2a\theta t^{}+\tfrac{1}{2}q\epsilon t^{}$	(5)
	$\displaystyle=(2a\theta+\tfrac{1}{2}q\epsilon)t^{*},$	(6)

where the last inequality holds because $\max V(\theta t^{*})\leq a\theta t^{*}$ and $\min U(\theta t^{*})\geq a\theta t^{*}$ .

So, comparing (4) and (6) and noting that $\theta\in[0,1)$ , in either case we have that

\displaystyle\max V(t)-\min U(t)\leq 8at^{*}+\tfrac{1}{2}\epsilon(qt^{*})\leq 8at^{*}+\tfrac{1}{2}\epsilon t\leq\epsilon t

for all $t\geq 16at^{*}/\epsilon$ .∎

Finally, we are ready for the proof of Proposition 1, which is essentially identical to the final proof of Theorem 1 in Robinson, (1951).

Proof.

Let $V(0)=0\in\mathbb{R}^{m}$ , $U(0)=0\in\mathbb{R}^{n}$ , and $V(t)=tA\overline{y}_{t}$ , $U(t)=tA^{\intercal}\overline{x}_{t}$ for $t\in\mathbb{N}$ , where $\overline{x}_{t}$ and $\overline{y}_{t}$ are as given in (1). Clearly, $(U,V)$ forms an iterative play system. It follows from (1) that $(U,V)$ is also a PFP-system with $E_{V}(t)=A\cdot\texttt{BR}^{2}_{A}(\overline{x}_{t})$ and $E_{U}(t)=A^{\intercal}\cdot\texttt{BR}^{1}_{A}(\overline{y}_{t})$ . This is because

	$\displaystyle\overline{y}^{\prime}_{t}$	$\displaystyle=\tfrac{t-1}{t}\overline{y}_{t-1}+\tfrac{1}{t}\texttt{BR}^{2}_{A}(\overline{x}_{t-1})\text{ implies }$
	$\displaystyle tA\overline{y}^{\prime}_{t}$	$\displaystyle=(t-1)A\overline{y}_{t-1}+A\cdot\texttt{BR}^{2}_{A}(\overline{x}_{t-1})$
		$\displaystyle=V(t-1)+E_{V}(t-1),$

so $x_{t}=\texttt{BR}^{1}_{A}(\overline{y}^{\prime}_{t-1})=e_{i(t)}$ , where $i(t)\in\arg\max[V(t-1)+E_{V}(t-1)]$ . A similar argument holds for $y_{t}$ .

So, by Lemmas 1 and 4,

\displaystyle\lim_{t\rightarrow\infty}\big{(}\!\max A\overline{y}_{t}-\min\overline{x}_{t}^{\intercal}A\big{)}=\lim_{t\rightarrow\infty}\frac{\max V(t)-\min U(t)}{t}=0,

where the first equality follows from the definition of $V(t)$ and $U(t)$ . Combining this with the fact that, for all $t$ ,

	$\displaystyle\max A\overline{y}_{t}$	$\displaystyle\geq\inf_{y\in\Delta^{n}}(\max Ay)=v^{*},\text{ and}$
	$\displaystyle\min\overline{x}_{t}^{\intercal}A$	$\displaystyle\leq\sup_{x\in\Delta^{m}}(\min x^{\intercal}A)=v^{*},$

we have that

\displaystyle\lim_{t\rightarrow\infty}\max A\overline{y}_{t}=\lim_{t\rightarrow\infty}\min\overline{x}_{t}^{\intercal}A=v^{*},

concluding the proof of convergence of AFP. ∎

B.1 Convergence rate of perturbed fictitious play

Given a 2p0s matrix game with payoff matrix $A\in\mathbb{R}^{m\times n}$ , write $t^{*}(\epsilon;m,n)$ to denote the value of $t^{*}$ given by Lemma 4 such that

\displaystyle\max V(t)-\min U(t)<\tfrac{1}{2}\epsilon t\text{ for }t\geq t^{*}.

(7)

We have that $t^{*}(\epsilon;1,1)=1$ . So, by the inductive step of the proof of Lemma 4, $t^{*}(\epsilon;2,1)=t^{*}(\epsilon;1,2)=\tfrac{8a}{\epsilon}$ , which then implies that $t^{*}(\epsilon;3,1)=t^{*}(\epsilon;2,2)=t^{*}(\epsilon;1,3)=(\tfrac{16a}{\epsilon})^{2}$ . Continuing inductively, we see that $t^{*}(\epsilon;m,n)=(\tfrac{16a}{\epsilon})^{m+n-2}$ . Substituting into (7) and rearranging terms gives that

\displaystyle\frac{\max V(t)-\min U(t)}{t}<\epsilon\text{ for }t\geq(\tfrac{8a}{\epsilon})^{m+n-2}.

Choosing $\epsilon_{t}=\frac{8a}{t^{1/(m+n-2)}}$ for each $t$ gives the result.

Appendix C Proof of Proposition 2

For first two subsections, we restate the definitions of $\Delta_{t}$ , $t\in\mathbb{N}_{0}$ from (2): $\Delta_{0}=[0,\dots,0]^{\intercal}\in\mathbb{Z}^{n}$ , $\Delta_{t}=tC^{n}\,\overline{x}_{t}$ for each $t\in\mathbb{N}$ , and $i_{t}$ is the index played by FP (AFP) at time $t$ , so

\displaystyle\Delta_{t+1,j}=\begin{cases}\Delta_{t,j}-1&\text{if }j=i_{t}-1\mod n;\\ \Delta_{t,j}+1&\text{if }j=i_{t}+1\mod n;\hskip 28.45274pt\eqref{eqn:delta_sequence_update}\\ \Delta_{t,j}&\text{otherwise;}\end{cases}

for each $t\in\mathbb{N}_{0}$ and $j\in\{1,\dots,n\}$ .

C.1 FP: $C^{n}$

We must show that $\max\Delta_{t}=\Omega_{p}(\sqrt{t})$ under random tiebreaking. Throughout, whenever performing arithmetic with indices, that arithetic is done modulo $n$ . As in the body of the paper, define $t_{m}=\inf\{t\in\mathbb{N}_{0}:\max\Delta_{t}=m\}$ and note the Markov inequality bound:

\displaystyle P(\max\Delta_{t}<m)=P(t_{m}>t)\leq E(t_{m})/t=\frac{1}{t}\sum_{k=0}^{m-1}E(t_{k+1}-t_{k}).

The bulk of the argument is in finding a bound for $E(t_{k+1}-t_{k})$ .

It follows by the definition of $\Delta_{t}$ and the FP update that $i_{t}\in\arg\max\Delta_{t}$ . Index $i_{1}$ may be chosen arbitrarily, but for $t>1$ , it follows from (2) that

\displaystyle i_{(t+1)}\in\begin{cases}\{i_{t}\}&\text{if }\Delta_{t,i_{t}}>\Delta_{t,i_{t}+1};\\ \{i_{t},i_{t}+1\}&\text{if }\Delta_{t,i_{t}}=\Delta_{t,i_{t}+1};\\ \{i_{t}+1\}&\text{if }\Delta_{t,i_{t}}<\Delta_{t,i_{t}+1};\end{cases}

because the value that is incremented at time $t$ is the value adjacent to index $i_{t}$ , $\Delta_{t,i_{t}+1}$ . Let $\tau_{1}=1$ and inductively define, for $\ell\in\mathbb{N}$ , $\tau_{\ell+1}=\inf\{t>\tau_{\ell}:\Delta_{t,i_{t}}=\Delta_{t,i_{t}+1}\}$ to be the next time at which there are two possible choices for $i_{t+1}$ . This is depicted in Table 4, writing $m=\max\Delta_{\tau_{\ell}}$ and $m^{\prime}\in\{m,m+1\}$ .

Table 4: The process of incrementing the index played under FP on

C^{n}

$\Delta_{\tau_{\ell}}=$	$[\dots$	$\leq 0$	$m$	$m$	$\leq$ 0	$\leq 0$	$\dots]$
		$\hphantom{-1}\Big{\downarrow}{-1}$	$\hphantom{-a}\Big{\downarrow}{-a}$	$\hphantom{+1}\Big{\downarrow}{+1}$	$\hphantom{+a}\Big{\downarrow}{+a}$
$\Delta_{\tau_{(\ell+1)}}=$	$[\dots$	$\leq 0$	$m-a$	$m^{\prime}$	$m^{\prime}$	$\leq 0$	$\dots]$

As shown in the table, all entries of $\Delta_{t}$ other than the two maximum values must be nonpositive at each $t=\tau_{\ell}$ . This follows by induction, since it holds for $\tau=1$ ( $\Delta_{\tau_{1}}$ has one positive entry) and if it’s true for some $\ell\in\mathbb{N}$ , then in order to progress to $\tau_{(\ell+1)}$ , we must add some number $a>m$ (over the course of $a$ timesteps) to the next entry, which means by (2) we will subtract $a$ from the previous entry $m$ , with $m-a\leq 0$ . Finally, note that between $\tau_{\ell}$ and $\tau_{(\ell+1)}$ , either we will have incremented the max from $m$ to $m+1$ if $i_{\tau_{\ell}}=i_{(\tau_{\ell}-1)}$ , or we will not have not (if $i_{\tau_{\ell}}=i_{(\tau_{\ell}-1)}+1)$ , and the max will remain at $m$ until we repeat some further number of increments of the index played. It is only on these timesteps that the maximum value can increment.

Based on this reasoning, we know that for any $k$ , there must exist $\ell(k)$ such that $t_{k}=\tau_{\ell(k)}+1$ . Consider the random variable $\ell(k+1)-\ell(k)$ , which is the number of increments of the index played that occurred between the increment of the max from $k$ to $k+1$ . Under uniform random tiebreaking, we have that $\ell(k+1)-\ell(k)\sim\text{Geometric}(1/2)$ , since at each $\tau_{\ell}$ there is a 1/2 chance of incrementing (“success”) or not incrementing (“failure”). So, $E[\ell(k+1)-\ell(k)]=2$ . Now suppose that we had the bound $\tau_{\ell+1}-\tau_{\ell}\leq d\max\Delta_{\tau_{\ell}}$ for some $d>0$ . That would imply that

	$\displaystyle t_{k+1}-t_{k}=\tau_{\ell(k+1)}-\tau_{\ell(k)}=\sum_{r=\ell(k)}^{\ell(k+1)-1}\tau_{r+1}-\tau_{r}$	$\displaystyle\leq\sum_{r=\ell(k)}^{\ell(k+1)-1}d\max\Delta_{t_{k+1}}$
		$\displaystyle=d[\ell(k+1)-\ell(k)](k+1).$

Taking the expectation of both sides, we get $E(t_{k+1}-t_{k})\leq 2d(k+1)$ . Plugging this into the Markov bound is sufficient to finish the argument, as explained in the proof sketch for Proposition 2 (in the paper).

All that remains is to show that $\tau_{\ell+1}-\tau_{\ell}\leq d\max\Delta_{\tau_{\ell}}=dm$ . From the argument depicted in Table 4, we know $\tau_{\ell+1}-\tau_{\ell}\leq a+1$ , and that $a\leq m+1-\min\Delta_{\tau_{\ell}}$ . Because $\Delta_{\tau_{\ell}}$ has only two positive entries and $\sum_{i=1}^{n}\Delta_{t,i}=0$ , we have $\min\Delta_{\tau_{\ell}}\geq-2m$ , so $\tau_{\ell+1}-\tau_{\ell}\leq 3m+2\leq 5m$ , concluding the proof.

C.2 AFP: $C^{n}$ , $n=3,4$

For $n=3,4$ we show $\max_{t}\Delta_{t}<3$ , which proves the result.

Based on the AFP update, it is impossible to get to $\max_{t}\Delta_{t}=m+1$ unless there are at least two non-adjacent $m$ ’s in $\Delta_{t-1}$ with an $m-1$ in between. Otherwise, the two-step nature of the AFP update will not allow an $m$ to be incremented to $m+1$ . However, it is impossible to have two non-adjacent $m$ ’s with an $m-1$ in between for $n=3,m=2$ because the entries of $\Delta_{t-1}$ sum to 0. Furthermore, in the $n=4$ case, for each $t$ , it must be that $\Delta_{t}=[a,b,-a,-b]$ for some $a$ and $b$ , by (2). So there also cannot be three positive numbers in this case.

C.3 FP: $T^{n}$

Assume without loss of generality that $x_{1}=e_{1}$ . Let $\tau_{k}=\min\{t:x_{t}=e_{k}\}$ and note that the form of $T^{n}$ implies that the strategies played by FP must be nondecreasing and increment by at most 1 at a time. We argue by strong induction that $\tau_{k+1}-\tau_{k}>\tau_{k}-\tau_{k-1}$ for each $k<n$ . Checking the first few terms, we have

	$\displaystyle\tau_{1}=1,\text{ so }$	$\displaystyle T^{n}\overline{x}_{1}=n^{-1}[0,n,0,\dots,0]^{\intercal},\text{ so}$
	$\displaystyle\tau_{2}=2,\text{ so }$	$\displaystyle T^{n}\overline{x}_{2}=2^{-1}n^{-1}[-n,n,n-1,\dots,0]^{\intercal},\text{ so}$

$x_{3}=e_{2}$ , and therefore $\tau_{3}>3$ , so $\tau_{3}-\tau_{2}>\tau_{2}-\tau_{1}=1$ . Now assume that for some fixed $k<n$ and all $k^{\prime}\in\{1,\dots,k\}$ that $\tau_{k^{\prime}+1}-\tau_{k^{\prime}}>\tau_{k^{\prime}}-\tau_{k^{\prime}-1}$ . Note that

	$\displaystyle\overline{x}_{\tau_{(k+1)}-1}$	$\displaystyle=[\tau_{2}-\tau_{1},\dots,\tau_{k+1}-\tau_{k},0,0,\dots,0]^{\intercal},\text{ so,}$
	$\displaystyle T^{n}\overline{x}_{\tau_{(k+1)}-1}$	$\displaystyle\propto[<,<,\dots,<,(\tau_{k+1}-\tau_{k})(n-k+1),0,\dots,0]^{\intercal},\text{ and}$
	$\displaystyle T^{n}\overline{x}_{\ell}$	$\displaystyle\propto[<,<,\dots,<,(\tau_{k+1}-\tau_{k})(n-k+1),(\ell-\tau_{k+1}+1)(n-k),\dots,0]^{\intercal},$

for $\ell\in\{\tau_{k+1},\dots,\tau_{k+2}-1\}$ , where the ‘ $<$ ’ signs indicate values that are no greater than their neighbors on the right; this holds by the inductive hypothesis and definition of $T^{n}$ . We know that for steps $\tau_{k+1},\dots\tau_{k+2}-1$ , FP will play $e_{k+1}$ , and we know that $\tau_{k+2}$ is the first timestep at which $(\tau_{k+2}-\tau_{k+1})(n-k)\geq(\tau_{k+1}-\tau_{k})(n-k+1)$ (or else FP would have played $k+1$ at $\tau_{k+2}$ , a contradiction). It follows that $\tau_{k+2}-\tau_{k+1}>\tau_{k+1}-\tau_{k}$ , as desired.

This result implies that $\tau_{k+1}-\tau_{k}\geq k$ for each $k<n$ , so we have $\tau_{k}\geq\sum_{j=1}^{k}j=k(k+1)/2\geq k^{2}/2$ for each $k$ . Inverting this, we get that $t\mapsto\sqrt{2t}$ is an upper bound on $k(t)$ , the index played by FP at time $t$ . Combining the expression for $T^{n}\overline{x}_{\ell}$ , $\ell\in\{\tau_{k+1},\dots,\tau_{k+2}-1\}$ , with this bound, we get that $\max\,T^{n}\overline{x}_{t}=n^{-1}t^{-1}(\tau_{k(t)+1}-\tau_{k(t)})(n-k(t)+1)\geq n^{-1}t^{-1}(n-k(t)+1)\geq n^{-1}t^{-1}(n-\sqrt{2t})=\Omega(1/\sqrt{t})$ .

C.4 AFP: $T^{n}$

We argue first by strong induction that $x_{t}=e_{\min(t,n)}$ for all $t$ . Assume without loss of generality that $x_{1}=e_{1}$ . Now assume that, for some fixed $\tau$ , $x_{t}=e_{\min(t,n)}$ for $t\leq\tau$ .

If $\tau<n$ , then under the inductive hypothesis,

	$\displaystyle T^{n}\overline{x}_{\tau}$	$\displaystyle=\tau^{-1}T^{n}\;[\overbrace{1,1,\dots,1}^{\tau},\;0,\dots,0]^{\intercal}$
		$\displaystyle=\tau^{-1}n^{-1}[-n,1,\dots,(n-\tau+2),\;(n-\tau+1),\dots,0]^{\intercal},$

for which the largest value is at index $\tau$ , so $x^{\prime}=e_{\tau}$ , so then $T^{n}\overline{x}^{\prime}_{\tau}$ will have largest value $2n^{-1}(n-\tau+1)$ at index $\tau+1$ , so $x_{\tau+1}=e_{\tau+1}$ . If $\tau\geq n$ , then under the inductive hypothesis, $T^{n}\overline{x}_{\tau}=\tau^{-1}T^{n}[1,1,\dots,1,\tau-n+1]^{\intercal}=\tau^{-1}n^{-1}[-n,1,1,\dots,1,1-2(\tau-n),2]^{\intercal}$ , at which point $x^{\prime}_{\tau}=x_{\tau+1}=e_{n}$ , as desired.

Finally, we are interested in $\max T^{n}\overline{x}_{t}$ for $t<n$ , which we obtain from the calculation above, $\frac{n-t+2}{nt}=O(1/t)$ .

Appendix D Additional figures

Appendix E RL experiment hyperparameters

APPO	TinyFighter	RWS
Discount factor $\gamma$	0.99	0.995
Value function coefficient	0.5	0.5
Entropy coefficient	0.01	0.02
Learning rate	3e-4	1.5e-4
Gradient clipping	10	80
Batch size	5,120	6,000
Number of workers	80	319
Rollout fragment length	64	64
Other	Ray defaults (as of Nov 2022) (Moritz et al.,, 2018)

Running With Scissors Environment
Reward shaping	Policy $t=1$ played against random and received a unit of reward for each ‘scissors’ picked up and a negative unit of reward for each ‘rock’ and ‘paper’ picked up.
Reward scaling ( $t=1$ )	10
Reward scaling ( $t>1$ )	100
Sprite size	3x3 pixels
Agent field of view	5x5 grid units (15x15 pixels)
Agent memory	4 frames

Table 5: Population neural network

\Pi_{\theta}(t)

architecture for TinyFighter. The neural network takes as inputs: an image array, the ratios of rock, paper, and scissors in the player’s inventory, and an agent index

t

which gets mapped to a vector in

R^{128}

by an embedding layer. The Shape column does not include batch dimension. We use ReLU activation functions except for the output layers, which use the identity function (‘linear’). We used the Keras library (Chollet et al.,, 2015) to implement the model.

Name	Type	Shape	# Parameters	Input
state	InputLayer	[13]	0
dense	Dense	[128]	1,792	state
dense_1	Dense	[128]	16,512	dense
neupl_opponent_dist	InputLayer	[4]	0
policy_head_inputs	Concatenate	[145]	0	state, dense_1, neupl_opponent_dist
neupl_opponent_idx	InputLayer	[5]	0
dense_2	Dense	[256]	37,376	policy_head_inputs
value_head_inputs	Concatenate	[150]	0	state, dense_1, neupl_opponent_dist, neupl_opp…
dense_3	Dense	[128]	32,896	dense_2
dense_6	Dense	[256]	38,656	value_head_inputs
dense_4	Dense	[64]	8,256	dense_3
action_mask	InputLayer	[4]	0
dense_7	Dense	[128]	32,896	dense_6
dense_5	Dense	[4]	260	dense_4
lambda	Lambda	[4]	0	action_mask
dense_8	Dense	[64]	8,256	dense_7
policy_out	Add	[4]	0	dense_5, lambda
value_out	Dense	[1]	65	dense_8

Table 6: Population neural network

\Pi_{\theta}(t)

architecture for RWS. The neural network takes as inputs: an image array, the ratios of rock, paper, and scissors in the player’s inventory, and an agent index

t

which gets mapped to a vector in

R^{128}

Name	Type	Shape	# Parameters	Input
rgb_history	InputLayer	[4, 15, 15, 3]	0
rescaling	Rescaling	[4, 15, 15, 3]	0	rgb_history
conv2d	Conv2D	[4, 5, 5, 32]	896	rescaling
activation	Activation	[4, 5, 5, 32]	0	conv2d
conv2d_1	Conv2D	[4, 5, 5, 16]	528	activation
activation_1	Activation	[4, 5, 5, 16]	0	conv2d_1
reshape	Reshape	[4, 400]	0	activation_1
inv_ratios_history	InputLayer	[4, 3]	0
concatenate	Concatenate	[4, 403]	0	reshape, inv_ratios_history
dense	Dense	[4, 256]	103,424	concatenate
dense_1	Dense	[4, 256]	65,792	dense
lstm	LSTM	[256]	525,312	dense_1
neupl_opponent_dist	InputLayer	[11]	0
neupl_opponent_idx	InputLayer	[12]	0
policy_head_inputs	Concatenate	[267]	0	lstm, neupl_opponent_dist
value_head_inputs	Concatenate	[279]	0	lstm, neupl_opponent_dist, neupl_opponent_idx
policy_1	Dense	[512]	137,216	policy_head_inputs
value_1	Dense	[512]	143,360	value_head_inputs
policy_2	Dense	[256]	131,328	policy_1
value_2	Dense	[256]	131,328	value_1
policy_3	Dense	[128]	32,896	policy_2
value_3	Dense	[128]	32,896	value_2
policy_out	Dense	[30]	3,870	policy_3
value_out	Dense	[1]	129	value_3

Appendix F Vanilla RL implementations of FP and AFP

Algorithm 2 gives a simple reinforcement learning implementation of FP and AFP that does not rely on Neural Population Learning or the reservoir buffer sampling of Heinrich et al., (2015). Notes:

•

$-i$ refers to the opponent of $i$ .
•

$\textsc{ReinforcementLearning}(\pi,\mu)$ plays $\pi$ against $\mu$ for some number of episodes, gathers data, and updates $\pi$ by reinforcement learning.
•

Lines 8 and 9 are the approximate reinforcement learning analogue to computing a best response to the average of all previous strategies. For details, see Heinrich et al., (2015), which uses a more complicated setup in order to limit storage requirements (constantly learning $\beta$ by supervised learning).

Algorithm 2 FP/AFP with reinforcement learning

1:Choose setting: FP or AFP.

2:Initialize policies

\pi^{1}_{1}

and

\pi^{2}_{1}

3:Initialize policy stores

\Pi^{1}=\{\pi^{1}_{1}\},\Pi^{2}=\{\pi^{2}_{1}\}

4:for

t=1,2,\dots

5: Initialize

\pi^{1}_{t+1},\pi^{2}_{t+1}

6: for

i=1,2

7: while per-timestep compute budget remains do

8: Sample

\beta\sim\text{Uniform}(\Pi^{-i})

\pi^{i}_{t+1}\leftarrow\textsc{ReinforcementLearning}(\pi^{i}_{t+1},\beta)

10: end while

11:

\Pi^{i}\leftarrow\Pi^{i}\cup\{\pi^{i}_{t+1}\}

12: end for

13: if setting is AFP and

t

is odd then

14:

\Pi^{1}\leftarrow\Pi^{1}\setminus\{\pi^{1}_{t}\}

and

\Pi^{2}\leftarrow\Pi^{2}\setminus\{\pi^{2}_{t}\}

15: end if

16:end for

	$\displaystyle\max V(s+t^{})-\min U(s+t^{})$	$\displaystyle=\max V^{\prime}(t^{})-\min\big{\{}U^{\prime}(t^{})-[\max V(s)-\min U(s)]\mathds{1}_{n}\big{\}}$
		$\displaystyle=\max V^{\prime}(t^{})-\min U^{\prime}(t^{})+\max V(s)-\min U(s)$
		$\displaystyle<\tfrac{1}{2}\epsilon t^{*}+\max V(s)-\min U(s).$

Anticipatory Fictitious Play

Abstract

1 Introduction

1.1 Related work

2 Preliminaries

2.1 Fictitious play

Theorem 1.

3 Anticipatory Fictitious Play

Remark.

Proposition 1.

Proof.

4 Application to normal form games

Proposition 2.

Proof.

4.1 Numerical results

5 Application to reinforcement learning

5.1 Environments

5.2 Adapting FP and AFP to reinforcement learning

5.3 Experimental setup

5.4 Results

6 Conclusion

Acknowledgements

References

Appendix A Why naive AFP doesn’t work

Appendix B Proof of Proposition 1

Definition 1.

Definition 2.

Lemma 1.

Definition 3.

Lemma 2.

Proof.

Lemma 3.

Proof.

Lemma 4.

Proof.

Proof.

B.1 Convergence rate of perturbed fictitious play

Appendix C Proof of Proposition 2

C.1 FP: CnC^{n}

C.2 AFP: CnC^{n}, n=3,4n=3,4

C.3 FP: TnT^{n}

C.4 AFP: TnT^{n}

Appendix D Additional figures

Appendix E RL experiment hyperparameters

Appendix F Vanilla RL implementations of FP and AFP

C.1 FP: $C^{n}$

C.2 AFP: $C^{n}$ , $n=3,4$

C.3 FP: $T^{n}$

C.4 AFP: $T^{n}$