Anticipatory Fictitious Play
Abstract
Fictitious play is an algorithm for computing Nash equilibria of matrix games. Recently, machine learning variants of fictitious play have been successfully applied to complicated real-world games. This paper presents a simple modification of fictitious play which is a strict improvement over the original: it has the same theoretical worst-case convergence rate, is equally applicable in a machine learning context, and enjoys superior empirical performance. We conduct an extensive comparison of our algorithm with fictitious play, proving an optimal convergence rate for certain classes of games, demonstrating superior performance numerically across a variety of games, and concluding with experiments that extend these algorithms to the setting of deep multiagent reinforcement learning.
1 Introduction
Matrix games (also known as normal-form games) are an abstract model for interactions between multiple decision makers. Fictitious play (FP) (Brown, (1951)) is a simple algorithm for two-player matrix games. In FP, each player starts by playing an arbitrary strategy, then proceeds iteratively by playing the best strategy against the empirical average of what the other has played so far. In some cases, such as two-player, zero-sum games, the empirical average strategies will converge to a Nash equilibrium.
Although there are more efficient algorithms for computing Nash equilibria in matrix games (Adler,, 2013; Shoham and Leyton-Brown,, 2008), there are a few reasons why fictitious play remains a topic of interest. First, it serves as a model for how humans might arrive at Nash equilibria in real-world interactions (Luce and Raiffa,, 1989; Brown,, 1951; Conlisk, 1993a, ). Second, FP is extensible to real-world games which are large and complicated. Our work is primarily motivated by the secondary application.
The initial step towards extending FP to real-world games was by Kuhn, (1953), which established the equivalence of normal-form games (represented by matrices) and extensive-form games (represented by trees with additional structure). Loosely speaking, this means that results which apply for matrix games may also apply to much more complicated decision making problems, such as ones that that incorporate temporal elements or varying amounts of hidden information. Leveraging this equivalence, Heinrich et al., (2015) proposed an extension of FP to the extensive-form setting, full-width extensive-form fictitious play (XFP), and proved that it converges to a Nash equilibrium in two-player, zero-sum games. Heinrich et al., (2015) also proposed Fictitious Self Play (FSP), a machine learning approximation to XFP. In contrast to XFP, which is intractable for real-world games whose states cannot be enumerated in practice, FSP relies only on basic operations which can be approximated in a machine learning setting, like averaging (via supervised learning) and computing best responses (via reinforcement learning). In this way, FSP provides a version of fictitious play suitable for arbitrarily complex two-player, zero-sum games. Not long after the introduction of FSP, Lanctot et al., (2017) presented Policy Space Response Oracles (PSRO), a general framework for fictitious-play-like reinforcement learning algorithms in two-player, zero-sum games. These ideas were employed as part of the groundbreaking AlphaStar system that defeated professional players at StarCraft II (Vinyals et al.,, 2019).
We introduce anticipatory fictitious play (AFP), a simple variant of fictitious play which is also reinforcement-learning-friendly. In contrast to FP, where players iteratively update to exploit an estimate of the opponent’s strategy, players in AFP update proactively to respond to the strategy that the opponent would use to exploit them.
We prove that AFP is guaranteed to converge to a Nash equilibrium in two-player zero-sum games and establish an optimal convergence rate for two classes of games that are of particular interest in learning for real world games (Balduzzi et al.,, 2019), a class of “cyclic” games and a class of “transitive” games. Numerical comparisons suggest that in AFP eventually outperforms FP on virtually any game, and that its improvement over FP improves as games get larger. Finally, we propose a reinforcement learning version of AFP that is implementable as a one-line modification of an RL implementation of FP, such as FSP. These algorithms are applied to Running With Scissors (Vezhnevets et al.,, 2020), a stochastic, competitive multiagent environment with cyclic dynamics.
1.1 Related work
Aside from the literature on fictitious play and its extension to reinforcement learning, there has been substantial work on “opponent-aware” learning algorithms. These algorithms incorporate information about opponent updates and are quite similar to anticipatory fictitious play.
In the context of evolutionary game theory, Conlisk, 1993a proposed an “extrapolation process,” whereby two players in a repeated game each forecast their opponents’ strategies and respond to those forecasts. Unlike AFP, where opponent responses are explicitly calculated, the forecasts are made by linear extrapolation based on the change in the opponent’s strategy over the last two timesteps. Conlisk, 1993b proposed two types of “defensive adaptation,” which are quite similar in spirit to AFP but differ in some important details; most importantly, while they consider the opponent’s empirical payoffs at each step, they do not respond directly to what the opponent is likely to play given those payoffs.
Shamma and Arslan, (2005) proposed derivative action fictitious play, a variant of fictitious play in the continous time setting in which a best response to a forecasted strategy is played, like in (Conlisk, 1993a, ). The algorithm uses a derivative-based forecast that is analogous to the discrete-time anticipated response of AFP. However, their convergence results rely on a fixed, positive entropy bonus that incentivizes players to play more randomly, and they do not consider the discrete-time case.
Zhang and Lesser, (2010) proposed Infinitesimal Gradient Ascent with Policy Prediction, in which two policy gradient learning algorithms continuously train against a forecast of the other’s policy. Their algorithm represents the core idea of AFP, albeit implemented in a different setting. However, their proof of convergence is limited to games. Foerster et al., (2018) and Letcher et al., (2018) take this idea further, modifying the objective of a reinforcement learning agent so that it accounts for how changes in the agent will change the anticipated learning of the other agents. This line of research is oriented more towards equilibrium finding in general-sum games (e.g. social dilemmas), and less on efficient estimation of equilibria in strictly competitive two-player environments.
2 Preliminaries
A (finite) two-player zero-sum game (2p0s game) is represented by a matrix , so that when player 1 plays and player 2 plays , the players observe payoffs respectively. Let be the set of probability vectors representing distributions over elements. Then a strategy for player 1 is an element and similarly, a strategy for player 2 is an element .
A Nash equilibrium in a 2p0s game is a pair of strategies such that each strategy is optimal against the other, i.e.,
The Nash equilibrium represents a pair of strategies that are “stable” in the sense that no player can earn a higher payoff by changing their strategy. At least one Nash equilibrium is guaranteed to exist in any finite game (Nash Jr,, 1950).
Nash equilibria in 2p0s games enjoy a nice property not shared by Nash equilibria in general: in 2p0s games, if and are Nash equilibria, then is a Nash equilibrium. In a 2p0s game, we define a Nash strategy to be be one that occurs as part of some Nash equilibrium. Note that the aforementioned property does not hold in general, so normally it is only valid to describe collections of strategies (one per player) as equilibria.
A solution to a 2p0s game is a pair of strategies such that
We say , which is unique, the value of the game. Nash equilibria are equivalent to solutions of 2p0s games (Shoham and Leyton-Brown,, 2008), which is why we use the same notation. Finally, the exploitability of a strategy is the difference between the value of the game and the worst-case payoff of that strategy. So the exploitability of is , and the exploitability of is .
2.1 Fictitious play
Let denote the standard basis vectors in or . Let be the best response operator for player , so that
Fictitious play is given by the following process. Let and be initial strategies for some , . For each , let
In other words, at each timestep , each player calculates the strategy that is the best response to their opponent’s average strategy so far. Robinson, (1951) proved that the pair of average strategies converges to a solution of the game by showing that the exploitability of both strategies converge to zero.
Theorem 1.
(Robinson, 1951) If is a FP process for a 2p0s game with payoff matrix , then
where is the value of the game. Furthermore, a bound on the rate of convergence is given by
where .
3 Anticipatory Fictitious Play
Although FP converges to a Nash equilibrium in 2p0s games, it may take an indirect path. For example, in Rock Paper Scissors with tiebreaking towards the minimum strategy index, the sequence of average strategies orbits the Nash equilibrium, slowly spiraling in with decreasing radius, as shown on the left in Figure 1. This tiebreaking scheme is not special; under random tiebreaking, the path traced by FP is qualitatively the same, resulting in slow convergence with high probability, as shown in Figure 2.


Given that FP appears to do an especially poor job of decreasing exploitability in the case above, we consider alternatives. Inspired by gradient descent, we ask if there is there a way to follow the gradient of exploitability towards the Nash equilibrium (without explicitly computing it, as is done in Lockhart et al., (2019)). By definition, the best response to the average strategy is a strategy that maximally exploits the average strategy. So, a natural choice of update to reduce exploitability is to move the average strategy in a direction that counters this best response.
To this end, we propose anticipatory fictitious play (AFP), a version of fictitious play that “anticipates” the best response an adversary might play against the current strategy, and then plays the best response to an average of that and the adversary’s current average strategy. (Simply responding directly to the opponent’s response does not work; see Appendix A.) Alternatively, one can think of AFP as a version of FP that “forgets” every other best response it calculates. This latter interpretation enables a convenient implementation of AFP as a modification of FP as demonstrated in Algorithm 2.
Remark.
The AFP update is seemingly consistent with human psychology: it is quite intuitive to imagine how an adversary might try to exploit oneself and to respond in order to best counter that strategy. Given that fictitious play provides a model for how humans or other non-algorithmic decision makers might arrive at an equilibrium in practice (Luce and Raiffa,, 1989; Brown,, 1951) anticipatory fictitious play offers a new model for how this may occur. We leave further consideration of this topic to future work.
AFP is given by the following process. For some and , let and be initial strategies for each player. For each , define
(1) |
Here, and are the best response to the opponent’s average strategy. They are the strategies that FP would have played at the current timestep. In AFP, each player “anticipates” this attack and defends against it by calculating the opponent’s average strategy that include this attack ( and ), and then playing the best response to the anticipated average strategy of the opponent.
In Figure 1, we see the effect of anticipation geometrically: AFP “cuts corners,” limiting the extent to which it overshoots its target. In contrast, FP aggressively overshoots, spending increasingly many steps playing strategies that take it further from its goal. The effect on algorithm performance is pronounced, with AFP hovering near equilibrium while FP slowly winds its way there.
Of course, RPS is a very specific example. It is natural to wonder: is AFP good in general? The rest of the paper seeks to answer that question. We begin by proving that AFP converges to a Nash equilibrium.
Proposition 1.
If is an AFP process for a 2p0s game with payoff matrix , the conclusion of Theorem 1 holds for this process. Namely, AFP converges to a Nash equilibrium, and it converges no slower than the rate that bounds FP.
Proof.
(Idea) Generalize the original proof of Theorem 1. We work with accumulating payoff vectors and . In the original proof, a player 1 strategy index is called eligible at time if (similarly for player 2). We replace eligibility with the notion of -eligibility, satisfied by an index , for any with . Essentially, an index is -eligible if it corresponds to a best response to a perturbation of the opponent’s history or a perturbation of the game itself. The original proof structure can be preserved in light of this replacement, requiring only minor modifications to some arguments and new constants. Treating the in-between strategies in AFP, and , as perturbations of and , it follows that AFP satisfies the conditions for the generalized result. A complete proof is given in Appendix B. ∎
4 Application to normal form games
Proposition 1 establishes that AFP converges and that AFP’s worst-case convergence rate satisfies the same bound as FP’s, where the worst-case is with respect to games and tiebreaking rules. The next proposition shows that for two classes of games of interest, AFP not only outperforms FP, but attains an optimal rate. In both classes, our proofs will reveal that AFP succeeds where FP fails because AFP avoids playing repeated strategies. The results hold for general applications of FP and AFP rather than relying on specific tiebreaking rules.
The classes of games that we analyze are intended to serve as abstract models of two fundamental aspects of real-world games: transitivity (akin to “skillfullness;” some ways of acting are strictly better than others) and nontransitivity (most notably, in the form of strategy cycles like Rock ¡ Paper ¡ Scissors ¡ Rock). Learning algorithms for real-world games must reliably improve along the transitive dimension while accounting for the existence of strategy cycles; see Balduzzi et al., (2019) for further discussion.
For each integer , define payoff matrices and by
for . The game given by is a purely cyclic game: each strategy beats the one before it and loses to the one after it; is the game Rock Paper Scissors. For each , a Nash equilibrium strategy is . The game given by could be considered “transitive:” each strategy is in some sense better than the last, and is a Nash equilibrium strategy. The payoffs are chosen so that each strategy is the unique best response to , so that an algorithm that learns by playing best responses will progress one strategy at a time rather than skipping to directly to strategy (as would happen if FP or AFP were applied to a game that is transitive in a stronger sense, such as one with a single dominant strategy; c.f. the definition of transitivity in (Balduzzi et al.,, 2019)). Note: could be defined equivalently without the factor, but this would create a spurious dependence on for the rates we derive.
The following proposition establishes a convergence rate of for AFP applied to and . This rate is optimal within the class of time-averaging algorithms, because the rate at which an average changes is . Note: we say a random variable if, for any , there exists such that for all .
Proposition 2.
FP and AFP applied symmetrically to and obtain the rates given in Table 1. In particular, if is an FP or AFP process for a 2p0s game with payoff matrix with tiebreaking as indicated, then Tiebreaking refers to the choice of when there are multiple maximizers. The “random” tiebreaking chooses between tied strategies independently and uniformly at random. For entries marked with “arbitrary” tiebreaking, the convergence rate holds no matter how tiebreaks are chosen.
Algorithm | Game | Tiebreaking | Rate | Caveats |
---|---|---|---|---|
FP | random | |||
AFP | arbitrary | |||
FP | arbitrary | |||
AFP | arbitrary |
Proof.
(Sketch, ) Full proofs of all cases are provided in Appendix C. Define and for each . The desired results are equivalent to under FP for the given tiebreaking rule, and is bounded under AFP for . Let be the index played by FP (AFP) at time (so ). It follows that
(2) |
for each and . Note that the entries of always sum to zero.
In the case of FP, it is easy to verify that is nondecreasing for any choice of tiebreaking. For each , define . Then by Markov’s inequality,
Examining the timesteps at which and relating them to , we show in the appendix that the time to increment the max from to satisfies . Thus the bound above becomes . Now let be arbitrary and plug in for , so we have as . So .
For the AFP case, consider the first timestep at which . Working backwards and checking cases, it can be shown that in order for the maximum value to increment from to , there must first be a timestep where there are two non-adjacent entries of with an entry of between them. This cannot happen in the case because three positive entries (2,1,2) don’t sum to zero. Similarly, in the case, it turns out by (2) that for some , . So there cannot be three positive entries in this case either. Therefore for . ∎
The proofs of Proposition 2 establish a theme: FP can be slow because it spends increasingly large amounts of time progressing between strategies (playing with increasing as increases), whereas AFP avoids this. (Return to Figure 1 for a visual example.)
Some further comments on the results: we only obtain the rate for AFP applied to in the case. We conjecture that: (i) for a specific tiebreaking rule, AFP has the same worst-case rate as FP but with a better constant, (ii) under random tiebreaking, AFP is for all . This is reflected in numerical simulations for large , as shown in Appendix D.
Our results are noteworthy for their lack of dependence on tiebreaking: worst-case analyses of FP typically rely on specific tiebreaking rules; see Daskalakis and Pan, (2014), for example. As for the “” caveat for FP applied to , this is an unremarkable consequence of analyzing a game with a pure strategy equilibrium (all probability assigned to a single strategy). We write to indicate the first index at which FP plays . Both FP and AFP will play forever some finite number of steps after they play it for the first time, thus attaining a rate as the average strategy “catches up” to . Our result shows that until this point, FP is slow, whereas AFP is always fast. As before, AFP’s superior performance is reflected in numerical simulations, as shown in Appendix D.
4.1 Numerical results
In order to compare FP and AFP more generally, we sample large numbers of random payoff matrices and compute aggregate statistics across them. Matrix entries are sampled as independent, identically distributed, standard Gaussian variables (note that the shift- and scale-invariance of matrix game equilibria implies that the choice of mean and variance is inconsequential). Since FP and AFP are so similar, and AFP computes two best responses per timestep, it’s natural to wonder: is AFP’s superior performance just an artifact of using more computation per timestep? So, in order to make a fair comparison, we compare the algorithms by the number of best responses calculated instead of the number of timesteps (algorithm iterations). Using the worst-case payoff as the measure of performance, we compare FP and AFP based on the number of responses computed and based on matrix size in Figures 3 and 4.
The result is that AFP is clearly better on both counts. Although FP is better for a substantial proportion of games at very early timesteps , AFP quickly outpaces FP, eventually across each of 1,000 matrices sampled. In terms of matrix size, FP and AFP appear equivalent on average for small matrices, but quickly grow separated as matrix size grows, with AFP likely to be much better.


5 Application to reinforcement learning
We apply reinforcement learning (RL) (Sutton and Barto,, 2018) versions of FP and AFP in the context of a (two-player, zero-sum, symmetric) stochastic game (Shapley, (1953)), defined by the tuple , where is the set of possible states of the environment, is the set of possible observations received by an agent, gives the observations for each player based on the current state, is the set of available actions, defines the transition dynamics for the environment given each player’s action, defines the reward for both players such that are the rewards observed by each player at time , and is the initial distribution of states, such that . Let be the set of possible sequences of observations. Then a policy is a map . An episode is played by iteratively transitioning by the environment according to the actions sampled from each players’ policies at each state. Players 1 and 2 earn returns . The reinforcement learning algorithms we consider take sequences of observations, actions, and rewards from both players and use them to incrementally update policies toward earning greater expected returns. For background on reinforcement learning, see Sutton and Barto, (2018). For details on machine learning approximations to FP, see Heinrich et al., (2015). Table 2 gives a high-level overview of the relationship.
Normal-form game | Stochastic/extensive-form game |
---|---|
Strategy | Policy |
Payoff | Expected return |
Best response | Approximate best response by RL |
Strategy mixture , , | At start of episode, sample policy with probability . Play entire episode with . |
We use two environments, our own TinyFighter, and Running With Scissors, from Vezhnevets et al., (2020).
5.1 Environments
TinyFighter is a minimal version of an arcade-style fighting game shown in Figure 5(a). It features two players with four possible actions: Move Left, Move Right, Kick, and Do Nothing. Players are represented by a rectangular body and when kicking, extend a rectangular leg towards the opponent.
Kicking consists of three phases: Startup, Active, and Recovery. Each phase of a kick lasts for a certain number of frames, and if the Active phase of the kick intersects with any part of the opponent (body or leg), a hit is registered. When a hit occurs, the players are pushed back, the opponent takes damage, and the opponent is stunned (unable to take actions) for a period of time. In the Startup and Recovery phases, the leg is extended, and like the body, can be hit by the opponent if the opponent has a kick in the active phase that intersects the player. The game is over when a player’s health is reduced to zero or when time runs out.
Player observations are vectors in and contain information about player and opponent state: position, health, an ‘attacking’ indicator, a ‘stunned’ indicator, and how many frames a player has been in the current action. The observation also includes the distance between players, time remaining, and the direction of the opponent (left or right of self). The game is partially observable, so information about the opponent’s state is hidden from the player for some number of frames (we use four, and the game runs at 15 frames per second). This means a strong player must guess about the distribution of actions the opponent may have taken recently and to respond to that distribution; playing deterministically will allow the opponent to exploit the player and so a stochastic strategy is required to play well.
Running With Scissors (RWS) is a spatiotemporal environment with partial observability with potential for nontransitive relationships between policies. As shown in Figure 5(b), RWS is a 2D gridworld with a few types of entities: two agents; three types of items: rock, paper, and scissors, which can be picked up by the agents to add to their inventories; and impassable walls. In addition to moving around and picking up items, agents in RWS have a “tag” action, which projects a cone in front of them for a single frame. If the cone hits the other agent, the episode ends and each agent receives rewards based on the payoff matrix according to the ratios of each item in their inventory. Agents can only see their own inventory and a grid in front of them and can remember the last four frames they’ve seen, so in order to perform effectively they must infer what items the opponent has picked up. Vezhnevets et al., (2020) and Liu et al., (2022) (Appendix B.1.) feature further discussion of the environment.


5.2 Adapting FP and AFP to reinforcement learning
Neural Population Learning (NeuPL) (Liu et al.,, 2022) is a framework for multiagent reinforcement learning wherein a collection of policies is learned and represented by a single neural network and all policies train continuously. For our experiments, we implement FP and AFP within NeuPL, as shown in Algorithm 1. For reference, we also include a simple RL version of FP and AFP in the style of PSRO in Appendix F.

In NeuPL-FP/AFP, the opponent sampler determines the distributions of opponents that each agent faces and is the only difference between the FP and AFP implementations. We have, for each ,
These distributions are depicted in Figure 6. Just as each step of FP involves computing a best response to an average against all prior strategies, sampling from corresponds to training agent uniformly against the prior policies; just as AFP can be thought of “forgetting” every other index, trains learner index uniformly against every odd indexed policy plus the most recent policy. The neural population net defines a different policy for each agent index , and can equivalently be represented as .
5.3 Experimental setup
For the neural population net, we use an actor-critic (Sutton and Barto,, 2018) architecture similar to that used for RWS in Liu et al., (2022): first, a module is used to process environment observations into a dense representation that is shared between an actor head and a critic head. The actor takes this vector, concatenates it with a vector representing the distribution of opponents faced by the currently training agent (e.g., for agent in FP or AFP), then processes it with a dense MLP with ReLu activations, with action masking applied prior to a final Softmax layer. The critic is similar, except an additional input is used: the index of the opponent sampled at the beginning of this episode, matching the original implementation.333Note that the actor (policy network) does not observe which opponent it faces, only the distribution over agents it faces; this is important because otherwise our agent would not learn a best response to an average policy as intended in FP and AFP. The reason for including this information for the critic (value network) is that it may reduce the variance of the value function estimator. For the exact architectures used, see Appendix E. We use a neural population size of . Based on matrix game simulations and a preliminary RL experiments, we determined that AFP performs slightly better in a short time horizon when initialized with two steps of FP, so we do this, as shown in the final panel of Figure 6. See Appendix D for a figure comparing performance in the matrix setting.
We implemented NeuPL within a basic self-play reinforcement learning loop by wrapping the base environment (TinyFighter or RWS) within a lightweight environment that handles NeuPL logic, such as opponent sampling. For reinforcement learning, we use the Asynchronous Proximal Policy Optimization (APPO) algorithm (Schulman et al.,, 2017), a distributed actor-critic RL algorithm, as implemented in RLLib (Moritz et al.,, 2018) with a single GPU learner. Hyperparameter settings are given in Appendix E. We train the entire neural population net (agents 1-11) for 12,500 steps, where a step is roughly 450 minibatch updates of stochastic gradient descent. This corresponds to about five days of training. For each of the two environments, we repeat this procedure independently 10 times for FP and 10 times for AFP.
5.4 Results
To evaluate exploitability, we made use of the fact that each FP and AFP neural population are made up of agents trained to “exploit” the ones that came before them. Specifically, each agent is trained to approximate a best response to the average policy returned by the algorithm at the previous timestep. So, to estimate the exploitability of NeuPL-FP or NeuPL-AFP at step , we simply use the average return earned by agent against agents to obtain the within-population exploitability at . This is a convenient metric, but insufficient on its own. In order for it to be meaningful, the agents in the population must have learned approximate best responses that are close to actual best responses; if they have not, it could be that within-population exploitability is low, not because the average policy approximates a Nash policy but because nothing had been learned at all. To account for this, we also evaluate the populations learned using relative population performance (Balduzzi et al.,, 2019), which measures the strength of one population of agents against the other. The purpose of using relative population performance is simply to verify that one algorithm did not produce generally more competent agents than the other.
We paired each of the 10 replicates of FP and AFP and computed the relative population performance for each. On TinyFighter, the average was -0.73, with a Z-test-based 90% confidence interval width of 1.37. On RWS, the average was 4.60, with a Z-test-based 90% confidence interval width of 3.50. We conclude that the agents learned by FP and AFP are not statistically significantly different in terms of performance for TinyFighter, but the agents learned by FP have a slight, statistically significant advantage in RWS. However, these differences are small relative to the total obtainable reward in either environment (20 for TinyFighter, roughly 60 for RWS), so we conclude it is reasonable to use within-population exploitability to compare FP and AFP, as shown in Figure 7. For consistency with the matrix game simulation results, we plot worst case payoff, which is simply the negative of exploitability in this case.


We find that AFP has a significantly better worst-case payoff of -7.4 versus 10.6 for FP at the final timestep in TinyFighter. This corresponds to a noteworthy 16% reduction in exploitability relative to the total possible reward of 20 that can be earned in TinyFighter. In RWS, the algorithms have essentially identical performance. The fact that AFP’s advantage varies widely by environment is not surprising. The matrix game simulations in Figure 3 showed that until over 100 steps of each algorithm, there is some proportion of games for which FP performs better. Correspondingly, we would expect that there is a nontrivial proportion of stochastic games where NeuPL-FP outperforms NeuPL-AFP for small population sizes. Although we expect NeuPL will not be able to support over 100 policies (the original paper used population size 8), it would be possible to do so within the PSRO framework. This remains a topic for further investigation.
6 Conclusion
We proposed a variant of fictitious play for faster estimation of Nash equilibria in two-player, zero-sum games. Anticipatory fictitious play is intuitive, easy to implement, and supported by theory and numerical simulations which suggest that it is virtually always preferable to fictitious play. Consequently, we shed new light on two motivating problems for fictitious play: primarily, large-scale multiagent reinforcement learning for complicated real-world games; also, modeling strategic decision making in humans. Further work is needed to understand the conditions under which AFP outperforms FP in the reinforcement learning setting.
Acknowledgements
The authors sincerely thank Ryan Martin for continued mentorship, especially around theory and the presentation of the paper; Philip Wardlaw for orchestrating preliminary reinforcement learning experiments; Adam Venis and Angelo Olcese for contributions to the proofs of FP and AFP applied to ; Jesse Clifton and Eric Laber for helpful comments on a draft; Jesse Clifton and Marc Lanctot for suggesting related works that had been overlooked; and Siqi Liu for helpful explanations and implementation details for NeuPL.
References
- Adler, (2013) Adler, I. (2013). The equivalence of linear programs and zero-sum games. International Journal of Game Theory, 42(1):165–177.
- Agresti and Coull, (1998) Agresti, A. and Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52(2):119–126.
- Balduzzi et al., (2019) Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T. (2019). Open-ended learning in symmetric zero-sum games. In International Conference on Machine Learning, pages 434–443. PMLR.
- Brown, (1951) Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 13(1):374–376.
- Chollet et al., (2015) Chollet, F. et al. (2015). Keras.
- (6) Conlisk, J. (1993a). Adaptation in games: Two solutions to the crawford puzzle. Journal of Economic Behavior & Organization, 22(1):25–50.
- (7) Conlisk, J. (1993b). Adaptive tactics in games: Further solutions to the crawford puzzle. Journal of Economic Behavior & Organization, 22(1):51–68.
- Daskalakis and Pan, (2014) Daskalakis, C. and Pan, Q. (2014). A counter-example to karlin’s strong conjecture for fictitious play. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 11–20, Philadelphia, PA, USA. IEEE, IEEE.
- Foerster et al., (2018) Foerster, J., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., and Mordatch, I. (2018). Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122–130, ’. AAMAS.
- Heinrich et al., (2015) Heinrich, J., Lanctot, M., and Silver, D. (2015). Fictitious self-play in extensive-form games. In International Conference on Machine Learning, pages 805–813.
- Kuhn, (1953) Kuhn, H. (1953). Extensive games and the problem of information. In Contributions to the Theory of Games, pages 193–216. Princeton University Press.
- Lanctot et al., (2017) Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T. (2017). A unified game-theoretic approach to multiagent reinforcement learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30, -. Curran Associates, Inc.
- Letcher et al., (2018) Letcher, A., Foerster, J., Balduzzi, D., Rocktäschel, T., and Whiteson, S. (2018). Stable opponent shaping in differentiable games. In International Conference on Learning Representations, page ’, Vancouver Convention Center, Vancouver, BC, Canada. ICLR.
- Liu et al., (2022) Liu, S., Marris, L., Hennes, D., Merel, J., Heess, N., and Graepel, T. (2022). Neupl: Neural population learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Lockhart et al., (2019) Lockhart, E., Lanctot, M., Pérolat, J., Lespiau, J.-B., Morrill, D., Timbers, F., and Tuyls, K. (2019). Computing approximate equilibria in sequential adversarial games by exploitability descent. arXiv preprint arXiv:1903.05614, ’(’).
- Luce and Raiffa, (1989) Luce, R. D. and Raiffa, H. (1989). Games and Decisions: Introduction and Critical Survey. Courier Corporation, ’.
- Moritz et al., (2018) Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., and Stoica, I. (2018). Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577.
- Nash Jr, (1950) Nash Jr, J. F. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1):48–49.
- Robinson, (1951) Robinson, J. (1951). An iterative method of solving a game. Annals of Mathematics, 54(2):296–301.
- Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.
- Shamma and Arslan, (2005) Shamma, J. S. and Arslan, G. (2005). Dynamic fictitious play, dynamic gradient play, and distributed convergence to nash equilibria. IEEE Transactions on Automatic Control, 50(3):312–327.
- Shapley, (1953) Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100.
- Shoham and Leyton-Brown, (2008) Shoham, Y. and Leyton-Brown, K. (2008). Multiagent Systems: Algorithmic, Game-theoretic, and Logical Foundations. Cambridge University Press.
- Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT press, Cambridge, Massachusetts.
- Vezhnevets et al., (2020) Vezhnevets, A., Wu, Y., Eckstein, M., Leblond, R., and Leibo, J. Z. (2020). Options as responses: Grounding behavioural hierarchies in multi-agent reinforcement learning. In International Conference on Machine Learning, pages 9733–9742, ’. PMLR, PMLR.
- Vinyals et al., (2019) Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., Vezhnevets, A. S., Leblond, R., Pohlen, T., Dalibard, V., Budden, D., Sulsky, Y., Molloy, J., Paine, T. L., Gulcehre, C., Wang, Z., Pfaff, T., Wu, Y., Ring, R., Yogatama, D., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Kavukcuoglu, K., Hassabis, D., Apps, C., and Silver, D. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354.
- Zhang and Lesser, (2010) Zhang, C. and Lesser, V. (2010). Multi-agent learning with policy prediction. In Twenty-fourth AAAI conference on artificial intelligence, pages 927–937, ’. AAAI.
Appendix
Appendix A Why naive AFP doesn’t work
The original idea for the AFP algorithm, which we refer to as naive AFP, was: at each timestep, play the best response to the opponent’s best response to your history. Formally, this given by the updates (c.f. the AFP update (1)):
In preliminary simulations, naive AFP performed well in cyclic games and seemed to be competitive with FP in other games.
However, upon inspection it becomes clear that Naive AFP is not guaranteed to converge to a Nash equilibrium. This is because Naive AFP can only play best responses to strategies returned by the best response operator, which are pure strategies, but Nash equilibria of some games have nonzero probability assigned to strategies that are not best responses to any pure strategies. Thus there are some games where naive AFP is incapable of assigning any probability to actions which must be assigned nonzero probability in a Nash equilibrium, so naive AFP cannot converge to a Nash equilibrium. For example, see the game given in Table 3.
Rock | Paper | Scissors | |
---|---|---|---|
Rock | 0 | -1 | 1 |
Paper | 1 | 0 | -1 |
Scissors | -1 | 1 | 0 |
SafeRock | 0 | 0 | 0.99 |
Appendix B Proof of Proposition 1
Our analysis closely follows the original proof of FP’s convergence given in Robinson, (1951), which consists of four key lemmas. We introduce the notion of a perturbed fictitious play system, a slight generalization of Robinson’s “vector system.” We modify Robinson’s second lemma accordingly, which causes inconsequential changes in the third and fourth lemma, leaving the final result the same. In this way, we extend Robinson’s proof to a broader class of algorithms which include fictitious play and anticipatory fictitious play as special cases.
Definition 1.
An iterative play system for is a pair of sequences of vectors and with each and satisfying the following properties:
-
•
, and
-
•
for each , and ,
where and are the th row and th column of , and and are the row and column “played” by players 1 and 2 at time .
The interpretation of an iterative play system is as follows. Suppose we choose and . Write to indicate a vector with a 1 at index and 0’s elsewhere. Then and are the empirical strategies played by players 1 and 2, and and are the payoffs for player 1 and 2 when faced with those empirical strategies. In this way, and can be seen as “accumulating empirical payoffs” for players and .
Definition 2.
A perturbed fictitious play system (PFP-system) is an iterative play system with the additional property that and are best responses to a perturbed set of empirical payoffs. Precisely, an iterative play system is a PFP-system such that for any values and with and for each ,
A special case of a PFP-system is what Robinson calls a “vector system,” which describes fictitious play. This is obtained by setting all entries of and to zero at all timesteps.
Lemma 1.
If is an iterative play system for , then
This lemma follows from the minimax nature of two-player zero-sum games and holds regardless of what rows and columns of are used to update elements of and .
Definition 3.
Given an iterative play system for matrix , we say row is -eligible in the interval if for some , could have been played as part of AFP. Precisely, the condition is that there exists such that
for some with . Or, equivalently,
Similarly, we say column is -eligible if
Lemma 2.
If is an iterative play system for and all rows and columns are -eligible in , then we have
Proof.
Let . By the -eligibility of , there must exist such that
So, because the th entry can’t change by more than per timestep,
where the last inequality holds because for , the minimum of versus can’t change by more than in timesteps. A similar argument shows the result for . ∎
Lemma 3.
If is an iterative play system for and all rows and columns are -eligible in , then
Proof.
Lemma 4.
For every matrix , , there exists a such that for any anticipatory fictitious play system,
Proof.
We follow the proof of Robinson, (1951), replacing the notion of eligibility with -eligibility. The strategy is induction. If , the result is trivial because for all . Now assume that the property holds for an arbitrary submatrix obtained by deleting any number of columns or rows from . We wish to show that the property also holds for .
Using the inductive hypothesis, pick such that
for any a submatrix of and an anticipatory fictitious play system for .
As in Robinson, (1951), we wish to show that if in , some row or column is not -eligible in , then
(3) |
Suppose without loss of generality that row is not -eligible . Then we construct a new anticipatory fictitious play system for matrix , which is with row deleted. Define
for , where is a vector of ’s with entries, , and is the operator that removes entry . We now check the conditions for an anticipatory fictitious play system:
-
•
We have , where the last equality holds because is not -eligible, so it could not be a maximizer of , so deleting it to form does not change the maximum.
-
•
We have that for each ,
where because the AFP-ineligibility of implies .
-
•
Finally, we must show that the rows and columns selected still qualify as “anticipatory” responses within the context of and , i.e. that
for each . By the definition of and fact that , we have
( is a PFP-system) and is just a shifted version of , so the fact that implies the result.
These points verify that satisfy the conditions for an anticipatory fictitious play system for on . We can choose remaining values for both sequences for to satisfy the anticipatory fictitious play conditions. So, using the inductive hypothesis, we have
With (3) established, we are ready to finish the proof: under the inductive hypothesis, we will be able to deal with time intervals by splitting them into two cases: if a row or column is not -eligible, we apply (3); if all rows or columns are -eligible, we apply Lemma 3.
Specifically, we show that for any AFP system for and ,
Let and express it as , where and . We consider the collection of length- intervals for .
-
•
Case 1. There is at least one interval where all rows and columns are -eligible. Let be the latest such interval. By Lemma 3,
where the last inequality holds because . By choice of the interval, all subsequent intervals with have no -eligible rows or columns, so (3) gives that
noting that the result holds trivially if . Combining the previous two results and loosening the bound, we have
(4) -
•
Case 2. In each interval for , some row or column is not -eligible. Applying (3) repeatedly,
(5) (6) where the last inequality holds because and .
So, comparing (4) and (6) and noting that , in either case we have that
for all .∎
Finally, we are ready for the proof of Proposition 1, which is essentially identical to the final proof of Theorem 1 in Robinson, (1951).
Proof.
B.1 Convergence rate of perturbed fictitious play
Given a 2p0s matrix game with payoff matrix , write to denote the value of given by Lemma 4 such that
(7) |
Appendix C Proof of Proposition 2
For first two subsections, we restate the definitions of , from (2): , for each , and is the index played by FP (AFP) at time , so
for each and .
C.1 FP:
We must show that under random tiebreaking. Throughout, whenever performing arithmetic with indices, that arithetic is done modulo . As in the body of the paper, define and note the Markov inequality bound:
The bulk of the argument is in finding a bound for .
It follows by the definition of and the FP update that . Index may be chosen arbitrarily, but for , it follows from (2) that
because the value that is incremented at time is the value adjacent to index , . Let and inductively define, for , to be the next time at which there are two possible choices for . This is depicted in Table 4, writing and .
0 | |||||||
---|---|---|---|---|---|---|---|
As shown in the table, all entries of other than the two maximum values must be nonpositive at each . This follows by induction, since it holds for ( has one positive entry) and if it’s true for some , then in order to progress to , we must add some number (over the course of timesteps) to the next entry, which means by (2) we will subtract from the previous entry , with . Finally, note that between and , either we will have incremented the max from to if , or we will not have not (if , and the max will remain at until we repeat some further number of increments of the index played. It is only on these timesteps that the maximum value can increment.
Based on this reasoning, we know that for any , there must exist such that . Consider the random variable , which is the number of increments of the index played that occurred between the increment of the max from to . Under uniform random tiebreaking, we have that , since at each there is a 1/2 chance of incrementing (“success”) or not incrementing (“failure”). So, . Now suppose that we had the bound for some . That would imply that
Taking the expectation of both sides, we get . Plugging this into the Markov bound is sufficient to finish the argument, as explained in the proof sketch for Proposition 2 (in the paper).
All that remains is to show that . From the argument depicted in Table 4, we know , and that . Because has only two positive entries and , we have , so , concluding the proof.
C.2 AFP: ,
For we show , which proves the result.
Based on the AFP update, it is impossible to get to unless there are at least two non-adjacent ’s in with an in between. Otherwise, the two-step nature of the AFP update will not allow an to be incremented to . However, it is impossible to have two non-adjacent ’s with an in between for because the entries of sum to 0. Furthermore, in the case, for each , it must be that for some and , by (2). So there also cannot be three positive numbers in this case.
C.3 FP:
Assume without loss of generality that . Let and note that the form of implies that the strategies played by FP must be nondecreasing and increment by at most 1 at a time. We argue by strong induction that for each . Checking the first few terms, we have
, and therefore , so . Now assume that for some fixed and all that . Note that
for , where the ‘’ signs indicate values that are no greater than their neighbors on the right; this holds by the inductive hypothesis and definition of . We know that for steps , FP will play , and we know that is the first timestep at which (or else FP would have played at , a contradiction). It follows that , as desired.
This result implies that for each , so we have for each . Inverting this, we get that is an upper bound on , the index played by FP at time . Combining the expression for , , with this bound, we get that .
C.4 AFP:
We argue first by strong induction that for all . Assume without loss of generality that . Now assume that, for some fixed , for .
If , then under the inductive hypothesis,
for which the largest value is at index , so , so then will have largest value at index , so . If , then under the inductive hypothesis, , at which point , as desired.
Finally, we are interested in for , which we obtain from the calculation above, .
Appendix D Additional figures



Appendix E RL experiment hyperparameters
APPO | TinyFighter | RWS |
---|---|---|
Discount factor | 0.99 | 0.995 |
Value function coefficient | 0.5 | 0.5 |
Entropy coefficient | 0.01 | 0.02 |
Learning rate | 3e-4 | 1.5e-4 |
Gradient clipping | 10 | 80 |
Batch size | 5,120 | 6,000 |
Number of workers | 80 | 319 |
Rollout fragment length | 64 | 64 |
Other | Ray defaults (as of Nov 2022) (Moritz et al.,, 2018) |
Running With Scissors Environment | |
---|---|
Reward shaping | Policy played against random and received a unit of reward for each ‘scissors’ picked up and a negative unit of reward for each ‘rock’ and ‘paper’ picked up. |
Reward scaling () | 10 |
Reward scaling () | 100 |
Sprite size | 3x3 pixels |
Agent field of view | 5x5 grid units (15x15 pixels) |
Agent memory | 4 frames |
Name | Type | Shape | # Parameters | Input |
---|---|---|---|---|
state | InputLayer | [13] | 0 | |
dense | Dense | [128] | 1,792 | state |
dense_1 | Dense | [128] | 16,512 | dense |
neupl_opponent_dist | InputLayer | [4] | 0 | |
policy_head_inputs | Concatenate | [145] | 0 | state, dense_1, neupl_opponent_dist |
neupl_opponent_idx | InputLayer | [5] | 0 | |
dense_2 | Dense | [256] | 37,376 | policy_head_inputs |
value_head_inputs | Concatenate | [150] | 0 | state, dense_1, neupl_opponent_dist, neupl_opp… |
dense_3 | Dense | [128] | 32,896 | dense_2 |
dense_6 | Dense | [256] | 38,656 | value_head_inputs |
dense_4 | Dense | [64] | 8,256 | dense_3 |
action_mask | InputLayer | [4] | 0 | |
dense_7 | Dense | [128] | 32,896 | dense_6 |
dense_5 | Dense | [4] | 260 | dense_4 |
lambda | Lambda | [4] | 0 | action_mask |
dense_8 | Dense | [64] | 8,256 | dense_7 |
policy_out | Add | [4] | 0 | dense_5, lambda |
value_out | Dense | [1] | 65 | dense_8 |
Name | Type | Shape | # Parameters | Input |
---|---|---|---|---|
rgb_history | InputLayer | [4, 15, 15, 3] | 0 | |
rescaling | Rescaling | [4, 15, 15, 3] | 0 | rgb_history |
conv2d | Conv2D | [4, 5, 5, 32] | 896 | rescaling |
activation | Activation | [4, 5, 5, 32] | 0 | conv2d |
conv2d_1 | Conv2D | [4, 5, 5, 16] | 528 | activation |
activation_1 | Activation | [4, 5, 5, 16] | 0 | conv2d_1 |
reshape | Reshape | [4, 400] | 0 | activation_1 |
inv_ratios_history | InputLayer | [4, 3] | 0 | |
concatenate | Concatenate | [4, 403] | 0 | reshape, inv_ratios_history |
dense | Dense | [4, 256] | 103,424 | concatenate |
dense_1 | Dense | [4, 256] | 65,792 | dense |
lstm | LSTM | [256] | 525,312 | dense_1 |
neupl_opponent_dist | InputLayer | [11] | 0 | |
neupl_opponent_idx | InputLayer | [12] | 0 | |
policy_head_inputs | Concatenate | [267] | 0 | lstm, neupl_opponent_dist |
value_head_inputs | Concatenate | [279] | 0 | lstm, neupl_opponent_dist, neupl_opponent_idx |
policy_1 | Dense | [512] | 137,216 | policy_head_inputs |
value_1 | Dense | [512] | 143,360 | value_head_inputs |
policy_2 | Dense | [256] | 131,328 | policy_1 |
value_2 | Dense | [256] | 131,328 | value_1 |
policy_3 | Dense | [128] | 32,896 | policy_2 |
value_3 | Dense | [128] | 32,896 | value_2 |
policy_out | Dense | [30] | 3,870 | policy_3 |
value_out | Dense | [1] | 129 | value_3 |
Appendix F Vanilla RL implementations of FP and AFP
Algorithm 2 gives a simple reinforcement learning implementation of FP and AFP that does not rely on Neural Population Learning or the reservoir buffer sampling of Heinrich et al., (2015). Notes:
-
•
refers to the opponent of .
-
•
plays against for some number of episodes, gathers data, and updates by reinforcement learning.
-
•
Lines 8 and 9 are the approximate reinforcement learning analogue to computing a best response to the average of all previous strategies. For details, see Heinrich et al., (2015), which uses a more complicated setup in order to limit storage requirements (constantly learning by supervised learning).