Steering control of payoff-maximizing players in adaptive learning dynamics ††thanks: X.C. gratefully acknowledges the generous faculty startup fund provided by BUPT (No. 505022023).
Abstract
Evolutionary game theory provides a mathematical foundation for cross-disciplinary fertilization, especially for integrating ideas from artificial intelligence and game theory. Such integration offers a transparent and rigorous approach to complex decision-making problems in a variety of important contexts, ranging from evolutionary computation to machine behavior. Despite the astronomically huge individual behavioral strategy space for interactions in the iterated Prisoner’s Dilemma (IPD) games, the so-called Zero-Determinant (ZD) strategies is a set of rather simple memory-one strategies yet can unilaterally set a linear payoff relationship between themselves and their opponent. Although the witting of ZD strategies gives players an upper hand in the IPD games, we find and characterize unbending strategies that can force ZD players to be fair in their own interest. Moreover, our analysis reveals the ubiquity of unbending properties in common IPD strategies which are previously overlooked. In this work, we demonstrate the important steering role of unbending strategies in fostering fairness and cooperation in pairwise interactions. Our results will help bring a new perspective by means of combining game theory and multi-agent learning systems for optimizing winning strategies that are robust to noises, errors, and deceptions in non-zero-sum games.
Index Terms:
steering control, adaptive learning, Prisoner’s Dilemma, direct reciprocity, evolutionary game theoryI Introduction
Evolutionary game theory provides a mathematical foundation for studying mechanisms of cooperation and myriad learning theories toward altruistic behavior [1, 2]. Adaptive learning strategies will evolve in scenarios where self-interested individuals interact with one another. Among others, the Prisoner’s Dilemma (PD) game is a symmetric game involving two players X and Y, and two actions: to cooperate or to defect. In a one-shot PD game, the four possible outcomes correspond to different payoffs from the focal player’s perspective: if both are cooperators, one gets the reward , if a cooperator is against a defector, the sucker’s payoff , if a defector is against a cooperator, the temptation , and if both are defectors, the punishment . The game is considered a paradigm for understanding the conflict between self-interest and collective interest as the payoff structure satisfies . For a particular example, the donation game is a simplified form of the PD game, where a cooperator offers the other player a benefit at a cost to itself with but a defector offers nothing. The iterated Prisoner’s Dilemma (IPD) games further assume repeated encounters between the same two players and sheds insights into the idea of direct reciprocity [3].
Despite the astronomically huge individual behavioral strategy space for IPD game interactions [4, 5], the so-called zero-determinant (ZD) strategies, including extortioners (extortionate ZD), compliers (generous ZD), and equalizers, are a set of rather simple memory-one strategies yet can unilaterally set a linear relation between their own payoff and their co-player’s payoff [6, 7, 8, 9, 10]. The three free parameters that determine a ZD strategy and how unequal the corresponding payoff relation is are the baseline payoff , the extortion factor , and the normalization factor . An equalizer with will fix their co-player’s payoff to the given , which can take any value between the punishment and the reward . An extortioner with and can always get an unfair share of the payoffs in conventional IPD games with [7]. In contrast, a complier with and can guarantee that their own payoff is never above their co-player’s [8].
As such, the finding of ZD strategies has greatly spurred new waves of work from diverse fields [11, 12], aiming to (i) elucidate the robustness and resilience of cooperation by means of the natural selection of IPD strategies from a population dynamics perspective [13, 14, 15] and (ii) explain reactions of human subjects to extortion that could be out of a desire for profit and/or a concern for fairness [16, 17]. Despite the witting to gain the upper hand in IPD games, there still exists one open issue regarding ZD strategies: how will the payoff structure and the three parameters determine their pairwise dominance and extortion ability?
We have fully addressed the above question and revealed the unforeseen Achilles’ heel of ZD strategies. Most of all, we have found and characterized multiple general classes of strategies that are unbending to extortion and can trigger the backfire of being extortionate and even outperform an extortioner in certain conditions. A fixed unbending strategy will compel a self-interested and adaptive extortioner who tries to maximize their payoff to be fair and behave like Tit-for-Tat (TFT) [1, 18] ultimately by letting . Examples of unbending strategies include general ZD strategies with the baseline payoff satisfying and the simple adaptive player Win-Stay Lose-Shift (WSLS) [19], of which the latter can even outperform an extortioner in interactions of more adversarial nature where [20].
Remarkably, the insistence of an unbending player on fairness can rein in not only extortioners but also a much broader family of strategies. In the present work, we further illustrate such a steering role toward altruism by considering the adaptive learning dynamics of a focal reactive player whose move in the current round depends on what the co-player did in the last round [18] against another fixed unbending co-player in a donation game [13]. In so doing, reactive players not only are de facto ZD strategies but also can be conveniently visualized within a unit square. It turns out that unbending strategies from different classes either “train” the reactive player to behave like generous TFT [19] or a full cooperator. Therefore, unbending players can helm the adaptive learning dynamics of extortionate and reactive players to fairness even in the absence of population dynamics and evolution. Our findings help illuminate ideas for multi-agent optimization in computational learning theory.
II Adaptive Learning Dynamics against Unbending Players
Payoff structure – equal gains from switching
As a simplified form of the PD game, the donation game features two parameters and , representing the benefit and the cost of cooperation. The payoff structure follows , , , and . As a particular type of memory-one strategies, reactive strategies can be described by the vector , where and are the probabilities to cooperate after a cooperation or a defection by the co-player, respectively, and are known as the reactive norm. Reactive strategies in fact become ZD strategies under the setting of a donation game satisfying “equal gains from switching”, namely, .
Proposition 1 (Reactive strategies versus Zero-Determinant strategies).
In the donation game, reactive strategies is a subset of general Zero-Determinant strategies with the normalization factor and the extortion factor being either positive or negative.
Proof.
For any PD game, the set of ZD strategies is a collection of memory-one strategies with three free parameters :
(1) |
The four ’s describe the probability to cooperate after each pairwise outcome in {CC, CD, DC, DD}, respectively.
In the donation game where the payoff matrix is replaced by , the set of ZD strategies can be further written as
(2) |
Given that the baseline payoff is between the punishment and the reward and that for , the admissible ranges of are
(3) |
Once these two parameters are decided, the range of can be derived accordingly ( needs to be a probability vector). If , it is straightforward to show that
(4) |
and the linear relation holds. That is, any reactive strategy can be obtained by letting
(5) |
∎
Geometrically, the unit square defined by the two components and of reactive strategies contains the isosceles right triangle representing reactive ZD strategies with the extortion factor , whose size is determined by the benefit-to-cost ratio . The lengths of both legs are in fact . As is shown in Fig. 1, the horizontal leg AB, the vertical leg AC, and the hypotenuse BC of the triangle ABC correspond to extortioners, compliers, and equalizers, respectively. More detailed illustrations are given in Table I.

Type | ZD parameters | Reactive expressions |
extortioner | ||
complier | ||
equalizer | ||
Steering control by unbending strategies
As a routine, we assume that the focal player X uses strategy and the co-player Y uses strategy . We then denote the average payoff of player X by and that of player Y by . Earlier on, we uncovered multiple general classes of fair-minded co-players who decide to fix their strategies such that a focal extortionate player can maximize their payoffs only if trying to be fair by letting .
Definition 1 (Unbending strategies).
Let player X uses an extortionate Zero-Determinant strategy of which the parameters satisfy , , and . An unbending strategy used by player Y is a memory-one strategy that (i) neutralizes the parameter in the first place such that both the two expected payoffs and are independent of , , and (ii) guarantees that the derivative of with respect to is strictly negative, .
To scrutinize the behavior and hence the learning dynamics of a self-interested focal player from a much broader strategy space than those in previous studies (see Fig. 1, the entire square instead of only those on the horizontal line AB), we now consider the scenario where represents a reactive strategy used by an adapting player and is a fixed unbending strategy from class A or class D.
Example 1 (Class A of unbending strategies).
In the donation game, an unbending strategy from class A can be described as:
(6) | |||
(7) |
The critical values for and are , , , and (for future reference as well):
(8) |
where and .
Example 2 (Class D of unbending strategies).
In the donation game, an unbending strategy from class D can be described as:
(9) | |||
(10) |
Equivalently, class D of unbending strategies is the set of general Zero-Determinant (ZD) strategies with the baseline payoff satisfying . To be more specific, it is the relative complement of extortionate ZD strategies with respect to the set of general ZD strategies with the extortion factor satisfying , consisting of
-
(i)
intermediate ZD: and ,
-
(ii)
generous ZD: and .
Remark.
For any unbending strategy from class A, and always hold. An intuitive comparison between , , and is given in Fig. 2. For any unbending strategy from class D, always holds.

As shown forthwith, we study how a fixed unbending player Y can curb the extortion and thus promote the cooperation of a self-interested, adaptive focal player X who explores the entire strategy space to get the highest possible payoff in the corresponding learning dynamics. Player X is assumed to use a reactive strategy within the parameter space which is broader than that of extortionate ZD strategies (see Table I). To prove convergence results on possible learning outcomes, we investigate the monotonicity of the payoff with respect to both and and find where the maximum value can be achieved against unbending strategies from class A and class D, separately. We demonstrate that any evolutionary reactive player aiming to maximize their own payoff will be steered by an unbending co-player from extortion to fairness. For the sake of demonstration and ease of visualization, we have focused on the donation game in which reactive strategies are actually a subset of ZD strategies. It is straightforward to extend our results to more general IPD games.
III Steering Control Over Self-Interested Players
In this section, we demonstrate the unilateral steering control by unbending players that use a fixed strategy from class A. Later on, we also show similar steering control by unbending players from class D. In order to visualize the learning dynamics, we choose to focus on the two-dimensional reactive strategies without loss of generality. Our analysis can be extrapolated for other memory-one strategies. In what follows, we explicitly prove the two major objectives of steering control over opponents, that is, characterize conditions for (i) yielding possible maximum payoffs and (ii) rendering fair and cooperative strategies, by using fixed unbending strategies in the adaptive learning dynamics.
the impact of unbending class a on steered learning dynamics
Let be a reactive strategy and an unbending strategy from class A (as shown in Example 8). Using the method by Press and Dyson [7], we obtain the payoff as a quadratic rational function of and , resulting from the quotient of two determinants. If , for example, . If , for another example, . In general, we have
(11) |
where the denominator is always positive and
(12) |
III-A Maximum of the Payoff
We claim that the maximal value of is obtained either at or . In addition to the proofs below, specific examples can be found in Table II.
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
Theorem 1 (Maximum on the boundaries).
Assume that player X uses a reactive strategy and that player Y uses an unbending strategy from class A. The maximum of the expected payoff is
(13) |
where is given in (8).
Proof.
We can further figure out whether the payoff always takes its maximum at by comparing and in (8).
Corollary 1 (Rule of the golden ratio).
Assume that player X uses a reactive strategy and that player Y uses an unbending strategy from class A. The maximum of the expected payoff is always achieved at if the benefit-to-cost ratio (the golden ratio).
Proof.
The solutions to are and . A routine calculation shows that and . Therefore, whether determines whether always holds, which further yields a comparison between the benefit-to-cost ratio and the golden ratio . ∎
More specifically,
-
(i)
if ,
(15) -
(ii)
if , .
III-B Monotonicy of the Payoff
We first take the partial derivative of the payoff with respect to and get
(16) |
where is defined as before and is a quadratic function of . We have
(17) |
and
(18) |
Theorem 2 (Monotonicity along the first axis).
Assume that player X uses a reactive strategy and that player Y uses an unbending strategy from class A. Let be fixed. The expected payoff either always increases or first decreases and then increases with .
Proof.
Since and , the graph of as a function of is a parabola opening downward whose axis of symmetry is to the right of . A routine calculation shows that . Therefore, as increases, is either always nonnegative or first negative and then positive for any fixed . ∎
We can obtain more detailed results about on the boundaries of the unit square.
Corollary 2 (Monotonicity on the boundaries).
Assume that player X uses a reactive strategy and that player Y uses an unbending strategy from class A. Let . The monotonicity of the expected payoff with respect to satisfies
-
(i)
if , is always nonnegative,
-
(ii)
if , is first negative and then positive,
where is given in (8).
Proof.
Based on the proposition above, it suffices to consider at , that is,
(19) |
The sign of at is determined by the relation between and . ∎
Remark.
In fact, is always positive at and hence positive for any . Namely, if player X uses a reactive extortionate ZD strategy and Y uses an unbending strategy from class A, the expected payoff will be irreverent to and an increasing function of , which echoes with what we have found in previous studies.
Example 3.
Assume that player X uses a reactive strategy and that player Y uses an unbending strategy from class A. Let
(20) |
It is easy to tell that . The partial derivative of the expected payoff with respect to satisfies
(21) |
We then take the partial derivative of the payoff with respect to . The general expression of is long and complicated. We only enumerate the results on the boundaries.
Proposition 2 (Monotonicity along the second axis).
Assume that player X uses a reactive strategy and that player Y uses an unbending strategy from class A. The monotonicity of the expected payoff with respect to satisfies
-
(i)
when , is always negative,
-
(ii)
when , is always zero,
-
(iii)
when and , is always negative.
The detailed proof is omitted for the limit of space.
To summarize, the fixed unbending strategy used by the co-player will determine the performance and the learning dynamics of the focal player. The global maximum payoff is always achieved at if is taken from a subset of class A (Fig. 2(a), ) or even the entire set of class A of unbending strategies (Fig. 2(b) and 2(c), ) provided that the benefit-to-cost ratio is at least the golden ratio . Furthermore, the learning dynamics can exhibit either global convergence (the shaded orange areas in Fig. 2, ) or bistability () along the direction of change of , where the final state of the focal player depends on their initial state. Examples are given in Table II and Fig. 3.

the impact of unbending class d on steered learning dynamics
Let be a reactive strategy and an unbending strategy from class D (as shown in Example 2). We obtain as a linear rational function of and . If , for example, . If and , for another example, the unbending co-player becomes a complier with and . In general, we have
(22) |
where and .
III-C Maximum of the Payoff
We claim that the maximal value of is obtained either at or .
Theorem 3 (Maximum on the boundaries).
Assume that player X uses a reactive strategy and that player Y uses an unbending strategy from class D. The maximum of the expected payoff is
(23) |
where the value of is determined by the baseline payoff of as a general ZD strategy.
III-D Monotonicity of the Payoff
We take the partial derivatives of the payoff with respect to and and get
(24) |
where is the same as that in (22).
Theorem 4 (Monotonicity across the unit square).
Assume that player X uses a reactive strategy and that player Y uses an unbending strategy from class D. The expected payoff is a strictly increasing function of and an increasing function of :
(25) |
The equality holds if and only if and .
Proof.
We know that and . Moreover, implies that . Hence, we have . On the other hand, it is easy to tell that , which is an equality if and only if and . ∎
To summarize,
-
(i)
if Y uses an intermediate ZD strategy,
(26) -
(ii)
if Y uses a generous ZD strategy,
(27)
We have shown the role of unbending strategies in steering control and particularly their ability in enforcing fair and cooperative outcomes against payoff-maximizing players. Our further analysis reveals that some common IPD strategies are in fact unbending, including the PSO gambler (a machine-trained strategy) [5] and generous ZD strategies [8]. In this regard, the framework of IPD has the potential for synergistically combining artificial intelligence (AI) and evolutionary game theory to enhance cooperation and foster fairness in various multi-agent systems. Also, all the unbending strategies from class A and at least those generous ZD strategies from class D are able to establish mutual cooperation among themselves. Moreover, even if noisy games are considered, for example, with probability the intended move is implemented as the opposite one, their mutual cooperation is only impacted as . These findings suggest the robustness and winning advantage of unbending strategies in population competition dynamics beyond the pairwise interactions considered above.
IV Discussion & Conclusion
Our search for unbending strategies is directly motivated by suppressing extortion, thereby requiring targeted interactions with ZD opponents. Nevertheless, our work is more broadly motivated by how to foster and enforce fairness and cooperation in pairwise interactions in the IPD games. We reveal the previously unforeseen unbending properties of many well-known strategies in that they actually have the ability to resist extortion. For example, class A contains “PSO Gambler”, an optimized memory-one strategy that is unbending when . Other well-known examples of unbending strategies include WSLS from class A when , “willing” from class C, and all the strategies from class D which are ZD players themselves but with a higher level of generosity than their opponents.
As such, unbending strategies can be used to rein in seemingly formidable extortionate ZD players, from whom a fair offer can ultimately be cultivated in their own interest. Our findings are in line with recent experimental evidence suggesting that human players often choose not to accede to extortion out of concern for fairness [16]. The conclusion can be further generalized and shed light on the backfire of extortion by unbending players in a broad sense. Our analysis shows that an adapting reactive player X after a higher payoff will cooperate more by increasing and in some cases to avoid potential punishment from a fixed unbending co-player Y in a donation game. Unbending strategies from class A would “train” the reactive player to behave like generous TFT () whereas those from class D would even “train” the reactive player to act as a full cooperator ().
Under the influence of a co-player Y from class A of unbending strategies, the learning dynamics of a reactive player X may exhibit bistability along the direction of change of , hence allowing two different learning outcomes, depending on both the original state of the focal player and the specific strategy of the co-player. After entering full cooperation with respect to , player X will stay neutral along the direction of change of . A reactive strategy can therefore converge to the full defector corner or otherwise to the generous TFT edge . Player X will always get the greatest payoff at the cooperative edge if the ratio (the ‘golden ratio’) or converge to the edge if player Y takes a strategy from a subset of class A. On the other hand, unbending strategies from class D are able to guarantee global convergence to the same cooperative edge, after which the baseline payoff of player Y can further decide the direction of change of . Player X will increase the value of until reaching the full cooperator corner if player Y uses an intermediate ZD strategy with or remain neutral along the edge if player Y uses the generous ZD strategy with .
In a nutshell, we have found and characterized general classes of unbending strategies that are able to steer the learning dynamics of selfish players in head-to-head encounters. Unbending strategies can trigger the backfire of greedy exploitations, punish the extortion, and thus turn the interactions into an Ultimatum game: to “win each battle” or to “win the war”. These strategies helm the reactive learning dynamics of extortionate and reactive players to fairness without introducing population dynamics and evolution through generations.
Our work helps pave the way for the promising initiative of combining game theory with artificial intelligence to gain more analytical insights into computational learning theory. Of particular interest are extensions to multi-agent learning systems that are fraught with perception and implementation errors and beyond pairwise interactions in a changing environment [21]. In doing so, integrating theoretical and empirical approaches will help enhance our understanding of cooperation in various advanced AI systems besides social and biological systems [22].
References
- [1] Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211(4489):1390–1396, 1981.
- [2] Martin A Nowak. Five rules for the evolution of cooperation. science, 314(5805):1560–1563, 2006.
- [3] Robert L Trivers. The evolution of reciprocal altruism. The Quarterly review of biology, 46(1):35–57, 1971.
- [4] Christian Hilbe, Krishnendu Chatterjee, and Martin A Nowak. Partners and rivals in direct reciprocity. Nature human behaviour, 2(7):469–477, 2018.
- [5] Marc Harper, Vincent Knight, Martin Jones, Georgios Koutsovoulos, Nikoleta E Glynatsi, and Owen Campbell. Reinforcement learning produces dominant strategies for the iterated prisoner’s dilemma. PloS one, 12(12):e0188046, 2017.
- [6] Maarten C Boerlijst, Martin A Nowak, and Karl Sigmund. Equal pay for all prisoners. The American mathematical monthly, 104(4):303–305, 1997.
- [7] William H Press and Freeman J Dyson. Iterated prisoner’s dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences, 109(26):10409–10413, 2012.
- [8] Alexander J Stewart and Joshua B Plotkin. From extortion to generosity, evolution in the iterated prisoner’s dilemma. Proceedings of the National Academy of Sciences, 110(38):15348–15353, 2013.
- [9] Ethan Akin. What you gotta know to play good in the iterated prisoner’s dilemma. Games, 6(3):175–190, 2015.
- [10] Xingru Chen, Long Wang, and Feng Fu. The intricate geometry of zero-determinant strategies underlying evolutionary adaptation from extortion to generosity. New Journal of Physics, 24(10):103001, 2022.
- [11] Dong Hao, Zhihai Rong, and Tao Zhou. Extortion under uncertainty: Zero-determinant strategies in noisy games. Physical Review E, 91(5):052803, 2015.
- [12] Alex McAvoy and Christoph Hauert. Autocratic strategies for iterated games with arbitrary action spaces. Proceedings of the National Academy of Sciences, 113(13):3573–3578, 2016.
- [13] Christian Hilbe, Martin A Nowak, and Karl Sigmund. Evolution of extortion in iterated prisoner’s dilemma games. Proceedings of the National Academy of Sciences, 110(17):6913–6918, 2013.
- [14] Jing Chen and Aleksey Zinger. The robustness of zero-determinant strategies in iterated prisoner’s dilemma games. Journal of theoretical biology, 357:46–54, 2014.
- [15] Fang Chen, Te Wu, and Long Wang. Evolutionary dynamics of zero-determinant strategies in repeated multiplayer games. Journal of Theoretical Biology, 549:111209, 2022.
- [16] Christian Hilbe, Torsten Röhl, and Manfred Milinski. Extortion subdues human players but is finally punished in the prisoner’s dilemma. Nature communications, 5(1):1–6, 2014.
- [17] Lutz Becks and Manfred Milinski. Extortion strategies resist disciplining when higher competitiveness is rewarded with extra gain. Nature communications, 10(1):1–9, 2019.
- [18] Martin A Nowak and Karl Sigmund. Tit for tat in heterogeneous populations. Nature, 355(6357):250–253, 1992.
- [19] Martin Nowak and Karl Sigmund. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game. Nature, 364(6432):56–58, 1993.
- [20] Maria R D’Orsogna and Matjaž Perc. Statistical physics of crime: A review. Physics of life reviews, 12:1–21, 2015.
- [21] Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research, 53:659–697, 2015.
- [22] Allan Dafoe, Yoram Bachrach, Gillian Hadfield, Eric Horvitz, Kate Larson, and Thore Graepel. Cooperative ai: machines must learn to find common ground, 2021.