This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Steering control of payoff-maximizing players in adaptive learning dynamics thanks: X.C. gratefully acknowledges the generous faculty startup fund provided by BUPT (No. 505022023).

1st Xingru Chen School of Sciences
Beijing University of Posts and Telecommunications
Beijing 100876, China
[email protected]
   2nd Feng Fu Department of Mathematics
Department of Biomedical Data Science
Dartmouth College
Hanover, NH 03755, USA
[email protected]
Abstract

Evolutionary game theory provides a mathematical foundation for cross-disciplinary fertilization, especially for integrating ideas from artificial intelligence and game theory. Such integration offers a transparent and rigorous approach to complex decision-making problems in a variety of important contexts, ranging from evolutionary computation to machine behavior. Despite the astronomically huge individual behavioral strategy space for interactions in the iterated Prisoner’s Dilemma (IPD) games, the so-called Zero-Determinant (ZD) strategies is a set of rather simple memory-one strategies yet can unilaterally set a linear payoff relationship between themselves and their opponent. Although the witting of ZD strategies gives players an upper hand in the IPD games, we find and characterize unbending strategies that can force ZD players to be fair in their own interest. Moreover, our analysis reveals the ubiquity of unbending properties in common IPD strategies which are previously overlooked. In this work, we demonstrate the important steering role of unbending strategies in fostering fairness and cooperation in pairwise interactions. Our results will help bring a new perspective by means of combining game theory and multi-agent learning systems for optimizing winning strategies that are robust to noises, errors, and deceptions in non-zero-sum games.

Index Terms:
steering control, adaptive learning, Prisoner’s Dilemma, direct reciprocity, evolutionary game theory

I Introduction

Evolutionary game theory provides a mathematical foundation for studying mechanisms of cooperation and myriad learning theories toward altruistic behavior [1, 2]. Adaptive learning strategies will evolve in scenarios where self-interested individuals interact with one another. Among others, the Prisoner’s Dilemma (PD) game is a symmetric game involving two players X and Y, and two actions: to cooperate or to defect. In a one-shot PD game, the four possible outcomes correspond to different payoffs from the focal player’s perspective: if both are cooperators, one gets the reward RR, if a cooperator is against a defector, the sucker’s payoff SS, if a defector is against a cooperator, the temptation TT, and if both are defectors, the punishment PP. The game is considered a paradigm for understanding the conflict between self-interest and collective interest as the payoff structure satisfies T>R>P>ST>R>P>S. For a particular example, the donation game is a simplified form of the PD game, where a cooperator offers the other player a benefit bb at a cost cc to itself with b>cb>c but a defector offers nothing. The iterated Prisoner’s Dilemma (IPD) games further assume repeated encounters between the same two players and sheds insights into the idea of direct reciprocity [3].

Despite the astronomically huge individual behavioral strategy space for IPD game interactions [4, 5], the so-called zero-determinant (ZD) strategies, including extortioners (extortionate ZD), compliers (generous ZD), and equalizers, are a set of rather simple memory-one strategies yet can unilaterally set a linear relation between their own payoff sXs_{X} and their co-player’s payoff sYs_{Y} [6, 7, 8, 9, 10]. The three free parameters that determine a ZD strategy 𝒑=[p1,p2,p3,p4]\bm{p}=[p_{1},p_{2},p_{3},p_{4}] and how unequal the corresponding payoff relation sXO=χ(sYO)s_{X}-O=\chi(s_{Y}-O) is are the baseline payoff OO, the extortion factor χ\chi, and the normalization factor ϕ\phi. An equalizer with χ=+\chi=+\infty will fix their co-player’s payoff to the given OO, which can take any value between the punishment PP and the reward RR. An extortioner with O=PO=P and χ>1\chi>1 can always get an unfair share of the payoffs in conventional IPD games with T+S>2PT+S>2P [7]. In contrast, a complier with O=RO=R and χ>1\chi>1 can guarantee that their own payoff is never above their co-player’s [8].

As such, the finding of ZD strategies has greatly spurred new waves of work from diverse fields [11, 12], aiming to (i) elucidate the robustness and resilience of cooperation by means of the natural selection of IPD strategies from a population dynamics perspective [13, 14, 15] and (ii) explain reactions of human subjects to extortion that could be out of a desire for profit and/or a concern for fairness [16, 17]. Despite the witting to gain the upper hand in IPD games, there still exists one open issue regarding ZD strategies: how will the payoff structure and the three parameters determine their pairwise dominance and extortion ability?

We have fully addressed the above question and revealed the unforeseen Achilles’ heel of ZD strategies. Most of all, we have found and characterized multiple general classes of strategies that are unbending to extortion and can trigger the backfire of being extortionate and even outperform an extortioner in certain conditions. A fixed unbending strategy will compel a self-interested and adaptive extortioner who tries to maximize their payoff to be fair and behave like Tit-for-Tat (TFT) [1, 18] ultimately by letting χ1\chi\to 1. Examples of unbending strategies include general ZD strategies with the baseline payoff OO satisfying P<ORP<O\leq R and the simple adaptive player Win-Stay Lose-Shift (WSLS) [19], of which the latter can even outperform an extortioner in interactions of more adversarial nature where T+S<2PT+S<2P [20].

Remarkably, the insistence of an unbending player on fairness can rein in not only extortioners but also a much broader family of strategies. In the present work, we further illustrate such a steering role toward altruism by considering the adaptive learning dynamics of a focal reactive player whose move in the current round depends on what the co-player did in the last round [18] against another fixed unbending co-player in a donation game [13]. In so doing, reactive players not only are de facto ZD strategies but also can be conveniently visualized within a unit square. It turns out that unbending strategies from different classes either “train” the reactive player to behave like generous TFT [19] or a full cooperator. Therefore, unbending players can helm the adaptive learning dynamics of extortionate and reactive players to fairness even in the absence of population dynamics and evolution. Our findings help illuminate ideas for multi-agent optimization in computational learning theory.

II Adaptive Learning Dynamics against Unbending Players

Payoff structure – equal gains from switching

As a simplified form of the PD game, the donation game features two parameters bb and cc, representing the benefit and the cost of cooperation. The payoff structure follows R=bcR=b-c, S=cS=-c, T=bT=b, and P=0P=0. As a particular type of memory-one strategies, reactive strategies can be described by the vector 𝒑=[p1,p2,p1,p2]\bm{p}=[p_{1},p_{2},p_{1},p_{2}], where p1p_{1} and p2p_{2} are the probabilities to cooperate after a cooperation or a defection by the co-player, respectively, and are known as the reactive norm. Reactive strategies in fact become ZD strategies under the setting of a donation game satisfying “equal gains from switching”, namely, R+P=T+S=bcR+P=T+S=b-c.

Proposition 1 (Reactive strategies versus Zero-Determinant strategies).

In the donation game, reactive strategies is a subset of general Zero-Determinant strategies with the normalization factor ϕ=1/(bχ+1)\phi=1/(b\chi+1) and the extortion factor χ\chi being either positive or negative.

Proof.

For any PD game, the set of ZD strategies is a collection of memory-one strategies [p1,p2,p3,p4][p_{1},p_{2},p_{3},p_{4}] with three free parameters (O,χ,ϕ)(O,\chi,\phi):

p1=1ϕ(RO)(χ1),p2=1ϕ[(TO)χ+(OS)],p3=ϕ[(OS)χ+(TO)],p4=ϕ(OP)(χ1).\displaystyle\begin{split}p_{1}&=1-\phi(R-O)(\chi-1),\\ p_{2}&=1-\phi[(T-O)\chi+(O-S)],\\ p_{3}&=\phi[(O-S)\chi+(T-O)],\\ p_{4}&=\phi(O-P)(\chi-1).\end{split} (1)

The four pip_{i}’s describe the probability to cooperate after each pairwise outcome in {CC, CD, DC, DD}, respectively.

In the donation game where the payoff matrix [R,S,T,P][R,S,T,P] is replaced by [bc,c,b,0][b-c,-c,b,0], the set of ZD strategies can be further written as

p1=1ϕ(bcO)(χ1),p2=1ϕ[(bO)χ+(O+c)],p3=ϕ[(O+c)χ+(bO)],p4=ϕO(χ1).\displaystyle\begin{split}p_{1}&=1-\phi(b-c-O)(\chi-1),\\ p_{2}&=1-\phi[(b-O)\chi+(O+c)],\\ p_{3}&=\phi[(O+c)\chi+(b-O)],\\ p_{4}&=\phi O(\chi-1).\end{split} (2)

Given that the baseline payoff is between the punishment PP and the reward RR and that 0pi10\leq p_{i}\leq 1 for i=1,2,3,4i=1,2,3,4, the admissible ranges of (O,χ)(O,\chi) are

0Obc,1χ<+or<χmax{O+cbO,bOO+c}.\begin{gathered}0\leq O\leq b-c,\\ 1\leq\chi<+\infty\,\text{or}\,-\infty<\chi\leq-\max\{\frac{O+c}{b-O},\frac{b-O}{O+c}\}.\end{gathered} (3)

Once these two parameters are decided, the range of ϕ\phi can be derived accordingly (𝒑\bm{p} needs to be a probability vector). If ϕ=1/(bχ+1)\phi=1/(b\chi+1), it is straightforward to show that

p1=p3=1(bcO)(χ1)bχ+c,p2=p4=O(χ1)bχ+c,\displaystyle\begin{split}p_{1}&=p_{3}=1-\frac{(b-c-O)(\chi-1)}{b\chi+c},\\ p_{2}&=p_{4}=\frac{O(\chi-1)}{b\chi+c},\end{split} (4)

and the linear relation O(1p1)(bcO)p2=0O(1-p_{1})-(b-c-O)p_{2}=0 holds. That is, any reactive strategy 𝒑=[p1,p2,p1,p2]\bm{p}=[p_{1},p_{2},p_{1},p_{2}] can be obtained by letting

{O=(bc)p2/(1p1+p2),χ=[bc(p1p2)]/[b(p1p2)c],ϕ=1/(bχ+1).\begin{cases}O=(b-c)p_{2}/(1-p_{1}+p_{2}),\\ \chi=[b-c(p_{1}-p_{2})]/[b(p_{1}-p_{2})-c],\\ \phi=1/(b\chi+1).\end{cases} (5)

Geometrically, the unit square defined by the two components p1p_{1} and p2p_{2} of reactive strategies contains the isosceles right triangle representing reactive ZD strategies with the extortion factor χ1\chi\geq 1, whose size is determined by the benefit-to-cost ratio r=b/cr=b/c. The lengths of both legs are in fact 11/r1-1/r. As is shown in Fig. 1, the horizontal leg AB, the vertical leg AC, and the hypotenuse BC of the triangle ABC correspond to extortioners, compliers, and equalizers, respectively. More detailed illustrations are given in Table I.

Refer to caption
Figure 1: Reactive strategies as a subset of general Zero-Determinant (ZD) strategies with the normalization factor ϕ=1/(bχ+1)\phi=1/(b\chi+1) in the donation game. In each panel, the set of ZD strategies with a positive extortion factor χ\chi is shaded in white (the isosceles right triangle), that with a negative χ\chi in gray (the pentagon), and their union is the set of reactive strategies (the unit square). The three particular types of ZD strategies, known as extortioners, compliers, and equalizers, are highlighted in orange, green, and blue, respectively.
TABLE I: Examples of reactive strategies in the donation game that in fact have properties of Zero-Determinant (ZD) strategies.
Type ZD parameters Reactive expressions
extortioner O=0O=0 p1=1(bc)(χ1)bχ+cp_{1}=1-\frac{(b-c)(\chi-1)}{b\chi+c}
χ>1\chi>1 p2=0p_{2}=0
complier O=bcO=b-c p1=1p_{1}=1
χ>1\chi>1 p2=(bc)(χ1)bχ+cp_{2}=\frac{(b-c)(\chi-1)}{b\chi+c}
equalizer 0Obc0\leq O\leq b-c p1=O+cbp_{1}=\frac{O+c}{b}
χ=+\chi=+\infty p2=Obp_{2}=\frac{O}{b}

Steering control by unbending strategies

As a routine, we assume that the focal player X uses strategy 𝒑\bm{p} and the co-player Y uses strategy 𝒒\bm{q}. We then denote the average payoff of player X by sXs_{X} and that of player Y by sYs_{Y}. Earlier on, we uncovered multiple general classes of fair-minded co-players who decide to fix their strategies such that a focal extortionate player can maximize their payoffs only if trying to be fair by letting χ1\chi\to 1.

Definition 1 (Unbending strategies).

Let player X uses an extortionate Zero-Determinant strategy 𝒑\bm{p} of which the parameters satisfy O=PO=P, χ>1\chi>1, and ϕ>0\phi>0. An unbending strategy 𝒒\bm{q} used by player Y is a memory-one strategy that (i) neutralizes the parameter ϕ\phi in the first place such that both the two expected payoffs sXs_{X} and sYs_{Y} are independent of ϕ\phi, sX/ϕ=sY/ϕ=0\partial s_{X}/\partial\phi=\partial s_{Y}/\partial\phi=0, and (ii) guarantees that the derivative of sXs_{X} with respect to χ\chi is strictly negative, sX/χ<0\partial s_{X}/\partial\chi<0.

To scrutinize the behavior and hence the learning dynamics of a self-interested focal player from a much broader strategy space than those in previous studies (see Fig. 1, the entire square instead of only those on the horizontal line AB), we now consider the scenario where 𝒑\bm{p} represents a reactive strategy used by an adapting player and 𝒒\bm{q} is a fixed unbending strategy from class A or class D.

Example 1 (Class A of unbending strategies).

In the donation game, an unbending strategy 𝒒=[q1,q2,q3,q4]\bm{q}=[q_{1},q_{2},q_{3},q_{4}] from class A can be described as:

q1=1,q3=0,\displaystyle q_{1}=1,q_{3}=0, (6)
qa<q2<1,0<q4hA(q2).\displaystyle q_{a}<q_{2}<1,0<q_{4}\leq h_{A}(q_{2}). (7)

The critical values for q2q_{2} and q4q_{4} are qa=(bc)c/(b2+bcc2)q_{a}=(b-c)c/(b^{2}+bc-c^{2}), hah_{a}, hAah_{Aa}, and hAh_{A} (for future reference as well):

{ha(q2)=(bq2c)(1q2)/[b(1q2)+c],hAa(q2)=(bc)(1q2)/c,hA(q2)=()/(),\begin{cases}h_{a}(q_{2})=(bq_{2}-c)(1-q_{2})/[b(1-q_{2})+c],\\ h_{Aa}(q_{2})=(b-c)(1-q_{2})/c,\\ h_{A}(q_{2})=(\ast)/(\ast\ast),\\ \end{cases} (8)

where ()=(bc)(1q2)[(b2+bcc2)q2(bc)c](\ast)=(b-c)(1-q_{2})[(b^{2}+bc-c^{2})q_{2}-(b-c)c] and ()=bc2q22(bc)(b2bcc2)q2+(bc)2(b+c)(\ast\ast)=bc^{2}q_{2}^{2}-(b-c)(b^{2}-bc-c^{2})q_{2}+(b-c)^{2}(b+c).

Example 2 (Class D of unbending strategies).

In the donation game, an unbending strategy 𝒒=[q1,q2,q3,q4]\bm{q}=[q_{1},q_{2},q_{3},q_{4}] from class D can be described as:

q4=hD(q1,q2,q3)=q2+q3q1,\displaystyle q_{4}=h_{D}(q_{1},q_{2},q_{3})=q_{2}+q_{3}-q_{1}, (9)
dD(q1,q2,q3)=b(q2q1)+c(q3q1)+c<0,q2+q3q1>0.\displaystyle\begin{aligned} \hfil\displaystyle\begin{split}d_{D}(q_{1},q_{2},q_{3})=b(q_{2}-q_{1})+c(q_{3}-q_{1})+c&<0,\\ q_{2}+q_{3}-q_{1}&>0.\end{split}\end{aligned} (10)

Equivalently, class D of unbending strategies is the set of general Zero-Determinant (ZD) strategies with the baseline payoff OO satisfying 0<Obc0<O\leq b-c. To be more specific, it is the relative complement of extortionate ZD strategies with respect to the set of general ZD strategies with the extortion factor χ\chi satisfying χ>1\chi>1, consisting of

  1. (i)

    intermediate ZD: 0<O<bc0<O<b-c and q1<1q_{1}<1,

  2. (ii)

    generous ZD: O=bcO=b-c and q1=1q_{1}=1.

Remark.

For any unbending strategy 𝒒\bm{q} from class A, hA(q2)>ha(q2)h_{A}(q_{2})>h_{a}(q_{2}) and hAa(q2)>ha(q2)h_{Aa}(q_{2})>h_{a}(q_{2}) always hold. An intuitive comparison between hAh_{A}, hAah_{Aa}, and hah_{a} is given in Fig. 2. For any unbending strategy 𝒒\bm{q} from class D, q3>0q_{3}>0 always holds.

Refer to caption
Figure 2: Steering control by unbending strategies from class A. A fixed unbending player can exert unilateral influence on both the possible maximum payoff and the direction of strategy choices of any adaptive learning co-player. We find a golden ratio of the underlying payoff matrix, i.e., the benefit-to-cost ratio r=b/c>(1+5)/2r=b/c>(1+\sqrt{5})/2 that ensures the global maximum payoff at (1,0)(1,0).

As shown forthwith, we study how a fixed unbending player Y can curb the extortion and thus promote the cooperation of a self-interested, adaptive focal player X who explores the entire strategy space to get the highest possible payoff in the corresponding learning dynamics. Player X is assumed to use a reactive strategy within the parameter space (p1,p2)(p_{1},p_{2}) which is broader than that of extortionate ZD strategies (see Table I). To prove convergence results on possible learning outcomes, we investigate the monotonicity of the payoff sXs_{X} with respect to both p1p_{1} and p2p_{2} and find where the maximum value can be achieved against unbending strategies from class A and class D, separately. We demonstrate that any evolutionary reactive player aiming to maximize their own payoff will be steered by an unbending co-player from extortion to fairness. For the sake of demonstration and ease of visualization, we have focused on the donation game in which reactive strategies are actually a subset of ZD strategies. It is straightforward to extend our results to more general IPD games.

III Steering Control Over Self-Interested Players

In this section, we demonstrate the unilateral steering control by unbending players that use a fixed strategy from class A. Later on, we also show similar steering control by unbending players from class D. In order to visualize the learning dynamics, we choose to focus on the two-dimensional reactive strategies without loss of generality. Our analysis can be extrapolated for other memory-one strategies. In what follows, we explicitly prove the two major objectives of steering control over opponents, that is, characterize conditions for (i) yielding possible maximum payoffs and (ii) rendering fair and cooperative strategies, by using fixed unbending strategies in the adaptive learning dynamics.

the impact of unbending class a on steered learning dynamics

Let 𝒑=[p1,p2,p1,p2]\bm{p}=[p_{1},p_{2},p_{1},p_{2}] be a reactive strategy and 𝒒=[1,q2,0,q4]\bm{q}=[1,q_{2},0,q_{4}] an unbending strategy from class A (as shown in Example 8). Using the method by Press and Dyson [7], we obtain the payoff sX(p1,p2)s_{X}(p_{1},p_{2}) as a quadratic rational function of p1p_{1} and p2p_{2}, resulting from the quotient of two determinants. If p1=1p_{1}=1, for example, sX(1,p2)=sY(1,p2)=bcs_{X}(1,p_{2})=s_{Y}(1,p_{2})=b-c. If p1=p2=0p_{1}=p_{2}=0, for another example, sX(0,0)=bq4/(1q2+q4)s_{X}(0,0)=bq_{4}/(1-q_{2}+q_{4}). In general, we have

sX(p1,p2)sX(1,p2)=(1p1)Δ(p1,p2)fA(p1,p2),s_{X}(p_{1},p_{2})-s_{X}(1,p_{2})=\frac{(1-p_{1})\Delta(p_{1},p_{2})}{f_{A}(p_{1},p_{2})}, (11)

where the denominator fA(p1,p2)=(1q2)[(1p1)q4(p2p1)2]+q4(1p2)f_{A}(p_{1},p_{2})=(1-q_{2})[(1-p_{1})-q_{4}(p_{2}-p_{1})^{2}]+q_{4}(1-p_{2}) is always positive and

Δ(p1,p2)=()(),()=[bq2q4+c(1q2)(bc)q4](1p2),()=b(1q2)(q4p1+1q4).\displaystyle\begin{split}\Delta(p_{1},p_{2})&=(\ast)-(\ast\ast),\\ (\ast)&=[bq_{2}q_{4}+c(1-q_{2})-(b-c)q_{4}](1-p_{2}),\\ (\ast\ast)&=b(1-q_{2})(q_{4}p_{1}+1-q_{4}).\\ \end{split} (12)

III-A Maximum of the Payoff

We claim that the maximal value of sXs_{X} is obtained either at p1=1p_{1}=1 or p1=p2=0p_{1}=p_{2}=0. In addition to the proofs below, specific examples can be found in Table II.

TABLE II: Adaptive learning dynamics towards fairness and cooperation against unbending strategies from class A. The 2-dimensional stream plots (gradient vector field) from the adaptive learning dynamics of a self-interested reactive player X against a fixed unbending player Y are given with respect to p1p_{1} and p2p_{2}. The color of the curves and dots (p1,p2)(p_{1},p_{2}) corresponds to the payoff values of X in situ, as specified by the given color bar. The solid point on the corner denotes where the maximum value of sXs_{X} lies (if there is more than one maximum point, only the bottom one with p2=0p_{2}=0 is shown). The other solid point indicates where sX(p1,0)/p1\partial s_{X}(p_{1},0)/\partial p_{1} becomes zero and the empty point is (c/b,0)(c/b,0) (the left boundary of extortionate ZD strategies).
r<rr<r^{\ast} r=rr=r^{\ast} r>rr>r^{\ast}
maxsX=\max s_{X}= sX(0,0)s_{X}(0,0) maxsX=\max s_{X}= sX(1,p1)s_{X}(1,p_{1}) maxsX=\max s_{X}= sX(1,p1)s_{X}(1,p_{1}) maxsX=\max s_{X}= sX(1,p1)s_{X}(1,p_{1})
sX(0,0)p1<0\frac{\partial s_{X}(0,0)}{\partial p_{1}}<0 [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
sX(0,0)p1>0\frac{\partial s_{X}(0,0)}{\partial p_{1}}>0 [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Theorem 1 (Maximum on the boundaries).

Assume that player X uses a reactive strategy 𝒑\bm{p} and that player Y uses an unbending strategy 𝒒\bm{q} from class A. The maximum of the expected payoff sXs_{X} is

maxsX(p1,p2)={sX(1,p2),q4hAa(q2)sX(0,0),q4>hAa(q2)\max s_{X}(p_{1},p_{2})=\begin{cases}s_{X}(1,p_{2}),&q_{4}\leq h_{Aa}(q_{2})\\ s_{X}(0,0),&q_{4}>h_{Aa}(q_{2})\end{cases} (13)

where hAa(q2)h_{Aa}(q_{2}) is given in (8).

Proof.

According to (12), Δ(p1,p2)\Delta(p_{1},p_{2}) decreases with respect to both p1p_{1} and p2p_{2}, whose maximal and minimal values are thus

{Δ(0,0)=c(q4hAa),Δ(1,1)=b(1q2),\begin{cases}\Delta(0,0)=c(q_{4}-h_{Aa}),\\ \Delta(1,1)=-b(1-q_{2}),\end{cases} (14)

respectively. Combining this observation with (11), we decide that p1=1p_{1}=1 is a level curve with a local maximum. Moreover, the greatest payoff is achieved at p1=1p_{1}=1 if Δ(0,0)0\Delta(0,0)\leq 0 and (0,0)(0,0) otherwise. ∎

We can further figure out whether the payoff sXs_{X} always takes its maximum at p1=1p_{1}=1 by comparing hAh_{A} and hAah_{Aa} in  (8).

Corollary 1 (Rule of the golden ratio).

Assume that player X uses a reactive strategy 𝒑\bm{p} and that player Y uses an unbending strategy 𝒒\bm{q} from class A. The maximum of the expected payoff sXs_{X} is always achieved at p1=1p_{1}=1 if the benefit-to-cost ratio r(1+5)/2r\geq(1+\sqrt{5})/2 (the golden ratio).

Proof.

The solutions to hA(q2)=hAa(q2)h_{A}(q_{2})=h_{Aa}(q_{2}) are q2=1q_{2}=1 and q2=qA=b(bc)/c2q_{2}=q_{A}=b(b-c)/c^{2}. A routine calculation shows that qA>qaq_{A}>q_{a} and hAa(qA)>0h_{Aa}(q_{A})>0. Therefore, whether qA1q_{A}\geq 1 determines whether maxsX(p1,p2)=sX(1,p2)\max s_{X}(p_{1},p_{2})=s_{X}(1,p_{2}) always holds, which further yields a comparison between the benefit-to-cost ratio rr and the golden ratio (1+5)/2(1+\sqrt{5})/2. ∎

More specifically,

  1. (i)

    if 1<r<(1+5)/21<r<(1+\sqrt{5})/2,

    maxsX(p1,p2)={sX(1,p2),0<q4hAasX(0,0),hAa<q4hA\max s_{X}(p_{1},p_{2})=\begin{cases}s_{X}(1,p_{2}),&0<q_{4}\leq h_{Aa}\\ s_{X}(0,0),&h_{Aa}<q_{4}\leq h_{A}\end{cases} (15)
  2. (ii)

    if r(1+5)/2r\geq(1+\sqrt{5})/2, maxsX(p1,p2)=sX(1,p2)\max s_{X}(p_{1},p_{2})=s_{X}(1,p_{2}).

III-B Monotonicy of the Payoff

We first take the partial derivative of the payoff sXs_{X} with respect to p1p_{1} and get

sX(p1,p2)p1=(1p2)q4gA(p1,p2)fA2(p1,p2),\frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{1}}=\frac{(1-p_{2})q_{4}g_{A}(p_{1},p_{2})}{f_{A}^{2}(p_{1},p_{2})}, (16)

where fA(p1,p2)f_{A}(p_{1},p_{2}) is defined as before and gA(p1,p2)=e2p12+e1p1+e0g_{A}(p_{1},p_{2})=e_{2}p_{1}^{2}+e_{1}p_{1}+e_{0} is a quadratic function of p1p_{1}. We have

{e2=(1q2)[b(1q2)+c]q4c(1q2)2,e1=2(1q2)[b(1q2)p2+b+c]q4+2c(1q2)2,\begin{cases}e_{2}=-(1-q_{2})[b(1-q_{2})+c]q_{4}-c(1-q_{2})^{2},\\ e_{1}=2(1-q_{2})[b(1-q_{2})p_{2}+b+c]q_{4}+2c(1-q_{2})^{2},\\ \end{cases} (17)

and

e2+e1+e0=[p2+q2(1p2)][()q4+()],()=[b(1q2)c](1p2),()=(1q2)[bc(1p2)].\displaystyle\begin{split}e_{2}+e_{1}+e_{0}&=[p_{2}+q_{2}(1-p_{2})][(\ast)q_{4}+(\ast\ast)],\\ (\ast)&=[b(1-q_{2})-c](1-p_{2}),\\ (\ast\ast)&=(1-q_{2})[b-c(1-p_{2})].\end{split} (18)
Theorem 2 (Monotonicity along the first axis).

Assume that player X uses a reactive strategy 𝒑\bm{p} and that player Y uses an unbending strategy 𝒒\bm{q} from class A. Let p2p_{2} be fixed. The expected payoff sXs_{X} either always increases or first decreases and then increases with p1p_{1}.

Proof.

Since e2<0e_{2}<0 and 2e2+e1=2b(1q2)q4[p2+q2(1p2)]>02e_{2}+e_{1}=2b(1-q_{2})q_{4}[p_{2}+q_{2}(1-p_{2})]>0, the graph of gAg_{A} as a function of p1p_{1} is a parabola opening downward whose axis of symmetry is to the right of p1=1p_{1}=1. A routine calculation shows that gA(1,p2)=e2+e1+e0>0g_{A}(1,p_{2})=e_{2}+e_{1}+e_{0}>0. Therefore, as p1p_{1} increases, sX/p1\partial s_{X}/\partial p_{1} is either always nonnegative or first negative and then positive for any fixed p2p_{2}. ∎

We can obtain more detailed results about sX/p1\partial s_{X}/\partial p_{1} on the boundaries of the unit square.

Corollary 2 (Monotonicity on the boundaries).

Assume that player X uses a reactive strategy 𝒑\bm{p} and that player Y uses an unbending strategy 𝒒\bm{q} from class A. Let p2=0p_{2}=0. The monotonicity of the expected payoff sXs_{X} with respect to p1p_{1} satisfies

  1. (i)

    if 0<q4ha(q2)0<q_{4}\leq h_{a}(q_{2}), sX/p1\partial s_{X}/\partial p_{1} is always nonnegative,

  2. (ii)

    if ha(q2)<q4<hA(q2)h_{a}(q_{2})<q_{4}<h_{A}(q_{2}), sX/p1\partial s_{X}/\partial p_{1} is first negative and then positive,

where ha(q2)h_{a}(q_{2}) is given in (8).

Proof.

Based on the proposition above, it suffices to consider sX/p1\partial s_{X}/\partial p_{1} at (0,0)(0,0), that is,

sX(p1,p2)p1|(0,0)=[b(1q2)+c][ha(q2)q4]q4(1q2+q4)2.\left.\frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{1}}\right|_{(0,0)}=\frac{[b(1-q_{2})+c][h_{a}(q_{2})-q_{4}]q_{4}}{(1-q_{2}+q_{4})^{2}}. (19)

The sign of sX(p1,p2)/p1{\partial s_{X}(p_{1},p_{2})}/{\partial p_{1}} at (0,0)(0,0) is determined by the relation between q4q_{4} and ha(q2)h_{a}(q_{2}). ∎

Remark.

In fact, sX/p1\partial s_{X}/\partial p_{1} is always positive at (c/b,0)(c/b,0) and hence positive for any c/bp11c/b\leq p_{1}\leq 1. Namely, if player X uses a reactive extortionate ZD strategy and Y uses an unbending strategy from class A, the expected payoff sXs_{X} will be irreverent to p2p_{2} and an increasing function of p1p_{1}, which echoes with what we have found in previous studies.

Example 3.

Assume that player X uses a reactive strategy 𝒑\bm{p} and that player Y uses an unbending strategy 𝒒\bm{q} from class A. Let

𝒒=[1,(bc)/b,0,(2b3c)/4b].\bm{q}=[1,(b-c)/b,0,(2b-3c)/4b]. (20)

It is easy to tell that (b2c)/2b=ha<q4<hA=(bc)/2b(b-2c)/2b=h_{a}<q_{4}<h_{A}=(b-c)/2b. The partial derivative of the expected payoff sXs_{X} with respect to p1p_{1} satisfies

sX(p1,p2)p1|p2=0{<0,0p1<c/(2bc)=0,p1=c/(2bc)>0.c/(2bc)<p11\left.\frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{1}}\right|_{p_{2}=0}\begin{cases}<0,&0\leq p_{1}<c/(2b-c)\\ =0,&p_{1}=c/(2b-c)\\ >0.&c/(2b-c)<p_{1}\leq 1\end{cases} (21)

We then take the partial derivative of the payoff sXs_{X} with respect to p2p_{2}. The general expression of sX/p2\partial s_{X}/\partial p_{2} is long and complicated. We only enumerate the results on the boundaries.

Proposition 2 (Monotonicity along the second axis).

Assume that player X uses a reactive strategy 𝒑\bm{p} and that player Y uses an unbending strategy 𝒒\bm{q} from class A. The monotonicity of the expected payoff sXs_{X} with respect to p2p_{2} satisfies

  1. (i)

    when p1=0p_{1}=0, sX/p2\partial s_{X}/\partial p_{2} is always negative,

  2. (ii)

    when p1=1p_{1}=1, sX/p2\partial s_{X}/\partial p_{2} is always zero,

  3. (iii)

    when p2=1p_{2}=1 and p1<1p_{1}<1, sX/p2\partial s_{X}/\partial p_{2} is always negative.

The detailed proof is omitted for the limit of space.

To summarize, the fixed unbending strategy 𝒒\bm{q} used by the co-player will determine the performance and the learning dynamics of the focal player. The global maximum payoff is always achieved at (1,p2)(1,p_{2}) if 𝒒\bm{q} is taken from a subset of class A (Fig. 2(a), q4hAaq_{4}\leq h_{Aa}) or even the entire set of class A of unbending strategies (Fig. 2(b) and 2(c), q4<hAq_{4}<h_{A}) provided that the benefit-to-cost ratio rr is at least the golden ratio (1+5)/2(1+\sqrt{5})/2. Furthermore, the learning dynamics can exhibit either global convergence (the shaded orange areas in Fig. 2, q4haq_{4}\leq h_{a}) or bistability (ha<q4<hAh_{a}<q_{4}<h_{A}) along the direction of change of p1p_{1}, where the final state of the focal player depends on their initial state. Examples are given in Table II and Fig. 3.

Refer to caption
Figure 3: Numerical learning processes enforced by unbending strategies from Class A. The learning curves of p1p_{1} and p2p_{2} of a reactive player are given with respect to time. The empty and the solid points represent the initial and the final states and the arrows indicate the direction. Different time scales are considered for different players illustrated by solid versus dashed curves: larger ω\omega is for faster learning while smaller ω\omega for slower learning.

the impact of unbending class d on steered learning dynamics

Let 𝒑=[p1,p2,p1,p2]\bm{p}=[p_{1},p_{2},p_{1},p_{2}] be a reactive strategy and 𝒒=[q1,q2,q3,q4]\bm{q}=[q_{1},q_{2},q_{3},q_{4}] an unbending strategy from class D (as shown in Example 2). We obtain sX(p1,p2)s_{X}(p_{1},p_{2}) as a linear rational function of p1p_{1} and p2p_{2}. If p1=p2=1p_{1}=p_{2}=1, for example, sX(1,1)=[(bc)q3c(1q1)]/(1q1+q3)s_{X}(1,1)=[(b-c)q_{3}-c(1-q_{1})]/(1-q_{1}+q_{3}). If p1=1p_{1}=1 and q1=1q_{1}=1, for another example, the unbending co-player becomes a complier with O=bcO=b-c and sX=sY=bcs_{X}=s_{Y}=b-c. In general, we have

sX(p1,p2)sX(1,1)=dD(q1,q2,q3)1q1+q3()(),s_{X}(p_{1},p_{2})-s_{X}(1,1)=\frac{d_{D}(q_{1},q_{2},q_{3})}{1-q_{1}+q_{3}}\cdot\frac{(\ast)}{(\ast\ast)}, (22)

where ()=q3(1p1)+(1q1)(1p2)(\ast)=q_{3}(1-p_{1})+(1-q_{1})(1-p_{2}) and ()=1q1+q3(q2q1)(p2p1)(\ast\ast)=1-q_{1}+q_{3}-(q_{2}-q_{1})(p_{2}-p_{1}).

III-C Maximum of the Payoff

We claim that the maximal value of sXs_{X} is obtained either at p1=p2=1p_{1}=p_{2}=1 or p1=1p_{1}=1.

Theorem 3 (Maximum on the boundaries).

Assume that player X uses a reactive strategy 𝒑\bm{p} and that player Y uses an unbending strategy 𝒒\bm{q} from class D. The maximum of the expected payoff sXs_{X} is

maxsX(p1,p2)={sX(1,1),q1<1sX(1,p2),q1=1\max s_{X}(p_{1},p_{2})=\begin{cases}s_{X}(1,1),&q_{1}<1\\ s_{X}(1,p_{2}),&q_{1}=1\end{cases} (23)

where the value of q1q_{1} is determined by the baseline payoff OO of 𝒒\bm{q} as a general ZD strategy.

Proof.

The sign of the right-hand side in (22) determines whether maxsX(p1,p2)=sX(1,1)\max s_{X}(p_{1},p_{2})=s_{X}(1,1). According to (10), q3<q2q1<0-q_{3}<q_{2}-q_{1}<0. Thus, ()>1q1+q3q30(\ast\ast)>1-q_{1}+q_{3}-q_{3}\geq 0. Therefore, we only need to consider the sign of ()(\ast). It is straightforward to show that ()0(\ast)\geq 0, which is an equality if and only if p1=p2=1p_{1}=p_{2}=1 or p1=1p_{1}=1 and q1=1q_{1}=1. ∎

III-D Monotonicity of the Payoff

We take the partial derivatives of the payoff sXs_{X} with respect to p1p_{1} and p2p_{2} and get

sX(p1,p2)p1=[(q2q1)(1p2)+q3]dD(q1,q2,q3)()2,sX(p1,p2)p2=[(q2q1)p1+1q2]dD(q1,q2,q3)()2.\displaystyle\begin{split}\frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{1}}&=-\frac{[(q_{2}-q_{1})(1-p_{2})+q_{3}]d_{D}(q_{1},q_{2},q_{3})}{(\ast\ast)^{2}},\\ \frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{2}}&=-\frac{[(q_{2}-q_{1})p_{1}+1-q_{2}]d_{D}(q_{1},q_{2},q_{3})}{(\ast\ast)^{2}}.\\ \end{split} (24)

where ()(\ast\ast) is the same as that in (22).

Theorem 4 (Monotonicity across the unit square).

Assume that player X uses a reactive strategy 𝒑\bm{p} and that player Y uses an unbending strategy 𝒒\bm{q} from class D. The expected payoff sXs_{X} is a strictly increasing function of p1p_{1} and an increasing function of p2p_{2}:

sX(p1,p2)p1>0sX(p1,p2)p20.\displaystyle\begin{split}\frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{1}}&>0\\ \frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{2}}&\geq 0.\end{split} (25)

The equality holds if and only if p1=1p_{1}=1 and q1=1q_{1}=1.

Proof.

We know that dD<0d_{D}<0 and ()>0(\ast\ast)>0. Moreover, q2q1>q3q_{2}-q_{1}>-q_{3} implies that (q2q1)(1p2)+q3>q3p20(q_{2}-q_{1})(1-p_{2})+q_{3}>q_{3}p_{2}\geq 0. Hence, we have sX/p1>0\partial s_{X}/\partial p_{1}>0. On the other hand, it is easy to tell that (q2q1)p1+1q20(q_{2}-q_{1})p_{1}+1-q_{2}\geq 0, which is an equality if and only if p1=1p_{1}=1 and q1=1q_{1}=1. ∎

To summarize,

  1. (i)

    if Y uses an intermediate ZD strategy,

    maxsX(p1,p2)=sX(1,1)=(bc)q3c(1q1)1q1+q3,sX(p1,p2)p1>0andsX(p1,p2)p2>0,\begin{gathered}\max s_{X}(p_{1},p_{2})=s_{X}(1,1)=\frac{(b-c)q_{3}-c(1-q_{1})}{1-q_{1}+q_{3}},\\ \frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{1}}>0\,\text{and}\,\frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{2}}>0,\end{gathered} (26)
  2. (ii)

    if Y uses a generous ZD strategy,

    maxsX(p1,p2)=sX(1,p2)=bc,sX(p1,p2)p1>0andsX(p1,p2)p2{>0,p1<1=0.p1=1\begin{gathered}\max s_{X}(p_{1},p_{2})=s_{X}(1,p_{2})=b-c,\\ \frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{1}}>0\,\text{and}\,\frac{\partial s_{X}(p_{1},p_{2})}{\partial p_{2}}\begin{cases}>0,&p_{1}<1\\ =0.&p_{1}=1\end{cases}\end{gathered} (27)

We have shown the role of unbending strategies in steering control and particularly their ability in enforcing fair and cooperative outcomes against payoff-maximizing players. Our further analysis reveals that some common IPD strategies are in fact unbending, including the PSO gambler (a machine-trained strategy) [5] and generous ZD strategies [8]. In this regard, the framework of IPD has the potential for synergistically combining artificial intelligence (AI) and evolutionary game theory to enhance cooperation and foster fairness in various multi-agent systems. Also, all the unbending strategies from class A and at least those generous ZD strategies from class D are able to establish mutual cooperation among themselves. Moreover, even if noisy games are considered, for example, with probability ε\varepsilon the intended move is implemented as the opposite one, their mutual cooperation is only impacted as 1𝒪(ε)1-\mathcal{O}(\varepsilon). These findings suggest the robustness and winning advantage of unbending strategies in population competition dynamics beyond the pairwise interactions considered above.

IV Discussion & Conclusion

Our search for unbending strategies is directly motivated by suppressing extortion, thereby requiring targeted interactions with ZD opponents. Nevertheless, our work is more broadly motivated by how to foster and enforce fairness and cooperation in pairwise interactions in the IPD games. We reveal the previously unforeseen unbending properties of many well-known strategies in that they actually have the ability to resist extortion. For example, class A contains “PSO Gambler”, an optimized memory-one strategy that is unbending when T+S>2PT+S>2P. Other well-known examples of unbending strategies include WSLS from class A when T+S<2PT+S<2P, “willing” from class C, and all the strategies from class D which are ZD players themselves but with a higher level of generosity than their opponents.

As such, unbending strategies can be used to rein in seemingly formidable extortionate ZD players, from whom a fair offer can ultimately be cultivated in their own interest. Our findings are in line with recent experimental evidence suggesting that human players often choose not to accede to extortion out of concern for fairness [16]. The conclusion can be further generalized and shed light on the backfire of extortion by unbending players in a broad sense. Our analysis shows that an adapting reactive player X after a higher payoff will cooperate more by increasing p1p_{1} and in some cases p2p_{2} to avoid potential punishment from a fixed unbending co-player Y in a donation game. Unbending strategies from class A would “train” the reactive player to behave like generous TFT (p1=1p_{1}=1) whereas those from class D would even “train” the reactive player to act as a full cooperator (p1=p2=1p_{1}=p_{2}=1).

Under the influence of a co-player Y from class A of unbending strategies, the learning dynamics of a reactive player X may exhibit bistability along the direction of change of p1p_{1}, hence allowing two different learning outcomes, depending on both the original state of the focal player and the specific strategy of the co-player. After entering full cooperation with respect to p1p_{1}, player X will stay neutral along the direction of change of p2p_{2}. A reactive strategy can therefore converge to the full defector corner (0,0)(0,0) or otherwise to the generous TFT edge (1,p2)(1,p_{2}). Player X will always get the greatest payoff at the cooperative edge (1,p1)(1,p_{1}) if the ratio b/c>(5+1)/2b/c>(\sqrt{5}+1)/2 (the ‘golden ratio’) or converge to the edge if player Y takes a strategy from a subset of class A. On the other hand, unbending strategies from class D are able to guarantee global convergence to the same cooperative edge, after which the baseline payoff OO of player Y can further decide the direction of change of p2p_{2}. Player X will increase the value of p2p_{2} until reaching the full cooperator corner (1,1)(1,1) if player Y uses an intermediate ZD strategy with P<O<RP<O<R or remain neutral along the edge (1,p2)(1,p_{2}) if player Y uses the generous ZD strategy with O=RO=R.

In a nutshell, we have found and characterized general classes of unbending strategies that are able to steer the learning dynamics of selfish players in head-to-head encounters. Unbending strategies can trigger the backfire of greedy exploitations, punish the extortion, and thus turn the interactions into an Ultimatum game: to “win each battle” or to “win the war”. These strategies helm the reactive learning dynamics of extortionate and reactive players to fairness without introducing population dynamics and evolution through generations.

Our work helps pave the way for the promising initiative of combining game theory with artificial intelligence to gain more analytical insights into computational learning theory. Of particular interest are extensions to multi-agent learning systems that are fraught with perception and implementation errors and beyond pairwise interactions in a changing environment [21]. In doing so, integrating theoretical and empirical approaches will help enhance our understanding of cooperation in various advanced AI systems besides social and biological systems [22].

References

  • [1] Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211(4489):1390–1396, 1981.
  • [2] Martin A Nowak. Five rules for the evolution of cooperation. science, 314(5805):1560–1563, 2006.
  • [3] Robert L Trivers. The evolution of reciprocal altruism. The Quarterly review of biology, 46(1):35–57, 1971.
  • [4] Christian Hilbe, Krishnendu Chatterjee, and Martin A Nowak. Partners and rivals in direct reciprocity. Nature human behaviour, 2(7):469–477, 2018.
  • [5] Marc Harper, Vincent Knight, Martin Jones, Georgios Koutsovoulos, Nikoleta E Glynatsi, and Owen Campbell. Reinforcement learning produces dominant strategies for the iterated prisoner’s dilemma. PloS one, 12(12):e0188046, 2017.
  • [6] Maarten C Boerlijst, Martin A Nowak, and Karl Sigmund. Equal pay for all prisoners. The American mathematical monthly, 104(4):303–305, 1997.
  • [7] William H Press and Freeman J Dyson. Iterated prisoner’s dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences, 109(26):10409–10413, 2012.
  • [8] Alexander J Stewart and Joshua B Plotkin. From extortion to generosity, evolution in the iterated prisoner’s dilemma. Proceedings of the National Academy of Sciences, 110(38):15348–15353, 2013.
  • [9] Ethan Akin. What you gotta know to play good in the iterated prisoner’s dilemma. Games, 6(3):175–190, 2015.
  • [10] Xingru Chen, Long Wang, and Feng Fu. The intricate geometry of zero-determinant strategies underlying evolutionary adaptation from extortion to generosity. New Journal of Physics, 24(10):103001, 2022.
  • [11] Dong Hao, Zhihai Rong, and Tao Zhou. Extortion under uncertainty: Zero-determinant strategies in noisy games. Physical Review E, 91(5):052803, 2015.
  • [12] Alex McAvoy and Christoph Hauert. Autocratic strategies for iterated games with arbitrary action spaces. Proceedings of the National Academy of Sciences, 113(13):3573–3578, 2016.
  • [13] Christian Hilbe, Martin A Nowak, and Karl Sigmund. Evolution of extortion in iterated prisoner’s dilemma games. Proceedings of the National Academy of Sciences, 110(17):6913–6918, 2013.
  • [14] Jing Chen and Aleksey Zinger. The robustness of zero-determinant strategies in iterated prisoner’s dilemma games. Journal of theoretical biology, 357:46–54, 2014.
  • [15] Fang Chen, Te Wu, and Long Wang. Evolutionary dynamics of zero-determinant strategies in repeated multiplayer games. Journal of Theoretical Biology, 549:111209, 2022.
  • [16] Christian Hilbe, Torsten Röhl, and Manfred Milinski. Extortion subdues human players but is finally punished in the prisoner’s dilemma. Nature communications, 5(1):1–6, 2014.
  • [17] Lutz Becks and Manfred Milinski. Extortion strategies resist disciplining when higher competitiveness is rewarded with extra gain. Nature communications, 10(1):1–9, 2019.
  • [18] Martin A Nowak and Karl Sigmund. Tit for tat in heterogeneous populations. Nature, 355(6357):250–253, 1992.
  • [19] Martin Nowak and Karl Sigmund. A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game. Nature, 364(6432):56–58, 1993.
  • [20] Maria R D’Orsogna and Matjaž Perc. Statistical physics of crime: A review. Physics of life reviews, 12:1–21, 2015.
  • [21] Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research, 53:659–697, 2015.
  • [22] Allan Dafoe, Yoram Bachrach, Gillian Hadfield, Eric Horvitz, Kate Larson, and Thore Graepel. Cooperative ai: machines must learn to find common ground, 2021.