This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Decentralized Online Learning in General-Sum Stackelberg Games

Yaolong Yu Department of Computer Science and Engineering
The Chinese University of Hong Kong
Hong Kong
Haipeng Chen Data Science
William & Mary
Williamsburg, Virginia, USA
Abstract

We study an online learning problem in general-sum Stackelberg games, where players act in a decentralized and strategic manner. We study two settings depending on the type of information for the follower: (1) the limited information setting where the follower only observes its own reward, and (2) the side information setting where the follower has extra side information about the leader’s reward. We show that for the follower, myopically best responding to the leader’s action is the best strategy for the limited information setting, but not necessarily so for the side information setting – the follower can manipulate the leader’s reward signals with strategic actions, and hence induce the leader’s strategy to converge to an equilibrium that is better off for itself. Based on these insights, we study decentralized online learning for both players in the two settings. Our main contribution is to derive last-iterate convergence and sample complexity results in both settings. Notably, we design a new manipulation strategy for the follower in the latter setting, and show that it has an intrinsic advantage against the best response strategy. Our theories are also supported by empirical results.

1 Introduction

Many real-world problems such as optimal auction [Myerson, 1981, Cole and Roughgarden, 2014] and security games [Tambe, 2011] can be modeled as a hierarchical game, where the two levels of players have asymmetric roles and can be partitioned into a leader who first takes an action and a follower who responds to the leader’s action. This class of games is called Stackelberg or leader-follower games [Stackelberg, 1934, Sherali et al., 1983]. General-sum Stackelberg games [Roughgarden, 2010] are a class of Stackelberg games where the sum of the leader’s and follower’s rewards is not necessarily zero. They have broad implications in many other real-world problems such as taxation policy making [Zheng et al., 2022], automated mechanism design [Conitzer and Sandholm, 2002], reward shaping [Leibo et al., 2017], security games [Gan et al., 2018, Blum et al., 2014], anti-poaching [Fang et al., 2015], and autonomous driving [Shalev-Shwartz et al., 2016].

We focus on the setting of a pure strategy in repeated general-sum Stackelberg games follows that of Bai et al. [2021], where players act in an online, decentralized, and strategic manner. Online means that the players learn on the go as opposed to learning in batches, where regret is usually used as the learning objective. Decentralized means that there is no central controller, and each player acts independently. Strategic means that the players are self-interested and aim at maximizing their own utility. Moreover, we take a learning perspective, meaning that the players learn from (noisy) samples by playing the game repeatedly. A comparable framework has been investigated in numerous related studies [Kao et al., 2022], finding widespread applications in diverse real-world scenarios, such as addressing the optimal taxation problem in the AI Economist [Zheng et al., 2022] and optimizing auction procedures [Amin et al., 2013].

There have been extensive studies on decentralized learning in multi-agent games [Blum and Mansour, 2007, Wu et al., 2022, Jin et al., 2021, Meng et al., 2021, Wei et al., 2021, Song et al., 2021, Mao and Başar, 2023, Zhong et al., 2023, Ghosh, 2023], mostly focusing on settings where all agents act simultaneously, without a hierarchical structure. Multi-agent learning in Stackelberg games has been relatively less explored. For example, Goktas et al. [2022], Sun et al. [2023] study zero-sum games, where the sum of rewards of the two players is zero. Kao et al. [2022], Zhao et al. [2023] study cooperative games, where the leader and the follower share the same reward. Though these studies make significant contributions to understanding learning in Stackelberg games, they make limiting assumptions about the reward structures of the game. The first part of this paper studies the learning problem in a more generalized setting of general-sum Stackelberg games. Due to the lack of knowledge of the reward structures, learning in general-sum Stackelberg games is much more challenging, and thus remains open.

Furthermore, in these studies [Goktas et al., 2022, Sun et al., 2023, Kao et al., 2022, Zhao et al., 2023], a hidden assumption is that the optimal strategy of the follower is to best respond to the leader’s strategy in each round. Recent studies [Gan et al., 2019a, b, Nguyen and Xu, 2019, Birmpas et al., 2020, Chen et al., 2022, 2023] show that under information asymmetry, a strategic follower can manipulate a commitment leader by misreporting their payoffs, so as to induce an equilibrium different from Stackelberg equilibrium that is better-off for the follower. The second part of this paper extends this intuition to an online learning setting, where the leader learns the commitment strategy via no-regret learning algorithms (e.g., EXP3 [Auer et al., 2002b] or UCB [Auer et al., 2002a]).

We use an example online Stackelberg game to further illustrate this intuition, the (expected) payoff matrix of which is shown in Table 1. By definition, (ase,bse)=(a1,b1)(a_{se},b_{se})=(a_{1},b_{1}) is its Stackleberg equilibrium. This is obtained by assuming that the follower will best respond to the leader’s action. In online learning settings, when the leader uses a no-regret learning algorithm, and the (strategic) follower forms a response function as ={(a1)=b2,(a2)=b1}\mathcal{F}=\{\mathcal{F}(a_{1})=b_{2},\mathcal{F}(a_{2})=b_{1}\}, then the leader will be tricked into believing that the expected payoffs of actions a1a_{1} and a2a_{2} are respectively 0.10.1 and 0.20.2. Hence, the leader will take action a2a_{2} when using the no-regret learning algorithms, and the game will converge to (a2,b1)(a_{2},b_{1}), where the follower has a higher payoff of 11 compared to 0.10.1 in the Stackleberg equilibrium.

Leader / Follower b1b_{1} b2b_{2}
a1a_{1} (0.3, 0.1) (0.1, 0.05)
a2a_{2} (0.2, 1) (0.3, 0.1)
Table 1: Payoff matrix of an example Stackelberg game. The row player is the leader. Each tuple denotes the payoffs of the leader (left) and the follower (right).

The insights gained from the above example can be generalized. In the following, we introduce two different categories of general-sum Stackelberg games that are distinguished by whether the follower has access to the information about the reward structure of the leader: (i) limited information: the follower only observes the reward of its own, and (ii) side information: the follower has extra side information of the leader’s reward in addition to itself.

In the limited information setting, the follower is not able to manipulate the game without the leader’s reward information. Therefore, best responding to the leader’s action is indeed the best strategy. This constitutes a typical general-sum Stackelberg game. Our contribution in this setting is to prove the convergence of general-sum Stackelberg equilibrium when both players use (variants of) no-regret learning algorithms. Note that this is a further step after [Kao et al., 2022] who focus on cooperative Stackelberg games where the two players share the same reward. In addition, we prove last-iterate convergence [Mertikopoulos et al., 2018, Daskalakis and Panageas, 2018, Lin et al., 2020, Wu et al., 2022] results, which are considered stronger than the average convergence results.

In the side information setting, we first consider the case when the follower is omniscient, i.e., it knows both players’ exact true reward functions. Building on the intuition from the above example, we design FBM, a manipulation strategy for the follower and prove that it gains an intrinsic advantage compared to the best response strategy. Then, we study a more intricate case called noisy side information, where the follower needs to learn the leader’s reward information from noisy bandit feedback in the online process. We design FMUCB, a variant of FBM that finds the follower’s best manipulation strategy in this case, and derive its sample complexity as well as last-iterate convergence. Our results complement existing works [Birmpas et al., 2020, Chen et al., 2022, 2023] that focus on the learning perspective of only the follower against a commitment leader in offline settings.

To validate the theoretical results, we conduct synthetic experiments for the above settings. Empirical results show that: 1) in the limited information setting, (variants of) no-regret learning algorithms lead to convergence of Stackelberg equilibrium in general-sum Stackelberg games, 2) in the side information setting, our proposed follower manipulation strategy does introduce an intrinsic reward advantage compared to best responses, both in the cases of an omniscient follower and noisy side information.

2 Related Work

Decentralized learning in simultaneous multi-agent games. This line of works has broad real-world applications, e.g., when multiple self-interested teams patrol over the same targets in security domains [Jiang et al., 2013] or wildlife conservation [Ministry of WWF-Pakistan, 2015], or when different countries that independently plan their own actions in international waters against illegal fishing [Klein, 2017]. There have been extensive studies in this category [Blum and Mansour, 2007, Wu et al., 2022, Jin et al., 2021, Meng et al., 2021, Wei et al., 2021]. The seminal work of Blum and Mansour [2007] shows that decentralized no-regret learning in multi-agent general-sum games leads to a coarse correlated equilibrium (CCE). The result has been improved using more sophisticated methods like optimistic hedge [Daskalakis et al., 2021, Chen and Peng, 2020, Anagnostides et al., 2022]. Jin et al. [2021], Liu et al. [2021] study a similar question in Makov games that involves sequential decision and reinforcement learning.

Learning in Stackelberg games. While our work also focuses on decentralized learning, we focus on learning in Stackelberg games where the players act sequentially. Lauffer et al. [2022] study a different setting than ours where the leader first commits to a randomized strategy, and the follower observes the randomized strategy and best responds to it. Such a setting also appears in Balcan et al. [2015]. In our setting, the leader first plays a deterministic action, and the follower responds to the action after observing it, a setting that is closer to Bai et al. [2021], Kao et al. [2022]. Many past works about learning in Stackelberg games, like security games [Blum et al., 2014, Peng et al., 2019, Balcan et al., 2015, Letchford et al., 2009], only focus on learning from the leader’s perspective, assuming access to an oracle of the follower’s best response. More recently, there have been rising interests for Stackelberg games involving learning of both players [Goktas et al., 2022, Sun et al., 2023, Kao et al., 2022, Zhao et al., 2023]. They focus on sub-classes of games with specific reward structures. Goktas et al. [2022], Sun et al. [2023] study zero-sum stochastic Stackelberg games where the sum of rewards of the two players is zero. Kao et al. [2022], Zhao et al. [2023] study cooperative Stackelberg games where the leader and the follower share the same reward. We study general-sum Stackelberg games, a more generalized setting without assuming specific reward structures like the above. The lack of prior knowledge of the reward structures, however, makes the learning problem much more challenging. As a matter of fact, learning in repeated general-sum Stackelberg games remains an open problem. The setting in Bai et al. [2021] is closer to us, where it also learns general-sum Stackelberg equilibrium from noisy bandit feedback. But their learning process needs a central device that queries each leader-follower action pair for sufficient times, making it essentially a centralized model or offline learning. We study an online learning problem in repeated general-sum Stackelberg games, where the players act in a decentralized and strategic manner. Last and very importantly, a hidden assumption in these works is that the follower’s best strategy is to best respond to the leader’s actions. In addition to studying Stackelberg games with a best-responding follower (Section 4), we also take a new perspective of a manipulative follower (Sections 5-6), and show that manipulative strategies can indeed yield higher payoff for the follower in an online learning setting.

Follower manipulation in Stackelberg games. It is known that the leader has the first-mover advantage in Stackelberg games: an optimal commitment in a Stakelberg equilibrium always yields a higher (or equal) payoff for the leader than in any Nash equilibrium of the same game [Von Stengel and Zamir, 2010]. This result is under the assumption that the leader has full information about the follower’s reward, or that it can learn such information via interacting with a truthful follower [Letchford et al., 2009, Blum et al., 2014, Haghtalab et al., 2016, Roth et al., 2016, Peng et al., 2019]. In other words, they assume that follower will always play myopic best responses to the leader’s strategy. Recent studies showed that a follower can actually manipulate a commitment leader, and induce an equilibrium different from the Stackelberg equilibrium by misreporting their payoffs [Gan et al., 2019a, b, Nguyen and Xu, 2019, Birmpas et al., 2020, Chen et al., 2022, 2023]. In parallel with these works, we consider an online learning setup where both players learn their best strategies from noisy bandit feedback. This is in contrast to the above existing works which take the learning perspective of only the follower against a commitment leader in an offline setup.

3 Preliminary

A general-sum Stackelberg game is represented as a tuple {𝒜,,μl,μf}\{\mathcal{A},\mathcal{B},\mu_{l},\mu_{f}\}. 𝒜\mathcal{A} represents the action set of leader ll, \mathcal{B} represents the action set of follower ff, and |𝒜|=A|\mathcal{A}|=A, ||=B|\mathcal{B}|=B. We denote 𝒜×\mathcal{A}\times\mathcal{B} the joint action set of the leader and the follower. For any joint action (a,b)𝒜×(a,b)\in\mathcal{A}\times\mathcal{B}, we use rl(a,b)[0,1]r_{l}(a,b)\in[0,1] and rf(a,b)[0,1]r_{f}(a,b)\in[0,1] to respectively denote the noisy reward for the leader and the follower, with expectations μl(a,b)[0,1]\mu_{l}(a,b)\in[0,1] and μf(a,b)[0,1]\mu_{f}(a,b)\in[0,1]. The follower has a response set to the leader’s actions, ={(a)|a𝒜}\mathcal{F}=\{\mathcal{F}(a)|a\in\mathcal{A}\}, where (a)\mathcal{F}(a) is the response to leader’s action aa. The follower’s best response towards aa is defined as the action that maximizes the follower’s true reward:

br(a)=argmaxbμf(a,b),\mathcal{F}_{br}(a)=\operatorname*{arg\,max}_{b\in\mathcal{B}}\mu_{f}(a,b), (1)

We denote br={br(a)|a𝒜}\mathcal{F}_{br}=\{\mathcal{F}_{br}(a)|a\in\mathcal{A}\} as the best response set. For simplicity, we assume that the best response to each action aa is unique and the game has a unique Stackelberg equilibrium (ase,bse)(a_{se},b_{se}),

ase=argmaxa𝒜μl(a,br(a)),bse=br(ase).a_{se}=\operatorname*{arg\,max}_{a\in\mathcal{A}}\mu_{l}(a,\mathcal{F}_{br}(a)),\ b_{se}=\mathcal{F}_{br}(a_{se}). (2)

A repeated general-sum Stackelberg game is played iteratively in a total of TT rounds. In each round tt, the procedure is as follows:

  • The leader plays an action at𝒜a_{t}\in\mathcal{A}.

  • The follower observes the leader’s action ata_{t}, and plays bt=t(at)b_{t}=\mathcal{F}_{t}(a_{t}) as the response.

  • The leader receives a noisy reward rl,t(at,bt)r_{l,t}(a_{t},b_{t}).

  • The follower receives reward information rt(at,bt)r_{t}(a_{t},b_{t}) which we will elaborate more in the following.

In the last step, as motivated by the example in Table 1, we study two types of settings depending on the kind of reward information that is available to the follower:

  • Limited information. The follower observes a (noisy) reward of its own: rt(at,bt)=rf,t(at,bt)r_{t}(a_{t},b_{t})=r_{f,t}(a_{t},b_{t}).

  • Side information. The follower has extra side information, meaning that it also knows the reward of the leader in addition to its own. This setting can be further divided into the case of an omniscient follower who knows the exact reward functions μl\mu_{l} and μf\mu_{f} directly; or the case of noisy side information where the follower learns the rewards from noisy bandit feedback in each round, i.e., rt(at,bt)=(rl,t(at,bt),rf,t(at,bt))r_{t}(a_{t},b_{t})=(r_{l,t}(a_{t},b_{t}),r_{f,t}(a_{t},b_{t})).

In the following, we will present our results for the limited information setting (Section 4), the omniscient follower setting (Section 5) and the noisy side information setting (Section 6).

4 A myopic follower with limited information

We first investigate the follower learning problem with limited information, where the best response is truly the follower’s best strategy. Then, We design appropriate online learning algorithms for the leader and prove their last-iterate convergence for the entire game involving both players.

4.1 Algorithm for the myopic follower

In the limited information setting, the follower does not know the entire game’s payoff matrix, and therefore does not have the ability to manipulate the game. Hence, myopically learning the best response for each action a𝒜a\in\mathcal{A} is indeed the best strategy for the follower. The learning problem for the follower then is equivalent to AA independent stochastic bandit problems, one for each a𝒜a\in\mathcal{A} that the leader plays. Upper Confidence Bound (UCB) is a good candidate algorithm for the follower as it is asymptotically optimal in stochastic bandits problems [Bubeck et al., 2012].

In each round tt, the follower selects its response based on UCB:

bt=argmaxbucbf,t(at,b),b_{t}=\operatorname*{arg\,max}_{b\in\mathcal{B}}\text{ucb}_{f,t}(a_{t},b),

where ucbf,t(a,b)=μ^f,t(a,b)+2log(T/δ)nt(a,b).\text{ucb}_{f,t}(a,b)=\hat{\mu}_{f,t}(a,b)+\sqrt{\frac{2\log(T/\delta)}{n_{t}(a,b)}}. nt(a,b)=max{1,τ=1t1𝕀{aτ=abτ=b}}n_{t}(a,b)=\max\big{\{}1,\sum_{\tau=1}^{t-1}\mathbb{I}\left\{a_{\tau}=a\land b_{\tau}=b\right\}\big{\}} is the number of times that action pair (a,b)(a,b) has been selected, and μ^f,t(a,b)\hat{\mu}_{f,t}(a,b) is the empirical mean of the follower’s reward for (a,b)(a,b):

μ^f,t(a,b)=1nt(a,b)τ=1t1rf,τ(aτ,bτ)𝕀{aτ=abτ=b}.\hat{\mu}_{f,t}(a,b)=\frac{1}{n_{t}(a,b)}\sum_{\tau=1}^{t-1}r_{f,\tau}(a_{\tau},b_{\tau})\mathbb{I}\left\{a_{\tau}=a\land b_{\tau}=b\right\}.

Following the result of [Auer et al., 2002a], for every stochastic bandit problem associated to a𝒜a\in\mathcal{A}, UCB needs approximately 𝒪(BlogTΔ12)\mathcal{O}\left(\frac{B\log T}{\Delta^{2}_{1}}\right) rounds to find the best response, where Δ1\Delta_{1} is the suboptimality gap to best response for all a𝒜a\in\mathcal{A},

Δ1=mina𝒜minbbr(a)μf(a,br(a))μf(a,b).\Delta_{1}=\min_{a\in\mathcal{A}}\min_{b\neq\mathcal{F}_{br}(a)}\mu_{f}(a,\mathcal{F}_{br}(a))-\mu_{f}(a,b).

This directly leads to the following proposition:

Proposition 1.

In a repeated general-sum Stackelberg game, if the follower uses UCB as the learning algorithm, with probability at least 1δ1-\delta, the follower’s regret is bounded as:

fS(T)\displaystyle\mathcal{R}^{S}_{f}(T) =𝔼[t=1Tμf(at,Br(at))μf(at,bt)]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\mu_{f}\left(a_{t},\text{Br}(a_{t})\right)-\mu_{f}\left(a_{t},b_{t}\right)\right]
𝒪(ABlog(T/δ)Δ1).\displaystyle\leq\mathcal{O}\left(\frac{AB\log(T/\delta)}{\Delta_{1}}\right).

This proposition shows that a myopic follower achieves no Stackelberg regret learning when it decomposes the Stackelberg game into AA stochastic bandits problems and uses the UCB algorithm to respond, no matter what learning algorithm the leader uses.

It should be emphasized that the utilization of arbitrary no-regret algorithms does not inherently ensure the convergence to the Stackelberg equilibrium in the general-sum Stackelberg game. The following result is provided to support this critical observation:

Theorem 1.

[Non-Convergence to the SE] Applying the UCB-UCB algorithm within a repeated general-sum Stackelberg game can result in the leader suffering regret linear in TT, which implies the game fails to converge to the Stackelberg equilibrium.

So to further understand the convergence of the entire game, we investigate two kinds of algorithms for the leader, a UCB-style algorithm and an EXP3-style algorithm, two of the most prevalent algorithms in the online learning literature. Combining the follower’s UCB algorithm, we have two decentralized learning paradigms: EXP3-UCB and UCB-UCB. It is worth noting that UCB-UCB is also studied in Kao et al. [2022] which focuses on a cooperative Stackelberg game setting. In the following, we present our first two major results on the last-iterate convergence for the two learning paradigms.

4.2 Last-iterate convergence of EXP3-UCB

We next introduce a variant of EXP3 for the leader. We denote the unbiased reward estimation of the reward rl,t(a,t(a))r_{l,t}(a,\mathcal{F}_{t}(a)) the leader receives in round tt as r~t(a)=rl,t(a,t(a))x~t(a)𝕀{at=a}\tilde{r}_{t}(a)=\frac{r_{l,t}(a,\mathcal{F}_{t}(a))}{\tilde{x}_{t}(a)}\mathbb{I}\{a_{t}=a\}, where xt(a)x_{t}(a) is the weight for each action a𝒜a\in\mathcal{A}. It is updated as xt+1(a)=exp(yt(a))a𝒜exp(yt(a)),whereyt(a)=τ=1tηr~τ(a)x_{t+1}(a)=\frac{\exp(y_{t}(a))}{\sum_{a\in\mathcal{A}}\exp(y_{t}(a))},\ \text{where}\ y_{t}(a)=\sum_{\tau=1}^{t}\eta\tilde{r}_{\tau}(a). Let 𝒙t=xt(a){\bm{x}}_{t}=\langle x_{t}(a)\rangle be the weight vector of all actions a𝒜a\in\mathcal{A}. It is initialized as a discrete uniform distribution: 𝒙1=[1/A,,1/A]{\bm{x}}_{1}=[1/A,\cdots,1/A]^{\top}. In round tt, the leader selects action ata_{t} according to the distribution

𝒙~t=(1α)𝒙t+α[1/A,,1/A].{\tilde{\bm{x}}}_{t}=(1-\alpha){\bm{x}}_{t}+\alpha[1/A,\cdots,1/A]^{\top}.

The second term on the right-hand side enforces a random probability of taking any action in 𝒜\mathcal{A} to guarantee a minimum amount of exploration. As we will show in our proof of Theorem 2, the extra α\alpha exploration is essential to the convergence to Stackelberg equilibrium as it allows the follower to do enough exploration to find its best responses using its UCB sub-routine. In the following, we will simply use EXP3 to refer to the above variant of EXP3 with the explicit uniform exploration.

Theorem 2.

[Last-iterate convergence of EXP3-UCB under limited information] In a limited information setting of a repeated general-sum Stackelberg game with noisy bandit feedback, applying EXP3-UCB, with α=𝒪(T13)\alpha=\mathcal{O}\left(T^{-\frac{1}{3}}\right), η=𝒪(T13)\eta=\mathcal{O}\left(T^{-\frac{1}{3}}\right), with probability at least 13δ1-3\delta,

[aTase]α+\displaystyle\mathbb{P}\big{[}a_{T}\neq a_{se}\big{]}\leq\alpha+
Aexp(Δ2T23+22Alog2δT13+C1ABΔ12logTδlog1δ),\displaystyle A\exp\!\left(-\Delta_{2}T^{\frac{2}{3}}+2\sqrt{2A\log\frac{2}{\delta}}T^{\frac{1}{3}}+\frac{C_{1}AB}{\Delta_{1}^{2}}\log\frac{T}{\delta}\log\frac{1}{\delta}\right),

where C1C_{1} is a constant, Δ2\Delta_{2} is the leader’s suboptimality reward gap to Stackelberg equilibrium:

Δ2=minaaseμl(ase,bse)μl(a,br(a)).\Delta_{2}=\min_{a\neq a_{se}}\mu_{l}(a_{se},b_{se})-\mu_{l}(a,\mathcal{F}_{br}(a)).

The proof is in Appendix A.2.

4.3 Last-iterate convergence of UCBE-UCB

When the leader uses a UCB-style algorithm, to guarantee that the game converges to the Stackelberg equilibrium, it requires that the UCB term will always upper bound μl(a,br(a))\mu_{l}(a,\mathcal{F}_{br}(a)) with high probability. This is not guaranteed with a vanilla UCB algorithm (Theorem 1) since the follower’s response is unstabilized and not necessarily the best response before the follower’s UCB algorithm converges. Therefore, we introduce a variant of UCB, called UCB with extra Exploration (UCBE), with a bonus term S0S_{0} that ensures upper bounding μl(a,br(a))\mu_{l}(a,\mathcal{F}_{br}(a)) with high probability. In UCBE, the leader chooses action at=argmaxa𝒜ucbel,t(a)a_{t}=\operatorname*{arg\,max}_{a\in\mathcal{A}}\text{ucbe}_{l,t}(a) in round tt as follows

ucbel,t(a)=μ^l,t(a)+S0nt(a),S0=𝒪(Bε3logABTδ).\text{ucbe}_{l,t}(a)=\hat{\mu}_{l,t}(a)+\sqrt{\frac{S_{0}}{n_{t}(a)}},\ S_{0}=\mathcal{O}\left(\frac{B}{\varepsilon^{3}}\log\frac{ABT}{\delta}\right). (3)

Here ε=min{Δ1,Δ2}\varepsilon=\min\left\{\Delta_{1},\Delta_{2}\right\}, μ^l,t(a)\hat{\mu}_{l,t}(a) is the empirical mean of leader’s action aa: μ^l,t(a)=1nt(a)τ=1t1rf,τ(aτ,bτ)𝕀{aτ=a}\hat{\mu}_{l,t}(a)=\frac{1}{n_{t}(a)}\sum_{\tau=1}^{t-1}r_{f,\tau}(a_{\tau},b_{\tau})\mathbb{I}\left\{a_{\tau}=a\right\}, and nt(a)=max{1,τ=1t1𝕀{aτ=a}}n_{t}(a)=\max\big{\{}1,\sum_{\tau=1}^{t-1}\mathbb{I}\left\{a_{\tau}=a\right\}\big{\}}.

Theorem 3.

[Last-iterate convergence of UCBE-UCB under limited information] In a limited information setting of a repeated general-sum Stackelberg game with noisy bandit feedback, applying UCBE-UCB with S0=𝒪(Bϵ3logABTδ)S_{0}=\mathcal{O}\left(\frac{B}{\epsilon^{3}}\log\frac{ABT}{\delta}\right), ε=min{Δ1,Δ2}\varepsilon=\min\{\Delta_{1},\Delta_{2}\}, for T𝒪(AS0Δ12)T\geq\mathcal{O}\left(\frac{AS_{0}}{\Delta_{1}^{2}}\right), with probability at least 13δ1-3\delta, we have:

(aTase)δT.\mathbb{P}\big{(}a_{T}\neq a_{se}\big{)}\leq\frac{\delta}{T}.

The proof is in Appendix A.3. This result shows that the game last-iterate converges to Stackelberg equilibrium using the decentralized online learning paradigm UCBE-UCB. It is worth noting that this is a further step after Kao et al. [2022] who prove average convergence of UCB-UCB in cooperative Stackelberg games.

5 A manipulative and omniscient follower

In some real-world settings, the follower has extra side information about the leader’s reward structure. As shown in the example in Table 1, a strategic follower can use that information to manipulate the leader’s learning and induce the convergence of the game into a more desirable equilibrium. To start with, we study a simpler case where the follower is omniscient [Zhao et al., 2023], i.e., it knows the exact true rewards of the players. A manipulative follower aims to maximize its average reward:

max1Tt=1Tμf(at,(at)),T,\max_{\mathcal{F}}\frac{1}{T}\sum\nolimits_{t=1}^{T}\mu_{f}(a_{t},\mathcal{F}(a_{t})),\ T\to\infty, (4)

where the leader’s action sequence {at}t=1T\left\{a_{t}\right\}_{t=1}^{T} is generated by the leader’s no-regret learning algorithm. When the follower uses the response set \mathcal{F}, the leader (who only observes its own reward signals) will take actions that maximize its own reward given \mathcal{F}. When \mathcal{F} is fixed, the learning problem for the leader reduces to a classical stochastic bandit problem, and the true reward for each action a𝒜a\in\mathcal{A} is μl(a,(a))\mu_{l}(a,\mathcal{F}(a)). The leader will eventually learn to take action aargmaxa𝒜μl(a,(a))a^{\prime}\in\operatorname*{arg\,max}_{a\in\mathcal{A}}\mu_{l}(a,\mathcal{F}(a)) when using no-regret learning. This implies that the leader can be misled by the follower’s manipulation.

Formulating the leader’s action selection as a constraint, when TT\to\infty (i.e., at equilibrium), the above equation can be re-written in a more concrete form as:

max,(a,b)\displaystyle\max_{\mathcal{F},(a^{\prime},b^{\prime})} μf(a,b)\displaystyle\mu_{f}(a^{\prime},b^{\prime}) (5)
s.t. μl(a,b)>maxaaμl(a,(a)),b=(a)\displaystyle\mu_{l}(a^{\prime},b^{\prime})>\max_{a\neq a^{\prime}}\mu_{l}(a,\mathcal{F}(a)),\ b^{\prime}=\mathcal{F}(a^{\prime})

For simplicity, we do not consider the tie-breaking rules for the leader. A more general formulation considering tie-breaking rules for the leader is discussed in Appendix C. For a manipulation strategy \mathcal{F}, it is associated with an action pair (a,b)(a^{\prime},b^{\prime}). We call {,(a,b)}\left\{\mathcal{F},(a^{\prime},b^{\prime})\right\} a qualified manipulation if it satisfies the constraint in Eq.(5). We denote an optimal solution for the follower as opt\mathcal{F}_{opt}, and afm=argmaxa𝒜μl(a,opt(a))a_{fm}=\operatorname*{arg\,max}_{a\in\mathcal{A}}\mu_{l}(a,\mathcal{F}_{opt}(a)), bfm=opt(afm)b_{fm}=\mathcal{F}_{opt}(a_{fm}). We assume that (afm,bfm)(a_{fm},b_{fm}) is unique for simplicity. We can verify that {br,(ase,bse)}\left\{\mathcal{F}_{br},(a_{se},b_{se})\right\} is a qualified manipulation:

μl(ase,bse)>maxaaseμl(a,br(a)).\mu_{l}(a_{se},b_{se})>\max_{a\neq a_{se}}\mu_{l}(a,\mathcal{F}_{br}(a)).

Because (afm,bfm)(a_{fm},b_{fm}) corresponds to the optimal manipulation (thus maximal reward), we have:

Proposition 2.

For any general-sum Stackelberg game with reward class μl(,)\mu_{l}(\cdot,\cdot), μf(,)\mu_{f}(\cdot,\cdot), if the Stackelberg equilibrium is unique, then the reward gap between (afm,bfm)(a_{fm},b_{fm}) and (ase,bse)(a_{se},b_{se}) satisfies:

Gap(afm,ase)=μf(afm,bfm)μf(ase,bse)0.\text{Gap}(a_{fm},a_{se})=\mu_{f}(a_{fm},b_{fm})-\mu_{f}(a_{se},b_{se})\geq 0. (6)

The second equality holds if and only if (ase,bse)=(afm,bfm)(a_{se},b_{se})=(a_{fm},b_{fm}).

5.1 Follower’s best manipulation (FBM)

0:  Candidate set 𝒦=𝒜×\mathcal{K}=\mathcal{A}\times\mathcal{B}, μl(,)\mu_{l}(\cdot,\cdot), μf(,)\mu_{f}(\cdot,\cdot)
1:  Candidate manipulation pair (a,b)=argmax(a,b)𝒦μf(a,b)(a^{\prime},b^{\prime})=\operatorname*{arg\,max}_{(a,b)\in\mathcal{K}}\mu_{f}(a,b)
2:  ={(a)=b}{(a)=wr(a)=argminbμl(a,b):aa}\mathcal{F}=\{\mathcal{F}(a^{\prime})=b^{\prime}\}\cup\{\mathcal{F}(a)=\text{wr}(a)=\operatorname*{arg\,min}_{b\in\mathcal{B}}\mu_{l}(a,b):a\neq a^{\prime}\}
3:  if maxaaμl(a,(a))μl(a,b)\max_{a\neq a^{\prime}}\mu_{l}(a,\mathcal{F}(a))\geq\mu_{l}(a^{\prime},b^{\prime}) then
4:     Eliminate (a,b)(a^{\prime},b^{\prime}) from candidate set: 𝒦𝒦\(a,b)\mathcal{K}\leftarrow\mathcal{K}\backslash(a^{\prime},b^{\prime})
5:     Return to Line 1
6:  end if
6:  The response function opt=\mathcal{F}_{opt}=\mathcal{F}
Algorithm 1 Follower’s best manipulation (FBM)

Based on the insights from the example in Table 1 and Proposition 2, we propose Follower’s Best Manipulation (FBM, Algorithm 1), a greedy algorithm that finds the best manipulation strategy for an omniscient follower. On the high level, the key idea is to find a response set \mathcal{F} that exaggerates the difference in the leader’s reward when using aa^{\prime} (an action that leads to a large follower’s reward) compared to other actions (Line 2).

𝒦\mathcal{K} is the candidate manipulation pair set which is initialized as the entire action space 𝒜×\mathcal{A}\times\mathcal{B}. It starts by selecting the potential manipulation pair (a,b)(a^{\prime},b^{\prime}) from the candidate set 𝒦\mathcal{K} which maximizes uf(,)u_{f}(\cdot,\cdot) (Line 1). It then forms the manipulation strategy (response set) that returns minimum leader’s reward minbμl(a,b)\min_{b\in\mathcal{B}}\mu_{l}(a,b) for aaa\neq a^{\prime}, and maximum leader’s reward μl(a,b)\mu_{l}(a^{\prime},b^{\prime}) when plays aa^{\prime} (Line 2). Here with a bit abuse of notation, wr(a)\text{wr}(a) stands for the follower’s worst response that induces the lowest leader’s reward

wr(a)=argminbμl(a,b).\text{wr}(a)=\operatorname*{arg\,min}_{b\in\mathcal{B}}\mu_{l}(a,b).

Next (Line 3), it verifies if the current {,(a,b)}\{\mathcal{F},(a^{\prime},b^{\prime})\} is a qualified manipulation. If it is, then \mathcal{F} is the optimal manipulation strategy since the associated manipulation pair leads to maximum reward for the follower. Hence, it exits the loop and returns the final solution opt=\mathcal{F}_{opt}=\mathcal{F}. Otherwise, the algorithm eliminates (a,b)(a^{\prime},b^{\prime}) from 𝒦\mathcal{K} and repeats the process (Lines 4-5).

Under the follower’s best manipulation strategy, the leader’s problem reduces to a stochastic bandit problem, where the reward for each action a𝒜a\in\mathcal{A} is μl(a,opt(a))\mu_{l}(a,\mathcal{F}_{opt}(a)), and the suboptimality gap is

Δ3=minaafmμl(afm,bfm)μl(a,wr(a)).\Delta_{3}=\min_{a\neq a_{fm}}\mu_{l}(a_{fm},b_{fm})-\mu_{l}(a,\text{wr}(a)).
Proposition 3.

In a repeated general-sum Stackelberg game with a unique Stackelberg equilibrium, if the leader uses a no-regret learning algorithm 𝒞\mathcal{C} and the follower uses the best manipulation opt\mathcal{F}_{opt},

1Tt=1Tμf(at,opt(at))\displaystyle\frac{1}{T}\sum\nolimits_{t=1}^{T}\mu_{f}\left(a_{t},\mathcal{F}_{opt}(a_{t})\right)
=1Tt=1Tμf(a¯t,br(a¯t))+Gap(afm,ase)+δ(T),\displaystyle=\frac{1}{T}\sum\nolimits_{t=1}^{T}\mu_{f}\left(\bar{a}_{t},\mathcal{F}_{br}(\bar{a}_{t})\right)+\text{Gap}(a_{fm},a_{se})+\delta(T),

where δ(T)0\delta(T)\to 0 as TT\to\infty, {at}t=1T\left\{a_{t}\right\}_{t=1}^{T} is generated by Algorithm 𝒞\mathcal{C} and opt\mathcal{F}_{opt}, and {a¯t}t=1T\left\{\bar{a}_{t}\right\}_{t=1}^{T} is generated by Algorithm 𝒞\mathcal{C} and br\mathcal{F}_{br}.

When the follower keeps using the best manipulation strategy, the game will eventually converge to the best manipulation pair (afm,bfm)(a_{fm},b_{fm}). Similarly, if the follower uses the best response strategy, the game will eventually converge to the Stackelberg equilibrium (ase,bse)(a_{se},b_{se}). According to Propositions 2 and 3, the best manipulation strategy gains an extra average reward Gap(afm,ase)\text{Gap}(a_{fm},a_{se}) compared to that of the best response strategy.

6 Manipulative follower with noisy side information

6.1 Follower’s manipulation strategy

0:  Candidate set 𝒦=𝒜×\mathcal{K}=\mathcal{A}\times\mathcal{B}, Ul,t(,)\text{U}_{l,t}(\cdot,\cdot), Uf,t(,)\text{U}_{f,t}(\cdot,\cdot), μ^l,t(,)\hat{\mu}_{l,t}(\cdot,\cdot).
1:  Candidate manipulation pair (a,b)=argmax(a,b)𝒦Uf,t(a,b)(a^{\prime},b^{\prime})=\operatorname*{arg\,max}_{(a,b)\in\mathcal{K}}\text{U}_{f,t}(a,b)
2:  ={(a)=b}{(a)=argminbUl,t(a,b):aa}\mathcal{F}=\{\mathcal{F}(a^{\prime})=b^{\prime}\}\cup\{\mathcal{F}(a)=\operatorname*{arg\,min}_{b\in\mathcal{B}}\text{U}_{l,t}(a,b):a\neq a^{\prime}\}
3:  if maxaaUl,t(a,(a))μ^l,t(a,b)\max_{a\neq a^{\prime}}\text{U}_{l,t}(a,\mathcal{F}(a))\geq\hat{\mu}_{l,t}(a^{\prime},b^{\prime}) then
4:     Eliminate (a,b)(a^{\prime},b^{\prime}) from candidate set: 𝒦𝒦\(a,b)\mathcal{K}\leftarrow\mathcal{K}\backslash(a^{\prime},b^{\prime})
5:     Return to Line 1
6:  end if
6:  The response function \mathcal{F}
Algorithm 2 FMUCB(tt): Follower’s manipulation by UCB at round tt

We now study the more intricate setting of learning the best manipulation strategy with noisy side information. We first present the follower manipulation algorithm FMUCB in Algorithm 2, a variant of FBM (Algorithm 1) that uses UCB to learn the best follower manipulation strategy with noisy side information. We can see that Algorithm 2 largely resembles Algorithm 1, and hence we omit the detailed description. Except for the inherited idea from FBM to manipulate the learning of the leader, another key intuition of FMUCB is to design appropriate UCB terms in place of the true reward terms μl(a,b)\mu_{l}(a,b) and μf(a,b)\mu_{f}(a,b) in FBM, that balances the trade-off between exploration and manipulation.

In Line 1, the goal is to find the candidate manipulation pair that maximizes uf(,)u_{f}(\cdot,\cdot). Therefore, we maximize the upper confidence bound Uf,t(a,b)\text{U}_{f,t}(a,b):

Uf,t(a,b)=μ^f,t(a,b)+2log(ABT/δ)nt(a,b).\text{U}_{f,t}(a,b)=\hat{\mu}_{f,t}(a,b)+\sqrt{\frac{2\log(ABT/\delta)}{n_{t}(a,b)}}. (7)

The suboptimal manipulation pair will not be selected after enough exploration, given the existence of the following suboptimality gap: for any action pair (ak,bk)𝒜×(a_{k},b_{k})\in\mathcal{A}\times\mathcal{B}, satisfying μf(ak,bk)<μf(afm,bfm)\mu_{f}(a_{k},b_{k})<\mu_{f}(a_{fm},b_{fm}),

Δ4=min(ak,bk)μf(afm,bfm)μf(ak,bk).\Delta_{4}=\min_{(a_{k},b_{k})}\mu_{f}(a_{fm},b_{fm})-\mu_{f}(a_{k},b_{k}).

Conversely, in Line 2, the goal is to find the “worst response” wr(a)=argminbμl(a,b)\text{wr}(a)=\arg\min_{b\in\mathcal{B}}\mu_{l}(a,b) for each action aafma\neq a_{fm}. Therefore, we define the term Ul,t(a,b)\text{U}_{l,t}(a,b) to be the lower confidence bound:

Ul,t(a,b)=μ^l,t(a,b)2log(ABT/δ)nt(a,b).\text{U}_{l,t}(a,b)=\hat{\mu}_{l,t}(a,b)-\sqrt{\frac{2\log(ABT/\delta)}{n_{t}(a,b)}}. (8)

For each action aa, the suboptimality gap against wr(a)\text{wr}(a) is defined as Δ5,a=minbwr(a)μl(a,b)μl(a,wr(a))\Delta_{5,a}=\min_{b\neq\text{wr}(a)}\mu_{l}(a,b)-\mu_{l}(a,\text{wr}(a)), and the minimum suboptimality gap for all a𝒜a\in\mathcal{A} is

Δ5=mina𝒜Δ5,a.\Delta_{5}=\min_{a\in\mathcal{A}}\Delta_{5,a}.

Given Δ5\Delta_{5}, the suboptimal wr(a)\text{wr}(a) will not be selected after enough exploration.

When verifying whether the current {,(a,b)}\{\mathcal{F},(a^{\prime},b^{\prime})\} is a qualified manipulation (Line 3), we also use Ul,t(a,(a))U_{l,t}(a,\mathcal{F}(a)) as it lower bounds μl(a,(a))\mu_{l}(a,\mathcal{F}(a)). This guarantees that the true best manipulation solution satisfies the constraint since the following holds with high probability: for any action aafma\neq a_{fm},

Ul,t(a,wr(a))μl(a,wr(a))<μl(afm,bfm).U_{l,t}(a,\text{wr}(a))\leq\mu_{l}(a,\text{wr}(a))<\mu_{l}(a_{fm},b_{fm}).

This eliminates the unfeasible solution since Ul,t(a,wr(a))U_{l,t}(a,\text{wr}(a)) approaches μl(a,wr(a))\mu_{l}(a,\text{wr}(a)), and the following suboptimality gap exists: for any action pair (ak,bk)𝒜×(a_{k},b_{k})\in\mathcal{A}\times\mathcal{B} satisfying μf(afm,bfm)<μf(ak,bk)\mu_{f}(a_{fm},b_{fm})<\mu_{f}(a_{k},b_{k}),

Δ6=min(ak,bk)maxaakμl(a,wr(a))μl(ak,bk).\Delta_{6}=\min_{(a_{k},b_{k})}\max_{a\neq a_{k}}\mu_{l}(a,\text{wr}(a))-\mu_{l}(a_{k},b_{k}).

Hence, Algorithm 2 can find the follower’s best manipulation with noisy side information.

Refer to caption
(a) Limited information
Refer to caption
(b) Omniscient follower
Refer to caption
(c) Noisy side information
Figure 1: Experiments for the limited information and side information settings. Each experiment is run 5 times to calculate the mean and std (standard deviation). The shaded area shows ±\pmstd.

6.2 Last iterate convergence analysis

We also provide theoretical guarantees for the sample efficiency and convergence of this algorithm. As a continuation of Section 4, we analyze the performance of our designed algorithm towards the leader’s two kinds of algorithms – EXP3 and UCBE.

Theorem 4 (Last iterate convergence of EXP3-FMUCB with noisy side information).

When the leader uses EXP3 with parameter α=𝒪(T13)\alpha=\mathcal{O}\left(T^{-\frac{1}{3}}\right), η=𝒪(T13)\eta=\mathcal{O}\left(T^{-\frac{1}{3}}\right), the follower uses FMUCB, and let Tf,wT_{f,w} be the total number of rounds that the follower did not play the best manipulation strategy in TT rounds, with probability at least 13δ1-3\delta,

Tf,w𝒪(A2Bαε2logABTδlog1δ),andT_{f,w}\leq\mathcal{O}\left(\frac{A^{2}B}{\alpha\varepsilon^{2}}\log\frac{ABT}{\delta}\log\frac{1}{\delta}\right),\text{and}
[aTafm]α+\displaystyle\mathbb{P}\big{[}a_{T}\neq a_{fm}\big{]}\leq\alpha+
Aexp(Δ3T23+22Alog2δT13+C2A2Bε2logABTδlog1δ),\displaystyle A\exp\left(-\Delta_{3}T^{\frac{2}{3}}+2\sqrt{2A\log\frac{2}{\delta}}T^{\frac{1}{3}}+\frac{C_{2}A^{2}B}{\varepsilon^{2}}\log\frac{ABT}{\delta}\log\frac{1}{\delta}\right),

where ε=min{Δ4,Δ5,Δ6}\varepsilon=\min\{\Delta_{4},\Delta_{5},\Delta_{6}\}, C2C_{2} is a constant.

Theorem 5 (Last iterate convergence of UCBE-FMUCB with noisy side information).

When the leader uses UCBE with S0=𝒪(Bε3logABTδ)S_{0}=\mathcal{O}\left(\frac{B}{\varepsilon^{3}}\log\frac{ABT}{\delta}\right) and ε=min{Δ4,Δ5,Δ6}\varepsilon=\min\{\Delta_{4},\Delta_{5},\Delta_{6}\}, the follower uses FMUCB, and let Tf,wT_{f,w} be the total number of rounds that the follower did not play the best manipulation strategy in TT rounds, with probability at least 13δ1-3\delta,

Tf,w𝒪(ABε2logABTδ),T_{f,w}\leq\mathcal{O}\left(\frac{AB}{\varepsilon^{2}}\log\frac{ABT}{\delta}\right),

and for T𝒪(AS0Δ32)T\geq\mathcal{O}\left(\frac{AS_{0}}{\Delta_{3}^{2}}\right), [aTafm]δT\mathbb{P}\big{[}a_{T}\neq a_{fm}\big{]}\leq\frac{\delta}{T}.

The proofs for Theorems 4 and 5 are respectively in Appendix B.1 and Appendix B.2. The two theorems present the sample complexity for learning the best manipulation strategy using FMUCB with noisy side information. They also guarantee that the game will achieve the last iterative convergence to the best manipulation pair (afm,bfm)(a_{fm},b_{fm}) with the algorithms EXP3-FMUCB and UCBE-FMUCB.

7 Empirical Results

To validate the theoretical results, we conduct experiments on a synthetic Stackelberg game. For both players, the number of actions A=B=5A\!=\!B\!=\!5. The players’ mean rewards μl(a,b)\mu_{l}(a,b) and μf(a,b)\mu_{f}(a,b) of each action pair (a,b)(a,b) are generated independently and identically by sampling from a uniform distribution U(0,1)\text{U}(0,1). For any action pair (a,b)(a,b), the noisy reward in round tt is generated from a Bernoulli distribution: rl,t(a,b)Ber(μl(a,b))r_{l,t}(a,b)\sim\text{Ber}\left(\mu_{l}(a,b)\right) and rf,t(a,b)Ber(μf(a,b))r_{f,t}(a,b)\sim\text{Ber}\left(\mu_{f}(a,b)\right), respectively. All experiments were run on a MacBook Pro laptop with a 1.4 GHz quad-core Intel Core i5 processor, a 1536 MB Intel Iris Plus Graphics 645, and 8 GB memory.

Our first experiment validates the convergence of no-regret learning algorithms such as EXP3-UCB and UCBE-UCB (Theorems 2 and 3 in Section 4). Figure 1 shows the average regret of the leader w.r.t. round tt. The blue and orange curves are respectively for EXP3-UCB (α=0.01\alpha\!=\!0.01, η=0.001\eta\!=\!0.001) and UCBE-UCB. We can see that the leader achieves no Stackelberg regret learning via both algorithms, implying that the game converges to Stackelberg equilibrium in both cases.

The second experiment validates the intrinsic reward gap between FBM and BR when the follower is omniscient (Proposition 3 in Section 5). In this experiment, the leader uses EXP3 (α=0.01\alpha=0.01, η=0.001\eta=0.001), and the follower uses BR (blue curve) or FBM (red curve). The result is in Figure 1, where the x-axis is the number of rounds tt, and the y-axis is the average follower reward. It shows that after playing sufficient rounds, the follower’s reward does converge, and FBM yields a significant intrinsic advantage (the gap between the two dashed lines) of approximately 0.22 compared to that of BR.

The third experiment validates the intrinsic reward gap between FMUCB and UCB when the follower learns the leader’s reward structure via noisy bandit feedback (Theorems 4 and 5 in Section 6). The leader uses either EXP3 (α=0.1\alpha=0.1, η=0.001\eta=0.001) or UCBE, and we compare the average follower reward w.r.t. round tt when it uses FMUCB v.s. the baseline best response learned by a vanilla UCB. This introduces four curves, EXP3-UCB, EXP3-FMUCB, UCBE-UCB, and UCBE-FMUCB. We can see that no matter whether the leader uses EXP3 or UCBE, FMUCB yields a significant reward advantage of about 0.3 for the follower compared to that of UCB.

8 Conclusion

We study decentralized online learning in general-sum Stackelberg games under the limited information and side information settings. For both settings, we design respective online learning algorithms for both players to learn from noisy bandit feedback, and prove the last iterate convergence as well as derive sample complexity for our designed algorithms. Our theoretical results also show that the follower can indeed gain an advantage by using a manipulative strategy against the best response strategy, a result that complements existing works in offline settings. Our empirical results are consistent with our theoretical findings.

9 Acknowledgment

The authors would like to thank Haifeng Xu and Mengxiao Zhang for valuable discussions and feedback.

References

  • Abe et al. [2022] Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Kentaro Toyoshima, and Atsushi Iwasaki. Last-iterate convergence with full and noisy feedback in two-player zero-sum games. arXiv preprint arXiv:2208.09855, 2022.
  • Amin et al. [2013] Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Learning prices for repeated auctions with strategic buyers. Advances in Neural Information Processing Systems, 26, 2013.
  • Anagnostides et al. [2022] Ioannis Anagnostides, Constantinos Daskalakis, Gabriele Farina, Maxwell Fishelson, Noah Golowich, and Tuomas Sandholm. Near-optimal no-regret learning for correlated equilibria in multi-player general-sum games. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 736–749, 2022.
  • Auer et al. [2002a] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002a.
  • Auer et al. [2002b] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
  • Bai et al. [2021] Yu Bai, Chi Jin, Huan Wang, and Caiming Xiong. Sample-efficient learning of stackelberg equilibria in general-sum games. Advances in Neural Information Processing Systems, 34:25799–25811, 2021.
  • Balcan et al. [2015] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment without regrets: Online learning in Stackelberg security games. In Proceedings of the sixteenth ACM conference on economics and computation, pages 61–78, 2015.
  • Birmpas et al. [2020] Georgios Birmpas, Jiarui Gan, Alexandros Hollender, Francisco Marmolejo, Ninad Rajgopal, and Alexandros Voudouris. Optimally deceiving a learning leader in stackelberg games. Advances in Neural Information Processing Systems, 33:20624–20635, 2020.
  • Blum and Mansour [2007] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8(6), 2007.
  • Blum et al. [2014] Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Learning optimal commitment to overcome insecurity. Advances in Neural Information Processing Systems, 27, 2014.
  • Bubeck et al. [2012] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Cai et al. [2024] Yang Cai, Haipeng Luo, Chen-Yu Wei, and Weiqiang Zheng. Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback. Advances in Neural Information Processing Systems, 36, 2024.
  • Chen and Peng [2020] Xi Chen and Binghui Peng. Hedging in games: Faster convergence of external and swap regrets. Advances in Neural Information Processing Systems, 33:18990–18999, 2020.
  • Chen et al. [2022] Yurong Chen, Xiaotie Deng, and Yuhao Li. Optimal private payoff manipulation against commitment in extensive-form games. In In Proceedings of 18th International Conference on Web and Internet Economics, 2022.
  • Chen et al. [2023] Yurong Chen, Xiaotie Deng, Jiarui Gan, and Yuhao Li. Learning to manipulate a commitment optimizer. arXiv preprint arXiv:2302.11829, 2023.
  • Cole and Roughgarden [2014] Richard Cole and Tim Roughgarden. The sample complexity of revenue maximization. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 243–252, 2014.
  • Conitzer and Sandholm [2002] Vincent Conitzer and Tuomas Sandholm. Complexity of mechanism design. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pages 103–110, 2002.
  • Daskalakis and Panageas [2018] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. Advances in neural information processing systems, 31, 2018.
  • Daskalakis et al. [2021] Constantinos Daskalakis, Maxwell Fishelson, and Noah Golowich. Near-optimal no-regret learning in general games. Advances in Neural Information Processing Systems, 34:27604–27616, 2021.
  • Fang et al. [2015] Fei Fang, Peter Stone, and Milind Tambe. When security games go green: designing defender strategies to prevent poaching and illegal fishing. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 2589–2595, 2015.
  • Gan et al. [2018] Jiarui Gan, Edith Elkind, and Michael Wooldridge. Stackelberg security games with multiple uncoordinated defenders. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 703–711, 2018.
  • Gan et al. [2019a] Jiarui Gan, Qingyu Guo, Long Tran-Thanh, Bo An, and Michael Wooldridge. Manipulating a learning defender and ways to counteract. Advances in Neural Information Processing Systems, 32, 2019a.
  • Gan et al. [2019b] Jiarui Gan, Haifeng Xu, Qingyu Guo, Long Tran-Thanh, Zinovi Rabinovich, and Michael Wooldridge. Imitative follower deception in stackelberg games. In Proceedings of the 2019 ACM Conference on Economics and Computation, pages 639–657, 2019b.
  • Ghosh [2023] Arnob Ghosh. Provably efficient model-free rl in leader-follower mdp with linear function approximation. In Learning for Dynamics and Control Conference, pages 1112–1124. PMLR, 2023.
  • Goktas et al. [2022] Denizalp Goktas, Sadie Zhao, and Amy Greenwald. Zero-sum stochastic stackelberg games. Advances in Neural Information Processing Systems, 35:11658–11672, 2022.
  • Haghtalab et al. [2016] Nika Haghtalab, Fei Fang, Thanh H Nguyen, Arunesh Sinha, Ariel D Procaccia, and Milind Tambe. Three strategies to success: learning adversary models in security games. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 308–314, 2016.
  • Jiang et al. [2013] Albert Xin Jiang, Ariel D Procaccia, Yundi Qian, Nisarg Shah, and Milind Tambe. Defender (mis) coordination in security games. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013.
  • Jin et al. [2021] Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning–a simple, efficient, decentralized algorithm for multiagent rl. arXiv preprint arXiv:2110.14555, 2021.
  • Kao et al. [2022] Hsu Kao, Chen-Yu Wei, and Vijay Subramanian. Decentralized cooperative reinforcement learning with hierarchical information structure. In International Conference on Algorithmic Learning Theory, pages 573–605, 2022.
  • Klein [2017] Natalie Klein. Can international litigation solve the india-sri lanka fishing dispute? https://researchers.mq.edu.au/en/publications/can-international-litigation-solve-the-india-sri-lanka-fishing-di, 2017. Accessed on 01 October 2023.
  • Lauffer et al. [2022] Niklas Lauffer, Mahsa Ghasemi, Abolfazl Hashemi, Yagiz Savas, and Ufuk Topcu. No-regret learning in dynamic stackelberg games. arXiv preprint arXiv:2202.04786, 2022.
  • Leibo et al. [2017] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464–473, 2017.
  • Letchford et al. [2009] Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In Algorithmic Game Theory: Second International Symposium, SAGT 2009, Paphos, Cyprus, October 18-20, 2009. Proceedings 2, pages 250–262. Springer, 2009.
  • Lin et al. [2020] Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083–6093. PMLR, 2020.
  • Liu et al. [2021] Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. A sharp analysis of model-based reinforcement learning with self-play. In International Conference on Machine Learning, pages 7001–7010, 2021.
  • Mao and Başar [2023] Weichao Mao and Tamer Başar. Provably efficient reinforcement learning in decentralized general-sum markov games. Dynamic Games and Applications, 13(1):165–186, 2023.
  • Meng et al. [2021] Min Meng, Xiuxian Li, Yiguang Hong, Jie Chen, and Long Wang. Decentralized online learning for noncooperative games in dynamic environments. arXiv preprint arXiv:2105.06200, 2021.
  • Mertikopoulos et al. [2018] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the twenty-ninth annual ACM-SIAM symposium on discrete algorithms, pages 2703–2717. SIAM, 2018.
  • Ministry of WWF-Pakistan [2015] Ministry of WWF-Pakistan. National plan of action for combating illegal wildlife trade in Pakistan, 2015.
  • Myerson [1981] Roger B Myerson. Optimal auction design. Mathematics of operations research, 6(1):58–73, 1981.
  • Nguyen and Xu [2019] Thanh Nguyen and Haifeng Xu. Imitative attacker deception in Stackelberg security games. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 528–534, 2019.
  • Orabona [2019] Francesco Orabona. A modern introduction to online learning. arXiv preprint arXiv:1912.13213, 2019.
  • Peng et al. [2019] Binghui Peng, Weiran Shen, Pingzhong Tang, and Song Zuo. Learning optimal strategies to commit to. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2149–2156, 2019.
  • Roth et al. [2016] Aaron Roth, Jonathan Ullman, and Zhiwei Steven Wu. Watch and learn: Optimizing from revealed preferences feedback. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 949–962, 2016.
  • Roughgarden [2010] Tim Roughgarden. Algorithmic game theory. Communications of the ACM, 53(7):78–86, 2010.
  • Shalev-Shwartz et al. [2016] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
  • Sherali et al. [1983] Hanif D Sherali, Allen L Soyster, and Frederic H Murphy. Stackelberg-nash-cournot equilibria: characterizations and computations. Operations Research, 31(2):253–276, 1983.
  • Song et al. [2021] Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations, 2021.
  • Stackelberg [1934] Heinrich von Stackelberg. Marktform und gleichgewicht. J. Springer, 1934.
  • Sun et al. [2023] Jingrui Sun, Hanxiao Wang, and Jiaqiang Wen. Zero-sum stackelberg stochastic linear-quadratic differential games. SIAM Journal on Control and Optimization, 61(1):252–284, 2023.
  • Tambe [2011] Milind Tambe. Security and game theory: algorithms, deployed systems, lessons learned. Cambridge university press, 2011.
  • Von Stengel and Zamir [2010] Bernhard Von Stengel and Shmuel Zamir. Leadership games with convex strategy sets. Games and Economic Behavior, 69(2):446–457, 2010.
  • Wei et al. [2021] Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games. In Conference on learning theory, pages 4259–4299, 2021.
  • Wu et al. [2022] Jibang Wu, Haifeng Xu, and Fan Yao. Multi-agent learning for iterative dominance elimination: Formal barriers and new algorithms. In Conference on Learning Theory, pages 543–543. PMLR, 2022.
  • Yu et al. [2022] Yaolong Yu, Haifeng Xu, and Haipeng Chen. Learning correlated stackelberg equilibrium in general-sum multi-leader-single-follower games. arXiv preprint arXiv:2210.12470, 2022.
  • Zhao et al. [2023] Geng Zhao, Banghua Zhu, Jiantao Jiao, and Michael Jordan. Online learning in Stackelberg games with an omniscient follower. In International Conference on Machine Learning, pages 42304–42316. PMLR, 2023.
  • Zheng et al. [2022] Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C Parkes, and Richard Socher. The AI economist: Improving equality and productivity with ai-driven tax policies. Science Advances, 2022.
  • Zhong et al. [2023] Han Zhong, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Can reinforcement learning find stackelberg-nash equilibria in general-sum markov games with myopically rational followers? Journal of Machine Learning Research, 24(35):1–52, 2023.

Decentralized Online Learning in General-Sum Stackelberg Games

Appendix A Proofs for Section 4

A.1 Proof for Theorem 1

Leader / Follower b1b_{1} b2b_{2}
a1a_{1} (0.95, 0.3) (0.9, 0.2)
a2a_{2} (1, 0.8) (0, 0.79)
Table 2: Payoff matrix of an specific Stackelberg game.

We apply the UCB-UCB algorithm to a specific Stackelberg game, as depicted in Table 2, where the Stackelberg equilibrium is identified as (a2,b1)(a_{2},b_{1}). The UCB algorithms for the leader and follower are defined respectively as:

ucbl,t(a)=μ^l,t(a)+2log(T/δ)nt(a),\text{ucb}_{l,t}(a)=\hat{\mu}_{l,t}(a)+\sqrt{\frac{2\log(T/\delta)}{n_{t}(a)}},

and

ucbf,t(a,b)=μ^f,t(a,b)+2log(T/δ)nt(a,b).\text{ucb}_{f,t}(a,b)=\hat{\mu}_{f,t}(a,b)+\sqrt{\frac{2\log(T/\delta)}{n_{t}(a,b)}}.

For simplicity in demonstrating the proof, it is assumed that players receive their actual true rewards at each round. Thus, ucbl,t(a1)>0.9\text{ucb}_{l,t}(a_{1})>0.9, and for a2a_{2} and b1b_{1}, b2b_{2}, the follower’s UCB values are calculated as:

ucbf,t(a2,b1)\displaystyle\text{ucb}_{f,t}(a_{2},b_{1}) =μf(a2,b1)+2log(T/δ)nt(a2,b1),\displaystyle=\mu_{f}(a_{2},b_{1})+\sqrt{\frac{2\log(T/\delta)}{n_{t}(a_{2},b_{1})}},
ucbf,t(a2,b2)\displaystyle\text{ucb}_{f,t}(a_{2},b_{2}) =μf(a2,b2)+2log(T/δ)nt(a2,b2).\displaystyle=\mu_{f}(a_{2},b_{2})+\sqrt{\frac{2\log(T/\delta)}{n_{t}(a_{2},b_{2})}}.

Let nt(a2,b1)=2k1log(T/δ)+1n_{t}(a_{2},b_{1})=2k_{1}\log(T/\delta)+1 and nt(a2,b2)=2k2log(T/δ)n_{t}(a_{2},b_{2})=2k_{2}\log(T/\delta). The UCB mechanism implies that for the follower to prefer b1b_{1} over b2b_{2}, the following inequality must hold:

μf(a2,b1)+2log(T/δ)nt(a2,b1)1>μf(a2,b2)+2log(T/δ)nt(a2,b2).\mu_{f}(a_{2},b_{1})+\sqrt{\frac{2\log(T/\delta)}{n_{t}(a_{2},b_{1})-1}}>\mu_{f}(a_{2},b_{2})+\sqrt{\frac{2\log(T/\delta)}{n_{t}(a_{2},b_{2})}}.

Given μf(a2,b1)μf(a2,b2)=0.01\mu_{f}(a_{2},b_{1})-\mu_{f}(a_{2},b_{2})=0.01, it follows that:

k2>1(0.01+1k1)2.k_{2}>\frac{1}{\left(0.01+\sqrt{\frac{1}{k_{1}}}\right)^{2}}.

Assuming k2=104k_{2}=10^{4} yields k1>14k2k_{1}>\frac{1}{4}k_{2}. If k1+k2=54×104k_{1}+k_{2}=\frac{5}{4}\times 10^{4}, then k1<4k2k_{1}<4k_{2}. Given nt(a2)=nt(a2,b1)+nt(a2,b2)n_{t}(a_{2})=n_{t}(a_{2},b_{1})+n_{t}(a_{2},b_{2}) and assuming nt(a2)=54×104log(T/δ)n_{t}(a_{2})=\frac{5}{4}\times 10^{4}\log(T/\delta), for sufficiently large TT, the leader’s UCB for a2a_{2} becomes:

ucbl,t(a2)\displaystyle\text{ucb}_{l,t}(a_{2}) =nt(a2,b1)μl(a2,b1)+nt(a2,b2)μl(a2,b2)nt(a2)+2log(T/δ)nt(a2)<0.9,\displaystyle=\frac{n_{t}(a_{2},b_{1})\mu_{l}(a_{2},b_{1})+n_{t}(a_{2},b_{2})\mu_{l}(a_{2},b_{2})}{n_{t}(a_{2})}+\sqrt{\frac{2\log(T/\delta)}{n_{t}(a_{2})}}<0.9,

which leads to a contradiction. Thus, nt(a2)<54×104log(T/δ)=𝒪(log(T/δ))n_{t}(a_{2})<\frac{5}{4}\times 10^{4}\log(T/\delta)=\mathcal{O}(\log(T/\delta)), indicating that the leader will play the optimal action no more than 𝒪(log(T/δ))\mathcal{O}(\log(T/\delta)) times. Consequently, for sufficiently large TT, the leader is expected to incur regret linear in TT.

A.2 Proof for Theorem 2

Theorem (Last iterate convergence of EXP3-UCB under limited information).

In a limited information setting of a repeated general-sum Stackelberg game with noisy bandit feedback, applying EXP3-UCB, with α=𝒪(T13)\alpha=\mathcal{O}\left(T^{-\frac{1}{3}}\right), η=𝒪(T13)\eta=\mathcal{O}\left(T^{-\frac{1}{3}}\right), for aasea\neq a_{se}, with probability at least 13δ1-3\delta,

[aase]α+Aexp(Δ2T23+22Alog2δT13+C1ABΔ12logTδlog1δ),\mathbb{P}\big{[}a\neq a_{se}\big{]}\leq\alpha+A\exp\left(-\Delta_{2}T^{\frac{2}{3}}+2\sqrt{2A\log\frac{2}{\delta}}T^{\frac{1}{3}}+\frac{C_{1}AB}{\Delta_{1}^{2}}\log\frac{T}{\delta}\log\frac{1}{\delta}\right),

where C1C_{1} is a constant, Δ2\Delta_{2} is the leader’s suboptimality reward gap to Stackelberg equilibrium:

Δ2=minaaseμl(ase,bse)μl(a,br(a)).\Delta_{2}=\min_{a\neq a_{se}}\mu_{l}(a_{se},b_{se})-\mu_{l}(a,\mathcal{F}_{br}(a)).

We first prove the following two lemmas which will then be used to prove Theorem 2. We denote r~t(a)=1x~t(a)rl,t(a,t(a))𝕀{at=a}\tilde{r}_{t}(a)=\frac{1}{\tilde{x}_{t}(a)}r_{l,t}(a,\mathcal{F}_{t}(a))\mathbb{I}\{a_{t}=a\}. We denote rt(a)=μl(a,t(a))r_{t}(a)=\mu_{l}(a,\mathcal{F}_{t}(a)) and μ(a)=μl(a,br(a))\mu(a)=\mu_{l}(a,\mathcal{F}_{br}(a)) in this section when there is no ambiguity.

Lemma 1.

[Wu et al., 2022] For a𝒜a\in\mathcal{A}, T>0T>0, with probability at least 1δ1-\delta,

|t=1Tη(r~t(a)rt(a))|<2η2AαTlog2δ.|\sum_{t=1}^{T}\eta(\tilde{r}_{t}(a)-r_{t}(a))|<\sqrt{2\eta^{2}\frac{A}{\alpha}T\log\frac{2}{\delta}}.
Proof of Lemma 1.

Fix T>0T>0 and any action aa. Let Ξt=η(r~t(a)rt(a))\Xi_{t}=\eta(\tilde{r}_{t}(a)-r_{t}(a)) and St=τ=1tΞτS_{t}=\sum_{\tau=1}^{t}\Xi_{\tau}. Since TT is fixed, 𝔼[St]\mathbb{E}[S_{t}] is bounded. Moreover, we have

𝔼[St|St1,,S1]\displaystyle\mathbb{E}[S_{t}|S_{t-1},\cdots,S_{1}] =St1+η𝔼[r~t(a)rt(a)|𝒢t1]=St1.\displaystyle=S_{t-1}+\eta\cdot\mathbb{E}[\tilde{r}_{t}(a)-r_{t}(a)|\mathcal{G}_{t-1}]=S_{t-1}.

By definition, {St}t=1T\{S_{t}\}_{t=1}^{T} is a martingale. Apply Azuma’s inequality to {St}\{S_{t}\}, for any x>0,1tTx>0,1\leq t\leq T, we have

[|St|x]2exp(x22W).\mathbb{P}[|S_{t}|\geq x]\leq 2\exp{\Big{(}-\frac{x^{2}}{2W}\Big{)}}. (9)

WW is an upper bound of τ=1t𝔼[Ξτ2|𝒢τ1]\sum_{\tau=1}^{t}\mathbb{E}[\Xi_{\tau}^{2}|\mathcal{G}_{\tau-1}]. Note that

𝔼[Ξt2|𝒢t1]\displaystyle\mathbb{E}[\Xi_{t}^{2}|\mathcal{G}_{t-1}] =η2𝔼[(0rt(a))2(1xt(a))+(r~t(a)rt(a))2xt(a)|𝒢t1]\displaystyle=\eta^{2}\mathbb{E}\Big{[}\big{(}0-r_{t}(a)\big{)}^{2}\cdot(1-x_{t}(a))+\Big{(}\tilde{r}_{t}(a)-r_{t}(a)\Big{)}^{2}\cdot x_{t}(a)\Big{|}\mathcal{G}_{t-1}\Big{]}
=η2𝔼[rl,t2(a,t(a))1xt(a)rt2(a)|𝒢t1]\displaystyle=\eta^{2}\mathbb{E}\Big{[}r_{l,t}^{2}(a,\mathcal{F}_{t}(a))\cdot\frac{1}{x_{t}(a)}-r_{t}^{2}(a)\Big{|}\mathcal{G}_{t-1}\Big{]}
Aαη2.\displaystyle\leq\frac{A}{\alpha}\eta^{2}.

We take t=Tt=T and W=Aαη2TW=\frac{A}{\alpha}\eta^{2}T in Eq.(9) and obtain

x2Wlog2δ,x\geq\sqrt{2W\log\frac{2}{\delta}},

which implies with probability at least 1δ1-\delta,

|t=1Tη(r~t(a)rt(a))|<2η2AαTlog2δ.|\sum_{t=1}^{T}\eta(\tilde{r}_{t}(a)-r_{t}(a))|<\sqrt{2\eta^{2}\frac{A}{\alpha}T\log\frac{2}{\delta}}.

Lemma 2.

We set δ=Tβ\delta=T^{-\beta} in ucbf,t(,)\text{ucb}_{f,t}(\cdot,\cdot), and β3\beta\geq 3. Denote Twbr(a)T_{wbr}(a) as the number of rounds that the follower did not play best response when leader played aa, with probability at least 1δ1-\delta, for all a𝒜a\in\mathcal{A}, we have

Twbr(a)𝒪(Blog(T/δ)Δ12).T_{wbr}(a)\leq\mathcal{O}\left(B\frac{\log(T/\delta)}{\Delta_{1}^{2}}\right).

Similarly, we denote the number of wrong recognition rounds of wr(a)\text{wr}(a) in Algorithm 2 for every action a𝒜a\in\mathcal{A} as Twwr(a)T_{wwr}(a), with probability at least 1δ1-\delta, for all a𝒜a\in\mathcal{A}, we have

Twwr(a)𝒪(Blog(ABT/δ)Δ12).T_{wwr}(a)\leq\mathcal{O}\left(B\frac{\log(ABT/\delta)}{\Delta_{1}^{2}}\right).
Proof.

Combining the Theorem 10.14 of Orabona [2019] and Bernstein inequality we can get the results. ∎

Proof of Theorem 2.

We denote μ(a)=μl(a,br(a))\mu(a)=\mu_{l}(a,\mathcal{F}_{br}(a)) in this section when there is no ambiguity. For aasea\neq a_{se}, based on E.4 of Yu et al. [2022], we have

t=1Trt(a)rt(ase)\displaystyle\sum_{t=1}^{T}r_{t}(a)-r_{t}(a_{se}) =t=1T(rt(a)μ(a))+(μ(a)μ(ase))+(μ(ase)rt(ase))\displaystyle=\sum_{t=1}^{T}\Big{(}r_{t}(a)-\mu(a)\Big{)}+\Big{(}\mu(a)-\mu(a_{se})\Big{)}+\Big{(}\mu(a_{se})-r_{t}(a_{se})\Big{)} (10)
t=1T(rt(a)μ(a))Δ2T+t=1T(μ(ase)rt(ase))\displaystyle\leq\sum_{t=1}^{T}\Big{(}r_{t}(a)-\mu(a)\Big{)}-\Delta_{2}T+\sum_{t=1}^{T}\Big{(}\mu(a_{se})-r_{t}(a_{se})\Big{)}
4sma𝒜Twbr(a)Δ2T,\displaystyle\leq 4s_{m}\sum_{a\in\mathcal{A}}T_{wbr}(a)-\Delta_{2}T,

where sms_{m} represents the biggest interval the leader chooses action aa between the ii-th time and the (i+1)(i+1)-th time for any i[n(a)1]i\in[n(a)-1]. With probability at least 1δ1-\delta, we have

sm𝒪(Aαlog1δ).s_{m}\leq\mathcal{O}\left(\frac{A}{\alpha}\log\frac{1}{\delta}\right).

We denote yt(a)=τ=1tηr~t(a)y_{t}(a)=\sum_{\tau=1}^{t}\eta\tilde{r}_{t}(a). For aasea\neq a_{se}, combined with 1 and Eq.(10), we have

yT(a)yT(ase)\displaystyle y_{T}(a)-y_{T}(a_{se}) =t=1Tη[r~t(a)r~t(ase)]\displaystyle=\sum_{t=1}^{T}\eta\left[\tilde{r}_{t}(a)-\tilde{r}_{t}(a_{se})\right]
=t=1Tη[r~t(a)rt(a))]+t=1Tη[rt(a)rt(ase)]+t=1Tη[rt(a)r~t(ase)]\displaystyle=\sum_{t=1}^{T}\eta\left[\tilde{r}_{t}(a)-r_{t}(a))\right]+\sum_{t=1}^{T}\eta\left[r_{t}(a)-r_{t}(a_{se})\right]+\sum_{t=1}^{T}\eta\left[r_{t}(a)-\tilde{r}_{t}(a_{se})\right]
22η2AαTlog2δ+4ηsma𝒜Twbr(a)Δ2ηT.\displaystyle\leq 2\sqrt{2\eta^{2}\frac{A}{\alpha}T\log\frac{2}{\delta}}+4\eta s_{m}\sum_{a\in\mathcal{A}}T_{wbr}(a)-\Delta_{2}\eta T.

For any aasea\neq a_{se}, we set α=𝒪(T13)\alpha=\mathcal{O}\left(T^{-\frac{1}{3}}\right), η=𝒪(T13)\eta=\mathcal{O}\left(T^{-\frac{1}{3}}\right), by the union bound, with probability at least 13δ1-3\delta,

xT+1(a)\displaystyle x_{T+1}(a) =exp(yT(a))a𝒜exp(yT(a))exp(yT(a))exp(yT(ase))\displaystyle=\frac{\exp(y_{T}(a))}{\sum_{a^{\prime}\in\mathcal{A}}\exp(y_{T}(a^{\prime}))}\leq\frac{\exp(y_{T}(a))}{\exp(y_{T}(a_{se}))} (11)
exp(Δ2T23+22Alog2δT13+C1ABΔ12logTδlog1δ)\displaystyle\leq\exp\left(-\Delta_{2}T^{\frac{2}{3}}+2\sqrt{2A\log\frac{2}{\delta}}T^{\frac{1}{3}}+\frac{C_{1}AB}{\Delta_{1}^{2}}\log\frac{T}{\delta}\log\frac{1}{\delta}\right)

Note that

[aTase]=aasex~T(a)α+aasexT(a).\mathbb{P}\big{[}a_{T}\neq a_{se}\big{]}=\sum_{a\neq a_{se}}\tilde{x}_{T}(a)\leq\alpha+\sum_{a\neq a_{se}}x_{T}(a).

So, combined with Eq.(11), we have reached our conclusion.

Remark: Last-iterate convergence refers to the phenomenon where the sequence of actions converges to the optimal action or the sequence of policies converges to the optimal policy, formulated as limtat=a\lim_{t\to\infty}a_{t}=a_{\star}. It represents a particularly strong form of convergence, which is inevitably a stronger notion than the average-iterate convergence [Wu et al., 2022, Abe et al., 2022, Cai et al., 2024]: an algorithm demonstrating last-iterate convergence is proven to also be capable of achieving average-iterate convergence whereas the reverse is not necessarily true. Our findings corroborate this view, indicating that limtP(ata)=0\lim_{t\to\infty}P(a_{t}\neq a_{\star})=0 leads to limtat=a\lim_{t\to\infty}a_{t}=a_{\star}.

The formulations of the bounds presented in both Theorem 2 and Theorem 3 are conventional within the learning theory community, attributable to divergent analytical approaches applied to EXP3 and UCB type algorithms. Generally speaking, Theorem 2 introduces a decay rate of 𝒪(T1/3+exp(T2/3))\mathcal{O}(T^{-1/3}+\exp(-T^{2/3})), contrasting with the linear decay in TT observed in Theorem 3. This distinction suggests a comparative advantage for Theorem 3, contingent upon a specific constraint on TT. Nevertheless, given that both bounds are influenced by the gaps, denoted as Δ1,Δ2\Delta_{1},\Delta_{2}, a thorough comparison necessitates consideration of these gap values.

A.3 Proof for Theorem 3

Theorem (Last iterate convergence of UCBE-UCB under limited information).

In a limited information setting of a repeated general-sum Stackelberg game with noisy bandit feedback, applying UCBE-UCB with S0=𝒪(Bϵ3logABTδ)S_{0}=\mathcal{O}\left(\frac{B}{\epsilon^{3}}\log\frac{ABT}{\delta}\right), ε=min{Δ1,Δ2}\varepsilon=\min\{\Delta_{1},\Delta_{2}\}, for T𝒪(AS0Δ12)T\geq\mathcal{O}\left(\frac{AS_{0}}{\Delta_{1}^{2}}\right), with probability at least 13δ1-3\delta, we have:

(aTase)δT.\mathbb{P}\big{(}a_{T}\neq a_{se}\big{)}\leq\frac{\delta}{T}.

Lemma 3.

Let Ul(a,b)=μ^(a,b)+S1nt(a,b)\text{U}_{l}(a,b)=\hat{\mu}(a,b)+\sqrt{\frac{S_{1}}{n_{t}(a,b)}}, n(a,b)1n(a,b)\geq 1, set S1=2logABTδS_{1}=2\log\frac{ABT}{\delta}, when n(a,b)𝒪(log(ABT/δ)ε2)n(a,b)\geq\mathcal{O}\left(\frac{\log(ABT/\delta)}{\varepsilon^{2}}\right), with probability at least 1δABT1-\frac{\delta}{ABT}, we have

Ul(a,b)[μ(a,b)ε4,μ(a,b)+ε4].U_{l}(a,b)\in\left[\mu(a,b)-\frac{\varepsilon}{4},\mu(a,b)+\frac{\varepsilon}{4}\right].

A similar result holds for the follower.

Proof of Lemma 3.

(1) When n(a,b)𝒪(log(ABT/δ)ε2)n(a,b)\geq\mathcal{O}\left(\frac{\log(ABT/\delta)}{\varepsilon^{2}}\right), by the Hoeffding inequality, with probability at least 1δABT1-\frac{\delta}{ABT}, we have

μ^(a,b)[μ(a,b)ε8,μ(a,b)+ε8].\hat{\mu}(a,b)\in\left[\mu(a,b)-\frac{\varepsilon}{8},\mu(a,b)+\frac{\varepsilon}{8}\right].

(2) When n(a,b)𝒪(S1ε2)n(a,b)\geq\mathcal{O}\left(\frac{S_{1}}{\varepsilon^{2}}\right),

S1nt(a,b)ε8.\sqrt{\frac{S_{1}}{n_{t}(a,b)}}\leq\frac{\varepsilon}{8}.

So combining (1) and (2), when n(a,b)𝒪(max{log(ABT/δ)ε2,S1ε2})n(a,b)\geq\mathcal{O}\left(\max\left\{\frac{\log(ABT/\delta)}{\varepsilon^{2}},\frac{S_{1}}{\varepsilon^{2}}\right\}\right), with probability at least 1δABT1-\frac{\delta}{ABT}, we have

Ul(a,b)[μ(a,b)ε4,μ(a,b)+ε4].U_{l}(a,b)\in\left[\mu(a,b)-\frac{\varepsilon}{4},\mu(a,b)+\frac{\varepsilon}{4}\right].

The bound for the UCB term of the follower is completely analogous and thus omitted. ∎

Proof of Theorem 3.

We next only consider rounds that the leader plays a𝒜a\in\mathcal{A}. With a little abuse of notation, tt is referred to as tt-th time that the leader plays the action aa. We denote ucbl(a)=μ^(a)+S0n(a)\text{ucb}_{l}(a)=\hat{\mu}(a)+\sqrt{\frac{S_{0}}{n(a)}}. We ignore the subscript tt when there is no ambiguity.

Noticed that when n(a)<S04n(a)<\frac{S_{0}}{4}, ucbl(a)>2ucb_{l}(a)>2. And when n(a)S0n(a)\geq S_{0}, ucbl(a)2ucb_{l}(a)\leq 2. So when total round TT (T𝒪(AS0)T\geq\mathcal{O}\left(AS_{0}\right)) is big enough, the leader will play action aa for at least S04\frac{S_{0}}{4} rounds.

We next bound the difference between accumulative reward the leader receives and the accumulative reward under best responses,

|t=1n(a)rl,t(a,bt)μl(a,br(a))|\displaystyle\left|\sum_{t=1}^{n(a)}r_{l,t}(a,b_{t})-\mu_{l}(a,\mathcal{F}_{br}(a))\right|
=\displaystyle= |t=1n(a)(rl,t(a,bt)μl(a,br(a)))𝕀{bt=br(a)}+t=1n(a)(rl,t(a,bt)μl(a,br(a)))𝕀{btbr(a)}|\displaystyle\left|\sum_{t=1}^{n(a)}\big{(}r_{l,t}(a,b_{t})-\mu_{l}(a,\mathcal{F}_{br}(a))\big{)}\mathbb{I}\left\{b_{t}=\mathcal{F}_{br}(a)\right\}+\sum_{t=1}^{n(a)}\big{(}r_{l,t}(a,b_{t})-\mu_{l}(a,\mathcal{F}_{br}(a))\big{)}\mathbb{I}\left\{b_{t}\neq\mathcal{F}_{br}(a)\right\}\right|
\displaystyle\leq |t=1n(a)(rl,t(a,br(a))μl(a,br(a)))𝕀{bt=br(a)}|+Twbr(a).\displaystyle\left|\sum_{t=1}^{n(a)}\big{(}r_{l,t}\left(a,\mathcal{F}_{br}\left(a\right)\right)-\mu_{l}\left(a,\mathcal{F}_{br}(a)\right)\big{)}\mathbb{I}\left\{b_{t}=\mathcal{F}_{br}(a)\right\}\right|+T_{wbr}(a).

And when n(a)Twbr(a)𝒪(log(AB/δ)ε2)n(a)-T_{wbr}(a)\geq\mathcal{O}\left(\frac{\log(AB/\delta)}{\varepsilon^{2}}\right), by Lemma 3, with probability at least 1δABT1-\frac{\delta}{ABT},

1n(a)|t=1n(a)(rl,t(a,br(a))μl(a,br(a)))𝕀{bt=br(a)}|ε8.\frac{1}{n(a)}\left|\sum_{t=1}^{n(a)}\big{(}r_{l,t}\left(a,\mathcal{F}_{br}\left(a\right)\right)-\mu_{l}\left(a,\mathcal{F}_{br}(a)\right)\big{)}\mathbb{I}\left\{b_{t}=\mathcal{F}_{br}(a)\right\}\right|\leq\frac{\varepsilon}{8}.

So when n(a)Twbr(a)+𝒪(logABδε2)n(a)\geq T_{wbr}(a)+\mathcal{O}\left(\frac{\log\frac{AB}{\delta}}{\varepsilon^{2}}\right), and n(a)8εTwbr(a)n(a)\geq\frac{8}{\varepsilon}T_{wbr}(a), with probability at least 1δABT1-\frac{\delta}{ABT}, we have

1n(a)|t=1n(a)rl,t(a,bt)μl(a,br(a))|ε8+1n(a)Twbr(a)ε8+ε8=ε4\displaystyle\frac{1}{n(a)}\left|\sum_{t=1}^{n(a)}r_{l,t}(a,b_{t})-\mu_{l}(a,\mathcal{F}_{br}(a))\right|\leq\frac{\varepsilon}{8}+\frac{1}{n(a)}T_{wbr}(a)\leq\frac{\varepsilon}{8}+\frac{\varepsilon}{8}=\frac{\varepsilon}{4}

We set S0𝒪(Bε3logABTδ)S_{0}\geq\mathcal{O}\left(\frac{B}{\varepsilon^{3}}\log\frac{ABT}{\delta}\right), which will guarantee n(a)𝒪(Bε3logABTδ)n(a)\geq\mathcal{O}\left(\frac{B}{\varepsilon^{3}}\log\frac{ABT}{\delta}\right) and the following inequality holds based on above analysis

|μ^l(a)μl(a,br(a))|ε4.\left|\hat{\mu}_{l}(a)-\mu_{l}(a,\mathcal{F}_{br}(a))\right|\leq\frac{\varepsilon}{4}.

When n(a)𝒪(S0ε2)n(a)\geq\mathcal{O}\left(\frac{S_{0}}{\varepsilon^{2}}\right), S0n(a)ε8\sqrt{\frac{S_{0}}{n(a)}}\leq\frac{\varepsilon}{8}, then with probability at least 1δT1-\frac{\delta}{T}, for all a𝒜a\in\mathcal{A}, we have,

|ucbl(a)μl(a,br(a))|3ε8<ε2.\left|\text{ucb}_{l}(a)-\mu_{l}(a,\mathcal{F}_{br}(a))\right|\leq\frac{3\varepsilon}{8}<\frac{\varepsilon}{2}.

So argmaxa𝒜ucbl(a)=ase\operatorname*{arg\,max}_{a\in\mathcal{A}}\text{ucb}_{l}(a)=a_{se}. Then by the union bound, with probability at least 13δ1-3\delta, we have

[aase]δT.\mathbb{P}\big{[}a\neq a_{se}\big{]}\leq\frac{\delta}{T}.

Appendix B Proof for Section 6

B.1 Proof for Theorem 4

Theorem (Last iterate convergence of EXP3-FMUCB with noisy side information).

When the leader uses EXP3 with parameter α=𝒪(T13)\alpha=\mathcal{O}\left(T^{-\frac{1}{3}}\right), η=𝒪(T13)\eta=\mathcal{O}\left(T^{-\frac{1}{3}}\right), the follower uses FMUCB, and let Tf,wT_{f,w} be the total number of rounds that the follower did not play the best manipulation strategy in TT rounds, with probability at least 13δ1-3\delta,

Tf,w𝒪(A2Bαε2logABTδ),andT_{f,w}\leq\mathcal{O}\left(\frac{A^{2}B}{\alpha\varepsilon^{2}}\log\frac{ABT}{\delta}\right),\text{and}
[aase]α+Aexp(Δ3T23+22Alog2δT13+C2A2Bε2logABTδlog1δ),\mathbb{P}\big{[}a\neq a_{se}\big{]}\leq\alpha+A\exp\left(-\Delta_{3}T^{\frac{2}{3}}+2\sqrt{2A\log\frac{2}{\delta}}T^{\frac{1}{3}}+\frac{C_{2}A^{2}B}{\varepsilon^{2}}\log\frac{ABT}{\delta}\log\frac{1}{\delta}\right),

where ε=min{Δ4,Δ5,Δ6}\varepsilon=\min\{\Delta_{4},\Delta_{5},\Delta_{6}\}, C2C_{2} is a constant.


We study the case when the follower fails to do the manipulation when leader plays aa. We ignore the subscript tt when there is no ambiguity, including the following two cases.

Wrong recognition for wr(a)\text{wr}(a)

We denote the rounds that the follower did not find wr(a)\text{wr}(a) correct for all a𝒜a\in\mathcal{A} as t1t_{1}. By Lemma 2, we have

t1a𝒜Twwr(a)𝒪(ABlog(ABT/δ)Δ12+ABlog(ABδ)).t_{1}\leq\sum_{a\in\mathcal{A}}T_{wwr}(a)\leq\mathcal{O}\left(AB\frac{\log(ABT/\delta)}{\Delta_{1}^{2}}+AB\log\left(\frac{AB}{\delta}\right)\right).

Wrong recognition of manipulation when wr(a)\text{wr}(a) is correct for all a𝒜a\in\mathcal{A}

We next study the wrong manipulation rounds when wr(a)\text{wr}(a) is correct for all aasea\neq a_{se}. For any action pair (a,b)𝒜×(a,b)\in\mathcal{A}\times\mathcal{B}, we consider whether it will be selected as the best manipulation pair or under what condition it will be eliminated by Algorithm 2 when it is not the best manipulation pair. We classify and discuss two different situations:

(i) μf(a,b)μf(afm,bfm)\mu_{f}(a,b)\geq\mu_{f}(a_{fm},b_{fm}), but there exists a^=argmina^aμl(a^,wr(a^))\hat{a}=\operatorname*{arg\,min}_{\hat{a}\neq a}\mu_{l}(\hat{a},\text{wr}(\hat{a})), μl(a^,wr(a^))μl(a,b)\mu_{l}(\hat{a},\text{wr}(\hat{a}))\geq\mu_{l}(a,b). As a reminder, μl(a^,wr(a))μl(a,b)=Δ6ε\mu_{l}(\hat{a},\text{wr}(a))-\mu_{l}(a,b)=\Delta_{6}\geq\varepsilon. When n(a,b)>t2=𝒪(log(ABT/δ)ε2)n(a,b)>t_{2}=\mathcal{O}\left(\frac{\log(ABT/\delta)}{\varepsilon^{2}}\right) and n(a^,wr(a^))>t2n(\hat{a},\text{wr}(\hat{a}))>t_{2}, by Lemma 3, the following inequality holds

μ^l,t(a,b)μl(a,b)+ε8<μl(a^,wr(a))ε4Ul,t(a^,wr(a^)),\hat{\mu}_{l,t}(a,b)\leq\mu_{l}(a,b)+\frac{\varepsilon}{8}<\mu_{l}(\hat{a},\text{wr}(a))-\frac{\varepsilon}{4}\leq U_{l,t}(\hat{a},\text{wr}(\hat{a})),

which means that (a,b)(a,b) will be eliminated by Algorithm 2 through Line 4. So it needs at most extra 2t12t_{1} rounds for action pair (a,b)(a,b) to eliminate the wrong manipulation pair (a,b)(a,b).

(ii) μf(a,b)<μf(afm,bfm)\mu_{f}(a,b)<\mu_{f}(a_{fm},b_{fm}). When n(a,b)>t3=𝒪(max{log(ABT/δ)ε2,S1ε2})n(a,b)>t_{3}=\mathcal{O}\left(\max\left\{\frac{\log(ABT/\delta)}{\varepsilon^{2}},\frac{S_{1}}{\varepsilon^{2}}\right\}\right), by Lemma 3, we have

Ul(a,b)μ(a,b)+ε4<μ(afm,bfm)<Ul(afm,bfm).U_{l}(a,b)\leq\mu(a,b)+\frac{\varepsilon}{4}<\mu(a_{fm},b_{fm})<U_{l}(a_{fm},b_{fm}).

So the extra wrong manipulation rounds for (a,b)(a,b) is no more than t3t_{3}.

When the leader plays more than 𝒪(Alog(1/δ)αt0)\mathcal{O}\left(\frac{A\log(1/\delta)}{\alpha}t_{0}\right) rounds, because of the existence of uniform exploration parameter α\alpha, with probability at least 1δ1-\delta, n(a)t0n(a)\geq t_{0}. So by the union bound, with probability at least 13δ1-3\delta,

Tf,w𝒪(Aαlog1δ(t1+ABt2+ABt3))=𝒪(A2Bαε2logABTδlog1δ).T_{f,w}\leq\mathcal{O}\left(\frac{A}{\alpha}\log\frac{1}{\delta}\left(t_{1}+ABt_{2}+ABt_{3}\right)\right)=\mathcal{O}\left(\frac{A^{2}B}{\alpha\varepsilon^{2}}\log\frac{ABT}{\delta}\log\frac{1}{\delta}\right).

We denote μm(a)=μl(a,fm(a))\mu_{m}(a)=\mu_{l}(a,\mathcal{F}_{fm}(a)) in this section when there is no ambiguity. For aasea\neq a_{se}, we have

t=1Trt(a)rt(afm)\displaystyle\sum_{t=1}^{T}r_{t}(a)-r_{t}(a_{fm}) =t=1T(rt(a)μm(a))+(μm(a)μm(afm))+(μm(afm)rt(afm))\displaystyle=\sum_{t=1}^{T}\Big{(}r_{t}(a)-\mu_{m}(a)\Big{)}+\Big{(}\mu_{m}(a)-\mu_{m}(a_{fm})\Big{)}+\Big{(}\mu_{m}(a_{fm})-r_{t}(a_{fm})\Big{)}
=t=1T(rt(a)μm(a))Δ3T+t=1T(μm(afm)rt(afm))\displaystyle=\sum_{t=1}^{T}\Big{(}r_{t}(a)-\mu_{m}(a)\Big{)}-\Delta_{3}T+\sum_{t=1}^{T}\Big{(}\mu_{m}(a_{fm})-r_{t}(a_{fm})\Big{)}
𝒪(Alog(1/δ)αt1)Δ3T+2smB(t2+t3).\displaystyle\leq\mathcal{O}\left(\frac{A\log(1/\delta)}{\alpha}t_{1}\right)-\Delta_{3}T+2s_{m}B(t_{2}+t_{3}).

Analogous to the proof of Theorem 2 in Appendix A.2, we can complete the proof of Theorem 4.

B.2 Proof for Theorem 5

Theorem (Last iterate convergence of UCBE-FMUCB with noisy side information).

When the leader uses UCBE with S0=𝒪(Bε3logABTδ)S_{0}=\mathcal{O}\left(\frac{B}{\varepsilon^{3}}\log\frac{ABT}{\delta}\right) and ε=min{Δ4,Δ5,Δ6}\varepsilon=\min\{\Delta_{4},\Delta_{5},\Delta_{6}\}, the follower uses FMUCB, and let Tf,wT_{f,w} be the total number of rounds that the follower did not play the best manipulation strategy in TT rounds, with probability at least 13δ1-3\delta,

Tf,w𝒪(ABε2logABTδ),T_{f,w}\leq\mathcal{O}\left(\frac{AB}{\varepsilon^{2}}\log\frac{ABT}{\delta}\right),

and for T𝒪(AS0Δ32)T\geq\mathcal{O}\left(\frac{AS_{0}}{\Delta_{3}^{2}}\right), [aTafm]δT\mathbb{P}\big{[}a_{T}\neq a_{fm}\big{]}\leq\frac{\delta}{T}.


Since the leader who uses UCBE will do pure exploration for each a𝒜a\in\mathcal{A} with 𝒪(Bε3logABTδ)\mathcal{O}\left(\frac{B}{\varepsilon^{3}}\log\frac{ABT}{\delta}\right) according to the analysis in Appendix A.3, where ε=min{Δ4,Δ5,Δ6}\varepsilon=\min\{\Delta_{4},\Delta_{5},\Delta_{6}\}. So by the union bound, with probability at least 13δ1-3\delta,

Tf,w𝒪(t1+ABt2+ABt3)=𝒪(ABε2logABTδ).T_{f,w}\leq\mathcal{O}\left(t_{1}+ABt_{2}+ABt_{3}\right)=\mathcal{O}\left(\frac{AB}{\varepsilon^{2}}\log\frac{ABT}{\delta}\right).

Similar to the proof of Theorem 3, we bound the difference between the accumulative reward the leader receives and the accumulative reward under best manipulation strategy,

|t=1n(a)rl,t(a,bt)μl(a,fm(a))|\displaystyle\left|\sum_{t=1}^{n(a)}r_{l,t}(a,b_{t})-\mu_{l}(a,\mathcal{F}_{fm}(a))\right|
\displaystyle\leq |t=1n(a)(rl,t(a,fm(a))μl(a,fm(a)))𝕀{bt=fm(a)}|+t=1n(a)𝕀{btfm(a)}.\displaystyle\left|\sum_{t=1}^{n(a)}\left(r_{l,t}\left(a,\mathcal{F}_{fm}(a)\right)-\mu_{l}\left(a,\mathcal{F}_{fm}(a)\right)\right)\mathbb{I}\left\{b_{t}=\mathcal{F}_{fm}(a)\right\}\right|+\sum_{t=1}^{n(a)}\mathbb{I}\left\{b_{t}\neq\mathcal{F}_{fm}(a)\right\}.

And based on the proof of Theorem 4 in Appendix A.3 and Lemma 2, we have

t=1n(a)𝕀{btfm(a)}Twwr(a)+Bt2+Bt3=𝒪(Bε2logABTδ).\sum_{t=1}^{n(a)}\mathbb{I}\left\{b_{t}\neq\mathcal{F}_{fm}(a)\right\}\leq T_{wwr}(a)+Bt_{2}+Bt_{3}=\mathcal{O}\left(\frac{B}{\varepsilon^{2}}\log\frac{ABT}{\delta}\right).

Analogous to the proof of Theorem 3 in Appendix A.3, we can complete the proof of Theorem 5.

Appendix C Follower’s pessimistic manipulation

We denote 𝐐()=argmaxa𝒜μl(a,(a))\mathbf{Q}(\mathcal{F})=\operatorname*{arg\,max}_{a\in\mathcal{A}}\mu_{l}(a,\mathcal{F}(a)). Because there may be multiple actions in 𝒜\mathcal{A} with the same largest μl(a,(a))\mu_{l}(a,\mathcal{F}(a)), so there may be more than one element in 𝐐()\mathbf{Q}(\mathcal{F}). From the follower’s perspective, pessimistically, the leader will break the tie using a=argmina𝐐()μf(a,(a))a=\operatorname*{arg\,min}_{a\in\mathbf{Q}(\mathcal{F})}\mu_{f}(a,\mathcal{F}(a)). Assume pessimistically tie-breaking is more reasonable considering that the leader uses no-regret learning algorithm, since the leader who uses no-regret learning algorithm may not play one action in 𝐐()\mathbf{Q}(\mathcal{F}) stably. So it is important for the follower to choose an appropriate manipulation strategy that will maximize its own reward in the long run taking into account the leader’s tendency. When the follower makes the assumption that the leader will break the tie pessimistically, the optimization problem of the follower is

maxmin(a,b)𝒜×\displaystyle\max_{\mathcal{F}}\min_{(a^{\prime},b^{\prime})\in\mathcal{A}\times\mathcal{B}} μf(a,b)\displaystyle\mu_{f}(a^{\prime},b^{\prime}) (12)
s.t. aargmaxa𝒜μl(a,(a))\displaystyle a^{\prime}\in\operatorname*{arg\,max}_{a\in\mathcal{A}}\mu_{l}(a,\mathcal{F}(a))
b=(a)\displaystyle b^{\prime}=\mathcal{F}(a^{\prime})

We present the Algorithm 3 for solving optimization problem (12), which is similar to Algorithm 1.

0:  Candidate set 𝒦=𝒜×\mathcal{K}=\mathcal{A}\times\mathcal{B}, 𝐇={}\mathbf{H}=\{\}, μl(,)\mu_{l}(\cdot,\cdot), μf(,)\mu_{f}(\cdot,\cdot)
1:  Candidate manipulation pair (a,b)=argmax(a,b)𝒦μf(a,b)(a^{\prime},b^{\prime})=\operatorname*{arg\,max}_{(a,b)\in\mathcal{K}}\mu_{f}(a,b)
2:  if μf(ap,bp)<μf(a,b)\mu_{f}(a_{p},b_{p})<\mu_{f}(a^{\prime},b^{\prime}) then
3:     ={(a)=b}{(a)=argminbμl(a,b):aa}\mathcal{F}=\{\mathcal{F}(a^{\prime})=b^{\prime}\}\cup\{\mathcal{F}(a)=\operatorname*{arg\,min}_{b\in\mathcal{B}}\mu_{l}(a,b):a\neq a^{\prime}\}
4:     Let 𝐐=argmaxμl(a,(a))\mathbf{Q}=\operatorname*{arg\,max}\mu_{l}(a,\mathcal{F}(a))
5:     if (a,b)𝐐(a^{\prime},b^{\prime})\in\mathbf{Q} then
6:        Denote the current \mathcal{F} as ap,bp\mathcal{F}_{a_{p},b_{p}}, 𝐇argmin(a,b)𝐐μf(a,b)\mathbf{H}\cup\operatorname*{arg\,min}_{(a,b)\in\mathbf{Q}}\mu_{f}(a,b), (ap,bp)=argmax(a,b)𝐇μf(a,b)(a_{p},b_{p})=\operatorname*{arg\,max}_{(a,b)\in\mathbf{H}}\mu_{f}(a,b)
7:     end if
8:     Eliminate (a,b)(a,b) from candidate set: 𝒦𝒦\(a,b)\mathcal{K}\leftarrow\mathcal{K}\backslash(a^{\prime},b^{\prime}) and return to Line 1
9:  else
10:     opt=ap,bp\mathcal{F}_{opt}=\mathcal{F}_{a_{p},b_{p}}
11:  end if
11:  The response function opt\mathcal{F}_{opt}
Algorithm 3 Follower’s best manipulation towards a pessimistic leader