This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Repeated Contracting with Multiple Non-Myopic Agents:
Policy Regret and Limited Liability

Natalie Collina Department of Computer and Information Sciences, University of Pennsylvania Varun Gupta Department of Computer and Information Sciences, University of Pennsylvania Aaron Roth Department of Computer and Information Sciences, University of Pennsylvania

We study a repeated contracting setting in which a Principal adaptively chooses amongst kk Agents at each of TT rounds. The Agents are non-myopic, and so a mechanism for the Principal induces a TT-round extensive form game amongst the Agents. We give several results aimed at understanding an under-explored aspect of contract theory — the game induced when choosing an Agent to contract with. First, we show that this game admits a pure-strategy non-responsive equilibrium amongst the Agents — informally an equilibrium in which the Agent’s actions depend on the history of realized states of nature, but not on the history of each other’s actions, and so avoids the complexities of collusion and threats. Next, we show that if the Principal selects Agents using a monotone bandit algorithm, then for any concave contract, in any such equilibrium, the Principal obtains no regret to contracting with the best Agent in hindsight — not just given their realized actions, but also to the counterfactual world in which they had offered a guaranteed TT-round contract to the best Agent in hindsight, which would have induced a different sequence of actions. Finally, we show that if the Principal selects Agents using a monotone bandit algorithm which guarantees no swap-regret, then the Principal can additionally offer only limited liability contracts (in which the Agent never needs to pay the Principal) while getting no-regret to the counterfactual world in which she offered a linear contract to the best Agent in hindsight — despite the fact that linear contracts are not limited liability. We instantiate this theorem by demonstrating the existence of a monotone no swap-regret bandit algorithm, which to our knowledge has not previously appeared in the literature.

1 Introduction

Principal-Agent problems arise when a Principal seeks to hire a third party (an Agent) to perform some costly task on their behalf. Typically the action of the Agent (e.g. the effort that they put in) is unobservable to the Principal — all that the Principal observes is the resulting outcome, which is a function both of the Agent’s action and some unknown state of nature. The design problem for the Principal is to find a contract (a mapping from outcomes to payments to the Agent) that maximizes their expected utility less the payment, when the Agent best responds so as to maximize their expected payment, less their cost. Classical contract theory concerns itself with the design of the optimal contract for the Principal, in a static setting with a single Agent, in which the Principal and the Agent share a common prior belief about the distribution over states of nature. This standard setting is well understood, and linear contracts (which pay the Agent a constant fraction of the Principal’s realized value) turn out to be robustly optimal in several different senses (Carroll, 2015; Dütting et al., 2019, 2022; Castiglioni et al., 2021).

Of course, each of the standard assumptions can be subject to challenge. For example, how does the Principal choose the Agent to hire? In many markets there is an aspect of competition between Agents that is unmodelled in a standard Principal/Agent setting. What if the distribution over states of nature deviates from the expectations of the Principal and Agent—i.e. what if their beliefs are mis-specified? In this case, what guarantees can be given to the Principal? And finally, many contracts (like linear contracts) can require negative payments to the Agent in certain realizations — i.e. they can obligate the Agent to pay the Principal, which is generally unimplementable in practice. Contracts that never obligate the Agent to pay the Principal are said to satisfy the limited liability condition, and are more realistic, but complicate the design of optimal contracts. These are the problems that we focus on in this paper.

As a running example, consider a college endowment that wants to hire a fund manager to invest it’s capital. The goal of the endowment is to maximize the return on its investment (minus the fees it pays). There are kk fund managers that the endowment can choose between, each of which must be paid a fee, as well as a costless outside option (say, investing the money in an index fund). A linear contract in this setting pays the Agent a fixed percentage (say 10%) of its excess performance above that of the index fund. Since this might be negative (if the Agent does not outperform the index fund), a linear contract is not limited liability.

In our running example, the problem that the Principal needs to solve in the most complex setting considered in our paper is to choose a fund manager amongst the kk candidates, or the outside option in each of TT sequential rounds. In each round, the Principal offers the chosen fund manager a contract, which must be limited liability. In fixing a method for doing this, the Principal is inducing a TT round extensive form game amongst the kk fund managers, who are not myopic, and who will play in an equilibrium of this game, given their beliefs about the distribution on the sequence of states of nature to be realized. We wish to guarantee to the Principal that in the equilibrium of this induced game, their cumulative utility is at least what it would have been had they selected the best of the fund managers in hindsight and offered them a (non-limited-liability) linear contract, guaranteed for all TT rounds, no matter what the sequence of states of nature turn out to be (i.e. even if the beliefs of the fund managers turn out to be incorrect). Note that what we are aiming for is a form of policy regret, in that in the actual play-out, the fund managers are playing an equilibrium strategy in a kk player extensive form game, but in the counterfactual that we use as our benchmark, the fund managers are guaranteed a fixed contract over all TT rounds, and so are simply best responding given their beliefs.

We now lay out our results in a bit more detail.

We study a repeated Principal/Agent problem over TT rounds. We model the action space for the Agent as being the set of real valued effort levels a[0,1]a\in[0,1], associated with a convex increasing cost function c(a)c(a). In each round tt:

  1. 1.

    The Principal chooses amongst a set of kk Agents as well as a costless outside option. If the Principal chooses one of the kk Agents, he offers them a concave contract vt:v^{t}:\mathbb{R}\rightarrow\mathbb{R} (a mapping from Principal payoffs to Agent payments) for that round.

  2. 2.

    The chosen Agent then chooses an action ata^{t}. Then, a state of nature yty^{t} is realized according to an arbitrary (possibly adversarial) process, which results in payoff r(at,yt)r(a^{t},y^{t}) for the Principal and a payment vt(r(at,yt))v^{t}(r(a^{t},y^{t})) for the Agent. In that round, the Principal gets utility r(at,yt)vt(r(at,yt))r(a^{t},y^{t})-v^{t}(r(a^{t},y^{t})) and the Agent gets utility vt(r(at,yt))c(at)v^{t}(r(a^{t},y^{t}))-c(a^{t}). We assume throughout that for each state of nature yy, the Principal’s reward r(a,y)r(a,y) is a non-decreasing linear function of the Agent’s effort level aa (and can depend on the state of nature yy in arbitrary ways).

At the end of each round, the Principal as well as the Agents observe yty^{t} and r(at,yt)r(a^{t},y^{t}), which can be used to inform the Principal’s choice of Agents in the following rounds. counterfactuals of the sort “what would the reward have been had the Principal chosen a different Agent” are not observed. A strategy for an Agent in this game is a mapping from transcripts (which records both the realized states of nature and actions of the other players at all previous rounds) and current-round-contracts to effort levels. The Agents each have (possibly mis-specified) beliefs over the distribution of the state of nature at each round (which can also be arbitrary functions of the transcript). An algorithm for the Principal maps their observed history to a (distribution over) Agents to choose at each round, as well as a contract to offer. Once the Principal fixes an algorithm, this induces a TT-round extensive form game amongst the Agents. We first show the existence of a particularly simple kind of equilibrium in this game.

Non-Responsive Pure Strategy Equilibrium Existence

Extensive form games can have many equilibria that support a wide range of behavior through threats between players, which limits the predictive power of Nash equilibria on their own. We say that an equilibrium is non-responsive if each player’s strategy at each round depends on the realized sequence of states of nature, but not on the history of actions of the other players. In Section 3.2, we prove a structural result: the game we study always admits a non-responsive pure-strategy Nash equilibrium. With this existence result in hand, in all of our subsequent results, we will give guarantees to the Principal under the assumption that the Agents are playing a non-responsive pure-strategy equilibrium of the extensive form game induced by the Principal’s algorithm.

Regret to the Best Fixed Agent

For our first algorithmic result (in Section 4), we fix any contract v(r)v(r) which is concave in rr (i.e. offers the Agent payment that is linear or has diminishing marginal returns of the Principal’s reward), and give an extremely simple algorithm for the Principal. At each round tt, the Principal selects an Agent using a monotone bandit algorithm with an external regret guarantee and offers contract vv. The regret guarantee of the bandit algorithm immediately implies that the Principal has diminishing regret to the realized rewards of the best Agent in hindsight. However, these can differ from the counterfactual rewards of giving the best Agent in hindsight a fixed contract, because the realized rewards are generated by actions the Agent played in a strategic game. What we show is that in the game induced by this mechanism, the rewards realized by each Agent are only higher than what they would have been counterfactually, had that Agent been given a fixed contract vv for all TT rounds. Thus the Principal has no regret to our counterfactual benchmark as well. This result hinges on the monotonicity of the bandit algorithm, which means that unilaterally increasing the reward of a single Agent in a single round only increases the probability that this Agent is chosen in future rounds. Intuitively, the result is that in the competitive game induced by the mechanism, Agents are incentivized to spend more effort than they would have otherwise, because the payoff they get is not just their one-round payoff, but also their continuation payoff over future rounds, which is increasing with current-round effort level. Monotone bandit algorithms with external regret guarantees are already known to exist Babaioff et al. (2009).

Swap Regret and Limited Liability Contracts

Our first result gives a mechanism that uses the same contract vv that its regret benchmark is defined with respect to. Thus if vv is not a limited liability contract, our mechanism also is not limited liability. This is problematic, because linear contracts which are known to be robustly optimal in a variety of settings (Carroll, 2015; Dütting et al., 2019, 2022; Castiglioni et al., 2021) and so would make ideal benchmarks, are not limited liability. Our second result (Section 5) gives an incentive-preserving reduction to limited liability contracts for any linear contract vv. We use a construction similar to the one Chassang (2013) used to give a limited-liability mechanism for a single Agent. Informally, each Agent can open a “tab” with our Principal, which records the cumulative payment owed from the Principal to the agent after each round. The Principal will not pay Agents throughout, and instead pays each Agent in a lump sum at the end of the game according to their tab — except that if the final tab indicates a negative payment from the Principal to that Agent. In this case, the Agent’s debt is forgiven and the Principal pays them zero. A property of this setup is that if at the end of TT rounds, the linear contract vv would have resulted in a net transfer from the Principal to the Agent, the cumulative payments made under the tab are identical to what they would have been under the linear contract; if on the other hand the linear contract vv would have resulted in a net transfer from the Agent to the Principal, the cumulative payments made under the tab deviate from that of the linear contract. The tab enforces limited liability, but serves to limit the downside of the Agent, and so can be incentive-distorting. We show that if the Principal chooses Agents using an algorithm that has diminishing swap regret (to the realized rewards), then the cumulative payment to the Agent under the tab is guaranteed to be non-negative (up to diminishing regret terms). This means that (again, up to low order regret terms) the utility for the Agent of playing any strategy is the same as it would have been had the mechanism been offering the linear contract vv rather than its limited liability proxy. We show that if the bandit algorithm used by the Principal obtains no swap-regret and is monotone (once again in the sense that unilaterally increasing the reward of a single Agent in a single round only increases the probability of choosing that Agent in future rounds), then in equilibrium, the Agents are incentivized to expend more effort than they would have had they been offered a guaranteed linear contract for all TT rounds, and hence the no (swap) regret guarantee of the Principal’s mechanism over the realized rewards extends to a no (swap) regret guarantee to the counterfactual rewards that would be been obtained had the Principal picked the best fixed Agent in hindsight. We show that standard no swap-regret algorithms like Blum and Mansour (2007) are not monotone in this sense — but that the recent no swap-regret algorithms given by Dagan et al. (2023) and Peng and Rubinstein (2023) are. These algorithms operate in the full information setting, but we give a monotonicity preserving reduction to the bandit setting. Thus we can use these algorithms in our reduction to promise the Principal no counterfactual (swap) regret to choosing the best Agent in hindsight (and offering them a fixed linear contract), while needing to offer only linear liability contracts.

1.1 Related Work

The origins of contract theory date to Holmström (1979) and Grossman and Hart (1992). This literature is too large to survey — we refer the reader to Bolton and Dewatripont (2004) for a textbook introduction.

Our work fits into a recent literature studying learning problems in contract theory. Ho et al. (2014) study online contract design by approaching it as a bandit problem in which an unknown distribution over myopic Agents arrive and respond to an offered contract by optimizing their expected utility with respect to a common prior. Cohen et al. (2022) extend this to the case in which the Agent has bounded risk aversion. Zhu et al. (2022) revisit this problem and characterize the sample complexity of online contract design in general (with nearly matching upper and lower bounds) and for the special case of linear contracts (with exactly matching upper and lower bounds). Camara et al. (2020) and Collina et al. (2023) study the problem of repeatedly contracting with a single non-myopic Agent in a setting in which there are no common prior assumptions, and the single Agent is assumed to follow various learning-theoretic behavioural assumptions (that their empirical play history satisfies various strengthenings of swap regret conditions). We distinguish ourselves from this line of work by studying the game induced by competition amongst multiple non-myopic Agents.

Chassang (2013) studies a repeated interaction between a Principal and a long-lived Agent, with a focus on the limited liability problem. In the context of a single Agent and an outside option (and in a fully observable setting), Chassang (2013) gives a similar “debt scheme” to simulate the incentives of a linear contract using limited liability contracts. Our work extends Chassang (2013) in several ways: we study a competitive setting amongst many Agents, operate in a partially observable setting (we do not observe the outcome of Agents we did not select), and give counterfactual policy regret bounds (rather than regret bounds to the realized rewards).

Our limited liability results hold for linear contracts, which have become focal in the contract theory literature because of their strong robustness properties. Carroll (2015) shows that linear contracts are minimax optimal for a Principal who knows some but not all of the Agent’s actions. Dütting et al. (2019) shows that if the Principal only knows the costs and expected rewards for each Agent action, then linear contracts are minimax optimal over the set of all reward distributions with the given expectation. Dütting et al. (2022) extends this robustness result to a combinatorial setting. Dütting et al. (2019) also show linear contracts are bounded approximations to optimal contracts, where the approximation factor can be bounded in terms of various quantities (e.g. the number of Agent actions, or the ratio of the largest to smallest reward, or the ratio of the largest to smallest cost, etc). Castiglioni et al. (2021) studies linear contracts in Bayesian settings (when the Principal knows a distribution over types from which the Agent’s type is drawn) and studies how well linear contracts can approximate optimal contracts.

Our results leverage the fact that in our mechanisms, the Principal is using a monotone selection algorithm. The same kind of monotonicity was used by Babaioff et al. (2009) in the context of using bandit algorithms to give truthful auctions. Babaioff et al. (2009) used a monotone algorithm for obtaining no external regret. We give a monotone algorithm for obtaining no swap-regret by giving a monotonicity preserving reduction to the full information swap regret algorithms recently given by Dagan et al. (2023); Peng and Rubinstein (2023), which we note are monotone. Prior swap regret algorithms (e.g. Blum and Mansour (2007)) even in the full information setting are (perhaps surprisingly) not monotone, as we show.

2 Preliminaries

We study a multi-round, multi-Agent contracting problem. To formalize the model, we first focus on what happens in a single round.

2.1 Single Round Contracting Model

In each round, the Principal would would like to incentivize some Agent to complete a task. The Principal will offer a fixed contract to one of kk Agents, or pick the “outside option” of not contracting at all. The task has multiple possible outcomes, each of which corresponds to a different value for the Principal.

Definition 1 (Outcome oo and Return r(o)r(o)).

There are mm possible outcomes o1omo_{1}...o_{m}, each of which is associated with a return for the Principal r(o)[1,1]r(o)\in[-1,1].

Definition 2 (Contract).

A contract is a payment rule v:v:\mathbb{R}\rightarrow\mathbb{R} mapping the Principal’s return to a payment made to an Agent.

Throughout this paper, we will make the following natural assumption on contracts — that higher reward to the Principal corresponds to higher payment to the Agent, and that payments exhibit (weakly) decreasing marginal returns.

Assumption 1.

Contracts are concave and increasing functions.

If selected by the Principal, an Agent responds to the contract with an action aa in action set 𝒜\mathcal{A}.

Definition 3 (Agent Action and Cost Function).

The action space for each Agent, 𝒜=[0,1]\mathcal{A}=[0,1], is all “effort levels” that they can exert towards the Principal’s task. Each effort level has an associated cost, possibly distinct for each Agent, given by a cost function ci:𝒜c_{i}:\mathcal{A}\to\mathbb{R}.

Throughout the paper, we will assume that Agent cost increases in effort and exhibits (weakly) increasing marginal cost to effort. Note that increasing marginal cost is consistent with an assumption of decreasing marginal return to effort.

Assumption 2.

Agent cost functions ci()c_{i}(\cdot) are increasing and convex in effort level.

There is also some hidden state of nature y𝒴y\in\mathcal{Y} in each round. The probability distribution over outcomes for an Agent is dictated by their chosen action and the state of nature.

Definition 4 (Outcome Distribution 𝒟i,y,a\mathcal{D}_{i,y,a}).

There are kmk\cdot m functions pi,o:𝒜×𝒴[0,1]p_{i,o}:\mathcal{A}\times\mathcal{Y}\rightarrow[0,1], each associated with a particular Agent ii and outcome oo, such that for all y𝒴y\in\mathcal{Y}, pi,o(,y)p_{i,o}(\cdot,y) is affine in its first argument. We will represent these functions as pi,o(a,y)=mi,o,ya+bi,o,yp_{i,o}(a,y)=m_{i,o,y}\cdot a+b_{i,o,y}. Furthermore, for all Agents ii, effort levels aa, and all states of nature yy, we have that

j=1mpi,oj(a,y)=1.\sum_{j=1}^{m}p_{i,o_{j}}(a,y)=1.

We assume that the probability that outcome oo occurs for Agent ii given effort level aa and state of nature yy is pi,o(a,y)p_{i,o}(a,y). By the constraints given above on the functions pp, this is always a valid probability distribution. We will refer to this distribution as 𝒟i,y,a\mathcal{D}_{i,y,a}. Additionally, we will refer to the randomness used to sample from this distribution as i\mathcal{R}_{i}, so that fixing a,ya,y, and i\mathcal{R}_{i} also fixes the sampled outcome oo. Let r¯(i,a,y)=𝔼o𝒟i,y,a[r(o)]\bar{r}(i,a,y)=\mathbb{E}_{o\sim\mathcal{D}_{i,y,a}}[r(o)] be the expected reward under the distribution of outcomes 𝒟i,y,a\mathcal{D}_{i,y,a}. Let v¯(i,a,y)=𝔼o𝒟i,y,a[v(r(o))]\bar{v}(i,a,y)=\mathbb{E}_{o\sim\mathcal{D}_{i,y,a}}[v(r(o))] be the expected payment to Agent ii over the same distribution of outcomes.

Remark 1.

If the returns r()r(\cdot) and the outcome distribution functions pi,o()p_{i,o}(\cdot) are such that effort level zero leads to zero return with probability one and the cost associated with zero effort is also zero, then we can model the ability of each Agent to leave the interaction at any time by playing effort level zero for all remaining time steps. This also ensures that Agents always have non-negative expected utility (no matter what their beliefs are) under their best response.

After a selected Agent takes an action which induces an outcome oo and corresponding Principal reward r(o)r(o), the Principal pays them according to the contract v(r(o))v(r(o)).

Definition 5 (Expected Single Round Agent Utility uiu_{i}).

The expected single round utility of Agent ii given that they are selected is ui(a,y)=v¯(i,a,y)ci(a)u_{i}(a,y)=\bar{v}(i,a,y)-c_{i}(a).

The Principal’s total utility that round is their reward minus this payment.

Definition 6 (Expected Single Round Principal Utility upu_{p}).

The expected single round utility of the Principal given that they select Agent ii playing action aa is up(i,a,y)=r¯(i,a,y)v¯(i,a,y)u_{p}(i,a,y)=\bar{r}(i,a,y)-\bar{v}(i,a,y). The single round utility if they select the outside option is up(,a,y)=0u_{p}(\emptyset,a,y)=0 for all a𝒜,y𝒴a\in\mathcal{A},y\in\mathcal{Y}.

Finally, we will make the natural assumption that if the Agent expends more effort, this only improves their expected return.

Assumption 3 (Monotone Relationship between Effort and Expected Return).

For any Agent ii, any y𝒴y\in\mathcal{Y} and any two effort levels a1,a2a_{1},a_{2} such that a1a2,a_{1}\geq a_{2}, the outcome values satisfy 𝔼o𝒟i,y,a1[r(o)]𝔼o𝒟i,y,a2[r(o)]\mathbb{E}_{o\sim\mathcal{D}{i,y,a_{1}}}[r(o)]\geq\mathbb{E}_{o\sim\mathcal{D}_{i,y,a_{2}}}[r(o)].

2.2 Repeated Contracting Game

In the repeated version of this game, the Principal can select which Agent to contract with each round based on the history of play. However, the Principal cannot observe the history of actions 𝒜\mathcal{A} or the states of nature yy. The Principal only observes the history of realizations of the returns of the selected Agents111This is a crucial aspect of contract theory — that the Principal must contract on the returns, and not directly on the actions of the Agent.. Thus, we can think of the state of the Principal at round tt as being determined by the history of his selections and of returns realized from rounds 11 to t1t-1.

Definition 7 (Principal Transcript).

We call the history of the selected Agents in each round and their realized returns to the Principal the Principal transcript, denoted πp,1:t=[(it,r(at,yt))]t[t]\pi^{p,1:t}=[(i^{t^{\prime}},r(a^{t^{\prime}},y^{t^{\prime}}))]_{t^{\prime}\in[t]}. The space of all possible Principal transcripts is denoted Πp\Pi^{p}.

Using only this transcript, the Principal must decide which Agent to contract with in each round. It will be useful to explicitly refer to the randomness used by the Principal in their selection mechanism when choosing an Agent in each round tt. We refer to this randomness as t\mathcal{R}^{t}, and view the randomized selection of the Principal as a deterministic selection mechanism that takes t\mathcal{R}^{t} as input.

Definition 8 (Principal Selection Mechanism ff).

In each round tt, the Principal uses a deterministic selection mechanism ft:Πp×[k]f^{t}:\Pi^{p}\times\mathcal{R}\to[k]\cup\emptyset to select an Agent (or the outside option). We will refer to the entire mechanism, consisting of all TT components, as ff. The mechanism ff is known to the Agents.

In order to model Agent decisions, it is not enough that they know the Principal selection mechanism — they must also have beliefs over the future realizations of the states of nature. We allow each Agent to have a (potentially incorrect) belief about the future states of nature at each round, which may be a function of the previously realized states of nature. While Agent beliefs may differ, we assume that all Agents are aware of the beliefs of all other Agents at that round. The Principal does not have any knowledge of these beliefs.

Definition 9 (Belief functions Bi()B_{i}(\cdot)).

Each Agent has a function Bit:𝒴t1Δ(𝒴Tt)B^{t}_{i}:\mathcal{Y}^{t-1}\to\Delta(\mathcal{Y}^{T-t}) for any t[T]t\in[T] which is a mapping from any tt-length prefix of states of nature to a distribution over the remaining TtT-t states of nature. This represents Agent ii’s belief at the start of round tt of the remaining TtT-t states of nature given the t1t-1-length prefix of states of nature. Let Bi1=Bi1()B^{1}_{i}=B^{1}_{i}(\emptyset) be the initial belief of Agent ii of the possible length TT sequence of natures states. Additionally, for any t2t1tt_{2}\geq t_{1}\geq t, we let Bit,t1:t2()B_{i}^{t,t_{1}:t_{2}}(\cdot) represent Agent ii’s belief in round tt of the states of nature in between rounds t1t_{1} and t2t_{2}, given some input of previous states of nature. Furthermore, Bt(y1:t)B^{t^{\prime}}(y^{1:t}) represents the set of beliefs of all Agents at round tt^{\prime} given y1:ty^{1:t}. Agents are aware of each others’ belief functions.

Assumption 4 (Full Support Beliefs).

We assume that the beliefs of each agent have full support over all possible sequences of states of nature, that is i[k],Bi1\forall i\in[k],B^{1}_{i} has full support. We make no assumption on the magnitude of the minimum probability, simply that it is non-zero on each sequence.

The selection function ff and set of Agent belief functions BB together define a kk-player extensive form game amongst the Agents, in which each Agent aims to maximize their utility over the entire time horizon of the interaction.

The Agents are more informed than the Principal, so they do have access to the history of states of nature and the actions of other Agents. Therefore, their state at each round is more complex, and determined by the following transcript.

Definition 10 (Agent Transcript).

We call the history of the realized states of nature y1yt1y^{1}...y^{t-1}, the history of the random bits used to sample the selected Agent’s outcome i11,,it1t1\mathcal{R}^{1}_{i^{1}},\ldots,\mathcal{R}^{t-1}_{i^{t-1}}, the history random bits used by the Principal 1,,t1\mathcal{R}^{1},\ldots,\mathcal{R}^{t-1}, and the history of actions of the selected Agents a1,,at1a^{1},\ldots,a^{t-1} an Agent transcript, denoted π1:t=[(yt,itt,t,at)]t[t]\pi^{1:t}=[(y^{t^{\prime}},\mathcal{R}^{t^{\prime}}_{i^{t^{\prime}}},\mathcal{R}^{t^{\prime}},a^{t^{\prime}})]_{t^{\prime}\in[t]}. We refer to a single element tt^{\prime} of an Agent transcript as πt\pi^{t^{\prime}}. The space of all possible Agent transcripts is denoted Π\Pi.

Note that the Agent transcript contains strictly more information than the Principal transcript in any round. The action space of each Agent ii is a policy mapping from Agent transcripts to actions in each round.

Definition 11 (Agent Policy).

An Agent policy pi:ΠΔ(𝒜)p_{i}:\Pi\to\Delta(\mathcal{A}) is a mapping from any tt-length transcript πt\pi^{t} to a distribution over actions. We will write pit(πt1)p^{t}_{i}(\pi^{t-1}) as the distribution over actions output by Agent ii at round tt. Let 𝒫\mathcal{P} be the set of all Agent policies.

Game 1 Agent Selection Game
  Input fixed contract vv, time horizon TT
  Agent cost functions ci()c_{i}(\cdot) and belief functions Bi()B_{i}(\cdot) are common knowledge amongst all Agents.
  Principal fixes their selection mechanism ff, which is common knowledge to all Agents.
  Agent transcript is initialized to transcript π={}\pi=\{\}
  Principal transcript is initialized to πp={}\pi^{p}=\{\}
  for t[T]t\in[T] do
     Principal selects Agent ft(πp,t1,t)=itf^{t}(\pi^{p,t-1},\mathcal{R}^{t})=i^{t} to whom to award contract
     Agent iti^{t} takes action ata^{t}, observed by all Agents, at a cost of cit(at)c_{i^{t}}(a^{t})
     Agents observe the state of nature yty^{t}
     Agents observe the outcome oitt𝒟it,yt,ato_{i^{t}}^{t}\sim\mathcal{D}_{i^{t},y^{t},a^{t}} and the randomness it\mathcal{R}^{t}_{i} used to sample it
     Principal receives the return from realized outcome r(oitt)r(o_{i^{t}}^{t})
     Agent iti^{t} receives payment v(r(oitt))v(r(o_{i^{t}}^{t}))
     Principal publishes randomness t\mathcal{R}^{t} used in round tt
     Principal transcript is updated with (it,r(oitt))(i^{t},r(o_{i^{t}}^{t}))
     Agent transcript is updated with (yt,itt,t,aitt)(y^{t},\mathcal{R}^{t}_{i^{t}},\mathcal{R}^{t},a^{t}_{i^{t}})
  end for

Taken together, a Principal selection mechanism ff, a set of Agent policies pp, and a distribution over states of nature y^\hat{y} fully define a distribution over transcripts of the extensive form game. When useful, we will consider the function which generates a distribution of transcripts given these inputs. This function will generate the distribution of all possible transcripts that may occur based on a distribution over the future states of nature which is taken as an input. Namely, the additional randomness over which it is generating a distribution is the randomness in the other components of the transcript, which the Agent beliefs do not concern themselves with - the randomness used by the Principal in their selection mechanism and the randomness used to sample the outcome from the selected Agent’s outcome distribution in each round.

Definition 12 (Transcript Generating Function).

Given a Principal with selection mechanism ff and Agents policies pp, the function gf,p:Δ(𝒴)TΔ(Π)g_{f,p}:\Delta(\mathcal{Y})^{T}\to\Delta(\Pi), takes as input a distribution over TT states of nature 𝒴\mathcal{Y} and outputs a distribution over transcripts. Given some prefix of the transcript of length tt, we can also write gf,p1:t:Π1:t×Δ(𝒴)ttΔ(Π1:t)g^{1:t^{\prime}}_{f,p}:\Pi^{1:t}\times\Delta(\mathcal{Y})^{t^{\prime}-t}\to\Delta(\Pi^{1:t^{\prime}}) for any ttt^{\prime}\geq t to denote the distribution over transcripts until round tt^{\prime}. When useful, we will also allow this function to take as input a deterministic sequence of nature states.

Now, we are ready to describe the utility of each Agent ii in the game defined by ff and belief function Bi()B_{i}(\cdot). We will describe this for every subgame of the game, defined by an Agent transcript prefix. While the equilibrium guarantees of our mechanisms do not require subgame perfection, it will be useful for us in our analysis to think about the payoffs in particular subgames.

Definition 13 (Agent Subgame Utility).

At round tt, given a previously realized transcript π¯1:t1\bar{\pi}^{1:t-1} with states of nature of y¯1:t1\bar{y}^{1:t-1}, Agent ii’s payoff function for policy pip_{i} is defined as

𝒰i(t,π¯1:t1,pji,pi)=𝔼π1:Tgf,p1:T(π¯1:t1,Bit,t:T(y¯1:t1))[t=tT𝕀[(ft(πp,1:t1,t)=i)]ui(pit(π1:t1),yt)].\displaystyle\mathcal{U}_{i}(t,\bar{\pi}^{1:t-1},p_{j\neq i},p_{i})=\mathbb{E}_{\pi^{1:T}\sim g^{1:T}_{f,p}(\bar{\pi}^{1:t-1},B^{t,t:T}_{i}(\bar{y}^{1:t-1}))}\left[\sum_{t^{\prime}=t}^{T}\mathbb{I}[(f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i)]\cdot u_{i}(p^{t^{\prime}}_{i}(\pi^{1:t^{\prime}-1}),y^{t^{\prime}})\right].

In some cases, it will also be useful to refer to Agent ii’s realized utility over a transcript prefix π1:t\pi^{1:t}, which includes information of all relevant elements the Agent’s utility – namely, the Principal selection in each round and the outcome of the selected Agent action – as 𝒰i(π1:t)\mathcal{U}_{i}(\pi^{1:t}).

Definition 14 (Principal Utility).

The Principal’s payoff function for selection mechanism ff, Agent policies pp^{*}, and a distribution over state of nature sequences y1:Ty^{1:T} is defined as

𝒰p(p,f,y1:T)=𝔼π1:Tgf,p1:T(y1:T)[t=1T𝕀[(ft(πp,1:t1,t)=i)]up(i,pi(π1:t1),yt)].\displaystyle\mathcal{U}_{p}(p^{*},f,y^{1:T})=\mathbb{E}_{\pi^{1:T}\sim g^{1:T}_{f,p^{*}}(y^{1:T})}\left[\sum_{t^{\prime}=1}^{T}\mathbb{I}[(f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i)]\cdot u_{p}(i,p^{*}_{i}(\pi^{1:t^{\prime}-1}),y^{t^{\prime}})\right].

We also let 𝒰p(πp)\mathcal{U}_{p}(\pi^{p}) denote the utility of the Principal over some Principal transcript πp\pi^{p}.

Definition 15 (Agent Best Response).

Given a set of belief functions BB and other Agent policies pjip_{j\neq i}, Agent ii’s policy pip^{*}_{i} is a best response if

piargmaxpi𝒫(𝒰i(1,,pji,pi)).p^{*}_{i}\in\operatorname*{arg\,max}_{p_{i}\in\mathcal{P}}(\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})).
Definition 16 (Agent Nash Equilibrium).

Given a set of belief functions BB and a Principal selection mechanism ff, Agent policies p1kp^{*}_{1...k} are a Nash equilibrium if, for each Agent ii, pip^{*}_{i} is a best response to pjip^{*}_{j\neq i}.

2.3 Non-Responsive Policies

Since effort levels are continuous, the set of policies 𝒫\mathcal{P} is an infinite set, and in fact each p𝒫p\in\mathcal{P} is also an infinite-dimensional object. Thus, it is not immediately obvious that the game we study (Game 1) even admits a Nash equilibrium, and if so what properties it might have. Additionally, Nash equilibria in extensive form games can lack predictive power and support a wide range of behaviors through threats and collusion. For these reasons we will want to prove the existence not only of Nash equilibria, but of non-responsive Nash equilibria, in which Agent policies do not in fact depend on the history of actions of the other players, or the history of choices of the mechanism (which themselves depend on the history of actions)—but rather only on the history of nature states.

Definition 17 (Non-responsive Equilibrium).

Given a set of belief functions BB and a Principal selection function ff, a set of Agent policies p1kp^{*}_{1...k} is a non-responsive equilibrium if it is an equilibrium and all policies in the set are non-responsive.

3 Existence of Non-Responsive Equilibrium

In this section we show that the game induced by the Agent selection interaction described in Game 1 admits a pure strategy non-responsive Nash equilibrium amongst the Agents. We will refer to Game 1 as the “general game,” to distinguish it from a related game we define in this section.

Theorem 1.

There exists a pure non-responsive Nash equilibrium of the Agent Selection Game (Game 1).

The remainder of this section is devoted to the sequence of arguments that result in the proof of Theorem 1. The outline of the proof is as follows:

  • First, we define the restricted game, in which Agents have a restricted strategy space and are only able to employ non-responsive strategies — i.e. strategies that are independent of the actions of other Agents.

  • Next, we apply the result of Glicksberg (1951) to show that the restricted game admits a mixed strategy Nash equilibrium (Lemma 2).

  • We show that we can purify any mixed strategy Nash equilibrium of the restricted game while maintaining its equilibrium properties; this establishes that the restricted game admits a pure strategy Nash equilibrium (Lemma 5).

  • Finally we show that any pure strategy Nash equilibrium of the restricted game must also be a pure strategy Nash equilibrium of the general game. This establishes that the general game admits a pure strategy Nash equilibrium in which all of the players are employing non-responsive strategies (Theorem 3).

3.1 Restricted Contracting Game

To execute the first step of the outline, we define the restricted contracting game in which Agents are restricted to playing non-responsive strategies. Unlike the general game, the strategy space in the restricted contracting game is finite-dimensional, and so it is easier to argue equilibrium existence; and of course, equilibria in the restricted contracting game are non-responsive by definition. Once we establish the existence of pure-strategy Nash equilibria in the restricted game, we will show how to “lift” them to pure strategy non-responsive equilibria of the general game. Agent policies in the restricted game will be functions of restricted transcripts. Restricted transcripts contain a history of the states of nature and randomness used in the playout, but crucially do not contain records of Agent actions or functions of Agent actions (like outcomes or Principal selections). The result is that policies that are defined as functions of restricted transcripts will be independent of the actions of other players.

Definition 18 (Restricted Transcript).

We call the collection of the history of the realized states of nature y1,,yt1y^{1},\ldots,y^{t-1}, the history of the random bits used to sample the selected Agent’s outcome in each round itt\mathcal{R}^{t}_{i^{t}}, and the history of the random bits used by Principal 1,,t1\mathcal{R}^{1},\ldots,\mathcal{R}^{t-1} a restricted transcript, denoted πr,1:t=[(yt,itt,t)]t[t].\pi^{r,1:t}=[(y^{t^{\prime}},\mathcal{R}^{t^{\prime}}_{i^{t^{\prime}}},\mathcal{R}^{t^{\prime}})]_{t^{\prime}\in[t]}. We refer to a single element tt^{\prime} of a restricted transcript as πr,t\pi^{r,t^{\prime}}. The space of all possible restricted transcripts is denoted Πr\Pi^{r}.

We will refer to Agent ii’s utility over the restricted transcript prefix πr,1:t\pi^{r,1:t} by 𝒰i(πr,1:t)\mathcal{U}_{i}(\pi^{r,1:t}).

Definition 19 (Restricted Agent Policy).

A restricted Agent policy pi:ΠrΔ(𝒜)p_{i}:\Pi^{r}\to\Delta(\mathcal{A}) for an Agent ii is a mapping from any tt-length restricted transcript to a distribution over actions Δ(𝒜)\Delta(\mathcal{A}). We will write pit(πr,t1)p^{t}_{i}(\pi^{r,t-1}) as the distribution over actions output by Agent ii at round tt. We will call the set of all restricted policies 𝒫r\mathcal{P}^{r}.

Note that all restricted Agent policies are non-responsive by construction - they do not depend on other Agents’ actions. In the general game, we refer to any policy that can be written as a function of the restricted transcript a “non-responsive” strategy.

Definition 20 (Restricted Transcript Generating Function).

Given a Principal selection mechanism ff, the function gfr:Δ(𝒴T)Δ(Πr)g^{r}_{f}:\Delta(\mathcal{Y}^{T})\to\Delta(\Pi^{r}), takes as input a distribution over a sequence of TT states of nature 𝒴\mathcal{Y} and outputs a distribution over restricted transcripts. Given some prefix of the restricted transcript of length tt, we can also write gfr,1:t:Π1:t×Δ(𝒴)ttΔ(Π1:t)g^{r,1:t^{\prime}}_{f}:\Pi^{1:t}\times\Delta(\mathcal{Y})^{t^{\prime}-t}\to\Delta(\Pi^{1:t^{\prime}}) for any ttt^{\prime}\geq t to denote the distribution over transcripts through round tt^{\prime}.

3.2 Equilibrium Existence

We begin by applying the following result to the restricted game.

Theorem 2 (Glicksberg (1951)).

Every continuous and compact game has a mixed strategy Nash equilibrium.

To do this, we will first show a result that will be useful throughout this paper for relating the local changes in an Agent’s policy to their global utility in the restricted game.

Lemma 1.

Fix a selection function ff and a strategy profile pp of Agents in the restricted game. Let p^i\hat{p}_{i} be a policy for Agent ii that is identical to pip_{i} except for the action in round tt for some transcript prefix πr,1:t1:\pi^{r,1:t-1}: so pit(πr,1:t1)p^it(πr,1:t1)p^{t}_{i}(\pi^{r,1:t-1})\neq\hat{p}^{t}_{i}(\pi^{r,1:t-1}). Then, the difference in Agent ii’s utilities between these two policies can be written as:

𝒰i(1,,pji,p^i)\displaystyle\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},\hat{p}_{i}) 𝒰i(1,,pji,pi)\displaystyle-\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})
=Pr[gfr,1:t1(Bi1)=πr,1:t1](𝒰i(t,πr,1:t1,pji,p^i)𝒰i(t,πr,1:t1,pji,pi)).\displaystyle\quad\quad\quad=\Pr[g^{r,1:t-1}_{f}(B^{1}_{i})=\pi^{r,1:t-1}]\cdot\left(\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},\hat{p}_{i})-\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},p_{i})\right).

The proof of Lemma 1 is deferred to Appendix A.1.

Lemma 2.

The restricted game has a mixed strategy Nash equilibrium.

Proof.

We show that the restricted game is continuous and compact and then apply Theorem 2.

Each player’s strategy space is all mappings from restricted transcripts to distributions over effort levels. As we assume that the coin flips used by the Principal, the randomness used to sample an outcome, and space of states of nature are finite, there are a finite number, say dd, of restricted transcripts, so this is a finite, dd-dimensional game. Each Agent’s strategy in the restricted game can be represented as a real-valued dd-dimensional vector lying in a compact space [0,1]d[0,1]^{d}, as effort levels are defined to lie in [0,1][0,1].

Next, we show that the utility functions are continuous in Agent ii’s policy pip_{i}. Consider a fixed restricted transcript prefix πr,1:t1\pi^{r,1:t-1}. We show that the change in Agent ii’s expected payoff is continuous in pit(πr,1:t1)p^{t}_{i}(\pi^{r,1:t-1}). Consider perturbing the action played by pit(πr,1:t1)p^{t}_{i}(\pi^{r,1:t-1}) to some new action aa^{\prime}. Let p^i\hat{p}_{i} represent this perturbed policy. This corresponds to changing a single coordinate in the dd-dimensional vector representing Agent ii’s policy. As this change is confined to a single subgame, by Lemma 1, the only change to the utility is in the expression 𝒰i(t,πr,1:t1,pji,p^i)\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},\hat{p}_{i}). So, it is sufficient to show that the change in this expression is continuous.

Recall that an Agent’s utility in this subgame π¯r,1:t1\bar{\pi}^{r,1:t-1} can be written as

𝒰i(t,π¯r,1:t1,pji,\displaystyle\mathcal{U}_{i}(t,\bar{\pi}^{r,1:t-1},p_{j\neq i}, pi)=𝔼πr,1:Tgfr,1:T(π¯r,1:t1,Bit(y¯1:t1))[t=tT𝕀[(ft(πp,1:t1,t)=i)]ui(pit(πr,1:t1),yt)]\displaystyle p_{i})=\mathbb{E}_{\pi^{r,1:T}\sim g^{r,1:T}_{f}(\bar{\pi}^{r,1:t-1},B^{t}_{i}(\bar{y}^{1:t-1}))}\left[\sum_{t^{\prime}=t}^{T}\mathbb{I}[(f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i)]\cdot u_{i}(p^{t^{\prime}}_{i}(\pi^{r,1:t^{\prime}-1}),y^{t^{\prime}})\right]
=𝔼πr,1:tgfr,1:t(π¯r,1:t1,Bit,t(y¯1:t1))[𝕀[(ft(πp,1:t1,t)=i)]ui(pit(πr,1:t1),yt)]immediate payoff\displaystyle=\underbrace{\mathbb{E}_{\pi^{r,1:t}\sim g^{r,1:t}_{f}(\bar{\pi}^{r,1:t-1},B^{t,t}_{i}(\bar{y}^{1:t-1}))}\left[\mathbb{I}[(f^{t}(\pi^{p,1:t-1},\mathcal{R}^{t})=i)]\cdot u_{i}(p^{t}_{i}(\pi^{r,1:t-1}),y^{t})\right]}_{\text{immediate payoff}}
+𝔼πr,t+1:Tgfr,1:T(π¯r,1:t1,Bit(y¯1:t1))[t=t+1T𝕀[(ft(πp,1:t1,t)=i)]ui(pit(πr,1:t1),yt)]continuation payoff.\displaystyle\quad+\underbrace{\mathbb{E}_{\pi^{r,t+1:T}\sim g^{r,1:T}_{f}(\bar{\pi}^{r,1:t-1},B^{t}_{i}(\bar{y}^{1:t-1}))}\left[\sum_{t^{\prime}=t+1}^{T}\mathbb{I}[(f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i)]\cdot u_{i}(p^{t^{\prime}}_{i}(\pi^{r,1:t^{\prime}-1}),y^{t^{\prime}})\right]}_{\text{continuation payoff}}.

The immediate payoff changes as a result of this perturbation only in the single round utility the Agent sees: ui(pit(πr,1:t1),yt)=𝔼apit(πr,1:t1)[v¯(i,a,yt)ci(a)]u_{i}(p^{t}_{i}(\pi^{r,1:t-1}),y^{t})=\mathbb{E}_{a\sim p^{t}_{i}(\pi^{r,1:t-1})}[\bar{v}(i,a,y^{t})-c_{i}(a)]. The cost function is, by assumption, convex on a closed set, and thus continuous. It remains to be shown that the expected payment to the Agent also continuous in the effort level. The expected payment is

𝔼apit(πr,1:t1)[v¯(i,a,yt)]=𝔼apit(πr,1:t1)[𝔼o𝒟i,y,a[v(r(o))]].\displaystyle\mathbb{E}_{a\sim p^{t}_{i}(\pi^{r,1:t-1})}[\bar{v}(i,a,y^{t})]=\mathbb{E}_{a\sim p^{t}_{i}(\pi^{r,1:t-1})}[\mathbb{E}_{o\sim\mathcal{D}_{i,y,a}}[v(r(o))]].

This is continuous in the effort level because the expected outcome is continuous in the effort level and payment is a post-processing of the outcome via the return r()r(\cdot) and the contract v()v(\cdot). The expected outcome is continuous in the effort level because the probability of each outcome, for a fixed state of nature yy, is affine in the effort level:

𝔼o𝒟i,y,a[o]=j[m]pi,om(a,y).om.\displaystyle\mathbb{E}_{o\sim\mathcal{D}_{i,y,a}}[o]=\sum_{j\in[m]}p_{i,o_{m}}(a,y).\cdot o_{m}.

Therefore, the immediate payoff for Agent ii is continuous in their action.

Next, observe that the continuation payoff also changes only as a function of how the perturbation affects the return Agent ii realizes for the Principal. This is because the effect in the continuation payoff of an Agent’s action in round tt is simply via the probability that they are selected by the Principal’s selection mechanism in future rounds, which takes as input the information of previously realized returns. Thus, by the same argument above, we also have that the continuation payoff is continuous in an Agent’s action. Therefore, we have established that utility functions are continuous in Agent policies.

As this game is continuous and compact, by Theorem 2, it must have a mixed strategy Nash equilibrium. ∎

Next, we show that without loss of generality, any mixed strategy Nash equilibrium in the restricted game can be converted into a pure strategy Nash equilibrium. Towards that end, the next lemma shows that if we change any Agent’s mixed strategy at any single round into a pure strategy by having them deterministically play the expected action under their mixed strategy, then the distribution over restricted transcripts is unchanged.

Lemma 3.

Fix a Principal selection algorithm ff and a set of non-responsive, possibly mixed, Agent strategies pp^{*}. Let ii be an Agent who is playing a mixed strategy under pip^{*}_{i} in some round tt^{\prime} for a restricted Agent transcript πr,1:t1\pi^{r,1:t^{\prime}-1}. Let pp^{\prime} be the strategy profile that is identical to pp^{*} for all Agents jij\neq i and differs for Agent ii only for transcript πr,1:t1\pi^{r,1:t^{\prime}-1} where pi,t(πr,t1)=𝔼api,t(πr,t1)[a]p^{\prime,t^{\prime}}_{i}(\pi^{r,t^{\prime}-1})=\mathbb{E}_{a\sim p^{*,t^{\prime}}_{i}(\pi^{r,t^{\prime}-1})}[a]. Then, the distribution over future Principal transcripts πp,t:T\pi^{p,t^{\prime}:T} induced by strategy profile pp^{*} is identical to the distribution over future Principal transcripts πp,t:T\pi^{p,t^{\prime}:T} induced by strategy profile pp^{\prime}.

Proof.

Observe that before time tt^{\prime}, there is no difference in the game, and thus the distributions of transcripts until tt^{\prime} will be identical. Since all Agents are non-responsive, their actions are not a function of other Agent policies. Thus, the distributions of actions of all Agents except Agent ii will remain the same in both strategy profiles pp^{*} and pp^{\prime}. This also means that for all Agents jij\neq i, the outcomes distributions they induce are the same in both worlds. Furthermore, since the randomness used by the Principal is fixed between both worlds, we know the Agent selections are the same.

It remains to be shown that the outcome distributions for Agent ii under both strategy profiles pip^{*}_{i} and pip^{\prime}_{i} are the same. At time tt, under strategy pip^{*}_{i} for Agent ii, we have outcome distribution o¯(pi(πr,1:t1),yt)\bar{o}(p^{*}_{i}(\pi^{r,1:t-1}),y^{t}).

We will show that for Agent ii, the probability for any outcome is identical under strategy pip^{*}_{i} and pip^{\prime}_{i}. Recall that per Definition 4, the probability that an outcome oo occurs for an Agent ii given an effort level aa and state of nature yy is an affine function pi,o(a,y)p_{i,o}(a,y), which we represent as pi,o(a,y)=mi,o,ya+bi,o,yp_{i,o}(a,y)=m_{i,o,y}\cdot a+b_{i,o,y}.

Let P(aj)=Prapi(πr,1:t1)[aj]P^{*}(a_{j})=Pr_{a\sim p^{*}_{i}(\pi^{r,1:t-1})}[a_{j}] and let P(aj)=Prapi(πr,1:t1)[aj]P^{\prime}(a_{j})=Pr_{a\sim p^{\prime}_{i}(\pi^{r,1:t-1})}[a_{j}].

Now, we can write the probability of an outcome oso_{s} for Agent ii occurring under policy pip^{*}_{i} as

a=01\displaystyle\int_{a=0}^{1} pos,yt(a)dP(a)\displaystyle p_{o_{s},y^{t}}(a)\ dP^{*}(a)
=a=01(mi,os,yta+bi,os,yt)𝑑P(a)\displaystyle=\int_{a=0}^{1}(m_{i,o_{s},y^{t}}\cdot a+b_{i,o_{s},y^{t}})\ dP^{*}(a)
=a=01(mi,os,yta)𝑑P(a)+a=01bi,os,yt𝑑P(a)\displaystyle=\int_{a=0}^{1}(m_{i,o_{s},y^{t}}\cdot a)\ dP^{*}(a)+\int_{a=0}^{1}b_{i,o_{s},y^{t}}\ dP^{*}(a)
=mi,os,yta=01a𝑑P(a)+bi,os,yt\displaystyle=m_{i,o_{s},y^{t}}\cdot\int_{a=0}^{1}a\ dP^{*}(a)+b_{i,o_{s},y^{t}}
=mi,os,yt𝔼api(πr,1:t1)[a]+bi,os,yt\displaystyle=m_{i,o_{s},y^{t}}\cdot\mathbb{E}_{a\sim p^{*}_{i}(\pi^{r,1:t-1})}[a]+b_{i,o_{s},y^{t}}
=mi,os,yt𝔼api(πr,1:t1)[a]+bi,os,yt\displaystyle=m_{i,o_{s},y^{t}}\cdot\mathbb{E}_{a\sim p^{\prime}_{i}(\pi^{r,1:t-1})}[a]+b_{i,o_{s},y^{t}} (by the fact that the expected value of aa is the same under pp^{*} and pp^{\prime})
=a=01pos,yt(a)𝑑P(a)\displaystyle=\int_{a=0}^{1}p_{o_{s},y^{t}}(a)\ dP^{\prime}(a) (by reversing the above steps with PP^{\prime})

Thus, any pi(πr,1:t1)p^{\prime}_{i}(\pi^{r,1:t-1}) with the same expected effort level will give an equal distribution over outcomes. In particular, this is true for pip^{\prime}_{i} that deterministically plays the expected effort level of pip^{*}_{i}. Therefore, we have shown that under both strategy profiles pp^{*} and pp^{\prime}, the distribution over Principal transcripts is identical. ∎

The next lemma shows that if we purify a single Agent’s equilibrium strategy by having them play the expectation of their mixed strategy, then their opponents are still playing best responses.

Lemma 4.

Let pp^{*} be a strategy profile of an equilibrium of the restricted game. Let pip^{*}_{i} be the policy for an Agent ii who is playing a mixed strategy at some round tt for transcript prefix πr,1:t1\pi^{r,1:t-1}. Let pip^{\prime}_{i} be an a policy for Agent ii that is almost identical to pip^{*}_{i}, except plays a pure strategy at round tt, specifically the mean effort level of the distribution of pi(πr,1:t1)p^{*}_{i}(\pi^{r,1:t-1}): that is, pi,t(πr,t1)=𝔼api,t(πr,t1)[a]p^{\prime,t^{\prime}}_{i}(\pi^{r,t^{\prime}-1})=\mathbb{E}_{a\sim p^{*,t^{\prime}}_{i}(\pi^{r,t^{\prime}-1})}[a]. Then, for all Agents jij\neq i, it is the case that pjp^{*}_{j} is a best response to p:i,j,pip^{*}_{:-i,-j},p^{\prime}_{i}:

pjargmaxpj𝒫r(𝒰j(1,,{p:i,j,pi},pj).\displaystyle p^{*}_{j}\in\operatorname*{arg\,max}_{p_{j}\in\mathcal{P}^{r}}(\mathcal{U}_{j}(1,\emptyset,\{p^{*}_{:-i,-j},p^{\prime}_{i}\},p_{j}).
Proof.

Consider some Agent jj, where jij\neq i. Fix some prefix π¯r,1:t1\bar{\pi}^{r,1:t-1} of the restricted game including states of nature y¯1:t1\bar{y}^{1:t-1}. Finally, let π~p,1:t\tilde{\pi}^{p,1:t} denote the Principal transcript given an arbitrary restricted Agent transcript πr,t\pi^{r,t} and the Agent policies (p:i,j,pj,pi)(p^{*}_{:-i,-j},p_{j},p^{\prime}_{i}), and define π,p,1:t\pi^{*,p,1:t} as the Principal transcript given an arbitrary restricted Agent transcript πr,1:t\pi^{r,1:t} and the Agent policies (p:i,j,pj,pi)(p^{*}_{:-i,-j},p_{j},p^{*}_{i}).

Then, the expected payoff of the Agent jj over their own belief given any policy pjp_{j} against the policies p:i,j,pj,pip^{*}_{:-i,-j},p_{j},p^{\prime}_{i} is:

𝔼πgfr,1:T(π¯r,1:t1,Bjt(y¯1:t1))[t=tT𝕀[ft(π~p,1:t1,t)=j]uj(pjt(πr,1:t1),yt)].\displaystyle\mathbb{E}_{\pi\sim g^{r,1:T}_{f}(\bar{\pi}^{r,1:t-1},B^{t}_{j}(\bar{y}^{1:t-1}))}\left[\sum_{t^{\prime}=t}^{T}\mathbb{I}[f^{t}(\tilde{\pi}^{p,1:t^{\prime}-1},\mathcal{R}^{t})=j]\cdot u_{j}(p^{t^{\prime}}_{j}(\pi^{r,1:t^{\prime}-1}),y^{t^{\prime}})\right].

While their payoff using pjp_{j} against pp^{*} is:

𝔼πgfr,1:T(π¯r,1:t1,Bjt(y¯1:t1))[t=tT𝕀[ft(π,p,1:t1,t)=j]uj(pjt(πr,1:t1),yt)].\displaystyle\mathbb{E}_{\pi\sim g^{r,1:T}_{f}(\bar{\pi}^{r,1:t-1},B^{t}_{j}(\bar{y}^{1:t-1}))}\left[\sum_{t^{\prime}=t}^{T}\mathbb{I}[f^{t}(\pi^{*,p,1:t^{\prime}-1},\mathcal{R}^{t})=j]\cdot u_{j}(p^{t^{\prime}}_{j}(\pi^{r,1:t^{\prime}-1}),y^{t^{\prime}})\right].

Recall that the restricted transcripts exclude information about the Agent actions, and therefore both these expressions are evaluated over the exact same restricted transcript.

Also note that, by Lemma 3, it is the case that π~p,1:t\tilde{\pi}^{p,1:t^{\prime}} and π,p,1:t\pi^{*,p,1:t^{\prime}} are identically distributed. Therefore, the two payoff expressions above are the same, and the utility of Agent jj is the same.

Now, assume for contradiction that there is some policy pjp_{j} which strictly outperforms pjp^{*}_{j} against p:i,j,pip^{*}_{:-i,-j},p^{\prime}_{i}. Given the equivalency of the utility functions, this implies that pjp_{j} strictly outperforms pjp^{*}_{j} against p:jp^{*}_{:-j}, leading to a contradiction. ∎

By iteratively applying this kind of strategy purification, we can establish the existence of a pure strategy Nash equilibrium in the restricted game.

Lemma 5.

There exists a pure Nash equilibrium of the restricted game.

The proof of Lemma 5 is deferred to Appendix A.2.

The next step is to “lift” a pure strategy equilibrium of the restricted game to a pure strategy (non-responsive) equilibrium of the general game. The difference between a strategy of the restricted game and a strategy of the general game is that a strategy of the general game can depend on a richer transcript which includes information about the actions of other Agents. We first observe, however, that when all Agents are playing deterministic, non-responsive strategies, the richer “full” transcript of the general game in fact contains no additional information beyond the transcript of the restricted game.

Lemma 6.

Whenever all Agents are playing deterministic policies, the full transcript π1:t\pi^{1:t} for an arbitrary round tt can be computed as a deterministic function of the restricted transcript πr,1:t\pi^{r,1:t}.

Proof.

Consider a deterministic Agent strategy profile pp and a restricted transcript πr,1:t\pi^{r,1:t}. We will show that we can compute the full Agent transcript π1:t\pi^{1:t} using only πr,1:t\pi^{r,1:t} and information known to all Agents, namely Agent policies and belief functions Bi()B_{i}(\cdot) for all i[k]i\in[k]. Note that the only difference between the transcripts is that the full Agent transcript contains the the action of the selected Agent in each round, so to show the claim we must simply show that in any round, any Agent can compute which Agent was selected by the Principal and what action the selected Agent took.

We proceed via induction. In round t=1t=1, any Agent can compute the action a1a^{1} since the Agent transcript is empty, all Agent strategies are deterministic, and both Agent belief functions and the initial state of the Principal’s selection mechanism are known to all Agents.

Now, consider some arbitrary round t>1t>1. Assume that at1a^{t-1} can be derived from πr,t1\pi^{r,t-1}. We will show that ata^{t} can be derived from πr,t\pi^{r,t}. First, observe that an arbitrary Agent jj can compute which Agent will be selected by the Principal in round tt. The Principal’s selection mechanism takes as input the Principal transcript, which includes the historical selections of Agents and their realized outcomes. By the inductive hypothesis, all Agents know the history of actions of selected Agents through round t1t-1. Additionally, they know, by the restricted transcript πr,t1\pi^{r,t-1}, the randomness itt\mathcal{R}^{t}_{i^{t}} used to sample the selected Agent’s outcome in each round tt. Both of these together exactly determine the state of the Principal’s selection mechanism at round tt. Thus, using t\mathcal{R}^{t}, which is contained in πr,t\pi^{r,t}, Agents know which Agent was selected in round tt.

Secondly, observe that the selected Agent’s action ata^{t} is a known, deterministic function that depends on πt1\pi^{t-1} and Bi()B_{i}(\cdot). Therefore, once again, applying the inductive hypothesis, Agents can compute the selected Agent’s action ata^{t}. ∎

To reason about utilities in the general game, we will show an extension of Lemma 1 to this setting.

Lemma 7.

In the general game, fix a selection function ff and a set of (possibly responsive) policies of other Agents pjip_{j\neq i}. Consider two (possibly responsive) policies for Agent ii, denoted p^i\hat{p}_{i} and pip_{i}, which only differ in a given transcript prefix π1:t1\pi^{1:t-1}. Then,

𝒰i(1,,pji,p^i)𝒰i(1,,pji,pi)=Pr[gf(Bi1)=π1:t1](𝒰i(t,π1:t1,pji,p^i)𝒰i(t,π1:t1,pji,pi)).\displaystyle\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},\hat{p}_{i})-\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})=\Pr[g_{f}(B^{1}_{i})=\pi^{1:t-1}]\cdot\left(\mathcal{U}_{i}(t,\pi^{1:t-1},p_{j\neq i},\hat{p}_{i})-\mathcal{U}_{i}(t,\pi^{1:t-1},p_{j\neq i},p_{i})\right).
Proof.

This proof follows the same argument as Lemma 1. ∎

Finally, combining these lemmas lets us establish that pure strategy Nash equilibria of the restricted game are in fact also pure strategy Nash equilibria of the general game.

Lemma 8.

Every pure Nash equilibrium of the restricted game is also a pure Nash equilibrium of the general game.

Proof.

Consider any pure strategy equilibrium pp in the restricted game. Each Agent is playing a policy in their best response set over all restricted pure strategies. Say Agent ii plays policy pip_{i} in this equilibrium. We will show that in the general game, for any Agent ii, there is no beneficial responsive deviation from their restricted policy pip_{i}, assuming all other Agents are playing p:ip_{:-i}. Suppose for the sake of contradiction that there is a beneficial deviation for Agent ii to responsive strategy qiq_{i} in the general game.

We can show that in fact, this is a valid strategy to play in the restricted game - i.e. that it can be implemented as a non-responsive strategy - as follows. By Lemma 6, when Agents are playing deterministic, non-responsive policies – as all policies in the profile pp are – any Agent ii can in round tt compute, using only πr,1:t1\pi^{r,1:t-1}, the full transcript π1:t1\pi^{1:t-1}, which is the exact information they would have access to in the general game. Therefore, in fact, qiq_{i} is also implementable as a non-responsive strategy q~i\tilde{q}_{i}, as it can be implemented only as a function only of the restricted transcript as follows: in round tt, let q~it(πr,t1)\tilde{q}^{t}_{i}(\pi^{r,t-1}) first compute the full transcript from the restricted transcript and then output the same action as qit(π1:t1)q^{t}_{i}(\pi^{1:t-1}). Since the policy qiq_{i} can be written as an equivalent policy which is only a function of the restricted transcript, it is by definition a non-responsive policy. Thus, it would necessarily also be a beneficial deviation from pitp^{t}_{i} in the restricted game. This gives us a contradiction to the fact that pp was a pure strategy equilibrium of the restricted game.

We can see this as follows.

maxqi𝒫\displaystyle\max_{q_{i}\in\mathcal{P}}\ 𝒰i(1,,pji,qi)𝒰i(1,,pji,pi)>0\displaystyle\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},q_{i})-\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})>0 (suppose that qiq_{i} is a beneficial responsive deviation from pip_{i})
maxq𝒫𝒰i(1,,pji,q~i)𝒰i(1,,pji,pi)>0\displaystyle\rightarrow\max_{q\in\mathcal{P}}\ \mathcal{U}_{i}(1,\emptyset,p_{j\neq i},\tilde{q}_{i})-\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})>0 ( q~\tilde{q} can be implemented as an equivalent restricted policy q~i\tilde{q}_{i})
maxq𝒫r𝒰i(1,,pji,q~i)𝒰i(1,,pji,pi)>0\displaystyle\rightarrow\max_{q\in\mathcal{P}^{r}}\ \mathcal{U}_{i}(1,\emptyset,p_{j\neq i},\tilde{q}_{i})-\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})>0 (restricted policies are non-responsive, thus we have a contradiction to pp being a policy equilibrium)

where 𝒫\mathcal{P} and 𝒫r\mathcal{P}^{r} are the set of responsive and restricted policies, respectively. ∎

Putting all of these lemmas together yields the main structural result of this section:

Theorem 3.

For every Principal selection mechanism, there exists a pure, non-responsive Nash equilibrium of the Agent selection game (Game 1).

Proof.

Follows directly from combining Lemmas 5 and 8. ∎

4 Monotone Selection Mechanisms And Counterfactual Regret

Having established in Section 3 that the Agent Selection Game (Game 1) always admits a pure strategy non-responsive Nash equilibrium, we now go on to study the guarantees that the Principal can obtain in such equilibria. In particular, we will compare the utility obtained by the Principal in the dynamic Agent selection game to the utility that they could have obtained counterfactually had they offered a fixed contract to a single Agent ii, guaranteed across all TT rounds. In evaluating the utility of the Principal under both their selection mechanism ff and the counterfactual fixed-agent mechanism, we assume that all agents are playing a non-responsive equilibria in the game formed by the respective selection mechanism.

The main result we show in this section is that if the Principal uses a selection mechanism that guarantees diminishing external regret (to the realized rewards) and is in addition monotone, in a sense that we will define, then the Principal in fact obtains utility at least as high as they would have in the counterfactual benchmark, for all Agents ii. In other words, we show that in every non-responsive equilibrium of the Agent selection game, monotone selection mechanisms suffice to convert external regret bounds into policy regret bounds.

Benchmark Comparison

In the benchmark setting, since the Agent is promised a guaranteed contract across all TT rounds, they can always act so as to maximize their per-round payoff, without needing to worry about the effect that their actions today will have on whether they are chosen tomorrow; that is, they will always play the action that is myopically optimal. This is formalized via the definitions below.

Definition 21 (Constant Principal Selection Mechanism fif_{i}).

A constant Principal selection mechanism fi:Π[k]f_{i}:\Pi\to[k]\cup\emptyset is defined as the mechanism which deterministically selects a fixed agent ii each round, regardless of the transcript.

Definition 22 (Myopic Optimal Action).

An Agent ii’s myopic optimal action in some round tt given transcript π1:t1\pi^{1:t-1} and corresponding previous states of nature y¯1,,y¯t1\bar{y}^{1},\ldots,\bar{y}^{t-1} is an action aita^{t}_{i} satisfying aitargmaxa𝒜𝔼ytBi1,t(y¯1:t1)[ui(a,yt)]a^{t}_{i}\in\operatorname*{arg\,max}_{a\in\mathcal{A}}\mathbb{E}_{y^{t}\sim B^{1,t}_{i}(\bar{y}^{1:t-1})}\left[u_{i}(a,y^{t})\right] Note that for an Agent ii this action may differ between rounds and may not be unique within a fixed round. Let the set of all such actions in round tt be denoted 𝒜~it\tilde{\mathcal{A}}^{t}_{i}.

Definition 23 (Myopic Optimal Policy).

An Agent’s myopic optimal policy is defined to be a policy in which each action pit(π1:t1)p^{t}_{i}(\pi^{1:t-1}) is a myopic optimal action for all πΠ\pi\in\Pi and t[T]t\in[T]. The set of all myopic optimal policies for player ii is defined to be pimp_{i}^{m}.

Lemma 9.

In every equilibrium of the game in which the Principal plays a constant Principal selection mechanism selecting agent ii, agent ii plays a myopic optimal policy.

The proof of Lemma 9 is deferred to Appendix B.1.

We now define a monotone selection mechanism. Informally, a monotone selection mechanism is one in which if the Principal’s reward is increased on a round in which Agent ii was chosen, and no other changes are made to the transcript, then this change only increases the probability that Agent ii is chosen at future rounds.

Definition 24 (Monotone Selection Mechanism).

Let ff be a Principal selection mechanism (Definition 8). Let πpΠp\pi^{p}\in\Pi^{p} be an arbitrary Principal transcript. Let π~p\tilde{\pi}^{p} be another Principal transcript that is identical to πp\pi^{p} except that it differs in exactly a single entry tt such that π~p,t=(i,r~(oit))\tilde{\pi}^{p,t}=(i,\tilde{r}(o^{t}_{i})) and πp,t=(i,r(oit))\pi^{p,t}=(i,r(o^{t}_{i})) where r~(oit)r(oit)\tilde{r}(o^{t}_{i})\geq r(o^{t}_{i}). The Principal’s selection mechanism ff is said to be monotone if for all ttt^{\prime}\geq t, it is the case that Pr[ft+1(π~p,1:t,)=i]Pr[ft+1(πp,1:t,)=i]\Pr_{\mathcal{R}}[f^{t^{\prime}+1}(\tilde{\pi}^{p,{1:t^{\prime}}},\mathcal{R})=i]\geq\Pr_{\mathcal{R}}[f^{t^{\prime}+1}({\pi}^{p,{1:t^{\prime}}},\mathcal{R})=i] — that is, the selection mechanism is only more likely to select ii in future rounds under π~p,t\tilde{\pi}^{p,t} than πp,t\pi^{p,t}.

We first establish that when the Principal employs a monotone selection mechanism, then in any non-responsive pure strategy equilibrium of the game, Agents all exert effort levels that are only higher than they would have in their myopically optimal strategies. Informally, this is because we can decompose Agent utilities into their single-round utility and their continuation utility. Any effort level that is below the myopically optimal level by definition will only improve single-round utility if raised — and because of monotonicity, will also only improve the continuation utility. By the non-responsiveness of the equilibrium, it will not affect the behavior of other Agents, and hence we can conclude that acting at any effort level below the myopically optimal level is dominated, and hence cannot be supported in equilibrium.

Lemma 10.

If the Principal is selecting Agents via a monotone algorithm, in any non-responsive, pure strategy Nash equilibrium of the selection game (Game 1), then the selected Agent iti^{t}’s effort levels in each round tt are at least as high as p~itt=inf𝒜~itt\tilde{p}^{t}_{i^{t}}=\inf{\tilde{\mathcal{A}}^{t}_{i^{t}}}.

The proof of Lemma 10 is deferred to Appendix B.1.

With this result in hand, we are almost ready to state and prove our main result, which is that if the Principal employs a bandit learning algorithm that is both monotone and has a diminishing external regret guarantee, then if agent policies always form a non-responsive equilibrium, the Principal will have no regret to the best counterfactual world induced by picking a fixed agent across all rounds. To state the result, we first need to give some online learning preliminaries.

4.1 Online Learning Preliminaries

When the Principal selects an Agent each round, their utility is determined by an unknown sequence of states of nature and an unknown set of Agent policies. This can be embedded into an “online learning from experts” setting, in which the Principal selects one of the kk actions at each step and obtains some return based upon this selection. As the Principal does not receive full feedback and only has access to the return of the Agent selected each round, our setting is in fact a special case of online learning with bandit feedback — recalling of course that we ultimately want stronger than typical regret bounds. In this section we will make use of algorithms with general no-regret properties to derive no-regret guarantees for our setting.

We begin with a brief overview of the general online learning from experts setting with bandit feedback and then introduce some definitions, including of notions of external regret and swap regret. In this section, to align with standard notation for no-regret algorithms, we refer to losses instead of rewards. At every round, an adaptive adversary will choose a vector of kk losses — one for each action. Simultaneously, the learning algorithm will choose a distribution over kk actions, and will experience loss equal to the loss of their chosen action. The learning algorithm will observe the loss of the action chosen, but will not observe the loss of actions not chosen. Since the adversary is allowed to be adaptive, they are able to react to the history of the game when selecting their loss vector in the current round, including to the realizations of randomness used previously by the learning algorithm (though not the algorithm’s randomness in the current round). We make this dependence explicit in our definitions, as it will be important to our later analysis—in our applications, the loss sequence will be explicitly adaptive.

Definition 25 (Transcript ϕ\phi).

Let ϕ1:t\phi^{1:t} denote a transcript of length tt consisting of the history of the losses observed by the algorithm b1,,btb^{1},\ldots,b^{t}, the actions selected of the algorithm s1,,sts^{1},\ldots,s^{t}, and the randomness used by the selection algorithm 1,,t\mathcal{R}^{1},\ldots,\mathcal{R}^{t}. Note that since we are in a setting with bandit feedback, the transcript does not include the entire loss vector \ell but only the loss of the selected action in that round, i.e. bt(ϕ1:t1)[st]b^{t}\coloneqq\ell(\phi^{1:t-1})[s^{t}].

Definition 26 (Adaptive Adversary \ell).

An adaptive adversary plays a loss function t:Φt1k\ell^{t}:\Phi^{t-1}\to\mathbb{R}^{k} in round tt as a function of the transcript ϕ1:t1\phi^{1:t-1}. We refer the adaptive adversary across all rounds as \ell.

Definition 27 (Learning Algorithm).

A learning algorithm ff is a mapping from t1t-1-length transcript prefixes ϕ1:t11:t1\phi^{1:t-1}_{\mathcal{R}^{1:t-1}} and random bits used by the algorithm in round tt, denoted t\mathcal{R}^{t}, to an action selection in the current round tt.

Definition 28 (External Regret).

The external regret of an algorithm ff against an adaptive adversary \ell over TT rounds is defined as

Regf(T,)=maxi[k]𝔼1:T[t=1T(ϕ1:t11:t1)[f(ϕ1:t11:t1,t)](ϕ1:t11:t1)[i]],\displaystyle\textsc{Reg}_{f}(T,\ell)=\max_{i\in[k]}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t=1}^{T}\ell(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}})[f(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}},\mathcal{R}^{t})]-\ell(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}})[i]\right],

where we use ϕ1:t11:t1\phi^{1:t-1}_{\mathcal{R}^{1:t-1}} to denote the transcript prefix deterministically resulting from the algorithm ff, adversary \ell and a particular realization of 1:t1\mathcal{R}^{1:t-1}.

Definition 29 (Swap Function).

Let the mapping χ:[k][k]\chi:[k]\to[k] represent a strategy modification function.

Definition 30 (Swap Regret).

The swap regret of an algorithm ff against an adaptive adversary \ell over TT rounds is defined as

SRegf(T,)=maxχ𝔼1:T[t=1T(ϕ1:t11:t1)[f(ϕ1:t11:t1,t)](ϕ1:t1t1[χ(f(ϕ1:t11:t1,t))]]\displaystyle\textsc{SReg}_{f}(T,\ell)=\max_{\chi}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t=1}^{T}\ell(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}})[f(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}},\mathcal{R}^{t})]-\ell(\phi^{t-1}_{\mathcal{R}^{1:t-1}}[\chi(f(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}},\mathcal{R}^{t}))]\right]

where we use ϕ1:t11:t1\phi^{1:t-1}_{\mathcal{R}^{1:t-1}} to denote the transcript prefix deterministically resulting from the algorithm ff, adversary \ell and a particular realization of 1:t1\mathcal{R}^{1:t-1}.

Note that in both of these definitions of regret, the benchmark comparison is to the realized loss vectors, not the counterfactual loss vectors that would have been realized had the learning algorithm made different choices. However, we will be able to show an even stronger guarantee in our setting - that in fact, the learning algorithm satisfies regret to these counterfactual loss vectors that they did not see, but which would have been realized under different choices of the algorithm.

4.2 Counterfactual Guarantees in Equilibrium

We are now in a position to state the main results of this section. We first observe that it immediately follows that if the Principal is selecting Agents using a bandit algorithm that has worst-case external regret guarantees, then they obtain utility at least as high as they would have had they consistently selected the best Agent in hindsight fixing the actions that the Agent played in Game 1.

Lemma 11.

If the Principal is selecting Agents with a selection mechanism with external regret bounded by R(T)R(T) for all adaptive adversaries \ell, then in any non-responsive, pure strategy equilibrium of the game induced by Game 1, the Principal has regret at most R(T)R(T) to the sequence of returns realized by the best fixed agent in hindsight.

Proof.

This follows immediately from instantiating the regret bound for the Principal in the Agent selection game. Note that the regret bound holds for an arbitrary adaptive adversary, and so in fact holds for any adversary with control over the states of nature in the Agent selection game. ∎

Now, we state stronger regret guarantees. In order to state our counterfactual guarantees, we must introduce a definition for what sorts of Principal transcripts can arise given any non-responsive equilibrium of the agents.

Definition 31 (Equilibrium Policy Set pe(f,y)p^{e}(f,y)).

Given an agent selection mechanism ff and a state of nature sequence yy, pe(f,y)p^{e}(f,y) is the set of all collections of policies that Agents can play which jointly form a non-responsive equilibrium.

Now we can formally define our notion of policy regret. Informally, we compare the utility of the Principal given the mechanism they actually implemented, to the utility that the Principal would have obtained had they implemented the best fixed-agent mechanism fif_{i}, given the actual sequence of realized states of nature and the beliefs of the Agents. When evaluating the utility for the Principal in both cases, we assume that the Agents play according to a non-responsive pure-strategy Nash equilibrium. Because there may be multiple such equilibria, we take the minimum Principal utility over this set.

Definition 32 (Policy Regret).

For some Principal selection mechanism ff, agent beliefs BB, and state of nature sequence y1:Ty^{1:T}, the policy regret of the Principal is

maxi[k]minppe(fi,y)𝒰p(gfi,p(y1:T))minppe(f,y)𝒰p(gf,p(y1:T)).\max_{i\in[k]}\min_{p\in p^{e}(f_{i},y)}\mathcal{U}_{p}(g_{f_{i},p}(y^{1:T}))-\min_{p\in p^{e}(f,y)}\mathcal{U}_{p}(g_{f,p}(y^{1:T})).

Finally, we can state our main theorem: informally, if the Principal’s algorithm has an external regret guarantee and is also monotone, then not only do they obtain no regret to the realized actions of the Agents in Game 1, but in any deterministic, non-responsive equilibrium of the induced game, they obtain utility as high as they would have had they interacted with the best of the Agents every round.

Theorem 4.

If the Principal is selecting Agents with a monotone selection mechanism with external regret bounded by R(T)R(T), the Principal has policy regret at most R(T)R(T).

Proof.

This follows from Lemma 10, Assumption 3, and Lemma 11. Any Agent’s effort in the Agent selection game is only higher than their myopically optimal effort level if they are guaranteed to be selected, by Lemma 10. Next, by Assumption 3, the counterfactual sequence of expected returns in the game when they were guaranteed the contract would have been only lower than the expected returns in the Agent selection game. Taken with Lemma 11, that the Principal has bounded regret to the sequence of returns of the best fixed Agent in hindsight in the Agent selection game, we have the claim. ∎

To instantiate this guarantee with a concrete mechanism, we only need that a monotone bandit learning algorithm of this sort exists with a diminishing regret guarantee. Indeed, we show in Section 5 that MonoBandit(MW) is a monotone bandit selection mechanism with external regret bounded by 2klog14(k)T342\sqrt{k}\cdot\log^{\frac{1}{4}}(k)\cdot T^{\frac{3}{4}} (Theorem 10).

Theorem 5.

If the Principal selects agents each round via the MonoBandit algorithm instantiated with Multiplicative Weights, the Principal has policy regret at most 2klog14(k)T342\sqrt{k}\cdot\log^{\frac{1}{4}}(k)\cdot T^{\frac{3}{4}}.

Proof.

This follows from Theorem 4 and Theorem 10. ∎

5 Swap Regret and Limited Liability Contracts

In Section 4, we showed that by using a monotone selection mechanism with external regret guarantees, the Principal is able to guarantee payoff similar to what they could have obtained by using the same contract in a benchmark setting in which the contract was guaranteed to any single fixed Agent across all TT rounds. However, if we take the benchmark contract to be one that is not limited liability — i.e. that can sometimes obligate the Agent to pay the Principal — then our mechanism also must offer contracts which fail to satisfy the limited liability condition. This is problematic, because on the one hand, the natural, robust set of linear benchmark policies do not satisfy the limited liability condition, but on the other hand, contracts that fail to satisfy the limited liability condition can be difficult to implement in practice. In this section, we turn our attention to mechanisms for the Principal that only offer limited liability contracts, while still being able to offer the same kinds of counterfactual regret guarantees we derived in Section 4 with respect to linear contracts, which are not limited liability.

Definition 33 (Limited Liability Contract).

A limited liability contract is a payment rule v:0v:\mathbb{R}\to\mathbb{R}_{\geq 0} mapping the Principal’s return to a payment made to an Agent.

Definition 34 (Linear Contract).

A linear contract vα:v_{\alpha}:\mathbb{R}\to\mathbb{R} with parameter α[0,1]\alpha\in[0,1] is a payment rule mapping the Principal’s return to a reward to the Agent as an α\alpha-fraction of the total return rr, as vα(r)=αrv_{\alpha}(r)=\alpha\cdot r.

To start, we prove that, by offering a linear contract each round and playing a monotone no swap-regret selection mechanism, the Principal guarantees that the liability of each agent is bounded by the swap regret term. Then, we prove that, if we equip this selection mechanism with a tab that ensures that the payment to each agent fulfills limited liability, it is an approximate (up to regret terms) equilibrium for the Agents to play a strategy profile corresponding to a non-responsive equilibrium of the general game. We use these results to show that, if we assume that agents will play one of these approximate non-responsive equilibria, the Principal gets low policy regret while ensuring limited liability for the Agents.

Lemma 12.

If the Principal selection mechanism ff has swap regret bounded by SReg(T)\textsc{SReg}(T) against any adaptive adversary \ell, and the Principal is providing a fixed linear contract vαv_{\alpha} each round, then each agent’s cumulative payment is at least αSReg(T)-\alpha\textsc{SReg}(T).

Proof.

Let 𝒰p(f,P,y|i)\mathcal{U}_{p}(f,P,y|i) represent the Principal utility from the rounds in which they select agent ii. The Principal selection mechanism ff has swap regret at most SRegf(T,)\textsc{SReg}_{f}(T,\ell) for any adaptive adversary \ell. Furthermore, the Principal always has an outside option, denoted \emptyset, that guarantees them return 0 in every round. Therefore, we have that

𝒰p(f,P,y)\displaystyle\mathcal{U}_{p}(f,P,y) =𝔼π1:Tgf,p1:T(y1:T)[t=1T𝕀[(ft(πp,1:t1,t)=i)]up(i,pi(π1:t1),yt)]\displaystyle=\mathbb{E}_{\pi^{1:T}\sim g^{1:T}_{f,p}(y^{1:T})}\left[\sum_{t^{\prime}=1}^{T}\mathbb{I}[(f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i)]\cdot u_{p}(i,p_{i}(\pi^{1:t^{\prime}-1}),y^{t^{\prime}})\right]
𝒰p(χ(f),P,y)SReg(T)\displaystyle\geq\mathcal{U}_{p}(\chi(f),P,y)-\textsc{SReg}(T)
=𝔼π1:Tgf,p1:T(y1:T)[t=1T𝕀[(χ(ft(πp,1:t1,t)=i))]up(i,pi(π1:t1),yt)]SReg(T),\displaystyle=\mathbb{E}_{\pi^{1:T}\sim g^{1:T}_{f,p}(y^{1:T})}\left[\sum_{t^{\prime}=1}^{T}\mathbb{I}[(\chi(f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i))]\cdot u_{p}(i,p_{i}(\pi^{1:t^{\prime}-1}),y^{t^{\prime}})\right]-\textsc{SReg}(T),

for any swap function χ\chi, and in particular, the swap function χ(j)={j,if ji,,if j=i\chi(j)=\begin{cases}j,\text{if }j\neq i,\\ \emptyset,\text{if }j=i\end{cases}.

So we have that

𝒰p(f,P,y)\displaystyle\mathcal{U}_{p}(f,P,y) 𝔼π1:Tgf,p1:T(y1:T)[t=1T𝕀[ft(πp,1:t1,t)=i]up(,pi(π1:t1),yt)]SReg(T)\displaystyle\geq\mathbb{E}_{\pi^{1:T}\sim g^{1:T}_{f,p}(y^{1:T})}\left[\sum_{t^{\prime}=1}^{T}\mathbb{I}[f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i]\cdot u_{p}(\emptyset,p_{i}(\pi^{1:t^{\prime}-1}),y^{t^{\prime}})\right]-\textsc{SReg}(T)
𝒰p(f,P,y|i)𝔼π1:Tgf,p1:T(y1:T)[t=1T𝕀[ft(πp,1:t1,t)=i]up(,pi(π1:t1),yt)]SReg(T)\displaystyle\rightarrow\mathcal{U}_{p}(f,P,y|i)\geq\mathbb{E}_{\pi^{1:T}\sim g^{1:T}_{f,p}(y^{1:T})}\left[\sum_{t^{\prime}=1}^{T}\mathbb{I}[f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i]\cdot u_{p}(\emptyset,p_{i}(\pi^{1:t^{\prime}-1}),y^{t^{\prime}})\right]-\textsc{SReg}(T)
𝒰p(f,P,y|i)SReg(T)\displaystyle\rightarrow\mathcal{U}_{p}(f,P,y|i)\geq-\textsc{SReg}(T) (by the fact that the reward and cost of the outside option are both 0)

Thus, for all Agent ii, the total utility the Principal gains during these rounds is at most SReg(T)-\textsc{SReg}(T). As the Principal is providing a fixed linear contract vαv_{\alpha} each round, the total payment from the Principal to Agent ii is α𝒰p(f,P,y|i)αSReg(T)\alpha\cdot\mathcal{U}_{p}(f,P,y|i)\geq-\alpha\textsc{SReg}(T). ∎

We show in Section 5 that our algorithm MonoBandit when instantiated with the algorithm TreeSwap from Dagan et al. (2023) is a monotone bandit selection mechanism with swap regret bounded by 2ko(T)2\sqrt{k}\cdot o(T) (Theorem 11).

Theorem 6.

If the Principal selects agents each round via the MonoBandit algorithm instantiated with TreeSwap, the Principal has policy regret at most 2ko(T)2\sqrt{k}\cdot o(T).

Proof.

This follows from Theorem 4 and Theorem 11. ∎

Theorem 7.

If the Principal selects agents each round via the MonoBandit algorithm (Algorithm 5) instantiated with TreeSwap and provides a fixed linear contract, then:
1) the Principal has policy regret at most 2ko(T)2\sqrt{k}\cdot o(T), and
2) all agents have expected liability no more than 2ko(T)2\sqrt{k}\cdot o(T).

Proof.

The first claim follows from Theorem 6. The second claim follows from Lemma 12. ∎

Thus, by running a monotone no swap-regret algorithm in the bandit setting and offering a linear contract, the Principal ensures that after the interaction, the net payment from the Principal to the Agent must be non-negative (up to a diminishing regret term).

We now introduce a limited liability mechanism. Informally, it works as follows: the Principal opens a “tab” for each Agent that allows them to accumulate debt. The Principal maintains a running total of the cumulative payments would be made under the linear contract, but defers the payments to a single lump sum transfer at end of the interaction. At the end of the interaction, if the tab is negative, any remaining debt owed from the Agent to the Principal is forgiven, and so no payments ever need to be made from the Agent to the Principal. Otherwise, the Principal pays the agent the tab. Introducing this tab does not impact the net payment made from the Principal to the Agent whenever the original net payment is non-negative. However, since the tab results in a zero payment when the net payment from the Principal to the Agent would have been negative, this mechanism is incentive distorting. Informally, since we have established that the net payment from the Principal to the Agent cannot have been very negative, we can show that the distortion of incentives is minimal.

We formalize the limited liability game in Game 2. It is similar to the standard game, with the exception of the addition of the three boldface lines.

Game 2 Agent Selection with Limited Liability Game
  Input fixed linear contract α\alpha, time horizon TT
  Agent cost functions ci()c_{i}(\cdot) and belief functions Bi()B_{i}(\cdot) are common knowledge amongst all Agents.
  Principal fixes their selection mechanism ff, which is common knowledge to all Agents.
  Agent transcript is initialized their transcript π={}\pi=\{\}
  Principal initializes their transcript of returns πp={}\pi^{p}=\{\}
  Principal initializes a tab 𝒯i\mathcal{T}_{i} for each Agent i[k]i\in[k]
  for t[T]t\in[T] do
     Principal selects Agent f(πp,t1)=itf(\pi^{p,t-1})=i^{t} to whom to award contract using random bits t\mathcal{R}^{t}
     Agent iti^{t} takes action ata^{t}, observed by all Agents, at a cost of cit(at)c_{i^{t}}(a^{t})
     Agents observe the state of nature yty^{t}
     Agent iti^{t} observes their outcome oitt𝒟it,yt,ato_{i^{t}}^{t}\sim\mathcal{D}_{i^{t},y^{t},a^{t}}
     Principal receives the return from realized outcome r(oitt)r(o_{i^{t}}^{t})
     Principal updates Agent iti^{t}’s tab: 𝒯it=𝒯it+αr(oitt)\mathcal{T}_{i^{t}}=\mathcal{T}_{i^{t}}+\alpha\cdot r(o_{i^{t}}^{t})
     Nature reveals randomness used to sample the outcome itt\mathcal{R}^{t}_{i^{t}}
     Principal publishes randomness t\mathcal{R}^{t} used in round tt
     Principal transcript is updated with (it,r(oitt))(i^{t},r(o_{i^{t}}^{t}))
     Agent transcript is updated with (yt,itt,t,aitt)(y^{t},\mathcal{R}^{t}_{i^{t}},\mathcal{R}^{t},a^{t}_{i^{t}})
  end for
  Principal pays max(0,𝒯i)\max(0,\mathcal{T}_{i}) to each Agent i[k]i\in[k]

We first establish that if the Principal employs a selection mechanism that has a swap regret guarantee, that for any set of policies played by the Agents, the Agents’ utilities are almost identical (up to the swap regret bound) in the Limited Liability Game (Game 2) and in the original game (Game 1). Informally, this is because whenever the Agent ends the interaction without “debt”, their payments are identical in the two settings, and the swap regret guarantee that the Principal has to the outside option (which always guarantees payoff 0) will similarly guarantee that no Agent will ever end the game with significant debt (because by the nature of linear contracts, this would also imply significant swap regret for the Principal).

Definition 35 (Agent Utility in the Limited Liability Game 𝒰Lim\mathcal{U}^{Lim}).

Let v¯i(π)\bar{v}_{i}(\pi) represent the total payment from the Principal to the Agent ii under the transcript π\pi. Furthermore, let ci(π)c_{i}(\pi) represent the sum of the costs of all actions taken by Agent ii (on rounds they are selected). Then, let 𝒰Lim\mathcal{U}^{Lim} be defined as the utility of Agent ii in the limited liability game:

𝒰iLim(1,,pji,pi)=πΠ(max(0,v¯i(π))ci(π))Pr[gf,pji,pi(Bi1)=π]\mathcal{U}^{Lim}_{i}(1,\emptyset,p_{j\neq i},p_{i})=\sum_{\pi\in\Pi}(\max(0,\bar{v}_{i}(\pi))-c_{i}(\pi))\cdot\Pr[g_{f,p_{j\neq i},p_{i}}(B^{1}_{i})=\pi]
Lemma 13.

Fix any set of Agent beliefs BB, set of Agent policies pp, and linear contract vαv_{\alpha}. If the Principal selection mechanism ff has swap regret bounded by SReg(T)\textsc{SReg}(T) against any adaptive adversary \ell, then each Agent will have expected payoff that differs by no more than αSReg(T)\alpha\textsc{SReg}(T) between the standard Agent selection game (Game 1) and the limited liability Agent selection game (Game 2). That is, for any Agent i[k]i\in[k],

𝒰i(1,,pji,pi)+αSReg(T)𝒰iLim(1,,pji,pi)𝒰i(1,,pji,pi).\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})+\alpha\textsc{SReg}(T)\geq\mathcal{U}^{Lim}_{i}(1,\emptyset,p_{j\neq i},p_{i})\geq\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i}).
Proof.

We can write the utility for an Agent ii under the limited liability game as

𝒰iLim(1,,pji,pi)\displaystyle\mathcal{U}^{Lim}_{i}(1,\emptyset,p_{j\neq i},p_{i}) =πΠ(max(0,v¯i(π))ci(π))Pr[gf,pji,pi(Bi1)=π]\displaystyle=\sum_{\pi\in\Pi}(\max(0,\bar{v}_{i}(\pi))-c_{i}(\pi))\cdot\Pr[g_{f,p_{j\neq i},p_{i}}(B^{1}_{i})=\pi]
=πΠ(v¯i(π)+max(0,v¯i(π))ci(π))Pr[gf,pji,pi(Bi1)=π]\displaystyle=\sum_{\pi\in\Pi}(\bar{v}_{i}(\pi)+\max(0,-\bar{v}_{i}(\pi))-c_{i}(\pi))\cdot\Pr[g_{f,p_{j\neq i},p_{i}}(B^{1}_{i})=\pi^{*}]
=πΠ(v¯i(π)ci(π))Pr[gf,pji,pi(Bi1)=π]+πΠ(max(0,v¯i(π)))Pr[gf,pji,pi(Bi1)=π]\displaystyle=\sum_{\pi\in\Pi}(\bar{v}_{i}(\pi)-c_{i}(\pi))\cdot\Pr[g_{f,p_{j\neq i},p_{i}}(B^{1}_{i})=\pi^{*}]+\sum_{\pi\in\Pi}(\max(0,-\bar{v}_{i}(\pi)))\cdot\Pr[g_{f,p_{j\neq i},p_{i}}(B^{1}_{i})=\pi^{*}]
=𝒰i(1,,pji,pi)+πΠ(max(0,v¯i(π)))Pr[gf,pji,pi(Bi1)=π].\displaystyle=\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})+\sum_{\pi\in\Pi}(\max(0,-\bar{v}_{i}(\pi)))\cdot\Pr[g_{f,p_{j\neq i},p_{i}}(B^{1}_{i})=\pi^{*}].

First, observe that the right hand side of the desired inequality follows immediately from the above. This is simply because the total payment made to any Agent in the limited liability game is exactly the max of zero and the total payments made to that Agent in the non-limited liability game. Therefore, each Agent only has weakly higher utility in the limited liability game.

Now, note that by Lemma 12, any Agent’s expected cumulative payment in the non-limited liability game is bounded below by αSReg(T)-\alpha\textsc{SReg}(T). In the non-limited liability game, this would mean a transfer from the Agent to the Principal of αSReg(T)\alpha\textsc{SReg}(T). Since payments to the Agents between the limited liability and non-limited liability games only differ when the latter are negative, we have established that the utility of an agent is at most αSReg(T)\alpha\textsc{SReg}(T) higher in the limited liability game. ∎

We now establish that any equilibrium of the general Agent selection Game (Game 1) remains an approximate equilibrium of the limited liability game (Game 2) whenever the Principal employs a no swap-regret algorithm. Hence, the guarantees that we showed in Section 4 for the Principal in equilibrium (that they obtain payoff at least as high as they would have in the counterfactual world in which the Principal engaged in a constant selection mechanism with the best Agent ii in hindsight) continue to hold here (in approximate equilibrium), whenever the Principal employs a monotone selection algorithm with swap-regret guarantees.

Theorem 8.

Fix a linear contract vαv_{\alpha} and a Principal selection mechanism ff with swap regret bounded by SReg(T)\textsc{SReg}(T) against any adaptive adversary. Any Nash equilibrium in the game with Agent selection (Game 1) is an αSReg(T)\alpha\textsc{SReg}(T)-approximate equilibrium in the limited liability game (Game 2).

Proof.

Let pp be any pure non-responsive equilibrium of the Agent selection game. We show that no Agent can unilaterally deviate to improve their expected payoff by more than SReg(T)\textsc{SReg}(T). Let pp^{\prime} be a strategy profile that is defined as p:i,pip_{:-i},p^{\prime}_{i}, i.e. is identical to pp except for an arbitrary deviation by Agent ii. The claim follows by two applications of Lemma 13:

𝒰iLim(1,,pji,pi)\displaystyle\mathcal{U}^{Lim}_{i}(1,\emptyset,p_{j\neq i},p_{i}) 𝒰i(1,,pji,pi)\displaystyle\geq\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})
𝒰i(1,,pji,pi)+αR(T)\displaystyle\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p^{\prime}_{i})+\alpha R(T) 𝒰iLim(1,,pji,pi).\displaystyle\geq\mathcal{U}^{Lim}_{i}(1,\emptyset,p_{j\neq i},p^{\prime}_{i}).

As pp is assumed to be an exact Nash equilibrium of the general game, we also know

𝒰i(1,,pji,pi)𝒰i(1,,pji,pi).\displaystyle\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})\geq\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p^{\prime}_{i}).

Together, these give the claim:

𝒰iLim(1,,pji,pi)\displaystyle\mathcal{U}^{Lim}_{i}(1,\emptyset,p_{j\neq i},p_{i}) 𝒰iLim(1,,pji,pi)αSReg(T).\displaystyle\geq\mathcal{U}^{Lim}_{i}(1,\emptyset,p_{j\neq i},p^{\prime}_{i})-\alpha\textsc{SReg}(T).

Defining the policy space over which to evaluate policy regret is slightly more delicate now; previously, we assumed that all agents were playing a non-responsive equilibrium of the full liability game. However, now that we are in a limited liability setting, the incentives of agents may have shifted by regret terms. Thus, we have strong guarantees not over all non-responsive pure-strategy equilibria of this new game but over another reasonable set of equilibria the agents could play: the set of non-responsive pure-strategy equilibria of the general, full liability game, which we have shown are all approximate equilibria of the limited liability game.

Definition 36 (Policy Regret in the Limited-Liability Space).

For some set of agent beliefs BB and state of nature sequence y1:Ty^{1:T}, the policy regret of a Principal implementing selection mechanism ff in Game 2 is

maxi[k]minpipim𝒰p(gpii(y1:T))minppe(f,y)𝒰p(gf,pl(y1:T))\max_{i\in[k]}\min_{p_{i}\in p_{i}^{m}}\mathcal{U}_{p}(g^{i}_{p_{i}}(y^{1:T}))-\min_{p\in p^{e}(f,y)}\mathcal{U}_{p}(g^{l}_{f,p}(y^{1:T}))
Theorem 9.

When the Principal uses the MonoBandit selection mechanism (Algorithm 5) and employs a fixed, linear contract vαv_{\alpha}, for any non-responsive equilibrium pp of the general game:

  • pp is a 2αko(T)2\alpha\sqrt{k}\cdot o(T)-approximate equilibrium of the limited liability game

  • the Principal has policy regret at most 2αko(T)2\alpha\sqrt{k}\cdot o(T).

Proof.

By Theorem 11, MonoBandit(TreeSwap) is a monotone algorithm with swap regret bounded by 2αko(T)2\alpha\sqrt{k}\cdot o(T). Thus, by Theorem 8, if all agents play according to an equilibrium of the full liability game, this is an 2αko(T)2\alpha\sqrt{k}\cdot o(T)-approximate equilibrium of the limited liability game. Finally, we get the policy regret bound for the Principal by Theorem 4. ∎

6 Monotone No Swap-Regret Algorithms

Finally, it remains to establish the existence of a bandit no-regret learning algorithm that is simultaniously monotone and obtains a non-trivial swap regret guarantee against an adaptive adversary. First we note that even in the full information setting (when the learning algorithm gets to observe the entire loss vector, and not just the loss of the action they choose), standard no swap-regret algorithms fail to satisfy the monotonicity property.

Lemma 14.

The Blum-Mansour algorithm (Algorithm 3) is not monotone (Definition 24).

We verify this numerically by showing two loss sequences in which Blum-Mansour violates the monotonicity condition. This example can be found in Appendix C.1.

On the other hand, the very recent swap regret algorithm given by Dagan et al. (2023) in the full information setting is indeed monotone.

Lemma 15.

The TreeSwap algorithm (Algorithm 1 of Dagan et al. (2023)), when instantiated with a monotone no-external regret algorithm, is monotone (Definition 24).

The proof of this claim is deferred to Appendix C.2.

6.1 A Monotone Bandit No Swap-Regret Algorithm in the Adaptive Setting

For our application, however, we need a no swap-regret algorithm that is monotone and operates in the bandit setting. In this section we give a monotonicty preserving reduction from the full-information setting to the bandit learning setting, such that when paired with a monotone swap-regret algorithm in the full information setting (like that of Dagan et al. (2023)) we obtain the object we need.

The MonoBandit algorithm preserves any combination of the following appealing properties from a full-feedback learning algorithm: monotonicity, sublinear swap regret, and sublinear external regret. These proofs are all deferred to Appendix C.

Lemma 16.

If SReg𝒜(T,)R(T)\textsc{SReg}_{\mathcal{A}}(T,\ell)\leq R(T) \forall\ell, then SRegMonoBandit(𝒜)(T,)2kTR(T)\textsc{SReg}_{MonoBandit(\mathcal{A})}(T,\ell)\leq 2\sqrt{k\cdot T\cdot R(T)}, \forall\ell.

Lemma 17.

If Reg𝒜(T,)R(T)\textsc{Reg}_{\mathcal{A}}(T,\ell)\leq R(T) \forall\ell, then RegMonoBandit(𝒜)(T,)2kTR(T)\textsc{Reg}_{MonoBandit(\mathcal{A})}(T,\ell)\leq 2\sqrt{k\cdot T\cdot R(T)}, \forall\ell.

Lemma 18.

If algorithm 𝒜\mathcal{A} is monotone, then MonoBandit(𝒜\mathcal{A}) is monotone.

Theorem 10.

MonoBandit(MW)MonoBandit(MW) is a monotone algorithm and

RegMonoBandit(MW)(T)=2klog(k)T34.\textsc{Reg}_{MonoBandit(MW)}(T)=2\sqrt{k\cdot\sqrt{log(k)}}\cdot T^{\frac{3}{4}}.
Proof.

As Multiplicative Weights is a monotone algorithm with RegMW(T)=log(k)T\textsc{Reg}_{MW}(T)=\sqrt{\log(k)\cdot T}, by Lemmas 17 and 18, MonoBandit(MW) is monotone and

RegMonoBandit(MW)(T)\displaystyle\textsc{Reg}_{MonoBandit(MW)}(T) 2Tklog(k)T\displaystyle\leq 2\sqrt{T\cdot k\cdot\sqrt{\log(k)\cdot T}}
=2klog(k)T34.\displaystyle=2\sqrt{k\cdot\sqrt{\log(k)}}\cdot T^{\frac{3}{4}}.

Theorem 11.

MonoBandit(TreeSwap)MonoBandit(TreeSwap) is a monotone algorithm and

SRegMonoBandit(TreeSwap)(T)=2ko(T).\textsc{SReg}_{MonoBandit(TreeSwap)}(T)=2\sqrt{k}\cdot o(T).
Proof.

From Dagan et al. (2023), the algorithm TreeSwap has the property that SRegTreeSwap(T)=o(T)\textsc{SReg}_{TreeSwap}(T)=o(T).

Combining this with Lemma 15, we get that TreeSwap is a monotone algorithm with SRegTreeSwap(T)=o(T)\textsc{SReg}_{TreeSwap}(T)=o(T). Therefore, by Lemmas 16 and 18, MonoBandit(TreeSwap) is monotone and

SRegMonoBandit(TreeSwap)(T)\displaystyle\textsc{SReg}_{MonoBandit(TreeSwap)}(T) 2Tko(T)\displaystyle\leq 2\sqrt{T\cdot k\cdot o(T)}
=2ko(T2)\displaystyle=2\sqrt{k}\cdot\sqrt{o(T^{2})}
=2ko(T).\displaystyle=2\sqrt{k}\cdot o(T).

7 Conclusion and Discussion

In this paper we have studied the utility of a Principal in equilibrium in a complex kk-player, TT-stage extensive form Agent selection game, and have given mechanisms that ensure that in equilibrium, the Principal has no regret to the counterfactual world in which they had offered a guaranteed contract to the best fixed Agent in hindsight, even taking into account the Agent’s differing behavior in this world. The primary tools that we have used to derive guarantees in this complex setting are online learning algorithms and their properties. In particular, we have connected external regret guarantees for bandit learning algorithms paired with monotonicity with counterfactual policy regret guarantees in these games, and have connected swap regret guarantees with the ability to preserve the incentive properties of linear contracts in limited liability implementations. Moreover, we have established the existence of bandit learning algorithms that are simultaneously monotone and have swap regret guarantees, which instantiates our theorems with concrete mechanisms.

An objection that one can level at our limited liability mechanism is that it defers all payments until the final round. As such, the Principal can incur substantial debt to the Agent. It would be preferable if the Principal immediately paid any debt owed to the Agent, and still need not ever collect debt owed from the Agent to the Principle. We note that in this implementation, the mechanism would still induce similar incentives to that of the full liability mechanism if the Principal’s selection rule had no adaptive swap regret — i.e. no swap regret not just overall (as summed from rounds 11 to TT), but also on every subsequence [t,T][t,T] simultaneously for all t<Tt<T. Algorithms with no adaptive swap regret exist — see e.g. Lee et al. (2022) for a generic recipe for deriving such algorithms — but we are unaware of any such algorithms that are simultaneously monotone, which is a necessary ingredient for our policy regret guarantees. This is a concrete setting where a new result in online learning (the existence of a monotone bandit algorithm obtaining adaptive swap regret guarantees) would have an immediate application in mechanism design.

References

  • Babaioff et al. [2009] Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. Characterizing truthful multi-armed bandit mechanisms. In Proceedings of the 10th ACM conference on Electronic commerce, pages 79–88, 2009.
  • Blum and Mansour [2007] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8(6), 2007.
  • Bolton and Dewatripont [2004] Patrick Bolton and Mathias Dewatripont. Contract theory. MIT press, 2004.
  • Camara et al. [2020] Modibo K Camara, Jason D Hartline, and Aleck Johnsen. Mechanisms for a no-regret agent: Beyond the common prior. In 2020 ieee 61st annual symposium on foundations of computer science (focs), pages 259–270. IEEE, 2020.
  • Carroll [2015] Gabriel Carroll. Robustness and linear contracts. American Economic Review, 105(2):536–563, 2015.
  • Castiglioni et al. [2021] Matteo Castiglioni, Alberto Marchesi, and Nicola Gatti. Bayesian agency: Linear versus tractable contracts. arXiv e-prints, pages arXiv–2106, 2021.
  • Chassang [2013] Sylvain Chassang. Calibrated incentive contracts. Econometrica, 81(5):1935–1971, 2013.
  • Cohen et al. [2022] Alon Cohen, Argyrios Deligkas, and Moran Koren. Learning approximately optimal contracts. In International Symposium on Algorithmic Game Theory, pages 331–346. Springer, 2022.
  • Collina et al. [2023] Natalie Collina, Aaron Roth, and Han Shao. Efficient prior-free mechanisms for no-regret agents. arXiv preprint arXiv:2311.07754, 2023.
  • Dagan et al. [2023] Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, and Noah Golowich. From external to swap regret 2.0: An efficient reduction and oblivious adversary for large action spaces, 2023.
  • Dütting et al. [2019] Paul Dütting, Tim Roughgarden, and Inbal Talgam-Cohen. Simple versus optimal contracts. In Proceedings of the 2019 ACM Conference on Economics and Computation, EC ’19, page 369–387, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367929. doi: 10.1145/3328526.3329591. URL https://doi.org/10.1145/3328526.3329591.
  • Dütting et al. [2022] Paul Dütting, Tomer Ezra, Michal Feldman, and Thomas Kesselheim. Combinatorial contracts. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 815–826. IEEE, 2022.
  • Glicksberg [1951] I. L. Glicksberg. A further generalization of the kakutani fixed point theorem, with applications to nash equilibrium points, 1951.
  • Grossman and Hart [1992] Sanford J Grossman and Oliver D Hart. An analysis of the principal-agent problem. In Foundations of Insurance Economics: Readings in Economics and Finance, pages 302–340. Springer, 1992.
  • Ho et al. [2014] Chien-Ju Ho, Aleksandrs Slivkins, and Jennifer Wortman Vaughan. Adaptive contract design for crowdsourcing markets: Bandit algorithms for repeated principal-agent problems. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 359–376, 2014.
  • Holmström [1979] Bengt Holmström. Moral hazard and observability. The Bell journal of economics, pages 74–91, 1979.
  • Lee et al. [2022] Daniel Lee, Georgy Noarov, Mallesh Pai, and Aaron Roth. Online minimax multiobjective optimization: Multicalibeating and other applications. Advances in Neural Information Processing Systems, 35:29051–29063, 2022.
  • Peng and Rubinstein [2023] Binghui Peng and Aviad Rubinstein. Fast swap regret minimization and applications to approximate correlated equilibria. arXiv preprint arXiv:2310.19647, 2023.
  • Zhu et al. [2022] Banghua Zhu, Stephen Bates, Zhuoran Yang, Yixin Wang, Jiantao Jiao, and Michael I Jordan. The sample complexity of online contract design. arXiv preprint arXiv:2211.05732, 2022.

Appendix A Proofs from Section 3.2

A.1 Proof of Lemma 1

Proof of Lemma 1.
𝒰i(1,,pji,p^i)\displaystyle\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},\hat{p}_{i}) =𝔼π1:Tgfr,1:T(Bi1)[t=tT𝕀[(ft(πp,1:t1,t)=i)]ui(pit(π1:t1),yt)]\displaystyle=\mathbb{E}_{\pi^{1:T}\sim g^{r,1:T}_{f}(B^{1}_{i})}\left[\sum_{t^{\prime}=t}^{T}\mathbb{I}[(f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i)]\cdot u_{i}(p^{t^{\prime}}_{i}(\pi^{1:t^{\prime}-1}),y^{t^{\prime}})\right]
=πΠr,1:t1Pr[gfr(Bi1)=π](𝒰(π)+𝒰i(t,π,pji,p^i))\displaystyle=\sum_{\pi^{*}\in\Pi^{r,1:t-1}}\Pr[g^{r}_{f}(B^{1}_{i})=\pi^{*}]\cdot(\mathcal{U}(\pi^{*})+\mathcal{U}_{i}(t,\pi^{*},p_{j\neq i},\hat{p}_{i}))
=πΠr,1:t1Pr[gfr(Bi1)=π](𝒰(π)+𝒰i(t,π,pji,p^i))\displaystyle=\sum_{\pi^{*}\in\Pi^{r,1:t-1}}\Pr[g^{r}_{f}(B^{1}_{i})=\pi^{*}]\cdot(\mathcal{U}(\pi^{*})+\mathcal{U}_{i}(t,\pi^{*},p_{j\neq i},\hat{p}_{i}))
=ππr,1:t1Πr,1:t1Pr[gfr(Bi1)=π](𝒰(π)+𝒰i(t,π,pji,p^i))\displaystyle=\sum_{\pi^{*}\neq\pi^{r,1:t-1}\in\Pi^{r,1:t-1}}\Pr[g^{r}_{f}(B^{1}_{i})=\pi^{*}]\cdot(\mathcal{U}(\pi^{*})+\mathcal{U}_{i}(t,\pi^{*},p_{j\neq i},\hat{p}_{i}))
+Pr[gfr,1:t1(Bi1)=πr,1:t1](𝒰(πr,1:t1)+𝒰i(t,πr,1:t1,pji,p^i))\displaystyle\quad\quad+\Pr[g^{r,1:t-1}_{f}(B^{1}_{i})=\pi^{r,1:t-1}]\cdot(\mathcal{U}(\pi^{r,1:t-1})+\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},\hat{p}_{i}))
=ππr,1:t1Πr,1:t1Pr[gfr(Bi1)=π](𝒰(π)+𝒰i(t,π,pji,pi))\displaystyle=\sum_{\pi^{*}\neq\pi^{r,1:t-1}\in\Pi^{r,1:t-1}}\Pr[g^{r}_{f}(B^{1}_{i})=\pi^{*}]\cdot(\mathcal{U}(\pi^{*})+\mathcal{U}_{i}(t,\pi^{*},p_{j\neq i},p_{i}))
+Pr[gfr,1:t1(Bi1)=πr,1:t1](𝒰(πr,1:t1)+𝒰i(t,πr,1:t1,pji,p^i))\displaystyle\quad\quad+\Pr[g^{r,1:t-1}_{f}(B^{1}_{i})=\pi^{r,1:t-1}]\cdot(\mathcal{U}(\pi^{r,1:t-1})+\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},\hat{p}_{i})) (as p^i\hat{p}_{i} is equivalent to pip_{i} under any transcript prefixes that are not πr,1:t1\pi^{r,1:t-1})

Therefore,

𝒰i(1,,pji,p^i)\displaystyle\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},\hat{p}_{i}) 𝒰i(1,,pji,pi)\displaystyle-\mathcal{U}_{i}(1,\emptyset,p_{j\neq i},p_{i})
=ππr,1:t1Πr,1:t1Pr[gfr(Bi1)=π](𝒰(π)+𝒰i(t,π,pji,pi))\displaystyle=\sum_{\pi^{*}\neq\pi^{r,1:t-1}\in\Pi^{r,1:t-1}}\Pr[g^{r}_{f}(B^{1}_{i})=\pi^{*}]\cdot\left(\mathcal{U}(\pi^{*})+\mathcal{U}_{i}(t,\pi^{*},p_{j\neq i},p_{i})\right)
+Pr[gfr,1:t1(Bi1)=πr,1:t1](𝒰(πr,1:t1)+𝒰i(t,πr,1:t1,pji,p^i))\displaystyle\quad\quad\quad+\Pr[g^{r,1:t-1}_{f}(B^{1}_{i})=\pi^{r,1:t-1}]\cdot\left(\mathcal{U}(\pi^{r,1:t-1})+\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},\hat{p}_{i})\right)
ππr,1:t1Πr,1:t1Pr[gfr(Bi1)=π](𝒰(π)+𝒰i(t,π,pji,pi))\displaystyle\quad\quad\quad-\sum_{\pi^{*}\neq\pi^{r,1:t-1}\in\Pi^{r,1:t-1}}\Pr[g^{r}_{f}(B^{1}_{i})=\pi^{*}]\cdot\left(\mathcal{U}(\pi^{*})+\mathcal{U}_{i}(t,\pi^{*},p_{j\neq i},p_{i})\right)
Pr[gfr,1:t1(Bi1)=πr,1:t1](𝒰(πr,1:t1)+𝒰i(t,πr,1:t1,pji,pi))\displaystyle\quad\quad\quad-\Pr[g^{r,1:t-1}_{f}(B^{1}_{i})=\pi^{r,1:t-1}]\cdot\left(\mathcal{U}(\pi^{r,1:t-1})+\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},p_{i})\right)
=Pr[gfr,1:t1(Bi1)=πr,1:t1](𝒰i(t,πr,1:t1,pji,p^i)𝒰i(t,πr,1:t1,pji,pi)).\displaystyle=\Pr[g^{r,1:t-1}_{f}(B^{1}_{i})=\pi^{r,1:t-1}]\cdot\left(\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},\hat{p}_{i})-\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},p_{i})\right).

A.2 Proof of Lemma 5

Proof of Lemma 5.

By Theorem 2, we know that the restricted game has a mixed strategy Nash equilibrium. Let us call this equilibrium strategy profile pp^{*}. We will now show that there is also an exact pure strategy Nash equilibrium in the restricted game. To do this, we will alter pp^{*} until it is pure, while still ensuring that it is a Nash equilibrium. First, we will consider some transcript prefix π¯r,1:t1\bar{\pi}^{r,1:t-1} with states of nature y¯1:t1\bar{y}^{1:t-1} in which player ii’s policy is outputting a distribution, and we will modify their policy to p^\hat{p} so that they instead deterministically output the expectation of their effort levels in pi(πr,1:t1)p^{*}_{i}(\pi^{r,1:t-1}). By Lemma 1, this change will be confined only to the Agent utility at that subgame.

Recall that an Agent ii’s expected payoff of playing pip_{i} is as follows:

𝔼πr,t:Tgf,pr(π¯r,1:t1,Bit,t:T(y¯1:t1))[t=tT𝕀[f(πp,1:t1,t)=i]ui(pit(πr,1:t1),yt)]\displaystyle\mathbb{E}_{\pi^{r,t:T}\sim g^{r}_{f,p}(\bar{\pi}^{r,1:t-1},B^{t,t:T}_{i}(\bar{y}^{1:t-1}))}\left[\sum_{t^{\prime}=t}^{T}\mathbb{I}[f(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i]\cdot u_{i}(p^{t^{\prime}}_{i}(\pi^{r,1:t^{\prime}-1}),y^{t^{\prime}})\right]
=𝔼πr,1:tgf,pr,1:t(π¯r,1:t1,Bit,t(y¯1:t1))[f(πp,1:t1,t)=i]ui(pit(πr,1:t1),yt)]\displaystyle=\mathbb{E}_{\pi^{r,1:t}\sim g^{r,1:t}_{f,p}(\bar{\pi}^{r,1:t-1},B^{t,t}_{i}(\bar{y}^{1:t-1}))}\left[f(\pi^{p,1:t-1},\mathcal{R}^{t})=i]\cdot u_{i}(p^{t}_{i}(\pi^{r,1:t-1}),y^{t})\right]
+𝔼πr,t:Tgf,pr(π¯r,1:t1,Bit,t:T(y¯1:t1))[t=t+1T𝕀[f(πp,1:t1,t)=i]ui(pit(πr,1:t1),yt)].\displaystyle\quad+\mathbb{E}_{\pi^{r,t:T}\sim g^{r}_{f,p}(\bar{\pi}^{r,1:t-1},B^{t,t:T}_{i}(\bar{y}^{1:t-1}))}\left[\sum_{t^{\prime}=t+1}^{T}\mathbb{I}[f(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i]\cdot u_{i}(p^{t^{\prime}}_{i}(\pi^{r,1:t^{\prime}-1}),y^{t^{\prime}})\right].

We will first prove that if Agent ii plays a pure strategy given by the expectation of the mixed strategy pi,tp^{*,t}_{i} against pi,tp^{*,t}_{-i}, they will receive only higher utility in expectation. Note that player ii changing their actions will not affect the actions of any other players under pip^{*}_{-i}, as all players are playing restricted policies. Thus, we can consider the actions of the other k1k-1 Agents as fixed when considering the payoff change for Agent ii. We consider the two parts of player ii’s utility separately.

First, we will consider their difference in their immediate payoff in round tt when switching from their original mixed strategy pp to the deterministic strategy p^\hat{p}. Note that before round tt, the policies pip_{i} and p^i\hat{p}_{i} are identical.

𝔼π1:tgfr,1:t(π¯1:t1,Bit,t(π¯1:t1))𝕀[f(πp,1:t1,t)=i]ui(p^it(πr,1:t1),yt)]\displaystyle\mathbb{E}_{\pi^{1:t}\sim g^{r,1:t}_{f}(\bar{\pi}^{1:t-1},B^{t,t}_{i}(\bar{\pi}^{1:t-1}))}\mathbb{I}\left[f(\pi^{p,1:t-1},\mathcal{R}^{t})=i]\cdot u_{i}(\hat{p}^{t}_{i}(\pi^{r,1:t-1}),y^{t})\right]
𝔼πr,1:tgfr,1:t(π¯1:t1,Bit,t(y¯1:t1))𝕀[f(πp,1:t1,t)=i]ui(pit(πr,1:t1),yt)]\displaystyle-\mathbb{E}_{\pi^{r,1:t}\sim g^{r,1:t}_{f}(\bar{\pi}^{1:t-1},B^{t,t}_{i}(\bar{y}^{1:t-1}))}\mathbb{I}\left[f(\pi^{p,1:t-1},\mathcal{R}^{t})=i]\cdot u_{i}(p^{t}_{i}(\pi^{r,1:t-1}),y^{t})\right]
=𝔼πr,1:tgfr,1:t(π¯1:t1,Bit,t(y¯1:t1))𝕀[f(πp,1:t1,t)=i](ui(p^it(πr,1:t1),yt)ui(pit(πr,1:t1),yt))].\displaystyle=\mathbb{E}_{\pi^{r,1:t}\sim g^{r,1:t}_{f}(\bar{\pi}^{1:t-1},B^{t,t}_{i}(\bar{y}^{1:t-1}))}\mathbb{I}\left[f(\pi^{p,1:t-1},\mathcal{R}^{t})=i]\cdot(u_{i}(\hat{p}^{t}_{i}(\pi^{r,1:t-1}),y^{t})-u_{i}(p^{t}_{i}(\pi^{r,1:t-1}),y_{t}))\right].

Recall that the probability of an outcome oo for Agent ii given action aa and state of nature yy is pi,o(a,y)=mi,o,ya+bi,o,yp_{i,o}(a,y)=m_{i,o,y}\cdot a+b_{i,o,y}. We can now write the difference in the expected single round utility for Agent ii when deviating from pitp^{t}_{i} to p^it\hat{p}^{t}_{i}.

𝔼a^p^it(π1:t1)[ui(a^,yt)]𝔼apit(π1:t1)[ui(a,yt)]\displaystyle\mathbb{E}_{\hat{a}\sim\hat{p}^{t}_{i}(\pi^{1:t-1})}[u_{i}(\hat{a},y^{t})]-\mathbb{E}_{a\sim p^{t}_{i}(\pi^{1:t-1})}[u_{i}(a,y^{t})]
=𝔼a^p^it(π1:t1)[v¯(i,a^,yt)ci(a^)]𝔼apit(π1:t1)[v¯(i,a,yt)ci(a)].\displaystyle=\mathbb{E}_{\hat{a}\sim\hat{p}^{t}_{i}(\pi^{1:t-1})}[\bar{v}(i,\hat{a},y^{t})-c_{i}(\hat{a})]-\mathbb{E}_{a\sim p^{t}_{i}(\pi^{1:t-1})}[\bar{v}(i,a,y^{t})-c_{i}(a)].

By the concavity of v()v(\cdot) and the convexity of ci()c_{i}(\cdot), this value is always non-negative. Thus, the expected immediate payoff change for Agent ii from moving from action pitp^{t}_{i} to p^it\hat{p}^{t}_{i} in round tt is non-negative.

Next, consider the change in the expected continuation payoff at round tt. Note that for rounds t>tt^{\prime}>t, the strategies pitp^{t^{\prime}}_{i} and p^it\hat{p}^{t^{\prime}}_{i} are identical once again.

𝔼πr,1:Tgfr,1:T(π¯r,1:t1,Bit,t:T(y¯1:t1)))[t=t+1T𝕀[f(πp,1:t1,t)=i]ui(pit(πr,1:t1),yt)]\displaystyle\mathbb{E}_{\pi^{r,1:T}\sim g^{r,1:T}_{f}(\bar{\pi}^{r,1:t-1},B^{t,t:T}_{i}(\bar{y}^{1:t-1})))}\left[\sum_{t^{\prime}=t+1}^{T}\mathbb{I}[f(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i]\cdot u_{i}(p^{t^{\prime}}_{i}(\pi^{r,1:t^{\prime}-1}),y^{t^{\prime}})\right]
𝔼πt:Tgfr,1:T(π¯r,1:t1,Bit(y¯1:t1)[t=t+1T𝕀[f(πp,1:t1,t)=i]ui(pit(πr,1:t1),yt)].\displaystyle-\mathbb{E}_{\pi^{t:T}\sim g^{r,1:T}_{f}(\bar{\pi}^{r,1:t-1},B^{t}_{i}(\bar{y}^{1:t-1})}\left[\sum_{t^{\prime}=t+1}^{T}\mathbb{I}[f(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i]\cdot u_{i}(p^{t^{\prime}}_{i}(\pi^{r,1:t^{\prime}-1}),y^{t^{\prime}})\right].

By Lemma 3, the distributions over Principal transcripts in both expressions are equal. Since Agents are playing restricted policies, no future Agent actions will change if Agent ii moves from strategy pip_{i} to p^i\hat{p}_{i}. Thus, this difference is also non-negative.

Therefore, under any mixed strategy Nash equilibrium, a player ii can always modify one of their actions in some subgame to be deterministic and only increase their payoff. Therefore, they are still best responding. Furthermore, by Lemma 4, all other non-responsive Agents remain best responding.

Thus, we can iteratively turn a mixed strategy Nash equilibrium in the restricted game into a pure strategy Nash equilibrium. ∎

Appendix B Remainder of Section 4

B.1 Proof of Lemma 10

Proof of Lemma 10.

Suppose for contradiction that this is not the case. Then, there is some strategy profile p1kp^{*}_{1\ldots k} which is a non-responsive, pure strategy Nash equilibrium where there exists some Agent i[k]i\in[k] such that their action pit(π1:t1)p^{t}_{i}(\pi^{1:t-1}) given some transcript prefix π1:t1\pi^{1:t-1} is strictly less than the effort of a myopic optimal action p~it\tilde{p}^{t}_{i}. Also let y¯1:t1\bar{y}^{1:t-1} be the state of nature sequence associated with this transcript prefix. Call p¯i\bar{p}_{i} the Agent policy where Agent ii does exert the myopic optimal effort p~it(π1:t1)\tilde{p}^{t}_{i}(\pi^{1:t-1}) under the prefix π1:t1\pi^{1:t-1} and otherwise behaves exactly as pip^{*}_{i}. As pp^{*} is a best response, we have that

𝒰i(1,,pji,p~i)\displaystyle\mathcal{U}_{i}(1,\emptyset,p^{*}_{j\neq i},\tilde{p}_{i}) 𝒰i(1,,pji,pi)0\displaystyle-\mathcal{U}_{i}(1,\emptyset,p^{*}_{j\neq i},p^{*}_{i})\geq 0
(gf,pji,p¯i(Bi1)=πr,1:t1)(𝒰i(t,πr,1:t1,pji,p~i)𝒰i(t,πr,1:t1,pji,pi))0\displaystyle\rightarrow\mathbb{P}(g_{f,p^{*}_{j\neq i},\bar{p}_{i}}(B^{1}_{i})=\pi^{r,1:t-1})\cdot\left(\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},\tilde{p}_{i})-\mathcal{U}_{i}(t,\pi^{r,1:t-1},p^{*}_{j\neq i},p^{*}_{i})\right)\geq 0
𝒰i(t,πr,1:t1,pji,p~i)𝒰i(t,πr,1:t1,pji,pi)0,\displaystyle\rightarrow\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},\tilde{p}_{i})-\mathcal{U}_{i}(t,\pi^{r,1:t-1},p^{*}_{j\neq i},p^{*}_{i})\geq 0,

where the first implication follows from Lemma 7 and the second follows from the assumption that Bi1()B^{1}_{i}(\cdot) has non-zero support on the realized sequence of states of nature and that the transcript generating function (Definition 12) generates a distribution over all valid transcripts – over the additional randomness of the Principal selections and the randomness used to sample from the selected Agent’s outcome distribution in each round – with the distribution over states of nature that it is given.

Thus, to show contradiction is is sufficient to prove that 𝒰i(t,πr,1:t1,pji,p~i)<𝒰i(t,πr,1:t1,pji,pi)\mathcal{U}_{i}(t,\pi^{r,1:t-1},p_{j\neq i},\tilde{p}_{i})<\mathcal{U}_{i}(t,\pi^{r,1:t-1},p^{*}_{j\neq i},p^{*}_{i}).

Observe that Agent ii can weakly improve their expected immediate payoff at round tt, by definition of a myopic optimal action.

Agent ii can also improve their continuation payoff for future rounds t>tt^{\prime}>t, denoted

𝔼πt:Tgf,p(π¯1:t1,Bit,t:T(y¯1:t1))[t=t+1T𝕀[ft(πp,1:t1,t)=i]ui(pit(π1:t1),yt)],\displaystyle\mathbb{E}_{\pi^{t:T}\sim g_{f,p}(\bar{\pi}^{1:t-1},B^{t,t:T}_{i}(\bar{y}^{1:t-1}))}\left[\sum_{t^{\prime}=t+1}^{T}\mathbb{I}[f^{t}(\pi^{p,1:t^{\prime}-1},\mathcal{R}^{t})=i]\cdot u_{i}(p^{t^{\prime}}_{i}(\pi^{1:t^{\prime}-1}),y^{t^{\prime}})\right],

by deviating to play p~it\tilde{p}^{t}_{i} in round tt. As all Agents are playing non-responsive strategies, the only effect Agent ii’s action in round tt has on their continuation payoff is through the effect it has on the probability of their selection in future rounds. The probability of their future selection depends on the returns that Agent ii realizes, as the Principal selection mechanism only takes as input the history of realized returns. So, we can see that this is monotonically increasing in the Agent ii’s effort in round tt, due to the monotonicity of the Principal’s selection mechanism (Definition 24) and Assumption 3, which establishes that increase in an effort level by Agent ii in a single round tt leads to higher expected returns.

In particular, let the a1a_{1} be deterministic action from strategy pi,tp^{*,t}_{i} and a2a_{2} be the higher effort level played deterministically from strategy p~it\tilde{p}^{t}_{i}. Then, it is the case that

𝔼o𝒟i,y,a2[r(o)]𝔼o𝒟i,y,a1[r(o)].\displaystyle\mathbb{E}_{o\sim\mathcal{D}_{i,y,a_{2}}}[r(o)]\geq\mathbb{E}_{o\sim\mathcal{D}_{i,y,a_{1}}}[r(o)].

Thus, we know that Agent ii’s deviation will only increase their continuation payoff via an increased selection probability in future rounds.

Therefore, we have shown that if Agent ii is playing a policy pip^{*}_{i} that in some round tt plays an effort level less than p~it\tilde{p}^{t}_{i} they can unilaterally deviate to strictly improve their payoff, and thus the strategy profile p1kp^{*}_{1\ldots k} cannot be a non-responsive Nash equilibrium. ∎

B.2 Proof of Lemma 9

Proof of Lemma 9.

Suppose for the sake of contradiction that there is some equilibrium strategy for agent ii, denoted pip_{i}, that is not a myopic optimal policy. If agent ii is playing pip_{i}, then there is some round tt where they are not playing a myopic optimal action from their set A~it\tilde{A}^{t}_{i} for some transcript prefix π^1:t1\hat{\pi}^{1:t-1}. Let their payoff differ from the myopic payoff by ϵ\epsilon. We will consider some policy p^i\hat{p}_{i} such that p^i(π^1:t1=pi(π^1:t1)\hat{p}_{i}(\hat{\pi}^{1:t-1}=p_{i}(\hat{\pi}^{1:t-1}), and for ππ^1:t1\pi\neq\hat{\pi}^{1:t-1}, p^i(π)=pi(π)\hat{p}_{i}(\pi)=p_{i}(\pi). By Lemma 7, the difference in the expected utility of the agent under p^i\hat{p}_{i} and pip_{i} is exactly

Pr[gf(Bi1)\displaystyle\Pr[g_{f}(B^{1}_{i}) =π^1:t1](𝒰i(t,π^1:t1,pji,p^i)𝒰i(t,π^1:t1,pji,pi))\displaystyle=\hat{\pi}^{1:t-1}]\cdot\left(\mathcal{U}_{i}(t,\hat{\pi}^{1:t-1},p_{j\neq i},\hat{p}_{i})-\mathcal{U}_{i}(t,\hat{\pi}^{1:t-1},p_{j\neq i},p_{i})\right)
=Pr[gf(Bi1)=π^1:t1](𝔼πtgf(π^1:t1,Bit,t(y¯1:t1))[ui(p^i(π^1:t1),yt)ui(pi(π^1:t1),yt)\displaystyle=\Pr[g_{f}(B^{1}_{i})=\hat{\pi}^{1:t-1}]\cdot\big{(}\mathbb{E}_{\pi^{t}\sim g_{f}(\hat{\pi}^{1:t-1},B^{t,t}_{i}(\bar{y}^{1:t-1}))}[u_{i}(\hat{p}_{i}(\hat{\pi}^{1:t-1}),y^{t})-u_{i}(p_{i}(\hat{\pi}^{1:t-1}),y^{t})
+𝔼πt+1:Tgf(π^1:t1πt,Bit:T,t(y¯1:t1))[𝒰i(t+1,π^1:t1πt,pji,p^i)𝒰i(t+1,π^1:t1πt,pji,pi)]])\displaystyle\quad\quad+\mathbb{E}_{\pi^{t+1:T}\sim g_{f}(\hat{\pi}^{1:t-1}\cup\pi^{t},B^{t:T,t}_{i}(\bar{y}^{1:t-1}))}[\mathcal{U}_{i}(t+1,\hat{\pi}^{1:t-1}\cup\pi^{t},p_{j\neq i},\hat{p}_{i})-\mathcal{U}_{i}(t+1,\hat{\pi}^{1:t-1}\cup\pi^{t},p_{j\neq i},p_{i})]]\big{)}
=Pr[gf(Bi1)=π^1:t1](𝔼πtg(π^1:t1,Bit,t(y¯1:t1))[ui(p^i(π^1:t1),yt)ui(pi(π^1:t1),yt))\displaystyle=\Pr[g_{f}(B^{1}_{i})=\hat{\pi}^{1:t-1}]\cdot\left(\mathbb{E}_{\pi^{t}\sim g(\hat{\pi}^{1:t-1},B^{t,t}_{i}(\bar{y}^{1:t-1}))}[u_{i}(\hat{p}_{i}(\hat{\pi}^{1:t-1}),y^{t})-u_{i}(p_{i}(\hat{\pi}^{1:t-1}),y^{t})\right) (By the fact that the action of the agent in the current round has no impact on their payoff in future rounds)
=Pr[gf(Bi1)=π^1:t1]ϵ\displaystyle=\Pr[g_{f}(B^{1}_{i})=\hat{\pi}^{1:t-1}]\cdot\epsilon
>0\displaystyle>0 (By the fact that the agent has full support beliefs.)

As agent ii can strictly improve their payoff by deviating to p^i\hat{p}_{i}, pip_{i} is in fact not in equilibrium, leading to a contradiction.

Appendix C Monotone No Swap-Regret Algorithms

Algorithm 3 The Blum-Mansour Swap Regret Algorithm Blum and Mansour [2007]
  Input learning rate η\eta
  Initialize kk copies of Algorithm 4 with parameter η\eta, where for each Agent i[k]i\in[k] we maintain a distribution qitq^{t}_{i}
  for t[T]t\in[T] do
     Experience loss for each Agent, denoted lt=(lt1,,ltk)l_{t}=(l_{t}^{1},\ldots,l_{t}^{k})
     Report loss vector ptiiltp_{t}^{i}i\cdot l_{t} to ithi^{\text{th}} copy of Algorithm 4
     Update qtiq_{t}^{i} for each i[k]i\in[k], each of the kk copies of Algorithm 4
     Combine qtiq_{t}^{i} for i[k]i\in[k] into a single probability distribution over the kk Agents, by finding distribution ptp_{t} such that Apt=ptAp_{t}=p_{t}
  end for

Algorithm 4 Exponential Weights Algorithm
  Input learning rate η\eta
  For each Agent i[k]i\in[k], initialize wi1=1w^{1}_{i}=1
  Define wt=i[k]wittw_{t}=\sum_{i\in[k]}w^{t}_{i}t
  for t[T]t\in[T] do
     Select Agent iti^{t} from distribution pt,p^{t}, where pti=wit/wtp_{t}^{i}=w^{t}_{i}/w^{t} for all i[k]i\in[k]
     Observe loss ltl^{t}
     Update weight as wit+1=wit(exp(ηlit))w^{t+1}_{i}=w^{t}_{i}(\exp{(-\eta\cdot l^{t}_{i})}) for all i[k]i\in[k]
  end for

C.1 Proof of Lemma 14

Proof.

Below we show an example with T=100T=100 and the number of actions k=3k=3. Note that given a fixed loss sequence, Blum-Mansour’s probability distributions it plays over each round are deterministic. Thus, we can numerically compute these probabilities to prove non-monotonicity.

Let the first 5050 loss vectors of l1l_{1} each be [0.1,1,0][-0.1,1,0]. Let the final 5050 loss vectors of l1l_{1} each be [1,1,0][1,-1,0]. Let l2l_{2} be exactly the same as l1l_{1}, except that l21=[2,1,0]l_{2}^{1}=[-2,1,0]. Then, when running Blum-Mansour as defined in Algorithm 3 with η=0.2\eta=0.2, the probability distribution maintained over actions at rounds 1:T1:T (truncating the middle for clarity) is

[0.342155640.317962160.339882190.350853860.302944220.346201920.359439210.288261340.352299450.367919830.273901920.358178250.376300810.259860540.363838650.028540180.705270450.266189360.023313780.733398730.243287490.018731130.760826390.220442480.014802890.787218320.197978790.011511080.812264420.1762245]\begin{bmatrix}0.34215564&0.31796216&0.33988219\\ 0.35085386&0.30294422&0.34620192\\ 0.35943921&0.28826134&0.35229945\\ 0.36791983&0.27390192&0.35817825\\ 0.37630081&0.25986054&0.36383865\\ ...&...&...\\ 0.02854018&0.70527045&0.26618936\\ 0.02331378&0.73339873&0.24328749\\ 0.01873113&0.76082639&0.22044248\\ 0.01480289&0.78721832&0.19797879\\ 0.01151108&0.81226442&0.1762245\\ \end{bmatrix}

While the corresponding distributions under l2l_{2} are

[0.380281780.289137960.330580270.389243220.274675750.336081030.398097920.260528290.341373790.406847860.246696520.346455630.415492090.233186490.351321420.029110130.720013360.250876510.023594280.748412940.227992780.01880280.775857840.205339360.014737060.802002570.183260370.011366430.826547060.1620865]\begin{bmatrix}0.38028178&0.28913796&0.33058027\\ 0.38924322&0.27467575&0.33608103\\ 0.39809792&0.26052829&0.34137379\\ 0.40684786&0.24669652&0.34645563\\ 0.41549209&0.23318649&0.35132142\\ ...&...&...\\ 0.02911013&0.72001336&0.25087651\\ 0.02359428&0.74841294&0.22799278\\ 0.0188028&0.77585784&0.20533936\\ 0.01473706&0.80200257&0.18326037\\ 0.01136643&0.82654706&0.1620865\\ \end{bmatrix}

Finally, we can look at the difference in move probabilities under l1l_{1} and l2l_{2}:

[2.90528236e021.40423773e021.50104462e022.94279178e021.38062655e021.56216522e022.98040066e021.35855831e021.62184235e023.01780959e021.33736323e021.68044637e023.05470484e021.31640258e021.73830227e025.69949881e041.47429088e021.53128587e022.80500303e041.50142094e021.52947097e027.16695363e051.50314536e021.51031232e026.58236902e051.47842499e021.47184262e021.44649313e041.42826443e021.41379950e02]\begin{bmatrix}2.90528236e-02&-1.40423773e-02&-1.50104462e-02\\ 2.94279178e-02&-1.38062655e-02&-1.56216522e-02\\ 2.98040066e-02&-1.35855831e-02&-1.62184235e-02\\ 3.01780959e-02&-1.33736323e-02&-1.68044637e-02\\ 3.05470484e-02&-1.31640258e-02&-1.73830227e-02\\ ...&...&...\\ 5.69949881e-04&1.47429088e-02&-1.53128587e-02\\ 2.80500303e-04&1.50142094e-02&-1.52947097e-02\\ 7.16695363e-05&1.50314536e-02&-1.51031232e-02\\ -6.58236902e-05&1.47842499e-02&-1.47184262e-02\\ -1.44649313e-04&1.42826443e-02&-1.41379950e-02\\ \end{bmatrix}

Note that at rounds t=99t=99 and t=100t=100, Blum-Mansour under l1l_{1} plays action 11 with a higher probability than Blum-Mansour under l2l_{2}. This violates the monotonicity condition. ∎

C.2 Proof of Lemma 15

Proof of 15.

We begin with a general description of the TreeSwap algorithm. Refer to Dagan et al. [2023] for a more detailed description. TreeSwap maintains multiple instances of a base no-external regret algorithm, which we refer to as Alg. Let TT be the total number of rounds that TreeSwap is run. Each instance of Alg is lazily updated, meaning that each instance has some period length mm, and is updated only T/mT/m times, during the end of each period, using the average loss of all the loss vectors given to the algorithm since the previous update. In each round tt, the action output by TreeSwap is the uniform mixture over the distributions output by a fixed number of these instances. The claim follows immediately by the fact that averaging preserves monotonicity. This means that neither the lazy updating of each instance of Alg, nor playing the averaging the distribution output by multiple instances of Alg in each round, violates the monotonicity of the base algorithm. ∎

C.3 Monobandit Algorithm

In addition to the definitions in Section 4.1, we introduce additional notation for a full-feedback learning algorithm, which is used in the implementation of our bandit-feedback algorithm MonoBandit.

Definition 37 (Full-Feedback Learning algorithm 𝒜\mathcal{A}, loss sequence ll).

A full-feedback learning algorithm 𝒜f\mathcal{A}_{f} consists of a deterministic function which takes as input 1:t𝒜f\mathcal{R}^{\mathcal{A}_{f}}_{1:t}, the realizations of randomness used by the algorithm in this round and all previous rounds, and l1:t1)l^{1:t-1}), the kk-length loss vectors from all previous rounds. 𝒜f\mathcal{A}_{f} outputs an action in the current round tt.

We can now define our algorithm MonoBandit implementing this reduction from full-feedback to bandit-feedback, which preserves the monotonicity and sublinear regret (either external or swap) guarantees of some full-feedback learning algorithm 𝒜\mathcal{A}. When we reason about the behavior of MonoBandit, we must be able to distinguish between the randomness inherent to 𝒜\mathcal{A} and the randomness that the bandit algorithm uses in addition. The past realizations of both objects are visible to the adaptive adversary, but 𝒜\mathcal{A} only has access to its own randomness. We therefore define the following distributions over randomness that are both utilized by the bandit-feedback algorithm:

Definition 38 (\mathcal{R}, f\mathcal{R}_{f} and b\mathcal{R}_{b}).

Let \mathcal{R} be composed of the following two sets of randomness: f\mathcal{R}_{f} and b\mathcal{R}_{b}. f\mathcal{R}_{f} is the randomness utilized by the full-information algorithm 𝒜\mathcal{A}. b\mathcal{R}_{b} is additional randomness utilized by MonoBandit, which is independent from f\mathcal{R}_{f}. In particular b\mathcal{R}_{b} is composed of TT independent random variables bt\mathcal{R}^{t}_{b}, where bt=EXPLOIT\mathcal{R}^{t}_{b}=EXPLOIT with probability 1ϵ1-\epsilon and bt=i\mathcal{R}^{t}_{b}=i with probability ϵk\frac{\epsilon}{k} for all i[k]i\in[k].

Algorithm 5 MonoBandit
  Inputs TT, Full-feedback Algorithm 𝒜\mathcal{A}
  Parameter ϵ\epsilon
  Probability vector pp (initialized to the uniform distribution)
  Sequence of sampled loss vectors L^\hat{L} (initialized to empty)
  Randomness of full-feedback algorithm f\mathcal{R}_{f}
  Additional randomness of bandit algorithm b\mathcal{R}_{b}
  for t[T]t\in[T] do
     if btEXPLOIT\mathcal{R}_{b}^{t}\neq EXPLOIT  then
        Play action bt\mathcal{R}_{b}^{t}
        Receive bandit feedback bb
        l^t[bt]kbϵ\hat{l}_{t}[\mathcal{R}_{b}^{t}]\leftarrow\frac{kb}{\epsilon}
        l^t[i]0\hat{l}_{t}[i]\leftarrow 0 for all ibti\neq\mathcal{R}_{b}^{t}
        L^L^l^t\hat{L}\leftarrow\hat{L}\cup\hat{l}_{t}
     else
        Select action from probability distribution pp
        L^L^0k\hat{L}\leftarrow\hat{L}\cup{0}^{k}
     end if
     p𝒜(L^,f1:t)p\leftarrow\mathcal{A}(\hat{L},\mathcal{R}_{f}^{1:t})
  end for

C.4 Proof of Lemma 16

Proof of Lemma 16.

Recall that we previously defined Swap Regret to be

SRegf(T,)=maxχ𝔼1:T[t=1T(ϕ1:t11:t1)[f(ϕ1:t11:t1,t)](ϕ1:t1t1[χ(f(ϕ1:t11:t1,t))]],\displaystyle\textsc{SReg}_{f}(T,\ell)=\max_{\chi}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t=1}^{T}\ell(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}})[f(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}},\mathcal{R}^{t})]-\ell(\phi^{t-1}_{\mathcal{R}^{1:t-1}}[\chi(f(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}},\mathcal{R}^{t}))]\right],

where we use ϕ1:t11:t1\phi^{1:t-1}_{\mathcal{R}^{1:t-1}} to denote the transcript prefix determinstically resulting from the algorithm ff, adversary \ell and a particular realization of 1:t1\mathcal{R}^{1:t-1}. Note that, given a particular ff and \ell, any tt-length prefix of the transcript is uniquely defined by the realization of the randomness of the learning algorithm up until round t1t-1. Therefore, for simplicity for the entirety of this section we will write swap regret as

SRegf(T,)=maxχ𝔼1:T[t=1T(1:t1)[f(1:t)](1:t1)[χ(f(1:t)]]\displaystyle\textsc{SReg}_{f}(T,\ell)=\max_{\chi}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t=1}^{T}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[\chi(f(\mathcal{R}^{1:t})]\right]

Let l^t\hat{l}_{t} represent the loss vector that is given to 𝒜\mathcal{A} at round tt. First, we will show that l^t\hat{l}_{t} is an unbiased sample of t\ell_{t}, regardless of the behavior of the adaptive adversary.

𝔼1:t[l^t[i]]\displaystyle\mathbb{E}_{\mathcal{R}^{1:t}}[\hat{l}_{t}[i]] =𝔼1:t[𝕀[bt=i]kϵ(1:t1)[i]]\displaystyle=\mathbb{E}_{\mathcal{R}^{1:t}}[\mathbb{I}[\mathcal{R}^{t}_{b}=i]\cdot\frac{k}{\epsilon}\ell(\mathcal{R}^{1:t-1})[i]]
=kϵ𝔼1:t[𝕀[bt=i](1:t1)[i]]\displaystyle=\frac{k}{\epsilon}\mathbb{E}_{\mathcal{R}^{1:t}}[\mathbb{I}[\mathcal{R}^{t}_{b}=i]\cdot\ell(\mathcal{R}^{1:t-1})[i]]
=kϵ𝔼bt[𝕀[bt=i]]𝔼1:t1[(1:t1)[i]]\displaystyle=\frac{k}{\epsilon}\mathbb{E}_{\mathcal{R}^{t}_{b}}[\mathbb{I}[\mathcal{R}^{t}_{b}=i]]\cdot\mathbb{E}_{\mathcal{R}^{1:t-1}}[\ell(\mathcal{R}^{1:t-1})[i]] (By the independence of bt\mathcal{R}^{t}_{b} and 1:t1\mathcal{R}^{1:t-1})
=𝔼1:t[(1:t1)[i]]\displaystyle=\mathbb{E}_{\mathcal{R}^{1:t}}[\ell(\mathcal{R}^{1:t-1})[i]]

Let us write the swap regret of MonoBandit in terms of the swap regret incurred on EXPLORE and EXPLOIT rounds. For brevity in notation, we let ff refer to MonoBandit.

SRegMonoBandit(T,)\displaystyle\textsc{SReg}_{MonoBandit}(T,\ell) maxχ𝔼1:T[tEXPLORE(1:t1)[f(1:t)](1:t1)[χ(f(1:t))]]+\displaystyle\leq max_{\chi}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLORE}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[\chi(f(\mathcal{R}^{1:t}))]\right]+
maxχ𝔼1:T[tEXPLOIT(1:t1)[f(1:t)](1:t1)[χ(f(1:t))]]\displaystyle max_{\chi}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLOIT}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[\chi(f(\mathcal{R}^{1:t}))]\right]

We will upper bound both terms, starting with the first. For all swap functions χ\chi, we have

𝔼1:T[tEXPLORE(1:t1)[f(1:t)](1:t1)[χ(f(1:t))]]\displaystyle\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLORE}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[\chi(f(\mathcal{R}^{1:t}))]\right]
𝔼1:T[t=1T(1:t1)[f(1:t)](1:t1)[χ(f(1:t))]]\displaystyle\leq\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t=1}^{T}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[\chi(f(\mathcal{R}^{1:t}))]\right]
=t=1T𝔼1:t[(1:t1)[f(1:t)](1:t1)[χ(f(1:t))]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}^{1:t}}\left[\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[\chi(f(\mathcal{R}^{1:t}))]\right]
=t=1T𝔼1:t[(1:t1)[𝒜(f1:t,(1:t2))](1:t1)[χ(𝒜(f1:t,(1:t2)))]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}^{1:t}}\left[\ell(\mathcal{R}^{1:t-1})[\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}^{1:t-2}))]-\ell(\mathcal{R}^{1:t-1})[\chi(\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}^{1:t-2})))]\right] (By the definition of MonoBandit)
=t=1T𝔼1:t[l^[𝒜(f1:t,(1:t2))]l^[χ(𝒜(f1:t,(1:t2)))]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}^{1:t}}\left[\hat{l}[\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}^{1:t-2}))]-\hat{l}[\chi(\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}^{1:t-2})))]\right] (By the fact that l^\hat{l} is an unbiased estimator of \ell )
=t=1T𝔼1:t[l^[𝒜(f1:t,(f1:t2))]l^[χ(𝒜(f1:t,(f1:t2)))]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}^{1:t}}\left[\hat{l}[\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}_{f}^{1:t-2}))]-\hat{l}[\chi(\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}_{f}^{1:t-2})))]\right] (By the fact that f\mathcal{R}^{f} is independent of b\mathcal{R}^{b} )
=t=1T𝔼f1:t[l^[𝒜(f1:t,(f1:t2))]l^[χ(𝒜(f1:t,(f1:t2)))]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}_{f}^{1:t}}\left[\hat{l}[\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}_{f}^{1:t-2}))]-\hat{l}[\chi(\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}_{f}^{1:t-2})))]\right]
|l^maxl^min|R(T)\displaystyle\leq|\hat{l}_{max}-\hat{l}_{min}|\cdot R(T) (By the definition of Swap Regret, and the assumption in the Lemma statement.)
=kϵR(T)\displaystyle=\frac{k}{\epsilon}\cdot R(T)

As for the second term,

maxχ𝔼1:T[tEXPLOIT(1:t1)[f(1:t)](1:t1)[χ(f(1:t))]]\displaystyle max_{\chi}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLOIT}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[\chi(f(\mathcal{R}^{1:t}))]\right] 𝔼1:T[tEXPLOIT1]\displaystyle\leq\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLOIT}1\right]
=ϵT\displaystyle=\epsilon T

Putting these together, we can upper bound the swap regret of MonoBandit as

kϵR(T)+ϵT\frac{k}{\epsilon}\cdot R(T)+\epsilon T

By setting ϵ=TkR(T)\epsilon=\sqrt{T\cdot k\cdot R(T)}, we get that

SRegMonoBandit(T,)2TkR(T),\textsc{SReg}_{MonoBandit}(T,\ell)\leq 2\sqrt{T\cdot k\cdot R(T)},\ \forall\ell

C.5 Proof of Lemma 17

Proof of Lemma 17.

Recall that we previously defined external regret to be

Regf(T,)=maxi[k]𝔼1:T[t=1T(ϕ1:t11:t1)[f(ϕ1:t11:t1,t)](ϕ1:t1t1[i]],\displaystyle\textsc{Reg}_{f}(T,\ell)=\max_{i\in[k]}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t=1}^{T}\ell(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}})[f(\phi^{1:t-1}_{\mathcal{R}^{1:t-1}},\mathcal{R}^{t})]-\ell(\phi^{t-1}_{\mathcal{R}^{1:t-1}}[i]\right],

where we use ϕ1:t11:t1\phi^{1:t-1}_{\mathcal{R}^{1:t-1}} to denote the transcript prefix determinstically resulting from the algorithm ff, adversary \ell and a particular realization of 1:t1\mathcal{R}^{1:t-1}. Note that, given a particular ff and \ell, any tt-length prefix of the transcript is uniquely defined by the realization of the randomness of the learning algorithm up until round t1t-1. Therefore, for simplicity for the entirety of this section we will write external regret as

Regf(T,)=maxi[k]𝔼1:T[t=1T(1:t1)[f(1:t)](1:t1)[i]]\displaystyle\textsc{Reg}_{f}(T,\ell)=\max_{i\in[k]}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t=1}^{T}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[i]\right]

Let l^t\hat{l}_{t} represent the loss vector that is given to 𝒜\mathcal{A} at round tt. First, we will show that l^t\hat{l}_{t} is an unbiased sample of t\ell_{t}, regardless of the behavior of the adaptive adversary.

𝔼1:t[l^t[i]]\displaystyle\mathbb{E}_{\mathcal{R}^{1:t}}[\hat{l}_{t}[i]] =𝔼1:t[𝕀[bt=i]kϵ(1:t1)[i]]\displaystyle=\mathbb{E}_{\mathcal{R}^{1:t}}[\mathbb{I}[\mathcal{R}^{t}_{b}=i]\cdot\frac{k}{\epsilon}\ell(\mathcal{R}^{1:t-1})[i]]
=kϵ𝔼1:t[𝕀[bt=i](1:t1)[i]]\displaystyle=\frac{k}{\epsilon}\mathbb{E}_{\mathcal{R}^{1:t}}[\mathbb{I}[\mathcal{R}^{t}_{b}=i]\cdot\ell(\mathcal{R}^{1:t-1})[i]]
=kϵ𝔼bt[𝕀[bt=i]]𝔼1:t1[(1:t1)[i]]\displaystyle=\frac{k}{\epsilon}\mathbb{E}_{\mathcal{R}^{t}_{b}}[\mathbb{I}[\mathcal{R}^{t}_{b}=i]]\cdot\mathbb{E}_{\mathcal{R}^{1:t-1}}[\ell(\mathcal{R}^{1:t-1})[i]] (By the independence of bt\mathcal{R}^{t}_{b} and 1:t1\mathcal{R}^{1:t-1})
=𝔼1:t[(1:t1)[i]]\displaystyle=\mathbb{E}_{\mathcal{R}^{1:t}}[\ell(\mathcal{R}^{1:t-1})[i]]

Let us write the external regret of MonoBandit in terms of the external regret incurred on EXPLORE and EXPLOIT rounds. For brevity in notation, we let ff refer to MonoBandit.

RegMonoBandit(T,)\displaystyle\textsc{Reg}_{MonoBandit}(T,\ell) maxi[k]𝔼1:T[tEXPLORE(1:t1)[f(1:t)](1:t1)[χ(f(1:t))]]+\displaystyle\leq max_{i\in[k]}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLORE}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[\chi(f(\mathcal{R}^{1:t}))]\right]+
maxi[k]𝔼1:T[tEXPLOIT(1:t1)[f(1:t)](1:t1)[i]]\displaystyle max_{i\in[k]}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLOIT}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[i]\right]

We will upper bound both terms, starting with the first. For all fixed actions ii, we have

𝔼1:T[tEXPLORE(1:t1)[f(1:t)](1:t1)[i]]\displaystyle\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLORE}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[i]\right]
𝔼1:T[t=1T(1:t1)[f(1:t)](1:t1)[i)]]\displaystyle\leq\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t=1}^{T}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[i)]\right]
=t=1T𝔼1:t[(1:t1)[f(1:t)](1:t1)[i]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}^{1:t}}\left[\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[i]\right]
=t=1T𝔼1:t[(1:t1)[𝒜(f1:t,(1:t2))](1:t1)[i]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}^{1:t}}\left[\ell(\mathcal{R}^{1:t-1})[\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}^{1:t-2}))]-\ell(\mathcal{R}^{1:t-1})[i]\right] (By the definition of MonoBandit)
=t=1T𝔼1:t[l^[𝒜(f1:t,(1:t2))]l^[i]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}^{1:t}}\left[\hat{l}[\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}^{1:t-2}))]-\hat{l}[i]\right] (By the fact that l^\hat{l} is an unbiased estimator of \ell )
=t=1T𝔼1:t[l^[𝒜(f1:t,(f1:t2))]l^[i]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}^{1:t}}\left[\hat{l}[\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}_{f}^{1:t-2}))]-\hat{l}[i]\right] (By the fact that f\mathcal{R}^{f} is independent of b\mathcal{R}^{b} )
=t=1T𝔼f1:t[l^[𝒜(f1:t,(f1:t2))]l^[i]]\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{R}_{f}^{1:t}}\left[\hat{l}[\mathcal{A}(\mathcal{R}_{f}^{1:t},\ell(\mathcal{R}_{f}^{1:t-2}))]-\hat{l}[i]\right]
|l^maxl^min|R(T)\displaystyle\leq|\hat{l}_{max}-\hat{l}_{min}|\cdot R(T) (By the definition of External Regret, and the assumption in the lemma statement.)
=kϵR(T)\displaystyle=\frac{k}{\epsilon}\cdot R(T)

As for the second term,

maxχ𝔼1:T[tEXPLOIT(1:t1)[f(1:t)](1:t1)[χ(f(1:t))]]\displaystyle max_{\chi}\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLOIT}\ell(\mathcal{R}^{1:t-1})[f(\mathcal{R}^{1:t})]-\ell(\mathcal{R}^{1:t-1})[\chi(f(\mathcal{R}^{1:t}))]\right] 𝔼1:T[tEXPLOIT1]\displaystyle\leq\mathbb{E}_{\mathcal{R}^{1:T}}\left[\sum_{t\in EXPLOIT}1\right]
=ϵT\displaystyle=\epsilon T

Putting these together, we can upper bound the external regret of MonoBandit as

kϵR(T)+ϵT\frac{k}{\epsilon}\cdot R(T)+\epsilon T

By setting ϵ=TkR(T)\epsilon=\sqrt{T\cdot k\cdot R(T)}, we get that

RegMonoBandit(T,)2TkR(T),\textsc{Reg}_{MonoBandit}(T,\ell)\leq 2\sqrt{T\cdot k\cdot R(T)},\ \forall\ell

C.6 Proof of Lemma 18

Proof of Lemma 18.

For ease of notation, throughout this proof we write the MonoBandit algorithm as taking as input the full feedback vector ll, while keeping in mind that it can only change its state based on the bandit feedback it receives from sampling ll each round. Furthermore, we let l^t(lt)\hat{l}_{t}(l_{t}) denote the sampled loss vector fed to the internal full-information algorithm 𝒜\mathcal{A} by MonoBandit at round tt.

For any fixed sequence of losses ll, the sampled loss vector l^t\hat{l}_{t^{\prime}} that MonoBandit feeds to 𝒜\mathcal{A} in round tt^{\prime} is independent of the sampled loss vector l^t\hat{l}_{t} seen at round t<tt<t^{\prime}. To see this, note that it depends only on the current loss vector at round tt^{\prime} and the realization of tb\mathcal{R}^{b}_{t^{\prime}}, both of which are independent of any previous losses or realizations of randomness:

l^t={kϵlt[i],tb=i0k,tb=EXPLOIT\displaystyle\hat{l}_{t^{\prime}}=\begin{cases}\frac{k}{\epsilon}\cdot l_{t^{\prime}}[i],\mathcal{R}^{b}_{t^{\prime}}=i\\ {0}^{k},\mathcal{R}^{b}_{t^{\prime}}=EXPLOIT\end{cases}

Now, consider the impact that decreasing loss of a single action in one round has on the probability of success later in the game. Let the loss sequence ll^{*} be the same as the loss sequence ll except at round t¯\bar{t}, action aa has decreased loss. Let the bandit feedback loss sequence bb^{*} be the loss sequence seen by MonoBandit under ll^{*}, and let bb be the loss sequence seen by MonoBandit under ll. Furthermore, let v(l,a)v(l,a) represent a kk-length vector which is zeroes everywhere but ll at index aa.

On any exploit round t>t¯t>\bar{t}:

1:t(MonoBanditt(l,1:t1,1:t)=a)\displaystyle\mathbb{P}_{\mathcal{R}^{1:t}}(MonoBandit_{t}(l^{*,1:t-1},\mathcal{R}^{1:t})=a)
=1:t(MonoBanditt(l,1:t1,1:t)=a|btEXPLOIT)(btEXPLOIT)\displaystyle=\mathbb{P}_{\mathcal{R}^{1:t}}(MonoBandit_{t}(l^{*,1:t-1},\mathcal{R}^{1:t})=a|\mathcal{R}_{b}^{t}\neq EXPLOIT)\cdot\mathbb{P}(\mathcal{R}_{b}^{t}\neq EXPLOIT)
+1:t(MonoBanditt(l,1:t1,1:t)=a|bt=EXPLOIT)(bt=EXPLOIT)\displaystyle+\mathbb{P}_{\mathcal{R}^{1:t}}(MonoBandit_{t}(l^{*,1:t-1},\mathcal{R}^{1:t})=a|\mathcal{R}_{b}^{t}=EXPLOIT)\cdot\mathbb{P}(\mathcal{R}_{b}^{t}=EXPLOIT)
=ϵk+1:t(MonoBanditt(l,1:t1,1:t)=a|bt=EXPLOIT)(bt=EXPLOIT)\displaystyle=\frac{\epsilon}{k}+\mathbb{P}_{\mathcal{R}^{1:t}}(MonoBandit_{t}(l^{*,1:t-1},\mathcal{R}^{1:t})=a|\mathcal{R}^{t}_{b}=EXPLOIT)\cdot\mathbb{P}(\mathcal{R}^{t}_{b}=EXPLOIT)
=ϵk+1:t(𝒜t(l^1:t1(l,1:t1),1:tf)=a|bt=EXPLOIT)(1ϵ)\displaystyle=\frac{\epsilon}{k}+\mathbb{P}_{\mathcal{R}^{1:t}}(\mathcal{A}_{t}(\hat{l}^{1:t-1}(l^{*,1:t-1}),\mathcal{R}^{f}_{1:t})=a|\mathcal{R}^{t}_{b}=EXPLOIT)\cdot(1-\epsilon)
=ϵk+1:t(𝒜t(l^1:t1(l1:t1),1:tf)=a|bt=EXPLOIT,bt¯a)(1ϵ)(1ϵk)\displaystyle=\frac{\epsilon}{k}+\mathbb{P}_{\mathcal{R}^{1:t}}(\mathcal{A}_{t}(\hat{l}^{1:t-1}(l^{1:t-1}),\mathcal{R}^{f}_{1:t})=a|\mathcal{R}^{t}_{b}=EXPLOIT,\mathcal{R}^{\bar{t}}_{b}\neq a)\cdot(1-\epsilon)(1-\frac{\epsilon}{k})
+1:t(𝒜t(l^1:t1(l,1:t1),1:tf)=a|bt=EXPLOIT,bt¯=a)(1ϵ)(ϵk)\displaystyle+\mathbb{P}_{\mathcal{R}^{1:t}}(\mathcal{A}_{t}(\hat{l}^{1:t-1}(l^{*,1:t-1}),\mathcal{R}^{f}_{1:t})=a|\mathcal{R}^{t}_{b}=EXPLOIT,\mathcal{R}_{b}^{\bar{t}}=a)\cdot(1-\epsilon)(\frac{\epsilon}{k}) (By the fact that l^\hat{l} is only different between ll^{*} and ll if action aa is sampled in round t^\hat{t}.)

We can compare this probability with that of the unaltered sequence ll:

1:t(MonoBanditt(b,1:t1,1:t)=a)1:t(MonoBanditt(b1:t1,1:t)=a)\displaystyle\mathbb{P}_{\mathcal{R}^{1:t}}(MonoBandit_{t}(b^{*,1:t-1},\mathcal{R}^{1:t})=a)-\mathbb{P}_{\mathcal{R}^{1:t}}(MonoBandit_{t}(b^{1:t-1},\mathcal{R}^{1:t})=a)
=1:t(𝒜t(l^1:t1(l,1:t1),1:tf)=a|bt=EXPLOIT,bt¯=a)(1ϵ)(ϵk)\displaystyle=\mathbb{P}_{\mathcal{R}^{1:t}}(\mathcal{A}_{t}(\hat{l}^{1:t-1}(l^{*,1:t-1}),\mathcal{R}^{f}_{1:t})=a|\mathcal{R}^{t}_{b}=EXPLOIT,\mathcal{R}_{b}^{\bar{t}}=a)\cdot(1-\epsilon)(\frac{\epsilon}{k})-
1:t(𝒜t(l^1:t1(l1:t1),1:tf)=a|bt=EXPLOIT,bt¯=a)(1ϵ)(ϵk)\displaystyle\mathbb{P}_{\mathcal{R}^{1:t}}(\mathcal{A}_{t}(\hat{l}^{1:t-1}(l^{1:t-1}),\mathcal{R}^{f}_{1:t})=a|\mathcal{R}^{t}_{b}=EXPLOIT,\mathcal{R}_{b}^{\bar{t}}=a)\cdot(1-\epsilon)(\frac{\epsilon}{k})
=1:t(𝒜t(l^1:t1(l,1:t¯1)l^t¯(lt¯)l^t¯+1:t1(l,t¯+1:t1),1:tf)=a|bt=EXPLOIT,bt¯=a)(1ϵ)(ϵk)\displaystyle=\mathbb{P}_{\mathcal{R}^{1:t}}(\mathcal{A}_{t}(\hat{l}^{1:t-1}(l^{*,1:\bar{t}-1})\cup\hat{l}_{\bar{t}}(l^{*}_{\bar{t}})\cup\hat{l}^{\bar{t}+1:t-1}(l^{*,\bar{t}+1:t-1}),\mathcal{R}^{f}_{1:t})=a|\mathcal{R}^{t}_{b}=EXPLOIT,\mathcal{R}_{b}^{\bar{t}}=a)\cdot(1-\epsilon)(\frac{\epsilon}{k})-
1:t(𝒜t(l^1:t1(l1:t¯1)l^t¯(lt¯)l^t¯+1:t1(lt¯+1:t1))=a|bt=EXPLOIT,bt¯=a)(1ϵ)(ϵk)\displaystyle\mathbb{P}_{\mathcal{R}^{1:t}}(\mathcal{A}_{t}(\hat{l}^{1:t-1}(l^{1:\bar{t}-1})\cup\hat{l}_{\bar{t}}(l_{\bar{t}})\cup\hat{l}^{\bar{t}+1:t-1}(l^{\bar{t}+1:t-1}))=a|\mathcal{R}^{t}_{b}=EXPLOIT,\mathcal{R}_{b}^{\bar{t}}=a)\cdot(1-\epsilon)(\frac{\epsilon}{k})
=1:t(𝒜t(l^1:t1(l1:t¯1)v(kϵlt¯[a],a)l^t¯+1:t1(lt¯+1:t1),1:tf)=a|bt=EXPLOIT,bt¯=a)(1ϵ)(ϵk)\displaystyle=\mathbb{P}_{\mathcal{R}^{1:t}}(\mathcal{A}_{t}(\hat{l}^{1:t-1}(l^{1:\bar{t}-1})\cup v(\frac{k}{\epsilon}l^{*}_{\bar{t}}[a],a)\cup\hat{l}^{\bar{t}+1:t-1}(l^{\bar{t}+1:t-1}),\mathcal{R}^{f}_{1:t})=a|\mathcal{R}^{t}_{b}=EXPLOIT,\mathcal{R}_{b}^{\bar{t}}=a)\cdot(1-\epsilon)(\frac{\epsilon}{k})-
1:t(𝒜t(l^1:t1(l1:t¯1)v(kϵlt¯[a],a)l^t¯+1:t1(lt¯+1:t1))=a|bt=EXPLOIT,bt¯=a)(1ϵ)(ϵk)\displaystyle\mathbb{P}_{\mathcal{R}^{1:t}}(\mathcal{A}_{t}(\hat{l}^{1:t-1}(l^{1:\bar{t}-1})\cup v(\frac{k}{\epsilon}l_{\bar{t}}[a],a)\cup\hat{l}^{\bar{t}+1:t-1}(l^{\bar{t}+1:t-1}))=a|\mathcal{R}^{t}_{b}=EXPLOIT,\mathcal{R}_{b}^{\bar{t}}=a)\cdot(1-\epsilon)(\frac{\epsilon}{k})
0\displaystyle\geq 0 (By the monotonicity of 𝒜\mathcal{A})