This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Robustness of Epoch-Greedy in Multi-Agent Contextual Bandit Mechanisms

Yinglun Xu
University of Illinois at Urbana-Champaign
[email protected]
&Bhuvesh Kumar
Georgia Institute of Technology
[email protected]
Jacob Abernethy
Georgia Institute of Technology
[email protected]
Abstract

Efficient learning in multi-armed bandit mechanisms such as pay-per-click (PPC) auctions typically involves three challenges: 1) inducing truthful bidding behavior (incentives), 2) using personalization in the users (context), and 3) circumventing manipulations in click patterns (corruptions). Each of these challenges has been studied orthogonally in the literature; incentives have been addressed by a line of work on truthful multi-armed bandit mechanisms, context has been extensively tackled by contextual bandit algorithms, while corruptions have been discussed via a recent line of work on bandits with adversarial corruptions. Since these challenges co-exist, it is important to understand the robustness of each of these approaches in addressing the other challenges, provide algorithms that can handle all simultaneously, and highlight inherent limitations in this combination. In this work, we show that the most prominent contextual bandit algorithm, ϵ\epsilon-greedy can be extended to handle the challenges introduced by strategic arms in the contextual multi-arm bandit mechanism setting. We further show that ϵ\epsilon-greedy is inherently robust to adversarial data corruption attacks and achieves performance that degrades linearly with the amount of corruption.

1 Introduction

Auctions for allocating online advertisements are run many billions of times per day and constitute the main business models for many digital platforms. How these auctions are designed and managed can therefore have major implications for both the auctioneers as well as the many many participants competing for ad space. Under a simple model of bidding behavior, and with good information about the quality of submitted ads, classical theory in mechanism design provides the auctioneer with what is essentially the uniquely optimal allocation and payment scheme. The VCG mechanism vickrey1961counterspeculation allocates each advertisement slot to the highest bidder, weighted by the “click-through-rate” (CTR), and charges an amount equal to the second highest bid . This scheme to select and charge bidders is welfare-maximizing and truthful, i.e. such that the bidders do not have any incentive to misreport their true value for clicks.

The simple VCG strategy, while nice in theory, is unfortunately lacking in one critical way: it requires that the auctioneer possess knowledge of the click-through-rates of the various ads submitted by bidders. From this perspective, selecting an ad to display to the user is akin to pulling an arm in a bandit problem, where the welfare associated to a pull is value the bidder receives in the event of a click. Indeed, this KK-armed bandit perspective on repeated ad auctions has been quite thoroughly studied; in particular, DBLP:journals/siamcomp/BabaioffSS14; devanur2009price model each arm as an advertiser with an associated fixed click-through-rate. The key hurdle for applying bandit methods to repeated auctions is the conflict between minimizing regret and maintaining reasonable long-term incentives for the bidders, since the auctioneer’s learning strategy may lead to unpredictable bid behavior. DBLP:journals/siamcomp/BabaioffSS14; devanur2009price design a VCG style mechanism using an explore-then-commit bandit strategy. The resulting mechanism is truthful and obtains the O(T2/3)O(T^{2/3}) lower bound on the welfare regret which is measured with respect to the fixed arm with the highest expected welfare.

In this paper, we turn our attention to the contextual version of the same bandit mechanism design problem. In the above setting, while the auctioneer’s estimates of the CTRs may change, the underlying CTRs for all the arms are fixed. But indeed these CTRs may not only be unknown but also fluctuate according to some context that arrives before each auction and provides details about ad desireability (e.g. profile of the user, time of day, etc.). When the mechanism is given access to this context, as well as a family of experts (i.e. hypotheses, policies, etc.), it will want to minimize regret not with respect to simply the best arm (ad) but the optimal expert for selecting an arm according to any given context. This more general setting provides additional power to the mechanism designer, who can now consider much more complex policies for allocation that achieve higher welfare.

Although contexts have been extensively studied in the multi arm bandits literature lu2010contextual; chu2011contextual; LangfordZhang07; AuerCeFrSc2003 , there has been little focus on the interplay between contexts and strategic arms who care about their own long term utility. Giving more flexibility to the mechanism to select ads, in the form of contexts and policies, presents new challenges to the mechanism design problem as it can affect bidder incentives in ways that have not been previously addressed. Consider just this one example: a participant can bid higher for a particular ad slot, but this action can (all else equal) decrease their likelihood of winning the slot, as the change can lead to the learning algorithm to move to a different policy. In order to manage this awkward incentive issue we slightly weaken our notion of truthfulness by considering a property we call α\alpha-rationality, where agents can only violate truthful bidding if it improves their utility by at least α\alpha. Ultimately, we are able to show that the mechanism can guarantee regret on the order of O~(T2/3K4/3+T1/3K2/3/α2)\tilde{O}\left(T^{2/3}K^{4/3}+T^{1/3}K^{2/3}/\alpha^{2}\right), with respect to the optimal arm selection policy, by using a relatively standard ϵ\epsilon-Greedy style algorithm. Pay per click auctions in the contextual multi-slot setting has also been studied before by gatti2012truthful where the principal directly tries to learn the click through rate as a function of the contexts for each arm under strong assumptions about the class of functions it is learning over. Their results do not apply to our setting as we don’t directly aim to learn the click through-rate for each context and arm but utilise the information given by the contexts through a class of experts.

Apart from the challenges induced by contexts and strategic agents, online ad auctions are also vulnerable to data corruption attacks such as click fraud wilbur2009click . For example, an agent may create a bot that either clicks her own ads or does not click competing ads to adversarially bias the click-through-rate estimates and make the ad seem less desirable to the principal. Corruption robustness in multi arm bandits has been studied at great length in recent years lykouris2018stochastic; gupta2019better , but never in multi arm bandit mechanisms. Although these algorithms have great performance in the stochastic bandit setting, they lack a key property called exploration separation DBLP:journals/siamcomp/BabaioffSS14 which has been shown to be fundamental for ensuring truthfulness in bandit mechanisms.

Finally, we show that the contextual ϵ\epsilon-greedy mechanism (Algorithm LABEL:alg:contextmab) has a surprisingly nice additional property: it is tolerant to a certain amount of adversarial click corruptions. In the easier case of stochastic multi arm bandit mechanism, we show that the contextual ϵ\epsilon-greedy mechanism is truthful without any relaxation on the rationality of the agent and is inherently robust to adversarial click corruptions without any changes to the mechanism. If the number of rounds corrupted by the adversary is CC, which is unknown to the auctioneer, the welfare regret of the mechanism in this setting is O~(K1/3T2/3+C)\tilde{O}(K^{1/3}T^{2/3}+C), which recovers the lower bound up to poly log terms. In the presence of contexts and α\alpha-rational agents, the welfare regret of the contextual ϵ\epsilon-greedy mechanism is O~(T2/3K4/3+T1/3K2/3/α2+C/α+KC)\tilde{O}\left(T^{2/3}K^{4/3}+T^{1/3}K^{2/3}/\alpha^{2}+C/\alpha+KC\right).

Other related work:

Our work contributes to the literature of principal learning in auction design in repeated settings with strategic agents. Outside of pay-per-click auctions there are multiple auction settings where principal learning with strategic agents arises, including posted prices amin2013learning; amin2014repeated; mohri2014optimal; mohri2015revenue, reserve prices liu2018learning, or revenue maximizing auctions abernethy2019learning. These works assume either the bidders discount their future utilities or assume that the same bidders only shows up for a limited number of rounds and thus have limited impact over principal learning. We don’t make such assumptions about the agents’ behavior. Another line of work focuses on multi arm bandits in the presence of strategic arms nazerzadeh2016sell; braverman2017multi which pass on part of their reward to the principal when picked. One more related research direction involves mechanism design questions when the principal knows that the agents are employing no-regret learning to decide their actions deng2019prior; heidari2016pricing. The inverse problem of reacting as a buyer to a no-regret seller has also been studied in heidari2016pricing. Finally, we note that repeated auction design with strategic agents also arises in many other works within dynamic mechanism design bergemann2006efficient; nazerzadeh2008dynamic; kakade2013optimal for a non-exhaustive list.

2 Preliminaries

We study the problem of pay-per-click (or PPC) ad auctions in a multi-armed contextual bandit setting. There is a set of advertisers (also referred to as agents or arms) which we denote as 𝒜\mathcal{A} and has cardinality KK; agents repeatedly compete for a single slot over TT time steps (that correspond to users). If an advertiser a𝒜a\in\mathcal{A} is selected, then a click occurs with probability equal to the click-through-rate. Upon a click, the agent earns value equal to μ(a)[0,1]\mu(a)\in[0,1] (advertisers earn a fixed reward whenever a click happens, but different users may have different probability of clicking their ad which depends on the context of the user). Contexts come from a context space 𝒳\mathcal{X} and are identically and independently distributed (i.i.d.) across rounds. The click-through-rate function ρ:𝒜×𝒳[0,1]\rho:\mathcal{A}\times\mathcal{X}\rightarrow[0,1] maps agents and contexts to the probability of being clicked upon display.

In each round t=1,,Tt=1,\cdots,T, a context xt𝒳x^{t}\in\mathcal{X} is drawn i.i.d. from a fixed distribution. Each agent a𝒜a\in\mathcal{A} submits a bid bt(a)b^{t}(a) and, in response the platform selects an agent ata^{t} and displays her ad. The selected ad is clicked with probability ρ(at,xt)\rho(a^{t},x^{t}) and the corresponding clicked result ctc^{t} is 11 if the ad is clicked and 0 otherwise. The platform receives the click result ctc^{t} and if the ad gets clicked, i.e. ct=1c^{t}=1, then it charges the agent ata^{t} a payment ptp^{t} and the agent ata^{t} gains utility ut(at)=μ(at)ptu^{t}(a^{t})=\mu(a^{t})-p^{t}. If the ad is not clicked, ct=0c^{t}=0, then the payment and the utility of the agent is 0.

Following classical works on contextual bandits, we also assume a class of hypotheses (also referred to as experts) which we denote as \mathcal{H} with ||=m|\mathcal{H}|=m. Each expert h:𝒳𝒜h:\mathcal{X}\rightarrow\mathcal{A} is a mapping from the context space to agents where h(x)h(x) is the agent recommended by the expert hh given context xx. The auctioneer can use the experts’ recommendations to make decisions.

Objective. The platform wishes to maximize the cumulative social welfare defined as tctμ(at)\sum_{t}c^{t}\cdot\mu(a^{t}). Note that ctc^{t} is a random variable that depends on (possibly randomized selection of) agent ata^{t} and on the randomness in the context xtx^{t}. To evaluate how well the platform performs, we need a benchmark and the typical benchmark is to compare against the expected social welfare of the best expert hh\in\mathcal{H} where the expectation is taken over the randomness in the contexts. When the value profile of the agents is μ[0,1]K\vec{\mu}\in[0,1]^{K} where μ(a)\mu(a) is the value of agent aa, the expected welfare for an expert hh is R(h,μ)=𝔼x[μ(h(x))ρ(h(x),x)]R(h,\vec{\mu})=\mathbbm{E}_{x}[\mu(h(x))\cdot\rho(h(x),x)]. Note that ρ(h(x),x)\rho(h(x),x) is the expected value of the click result when the context is xx and the arm suggested by the expert hh, i.e. h(x)h(x) is displayed.

Let h=argmaxh{R(h,μ)}h^{\star}=\operatorname*{argmax}_{h\in\mathcal{H}}\{R(h,\vec{\mu})\} be the expert with the highest expected welfare. The performance of our algorithm against hh^{\star} is measured by the notion of regret:

Regcontext=TR(h,μ)t=1Tctμ(at).\textsc{Reg}_{\text{context}}=T\cdot R(h^{\star},\vec{\mu})-\sum_{t=1}^{T}c^{t}\cdot\mu(a^{t}).

If we assume that agents report their value truthfully (defined below) then this reduces to the contextual bandit setting AuerCeFrSc2003; LangfordZhang07.

A special case of this setting is the multi-armed bandit setting where the hypothesis class is the same as the agent space =𝒜\mathcal{H}=\mathcal{A} and each hypothesis always suggests the same agent. In this special case the expected welfare of hypothesis h=ah=a is R(h,μ)=μ(a)𝔼x[ρ(a,x)]R(h,\vec{\mu})=\mu(a)\cdot\mathbbm{E}_{x}[\rho(a,x)].

Incentives. DBLP:journals/siamcomp/BabaioffSS14 and devanur2009price introduced incentives in the multi-armed bandit setting by considering that agents bid strategically with the goal to maximize their own long term utility defined as the difference between the total welfare they collect and the price they pay over the duration of auction. They focus on designing mechanisms that are truthful, i.e., agents cannot benefit by bidding something other than their value. They show that the strategic behaviour of agents poses additional challenges by providing a Ω(T2/3)\Omega(T^{2/3}) lower bound for this setting compared to the O(T)O(\sqrt{T}) regret that is achievable in the absence of incentives. This O(T2/3)O(T^{2/3}) regret can be achieved by classical algorithms that separate exploration from exploitation such as explore-then-commit and ϵ\epsilon-greedy.

In this work, we introduce the study of truthful mechanism design in contextual bandit mechanisms. Our definition of truthfulness generalizes the one in multi-armed bandits:

Definition 1 (Truthful Mechanism).

A mechanism is truthful if for any click-through-rates function ρ(,)\rho(\cdot,\cdot) and value profile μ\vec{\mu}, for every agent a𝒜a\in\mathcal{A}, irrespective of the bidding strategies used by other agents (𝒜a\mathcal{A}\setminus a), agent aa obtains the maximum expected utility over all TT rounds by bidding her true value bt(a)=μ(a)b^{t}(a)=\mu(a) at every round tt.

We now introduce some definitions needed to characterize the setting. Suppose the expert that the mechanism follows is hh, and the context is xx, then the probability that agent aa gets a click is its click probability under context xx times the indicator of whether hh suggests agent aa, i.e., ρ(a,x)𝟙{h(x)=a}\rho(a,x)\mathbbm{1}\{h(x)=a\}. Since the context in each round is identically and independently distributed, the probability that the agent will receive a click is this value in expectation over all contexts, which gives the following definition.

For the rest of the paper, we assume that the agents cannot observe the context. This is necessary as we show in the appendix LABEL:sec:observe that if agents are allowed to observe the context, then it is hard to design truthful and low-regret mechanism without any assumptions on the expert class. This is a justifiable assumption as the context may contain sensitive information about the users which should not be shared with the advertiser.

Definition 2 (Click probability for an agent induced by an expert).

The click probability for an agent aa induced by an expert hh is the probability that the agent will receive a click if the principal follows the recommendations of expert hh, which is given by

Π(a,h):=𝔼x[ρ(a,x)𝟙{h(x)=a}],\Pi(a,h):=\mathbbm{E}_{x}[\rho(a,x)\mathbbm{1}\{h(x)=a\}],

where p(x)p(x) is the probability that the context is xx.

Note that based on the definition of Π(a,h)\Pi(a,h), the expected welfare of the mechanism following an expert hh can be written as R(h,𝝁)=a𝒜μ(a)Π(a,h)R(h,\bm{\mu})=\sum_{a\in\mathcal{A}}\mu(a)\cdot\Pi(a,h). Next we introduce the definition of reported welfare which is a proxy for expected welfare.

Definition 3 (Reported welfare of an expert).

The reported welfare of an expert hh when the bids profile is 𝐛\bm{b} is:

R(h,𝒃)=a𝒜b(a)Π(a,h)R(h,\bm{b})=\sum_{a\in\mathcal{A}}b(a)\cdot\Pi(a,h)

Note that when the mechanism is truthful, agents will bid their true values, i.e. 𝒃=𝝁\bm{b}=\bm{\mu}, then the reported welfare of an expert will be identical to its expected welfare. In this case, the expert with the highest reported welfare is exactly the best expert, that is, argmaxhR(h,𝒃)=h\operatorname*{argmax}_{h\in\mathcal{H}}R(h,\bm{b})=h^{*}.

Next we want to characterize the incentives of agents. One important quantity the incentives are based on is the relation between how much they bid and how likely they will get a click. Recall that the mechanism allocates slots to agents based on experts class \mathcal{H}, data in the history HtH^{t} which includes all the information available to the principal up to round tt, agents’ bids 𝒃\bm{b}, and the context xx. So for an agent aa, the probability that it could win the slot if bidding b(a)=bb(a)=b is Pr{At(,Ht,x,b,𝒃ta)=a}\Pr\{A^{t}(\mathcal{H},H^{t},x,b,\bm{b}^{t}_{-a})=a\}, where AtA^{t} is the allocation rule of the mechanism at round tt and 𝒃ta\bm{b}^{t}_{-a} is the bids of the other agents, which multiplied by the click through rate ρ(a,x)\rho(a,x) would be the probability that it could receive a click. Since the agent cannot observe the context, the general probability to receive a click should be that value in expectation over all context, which gives the following definition.

Definition 4 (Click probability of an agent).

Let AtA^{t} be the allocation rule used by the mechanism at round tt, HtH^{t} be the history till round tt, and \mathcal{H} be the experts class. For an agent aa, fixing the bids of the other agents as 𝐛ta\bm{b}^{t}_{-a}, then the probability that the agent aa get clicked in round tt by bidding b(a)=bb(a)=b is given by

gt(a,b)=𝔼x[Pr{At(,Ht,x,b,𝒃ta)=a}ρ(a,x)]g^{t}(a,b)=\mathbbm{E}_{x}[\Pr\{A^{t}(\mathcal{H},H^{t},x,b,\bm{b}^{t}_{-a})=a\}\rho(a,x)]

where ρ(a,x)\rho(a,x) is the click through rate for agent aa on context xx.

Based on the definition of gt(a,b)g^{t}(a,b), we are able to formally write the expected utility function of an agent aa when her bid is bb as:

ut(a,b)=(μ(a)pt(a,b))gt(a,b),u^{t}(a,b)=(\mu(a)-p^{t}(a,b))g^{t}(a,b),

where pt(a,b)p^{t}(a,b) is the payment at round tt if agent aa is selected and clicked and her bid is bb. Note that the function pt(,)p^{t}(\cdot,\cdot) can depend on the allocation rule, bids from other arms, data of history, and experts class.

Finally, we introduce the definition of exploration separated allocation rule which has been shown to be fundamental for truthful bandit mechanisms DBLP:journals/siamcomp/BabaioffSS14.

Definition 5 (Exploration-separated allocation rule (Definition 1.2, DBLP:journals/siamcomp/BabaioffSS14)).

An allocation rule is called exploration-separated if the allocation on any influential rounds does not depend on any bids, where influential rounds are the rounds whose click result and bids will influence the allocation in the future rounds.

3 Contextual bandit mechanisms

In the stochastic setting without contexts, since each expert is an arm and we are competing with the best arm in expectation, as shown in DBLP:journals/siamcomp/BabaioffSS14; kakade2013optimal, a simple VCG style mechanism recovers the lower bounds for the regret. In the presence of a more complex class of experts which can utilize the context to make better decisions, the benchmark becomes harder as we are now competing with the best expert in expectation. To compete with the best expert, the platform has to consider expert predictions while making decisions and the interaction between the agents and the experts makes truthfulness harder.

We focus our attention on designing mechanisms using exploration separated allocation rules. For such allocation rules, in any non-influential round, i.e. the round whose click results and bids don’t influence the future decisions of the principal, the bid reported by an agent can only influence the utility obtain in the current round. Hence the agents optimize their bid only to maximise the utility from the current non-influential round. In other words, for each aa and tt that is non-influential, if ut(a,b)u^{t}(a,b) is maximized at b=μ(a)b=\mu(a), then all agent will always bid truthfully in every non-influential round. This fact gives a necessary condition for a mechanism which uses an exploration separated allocation rule to be truthful.

Lemma 1.

If a mechanism uses an exploration separated allocation rule, then the mechanism is truthful only if for every non-influential round tt and every agent aa, the probability that she gets clicked in round tt, i.e. gt(a,b)g^{t}(a,b) is monotonically increasing in her bid bb and the payment rule satisfies

pt(a,b)gt(a,b)=bgt(a,b)y=0bgt(a,y)dyp^{t}(a,b)\cdot g^{t}(a,b)=b\cdot g^{t}(a,b)-\int_{y=0}^{b}g^{t}(a,y)\mathrm{d}y

In fact, this payment rule can be viewed as the Pay-Per-Click auction version of the Myerson payment identity (myerson1981optimal). This theorem shows that for any mechanism using an exploration separated allocation rule, it can only be truthful for some payment rule only if gt(a,b)g^{t}(a,b) is always monotonically increasing in bb in any non-influential round tt.

Note that to have monotone gt(a,b)g^{t}(a,b), one simple allocation rule could be randomly picking an arm at each round, but clearly such allocation rule is far worse then the benchmark we want to achieve. We need a more complicated allocation rule which can guarantee monotonicity of gt(a,b)g^{t}(a,b) while recovering the benchmark.

Perfect information.

If the click probability for an agent induced by an expert Π(a,h)\Pi(a,h) for all a𝒜a\in\mathcal{A} and hh\in\mathcal{H} is known by the principal at the beginning, to recover the benchmark, a straight forward idea is to always follow the expert with the highest reported welfare, i.e. ht=argmaxh{R(h,𝒃)}h^{t}=\operatorname*{argmax}_{h}\{R(h,\bm{b})\} at each round. Mechanisms with such allocation rule will recover the benchmark if all agents bid truthfully, and the lemma below shows that there exist payment rules to make the mechanism truthful.

Lemma 2.

In contextual PPC auction, there exist payment rules such that the mechanism which always follows the expert with the highest reported welfare ht=argmaxhR(h,𝐛t)h^{t}=\operatorname*{argmax}_{h}R(h,\bm{b}^{t}) is truthful.

The intuition behind the proof is to show that the click probability of each agent aa as a function of her bid is monotonically increasing. This is true because if an agent increases her bid, and the expert selected by the mechanism changes, then the the click probability for that agent induced by the new expert can only be higher.

Aside from the requirement of a monotone gt(a,b)g^{t}(a,b), the payment rule poses an additional difficulty in the contextual setting. In the easier case where experts are arms themselves as shown in devanur2009price, the allocation rule that maximizes the welfare results in a simple gt(a,b)g^{t}(a,b) whose value is the click through rate of agent aa times the indicator of b>db>d where dd is the weighted highest bid of the other bidders. The corresponding truthful payment rule given by lemma 1 is pt(a,b)=pp^{t}(a,b)=p which doesn’t require the knowledge of the click through rate. In the setting with more complicated class of experts, the allocation rule that chooses the expert with the highest reported welfare can result in much more complicated gt(a,b)g^{t}(a,b), and to find the corresponding truthful payment rule requires the full information of gt(a,b)g^{t}(a,b), which in turn requires the information of Π(a,h)\Pi(a,h) for all aa and hh. If the payment rule a mechanism actually uses is only slightly deviated from the truthful payment rule, the utility from bidding truthfully is still close to the optimal, but the optimal bid can be far from their true value. We give an example in the appendix LABEL:sec:example to illustrate this fact. Due to this reason, we relax the requirement of truthfulness by introducing the definition of α\alpha-rational agents, who will bid truthfully as long as bidding truthfully leads to utility close to the optimal utility.

Definition 6 (α\alpha-rational).

In mechanisms where the bids of agents of a round will not affect the future rounds, an α\alpha-rational agent will bid truthfully in round tt if the extra utility in that round she can get by misreporting her value is bounded by α\alpha. That is, an α\alpha-rational agent aa will bid truthfully in round tt if maxbut(a,b)ut(a,μ(a))α.\max_{b}u^{t}(a,b)-u^{t}(a,\mu(a))\leq\alpha. Note that α[0,1]\alpha\in[0,1].

Since estimating the optimal bids for the agents requires them to know information about the competition and other estimates of the mechanism, it is reasonable to assume that if the extra utility they get from trying to estimate their optimal bids is very small, then they can just bid truthfully.

3.1 ϵ\epsilon-Greedy for Contextual Bandit Mechanisms

Now we present the contextual ϵ\epsilon-Greedy Mechanism ( algorithm LABEL:alg:contextmab) for the contextual multi-arm bandit PPC auction setting. In each round, with a fixed probability ϵ\epsilon, the mechanism uniformly at random select an arm. Such rounds are called explore rounds and the mechanism learns over each expert based on the information only from the explore rounds. For each arm a𝒜a\in\mathcal{A} and expert hh, the mechanism tries to learn Π(a,h)\Pi(a,h) the click probabilities of agent aa induced by expert hh as empirical estimate Π^(a,h)\hat{\Pi}(a,h) from the explore rounds. There is no payment charged in the explore rounds. The other rounds are called exploit rounds where the principal follows the empirically best expert under the current bids and then selects the arm suggested by this expert. In case of a click, the mechanism also the estimates Π^(a,h)\hat{\Pi}(a,h) to charge payments given by lemma 1 with the true click probabilities of agents substituted by the empirical estimation.

First we show that contextual ϵ\epsilon-greedy mechanism can maintain accurate empirical estimates of the click probability for agents induced by experts.

Lemma 3.

For the contextual ϵ\epsilon-greedy mechanism (algorithm LABEL:alg:contextmab), let the number of experts in \mathcal{H} be mm, number of rounds be TT, and number of agents be KK. With probability at least 13/T1-3/T, for every expert hh, every arm aa, at every round t24logT/ϵt\geq 24\log T/\epsilon the estimation error on click probability is bound by

|Π^t(a,h)Π(a,h)|K2log(TmK)tϵ.|\hat{\Pi}^{t}(a,h)-\Pi(a,h)|\leq K\sqrt{\frac{2\log(T\cdot m\cdot K)}{t\epsilon}}.

Since in each round, we explore each arm with a fixed probability ϵ/K\epsilon/K, thus for each expert we get a sample with this probability, and the result follows from a simple concentration bound.

Since we use empirical estimates to calculate the payments, in the initial few rounds, agents might be able to take advantage of the inaccurate estimates, but after a few rounds, as the estimates become more accurate, we show that the the extra utility agents can get by misreporting is bounded. Thus if the agents are α\alpha-rational, after a few rounds, the agents will start bidding truthfully.

Theorem 4.

For the contextual ϵ\epsilon-greedy mechanism (algorithm LABEL:alg:contextmab), let the number of experts in \mathcal{H} be mm, number of rounds be TT, and number of agents be KK. With probability at least 13/T1-3/T, an α\alpha-rational agent will bid truthfully in all t8K2log(TmK)ϵα2t\geq\frac{8K^{2}\log(T\cdot m\cdot K)}{\epsilon\alpha^{2}} rounds.

The intuition behind Theorem 4 is that when the estimates Π^(a,h)\hat{\Pi}(a,h) are more accurate for all arms a𝒜a\in\mathcal{A} and experts hh\in\mathcal{H}, and the corresponding payment rule is more accurate, thus the extra utility an agent can obtain from trying to exploit the errors in the estimate goes down as the rounds go by.

Using lemma 3 and theorem 4, we now present the main regret guarantee for this setting. We show that the contextual ϵ\epsilon-greedy mechanism (Algorithm LABEL:alg:contextmab) has diminishing regret compare to the best expert when all the agents are α\alpha-rational.

Theorem 5.

For the contextual ϵ\epsilon-greedy mechanism (algorithm LABEL:alg:contextmab), let the number of experts in \mathcal{H} be mm, number of rounds be TT, and number of agents be KK. If all agents are α\alpha-rational, the expected welfare regret of contextual ϵ\epsilon-Greedy Mechanism satisfies

Reg(T)\displaystyle\textsc{Reg}(T)\leq Tϵ+K28Tlog(TmK)ϵ+3K+\displaystyle T\epsilon+K^{2}\sqrt{\frac{8T\log(T\cdot m\cdot K)}{\epsilon}}+3K+
8K2log(TmK)ϵα2\displaystyle\frac{8K^{2}\log(T\cdot m\cdot K)}{\epsilon\alpha^{2}}

Setting ϵ=2T1/3log(TmK)1/3K4/3\epsilon=2T^{-1/3}\log(T\cdot m\cdot K)^{1/3}K^{4/3}, we have

Reg(T)\displaystyle\textsc{Reg}(T)\leq O~(T2/3K4/3+T1/3K2/3/α2)\displaystyle\widetilde{O}\left(T^{2/3}K^{4/3}+T^{1/3}K^{2/3}/\alpha^{2}\right)

The results come from the observation that regret can be decomposed into four parts: (a) explore rounds; (b) rare cases where estimation are not accurate at all; (c) the early few rounds when α\alpha rational agents do not bid truthfully; (d) exploit rounds when the estimates are accurate and agents are bidding truthfully. a) and b) can be bounded as a direct consequence of concentration inequalities, and c) is bounded using theorem 4. For d), we show that even if we follow the suggestions of a sub-optimal expert, its expected welfare can not be too far from that of the best expert.

Parameters : Number of arms KK, Number of rounds TT, exploration rate ϵ\epsilon, A hypotheses class HH
Initialize : h^\hat{h}\leftarrow an arbitrary expert in HH, te{}t_{e}\leftarrow\{\}
  /* h^\hat{h} is the empirical best expert, tet_{e} is the index of explore rounds */
for t=1,,Tt=1,\ldots,T do
      
      Receive bids btjb^{t}_{j} from arm jj for all j[K]j\in[K]
      t{1w.p. ϵ0otherwise\ell^{t}\leftarrow\begin{cases}1&\textnormal{w.p. }\epsilon\\ 0&\textnormal{otherwise}\\ \end{cases}
        /* Explore or exploit */
      
      if t=1\ell^{t}=1 then
             /* ... Explore Round ... */