This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning against Non-credible Auctions

Qian Wang1, Xuanzhi Xia2, Zongjun Yang3, Xiaotie Deng1, Yuqing Kong1,
Zhilin Zhang4, Liang Wang4, Chuan Yu4, Jian Xu4, Bo Zheng4
Abstract

The standard framework of online bidding algorithm design assumes that the seller commits himself to faithfully implementing the rules of the adopted auction. However, the seller may attempt to cheat in execution to increase his revenue if the auction belongs to the class of non-credible auctions. For example, in a second-price auction, the seller could create a fake bid between the highest bid and the second highest bid. This paper focuses on one such case of online bidding in repeated second-price auctions. At each time tt, the winner with bid btb_{t} is charged not the highest competing bid dtd_{t} but a manipulated price pt=α0dt+(1α0)btp_{t}=\alpha_{0}d_{t}+(1-\alpha_{0})b_{t}, where the parameter α0[0,1]\alpha_{0}\in[0,1] in essence measures the seller’s credibility. Unlike classic repeated-auction settings where the bidder has access to samples (ds)s=1t1(d_{s})_{s=1}^{t-1}, she can only receive mixed signals of (bs)s=1t1(b_{s})_{s=1}^{t-1}, (ds)s=1t1(d_{s})_{s=1}^{t-1} and α0\alpha_{0} in this problem. The task for the bidder is to learn not only the bid distributions of her competitors but also the seller’s credibility. We establish regret lower bounds in various information models and provide corresponding online bidding algorithms that can achieve near-optimal performance. Specifically, we consider three cases of prior information based on whether the credibility α0\alpha_{0} and the distribution of the highest competing bids are known. Our goal is to characterize the landscape of online bidding in non-credible auctions and understand the impact of the seller’s credibility on online bidding algorithm design under different information structures.

1 Introduction

Digital advertising has experienced significant expansion due to the rapid rise of online-activities, surpassing traditional advertising as the dominant marketing influence in various industries. Between 2021 and 2022, digital advertising revenues in U.S. grew 10.8% year-over-year totalling $209.7 billion dollars (IAB 2023). In practice, a huge amount of online ads are sold via real-time auctions implemented on advertising platforms and advertisers participate in such repeated online auctions to purchase advertising opportunities. This has motivated a flourishing line of work to focus on the problem of online bidding algorithm design. In particular, learning to bid in repeated second-price auctions—often with constraints or unknown own valuations—has been well studied due to the popularity of this auction format in practice (Weed, Perchet, and Rigollet 2016; Balseiro and Gur 2019; Golrezaei et al. 2021; Balseiro, Lu, and Mirrokni 2022; Feng, Padmanabhan, and Wang 2022; Chen et al. 2022, 2023). However, these studies are in fact based on an implicit assumption that the seller commits himself to faithfully implementing the rules of the announced second-price auction.

The possibility that a seller can profitably cheat in a second-price auction was pointed out as early as the seminal paper (Vickrey 1961) that introduced this auction format. After observing all the bids, the seller can strategically exaggerate the highest competing bid and overcharge the winner up to the amount of her own bid. Several following papers studied the issue of seller cheating in second-price auctions (Rothkopf and Harstad 1995; Porter and Shoham 2005; McAdams and Schwarz 2007). Recent theoretical work by Akbarpour and Li (2020) formally modelled credibility in an extensive-form game where the seller is allowed to deviate from the auction rules as long as the deviation cannot be detected by bidders. They defined an auction to be credible if it is incentive-compatible for the seller to follow the rules in the presence of cheating opportunities. In a second-price auction, the seller can even charge the winner the amount of her own bid to obtain higher utility, with an innocent explanation that the highest and second-highest bids are identical. Therefore, the prevalent second-price auction belongs to the class of non-credible auctions in this framework. Taking credibility into consideration, advertisers are confronted with a question of practical importance: how should an advertiser bid in repeated non-credible second-price auctions to maximize her cumulative utility?

In this work, we formulate the above problem as an online learning problem for a single bidder. We consider the scenario with a single seller who runs repeated non-credible second-price auctions. At each time tt, the winner with bid btb_{t} is charged not the highest competing bid dtd_{t} but a manipulated price pt=α0dt+(1α0)btp_{t}=\alpha_{0}d_{t}+(1-\alpha_{0})b_{t} for some α0[0,1]\alpha_{0}\in[0,1]. The parameter α0\alpha_{0} in essence captures the seller’s credibility, the extent to which the seller deviates from the second-price auction rules. Our linear model is equivalent to the classic bid-shilling model (Rothkopf and Harstad 1995; Porter and Shoham 2005; McAdams and Schwarz 2007) in expectation. In the bid-shilling model, the seller cheats by inserting a shill bid after observing all of the bids with probability PcP^{c}. The seller is assumed to take full advantage of his power so the winner will pay her own bid if the seller does cheat. Then the winner’s expected payment is Pcbt+(1Pc)dtP^{c}b_{t}+(1-P^{c})d_{t}. Moreover, no matter what charging rules the seller actually uses, as long as the estimated credibility within this linear model is away from 11, it can be confirmed that the seller is cheating. We believe our results and techniques have implications for the more complicated setting with pt=h(bt,dt;α)p_{t}=h(b_{t},d_{t};\alpha).

We assume the highest competing bids (dt)t=1T(d_{t})_{t=1}^{T} are i.i.d. sampled from a distribution GG. The bidder aims to maximize her expected cumulative utility, which is given by the expected difference between the total value and the total payment. Moving to the information model, We investigate three cases of prior information: (1) known α0\alpha_{0} and unknown GG; (2) unknown α0\alpha_{0} and known GG; (3) unknown α0\alpha_{0} and unknown GG. For all three cases, we consider bandit feedback where the bidder can observe the realized allocation and cost at each round. For the last case where neither is unknown, we additionally consider full feedback where the price ptp_{t} is always observable regardless of the auction outcome. More discussions on modeling will be placed in Section 2.

The key challenge of this problem lies in the lack of credibility and its impact on the learning process. If assuming the seller has full commitment, it is well known that truthful bidding is the dominant strategy in second-price auctions, but this truthful property no longer holds in non-credible second-price auctions. Identifying optimal bidding strategies for utility maximization requires not only knowing the bidder’s own values but also considering the strategies of her competitors and the seller. In classic repeated-auction settings (assuming a trustworthy seller), online bidding algorithms can collect historical samples (ds)s=1t1(d_{s})_{s=1}^{t-1} to estimate distribution GG. However, the bidder in non-credible auctions needs to cope with an additional dimension of uncertainty: the available observations under either bandit or full feedback are all manipulated prices, i.e., mixed signals of (bs)s=1t1,(ds)s=1t1(b_{s})_{s=1}^{t-1},(d_{s})_{s=1}^{t-1} and α0\alpha_{0}. As a result, difficulties arise in the estimation of the distribution of her competitors’ bids and the seller’s credibility.

1.1 Main Contributions

First, we characterize the optimal clairvoyant bidding strategy in non-credible second-price auctions when the bidder knows both credibility α0\alpha_{0} and distribution GG. This optimal clairvoyant bidding strategy is also used as the benchmark strategy in the regret definition.

Next, we establish regret lower bounds in various information models and provide corresponding online bidding algorithms that are optimal up to log factors. Our results are summarized in Table 1.

  • For the case where GG is unknown and α0\alpha_{0} is known, we explore the landscape by discussing how the problem varies with different credibility parameter α0=0,α0=1\alpha_{0}=0,\alpha_{0}=1 and α0(0,1)\alpha_{0}\in(0,1). We mainly contribute to the regret analysis for α0(0,1)\alpha_{0}\in(0,1), with a proven Ω(T)\Omega(\sqrt{T}) lower bound, and a concrete near-optimal O~(T)\widetilde{O}(\sqrt{T}) algorithm.

  • For the case where GG is known and α0\alpha_{0} is unknown, we develop an O(log2T)O(\log^{2}{T}) algorithm, which adopts a dynamic estimation approach to approximate α0\alpha_{0}.

  • For the challenging case where both GG and α0\alpha_{0} are unknown, we observe that under bandit feedback, an Ω(T2/3)\Omega(T^{2/3}) lower bound and an O~(T2/3)\widetilde{O}(T^{2/3}) algorithm follow directly from existing algorithms. We then turn to the more interesting setting with full information feedback, for which we propose an episodic bidding algorithm that learns α0\alpha_{0} and GG simultaneously in an efficient manner, while achieving a near-optimal regret of O~(T)\widetilde{O}(\sqrt{T}).

Overall, this work provides a theoretical regret analysis for learning against non-credible auctions. We aim to characterize the landscape of online bidding in non-credible auctions and analyze how the seller’ credibility influences the design of online bidding algorithms under different information structures.

Table 1: Result Summary.
𝜶0\bm{\alpha}_{0} 𝑮\bm{G} Feedback Upper bound Lower bound Theorem
Known, α0=1\alpha_{0}=1 Unknown Bandit 0 0 Theorem 3.1
Known, α0(0,1)\alpha_{0}\in(0,1) O~(T1/2)\widetilde{O}(T^{1/2}) Ω(T1/2)\Omega(T^{1/2})
Known, α0=0\alpha_{0}=0 O~(T2/3)\widetilde{O}(T^{2/3}) Ω(T2/3)\Omega(T^{2/3})
Unknown Known Bandit O~(1)\widetilde{O}(1) Ω(1)\Omega(1) Theorem 4.31
Unknown Unknown Bandit O~(T2/3)\widetilde{O}(T^{2/3}) Ω(T2/3)\Omega(T^{2/3}) Corollary 5.1
Full O~(T1/2)\widetilde{O}(T^{1/2}) Ω(T1/2)\Omega(T^{1/2}) Theorem 5.61

1 The regret bounds of these theorems rely on corresponding assumptions.

1.2 Related Work

Credibility in auctions. The issue of seller cheating has been studied by the game-theoretic literature in a strategic framework  (McAdams and Schwarz 2007; Porter and Shoham 2005; Rothkopf and Harstad 1995). Recently, Akbarpour and Li (2020) explored the setting where the seller deviates from the auction rules in a way that can be innocently explained. To this end, they defined credibility based on the detectability of seller deviations. They further established an impossibility result that no optimal auction simultaneously achieves staticity, credibility and strategy-proofness. They showed that the first-price auction is the unique static optimal auction that achieves credibility. Hakimov and Raghavan (2023) considered the general allocation problem and introduced the definition of verifiability (i.e., allowing participants to check the correctness of their assignments) and transparency (i.e., allowing participants to check whether the allocation rule is deviated). These are stronger security notions than the credibility concept investigated by (Akbarpour and Li 2020). Dianat and Freer (2021) studied how the credibility of an auction format affects bidding behavior and final outcomes via laboratory experiments. Their empirical findings confirm the theory that sellers do have incentives to break the auction rules and overcharge the winning bidder. These pioneering works discussed how a participant can potentially detect and learn non-credible mechanisms as we do. In contrast, our work is based on the online learning framework, where information revelation is partial and sequential in nature.

Learning to bid. Our work is closely related with the line of literature on learning to bid in repeated auctions. Balseiro et al. (2019); Han, Zhou, and Weissman (2020); Han et al. (2020); Badanidiyuru, Feng, and Guruganesh (2023); Zhang et al. (2022) studied the problem of learning in repeated first-price auctions. Castiglioni, Celli, and Kroer (2022); Wang et al. (2023) studied no-regret learning in repeated first-price auctions with budget constraints. As for repeated second-price auctions, Balseiro and Gur (2019); Balseiro, Lu, and Mirrokni (2022); Chen et al. (2022) considered the bidding problem with budget constraints and Feng, Padmanabhan, and Wang (2022); Golrezaei et al. (2021) further considered return-on-spend (RoS) constraints. All these works assume that the seller has full commitment to the announced auction rules.

2 Problem Formulation

We consider the problem of online learning in repeated non-credible second-price auctions. We focus on a single bidder in a large population of bidders during a time horizon TT. In each round t=1,,Tt=1,\ldots,T, there is an available item auctioned by a single seller. The bidder perceives a value vt[0,1]v_{t}\in[0,1] for this item, and then submits a bid bt+b_{t}\in\mathbb{R}_{+} based on vtv_{t} and all historical observations available to her. We denote the maximum bid of all other bidders by dt+d_{t}\in\mathbb{R}_{+}. As usual, we use the bold symbol 𝒗\bm{v} without subscript tt to denote the vector (v1,,vT)(v_{1},\ldots,v_{T}); the same goes for other variables in the present paper.

We consider a stochastic setting where vtv_{t} is i.i.d. sampled from a distribution FF and dtd_{t} is i.i.d. sampled from a distribution GG. The latter assumption follows from the standard mean-field approximation (Iyer, Johari, and Sundararajan 2014; Balseiro, Besbes, and Weintraub 2015) and is a common practice in literature. The main rationale behind this assumption is that when the number of other bidders is large, on average their valuations and bidding strategies are static over time. Whether GG is known to the bidder depends on the information structure while FF is always unknown to the bidder.

The seller claims that he follows the rules of the second-price auction, but he actually uses a combination of the first-price auction and the second-price auction. For simplicity, we assume a linear model. The auction outcome in round tt is then as follows: if btdtb_{t}\geq d_{t}, the bidder wins the item and pays ptα0dt+(1α0)btp_{t}\coloneqq\alpha_{0}d_{t}+(1-\alpha_{0})b_{t}, where α0\alpha_{0} is assumed to be a fixed weight throughout the period; if bt<dtb_{t}<d_{t}, the bidder loses the auction and pays nothing. Here we assume that ties are broken in favor of the bidder we concern to simplify exposition. We only consider α0[0,1]\alpha_{0}\in[0,1] since α0<0\alpha_{0}<0 will be immediately detected by the winner and α0>1\alpha_{0}>1 will lead to lower revenue than mere second-price auctions. One can observe the pricing rule follows the second-price auction when α0=1\alpha_{0}=1, and it follows the first-price auction when α0=0\alpha_{0}=0.

Let xt𝕀{btdt}x_{t}\coloneqq\mathbb{I}\left\{b_{t}\geq d_{t}\right\} be the binary variable indicating whether the bidder wins the item. Let ctxtptc_{t}\coloneqq x_{t}p_{t} be the bidder’s cost and let rtxtvtctr_{t}\coloneqq x_{t}v_{t}-c_{t} be the corresponding reward.

Information structure.

In this paper, we investigate three cases of prior information: 1) known credibility α0\alpha_{0} and unknown distribution GG; 2) unknown credibility α0\alpha_{0} and known distribution GG; 3) unknown credibility α0\alpha_{0} and unknown distribution GG.

The first two cases not only serve as warm-up analysis, but also have practical significance in their own right. In reality, a bidder may receive some additional signals beyond the learning process to construct her belief over the seller’s credibility or the strategies of other bidders, e.g. the seller’s reputation heard from other bidders, bidding data collected through other credible channels. Then her bidding algorithm mainly aims to learn the other part in the competing environment.

We consider two cases of information feedback:

  1. 1.

    Bandit information feedback. The bidder can observe the allocation xtx_{t} and the cost ctc_{t} at the end of each round tt.

  2. 2.

    Full information feedback. The bidder can observe the allocation xtx_{t} and the price ptp_{t} at the end of each round tt.

The second information feedback makes sense in non-censored auctions where the seller-side platform (SSP) is supposed to provide the minimum winning price for every bidder regardless of the outcome. If the bidder wins, a dishonest seller will overstate the minimum winning price to overcharge the winner; if the bidder loses, a dishonest seller will understate the minimum winning price to trick the bidder into raising her bids in the following rounds. The full-feedback model for simplicity assumes that these two types of deceptions are symmetric, controlled by the same parameter α0\alpha_{0}. Note that when α0=0\alpha_{0}=0, both feedback models are equivalent to the binary feedback model in first-prices auctions.

We denote the historical observations available to the bidder before submitting a bid in round tt by t\mathcal{H}_{t}. For the two cases of information feedback, we have, respectively,

tB(vs,xs,cs)s=1t1,tF(vs,xs,ps)s=1t1.\displaystyle\mathcal{H}^{B}_{t}\coloneqq\left(v_{s},x_{s},c_{s}\right)_{s=1}^{t-1},\ \mathcal{H}^{F}_{t}\coloneqq\left(v_{s},x_{s},p_{s}\right)_{s=1}^{t-1}.

We will omit the superscript BB or FF in the remaining of this paper when the context is clear.

Bidding strategy and regret.

A bidding strategy maps (t,vt)(\mathcal{H}_{t},v_{t}) to a (possibly random) bid btb_{t} for each tt. For a strategy π\pi, we denote by (π)\mathcal{R}(\pi) the expected performance of π\pi, defined as follows:

(π)\displaystyle\mathcal{R}(\pi) =𝔼𝒗,𝒅π[t=1Trtπ]\displaystyle=\mathbb{E}^{\pi}_{\bm{v},\bm{d}}\left[\sum_{t=1}^{T}r^{\pi}_{t}\right]
=𝔼𝒗,𝒅π[t=1T𝕀{btπdt}(vtptπ)],\displaystyle=\mathbb{E}^{\pi}_{\bm{v},\bm{d}}\left[\sum_{t=1}^{T}\mathbb{I}\left\{b^{\pi}_{t}\geq d_{t}\right\}\left(v_{t}-p_{t}^{\pi}\right)\right],

where the expectation is taken with respect to the values 𝒗\bm{v}, the highest competing bids 𝒅\bm{d} and any possible randomness embedded in the strategy π\pi. The expect regret of the bidder is defined to be the difference in the expected cumulative rewards of the bidder’s strategy and the optimal bidding strategy, which has the perfect knowledge of α0\alpha_{0}, FF and GG to maximize the target:

Regret(π)=maxπ(π)(π).\displaystyle\mathrm{Regret}(\pi)=\max_{\pi^{\prime}}\mathcal{R}(\pi^{\prime})-\mathcal{R}(\pi).

Now we look at the the optimal bidding strategy. Using the independence of dtd_{t} from vtv_{t}, one has

(π)=t=1T𝔼t,vtπ[(vtbtπ)G(btπ)+α00btπG(y)𝑑y].\displaystyle\mathcal{R}(\pi)=\sum_{t=1}^{T}\mathbb{E}^{\pi}_{\mathcal{H}_{t},v_{t}}\left[(v_{t}-b_{t}^{\pi})G(b_{t}^{\pi})+\alpha_{0}\int_{0}^{b_{t}^{\pi}}G(y)dy\right].

Let r(v,b,α)=(vb)G(b)+α0bG(y)𝑑yr(v,b,\alpha)=\left(v-b\right)G(b)+\alpha\int_{0}^{b}G(y)\,dy and let b(v,α)=argmaxbr(v,b,α)b^{*}(v,\alpha)=\arg\max_{b}r(v,b,\alpha) (taking the largest in a case of a tie). Then with the perfect knowledge of α0\alpha_{0} and GG, the optimal strategy in each round submits b(vt,α0)b^{*}(v_{t},\alpha_{0}), denoted by btb_{t}^{*} for short. The expect regret of strategy π\pi can written as

Regret(π)=𝔼𝒗,𝒅π[t=1Tr(vt,bt,α0)t=1Tr(vt,btπ,α0)].\displaystyle\text{Regret}(\pi)=\mathbb{E}^{\pi}_{\bm{v},\bm{d}}\left[\sum_{t=1}^{T}r(v_{t},b_{t}^{*},\alpha_{0})-\sum_{t=1}^{T}r(v_{t},b_{t}^{\pi},\alpha_{0})\right].

We will omit the superscript π\pi in the remaining of this paper when the context is clear .

3 Learning with Known α\alpha and Unknown GG

We start with the scenario where the credibility parameter α\alpha is known but the distribution GG of the highest competing bids is unknown. The algorithms and proofs of this section are deferred to Appendix A.

Observe that when α\alpha reaches an endpoint of [0,1][0,1], this problem will degenerate into a bidding problem in repeated first-price or second-price auctions. Therefore, We will consider three cases separately: α0=0\alpha_{0}=0, α0=1\alpha_{0}=1, and α0(0,1)\alpha_{0}\in(0,1). The following Theorem 3.1 provides a comprehensive characterization of this setting. For each case, it establishes a regret lower bound and gives an algorithm with optimal performance up to log factors.

Theorem 3.1.

For repeated non-credible second-price auctions with known credibility α0\alpha_{0}, unknown distribution GG and bandit feedback:

  1. (1)

    when α0=1\alpha_{0}=1, truthful bidding achieves no regret;

  2. (2)

    when α0=0\alpha_{0}=0, there exists a bidding algorithm (Algorithm A.1) that achieves an O~(T2/3)\widetilde{O}(T^{2/3}) regret, and the lower bound on regret for this case is Ω(T2/3)\Omega(T^{2/3});

  3. (3)

    when α0(0,1)\alpha_{0}\in(0,1), there exists a bidding algorithm (Algorithm A.2) that achieves an O~(T1/2)\widetilde{O}(T^{1/2}) regret, and the lower bound on regret for this case is Ω(T1/2)\Omega(T^{1/2}).

The lower bounds in Theorem 3.1 show that the hardness of this learning problem increases as α0\alpha_{0} decreases, which demonstrates the impact of reduced credibility on online bidding optimization. When α0\alpha_{0} deviates from 11, truthful bidding is no longer a dominant strategy. The bidder has to learn the competitors’ bids to make decisions, and thus the distribution estimation error would introduce an inevitable regret of order Ω(T1/2)\Omega(T^{1/2}). As long as α0>0\alpha_{0}>0, the bidder can infer the highest competing bid dtd_{t} from her payment once she wins an auction by measuring the difference between ctc_{t} and btb_{t}. However, when α0\alpha_{0} becomes 0, ctbtc_{t}\equiv b_{t} in winning rounds provides no additional information about dtd_{t}. The complete loss of credibility cripples the bidder’s ability to observe the competitive environment and estimate the distribution GG, so the regret lower bound leaps from Ω(T1/2)\Omega(T^{1/2}) to Ω(T2/3)\Omega(T^{2/3}).

The first statement of Theorem 3.1 is trivial due to the nature of the second-price auction.

The second case is equivalent to bidding in repeated first-price auctions with binary feedback (receiving only xt=1x_{t}=1 or 0), which can be modeled as a contextual bandits problem with cross learning:

  • Cross learning over contexts. The bidder in round tt not only receives the reward rtr_{t} under (vt,bt)(v_{t},b_{t}), but also observes the rewards rr^{\prime} under (v,bt)(v^{\prime},b_{t}) for every other vv^{\prime}.

For the contextual bandits problem with cross-learning over contexts in the stochastic setting, Balseiro et al. (2019) proposed a UCB-based algorithm that can achieve an O(KT)O(\sqrt{KT}) regret, where KK is the number of actions. Applying this algorithm to the auction setting results in a regret bound of O~(KT+T/K)\widetilde{O}(\sqrt{KT}+T/K), where the last term comes from the discretization error and the upper bound becomes O~(T2/3)\widetilde{O}(T^{2/3}) with KT1/3K\sim T^{1/3}. They also proved the regret lower bound is Ω(T2/3)\Omega(T^{2/3}) via a reduction to the problem of dynamic pricing.

Lemma 3.2 (Balseiro et al. (2019)).

For repeated first-price auctions with binary feedback, Algorithm A.1 can achieve an O~(T2/3)\widetilde{O}(T^{2/3}) regret, and there exists a problem instance where any algorithm must incur a regret of at least Ω(T2/3)\Omega(T^{2/3}) regret.

The third case is similar to bidding in repeated first-price auctions with censored feedback, where the seller runs first-price auctions and always reveals the winner’s bid to each bidder so the bidder can see the highest competing bid only if she loses the auction. Han, Zhou, and Weissman (2020) modeled that problem as a contextual bandits problem with cross learning, partial ordering:

  • Cross learning over contexts.

  • Partial ordering over actions. There exists a partial order \preceq_{\mathcal{B}} over the action set \mathcal{B}. The bidder in round tt not only receives the reward rtr_{t} under (vt,bt)(v_{t},b_{t}), but also observes the rewards rr^{\prime} under (vt,b)(v_{t},b^{\prime}) for every other bbtb^{\prime}\preceq_{\mathcal{B}}b_{t}.

  • Partial ordering over contexts. There exists a partial order 𝒱\preceq_{\mathcal{V}} over the context set 𝒱\mathcal{V} such that if v1𝒱v2v_{1}\preceq_{\mathcal{V}}v_{2}, then b(v1)b(v2)b^{*}(v_{1})\preceq_{\mathcal{B}}b^{*}(v_{2}) where b(v)b^{*}(v) is the optimal auction under context vv.

Lemma 3.3.

[Han, Zhou, and Weissman (2020)] For the contextual bandits problem with cross-learning over contexts, partial ordering over auctions and contexts in the stochastic setting, there exists a bidding algorithm than achieves an O~(T1/2)\widetilde{O}(T^{1/2}) regret.

We carefully adapt their algorithm to our setting (Algorithm A.2) and verify that the third case with α0(0,1)\alpha_{0}\in(0,1) satisfies the above three properties:

  • Cross learning over values. The bidder can calculate rr^{\prime} under (v,bt)(v^{\prime},b_{t}) by

    r=rt+xt(vvt).\displaystyle r^{\prime}=r_{t}+x_{t}(v^{\prime}-v_{t}). (1)
  • Partial ordering over bids. If the bidder wins in round tt, she can infer the highest competing bid dtd_{t} by

    dt=(ct(1α0)bt)/α0.\displaystyle d_{t}=\left(c_{t}-(1-\alpha_{0})b_{t}\right)/\alpha_{0}. (2)

    Therefore the reward rr^{\prime} under (vt,b)(v_{t},b^{\prime}) for any other b<btb<b_{t} can be calculated by using the corresponding allocation 𝕀{bdt}\mathbb{I}\left\{b\geq d_{t}\right\} and price α0dt+(1α0)b\alpha_{0}d_{t}+(1-\alpha_{0})b. If the bidder loses the auction with btb_{t}, she should also lose with b<btb^{\prime}<b_{t} and the reward rr^{\prime} is 0.

  • Partial ordering over values. We have shown in the previous section that the optimal bid under value vv is b(v,α0)=argmaxbr(v,b,α0)b^{*}(v,\alpha_{0})=\arg\max_{b}r(v,b,\alpha_{0}). The following lemma shows it is a non-decreasing function in vv.

Lemma 3.4.

b(v,α)=argmaxbr(v,b,α)b^{*}(v,\alpha)=\arg\max_{b}r(v,b,\alpha) (taking the largest in the case of a tie) is a non-decreasing function in both vv and α\alpha.

Therefore, Algorithm A.2 can achieve an O~(T1/2)\widetilde{O}(T^{1/2}) regret when α0(0,1)\alpha_{0}\in(0,1). For the last piece of the puzzle, we prove the following regret lower bound. Remark that Lemma 3.5 holds for any α0\alpha_{0}, though the bound is not tight when α0=0\alpha_{0}=0.

Lemma 3.5.

For repeated non-credible second-price auctions with known credibility α0\alpha_{0}, unknown distribution GG, there exists a constant c>0c>0 such that

infπsupGRegret(π)c(1α0)T,\inf_{\pi}\sup_{G}\mathrm{Regret}(\pi)\geq c\cdot(1-\alpha_{0})\sqrt{T},

even in the special case with vt1v_{t}\equiv 1 and full feedback.

4 Learning with Unknown α\alpha and Known GG

We next consider the scenario where distribution GG is known but credibility α0\alpha_{0} is unknown. The proofs of this section are deferred to Appendix B.

Input: Time horizon TT; distribution GG.
Initialization: The bidder submits b1=1b_{1}=1.
1 for t2t\leftarrow 2 to TT do
2       The bidder receives the value vt[0,1]v_{t}\in[0,1].
3       The bidder estimates the seller’s credibility by
α~t=argminα[0,1]|s=1t1(rsr(vs,bs,α))|,\displaystyle\widetilde{\alpha}_{t}=\arg\min_{\alpha\in[0,1]}\left|\sum_{s=1}^{t-1}\left(r_{s}-r(v_{s},b_{s},\alpha)\right)\right|, (3)
4       The bidder submits bt=argmaxbr(vt,b,α~t)b_{t}=\arg\max_{b}r(v_{t},b,\widetilde{\alpha}_{t}).
5 end for
Algorithm 4.1 Learning with unknown α0\alpha_{0}, known GG and bandit feedback

Our bidding algorithm for this setting is depicted in Algorithm 4.1. The bidder first conducts a one-round exploration to make an appropriate initialization. After receiving the value vtv_{t} in each round t=2,,Tt=2,\ldots,T, the bidder computes α~t\widetilde{\alpha}_{t}, which is the estimation of α0\alpha_{0} based on the historical observations in the past t1t-1 rounds. Recall that the optimal bid btb^{*}_{t} shown in Section 2 maximizes r(vt,b,α0)r(v_{t},b,\alpha_{0}). Thus, by the choice of btb_{t}, if the estimator α~t\widetilde{\alpha}_{t} is close to α0\alpha_{0}, the expected reward r(vt,bt,α0)r(v_{t},b_{t},\alpha_{0}) will be close to the optimal reward r(vt,bt,α0)r(v_{t},b_{t}^{*},\alpha_{0}).

Equation 3 in Algorithm 4.1 aims to estimate α0\alpha_{0} by fitting the observed rewards {rs}s=1t1\{r_{s}\}_{s=1}^{t-1} to the expected rewards {r(vs,bs,α)}s=1t1\{r(v_{s},b_{s},\alpha)\}_{s=1}^{t-1}, which can be computed given GG is known. We apply the Azuma-Hoeffding inequality to obtain the following result.

Lemma 4.1.

Under Algorithm 4.1, we have with probability at least 1δ1-\delta, t[T]\forall t\in[T],

|αt~α0|wt,\displaystyle|\widetilde{\alpha_{t}}-\alpha_{0}|\leq w_{t},

where wtw_{t} is given by

wt=22(t1)log(2T/δ)s=1t10bsG(y)𝑑y.\displaystyle w_{t}=\frac{2\sqrt{2(t-1)\log{(2T/\delta)}}}{\sum_{s=1}^{t-1}\int_{0}^{b_{s}}G(y)dy}.

For technical purpose, we make the following assumption.

Assumption 4.2.

GG is twice differentiable and log-concave with density function gg. There exist positive constants B1,B2B_{1},B_{2} such that B1g(x)B2B_{1}\leq g(x)\leq B_{2} for x[0,1]x\in[0,1].

The CDFs of many common distributions, such as gamma distributions, Gaussian distributions and uniform distributions, are all log-concave. The existence of positive bounds on the density function is also a standard and common assumption in various learning problems.

Theorem 4.3.

Suppose that Assumption 4.2 holds. For repeated non-credible second-price auctions with unknown credibility α0\alpha_{0}, known distribution GG and bandit feedback, there exists a bidding algorithm (Algorithm 4.1) that achieves an O(log2T)O(\log^{2}{T}) regret, and any algorithm must incur at least a constant regret.

Remark that due to the additional assumption, we cannot actually draw the conclusion that estimating α\alpha is generally easier than estimating GG by comparing two lower bounds in Theorem 4.3 and Theorem 3.1. For example, the two-point method used in the proof of Lemma 3.5 constructs two discrete GG distributions to show no bidding algorithm can obtain low regret simultaneously under both distributions, while Assumption 4.2 has ruled out such bad cases.

A key step in the proof of Theorem 4.3 involves showing that the reward per round obtained by the bidder is close to the reward under optimal bid with high probability. We have with probability at least 1δ1-\delta, t\forall t,

r(vt,bt,α0)r(vt,bt,α0)wt2/B1.\displaystyle r(v_{t},b_{t}^{*},\alpha_{0})-r(v_{t},b_{t},\alpha_{0})\leq w_{t}^{2}/B_{1}.

Then the regret upper bound can be established by showing wtlogT/tw_{t}\sim\sqrt{\log T/t} and summing up through 11 to TT. It is also worth discussing the robustness of this result. Even without Assumption 4.2, we can get

r(vt,bt,α0)r(vt,bt,α0)wt.\displaystyle r(v_{t},b_{t}^{*},\alpha_{0})-r(v_{t},b_{t},\alpha_{0})\leq w_{t}.

And with a weaker continuity condition, it still holds that wtlogT/tw_{t}\sim\sqrt{\log T/t}.

Corollary 4.4.

Suppose that GG is continuous. For repeated non-credible second-price auctions with unknown credibility α0\alpha_{0}, known distribution GG and bandit feedback, Algorithm 4.1 can achieve an O~(T)\widetilde{O}(\sqrt{T}) regret.

An intuition on why the proof of the regret upper bound may fail under some discontinuous distributions is given in Example B.3. In spite of this, we conjecture that Algorithm 4.1 can also guarantee a lower regret in the case with discontinuous GG and we leave that as a future direction.

5 Learning with Unknown α\alpha and Unknown GG

Bandit feedback.

For the last scenario with both α\alpha and GG unknown, we first consider bandit feedback. Although the seller’s credibility is unknown, this case still satisfies the cross-learning property over values, i.e., Equation 1 holds. Thus, Algorithm A.1, which actually does not use the value of α0\alpha_{0}, can still work in this case and achieve an O~(T2/3)\widetilde{O}(T^{2/3}) regret. The regret lower bound also directly follows the third statement of Theorem 3.1 since any bidding strategy cannot obtain a better regret guarantee than O(T2/3)O(T^{2/3}) when α0=0\alpha_{0}=0. Hence, we have the following result.

Corollary 5.1.

For repeated non-credible second-price auctions with unknown credibility α0\alpha_{0}, unknown distribution GG and bandit feedback, there exists a bidding algorithm (Algorithm A.1) that achieves an O~(T2/3)\widetilde{O}(T^{2/3}) regret, and the lower bound on regret for this problem is Ω(T2/3)\Omega(T^{2/3}).

In spite of the sublinear regret, the result of Corollary 5.1 is not satisfactory. Algorithm A.1 only falls into the category of usual UCB policies with cross learning, but does not make full use of the properties of this auction setting. In fact, it treats the seller’s mechanism as a black box without really estimating the seller’s credibility or other bidder’s strategies.

Full feedback.

With the above in mind, we explore whether we can make any improvements with richer feedback and more meticulous estimation of α0\alpha_{0} and GG. In the full feedback model, pt=α0dt+(1α0)btp_{t}=\alpha_{0}d_{t}+(1-\alpha_{0})b_{t} is always observable, which would intuitively help our estimation. Nevertheless, when α0=0\alpha_{0}=0, binary feedback in first-prices auctions still results in the Ω(T2/3)\Omega(T^{2/3}) lower bound. Thus, in what follows we exclude the extreme case of α=0\alpha=0 when it is impossible to achieves a better bound than O(T2/3)O(T^{2/3}). Note that sellers in reality often do not set α0=0\alpha_{0}=0 due to concerns about reputation or cheating costs (McAdams and Schwarz 2007). This assumption is also consistent with the recent empirical findings by Dianat and Freer (2021) that although sellers in non-credible second-price auctions often overcharge, they typically do not use the rules of the first-price auction to maximize revenue.

Assumption 5.2.

There exists a constant α¯>0\underline{\alpha}>0 such that α0[α¯,1]\alpha_{0}\in[\underline{\alpha},1].

We also make a slightly stronger assumption over GG.

Assumption 5.3.

GG is twice differentiable with log-concave density function gg. There exist positive constants B1,B2,B3,WB_{1},B_{2},B_{3},W such that B1<g(x)<B2B_{1}<g(x)<B_{2} and |g(x)|B3|g^{\prime}(x)|\leq B_{3} for x[0,1+W]x\in[0,1+W].

Input: Time horizon TT.
1 for tΓ1t\in\Gamma_{1} do
2       The bidder receives the value vt[0,1]v_{t}\in[0,1].
3       The bidder submits a bid bt=1b_{t}=1.
4 end for
5Estimate α0\alpha_{0} by using α~1\widetilde{\alpha}_{1}, which is computed by,
α~1=argminα[α¯,1]1(α),\displaystyle\widetilde{\alpha}_{1}=\arg\min_{\alpha\in[\underline{\alpha},1]}\mathcal{L}_{1}(\alpha), (4)
where 1(α)\mathcal{L}_{1}(\alpha) is defined in Equation 13.
6The bidder computes Ψ^(;α~1)\hat{\Psi}(\cdot;\widetilde{\alpha}_{1}) by
Ψ^1(,α~1)\displaystyle\hat{\Psi}_{1}(\cdot,\widetilde{\alpha}_{1})\quad
=argmaxΨisconcave\displaystyle=\mathop{\arg\max}\limits_{\Psi\,is\,concave} 1TstΓsΨ(pt(1α~1)btα~1)\displaystyle\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\Psi\left(\frac{p_{t}-(1-\widetilde{\alpha}_{1})b_{t}}{\widetilde{\alpha}_{1}}\right)
\displaystyle- exp(Ψ(y;α~1))𝑑y.\displaystyle\int\exp(\Psi(y;\widetilde{\alpha}_{1}))dy. (5)
7 The bidder estimates GG by
G^1(;α~1)=exp(Ψ^1(y;α~1))𝑑y.\displaystyle\hat{G}_{1}(\cdot;\widetilde{\alpha}_{1})=\int\exp(\hat{\Psi}_{1}(y;\widetilde{\alpha}_{1}))dy. (6)
for s=2,3,,Ss=2,3,\cdots,S do
8       for tΓst\in\Gamma_{s} do
9             The bidder receives the value vt[0,1]v_{t}\in[0,1].
10             The bidder submits a bid btb_{t}, computed by,
bt=argmaxb[\displaystyle b_{t}=\arg\max_{b}\Big{[} (vtb)G^s1(b)\displaystyle\left(v_{t}-b\right)\hat{G}_{s-1}(b)
+α~s1\displaystyle+\widetilde{\alpha}_{s-1} 0bG^s1(y)dy],\displaystyle\int_{0}^{b}\hat{G}_{s-1}(y)\,dy\Big{]}, (7)
11            
12       end for
13      The bidder updates the estimator for α0\alpha_{0} in episode ss by using α~s\widetilde{\alpha}_{s}, which is computed by,
α~s=argminα[α¯,1]s(α).\displaystyle\widetilde{\alpha}_{s}=\arg\min_{\alpha\in[\underline{\alpha},1]}\mathcal{L}_{s}(\alpha). (8)
where s(α)\mathcal{L}_{s}(\alpha) is defined in Equation 13. The bidder updates Ψ^(;α~1)\hat{\Psi}(\cdot;\widetilde{\alpha}_{1}) by
Ψ^s(,α~s)\displaystyle\hat{\Psi}_{s}(\cdot,\widetilde{\alpha}_{s})\quad
=argmaxΨisconcave\displaystyle=\mathop{\arg\max}\limits_{\Psi\,is\,concave} 1TstΓsΨ(pt(1α~s)btα~s)\displaystyle\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\Psi\left(\frac{p_{t}-(1-\widetilde{\alpha}_{s})b_{t}}{\widetilde{\alpha}_{s}}\right)
\displaystyle- exp(Ψ(y;α~s))𝑑y.\displaystyle\int\exp(\Psi(y;\widetilde{\alpha}_{s}))dy. (9)
14       The bidder updates the estimation of GG by
G^s(;α~s)=exp(Ψ^1(y;α~s))𝑑y.\displaystyle\hat{G}_{s}(\cdot;\widetilde{\alpha}_{s})=\int\exp(\hat{\Psi}_{1}(y;\widetilde{\alpha}_{s}))dy. (10)
15 end for
Algorithm 5.1 Learning with unknown α0\alpha_{0}, unknown GG and full feedback

Our bidding algorithm for full feedback is presented in Algorithm 5.1. It runs in an episodic manner, similar to Cesa-Bianchi, Gentile, and Mansour (2014),Javanmard and Nazerzadeh (2019), Badanidiyuru, Feng, and Guruganesh (2023). During a time horizon TT, the bidding algorithm is divided into SS episodes, each containing Ts=T12sT_{s}=T^{1-2^{-s}} time steps. Denote Γs\Gamma_{s} be the time steps in stage ss, s.t. |Γs|=Ts|\Gamma_{s}|=T_{s}. For any time step tt in the first episode, we set bt=1b_{t}=1 for a proper initialization. For any time step tt in episode ss(s2s\geq 2), we use the estimated parameter α~s1\widetilde{\alpha}_{s-1} and distribution G^s1\hat{G}_{s-1} in the (s1)(s-1)-th episode to set the bid

bt=argmaxb(vtb)G^s1(b)+α~s10bG^s1(y)𝑑y,\displaystyle b_{t}=\arg\max_{b}\left(v_{t}-b\right)\hat{G}_{s-1}(b)+\widetilde{\alpha}_{s-1}\int_{0}^{b}\hat{G}_{s-1}(y)\,dy,

and only update these estimators at the end of episode ss by using the data observed in episode ss.

The main difficulty is how to update the estimators of GG and α0\alpha_{0} in each episode. Recall that when G is known, the estimation step (Equation 3) in Algorithm 4.1 is essentially similar to using the maximum likelihood estimation (MLE) method to find the most probable α\alpha. However, without the knowledge of GG, the bidder cannot directly estimate α0\alpha_{0} by matching observed rewards to expected rewards. To handle this challenge, we combine the non-parametric log-concave density estimator and MLE method, to learn α0\alpha_{0} and GG simultaneously.

We first introduce the non-parametric estimator of log-concave density function gg, which is adopted from Dümbgen and Rufibach (2009). In each episode s, given realized pt,tΓsp_{t},t\in\Gamma_{s}, Algorithm 5.1 for any parameter α\alpha gives an estimator g^s(,α)\hat{g}_{s}(\cdot,\alpha),

g^s(,α)=argmaxg is log-concave1Ts\displaystyle\hat{g}_{s}(\cdot,\alpha)=\mathop{\arg\max}\limits_{\text{$g$ is log-concave}}\frac{1}{T_{s}} tΓslogg(pt(1α)btα)\displaystyle\sum_{t\in\Gamma_{s}}\log g\left(\frac{p_{t}-(1-\alpha)b_{t}}{\alpha}\right)
g(y)𝑑y.\displaystyle-\int g(y)dy. (11)

In this work, we restrict the function class of g^s(;α)\hat{g}_{s}(\cdot;\alpha) for any α\alpha and ss as below,

𝒫={p:p(z)B2,z[0,1+W],p(z)𝑑z=1}.\displaystyle\mathcal{P}=\big{\{}p:p(z)\leq B_{2},\forall z\in[0,1+W],\int p(z)dz=1\big{\}}.

It is w.l.o.g. to re-parameterize g(y)=exp(Ψ(y))g(y)=\exp(\Psi(y)), where Ψ(y)\Psi(y) is a concave function w.r.t yy. Then it is equivalent to get an estimator Ψ^s(,α)\hat{\Psi}_{s}(\cdot,\alpha),

Ψ^s(,α)=argmaxΨ is concave1Ts\displaystyle\hat{\Psi}_{s}(\cdot,\alpha)=\mathop{\arg\max}\limits_{\text{$\Psi$ is concave}}\frac{1}{T_{s}} tΓsΨ(pt(1α)btα)\displaystyle\sum_{t\in\Gamma_{s}}\Psi\left(\frac{p_{t}-(1-\alpha)b_{t}}{\alpha}\right)
exp(Ψ(y;α))𝑑y.\displaystyle-\int\exp(\Psi(y;\alpha))dy. (12)

Let G^s(y;α)=0yg^s(z,α)𝑑z=0yexp(Ψ^s(z;α))𝑑z\hat{G}_{s}(y;\alpha)=\int_{0}^{y}\hat{g}_{s}(z,\alpha)dz=\int_{0}^{y}\exp(\hat{\Psi}_{s}(z;\alpha))dz be the estimated empirical distribution using the above estimator. Let 𝔾s\mathbb{G}_{s} be the empirical distribution of samples {dt}tΓs\{d_{t}\}_{t\in\Gamma_{s}} in episode ss such that 𝔾s(y)=1TstΓs𝕀{dty}\mathbb{G}_{s}(y)=\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\mathbb{I}\{d_{t}\leq y\}. Dümbgen and Rufibach (2009) proved the following result.

Lemma 5.4 (Dümbgen and Rufibach (2009)).

The optimizer Ψ^s(,α0)\hat{\Psi}_{s}(\cdot,\alpha_{0}) exists and is unique. For any dt=(pt(1α0)bt)/α0,tΓsd_{t}=(p_{t}-(1-\alpha_{0})b_{t})/\alpha_{0},t\in\Gamma_{s}, 𝔾s(dt)1TsG^s(dt;α0)𝔾s(dt)\mathbb{G}_{s}(d_{t})-\frac{1}{T_{s}}\leq\hat{G}_{s}(d_{t};\alpha_{0})\leq\mathbb{G}_{s}(d_{t}).

Give the above characterization of Ψ^s(,α0)\hat{\Psi}_{s}(\cdot,\alpha_{0}) and G^s(;α0)\hat{G}_{s}(\cdot;\alpha_{0}) we provide the uniform convergence bound for |G^s(d;α0)G(z)||\hat{G}_{s}(d;\alpha_{0})-G(z)| in the following lemma.

Lemma 5.5.

Suppose that Tslog2(2/δ)T_{s}\gg\log^{2}(2/\delta) for some δ>0\delta>0, then with probability at least 1δ1-\delta, d[0,1],|G^s(d;α0)G(z)|O(log(1/δ)/Ts)\forall d\in[0,1],|\hat{G}_{s}(d;\alpha_{0})-G(z)|\leq O(\sqrt{\log(1/\delta)/T_{s}}) holds .

Given the non-parametric estimator G^s(;α)\hat{G}_{s}(\cdot;\alpha) introduced in the above, we minimize the following MLE loss function to compute α~s\widetilde{\alpha}_{s},

s(α)=1Ts\displaystyle\mathcal{L}_{s}(\alpha)=-\frac{1}{T_{s}} tΓs[ξtlogG^s(εt(α);α)\displaystyle\sum_{t\in\Gamma_{s}}\Big{[}\xi_{t}\log\hat{G}_{s}(\varepsilon_{t}(\alpha);\alpha)
+(1ξt)log(1G^s(εt(α);α))],\displaystyle+(1-\xi_{t})\log(1-\hat{G}_{s}(\varepsilon_{t}(\alpha);\alpha))\Big{]}, (13)

where ξt=𝕀{ptbt+α¯W/2}\xi_{t}=\mathbb{I}\{p_{t}\leq b_{t}+\underline{\alpha}W/2\} and εt(α)=bt+α¯W/(2α)\varepsilon_{t}(\alpha)=b_{t}+\underline{\alpha}W/(2\alpha). Here, the indicator ξt\xi_{t} is carefully chosen so that

𝔼[ξt]=G(εt(α)),\displaystyle\mathbb{E}\left[\xi_{t}\right]=G(\varepsilon_{t}(\alpha)),

and α[α¯,1]\forall\alpha\in[\underline{\alpha},1], 0<εt(α)<1+W0<\varepsilon_{t}(\alpha)<1+W.

Theorem 5.6.

Suppose that Assumption 5.2 holds. For repeated non-credible second-price auctions with unknown credibility α0\alpha_{0}, known distribution GG and full feedback, there exists a bidding algorithm Algorithm 5.1 that achieves O~(T1/2)\widetilde{O}(T^{1/2}) if Assumption 5.3 holds. And the lower bound on regret for this problem is Ω(T1/2)\Omega(T^{1/2}).

We first give a bound on distance between α~s\widetilde{\alpha}_{s} and α0\alpha_{0} in Lemma C.3. Next we show that the distribution estimator G^s(;α~s)\hat{G}_{s}(\cdot;\widetilde{\alpha}_{s}) uniformly converges to the real distribution GG in Lemma C.4. Our algorithm and the regret analysis follow the same spirit as in Theorem 4.3 in Badanidiyuru, Feng, and Guruganesh (2023). The main difference is that we use ξt\xi_{t} rather than xtx_{t} as our indicator function. We defer the proof of this section to Appendix C.

6 Conclusion

In this paper, we formulate and study the problem of learning against non-credible auctions and evaluate several algorithms under various information structures.

It remains an open problem whether we can further relax the assumptions over distribution GG, especially for Theorem 4.3 and Theorem 5.6. Can we design a near optimal algorithm for general distributions with unknown α0\alpha_{0} and known GG? Can we apply the upper bound result of Theorem 5.6 to the bandit feedback model? In this work, we assume the seller’s credibility α0\alpha_{0} is fixed. Another research direction is to consider that the seller may use a time-dependent charging rule. Will learning learning become impossible in this more complicated setting?

References

  • Akbarpour and Li (2020) Akbarpour, M.; and Li, S. 2020. Credible auctions: A trilemma. Econometrica, 88(2): 425–467.
  • An (1996) An, M. 1996. Log-concave Probability Distributions: Theory and Statistical Testing. Game theory and information, University Library of Munich, Germany.
  • Badanidiyuru, Feng, and Guruganesh (2023) Badanidiyuru, A.; Feng, Z.; and Guruganesh, G. 2023. Learning to Bid in Contextual First Price Auctions. In Proceedings of the ACM Web Conference 2023, 3489–3497.
  • Balseiro et al. (2019) Balseiro, S.; Golrezaei, N.; Mahdian, M.; Mirrokni, V.; and Schneider, J. 2019. Contextual bandits with cross-learning. Advances in Neural Information Processing Systems, 32.
  • Balseiro, Besbes, and Weintraub (2015) Balseiro, S. R.; Besbes, O.; and Weintraub, G. Y. 2015. Repeated auctions with budgets in ad exchanges: Approximations and design. Management Science, 61(4): 864–884.
  • Balseiro and Gur (2019) Balseiro, S. R.; and Gur, Y. 2019. Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Science, 65(9): 3952–3968.
  • Balseiro, Lu, and Mirrokni (2022) Balseiro, S. R.; Lu, H.; and Mirrokni, V. 2022. The best of many worlds: Dual mirror descent for online allocation problems. Operations Research.
  • Castiglioni, Celli, and Kroer (2022) Castiglioni, M.; Celli, A.; and Kroer, C. 2022. Online learning with knapsacks: the best of both worlds. In International Conference on Machine Learning, 2767–2783. PMLR.
  • Cesa-Bianchi, Gentile, and Mansour (2014) Cesa-Bianchi, N.; Gentile, C.; and Mansour, Y. 2014. Regret minimization for reserve prices in second-price auctions. IEEE Transactions on Information Theory, 61(1): 549–564.
  • Chen et al. (2023) Chen, Y.; Wang, Q.; Duan, Z.; Sun, H.; Chen, Z.; Yan, X.; and Deng, X. 2023. Coordinated Dynamic Bidding in Repeated Second-Price Auctions with Budgets. arXiv preprint arXiv:2306.07709.
  • Chen et al. (2022) Chen, Z.; Wang, C.; Wang, Q.; Pan, Y.; Shi, Z.; Tang, C.; Cai, Z.; Ren, Y.; Zhu, Z.; and Deng, X. 2022. Dynamic Budget Throttling in Repeated Second-Price Auctions. arXiv preprint arXiv:2207.04690.
  • Dianat and Freer (2021) Dianat, A.; and Freer, M. 2021. Credibility in Second-Price Auctions: An Experimental Test. arXiv preprint arXiv:2105.00204.
  • Dümbgen and Rufibach (2009) Dümbgen, L.; and Rufibach, K. 2009. Maximum likelihood estimation of a log-concave density and its distribution function: Basic properties and uniform consistency. Bernoulli, 15(1): 40–68.
  • Feng, Padmanabhan, and Wang (2022) Feng, Z.; Padmanabhan, S.; and Wang, D. 2022. Online Bidding Algorithms for Return-on-Spend Constrained Advertisers. arXiv preprint arXiv:2208.13713.
  • Golrezaei et al. (2021) Golrezaei, N.; Jaillet, P.; Liang, J. C. N.; and Mirrokni, V. 2021. Bidding and pricing in budget and ROI constrained markets. arXiv preprint arXiv:2107.07725.
  • Hakimov and Raghavan (2023) Hakimov, R.; and Raghavan, M. 2023. Improving Transparency and Verifiability in School Admissions: Theory and Experiment.
  • Han et al. (2020) Han, Y.; Zhou, Z.; Flores, A.; Ordentlich, E.; and Weissman, T. 2020. Learning to bid optimally and efficiently in adversarial first-price auctions. arXiv preprint arXiv:2007.04568.
  • Han, Zhou, and Weissman (2020) Han, Y.; Zhou, Z.; and Weissman, T. 2020. Optimal no-regret learning in repeated first-price auctions. arXiv preprint arXiv:2003.09795.
  • IAB (2023) IAB, T. I. A. B. 2023. Internet Advertising Revenue Report, Full year 2022 results.
  • Iyer, Johari, and Sundararajan (2014) Iyer, K.; Johari, R.; and Sundararajan, M. 2014. Mean field equilibria of dynamic auctions with learning. Management Science, 60(12): 2949–2970.
  • Javanmard and Nazerzadeh (2019) Javanmard, A.; and Nazerzadeh, H. 2019. Dynamic pricing in high-dimensions. The Journal of Machine Learning Research, 20(1): 315–363.
  • McAdams and Schwarz (2007) McAdams, D.; and Schwarz, M. 2007. Who pays when auction rules are bent? International Journal of Industrial Organization, 25(5): 1144–1157.
  • Porter and Shoham (2005) Porter, R.; and Shoham, Y. 2005. On cheating in sealed-bid auctions. Decision Support Systems, 39(1): 41–54.
  • Rothkopf and Harstad (1995) Rothkopf, M. H.; and Harstad, R. M. 1995. Two models of bid-taker cheating in Vickrey auctions. Journal of Business, 257–267.
  • Tsybakov (2009) Tsybakov, A. B. 2009. Introduction to Nonparametric Estimation. New York: Springer-Verlag.
  • Vickrey (1961) Vickrey, W. 1961. Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance, 16(1): 8–37.
  • Wang et al. (2023) Wang, Q.; Yang, Z.; Deng, X.; and Kong, Y. 2023. Learning to bid in repeated first-price auctions with budgets. arXiv preprint arXiv:2304.13477.
  • Weed, Perchet, and Rigollet (2016) Weed, J.; Perchet, V.; and Rigollet, P. 2016. Online learning in repeated auctions. Journal of Machine Learning Research, 49(June): 1562–1583.
  • Zhang et al. (2022) Zhang, W.; Han, Y.; Zhou, Z.; Flores, A.; and Weissman, T. 2022. Leveraging the Hints: Adaptive Bidding in Repeated First-Price Auctions. arXiv preprint arXiv:2211.06358.

Supplementary Material of “Learning against Non-credible Auctions”

Appendix A Missing Algorithms and Proofs of Section 3

Input: Time horizon TT.
Initialization: Let ={b1,,bK}\mathcal{B}=\{b^{1},\cdots,b^{K}\} be the bid set with bk=(k1)/Kb^{k}=(k-1)/K; for t[K]t\in[K], the bidder submits a bid bt=btb_{t}=b^{t}.
1 for tK+1t\leftarrow K+1 to TT do
2       The bidder receives the value vt[0,1]v_{t}\in[0,1].
3       For every k[K]k\in[K], the bidder counts the observation by
ntks=1t1𝕀{bs=bk}.\displaystyle n_{t}^{k}\leftarrow\sum_{s=1}^{t-1}\mathbb{I}\left\{b_{s}=b^{k}\right\}. (14)
4       For every k[K]k\in[K], the bidder computes the confidence bound:
wtk=2lnTntk,\displaystyle w_{t}^{k}=\sqrt{\frac{2\ln{T}}{n_{t}^{k}}}, (15)
5       For every k[K]k\in[K], the bidder estimates the reward by
r~t(vt,bk)=\displaystyle\widetilde{r}_{t}(v_{t},b^{k})= 1ntks=1t1𝕀{bs=bk}(xsvtcs),\displaystyle\frac{1}{n_{t}^{k}}\sum_{s=1}^{t-1}\mathbb{I}\left\{b_{s}=b^{k}\right\}\left(x_{s}v_{t}-c_{s}\right), (16)
6       The bidder submits a bid bt=argmaxb(r~t(vt,b)+wtk)b_{t}=\arg\max_{b\in\mathcal{B}}\left(\widetilde{r}_{t}(v_{t},b)+w_{t}^{k}\right).
7 end for
Algorithm A.1 Learning with known α0=0\alpha_{0}=0 (or unknown α0\alpha_{0}), unknown GG and bandit feedback
Input: Time horizon TT; credibility parameter α0(0,1)\alpha_{0}\in(0,1).
Initialization: Let 𝒱=[v1,,vM]\mathcal{V}=[v^{1},\cdots,v^{M}] be the value set with vm=(m1)/Mv^{m}=(m-1)/M; let ={b1,,bK}\mathcal{B}=\{b^{1},\cdots,b^{K}\} be the bid set with bk=(k1)/Kb^{k}=(k-1)/K; set 1m\mathcal{B}_{1}^{m}\leftarrow\mathcal{B} for each vm𝒱v_{m}\in\mathcal{V}; select failure probability δ(0,1)\delta\in(0,1).
1 for t1t\leftarrow 1 to TT do
2       The bidder receives the value vt[0,1]v_{t}\in[0,1] and round it to vm(t)v^{m(t)} by
vm(t)=max{u𝒱:uvt}.\displaystyle v^{m(t)}=\max\{u\in\mathcal{V}:u\leq v_{t}\}. (17)
3       The bidder submits a bid bt=suptm(t)b_{t}=\sup\mathcal{B}_{t}^{m(t)}.
4       For every k[K]k\in[K], the bidder counts the observation by
ntks=1t𝕀{bsbk}.\displaystyle n_{t}^{k}\leftarrow\sum_{s=1}^{t}\mathbb{I}\left\{b_{s}\geq b^{k}\right\}. (18)
5       For every m[M],k[K]m\in[M],k\in[K], the bidder estimates the reward by
r~t(vm,bk,α0)=\displaystyle\widetilde{r}_{t}(v^{m},b^{k},\alpha_{0})= 1ntks=1t𝕀{bsbk}𝕀{bkd~s}xs(vmα0d~s(1α0)bk),\displaystyle\frac{1}{n_{t}^{k}}\sum_{s=1}^{t}\mathbb{I}\left\{b_{s}\geq b^{k}\right\}\mathbb{I}\left\{b^{k}\geq\widetilde{d}_{s}\right\}x_{s}\left(v^{m}-\alpha_{0}\widetilde{d}_{s}-(1-\alpha_{0})b^{k}\right), (19)
where d~s=(cs(1α0)bs)/α0\widetilde{d}_{s}=\left(c_{s}-(1-\alpha_{0})b_{s}\right)/\alpha_{0}.
6       for m=M,M1,,1m=M,M-1,\ldots,1 do
7             The bidder eliminates bids by:
tm={bktm:bkmins>msupt+1s}.\displaystyle\mathcal{B}_{t}^{m}=\left\{b^{k}\in\mathcal{B}_{t}^{m}:b^{k}\leq\min_{s>m}\sup\mathcal{B}_{t+1}^{s}\right\}. (20)
8             The bidder computes the confidence bound:
wtm=4lnTlog(KT/δ)Ntm,\displaystyle w_{t}^{m}=\sqrt{\frac{4\ln{T}\log(KT/\delta)}{N_{t}^{m}}}, (21)
where Ntm=minbktmntkN_{t}^{m}=\min_{b^{k}\in\mathcal{B}_{t}^{m}}n_{t}^{k}.
9             The bidder eliminates bids by:
t+1m{\displaystyle\mathcal{B}_{t+1}^{m}\leftarrow\Big{\{} bktm:r~t(vm,bk,α0)maxbtmr~t(vm,b,α0)2wtm}.\displaystyle b^{k}\in\mathcal{B}_{t}^{m}:\widetilde{r}_{t}(v^{m},b^{k},\alpha_{0})\geq\max_{b\in\mathcal{B}_{t}^{m}}\widetilde{r}_{t}(v^{m},b,\alpha_{0})-2w_{t}^{m}\Big{\}}. (22)
10       end for
11      
12 end for
Algorithm A.2 Learning with known α0(0,1)\alpha_{0}\in(0,1), unknown GG and bandit feedback

See 3.4

Proof of Lemma 3.4.

Fix any v1,v2v_{1},v_{2} with v1v2v_{1}\leq v_{2}. For any bb(v1,α)b\leq b^{*}(v_{1},\alpha), it holds that

r(v2,b,α)\displaystyle r(v_{2},b,\alpha) =r(v1,b,α)+(v2v1)G(b)\displaystyle=r(v_{1},b,\alpha)+(v_{2}-v_{1})G(b)
r(v1,b(v1,α),α)+(v2v1)G(b(v1,α))\displaystyle\leq r(v_{1},b^{*}(v_{1},\alpha),\alpha)+(v_{2}-v_{1})G(b^{*}(v_{1},\alpha))
=r(v2,b(v1,α),α),\displaystyle=r(v_{2},b^{*}(v_{1},\alpha),\alpha),

where the inequality follows from the definition of b(v1,α)b^{*}(v_{1},\alpha) and the conditions v1v2,bb(v1,α)v_{1}\leq v_{2},b\leq b^{*}(v_{1},\alpha). Thus, all bids no larger than b(v1,α)b^{*}(v_{1},\alpha) cannot be the largest maximizer for v2v_{2}.

Fix any α1,α2\alpha_{1},\alpha_{2} with α1α2\alpha_{1}\leq\alpha_{2}. For any bb(v,α1)b\leq b^{*}(v,\alpha_{1}), it holds that

r(v,b,α2)\displaystyle r(v,b,\alpha_{2}) =r(v,b,α1)+(α2α1)0bG(y)𝑑y\displaystyle=r(v,b,\alpha_{1})+(\alpha_{2}-\alpha_{1})\int_{0}^{b}G(y)\,dy
r(v,b(v,α1),α1)+(α2α1)0b(v,α1)G(y)𝑑y\displaystyle\leq r(v,b^{*}(v,\alpha_{1}),\alpha_{1})+(\alpha_{2}-\alpha_{1})\int_{0}^{b^{*}(v,\alpha_{1})}G(y)\,dy
=r(v,b(v,α1),α2),\displaystyle=r(v,b^{*}(v,\alpha_{1}),\alpha_{2}),

where the inequality follows from the definition of b(v,α)b^{*}(v,\alpha) and the conditions α1α2,bbb(v,α1)\alpha_{1}\leq\alpha_{2},b\leq b\leq b^{*}(v,\alpha_{1}). Thus, all bids no larger than b(v,α1)b^{*}(v,\alpha_{1}) cannot be the largest maximizer for α2\alpha_{2}. ∎

See 3.5

Proof of Lemma 3.5.

The proof follows the Le Cam’s two-point method (Tsybakov 2009). For any α0(0,1)\alpha_{0}\in(0,1), consider the following two candidate distributions supported on [0,1][0,1]:

G1(x)={0if x[0,1α032α0)12+Δif x[1α032α0,2α032α0)1if x[2α032α0,1],G2(x)={0if x[0,1α032α0)12Δif x[1α032α0,2α032α0)1if x[2α032α0,1],G_{1}(x)=\begin{cases}0&\text{if }x\in\left[0,\frac{1-\alpha_{0}}{3-2\alpha_{0}}\right)\\ \frac{1}{2}+\Delta&\text{if }x\in\left[\frac{1-\alpha_{0}}{3-2\alpha_{0}},\frac{2-\alpha_{0}}{3-2\alpha_{0}}\right)\\ 1&\text{if }x\in\left[\frac{2-\alpha_{0}}{3-2\alpha_{0}},1\right]\end{cases},\qquad G_{2}(x)=\begin{cases}0&\text{if }x\in\left[0,\frac{1-\alpha_{0}}{3-2\alpha_{0}}\right)\\ \frac{1}{2}-\Delta&\text{if }x\in\left[\frac{1-\alpha_{0}}{3-2\alpha_{0}},\frac{2-\alpha_{0}}{3-2\alpha_{0}}\right)\\ 1&\text{if }x\in\left[\frac{2-\alpha_{0}}{3-2\alpha_{0}},1\right]\end{cases},

where Δ\Delta is some parameter to be chosen later. In other words, G1G_{1} corresponds to a discrete random variable taking value in {1α032α0,2α032α0}\left\{\frac{1-\alpha_{0}}{3-2\alpha_{0}},\frac{2-\alpha_{0}}{3-2\alpha_{0}}\right\} with probability (1/2+Δ,1/2Δ)(1/2+\Delta,1/2-\Delta), and G2G_{2} corresponds to the probability (1/2Δ,1/2+Δ)(1/2-\Delta,1/2+\Delta). Fixing vt1v_{t}\equiv 1, let R1(b)R_{1}(b) and R2(b)R_{2}(b) be the expected per-round reward when bidding bb under G1G_{1} and G2G_{2}, respectively. After some algebra, it is straightforward to check that

maxb[0,1]R1(b)\displaystyle\max_{b\in[0,1]}R_{1}(b) =maxb[0,1][(1b)G1(b)+α00bG1(x)dx]=2α032α0(12+Δ),\displaystyle=\max_{b\in[0,1]}\left[(1-b)G_{1}(b)+\alpha_{0}\int_{0}^{b}G_{1}(x)\mathrm{d}x\right]=\frac{2-\alpha_{0}}{3-2\alpha_{0}}\left(\frac{1}{2}+\Delta\right),
maxb[0,1]R2(b)\displaystyle\max_{b\in[0,1]}R_{2}(b) =maxb[0,1][(1b)G2(b)+α00bG2(x)dx]=132α0(1α02+α0Δ),\displaystyle=\max_{b\in[0,1]}\left[(1-b)G_{2}(b)+\alpha_{0}\int_{0}^{b}G_{2}(x)\mathrm{d}x\right]=\frac{1}{3-2\alpha_{0}}\left(1-\frac{\alpha_{0}}{2}+\alpha_{0}\Delta\right),
maxb[0,1](R1(b)+R2(b))\displaystyle\max_{b\in[0,1]}\left(R_{1}(b)+R_{2}(b)\right) =maxb[0,1][(1b)(G1(b)+G2(b))+α00b(G1(x)+G2(x))dx]=2α032α0.\displaystyle=\max_{b\in[0,1]}\left[(1-b)\left(G_{1}(b)+G_{2}(b)\right)+\alpha_{0}\int_{0}^{b}\left(G_{1}(x)+G_{2}(x)\right)\mathrm{d}x\right]=\frac{2-\alpha_{0}}{3-2\alpha_{0}}.

Hence, for any bt[0,1]b_{t}\in[0,1], we have

(maxb[0,1]R1(b)R1(bt))+(maxb[0,1]R2(b)R2(bt))\displaystyle\left(\max_{b\in[0,1]}R_{1}(b)-R_{1}(b_{t})\right)+\left(\max_{b\in[0,1]}R_{2}(b)-R_{2}(b_{t})\right)
maxb[0,1]R1(b)+maxb[0,1]R2(b)maxb[0,1](R1(b)+R2(b))\displaystyle\geq\max_{b\in[0,1]}R_{1}(b)+\max_{b\in[0,1]}R_{2}(b)-\max_{b\in[0,1]}\left(R_{1}(b)+R_{2}(b)\right)
=22α032α0Δ.\displaystyle=\frac{2-2\alpha_{0}}{3-2\alpha_{0}}\Delta. (23)

The inequality (23) is the separation condition required in the two-point method: there is no single bid btb_{t} that can obtain a uniformly small instantaneous regret under both G1G_{1} and G2G_{2}.

In the full information model, when α00\alpha_{0}\neq 0, the bidder can always infer dtd_{t} by

dt=(pt(1α0)bt)/α0.\displaystyle d_{t}=\left(p_{t}-(1-\alpha_{0})b_{t}\right)/\alpha_{0}.

Let Pit,i{1,2}P_{i}^{t},i\in\{1,2\} be the distribution of all historical samples (d1,,dt1)(d_{1},\cdots,d_{t-1}) at the beginning of time tt. Then for any policy π\pi,

supGRegret(π;G)\displaystyle\sup_{G}\mathrm{Regret}(\pi;G) (a)12Regret(π;G1)+12Regret(π;G2)\displaystyle\overset{(\text{a})}{\geq}\frac{1}{2}\mathrm{Regret}(\pi;G_{1})+\frac{1}{2}\mathrm{Regret}(\pi;G_{2})
=12t=1T(𝔼P1t[maxb[0,1]R1(b)R1(bt)]+𝔼P2t[maxb[0,1]R2(b)R2(bt)])\displaystyle=\frac{1}{2}\sum_{t=1}^{T}\left(\mathbb{E}_{P_{1}^{t}}\left[\max_{b\in[0,1]}R_{1}(b)-R_{1}(b_{t})\right]+\mathbb{E}_{P_{2}^{t}}\left[\max_{b\in[0,1]}R_{2}(b)-R_{2}(b_{t})\right]\right)
(b)12t=1T22α032α0Δmin{dP1t,dP2t}\displaystyle\overset{(\text{b})}{\geq}\frac{1}{2}\sum_{t=1}^{T}\frac{2-2\alpha_{0}}{3-2\alpha_{0}}\Delta\int\min\{\mathrm{d}P_{1}^{t},\mathrm{d}P_{2}^{t}\}
=(c)12t=1T22α032α0Δ(1P1tP2tTV)\displaystyle\overset{(\text{c})}{=}\frac{1}{2}\sum_{t=1}^{T}\frac{2-2\alpha_{0}}{3-2\alpha_{0}}\Delta\left(1-\|P_{1}^{t}-P_{2}^{t}\|_{\mathrm{TV}}\right)
1α03Δt=1T(1P1tP2tTV),\displaystyle\geq\frac{1-\alpha_{0}}{3}\Delta\sum_{t=1}^{T}\left(1-\|P_{1}^{t}-P_{2}^{t}\|_{\mathrm{TV}}\right), (24)

where (a) is due to the fact that the maximum is no smaller than the average, (b) follows from (23), and (c) is due to the identity min{dP,dQ}=1PQTV\int\min\{\mathrm{d}P,\mathrm{d}Q\}=1-\|P-Q\|_{\mathrm{TV}}. Invoking Lemma A.1 and using the fact that for Δ(0,1/4)\Delta\in(0,1/4),

DKL(P1t||P2t)\displaystyle D_{\mathrm{KL}}(P_{1}^{t}||P_{2}^{t}) =(t1)DKL(G1||G2)\displaystyle=(t-1)D_{\mathrm{KL}}(G_{1}||G_{2})
=(t1)((12+Δ)log1/2+Δ1/2Δ+(12Δ)log1/2Δ1/2+Δ)\displaystyle=(t-1)\left(\left(\frac{1}{2}+\Delta\right)\log\frac{1/2+\Delta}{1/2-\Delta}+\left(\frac{1}{2}-\Delta\right)\log\frac{1/2-\Delta}{1/2+\Delta}\right)
16TΔ2,\displaystyle\leq 16T\Delta^{2}, (25)

we have the following inequality on total variation distance:

1P1tP2tTV12exp(16TΔ2).1-\|P_{1}^{t}-P_{2}^{t}\|_{\mathrm{TV}}\geq\frac{1}{2}\exp\left(-16T\Delta^{2}\right). (26)

Finally, choosing Δ=14T\Delta=\frac{1}{4\sqrt{T}}, combining inequalities (24) and (26), we conclude that Lemma 3.5 holds with the constant c=124ec=\frac{1}{24e}. ∎

Lemma A.1 (Tsybakov (2009)).

Let P,QP,Q be two probability measures on the same space. It holds that

1PQTV12exp(DKL(P||Q)+DKL(Q||P)2).1-\|P-Q\|_{\mathrm{TV}}\geq\frac{1}{2}\exp\left(-\frac{D_{\mathrm{KL}}(P||Q)+D_{\mathrm{KL}}(Q||P)}{2}\right).

Appendix B Missing Proofs of Section 4

See 4.1

Proof of Lemma 4.1.

Taking expectation on rsr_{s} w.r.t. dsd_{s}, one has

𝔼[rs]=r(vs,bs,α0).\displaystyle\mathbb{E}\left[r_{s}\right]=r(v_{s},b_{s},\alpha_{0}).

Then by applying Azuma–Hoeffding inequality, with probability at least 1δ/T1-\delta/T,

|s=1t1(rsr(vs,bs,α0))|2(t1)log(2T/δ).\displaystyle\left|\sum_{s=1}^{t-1}\left(r_{s}-r(v_{s},b_{s},\alpha_{0})\right)\right|\leq\sqrt{2(t-1)\log{(2T/\delta)}}.

Then by (3) in Algorithm 4.1,

|s=1t1(α~tα0)0bsG(y)𝑑y|\displaystyle\left|\sum_{s=1}^{t-1}\left(\widetilde{\alpha}_{t}-\alpha_{0}\right)\int_{0}^{b_{s}}G(y)\,dy\right|\leq |s=1t1(rsr(vs,bs,α~t))|+|s=1t1(rsr(vs,bs,α0))|\displaystyle\left|\sum_{s=1}^{t-1}\left(r_{s}-r(v_{s},b_{s},\widetilde{\alpha}_{t})\right)\right|+\left|\sum_{s=1}^{t-1}\left(r_{s}-r(v_{s},b_{s},\alpha_{0})\right)\right|
\displaystyle\leq 2|s=1t1(rsr(vs,bs,α0))|\displaystyle 2\left|\sum_{s=1}^{t-1}\left(r_{s}-r(v_{s},b_{s},\alpha_{0})\right)\right|
\displaystyle\leq 22(t1)log(2T/δ).\displaystyle 2\sqrt{2(t-1)\log{(2T/\delta)}}.

Then the proof can be concluded by applying a union bound. ∎

See 4.3

Proof of Theorem 4.3.

First, we show that the loss incurred by choosing such a bid btb_{t} is small with probability at least 1δ1-\delta. For all t2t\geq 2,

r(vt,bt,α0)\displaystyle r(v_{t},b_{t},\alpha_{0}) =r(vt,bt,α~t)(α~tα0)0btG(y)𝑑y,\displaystyle=r(v_{t},b_{t},\widetilde{\alpha}_{t})-\left(\widetilde{\alpha}_{t}-\alpha_{0}\right)\int_{0}^{b_{t}}G(y)\,dy,
(a)r(vt,bt,α~t)(α~tα0)0btG(y)𝑑y,\displaystyle\overset{(\text{a})}{\geq}r(v_{t},b_{t}^{*},\widetilde{\alpha}_{t})-\left(\widetilde{\alpha}_{t}-\alpha_{0}\right)\int_{0}^{b_{t}}G(y)\,dy,
=r(vt,bt,α0)+(α~tα0)btbtG(y)𝑑y,\displaystyle=r(v_{t},b_{t}^{*},\alpha_{0})+\left(\widetilde{\alpha}_{t}-\alpha_{0}\right)\int_{b_{t}}^{b_{t}^{*}}G(y)\,dy,
(b)r(vt,bt,α0)1B1(α~tα0)2,\displaystyle\overset{(\text{b})}{\geq}r(v_{t},b_{t}^{*},\alpha_{0})-\frac{1}{B_{1}}(\widetilde{\alpha}_{t}-\alpha_{0})^{2},
(c)r(vt,bt,α0)1B1wt2,\displaystyle\overset{(\text{c})}{\geq}r(v_{t},b_{t}^{*},\alpha_{0})-\frac{1}{B_{1}}w_{t}^{2}, (27)

where (a) holds by the choice of btb_{t} and (b) holds since bt=b(vt,α0)b_{t}^{*}=b^{*}(v_{t},\alpha_{0}), bt=b(vt,α~t)b_{t}=b^{*}(v_{t},\widetilde{\alpha}_{t}) together with Lemma B.1; (c) follows from Lemma 4.1.

Next, we have

s=1t10bsG(y)𝑑y\displaystyle\sum_{s=1}^{t-1}\int_{0}^{b_{s}}G(y)dy =s=1t10b(vs,α~s)G(y)𝑑y.\displaystyle=\sum_{s=1}^{t-1}\int_{0}^{b^{*}(v_{s},\widetilde{\alpha}_{s})}G(y)\,dy.
(a)s=1t10b(vs,0)G(y)𝑑y.\displaystyle\overset{(\text{a})}{\geq}\sum_{s=1}^{t-1}\int_{0}^{b^{*}(v_{s},0)}G(y)\,dy.
(b)(t1)𝔼v[0b(v,0)G(y)𝑑y]C02(t1)log(T/δ),\displaystyle\overset{(\text{b})}{\geq}(t-1)\cdot\underbrace{\mathbb{E}_{v}\left[\int_{0}^{b^{*}(v,0)}G(y)\,dy\right]}_{C_{0}}-\sqrt{2(t-1)\log{(T/\delta)}},

where (a) holds by Lemma 3.4 or Lemma B.1 and (b) holds with probability at least 1δ/T1-\delta/T by applying Azuma–Hoeffding inequality.

By Lemma B.2, If C0=0C_{0}=0, no algorithm can obtain positive expected reward under regardless of the seller’s credibility α0\alpha_{0}, which also implies any algorithm that guarantees btvtb_{t}\leq v_{t} can get a zero regret. Therefore, we can presume C0C_{0} is a positive constant depending on the known distribution GG. Then for some constant C1(0,C0)C_{1}\in(0,C_{0}), when t>t0=2log(T/δ)(C0C1)2t>t_{0}=\frac{2\log{(T/\delta)}}{(C_{0}-C_{1})^{2}}, we have

wt\displaystyle w_{t} 22(t1)log(2T/δ)C0(t1)2(t1)log(T/δ)\displaystyle\leq\frac{2\sqrt{2(t-1)\log{(2T/\delta)}}}{C_{0}(t-1)-\sqrt{2(t-1)\log{(T/\delta)}}}
22log(2T/δ)C0t12log(T/δ)\displaystyle\leq\frac{2\sqrt{2\log{(2T/\delta)}}}{C_{0}\sqrt{t-1}-\sqrt{2\log{(T/\delta)}}}
22log(2T/δ)C1t1.\displaystyle\leq\frac{2\sqrt{2\log{(2T/\delta)}}}{C_{1}\sqrt{t-1}}. (28)

Combining (27) and (28), we have with probability at least 12δ1-2\delta,

t=1Tr(vt,bt,α0)t=1Tr(vt,bt,α0)\displaystyle\sum_{t=1}^{T}r(v_{t},b_{t}^{*},\alpha_{0})-\sum_{t=1}^{T}r(v_{t},b_{t},\alpha_{0})\leq t0+1B1t=t0+1Twt2\displaystyle t_{0}+\frac{1}{B_{1}}\sum_{t=t_{0}+1}^{T}w_{t}^{2}
\displaystyle\leq t0+8log(2T/δ)B1C12t=t0+1T1t1\displaystyle t_{0}+\frac{8\log{(2T/\delta)}}{B_{1}C_{1}^{2}}\sum_{t=t_{0}+1}^{T}\frac{1}{t-1}
\displaystyle\leq 2log(T/δ)(C0C1)2+8log(2T/δ)B1C12logT,\displaystyle\frac{2\log{(T/\delta)}}{(C_{0}-C_{1})^{2}}+\frac{8\log{(2T/\delta)}}{B_{1}C_{1}^{2}}\log{T},

which is O(log2T)O(\log^{2}{T}) by taking δT1\delta\sim T^{-1}. The proof of the regret upper bound can be finished by taking expectation.

The lower bound part trivially holds. We have for some constant cc,

infπsupα0Regret(π)c,\displaystyle\inf_{\pi}\sup_{\alpha_{0}}\mathrm{Regret}(\pi)\geq c,

since at least in the first round, no single bid btb_{t} that can obtain a uniformly small instantaneous regret for all α0\alpha_{0}. ∎

See 4.4

Proof of Corollary 4.4.

The proof is analogous to that of Theorem 4.3. We first have

r(vt,bt,α0)\displaystyle r(v_{t},b_{t},\alpha_{0}) =r(vt,bt,α~t)(α~tα0)0btG(y)𝑑y,\displaystyle=r(v_{t},b_{t},\widetilde{\alpha}_{t})-\left(\widetilde{\alpha}_{t}-\alpha_{0}\right)\int_{0}^{b_{t}}G(y)\,dy,
r(vt,bt,α~t)(α~tα0)0btG(y)𝑑y,\displaystyle\geq r(v_{t},b_{t}^{*},\widetilde{\alpha}_{t})-\left(\widetilde{\alpha}_{t}-\alpha_{0}\right)\int_{0}^{b_{t}}G(y)\,dy,
=r(vt,bt,α0)+(α~tα0)btbtG(y)𝑑y,\displaystyle=r(v_{t},b_{t}^{*},\alpha_{0})+\left(\widetilde{\alpha}_{t}-\alpha_{0}\right)\int_{b_{t}}^{b_{t}^{*}}G(y)\,dy,
r(vt,bt,α0)wt,\displaystyle\geq r(v_{t},b_{t}^{*},\alpha_{0})-w_{t}, (29)

where the second inequality uses the face btbtG(y)𝑑y1\int_{b_{t}}^{b_{t}^{*}}G(y)\,dy\leq 1 rather than Lemma B.1. Together the inequality (28), which still holds given that GG is continuous, the proof can be finished by bounding t=1Twt\sum_{t=1}^{T}w_{t}. ∎

Lemma B.1.

Suppose that Assumption 4.2 holds. Then b(v,α)b^{*}(v,\alpha) is strictly increasing in α\alpha with derivative bounded by 1/B11/B_{1}.

Proof of Lemma B.1.

First, we have

r(v,b,α)b=01(vb)g(b)G(b)=α.\displaystyle\frac{\partial r(v,b,\alpha)}{\partial b}=0\Longrightarrow 1-(v-b)\frac{g(b)}{G(b)}=\alpha.

Denote ϕ(b)=1(vb)g(b)G(b)\phi(b)=1-(v-b)\frac{g(b)}{G(b)}. Give that Assumption 4.2 holds, we have ϕ(b)=logG(b)(vb)log′′G(b)B1\phi^{\prime}(b)=\log^{\prime}G(b)-(v-b)\log^{\prime\prime}G(b)\geq B_{1}. Thus, ϕ(b)\phi(b) is a strictly increasing function and we have

b(v,α)α=(ϕ1(α))=1ϕ(ϕ1(α))1B1.\displaystyle\frac{\partial b^{*}(v,\alpha)}{\partial\alpha}=(\phi^{-1}(\alpha))^{\prime}=\frac{1}{\phi^{\prime}(\phi^{-1}(\alpha))}\leq\frac{1}{B_{1}}.

Lemma B.2.

Suppose that GG is continuous. If 𝔼v[0b(v,0)G(y)𝑑y]=0\mathbb{E}_{v}\left[\int_{0}^{b^{*}(v,0)}G(y)\,dy\right]=0, then for any α0\alpha_{0}, no algorithm can obtain positive expected reward.

Proof of Lemma B.2.

First, as b(v,0)b^{*}(v,0) is increasing in vv by Lemma 3.4, 0b(v,0)G(y)𝑑y\int_{0}^{b^{*}(v,0)}G(y)\,dy is also increasing in vv. Thus, if 𝔼v[0b(v,0)G(y)𝑑y]=0\mathbb{E}_{v}\left[\int_{0}^{b^{*}(v,0)}G(y)\,dy\right]=0, it holds that for nearly all vv,

0b(v,0)G(y)𝑑y=0,\displaystyle\int_{0}^{b^{*}(v,0)}G(y)\,dy=0,

except a set contained in [F1(1),1][F^{-1}(1),1] on which the probability distribution FF has a zero measure.

Next, by the continuity and monotonicity of GG, for all v[0,F1(1))v\in[0,F^{-1}(1)), we have G(y)0G(y)\equiv 0 on [0,b(v,0)][0,b^{*}(v,0)]. As

r(v,b(v,0),0)=(vb(v,0))G(b(v,0))=0=r(v,v,0),r(v,b^{*}(v,0),0)=(v-b^{*}(v,0))G(b^{*}(v,0))=0=r(v,v,0),

we have b(v,0)vb^{*}(v,0)\geq v by the tie-breaking rule in the definition of b(v,0)b^{*}(v,0). Thus, G(y)G(y) remains zero on [0,v][0,v] for all v<F1(1)v<F^{-1}(1). Again by the continuity of GG, we have G(y)0G(y)\equiv 0 on [0,F1(1)][0,F^{-1}(1)] where F1(1)=inf{v:F(v)=1}F^{-1}(1)=\inf\{v:F(v)=1\}.

For nearly all vv except a zero-measure set,

r(v,b,α0)\displaystyle r(v,b,\alpha_{0}) =(vb)G(b)+α0bG(y)𝑑y\displaystyle=(v-b)G(b)+\alpha\int_{0}^{b}G(y)\,dy
=α0vbα0G(y)G(v)dy\displaystyle=\alpha_{0}\int_{v}^{b}\alpha_{0}G(y)-G(v)\,dy
0,\displaystyle\leq 0,

where the inequality holds (1) by G(y)0G(y)\equiv 0 when bvb\leq v and (2) by αG(y)G(v)\alpha G(y)\leq G(v) for y[v,b]y\in[v,b] when bvb\geq v. Therefore, no algorithm can obtain positive expected reward under such a distribution GG regardless of the seller’s credibility α0\alpha_{0}. ∎

Example B.3.

Let v1v\equiv 1. Consider the following distribution:

G(y)={0if 0y<1334y+14if 13y1.G(y)=\begin{cases}0&\text{if }0\leq y<\frac{1}{3}\\ \frac{3}{4}y+\frac{1}{4}&\text{if }\frac{1}{3}\leq y\leq 1\end{cases}.

Then for b[1/3,1]b\in[1/3,1],

r(v,b,α0)\displaystyle r(v,b,\alpha_{0}) =(1b)G(b)+α00bG(y)𝑑y\displaystyle=(1-b)G(b)+\alpha_{0}\int_{0}^{b}G(y)\,dy
=63α08b2+2+α04b+2α08.\displaystyle=-\frac{6-3\alpha_{0}}{8}b^{2}+\frac{2+\alpha_{0}}{4}b+\frac{2-\alpha_{0}}{8}.

The optimal bid is calculated by

b\displaystyle b^{*} =argmaxbr(v,b,α0)=13(1+22/α01).\displaystyle=\arg\max_{b}r(v,b,\alpha_{0})=\frac{1}{3}\left(1+\frac{2}{2/\alpha_{0}-1}\right).

Taking α0=1T+1/2\alpha_{0}=\frac{1}{\sqrt{T}+1/2}, we have

0bG(y)𝑑y\displaystyle\int_{0}^{b^{*}}G(y)dy =1313(1+1T)(34y+14)𝑑y=124(1T+4T)0.\displaystyle=\int_{\frac{1}{3}}^{\frac{1}{3}\cdot\left(1+\frac{1}{\sqrt{T}}\right)}\left(\frac{3}{4}y+\frac{1}{4}\right)dy=\frac{1}{24}(\frac{1}{T}+\frac{4}{\sqrt{T}})\longrightarrow 0.

Thus, C0C_{0} in the proof of Theorem 4.3 is neither a positive constant nor 0, but an infinitesimal o(1). As a result, we are not able to give an O(logT/t1)O(\log{T}/\sqrt{t-1}) upper bound for wtw_{t}.

Intuitively, it requires btbtb_{t}\rightarrow b_{t}^{*} for an algorithm to achieve low regret. However, when btb_{t} is approaching 13(1+1T)\frac{1}{3}\left(1+\frac{1}{\sqrt{T}}\right), although the bidder can still win with positive probability, it becomes harder and harder for the bidder to distinguish different α0\alpha_{0} under bandit feedback since in the winning rounds btdtbt130b_{t}-d_{t}\leq b_{t}-\frac{1}{3}\longrightarrow 0.

Appendix C Missing Proofs of Section 5

Proposition C.1.

Suppose that Assumption 5.3 holds. There exists positive constants lWl_{W} and hWh_{W}, such that x[0,1+W]\forall x\in[0,1+W],

max{|logG(x)|,|log(1G(x))|}\displaystyle\max\{|\log^{\prime}G(x)|,|\log^{\prime}(1-G(x))|\} hW,\displaystyle\leq h_{W}, (30)
min{log′′G(x),log′′(1G(x))}\displaystyle\min\{-\log^{\prime\prime}G(x),-\log^{\prime\prime}(1-G(x))\} lW.\displaystyle\geq l_{W}. (31)

Note that if the density function gg is log-concave, then the cumulative distribution function GG and 1G1-G are both log-concave (An 1996). Thus, Equation 30 and Equation 31 trivially holds for the bounded interval [0,1+W][0,1+W].

See 5.5

Proof of Lemma 5.5.

Denote rs=12B1Tsr_{s}=\frac{1}{2B_{1}\sqrt{T_{s}}}. For any d[0,1+W]d\in[0,1+W], the probability that there exists a point y{dt}tΓsy\in\{d_{t}\}_{t\in\Gamma_{s}} such that |yd|rs|y-d|\leq r_{s} is at least

1(12B1rs)Ts=1(11Ts)Ts1eTs1δ/2.\displaystyle 1-(1-2B_{1}\cdot r_{s})^{T_{s}}=1-\left(1-\frac{1}{\sqrt{T_{s}}}\right)^{T_{s}}\approx 1-e^{-\sqrt{T_{s}}}\geq 1-\delta/2.

Then for any dd and yy such that |yd|rs|y-d|\leq r_{s}, we can decompose |G^s(d;α0)G(z)||\hat{G}_{s}(d;\alpha_{0})-G(z)| in the following,

|G^s(d;α0)G(z)|\displaystyle|\hat{G}_{s}(d;\alpha_{0})-G(z)|
\displaystyle\leq\, |G^s(d;α0)G^s(y;α0)|+|G^s(y;α0)𝔾s(y)|+|𝔾s(y)G(y)|+|G(y)G(d)|\displaystyle|\hat{G}_{s}(d;\alpha_{0})-\hat{G}_{s}(y;\alpha_{0})|+|\hat{G}_{s}(y;\alpha_{0})-\mathbb{G}_{s}(y)|+|\mathbb{G}_{s}(y)-G(y)|+|G(y)-G(d)|
\displaystyle\leq\, B2rs+1Ts+log(4/δ)2Ts+B2rs\displaystyle B_{2}\cdot r_{s}+\frac{1}{T_{s}}+\sqrt{\frac{log(4/\delta)}{2T_{s}}}+B_{2}\cdot r_{s}
=\displaystyle=\, B2B1Ts+log(4/δ)2Ts+1Ts.\displaystyle\frac{B_{2}}{B_{1}\sqrt{T_{s}}}+\sqrt{\frac{log(4/\delta)}{2T_{s}}}+\frac{1}{T_{s}}.

The second inequality follows from G^s(;α0)\hat{G}_{s}(\cdot;\alpha_{0}) is B2B_{2}-Lipschitz, Lemma 5.4, Dvoretzky-Kiefer-Wolfowitz (DKW) inequality and GG is B2B_{2}-Lipschitz. It also holds with probability at least 1δ/21-\delta/2 since Tslog2(2/δ)T_{s}\gg log^{2}(2/\delta). ∎

Given Lemma 5.5, G^s(;α0)\hat{G}_{s}(\cdot;\alpha_{0}) is arbitrarily close to GG when TsT_{s} is sufficiently large. In addition, Dümbgen and Rufibach (2009) also show g^s(,α0)\hat{g}_{s}(\cdot,\alpha_{0}) is arbitrarily close to gg when TsT_{s} is sufficiently large. Therefore, we can show,

Proposition C.2.

Suppose that Assumption 5.3 holds. Under Algorithm 5.1, there exists positive constants l~W\tilde{l}_{W} and h~W\tilde{h}_{W}, such that both

max{|logG^s(x;α0)|,|log(1G^s(x;α0))|}h~W,x[0,1+W],s[S]\displaystyle max\{|\log^{\prime}\hat{G}_{s}(x;\alpha_{0})|,|\log^{\prime}(1-\hat{G}_{s}(x;\alpha_{0}))|\}\leq\tilde{h}_{W},\forall x\in[0,1+W],\forall s\in[S]

and

min{log′′G^s(x;α),log′′(1G^s(x;α))}l~W,x[0,1+W],α[α¯,1],,s[S]\displaystyle min\{-\log^{\prime\prime}\hat{G}_{s}(x;\alpha),-\log^{\prime\prime}(1-\hat{G}_{s}(x;\alpha))\}\geq\tilde{l}_{W},\forall x\in[0,1+W],\forall\alpha\in[\underline{\alpha},1],,\forall s\in[S]

hold almost surely.

See 5.6

Proof of Theorem 5.6.

The lower bound directly follows from Lemma 3.5 since no algorithm can avoid Ω(T)\Omega(\sqrt{T}) regret even with known α0\alpha_{0}. Then we mainly focus on the upper bound. We first rewrite the regret per round in the following way,

r(vt,bt,α0)r(vt,bt,α0)\displaystyle r(v_{t},b_{t}^{*},\alpha_{0})-r(v_{t},b_{t},\alpha_{0})
=\displaystyle= 0bt(vtα0x(1α0)bt)g(x)𝑑x0bt(vtα0x(1α0)bt)g(x)𝑑x\displaystyle\int_{0}^{b_{t}^{*}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t}^{*})g(x)dx-\int_{0}^{b_{t}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t})g(x)dx
=\displaystyle= 0bt(vtα0x(1α0)bt)g(x)𝑑x0bt(vtα0x(1α0)bt)g^s1(x;α~s1)𝑑x\displaystyle\int_{0}^{b_{t}^{*}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t}^{*})g(x)dx-\int_{0}^{b_{t}^{*}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t}^{*})\hat{g}_{s-1}(x;\widetilde{\alpha}_{s-1})dx
+0bt(vtα0x(1α0)bt)g^s1(x;α~s1)𝑑x0bt(vtα0x(1α0)bt)g^s1(x;α~s1)𝑑x\displaystyle+\int_{0}^{b_{t}^{*}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t}^{*})\hat{g}_{s-1}(x;\widetilde{\alpha}_{s-1})dx-\int_{0}^{b_{t}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t})\hat{g}_{s-1}(x;\widetilde{\alpha}_{s-1})dx
+0bt(vtα0x(1α0)bt)g^s1(x;α~s1)𝑑x0bt(vtα0x(1α0)bt)g(x)𝑑x\displaystyle+\int_{0}^{b_{t}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t})\hat{g}_{s-1}(x;\widetilde{\alpha}_{s-1})dx-\int_{0}^{b_{t}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t})g(x)dx
\displaystyle\leq 0bt(vtα0x(1α0)bt)g(x)𝑑x0bt(vtα0x(1α0)bt)g^s1(x;α~s1)𝑑x\displaystyle\int_{0}^{b_{t}^{*}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t}^{*})g(x)dx-\int_{0}^{b_{t}^{*}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t}^{*})\hat{g}_{s-1}(x;\widetilde{\alpha}_{s-1})dx
+0bt(vtα0x(1α0)bt)g^s1(x;α~s1)𝑑x0bt(vtα0x(1α0)bt)g(x)𝑑x\displaystyle+\int_{0}^{b_{t}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t})\hat{g}_{s-1}(x;\widetilde{\alpha}_{s-1})dx-\int_{0}^{b_{t}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t})g(x)dx
=\displaystyle= (vtbt)[G(bt)G^s1(bt;α~s1)](vtbt)[G(bt)G^s1(bt;α~s1)]+α0btbt[G(x)G^s1(x;α~s1)]𝑑x\displaystyle(v_{t}-b_{t}^{*})\cdot[G(b_{t}^{*})-\hat{G}_{s-1}(b_{t}^{*};\widetilde{\alpha}_{s-1})]-(v_{t}-b_{t})\cdot[G(b_{t})-\hat{G}_{s-1}(b_{t};\widetilde{\alpha}_{s-1})]+\alpha_{0}\int_{b_{t}}^{b_{t}^{*}}[G(x)-\hat{G}_{s-1}(x;\widetilde{\alpha}_{s-1})]dx
\displaystyle\leq |G(bt)G^s1(bt;α~s1)|+|G(bt)G^s1(bt;α~s1)|+supx|G(x)G^s1(x;α~s1)|,\displaystyle|G(b_{t}^{*})-\hat{G}_{s-1}(b_{t}^{*};\widetilde{\alpha}_{s-1})|+|G(b_{t})-\hat{G}_{s-1}(b_{t};\widetilde{\alpha}_{s-1})|+\sup_{x}|G(x)-\hat{G}_{s-1}(x;\widetilde{\alpha}_{s-1})|,

where the first inequality holds by the choice of btb_{t}, i.e.,

0bt(vtα0x(1α0)bt)g^s1(x;α~s1)𝑑x0bt(vtα0x(1α0)bt)g^s1(x;α~s1)𝑑x.\displaystyle\int_{0}^{b_{t}^{*}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t}^{*})\hat{g}_{s-1}(x;\widetilde{\alpha}_{s-1})dx\leq\int_{0}^{b_{t}}(v_{t}-\alpha_{0}x-(1-\alpha_{0})b_{t})\hat{g}_{s-1}(x;\widetilde{\alpha}_{s-1})dx.

Let Regrets\mathrm{Regret}_{s} be the regret achieved in episode ss. We have

Regrets\displaystyle\mathrm{Regret}_{s} tΓs|G(bt)G^s1(bt;α~s1)|+tΓs|G(bt)G^s1(bt;α~s1)|+Tssupx|G(x)G^s1(x;α~s1)|\displaystyle\leq\sum_{t\in\Gamma_{s}}|G(b_{t}^{*})-\hat{G}_{s-1}(b_{t}^{*};\widetilde{\alpha}_{s-1})|+\sum_{t\in\Gamma_{s}}|G(b_{t})-\hat{G}_{s-1}(b_{t};\widetilde{\alpha}_{s-1})|+T_{s}\sup_{x}|G(x)-\hat{G}_{s-1}(x;\widetilde{\alpha}_{s-1})|
12AB2(2+W)Tsα¯2W2l~W+3B2TsB1+3log(16S/δ)Ts2+3,\displaystyle\leq\frac{12AB_{2}(2+W)T_{s}}{\underline{\alpha}^{2}W^{2}\tilde{l}_{W}}+\frac{3B_{2}\sqrt{T_{s}}}{B_{1}}+3\sqrt{\frac{\log(16S/\delta)T_{s}}{2}}+3,

where the second inequality holds by Lemma C.3 and Lemma C.4 (setting 𝒦s=4Aα¯2W2l~W\mathcal{K}_{s}=\frac{4A}{\underline{\alpha}^{2}W^{2}\tilde{l}_{W}} and δ:=δ/2S\delta:=\delta/2S). Hence, the inequality holds with probability at least 1δ/S1-\delta/S. Finally, by a union bound over SS episodes and the fact that Ts/Ts1=TT_{s}/T_{s-1}=\sqrt{T} and SloglogTS\leq\log\log T, we complete our proof. ∎

Lemma C.3.

For each episode ss, we have with probability at least 1δ/2S1-\delta/2S,

α~sα04Aα¯2W2l~W,\displaystyle||\widetilde{\alpha}_{s}-\alpha_{0}||\leq\frac{4A}{\underline{\alpha}^{2}W^{2}\tilde{l}_{W}},

where

A=Wh~W2β¯Ts(B2B1Ts+log(16S/δ)2Ts+1Ts)+Wh~W2β¯log(2S/δ)Ts.A=\frac{W\tilde{h}_{W}}{2\overline{\beta}T_{s}}\cdot\left(\frac{B_{2}}{B_{1}\sqrt{T_{s}}}+\sqrt{\frac{log(16S/\delta)}{2T_{s}}}+\frac{1}{T_{s}}\right)+\frac{W\tilde{h}_{W}}{2\overline{\beta}}\sqrt{\frac{\log(2S/\delta)}{T_{s}}}.
Proof.

For notation simplicity, we re-parameterize α\alpha by denoting β=1/α\beta=1/\alpha. By Assumption 5.2, β\beta has an upper bound β¯=1/α¯\overline{\beta}=1/\underline{\alpha}. To compute β^s\hat{\beta}_{s}, we minimize the following MLE loss function,

s(β)=1TstΓs\displaystyle\mathcal{L}_{s}(\beta)=-\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}} [𝕀{bt+W2β¯pt}logG^s(bt+W2β¯β;β)\displaystyle\left[\mathbb{I}\left\{b_{t}+\frac{W}{2\overline{\beta}}\geq p_{t}\right\}\log\hat{G}_{s}\left(b_{t}+\frac{W}{2\overline{\beta}}\beta;\beta\right)\right.
+𝕀{bt+W2β¯<pt}log(1G^s(bt+W2β¯β;β))].\displaystyle+\left.\mathbb{I}\left\{b_{t}+\frac{W}{2\overline{\beta}}<p_{t}\right\}\log\left(1-\hat{G}_{s}\left(b_{t}+\frac{W}{2\overline{\beta}}\beta;\beta\right)\right)\right].

We abuse notation and denote εt(β)=bt+W2β¯β\varepsilon_{t}(\beta)=b_{t}+\frac{W}{2\overline{\beta}}\beta. Observe that

𝔼[𝕀{bt+W2β¯pt}]\displaystyle\mathbb{E}\left[\mathbb{I}\left\{b_{t}+\frac{W}{2\overline{\beta}}\geq p_{t}\right\}\right] =𝔼[𝕀{bt+W2β¯αdt+(1α)bt}]=𝔼[𝕀{bt+W2β¯βdt}]=G(εt(β)).\displaystyle=\mathbb{E}\left[\mathbb{I}\left\{b_{t}+\frac{W}{2\overline{\beta}}\geq\alpha d_{t}+(1-\alpha)b_{t}\right\}\right]=\mathbb{E}\left[\mathbb{I}\left\{b_{t}+\frac{W}{2\overline{\beta}}\beta\geq d_{t}\right\}\right]=G(\varepsilon_{t}(\beta)).

When G^s\hat{G}_{s} equals to the real GG, β0=1/α0\beta_{0}=1/\alpha_{0} should minimize the MLE loss function. By the second-order Taylor theorem, we have

s(β^s)s(β0)=(β0)(β^sβ0)+12′′(β~)(β^sβ0)2\displaystyle\mathcal{L}_{s}(\hat{\beta}_{s})-\mathcal{L}_{s}(\beta_{0})=\mathcal{L}^{\prime}(\beta_{0})(\hat{\beta}_{s}-\beta_{0})+\frac{1}{2}\mathcal{L}^{\prime\prime}(\tilde{\beta})(\hat{\beta}_{s}-\beta_{0})^{2}

for some β~\tilde{\beta} on the line segment between β\beta and β^s\hat{\beta}_{s}. Given the definition of s(β;)\mathcal{L}_{s}(\beta;\cdot), we have

s(β)=1TstΓsW2β¯ηt(β),s′′(β)=1TstΓsW24β¯2ζt(β)\displaystyle\mathcal{L}^{\prime}_{s}(\beta)=\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\frac{W}{2\overline{\beta}}\eta_{t}(\beta),\qquad\mathcal{L}^{\prime\prime}_{s}(\beta)=\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\frac{W^{2}}{4\overline{\beta}^{2}}\zeta_{t}(\beta)

where ηt(β)\eta_{t}(\beta) and ζt(β)\zeta_{t}(\beta) are defined as follows,

ηt(β)=𝕀{bt+W2β¯pt}logG^s(εt(β);β)𝕀{bt+W2β¯<pt}log(1G^s(εt(β);β)),\displaystyle\eta_{t}(\beta)=-\mathbb{I}\left\{b_{t}+\frac{W}{2\overline{\beta}}\geq p_{t}\right\}\log^{\prime}\hat{G}_{s}(\varepsilon_{t}(\beta);\beta)-\mathbb{I}\left\{b_{t}+\frac{W}{2\overline{\beta}}<p_{t}\right\}\log^{\prime}(1-\hat{G}_{s}(\varepsilon_{t}(\beta);\beta)),
ζt(β)=𝕀{bt+W2β¯pt}log′′G^s(εt(β);β)𝕀{bt+W2β¯<pt}log′′(1G^s(εt(β);β)).\displaystyle\zeta_{t}(\beta)=-\mathbb{I}\left\{b_{t}+\frac{W}{2\overline{\beta}}\geq p_{t}\right\}\log^{\prime\prime}\hat{G}_{s}(\varepsilon_{t}(\beta);\beta)-\mathbb{I}\left\{b_{t}+\frac{W}{2\overline{\beta}}<p_{t}\right\}\log^{\prime\prime}(1-\hat{G}_{s}(\varepsilon_{t}(\beta);\beta)).

Based on our construction of the algorithm, btb_{t} is independent with dtd_{t}. Therefore, εt(β0)=bt+W2β¯β0\varepsilon_{t}(\beta_{0})=b_{t}+\frac{W}{2\overline{\beta}}\beta_{0} are independent with dtd_{t} for any tΓst\in\Gamma_{s}, we have

𝔼[ηt(β0)]\displaystyle\mathbb{E}[\eta_{t}(\beta_{0})] =g^s(εt(β0);β0)G^s(εt(β0);β0)G(εt(β0))+g^s(εt(β0);β0)G^s(1εt(β0);β0)(1G(εt(β0)))\displaystyle=-\frac{\hat{g}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}{\hat{G}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}\cdot G(\varepsilon_{t}(\beta_{0}))+\frac{\hat{g}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}{\hat{G}_{s}(1-\varepsilon_{t}(\beta_{0});\beta_{0})}\cdot(1-G(\varepsilon_{t}(\beta_{0})))
=[G^s(εt(β0);β0)G(εt(β0)))][g^s(εt(β0);β0)G^s(εt(β0);β0)+g^s(εt(β0);β0)1G^s(εt(β0);β0)].\displaystyle=[\hat{G}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})-G(\varepsilon_{t}(\beta_{0})))]\cdot[\frac{\hat{g}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}{\hat{G}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}+\frac{\hat{g}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}{1-\hat{G}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}].

Thus, by Lemma 5.5 and Proposition C.2, we have

|𝔼[ηt(β0)]|\displaystyle|\mathbb{E}[\eta_{t}(\beta_{0})]| =[G^s(εt(β0);β0)G(εt(β0)))][g^s(εt(β0);β0)G^s(εt(β0);β0)+g^s(εt(β0);β0)1G^s(εt(β0);β0)]\displaystyle=[\hat{G}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})-G(\varepsilon_{t}(\beta_{0})))]\cdot[\frac{\hat{g}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}{\hat{G}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}+\frac{\hat{g}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}{1-\hat{G}_{s}(\varepsilon_{t}(\beta_{0});\beta_{0})}]
2h~W(B2B1Ts+log(16S/δ)2Ts+1Ts)\displaystyle\leq 2\tilde{h}_{W}\cdot\left(\frac{B_{2}}{B_{1}\sqrt{T_{s}}}+\sqrt{\frac{log(16S/\delta)}{2T_{s}}}+\frac{1}{T_{s}}\right)

holds with probability at least 1δ/4S1-\delta/4S. Then by Hoeffding’s inequality and union bound

|s(β0)|\displaystyle|\mathcal{L}^{\prime}_{s}(\beta_{0})| W2β¯TstΓs|𝔼[ηt(β0)]|+Wh~W2β¯log(8S/δ)Ts\displaystyle\leq\frac{W}{2\overline{\beta}T_{s}}\sum_{t\in\Gamma_{s}}|\mathbb{E}[\eta_{t}(\beta_{0})]|+\frac{W\tilde{h}_{W}}{2\overline{\beta}}\sqrt{\frac{\log(8S/\delta)}{T_{s}}}
Wh~W2β¯Ts(B2B1Ts+log(16S/δ)2Ts+1Ts)+Wh~W2β¯log(8S/δ)Ts:=A\displaystyle\leq\frac{W\tilde{h}_{W}}{2\overline{\beta}T_{s}}\cdot\left(\frac{B_{2}}{B_{1}\sqrt{T_{s}}}+\sqrt{\frac{log(16S/\delta)}{2T_{s}}}+\frac{1}{T_{s}}\right)+\frac{W\tilde{h}_{W}}{2\overline{\beta}}\sqrt{\frac{\log(8S/\delta)}{T_{s}}}:=A

holds with probability at least 1δ/2S1-\delta/2S. By the optimality of β^s\hat{\beta}_{s},

s(β^s)s(β0)0.\displaystyle\mathcal{L}_{s}(\hat{\beta}_{s})-\mathcal{L}_{s}(\beta_{0})\leq 0.

Invoking into s(β)\mathcal{L}_{s}(\beta), we have

12s′′(β0)(β^sβ0)2s(β0)(β^sβ0)A|β^sβ0|,\displaystyle\frac{1}{2}\mathcal{L}^{\prime\prime}_{s}(\beta_{0})(\hat{\beta}_{s}-\beta_{0})^{2}\leq-\mathcal{L}^{\prime}_{s}(\beta_{0})(\hat{\beta}_{s}-\beta_{0})\leq A|\hat{\beta}_{s}-\beta_{0}|,
|β^sβ0|2A2l~WW2/(4β¯2)=4β¯2AW2l~W.\displaystyle|\hat{\beta}_{s}-\beta_{0}|\leq\frac{2A}{2\tilde{l}_{W}\cdot W^{2}/(4\overline{\beta}^{2})}=\frac{4\overline{\beta}^{2}A}{W^{2}\tilde{l}_{W}}.

holds with probability at least 1δ/2S1-\delta/2S. So we have

α~sα04β¯2AW2l~W=4Aα¯2W2l~W.\displaystyle||\widetilde{\alpha}_{s}-\alpha_{0}||\leq\frac{4\overline{\beta}^{2}A}{W^{2}\tilde{l}_{W}}=\frac{4A}{\underline{\alpha}^{2}W^{2}\tilde{l}_{W}}.

holds with probability at least 1δ/2S1-\delta/2S. ∎

Lemma C.4.

For any fixed δ>0\delta>0, suppose Tslog2(2/δ)T_{s}\gg\log^{2}(2/\delta) and conditioned on |β^sβ0|𝒦s|\hat{\beta}_{s}-\beta_{0}|\leq\mathcal{K}_{s}, we have for all d[0,1+W]d\in[0,1+W],

|G^s(d;β^s)G(d)|3B2(2+W)𝒦s+B2B1Ts+log(8/δ)2Ts+1Ts\displaystyle|\hat{G}_{s}(d;\hat{\beta}_{s})-G(d)|\leq 3B_{2}(2+W)\mathcal{K}_{s}+\frac{B_{2}}{B_{1}\sqrt{T_{s}}}+\sqrt{\frac{\log(8/\delta)}{2T_{s}}}+\frac{1}{T_{s}}

holds with probability at least 1δ1-\delta.

Proof.

Let 𝔾^s\hat{\mathbb{G}}_{s} be the empirical distribution of samples {β^s(ptbt)+bt}tΓs\{\hat{\beta}_{s}(p_{t}-b_{t})+b_{t}\}_{t\in\Gamma_{s}}, i.e.

𝔾^s(d)=1TstΓs𝕀{β^s(ptbt)+btd}=1TstΓs𝕀{dtd+(β0β^s)(ptbt)}.\displaystyle\hat{\mathbb{G}}_{s}(d)=\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\mathbb{I}\{\hat{\beta}_{s}(p_{t}-b_{t})+b_{t}\leq d\}=\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\mathbb{I}\{d_{t}\leq d+(\beta_{0}-\hat{\beta}_{s})(p_{t}-b_{t})\}.

First, we give a uniform convergence bound for |𝔾^s(d)G(d)||\hat{\mathbb{G}}_{s}(d)-G(d)|. The main challenge is that we cannot directly apply DKW inequality, since β^s\hat{\beta}_{s} depends on dtd_{t}, tΓst\in\Gamma_{s}. To handle this challenge, we bound the lower bound and upper bound of 𝔾^s(d)\hat{\mathbb{G}}_{s}(d) separately. Since |β^sβ0|𝒦s|\hat{\beta}_{s}-\beta_{0}|\leq\mathcal{K}_{s}, and bt1b_{t}\leq 1, pt1+Wp_{t}\leq 1+W, we have

1TstΓs𝕀{dtd(2+W)𝒦s}𝔾^s(d)1TstΓs𝕀{dtd+(2+W)𝒦s}\displaystyle\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\mathbb{I}\{d_{t}\leq d-(2+W)\mathcal{K}_{s}\}\leq\hat{\mathbb{G}}_{s}(d)\leq\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\mathbb{I}\{d_{t}\leq d+(2+W)\mathcal{K}_{s}\}

Thus, conditioned on |β^sβ0|𝒦s|\hat{\beta}_{s}-\beta_{0}|\leq\mathcal{K}_{s}, for any γ>0\gamma>0, we have

(𝔾^s(d)G(d+(2+W)𝒦s)γ)\displaystyle\mathbb{P}(\hat{\mathbb{G}}_{s}(d)-G(d+(2+W)\mathcal{K}_{s})\leq\gamma)
\displaystyle\geq\; (1TstΓs𝕀{dtd+(2+W)𝒦s}G(d+(2+W)𝒦s)γ)\displaystyle\mathbb{P}(\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\mathbb{I}\{d_{t}\leq d+(2+W)\mathcal{K}_{s}\}-G(d+(2+W)\mathcal{K}_{s})\leq\gamma)
\displaystyle\geq\; 1(supd|1TstΓs𝕀{dtd+(2+W)𝒦s}G(d+(2+W)𝒦s)|>γ)\displaystyle 1-\mathbb{P}(\sup_{d}|\frac{1}{T_{s}}\sum_{t\in\Gamma_{s}}\mathbb{I}\{d_{t}\geq d+(2+W)\mathcal{K}_{s}\}-G(d+(2+W)\mathcal{K}_{s})|>\gamma)
\displaystyle\geq\; 12exp(2Tsγ2).\displaystyle 1-2exp(-2T_{s}\gamma^{2}).

Similarly, we have (G(d+(2+W)𝒦s)𝔾^s(d)γ)12exp(2Tsγ2)\mathbb{P}(\hat{G(d+(2+W)\mathcal{K}_{s})-\mathbb{G}}_{s}(d)\leq\gamma)\geq 1-2exp(-2T_{s}\gamma^{2}) for any γ>0\gamma>0, conditioned on |β^sβ0|𝒦s|\hat{\beta}_{s}-\beta_{0}|\leq\mathcal{K}_{s}.
Therefore, applying a union bound and Lipschitzness of GG we have,

|𝔾^s(d)G(d)|log(8/δ)2Ts+B2(2+W)𝒦s\displaystyle|\hat{\mathbb{G}}_{s}(d)-G(d)|\leq\sqrt{\frac{\log(8/\delta)}{2T_{s}}}+B_{2}(2+W)\mathcal{K}_{s} (32)

holds with probability at least 1δ/21-\delta/2.
Second, we apply the similar technique used in Lemma 5.5 to bound |G^s(d;β^s)G(d)||\hat{G}_{s}(d;\hat{\beta}_{s})-G(d)|. Denote d^t=β^s(ptbt)+bt\hat{d}_{t}=\hat{\beta}_{s}(p_{t}-b_{t})+b_{t}, tΓs\forall t\in\Gamma_{s}. Thus, for any d^t\hat{d}_{t}, there must exist at least one dtd_{t} s.t. |dtd^t|=|(β0β^s)(ptbt)|(2+W)𝒦s|d_{t}-\hat{d}_{t}|=|(\beta_{0}-\hat{\beta}_{s})(p_{t}-b_{t})|\leq(2+W)\mathcal{K}_{s}. Let rs=12B1Tsr_{s}=\frac{1}{2B_{1}\sqrt{T_{s}}}, then for any d[0,1+W]d\in[0,1+W], the probability that there exists a point y{d^t}tΓsy\in\{\hat{d}_{t}\}_{t\in\Gamma_{s}} s.t. |yz|rs+(2+W)𝒦s|y-z|\leq r_{s}+(2+W)\mathcal{K}_{s}, is at least,

1(12B1rs)Ts=1(11Ts)1eTs1δ/2\displaystyle 1-(1-2B_{1}\cdot r_{s})^{T_{s}}=1-(1-\frac{1}{\sqrt{T_{s}}})\approx 1-e^{-\sqrt{T_{s}}}\geq 1-\delta/2

Therefore, for any d[0,1+W]d\in[0,1+W], we can decompose G^s(d;β^s)G(d)\hat{G}_{s}(d;\hat{\beta}_{s})-G(d) in the following,

|G^s(d;β^s)G(d)|\displaystyle|\hat{G}_{s}(d;\hat{\beta}_{s})-G(d)|
\displaystyle\leq |G^s(d;β^s)G^s(y;β^s)|+|G^s(y;β^s)𝔾^s(y)|+|𝔾^s(y)|G(y)|+|G(y)G(d)|.\displaystyle|\hat{G}_{s}(d;\hat{\beta}_{s})-\hat{G}_{s}(y;\hat{\beta}_{s})|+|\hat{G}_{s}(y;\hat{\beta}_{s})-\hat{\mathbb{G}}_{s}(y)|+|\hat{\mathbb{G}}_{s}(y)|-G(y)|+|G(y)-G(d)|.

Indeed, the characterization results by Lemma 5.4 applies to samples d^t\hat{d}_{t}. Then we have |G^s(y;β^s)𝔾^s(y)|1Ts|\hat{G}_{s}(y;\hat{\beta}_{s})-\hat{\mathbb{G}}_{s}(y)|\leq\frac{1}{T_{s}}. . By the Lipshitzness of G^s(;β^s)\hat{G}_{s}(\cdot;\hat{\beta}_{s}) and GG, Equation 32 and union bound, we have

|G^s(d;β^s)G(d)|\displaystyle|\hat{G}_{s}(d;\hat{\beta}_{s})-G(d)| 2B2[rs+(2+W)𝒦s]+1Ts+log(8/δ)2Ts+B2(2+W)𝒦s\displaystyle\leq 2B_{2}[r_{s}+(2+W)\mathcal{K}_{s}]+\frac{1}{T_{s}}+\sqrt{\frac{\log(8/\delta)}{2T_{s}}}+B_{2}(2+W)\mathcal{K}_{s}
=3B2(2+W)𝒦s+B2B1Ts+log(8/δ)2Ts+1Ts\displaystyle=3B_{2}(2+W)\mathcal{K}_{s}+\frac{B_{2}}{B_{1}\sqrt{T_{s}}}+\sqrt{\frac{\log(8/\delta)}{2T_{s}}}+\frac{1}{T_{s}}

holds with probability at least 1δ1-\delta when Tslog2(2/δ)T_{s}\gg\log^{2}(2/\delta). ∎