This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

No-regret Learning in Repeated First-Price Auctions with Budget Constraints

Rui Ai
Peking University
[email protected]
Equal contribution
   Chang Wang 11footnotemark: 1
Peking University
[email protected]
   Chenchen Li
JD.COM
[email protected]
   Jinshan Zhang
Zhejiang University
[email protected]
   Wenhan Huang
Shanghai Jiaotong University
[email protected]
   Xiaotie Deng
Peking University
[email protected]
Corresponding author
Abstract

Recently the online advertising market has exhibited a gradual shift from second-price auctions to first-price auctions. Although there has been a line of works concerning online bidding strategies in first-price auctions, it still remains open how to handle budget constraints in the problem. In the present paper, we initiate the study for a buyer with budgets to learn online bidding strategies in repeated first-price auctions. We propose an RL-based bidding algorithm against the optimal non-anticipating strategy under stationary competition. Our algorithm obtains O~(T)\widetilde{O}(\sqrt{T})-regret if the bids are all revealed at the end of each round. With the restriction that the buyer only sees the winning bid after each round, our modified algorithm obtains O~(T712)\widetilde{O}(T^{\frac{7}{12}})-regret by techniques developed from survival analysis. Our analysis extends to the more general scenario where the buyer has any bounded instantaneous utility function with regrets of the same order.

1 Introduction

The market for online advertising has extensively grown in recent years. It is estimated that the volume of online advertising worldwide will reach 500 billion dollars in 2022 (Statista, 2021). In such a market, advertising platforms use auctions to allocate ad opportunities. Typically, the advertiser has a limited amount of capital for an advertisement campaign and thus all rounds of competition are interconnected by budgets. Furthermore, the advertiser has very limited knowledge of 1) her valuation of certain keywords and 2) the competition she is facing. Many works have been devoted to developing algorithms for learning strategies for optimally spending the budget in repeated second-price auctions (see Section 1.1).

Nevertheless, we have witnessed numerous switches from second-price auctions to first-price auctions in the online advertising market. A recent remarkable example is Google AdSenses’ integrated move at the end of 2021 (LLC, 2021). Earlier examples also include AppNexus, Index Exchange, and OpenX (Sluis, 2017). This industry-wide shift is due to various factors including a fairer transactional process and increased transparency. Further, the shift to first-price auctions brings about major importance to the following open question which is barely considered in previous works:

How should budget-constrained advertisers learn to compete in repeated first-price auctions?

This paper thus initiates the study of learning to bid with budget constraints in repeated first-price auctions. We note that the application of first-price auctions with budgets is not limited to online advertising mentioned above. Traditional competitive environments like mussel trade in Netherlands (van Schaik et al., 2001), modern price competition, and procurement auctions (e.g. U.S. Treasury Securities auction (Chari and Weber, 1992)) are examples as well.

Challenges and contributions

The challenges in this setting are two-fold.

The first challenge relates to information structure. In practical auctions, it is often the case that only the highest bid is revealed (Esponda, 2008). This is known as censored-feedback or winner’s curse in literature (Capen et al., 1971). This affects the information structure of learning since the buyer learns less information when she wins. This makes the problem challenging compared to standard contextual bandits (c.f. Section 1.1).

The second challenge is more fundamental. It is known that the strategy in first-price auctions is notoriously complex to analyze, even in the static case (Lebrun, 1996). To get an intuitive feeling of this difficulty in our problem compared to repeated second-price auctions, let us consider the offline case where the opponents’ bids are all known. Given the budget, the problem for second-price auctions can be reduced to a pure knapsack problem, where the budget is regarded as weight capacity and prices as weights. This structure enables mature techniques including duality theory to be applied to study the benchmark strategy. Pitifully in first-price auctions, since the payment depends on the buyer’s own bid, the previous approach/benchmark is not directly usable. We provide a concrete example to further illustrate such difficulty.

Example 1.1.

Consider a case where the buyer’s value vv follows a uniform distribution on [0.4,1][0.4,1] and the highest bid mm of the opponents’ follows a uniform distribution on [0,0.5][0,0.5]. The budget BB is set to be half of the time horizon 0.5T0.5T. Now, the first-best benchmark (an anticipating111An algorithm is anticipating if bid selection depends on future observations, see Flajolet and Jaillet (2017). strategy) is

𝔼𝒗FT𝒎GT[maxb1,,bTt=1T(vtbt)𝟏{btmt}]\displaystyle\operatorname*{\mathbb{E}}_{\begin{subarray}{c}\bm{v}\sim F^{T}\\ \bm{m}\sim G^{T}\end{subarray}}\left[\max_{b_{1},\dotsc,b_{T}}\sum_{t=1}^{T}(v_{t}-b_{t})\mathbf{1}_{\{b_{t}\geq m_{t}\}}\right]
subject tot=1T𝟏{btmt}btBv1,,vT;m1,,mT.\displaystyle\text{subject to}\quad\sum_{t=1}^{T}\mathbf{1}_{\{b_{t}\geq m_{t}\}}b_{t}\leq B\quad\forall v_{1},\dotsc,v_{T};m_{1},\dotsc,m_{T}.

In hindsight, we need to pay as little as possible. Using the theory of knapsack, the utility turns out to be T𝔼[𝟏{VM}(VM)]+=0.45TT\cdot\operatorname*{\mathbb{E}}[\mathbf{1}_{\{V\geq M\}}(V-M)]^{+}=0.45T. On the contrary, the optimal non-anticipating bidding strategy in a first-price auction is to bid V2\frac{V}{2} and the utility is T𝔼[𝟏{V2M}V2]=0.26TT\cdot\operatorname*{\mathbb{E}}[\mathbf{1}_{\{\frac{V}{2}\geq M\}}\frac{V}{2}]=0.26T. There is already an Ω(T)\Omega(T) separation between the first-best benchmark and the ideal case with full information.

This example shows that simple characterization of the optimal in previous works is not applicable. Indeed, it remains unclear what methodology can be applied in first-price auctions with budgets (see (Balseiro and Gur, 2019, §2.4) for further discussions).

The present paper takes the first step to combat the challenges mentioned above with a dynamic programming approach. Correspondingly, our contribution is also two-fold:

  • We provide an RL-based learning algorithm. Through the characterization of the optimal strategy, we obtain O~(T)\widetilde{O}(\sqrt{T})-regret guarantee for the algorithm in the full-feedback case222This is especially practical in public-sector auctions (Chari and Weber, 1992) as regulations mandate all bids to be revealed..

  • In the censored-feedback setting, by techniques developed from survival analysis, we modify our algorithm and obtain a regret of O~(T712)\widetilde{O}(T^{\frac{7}{12}}).

1.1 Related Work

Repeated second-price auctions with budgets

There is a flourishing source of literature on bidding strategies in repeated auctions with budgets. Through the lens of online learning, Balseiro and Gur (2019) identify asymptotically optimal online bidding strategies known as pacing (a.k.a. bid-shading in literature) in repeated second-price auctions with budgets. Inspired by the pacing strategy, Flajolet and Jaillet (2017) develop no-regret non-anticipating algorithms for learning with contextual information in repeated second-price auctions. Another line of works that uses similar techniques in the present paper includes Amin et al. (2012); Tran-Thanh et al. (2014); Gummadi et al. (2012). Gummadi et al. (2012) and Amin et al. (2012) study bidding strategies in repeated second-price auctions with budget constraints, but the former does not involve any learning and the latter does not provide any regret analysis (their estimator is biased). Tran-Thanh et al. (2014) derive regret bounds for the same scenario but the optimization objective is the number of items won instead of value or surplus. Baltaoglu et al. (2017) also use dynamic programming to tackle repeated second-price auctions with budgets. However, they assume per-round budget constraints and their dynamic programming algorithm is for allocating bids among multiple items. Again, we emphasize that no prior work has been done in repeated first-price auctions with budgets, since the structure of the problem (compared to second-price variants) is fundamentally different (recall Example 1.1).

Repeated first-price auctions without budgets

Two notable works concerning repeated first-price auctions are Han et al. (2020b) and Han et al. (2020a). In Han et al. (2020b), they introduce a new problem called monotone group contextual bandits and then obtain an O(Tln2T)O(\sqrt{T}\ln^{2}T)-regret algorithm for repeated first-price auctions without budget constraints under stationary settings. In Han et al. (2020a), they concentrate on an adversarial setting and develop a mini-max optimal online bidding algorithm with O(TlnT)O(\sqrt{T}\ln T) regret against all Lipschitz bidding strategies. A crucial difference is that in the present paper, budgets are involved thus the bandit algorithm by Han et al. (2020b) is not suitable for our needs.

Bandit with knapsack

From the bandit side, Badanidiyuru et al. (2013) develop bandit algorithms under resource constraints. They show that their algorithm can be used in dynamic procurement, dynamic posted pricing with limited supply, etc. However, since the bidder observes her value before bidding in our problem, results by Badanidiyuru et al. (2013) cannot be directly applied to our setting. Our setting also relates to contextual bandit problems with resource constraints (Badanidiyuru et al., 2014; Agrawal and Devanur, 2016; Agrawal et al., 2016). Nevertheless, applying this contextual bandit approach requires discretizing the action space, which needs Lipschitz continuity of distributions. Our approach does not rely on any continuity assumption. Further, the performance guarantee (typically O~(T23)\widetilde{O}(T^{\frac{2}{3}})) is worse than ours. It also does not fit into our information structure when the feedback is censored.

2 Preliminaries

Auction mechanism

We consider a repeated first-price auction with budgets. Specifically, we suppose that the buyer has a limited budget BB to spend in a time horizon of T+T\leq+\infty (can be unknown to her) rounds. At the beginning of each round t=1,2,,Tt=1,2,\dotsc,T, the bidder privately observes a value vtv_{t} for a fresh copy of item and bids btb_{t} according to her past observations 𝒉t\bm{h}_{t} and value vtv_{t}. Denote the strategy she employs as π:(vt,Bt,𝒉t)bt\pi\colon(v_{t},B_{t},\bm{h}_{t})\to b_{t}, which maps her current budget BtB_{t}, value vtv_{t} and past history 𝒉t\bm{h}_{t} to a bid. Let mtm_{t} be the maximum bid of the other bidders. Since the auction is a first price auction, if btb_{t} is larger than mtm_{t}, then the buyer wins the auction, is charged btb_{t}, and obtains a utility of rtr_{t}; otherwise she loses and rt=0r_{t}=0. Therefore, the instantaneous utility of the buyer is

rt=(vtbt)𝟏{btmt}.r_{t}=(v_{t}-b_{t})\mathbf{1}_{\{b_{t}\geq m_{t}\}}.

The exact information structure of history the buyer observes is dictated by how the mechanism reveals mtm_{t}. In full generality, we assume that the feedback is censored, i.e. only the highest bid is revealed at the end of each round and the winner does not observe mtm_{t} exactly. This is considered to be an informational version of winner’s curse (Capen et al., 1971) and is of practical interest (Esponda, 2008). For the purpose of modeling, we suppose that ties are broken in favor of the buyer but this choice is arbitrary and by no means a limitation of our approach.

Next, we state the assumptions on mtm_{t} and vtv_{t}. Without loss of generality, we assume that bt,mt,vtb_{t},m_{t},v_{t} are normalized to be in [0,1][0,1]. In the present paper, we consider a stochastic setting where mtm_{t} and vtv_{t} are drawn from some distributions F,GF,G unknown to the buyer, respectively, and independent from the past. We will also refer to the cumulative distribution functions of F,GF,G with the same notations. No further assumptions will be made on F,GF,G. Now, the expected instantaneous utility of the buyer at time tt with strategy π\pi is

Rπ(vt,bt)=𝔼mtF[rt]=(vtbt)F(bt).R^{\pi}(v_{t},b_{t})=\operatorname*{\mathbb{E}}_{m_{t}\sim F}[r_{t}]=(v_{t}-b_{t})F(b_{t}).

To argue for the reasonability of this assumption, note that although other buyers may also involve learning behavior, it is typical that in a real advertising market, there are a large number of buyers. The specific buyer only faces a different small portion of them and is completely oblivious of whom she is facing in each round. Therefore, the buyer’s sole objective is to maximize her utility (instead of fooling other buyers) and by the law of large numbers, the price mtm_{t} and value vtv_{t} the buyer observes are independent and identically distributed at least for a period of time.

Buyer’s target and regret

The buyer aims at maximizing her long-term accumulated utility subject to the budget constraints. Recall that the instantaneous utility of the buyer is rt=(vtbt)𝟏{btmt}r_{t}=(v_{t}-b_{t})\mathbf{1}_{\{b_{t}\geq m_{t}\}}. The payment is ct=bt𝟏{btmt}c_{t}=b_{t}\mathbf{1}_{\{b_{t}\geq m_{t}\}} and the budget will then decrease accordingly as the payment incurs. She can continue to bid as long as budget has not run out but must stop at

τ=min{T+1,min{t:τ=1tcτ=B}+1}.\tau^{*}=\min\left\{T+1,\min\left\{t\in\mathbb{N}:\sum_{\tau=1}^{t}c_{\tau}=B\right\}+1\right\}.

The buyer’s problem now becomes to determine how much to bid in each round to maximize her accumulated utility. In line with works Gummadi et al. (2012); Golrezaei et al. (2019); Deng and Zhang (2021), the buyer adopts a discount factor λ(0,1)\lambda\in(0,1). She takes discounts since she does not know TT or τ\tau^{*}. Discount factors are interpreted to be the estimate of the probability that the repeated auction will last for at least tt rounds (Devanur et al., 2014; Drutsa, 2018). On the economic side, in important real-world markets like online advertising platforms, the buyers are impatient for opportunities since companies of different sizes have different capabilities. Discount factors model how impatient the buyer is (Drutsa, 2017; Vanunts and Drutsa, 2019). Now the buyer’s optimization problem is to determine a non-anticipating strategy π\pi for the following:

maxπ𝔼𝒗FT𝒎GT[t=1Tλt1rt]\displaystyle\max_{\pi}\quad\operatorname*{\mathbb{E}}_{\begin{subarray}{c}\bm{v}\sim F^{T}\\ \bm{m}\sim G^{T}\end{subarray}}\left[\sum_{t=1}^{T}\lambda^{t-1}r_{t}\right]
subject tot=1T𝟏{btmt}btBv1,,vT;m1,,mT,\displaystyle\text{subject to}\quad\sum_{t=1}^{T}\mathbf{1}_{\{b_{t}\geq m_{t}\}}b_{t}\leq B\quad\forall v_{1},\dotsc,v_{T};m_{1},\dotsc,m_{T},

where bt=π(vt,Bt,𝒉t)b_{t}=\pi(v_{t},B_{t},\bm{h}_{t}). Here, 𝒗(v1,,vT)\bm{v}\coloneqq(v_{1},\dotsc,v_{T}) denotes the sequence of private values the buyer observes and 𝒎(m1,,mT)\bm{m}\coloneqq(m_{1},\dotsc,m_{T}) is the sequence of the highest bids of the other bidders. Vπ(B,t)V^{\pi}(B,t) denotes the expected accumulated utility using strategy π\pi with budget BB and starting from time tt. Let π\pi^{*} denote the optimal bidding strategy with the knowledge of the underlying distributions FF and GG. The corresponding expected accumulated utility is VπV^{\pi^{*}}.

We now come to define the regret. Given time TT, the regret is defined to be the difference between the accumulated utility of the buyer’s bidding strategy π\pi and that of the optimal non-anticipating bidding strategy π\pi^{*}

Regret=𝔼𝒗GT[t=1TRπ(vt,bt)Rπ(vt,bt)].\text{Regret}=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}R^{\pi^{*}}(v_{t},b_{t}^{*})-R^{\pi}(v_{t},b_{t})\right]. (1)

3 Bidding Algorithm and Analysis

In this section, we present our bidding algorithm and the high-level ideas in the analysis of regret. We first consider the case where the feedback is not censored, i.e. the buyer is aware of mtm_{t} no matter whether she wins or not. Then we extend our algorithm to the case where the feedback is censored with techniques developed from survival analysis.

3.1 Full Feedback

When FF and GG are known, the buyer’s problem can be viewed as offline. The technical challenge lies in the observation that even when the distributions are known, the buyer’s problem cannot be directly analyzed as a knapsack problem. To tackle this challenge, we use a dynamic programming approach to solve the problem. In particular, the optimal strategy π\pi^{*} satisfies the following Bellman equation:

b(Bτ,v)\displaystyle b^{*}(B_{\tau},v) argmaxb[(vb)+λV(Bτb,τ+1)]F(b)+λV(Bτ,τ+1)(1F(b)),\displaystyle\in\operatorname*{argmax}_{b}[(v-b)+\lambda V(B_{\tau}-b,\tau+1)]F(b)+\lambda V(B_{\tau},\tau+1)(1-F(b)),
V(Bτ,τ)\displaystyle V(B_{\tau},\tau) =𝔼v[(vb)+λV(Bτb,τ+1)]F(b)+λV(Bτ,τ+1)(1F(b)),\displaystyle=\operatorname*{\mathbb{E}}_{v}[(v-b^{*})+\lambda V(B_{\tau}-b^{*},\tau+1)]F(b^{*})+\lambda V(B_{\tau},\tau+1)(1-F(b^{*})),

for all τ\tau\in\mathbb{N} and 0BτB0\leq B_{\tau}\leq B. Note that for any Bτ<0B_{\tau}<0, V(Bτ,τ)=V(B_{\tau},\tau)=-\infty. By choosing appropriate initialization conditions, we can solve the equation recursively and obtain the optimal bidding strategy together with the function V(,)V(\cdot,\cdot). The above recursion also characterizes the optimal solution, which will be used in the analysis later.

When the buyer does not have the information of FF and GG, she can learn the distributions from past observations. Therefore, it is natural to maintain estimations F^\hat{F} and G^\hat{G} of the true distributions. Our algorithm for the full-feedback case is now depicted in Algorithm 1. To ease technical loads, we first assume the knowledge of GG and only estimate FF in Algorithm 1. Later, we will add the estimation of GG and its analysis is presented in Theorem 3.6.

Algorithm 1 Algorithm for the full-feedback case
1:Input: Initial budget BB and constant C1C_{1} \triangleright C1C_{1} is an arbitrary positive constant
2:Initialize the estimation F^\hat{F} of FF to a uniform distribution over [0,1][0,1] and B1BB_{1}\leftarrow B
3:for t=1,2,t=1,2,\dotsc do
4:     Observe the value vtv_{t} in round tt
5:     Let t0t_{0} be the smallest integer that satisfies λt0t11λ<C1t\lambda^{t_{0}-t}\frac{1}{1-\lambda}<\frac{C_{1}}{\sqrt{t}}
6:     Set VF^(Bt0,t0)=0V_{\hat{F}}(B_{t_{0}},t_{0})=0 for any Bt0B_{t_{0}} \triangleright VF^V_{\hat{F}} is algorithm’s estimation of VV
7:     for τ=t0,t01,,t\tau=t_{0},t_{0}-1,\dotsc,t do
8:         Qv,F^(Bτ,τ,b)[(vb)+λVF^(Bτb,τ+1)]F^(b)+λVF^(Bτ,τ+1)(1F^(b))Q_{v,\hat{F}}(B_{\tau},\tau,b)\leftarrow[(v-b)+\lambda V_{\hat{F}}(B_{\tau}-b,\tau+1)]\hat{F}(b)+\lambda V_{\hat{F}}(B_{\tau},\tau+1)(1-\hat{F}(b))
9:         Solve the optimization problem b^τargmaxbQv,F^(Bτ,τ,b)\hat{b}_{\tau}^{*}\leftarrow\operatorname*{argmax}_{b}Q_{v,\hat{F}}(B_{\tau},\tau,b)
10:         VF^(Bτ,τ)𝔼vG[Qv,F^(Bτ,τ,b^τ)]V_{\hat{F}}(B_{\tau},\tau)\leftarrow\operatorname*{\mathbb{E}}_{v\sim G}[Q_{v,\hat{F}}(B_{\tau},\tau,\hat{b}_{\tau}^{*})]
11:     end for
12:     Place a bid b^targmaxbQv,F^(Bt,t,b)\hat{b}_{t}\leftarrow\operatorname*{argmax}_{b}Q_{v,\hat{F}}(B_{t},t,b)
13:     Observe mt,ctm_{t},c_{t} and rtr_{t} from this round of auction and update F^(x)=1ti=1t𝟏{mix}\hat{F}(x)=\frac{1}{t}\sum_{i=1}^{t}\mathbf{1}_{\{m_{i}\leq x\}}.
14:     Bt+1BtctB_{t+1}\leftarrow B_{t}-c_{t}. If Bt+10B_{t+1}\leq 0 then halt.
15:end for

Similar to prior work (Amin et al., 2012), Algorithm 1 performs exploration and exploitation simultaneously. The buyer initializes the estimation of FF to a uniform distribution (2). At round tt, the buyer observes her valuation. Then, she uses her estimation of FF to solve the dynamic programming problem recursively333For the non-trivial case BTB\leq T, this can be solved in O(T4.5(1λ)6)O\left(\frac{T^{4.5}}{(1-\lambda)^{6}}\right) time with only O(T12)O(T^{-\frac{1}{2}}) loss (see, e.g. Chow and Tsitsiklis, 1989). to obtain an estimation of the optimal bid (7~11). To provide a base case for recursion, note that for sufficiently large t0tt_{0}\gg t, VF^(,t0)V_{\hat{F}}(\cdot,t_{0})’s impact to VF^(,t)V_{\hat{F}}(\cdot,t) is negligible due to the discount λt0t\lambda^{t_{0}-t}. Therefore, the buyer can approximate VF^(,t0)V_{\hat{F}}(\cdot,t_{0}) with zero for t0t_{0} (5). Finally, the auction proceeds with mt,rt,ctm_{t},r_{t},c_{t} revealed and the buyer updates her information accordingly (13~14).

Analysis of regret

The establishment of the regret bounds will be given in two steps. First, we will show that the buyer’s estimation VF^V_{\hat{F}} is approximately accurate with a sufficient number of samples. This relies on the estimation of FF. Concentration inequalities are thus intrinsic to our analysis. Second, we bridge the regret defined in Equation 1 with VV and VF^V_{\hat{F}}. This is done by a series of auxiliary quantities measuring the regret. We obtain the desired result by combining the two steps.

We first present the following lemma (Dvoretzky et al., 1956; Massart, 1990), which states a uniform convergence result for the estimation of cumulative distribution functions.

Lemma 3.1 (Dvoretzky–Kiefer–Wolfowitz).

Given tt\in\mathbb{N}, let m1,m2,,mtm_{1},m_{2},\dotsc,m_{t} be real-valued independent and identically distributed random variables with cumulative distribution function FF. Let F^t\hat{F}_{t} denote the associated empirical distribution function defined by F^t(x)=1ti=1t𝟏{mix}\hat{F}_{t}(x)=\frac{1}{t}\sum_{i=1}^{t}\mathbf{1}_{\{m_{i}\leq x\}} where xx\in\mathbb{R}. Then with probability 1δ1-\delta, it holds

supx|F^t(x)F(x)|12ln2δt12.\sup_{x}|\hat{F}_{t}(x)-F(x)|\leq\sqrt{\frac{1}{2}\ln\frac{2}{\delta}}t^{-\frac{1}{2}}.

With Lemma 3.1 in hand, given any TT, we show a bound for the difference between V(Bt,t)V(B_{t},t) and VF^(Bt,t)V_{\hat{F}}(B_{t},t) for any 1tT1\leq t\leq T (recall that V(Bt,t)V(B_{t},t) is the accumulated utility of the optimal strategy with the knowledge of FF). We prove this using induction. Note that the induction basis is a little tricky due to the base case of recursion in Algorithm 1. Therefore, in the following lemma, we first deal with the induction step.

Lemma 3.2.

For any round tTt\leq T, budget BtB_{t} and with probability at least 1δT1-\frac{\delta}{T}, we have the following bounds for the estimated VF^V_{\hat{F}} and the ground truth with FF if supBt0|V(Bt0,t0)VF^(Bt0,t0)|12ln2Tδ1+λ(1λ)2t12\sup_{B_{t_{0}}}|V(B_{t_{0}},t_{0})-V_{\hat{F}}(B_{t_{0}},t_{0})|\leq\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}:

|V(Bt,t)VF^(Bt,t)|Ct12=O~(1t)whereC=12ln2Tδ1+λ(1λ)2.|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq Ct^{-\frac{1}{2}}=\widetilde{O}\left(\frac{1}{\sqrt{t}}\right)\quad\text{where}\quad C=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}.

Note that the lemma above assumes that supBt0|V(Bt0,t0)VF^(Bt0,t0)|\sup_{B_{t_{0}}}|V(B_{t_{0}},t_{0})-V_{\hat{F}}(B_{t_{0}},t_{0})| is bounded above by 12ln2Tδ1+λ(1λ)2t12\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}, which holds if the base case VF^(Bt0,t0)V_{\hat{F}}(B_{t_{0}},t_{0}) is set accurately (i.e. V(Bt0,t0)=VF^(Bt0,t0)V(B_{t_{0}},t_{0})=V_{\hat{F}}(B_{t_{0}},t_{0})). We use V~F^(Bt,t)\widetilde{V}_{\hat{F}}(B_{t},t) to denote the estimated value function using F^\hat{F} if the base is indeed accurate. In the following lemma, we show that using the alternative initialization method specified in 5 of Algorithm 1, supBt|VF^(Bt,t)V~F^(Bt,t)|\sup_{B_{t}}|V_{\hat{F}}(B_{t},t)-\widetilde{V}_{\hat{F}}(B_{t},t)| actually satisfies the condition of Lemma 3.2.

Lemma 3.3.

Suppose V~F^(Bt0,t0)=V(Bt0,t0)\widetilde{V}_{\hat{F}}(B_{t_{0}},t_{0})=V(B_{t_{0}},t_{0}) and V~F^(Bt,t)\widetilde{V}_{\hat{F}}(B_{t},t) is then computed by the recursive procedure in Algorithm 1. Then it holds that for any τt0\tau\leq t_{0} and BτB_{\tau}:

|VF^(Bτ,τ)V~F^(Bτ,τ)|11λλt0τ.|V_{\hat{F}}(B_{\tau},\tau)-\widetilde{V}_{\hat{F}}(B_{\tau},\tau)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-\tau}.

In particular, when τ=t\tau=t, we have supBt|VF^(Bt,t)V~F^(Bt,t)|C1t\sup_{B_{t}}|V_{\hat{F}}(B_{t},t)-\widetilde{V}_{\hat{F}}(B_{t},t)|\leq\frac{C_{1}}{\sqrt{t}} (by construction of t0t_{0}).

Synthesizing Lemma 3.2 and Lemma 3.3, we have |V(Bt,t)VF^(Bt,t)|(C+C1)t12|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq(C+C_{1})t^{\frac{1}{2}}. A crucial next step is to relate this bound to the final regret. This is achieved by two transformations. Roughly speaking, the buyer’s “regret” can be viewed in two parts: 1) she does not bid according to the optimal strategy; 2) her strategy is not optimally spending the budget which leads to future losses. These two transformations are done with this intuitive observation and summarized in the following lemma that bounds the performance of the buyer’s strategy. Below we first condition on the good event that the estimation succeeds for every tt. Then we add the contribution of the bad event to the regret in Theorem 3.5.

Lemma 3.4.

For any given BtB_{t} and tt, denote Vπ(Bt,t)=𝔼𝐯GT[τ=tTλτtRπ(vτ,bτ)]V^{\pi}(B_{t},t)=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{\tau=t}^{T}\lambda^{\tau-t}R^{\pi}(v_{\tau},b_{\tau})\right], then

|V(Bt,t)Vπ(Bt,t)|4(12ln2Tδ1+λ(1λ)2+C1)(1λ)t.|V(B_{t},t)-V^{\pi}(B_{t},t)|\leq\frac{4\left(\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}+C_{1}\right)}{(1-\lambda)\sqrt{t}}.

By Lemma 3.4 and further transformations, we can now establish the regret bound of Algorithm 1.

Theorem 3.5.

Under the circumstance that FF is unknown, the worse-case regret of Algorithm 1 is O~(T)\widetilde{O}(\sqrt{T}), where the regret is computed according to Equation 1. Explicitly, if we take C1=1C_{1}=1,

Regret(412ln2T21+λ(1λ)2+5λ)T+λ2log1λT(1λ)2+1.\text{Regret}\leq\left(4\sqrt{\frac{1}{2}\ln 2T^{2}}\frac{1+\lambda}{(1-\lambda)^{2}}+5-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+1.

Note that the regret bound is meaningful only if B=ω(T12)B=\omega(T^{\frac{1}{2}}). Let us now take the budget BB to scale linearly with TT as in Balseiro and Gur (2019); Flajolet and Jaillet (2017). Specifically, assume that T<+T<+\infty and there exists a constant β\beta such that the budget B=βTB=\beta T, then we establish that the regret is O~(T)\widetilde{O}(\sqrt{T}) in this special case. Indeed, under this condition, we can simply set t0=T+1t_{0}=T+1 and VF^(BT+1,T+1)=0V_{\hat{F}}(B_{T+1},T+1)=0 for any BT+1B_{T+1} in Algorithm 1. Therefore, C1=0C_{1}=0 and the worst-case regret is bounded by (412ln2T21+λ(1λ)2+1λ)T+λ2log1λT(1λ)2+1\left(4\sqrt{\frac{1}{2}\ln 2T^{2}}\frac{1+\lambda}{(1-\lambda)^{2}}+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+1.

Next we deal with the case where GG is also initially unknown. Based on Algorithm 1, we additionally maintain an estimation G^\hat{G} of GG based on past observations of valuations. G^\hat{G} is initialized to be a uniform distribution and will be used to solve the dynamic programming problem (see 7 of Algorithm 2). Using similar techniques as before (with more work), we obtain the following theorem.

Theorem 3.6.

Under the circumstance that F,GF,G are both unknown, it holds that the worst-case regret of Algorithm 1 using empirical distribution functions to estimate FF and GG is O~(T)\widetilde{O}(\sqrt{T}). Explicitly, if we take C1=1C_{1}=1,

Regret(12ln4T26(1+λ)(1λ)2+5λ)T+λ2log1λT(1λ)2+1.\text{Regret}\leq\left(\sqrt{\frac{1}{2}\ln 4T^{2}}\frac{6(1+\lambda)}{(1-\lambda)^{2}}+5-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+1.

3.2 Censored Feedback

In this subsection, we deal with the case that the buyer can only see the winner’s bid after each round. In other words, the feedback is left-censored. Concretely, the buyer’s observation is

ot=max{bt,mt}.o_{t}=\max\{b_{t},m_{t}\}.

When she wins, the exact value of mtm_{t} is not revealed. The buyer only knows that mtm_{t} is smaller than her bid in the current round. To estimate the distribution of mtm_{t}, there is a classical statistics (KM estimator) developed by Kaplan and Meier (1958) for the estimation of FF in this scenario. However, the KM estimator assumes the sequence (mt)t=1T(m_{t})_{t=1}^{T} is deterministic, which does not fit our needs. Although Suzukawa (2004)’s extension allows random censorship, it requires independence between btb_{t} and mtm_{t}, which is not realistic since we use the estimated distribution to place bids.

To tackle this problem, we first introduce an estimator proposed by Zeng (2004) denoted by F^n\hat{F}_{n} to take place of the previous empirical distribution used in Algorithm 1.

Estimation procedure

We now present the procedure for estimating FF under censored feedback. It suffices to estimate the distribution function of 1mt1-m_{t} which is right-censored by 1bt1-b_{t}. Let yt=min{1mt,1bt},rt=𝟏{mtbt}y_{t}=\min\{1-m_{t},1-b_{t}\},r_{t}=\mathbf{1}_{\{m_{t}\geq b_{t}\}}. The observations can now be described as (yt,rt,𝒉t)t=1T(y_{t},r_{t},\bm{h}_{t})_{t=1}^{T}.

Roughly speaking, to decouple the dependency between mt,btm_{t},b_{t}, we use the fact that btb_{t} and mtm_{t} are independent conditioning on 𝒉t\bm{h}_{t}. Intuitively, the history 𝒉t\bm{h}_{t} provides information for getting enough effective samples for mtm_{t}. Next we establish models to estimate the hazard rate functions444The hazard rate function of a random variable XX with p.d.f. ff and c.d.f. FF is HX(x)=f(x)1F(x)H_{X}(x)=\frac{f(x)}{1-F(x)}. of 1mt,1bt1-m_{t},1-b_{t} using 𝒉t\bm{h}_{t}. With the hazard rate functions, we use the maximum likelihood method with a kernel to compute the final estimation F^t\hat{F}_{t} and obtain Equation 3.

Details follow. We initialize with a sequence (bandwidth) (at)t=1T(a_{t})_{t=1}^{T} such that log2attat20,tat2,tat40\frac{\log^{2}a_{t}}{ta_{t}^{2}}\to 0,ta_{t}^{2}\to\infty,ta_{t}^{4}\to 0 as t+t\to+\infty and a symmetric kernel function K(,)C2(2)K(\cdot,\cdot)\in C^{2}(\mathbb{R}^{2}) with bounded gradient. Now, at each time tt, we compute two vectors 𝜷t,𝜸t\bm{\beta}_{t},\bm{\gamma}_{t} which maximize each of the followings

1(𝜷)=1tτ=1trτ[𝜷𝒉τlog(yiyτe𝜷𝒉τ)],2(𝜸)=1tτ=1t(1rτ)[𝜸𝒉τlog(yiyτe𝜸𝒉τ)].\begin{split}\mathcal{L}_{1}(\bm{\beta})&=\frac{1}{t}\sum_{\tau=1}^{t}r_{\tau}\left[\bm{\beta}^{\top}\bm{h}_{\tau}-\log\left(\sum_{y_{i}\geq y_{\tau}}e^{\bm{\beta}^{\top}\bm{h}_{\tau}}\right)\right],\\ \mathcal{L}_{2}(\bm{\gamma})&=\frac{1}{t}\sum_{\tau=1}^{t}(1-r_{\tau})\left[\bm{\gamma}^{\top}\bm{h}_{\tau}-\log\left(\sum_{y_{i}\geq y_{\tau}}e^{\bm{\gamma}^{\top}\bm{h}_{\tau}}\right)\right].\end{split} (2)

We arbitrarily pad 𝒉1,,𝒉t\bm{h}_{1},\dotsc,\bm{h}_{t} with zeros to make their length the same (we will show that this is without loss of generality in a moment). Compute 𝒁t=(𝜷t𝒉t,𝜸t𝒉t)\bm{Z}_{t}=(\bm{\beta}_{t}^{\top}\bm{h}_{t},\bm{\gamma}_{t}^{\top}\bm{h}_{t})^{\top}. The survival function of 1mt1-m_{t}, or equivalently the cumulative distribution function of mtm_{t}, is estimated based on Zeng (2004)’s estimator

F^t(x)=1ti=1tj=1t(1K((𝒁i𝒁j)/an)𝟏{yjx}rjm=1nK((𝒁i𝒁m)/an)𝟏{yjym}).\hat{F}_{t}(x)=\frac{1}{t}\sum_{i=1}^{t}\prod_{j=1}^{t}\left(1-\frac{K((\bm{Z}_{i}-\bm{Z}_{j})/a_{n})\mathbf{1}_{\{y_{j}\leq x\}}r_{j}}{\sum_{m=1}^{n}K((\bm{Z}_{i}-\bm{Z}_{m})/a_{n})\mathbf{1}_{\{y_{j}\leq y_{m}\}}}\right). (3)

Now, we are ready to apply the estimator and the algorithm for censored feedback is depicted in Algorithm 2. Note that the new estimator’s convergence rate is slower than that for the full-feedback case. Therefore, compared to Algorithm 1, Algorithm 2 is now a multi-phase algorithm. The algorithm only updates the estimation of F^\hat{F} at the end of each phase (see Figure 1 for an illustration). The other elements of each phase in Algorithm 2 are similar to Algorithm 1.

Algorithm 2 Algorithm for the censored-feedback case
1:Input: Initial budget BB and constant C1C_{1} \triangleright C1C_{1} is an arbitrary positive constant
2:Initialize the estimation F^\hat{F} of FF and the estimation G^\hat{G} of GG to uniform distributions over [0,1][0,1]
3:B1BB_{1}\leftarrow B
4:for Phase i=1,2,i=1,2,\dotsc do \triangleright Phase i(i>1)i\,(i>1) lasts for 2i2^{i} rounds. Phase 1 lasts for 2 rounds
5:     for each tt in the time interval of round ii do
6:         Observe the value vtv_{t} in round tt
7:         Update G^(x)=1ti=1t𝟏{vix}\hat{G}(x)=\frac{1}{t}\sum_{i=1}^{t}\mathbf{1}_{\{v_{i}\leq x\}}.
8:         Let t0t_{0} be the smallest integer that satisfies λt0t11λ<C1t\lambda^{t_{0}-t}\frac{1}{1-\lambda}<\frac{C_{1}}{\sqrt{t}}
9:         Set VF^,G^(Bt0,t0)=0V_{\hat{F},\hat{G}}(B_{t_{0}},t_{0})=0 for any Bt0B_{t_{0}} \triangleright VF^,G^V_{\hat{F},\hat{G}} is algorithm’s estimation of VV
10:         for τ=t0,t01,,t\tau=t_{0},t_{0}-1,\dotsc,t do \triangleright This loop can be moved to the end of each phase to reduce the invocation time from TT to lnT\ln T
11:              QF^,G^(Bτ,τ,b)[(vb)+λVF^,G^(Bτb,τ+1)]F^(b)+λVF^,G^(Bτ,τ+1)(1F^(b))Q_{\hat{F},\hat{G}}(B_{\tau},\tau,b)\leftarrow[(v-b)+\lambda V_{\hat{F},\hat{G}}(B_{\tau}-b,\tau+1)]\hat{F}(b)+\lambda V_{\hat{F},\hat{G}}(B_{\tau},\tau+1)(1-\hat{F}(b))
12:              Solve the optimization problem b^τargmaxbQF^,G^(Bτ,τ,b)\hat{b}_{\tau}^{*}\leftarrow\operatorname*{argmax}_{b}Q_{\hat{F},\hat{G}}(B_{\tau},\tau,b)
13:              VF^,G^(Bτ,τ)𝔼vG[QF^,G^(Bτ,τ,b^τ)]V_{\hat{F},\hat{G}}(B_{\tau},\tau)\leftarrow\operatorname*{\mathbb{E}}_{v\sim G}[Q_{\hat{F},\hat{G}}(B_{\tau},\tau,\hat{b}_{\tau}^{*})]
14:         end for
15:         Place a bid b^targmaxbQF^,G^(Bt,t,b)\hat{b}_{t}\leftarrow\operatorname*{argmax}_{b}Q_{\hat{F},\hat{G}}(B_{t},t,b)
16:         Observe ot,cto_{t},c_{t} and rtr_{t} from this round of auction
17:         Bt+1BtctB_{t+1}\leftarrow B_{t}-c_{t}. If Bt+10B_{t+1}\leq 0 then halt.
18:     end for
19:     Update F^\hat{F} using the estimator specified in Equation 3 with data observed before this phase
20:end for
Estimate FF using Equation 3Estimate FF using Equation 3Estimate FF using Equation 3
Phase kk
Length 2k12^{k-1}
Phase k1k-1
Length 2k22^{k-2}
Phase 1
Length 2
\bm{\dotsb}
Figure 1: Schematic representation of the phases in Algorithm 2. Algorithm 2 updates its estimates of FF at the end of each phase.

Analysis of regret

To analyze the performance of Algorithm 2, we will prove a series of lemmas on the estimation error of Equation 3. In particular, our proof relies on the following convergence result.

Lemma 3.7 (Zeng).

Let F^n\hat{F}_{n} be the estimation of FF after using nn observations. We have

n(F^n(1x)F(1x))(x)in([0,1]),\sqrt{n}(\hat{F}_{n}(1-x)-F(1-x))\Longrightarrow\mathcal{B}(x)\quad\text{in}\quad\ell^{\infty}([0,1]),

where (x)\mathcal{B}(x) is a Gaussian process.

Before we proceed to apply the lemma, we verify a series of prerequisites mentioned in Zeng (2004) to make sure it holds. First, we make sure that conditioning on 𝒉t\bm{h}_{t}, the random variables 1mt1-m_{t} and 1bt1-b_{t} are independent. Indeed, btb_{t} is completely decided by 𝒉t\bm{h}_{t} and mtm_{t} is independent of 𝒉t\bm{h}_{t}. Second, we note that the maximizer shown in Equation 2 is essentially doing Cox’s proportional hazards regression analysis. To establish consistency of the estimator, we show that at least one of m~t1mt\widetilde{m}_{t}\coloneqq 1-m_{t} and 1bt1-b_{t} follows Cox’s proportional model. That is to say, there exists 𝜷\bm{\beta} and a function f(y)f(y) such that the hazard rate function of m~t\widetilde{m}_{t} or 1bt1-b_{t} conditioning on 𝒉t\bm{h}_{t} exactly follows

H(y𝒉t)=f(y)e𝜷𝒉t.H(y\mid\bm{h}_{t})=f(y)e^{\bm{\beta}^{\top}\bm{h}_{t}}. (4)

Equation 4 holds for m~t\widetilde{m}_{t}. In fact, taking 𝜷=𝟎\bm{\beta}=\mathbf{0} and f(y)=F(1y)1F(1y)f(y)=\frac{F^{\prime}(1-y)}{1-F(1-y)} suffices. Since we take 𝜷=𝟎\bm{\beta}=\mathbf{0}, consistency establishes regardless of the way we pad 𝒉t\bm{h}_{t}.

Next, consider some phase at time 2nt2n+112^{n}\leq t\leq 2^{n+1}-1. The estimation F^n\hat{F}_{n} is computed using the first 2n2^{n} observed data points. Applying similar techniques for the rate of convergence of empirical process (Norvaiša and Paulauskas, 1991), we obtain

Lemma 3.8.

Under Algorithm 2, let F^n\hat{F}_{n} be an empirical process of updates and \mathcal{B} be a general Gaussian process, respectively, indexed by a class \mathscr{F} of real measurable functions. We have:

|Pr({Fnr})Pr({r})|C2(1+r)3ln2tt16,|{\Pr(\{\|F_{n}\|_{\mathscr{F}}\geq r\})-\Pr(\{\|\mathcal{B}\|_{\mathscr{F}}\geq r\})}|\leq C_{2}(1+r)^{-3}\ln^{2}t\cdot t^{-\frac{1}{6}},

where C2C_{2} is a constant depending only on F^n\hat{F}_{n} and tt is the size of data used to update the estimation.

Synthesizing the convergence results in Lemma 3.7 and Lemma 3.8, we establish the following lemma on the performance of F^\hat{F} in Algorithm 2.

Lemma 3.9.

Under the update process in Algorithm 2, for any 2nt2n+112^{n}\leq t\leq 2^{n+1}-1, we have the following bounds for the estimation F^n\hat{F}_{n}:

|Pr(supb|2n(F^n(1b)F(1b))|r)Pr(supb|(1b)|r)|M(1+r)3ln2tt16,|{\Pr(\sup_{b}|\sqrt{2^{n}}(\hat{F}_{n}(1-b)-F(1-b))|\geq r)-\Pr(\sup_{b}|\mathcal{B}(1-b)|\geq r)}|\leq M(1+r)^{-3}\ln^{2}t\cdot t^{-\frac{1}{6}},

where MM is a constant depending only on FF and Algorithm 2.

Finally, we now bound the difference between F^n\hat{F}_{n} and FF with the help of Lemma 3.9.

Lemma 3.10.

Recall that we use the first 2n2^{n} data points to estimate F^\hat{F}. Under the update procedure of Algorithm 2, for any 2nt2n+112^{n}\leq t\leq 2^{n+1}-1, with probability at least 1T512/(2lnT)1-T^{-\frac{5}{12}}/(2\ln T)

supx|F^(x)F(x)|2C5(4Mln3T)13t59T536.\sup_{x}|\hat{F}(x)-F(x)|\leq\sqrt{2}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{5}{9}}T^{\frac{5}{36}}.

With Lemma 3.10 in hand and employing a similar methodology in the proof of Theorem 3.5, we now have

Theorem 3.11.

Under the circumstance that F,GF,G are both unknown and the feedback is censored, the worst-case regret of Algorithm 2 is O~(T712)\widetilde{O}(T^{\frac{7}{12}}). Explicitly, if we take C1=1C_{1}=1,

Regret(92(1+λ)2(1λ)2C5(4Mln3T)13+1)T712\displaystyle\text{Regret}\leq\left(\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}+1\right)T^{\frac{7}{12}}
+(12ln4T17122(1+λ)(1λ)2+5λ)T+λ2log1λT(1λ)2.\displaystyle+\left(\sqrt{\frac{1}{2}\ln{4T^{\frac{17}{12}}}}\frac{2(1+\lambda)}{(1-\lambda)^{2}}+5-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}.
Remark 3.12.

The previous setting in Han et al. (2020b) is a special case of our setting. In Han et al. (2020b), there are no budget constraints and λ=0\lambda=0 (thus removing the 11λ\frac{1}{1-\lambda} factor in our results). The buyer’s aim is to maximize (vb)F(b)(v-b)F(b) each round. This is equivalent to the case VF^=0V_{\hat{F}}=0 in our setting with no need to estimate GG, yielding regret O~(T)\widetilde{O}(\sqrt{T}) in the full-feedback case and regret O~(T712)\widetilde{O}(T^{\frac{7}{12}}) in the censored-feedback case.

4 Discussion and Conclusion

In this paper, we develop a learning algorithm to adaptively bid in repeated first-price auctions with budgets. On the theoretical side, our algorithm, together with its analysis of O~(T)\widetilde{O}(\sqrt{T})-regret in the full-feedback case and O~(T712)\widetilde{O}(T^{\frac{7}{12}})-regret in the censored-feedback case, takes the first step in understanding the problem. On the practical side, our algorithm is simple and readily applicable to the digital world that has shifted to first-price auctions.

Questions raise themselves for future explorations. We observe here that in the limiting case λ1\lambda\to 1, the optimal bidding strategy in Algorithm 2 is similar to a pacing strategy, which partially answers the open question555“A question of theoretical and practical interest is how to extend the adaptive pacing approach that is suggested in this paper to nontruthful mechanisms such as first-price auctions.” raised in Balseiro and Gur (2019). In the limiting case of λ1\lambda\to 1, the optimal bid of Algorithm 2 is of the form vt1+xt\frac{v_{t}}{1+x_{t}}, where xtx_{t} is a pacing multiplier that depends only on BtB_{t} and FF. Further, xtx_{t} can be computed without explicitly solving the dynamic programming problem. This observation can be viewed as a direct corollary of (Theorem 3.1 Gummadi et al., 2012). This connection between Algorithm 2 and pacing suggests future directions to extend the current adaptive pacing techniques to other auction forms666We note here that parallel to our work, Gaitonde et al. (2022) extend the pacing techniques to a class of auction forms (called core auctions). They obtain O~(T34)\widetilde{O}(T^{\frac{3}{4}})-regret bounds against the best linear policy under the value-maximization objective.. Other immediate open questions include establishing lower bounds of regret for our problem.

References

  • Agrawal and Devanur [2016] Shipra Agrawal and Nikhil Devanur. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29, 2016.
  • Agrawal et al. [2016] Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Conference on Learning Theory, pages 4–18. PMLR, 2016.
  • Amin et al. [2012] Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. Budget optimization for sponsored search: Censored learning in MDPs. arXiv preprint arXiv:1210.4847, 2012.
  • Badanidiyuru et al. [2013] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207–216. IEEE, 2013.
  • Badanidiyuru et al. [2014] Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. Resourceful contextual bandits. In Conference on Learning Theory, pages 1109–1134. PMLR, 2014.
  • Balseiro and Gur [2019] Santiago R. Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Science, 65(9):3952–3968, 2019.
  • Baltaoglu et al. [2017] M. Sevi Baltaoglu, Lang Tong, and Qing Zhao. Online learning of optimal bidding strategy in repeated multi-commodity auctions. Advances in Neural Information Processing Systems, 30, 2017.
  • Capen et al. [1971] Edward C. Capen, Robert V. Clapp, and William M. Campbell. Competitive bidding in high-risk situations. Journal of Petroleum Technology, 23(06):641–653, 1971.
  • Chari and Weber [1992] V. Chari and Robert Weber. How the us treasury should auction its debt. Quarterly Review, 16(4), 1992.
  • Chow and Tsitsiklis [1989] Chef-Seng Chow and John N Tsitsiklis. The complexity of dynamic programming. Journal of complexity, 5(4):466–488, 1989.
  • Deng and Zhang [2021] Yuan Deng and Hanrui Zhang. Prior-independent dynamic auctions for a value-maximizing buyer. Advances in Neural Information Processing Systems, 34, 2021.
  • Devanur et al. [2014] Nikhil R Devanur, Yuval Peres, and Balasubramanian Sivan. Perfect bayesian equilibria in repeated sales. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 983–1002. SIAM, 2014.
  • Drutsa [2017] Alexey Drutsa. Horizon-independent optimal pricing in repeated auctions with truthful and strategic buyers. In Proceedings of the 26th International Conference on World Wide Web, pages 33–42, 2017.
  • Drutsa [2018] Alexey Drutsa. Weakly consistent optimal pricing algorithms in repeated posted-price auctions with strategic buyer. In International Conference on Machine Learning, pages 1319–1328. PMLR, 2018.
  • Dvoretzky et al. [1956] Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642–669, 1956.
  • Esponda [2008] Ignacio Esponda. Information feedback in first price auctions. The RAND Journal of Economics, 39(2):491–508, 2008.
  • Flajolet and Jaillet [2017] Arthur Flajolet and Patrick Jaillet. Real-time bidding with side information. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, 2017.
  • Gaitonde et al. [2022] Jason Gaitonde, Yingkai Li, Bar Light, Brendan Lucier, and Aleksandrs Slivkins. Budget Pacing in Repeated Auctions: Regret and Efficiency without Convergence. arXiv e-prints, art. arXiv:2205.08674, May 2022.
  • Golrezaei et al. [2019] Negin Golrezaei, Adel Javanmard, and Vahab Mirrokni. Dynamic incentive-aware learning: Robust pricing in contextual auctions. Advances in Neural Information Processing Systems, 32, 2019.
  • Gummadi et al. [2012] Ramakrishna Gummadi, Peter Key, and Alexandre Proutiere. Repeated auctions under budget constraints: Optimal bidding strategies and equilibria. In the Eighth Ad Auction Workshop, volume 4, 2012.
  • Han et al. [2020a] Yanjun Han, Zhengyuan Zhou, Aaron Flores, Erik Ordentlich, and Tsachy Weissman. Learning to bid optimally and efficiently in adversarial first-price auctions. arXiv preprint arXiv:2007.04568, 2020a.
  • Han et al. [2020b] Yanjun Han, Zhengyuan Zhou, and Tsachy Weissman. Optimal no-regret learning in repeated first-price auctions. arXiv preprint arXiv:2003.09795, 2020b.
  • He et al. [2021] Jiafan He, Dongruo Zhou, and Quanquan Gu. Nearly minimax optimal reinforcement learning for discounted MDPs. Advances in Neural Information Processing Systems, 34:22288–22300, 2021.
  • Kaplan and Meier [1958] Edward L. Kaplan and Paul Meier. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282):457–481, 1958.
  • Lebrun [1996] Bernard Lebrun. Existence of an equilibrium in first price auctions. Economic Theory, 7(3):421–443, 1996.
  • Liu and Su [2020] Shuang Liu and Hao Su. Regret bounds for discounted MDPs. arXiv preprint arXiv:2002.05138, 2020.
  • LLC [2021] Google LLC. Faqs about adsense moving to a first-price auction - google adsense help. https://support.google.com/adsense/answer/10858748?hl=en, 2021. Accessed: 2022-05-06.
  • Massart [1990] Pascal Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability, pages 1269–1283, 1990.
  • Norvaiša and Paulauskas [1991] R. Norvaiša and V. Paulauskas. Rate of convergence in the central limit theorem for empirical processes. Journal of Theoretical Probability, 4(3):511–534, 1991.
  • Sluis [2017] Sarah Sluis. Big changes coming to auctions, as exchanges roll the dice on first-price. https://www.adexchanger.com/platforms/big-changes-coming-auctions-exchanges-roll-dice-first-price/, 2017.
  • Statista [2021] Statista. Digital advertising spending worldwide from 2019 to 2024 (in billion u.s. dollars). https://www.statista.com/statistics/237974/online-advertising-spending-worldwide/, 2021. Accessed: 2022-05-05.
  • Suzukawa [2004] Akio Suzukawa. Unbiased estimation of functionals under random censorship. Journal of the Japan Statistical Society, 34(2):153–172, 2004.
  • Tran-Thanh et al. [2014] Long Tran-Thanh, Lampros Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R Jennings, and Peter Key. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In 30th Conference on Uncertainty in AI, 2014.
  • van Schaik et al. [2001] Frans D. J. van Schaik, Jack P. C. Kleijnen, et al. Sealed-bid auctions: case study. Center for Economic Research, 2001.
  • Vanunts and Drutsa [2019] Arsenii Vanunts and Alexey Drutsa. Optimal pricing in repeated posted-price auctions with different patience of the seller and the buyer. Advances in Neural Information Processing Systems, 32, 2019.
  • Yang et al. [2021] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
  • Zeng [2004] Donglin Zeng. Estimating marginal survival function by adjusting for dependent censoring using many covariates. The Annals of Statistics, 32(4):1533–1555, 2004.
  • Zhou et al. [2021] Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted MDPs with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR, 2021.

Appendix A Omitted Proofs in Section 3.1

A.1 Proof of Lemma 3.2

We will use backward induction to show that

|V(Bt,t)VF^(Bt,t)|Ct12.|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq Ct^{-\frac{1}{2}}.

The inequality holds trivially with the condition of the lemma for the basis (t=t0t=t_{0}). Suppose the bound holds for t+1t+1. Now we write out the difference of the value functions

V(Bt,t)VF^(Bt,t)\displaystyle V(B_{t},t)-V_{\hat{F}}(B_{t},t) =𝔼vG[Qv,F(Bt,t,bt)Qv,F^(Bt,t,bt)]\displaystyle=\operatorname*{\mathbb{E}}_{v\sim G}[Q_{v,F}(B_{t},t,b^{*}_{t})-Q_{v,\hat{F}}(B_{t},t,b_{t})]
𝔼vG[Qv,F(Bt,t,bt)Qv,F^(Bt,t,bt)],\displaystyle\leq\operatorname*{\mathbb{E}}_{v\sim G}[Q_{v,F}(B_{t},t,b^{*}_{t})-Q_{v,\hat{F}}(B_{t},t,b_{t}^{*})],

where btb_{t} is Algorithm 1’s estimated optimal bid and btb^{*}_{t} is the bid of the benchmark. The inequality establishes by noting that btb^{*}_{t} is sub-optimal under F^\hat{F}. Next consider the term inside the expectation which is rewritten as follows:

Qv,F(Bt,t,bt)Qv,F^(Bt,t,bt)\displaystyle Q_{v,F}(B_{t},t,b^{*}_{t})-Q_{v,\hat{F}}(B_{t},t,b^{*}_{t})\leq
|(vtbt)(F(bt)F^(bt))|Δ1\displaystyle\underbrace{|(v_{t}-b^{*}_{t})(F(b^{*}_{t})-\hat{F}(b^{*}_{t}))|}_{\Delta_{1}}
+λF(bt)|(V(Btbt,t+1)VF^(Btbt,t+1))|+λ|(F(bt)F^(bt))VF^(Btbt,t+1)|Δ2\displaystyle+\underbrace{\lambda F(b^{*}_{t})|(V(B_{t}-b^{*}_{t},t+1)-V_{\hat{F}}(B_{t}-b^{*}_{t},t+1))|+\lambda|(F(b^{*}_{t})-\hat{F}(b^{*}_{t}))V_{\hat{F}}(B_{t}-b^{*}_{t},t+1)|}_{\Delta_{2}}
+λ|(1F(bt))(V(Bt,t+1)VF^(Bt,t+1))|+λ|(F^(bt)F(bt))VF^(Bt,t+1)|Δ3.\displaystyle+\underbrace{\lambda|(1-F(b^{*}_{t}))(V(B_{t},t+1)-V_{\hat{F}}(B_{t},t+1))|+\lambda|(\hat{F}(b^{*}_{t})-F(b^{*}_{t}))V_{\hat{F}}(B_{t},t+1)|}_{\Delta_{3}}.

To bound the above equation, we deal with the three terms Δ1,Δ2,Δ3\Delta_{1},\Delta_{2},\Delta_{3} separately. Using Lemma 3.1 and union bound of TT rounds, Δ112ln2Tδt12\Delta_{1}\leq\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}t^{-\frac{1}{2}} with probability at least 1δ1-\delta. By the induction hypothesis |V(Bt,t)VF^(Bt,t)|C(t+1)12Ct12|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq C(t+1)^{-\frac{1}{2}}\leq Ct^{-\frac{1}{2}} and that any value function is bounded by 1+λ+λ2+=11λ1+\lambda+\lambda^{2}+\dotsb=\frac{1}{1-\lambda}, we have

Δ2\displaystyle\Delta_{2} λCF(bt)t12+λ1λ12ln2Tδt12,\displaystyle\leq\lambda CF(b^{*}_{t})t^{-\frac{1}{2}}+\frac{\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}t^{-\frac{1}{2}},
Δ3\displaystyle\Delta_{3} λC(1F(bt))t12+λ1λ12ln2Tδt12.\displaystyle\leq\lambda C(1-F(b^{*}_{t}))t^{-\frac{1}{2}}+\frac{\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}t^{-\frac{1}{2}}.

Therefore,

Qv,F(Bt,t,bt)Qv,F^(Bt,t,bt)\displaystyle Q_{v,F}(B_{t},t,b^{*}_{t})-Q_{v,\hat{F}}(B_{t},t,b^{*}_{t}) (2λ1λ12ln2Tδ+λC+12ln2Tδ)t12\displaystyle\leq\left(\frac{2\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}+\lambda C+\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\right)t^{-\frac{1}{2}}
=12ln2Tδ1+λ(1λ)2t12.\displaystyle=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}.

This establishes that V(Bt,t)VF^(Bt,t)Ct12V(B_{t},t)-V_{\hat{F}}(B_{t},t)\leq Ct^{-\frac{1}{2}}. Finally, by symmetry—swap FF and F^\hat{F} and repeat the proof above, it holds that

|V(Bt,t)VF^(Bt,t)|Ct12.|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq Ct^{-\frac{1}{2}}.

This finishes the induction step and the claim follows.

A.2 Proof of Lemma 3.3

We will state a general form of Lemma 3.3 concerning the error in the initialization of the base case. This lemma will come in handy in the following sections.

Lemma A.1.

For any fixed distributions F,GF,G, consider the value function VF,G(Bt,t)V_{F,G}(B_{t},t). Suppose we use an arbitrary value in [0,11λ]\left[0,\frac{1}{1-\lambda}\right] to initialize the base case VF,G(Bt0,t0)V_{F,G}(B_{t_{0}},t_{0}) and recursively compute thereon to obtain V~F,G(Bt,t)\widetilde{V}_{F,G}(B_{t},t), then it holds that for any tt0t\leq t_{0}:

supBt|VF,G(Bt,t)V~F,G(Bt,t)|11λλt0t.\sup_{B_{t}}|V_{F,G}(B_{t},t)-\widetilde{V}_{F,G}(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}.
Proof.

When τ=t0\tau=t_{0}, the claim holds because VF,GV_{F,G} and V~F,G(,)\widetilde{V}_{F,G}(\cdot,\cdot) are both upper bounded by 11λ\frac{1}{1-\lambda} and lower bounded by 0.

Supposing the claim holds when τ=t+1\tau=t+1, then for τ=t\tau=t, we have

V~F,G(Bt,t)\displaystyle\widetilde{V}_{F,G}(B_{t},t) VF,G(Bt,t)𝔼vG[(vtbt)F(bt)+λF(bt)V~F,G(Btbt,t+1)\displaystyle-V_{F,G}(B_{t},t)\leq\operatorname*{\mathbb{E}}_{v\sim G}[(v_{t}-b_{t}^{*}){F}(b_{t}^{*})+\lambda{F}(b_{t}^{*})\widetilde{V}_{F,G}(B_{t}-b_{t}^{*},t+1)
+λ(1F(bt))V~(Bt,t+1)(vtbt)F(bt)λF(bt)VF,G(Btbt,t+1)\displaystyle+\lambda(1-{F}(b_{t}^{*}))\widetilde{V}(B_{t},t+1)-(v_{t}-b_{t}^{*}){F}(b_{t}^{*})-\lambda{F}(b_{t}^{*})V_{F,G}(B_{t}-b_{t}^{*},t+1)
λ(1F(bt))VF,G(Bt,t+1)]\displaystyle-\lambda(1-{F}(b_{t}^{*}))V_{F,G}(B_{t},t+1)]
11λλt0t1λ\displaystyle\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t-1}\lambda
=11λλt0t,\displaystyle=\frac{1}{1-\lambda}\lambda^{t_{0}-t},

In the derivation above, btb_{t}^{*} denotes the optimal bidding strategy obtained by computing V(Bt,t)V(B_{t},t). The first inequality holds since btb^{*}_{t} is not be the optimal bidding strategy under V^(,t)\hat{V}(\cdot,t)’s view. The second inequality holds since |V~(Bt+1,t+1)V(Bt+1,t+1)|11λλt0t1|\widetilde{V}(B_{t+1},t+1)-V(B_{t+1},t+1)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t-1} for any Bt+1B_{t+1}.

And by symmetry, we have |V~(Bt,t)V(Bt,t)|11λλt0t|\widetilde{V}(B_{t},t)-V(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}. This concludes the induction step and yields the lemma. ∎

In particular, using Lemma A.1, under the condition of Lemma 3.3, the initialization is taken to be 0. We have

supBt|VF^(Bt,t)V~F^(Bt,t)|11λλt0t.\sup_{B_{t}}|V_{\hat{F}}(B_{t},t)-\widetilde{V}_{\hat{F}}(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}.
Remark A.2.

For convenience, similar to the notations used in this lemma, for any value function ν\nu, we will use ν~\widetilde{\nu} to denote its approximately-initialized counterpart. Furthermore, we will invoke the lemma many times for other value functions in the rest of the proofs.

A.3 Proof of Theorem 3.5

Below we first condition on the good event that the estimation (of Lemma 3.1) succeeds for every tt. Then we add the contribution of the bad event to the regret in Theorem 3.5 finally.

To proof this theorem, we will first bound the following auxiliary “regret” with Lemma 3.2 and Lemma A.1.

Let us first make an intuitive and approximate description of the regret. The buyer’s “regret” can be viewed in two parts: 1) she does not bid according to the optimal strategy; 2) her strategy is not optimally spending the budget which leads to future losses. Given remaining budget BtB_{t} at time tt with strategy π\pi, the above intuition guides us to first look at

R1\displaystyle R_{1} =𝔼𝒗GT[t=1T(Rtπ(vt,bt)Rtπ(vt,bt))\displaystyle=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\Biggl{[}\sum_{t=1}^{T}(R^{\pi^{*}}_{t}(v_{t},b^{*}_{t})-R^{\pi}_{t}(v_{t},b_{t}))
+[λ(F(bt)V(Btbt,t+1)+(1F(bt))V(Bt,t+1))\displaystyle+[\lambda(F(b_{t}^{*})V(B_{t}-b_{t}^{*},t+1)+(1-F(b_{t}^{*}))V(B_{t},t+1))
λ(F(bt)V(Btbt,t+1)+(1F(bt))V(Bt,t+1))]].\displaystyle-\lambda(F(b_{t})V(B_{t}-b_{t},t+1)+(1-F(b_{t}))V(B_{t},t+1))]\Biggr{]}.
Lemma A.3.

Suppose Lemma 3.2 and Lemma A.1 hold for some constants CC and C1C_{1}, i.e. |V(Bt,t)VF^(Bt,t)|(C+C1)t12|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq(C+C_{1})t^{-\frac{1}{2}}. Assume further that supx|F(x)F^(x)|Kt12\sup_{x}|F(x)-\hat{F}(x)|\leq Kt^{-\frac{1}{2}} for some constant KK. We have

R12(K(1+λ)1λ+(1+λ)C+2C1)T.R_{1}\leq 2\left(\frac{K(1+\lambda)}{1-\lambda}+(1+\lambda)C+2C_{1}\right)\sqrt{T}.
Proof.

To ease description, we first let

H^t\displaystyle\hat{H}_{t} (vtbt)F^(bt)+λ(F^(bt)VF^(Btbt,t+1)+(1F^(bt))VF^(Bt,t+1)),\displaystyle\coloneqq(v_{t}-b_{t})\hat{F}(b_{t})+\lambda(\hat{F}(b_{t})V_{\hat{F}}(B_{t}-b_{t},t+1)+(1-\hat{F}(b_{t}))V_{\hat{F}}(B_{t},t+1)),
Ht\displaystyle H_{t} (vtbt)F(bt)+λ(F(bt)V(Btbt,t+1)+(1F(bt))V(Bt,t+1)),\displaystyle\coloneqq(v_{t}-b_{t}^{*})F(b_{t}^{*})+\lambda(F(b_{t}^{*})V(B_{t}-b_{t}^{*},t+1)+(1-F(b_{t}^{*}))V(B_{t},t+1)),
H~t\displaystyle\widetilde{H}_{t} (vtbt)F(bt)+λ(F(bt)V(Btbt,t+1)+(1F(bt))V(Bt,t+1)).\displaystyle\coloneqq(v_{t}-b_{t})F(b_{t})+\lambda(F(b_{t})V(B_{t}-b_{t},t+1)+(1-F(b_{t}))V(B_{t},t+1)).

(Recall that btb_{t} is Algorithm 1’s estimated optimal bid and btb^{*}_{t} is the bid of the benchmark.) Using the notations above, R1R_{1} now becomes

𝔼𝒗GT[t=1T(HtH~t)]𝔼𝒗GT[t=1T(|HtH^t|+|H^tH~t|)].\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(H_{t}-\widetilde{H}_{t})\right]\leq\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(|H_{t}-\hat{H}_{t}|+|\hat{H}_{t}-\widetilde{H}_{t}|)\right].

Use the induction part in the proof of Lemma 3.2 and Lemma A.1, |HtH^t|(C+C1)t12|H_{t}-\hat{H}_{t}|\leq(C+C_{1})t^{-\frac{1}{2}} follows from the condition. In order to bound |H^tH~t||\hat{H}_{t}-\widetilde{H}_{t}|, we write

|H^tH~t|\displaystyle|\hat{H}_{t}-\widetilde{H}_{t}|\leq |(vtbt)(F(bt)F^(bt))|+λ|F^(bt)F(bt)|VF^(Btbt,t+1)\displaystyle|(v_{t}-b_{t})(F(b_{t})-\hat{F}(b_{t}))|+\lambda|\hat{F}(b_{t})-F(b_{t})|V_{\hat{F}}(B_{t}-b_{t},t+1)
+λF(bt)|V(Btbt,t+1)VF^(Btbt,t+1)|\displaystyle+\lambda F(b_{t})|V(B_{t}-b_{t},t+1)-V_{\hat{F}}(B_{t}-b_{t},t+1)|
+λ|F(bt)F^(bt)|VF^(Bt,t+1)+λ(1F(bt))|V(Bt,t+1)VF^(Bt,t+1)|\displaystyle+\lambda|F(b_{t})-\hat{F}(b_{t})|V_{\hat{F}}(B_{t},t+1)+\lambda(1-F(b_{t}))|V(B_{t},t+1)-V_{\hat{F}}(B_{t},t+1)|
Kt12(1+2λ1λ)+λ(Ct12+C1λt12)\displaystyle\leq Kt^{-\frac{1}{2}}\left(1+\frac{2\lambda}{1-\lambda}\right)+\lambda\left(Ct^{-\frac{1}{2}}+\frac{C_{1}}{\lambda}{t}^{-\frac{1}{2}}\right)
=(K(1+λ)1λ+λC+C1)t12.\displaystyle=\left(\frac{K(1+\lambda)}{1-\lambda}+\lambda C+C_{1}\right)t^{-\frac{1}{2}}.

The first inequality holds because of the triangle inequality and the second inequality establishes using the conditions specified (Lemma 3.1, Lemma 3.2 and Lemma A.1). Note that it holds that |V(Bt+1,t+1)VF^(Bt+1,t+1)|11λλt0t1=C1λt.|V(B_{t+1},t+1)-V_{\hat{F}}(B_{t+1},t+1)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t-1}=\frac{C_{1}}{\lambda\sqrt{t}}.

Finally, we sum up the above estimation and obtain

R1t=1T(K(1+λ)1λ+(1+λ)C+2C1)t12,R_{1}\leq\sum_{t=1}^{T}\left(\frac{K(1+\lambda)}{1-\lambda}+(1+\lambda)C+2C_{1}\right)t^{-\frac{1}{2}},

as desired. ∎

In particular, taking C=12ln2Tδ1+λ(1λ)2C=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}} and K=12ln2TδK=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}} as it holds in the analysis of Algorithm 1, we have R14(C+C1)TR_{1}\leq 4(C+C_{1})\sqrt{T}.

Next, we will build connections between R1R_{1} and the accumulated differences in the values. The definition of R2R_{2} is inspired by the recent advances in Yang et al. (2021); He et al. (2021); Liu and Su (2020); Zhou et al. (2021). We define

R2=𝔼𝒗GT[t=1T(V(Bt,t)Vπ(Bt,t))].R_{2}=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(V(B_{t},t)-V^{\pi}(B_{t},t))\right].
Lemma A.4.

For any given BtB_{t} and tt, suppose that the conditions for Lemma A.3 holds. We have

|V(Bt,t)Vπ(Bt,t)|C(1λ)twhereC=K(1+λ)1λ+(1+λ)C+2C1.|V(B_{t},t)-V^{\pi}(B_{t},t)|\leq\frac{C^{\prime}}{(1-\lambda)\sqrt{t}}\quad\text{where}\quad C^{\prime}=\frac{K(1+\lambda)}{1-\lambda}+(1+\lambda)C+2C_{1}.
Proof.

To start, we introduce the following notation:

Htπ=(vtbt)F(bt)+λ(F(bt)Vπ(Btbt,t+1)+(1F(bt))Vπ(Bt,t+1)).H_{t}^{\pi}=(v_{t}-b_{t})F(b_{t})+\lambda(F(b_{t})V^{\pi}(B_{t}-b_{t},t+1)+(1-F(b_{t}))V^{\pi}(B_{t},t+1)).

Note that we have 𝔼𝒗GT[Ht]=V(Bt,t)\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}[H_{t}]=V(B_{t},t) and 𝔼𝒗GT[Htπ]=Vπ(Bt,t)\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}[H_{t}^{\pi}]=V^{\pi}(B_{t},t). We will use backward induction to bound the difference between V(Bt,t)V(B_{t},t) and Vπ(Bt,t)V^{\pi}(B_{t},t) for any BtB_{t}. First choose a sufficient large t0t_{0} and assume V(Bt0,t0)=V~π(Bt0,t0)V(B_{t_{0}},t_{0})=\widetilde{V}^{\pi}(B_{t_{0}},t_{0}). When τ=t0\tau=t_{0}, the induction basis holds since V(Bt0,t0)=V~π(Bt0,t0)V(B_{t_{0}},t_{0})=\widetilde{V}^{\pi}(B_{t_{0}},t_{0}). Now consider τ=t+1\tau=t+1. By the induction hypothesis, |V(Bτ,τ)V~π(Bτ,τ)|C(1λ)τ|V(B_{\tau},\tau)-\widetilde{V}^{\pi}(B_{\tau},\tau)|\leq\frac{C^{\prime}}{(1-\lambda)\sqrt{\tau}}, and when τ=t\tau=t, it holds that

H~tHtπ\displaystyle\widetilde{H}_{t}-H_{t}^{\pi} =𝔼vG[λF(bt)(V(Btbt,t+1)V~π(Btbt,t+1))\displaystyle=\operatorname*{\mathbb{E}}_{v\sim G}[\lambda F(b_{t})(V(B_{t}-b_{t},t+1)-\widetilde{V}^{\pi}(B_{t}-b_{t},t+1))
+λ(1F(bt))(V(Bt,t+1)V~π(Bt,t+1))]\displaystyle+\lambda(1-F(b_{t}))(V(B_{t},t+1)-\widetilde{V}^{\pi}(B_{t},t+1))]
Cλ(1λ)t+1.\displaystyle\leq\frac{C^{\prime}\lambda}{(1-\lambda)\sqrt{t+1}}.

It follows that

|V(Bt,t)V~π(Bt,t)|𝔼𝒗G[|HtH~t|+|H~tHtπ|]Ct+Cλ(1λ)t+1C(1λ)t.|V(B_{t},t)-\widetilde{V}^{\pi}(B_{t},t)|\leq\operatorname*{\mathbb{E}}_{\bm{v}\sim G}[|H_{t}-\widetilde{H}_{t}|+|\widetilde{H}_{t}-H_{t}^{\pi}|]\leq\frac{C^{\prime}}{\sqrt{t}}+\frac{C^{\prime}\lambda}{(1-\lambda)\sqrt{t+1}}\leq\frac{C^{\prime}}{(1-\lambda)\sqrt{t}}.

and this concludes the induction step.

Applying the proof techniques in Lemma A.1 with the condition for VπV^{\pi}, we have

|Vπ(Bt,t)V~π(Bt,t)|11λλt0t.|V^{\pi}(B_{t},t)-\widetilde{V}^{\pi}(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}.

Since t0t_{0} is arbitrarily chosen (as long as it is sufficiently large), we can take t0+t_{0}\to+\infty. Note that we have limt0+V~π(Bt,t)=Vπ(Bt,t)\lim_{t_{0}\to+\infty}\widetilde{V}^{\pi}(B_{t},t)=V^{\pi}(B_{t},t). Therefore, it holds that

|V(Bt,t)Vπ(Bt,t)|C(1λ)t.|V(B_{t},t)-V^{\pi}(B_{t},t)|\leq\frac{C^{\prime}}{(1-\lambda)\sqrt{t}}.

This establishes

R22C1λT,R_{2}\leq\frac{2C^{\prime}}{1-\lambda}\sqrt{T},

and concludes the proof. ∎

In particular, take C=12ln2Tδ1+λ(1λ)2C=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}} and K=12ln2TδK=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}} as it holds in the analysis of Algorithm 1, we have R24(C+C1)1λTR_{2}\leq\frac{4(C+C_{1})}{1-\lambda}\sqrt{T}.

Finally, we relate the regret defined in Equation 1 with R2R_{2} in the previous lemma.

Lemma A.5.

Suppose that R22C1λTR_{2}\leq\frac{2C^{\prime}}{1-\lambda}\sqrt{T}, then Regret(2C+1λ)T+1λ2log1λT1λ\text{Regret}\leq(2C^{\prime}+1-\lambda)\sqrt{T}+\frac{1-\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{1-\lambda}.

Proof.

In fact, note that

R2=𝔼𝒗GT[t=1T(1+λ++λt1)(R(vt,bt)Rπ(vt,bt))],R_{2}=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(1+\lambda+\dotsb+\lambda^{t-1})(R^{*}(v_{t},b_{t}^{*})-R^{\pi}(v_{t},b_{t}))\right],

and any expected instantaneous reward is no greater than 1. We have

|11λRegretR2|=𝔼𝒗GT[|t=1Tλt1λ(R(vt,bt)Rπ(vt,bt))|]\displaystyle\left|\frac{1}{1-\lambda}\text{Regret}-R_{2}\right|=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\left|\sum_{t=1}^{T}\frac{\lambda^{t}}{1-\lambda}(R^{*}(v_{t},b_{t}^{*})-R^{\pi}(v_{t},b_{t}))\right|\right]
λ2(1λ)log1λT(1λ)2+1TT.\displaystyle\leq\frac{\lambda}{2(1-\lambda)}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+\frac{1}{\sqrt{T}}T.

The above bound holds because when tlog1λT1λt\geq\log_{\frac{1}{\lambda}}\frac{\sqrt{T}}{1-\lambda}, we have λt1λ1T\frac{\lambda^{t}}{1-\lambda}\leq\frac{1}{\sqrt{T}}. The considered quantity is divided into two parts with this threshold log1λT1λ\log_{\frac{1}{\lambda}}\frac{\sqrt{T}}{1-\lambda}. The first part is no greater than log1λT1λλ1λ\log_{\frac{1}{\lambda}}\frac{\sqrt{T}}{1-\lambda}\frac{\lambda}{1-\lambda} and the second part is no greater than 1TT=T\frac{1}{\sqrt{T}}T=\sqrt{T}. The result follows after simple rearrangements. ∎

In particular, take C=K(1+λ)1λ+(1+λ)C+2C1C^{\prime}=\frac{K(1+\lambda)}{1-\lambda}+(1+\lambda)C+2C_{1} as it holds in the analysis of Algorithm 1, we have

Regret((412ln2Tδ1+λ(1λ)2+4C1)+1λ)T+λ2log1λT(1λ)2,\text{Regret}\leq\left(\left(4\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}+4C_{1}\right)+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}},

conditioning on the good event that the estimation (of Lemma 3.1) succeeds for every 1tT1\leq t\leq T. Now take δ=1T\delta=\frac{1}{T} and note that Pr[bad event]1T\Pr[\text{bad event}]\leq\frac{1}{T}. By using the trivial regret bound TT for the failure event, we have

Regret((412ln2T21+λ(1λ)2+4C1)+1λ)T+λ2log1λT(1λ)2+1TT,\text{Regret}\leq\left(\left(4\sqrt{\frac{1}{2}\ln 2T^{2}}\frac{1+\lambda}{(1-\lambda)^{2}}+4C_{1}\right)+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+\frac{1}{T}\cdot T,

and this concludes the proof of Theorem 3.5.

A.4 Proof of Theorem 3.6

Note that the function QQ depends on both FF and GG as VV does. With the additional condition that GG also needs to be estimated, we use Q,G^Q_{\cdot,\hat{G}} and Q,GQ_{\cdot,G} to denote the corresponding version computed using G^\hat{G} and GG respectively. We also extend this notation on VV in this scenario (c.f. Algorithm 2).

We start with two simple lemmas. These two lemmas are direct corollaries of Lemma 3.1 and Lemma A.1.

Lemma A.6.

Let G^t\hat{G}_{t} be the estimated distribution at round tt in Algorithm 1. With probability 1δ1-\delta,

supx|G^t(x)G(x)|12ln2δt12.\sup_{x}|\hat{G}_{t}(x)-G(x)|\leq\sqrt{\frac{1}{2}\ln\frac{2}{\delta}}t^{-\frac{1}{2}}.
Lemma A.7.

For any round tTt\leq T, budget BtB_{t}, it holds that

supBt|VF^,G^(Bt,t)V~F^,G^(Bt,t)|11λλt0t.\sup_{B_{t}}|V_{\hat{F},\hat{G}}(B_{t},t)-\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}.

In the following proof, we will bridge V(Bt,t)V(B_{t},t) and VF^,G^(Bt,t)V_{\hat{F},\hat{G}}(B_{t},t)—the estimated value function, gradually. For distributions A,BA,B, the notation VA,BV_{A,B} refers to the value function computed when F=AF=A and G=BG=B.

Lemma A.8.

With probability at least 1δ2T1-\frac{\delta}{2T}, for any given tTt\leq T and budget BtB_{t}, we have

|VF^,G^(Bt,t)VF^,G(Bt,t)|12ln4Tδ1(1λ)2t12.|V_{\hat{F},\hat{G}}(B_{t},t)-V_{\hat{F},G}(B_{t},t)|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1}{(1-\lambda)^{2}}t^{-\frac{1}{2}}.
Proof.

First we note that Lemma A.6 states with probability at least 1δ2T1-\frac{\delta}{2T}, supx|Gt(x)G(x)|12ln4Tδt12\sup_{x}|G_{t}(x)-G(x)|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}. Now we apply backward induction. When t=t0t=t_{0}, the induction basis holds trivially since |VF^,G^(Bt,t)VF^,G(Bt,t)|=012ln4Tδ11λt12|V_{\hat{F},\hat{G}}(B_{t},t)-V_{\hat{F},G}(B_{t},t)|=0\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1}{1-\lambda}t^{-\frac{1}{2}}. Assume the induction hypothesis holds for τ=t+1\tau=t+1. For any t<t0,vtt<t_{0},v_{t} and BtB_{t}, it holds that

QF^,G(Bt,t,bt)\displaystyle Q_{\hat{F},G}(B_{t},t,b_{t}^{*}) QF^,G^(Bt,t,bt)QF^,G(Bt,t,bt)QF^,G^(Bt,t,bt)\displaystyle-Q_{\hat{F},\hat{G}}(B_{t},t,b_{t})\leq Q_{\hat{F},G}(B_{t},t,b_{t}^{*})-Q_{\hat{F},\hat{G}}(B_{t},t,b_{t}^{*})
=λF^(bt)VF^,G(Btbt,t+1)+λ(1F^(bt))VF^,G(Bt,t+1)\displaystyle=\lambda\hat{F}(b_{t}^{*})V_{\hat{F},G}(B_{t}-b_{t}^{*},t+1)+\lambda(1-\hat{F}(b_{t}^{*}))V_{\hat{F},G}(B_{t},t+1)
λF^(bt)VF^,G(Btbt,t+1)λ(1F^(bt))VF^,G^(Bt,t+1)\displaystyle-\lambda\hat{F}(b_{t}^{*})V_{\hat{F},G}(B_{t}-b_{t}^{*},t+1)-\lambda(1-\hat{F}(b_{t}^{*}))V_{\hat{F},\hat{G}}(B_{t},t+1)
λF^(bt)|VF^,G(Btbt,t+1)VF^,G^(Btbt,t+1)|\displaystyle\leq\lambda\hat{F}(b_{t}^{*})|V_{\hat{F},G}(B_{t}-b_{t}^{*},t+1)-V_{\hat{F},\hat{G}}(B_{t}-b_{t}^{*},t+1)|
+λ(1F^(bt))|VF^,G(Bt,t+1)VF^,G^(Bt,t+1)|\displaystyle+\lambda(1-\hat{F}(b_{t}^{*}))|V_{\hat{F},G}(B_{t},t+1)-V_{\hat{F},\hat{G}}(B_{t},t+1)|
12ln4Tδλ(1λ)2t12.\displaystyle\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}.

The first inequality holds since by construction btb_{t}^{*} is the optimal bid under F^\hat{F} and GG’s view while btb_{t} is the optimal bid under F^\hat{F} and G^\hat{G}’s view. The second inequality holds with the triangle inequality. And the third inequality is due to the induction hypothesis. Finally, by symmetry—swap GG and G^\hat{G} and repeat the proof above, it holds that |QF^,G(Bt,t,bt)QF^,G^(Bt,t,bt)|12ln4Tδλ(1λ)2t12|Q_{\hat{F},G}(B_{t},t,b_{t})-Q_{\hat{F},\hat{G}}(B_{t},t,b_{t}^{*})|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}.

Next, since VF^,G^(Bt,t)=𝔼G^[QF^,G^(Bt,t,bt)]V_{\hat{F},\hat{G}}(B_{t},t)=\operatorname*{\mathbb{E}}_{\hat{G}}[Q_{\hat{F},\hat{G}}(B_{t},t,b_{t})] and VF^,G(Bt,t)=𝔼G[QF^,vt(Bt,t,bt)]V_{\hat{F},G}(B_{t},t)=\operatorname*{\mathbb{E}}_{G}[Q_{\hat{F},v_{t}}(B_{t},t,b_{t}^{*})], we have

|VF^,G(Bt,t)VF^,G^(Bt,t)|Δ1+Δ2,|V_{\hat{F},G}(B_{t},t)-V_{\hat{F},\hat{G}}(B_{t},t)|\leq\Delta_{1}+\Delta_{2},

where

Δ1\displaystyle\Delta_{1} =|𝔼G[QF^,G(Bt,t,bt)]𝔼G^[QF^,G(Bt,t,bt)]| and\displaystyle=|{\operatorname*{\mathbb{E}}_{G}[Q_{\hat{F},G}(B_{t},t,b_{t}^{*})]-\operatorname*{\mathbb{E}}_{\hat{G}}[Q_{\hat{F},G}(B_{t},t,b_{t}^{*})]}|\text{ and}
Δ2\displaystyle\Delta_{2} =|𝔼G^[QF^,G^(Bt,t,bt)]𝔼G^[QF^,G(Bt,t,bt)]|.\displaystyle=|{\operatorname*{\mathbb{E}}_{\hat{G}}[Q_{\hat{F},\hat{G}}(B_{t},t,b_{t})]-\operatorname*{\mathbb{E}}_{\hat{G}}[Q_{\hat{F},G}(B_{t},t,b_{t}^{*})]}|.
  • To bound Δ1\Delta_{1}: since QF^,GQ_{\hat{F},G} is supported on [0,11λ]\left[0,\frac{1}{1-\lambda}\right] and the difference between GG and G^\hat{G} is upper bounded by 12ln4Tδt12\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}, Δ1\Delta_{1} is bounded by 12ln4Tδt1211λ\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}\frac{1}{1-\lambda}. Note that we use the fact that with integration by parts, 01Qd(GG^)=Q(GG^)|0101(GG^)dQ\int_{0}^{1}Q\,\mathrm{d}(G-\hat{G})=Q(G-\hat{G})|_{0}^{1}-\int_{0}^{1}(G-\hat{G})\,\mathrm{d}Q. Therefore, it holds that |Δ1|01|GG^||dQ||\Delta_{1}|\leq\int_{0}^{1}|G-\hat{G}||\mathrm{d}Q|. Since QQ is monotone w.r.t. vtv_{t} and is two-sided bounded, it holds that |Δ1|12ln4Tδt1211λ|\Delta_{1}|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}\frac{1}{1-\lambda}.

  • To bound Δ2\Delta_{2}, it is clear that Δ212ln4Tδλ(1λ)2t12\Delta_{2}\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}} by linearity of expectation.

Therefore, we obtain

|VF^,G(Bt,t)VF^,G^(Bt,t)|12ln4Tδ1(1λ)2t12,|V_{\hat{F},G}(B_{t},t)-V_{\hat{F},\hat{G}}(B_{t},t)|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1}{(1-\lambda)^{2}}t^{-\frac{1}{2}},

which finishes induction step and concludes the proof. ∎

Lemma A.9.

For any tTt\leq T and budget BtB_{t}, with probability at least 1δT1-\frac{\delta}{T}, it holds that:

|VF,G(Bt,t)V~F^,G^(Bt,t)|(12ln4Tδ2+λ(1λ)2+C1)t12.|V_{F,G}(B_{t},t)-\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)|\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{2+\lambda}{(1-\lambda)^{2}}+C_{1}\right)t^{-\frac{1}{2}}.
Proof.

In order to bound the difference between V~F^,G^(Bt,t)\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t) and VF,G(Bt,t)V_{F,G}(B_{t},t). We first rewrite

|VF,G(Bt,t)V~F^,G^(Bt,t)|Δ1+Δ2+Δ3,|V_{F,G}(B_{t},t)-\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)|\leq\Delta_{1}+\Delta_{2}+\Delta_{3},

where

Δ1\displaystyle\Delta_{1} =|V~F^,G^(Bt,t)VF^,G^(Bt,t)|,\displaystyle=|\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)-V_{\hat{F},\hat{G}}(B_{t},t)|,
Δ2\displaystyle\Delta_{2} =|VF^,G^(Bt,t)VF^,G(Bt,t)| and\displaystyle=|V_{\hat{F},\hat{G}}(B_{t},t)-V_{\hat{F},G}(B_{t},t)|\text{ and}
Δ3\displaystyle\Delta_{3} =|VF^,G(Bt,t)VF,G(Bt,t)|.\displaystyle=|V_{\hat{F},G}(B_{t},t)-V_{F,G}(B_{t},t)|.
  • To bound Δ1\Delta_{1}: using Lemma A.7 and the definition of t0t_{0}, we conclude that Δ1C1t\Delta_{1}\leq\frac{C_{1}}{\sqrt{t}}.

  • To bound Δ2\Delta_{2}: using Lemma A.8, it holds that Δ212ln4Tδ1(1λ)2t12\Delta_{2}\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1}{(1-\lambda)^{2}}t^{-\frac{1}{2}}.

  • To bound Δ3\Delta_{3}: using Lemma 3.2, we obtain that Δ312ln4Tδ1+λ(1λ)2t12\Delta_{3}\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}. Note that we use Bonferroni’s method to divide δ\delta into two parts for the union bound of error in the estimations of FF and GG.

Therefore, it holds that:

|VF,G(Bt,t)V~F^,G^(Bt,t)|(12ln4Tδ2+λ(1λ)2+C1)t12.|V_{F,G}(B_{t},t)-\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)|\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{2+\lambda}{(1-\lambda)^{2}}+C_{1}\right)t^{-\frac{1}{2}}.

Now, we provide a bound similar to Lemma A.3 for Theorem 3.6. Similarly, we define

H^t\displaystyle\hat{H}_{t} (vtbt)F^(bt)+λF^(bt)V~F^,G^(Btbt,t+1)+λ(1F^(bt))V~F^,G^(Bt,t+1),\displaystyle\coloneqq(v_{t}-b_{t})\hat{F}(b_{t})+\lambda\hat{F}(b_{t})\widetilde{V}_{\hat{F},\hat{G}}(B_{t}-b_{t},t+1)+\lambda(1-\hat{F}(b_{t}))\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t+1),
H~t\displaystyle\widetilde{H}_{t} (vtbt)F(bt)+λF(bt)VF,G(Btbt,t+1)+λ(1F(bt))VF,G(Bt,t+1),\displaystyle\coloneqq(v_{t}-b_{t})F(b_{t})+\lambda{F}(b_{t})V_{F,G}(B_{t}-b_{t},t+1)+\lambda(1-F(b_{t}))V_{F,G}(B_{t},t+1),
Ht\displaystyle H_{t} (vtbt)F(bt)+λF(bt)VF,G(Btbt,t+1)+λ(1F(bt))VF,G(Bt,t+1).\displaystyle\coloneqq(v_{t}-b_{t}^{*})F(b_{t}^{*})+\lambda{F}(b_{t}^{*})V_{F,G}(B_{t}-b_{t},t+1)+\lambda(1-F(b_{t}^{*}))V_{F,G}(B_{t},t+1).

Recall that R1=𝔼𝒗GT[t=1T(HtH~t)]R_{1}=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(H_{t}-\widetilde{H}_{t})\right]. The following series of arguments holds. First, we have

R1|𝔼𝒗GT[t=1T(HtH^t)]|+𝔼𝒗GT[t=1T|H~tH^t|].R_{1}\leq\Biggl{|}{\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(H_{t}-\hat{H}_{t})\right]}\Biggr{|}+\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}|\widetilde{H}_{t}-\hat{H}_{t}|\right].

Followed from Lemma A.8, it holds that

|𝔼𝒗GT[t=1THtH^t]|(12ln4Tδ2+λ(1λ)2+C1)t=1Tt12.\Biggl{|}{\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}H_{t}-\hat{H}_{t}\right]}\Biggr{|}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{2+\lambda}{(1-\lambda)^{2}}+C_{1}\right)\sum_{t=1}^{T}t^{-\frac{1}{2}}.

It suffices to bound |H~tH^t||\widetilde{H}_{t}-\hat{H}_{t}|. To do so we rewrite

|H~tH^t|Δ1+Δ2+Δ3+Δ4+Δ5,|\widetilde{H}_{t}-\hat{H}_{t}|\leq\Delta_{1}+\Delta_{2}+\Delta_{3}+\Delta_{4}+\Delta_{5},

where

Δ1\displaystyle\Delta_{1} =|(vtbt)F^(bt)(vtbt)F(bt)|,\displaystyle=|(v_{t}-b_{t})\hat{F}(b_{t})-(v_{t}-b_{t})F(b_{t})|,
Δ2\displaystyle\Delta_{2} =λF^(bt)|V~F^,G^(Btbt,t+1)VF,G(Btbt,t+1)|,\displaystyle=\lambda\hat{F}(b_{t})|\widetilde{V}_{\hat{F},\hat{G}}(B_{t}-b_{t},t+1)-V_{F,G}(B_{t}-b_{t},t+1)|,
Δ3\displaystyle\Delta_{3} =|λF^(bt)VF,G(Btbt,t+1)λF(bt)VF,G(Btbt,t+1)|,\displaystyle=|\lambda\hat{F}(b_{t})V_{F,G}(B_{t}-b_{t},t+1)-\lambda{F}(b_{t})V_{F,G}(B_{t}-b_{t},t+1)|,
Δ4\displaystyle\Delta_{4} =λ(1F^(bt))|V~F^,G^(Bt,t+1)VF,G(Bt,t+1)| and\displaystyle=\lambda(1-\hat{F}(b_{t}))|\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t+1)-V_{F,G}(B_{t},t+1)|\text{ and}
Δ5\displaystyle\Delta_{5} =|λ(1F^(bt))VF,G(Bt,t+1)λ(1F(bt))VF,G(Bt,t+1)|.\displaystyle=|\lambda(1-\hat{F}(b_{t}))V_{F,G}(B_{t},t+1)-\lambda(1-F(b_{t}))V_{F,G}(B_{t},t+1)|.
  • To bound Δ1\Delta_{1}: using Lemma 3.1, it holds Δ112ln4Tδt12\Delta_{1}\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}.

  • To bound Δ2+Δ4\Delta_{2}+\Delta_{4}: using Lemma A.8 and setting C1C1λC_{1}\leftarrow\frac{C_{1}}{\lambda}, it holds that Δ2+Δ4λ(12ln4Tδ2+λ(1λ)2+C1λ)t12\Delta_{2}+\Delta_{4}\leq\lambda\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{2+\lambda}{(1-\lambda)^{2}}+\frac{C_{1}}{\lambda}\right)t^{-\frac{1}{2}}.

  • To bound Δ3\Delta_{3}: since VF,G(,)11λV_{F,G}(\cdot,\cdot)\leq\frac{1}{1-\lambda}, then it holds that Δ3λ1λ12ln4Tδt12\Delta_{3}\leq\frac{\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}} .

  • To bound Δ5\Delta_{5}: like the way we deal with Δ3\Delta_{3}, it holds that Δ5λ1λ12ln4Tδt12\Delta_{5}\leq\frac{\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}.

After summing them up, we have

|H~tH^t|(12ln4Tδ1+2λ(1λ)2+C1)t12.|\widetilde{H}_{t}-\hat{H}_{t}|\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1+2\lambda}{(1-\lambda)^{2}}+C_{1}\right)t^{-\frac{1}{2}}.

Therefore, conditioning on the good event that the estimation succeeds (of both FF and GG) for every tt, R1R_{1} is bounded by

R1(12ln4Tδ3+3λ(1λ)2+2C1)t=1Tt12(12ln4Tδ6(1+λ)(1λ)2+4C1)T.R_{1}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{3+3\lambda}{(1-\lambda)^{2}}+2C_{1}\right)\sum_{t=1}^{T}t^{-\frac{1}{2}}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{6(1+\lambda)}{(1-\lambda)^{2}}+4C_{1}\right)\sqrt{T}.

We then apply the same techniques in Lemma A.4 and Lemma A.5 to yield

Regret(12ln4Tδ6(1+λ)(1λ)2+4C1+1λ)T+λ2log1λT(1λ)2,\text{Regret}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{6(1+\lambda)}{(1-\lambda)^{2}}+4C_{1}+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}},

conditioning on the good event.

Now take δ=1T\delta=\frac{1}{T} and note that Pr[bad event]1T\Pr[\text{bad event}]\leq\frac{1}{T}. By using the trivial regret bound TT for the failure event, we obtain the desired bound

Regret(12ln4Tδ6(1+λ)(1λ)2+4C1+1λ)T+λ2log1λT(1λ)2+1.\text{Regret}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{6(1+\lambda)}{(1-\lambda)^{2}}+4C_{1}+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+1.

Appendix B Omitted proof in Section 3.2

B.1 Proof of Lemma 3.9

Using Lemma 3.8, we have |Pr(supb|2n(F^n(1b)F(1b))|r)Pr(supb|B(1b)|r)|C2(1+r)3ln22n(2n6)|{\Pr(\sup_{b}|\sqrt{2^{n}}(\hat{F}_{n}(1-b)-F(1-b))}|\geq r)-\Pr(\sup_{b}|B(1-b)|\geq r)|\leq C_{2}(1+r)^{-3}\ln^{2}2^{n}(2^{-\frac{n}{6}}). Since 2nt2n+112^{n}\leq t\leq 2^{n+1}-1, it holds that:

|Pr(supb|2n(F^n(1b)F(1b))|r)Pr(supb|B(1b)|r)|\displaystyle|{\Pr(\sup_{b}|\sqrt{2^{n}}(\hat{F}_{n}(1-b)-F(1-b))}|\geq r)-\Pr(\sup_{b}|B(1-b)|\geq r)|
C2(1+r)3ln2t(t2)16.\displaystyle\leq C_{2}(1+r)^{-3}\ln^{2}t\left(\frac{t}{2}\right)^{-\frac{1}{6}}.

Taking M=C2216M=C_{2}2^{\frac{1}{6}} concludes the proof.

B.2 Proof of Lemma 3.10

Before we proceed to bound the difference between F^n\hat{F}_{n} and FF, we also need the following lemma to to characterize a property of Gaussian processes.

Lemma B.1.

Let (1b)\mathcal{B}(1-b) be a Gaussian process, we have

Pr(supb|(1b)|r)|C3eC4r2,\Pr(\sup_{b}|\mathcal{B}(1-b)|\geq r)|\leq C_{3}e^{-C_{4}r^{2}},

where C3C_{3} and C4C_{4} are constants.

For a Gaussian process, the tail satisfies Gaussian distribution. For a normal distribution, we have the following well-known inequality

x12πet22dtC6eC7x2,\int_{x}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^{2}}{2}}\,\mathrm{d}t\leq C_{6}e^{-C_{7}x^{2}},

where x0x\geq 0 and C6C_{6}, C7C_{7} are constants. For example, we can take C6=12e12C_{6}=\frac{1}{2}e^{\frac{1}{2}} and C7=12C_{7}=\frac{1}{2} in this inequality.

To bound for a certain Gaussian process, we rescale the random variable and the tail distribution also satisfies above property. So, there exists C3C_{3} and C4C_{4} to make Lemma B.1 hold.

Now we give a bound for the estimation of F^\hat{F}. First, we have

Pr(supx|2n(F^(x)F(x))|r)Pr(||r)+M(1+r)3ln2tt16,\Pr(\sup_{x}|\sqrt{2^{n}}(\hat{F}(x)-F(x))|\geq r)\leq\Pr(|\mathcal{B}|\geq r)+M(1+r)^{-3}\ln^{2}t\cdot t^{-\frac{1}{6}},

We show that Lemma 3.10 establishes if we take r=max{r1,r2}r=\max\{r_{1},r_{2}\} where r1=1C4ln4C3lnTδr_{1}=\sqrt{\frac{1}{C_{4}}\ln\frac{4C_{3}\ln T}{\delta}} and r2=(4Mln3T)13t118δ13r_{2}=(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{1}{18}}\delta^{-\frac{1}{3}}. Indeed, if rr1r\geq r_{1} then the first part on the right side is not greater than δ4lnT\frac{\delta}{4\ln T} by Lemma B.1 and if rr2r\geq r_{2} then the second part is not greater than δ4lnT\frac{\delta}{4\ln T}. Take δ=T512\delta=T^{-\frac{5}{12}}. It follows from simple comparison (between the rate of growth of r1r_{1} and r2r_{2}) that there exists a constant C5C_{5} for any tt and TT such that max{r1,r2}C5r2\max\{r_{1},r_{2}\}\leq C_{5}r_{2}. Now,

Pr(supx|t2(F^(x)F(x))|C5r2)Pr(supx|2n(F^(x)F(x))|C5r2)12T512lnT\Pr\left(\sup_{x}\left|\sqrt{\frac{t}{2}}(\hat{F}(x)-F(x))\right|\geq C_{5}r_{2}\right)\leq\Pr(\sup_{x}|\sqrt{2^{n}}(\hat{F}(x)-F(x))|\geq C_{5}r_{2})\leq\frac{1}{2T^{\frac{5}{12}}\ln T}

since t2n+1t\leq 2^{n+1}. This concludes the proof of the lemma.

B.3 Proof of Theorem 3.11

Similar to the proof in the full-feedback case, let us first assume the knowledge of GG. The regret of learning GG will be considered later. The proof of Theorem 3.11 is short thanks to the tool-sets established by previous sections. As usual, we first condition on the good event that the estimation succeeds for every tt. Then we add the contribution of the bad event to the regret in the final proof.

In order to bound the final regret, we need to bound the difference between FF and F^n\hat{F}_{n} and then apply the methodology used in the proof of Theorem 3.5.

Dropping the first two rounds, for any 2nt2n+112^{n}\leq t\leq 2^{n+1}-1, the estimation of FF is F^n\hat{F}_{n}. Then using Lemma 3.10, we obtain that, with probability at least 1δ2lnT1-\frac{\delta}{2\ln T},

supx|F^n(x)F(x)|2C5(4Mln3T)13t59T536,\sup_{x}|\hat{F}_{n}(x)-F(x)|\leq\sqrt{2}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{5}{9}}T^{\frac{5}{36}},

where δ=T512\delta=T^{-\frac{5}{12}}. Note that we use the fact that if 1tT1\leq t\leq T, the algorithm updates for at most lnT\lfloor\ln T\rfloor times. Now, the new concentration bound (Lemma 3.8) effective changes K,CK,C and the convergence rate in Lemma A.3, Lemma A.4 and Lemma A.5. Therefore, by substituting Ct\frac{C}{\sqrt{t}} with 2C5(4Mln3T)13t59T5361+λ(1λ)2\sqrt{2}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{5}{9}}T^{\frac{5}{36}}\frac{1+\lambda}{(1-\lambda)^{2}} and Kt\frac{K}{\sqrt{t}} with 2C5(4Mln3T)13t59T536\sqrt{2}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{5}{9}}T^{\frac{5}{36}} in the lemmas we obtain that conditioning on the good event (w.p. 1δ21-\frac{\delta}{2}) that the estimation of FF succeeds every time,

R14C1T+92(1+λ)2(1λ)2C5(4Mln3T)13T712,R_{1}\leq 4C_{1}\sqrt{T}+\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}T^{\frac{7}{12}},

if Algorithm 2 has the knowledge of GG.

Next we add the estimation of GG. Note that Lemma A.8 decouples the regret of estimating FF and GG, hence we may apply the same approach in the proof of Theorem 3.6 by taking C=12ln4Tδ1+λ(1λ)2C=\sqrt{\frac{1}{2}\ln{\frac{4T}{\delta}}}\frac{1+\lambda}{(1-\lambda)^{2}} and K=12ln4TδK=\sqrt{\frac{1}{2}\ln{\frac{4T}{\delta}}} and we have

R14C1T+92(1+λ)2(1λ)2C5(4Mln3T)13T712+[12ln4Tδ2(1+λ)(1λ)2]T,R_{1}\leq 4C_{1}\sqrt{T}+\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}T^{\frac{7}{12}}+\left[\sqrt{\frac{1}{2}\ln{\frac{4T}{\delta}}}\frac{2(1+\lambda)}{(1-\lambda)^{2}}\right]\sqrt{T},

conditioning on the good event (w.p. 1δ1-\delta) that the estimation of FF succeeds every time.

Finally, we use Lemma A.4 and Lemma A.5 to transform R1R_{1} into the final regret, which yields

Regret92(1+λ)2(1λ)2C5(4Mln3T)13T712+[12ln4T17122(1+λ)(1λ)2]T\displaystyle\text{Regret}\leq\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}T^{\frac{7}{12}}+\left[\sqrt{\frac{1}{2}\ln 4T^{\frac{17}{12}}}\frac{2(1+\lambda)}{(1-\lambda)^{2}}\right]\sqrt{T}
+[(4C1+1λ)T+λ2log1λT(1λ)2],\displaystyle+\left[(4C_{1}+1-\lambda)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}\right],

conditioning on the good event that the estimations of FF and GG succeeds for every 1tT1\leq t\leq T. Note that Pr[bad event]1T512\Pr[\text{bad event}]\leq\frac{1}{T^{\frac{5}{12}}}. By using the trivial regret bound TT for the failure event, we have

Regret(92(1+λ)2(1λ)2C5(4Mln3T)13+1)T712\displaystyle\text{Regret}\leq\left(\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}+1\right)T^{\frac{7}{12}}
+(12ln4T17122(1+λ)(1λ)2+4C1+1λ)T+λ2log1λT(1λ)2.\displaystyle+\left(\sqrt{\frac{1}{2}\ln{4T^{\frac{17}{12}}}}\frac{2(1+\lambda)}{(1-\lambda)^{2}}+4C_{1}+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}.

This concludes the proof.