No-regret Learning in Repeated First-Price Auctions with Budget Constraints

Rui Ai
Peking University
[email protected] Equal contribution Chang Wang ¹¹footnotemark: 1
Peking University
[email protected] Chenchen Li
JD.COM
[email protected] Jinshan Zhang
Zhejiang University
[email protected] Wenhan Huang
Shanghai Jiaotong University
[email protected] Xiaotie Deng
Peking University
[email protected] Corresponding author

Abstract

Recently the online advertising market has exhibited a gradual shift from second-price auctions to first-price auctions. Although there has been a line of works concerning online bidding strategies in first-price auctions, it still remains open how to handle budget constraints in the problem. In the present paper, we initiate the study for a buyer with budgets to learn online bidding strategies in repeated first-price auctions. We propose an RL-based bidding algorithm against the optimal non-anticipating strategy under stationary competition. Our algorithm obtains $\widetilde{O}(\sqrt{T})$ -regret if the bids are all revealed at the end of each round. With the restriction that the buyer only sees the winning bid after each round, our modified algorithm obtains $\widetilde{O}(T^{\frac{7}{12}})$ -regret by techniques developed from survival analysis. Our analysis extends to the more general scenario where the buyer has any bounded instantaneous utility function with regrets of the same order.

1 Introduction

The market for online advertising has extensively grown in recent years. It is estimated that the volume of online advertising worldwide will reach 500 billion dollars in 2022 (Statista, 2021). In such a market, advertising platforms use auctions to allocate ad opportunities. Typically, the advertiser has a limited amount of capital for an advertisement campaign and thus all rounds of competition are interconnected by budgets. Furthermore, the advertiser has very limited knowledge of 1) her valuation of certain keywords and 2) the competition she is facing. Many works have been devoted to developing algorithms for learning strategies for optimally spending the budget in repeated second-price auctions (see Section 1.1).

Nevertheless, we have witnessed numerous switches from second-price auctions to first-price auctions in the online advertising market. A recent remarkable example is Google AdSenses’ integrated move at the end of 2021 (LLC, 2021). Earlier examples also include AppNexus, Index Exchange, and OpenX (Sluis, 2017). This industry-wide shift is due to various factors including a fairer transactional process and increased transparency. Further, the shift to first-price auctions brings about major importance to the following open question which is barely considered in previous works:

How should budget-constrained advertisers learn to compete in repeated first-price auctions?

This paper thus initiates the study of learning to bid with budget constraints in repeated first-price auctions. We note that the application of first-price auctions with budgets is not limited to online advertising mentioned above. Traditional competitive environments like mussel trade in Netherlands (van Schaik et al., 2001), modern price competition, and procurement auctions (e.g. U.S. Treasury Securities auction (Chari and Weber, 1992)) are examples as well.

Challenges and contributions

The challenges in this setting are two-fold.

The first challenge relates to information structure. In practical auctions, it is often the case that only the highest bid is revealed (Esponda, 2008). This is known as censored-feedback or winner’s curse in literature (Capen et al., 1971). This affects the information structure of learning since the buyer learns less information when she wins. This makes the problem challenging compared to standard contextual bandits (c.f. Section 1.1).

The second challenge is more fundamental. It is known that the strategy in first-price auctions is notoriously complex to analyze, even in the static case (Lebrun, 1996). To get an intuitive feeling of this difficulty in our problem compared to repeated second-price auctions, let us consider the offline case where the opponents’ bids are all known. Given the budget, the problem for second-price auctions can be reduced to a pure knapsack problem, where the budget is regarded as weight capacity and prices as weights. This structure enables mature techniques including duality theory to be applied to study the benchmark strategy. Pitifully in first-price auctions, since the payment depends on the buyer’s own bid, the previous approach/benchmark is not directly usable. We provide a concrete example to further illustrate such difficulty.

Example 1.1.

Consider a case where the buyer’s value $v$ follows a uniform distribution on $[0.4,1]$ and the highest bid $m$ of the opponents’ follows a uniform distribution on $[0,0.5]$ . The budget $B$ is set to be half of the time horizon $0.5T$ . Now, the first-best benchmark (an anticipating¹¹1An algorithm is anticipating if bid selection depends on future observations, see Flajolet and Jaillet (2017). strategy) is

	$\displaystyle\operatorname*{\mathbb{E}}_{\begin{subarray}{c}\bm{v}\sim F^{T}\\ \bm{m}\sim G^{T}\end{subarray}}\left[\max_{b_{1},\dotsc,b_{T}}\sum_{t=1}^{T}(v_{t}-b_{t})\mathbf{1}_{\{b_{t}\geq m_{t}\}}\right]$
	$\displaystyle\text{subject to}\quad\sum_{t=1}^{T}\mathbf{1}_{\{b_{t}\geq m_{t}\}}b_{t}\leq B\quad\forall v_{1},\dotsc,v_{T};m_{1},\dotsc,m_{T}.$

In hindsight, we need to pay as little as possible. Using the theory of knapsack, the utility turns out to be $T\cdot\operatorname*{\mathbb{E}}[\mathbf{1}_{\{V\geq M\}}(V-M)]^{+}=0.45T$ . On the contrary, the optimal non-anticipating bidding strategy in a first-price auction is to bid $\frac{V}{2}$ and the utility is $T\cdot\operatorname*{\mathbb{E}}[\mathbf{1}_{\{\frac{V}{2}\geq M\}}\frac{V}{2}]=0.26T$ . There is already an $\Omega(T)$ separation between the first-best benchmark and the ideal case with full information.

This example shows that simple characterization of the optimal in previous works is not applicable. Indeed, it remains unclear what methodology can be applied in first-price auctions with budgets (see (Balseiro and Gur, 2019, §2.4) for further discussions).

The present paper takes the first step to combat the challenges mentioned above with a dynamic programming approach. Correspondingly, our contribution is also two-fold:

•

We provide an RL-based learning algorithm. Through the characterization of the optimal strategy, we obtain $\widetilde{O}(\sqrt{T})$ -regret guarantee for the algorithm in the full-feedback case²²2This is especially practical in public-sector auctions (Chari and Weber, 1992) as regulations mandate all bids to be revealed..
•

In the censored-feedback setting, by techniques developed from survival analysis, we modify our algorithm and obtain a regret of $\widetilde{O}(T^{\frac{7}{12}})$ .

1.1 Related Work

Repeated second-price auctions with budgets

There is a flourishing source of literature on bidding strategies in repeated auctions with budgets. Through the lens of online learning, Balseiro and Gur (2019) identify asymptotically optimal online bidding strategies known as pacing (a.k.a. bid-shading in literature) in repeated second-price auctions with budgets. Inspired by the pacing strategy, Flajolet and Jaillet (2017) develop no-regret non-anticipating algorithms for learning with contextual information in repeated second-price auctions. Another line of works that uses similar techniques in the present paper includes Amin et al. (2012); Tran-Thanh et al. (2014); Gummadi et al. (2012). Gummadi et al. (2012) and Amin et al. (2012) study bidding strategies in repeated second-price auctions with budget constraints, but the former does not involve any learning and the latter does not provide any regret analysis (their estimator is biased). Tran-Thanh et al. (2014) derive regret bounds for the same scenario but the optimization objective is the number of items won instead of value or surplus. Baltaoglu et al. (2017) also use dynamic programming to tackle repeated second-price auctions with budgets. However, they assume per-round budget constraints and their dynamic programming algorithm is for allocating bids among multiple items. Again, we emphasize that no prior work has been done in repeated first-price auctions with budgets, since the structure of the problem (compared to second-price variants) is fundamentally different (recall Example 1.1).

Repeated first-price auctions without budgets

Two notable works concerning repeated first-price auctions are Han et al. (2020b) and Han et al. (2020a). In Han et al. (2020b), they introduce a new problem called monotone group contextual bandits and then obtain an $O(\sqrt{T}\ln^{2}T)$ -regret algorithm for repeated first-price auctions without budget constraints under stationary settings. In Han et al. (2020a), they concentrate on an adversarial setting and develop a mini-max optimal online bidding algorithm with $O(\sqrt{T}\ln T)$ regret against all Lipschitz bidding strategies. A crucial difference is that in the present paper, budgets are involved thus the bandit algorithm by Han et al. (2020b) is not suitable for our needs.

Bandit with knapsack

From the bandit side, Badanidiyuru et al. (2013) develop bandit algorithms under resource constraints. They show that their algorithm can be used in dynamic procurement, dynamic posted pricing with limited supply, etc. However, since the bidder observes her value before bidding in our problem, results by Badanidiyuru et al. (2013) cannot be directly applied to our setting. Our setting also relates to contextual bandit problems with resource constraints (Badanidiyuru et al., 2014; Agrawal and Devanur, 2016; Agrawal et al., 2016). Nevertheless, applying this contextual bandit approach requires discretizing the action space, which needs Lipschitz continuity of distributions. Our approach does not rely on any continuity assumption. Further, the performance guarantee (typically $\widetilde{O}(T^{\frac{2}{3}})$ ) is worse than ours. It also does not fit into our information structure when the feedback is censored.

2 Preliminaries

Auction mechanism

We consider a repeated first-price auction with budgets. Specifically, we suppose that the buyer has a limited budget $B$ to spend in a time horizon of $T\leq+\infty$ (can be unknown to her) rounds. At the beginning of each round $t=1,2,\dotsc,T$ , the bidder privately observes a value $v_{t}$ for a fresh copy of item and bids $b_{t}$ according to her past observations $\bm{h}_{t}$ and value $v_{t}$ . Denote the strategy she employs as $\pi\colon(v_{t},B_{t},\bm{h}_{t})\to b_{t}$ , which maps her current budget $B_{t}$ , value $v_{t}$ and past history $\bm{h}_{t}$ to a bid. Let $m_{t}$ be the maximum bid of the other bidders. Since the auction is a first price auction, if $b_{t}$ is larger than $m_{t}$ , then the buyer wins the auction, is charged $b_{t}$ , and obtains a utility of $r_{t}$ ; otherwise she loses and $r_{t}=0$ . Therefore, the instantaneous utility of the buyer is

r_{t}=(v_{t}-b_{t})\mathbf{1}_{\{b_{t}\geq m_{t}\}}.

The exact information structure of history the buyer observes is dictated by how the mechanism reveals $m_{t}$ . In full generality, we assume that the feedback is censored, i.e. only the highest bid is revealed at the end of each round and the winner does not observe $m_{t}$ exactly. This is considered to be an informational version of winner’s curse (Capen et al., 1971) and is of practical interest (Esponda, 2008). For the purpose of modeling, we suppose that ties are broken in favor of the buyer but this choice is arbitrary and by no means a limitation of our approach.

Next, we state the assumptions on $m_{t}$ and $v_{t}$ . Without loss of generality, we assume that $b_{t},m_{t},v_{t}$ are normalized to be in $[0,1]$ . In the present paper, we consider a stochastic setting where $m_{t}$ and $v_{t}$ are drawn from some distributions $F,G$ unknown to the buyer, respectively, and independent from the past. We will also refer to the cumulative distribution functions of $F,G$ with the same notations. No further assumptions will be made on $F,G$ . Now, the expected instantaneous utility of the buyer at time $t$ with strategy $\pi$ is

R^{\pi}(v_{t},b_{t})=\operatorname*{\mathbb{E}}_{m_{t}\sim F}[r_{t}]=(v_{t}-b_{t})F(b_{t}).

To argue for the reasonability of this assumption, note that although other buyers may also involve learning behavior, it is typical that in a real advertising market, there are a large number of buyers. The specific buyer only faces a different small portion of them and is completely oblivious of whom she is facing in each round. Therefore, the buyer’s sole objective is to maximize her utility (instead of fooling other buyers) and by the law of large numbers, the price $m_{t}$ and value $v_{t}$ the buyer observes are independent and identically distributed at least for a period of time.

Buyer’s target and regret

The buyer aims at maximizing her long-term accumulated utility subject to the budget constraints. Recall that the instantaneous utility of the buyer is $r_{t}=(v_{t}-b_{t})\mathbf{1}_{\{b_{t}\geq m_{t}\}}$ . The payment is $c_{t}=b_{t}\mathbf{1}_{\{b_{t}\geq m_{t}\}}$ and the budget will then decrease accordingly as the payment incurs. She can continue to bid as long as budget has not run out but must stop at

\tau^{*}=\min\left\{T+1,\min\left\{t\in\mathbb{N}:\sum_{\tau=1}^{t}c_{\tau}=B\right\}+1\right\}.

The buyer’s problem now becomes to determine how much to bid in each round to maximize her accumulated utility. In line with works Gummadi et al. (2012); Golrezaei et al. (2019); Deng and Zhang (2021), the buyer adopts a discount factor $\lambda\in(0,1)$ . She takes discounts since she does not know $T$ or $\tau^{*}$ . Discount factors are interpreted to be the estimate of the probability that the repeated auction will last for at least $t$ rounds (Devanur et al., 2014; Drutsa, 2018). On the economic side, in important real-world markets like online advertising platforms, the buyers are impatient for opportunities since companies of different sizes have different capabilities. Discount factors model how impatient the buyer is (Drutsa, 2017; Vanunts and Drutsa, 2019). Now the buyer’s optimization problem is to determine a non-anticipating strategy $\pi$ for the following:

	$\displaystyle\max_{\pi}\quad\operatorname*{\mathbb{E}}_{\begin{subarray}{c}\bm{v}\sim F^{T}\\ \bm{m}\sim G^{T}\end{subarray}}\left[\sum_{t=1}^{T}\lambda^{t-1}r_{t}\right]$
	$\displaystyle\text{subject to}\quad\sum_{t=1}^{T}\mathbf{1}_{\{b_{t}\geq m_{t}\}}b_{t}\leq B\quad\forall v_{1},\dotsc,v_{T};m_{1},\dotsc,m_{T},$

where $b_{t}=\pi(v_{t},B_{t},\bm{h}_{t})$ . Here, $\bm{v}\coloneqq(v_{1},\dotsc,v_{T})$ denotes the sequence of private values the buyer observes and $\bm{m}\coloneqq(m_{1},\dotsc,m_{T})$ is the sequence of the highest bids of the other bidders. $V^{\pi}(B,t)$ denotes the expected accumulated utility using strategy $\pi$ with budget $B$ and starting from time $t$ . Let $\pi^{*}$ denote the optimal bidding strategy with the knowledge of the underlying distributions $F$ and $G$ . The corresponding expected accumulated utility is $V^{\pi^{*}}$ .

We now come to define the regret. Given time $T$ , the regret is defined to be the difference between the accumulated utility of the buyer’s bidding strategy $\pi$ and that of the optimal non-anticipating bidding strategy $\pi^{*}$

\text{Regret}=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}R^{\pi^{*}}(v_{t},b_{t}^{*})-R^{\pi}(v_{t},b_{t})\right].

(1)

3 Bidding Algorithm and Analysis

In this section, we present our bidding algorithm and the high-level ideas in the analysis of regret. We first consider the case where the feedback is not censored, i.e. the buyer is aware of $m_{t}$ no matter whether she wins or not. Then we extend our algorithm to the case where the feedback is censored with techniques developed from survival analysis.

3.1 Full Feedback

When $F$ and $G$ are known, the buyer’s problem can be viewed as offline. The technical challenge lies in the observation that even when the distributions are known, the buyer’s problem cannot be directly analyzed as a knapsack problem. To tackle this challenge, we use a dynamic programming approach to solve the problem. In particular, the optimal strategy $\pi^{*}$ satisfies the following Bellman equation:

	$\displaystyle b^{*}(B_{\tau},v)$	$\displaystyle\in\operatorname*{argmax}_{b}[(v-b)+\lambda V(B_{\tau}-b,\tau+1)]F(b)+\lambda V(B_{\tau},\tau+1)(1-F(b)),$
	$\displaystyle V(B_{\tau},\tau)$	$\displaystyle=\operatorname{\mathbb{E}}_{v}[(v-b^{})+\lambda V(B_{\tau}-b^{},\tau+1)]F(b^{})+\lambda V(B_{\tau},\tau+1)(1-F(b^{*})),$

for all $\tau\in\mathbb{N}$ and $0\leq B_{\tau}\leq B$ . Note that for any $B_{\tau}<0$ , $V(B_{\tau},\tau)=-\infty$ . By choosing appropriate initialization conditions, we can solve the equation recursively and obtain the optimal bidding strategy together with the function $V(\cdot,\cdot)$ . The above recursion also characterizes the optimal solution, which will be used in the analysis later.

When the buyer does not have the information of $F$ and $G$ , she can learn the distributions from past observations. Therefore, it is natural to maintain estimations $\hat{F}$ and $\hat{G}$ of the true distributions. Our algorithm for the full-feedback case is now depicted in Algorithm 1. To ease technical loads, we first assume the knowledge of $G$ and only estimate $F$ in Algorithm 1. Later, we will add the estimation of $G$ and its analysis is presented in Theorem 3.6.

Algorithm 1 Algorithm for the full-feedback case

1:Input: Initial budget

B

and constant

C_{1}

\triangleright

C_{1}

is an arbitrary positive constant

2:Initialize the estimation

\hat{F}

F

to a uniform distribution over

[0,1]

and

B_{1}\leftarrow B

3:for

t=1,2,\dotsc

4: Observe the value

v_{t}

in round

t

5: Let

t_{0}

be the smallest integer that satisfies

\lambda^{t_{0}-t}\frac{1}{1-\lambda}<\frac{C_{1}}{\sqrt{t}}

6: Set

V_{\hat{F}}(B_{t_{0}},t_{0})=0

for any

B_{t_{0}}

\triangleright

V_{\hat{F}}

is algorithm’s estimation of

V

7: for

\tau=t_{0},t_{0}-1,\dotsc,t

Q_{v,\hat{F}}(B_{\tau},\tau,b)\leftarrow[(v-b)+\lambda V_{\hat{F}}(B_{\tau}-b,\tau+1)]\hat{F}(b)+\lambda V_{\hat{F}}(B_{\tau},\tau+1)(1-\hat{F}(b))

9: Solve the optimization problem

\hat{b}_{\tau}^{*}\leftarrow\operatorname*{argmax}_{b}Q_{v,\hat{F}}(B_{\tau},\tau,b)

10:

V_{\hat{F}}(B_{\tau},\tau)\leftarrow\operatorname*{\mathbb{E}}_{v\sim G}[Q_{v,\hat{F}}(B_{\tau},\tau,\hat{b}_{\tau}^{*})]

11: end for

12: Place a bid

\hat{b}_{t}\leftarrow\operatorname*{argmax}_{b}Q_{v,\hat{F}}(B_{t},t,b)

13: Observe

m_{t},c_{t}

and

r_{t}

from this round of auction and update

\hat{F}(x)=\frac{1}{t}\sum_{i=1}^{t}\mathbf{1}_{\{m_{i}\leq x\}}

14:

B_{t+1}\leftarrow B_{t}-c_{t}

. If

B_{t+1}\leq 0

then halt.

15:end for

Similar to prior work (Amin et al., 2012), Algorithm 1 performs exploration and exploitation simultaneously. The buyer initializes the estimation of $F$ to a uniform distribution (2). At round $t$ , the buyer observes her valuation. Then, she uses her estimation of $F$ to solve the dynamic programming problem recursively³³3For the non-trivial case $B\leq T$ , this can be solved in $O\left(\frac{T^{4.5}}{(1-\lambda)^{6}}\right)$ time with only $O(T^{-\frac{1}{2}})$ loss (see, e.g. Chow and Tsitsiklis, 1989). to obtain an estimation of the optimal bid (7~11). To provide a base case for recursion, note that for sufficiently large $t_{0}\gg t$ , $V_{\hat{F}}(\cdot,t_{0})$ ’s impact to $V_{\hat{F}}(\cdot,t)$ is negligible due to the discount $\lambda^{t_{0}-t}$ . Therefore, the buyer can approximate $V_{\hat{F}}(\cdot,t_{0})$ with zero for $t_{0}$ (5). Finally, the auction proceeds with $m_{t},r_{t},c_{t}$ revealed and the buyer updates her information accordingly (13~14).

Analysis of regret

The establishment of the regret bounds will be given in two steps. First, we will show that the buyer’s estimation $V_{\hat{F}}$ is approximately accurate with a sufficient number of samples. This relies on the estimation of $F$ . Concentration inequalities are thus intrinsic to our analysis. Second, we bridge the regret defined in Equation 1 with $V$ and $V_{\hat{F}}$ . This is done by a series of auxiliary quantities measuring the regret. We obtain the desired result by combining the two steps.

We first present the following lemma (Dvoretzky et al., 1956; Massart, 1990), which states a uniform convergence result for the estimation of cumulative distribution functions.

Lemma 3.1 (Dvoretzky–Kiefer–Wolfowitz).

Given $t\in\mathbb{N}$ , let $m_{1},m_{2},\dotsc,m_{t}$ be real-valued independent and identically distributed random variables with cumulative distribution function $F$ . Let $\hat{F}_{t}$ denote the associated empirical distribution function defined by $\hat{F}_{t}(x)=\frac{1}{t}\sum_{i=1}^{t}\mathbf{1}_{\{m_{i}\leq x\}}$ where $x\in\mathbb{R}$ . Then with probability $1-\delta$ , it holds

\sup_{x}|\hat{F}_{t}(x)-F(x)|\leq\sqrt{\frac{1}{2}\ln\frac{2}{\delta}}t^{-\frac{1}{2}}.

With Lemma 3.1 in hand, given any $T$ , we show a bound for the difference between $V(B_{t},t)$ and $V_{\hat{F}}(B_{t},t)$ for any $1\leq t\leq T$ (recall that $V(B_{t},t)$ is the accumulated utility of the optimal strategy with the knowledge of $F$ ). We prove this using induction. Note that the induction basis is a little tricky due to the base case of recursion in Algorithm 1. Therefore, in the following lemma, we first deal with the induction step.

Lemma 3.2.

For any round $t\leq T$ , budget $B_{t}$ and with probability at least $1-\frac{\delta}{T}$ , we have the following bounds for the estimated $V_{\hat{F}}$ and the ground truth with $F$ if $\sup_{B_{t_{0}}}|V(B_{t_{0}},t_{0})-V_{\hat{F}}(B_{t_{0}},t_{0})|\leq\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}$ :

|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq Ct^{-\frac{1}{2}}=\widetilde{O}\left(\frac{1}{\sqrt{t}}\right)\quad\text{where}\quad C=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}.

Note that the lemma above assumes that $\sup_{B_{t_{0}}}|V(B_{t_{0}},t_{0})-V_{\hat{F}}(B_{t_{0}},t_{0})|$ is bounded above by $\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}$ , which holds if the base case $V_{\hat{F}}(B_{t_{0}},t_{0})$ is set accurately (i.e. $V(B_{t_{0}},t_{0})=V_{\hat{F}}(B_{t_{0}},t_{0})$ ). We use $\widetilde{V}_{\hat{F}}(B_{t},t)$ to denote the estimated value function using $\hat{F}$ if the base is indeed accurate. In the following lemma, we show that using the alternative initialization method specified in 5 of Algorithm 1, $\sup_{B_{t}}|V_{\hat{F}}(B_{t},t)-\widetilde{V}_{\hat{F}}(B_{t},t)|$ actually satisfies the condition of Lemma 3.2.

Lemma 3.3.

Suppose $\widetilde{V}_{\hat{F}}(B_{t_{0}},t_{0})=V(B_{t_{0}},t_{0})$ and $\widetilde{V}_{\hat{F}}(B_{t},t)$ is then computed by the recursive procedure in Algorithm 1. Then it holds that for any $\tau\leq t_{0}$ and $B_{\tau}$ :

|V_{\hat{F}}(B_{\tau},\tau)-\widetilde{V}_{\hat{F}}(B_{\tau},\tau)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-\tau}.

In particular, when $\tau=t$ , we have $\sup_{B_{t}}|V_{\hat{F}}(B_{t},t)-\widetilde{V}_{\hat{F}}(B_{t},t)|\leq\frac{C_{1}}{\sqrt{t}}$ (by construction of $t_{0}$ ).

Synthesizing Lemma 3.2 and Lemma 3.3, we have $|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq(C+C_{1})t^{\frac{1}{2}}$ . A crucial next step is to relate this bound to the final regret. This is achieved by two transformations. Roughly speaking, the buyer’s “regret” can be viewed in two parts: 1) she does not bid according to the optimal strategy; 2) her strategy is not optimally spending the budget which leads to future losses. These two transformations are done with this intuitive observation and summarized in the following lemma that bounds the performance of the buyer’s strategy. Below we first condition on the good event that the estimation succeeds for every $t$ . Then we add the contribution of the bad event to the regret in Theorem 3.5.

Lemma 3.4.

For any given $B_{t}$ and $t$ , denote $V^{\pi}(B_{t},t)=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{\tau=t}^{T}\lambda^{\tau-t}R^{\pi}(v_{\tau},b_{\tau})\right]$ , then

|V(B_{t},t)-V^{\pi}(B_{t},t)|\leq\frac{4\left(\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}+C_{1}\right)}{(1-\lambda)\sqrt{t}}.

By Lemma 3.4 and further transformations, we can now establish the regret bound of Algorithm 1.

Theorem 3.5.

Under the circumstance that $F$ is unknown, the worse-case regret of Algorithm 1 is $\widetilde{O}(\sqrt{T})$ , where the regret is computed according to Equation 1. Explicitly, if we take $C_{1}=1$ ,

\text{Regret}\leq\left(4\sqrt{\frac{1}{2}\ln 2T^{2}}\frac{1+\lambda}{(1-\lambda)^{2}}+5-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+1.

Note that the regret bound is meaningful only if $B=\omega(T^{\frac{1}{2}})$ . Let us now take the budget $B$ to scale linearly with $T$ as in Balseiro and Gur (2019); Flajolet and Jaillet (2017). Specifically, assume that $T<+\infty$ and there exists a constant $\beta$ such that the budget $B=\beta T$ , then we establish that the regret is $\widetilde{O}(\sqrt{T})$ in this special case. Indeed, under this condition, we can simply set $t_{0}=T+1$ and $V_{\hat{F}}(B_{T+1},T+1)=0$ for any $B_{T+1}$ in Algorithm 1. Therefore, $C_{1}=0$ and the worst-case regret is bounded by $\left(4\sqrt{\frac{1}{2}\ln 2T^{2}}\frac{1+\lambda}{(1-\lambda)^{2}}+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+1$ .

Next we deal with the case where $G$ is also initially unknown. Based on Algorithm 1, we additionally maintain an estimation $\hat{G}$ of $G$ based on past observations of valuations. $\hat{G}$ is initialized to be a uniform distribution and will be used to solve the dynamic programming problem (see 7 of Algorithm 2). Using similar techniques as before (with more work), we obtain the following theorem.

Theorem 3.6.

Under the circumstance that $F,G$ are both unknown, it holds that the worst-case regret of Algorithm 1 using empirical distribution functions to estimate $F$ and $G$ is $\widetilde{O}(\sqrt{T})$ . Explicitly, if we take $C_{1}=1$ ,

\text{Regret}\leq\left(\sqrt{\frac{1}{2}\ln 4T^{2}}\frac{6(1+\lambda)}{(1-\lambda)^{2}}+5-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+1.

3.2 Censored Feedback

In this subsection, we deal with the case that the buyer can only see the winner’s bid after each round. In other words, the feedback is left-censored. Concretely, the buyer’s observation is

o_{t}=\max\{b_{t},m_{t}\}.

When she wins, the exact value of $m_{t}$ is not revealed. The buyer only knows that $m_{t}$ is smaller than her bid in the current round. To estimate the distribution of $m_{t}$ , there is a classical statistics (KM estimator) developed by Kaplan and Meier (1958) for the estimation of $F$ in this scenario. However, the KM estimator assumes the sequence $(m_{t})_{t=1}^{T}$ is deterministic, which does not fit our needs. Although Suzukawa (2004)’s extension allows random censorship, it requires independence between $b_{t}$ and $m_{t}$ , which is not realistic since we use the estimated distribution to place bids.

To tackle this problem, we first introduce an estimator proposed by Zeng (2004) denoted by $\hat{F}_{n}$ to take place of the previous empirical distribution used in Algorithm 1.

Estimation procedure

We now present the procedure for estimating $F$ under censored feedback. It suffices to estimate the distribution function of $1-m_{t}$ which is right-censored by $1-b_{t}$ . Let $y_{t}=\min\{1-m_{t},1-b_{t}\},r_{t}=\mathbf{1}_{\{m_{t}\geq b_{t}\}}$ . The observations can now be described as $(y_{t},r_{t},\bm{h}_{t})_{t=1}^{T}$ .

Roughly speaking, to decouple the dependency between $m_{t},b_{t}$ , we use the fact that $b_{t}$ and $m_{t}$ are independent conditioning on $\bm{h}_{t}$ . Intuitively, the history $\bm{h}_{t}$ provides information for getting enough effective samples for $m_{t}$ . Next we establish models to estimate the hazard rate functions⁴⁴4The hazard rate function of a random variable $X$ with p.d.f. $f$ and c.d.f. $F$ is $H_{X}(x)=\frac{f(x)}{1-F(x)}$ . of $1-m_{t},1-b_{t}$ using $\bm{h}_{t}$ . With the hazard rate functions, we use the maximum likelihood method with a kernel to compute the final estimation $\hat{F}_{t}$ and obtain Equation 3.

Details follow. We initialize with a sequence (bandwidth) $(a_{t})_{t=1}^{T}$ such that $\frac{\log^{2}a_{t}}{ta_{t}^{2}}\to 0,ta_{t}^{2}\to\infty,ta_{t}^{4}\to 0$ as $t\to+\infty$ and a symmetric kernel function $K(\cdot,\cdot)\in C^{2}(\mathbb{R}^{2})$ with bounded gradient. Now, at each time $t$ , we compute two vectors $\bm{\beta}_{t},\bm{\gamma}_{t}$ which maximize each of the followings

\begin{split}\mathcal{L}_{1}(\bm{\beta})&=\frac{1}{t}\sum_{\tau=1}^{t}r_{\tau}\left[\bm{\beta}^{\top}\bm{h}_{\tau}-\log\left(\sum_{y_{i}\geq y_{\tau}}e^{\bm{\beta}^{\top}\bm{h}_{\tau}}\right)\right],\\ \mathcal{L}_{2}(\bm{\gamma})&=\frac{1}{t}\sum_{\tau=1}^{t}(1-r_{\tau})\left[\bm{\gamma}^{\top}\bm{h}_{\tau}-\log\left(\sum_{y_{i}\geq y_{\tau}}e^{\bm{\gamma}^{\top}\bm{h}_{\tau}}\right)\right].\end{split}

(2)

We arbitrarily pad $\bm{h}_{1},\dotsc,\bm{h}_{t}$ with zeros to make their length the same (we will show that this is without loss of generality in a moment). Compute $\bm{Z}_{t}=(\bm{\beta}_{t}^{\top}\bm{h}_{t},\bm{\gamma}_{t}^{\top}\bm{h}_{t})^{\top}$ . The survival function of $1-m_{t}$ , or equivalently the cumulative distribution function of $m_{t}$ , is estimated based on Zeng (2004)’s estimator

\hat{F}_{t}(x)=\frac{1}{t}\sum_{i=1}^{t}\prod_{j=1}^{t}\left(1-\frac{K((\bm{Z}_{i}-\bm{Z}_{j})/a_{n})\mathbf{1}_{\{y_{j}\leq x\}}r_{j}}{\sum_{m=1}^{n}K((\bm{Z}_{i}-\bm{Z}_{m})/a_{n})\mathbf{1}_{\{y_{j}\leq y_{m}\}}}\right).

(3)

Now, we are ready to apply the estimator and the algorithm for censored feedback is depicted in Algorithm 2. Note that the new estimator’s convergence rate is slower than that for the full-feedback case. Therefore, compared to Algorithm 1, Algorithm 2 is now a multi-phase algorithm. The algorithm only updates the estimation of $\hat{F}$ at the end of each phase (see Figure 1 for an illustration). The other elements of each phase in Algorithm 2 are similar to Algorithm 1.

Algorithm 2 Algorithm for the censored-feedback case

1:Input: Initial budget

B

and constant

C_{1}

\triangleright

C_{1}

is an arbitrary positive constant

2:Initialize the estimation

\hat{F}

F

and the estimation

\hat{G}

G

to uniform distributions over

[0,1]

B_{1}\leftarrow B

4:for Phase

i=1,2,\dotsc

\triangleright

Phase

i\,(i>1)

lasts for

2^{i}

rounds. Phase 1 lasts for 2 rounds

5: for each

t

in the time interval of round

i

6: Observe the value

v_{t}

in round

t

7: Update

\hat{G}(x)=\frac{1}{t}\sum_{i=1}^{t}\mathbf{1}_{\{v_{i}\leq x\}}

8: Let

t_{0}

be the smallest integer that satisfies

\lambda^{t_{0}-t}\frac{1}{1-\lambda}<\frac{C_{1}}{\sqrt{t}}

9: Set

V_{\hat{F},\hat{G}}(B_{t_{0}},t_{0})=0

for any

B_{t_{0}}

\triangleright

V_{\hat{F},\hat{G}}

is algorithm’s estimation of

V

10: for

\tau=t_{0},t_{0}-1,\dotsc,t

\triangleright

This loop can be moved to the end of each phase to reduce the invocation time from

T

\ln T

11:

Q_{\hat{F},\hat{G}}(B_{\tau},\tau,b)\leftarrow[(v-b)+\lambda V_{\hat{F},\hat{G}}(B_{\tau}-b,\tau+1)]\hat{F}(b)+\lambda V_{\hat{F},\hat{G}}(B_{\tau},\tau+1)(1-\hat{F}(b))

12: Solve the optimization problem

\hat{b}_{\tau}^{*}\leftarrow\operatorname*{argmax}_{b}Q_{\hat{F},\hat{G}}(B_{\tau},\tau,b)

13:

V_{\hat{F},\hat{G}}(B_{\tau},\tau)\leftarrow\operatorname*{\mathbb{E}}_{v\sim G}[Q_{\hat{F},\hat{G}}(B_{\tau},\tau,\hat{b}_{\tau}^{*})]

14: end for

15: Place a bid

\hat{b}_{t}\leftarrow\operatorname*{argmax}_{b}Q_{\hat{F},\hat{G}}(B_{t},t,b)

16: Observe

o_{t},c_{t}

and

r_{t}

from this round of auction

17:

B_{t+1}\leftarrow B_{t}-c_{t}

. If

B_{t+1}\leq 0

then halt.

18: end for

19: Update

\hat{F}

using the estimator specified in Equation 3 with data observed before this phase

20:end for

Figure 1: Schematic representation of the phases in Algorithm 2. Algorithm 2 updates its estimates of

F

at the end of each phase.

Analysis of regret

To analyze the performance of Algorithm 2, we will prove a series of lemmas on the estimation error of Equation 3. In particular, our proof relies on the following convergence result.

Lemma 3.7 (Zeng).

Let $\hat{F}_{n}$ be the estimation of $F$ after using $n$ observations. We have

\sqrt{n}(\hat{F}_{n}(1-x)-F(1-x))\Longrightarrow\mathcal{B}(x)\quad\text{in}\quad\ell^{\infty}([0,1]),

where $\mathcal{B}(x)$ is a Gaussian process.

Before we proceed to apply the lemma, we verify a series of prerequisites mentioned in Zeng (2004) to make sure it holds. First, we make sure that conditioning on $\bm{h}_{t}$ , the random variables $1-m_{t}$ and $1-b_{t}$ are independent. Indeed, $b_{t}$ is completely decided by $\bm{h}_{t}$ and $m_{t}$ is independent of $\bm{h}_{t}$ . Second, we note that the maximizer shown in Equation 2 is essentially doing Cox’s proportional hazards regression analysis. To establish consistency of the estimator, we show that at least one of $\widetilde{m}_{t}\coloneqq 1-m_{t}$ and $1-b_{t}$ follows Cox’s proportional model. That is to say, there exists $\bm{\beta}$ and a function $f(y)$ such that the hazard rate function of $\widetilde{m}_{t}$ or $1-b_{t}$ conditioning on $\bm{h}_{t}$ exactly follows

H(y\mid\bm{h}_{t})=f(y)e^{\bm{\beta}^{\top}\bm{h}_{t}}.

(4)

Equation 4 holds for $\widetilde{m}_{t}$ . In fact, taking $\bm{\beta}=\mathbf{0}$ and $f(y)=\frac{F^{\prime}(1-y)}{1-F(1-y)}$ suffices. Since we take $\bm{\beta}=\mathbf{0}$ , consistency establishes regardless of the way we pad $\bm{h}_{t}$ .

Next, consider some phase at time $2^{n}\leq t\leq 2^{n+1}-1$ . The estimation $\hat{F}_{n}$ is computed using the first $2^{n}$ observed data points. Applying similar techniques for the rate of convergence of empirical process (Norvaiša and Paulauskas, 1991), we obtain

Lemma 3.8.

Under Algorithm 2, let $\hat{F}_{n}$ be an empirical process of updates and $\mathcal{B}$ be a general Gaussian process, respectively, indexed by a class $\mathscr{F}$ of real measurable functions. We have:

|{\Pr(\{\|F_{n}\|_{\mathscr{F}}\geq r\})-\Pr(\{\|\mathcal{B}\|_{\mathscr{F}}\geq r\})}|\leq C_{2}(1+r)^{-3}\ln^{2}t\cdot t^{-\frac{1}{6}},

where $C_{2}$ is a constant depending only on $\hat{F}_{n}$ and $t$ is the size of data used to update the estimation.

Synthesizing the convergence results in Lemma 3.7 and Lemma 3.8, we establish the following lemma on the performance of $\hat{F}$ in Algorithm 2.

Lemma 3.9.

Under the update process in Algorithm 2, for any $2^{n}\leq t\leq 2^{n+1}-1$ , we have the following bounds for the estimation $\hat{F}_{n}$ :

|{\Pr(\sup_{b}|\sqrt{2^{n}}(\hat{F}_{n}(1-b)-F(1-b))|\geq r)-\Pr(\sup_{b}|\mathcal{B}(1-b)|\geq r)}|\leq M(1+r)^{-3}\ln^{2}t\cdot t^{-\frac{1}{6}},

where $M$ is a constant depending only on $F$ and Algorithm 2.

Finally, we now bound the difference between $\hat{F}_{n}$ and $F$ with the help of Lemma 3.9.

Lemma 3.10.

Recall that we use the first $2^{n}$ data points to estimate $\hat{F}$ . Under the update procedure of Algorithm 2, for any $2^{n}\leq t\leq 2^{n+1}-1$ , with probability at least $1-T^{-\frac{5}{12}}/(2\ln T)$

\sup_{x}|\hat{F}(x)-F(x)|\leq\sqrt{2}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{5}{9}}T^{\frac{5}{36}}.

With Lemma 3.10 in hand and employing a similar methodology in the proof of Theorem 3.5, we now have

Theorem 3.11.

Under the circumstance that $F,G$ are both unknown and the feedback is censored, the worst-case regret of Algorithm 2 is $\widetilde{O}(T^{\frac{7}{12}})$ . Explicitly, if we take $C_{1}=1$ ,

	$\displaystyle\text{Regret}\leq\left(\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}+1\right)T^{\frac{7}{12}}$
	$\displaystyle+\left(\sqrt{\frac{1}{2}\ln{4T^{\frac{17}{12}}}}\frac{2(1+\lambda)}{(1-\lambda)^{2}}+5-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}.$

Remark 3.12.

The previous setting in Han et al. (2020b) is a special case of our setting. In Han et al. (2020b), there are no budget constraints and $\lambda=0$ (thus removing the $\frac{1}{1-\lambda}$ factor in our results). The buyer’s aim is to maximize $(v-b)F(b)$ each round. This is equivalent to the case $V_{\hat{F}}=0$ in our setting with no need to estimate $G$ , yielding regret $\widetilde{O}(\sqrt{T})$ in the full-feedback case and regret $\widetilde{O}(T^{\frac{7}{12}})$ in the censored-feedback case.

4 Discussion and Conclusion

In this paper, we develop a learning algorithm to adaptively bid in repeated first-price auctions with budgets. On the theoretical side, our algorithm, together with its analysis of $\widetilde{O}(\sqrt{T})$ -regret in the full-feedback case and $\widetilde{O}(T^{\frac{7}{12}})$ -regret in the censored-feedback case, takes the first step in understanding the problem. On the practical side, our algorithm is simple and readily applicable to the digital world that has shifted to first-price auctions.

Questions raise themselves for future explorations. We observe here that in the limiting case $\lambda\to 1$ , the optimal bidding strategy in Algorithm 2 is similar to a pacing strategy, which partially answers the open question⁵⁵5“A question of theoretical and practical interest is how to extend the adaptive pacing approach that is suggested in this paper to nontruthful mechanisms such as first-price auctions.” raised in Balseiro and Gur (2019). In the limiting case of $\lambda\to 1$ , the optimal bid of Algorithm 2 is of the form $\frac{v_{t}}{1+x_{t}}$ , where $x_{t}$ is a pacing multiplier that depends only on $B_{t}$ and $F$ . Further, $x_{t}$ can be computed without explicitly solving the dynamic programming problem. This observation can be viewed as a direct corollary of (Theorem 3.1 Gummadi et al., 2012). This connection between Algorithm 2 and pacing suggests future directions to extend the current adaptive pacing techniques to other auction forms⁶⁶6We note here that parallel to our work, Gaitonde et al. (2022) extend the pacing techniques to a class of auction forms (called core auctions). They obtain $\widetilde{O}(T^{\frac{3}{4}})$ -regret bounds against the best linear policy under the value-maximization objective.. Other immediate open questions include establishing lower bounds of regret for our problem.

References

Agrawal and Devanur [2016] Shipra Agrawal and Nikhil Devanur. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29, 2016.
Agrawal et al. [2016] Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Conference on Learning Theory, pages 4–18. PMLR, 2016.
Amin et al. [2012] Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. Budget optimization for sponsored search: Censored learning in MDPs. arXiv preprint arXiv:1210.4847, 2012.
Badanidiyuru et al. [2013] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207–216. IEEE, 2013.
Badanidiyuru et al. [2014] Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. Resourceful contextual bandits. In Conference on Learning Theory, pages 1109–1134. PMLR, 2014.
Balseiro and Gur [2019] Santiago R. Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Science, 65(9):3952–3968, 2019.
Baltaoglu et al. [2017] M. Sevi Baltaoglu, Lang Tong, and Qing Zhao. Online learning of optimal bidding strategy in repeated multi-commodity auctions. Advances in Neural Information Processing Systems, 30, 2017.
Capen et al. [1971] Edward C. Capen, Robert V. Clapp, and William M. Campbell. Competitive bidding in high-risk situations. Journal of Petroleum Technology, 23(06):641–653, 1971.
Chari and Weber [1992] V. Chari and Robert Weber. How the us treasury should auction its debt. Quarterly Review, 16(4), 1992.
Chow and Tsitsiklis [1989] Chef-Seng Chow and John N Tsitsiklis. The complexity of dynamic programming. Journal of complexity, 5(4):466–488, 1989.
Deng and Zhang [2021] Yuan Deng and Hanrui Zhang. Prior-independent dynamic auctions for a value-maximizing buyer. Advances in Neural Information Processing Systems, 34, 2021.
Devanur et al. [2014] Nikhil R Devanur, Yuval Peres, and Balasubramanian Sivan. Perfect bayesian equilibria in repeated sales. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 983–1002. SIAM, 2014.
Drutsa [2017] Alexey Drutsa. Horizon-independent optimal pricing in repeated auctions with truthful and strategic buyers. In Proceedings of the 26th International Conference on World Wide Web, pages 33–42, 2017.
Drutsa [2018] Alexey Drutsa. Weakly consistent optimal pricing algorithms in repeated posted-price auctions with strategic buyer. In International Conference on Machine Learning, pages 1319–1328. PMLR, 2018.
Dvoretzky et al. [1956] Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642–669, 1956.
Esponda [2008] Ignacio Esponda. Information feedback in first price auctions. The RAND Journal of Economics, 39(2):491–508, 2008.
Flajolet and Jaillet [2017] Arthur Flajolet and Patrick Jaillet. Real-time bidding with side information. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, 2017.
Gaitonde et al. [2022] Jason Gaitonde, Yingkai Li, Bar Light, Brendan Lucier, and Aleksandrs Slivkins. Budget Pacing in Repeated Auctions: Regret and Efficiency without Convergence. arXiv e-prints, art. arXiv:2205.08674, May 2022.
Golrezaei et al. [2019] Negin Golrezaei, Adel Javanmard, and Vahab Mirrokni. Dynamic incentive-aware learning: Robust pricing in contextual auctions. Advances in Neural Information Processing Systems, 32, 2019.
Gummadi et al. [2012] Ramakrishna Gummadi, Peter Key, and Alexandre Proutiere. Repeated auctions under budget constraints: Optimal bidding strategies and equilibria. In the Eighth Ad Auction Workshop, volume 4, 2012.
Han et al. [2020a] Yanjun Han, Zhengyuan Zhou, Aaron Flores, Erik Ordentlich, and Tsachy Weissman. Learning to bid optimally and efficiently in adversarial first-price auctions. arXiv preprint arXiv:2007.04568, 2020a.
Han et al. [2020b] Yanjun Han, Zhengyuan Zhou, and Tsachy Weissman. Optimal no-regret learning in repeated first-price auctions. arXiv preprint arXiv:2003.09795, 2020b.
He et al. [2021] Jiafan He, Dongruo Zhou, and Quanquan Gu. Nearly minimax optimal reinforcement learning for discounted MDPs. Advances in Neural Information Processing Systems, 34:22288–22300, 2021.
Kaplan and Meier [1958] Edward L. Kaplan and Paul Meier. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282):457–481, 1958.
Lebrun [1996] Bernard Lebrun. Existence of an equilibrium in first price auctions. Economic Theory, 7(3):421–443, 1996.
Liu and Su [2020] Shuang Liu and Hao Su. Regret bounds for discounted MDPs. arXiv preprint arXiv:2002.05138, 2020.
LLC [2021] Google LLC. Faqs about adsense moving to a first-price auction - google adsense help. https://support.google.com/adsense/answer/10858748?hl=en, 2021. Accessed: 2022-05-06.
Massart [1990] Pascal Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability, pages 1269–1283, 1990.
Norvaiša and Paulauskas [1991] R. Norvaiša and V. Paulauskas. Rate of convergence in the central limit theorem for empirical processes. Journal of Theoretical Probability, 4(3):511–534, 1991.
Sluis [2017] Sarah Sluis. Big changes coming to auctions, as exchanges roll the dice on first-price. https://www.adexchanger.com/platforms/big-changes-coming-auctions-exchanges-roll-dice-first-price/, 2017.
Statista [2021] Statista. Digital advertising spending worldwide from 2019 to 2024 (in billion u.s. dollars). https://www.statista.com/statistics/237974/online-advertising-spending-worldwide/, 2021. Accessed: 2022-05-05.
Suzukawa [2004] Akio Suzukawa. Unbiased estimation of functionals under random censorship. Journal of the Japan Statistical Society, 34(2):153–172, 2004.
Tran-Thanh et al. [2014] Long Tran-Thanh, Lampros Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R Jennings, and Peter Key. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In 30th Conference on Uncertainty in AI, 2014.
van Schaik et al. [2001] Frans D. J. van Schaik, Jack P. C. Kleijnen, et al. Sealed-bid auctions: case study. Center for Economic Research, 2001.
Vanunts and Drutsa [2019] Arsenii Vanunts and Alexey Drutsa. Optimal pricing in repeated posted-price auctions with different patience of the seller and the buyer. Advances in Neural Information Processing Systems, 32, 2019.
Yang et al. [2021] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
Zeng [2004] Donglin Zeng. Estimating marginal survival function by adjusting for dependent censoring using many covariates. The Annals of Statistics, 32(4):1533–1555, 2004.
Zhou et al. [2021] Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted MDPs with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR, 2021.

Appendix A Omitted Proofs in Section 3.1

A.1 Proof of Lemma 3.2

We will use backward induction to show that

|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq Ct^{-\frac{1}{2}}.

The inequality holds trivially with the condition of the lemma for the basis ( $t=t_{0}$ ). Suppose the bound holds for $t+1$ . Now we write out the difference of the value functions

	$\displaystyle V(B_{t},t)-V_{\hat{F}}(B_{t},t)$	$\displaystyle=\operatorname{\mathbb{E}}_{v\sim G}[Q_{v,F}(B_{t},t,b^{}_{t})-Q_{v,\hat{F}}(B_{t},t,b_{t})]$
		$\displaystyle\leq\operatorname{\mathbb{E}}_{v\sim G}[Q_{v,F}(B_{t},t,b^{}_{t})-Q_{v,\hat{F}}(B_{t},t,b_{t}^{*})],$

where $b_{t}$ is Algorithm 1’s estimated optimal bid and $b^{*}_{t}$ is the bid of the benchmark. The inequality establishes by noting that $b^{*}_{t}$ is sub-optimal under $\hat{F}$ . Next consider the term inside the expectation which is rewritten as follows:

	$\displaystyle Q_{v,F}(B_{t},t,b^{}_{t})-Q_{v,\hat{F}}(B_{t},t,b^{}_{t})\leq$
	$\displaystyle\underbrace{\|(v_{t}-b^{}_{t})(F(b^{}_{t})-\hat{F}(b^{*}_{t}))\|}_{\Delta_{1}}$
	$\displaystyle+\underbrace{\lambda F(b^{}_{t})\|(V(B_{t}-b^{}_{t},t+1)-V_{\hat{F}}(B_{t}-b^{}_{t},t+1))\|+\lambda\|(F(b^{}_{t})-\hat{F}(b^{}_{t}))V_{\hat{F}}(B_{t}-b^{}_{t},t+1)\|}_{\Delta_{2}}$
	$\displaystyle+\underbrace{\lambda\|(1-F(b^{}_{t}))(V(B_{t},t+1)-V_{\hat{F}}(B_{t},t+1))\|+\lambda\|(\hat{F}(b^{}_{t})-F(b^{*}_{t}))V_{\hat{F}}(B_{t},t+1)\|}_{\Delta_{3}}.$

To bound the above equation, we deal with the three terms $\Delta_{1},\Delta_{2},\Delta_{3}$ separately. Using Lemma 3.1 and union bound of $T$ rounds, $\Delta_{1}\leq\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}t^{-\frac{1}{2}}$ with probability at least $1-\delta$ . By the induction hypothesis $|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq C(t+1)^{-\frac{1}{2}}\leq Ct^{-\frac{1}{2}}$ and that any value function is bounded by $1+\lambda+\lambda^{2}+\dotsb=\frac{1}{1-\lambda}$ , we have

	$\displaystyle\Delta_{2}$	$\displaystyle\leq\lambda CF(b^{*}_{t})t^{-\frac{1}{2}}+\frac{\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}t^{-\frac{1}{2}},$
	$\displaystyle\Delta_{3}$	$\displaystyle\leq\lambda C(1-F(b^{*}_{t}))t^{-\frac{1}{2}}+\frac{\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}t^{-\frac{1}{2}}.$

Therefore,

	$\displaystyle Q_{v,F}(B_{t},t,b^{}_{t})-Q_{v,\hat{F}}(B_{t},t,b^{}_{t})$	$\displaystyle\leq\left(\frac{2\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}+\lambda C+\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\right)t^{-\frac{1}{2}}$
		$\displaystyle=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}.$

This establishes that $V(B_{t},t)-V_{\hat{F}}(B_{t},t)\leq Ct^{-\frac{1}{2}}$ . Finally, by symmetry—swap $F$ and $\hat{F}$ and repeat the proof above, it holds that

|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq Ct^{-\frac{1}{2}}.

This finishes the induction step and the claim follows.

A.2 Proof of Lemma 3.3

We will state a general form of Lemma 3.3 concerning the error in the initialization of the base case. This lemma will come in handy in the following sections.

Lemma A.1.

For any fixed distributions $F,G$ , consider the value function $V_{F,G}(B_{t},t)$ . Suppose we use an arbitrary value in $\left[0,\frac{1}{1-\lambda}\right]$ to initialize the base case $V_{F,G}(B_{t_{0}},t_{0})$ and recursively compute thereon to obtain $\widetilde{V}_{F,G}(B_{t},t)$ , then it holds that for any $t\leq t_{0}$ :

\sup_{B_{t}}|V_{F,G}(B_{t},t)-\widetilde{V}_{F,G}(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}.

Proof.

When $\tau=t_{0}$ , the claim holds because $V_{F,G}$ and $\widetilde{V}_{F,G}(\cdot,\cdot)$ are both upper bounded by $\frac{1}{1-\lambda}$ and lower bounded by 0.

Supposing the claim holds when $\tau=t+1$ , then for $\tau=t$ , we have

	$\displaystyle\widetilde{V}_{F,G}(B_{t},t)$	$\displaystyle-V_{F,G}(B_{t},t)\leq\operatorname{\mathbb{E}}_{v\sim G}[(v_{t}-b_{t}^{}){F}(b_{t}^{})+\lambda{F}(b_{t}^{})\widetilde{V}_{F,G}(B_{t}-b_{t}^{*},t+1)$
		$\displaystyle+\lambda(1-{F}(b_{t}^{}))\widetilde{V}(B_{t},t+1)-(v_{t}-b_{t}^{}){F}(b_{t}^{})-\lambda{F}(b_{t}^{})V_{F,G}(B_{t}-b_{t}^{*},t+1)$
		$\displaystyle-\lambda(1-{F}(b_{t}^{*}))V_{F,G}(B_{t},t+1)]$
		$\displaystyle\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t-1}\lambda$
		$\displaystyle=\frac{1}{1-\lambda}\lambda^{t_{0}-t},$

In the derivation above, $b_{t}^{*}$ denotes the optimal bidding strategy obtained by computing $V(B_{t},t)$ . The first inequality holds since $b^{*}_{t}$ is not be the optimal bidding strategy under $\hat{V}(\cdot,t)$ ’s view. The second inequality holds since $|\widetilde{V}(B_{t+1},t+1)-V(B_{t+1},t+1)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t-1}$ for any $B_{t+1}$ .

And by symmetry, we have $|\widetilde{V}(B_{t},t)-V(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}$ . This concludes the induction step and yields the lemma. ∎

In particular, using Lemma A.1, under the condition of Lemma 3.3, the initialization is taken to be 0. We have

\sup_{B_{t}}|V_{\hat{F}}(B_{t},t)-\widetilde{V}_{\hat{F}}(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}.

Remark A.2.

For convenience, similar to the notations used in this lemma, for any value function $\nu$ , we will use $\widetilde{\nu}$ to denote its approximately-initialized counterpart. Furthermore, we will invoke the lemma many times for other value functions in the rest of the proofs.

A.3 Proof of Theorem 3.5

Below we first condition on the good event that the estimation (of Lemma 3.1) succeeds for every $t$ . Then we add the contribution of the bad event to the regret in Theorem 3.5 finally.

To proof this theorem, we will first bound the following auxiliary “regret” with Lemma 3.2 and Lemma A.1.

Let us first make an intuitive and approximate description of the regret. The buyer’s “regret” can be viewed in two parts: 1) she does not bid according to the optimal strategy; 2) her strategy is not optimally spending the budget which leads to future losses. Given remaining budget $B_{t}$ at time $t$ with strategy $\pi$ , the above intuition guides us to first look at

	$\displaystyle R_{1}$	$\displaystyle=\operatorname{\mathbb{E}}_{\bm{v}\sim G^{T}}\Biggl{[}\sum_{t=1}^{T}(R^{\pi^{}}_{t}(v_{t},b^{*}_{t})-R^{\pi}_{t}(v_{t},b_{t}))$
		$\displaystyle+[\lambda(F(b_{t}^{})V(B_{t}-b_{t}^{},t+1)+(1-F(b_{t}^{*}))V(B_{t},t+1))$
		$\displaystyle-\lambda(F(b_{t})V(B_{t}-b_{t},t+1)+(1-F(b_{t}))V(B_{t},t+1))]\Biggr{]}.$

Lemma A.3.

Suppose Lemma 3.2 and Lemma A.1 hold for some constants $C$ and $C_{1}$ , i.e. $|V(B_{t},t)-V_{\hat{F}}(B_{t},t)|\leq(C+C_{1})t^{-\frac{1}{2}}$ . Assume further that $\sup_{x}|F(x)-\hat{F}(x)|\leq Kt^{-\frac{1}{2}}$ for some constant $K$ . We have

R_{1}\leq 2\left(\frac{K(1+\lambda)}{1-\lambda}+(1+\lambda)C+2C_{1}\right)\sqrt{T}.

Proof.

To ease description, we first let

	$\displaystyle\hat{H}_{t}$	$\displaystyle\coloneqq(v_{t}-b_{t})\hat{F}(b_{t})+\lambda(\hat{F}(b_{t})V_{\hat{F}}(B_{t}-b_{t},t+1)+(1-\hat{F}(b_{t}))V_{\hat{F}}(B_{t},t+1)),$
	$\displaystyle H_{t}$	$\displaystyle\coloneqq(v_{t}-b_{t}^{})F(b_{t}^{})+\lambda(F(b_{t}^{})V(B_{t}-b_{t}^{},t+1)+(1-F(b_{t}^{*}))V(B_{t},t+1)),$
	$\displaystyle\widetilde{H}_{t}$	$\displaystyle\coloneqq(v_{t}-b_{t})F(b_{t})+\lambda(F(b_{t})V(B_{t}-b_{t},t+1)+(1-F(b_{t}))V(B_{t},t+1)).$

(Recall that $b_{t}$ is Algorithm 1’s estimated optimal bid and $b^{*}_{t}$ is the bid of the benchmark.) Using the notations above, $R_{1}$ now becomes

\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(H_{t}-\widetilde{H}_{t})\right]\leq\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(|H_{t}-\hat{H}_{t}|+|\hat{H}_{t}-\widetilde{H}_{t}|)\right].

Use the induction part in the proof of Lemma 3.2 and Lemma A.1, $|H_{t}-\hat{H}_{t}|\leq(C+C_{1})t^{-\frac{1}{2}}$ follows from the condition. In order to bound $|\hat{H}_{t}-\widetilde{H}_{t}|$ , we write

	$\displaystyle\|\hat{H}_{t}-\widetilde{H}_{t}\|\leq$	$\displaystyle\|(v_{t}-b_{t})(F(b_{t})-\hat{F}(b_{t}))\|+\lambda\|\hat{F}(b_{t})-F(b_{t})\|V_{\hat{F}}(B_{t}-b_{t},t+1)$
		$\displaystyle+\lambda F(b_{t})\|V(B_{t}-b_{t},t+1)-V_{\hat{F}}(B_{t}-b_{t},t+1)\|$
		$\displaystyle+\lambda\|F(b_{t})-\hat{F}(b_{t})\|V_{\hat{F}}(B_{t},t+1)+\lambda(1-F(b_{t}))\|V(B_{t},t+1)-V_{\hat{F}}(B_{t},t+1)\|$
		$\displaystyle\leq Kt^{-\frac{1}{2}}\left(1+\frac{2\lambda}{1-\lambda}\right)+\lambda\left(Ct^{-\frac{1}{2}}+\frac{C_{1}}{\lambda}{t}^{-\frac{1}{2}}\right)$
		$\displaystyle=\left(\frac{K(1+\lambda)}{1-\lambda}+\lambda C+C_{1}\right)t^{-\frac{1}{2}}.$

The first inequality holds because of the triangle inequality and the second inequality establishes using the conditions specified (Lemma 3.1, Lemma 3.2 and Lemma A.1). Note that it holds that $|V(B_{t+1},t+1)-V_{\hat{F}}(B_{t+1},t+1)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t-1}=\frac{C_{1}}{\lambda\sqrt{t}}.$

Finally, we sum up the above estimation and obtain

R_{1}\leq\sum_{t=1}^{T}\left(\frac{K(1+\lambda)}{1-\lambda}+(1+\lambda)C+2C_{1}\right)t^{-\frac{1}{2}},

as desired. ∎

In particular, taking $C=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}$ and $K=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}$ as it holds in the analysis of Algorithm 1, we have $R_{1}\leq 4(C+C_{1})\sqrt{T}$ .

Next, we will build connections between $R_{1}$ and the accumulated differences in the values. The definition of $R_{2}$ is inspired by the recent advances in Yang et al. (2021); He et al. (2021); Liu and Su (2020); Zhou et al. (2021). We define

R_{2}=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(V(B_{t},t)-V^{\pi}(B_{t},t))\right].

Lemma A.4.

For any given $B_{t}$ and $t$ , suppose that the conditions for Lemma A.3 holds. We have

|V(B_{t},t)-V^{\pi}(B_{t},t)|\leq\frac{C^{\prime}}{(1-\lambda)\sqrt{t}}\quad\text{where}\quad C^{\prime}=\frac{K(1+\lambda)}{1-\lambda}+(1+\lambda)C+2C_{1}.

Proof.

To start, we introduce the following notation:

H_{t}^{\pi}=(v_{t}-b_{t})F(b_{t})+\lambda(F(b_{t})V^{\pi}(B_{t}-b_{t},t+1)+(1-F(b_{t}))V^{\pi}(B_{t},t+1)).

Note that we have $\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}[H_{t}]=V(B_{t},t)$ and $\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}[H_{t}^{\pi}]=V^{\pi}(B_{t},t)$ . We will use backward induction to bound the difference between $V(B_{t},t)$ and $V^{\pi}(B_{t},t)$ for any $B_{t}$ . First choose a sufficient large $t_{0}$ and assume $V(B_{t_{0}},t_{0})=\widetilde{V}^{\pi}(B_{t_{0}},t_{0})$ . When $\tau=t_{0}$ , the induction basis holds since $V(B_{t_{0}},t_{0})=\widetilde{V}^{\pi}(B_{t_{0}},t_{0})$ . Now consider $\tau=t+1$ . By the induction hypothesis, $|V(B_{\tau},\tau)-\widetilde{V}^{\pi}(B_{\tau},\tau)|\leq\frac{C^{\prime}}{(1-\lambda)\sqrt{\tau}}$ , and when $\tau=t$ , it holds that

	$\displaystyle\widetilde{H}_{t}-H_{t}^{\pi}$	$\displaystyle=\operatorname*{\mathbb{E}}_{v\sim G}[\lambda F(b_{t})(V(B_{t}-b_{t},t+1)-\widetilde{V}^{\pi}(B_{t}-b_{t},t+1))$
		$\displaystyle+\lambda(1-F(b_{t}))(V(B_{t},t+1)-\widetilde{V}^{\pi}(B_{t},t+1))]$
		$\displaystyle\leq\frac{C^{\prime}\lambda}{(1-\lambda)\sqrt{t+1}}.$

It follows that

|V(B_{t},t)-\widetilde{V}^{\pi}(B_{t},t)|\leq\operatorname*{\mathbb{E}}_{\bm{v}\sim G}[|H_{t}-\widetilde{H}_{t}|+|\widetilde{H}_{t}-H_{t}^{\pi}|]\leq\frac{C^{\prime}}{\sqrt{t}}+\frac{C^{\prime}\lambda}{(1-\lambda)\sqrt{t+1}}\leq\frac{C^{\prime}}{(1-\lambda)\sqrt{t}}.

and this concludes the induction step.

Applying the proof techniques in Lemma A.1 with the condition for $V^{\pi}$ , we have

|V^{\pi}(B_{t},t)-\widetilde{V}^{\pi}(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}.

Since $t_{0}$ is arbitrarily chosen (as long as it is sufficiently large), we can take $t_{0}\to+\infty$ . Note that we have $\lim_{t_{0}\to+\infty}\widetilde{V}^{\pi}(B_{t},t)=V^{\pi}(B_{t},t)$ . Therefore, it holds that

|V(B_{t},t)-V^{\pi}(B_{t},t)|\leq\frac{C^{\prime}}{(1-\lambda)\sqrt{t}}.

This establishes

R_{2}\leq\frac{2C^{\prime}}{1-\lambda}\sqrt{T},

and concludes the proof. ∎

In particular, take $C=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}$ and $K=\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}$ as it holds in the analysis of Algorithm 1, we have $R_{2}\leq\frac{4(C+C_{1})}{1-\lambda}\sqrt{T}$ .

Finally, we relate the regret defined in Equation 1 with $R_{2}$ in the previous lemma.

Lemma A.5.

Suppose that $R_{2}\leq\frac{2C^{\prime}}{1-\lambda}\sqrt{T}$ , then $\text{Regret}\leq(2C^{\prime}+1-\lambda)\sqrt{T}+\frac{1-\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{1-\lambda}$ .

Proof.

In fact, note that

R_{2}=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(1+\lambda+\dotsb+\lambda^{t-1})(R^{*}(v_{t},b_{t}^{*})-R^{\pi}(v_{t},b_{t}))\right],

and any expected instantaneous reward is no greater than 1. We have

	$\displaystyle\left\|\frac{1}{1-\lambda}\text{Regret}-R_{2}\right\|=\operatorname{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\left\|\sum_{t=1}^{T}\frac{\lambda^{t}}{1-\lambda}(R^{}(v_{t},b_{t}^{*})-R^{\pi}(v_{t},b_{t}))\right\|\right]$
	$\displaystyle\leq\frac{\lambda}{2(1-\lambda)}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+\frac{1}{\sqrt{T}}T.$

The above bound holds because when $t\geq\log_{\frac{1}{\lambda}}\frac{\sqrt{T}}{1-\lambda}$ , we have $\frac{\lambda^{t}}{1-\lambda}\leq\frac{1}{\sqrt{T}}$ . The considered quantity is divided into two parts with this threshold $\log_{\frac{1}{\lambda}}\frac{\sqrt{T}}{1-\lambda}$ . The first part is no greater than $\log_{\frac{1}{\lambda}}\frac{\sqrt{T}}{1-\lambda}\frac{\lambda}{1-\lambda}$ and the second part is no greater than $\frac{1}{\sqrt{T}}T=\sqrt{T}$ . The result follows after simple rearrangements. ∎

In particular, take $C^{\prime}=\frac{K(1+\lambda)}{1-\lambda}+(1+\lambda)C+2C_{1}$ as it holds in the analysis of Algorithm 1, we have

\text{Regret}\leq\left(\left(4\sqrt{\frac{1}{2}\ln\frac{2T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}+4C_{1}\right)+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}},

conditioning on the good event that the estimation (of Lemma 3.1) succeeds for every $1\leq t\leq T$ . Now take $\delta=\frac{1}{T}$ and note that $\Pr[\text{bad event}]\leq\frac{1}{T}$ . By using the trivial regret bound $T$ for the failure event, we have

\text{Regret}\leq\left(\left(4\sqrt{\frac{1}{2}\ln 2T^{2}}\frac{1+\lambda}{(1-\lambda)^{2}}+4C_{1}\right)+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+\frac{1}{T}\cdot T,

and this concludes the proof of Theorem 3.5.

A.4 Proof of Theorem 3.6

Note that the function $Q$ depends on both $F$ and $G$ as $V$ does. With the additional condition that $G$ also needs to be estimated, we use $Q_{\cdot,\hat{G}}$ and $Q_{\cdot,G}$ to denote the corresponding version computed using $\hat{G}$ and $G$ respectively. We also extend this notation on $V$ in this scenario (c.f. Algorithm 2).

We start with two simple lemmas. These two lemmas are direct corollaries of Lemma 3.1 and Lemma A.1.

Lemma A.6.

Let $\hat{G}_{t}$ be the estimated distribution at round $t$ in Algorithm 1. With probability $1-\delta$ ,

\sup_{x}|\hat{G}_{t}(x)-G(x)|\leq\sqrt{\frac{1}{2}\ln\frac{2}{\delta}}t^{-\frac{1}{2}}.

Lemma A.7.

For any round $t\leq T$ , budget $B_{t}$ , it holds that

\sup_{B_{t}}|V_{\hat{F},\hat{G}}(B_{t},t)-\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)|\leq\frac{1}{1-\lambda}\lambda^{t_{0}-t}.

In the following proof, we will bridge $V(B_{t},t)$ and $V_{\hat{F},\hat{G}}(B_{t},t)$ —the estimated value function, gradually. For distributions $A,B$ , the notation $V_{A,B}$ refers to the value function computed when $F=A$ and $G=B$ .

Lemma A.8.

With probability at least $1-\frac{\delta}{2T}$ , for any given $t\leq T$ and budget $B_{t}$ , we have

|V_{\hat{F},\hat{G}}(B_{t},t)-V_{\hat{F},G}(B_{t},t)|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1}{(1-\lambda)^{2}}t^{-\frac{1}{2}}.

Proof.

First we note that Lemma A.6 states with probability at least $1-\frac{\delta}{2T}$ , $\sup_{x}|G_{t}(x)-G(x)|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}$ . Now we apply backward induction. When $t=t_{0}$ , the induction basis holds trivially since $|V_{\hat{F},\hat{G}}(B_{t},t)-V_{\hat{F},G}(B_{t},t)|=0\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1}{1-\lambda}t^{-\frac{1}{2}}$ . Assume the induction hypothesis holds for $\tau=t+1$ . For any $t<t_{0},v_{t}$ and $B_{t}$ , it holds that

	$\displaystyle Q_{\hat{F},G}(B_{t},t,b_{t}^{*})$	$\displaystyle-Q_{\hat{F},\hat{G}}(B_{t},t,b_{t})\leq Q_{\hat{F},G}(B_{t},t,b_{t}^{})-Q_{\hat{F},\hat{G}}(B_{t},t,b_{t}^{})$
		$\displaystyle=\lambda\hat{F}(b_{t}^{})V_{\hat{F},G}(B_{t}-b_{t}^{},t+1)+\lambda(1-\hat{F}(b_{t}^{*}))V_{\hat{F},G}(B_{t},t+1)$
		$\displaystyle-\lambda\hat{F}(b_{t}^{})V_{\hat{F},G}(B_{t}-b_{t}^{},t+1)-\lambda(1-\hat{F}(b_{t}^{*}))V_{\hat{F},\hat{G}}(B_{t},t+1)$
		$\displaystyle\leq\lambda\hat{F}(b_{t}^{})\|V_{\hat{F},G}(B_{t}-b_{t}^{},t+1)-V_{\hat{F},\hat{G}}(B_{t}-b_{t}^{*},t+1)\|$
		$\displaystyle+\lambda(1-\hat{F}(b_{t}^{*}))\|V_{\hat{F},G}(B_{t},t+1)-V_{\hat{F},\hat{G}}(B_{t},t+1)\|$
		$\displaystyle\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}.$

The first inequality holds since by construction $b_{t}^{*}$ is the optimal bid under $\hat{F}$ and $G$ ’s view while $b_{t}$ is the optimal bid under $\hat{F}$ and $\hat{G}$ ’s view. The second inequality holds with the triangle inequality. And the third inequality is due to the induction hypothesis. Finally, by symmetry—swap $G$ and $\hat{G}$ and repeat the proof above, it holds that $|Q_{\hat{F},G}(B_{t},t,b_{t})-Q_{\hat{F},\hat{G}}(B_{t},t,b_{t}^{*})|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}$ .

Next, since $V_{\hat{F},\hat{G}}(B_{t},t)=\operatorname*{\mathbb{E}}_{\hat{G}}[Q_{\hat{F},\hat{G}}(B_{t},t,b_{t})]$ and $V_{\hat{F},G}(B_{t},t)=\operatorname*{\mathbb{E}}_{G}[Q_{\hat{F},v_{t}}(B_{t},t,b_{t}^{*})]$ , we have

|V_{\hat{F},G}(B_{t},t)-V_{\hat{F},\hat{G}}(B_{t},t)|\leq\Delta_{1}+\Delta_{2},

where

	$\displaystyle\Delta_{1}$	$\displaystyle=\|{\operatorname{\mathbb{E}}_{G}[Q_{\hat{F},G}(B_{t},t,b_{t}^{})]-\operatorname{\mathbb{E}}_{\hat{G}}[Q_{\hat{F},G}(B_{t},t,b_{t}^{})]}\|\text{ and}$
	$\displaystyle\Delta_{2}$	$\displaystyle=\|{\operatorname{\mathbb{E}}_{\hat{G}}[Q_{\hat{F},\hat{G}}(B_{t},t,b_{t})]-\operatorname{\mathbb{E}}_{\hat{G}}[Q_{\hat{F},G}(B_{t},t,b_{t}^{*})]}\|.$

•

To bound $\Delta_{1}$ : since $Q_{\hat{F},G}$ is supported on $\left[0,\frac{1}{1-\lambda}\right]$ and the difference between $G$ and $\hat{G}$ is upper bounded by $\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}$ , $\Delta_{1}$ is bounded by $\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}\frac{1}{1-\lambda}$ . Note that we use the fact that with integration by parts, $\int_{0}^{1}Q\,\mathrm{d}(G-\hat{G})=Q(G-\hat{G})|_{0}^{1}-\int_{0}^{1}(G-\hat{G})\,\mathrm{d}Q$ . Therefore, it holds that $|\Delta_{1}|\leq\int_{0}^{1}|G-\hat{G}||\mathrm{d}Q|$ . Since $Q$ is monotone w.r.t. $v_{t}$ and is two-sided bounded, it holds that $|\Delta_{1}|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}\frac{1}{1-\lambda}$ .
•

To bound $\Delta_{2}$ , it is clear that $\Delta_{2}\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}$ by linearity of expectation.

Therefore, we obtain

|V_{\hat{F},G}(B_{t},t)-V_{\hat{F},\hat{G}}(B_{t},t)|\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1}{(1-\lambda)^{2}}t^{-\frac{1}{2}},

which finishes induction step and concludes the proof. ∎

Lemma A.9.

For any $t\leq T$ and budget $B_{t}$ , with probability at least $1-\frac{\delta}{T}$ , it holds that:

|V_{F,G}(B_{t},t)-\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)|\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{2+\lambda}{(1-\lambda)^{2}}+C_{1}\right)t^{-\frac{1}{2}}.

Proof.

In order to bound the difference between $\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)$ and $V_{F,G}(B_{t},t)$ . We first rewrite

|V_{F,G}(B_{t},t)-\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)|\leq\Delta_{1}+\Delta_{2}+\Delta_{3},

where

	$\displaystyle\Delta_{1}$	$\displaystyle=\|\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)-V_{\hat{F},\hat{G}}(B_{t},t)\|,$
	$\displaystyle\Delta_{2}$	$\displaystyle=\|V_{\hat{F},\hat{G}}(B_{t},t)-V_{\hat{F},G}(B_{t},t)\|\text{ and}$
	$\displaystyle\Delta_{3}$	$\displaystyle=\|V_{\hat{F},G}(B_{t},t)-V_{F,G}(B_{t},t)\|.$

•

To bound $\Delta_{1}$ : using Lemma A.7 and the definition of $t_{0}$ , we conclude that $\Delta_{1}\leq\frac{C_{1}}{\sqrt{t}}$ .
•

To bound $\Delta_{2}$ : using Lemma A.8, it holds that $\Delta_{2}\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1}{(1-\lambda)^{2}}t^{-\frac{1}{2}}$ .
•

To bound $\Delta_{3}$ : using Lemma 3.2, we obtain that $\Delta_{3}\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1+\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}$ . Note that we use Bonferroni’s method to divide $\delta$ into two parts for the union bound of error in the estimations of $F$ and $G$ .

Therefore, it holds that:

|V_{F,G}(B_{t},t)-\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t)|\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{2+\lambda}{(1-\lambda)^{2}}+C_{1}\right)t^{-\frac{1}{2}}.

∎

Now, we provide a bound similar to Lemma A.3 for Theorem 3.6. Similarly, we define

	$\displaystyle\hat{H}_{t}$	$\displaystyle\coloneqq(v_{t}-b_{t})\hat{F}(b_{t})+\lambda\hat{F}(b_{t})\widetilde{V}_{\hat{F},\hat{G}}(B_{t}-b_{t},t+1)+\lambda(1-\hat{F}(b_{t}))\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t+1),$
	$\displaystyle\widetilde{H}_{t}$	$\displaystyle\coloneqq(v_{t}-b_{t})F(b_{t})+\lambda{F}(b_{t})V_{F,G}(B_{t}-b_{t},t+1)+\lambda(1-F(b_{t}))V_{F,G}(B_{t},t+1),$
	$\displaystyle H_{t}$	$\displaystyle\coloneqq(v_{t}-b_{t}^{})F(b_{t}^{})+\lambda{F}(b_{t}^{})V_{F,G}(B_{t}-b_{t},t+1)+\lambda(1-F(b_{t}^{}))V_{F,G}(B_{t},t+1).$

Recall that $R_{1}=\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(H_{t}-\widetilde{H}_{t})\right]$ . The following series of arguments holds. First, we have

R_{1}\leq\Biggl{|}{\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}(H_{t}-\hat{H}_{t})\right]}\Biggr{|}+\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}|\widetilde{H}_{t}-\hat{H}_{t}|\right].

Followed from Lemma A.8, it holds that

\Biggl{|}{\operatorname*{\mathbb{E}}_{\bm{v}\sim G^{T}}\left[\sum_{t=1}^{T}H_{t}-\hat{H}_{t}\right]}\Biggr{|}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{2+\lambda}{(1-\lambda)^{2}}+C_{1}\right)\sum_{t=1}^{T}t^{-\frac{1}{2}}.

It suffices to bound $|\widetilde{H}_{t}-\hat{H}_{t}|$ . To do so we rewrite

|\widetilde{H}_{t}-\hat{H}_{t}|\leq\Delta_{1}+\Delta_{2}+\Delta_{3}+\Delta_{4}+\Delta_{5},

where

	$\displaystyle\Delta_{1}$	$\displaystyle=\|(v_{t}-b_{t})\hat{F}(b_{t})-(v_{t}-b_{t})F(b_{t})\|,$
	$\displaystyle\Delta_{2}$	$\displaystyle=\lambda\hat{F}(b_{t})\|\widetilde{V}_{\hat{F},\hat{G}}(B_{t}-b_{t},t+1)-V_{F,G}(B_{t}-b_{t},t+1)\|,$
	$\displaystyle\Delta_{3}$	$\displaystyle=\|\lambda\hat{F}(b_{t})V_{F,G}(B_{t}-b_{t},t+1)-\lambda{F}(b_{t})V_{F,G}(B_{t}-b_{t},t+1)\|,$
	$\displaystyle\Delta_{4}$	$\displaystyle=\lambda(1-\hat{F}(b_{t}))\|\widetilde{V}_{\hat{F},\hat{G}}(B_{t},t+1)-V_{F,G}(B_{t},t+1)\|\text{ and}$
	$\displaystyle\Delta_{5}$	$\displaystyle=\|\lambda(1-\hat{F}(b_{t}))V_{F,G}(B_{t},t+1)-\lambda(1-F(b_{t}))V_{F,G}(B_{t},t+1)\|.$

•

To bound $\Delta_{1}$ : using Lemma 3.1, it holds $\Delta_{1}\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}$ .
•

To bound $\Delta_{2}+\Delta_{4}$ : using Lemma A.8 and setting $C_{1}\leftarrow\frac{C_{1}}{\lambda}$ , it holds that $\Delta_{2}+\Delta_{4}\leq\lambda\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{2+\lambda}{(1-\lambda)^{2}}+\frac{C_{1}}{\lambda}\right)t^{-\frac{1}{2}}$ .
•

To bound $\Delta_{3}$ : since $V_{F,G}(\cdot,\cdot)\leq\frac{1}{1-\lambda}$ , then it holds that $\Delta_{3}\leq\frac{\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}$ .
•

To bound $\Delta_{5}$ : like the way we deal with $\Delta_{3}$ , it holds that $\Delta_{5}\leq\frac{\lambda}{1-\lambda}\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}t^{-\frac{1}{2}}$ .

After summing them up, we have

|\widetilde{H}_{t}-\hat{H}_{t}|\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{1+2\lambda}{(1-\lambda)^{2}}+C_{1}\right)t^{-\frac{1}{2}}.

Therefore, conditioning on the good event that the estimation succeeds (of both $F$ and $G$ ) for every $t$ , $R_{1}$ is bounded by

R_{1}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{3+3\lambda}{(1-\lambda)^{2}}+2C_{1}\right)\sum_{t=1}^{T}t^{-\frac{1}{2}}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{6(1+\lambda)}{(1-\lambda)^{2}}+4C_{1}\right)\sqrt{T}.

We then apply the same techniques in Lemma A.4 and Lemma A.5 to yield

\text{Regret}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{6(1+\lambda)}{(1-\lambda)^{2}}+4C_{1}+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}},

conditioning on the good event.

Now take $\delta=\frac{1}{T}$ and note that $\Pr[\text{bad event}]\leq\frac{1}{T}$ . By using the trivial regret bound $T$ for the failure event, we obtain the desired bound

\text{Regret}\leq\left(\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{6(1+\lambda)}{(1-\lambda)^{2}}+4C_{1}+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}+1.

Appendix B Omitted proof in Section 3.2

B.1 Proof of Lemma 3.9

Using Lemma 3.8, we have $|{\Pr(\sup_{b}|\sqrt{2^{n}}(\hat{F}_{n}(1-b)-F(1-b))}|\geq r)-\Pr(\sup_{b}|B(1-b)|\geq r)|\leq C_{2}(1+r)^{-3}\ln^{2}2^{n}(2^{-\frac{n}{6}})$ . Since $2^{n}\leq t\leq 2^{n+1}-1$ , it holds that:

	$\displaystyle\|{\Pr(\sup_{b}\|\sqrt{2^{n}}(\hat{F}_{n}(1-b)-F(1-b))}\|\geq r)-\Pr(\sup_{b}\|B(1-b)\|\geq r)\|$
	$\displaystyle\leq C_{2}(1+r)^{-3}\ln^{2}t\left(\frac{t}{2}\right)^{-\frac{1}{6}}.$

Taking $M=C_{2}2^{\frac{1}{6}}$ concludes the proof.

B.2 Proof of Lemma 3.10

Before we proceed to bound the difference between $\hat{F}_{n}$ and $F$ , we also need the following lemma to to characterize a property of Gaussian processes.

Lemma B.1.

Let $\mathcal{B}(1-b)$ be a Gaussian process, we have

\Pr(\sup_{b}|\mathcal{B}(1-b)|\geq r)|\leq C_{3}e^{-C_{4}r^{2}},

where $C_{3}$ and $C_{4}$ are constants.

For a Gaussian process, the tail satisfies Gaussian distribution. For a normal distribution, we have the following well-known inequality

\int_{x}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^{2}}{2}}\,\mathrm{d}t\leq C_{6}e^{-C_{7}x^{2}},

where $x\geq 0$ and $C_{6}$ , $C_{7}$ are constants. For example, we can take $C_{6}=\frac{1}{2}e^{\frac{1}{2}}$ and $C_{7}=\frac{1}{2}$ in this inequality.

To bound for a certain Gaussian process, we rescale the random variable and the tail distribution also satisfies above property. So, there exists $C_{3}$ and $C_{4}$ to make Lemma B.1 hold.

Now we give a bound for the estimation of $\hat{F}$ . First, we have

\Pr(\sup_{x}|\sqrt{2^{n}}(\hat{F}(x)-F(x))|\geq r)\leq\Pr(|\mathcal{B}|\geq r)+M(1+r)^{-3}\ln^{2}t\cdot t^{-\frac{1}{6}},

We show that Lemma 3.10 establishes if we take $r=\max\{r_{1},r_{2}\}$ where $r_{1}=\sqrt{\frac{1}{C_{4}}\ln\frac{4C_{3}\ln T}{\delta}}$ and $r_{2}=(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{1}{18}}\delta^{-\frac{1}{3}}$ . Indeed, if $r\geq r_{1}$ then the first part on the right side is not greater than $\frac{\delta}{4\ln T}$ by Lemma B.1 and if $r\geq r_{2}$ then the second part is not greater than $\frac{\delta}{4\ln T}$ . Take $\delta=T^{-\frac{5}{12}}$ . It follows from simple comparison (between the rate of growth of $r_{1}$ and $r_{2}$ ) that there exists a constant $C_{5}$ for any $t$ and $T$ such that $\max\{r_{1},r_{2}\}\leq C_{5}r_{2}$ . Now,

\Pr\left(\sup_{x}\left|\sqrt{\frac{t}{2}}(\hat{F}(x)-F(x))\right|\geq C_{5}r_{2}\right)\leq\Pr(\sup_{x}|\sqrt{2^{n}}(\hat{F}(x)-F(x))|\geq C_{5}r_{2})\leq\frac{1}{2T^{\frac{5}{12}}\ln T}

since $t\leq 2^{n+1}$ . This concludes the proof of the lemma.

B.3 Proof of Theorem 3.11

Similar to the proof in the full-feedback case, let us first assume the knowledge of $G$ . The regret of learning $G$ will be considered later. The proof of Theorem 3.11 is short thanks to the tool-sets established by previous sections. As usual, we first condition on the good event that the estimation succeeds for every $t$ . Then we add the contribution of the bad event to the regret in the final proof.

In order to bound the final regret, we need to bound the difference between $F$ and $\hat{F}_{n}$ and then apply the methodology used in the proof of Theorem 3.5.

Dropping the first two rounds, for any $2^{n}\leq t\leq 2^{n+1}-1$ , the estimation of $F$ is $\hat{F}_{n}$ . Then using Lemma 3.10, we obtain that, with probability at least $1-\frac{\delta}{2\ln T}$ ,

\sup_{x}|\hat{F}_{n}(x)-F(x)|\leq\sqrt{2}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{5}{9}}T^{\frac{5}{36}},

where $\delta=T^{-\frac{5}{12}}$ . Note that we use the fact that if $1\leq t\leq T$ , the algorithm updates for at most $\lfloor\ln T\rfloor$ times. Now, the new concentration bound (Lemma 3.8) effective changes $K,C$ and the convergence rate in Lemma A.3, Lemma A.4 and Lemma A.5. Therefore, by substituting $\frac{C}{\sqrt{t}}$ with $\sqrt{2}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{5}{9}}T^{\frac{5}{36}}\frac{1+\lambda}{(1-\lambda)^{2}}$ and $\frac{K}{\sqrt{t}}$ with $\sqrt{2}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}t^{-\frac{5}{9}}T^{\frac{5}{36}}$ in the lemmas we obtain that conditioning on the good event (w.p. $1-\frac{\delta}{2}$ ) that the estimation of $F$ succeeds every time,

R_{1}\leq 4C_{1}\sqrt{T}+\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}T^{\frac{7}{12}},

if Algorithm 2 has the knowledge of $G$ .

Next we add the estimation of $G$ . Note that Lemma A.8 decouples the regret of estimating $F$ and $G$ , hence we may apply the same approach in the proof of Theorem 3.6 by taking $C=\sqrt{\frac{1}{2}\ln{\frac{4T}{\delta}}}\frac{1+\lambda}{(1-\lambda)^{2}}$ and $K=\sqrt{\frac{1}{2}\ln{\frac{4T}{\delta}}}$ and we have

R_{1}\leq 4C_{1}\sqrt{T}+\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}T^{\frac{7}{12}}+\left[\sqrt{\frac{1}{2}\ln{\frac{4T}{\delta}}}\frac{2(1+\lambda)}{(1-\lambda)^{2}}\right]\sqrt{T},

conditioning on the good event (w.p. $1-\delta$ ) that the estimation of $F$ succeeds every time.

Finally, we use Lemma A.4 and Lemma A.5 to transform $R_{1}$ into the final regret, which yields

	$\displaystyle\text{Regret}\leq\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}T^{\frac{7}{12}}+\left[\sqrt{\frac{1}{2}\ln 4T^{\frac{17}{12}}}\frac{2(1+\lambda)}{(1-\lambda)^{2}}\right]\sqrt{T}$
	$\displaystyle+\left[(4C_{1}+1-\lambda)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}\right],$

conditioning on the good event that the estimations of $F$ and $G$ succeeds for every $1\leq t\leq T$ . Note that $\Pr[\text{bad event}]\leq\frac{1}{T^{\frac{5}{12}}}$ . By using the trivial regret bound $T$ for the failure event, we have

	$\displaystyle\text{Regret}\leq\left(\frac{9\sqrt{2}(1+\lambda)}{2(1-\lambda)^{2}}C_{5}(4M\ln^{3}T)^{\frac{1}{3}}+1\right)T^{\frac{7}{12}}$
	$\displaystyle+\left(\sqrt{\frac{1}{2}\ln{4T^{\frac{17}{12}}}}\frac{2(1+\lambda)}{(1-\lambda)^{2}}+4C_{1}+1-\lambda\right)\sqrt{T}+\frac{\lambda}{2}\log_{\frac{1}{\lambda}}\frac{T}{(1-\lambda)^{2}}.$

This concludes the proof.

	$\displaystyle Q_{v,F}(B_{t},t,b^{}_{t})-Q_{v,\hat{F}}(B_{t},t,b^{}_{t})\leq$
	$\displaystyle\underbrace{\|(v_{t}-b^{}_{t})(F(b^{}_{t})-\hat{F}(b^{*}_{t}))\|}_{\Delta_{1}}$
	$\displaystyle+\underbrace{\lambda F(b^{}_{t})\|(V(B_{t}-b^{}_{t},t+1)-V_{\hat{F}}(B_{t}-b^{}_{t},t+1))\|+\lambda\|(F(b^{}_{t})-\hat{F}(b^{}_{t}))V_{\hat{F}}(B_{t}-b^{}_{t},t+1)\|}_{\Delta_{2}}$
	$\displaystyle+\underbrace{\lambda\|(1-F(b^{}_{t}))(V(B_{t},t+1)-V_{\hat{F}}(B_{t},t+1))\|+\lambda\|(\hat{F}(b^{}_{t})-F(b^{*}_{t}))V_{\hat{F}}(B_{t},t+1)\|}_{\Delta_{3}}.$

	$\displaystyle\|\hat{H}_{t}-\widetilde{H}_{t}\|\leq$	$\displaystyle\|(v_{t}-b_{t})(F(b_{t})-\hat{F}(b_{t}))\|+\lambda\|\hat{F}(b_{t})-F(b_{t})\|V_{\hat{F}}(B_{t}-b_{t},t+1)$
		$\displaystyle+\lambda F(b_{t})\|V(B_{t}-b_{t},t+1)-V_{\hat{F}}(B_{t}-b_{t},t+1)\|$
		$\displaystyle+\lambda\|F(b_{t})-\hat{F}(b_{t})\|V_{\hat{F}}(B_{t},t+1)+\lambda(1-F(b_{t}))\|V(B_{t},t+1)-V_{\hat{F}}(B_{t},t+1)\|$
		$\displaystyle\leq Kt^{-\frac{1}{2}}\left(1+\frac{2\lambda}{1-\lambda}\right)+\lambda\left(Ct^{-\frac{1}{2}}+\frac{C_{1}}{\lambda}{t}^{-\frac{1}{2}}\right)$
		$\displaystyle=\left(\frac{K(1+\lambda)}{1-\lambda}+\lambda C+C_{1}\right)t^{-\frac{1}{2}}.$

	$\displaystyle Q_{\hat{F},G}(B_{t},t,b_{t}^{*})$	$\displaystyle-Q_{\hat{F},\hat{G}}(B_{t},t,b_{t})\leq Q_{\hat{F},G}(B_{t},t,b_{t}^{})-Q_{\hat{F},\hat{G}}(B_{t},t,b_{t}^{})$
		$\displaystyle=\lambda\hat{F}(b_{t}^{})V_{\hat{F},G}(B_{t}-b_{t}^{},t+1)+\lambda(1-\hat{F}(b_{t}^{*}))V_{\hat{F},G}(B_{t},t+1)$
		$\displaystyle-\lambda\hat{F}(b_{t}^{})V_{\hat{F},G}(B_{t}-b_{t}^{},t+1)-\lambda(1-\hat{F}(b_{t}^{*}))V_{\hat{F},\hat{G}}(B_{t},t+1)$
		$\displaystyle\leq\lambda\hat{F}(b_{t}^{})\|V_{\hat{F},G}(B_{t}-b_{t}^{},t+1)-V_{\hat{F},\hat{G}}(B_{t}-b_{t}^{*},t+1)\|$
		$\displaystyle+\lambda(1-\hat{F}(b_{t}^{*}))\|V_{\hat{F},G}(B_{t},t+1)-V_{\hat{F},\hat{G}}(B_{t},t+1)\|$
		$\displaystyle\leq\sqrt{\frac{1}{2}\ln\frac{4T}{\delta}}\frac{\lambda}{(1-\lambda)^{2}}t^{-\frac{1}{2}}.$