No-regret Learning in Repeated First-Price Auctions with Budget Constraints
Abstract
Recently the online advertising market has exhibited a gradual shift from second-price auctions to first-price auctions. Although there has been a line of works concerning online bidding strategies in first-price auctions, it still remains open how to handle budget constraints in the problem. In the present paper, we initiate the study for a buyer with budgets to learn online bidding strategies in repeated first-price auctions. We propose an RL-based bidding algorithm against the optimal non-anticipating strategy under stationary competition. Our algorithm obtains -regret if the bids are all revealed at the end of each round. With the restriction that the buyer only sees the winning bid after each round, our modified algorithm obtains -regret by techniques developed from survival analysis. Our analysis extends to the more general scenario where the buyer has any bounded instantaneous utility function with regrets of the same order.
1 Introduction
The market for online advertising has extensively grown in recent years. It is estimated that the volume of online advertising worldwide will reach 500 billion dollars in 2022 (Statista, 2021). In such a market, advertising platforms use auctions to allocate ad opportunities. Typically, the advertiser has a limited amount of capital for an advertisement campaign and thus all rounds of competition are interconnected by budgets. Furthermore, the advertiser has very limited knowledge of 1) her valuation of certain keywords and 2) the competition she is facing. Many works have been devoted to developing algorithms for learning strategies for optimally spending the budget in repeated second-price auctions (see Section 1.1).
Nevertheless, we have witnessed numerous switches from second-price auctions to first-price auctions in the online advertising market. A recent remarkable example is Google AdSenses’ integrated move at the end of 2021 (LLC, 2021). Earlier examples also include AppNexus, Index Exchange, and OpenX (Sluis, 2017). This industry-wide shift is due to various factors including a fairer transactional process and increased transparency. Further, the shift to first-price auctions brings about major importance to the following open question which is barely considered in previous works:
How should budget-constrained advertisers learn to compete in repeated first-price auctions?
This paper thus initiates the study of learning to bid with budget constraints in repeated first-price auctions. We note that the application of first-price auctions with budgets is not limited to online advertising mentioned above. Traditional competitive environments like mussel trade in Netherlands (van Schaik et al., 2001), modern price competition, and procurement auctions (e.g. U.S. Treasury Securities auction (Chari and Weber, 1992)) are examples as well.
Challenges and contributions
The challenges in this setting are two-fold.
The first challenge relates to information structure. In practical auctions, it is often the case that only the highest bid is revealed (Esponda, 2008). This is known as censored-feedback or winner’s curse in literature (Capen et al., 1971). This affects the information structure of learning since the buyer learns less information when she wins. This makes the problem challenging compared to standard contextual bandits (c.f. Section 1.1).
The second challenge is more fundamental. It is known that the strategy in first-price auctions is notoriously complex to analyze, even in the static case (Lebrun, 1996). To get an intuitive feeling of this difficulty in our problem compared to repeated second-price auctions, let us consider the offline case where the opponents’ bids are all known. Given the budget, the problem for second-price auctions can be reduced to a pure knapsack problem, where the budget is regarded as weight capacity and prices as weights. This structure enables mature techniques including duality theory to be applied to study the benchmark strategy. Pitifully in first-price auctions, since the payment depends on the buyer’s own bid, the previous approach/benchmark is not directly usable. We provide a concrete example to further illustrate such difficulty.
Example 1.1.
Consider a case where the buyer’s value follows a uniform distribution on and the highest bid of the opponents’ follows a uniform distribution on . The budget is set to be half of the time horizon . Now, the first-best benchmark (an anticipating111An algorithm is anticipating if bid selection depends on future observations, see Flajolet and Jaillet (2017). strategy) is
In hindsight, we need to pay as little as possible. Using the theory of knapsack, the utility turns out to be . On the contrary, the optimal non-anticipating bidding strategy in a first-price auction is to bid and the utility is . There is already an separation between the first-best benchmark and the ideal case with full information.
This example shows that simple characterization of the optimal in previous works is not applicable. Indeed, it remains unclear what methodology can be applied in first-price auctions with budgets (see (Balseiro and Gur, 2019, §2.4) for further discussions).
The present paper takes the first step to combat the challenges mentioned above with a dynamic programming approach. Correspondingly, our contribution is also two-fold:
-
•
We provide an RL-based learning algorithm. Through the characterization of the optimal strategy, we obtain -regret guarantee for the algorithm in the full-feedback case222This is especially practical in public-sector auctions (Chari and Weber, 1992) as regulations mandate all bids to be revealed..
-
•
In the censored-feedback setting, by techniques developed from survival analysis, we modify our algorithm and obtain a regret of .
1.1 Related Work
Repeated second-price auctions with budgets
There is a flourishing source of literature on bidding strategies in repeated auctions with budgets. Through the lens of online learning, Balseiro and Gur (2019) identify asymptotically optimal online bidding strategies known as pacing (a.k.a. bid-shading in literature) in repeated second-price auctions with budgets. Inspired by the pacing strategy, Flajolet and Jaillet (2017) develop no-regret non-anticipating algorithms for learning with contextual information in repeated second-price auctions. Another line of works that uses similar techniques in the present paper includes Amin et al. (2012); Tran-Thanh et al. (2014); Gummadi et al. (2012). Gummadi et al. (2012) and Amin et al. (2012) study bidding strategies in repeated second-price auctions with budget constraints, but the former does not involve any learning and the latter does not provide any regret analysis (their estimator is biased). Tran-Thanh et al. (2014) derive regret bounds for the same scenario but the optimization objective is the number of items won instead of value or surplus. Baltaoglu et al. (2017) also use dynamic programming to tackle repeated second-price auctions with budgets. However, they assume per-round budget constraints and their dynamic programming algorithm is for allocating bids among multiple items. Again, we emphasize that no prior work has been done in repeated first-price auctions with budgets, since the structure of the problem (compared to second-price variants) is fundamentally different (recall Example 1.1).
Repeated first-price auctions without budgets
Two notable works concerning repeated first-price auctions are Han et al. (2020b) and Han et al. (2020a). In Han et al. (2020b), they introduce a new problem called monotone group contextual bandits and then obtain an -regret algorithm for repeated first-price auctions without budget constraints under stationary settings. In Han et al. (2020a), they concentrate on an adversarial setting and develop a mini-max optimal online bidding algorithm with regret against all Lipschitz bidding strategies. A crucial difference is that in the present paper, budgets are involved thus the bandit algorithm by Han et al. (2020b) is not suitable for our needs.
Bandit with knapsack
From the bandit side, Badanidiyuru et al. (2013) develop bandit algorithms under resource constraints. They show that their algorithm can be used in dynamic procurement, dynamic posted pricing with limited supply, etc. However, since the bidder observes her value before bidding in our problem, results by Badanidiyuru et al. (2013) cannot be directly applied to our setting. Our setting also relates to contextual bandit problems with resource constraints (Badanidiyuru et al., 2014; Agrawal and Devanur, 2016; Agrawal et al., 2016). Nevertheless, applying this contextual bandit approach requires discretizing the action space, which needs Lipschitz continuity of distributions. Our approach does not rely on any continuity assumption. Further, the performance guarantee (typically ) is worse than ours. It also does not fit into our information structure when the feedback is censored.
2 Preliminaries
Auction mechanism
We consider a repeated first-price auction with budgets. Specifically, we suppose that the buyer has a limited budget to spend in a time horizon of (can be unknown to her) rounds. At the beginning of each round , the bidder privately observes a value for a fresh copy of item and bids according to her past observations and value . Denote the strategy she employs as , which maps her current budget , value and past history to a bid. Let be the maximum bid of the other bidders. Since the auction is a first price auction, if is larger than , then the buyer wins the auction, is charged , and obtains a utility of ; otherwise she loses and . Therefore, the instantaneous utility of the buyer is
The exact information structure of history the buyer observes is dictated by how the mechanism reveals . In full generality, we assume that the feedback is censored, i.e. only the highest bid is revealed at the end of each round and the winner does not observe exactly. This is considered to be an informational version of winner’s curse (Capen et al., 1971) and is of practical interest (Esponda, 2008). For the purpose of modeling, we suppose that ties are broken in favor of the buyer but this choice is arbitrary and by no means a limitation of our approach.
Next, we state the assumptions on and . Without loss of generality, we assume that are normalized to be in . In the present paper, we consider a stochastic setting where and are drawn from some distributions unknown to the buyer, respectively, and independent from the past. We will also refer to the cumulative distribution functions of with the same notations. No further assumptions will be made on . Now, the expected instantaneous utility of the buyer at time with strategy is
To argue for the reasonability of this assumption, note that although other buyers may also involve learning behavior, it is typical that in a real advertising market, there are a large number of buyers. The specific buyer only faces a different small portion of them and is completely oblivious of whom she is facing in each round. Therefore, the buyer’s sole objective is to maximize her utility (instead of fooling other buyers) and by the law of large numbers, the price and value the buyer observes are independent and identically distributed at least for a period of time.
Buyer’s target and regret
The buyer aims at maximizing her long-term accumulated utility subject to the budget constraints. Recall that the instantaneous utility of the buyer is . The payment is and the budget will then decrease accordingly as the payment incurs. She can continue to bid as long as budget has not run out but must stop at
The buyer’s problem now becomes to determine how much to bid in each round to maximize her accumulated utility. In line with works Gummadi et al. (2012); Golrezaei et al. (2019); Deng and Zhang (2021), the buyer adopts a discount factor . She takes discounts since she does not know or . Discount factors are interpreted to be the estimate of the probability that the repeated auction will last for at least rounds (Devanur et al., 2014; Drutsa, 2018). On the economic side, in important real-world markets like online advertising platforms, the buyers are impatient for opportunities since companies of different sizes have different capabilities. Discount factors model how impatient the buyer is (Drutsa, 2017; Vanunts and Drutsa, 2019). Now the buyer’s optimization problem is to determine a non-anticipating strategy for the following:
where . Here, denotes the sequence of private values the buyer observes and is the sequence of the highest bids of the other bidders. denotes the expected accumulated utility using strategy with budget and starting from time . Let denote the optimal bidding strategy with the knowledge of the underlying distributions and . The corresponding expected accumulated utility is .
We now come to define the regret. Given time , the regret is defined to be the difference between the accumulated utility of the buyer’s bidding strategy and that of the optimal non-anticipating bidding strategy
(1) |
3 Bidding Algorithm and Analysis
In this section, we present our bidding algorithm and the high-level ideas in the analysis of regret. We first consider the case where the feedback is not censored, i.e. the buyer is aware of no matter whether she wins or not. Then we extend our algorithm to the case where the feedback is censored with techniques developed from survival analysis.
3.1 Full Feedback
When and are known, the buyer’s problem can be viewed as offline. The technical challenge lies in the observation that even when the distributions are known, the buyer’s problem cannot be directly analyzed as a knapsack problem. To tackle this challenge, we use a dynamic programming approach to solve the problem. In particular, the optimal strategy satisfies the following Bellman equation:
for all and . Note that for any , . By choosing appropriate initialization conditions, we can solve the equation recursively and obtain the optimal bidding strategy together with the function . The above recursion also characterizes the optimal solution, which will be used in the analysis later.
When the buyer does not have the information of and , she can learn the distributions from past observations. Therefore, it is natural to maintain estimations and of the true distributions. Our algorithm for the full-feedback case is now depicted in Algorithm 1. To ease technical loads, we first assume the knowledge of and only estimate in Algorithm 1. Later, we will add the estimation of and its analysis is presented in Theorem 3.6.
Similar to prior work (Amin et al., 2012), Algorithm 1 performs exploration and exploitation simultaneously. The buyer initializes the estimation of to a uniform distribution (2). At round , the buyer observes her valuation. Then, she uses her estimation of to solve the dynamic programming problem recursively333For the non-trivial case , this can be solved in time with only loss (see, e.g. Chow and Tsitsiklis, 1989). to obtain an estimation of the optimal bid (7~11). To provide a base case for recursion, note that for sufficiently large , ’s impact to is negligible due to the discount . Therefore, the buyer can approximate with zero for (5). Finally, the auction proceeds with revealed and the buyer updates her information accordingly (13~14).
Analysis of regret
The establishment of the regret bounds will be given in two steps. First, we will show that the buyer’s estimation is approximately accurate with a sufficient number of samples. This relies on the estimation of . Concentration inequalities are thus intrinsic to our analysis. Second, we bridge the regret defined in Equation 1 with and . This is done by a series of auxiliary quantities measuring the regret. We obtain the desired result by combining the two steps.
We first present the following lemma (Dvoretzky et al., 1956; Massart, 1990), which states a uniform convergence result for the estimation of cumulative distribution functions.
Lemma 3.1 (Dvoretzky–Kiefer–Wolfowitz).
Given , let be real-valued independent and identically distributed random variables with cumulative distribution function . Let denote the associated empirical distribution function defined by where . Then with probability , it holds
With Lemma 3.1 in hand, given any , we show a bound for the difference between and for any (recall that is the accumulated utility of the optimal strategy with the knowledge of ). We prove this using induction. Note that the induction basis is a little tricky due to the base case of recursion in Algorithm 1. Therefore, in the following lemma, we first deal with the induction step.
Lemma 3.2.
For any round , budget and with probability at least , we have the following bounds for the estimated and the ground truth with if :
Note that the lemma above assumes that is bounded above by , which holds if the base case is set accurately (i.e. ). We use to denote the estimated value function using if the base is indeed accurate. In the following lemma, we show that using the alternative initialization method specified in 5 of Algorithm 1, actually satisfies the condition of Lemma 3.2.
Lemma 3.3.
Suppose and is then computed by the recursive procedure in Algorithm 1. Then it holds that for any and :
In particular, when , we have (by construction of ).
Synthesizing Lemma 3.2 and Lemma 3.3, we have . A crucial next step is to relate this bound to the final regret. This is achieved by two transformations. Roughly speaking, the buyer’s “regret” can be viewed in two parts: 1) she does not bid according to the optimal strategy; 2) her strategy is not optimally spending the budget which leads to future losses. These two transformations are done with this intuitive observation and summarized in the following lemma that bounds the performance of the buyer’s strategy. Below we first condition on the good event that the estimation succeeds for every . Then we add the contribution of the bad event to the regret in Theorem 3.5.
Lemma 3.4.
For any given and , denote , then
By Lemma 3.4 and further transformations, we can now establish the regret bound of Algorithm 1.
Theorem 3.5.
Under the circumstance that is unknown, the worse-case regret of Algorithm 1 is , where the regret is computed according to Equation 1. Explicitly, if we take ,
Note that the regret bound is meaningful only if . Let us now take the budget to scale linearly with as in Balseiro and Gur (2019); Flajolet and Jaillet (2017). Specifically, assume that and there exists a constant such that the budget , then we establish that the regret is in this special case. Indeed, under this condition, we can simply set and for any in Algorithm 1. Therefore, and the worst-case regret is bounded by .
Next we deal with the case where is also initially unknown. Based on Algorithm 1, we additionally maintain an estimation of based on past observations of valuations. is initialized to be a uniform distribution and will be used to solve the dynamic programming problem (see 7 of Algorithm 2). Using similar techniques as before (with more work), we obtain the following theorem.
Theorem 3.6.
Under the circumstance that are both unknown, it holds that the worst-case regret of Algorithm 1 using empirical distribution functions to estimate and is . Explicitly, if we take ,
3.2 Censored Feedback
In this subsection, we deal with the case that the buyer can only see the winner’s bid after each round. In other words, the feedback is left-censored. Concretely, the buyer’s observation is
When she wins, the exact value of is not revealed. The buyer only knows that is smaller than her bid in the current round. To estimate the distribution of , there is a classical statistics (KM estimator) developed by Kaplan and Meier (1958) for the estimation of in this scenario. However, the KM estimator assumes the sequence is deterministic, which does not fit our needs. Although Suzukawa (2004)’s extension allows random censorship, it requires independence between and , which is not realistic since we use the estimated distribution to place bids.
To tackle this problem, we first introduce an estimator proposed by Zeng (2004) denoted by to take place of the previous empirical distribution used in Algorithm 1.
Estimation procedure
We now present the procedure for estimating under censored feedback. It suffices to estimate the distribution function of which is right-censored by . Let . The observations can now be described as .
Roughly speaking, to decouple the dependency between , we use the fact that and are independent conditioning on . Intuitively, the history provides information for getting enough effective samples for . Next we establish models to estimate the hazard rate functions444The hazard rate function of a random variable with p.d.f. and c.d.f. is . of using . With the hazard rate functions, we use the maximum likelihood method with a kernel to compute the final estimation and obtain Equation 3.
Details follow. We initialize with a sequence (bandwidth) such that as and a symmetric kernel function with bounded gradient. Now, at each time , we compute two vectors which maximize each of the followings
(2) |
We arbitrarily pad with zeros to make their length the same (we will show that this is without loss of generality in a moment). Compute . The survival function of , or equivalently the cumulative distribution function of , is estimated based on Zeng (2004)’s estimator
(3) |
Now, we are ready to apply the estimator and the algorithm for censored feedback is depicted in Algorithm 2. Note that the new estimator’s convergence rate is slower than that for the full-feedback case. Therefore, compared to Algorithm 1, Algorithm 2 is now a multi-phase algorithm. The algorithm only updates the estimation of at the end of each phase (see Figure 1 for an illustration). The other elements of each phase in Algorithm 2 are similar to Algorithm 1.
Analysis of regret
To analyze the performance of Algorithm 2, we will prove a series of lemmas on the estimation error of Equation 3. In particular, our proof relies on the following convergence result.
Lemma 3.7 (Zeng).
Let be the estimation of after using observations. We have
where is a Gaussian process.
Before we proceed to apply the lemma, we verify a series of prerequisites mentioned in Zeng (2004) to make sure it holds. First, we make sure that conditioning on , the random variables and are independent. Indeed, is completely decided by and is independent of . Second, we note that the maximizer shown in Equation 2 is essentially doing Cox’s proportional hazards regression analysis. To establish consistency of the estimator, we show that at least one of and follows Cox’s proportional model. That is to say, there exists and a function such that the hazard rate function of or conditioning on exactly follows
(4) |
Equation 4 holds for . In fact, taking and suffices. Since we take , consistency establishes regardless of the way we pad .
Next, consider some phase at time . The estimation is computed using the first observed data points. Applying similar techniques for the rate of convergence of empirical process (Norvaiša and Paulauskas, 1991), we obtain
Lemma 3.8.
Under Algorithm 2, let be an empirical process of updates and be a general Gaussian process, respectively, indexed by a class of real measurable functions. We have:
where is a constant depending only on and is the size of data used to update the estimation.
Synthesizing the convergence results in Lemma 3.7 and Lemma 3.8, we establish the following lemma on the performance of in Algorithm 2.
Lemma 3.9.
Under the update process in Algorithm 2, for any , we have the following bounds for the estimation :
where is a constant depending only on and Algorithm 2.
Finally, we now bound the difference between and with the help of Lemma 3.9.
Lemma 3.10.
Recall that we use the first data points to estimate . Under the update procedure of Algorithm 2, for any , with probability at least
With Lemma 3.10 in hand and employing a similar methodology in the proof of Theorem 3.5, we now have
Theorem 3.11.
Under the circumstance that are both unknown and the feedback is censored, the worst-case regret of Algorithm 2 is . Explicitly, if we take ,
Remark 3.12.
The previous setting in Han et al. (2020b) is a special case of our setting. In Han et al. (2020b), there are no budget constraints and (thus removing the factor in our results). The buyer’s aim is to maximize each round. This is equivalent to the case in our setting with no need to estimate , yielding regret in the full-feedback case and regret in the censored-feedback case.
4 Discussion and Conclusion
In this paper, we develop a learning algorithm to adaptively bid in repeated first-price auctions with budgets. On the theoretical side, our algorithm, together with its analysis of -regret in the full-feedback case and -regret in the censored-feedback case, takes the first step in understanding the problem. On the practical side, our algorithm is simple and readily applicable to the digital world that has shifted to first-price auctions.
Questions raise themselves for future explorations. We observe here that in the limiting case , the optimal bidding strategy in Algorithm 2 is similar to a pacing strategy, which partially answers the open question555“A question of theoretical and practical interest is how to extend the adaptive pacing approach that is suggested in this paper to nontruthful mechanisms such as first-price auctions.” raised in Balseiro and Gur (2019). In the limiting case of , the optimal bid of Algorithm 2 is of the form , where is a pacing multiplier that depends only on and . Further, can be computed without explicitly solving the dynamic programming problem. This observation can be viewed as a direct corollary of (Theorem 3.1 Gummadi et al., 2012). This connection between Algorithm 2 and pacing suggests future directions to extend the current adaptive pacing techniques to other auction forms666We note here that parallel to our work, Gaitonde et al. (2022) extend the pacing techniques to a class of auction forms (called core auctions). They obtain -regret bounds against the best linear policy under the value-maximization objective.. Other immediate open questions include establishing lower bounds of regret for our problem.
References
- Agrawal and Devanur [2016] Shipra Agrawal and Nikhil Devanur. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29, 2016.
- Agrawal et al. [2016] Shipra Agrawal, Nikhil R. Devanur, and Lihong Li. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Conference on Learning Theory, pages 4–18. PMLR, 2016.
- Amin et al. [2012] Kareem Amin, Michael Kearns, Peter Key, and Anton Schwaighofer. Budget optimization for sponsored search: Censored learning in MDPs. arXiv preprint arXiv:1210.4847, 2012.
- Badanidiyuru et al. [2013] Ashwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 207–216. IEEE, 2013.
- Badanidiyuru et al. [2014] Ashwinkumar Badanidiyuru, John Langford, and Aleksandrs Slivkins. Resourceful contextual bandits. In Conference on Learning Theory, pages 1109–1134. PMLR, 2014.
- Balseiro and Gur [2019] Santiago R. Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Science, 65(9):3952–3968, 2019.
- Baltaoglu et al. [2017] M. Sevi Baltaoglu, Lang Tong, and Qing Zhao. Online learning of optimal bidding strategy in repeated multi-commodity auctions. Advances in Neural Information Processing Systems, 30, 2017.
- Capen et al. [1971] Edward C. Capen, Robert V. Clapp, and William M. Campbell. Competitive bidding in high-risk situations. Journal of Petroleum Technology, 23(06):641–653, 1971.
- Chari and Weber [1992] V. Chari and Robert Weber. How the us treasury should auction its debt. Quarterly Review, 16(4), 1992.
- Chow and Tsitsiklis [1989] Chef-Seng Chow and John N Tsitsiklis. The complexity of dynamic programming. Journal of complexity, 5(4):466–488, 1989.
- Deng and Zhang [2021] Yuan Deng and Hanrui Zhang. Prior-independent dynamic auctions for a value-maximizing buyer. Advances in Neural Information Processing Systems, 34, 2021.
- Devanur et al. [2014] Nikhil R Devanur, Yuval Peres, and Balasubramanian Sivan. Perfect bayesian equilibria in repeated sales. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 983–1002. SIAM, 2014.
- Drutsa [2017] Alexey Drutsa. Horizon-independent optimal pricing in repeated auctions with truthful and strategic buyers. In Proceedings of the 26th International Conference on World Wide Web, pages 33–42, 2017.
- Drutsa [2018] Alexey Drutsa. Weakly consistent optimal pricing algorithms in repeated posted-price auctions with strategic buyer. In International Conference on Machine Learning, pages 1319–1328. PMLR, 2018.
- Dvoretzky et al. [1956] Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642–669, 1956.
- Esponda [2008] Ignacio Esponda. Information feedback in first price auctions. The RAND Journal of Economics, 39(2):491–508, 2008.
- Flajolet and Jaillet [2017] Arthur Flajolet and Patrick Jaillet. Real-time bidding with side information. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, 2017.
- Gaitonde et al. [2022] Jason Gaitonde, Yingkai Li, Bar Light, Brendan Lucier, and Aleksandrs Slivkins. Budget Pacing in Repeated Auctions: Regret and Efficiency without Convergence. arXiv e-prints, art. arXiv:2205.08674, May 2022.
- Golrezaei et al. [2019] Negin Golrezaei, Adel Javanmard, and Vahab Mirrokni. Dynamic incentive-aware learning: Robust pricing in contextual auctions. Advances in Neural Information Processing Systems, 32, 2019.
- Gummadi et al. [2012] Ramakrishna Gummadi, Peter Key, and Alexandre Proutiere. Repeated auctions under budget constraints: Optimal bidding strategies and equilibria. In the Eighth Ad Auction Workshop, volume 4, 2012.
- Han et al. [2020a] Yanjun Han, Zhengyuan Zhou, Aaron Flores, Erik Ordentlich, and Tsachy Weissman. Learning to bid optimally and efficiently in adversarial first-price auctions. arXiv preprint arXiv:2007.04568, 2020a.
- Han et al. [2020b] Yanjun Han, Zhengyuan Zhou, and Tsachy Weissman. Optimal no-regret learning in repeated first-price auctions. arXiv preprint arXiv:2003.09795, 2020b.
- He et al. [2021] Jiafan He, Dongruo Zhou, and Quanquan Gu. Nearly minimax optimal reinforcement learning for discounted MDPs. Advances in Neural Information Processing Systems, 34:22288–22300, 2021.
- Kaplan and Meier [1958] Edward L. Kaplan and Paul Meier. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association, 53(282):457–481, 1958.
- Lebrun [1996] Bernard Lebrun. Existence of an equilibrium in first price auctions. Economic Theory, 7(3):421–443, 1996.
- Liu and Su [2020] Shuang Liu and Hao Su. Regret bounds for discounted MDPs. arXiv preprint arXiv:2002.05138, 2020.
- LLC [2021] Google LLC. Faqs about adsense moving to a first-price auction - google adsense help. https://support.google.com/adsense/answer/10858748?hl=en, 2021. Accessed: 2022-05-06.
- Massart [1990] Pascal Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability, pages 1269–1283, 1990.
- Norvaiša and Paulauskas [1991] R. Norvaiša and V. Paulauskas. Rate of convergence in the central limit theorem for empirical processes. Journal of Theoretical Probability, 4(3):511–534, 1991.
- Sluis [2017] Sarah Sluis. Big changes coming to auctions, as exchanges roll the dice on first-price. https://www.adexchanger.com/platforms/big-changes-coming-auctions-exchanges-roll-dice-first-price/, 2017.
- Statista [2021] Statista. Digital advertising spending worldwide from 2019 to 2024 (in billion u.s. dollars). https://www.statista.com/statistics/237974/online-advertising-spending-worldwide/, 2021. Accessed: 2022-05-05.
- Suzukawa [2004] Akio Suzukawa. Unbiased estimation of functionals under random censorship. Journal of the Japan Statistical Society, 34(2):153–172, 2004.
- Tran-Thanh et al. [2014] Long Tran-Thanh, Lampros Stavrogiannis, Victor Naroditskiy, Valentin Robu, Nicholas R Jennings, and Peter Key. Efficient regret bounds for online bid optimisation in budget-limited sponsored search auctions. In 30th Conference on Uncertainty in AI, 2014.
- van Schaik et al. [2001] Frans D. J. van Schaik, Jack P. C. Kleijnen, et al. Sealed-bid auctions: case study. Center for Economic Research, 2001.
- Vanunts and Drutsa [2019] Arsenii Vanunts and Alexey Drutsa. Optimal pricing in repeated posted-price auctions with different patience of the seller and the buyer. Advances in Neural Information Processing Systems, 32, 2019.
- Yang et al. [2021] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
- Zeng [2004] Donglin Zeng. Estimating marginal survival function by adjusting for dependent censoring using many covariates. The Annals of Statistics, 32(4):1533–1555, 2004.
- Zhou et al. [2021] Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted MDPs with feature mapping. In International Conference on Machine Learning, pages 12793–12802. PMLR, 2021.
Appendix A Omitted Proofs in Section 3.1
A.1 Proof of Lemma 3.2
We will use backward induction to show that
The inequality holds trivially with the condition of the lemma for the basis (). Suppose the bound holds for . Now we write out the difference of the value functions
where is Algorithm 1’s estimated optimal bid and is the bid of the benchmark. The inequality establishes by noting that is sub-optimal under . Next consider the term inside the expectation which is rewritten as follows:
To bound the above equation, we deal with the three terms separately. Using Lemma 3.1 and union bound of rounds, with probability at least . By the induction hypothesis and that any value function is bounded by , we have
Therefore,
This establishes that . Finally, by symmetry—swap and and repeat the proof above, it holds that
This finishes the induction step and the claim follows.
A.2 Proof of Lemma 3.3
We will state a general form of Lemma 3.3 concerning the error in the initialization of the base case. This lemma will come in handy in the following sections.
Lemma A.1.
For any fixed distributions , consider the value function . Suppose we use an arbitrary value in to initialize the base case and recursively compute thereon to obtain , then it holds that for any :
Proof.
When , the claim holds because and are both upper bounded by and lower bounded by 0.
Supposing the claim holds when , then for , we have
In the derivation above, denotes the optimal bidding strategy obtained by computing . The first inequality holds since is not be the optimal bidding strategy under ’s view. The second inequality holds since for any .
And by symmetry, we have . This concludes the induction step and yields the lemma. ∎
In particular, using Lemma A.1, under the condition of Lemma 3.3, the initialization is taken to be 0. We have
Remark A.2.
For convenience, similar to the notations used in this lemma, for any value function , we will use to denote its approximately-initialized counterpart. Furthermore, we will invoke the lemma many times for other value functions in the rest of the proofs.
A.3 Proof of Theorem 3.5
Below we first condition on the good event that the estimation (of Lemma 3.1) succeeds for every . Then we add the contribution of the bad event to the regret in Theorem 3.5 finally.
To proof this theorem, we will first bound the following auxiliary “regret” with Lemma 3.2 and Lemma A.1.
Let us first make an intuitive and approximate description of the regret. The buyer’s “regret” can be viewed in two parts: 1) she does not bid according to the optimal strategy; 2) her strategy is not optimally spending the budget which leads to future losses. Given remaining budget at time with strategy , the above intuition guides us to first look at
Lemma A.3.
Proof.
To ease description, we first let
(Recall that is Algorithm 1’s estimated optimal bid and is the bid of the benchmark.) Using the notations above, now becomes
Use the induction part in the proof of Lemma 3.2 and Lemma A.1, follows from the condition. In order to bound , we write
The first inequality holds because of the triangle inequality and the second inequality establishes using the conditions specified (Lemma 3.1, Lemma 3.2 and Lemma A.1). Note that it holds that
Finally, we sum up the above estimation and obtain
as desired. ∎
In particular, taking and as it holds in the analysis of Algorithm 1, we have .
Next, we will build connections between and the accumulated differences in the values. The definition of is inspired by the recent advances in Yang et al. (2021); He et al. (2021); Liu and Su (2020); Zhou et al. (2021). We define
Lemma A.4.
For any given and , suppose that the conditions for Lemma A.3 holds. We have
Proof.
To start, we introduce the following notation:
Note that we have and . We will use backward induction to bound the difference between and for any . First choose a sufficient large and assume . When , the induction basis holds since . Now consider . By the induction hypothesis, , and when , it holds that
It follows that
and this concludes the induction step.
Applying the proof techniques in Lemma A.1 with the condition for , we have
Since is arbitrarily chosen (as long as it is sufficiently large), we can take . Note that we have . Therefore, it holds that
This establishes
and concludes the proof. ∎
In particular, take and as it holds in the analysis of Algorithm 1, we have .
Finally, we relate the regret defined in Equation 1 with in the previous lemma.
Lemma A.5.
Suppose that , then .
Proof.
In fact, note that
and any expected instantaneous reward is no greater than 1. We have
The above bound holds because when , we have . The considered quantity is divided into two parts with this threshold . The first part is no greater than and the second part is no greater than . The result follows after simple rearrangements. ∎
In particular, take as it holds in the analysis of Algorithm 1, we have
conditioning on the good event that the estimation (of Lemma 3.1) succeeds for every . Now take and note that . By using the trivial regret bound for the failure event, we have
and this concludes the proof of Theorem 3.5.
A.4 Proof of Theorem 3.6
Note that the function depends on both and as does. With the additional condition that also needs to be estimated, we use and to denote the corresponding version computed using and respectively. We also extend this notation on in this scenario (c.f. Algorithm 2).
We start with two simple lemmas. These two lemmas are direct corollaries of Lemma 3.1 and Lemma A.1.
Lemma A.6.
Let be the estimated distribution at round in Algorithm 1. With probability ,
Lemma A.7.
For any round , budget , it holds that
In the following proof, we will bridge and —the estimated value function, gradually. For distributions , the notation refers to the value function computed when and .
Lemma A.8.
With probability at least , for any given and budget , we have
Proof.
First we note that Lemma A.6 states with probability at least , . Now we apply backward induction. When , the induction basis holds trivially since . Assume the induction hypothesis holds for . For any and , it holds that
The first inequality holds since by construction is the optimal bid under and ’s view while is the optimal bid under and ’s view. The second inequality holds with the triangle inequality. And the third inequality is due to the induction hypothesis. Finally, by symmetry—swap and and repeat the proof above, it holds that .
Next, since and , we have
where
-
•
To bound : since is supported on and the difference between and is upper bounded by , is bounded by . Note that we use the fact that with integration by parts, . Therefore, it holds that . Since is monotone w.r.t. and is two-sided bounded, it holds that .
-
•
To bound , it is clear that by linearity of expectation.
Therefore, we obtain
which finishes induction step and concludes the proof. ∎
Lemma A.9.
For any and budget , with probability at least , it holds that:
Proof.
In order to bound the difference between and . We first rewrite
where
Therefore, it holds that:
∎
Now, we provide a bound similar to Lemma A.3 for Theorem 3.6. Similarly, we define
Recall that . The following series of arguments holds. First, we have
Followed from Lemma A.8, it holds that
It suffices to bound . To do so we rewrite
where
After summing them up, we have
Therefore, conditioning on the good event that the estimation succeeds (of both and ) for every , is bounded by
We then apply the same techniques in Lemma A.4 and Lemma A.5 to yield
conditioning on the good event.
Now take and note that . By using the trivial regret bound for the failure event, we obtain the desired bound
Appendix B Omitted proof in Section 3.2
B.1 Proof of Lemma 3.9
B.2 Proof of Lemma 3.10
Before we proceed to bound the difference between and , we also need the following lemma to to characterize a property of Gaussian processes.
Lemma B.1.
Let be a Gaussian process, we have
where and are constants.
For a Gaussian process, the tail satisfies Gaussian distribution. For a normal distribution, we have the following well-known inequality
where and , are constants. For example, we can take and in this inequality.
To bound for a certain Gaussian process, we rescale the random variable and the tail distribution also satisfies above property. So, there exists and to make Lemma B.1 hold.
Now we give a bound for the estimation of . First, we have
We show that Lemma 3.10 establishes if we take where and . Indeed, if then the first part on the right side is not greater than by Lemma B.1 and if then the second part is not greater than . Take . It follows from simple comparison (between the rate of growth of and ) that there exists a constant for any and such that . Now,
since . This concludes the proof of the lemma.
B.3 Proof of Theorem 3.11
Similar to the proof in the full-feedback case, let us first assume the knowledge of . The regret of learning will be considered later. The proof of Theorem 3.11 is short thanks to the tool-sets established by previous sections. As usual, we first condition on the good event that the estimation succeeds for every . Then we add the contribution of the bad event to the regret in the final proof.
In order to bound the final regret, we need to bound the difference between and and then apply the methodology used in the proof of Theorem 3.5.
Dropping the first two rounds, for any , the estimation of is . Then using Lemma 3.10, we obtain that, with probability at least ,
where . Note that we use the fact that if , the algorithm updates for at most times. Now, the new concentration bound (Lemma 3.8) effective changes and the convergence rate in Lemma A.3, Lemma A.4 and Lemma A.5. Therefore, by substituting with and with in the lemmas we obtain that conditioning on the good event (w.p. ) that the estimation of succeeds every time,
if Algorithm 2 has the knowledge of .
Next we add the estimation of . Note that Lemma A.8 decouples the regret of estimating and , hence we may apply the same approach in the proof of Theorem 3.6 by taking and and we have
conditioning on the good event (w.p. ) that the estimation of succeeds every time.