This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Logarithmic Regret in Feature-based Dynamic Pricing

Jianyu Xu
Department of Computer Science
University of California, Santa Barbara
Santa Barbara, CA 93106
[email protected]
&Yu-Xiang Wang
Department of Computer Science
University of California, Santa Barbara
Santa Barbara, CA 93106
[email protected]
Abstract

Feature-based dynamic pricing is an increasingly popular model of setting prices for highly differentiated products with applications in digital marketing, online sales, real estate and so on. The problem was formally studied as an online learning problem (Javanmard and Nazerzadeh, 2019) where a seller needs to propose prices on the fly for a sequence of TT products based on their features xx while having a small regret relative to the best —“omniscient”— pricing strategy she could have come up with in hindsight. We revisit this problem and provide two algorithms (EMLP and ONSP) for stochastic and adversarial feature settings, respectively, and prove the optimal O(dlogT)O(d\log{T}) regret bounds for both. In comparison, the best existing results are O(min{1λmin2logT,T})O\left(\min\left\{\frac{1}{\lambda_{\min}^{2}}\log{T},\sqrt{T}\right\}\right) and O(T2/3)O(T^{2/3}) respectively, with λmin\lambda_{\min} being the smallest eigenvalue of 𝔼[xxT]\mathbb{E}[xx^{T}] that could be arbitrarily close to 0. We also prove an Ω(T)\Omega(\sqrt{T}) information-theoretic lower bound for a slightly more general setting, which demonstrates that “knowing-the-demand-curve” leads to an exponential improvement in feature-based dynamic pricing.

Keywords: dynamic pricing, online learning, adversarial features, optimal regret, affine invariant, distribution-free.

1 Introduction

The problem of pricing — to find a high-and-acceptable price — has been studied since Cournot (1897). In order to locate the optimal price that maximizes the revenue, a firm may adjust their prices of products frequently, which inspires the dynamic pricing problem. Existing works (Kleinberg and Leighton, 2003; Broder and Rusmevichientong, 2012; Chen and Farias, 2013; Besbes and Zeevi, 2015) primarily focus on pricing a single product, which usually will not work well in another setting when thousands of new products are being listed every day with no prior experience in selling them. Therefore, we seek methods that approach an acceptable-and-profitable price with only observations on this single product and some historical selling records of other products.

In this work, we consider a “feature-based dynamic pricing” problem, which was studied by Amin et al. (2014); Cohen et al. (2020); Javanmard and Nazerzadeh (2019). In this problem setting, a sales session (product, customer and other environmental variables) is described by a feature vector, and the customer’s expected valuation is modeled as a linear function of this feature vector.

Feature-based dynamic pricing. For t=1,2,,T:t=1,2,...,T: 1. A feature vector xtdx_{t}\in\mathbb{R}^{d} is revealed that describes a sales session (product, customer and context). 2. The customer valuates the product as wt=xtθ+Ntw_{t}=x_{t}^{\top}\theta^{*}+N_{t}. 3. The seller proposes a price vt>0v_{t}>0 concurrently (according to xtx_{t} and historical sales records). 4. The transaction is successful if vtwtv_{t}\leq w_{t}, i.e., the seller gets a reward (payment) of rt=vt𝟙(vtwt)r_{t}=v_{t}\cdot\mathbbm{1}(v_{t}\leq w_{t}).

Here TT is unknown to the seller (and thus can go to infinity), xtx_{t}’s can be either stochastic (e.g., each sales session is drawn i.i.d.) or adversarial (e.g., the sessions arrive in a strategic sequence), θd\theta^{*}\in\mathbb{R}^{d} is a fixed parameter for all time periods, NtN_{t} is a zero-mean noise, and 𝟙t=𝟙(vtwt)\mathbbm{1}_{t}=\mathbbm{1}(v_{t}\leq w_{t}) is an indicator that equals 11 if vtwtv_{t}\leq w_{t} and 0 otherwise. In this online-fashioned setting, we only see and sell one product at each time. Also, the feedback is Boolean Censored, which means we can only observe 𝟙t\mathbbm{1}_{t} instead of knowing wtw_{t} directly. The best pricing policy for this problem is the one that maximizes the expected reward, and the regret of a pricing policy is accordingly defined as the difference of expected rewards between this selected policy and the best policy.

Summary of Results. Our contributions are threefold.

  1. 1.

    When xtx_{t}’s are independently and identically distributed (i.i.d.) from an unknown distribution, we propose an “Epoch-based Max-Likelihood Pricing (EMLP)” algorithm that guarantees a regret bound at O(dlogT)O(d\log{T}). The design of EMLP is similar to that of the RMLP algorithm in Javanmard and Nazerzadeh (2019), but our new analysis improves their regret bound at O(T)O(\sqrt{T}) when 𝔼[xx]\mathbb{E}[xx^{\top}] is near singular.

  2. 2.

    When xtx_{t}’s are adversarial, we propose an “Online-Newton-Step Pricing (ONSP)” algorithm that achieves O(dlogT)O(d\log{T}) regret on constant-level noises for the first time, which exponentially improves the best existing result of O(T2/3)O(T^{2/3}) (Cohen et al., 2020).111Previous works (Cohen et al., 2020; Krishnamurthy et al., 2021) did achieve polylog regrets, but only for negligible noise with σ=O(1TlogT)\sigma=O(\frac{1}{T\log{T}}).

  3. 3.

    Our methods that achieve logarithmic regret require knowing the exact distribution of NtN_{t} in advance, as is also assumed in Javanmard and Nazerzadeh (2019). We prove an Ω(T)\Omega(\sqrt{T}) lower bound on the regret if Nt𝒩(0,σ2)N_{t}\sim\mathcal{N}(0,\sigma^{2}) where σ\sigma is unknown, even with θ\theta^{*} given and xtx_{t} fixed for all tt.

The O(logT)O(\log{T}) regret of EMLP and ONSP meets the information-theoretical lower bound (Theorem 5, Javanmard and Nazerzadeh, 2019). In fact, the bound is optimal even when wtw_{t} is revealed to the learner (Mourtada, 2019). From the perspective of characterizing the hardness of dynamic pricing problems, we generalize the classical results on “The Value of Knowing a Demand Curve” (Kleinberg and Leighton, 2003) by further dividing the random-valuation class with an exponential separation of: (1) O(logT)O(\log{T}) regret for knowing the demand curve exactly (even with adversarial features), and (2) Ω(T)\Omega(\sqrt{T}) regret for almost knowing the demand curves (up to a one-parameter parametric family).

2 Related Works

In this section, we discuss our results relative to existing works on feature-based dynamic pricing, and highlight the connections and differences to the related settings of contextual bandits and contextual search (for a broader discussion, see Appendix A).

Feature-based Dynamic Pricing. There is a growing body of work on dynamic pricing with linear features (Amin et al., 2014; Qiang and Bayati, 2016; Cohen et al., 2020; Javanmard and Nazerzadeh, 2019). Table 1 summarizes the differences in the settings and results222We only concern the dependence on TT since there are various different assumptions on dd..

Table 1: Related Works and Regret Bounds w.r.t. TT
Algorithm Work Regret (upper) bound Feature Noise
LEAP (Amin et al., 2014) O~(T23)\tilde{O}(T^{\frac{2}{3}}) i.i.d. Noise-free
EllipsoidPricing (Cohen et al., 2020) O(logT)O(\log{T}) adversarial Noise-free
EllipsoidEXP4 (Cohen et al., 2020) O~(T23)\tilde{O}(T^{\frac{2}{3}}) adversarial Sub-Gaussian
PricingSearch (Leme and Schneider, 2018) O(loglog(T))O(\log\log(T)) adversarial Noise-free
RMLP (Javanmard and Nazerzadeh, 2019) O(logT/Cmin2)O(\log{T}/C_{\min}^{2}) i.i.d. Log-concave, distribution-known
O(T)O(\sqrt{T})
RMLP-2 (Javanmard and Nazerzadeh, 2019) O(T)O(\sqrt{T}) i.i.d. Known parametric family of log-concave.
ShallowPricing (Cohen et al., 2020) O(poly(logT))O(poly(\log{T})) adversarial Sub-Gaussian, known σ=O(1TlogT)\sigma=O(\frac{1}{T\log{T}})
CorPV (Krishnamurthy et al., 2021)
Algorithm 2 (MSPP) (Liu et al., 2021) O(loglog(T))O(\log\log(T)) adversarial Noise-free
EMLP This paper O(logT)O(\log T) i.i.d. Strictly log-concave, distribution-known
ONSP This paper O(logT)O(\log{T}) adversarial Strictly log-concave, distribution-known

CminC_{\min} is the restricted eigenvalue condition. It reduces to the smallest eigenvalue of 𝔼[xx]\mathbb{E}[xx^{\top}] in the low-dimensional case we consider.

Among these work, our paper directly builds upon (Cohen et al., 2020) and (Javanmard and Nazerzadeh, 2019), as we share the same setting of online feature vectors, linear and noisy valuations and Boolean-censored feedback. Relative to the results in (Javanmard and Nazerzadeh, 2019), we obtain O(dlogT)O(d\log T) regret under weaker assumptions on the sequence of input features — in both distribution-free stochastic feature setting and the adversarial feature setting. It is to be noted that (Javanmard and Nazerzadeh, 2019) also covers the sparse high-dimensional setting, and handles a slightly broader class of demand curves. Relative to (Cohen et al., 2020), in which the adversarial feature-based dynamic pricing was first studied, our algorithm ONSP enjoys the optimal O(dlogT)O(d\log T) regret when the noise-level is a constant. In comparison, Cohen et al. (2020) reduces the problem to contextual bandits and applies the (computationally inefficient) “EXP-4” algorithm (Auer et al., 2002) to achieve a O~(T2/3)\tilde{O}(T^{2/3}) regret. The “bisection” style-algorithm in both Cohen et al. (2020) and Krishnamurthy et al. (2021) could achieve O~(poly(d)polylog(T))\tilde{O}(poly(d)poly\log(T)) regrets but requires a small-variance subgaussian noise satisfying σ=O(1TlogT)\sigma=O(\frac{1}{T\log{T}}).

Lower Bounds. Most existing works focus on the lower regret bounds of non-feature-based models. Kleinberg and Leighton (2003) divides the problem setting as fixed, random, and adversarial valuations, and then proves each a Θ(loglogT)\Theta(\log\log{T}), Θ(T)\Theta(\sqrt{T}), and Θ(T2/3)\Theta(T^{2/3}) regret, respectively. Broder and Rusmevichientong (2012) further proves a Θ(T)\Theta(\sqrt{T}) regret in general parametric valuation models. In this work, we generalize the methods of Broder and Rusmevichientong (2012) to our feature-based setting and further narrow it down to a linear-feature Gaussian-noisy model. As a complement to Kleinberg and Leighton (2003), we further separate the exponential regret gap between: (1) O(logT)O(\log{T}) of the hardest (adversarial feature) totally-parametric model, and (2) Ω(T)\Omega(\sqrt{T}) of the simplest (fixed known expectation) unknown-σ\sigma Gaussian model.

Contextual Bandits. For readers familiar with the online learning literature, our problem can be reduced to a contextual bandits problem (Langford and Zhang, 2007; Agarwal et al., 2014) by discretizing the prices. But this reduction only results in O(T2/3)O(T^{2/3}) regret, as it does not capture the special structure of the feedback: an accepted price indicates the acceptance of all lower prices, and vise versa. Moreover, when comparing to linear bandits (Chu et al., 2011), it is the valuation instead of the expected reward that we assume to be linear.

Contextual Search. Feature-based dynamic pricing is also related to the contextual search problem (Lobel et al., 2018; Leme and Schneider, 2018; Liu et al., 2021; Krishnamurthy et al., 2021), which often involves learning from Boolean feedbacks, sometimes with a “pricing loss” and “noisy” feedback. These shared jargons make this problem appearing very similar to our problem. However, except for the noiseless cases (Lobel et al., 2018; Leme and Schneider, 2018), contextual search algorithms, even with “pricing losses” and “Noisy Boolean feedback” (e.g., Liu et al., 2021), do not imply meaningful regret bounds in our problem setup due to several subtle but important differences in the problem settings. Specifically, the noisy-boolean feedback model of (Liu et al., 2021) is about randomly toggling the “purchase decision” determined by the noiseless valuation xθx^{\top}\theta^{*} with probability 0.5ϵ0.5-\epsilon. This is incompatible to our problem setting where the purchasing decision is determined by a noisy valuation xθ+Noisex^{\top}\theta^{*}+\text{Noise}. Ultimately, in the setting of (Liu et al., 2021), the optimal policy alway plays xθx^{\top}\theta^{*}, but our problem is harder in that we need to exploit the noise and the optimal price could be very different from xθx^{\top}\theta^{*}. 333As an explicit example, suppose the valuation xθ=0x^{\top}\theta^{*}=0, then the optimal price must be >0>0 in order to avoid zero return. Krishnamurthy et al. (2021) also discussed this issue explicitly and considered the more natural noisy Boolean feedback model studied in this paper. Their result, similar to Cohen et al. (2020), only achieves a logarithmic regret when the noise on the valuation is vanishing in an O~(1/T)\tilde{O}(1/T) rate.

3 Problem Setup

Symbols and Notations. Now we introduce the mathematical symbols and notations involved in the following pages. The game consists of TT rounds. xtdx_{t}\in\mathbb{R}^{d}, vt+v_{t}\in\mathbb{R}_{+} and NtN_{t}\in\mathbb{R} denote the feature vector, the proposed price and the noise respectively at round t=1,2,,Tt=1,2,...,T.444In an epoch-design situation, a subscript (k,t)(k,t) indicates round tt of epoch kk. We denote the product ut:=xtθu_{t}:=x_{t}^{\top}\theta^{*} as an expected valuation. At each round, we receive a payoff (reward) rt=vt𝟙tr_{t}=v_{t}\cdot\mathbbm{1}_{t}, where the binary variable 𝟙t\mathbbm{1}_{t} indicates whether the price is accepted or not, i.e., 𝟙t=𝟏(vtwt)\mathbbm{1}_{t}=\mathbf{1}(v_{t}\leq w_{t}). As we may estimate θ\theta^{*} in our algorithms, we denote θ^t\hat{\theta}_{t} as an estimator of θ\theta^{*}, which we will formally define in the algorithms. Furthermore, we denote some functions that are related to noise distribution: F(ω)F(\omega) and f(ω)f(\omega) denote the cumulative distribution function (CDF) and probability density function (PDF) sequentially. We know that F(ω)=f(ω)F^{\prime}(\omega)=f(\omega) if we assume differentiability. To concisely denote all data observed up to round τ\tau (i.e., feature, price and payoff of all past rounds), we define hist(τ)={(xt,vt,𝟙t) for t=1,2,,τ}hist(\tau)=\{(x_{t},v_{t},\mathbbm{1}_{t})\text{ for }t=1,2,...,\tau\}. hist(τ)hist(\tau) represents the transcript of all observed random variables before round (τ+1)(\tau+1).

We define

lt(θ):=𝟙tlog(1F(vtxtθ))(1𝟙t)log(F(vtxtθ))l_{t}(\theta):=-\mathbbm{1}_{t}\cdot\log\big{(}1-F(v_{t}-x_{t}^{\top}\theta)\big{)}-(1-\mathbbm{1}_{t})\log\big{(}F(v_{t}-x_{t}^{\top}\theta)\big{)} (1)

as a negative log-likelihood function at round tt. Also, we define an expected log-likelihood function Lt(θ)L_{t}(\theta):

Lt(θ):=𝔼Nt[lt(θ)|xt]L_{t}(\theta):=\mathbb{E}_{N_{t}}[l_{t}(\theta)|x_{t}] (2)

Notice that we will later define an L^k(θ)\hat{L}_{k}(\theta) which is, however, not an expectation.

Definitions of Key Quantities. We firstly define an expected reward function g(v,u)g(v,u).

g(v,u):=𝔼[rt|vt=v,xtθ=u]=vP[vxtθ+Nt]=v(1F(vu)).g(v,u):=\mathbb{E}[r_{t}|v_{t}=v,x_{t}^{\top}\theta^{*}=u]=v\cdot P[v\leq x_{t}^{\top}\theta^{*}+N_{t}]=v\cdot(1-F(v-u)). (3)

This indicates that if the expected valuation is uu and the proposed price is vv, then the (conditionally) expected reward is g(v,u)g(v,u). Now we formally define the regret of a policy (algorithm) 𝒜\mathcal{A} as is promised in Section 1.

Definition 1 (Regret).

Let 𝒜:d×(d,,{0,1})t1\mathcal{A}:\mathbb{R}^{d}\times\left(\mathbb{R}^{d},\mathbb{R},\{0,1\}\right)^{t-1}\rightarrow\mathbb{R} be a policy of pricing, i.e. 𝒜(xt,hist(t1))=vt\mathcal{A}(x_{t},hist(t-1))=v_{t}. The regret of 𝒜\mathcal{A} is defined as follows.

Reg𝒜=\displaystyle Reg_{\mathcal{A}}= t=1Tmaxvg(v,xtθ)g(𝒜(xt,hist(t1)),xtθ).\displaystyle\sum_{t=1}^{T}\max_{v}g(v,x_{t}^{\top}\theta^{*})-g(\mathcal{A}(x_{t},hist(t-1)),x_{t}^{\top}\theta^{*}). (4)

Here hist(t1)hist(t-1) is the historical records until (t1)th(t-1)^{\text{th}} round.

Summary of Assumptions. We specify the problem settings by proposing three assumptions.

Assumption 1 (Known, bounded, strictly log-concave distribution).

The noise NtN_{t} is independently and identically sampled from a distribution whose CDF is FF. Assume that F2F\in\mathbb{C}^{2} is strictly increasing and that FF and (1F)(1-F) are strictly log-concave. Also assume that ff and ff^{\prime} are bounded, and denote Bf:=supωf(ω),Bf:=supω|f(ω)|B_{f}:=\sup_{\omega\in\mathbb{R}}{f(\omega)},B_{f^{\prime}}:=\sup_{\omega\in\mathbb{R}}|f^{\prime}(\omega)| as two constants.

Assumption 2 (Bounded convex parameter space).

The true parameter θ\theta^{*}\in\mathbb{H}, where {θ:θ2B1}\mathbb{H}\subseteq\{\theta:||\theta||_{2}\leq B_{1}\} is a bounded convex set and B1B_{1} is a constant. Assume \mathbb{H} is known to us (but θ\theta^{*} is not).

Assumption 3 (Bounded feature space).

Assume xtD{x:x2B2}x_{t}\in D\subseteq\{x:||x||_{2}\leq B_{2}\}, t=1,2,,T\forall t=1,2,\ldots,T. Also, 0xθB,xD,θ0\leq x^{\top}\theta\leq B,\forall x\in D,\forall\theta\in\mathbb{H}, where B=B1B2B=B_{1}\cdot B_{2} is a constant.

Assumption 2 and 3 are mild as we can choose B1B_{1} and B2B_{2} large enough. In Section 4.1, we may add further complement to Assumption 3 to form a stochastic setting. Assumption 1 is stronger since we might not know the exact CDF in practice, but it is still acceptable from an information-theoretic perspective. There are at least three reasons that lead to this assumption: Primarily, this is necessary if we hope to achieve an O(log(T))O(\log(T)) regret. We will prove in Section 5.3 that an Ω(T)\Omega(\sqrt{T}) is unavoidable if we cannot know one parameter exactly. Moreover, the pioneering work of Javanmard and Nazerzadeh (2019) also assumes a known noise distribution with log-concave CDF, and many common distributions are actually strictly log-concave, such as Gaussian and logistic.555In fact, FF and (1F)(1-F) are both log-concave if its PDF is log-concave, according to Prekopa’s Inequality. Besides, although we did not present a method to precisely estimate σ\sigma in this work, it is a reasonable algorithm to replace with a plug-in estimator estimated using historical offline data. As we have shown, not knowing σ\sigma requires O(T)O(\sqrt{T}) regret in general, but the lower bound does not rule out the plug-in approach achieving a smaller regret for interesting subclasses of problems in practice.

Finally, we state a lemma and define an argmax function helpful for our algorithm design.

Lemma 2 (Uniqueness).

For any u0u\geq 0, there exists a unique v0v^{*}\geq 0 such that g(v,u)=maxvg(v,u)g(v^{*},u)=\max_{v\in\mathbb{R}}{g}(v,u). Thus, we can define a greedily pricing function that maximizes the expected reward:

J(u):=argmaxvg(v,u)J(u):=\mathop{\arg\max}_{v}g(v,u) (5)

Please see the proof of Lemma 2 in Appendix B.1.

4 Algorithms

In this section, we propose two dynamic pricing algorithms: EMLP and ONSP, for stochastic and adversarial features respectively.

Algorithm 1 Epoch-based max-likelihood pricing (EMLP)
  Input: Convex and bounded set \mathbb{H}
  Observe x1x_{1}, randomly choose v1v_{1} and get r1r_{1}.
  Solve θ^1=argminθl1(θ)\hat{\theta}_{1}=\arg\min_{\theta\in\mathbb{H}}l_{1}(\theta);
  for k=1k=1 to log2T+1\lfloor\log_{2}{T}\rfloor+1 do
     Set τk=2k1\tau_{k}=2^{k-1};
     for t=1t=1 to τk\tau_{k}  do
        Observe xk,tx_{k,t};
        Set price vk,t=J(xk,tθ^k)v_{k,t}=J(x_{k,t}^{\top}\hat{\theta}_{k});
        Receive rk,t=vkt𝟙tr_{k,t}=v_{k_{t}}\cdot\mathbbm{1}_{t};
     end for
     Solve: θ^k+1=argminθL^k(θ),\hat{\theta}_{k+1}=\arg\min_{\theta\in\mathbb{H}}\hat{L}_{k}(\theta), where L^k(θ)=1τkt=1τklk,t(θ)\hat{L}_{k}(\theta)=\frac{1}{\tau_{k}}\sum_{t=1}^{\tau_{k}}l_{k,t}(\theta).
  end for
Algorithm 2 Online Newton Step Pricing (ONSP)
  Input: Convex and bounded set \mathbb{H}, θ1\theta_{1}, parameter γ,ϵ>0\gamma,\epsilon>0
  Set A0=ϵIdA_{0}=\epsilon\cdot{I_{d}};
  for t=1t=1 to TT do
     Observe xtx_{t};
     Set price vt=J(xtθt)v_{t}=J(x_{t}^{\top}\theta_{t});
     Receive rt=vt𝟙tr_{t}=v_{t}\cdot\mathbbm{1}_{t};
     Set surrogate loss function lt(θ)l_{t}(\theta);
     Calculate t=lt(θ)\nabla_{t}=\nabla{l}_{t}(\theta);
     Rank-1 update: At=At1+ttA_{t}=A_{t-1}+\nabla_{t}\nabla_{t}^{\top};
     Newton step: θ^t+1=θt1γAt1t\hat{\theta}_{t+1}=\theta_{t}-\frac{1}{\gamma}A_{t}^{-1}\nabla_{t};
     Projection: θt+1=At(θ^t+1)\theta_{t+1}=\prod_{\mathbb{H}}^{A_{t}}(\hat{\theta}_{t+1}).
  end for

4.1 Pricing with Distribution-Free Stochastic Features

Assumption 4 (Stochastic features).

Assume xt𝔻Dx_{t}\sim\mathbb{D}\subseteq D are independently identically distributed (i.i.d.) from an unknown distribution, for any t=1,2,,Tt=1,2,\ldots,T.

The first algorithm, Epoch-based Max-Likelihood Pricing (EMLP) algorithm, is suitable for a stochastic setting defined by Assumption 4. EMLP proceeds in epochs with each stage doubling the length of the previous epoch. At the end of each epoch, we consolidate the observed data and solve a maximum likelihood estimation problem to learn θ\theta. A max likelihood estimator (MLE) obtained by minimizing L^k(θ):=1τkt=1τklk,t(θ),\hat{L}_{k}(\theta):=\frac{1}{\tau_{k}}\sum_{t=1}^{\tau_{k}}l_{k,t}(\theta), which is then used in the next epoch as if it is the true parameter vector. In the equation, k,τkk,\tau_{k} denotes the index and length of epoch kk. The estimator is computed using data in hist(k)hist(k), which denotes the transcript for epoch 1k1\sim k. The pseudo-code of EMLP is summarized in Algorithm 1. In the remainder of this section, we discuss the computational efficiency and prove the upper regret bound of O(dlogT)O(d\log{T}).

Computational Efficiency. The calculations in EMLP are straightforward except for argminL^k(θ)\arg\min\hat{L}_{k}(\theta) and J(u)J(u). As g(v,u)g(v,u) is proved unimodal in Lemma 2, we may efficiently calculate J(u)J(u) by binary search. We will prove that lk,tl_{k,t} is exp-concave (and thus also convex). Therefore, we may apply any off-the-shelf tools for solving convex optimization.

MLE and Probit Regression. A closer inspection reveals that this log-likelihood function corresponds to a probit (Aldrich et al., 1984) or a logit model (Wright, 1995) for Gaussian or logistic noises. See Appendix C.2.1.

Affine Invariance. Both optimization problems involved depend only on xθx^{\top}\theta, so if we add any affine transformation to xx into x~=Ax\tilde{x}=Ax, the agent can instead learn a new parameter of θ~=(A)1θ\tilde{\theta}^{*}=(A^{\top})^{-1}\theta^{*} and achieve the same ut=xtθu_{t}=x_{t}^{\top}\theta^{*}. Also, the regret bound is not affected as the upper bound BB over xθx^{\top}\theta does not change 666Here AA is assumed invertible, otherwise the mapping from x~t\tilde{x}_{t} to utu_{t} does not necessarily exist.. Therefore, it is only natural that the regret bound does not depend on the distribution xx, nor the condition numbers of 𝔼[xx]\mathbb{E}[xx^{\top}] (i.e., the ratio of λmax/λmin\lambda_{\max}/\lambda_{\min}).

4.2 Pricing with Adversarial Features

In this part, we propose an “Online Newton Step Pricing (ONSP)” algorithm that deals with adversarial {xt}\{x_{t}\} series and guarantees O(dlogT)O(d\log{T}) regret. The pseudo-code of ONSP is shown as Algorithm 2. In each round, it uses the likelihood function as a surrogate loss and applies “Online Newton Step”(ONS) method to update θ^\hat{\theta}. In the next round, it adopts the updated θ^\hat{\theta} and sets a price greedily. In the remainder of this section, we discuss some properties of ONSP and prove the regret bound.

The calculations of ONSP are straightforward. The time complexity of calculating the matrix inverse At1A_{t}^{-1} is O(d3)O(d^{3}), which is fair as dd is small. In high-dimensional cases, we may use Woodbury matrix identity777(A+xx)1=A111+xA1xA1x(A1x).(A+xx^{\top})^{-1}=A^{-1}-\frac{1}{1+x^{\top}A^{-1}x}A^{-1}x(A^{-1}x)^{\top}. to reduce it to O(d2)O(d^{2}) as we could get A1A^{-1} directly from the latest round.

5 Regret Analysis

In this section, we mainly prove the logarithmic regret bounds of EMLP and ONSP corresponding to stochastic and adversarial settings, respectively. Besides, we also prove an Ω(T)\Omega(\sqrt{T}) regret bound on fully parametric FF with one parameter unknown.

5.1 O(dlogT)O(d\log{T}) Regret of EMLP

In this part, we present the regret analysis of Algorithm 1. First of all, we propose the following theorem as our main result on EMLP.

Theorem 3 (Overall regret).

With Assumptions 1, 2, 3 and 4, the expected regret of EMLP can be bounded by:

𝔼[RegEMLP]2CsdlogT,\mathbb{E}[Reg_{\text{EMLP}}]\leq 2C_{s}{d}{\log{T}}, (6)

where CsC_{s} is a constant that depends only on F(ω)F(\omega) and is independent to 𝔻\mathbb{D}.

The proof of Theorem 3 is sophisticated. For the sake of clarity, we next present an inequality system as a roadmap toward the proof. After this, we formally illustrate each line of it with lemmas.

Since EMLP proposes J(xk,tθ^k)J(x_{k,t}^{\top}\hat{\theta}_{k}) in every round of epoch kk, we may denote the per-round regret as Regt(θ^k)Reg_{t}(\hat{\theta}_{k}), where:

Regt(θ):=g(J(xtθ),xtθ)g(J(xtθ),xtθ).Reg_{t}(\theta):=g(J(x_{t}^{\top}\theta^{*}),x_{t}^{\top}\theta^{*})-g(J(x_{t}^{\top}\theta),x_{t}^{\top}\theta^{*}). (7)

Therefore, it is sufficient to prove the following Theorem:

Theorem 4 (Expected per-round regret).

For the per-round regret defined in Equation (7), we have:

𝔼[Regk,t(θ^k)]Csdτk.\mathbb{E}[Reg_{k,t}(\hat{\theta}_{k})]\leq C_{s}\cdot\frac{d}{\tau_{k}}.

The proof roadmap of Theorem 4 can be written as the following inequality system.

𝔼[Regk,t(θ^k)]C𝔼[(θ^kθ)xk,txk,t(θ^kθ)]2CCdown𝔼[L^k(θ^k)L^k(θ)]2CCexpCdown2dτk.\mathbb{E}[Reg_{k,t}(\hat{\theta}_{k})]\leq C\cdot\mathbb{E}[(\hat{\theta}_{k}-\theta^{*})^{\top}x_{k,t}x_{k,t}^{\top}(\hat{\theta}_{k}-\theta^{*})]\leq\frac{2C}{C_{\text{down}}}\mathbb{E}[\hat{L}_{k}(\hat{\theta}_{k})-\hat{L}_{k}(\theta^{*})]\leq\frac{2C\cdot C_{\text{exp}}}{C_{\text{down}}^{2}}\frac{d}{\tau_{k}}. (8)

We explain Equation (8) in details. The first inequality comes from the following Lemma 5.

Lemma 5 (Quadratic regret bound).

We have:

Regt(θ)C(θθ)xtxt(θθ),θ,xt𝔻.Reg_{t}(\theta)\leq C\cdot(\theta-\theta^{*})^{\top}x_{t}x_{t}^{\top}(\theta-\theta^{*}),\forall\theta\in\mathbb{H},\forall x_{t}\in\mathbb{D}. (9)

Here C=2Bf+(B+J(0))Bf.C=2B_{f}+(B+J(0))\cdot B_{f^{\prime}}.

The intuition is that function g(J(u),u)g(J(u),u) is 2nd2^{nd}-order-smooth at (J(u),u)(J(u^{*}),u^{*}). A detailed proof of Lemma 5 is in Appendix B.2.1. Note that CC is highly dependent on the distribution FF. After this, we propose Lemma 11 that contributes to the second inequality of Equation (8).

Lemma 6 (Quadratic likelihood bound).

For the expected likelihood function Lt(θ)L_{t}(\theta) defined in Equation (2), we have:

Lt(θ)Lt(θ)12Cdown(θθ)xtxt(θθ),θ,x𝔻,L_{t}(\theta)-L_{t}(\theta^{*})\geq\frac{1}{2}C_{\text{down}}(\theta-\theta^{*})^{\top}x_{t}x_{t}^{\top}(\theta-\theta^{*}),\forall\theta\in\mathbb{H},\forall x\in\mathbb{D}, (10)
 where Cdown:=infω[B,B+J(0)]min{d2log(1F(ω))dω2,d2log(F(ω))dω2}>0.\text{ where }\;\;\;C_{\text{down}}:=\inf_{\omega\in[-B,B+J(0)]}\min\left\{\frac{\text{d}^{2}\log(1-F(\omega))}{\text{d}\omega^{2}},\frac{\text{d}^{2}\log(F(\omega))}{\text{d}\omega^{2}}\right\}>0. (11)
Proof.

Since the true parameter always maximizes the expected likelihood function (Murphy, 2012), by Taylor Expansion we have L(θ)=0\nabla L(\theta^{*})=0, and hence Lt(θ)Lt(θ)=12(θθ)2Lt(θ~)(θθ)L_{t}(\theta)-L_{t}(\theta^{*})=\frac{1}{2}(\theta-\theta^{*})^{\top}\nabla^{2}L_{t}(\tilde{\theta})(\theta-\theta^{*}) for some θ~=αθ+(1α)θ\tilde{\theta}=\alpha\theta^{*}+(1-\alpha)\theta. Therefore, we only need to prove the following lemma:

Lemma 7 (Strong convexity and Exponential Concavity).

Suppose lt(θ)l_{t}(\theta) is the negative log-likelihood function in epoch kk at time tt. For any θ,xt𝔻\theta\in\mathbb{H},x_{t}\sim\mathbb{D}, we have:

2lt(θ)CdownxtxtCdownCexplt(θ)lt(θ)0,\nabla^{2}l_{t}(\theta)\succeq C_{\text{down}}x_{t}x_{t}^{\top}\succeq\frac{C_{\text{down}}}{C_{\text{exp}}}\nabla l_{t}(\theta)\nabla l_{t}(\theta)^{\top}\succeq 0, (12)
 where Cexp:=supω[B,B+J(0)]max{f(ω)2F(ω)2,f(ω)2(1F(ω))2}<+.\text{ where }\;C_{\text{exp}}:=\sup_{\omega\in[-B,B+J(0)]}\max\left\{\frac{f(\omega)^{2}}{F(\omega)^{2}},\frac{f(\omega)^{2}}{(1-F(\omega))^{2}}\right\}<+\infty. (13)

Proof of Lemma 7 is in Appendix B.2.2. With this lemma, we see that Lemma 11 holds. ∎

With Lemma 5 and Lemma 11, we can immediately get the following Lemma 8.

Lemma 8 (Surrogate Regret).

The relationship between Reg(θ)Reg(\theta) and likelihood function can be shown as follows:

Regt(θ)2CCdown(Lt(θ)Lt(θ)),Reg_{t}(\theta)\leq\frac{2\cdot C}{C_{\text{down}}}\left(L_{t}(\theta)-L_{t}(\theta^{*})\right), (14)

θ,x𝔻\forall\theta\in\mathbb{H},\forall{x}\in\mathbb{D}, where CC and CdownC_{\text{down}} are defined in Lemma 5 and 11 respectively.

Lemma 8 enables us to choose the negative log-likelihood function as a surrogate loss. This is not only an important insight of EMLP regret analysis, but also the foundation of ONSP design.

The last inequality of Equation (8) comes from this lemma:

Lemma 9 (Per-epoch surrogate regret bound).

Denoting θ^k\hat{\theta}_{k} as the estimator coming from epoch (k1)(k-1) and being used in epoch kk, we have:

𝔼h[L^k(θ^k)L^k(θ)]CexpCdowndτk+1.\mathbb{E}_{h}[\hat{L}_{k}(\hat{\theta}_{k})-\hat{L}_{k}(\theta^{*})]\leq\frac{C_{\text{exp}}}{C_{\text{down}}}\cdot\frac{d}{\tau_{k}+1}. (15)

Here CexpC_{\text{exp}} is defined in Equation 13, and 𝔼h[]=𝔼[|hist(k1)].\mathbb{E}_{h}[\cdot]=\mathbb{E}[\cdot|hist(k-1)].

Proof of Lemma 9 is partly derived from the work Koren and Levy (2015), and here we give a proof sketch without specific derivations. A detailed proof lies in Appendix B.2.3.

Proof sketch of Lemma 9. We list the four main points that contribute to the proof:

  • Notice that lk,t(θ)l_{k,t}(\theta) is strongly convex w.r.t. a seminorm xk,txk,tx_{k,t}x_{k,t}^{\top}, we know L^k(θ)\hat{L}_{k}(\theta) is also strongly convex w.r.t. t=1τkxk,txk,t\sum_{t=1}^{\tau_{k}}x_{k,t}x_{k,t}^{\top}.

  • For two strongly convex functions g1g_{1} and g2g_{2}, we can upper bound the distance between their arg-minimals (scaled by some norm ||||||\cdot||) with the dual norm of (g1g2)\nabla(g_{1}-g_{2}).

  • Since a seminorm has no dual norm, we apply two methods to convert it into a norm: (1) separation of parameters and likelihood functions with a “leave-one-out” method (to separately take expectations), and (2) separation of the spinning space and the null space.

  • As the dual data-dependent norm offsets the sum of xxxx^{\top} to a constant, Lemma 9 holds.

We have so far proved Inequality (8) after proving Lemma 5, 11, 9. Therefore, Theorem 4 holds.

5.2 O(dlogT)O(d\log{T}) Regret of ONSP

Here we present the regret analysis of Algorithm 2 (ONSP). Firstly, we state the main theorem.

Theorem 10.

With Assumptions 1, 2, 3, the regret of Algorithm 2 (ONSP) satisfies:

RegONSPCadlogT,Reg_{\text{ONSP}}\leq C_{a}\cdot d\log{T}, (16)

where CaC_{a} is a function only dependent on FF.

Proof.

Proof of Theorem 10 here is more concise than Section 5.1, because the important Lemma 7 and 8 have been proved there. From Lemma 8, we have:

g(J(ut),ut)g(J(ut),ut)2CCdown𝔼Nt[lt(θt)lt(θ)].g(J(u_{t}^{*}),u_{t}^{*})-g(J(u_{t}),u_{t}^{*})\leq\frac{2\cdot C}{C_{\text{down}}}\cdot\mathbb{E}_{N_{t}}[l_{t}(\theta_{t})-l_{t}(\theta^{*})]. (17)

With Equation 17, we may reduce the regret of likelihood functions as a surrogate regret of pricing. From Lemma 7 we see that the log-likelihood function is CdownCexp\frac{C_{\text{down}}}{C_{\text{exp}}}-exponentially concave888A function f(μ)f(\mu) is α\alpha-exponentially concave iff 2f(μ)αf(μ)f(μ)\nabla^{2}f(\mu)\succeq\alpha\nabla f(\mu)\nabla f(\mu)^{\top}.. This enables an application of Online Newton Step method to achieve a logarithmic regret. Therefore, by citing from the Online Convex Optimization (Hazan, 2016), we have the following Lemma.

Lemma 11 (Online Newton Step).

With parameters γ=12min{14GD,α}\gamma=\frac{1}{2}\min\{\frac{1}{4GD},\alpha\} and ϵ=1γ2D2\epsilon=\frac{1}{\gamma^{2}D^{2}}, and T>4T>4 guarantees:

sup{xt}{t=1Tlt(θt)minθt=1Tlt(θ)}5(1α+GD)dlogT.\sup_{\{x_{t}\}}\left\{\sum_{t=1}^{T}l_{t}(\theta_{t})-\min_{\theta\in\mathbb{H}}\sum_{t=1}^{T}l_{t}(\theta)\right\}\leq 5\left(\frac{1}{\alpha}+GD\right)d\log T.

Here α=CdownCexp\alpha=\frac{C_{\text{down}}}{C_{\text{exp}}}, D=2B1D=2\cdot B_{1} and G=CexpB2G=\sqrt{C_{\text{exp}}}\cdot B_{2}.

With Equation 17 and Lemma 11, we have:

Reg=t=1T(g(J(ut),ut)𝔼N1,N2,,Nt1[g(J(ut),ut)])2CCdown5(1α+GD)dlogT.Reg=\sum_{t=1}^{T}\left(g(J(u_{t}^{*}),u_{t}^{*})-\mathbb{E}_{N_{1},N_{2},\ldots,N_{t-1}}[g(J(u_{t}),u_{t}^{*})]\right)\leq\frac{2\cdot C}{C_{\text{down}}}\cdot 5\left(\frac{1}{\alpha}+GD\right)d\log T. (18)

Therefore, we have proved Lemma 10. ∎

5.3 Lower Bound for Unknown Distribution

In this part, we evaluate Assumption 1 and prove that an Ω(T)\Omega{(\sqrt{T})} lower regret bound is unavoidable with even a slight relaxation: a Gaussian noise with unknown σ\sigma. Our proof is inspired by Broder and Rusmevichientong (2012) Theorem 3.1, while our lower bound relies on more specific assumptions (and thus applies to more general cases).

We firstly state Assumption 5 covering this part, and then state Theorem 12 as a lower bound:

Assumption 5.

The noise Nt𝒩(0,σ2)N_{t}\sim\mathcal{N}(0,\sigma^{2}) independently, where 0<σ10<\sigma\leq 1 is fixed and unknown.

Theorem 12 (Lower bound with unknown σ\sigma).

Under Assumption 2, 3, 4 and 5, for any policy (algorithm) Ψ:d×(d,,{0,1})t1+\Psi:\mathbb{R}^{d}\times\left(\mathbb{R}^{d},\mathbb{R},\{0,1\}\right)^{t-1}\rightarrow\mathbb{R}^{+} and any T>2T>2, there exists a Gaussian parameter σ+\sigma\in\mathbb{R}^{+}, a distribution 𝔻\mathbb{D} of features and a fixed parameter θ\theta^{*}, such that: RegΨ124000T.Reg_{\Psi}\geq{\frac{1}{24000}}\cdot\sqrt{T}.

Remark: Here we assume xtx_{t} to be i.i.d., which also implies the applicability on adversarial features. However, the minimax regret of the stochastic feature setting is Θ(T)\Theta(\sqrt{T}) (Javanmard and Nazerzadeh, 2019), while existing results have not yet closed the gap in adversarial feature settings.

Proof sketch of Theorem 12. Here we assume a fixed valuation, i.e. u=xtθ,t=1,2,u^{*}=x_{t}^{\top}\theta^{*},\forall t=1,2,\ldots. Equivalently, we assume a fixed feature. The main idea of proof is similar to that in Broder and Rusmevichientong (2012): we assume σ1=1,σ2=1T14\sigma_{1}=1,\sigma_{2}=1-T^{-\frac{1}{4}}, and we prove that: (1) it is costly for an algorithm to perform well in both cases if the σ\sigma’s are different by a lot, and (2) it is costly for an algorithm to distinguish the two cases if σ\sigma’s are close enough to each other. We put the detailed proof in Appendix B.3.

6 Numerical Result

Refer to caption
(a) Stochastic feature
Refer to caption
(b) Adversarial feature
Figure 1: The regret of EMLP, ONSP and EXP-4 on simulated examples (we only conduct EXP-4 up to T=212T=2^{12} due to its exponential time consuming), with Figure 1(a) for stochastic features and Figure 1(b) for adversarial ones. The plots are in log-log scales with all regrets divided by a log(t)\log(t) factor to show the convergence. For EXP-4, we discretize the parameter space with T13T^{-\frac{1}{3}}-size grids, which would incur an O~(T23)\tilde{O}(T^{\frac{2}{3}}) regret according to Cohen et al. (2020). We also plot linear fits for some regret curves, where a slope-α\alpha line indicates an O(Tα)O(T^{\alpha}) regret. Besides, we draw error bars and bands with 0.95 coverage using Wald’s test. The two diagrams reveal that (i) logarithmic regrets of EMLP and ONSP in the stochastic setting, (ii) a nearly-linear regret of EMLP in the adversarial setting, and (iii) O(T23)O(T^{\frac{2}{3}}) regrets of EXP-4 in both settings.

In this section, we conduct numerical experiments to validate EMLP and ONSP. In comparison with the existing work, we implement a discretized EXP-4 (Auer et al., 2002) algorithm for pricing, as is introduced in Cohen et al. (2020) (in a slightly different setting). We will test these three algorithms in both stochastic and adversarial settings.

Basically, we assume d=2,B1=B2=B=1d=2,B_{1}=B_{2}=B=1 and Nt𝒩(0,σ2)N_{t}\sim\mathcal{N}(0,\sigma^{2}) with σ=0.25\sigma=0.25. In both settings, we conduct EMLP and ONSP for T=216T=2^{16} rounds. For ONSP, we empirically select γ\gamma and ϵ\epsilon that accelerates the convergence, instead of using the values specified in Lemma 11. Since EXP-4 consumes exponential time and requires the knowledge of TT in advance to discretize the policy and valuation spaces, we execute EXP-4 for a series of T=2k,k=1,2,,12T=2^{k},k=1,2,\ldots,12. We repeat every experiment 5 times for each setting and then take an average.

Stochastic Setting. We implement and test EMLP, ONSP and EXP-4 with stochastic {xt}\{x_{t}\}’s. The numerical results are shown in Figure 1(a) on a log-log diagram, with the regrets divided by log(t)\log(t). It shows log(t)\log(t)-convergences on EMLP and ONSP, while EXP-4 is in a tαt^{\alpha} rate with α0.699\alpha\approx 0.699.

Adversarial Setting. We implement the three algorithms and test them with an adversarial {xt}\{x_{t}\}’s: for the kk-th epoch, i.e. t=2k1,2k1+1,,2k1t=2^{k-1},2^{k-1}+1,\ldots,2^{k}-1, we let xt=[1,0]x_{t}=[1,0]^{\top} if k1(mod2)k\equiv 1(\mod 2) and xt=[0,1]x_{t}=[0,1]^{\top} if k0(mod2)k\equiv 0(\mod 2). The numerical results are shown in Figure 1(b) on a log-log diagram, with the regrets divided by log(t)\log(t). The log-log plots of ONSP and EXP-4 are almost the same as those in Figure 1(a). However, EMLP shows an almost linear (tαt^{\alpha} rate with α0.912\alpha\approx 0.912) regret in this adversarial setting. This is because the adversarial series only trains one dimension of θ\theta in each epoch, while the other is arbitrarily initialized and does not necessarily converge. However, in the next epoch, the incorrect dimension is exploited. Therefore, a linear regret originates.

7 Discussion

Here we discuss the coefficients on our regret bounds as a potential extension of future works. In Appendix C we will discuss more on algorithmic design, problem modeling, and ethic issues.

Coefficients on Regret Bounds. The exact regret bounds of both EMLP and ONSP contain a constant CexpCdown\frac{C_{\text{exp}}}{C_{\text{down}}} that highly depends on the noise CDF FF and could be large. A detailed analysis in Appendix C.1 shows that CexpCdown\frac{C_{\text{exp}}}{C_{\text{down}}} is exponentially large w.r.t. Bσ\frac{B}{\sigma} (see Equation 39 and Lemma 21) for Gaussian noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}), which implies that a smaller noise variance would lead to a (much) larger regret bound. This is very counter-intuitive as a larger noise usually leads to a more sophisticated situation, but similar phenomenons also occur in existing algorithms that are suitable for constant-variance noise, such as RMLP in Javanmard and Nazerzadeh (2019) and OORMLP in Wang et al. (2020). In fact, it is because a (constantly) large noise would help explore the unknown parameter θ\theta^{*} and smoothen the expected regret. In this work, this can be addressed by increasing TT since we mainly concern the asymptotic regrets as TT\rightarrow\infty with fixed noise distributions. However, we admit that it is indeed a nontrivial issue for finite TT and small σ\sigma situations. There exists a “ShallowPricing” method in Cohen et al. (2020) that can deal with a very-small-variance noise setting (when σ=O~(1T)\sigma=\tilde{O}(\frac{1}{T})) and achieve a logarithmic regret. Specifically, its regret bound would decrease as the noise variance σ\sigma decreases (but would still not reach O(loglogT)O(\log\log{T}) as the noise vanishes). We might also apply this method as a preprocess to cut the parameter domain and decrease Bσ\frac{B}{\sigma} within logarithmic trials (see Cohen et al. (2020) Thm. 3), but it is still open whether a log(T)\log(T) regret is achievable when σ=Θ(Tα)\sigma=\Theta(T^{-\alpha}) for α(0,1)\alpha\in(0,1).

8 Conclusion

In this work, we studied the problem of online feature-based dynamic pricing with a noisy linear valuation in both stochastic and adversarial settings. We proposed a max-likelihood-estimate-based algorithm (EMLP) for stochastic features and an online-Newton-step-based algorithm (ONSP) for adversarial features. Both of them enjoy a regret guarantee of O(dlogT)O(d\log{T}), which also attains the information-theoretic limit up to a constant factor. Compared with existing works, EMLP gets rid of strong assumptions on the distribution of the feature vectors in the stochastic setting, and ONSP improves the regret bound exponentially from O(T2/3)O(T^{2/3}) to O(logT)O(\log{T}) in the adversarial setting. We also showed that knowing the noise distribution (or the demand curve) is required to obtain logarithmic regret, where we prove a lower bound of Ω(T)\Omega(\sqrt{T}) on the regret for the case when the noise is knowingly Gaussian but with an unknown σ\sigma. In addition, we conducted numerical experiments to empirically validate the scaling of our algorithms. Finally, we discussed the regret dependence on the noise variance, and proposed a subtle open problem for further study.

Acknowledgments

The work is partially supported by the Adobe Data Science Award and a start-up grant from the UCSB Department of Computer Science. We appreciate the input from anonymous reviewers and AC as well as a discussion with Akshay Krishnamurthy for clarifying some details of Krishnamurthy et al. (2021).

References

  • Agarwal et al. [2014] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning (ICML-14), pages 1638–1646, 2014.
  • Aldrich et al. [1984] J. H. Aldrich, F. D. Nelson, and E. S. Adler. Linear Probability, Logit, and Probit Models. Number 45. Sage, 1984.
  • Amin et al. [2014] K. Amin, A. Rostamizadeh, and U. Syed. Repeated contextual auctions with strategic buyers. In Advances in Neural Information Processing Systems (NIPS-14), pages 622–630, 2014.
  • Araman and Caldentey [2009] V. F. Araman and R. Caldentey. Dynamic pricing for nonperishable products with demand learning. Operations Research, 57(5):1169–1188, 2009.
  • Auer et al. [2002] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
  • Aydin and Ziya [2009] G. Aydin and S. Ziya. Personalized dynamic pricing of limited inventories. Operations Research, 57(6):1523–1531, 2009.
  • Besanko et al. [1998] D. Besanko, S. Gupta, and D. Jain. Logit demand estimation under competitive pricing behavior: An equilibrium framework. Management Science, 44(11-part-1):1533–1547, 1998.
  • Besanko et al. [2003] D. Besanko, J.-P. Dubé, and S. Gupta. Competitive price discrimination strategies in a vertical channel using aggregate retail data. Management Science, 49(9):1121–1138, 2003.
  • Besbes and Zeevi [2015] O. Besbes and A. Zeevi. On the (surprising) sufficiency of linear models for dynamic pricing with demand learning. Management Science, 61(4):723–739, 2015.
  • Broder and Rusmevichientong [2012] J. Broder and P. Rusmevichientong. Dynamic pricing under a general parametric choice model. Operations Research, 60(4):965–980, 2012.
  • Chan et al. [2009] T. Chan, V. Kadiyali, and P. Xiao. Structural models of pricing. Handbook of pricing research in marketing, pages 108–131, 2009.
  • Chen and Gallego [2021] N. Chen and G. Gallego. A primal-dual learning algorithm for personalized dynamic pricing with an inventory constraint. Mathematics of Operations Research, 2021.
  • Chen and Farias [2013] Y. Chen and V. F. Farias. Simple policies for dynamic pricing with imperfect forecasts. Operations Research, 61(3):612–624, 2013.
  • Chu et al. [2011] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS-11), pages 208–214, 2011.
  • Cohen et al. [2020] M. C. Cohen, I. Lobel, and R. Paes Leme. Feature-based dynamic pricing. Management Science, 66(11):4921–4943, 2020.
  • Cournot [1897] A. A. Cournot. Researches into the Mathematical Principles of the Theory of Wealth. Macmillan, 1897.
  • den Boer [2015] A. V. den Boer. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in Operations Research and Management Science, 20(1):1–18, 2015.
  • Draganska and Jain [2006] M. Draganska and D. C. Jain. Consumer preferences and product-line pricing strategies: An empirical analysis. Marketing science, 25(2):164–174, 2006.
  • Evans [1924] G. C. Evans. The dynamics of monopoly. The American Mathematical Monthly, 31(2):77–83, 1924.
  • Hazan [2016] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.
  • Iyengar et al. [2007] R. Iyengar, A. Ansari, and S. Gupta. A model of consumer learning for service quality and usage. Journal of Marketing Research, 44(4):529–544, 2007.
  • Javanmard and Nazerzadeh [2019] A. Javanmard and H. Nazerzadeh. Dynamic pricing in high-dimensions. The Journal of Machine Learning Research, 20(1):315–363, 2019.
  • Joskow and Wolfram [2012] P. L. Joskow and C. D. Wolfram. Dynamic pricing of electricity. American Economic Review, 102(3):381–85, 2012.
  • Kadiyali et al. [1996] V. Kadiyali, N. J. Vilcassim, and P. K. Chintagunta. Empirical analysis of competitive product line pricing decisions: Lead, follow, or move together? Journal of Business, pages 459–487, 1996.
  • Keskin and Zeevi [2014] N. B. Keskin and A. Zeevi. Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations Research, 62(5):1142–1167, 2014.
  • Kincaid and Darling [1963] W. Kincaid and D. Darling. An inventory pricing problem. Journal of Mathematical Analysis and Applications, 7:183–208, 1963.
  • Kleinberg and Leighton [2003] R. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In IEEE Symposium on Foundations of Computer Science (FOCS-03), pages 594–605. IEEE, 2003.
  • Koren and Levy [2015] T. Koren and K. Levy. Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems (NIPS-15), pages 1477–1485, 2015.
  • Krämer et al. [2018] A. Krämer, M. Friesen, and T. Shelton. Are airline passengers ready for personalized dynamic pricing? a study of german consumers. Journal of Revenue and Pricing Management, 17(2):115–120, 2018.
  • Krishnamurthy et al. [2021] A. Krishnamurthy, T. Lykouris, C. Podimata, and R. Schapire. Contextual search in the presence of irrational agents. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC-21), pages 910–918, 2021.
  • Lambrecht et al. [2007] A. Lambrecht, K. Seim, and B. Skiera. Does uncertainty matter? consumer behavior under three-part tariffs. Marketing Science, 26(5):698–710, 2007.
  • Langford and Zhang [2007] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems (NIPS-07), pages 817–824, 2007.
  • Leme and Schneider [2018] R. P. Leme and J. Schneider. Contextual search via intrinsic volumes. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS-18), pages 268–282. IEEE, 2018.
  • Liu et al. [2021] A. Liu, R. P. Leme, and J. Schneider. Optimal contextual pricing and extensions. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA-21), pages 1059–1078. SIAM, 2021.
  • Lobel et al. [2018] I. Lobel, R. P. Leme, and A. Vladu. Multidimensional binary search for contextual decision-making. Operations Research, 66(5):1346–1361, 2018.
  • Mazumdar et al. [2005] T. Mazumdar, S. P. Raj, and I. Sinha. Reference price research: Review and propositions. Journal of Marketing, 69(4):84–102, 2005.
  • Mourtada [2019] J. Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. arXiv preprint arXiv:1912.10754, 2019.
  • Murphy [2012] K. P. Murphy. Machine Learning: a Probabilistic Perspective. MIT press, 2012.
  • Qiang and Bayati [2016] S. Qiang and M. Bayati. Dynamic pricing with demand covariates. arXiv preprint arXiv:1604.07463, 2016.
  • Schultz et al. [1938] H. Schultz et al. Theory and Measurement of Demand. The University of Chicago Press, 1938.
  • Wang et al. [2020] C.-H. Wang, Z. Wang, W. W. Sun, and G. Cheng. Online regularization for high-dimensional dynamic pricing algorithms. arXiv preprint arXiv:2007.02470, 2020.
  • Whittle [1980] P. Whittle. Multi-armed bandits and the gittins index. Journal of the Royal Statistical Society: Series B (Methodological), 42(2):143–149, 1980.
  • Wright [1995] R. E. Wright. Logistic regression. 1995.

APPENDIX

Appendix A Other related works

Here we will briefly review the history and recent studies that are related to our work. For the historical introductions, we mainly refer to den Boer [2015] as a survey. For bandit approaches, we will review some works that apply bandit algorithms to settle pricing problems. For the structural models, we will introduce different modules based on the review in Chan et al. [2009]. Based on the existing works, we might have a better view of our problem setting and methodology.

A.1 History of Pricing

It was the work of Cournot [1897] in 1897 that firstly applied mathematics to analyze the relationship between prices and demands. In that work, the price was denoted as pp and the demand was defined as a demand function F(p)F(p). Therefore, the revenue could be written as pF(p)pF(p). This was a straightforward interpretation of the general pricing problem, and the key to solving it was estimations of F(p)F(p) regarding different products. Later in 1938, the work Schultz et al. [1938] proposed price-demand measurements on exclusive kinds of products. It is worth mentioning that these problems are “static pricing” ones, because FF is totally determined by price pp and we only need to insist on the optimal one to maximize our profits.

However, the static settings were qualified by the following two observations: on the one hand, a demand function may not only depends on the static value of pp, but also be affected by the trend of pp’s changing [Evans, 1924, Mazumdar et al., 2005]; on the other hand, even if F(p)F(p) is static, pp itself might change over time according to other factors such as inventory level [Kincaid and Darling, 1963]. As a result, it is necessary to consider dynamics in both demand and price, which leads to a “dynamic pricing” problem setting.

A.2 Dynamic Pricing as Bandits

As is said in Section 2, the pricing problem can be viewed as a stochastic contextual bandits problem [see, e.g., Langford and Zhang, 2007, Agarwal et al., 2014]. Even though we may not know the form of the demand function, we can definitely see feedback of demands, i.e. how many products are sold out, which enables us to learn a better decision-making policy. Therefore, it can be studied in a bandit module. If the demand function is totally agnostic, i.e. the evaluations (the highest prices that customers would accept) come at random or even at adversary over time, then it can be modeled as a Multi-arm bandit (MAB) problem [Whittle, 1980] exactly. In our paper, instead, we focus on selling different products with a great variety of features. This can be characterized as a Contextual bandit (CB) problem [Auer et al., 2002, Langford and Zhang, 2007]. The work Cohen et al. [2020], which applies the “EXP-4” algorithm from Auer et al. [2002], also mentions that “the arms represent prices and the payoffs from the different arms are correlated since the measures of demand evaluated at different price points are correlated random variables”. A variety of existing works, including Kleinberg and Leighton [2003], Araman and Caldentey [2009], Chen and Farias [2013], Keskin and Zeevi [2014], Besbes and Zeevi [2015], has been approaching the demand function from a perspective of from either parameterized or non-parameterized bandits.

However, our problem setting is different from a contextual bandits setting in at least two perspectives: feedback and regret. The pricing problem has a specially structured feedback between full information and bandits setting. Specifically, rt>0r_{t}>0 implies that all policies producing v<vtv<v_{t} will end up receiving rt=vr^{\prime}_{t}=v, and rt=0r_{t}=0 implies that all policies producing v>vtv>v_{t} will end up receiving rt=0r^{\prime}_{t}=0. However, the missing patterns are confounded with the rewards. Therefore it is non-trivial to leverage this structure to improve the importance sampling approach underlying the algorithm of Agarwal et al. [2014]. We instead consider the natural analog to the linear contextual bandits setting [Chu et al., 2011]999But do notice that our expected reward above is not linear, even if the valuation function is. and demonstrate that in this case an exponential improvement in the regret is possible using the additional information from the censored feedback. As for regret, while in contextual bandits it refers to a comparison with the optimal policy, it is here referring to a comparison with the optimal action. In other words, though our approaches (both in EMLP and in ONSP) are finding the true parameter θ\theta^{*}, the regret is defined as the “revenue gap” between the optimal price and our proposed prices. These are actually equivalent in our fully-parametric setting (where we assume a linear-valuation-known-noise model), but will differ a lot in partially parametric and totally agnostic settings.

A.3 Structural Model

While a totally agnostic model guarantees the most generality, a structural model would help us better understand the mechanism behind the observation of prices and demands. The key to a structural pricing model is the behavior of agents in the market, including customers and/or firms. In other words, the behavior of each side can be described as a decision model. From the perspective of demand (customers), the work Kadiyali et al. [1996] adopts a linear model on laundry detergents market, Iyengar et al. [2007] and Lambrecht et al. [2007] study three-part-tariff pricing problems on wireless and internet services with mixed logit models. Besanko et al. assumed an aggregate logit model on customers in works Besanko et al. [1998] and Besanko et al. [2003] in order to study the competitive behavior of manufacturers in ketchup market. Meanwhile, the supply side is usually assumed to be more strategic, such as Bertrand-Nash behaviors [Kadiyali et al., 1996, Besanko et al., 1998, Draganska and Jain, 2006]. For more details, please see Chan et al. [2009].

Appendix B Proofs

B.1 Proof of Lemma 2

Proof.

Since v=argmaxg(v,u)v^{*}=\mathop{\mathrm{argmax}}g(v,u), we have:

g(v,u)v|v=v=0\displaystyle\frac{\partial g(v,u)}{\partial v}|_{v=v^{*}}=0
\displaystyle\Leftrightarrow 1F(vu)vf(vu)=0\displaystyle 1-F(v^{*}-u)-v^{*}\cdot f(v^{*}-u)=0
\displaystyle\Leftrightarrow 1F(vu)f(vu)(vu)=u\displaystyle\frac{1-F(v^{*}-u)}{f(v^{*}-u)}-(v^{*}-u)=u

Define φ(ω)=1F(ω)f(ω)ω\varphi(\omega)=\frac{1-F(\omega)}{f(\omega)}-\omega, and we take derivatives:

φ(ω)=f2(ω)(1F(ω))f(w)f2(w)1=d2log(1F(ω))dω2(1F(ω))2(f(ω))21<1,\varphi^{\prime}(\omega)=\frac{-f^{2}(\omega)-(1-F(\omega))f^{\prime}(w)}{f^{2}(w)}-1=\frac{\text{d}^{2}\log(1-F(\omega))}{\text{d}\omega^{2}}\cdot\frac{(1-F(\omega))^{2}}{(f(\omega))^{2}}-1<-1,

where the last equality comes from the strict log-concavity of (1F(ω))(1-F(\omega)). Therefore, φ(ω)\varphi(\omega) is decreasing and φ(+)=\varphi(+\infty)=-\infty. Also, notice φ()=+\varphi(-\infty)=+\infty, we know that for any uu\in\mathbb{R}, there exists an ω\omega such that φ(ω)=u\varphi(\omega)=u. For u0u\geq 0, we know that g(v,u)0g(v,u)\geq 0 for v0v\geq 0 and g(v,u)<0g(v,u)<0 for v<0v<0. Therefore, v0v^{*}\geq 0 if u0u\geq 0. ∎

B.2 Proofs in Section 5.1

B.2.1 Proof of Lemma 5

Proof.

We again define φ(ω)=1F(ω)f(ω)ω\varphi(\omega)=\frac{1-F(\omega)}{f(\omega)}-\omega as in Appendix B.1. According to Equation 5, we have:

g(v,u)v|v=J(u)=0\displaystyle\frac{\partial g(v,u)}{\partial v}|_{v=J(u)}=0 (19)
\displaystyle\Rightarrow 1F(J(u)u)J(u)f(J(u)u)=0\displaystyle 1-F(J(u)-u)-J(u)\cdot f(J(u)-u)=0
\displaystyle\Rightarrow φ(J(u)u)=u\displaystyle\varphi(J(u)-u)=u
\displaystyle\Rightarrow J(u)=u+φ1(u)\displaystyle J(u)=u+\varphi^{-1}(u)
\displaystyle\Rightarrow J(u)=1+1φ(φ1(u)).\displaystyle J^{\prime}(u)=1+\frac{1}{\varphi^{\prime}(\varphi^{-1}(u))}.

The last line of Equation 19 is due to the Implicit Function Derivatives Principle. From the result in Appendix B.1, we know that φ(ω)<1,ω\varphi^{\prime}(\omega)<-1,\forall\omega\in\mathbb{R}. Therefore, we have J(u)(0,1),uJ^{\prime}(u)\in(0,1),u\in\mathbb{R}, and hence 0J(u)<u+J(0)0\geq J(u)<u+J(0) for u0u\geq 0. Since u[0,B]u\in[0,B], we may assume that v[0,B+J(0)]v\in[0,B+J(0)] without losing generality. In the following part, we will frequently use this range.

Denote u:=xtθ,u=xtθu:=x_{t}^{\top}\theta,u^{*}=x_{t}^{\top}\theta^{*}. According to Equation 7, we know that:

Regt(θ)\displaystyle Reg_{t}(\theta) =g(J(u),u)g(J(u),u)\displaystyle=g(J(u^{*}),u^{*})-g(J(u),u^{*})
=g(v,u)v|v=J(u)(J(uJ(u)))+12(2g(v,u)v2|v=v~)(J(u)J(u))2\displaystyle=-\frac{\partial g(v,u^{*})}{\partial v}|_{v=J(u^{*})}(J(u^{*}-J(u)))+\frac{1}{2}\left(-\frac{\partial^{2}g(v,u^{*})}{\partial v^{2}}|_{v=\tilde{v}}\right)(J(u^{*})-J(u))^{2}
0+12maxv~[0,B+J(0)](2g(v,u)v2|v=v~)(J(u)J(u))2\displaystyle\leq 0+\frac{1}{2}\max_{\tilde{v}\in[0,B+J(0)]}\left(-\frac{\partial^{2}g(v,u^{*})}{\partial v^{2}}|_{v=\tilde{v}}\right)\cdot(J(u^{*})-J(u))^{2}
=12maxv~[0,B+J(0)](2f(v~u)+v~f(v~u))(J(u)J(u))2\displaystyle=\frac{1}{2}\max_{\tilde{v}\in[0,B+J(0)]}\left(2f(\tilde{v}-u^{*})+\tilde{v}\cdot{f^{\prime}}(\tilde{v}-u^{*})\right)\cdot(J(u^{*})-J(u))^{2}
12(2Bf+(B+J(0))Bf)(J(u)J(u))2\displaystyle\leq\frac{1}{2}(2B_{f}+(B+J(0))\cdot B_{f^{\prime}})(J(u^{*})-J(u))^{2}
12(2Bf+(B+J(0))Bf)(uu)2\displaystyle\leq\frac{1}{2}(2B_{f}+(B+J(0))\cdot B_{f^{\prime}})(u^{*}-u)^{2}
=12(2Bf+(B+J(0))Bf)(θθ)xtxt(θθ).\displaystyle=\frac{1}{2}(2B_{f}+(B+J(0))\cdot B_{f^{\prime}})(\theta^{*}-\theta)^{\top}x_{t}x_{t}^{\top}(\theta^{*}-\theta).

Here the first line is from the definition of gg and Reg(θ)Reg(\theta), the second line is due to Taylor’s Expansion, the third line is from the fact that J(u)J(u^{*}) maximizes g(v,u)g(v,u^{*}) with respect to vv, the fourth line is by calculus, the fifth line is from the assumption that 0<f(ω)Bf,|f(ω)|Bf0<f(\omega)\leq B_{f},|f^{\prime}(\omega)|\leq B_{f^{\prime}} and v[0,B+J(0)]v\in[0,B+J(0)], the sixth line is because of J(u)(0,1),uJ^{\prime}(u)\in(0,1),\forall u\in\mathbb{R}, and the seventh line is from the definition of uu and uu^{*}. ∎

B.2.2 Proof of Lemma 7

Proof.

We take derivatives of lt(θ)l_{t}(\theta), and we get:

lt(θ)=\displaystyle l_{t}(\theta)= 𝟙t(log(1F(vtxtθ)))+(1𝟙t)(log(F(vtxtθ)))\displaystyle\mathbbm{1}_{t}\left(-\log(1-F(v_{t}-x_{t}^{\top}\theta))\right)+(1-\mathbbm{1}_{t})\left(-\log(F(v_{t}-x_{t}^{\top}\theta))\right) (20)
lt(θ)=\displaystyle\nabla l_{t}(\theta)= 𝟙t(f(vtxtθ)1F(vtxtθ))xt+(1𝟙t)(f(vtxtθ)F(vtxtθ))xt\displaystyle\mathbbm{1}_{t}\left(-\frac{f(v_{t}-x_{t}^{\top}\theta)}{1-F(v_{t}-x_{t}^{\top}\theta)}\right)\cdot x_{t}+(1-\mathbbm{1}_{t})\left(\frac{f(v_{t}-x_{t}^{\top}\theta)}{F(v_{t}-x_{t}^{\top}\theta)}\right)\cdot x_{t}
2lt(θ)=\displaystyle\nabla^{2}l_{t}(\theta)= 𝟙tf(vtxtθ)2+f(vtxtθ)(1F(vtxtθ))(1F(vtxtθ))2xtxt\displaystyle\mathbbm{1}_{t}\cdot\frac{f(v_{t}-x_{t}^{\top}\theta)^{2}+f^{\prime}(v_{t}-x_{t}^{\top}\theta)\cdot(1-F(v_{t}-x_{t}^{\top}\theta))}{(1-F(v_{t}-x_{t}^{\top}\theta))^{2}}\cdot{x_{t}x_{t}^{\top}}
+(1𝟙t)f(vtxtθ)2f(vtxtθ)F(vtxtθ)F(vtxtθ)2xtxt\displaystyle+(1-\mathbbm{1}_{t})\cdot\frac{f(v_{t}-x_{t}^{\top}\theta)^{2}-f^{\prime}(v_{t}-x_{t}^{\top}\theta)F(v_{t}-x_{t}^{\top}\theta)}{F(v_{t}-x_{t}^{\top}\theta)^{2}}\cdot{x_{t}x_{t}^{\top}}
=\displaystyle= 𝟙td2log(1F(ω))dω2|ω=vtxtθxtxt+(1𝟙t)d2log(F(ω))dω2|ω=vtxtθxtxt\displaystyle\mathbbm{1}_{t}\cdot\frac{-\text{d}^{2}\log(1-F(\omega))}{\text{d}\omega^{2}}|_{\omega=v_{t}-x_{t}^{\top}\theta}\cdot{x_{t}x_{t}^{\top}}+(1-\mathbbm{1}_{t})\frac{-\text{d}^{2}\log(F(\omega))}{\text{d}\omega^{2}}|_{\omega=v_{t}-x_{t}^{\top}\theta}\cdot{x_{t}x_{t}^{\top}}
\displaystyle\succeq infω[B,B+J(0)]min{d2log(1F(ω))dω2,d2log(F(ω))dω2}\displaystyle\inf_{\omega\in[-B,B+J(0)]}\min\left\{\frac{\text{d}^{2}\log(1-F(\omega))}{\text{d}\omega^{2}},\frac{\text{d}^{2}\log(F(\omega))}{\text{d}\omega^{2}}\right\}
=\displaystyle= Cdownxtxt,\displaystyle C_{\text{down}}x_{t}x_{t}^{\top},

which directly proves the first inequality. For the second inequality, just notice that

lt(θ)lt(θ)=\displaystyle\nabla l_{t}(\theta)\nabla l_{t}(\theta)^{\top}= 𝟙t(f(vtxtθ)1F(vtxtθ))2xtxt+(1𝟙t)(f(vtxtθ)F(vtxtθ))2xtxt\displaystyle\mathbbm{1}_{t}\left(\frac{f(v_{t}-x_{t}^{\top}\theta)}{1-F(v_{t}-x_{t}^{\top}\theta)}\right)^{2}x_{t}x_{t}^{\top}+(1-\mathbbm{1}_{t})\left(\frac{f(v_{t}-x_{t}^{\top}\theta)}{F(v_{t}-x_{t}^{\top}\theta)}\right)^{2}x_{t}x_{t}^{\top} (21)
\displaystyle\preceq supω[B,B+J(0)]max{(f(ω)F(ω))2,(f(ω)1F(ω))2}xtxt\displaystyle\sup_{\omega\in[-B,B+J(0)]}\max\{\left(\frac{f(\omega)}{F(\omega)}\right)^{2},\left(\frac{f(\omega)}{1-F(\omega)}\right)^{2}\}x_{t}x_{t}^{\top}
=\displaystyle= Cexpxtxt.\displaystyle C_{\text{exp}}x_{t}x_{t}^{\top}.

The only thing to point out is that f(ω)F(ω)\frac{f(\omega)}{F(\omega)} and f(ω)1F(ω)\frac{f(\omega)}{1-F(\omega)} are all continuous for ω[B,B+J(0)]\omega\in[-B,B+J(0)], as F(ω)F(\omega) is strictly increasing and thus 0<F(ω)<1,ω0<F(\omega)<1,\omega\in\mathbb{R}. ∎

B.2.3 Proof of Lemma 9

Proof.

In the following part, we consider a situation that an epoch of n2n\geq 2 rounds of pricing is conducted, generating lj(θ)l_{j}(\theta) as negative likelihood functions, j=1,2,,nj=1,2,\ldots,n. Define a “leave-one-out”negative log-likelihood function

L~i(θ)=1nj=1,jinlj(θ),\tilde{L}_{i}(\theta)=\frac{1}{n}\sum_{j=1,j\neq i}^{n}l_{j}(\theta),

and let

θ~i:=argminθL~i(θ).\tilde{\theta}_{i}:=\mathop{\arg\min}\limits_{\theta}\tilde{L}_{i}(\theta).

Based on this definition, we know that θ~i\tilde{\theta}_{i} is independent to li(θ)l_{i}(\theta) given historical data, and that θ~i\tilde{\theta}_{i} are identically distributed for all i=1,2,3,,ni=1,2,3,\ldots,n.

In the following part, we will firstly propose and proof the following inequality:

1ni=1n(li(θ~i)li(θ^))CexpCdowndn=O(dn),\frac{1}{n}\sum_{i=1}^{n}(l_{i}(\tilde{\theta}_{i})-l_{i}(\hat{\theta}))\leq\frac{C_{\text{exp}}}{C_{down}}\frac{d}{n}=O(\frac{d}{n}), (22)

where θ^\hat{\theta} is the short-hand notation of θ^k\hat{\theta}_{k} as we do not specify the epoch kk in this part. We now cite a lemma from Koren and Levy [2015]:

Lemma 13.

Let g1g_{1}, g2g_{2} be 2 convex function defined over a closed and convex domain 𝒦d\mathcal{K}\subseteq\mathbb{R}^{d}, and let x1=argminx𝒦g1(x)x_{1}=\arg\min_{x\in\mathcal{K}}g_{1}(x) and x2=argminx𝒦g2(x)x_{2}=\arg\min_{x\in\mathcal{K}}g_{2}(x). Assume g2g_{2} is locally δ\delta-strongly-convex at x1x_{1} with respect to a norm ||||||\cdot||. Then, for h=g2g1h=g_{2}-g_{1} we have

x2x12δh(x1).||x_{2}-x_{1}||\leq\frac{2}{\delta}||\nabla h(x_{1})||_{*}.

Here ||||||\cdot||_{*} denotes a dual norm.

The following is a proof of this lemma.

Proof.

(of Lemma 13) According to convexity of g2g_{2}, we have:

g2(x1)g2(x2)+g2(x2)(x1x2).g_{2}(x_{1})\geq g_{2}(x_{2})+\nabla g_{2}(x_{2})^{\top}(x_{1}-x_{2}). (23)

According to strong convexity of g2g_{2} at x1x_{1}, we have:

g2(x2)g2(x1)+g2(x1)(x2x1)+δ2x2x12.g_{2}(x_{2})\geq g_{2}(x_{1})+\nabla g_{2}(x_{1})^{\top}(x_{2}-x_{1})+\frac{\delta}{2}||x_{2}-x_{1}||^{2}. (24)

Add Equation (23) and (24), and we have:

g2(x1)+g2(x2)g2(x2)+g2(x1)+(g2(x1)g2(x2))(x2x1)+δ2x2x12\displaystyle g_{2}(x_{1})+g_{2}(x_{2})\geq g_{2}(x_{2})+g_{2}(x_{1})+(\nabla g_{2}(x_{1})-\nabla g_{2}(x_{2}))^{\top}(x_{2}-x_{1})+\frac{\delta}{2}||x_{2}-x_{1}||^{2} (25)
\displaystyle\Leftrightarrow (g2(x1)g2(x2))(x1x2)δ2x1x22\displaystyle(\nabla g_{2}(x_{1})-\nabla g_{2}(x_{2}))^{\top}(x_{1}-x_{2})\geq\frac{\delta}{2}||x_{1}-x_{2}||^{2}
\displaystyle\Leftrightarrow (g1(x1)+h(x1)g2(x2))(x1x2)δ2x1x22\displaystyle(\nabla g_{1}(x_{1})+\nabla h(x_{1})-\nabla g_{2}(x_{2}))^{\top}(x_{1}-x_{2})\geq\frac{\delta}{2}||x_{1}-x_{2}||^{2}
\displaystyle\Leftrightarrow h(x1)(x1x2)δ2x1x22\displaystyle\nabla h(x_{1})^{\top}(x_{1}-x_{2})\geq\frac{\delta}{2}||x_{1}-x_{2}||^{2}
\displaystyle\Rightarrow h(x1)x1x2δ2x1x22\displaystyle||\nabla h(x_{1})||_{*}||x_{1}-x_{2}||\geq\frac{\delta}{2}||x_{1}-x_{2}||^{2}
\displaystyle\Rightarrow h(x1)δ2x1x2.\displaystyle||\nabla h(x_{1})||_{*}\geq\frac{\delta}{2}||x_{1}-x_{2}||.

The first step is trivial. The second step is a sequence of g2=g1+hg_{2}=g_{1}+h. The third step is derived by the following 2 first-order optimality conditions: g1(x1)(x1x2)0\nabla g_{1}(x_{1})^{\top}(x_{1}-x_{2})\leq 0, and g2(x2)(x2x1)0\nabla g_{2}(x_{2})^{\top}(x_{2}-x_{1})\leq 0. The fourth step is derived from Holder’s Inequality:

h(x1)x1x2h(x1)(x1x2).||\nabla h(x_{1})||_{*}||x_{1}-x_{2}||\geq\nabla h(x_{1})^{\top}(x_{1}-x_{2}).

Therefore, the lemma holds. ∎

In the following part, we will set up a strongly convex function of g2g_{2}. Denote H=t=1nxtxtH=\sum_{t=1}^{n}x_{t}x_{t}^{\top}. From Lemma 7, we know that

2L^(θ)Cdown1nH.\nabla^{2}\hat{L}(\theta)\succeq C_{down}\frac{1}{n}H.

Here L^(θ)\hat{L}(\theta) is the short-hand notation of L^k(θ)\hat{L}_{k}(\theta) as we do not specify kk in this part. Since we do not know if HH is invertible, i.e. if a norm can be induced by HH, we cannot let g2(θ)=L^(θ)g_{2}(\theta)=\hat{L}(\theta). Instead, we change the variable as follows:

We first apply singular value decomposition to HH, i.e. H=UΣUH=U\Sigma U^{\top}, where Ud×r,UU=Ir,Σ=diag{λ1,λ2,,λr}0U\in\mathbb{R}^{d\times r},U^{\top}U=I_{r},\Sigma=diag\{\lambda_{1},\lambda_{2},\ldots,\lambda_{r}\}\succ 0. After that, we introduce a new variable η:=Uθ\eta:=U^{\top}\theta. Therefore, we have θ=Uη+Vϵ\theta=U\eta+V\epsilon, where Vd×(dr),VV=Idr,VU=0V\in\mathbb{R}^{d\times(d-r)},V^{\top}V=I_{d-r},V^{\top}U=0 is the standard orthogonal bases of the null space of UU, and ϵ(dr)\epsilon\in\mathbb{R}^{(d-r)}. Similarly, we define η~i=Uθ~i\tilde{\eta}_{i}=U^{\top}\tilde{\theta}_{i} and η^=Uθ^\hat{\eta}=U^{\top}\hat{\theta}. According to these, we define the following functions:

fi(η)\displaystyle f_{i}(\eta) :=li(θ)=li(Uη+Vϵ)\displaystyle:=l_{i}(\theta)=l_{i}(U\eta+V\epsilon) (26)
F~i(η)\displaystyle\tilde{F}_{i}(\eta) :=L~i(θ)=L~i(Uη+Vϵ)\displaystyle:=\tilde{L}_{i}(\theta)=\tilde{L}_{i}(U\eta+V\epsilon)
F^(η)\displaystyle\hat{F}(\eta) :=L^(θ)=L^(Uη+Vϵ).\displaystyle:=\hat{L}(\theta)=\hat{L}(U\eta+V\epsilon).

Now we prove that F^(η)\hat{F}(\eta) is locally-strongly-convex. Similar to the proof of Lemma 7, we have:

2F^(η)=\displaystyle\nabla^{2}\hat{F}(\eta)= 1ni=1n2fi(η)\displaystyle\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}f_{i}(\eta) (27)
=\displaystyle= 1ni=1n2li(xiθ)2(xiθη)(xiθη)\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial^{2}l_{i}}{\partial(x_{i}^{\top}\theta)^{2}}(\frac{\partial x_{i}^{\top}\theta}{\partial\eta})(\frac{\partial x_{i}^{\top}\theta}{\partial\eta})^{\top}
=\displaystyle= 1ni=1n2li(xiθ)2(xi(Uη+Vϵ)η)(xi(Uη+Vϵ)η)\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial^{2}l_{i}}{\partial(x_{i}^{\top}\theta)^{2}}(\frac{\partial x_{i}^{\top}(U\eta+V\epsilon)}{\partial\eta})(\frac{\partial x_{i}^{\top}(U\eta+V\epsilon)}{\partial\eta})^{\top}
=\displaystyle= 1ni=1n2li(xiθ)2(Uxi)(Uxi)\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial^{2}l_{i}}{\partial(x_{i}^{\top}\theta)^{2}}(U^{\top}x_{i})(U^{\top}x_{i})^{\top}
\displaystyle\succeq 1ni=1nCdownUxixiU\displaystyle\frac{1}{n}\sum_{i=1}^{n}C_{down}U^{\top}x_{i}x_{i}^{\top}U
=\displaystyle= 1nCdownU(i=1nxixi)U\displaystyle\frac{1}{n}C_{down}U^{\top}(\sum_{i=1}^{n}x_{i}x_{i}^{\top})U^{\top}
=\displaystyle= 1nCdownUHU\displaystyle\frac{1}{n}C_{down}U^{\top}HU
=\displaystyle= 1nCdownUUΣUU\displaystyle\frac{1}{n}C_{down}U^{\top}U\Sigma U^{\top}U
=\displaystyle= 1nCdownΣ\displaystyle\frac{1}{n}C_{down}\Sigma
\displaystyle\succ 0\displaystyle 0

That is to say, F^(η)\hat{F}(\eta) is locally Cdownn\frac{C_{down}}{n}-strongly convex w.r.t Σ\Sigma at η\eta. Similarly, we can verify that F~i(η)\tilde{F}_{i}(\eta) is convex (not necessarily strongly convex). Therefore, according to Lemma 13, let g1(η)=F~i(η),g2(η)=F^(η)g_{1}(\eta)=\tilde{F}_{i}(\eta),g_{2}(\eta)=\hat{F}(\eta), and then x1=η~i=Uθ~ix_{1}=\tilde{\eta}_{i}=U^{\top}\tilde{\theta}_{i}, x2=η^=Uθ^x_{2}=\hat{\eta}=U^{\top}\hat{\theta}. Therefore, we have:

η^η~iΣ1Cdownfi(η~i)Σ.||\hat{\eta}-\tilde{\eta}_{i}||_{\Sigma}\leq\frac{1}{C_{down}}||\nabla f_{i}(\tilde{\eta}_{i})||_{\Sigma}^{*}. (28)

Now let us show the validation of this theorem:

li(θ~i)li(θ^)=\displaystyle l_{i}(\tilde{\theta}_{i})-l_{i}(\hat{\theta})= fi(η~i)fi(η^)\displaystyle f_{i}(\tilde{\eta}_{i})-f_{i}(\hat{\eta}) (29)
convexity\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{convexity}}}}{\leq} fi(η~i)(η~iη^)\displaystyle\nabla f_{i}(\tilde{\eta}_{i})^{\top}(\tilde{\eta}_{i}-\hat{\eta})
Holder inequality\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Holder\ inequality}}}}{\leq} fi(η~i)Ση~iη^Σ\displaystyle||\nabla f_{i}(\tilde{\eta}_{i})||^{*}_{\Sigma}||\tilde{\eta}_{i}-\hat{\eta}||_{\Sigma}
Lemma 13\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma \ref{lemma_dual_norm_of_gradient_convex}}}}}{\leq} 1Cdown(fi(η~i)Σ)2.\displaystyle\frac{1}{C_{down}}(||\nabla f_{i}(\tilde{\eta}_{i})||^{*}_{\Sigma})^{2}.

And thus we have

i=1nli(θ~i)li(θ^)\displaystyle\sum_{i=1}^{n}l_{i}(\tilde{\theta}_{i})-l_{i}(\hat{\theta}) 1Cdowni=1n||fi(η~i)||Σ)2\displaystyle\leq\frac{1}{C_{down}}\sum_{i=1}^{n}||\nabla f_{i}(\tilde{\eta}_{i})||^{*}_{\Sigma})^{2} (30)
1Cdowni=1n(pΦ)max2xiUΣ1Uxi\displaystyle\leq\frac{1}{C_{down}}\sum_{i=1}^{n}(\frac{p}{\Phi})_{\max}^{2}x_{i}^{\top}U\Sigma^{-1}U^{\top}x_{i}
=CexpCdownCexpi=1ntr(xiUΣ1Uxi)\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}\sum_{i=1}^{n}tr(x_{i}^{\top}U\Sigma^{-1}U^{\top}x_{i})
=CexpCdownCexpi=1ntr(UΣ1Uxixi)\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}\sum_{i=1}^{n}tr(U\Sigma^{-1}U^{\top}x_{i}x_{i}^{\top})
=CexpCdownCexptr(UΣ1Ui=1nxixi)\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(U\Sigma^{-1}U^{\top}\sum_{i=1}^{n}x_{i}x_{i}^{\top})
=CexpCdownCexptr(UΣ1UH)\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(U\Sigma^{-1}U^{\top}H)
=CexpCdownCexptr(UΣ1UUΣU)\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(U\Sigma^{-1}U^{\top}U\Sigma U^{\top})
=CexpCdownCexptr(UU)\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(UU^{\top})
=CexpCdownCexptr(UU)\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(U^{\top}U)
=CexpCdownCexptr(Ir)\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(I_{r})
=CexpCdownCexpr\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}r
CexpCdownd.\displaystyle\leq\frac{C_{\text{exp}}}{C_{down}}d.

Thus the Inequality 22 is proved. After that, we have:

𝔼h[L(θ~n)]L(θ)\displaystyle\mathbb{E}_{h}[L(\tilde{\theta}_{n})]-L(\theta^{*})
=\displaystyle= 𝔼h[L(θ~n)]𝔼h[L^(θ)]\displaystyle\mathbb{E}_{h}[L(\tilde{\theta}_{n})]-\mathbb{E}_{h}[\hat{L}(\theta^{*})]
\displaystyle\leq 𝔼h[L(θ~n)]𝔼h[L^(θ^)]\displaystyle\mathbb{E}_{h}[L(\tilde{\theta}_{n})]-\mathbb{E}_{h}[\hat{L}(\hat{\theta})]
=\displaystyle= 1ni=1n𝔼h[L(θ~i)]𝔼h[L^(θ^)]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{h}[L(\tilde{\theta}_{i})]-\mathbb{E}_{h}[\hat{L}(\hat{\theta})]
=\displaystyle= 1ni=1n𝔼h[li(θ~i)]𝔼h[L^(θ^)]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{h}[l_{i}(\tilde{\theta}_{i})]-\mathbb{E}_{h}[\hat{L}(\hat{\theta})]
=\displaystyle= 1ni=1n𝔼h[li(θ~i)]i=1n𝔼h[li(θ^)]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{h}[l_{i}(\tilde{\theta}_{i})]-\sum_{i=1}^{n}\mathbb{E}_{h}[l_{i}(\hat{\theta})]
=\displaystyle= 1ni=1n𝔼h[li(θ~i)li(θ^)]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{h}[l_{i}(\tilde{\theta}_{i})-l_{i}(\hat{\theta})]
\displaystyle\leq CexpCdowndn\displaystyle\frac{C_{\text{exp}}}{C_{down}}\frac{d}{n}
=\displaystyle= O(dn)\displaystyle O(\frac{d}{n})

Thus we has proved that 𝔼h[L(θ~n)]L(θ)CexpCdowndn\mathbb{E}_{h}[L(\tilde{\theta}_{n})]-L(\theta^{*})\leq\frac{C_{\text{exp}}}{C_{down}}\cdot\frac{d}{n}. Notice that θ~n\tilde{\theta}_{n} is generated by optimizing the leave-one-out likelihood function L~n(θ)=j=1n1lj(θ)\tilde{L}_{n}(\theta)=\sum_{j=1}^{n-1}l_{j}(\theta), which does not contain ln(θ)l_{n}(\theta), and that the expected likelihood function L(θ)L(\theta) does not depend on any specific result occurring in this round. That is to say, every term of this inequality is not related to the last round (xn,vn,𝟙n)(x_{n},v_{n},\mathbbm{1}_{n}) at all. In other words, this inequality is still valid if we only conduct this epoch from round 11 to (n1)(n-1).

Now let n=τ+1n=\tau+1, and then we know that θ~τ+1=θ^\tilde{\theta}_{\tau+1}=\hat{\theta}. Therefore, the theorem holds. ∎

B.3 Proof of Lower bound in Section 5.3

Proof.

We assume a fixed uu^{*} such that xθ=u,x𝔻x^{\top}\theta^{*}=u^{*},\forall x\in\mathbb{D}. In other words, we are considering a non-context setting. Therefore, we can define a policy as Ψ:{0,1}t+,t=1,2,\Psi:\{0,1\}^{t}\rightarrow\mathbb{R}^{+},t=1,2,\ldots that does not observe xtx_{t} at all. Before the proof begins, we firstly define a few notations: We denote Φσ(ω)\Phi_{\sigma}(\omega) and pσ(ω)p_{\sigma}(\omega) as the CDF and PDF of Gaussian distribution 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}), and the corresponding Jσ(u)=argmaxvv(1Φσ(vu))J_{\sigma}(u)=\arg\max_{v}v(1-\Phi_{\sigma}(v-u)) as the pricing function.

Since we have proved that J(u)(0,1)J^{\prime}(u)\in(0,1) for uu\in\mathbb{R} in Appendix B.2.2, we have the following lemma:

Lemma 14.

uJσ(u)u-J_{\sigma}(u) monotonically increases as u(0,+),σ>0u\in(0,+\infty),\forall\sigma>0. Also, we know that Jσ(0)>0,σ>0J_{\sigma}(0)>0,\forall\sigma>0.

Now consider the following cases: σ1=1,σ2=1f(T)\sigma_{1}=1,\sigma_{2}=1-f(T), where limTf(T)=0,f(T)<0,0<f(T)<12\lim_{T\rightarrow\infty}f(T)=0,f^{\prime}(T)<0,0<f(T)<\frac{1}{2}. We will later determine the explicit form of f(T)f(T).

Suppose uu^{*} satisfies Jσ1(u)=uJ_{\sigma_{1}}(u^{*})=u^{*}. Solve it and get u=π2u^{*}=\sqrt{\frac{\pi}{2}}. Therefore, we have u(0,u)J1(u)>uu\in(0,u^{*})\Leftrightarrow J_{1}(u)>u, and u(u,+)J1(u)<uu\in(u^{*},+\infty)\Leftrightarrow J_{1}(u)<u. As a result, we have the following lemma.

Lemma 15.

For any σ(12,1)\sigma\in(\frac{1}{2},1), we have:

Jσ(u)(0,u)J_{\sigma}(u^{*})\in(0,u^{*}) (31)
Proof.

Firstly, we have:

Jσ(u)=\displaystyle J_{\sigma}(u)= argmaxvvΦσ(uv)\displaystyle\mathop{\arg\max}\limits_{v}v\Phi_{\sigma}(u-v)
=\displaystyle= argmaxvvΦ1(uvσ)\displaystyle\mathop{\arg\max}\limits_{v}v\Phi_{1}(\frac{u-v}{\sigma})
=\displaystyle= argmaxω=vσσωΦ1(uσω)\displaystyle\mathop{\arg\max}\limits_{\omega=\frac{v}{\sigma}}\sigma\omega\Phi_{1}(\frac{u}{\sigma}-\omega)
=\displaystyle= σargmaxωΦ1(uσω)\displaystyle\sigma\mathop{\arg\max}\limits_{\omega}\Phi_{1}(\frac{u}{\sigma}-\omega)
=\displaystyle= σJ1(uσ).\displaystyle\sigma J_{1}(\frac{u}{\sigma}).

When σ(12,1)\sigma\in(\frac{1}{2},1), we know uσ>u\frac{u^{*}}{\sigma}>u^{*}. Since J1(u)=uJ_{1}(u^{*})=u^{*} and that u(u,+)J1(u)<uu\in(u^{*},+\infty)\Leftrightarrow J_{1}(u)<u, we have uσ>J1(uσ)\frac{u^{*}}{\sigma}>J_{1}(\frac{u^{*}}{\sigma}). Hence

u>σJ1(uσ)=Jσ(u).u^{*}>\sigma J_{1}(\frac{u^{*}}{\sigma})=J_{\sigma}(u^{*}). (32)

Therefore, without losing generality, we assume that for the problem parameterized by σ2\sigma_{2}, the price v(0,u)v\in(0,u^{*}). To be specific, suppose v(σ)=Jσ(u)v^{*}(\sigma)=J_{\sigma}(u^{*}). Define Ψt+1:[0,1]t(0,u)\Psi_{t+1}:[0,1]^{t}\rightarrow(0,u^{*}) as any policy that proposes a price at time t+1t+1. Define Ψ={Ψ1,Ψ2,,ΨT1,ΨT}\Psi=\{\Psi_{1},\Psi_{2},\ldots,\Psi_{T-1},\Psi_{T}\}.

Define the sequence of price as V={v1,v2,,vT1,vT}V=\{v_{1},v_{2},\ldots,v_{T-1},v_{T}\}, and the sequence of decisions as 𝟙={𝟙1,𝟙2,,𝟙T1,𝟙T}\mathbbm{1}=\{\mathbbm{1}_{1},\mathbbm{1}_{2},\ldots,\mathbbm{1}_{T-1},\mathbbm{1}_{T}\}. Denote Vt={v1,v2,,vt,}V^{t}=\{v_{1},v_{2},\ldots,v_{t},\}.

Define the probability (also the likelihood if we change uu^{*} to other parameter uu):

QTV,σ(𝟙)=t=1TΦσ(uvt)𝟙tΦσ(vtu)1𝟙t.Q^{V,\sigma}_{T}(\mathbbm{1})=\prod_{t=1}^{T}\Phi_{\sigma}(u^{*}-v_{t})^{\mathbbm{1}_{t}}\Phi_{\sigma}(v_{t}-u^{*})^{1-\mathbbm{1}_{t}}. (33)

Define a random variable Yt{0,1}t,YtQtVt,σY_{t}\in\{0,1\}^{t},Y_{t}\sim Q^{V^{t},\sigma}_{t} and one possible assignment
yt={𝟙1,𝟙2,,𝟙t1,𝟙t}y_{t}=\{\mathbbm{1}_{1},\mathbbm{1}_{2},\ldots,\mathbbm{1}_{t-1},\mathbbm{1}_{t}\} . For any price vv and any parameter σ\sigma, define the expected reward function as r(v,σ):=vΦσ(uv)r(v,\sigma):=v\Phi_{\sigma}(u^{*}-v). Based on this, we can further define the expected regret Regret(σ,T,Ψ)\mathrm{Regret}(\sigma,T,\Psi):

Regret(σ,T,Ψ)=𝔼[t=1Tr(Jσ(u),σ)r(Ψt(yt1),σ)]\mathrm{Regret}(\sigma,T,\Psi)=\mathbb{E}[\sum_{t=1}^{T}r(J_{\sigma}(u^{*}),\sigma)-r(\Psi_{t}(y_{t-1}),\sigma)] (34)

Now we have the following properties:

Lemma 16.
  1. 1.

    r(v(σ),σ)r(v,σ)160(v(σ)v)2r(v^{*}(\sigma),\sigma)-r(v,\sigma)\geq\frac{1}{60}(v^{*}(\sigma)-v)^{2};

  2. 2.

    |v(σ)u|25|1σ||v^{*}(\sigma)-u^{*}|\geq\frac{2}{5}|1-\sigma|;

  3. 3.

    |Φσ(uv)Φ1(uv)||uv||σ1||\Phi_{\sigma}(u^{*}-v)-\Phi_{1}(u^{*}-v)|\leq|u^{*}-v|\cdot|\sigma-1|.

Proof.
  1. 1.

    We have:

    r(v,σ)v|v=v(σ)\displaystyle\frac{\partial r(v,\sigma)}{\partial v}|_{v=v^{*}(\sigma)} =02r(v,σ)v2\displaystyle=0\frac{\partial^{2}r(v,\sigma)}{\partial v^{2}} =1σ2(v2uv2σ2)pσ(uv)\displaystyle=\frac{1}{\sigma^{2}}(v^{2}-u^{*}v-2\sigma^{2})p_{\sigma}(u^{*}-v)

    Since v(0,u)v\in(0,u^{*}), we have (v2uv2σ2)<2σ2(v^{2}-u^{*}v-2\sigma^{2})<-2\sigma^{2}. Also, since σ(1/2,1)\sigma\in(1/2,1), we have pσ(uv)>12πe(u)22(1/2)2=12πeπ>0.017p_{\sigma}(u^{*}-v)>\frac{1}{\sqrt{2\pi}}\cdot e^{-\frac{(u^{*})^{2}}{2\cdot(1/2)^{2}}}=\frac{1}{\sqrt{2\pi}e^{\pi}}>0.017. Therefore, we have

    2r(v,σ)v2<20.017<130\frac{\partial^{2}r(v,\sigma)}{\partial v^{2}}<-2*0.017<-\frac{1}{30}

    As a result, we have:

    r(v(σ),σ)r(v,σ)\displaystyle r(v*(\sigma),\sigma)-r(v,\sigma) =(v(σ)v)r(v,σ)v|v=v(σ)12(v(σ)v)22r(v,σ)v2|v=v~\displaystyle=-(v*(\sigma)-v)\frac{\partial r(v,\sigma)}{\partial v}|_{v=v^{*}(\sigma)}-\frac{1}{2}(v*(\sigma)-v)^{2}\frac{\partial^{2}r(v,\sigma)}{\partial v^{2}}|_{v=\tilde{v}} (35)
    =012(v(σ)v)22r(v,σ)v2|v=v~\displaystyle=0-\frac{1}{2}(v*(\sigma)-v)^{2}\frac{\partial^{2}r(v,\sigma)}{\partial v^{2}}|_{v=\tilde{v}}
    12130(v(σ)v)2.\displaystyle\geq\frac{1}{2}\cdot\frac{1}{30}(v*(\sigma)-v)^{2}.
  2. 2.

    According to Equation 32, we know that:

    v(σ)=σJ1(uσ)v^{*}(\sigma)=\sigma J_{1}(\frac{u^{*}}{\sigma})

    For u(u,+),J1(u)<uu\in(u^{*},+\infty),J_{1}(u)<u. According to Lemma 14, we have:

    J1(u)\displaystyle J^{\prime}_{1}(u) =1+1J1(u)(J1(u)u)2\displaystyle=1+\frac{1}{J_{1}(u)(J_{1}(u)-u)-2}
    >1+102\displaystyle>1+\frac{1}{0-2}
    =12.\displaystyle=\frac{1}{2}.

    Also, for u(u,uσ)u\in(u^{*},\frac{u^{*}}{\sigma}), we have:

    J1(u)\displaystyle J^{\prime}_{1}(u) =112+J1(u)(uJ1(u))\displaystyle=1-\frac{1}{2+J_{1}(u)(u-J_{1}(u))}
    0<J1(u)<u112+u(u0)\displaystyle\underset{\mathclap{\overset{\uparrow}{0<J_{1}(u)<u}}}{\leq}1-\frac{1}{2+u(u-0)}
    u<uσ<u2112+(u2)2\displaystyle\underset{\mathclap{\overset{\uparrow}{u<\frac{u^{*}}{\sigma}<\frac{u^{*}}{2}}}}{\leq}1-\frac{1}{2+(\frac{u^{*}}{2})^{2}}
    =112+π8\displaystyle=1-\frac{1}{2+\frac{\pi}{8}}
    <35.\displaystyle<\frac{3}{5}.

    Therefore, we have:

    J1(u)Jσ(u)\displaystyle J_{1}(u^{*})-J_{\sigma}(u^{*}) =J1(u)σJ1(uσ)\displaystyle=J_{1}(u^{*})-\sigma J_{1}(\frac{u^{*}}{\sigma})
    =J1(u)σJ1(u)+σ(J1(u)J1(uσ))\displaystyle=J_{1}(u^{*})-\sigma J_{1}(u^{*})+\sigma(J_{1}(u^{*})-J_{1}(\frac{u^{*}}{\sigma}))
    =J1(u)(1σ)σ(J1(uσ)J1(u))\displaystyle=J_{1}(u^{*})(1-\sigma)-\sigma(J_{1}(\frac{u^{*}}{\sigma})-J_{1}(u^{*}))
    >u(1σ)σ35(uσu)\displaystyle>u^{*}(1-\sigma)-\sigma\cdot\frac{3}{5}(\frac{u^{*}}{\sigma}-u^{*})
    =u(1σ)35σ(1σ1)u\displaystyle=u^{*}(1-\sigma)-\frac{3}{5}\sigma(\frac{1}{\sigma}-1)u^{*}
    =u(1σ)(135)\displaystyle=u^{*}(1-\sigma)(1-\frac{3}{5})
    >25(1σ).\displaystyle>\frac{2}{5}(1-\sigma).
  3. 3.

    This is because:

    |Φσ(uv)Φ1(uv)|\displaystyle|\Phi_{\sigma}(u^{*}-v)-\Phi_{1}(u^{*}-v)| =|Φσ(uv)Φσ(σ(uv))|\displaystyle=|\Phi_{\sigma}(u^{*}-v)-\Phi_{\sigma}(\sigma(u^{*}-v))|
    max|pσ||(uv)σ(uv)|\displaystyle\leq\max|p_{\sigma}|\cdot|(u^{*}-v)-\sigma(u^{*}-v)|
    12πσ|(uv)σ(uv)|\displaystyle\leq\frac{1}{\sqrt{2\pi}\sigma}\cdot|(u^{*}-v)-\sigma(u^{*}-v)|
    σ>12(1σ)|uv|.\displaystyle\underset{\mathclap{\overset{\uparrow}{\sigma>\frac{1}{2}}}}{\leq}(1-\sigma)|u^{*}-v|.

In the following part, we will propose two theorems, which balance the cost of learning and that of uncertainty. This part is mostly similar to [BR12] Section 3, but we adopt a different family of demand curves here.

Theorem 17 (Learning is costly).

Let σ(1/2,1)\sigma\in(1/2,1) and vt(0,u)v_{t}\in(0,u^{*}), and we have:

𝒦(QV,1;QV,σ)<9900(1σ)2Regret(1,T,Ψ).\mathcal{K}(Q^{V,1};Q^{V,\sigma})<9900(1-\sigma)^{2}\mathrm{Regret}(1,T,\Psi). (36)

Here vt=Ψ(yt1),t=1,2,,Tv_{t}=\Psi(y_{t-1}),t=1,2,\ldots,T.

Proof.

First of all, we cite the following lemma that would facilitate the proof.

Lemma 18 (Corollary 3.1 in Taneja and Kumar, 2004).

Suppose B1B_{1} and B2B_{2} are distributions of Bernoulli random variables with parameters q1q_{1} and q2q_{2}, respectively, with q1,q2(0,1)q_{1},q_{2}\in(0,1). Then,

𝒦(B1;B2)(q1q2)2q2(1q2).\mathcal{K}(B_{1};B_{2})\leq\frac{(q_{1}-q_{2})^{2}}{q_{2}(1-q_{2})}.

According to the definition of KL-divergence, we have:

𝒦(QTV,1;QTV,σ)=s=1T𝒦(QsVs,1;QsVs,σ|Ys1).\mathcal{K}(Q_{T}^{V,1};Q_{T}^{V,\sigma})=\sum_{s=1}^{T}\mathcal{K}(Q_{s}^{V^{s},1};Q_{s}^{V^{s},\sigma}|Y_{s-1}).

For each term of the RHS, we have:

𝒦(QsVs,1,QsVs,σ|Ys1)\displaystyle\qquad\mathcal{K}(Q_{s}^{V^{s},1},Q_{s}^{V^{s},\sigma}|Y_{s-1})
=ys{0,1}sQsVs,1(ys)log(QsVs,1(𝟙s|ys1)QsVs,σ(𝟙s|ys1))\displaystyle=\quad\sum_{y_{s}\in\{0,1\}^{s}}Q_{s}^{V^{s},1}(y_{s})\log\left(\frac{Q_{s}^{V^{s},1}(\mathbbm{1}_{s}|y_{s-1})}{Q_{s}^{V^{s},\sigma}(\mathbbm{1}_{s}|y_{s-1})}\right)
=split ys as ys1 and indsys1{0,1}s1Qs1Vs1,1(ys1)𝟙s{0,1}QsVs,1(𝟙s|ys1)log(QsVs,1(𝟙s|ys1)QsVs,σ(𝟙s|ys1))\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{split $y_{s}$ as $y_{s-1}$ and $ind_{s}$}}}}{=}\qquad\qquad\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})\cdot\sum_{\mathbbm{1}_{s}\in\{0,1\}}Q_{s}^{V^{s},1}(\mathbbm{1}_{s}|y_{s-1})\log\left(\frac{Q_{s}^{V^{s},1}(\mathbbm{1}_{s}|y_{s-1})}{Q_{s}^{V^{s},\sigma}(\mathbbm{1}_{s}|y_{s-1})}\right)
=ys1{0,1}s1Qs1Vs1,1(ys1)𝒦(QsVs,1(|ys1),QsVs,σ(|ys1))\displaystyle=\qquad\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})\mathcal{K}\left(Q_{s}^{V^{s},1}(\cdot|y_{s-1}),Q_{s}^{V^{s},\sigma}(\cdot|y_{s-1})\right)
Lemma 18ys1{0,1}s1Qs1Vs1,1(ys1)(Φ1(uvs)Φσ(uvs))2Φσ(uvs)(1Φσ(uvs))\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma \ref{lemma_Taneja_and_Kumar_bernoulli_divergence}}}}}{\leq}\qquad\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})\frac{(\Phi_{1}(u^{*}-v_{s})-\Phi_{\sigma}(u^{*}-v_{s}))^{2}}{\Phi_{\sigma}(u^{*}-v_{s})(1-\Phi_{\sigma}(u^{*}-v_{s}))}
=1Φσ(uvs)(1Φσ(uvs))ys1{0,1}s1Qs1Vs1,1(ys1)(Φ1(uvs)Φσ(uvs))2\displaystyle=\quad\frac{1}{\Phi_{\sigma}(u^{*}-v_{s})(1-\Phi_{\sigma}(u^{*}-v_{s}))}\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})(\Phi_{1}(u^{*}-v_{s})-\Phi_{\sigma}(u^{*}-v_{s}))^{2}
()165ys1{0,1}s1Qs1Vs1,1(ys1)(Φ1(uvs)Φσ(uvs))2\displaystyle\underset{\mathclap{\overset{\uparrow}{(**)}}}{\leq}\qquad 165\cdot\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})(\Phi_{1}(u^{*}-v_{s})-\Phi_{\sigma}(u^{*}-v_{s}))^{2}
Lemma 16 Property 3165ys1{0,1}s1Qs1Vs1,1(ys1)(uvs)2(1σ)2\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma \ref{lemma_properties} Property 3}}}}{\leq}\qquad 165\cdot\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})(u^{*}-v_{s})^{2}(1-\sigma)^{2}
=165(1σ)2𝔼Ys1[(uvs)2].\displaystyle=\qquad 165(1-\sigma)^{2}\mathbb{E}_{Y_{s-1}}[(u^{*}-v_{s})^{2}].

Here inequality (**) above is proved as follows: since vs(0,u)v_{s}\in(0,u^{*}) as is assumed, we have:

12<Φσ(uvs)<\displaystyle\frac{1}{2}<\Phi_{\sigma}(u^{*}-v_{s})< Φσ(u)\displaystyle\Phi_{\sigma}(u^{*})
=\displaystyle= σΦ1(uσ)\displaystyle\sigma\cdot\Phi_{1}(\frac{u^{*}}{\sigma})
\displaystyle\leq 1Φ1(π212)\displaystyle 1\cdot\Phi_{1}(\frac{\sqrt{\frac{\pi}{2}}}{\frac{1}{2}})
\displaystyle\leq Φ1(2π)\displaystyle\Phi_{1}(\sqrt{2\pi})
\displaystyle\leq 0.9939.\displaystyle 0.9939\ .

As a result, we have 1Φσ(uvs)(1Φσ(uvs))10.9939×0.0061=164.7988165\frac{1}{\Phi_{\sigma}(u^{*}-v_{s})(1-\Phi_{\sigma}(u^{*}-v_{s}))}\leq\frac{1}{0.9939\times 0.0061}=164.7988\leq 165. Therefore, by summing up all ss, we have:

𝒦(QTV,1;QTV,σ)\displaystyle\mathcal{K}(Q_{T}^{V,1};Q_{T}^{V,\sigma}) =s=1T𝒦(QsVs,1;QsVs,σ|Ys1)\displaystyle=\sum_{s=1}^{T}\mathcal{K}(Q_{s}^{V^{s},1};Q_{s}^{V^{s},\sigma}|Y_{s-1})
165(1σ)2s=1T𝔼Ys1[(uvs)2]\displaystyle\leq 165(1-\sigma)^{2}\sum_{s=1}^{T}\mathbb{E}_{Y_{s-1}}[(u^{*}-v_{s})^{2}]
Lemma 16 Property 1165×60(1σ)2s=1T(r(u,1)r(vs,1))\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma~{}\ref{lemma_properties} Property 1}}}}{\leq}165\times 60\cdot(1-\sigma)^{2}\sum_{s=1}^{T}\left(r(u^{*},1)-r(v_{s},1)\right)
=definition of regret and vs=Ψ(ys1).9900(1σ)2Regret(1,T,Ψ),\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{definition of regret and }v_{s}=\Psi(y_{s-1}).}}}{=}9900(1-\sigma)^{2}\mathrm{Regret}(1,T,\Psi),

which concludes the proof. ∎

Theorem 19 (Uncertainty is costly).

Let σ1T14\sigma\leq 1-T^{-\frac{1}{4}}, and we have:

Regret(1,T,Ψ)+Regret(σ,T,Ψ)124000Te𝒦(QV,1;QV,σ).\mathrm{Regret}(1,T,\Psi)+\mathrm{Regret}(\sigma,T,\Psi)\geq\frac{1}{24000}\cdot\sqrt{T}\cdot e^{-\mathcal{K}(Q^{V,1};Q^{V,\sigma})}. (37)

Here vt=Ψ(yt1),t=1,2,,Tv_{t}=\Psi(y_{t-1}),t=1,2,\ldots,T.

Proof.

First of all, we cite a lemma that would facilitate our proof:

Lemma 20.

Let Q0Q_{0} and Q1Q_{1} be two probability distributions on a finite space 𝒴\mathcal{Y}; with Q0(y),Q1(y)>0,y𝒴Q_{0}(y),Q_{1}(y)>0,\forall y\in\mathcal{Y}. Then for any function F:𝒴{0,1}F:\mathcal{Y}\rightarrow\{0,1\},

Q0{F=1}+Q1{J=0}12e𝒦(Q0;Q1),Q_{0}\{F=1\}+Q_{1}\{J=0\}\geq\frac{1}{2}e^{-\mathcal{K}(Q_{0};Q_{1})},

where 𝒦(Q0;Q1)\mathcal{K}(Q_{0};Q_{1}) denotes the KL-divergence of Q0Q_{0} and Q1Q_{1}.

Define two intervals of prices:

C1={v:|u|110T14}andC2={v:|Jσ(u)v|110T14}C_{1}=\{v:|u^{*}|\leq\frac{1}{10T^{\frac{1}{4}}}\}\ \ and\ \ C_{2}=\{v:|J_{\sigma}(u^{*})-v|\leq\frac{1}{10T^{\frac{1}{4}}}\}

Note that C1C_{1} and C2C_{2} are disjoint, since |uJσ(u)|25|1σ|=25T1/2|u^{*}-J_{\sigma}(u^{*})|\geq\frac{2}{5}|1-\sigma|=\frac{2}{5T^{1/2}} according to Lemma 16 Property 2. Also, for v(0,u)\C2v\in(0,u^{*})\backslash C_{2}, the regret is large according to Lemma 16 Property 1, because:

r(v(σ),σ)r(v,σ)160(vv(σ))216000T12.r(v^{*}(\sigma),\sigma)-r(v,\sigma)\geq\frac{1}{60}(v-v^{*}(\sigma))^{2}\geq\frac{1}{6000{T}^{\frac{1}{2}}}.

Then, we have:

Regret(1,T,Ψ)+Regret(σ,T,Ψ)\displaystyle\mathrm{Regret}(1,T,\Psi)+\mathrm{Regret}(\sigma,T,\Psi)
\displaystyle\geq t=1T1𝔼1[r(u,1)r(vt+1,1)]+𝔼σ[r(Jσ(u),σ)r(vt+1,σ)]\displaystyle\ \sum_{t=1}^{T-1}\mathbb{E}_{1}[r(u^{*},1)-r(v_{t+1},1)]+\mathbb{E}_{\sigma}[r(J_{\sigma}(u^{*}),\sigma)-r(v_{t+1},\sigma)]
\displaystyle\geq 16000Tt=1T11[vt+1C1]+σ[vt+1{C2}]\displaystyle\ \frac{1}{6000\sqrt{T}}\sum_{t=1}^{T-1}\mathbb{P}_{1}[v_{t+1}\notin{C_{1}}]+\mathbb{P}_{\sigma}[v_{t+1}\notin\{C_{2}\}]
SupposeFt+1=𝟙[vt+1C2]\displaystyle\underset{\mathclap{\overset{\uparrow}{Suppose\ F_{t+1}=\mathbbm{1}[v_{t+1}\in C_{2}]}}}{\geq} 16000Tt=1T11[Ft+1=1]+σ[Ft+1=0]\displaystyle\ \frac{1}{6000\sqrt{T}}\sum_{t=1}^{T-1}\mathbb{P}_{1}[F_{t+1}=1]+\mathbb{P}_{\sigma}[F_{t+1}=0]
Lemma20\displaystyle\underset{\mathclap{\overset{\uparrow}{Lemma\ \ref{lemma_Tsybakov2009}}}}{\geq} 16000Tt=1T112e𝒦(QtVt,1;QtVt,σ)\displaystyle\ \frac{1}{6000\sqrt{T}}\sum_{t=1}^{T-1}\frac{1}{2}e^{-\mathcal{K}(Q_{t}^{V^{t},1};Q_{t}^{V^{t},\sigma})}
𝒦(QtVt,1;QtVt,σ)notdecreasing\displaystyle\underset{\mathclap{\overset{\uparrow}{\mathcal{K}(Q_{t}^{V^{t},1};Q_{t}^{V^{t},\sigma})\ not\ decreasing\ }}}{\geq} 16000TT12e𝒦(QTV,1;QTV,σ)\displaystyle\ \frac{1}{6000\sqrt{T}}\frac{T-1}{2}e^{-\mathcal{K}(Q_{T}^{V,1};Q_{T}^{V,\sigma})}
\displaystyle\geq 124000Te𝒦(QTV,1;QTV,σ).\displaystyle\ \frac{1}{24000}\sqrt{T}e^{-\mathcal{K}(Q_{T}^{V,1};Q_{T}^{V,\sigma})}.

According to Theorem 17 and Theorem 19, we can then prove Theorem 12. Let σ=1T14\sigma=1-T^{-\frac{1}{4}}

2(Regret(1,T,Ψ)+Regret(σ,T,Ψ))\displaystyle 2\left(\mathrm{Regret}(1,T,\Psi)+\mathrm{Regret}(\sigma,T,\Psi)\right)
\displaystyle\geq Regret(1,T,Ψ)+(Regret(1,T,Ψ)+Regret(σ,T,Ψ))\displaystyle\mathrm{Regret}(1,T,\Psi)+\left(\mathrm{Regret}(1,T,\Psi)+\mathrm{Regret}(\sigma,T,\Psi)\right)
\displaystyle\geq 19900T1/2𝒦(QV,1;QV,σ)+124000Te𝒦(QV,1;QV,σ)\displaystyle\frac{1}{9900T^{-1/2}}\mathcal{K}(Q^{V,1};Q^{V,\sigma})+\frac{1}{24000}\cdot\sqrt{T}\cdot e^{-\mathcal{K}(Q^{V,1};Q^{V,\sigma})}
\displaystyle\geq 124000T(𝒦(QV,1;QV,σ)+e𝒦(QV,1;QV,σ))\displaystyle\frac{1}{24000}\sqrt{T}\left(\mathcal{K}(Q^{V,1};Q^{V,\sigma})+e^{-\mathcal{K}(Q^{V,1};Q^{V,\sigma})}\right)
Thefactexx+1,x\displaystyle\underset{\mathclap{\overset{\uparrow}{The\ fact\ e^{x}\geq x+1,\forall{x}\in\mathbb{R}}}}{\geq} 124000T.\displaystyle\frac{1}{24000}\sqrt{T}.

Thus Theorem 12 is proved valid. ∎

Appendix C More Discussions

C.1 Dependence on BB and Noise Variance

Here we use a concrete example to analyze the coefficients of regret bounds. Again, we assume that Nt𝒩(0,σ2)N_{t}\sim\mathcal{N}(0,\sigma^{2}). Notice that both CsC_{s} and CaC_{a} have a component of CexpCdown\frac{C_{\exp}}{C_{down}}. In order to analyze CexpCdown\frac{C_{\exp}}{C_{down}}, we define a hazard function denoted as λ(ω)\lambda(\omega) with ω\omega\in\mathbb{R}:

λ(ω):=p1(ω)1Φ1(ω)=p1(ω)Φ1(ω),\lambda(\omega):=\frac{p_{1}(\omega)}{1-\Phi_{1}(\omega)}=\frac{p_{1}(-\omega)}{\Phi_{1}(-\omega)}, (38)

where Φ1\Phi_{1} and p1p_{1} are the CDF and PDF of standard Gaussian distribution. The concept of hazard function comes from the area of survival analysis. From Equation 11 and 13, we plug in Equation 38 and get:

Cdown\displaystyle C_{\text{down}} infω[Bσ,Bσ]{1σ2λ(ω)2+ωλ(ω)}\displaystyle\geq\inf_{\omega\in[-\frac{B}{\sigma},\frac{B}{\sigma}]}\left\{\frac{1}{\sigma^{2}}\lambda(-\omega)^{2}+\omega\cdot\lambda(-\omega)\right\} (39)
Cexp\displaystyle C_{\text{exp}} supω[Bσ,Bσ]{1σ2λ(ω)2}.\displaystyle\leq\sup_{\omega\in[-\frac{B}{\sigma},\frac{B}{\sigma}]}\left\{\frac{1}{\sigma^{2}}\lambda(-\omega)^{2}\right\}.

In Lemma 21, we will prove that λ(ω)\lambda(\omega) is exponentially small as ω+\omega\rightarrow+\infty, and is asymptotically close to ω-\omega as ω\omega\rightarrow-\infty. Therefore, CdownC_{down} is exponentially small and CexpC_{exp} is quadratically large with respect to B/σB/\sigma. Although we assume that BB and σ\sigma are constant, we should be alert that the scale of B/σB/\sigma can be very large as σ\sigma goes to zero, i.e. as the noise is “insignificant”. In practice (especially when TT is finite), this may cause extremely large regret at the beginning. A “Shallow Pricing” method introduced by Cohen et al. [2020] (as well as other domain-cutting methods in contextual searching) may serve as a good pre-process as it frequently conducts bisections to cut the feasible region of θ\theta^{*} with high probability. According to Theorem 3 in Cohen et al. [2020], their Shallow Pricing algorithm will bisect the parameter set for at most logarithmic times to ensure that Bσ\frac{B}{\sigma} has been small enough (i.e. upper-bounded by O(polylog(T))O(poly\log(T))). However, this does not necessarily means that we can use a O(logT)O(\log{T})-time pre-process to achieve the same effect, since they run the algorithm throughout the session while we only take it as a pre-process. Intuitively, at least under the adversarial feature assumption, we cannot totally rely on a few features occurring at the beginning (as they might be misleading) to cut the parameter set once and for all. A mixture approach of Shallow Pricing and EMLP/ONSP might work, as the algorithm can detect whether current Bσ\frac{B}{\sigma} is larger than a threshold of bisection. However, this requires new regret analysis as the operations parameter domain are changing over time. Therefore, we claim in Section 7 that the regret bound is still open if σ=Θ(Tα)\sigma=\Theta(T^{-\alpha}) for α(0,1)\alpha\in(0,1).

Lemma 21 (Properties of λ(ω)\lambda(\omega)).

For λ(ω):=p1(ω)1Φ1(ω)\lambda(\omega):=\frac{p_{1}(\omega)}{1-\Phi_{1}(\omega)}, we have:

1,\displaystyle 1, ddωλ(ω)>0.\displaystyle\qquad\frac{d}{d\omega}\lambda(\omega)>0.
2,\displaystyle 2, limωωkλ(ω)=0,k>0.\displaystyle\qquad\lim_{\omega\rightarrow-\infty}\omega^{k}\lambda(\omega)=0,\ \forall k>0.
3,\displaystyle 3, limω+λ(ω)ω=0.\displaystyle\qquad\lim_{\omega\rightarrow+\infty}\lambda(\omega)-\omega=0.
4,\displaystyle 4, limω+ω(λ(ω)ω)=1.\displaystyle\qquad\lim_{\omega\rightarrow+\infty}\omega\left(\lambda(\omega)-\omega\right)=1.
Proof.

We prove the Lemma 21 sequentially:

  1. 1.

    We have:

    λ(ω)=\displaystyle\lambda^{\prime}(\omega)= p12(ω)p1(ω)Φ1(ω)Φ1(ω)2\displaystyle\frac{p_{1}^{2}(-\omega)-p^{\prime}_{1}(-\omega)\Phi_{1}(-\omega)}{\Phi_{1}(-\omega)^{2}} (40)
    =\displaystyle= p12(ω)ωp1(ω)Φ1(ω)Φ1(ω)2\displaystyle\frac{p_{1}^{2}(-\omega)-\omega p_{1}(-\omega)\Phi_{1}(-\omega)}{\Phi_{1}(-\omega)^{2}}
    =\displaystyle= p1(ω)(p1(ω)ωΦ1(ω))Φ1(ω)2.\displaystyle\frac{p_{1}(-\omega)\left(p_{1}(-\omega)-\omega\Phi_{1}(-\omega)\right)}{\Phi_{1}(-\omega)^{2}}.

    Therefore, it is equivalent to prove that p1(ω)ωΦ1(ω)>0p_{1}(-\omega)-\omega\Phi_{1}(-\omega)>0.

    Suppose f(ω)=p1(ω)+ωΦ1(ω)f(\omega)=p_{1}(\omega)+\omega\Phi_{1}(\omega). We now take its derivatives as follows:

    f(ω)\displaystyle f^{\prime}(\omega) =p1(ω)+(Φ1(ω)+ωp1(ω))\displaystyle=p^{\prime}_{1}(\omega)+(\Phi_{1}(\omega)+\omega\cdot p_{1}(\omega)) (41)
    =(ω)p1(ω)+Φ1(ω)+ωp1(ω)\displaystyle=(-\omega)p_{1}(\omega)+\Phi_{1}(\omega)+\omega\cdot p_{1}(\omega)
    =Φ1(ω)\displaystyle=\Phi_{1}(\omega)
    >0\displaystyle>0

    Therefore, we know that f(ω)f(\omega) monotonically increases in \mathbb{R}. Additionally, since we have:

    limωf(ω)\displaystyle\lim_{\omega\rightarrow-\infty}f(\omega) (42)
    =\displaystyle= limωp1(ω)+limωωΦ1(ω)\displaystyle\lim_{\omega\rightarrow-\infty}p_{1}(\omega)+\lim_{\omega\rightarrow-\infty}\omega\Phi_{1}(\omega)
    =\displaystyle= 0+limω1σ2Φ1(ω)1/ω\displaystyle 0+\lim_{\omega\rightarrow-\infty}\frac{1}{\sigma^{2}}\cdot\frac{\Phi_{1}(\omega)}{1/\omega}
    =\displaystyle= limωp1(ω)1/ω2\displaystyle\lim_{\omega\rightarrow-\infty}\cdot\frac{p_{1}(\omega)}{-1/\omega^{2}}
    =\displaystyle= limω(12πω2exp{ω22})\displaystyle\lim_{\omega\rightarrow-\infty}\cdot\left(-\frac{1}{\sqrt{2\pi}}\cdot\frac{\omega^{2}}{\exp\{\frac{\omega^{2}}{2}\}}\right)
    =\displaystyle= 0\displaystyle 0

    Therefore, we know that f(ω)>0f(\omega)>0, ω\forall\omega\in\mathbb{R}, and as a result, λ(ω)>0\lambda^{\prime}(\omega)>0.

  2. 2.

    We have:

    limωωkλ(ω)\displaystyle\lim_{\omega\rightarrow-\infty}\omega^{k}\lambda(\omega) (43)
    =\displaystyle= limωωkp1(ω)Φ1(ω)\displaystyle\lim_{\omega\rightarrow-\infty}\omega^{k}\frac{p_{1}(-\omega)}{\Phi_{1}(-\omega)}
    =\displaystyle= limωωkp1(ω)limωΦ1(ω)\displaystyle\frac{\mathop{\lim}\limits_{\omega\rightarrow-\infty}\omega^{k}p_{1}(-\omega)}{\mathop{\lim}\limits_{\omega\rightarrow-\infty}\Phi_{1}(-\omega)}
    =\displaystyle= limωωk(12πexp{ω22})1\displaystyle\frac{\mathop{\lim}\limits_{\omega\rightarrow-\infty}\omega^{k}(\frac{1}{\sqrt{2\pi}}\exp\{-\frac{\omega^{2}}{2}\})}{1}
    =\displaystyle= 01\displaystyle\frac{0}{1}
    =\displaystyle= 0.\displaystyle 0.
  3. 3.

    We only need to prove that

    limω+λ(ω)ω=0.\lim_{\omega\rightarrow+\infty}\lambda(\omega)-\omega=0.

    Actually, we have:

    limω+λ(ω)ω\displaystyle\lim_{\omega\rightarrow+\infty}\lambda(\omega)-\omega (44)
    =\displaystyle= limω+p1(ω)ωΦ1(ω)Φ1(ω)\displaystyle\lim_{\omega\rightarrow+\infty}\frac{p_{1}(-\omega)-\omega\Phi_{1}(-\omega)}{\Phi_{1}(-\omega)}
    =\displaystyle= limωp1(ω)+ωΦ1(ω)Φ1(ω)\displaystyle\lim_{\omega\rightarrow-\infty}\frac{p_{1}(\omega)+\omega\Phi_{1}(\omega)}{\Phi_{1}(\omega)}
    =L’Hospital’s rule\displaystyle\overset{\mathclap{\underset{\downarrow}{\text{L'Hospital's rule}}}}{=} limω(ω)p1(ω)+Φ1(ω)+ωp1(ω)p1(ω)\displaystyle\lim_{\omega\rightarrow-\infty}\frac{(-\omega)p_{1}(\omega)+\Phi_{1}(\omega)+\omega p_{1}(\omega)}{p_{1}(\omega)}
    =\displaystyle= limωΦ1(ω)p1(ω)\displaystyle\lim_{\omega\rightarrow-\infty}\frac{\Phi_{1}(\omega)}{p_{1}(\omega)}
    =\displaystyle= limωp1(ω)(ω)p1(ω)\displaystyle\lim_{\omega\rightarrow-\infty}\frac{p_{1}(\omega)}{(-\omega)p_{1}(\omega)}
    =\displaystyle= 0\displaystyle 0
  4. 4.
    limω+ω(λ(ω)ω)\displaystyle\lim_{\omega\rightarrow+\infty}\omega(\lambda(\omega)-\omega) (45)
    =\displaystyle= limω+ω(p1(ω)ωΦ1(ω))Φ1(ω)\displaystyle\lim_{\omega\rightarrow+\infty}\frac{\omega\left(p_{1}(-\omega)-\omega\Phi_{1}(-\omega)\right)}{\Phi_{1}(-\omega)}
    =\displaystyle= limωωp1(ω)ω2Φ1(ω)Φ1(ω)\displaystyle\lim_{\omega\rightarrow-\infty}\frac{-\omega p_{1}(\omega)-\omega^{2}\Phi_{1}(\omega)}{\Phi_{1}(\omega)}
    =L’Hospital’s rule\displaystyle\overset{\mathclap{\underset{\downarrow}{\text{L'Hospital's rule}}}}{=} limωp1(ω)ω(ω)p1(ω)ω2p1(ω)2ωΦ1(ω)p1(ω)\displaystyle\lim_{\omega\rightarrow-\infty}\frac{-p_{1}(\omega)-\omega(-\omega)p_{1}(\omega)-\omega^{2}p_{1}(\omega)-2\omega\cdot\Phi_{1}(\omega)}{p_{1}(\omega)}
    =\displaystyle= 12limωωΦ1(ω)p1(ω)\displaystyle-1-2\lim_{\omega\rightarrow-\infty}\frac{\omega\Phi_{1}(\omega)}{p_{1}(\omega)}
    =\displaystyle= 1+2limω+1λ(ω)ω\displaystyle-1+2\lim_{\omega\rightarrow+\infty}\frac{1}{\frac{\lambda(\omega)}{\omega}}
    =\displaystyle= 1+2\displaystyle-1+2
    =\displaystyle= 1.\displaystyle 1.

Thus the lemma holds.

C.2 Algorithmic Design

C.2.1 Probit and Logistic Regressions

A probit/logit model is described as follows: a Boolean random variable YY satisfies the following probabilistic distribution: [Y=1|X]=F(Xβ)\mathbb{P}[Y=1|X]=F(X^{\top}\beta), where XX\in\mathbb{R} is a random vector, β\beta\in\mathbb{R} is a parameter, and FF is the cumulative distribution function (CDF) of a (standard) Gaussian/logistic distribution. In our problem, we may treat 𝟙t\mathbbm{1}_{t} as YY, [xt,vt][{x_{t}}^{\top},v_{t}]^{\top} as XX and [θ,1][{\theta^{*}}^{\top},-1]^{\top} as β\beta, which exactly fits this model if we assume the noise as Gaussian or logistic. Therefore, θ^k=argminθL^k(θ)\hat{\theta}_{k}=\mathop{\arg\min}_{\theta}\hat{L}_{k}(\theta) can be solved via the highly efficient implementation of generalized linear models, e.g., GLMnet, rather than resorting to generic tools for convex programming. As a heuristic, we could leverage the vast body of statistical work on probit or logit models and adopt a fully Bayesian approach that jointly estimates θ\theta and hyper-parameters of FF. This would make the algorithm more practical by eliminating the need to choose the hyper-parameters when running this algorithm.

C.2.2 Advantages of EMLP over ONSP.

For the stochastic setting, we specifically propose EMLP even though ONSP also works. This is because EMLP only “switch” the pricing policy θ^\hat{\theta} for logT\log{T} times. This makes it appealing in many applications (especially for brick-and-mortar sales) where the number of policy updates is a bottleneck. In fact, the iterations within one epoch can be carried out entirely in parallel.

C.2.3 Agnostic Dynamic Pricing: Explorations versus Exploitation

At the moment, the proposed algorithm relies on the assumption of a linear valuation function (see Appendix C.3 for more discussion on problem modeling). It will be interesting to investigate the settings of model-misspecified cases and the full agnostic settings. The key would be to exploit the structural feedback in model-free policy-evaluation methods such as importance sampling. The main reason why we do not explore lies in the noisy model: essentially we are implicitly exploring a higher (permitted) price using the naturally occurring noise in the data. In comparison, there is another problem setting named “adversarial irrationality” where some of the customers will valuate the product adaptively and adversarially101010An adaptive adversary may take actions adversarially in respond to the environmental changes. In comparison, what we allow for the “adversarial features” is actually chosen by an oblivious adversary before the interactions start.. Existing work Krishnamurthy et al. [2021] adopts this setting and shows a linear regret dependence on the number of irrational customers, but they consider a different loss function (See Related Works Section).

C.3 Problem Modeling

C.3.1 Noise Distributions

In this work, we have made four assumptions on the noise distribution: strict log-concavity, 2ndorder smooth2^{\text{nd}}-\text{order smooth}, known, and i.i.d.. Here we explain each of them specifically.

  • The assumption of knowing the exact FF is critical to the regret bound: If we have this knowledge, then we achieve O(logT)O(\log{T}) even with adversarial features; otherwise, an Ω(T)\Omega(\sqrt{T}) regret is unavoidable even with stochastic features.

  • The strictly log-concave distribution family includes Gaussian and logistic distributions as two common noises. In comparison, Javanmard and Nazerzadeh [2019] assumes log-concavity that further covers Laplacian, exponential and uniform distributions. Javanmard and Nazerzadeh [2019] also considers the cases when (1) the noise distribution is unknown but log-concave, and (2) the noise distribution is zero-mean and bounded by support of [δ,δ][-\delta,\delta]. For case (1), they propose an algorithm with regret O(T)O(\sqrt{T}) and meanwhile prove the same lower bound. For case (2), they propose an algorithm with linear regret.

  • The assumption that FF is 2nd2^{\text{nd}}-order smooth is also assumed by Javanmard and Nazerzadeh [2019] by taking derivatives f(v)f^{\prime}(v) and applying its upper bound in the proof. Therefore, we are still unaware of the regret bound if the noise distribution is discrete, where a lower bound of Ω(T)\Omega(\sqrt{T}) can be directly applied from Kleinberg and Leighton [2003].

  • We even assume that the noise is identically distributed. However, the noise would vary among different people. The same problem happens on the parameter θ\theta^{*}: can we assume different people sharing the same evaluation parameter? We may interpret it in the following two ways, but there are still flaws: (1) the “customer” can be the public, i.e. their performance is quite stable in general; or (2) the customer can be the same one over the whole time series. However, the former explanation cannot match the assumption that we just sell one product at each time, and the latter one would definitely undermine the independent assumption of the noise: people would do “human learning” and might gradually reduce their noise of making decisions. To this extent, it is closer to the fact if we assume noises as martingales. This assumption has been stated in Qiang and Bayati [2016].

C.3.2 Linear Valuations on Features

There exist many products whose prices are not linearly dependent on features. One famous instance is a diamond: a kilogram of diamond powder is very cheap because it can be produced artificially, but a single 5-carat (or 1 gram) diamond might cost more than $100,000. This is because of an intrinsic non-linear property of diamond: large ones are rare and cannot be (at least easily) compound from smaller ones. Another example lies in electricity pricing [Joskow and Wolfram, 2012], where the more you consume, the higher unit price you suffer. On the contrary, commodities tend to be cheaper than retail prices. These are both consequences of marginal costs: a large volume consuming of electricity may cause extra maintenance and increase the cost, and a large amount of purchasing would release the storage and thus reduce their costs. In a word, our problem setting might not be suitable for those large-enough features, and thus an upper bound of xθx^{\top}\theta becomes a necessity.

C.4 Ex Ante v.s. Ex Post Regrets

In this work, we considered the ex ante regret Regea=t=1Tmaxθ𝔼[vtθ𝟙(vtθwt)]𝔼[vt𝟙(vtwt)]Reg_{ea}=\sum_{t=1}^{T}\max_{\theta}\mathbb{E}[v^{\theta}_{t}\cdot\mathbbm{1}(v^{\theta}_{t}\leq w_{t})]-\mathbb{E}[v_{t}\cdot\mathbbm{1}(v_{t}\leq w_{t})], where vtθ=J(xtθ)v^{\theta}_{t}=J(x_{t}^{\top}\theta) is the greedy price with parameter θ\theta and wt=xtθ+Ntw_{t}=x_{t}^{\top}\theta^{*}+N_{t} is the realized random valuation. The ex post definition of the cumulative regret, i.e., Regep=maxθt=1Tvtθ𝟙(vtθwt)vt𝟙(vtwt)Reg_{ep}=\max_{\theta}\sum_{t=1}^{T}v^{\theta}_{t}\mathbbm{1}(v^{\theta}_{t}\leq w_{t})-v_{t}\mathbbm{1}(v_{t}\leq w_{t}) makes sense, too. Note that we can decompose 𝔼[Regep]=Regea+𝔼[maxθt=1Tvtθ𝟙(vtθwt)t=1Tvtθ𝟙(vtθwt)].\mathbb{E}\left[Reg_{ep}\right]=Reg_{ea}+\mathbb{E}[\max_{\theta}\sum_{t=1}^{T}v^{\theta}_{t}\mathbbm{1}(v^{\theta}_{t}\leq w_{t})-\sum_{t=1}^{T}v^{\theta^{*}}_{t}\mathbbm{1}(v^{\theta^{*}}_{t}\leq w_{t})]. While it might be the case that the second term is Ω(dT)\Omega(\sqrt{dT}) as the reviewer pointed out, it is a constant independent of the algorithm. For this reason, we believe using RegeaReg_{ea} is without loss of generality, and it reveals more nuanced performance differences of different algorithms.

For an ex post dynamic regret, i.e., Regd=t=1Twtvt𝟙(vtwt)Reg_{d}=\sum_{t=1}^{T}w_{t}-v_{t}\cdot\mathbbm{1}(v_{t}\leq w_{t}), it is argued in Cohen et al. [2020] that any policy must suffer an expected regret of Ω(T)\Omega(T) (even if θ\theta^{*} is known). We may also present a good example lies in Nt𝒩(0,1),xtθ=π2N_{t}\sim\mathcal{N}(0,1),x_{t}^{\top}\theta^{*}=\sqrt{\frac{\pi}{2}} where the optimal price is π2\sqrt{\frac{\pi}{2}} as well but the probability of acceptance is only 1/2, and this leads to a constant per-step regret of 12π2\frac{1}{2}\sqrt{\frac{\pi}{2}}.

C.5 Ethic Issues

A field of study lies in “personalized dynamic pricing” [Aydin and Ziya, 2009, Chen and Gallego, 2021], where a firm makes use of information of individual customers and sets a unique price for each of them. This has been frequently applied in airline pricing [Krämer et al., 2018]. However, this causes first-order pricing discrimination. Even though this “discrimination” is not necessarily immoral, it must be embarrassing if we are witted proposing the same product with different prices towards different customers. For example, if we know the coming customer is rich enough and is not as sensitive towards a price (e.g., he/she has a variance larger than other customers), then we are probably raising the price without being too risky. Or if the customer is used to purchase goods from ours, then he or she might have a higher expectation on our products (e.g., he/she has a θ=aθ,a>1\theta=a\theta^{*},a>1), and we might take advantage and propose a higher price than others. These cases would not happen in an auction-based situation (such as a live sale), but might frequently happen in a more secret place, for instance, a customized travel plan.