Logarithmic Regret in Feature-based Dynamic Pricing

Jianyu Xu
Department of Computer Science
University of California, Santa Barbara
Santa Barbara, CA 93106
[email protected]
&Yu-Xiang Wang
Department of Computer Science
University of California, Santa Barbara
Santa Barbara, CA 93106
[email protected]

Abstract

Feature-based dynamic pricing is an increasingly popular model of setting prices for highly differentiated products with applications in digital marketing, online sales, real estate and so on. The problem was formally studied as an online learning problem (Javanmard and Nazerzadeh, 2019) where a seller needs to propose prices on the fly for a sequence of $T$ products based on their features $x$ while having a small regret relative to the best —“omniscient”— pricing strategy she could have come up with in hindsight. We revisit this problem and provide two algorithms (EMLP and ONSP) for stochastic and adversarial feature settings, respectively, and prove the optimal $O(d\log{T})$ regret bounds for both. In comparison, the best existing results are $O\left(\min\left\{\frac{1}{\lambda_{\min}^{2}}\log{T},\sqrt{T}\right\}\right)$ and $O(T^{2/3})$ respectively, with $\lambda_{\min}$ being the smallest eigenvalue of $\mathbb{E}[xx^{T}]$ that could be arbitrarily close to $0$ . We also prove an $\Omega(\sqrt{T})$ information-theoretic lower bound for a slightly more general setting, which demonstrates that “knowing-the-demand-curve” leads to an exponential improvement in feature-based dynamic pricing.

Keywords: dynamic pricing, online learning, adversarial features, optimal regret, affine invariant, distribution-free.

1 Introduction

The problem of pricing — to find a high-and-acceptable price — has been studied since Cournot (1897). In order to locate the optimal price that maximizes the revenue, a firm may adjust their prices of products frequently, which inspires the dynamic pricing problem. Existing works (Kleinberg and Leighton, 2003; Broder and Rusmevichientong, 2012; Chen and Farias, 2013; Besbes and Zeevi, 2015) primarily focus on pricing a single product, which usually will not work well in another setting when thousands of new products are being listed every day with no prior experience in selling them. Therefore, we seek methods that approach an acceptable-and-profitable price with only observations on this single product and some historical selling records of other products.

In this work, we consider a “feature-based dynamic pricing” problem, which was studied by Amin et al. (2014); Cohen et al. (2020); Javanmard and Nazerzadeh (2019). In this problem setting, a sales session (product, customer and other environmental variables) is described by a feature vector, and the customer’s expected valuation is modeled as a linear function of this feature vector.

Feature-based dynamic pricing. For $t=1,2,...,T:$ 1. A feature vector $x_{t}\in\mathbb{R}^{d}$ is revealed that describes a sales session (product, customer and context). 2. The customer valuates the product as $w_{t}=x_{t}^{\top}\theta^{*}+N_{t}$ . 3. The seller proposes a price $v_{t}>0$ concurrently (according to $x_{t}$ and historical sales records). 4. The transaction is successful if $v_{t}\leq w_{t}$ , i.e., the seller gets a reward (payment) of $r_{t}=v_{t}\cdot\mathbbm{1}(v_{t}\leq w_{t})$ .

Here $T$ is unknown to the seller (and thus can go to infinity), $x_{t}$ ’s can be either stochastic (e.g., each sales session is drawn i.i.d.) or adversarial (e.g., the sessions arrive in a strategic sequence), $\theta^{*}\in\mathbb{R}^{d}$ is a fixed parameter for all time periods, $N_{t}$ is a zero-mean noise, and $\mathbbm{1}_{t}=\mathbbm{1}(v_{t}\leq w_{t})$ is an indicator that equals $1$ if $v_{t}\leq w_{t}$ and $0$ otherwise. In this online-fashioned setting, we only see and sell one product at each time. Also, the feedback is Boolean Censored, which means we can only observe $\mathbbm{1}_{t}$ instead of knowing $w_{t}$ directly. The best pricing policy for this problem is the one that maximizes the expected reward, and the regret of a pricing policy is accordingly defined as the difference of expected rewards between this selected policy and the best policy.

Summary of Results. Our contributions are threefold.

1.

When $x_{t}$ ’s are independently and identically distributed (i.i.d.) from an unknown distribution, we propose an “Epoch-based Max-Likelihood Pricing (EMLP)” algorithm that guarantees a regret bound at $O(d\log{T})$ . The design of EMLP is similar to that of the RMLP algorithm in Javanmard and Nazerzadeh (2019), but our new analysis improves their regret bound at $O(\sqrt{T})$ when $\mathbb{E}[xx^{\top}]$ is near singular.
2.

When $x_{t}$ ’s are adversarial, we propose an “Online-Newton-Step Pricing (ONSP)” algorithm that achieves $O(d\log{T})$ regret on constant-level noises for the first time, which exponentially improves the best existing result of $O(T^{2/3})$ (Cohen et al., 2020).¹¹1Previous works (Cohen et al., 2020; Krishnamurthy et al., 2021) did achieve polylog regrets, but only for negligible noise with $\sigma=O(\frac{1}{T\log{T}})$ .
3.

Our methods that achieve logarithmic regret require knowing the exact distribution of $N_{t}$ in advance, as is also assumed in Javanmard and Nazerzadeh (2019). We prove an $\Omega(\sqrt{T})$ lower bound on the regret if $N_{t}\sim\mathcal{N}(0,\sigma^{2})$ where $\sigma$ is unknown, even with $\theta^{*}$ given and $x_{t}$ fixed for all $t$ .

The $O(\log{T})$ regret of EMLP and ONSP meets the information-theoretical lower bound (Theorem 5, Javanmard and Nazerzadeh, 2019). In fact, the bound is optimal even when $w_{t}$ is revealed to the learner (Mourtada, 2019). From the perspective of characterizing the hardness of dynamic pricing problems, we generalize the classical results on “The Value of Knowing a Demand Curve” (Kleinberg and Leighton, 2003) by further dividing the random-valuation class with an exponential separation of: (1) $O(\log{T})$ regret for knowing the demand curve exactly (even with adversarial features), and (2) $\Omega(\sqrt{T})$ regret for almost knowing the demand curves (up to a one-parameter parametric family).

2 Related Works

In this section, we discuss our results relative to existing works on feature-based dynamic pricing, and highlight the connections and differences to the related settings of contextual bandits and contextual search (for a broader discussion, see Appendix A).

Feature-based Dynamic Pricing. There is a growing body of work on dynamic pricing with linear features (Amin et al., 2014; Qiang and Bayati, 2016; Cohen et al., 2020; Javanmard and Nazerzadeh, 2019). Table 1 summarizes the differences in the settings and results²²2We only concern the dependence on $T$ since there are various different assumptions on $d$ ..

Table 1: Related Works and Regret Bounds w.r.t.

T

Algorithm	Work	Regret (upper) bound	Feature	Noise
LEAP	(Amin et al., 2014)	$\tilde{O}(T^{\frac{2}{3}})$	i.i.d.	Noise-free
EllipsoidPricing	(Cohen et al., 2020)	$O(\log{T})$	adversarial	Noise-free
EllipsoidEXP4	(Cohen et al., 2020)	$\tilde{O}(T^{\frac{2}{3}})$	adversarial	Sub-Gaussian
PricingSearch	(Leme and Schneider, 2018)	$O(\log\log(T))$	adversarial	Noise-free
RMLP	(Javanmard and Nazerzadeh, 2019)	$O(\log{T}/C_{\min}^{2})$ ^†	i.i.d.	Log-concave, distribution-known
RMLP	(Javanmard and Nazerzadeh, 2019)	$O(\sqrt{T})$	i.i.d.	Log-concave, distribution-known
RMLP-2	(Javanmard and Nazerzadeh, 2019)	$O(\sqrt{T})$	i.i.d.	Known parametric family of log-concave.
ShallowPricing	(Cohen et al., 2020)	$O(poly(\log{T}))$	adversarial	Sub-Gaussian, known $\sigma=O(\frac{1}{T\log{T}})$
CorPV	(Krishnamurthy et al., 2021)	$O(poly(\log{T}))$	adversarial	Sub-Gaussian, known $\sigma=O(\frac{1}{T\log{T}})$
Algorithm 2 (MSPP)	(Liu et al., 2021)	$O(\log\log(T))$	adversarial	Noise-free
EMLP	This paper	$O(\log T)$	i.i.d.	Strictly log-concave, distribution-known
ONSP	This paper	$O(\log{T})$	adversarial	Strictly log-concave, distribution-known

^† $C_{\min}$ is the restricted eigenvalue condition. It reduces to the smallest eigenvalue of $\mathbb{E}[xx^{\top}]$ in the low-dimensional case we consider.

Among these work, our paper directly builds upon (Cohen et al., 2020) and (Javanmard and Nazerzadeh, 2019), as we share the same setting of online feature vectors, linear and noisy valuations and Boolean-censored feedback. Relative to the results in (Javanmard and Nazerzadeh, 2019), we obtain $O(d\log T)$ regret under weaker assumptions on the sequence of input features — in both distribution-free stochastic feature setting and the adversarial feature setting. It is to be noted that (Javanmard and Nazerzadeh, 2019) also covers the sparse high-dimensional setting, and handles a slightly broader class of demand curves. Relative to (Cohen et al., 2020), in which the adversarial feature-based dynamic pricing was first studied, our algorithm ONSP enjoys the optimal $O(d\log T)$ regret when the noise-level is a constant. In comparison, Cohen et al. (2020) reduces the problem to contextual bandits and applies the (computationally inefficient) “EXP-4” algorithm (Auer et al., 2002) to achieve a $\tilde{O}(T^{2/3})$ regret. The “bisection” style-algorithm in both Cohen et al. (2020) and Krishnamurthy et al. (2021) could achieve $\tilde{O}(poly(d)poly\log(T))$ regrets but requires a small-variance subgaussian noise satisfying $\sigma=O(\frac{1}{T\log{T}})$ .

Lower Bounds. Most existing works focus on the lower regret bounds of non-feature-based models. Kleinberg and Leighton (2003) divides the problem setting as fixed, random, and adversarial valuations, and then proves each a $\Theta(\log\log{T})$ , $\Theta(\sqrt{T})$ , and $\Theta(T^{2/3})$ regret, respectively. Broder and Rusmevichientong (2012) further proves a $\Theta(\sqrt{T})$ regret in general parametric valuation models. In this work, we generalize the methods of Broder and Rusmevichientong (2012) to our feature-based setting and further narrow it down to a linear-feature Gaussian-noisy model. As a complement to Kleinberg and Leighton (2003), we further separate the exponential regret gap between: (1) $O(\log{T})$ of the hardest (adversarial feature) totally-parametric model, and (2) $\Omega(\sqrt{T})$ of the simplest (fixed known expectation) unknown- $\sigma$ Gaussian model.

Contextual Bandits. For readers familiar with the online learning literature, our problem can be reduced to a contextual bandits problem (Langford and Zhang, 2007; Agarwal et al., 2014) by discretizing the prices. But this reduction only results in $O(T^{2/3})$ regret, as it does not capture the special structure of the feedback: an accepted price indicates the acceptance of all lower prices, and vise versa. Moreover, when comparing to linear bandits (Chu et al., 2011), it is the valuation instead of the expected reward that we assume to be linear.

Contextual Search. Feature-based dynamic pricing is also related to the contextual search problem (Lobel et al., 2018; Leme and Schneider, 2018; Liu et al., 2021; Krishnamurthy et al., 2021), which often involves learning from Boolean feedbacks, sometimes with a “pricing loss” and “noisy” feedback. These shared jargons make this problem appearing very similar to our problem. However, except for the noiseless cases (Lobel et al., 2018; Leme and Schneider, 2018), contextual search algorithms, even with “pricing losses” and “Noisy Boolean feedback” (e.g., Liu et al., 2021), do not imply meaningful regret bounds in our problem setup due to several subtle but important differences in the problem settings. Specifically, the noisy-boolean feedback model of (Liu et al., 2021) is about randomly toggling the “purchase decision” determined by the noiseless valuation $x^{\top}\theta^{*}$ with probability $0.5-\epsilon$ . This is incompatible to our problem setting where the purchasing decision is determined by a noisy valuation $x^{\top}\theta^{*}+\text{Noise}$ . Ultimately, in the setting of (Liu et al., 2021), the optimal policy alway plays $x^{\top}\theta^{*}$ , but our problem is harder in that we need to exploit the noise and the optimal price could be very different from $x^{\top}\theta^{*}$ . ³³3As an explicit example, suppose the valuation $x^{\top}\theta^{*}=0$ , then the optimal price must be $>0$ in order to avoid zero return. Krishnamurthy et al. (2021) also discussed this issue explicitly and considered the more natural noisy Boolean feedback model studied in this paper. Their result, similar to Cohen et al. (2020), only achieves a logarithmic regret when the noise on the valuation is vanishing in an $\tilde{O}(1/T)$ rate.

3 Problem Setup

Symbols and Notations. Now we introduce the mathematical symbols and notations involved in the following pages. The game consists of $T$ rounds. $x_{t}\in\mathbb{R}^{d}$ , $v_{t}\in\mathbb{R}_{+}$ and $N_{t}\in\mathbb{R}$ denote the feature vector, the proposed price and the noise respectively at round $t=1,2,...,T$ .⁴⁴4In an epoch-design situation, a subscript $(k,t)$ indicates round $t$ of epoch $k$ . We denote the product $u_{t}:=x_{t}^{\top}\theta^{*}$ as an expected valuation. At each round, we receive a payoff (reward) $r_{t}=v_{t}\cdot\mathbbm{1}_{t}$ , where the binary variable $\mathbbm{1}_{t}$ indicates whether the price is accepted or not, i.e., $\mathbbm{1}_{t}=\mathbf{1}(v_{t}\leq w_{t})$ . As we may estimate $\theta^{*}$ in our algorithms, we denote $\hat{\theta}_{t}$ as an estimator of $\theta^{*}$ , which we will formally define in the algorithms. Furthermore, we denote some functions that are related to noise distribution: $F(\omega)$ and $f(\omega)$ denote the cumulative distribution function (CDF) and probability density function (PDF) sequentially. We know that $F^{\prime}(\omega)=f(\omega)$ if we assume differentiability. To concisely denote all data observed up to round $\tau$ (i.e., feature, price and payoff of all past rounds), we define $hist(\tau)=\{(x_{t},v_{t},\mathbbm{1}_{t})\text{ for }t=1,2,...,\tau\}$ . $hist(\tau)$ represents the transcript of all observed random variables before round $(\tau+1)$ .

We define

l_{t}(\theta):=-\mathbbm{1}_{t}\cdot\log\big{(}1-F(v_{t}-x_{t}^{\top}\theta)\big{)}-(1-\mathbbm{1}_{t})\log\big{(}F(v_{t}-x_{t}^{\top}\theta)\big{)}

(1)

as a negative log-likelihood function at round $t$ . Also, we define an expected log-likelihood function $L_{t}(\theta)$ :

L_{t}(\theta):=\mathbb{E}_{N_{t}}[l_{t}(\theta)|x_{t}]

(2)

Notice that we will later define an $\hat{L}_{k}(\theta)$ which is, however, not an expectation.

Definitions of Key Quantities. We firstly define an expected reward function $g(v,u)$ .

g(v,u):=\mathbb{E}[r_{t}|v_{t}=v,x_{t}^{\top}\theta^{*}=u]=v\cdot P[v\leq x_{t}^{\top}\theta^{*}+N_{t}]=v\cdot(1-F(v-u)).

(3)

This indicates that if the expected valuation is $u$ and the proposed price is $v$ , then the (conditionally) expected reward is $g(v,u)$ . Now we formally define the regret of a policy (algorithm) $\mathcal{A}$ as is promised in Section 1.

Definition 1 (Regret).

Let $\mathcal{A}:\mathbb{R}^{d}\times\left(\mathbb{R}^{d},\mathbb{R},\{0,1\}\right)^{t-1}\rightarrow\mathbb{R}$ be a policy of pricing, i.e. $\mathcal{A}(x_{t},hist(t-1))=v_{t}$ . The regret of $\mathcal{A}$ is defined as follows.

\displaystyle Reg_{\mathcal{A}}=

\displaystyle\sum_{t=1}^{T}\max_{v}g(v,x_{t}^{\top}\theta^{*})-g(\mathcal{A}(x_{t},hist(t-1)),x_{t}^{\top}\theta^{*}).

(4)

Here $hist(t-1)$ is the historical records until $(t-1)^{\text{th}}$ round.

Summary of Assumptions. We specify the problem settings by proposing three assumptions.

Assumption 1 (Known, bounded, strictly log-concave distribution).

The noise $N_{t}$ is independently and identically sampled from a distribution whose CDF is $F$ . Assume that $F\in\mathbb{C}^{2}$ is strictly increasing and that $F$ and $(1-F)$ are strictly log-concave. Also assume that $f$ and $f^{\prime}$ are bounded, and denote $B_{f}:=\sup_{\omega\in\mathbb{R}}{f(\omega)},B_{f^{\prime}}:=\sup_{\omega\in\mathbb{R}}|f^{\prime}(\omega)|$ as two constants.

Assumption 2 (Bounded convex parameter space).

The true parameter $\theta^{*}\in\mathbb{H}$ , where $\mathbb{H}\subseteq\{\theta:||\theta||_{2}\leq B_{1}\}$ is a bounded convex set and $B_{1}$ is a constant. Assume $\mathbb{H}$ is known to us (but $\theta^{*}$ is not).

Assumption 3 (Bounded feature space).

Assume $x_{t}\in D\subseteq\{x:||x||_{2}\leq B_{2}\}$ , $\forall t=1,2,\ldots,T$ . Also, $0\leq x^{\top}\theta\leq B,\forall x\in D,\forall\theta\in\mathbb{H}$ , where $B=B_{1}\cdot B_{2}$ is a constant.

Assumption 2 and 3 are mild as we can choose $B_{1}$ and $B_{2}$ large enough. In Section 4.1, we may add further complement to Assumption 3 to form a stochastic setting. Assumption 1 is stronger since we might not know the exact CDF in practice, but it is still acceptable from an information-theoretic perspective. There are at least three reasons that lead to this assumption: Primarily, this is necessary if we hope to achieve an $O(\log(T))$ regret. We will prove in Section 5.3 that an $\Omega(\sqrt{T})$ is unavoidable if we cannot know one parameter exactly. Moreover, the pioneering work of Javanmard and Nazerzadeh (2019) also assumes a known noise distribution with log-concave CDF, and many common distributions are actually strictly log-concave, such as Gaussian and logistic.⁵⁵5In fact, $F$ and $(1-F)$ are both log-concave if its PDF is log-concave, according to Prekopa’s Inequality. Besides, although we did not present a method to precisely estimate $\sigma$ in this work, it is a reasonable algorithm to replace with a plug-in estimator estimated using historical offline data. As we have shown, not knowing $\sigma$ requires $O(\sqrt{T})$ regret in general, but the lower bound does not rule out the plug-in approach achieving a smaller regret for interesting subclasses of problems in practice.

Finally, we state a lemma and define an argmax function helpful for our algorithm design.

Lemma 2 (Uniqueness).

For any $u\geq 0$ , there exists a unique $v^{*}\geq 0$ such that $g(v^{*},u)=\max_{v\in\mathbb{R}}{g}(v,u)$ . Thus, we can define a greedily pricing function that maximizes the expected reward:

J(u):=\mathop{\arg\max}_{v}g(v,u)

(5)

Please see the proof of Lemma 2 in Appendix B.1.

4 Algorithms

In this section, we propose two dynamic pricing algorithms: EMLP and ONSP, for stochastic and adversarial features respectively.

Algorithm 1 Epoch-based max-likelihood pricing (EMLP)

Input: Convex and bounded set

\mathbb{H}

Observe

x_{1}

, randomly choose

v_{1}

and get

r_{1}

Solve

\hat{\theta}_{1}=\arg\min_{\theta\in\mathbb{H}}l_{1}(\theta)

;

for

k=1

\lfloor\log_{2}{T}\rfloor+1

Set

\tau_{k}=2^{k-1}

;

for

t=1

\tau_{k}

Observe

x_{k,t}

;

Set price

v_{k,t}=J(x_{k,t}^{\top}\hat{\theta}_{k})

;

Receive

r_{k,t}=v_{k_{t}}\cdot\mathbbm{1}_{t}

;

end for

Solve:

\hat{\theta}_{k+1}=\arg\min_{\theta\in\mathbb{H}}\hat{L}_{k}(\theta),

where

\hat{L}_{k}(\theta)=\frac{1}{\tau_{k}}\sum_{t=1}^{\tau_{k}}l_{k,t}(\theta)

end for

Algorithm 2 Online Newton Step Pricing (ONSP)

Input: Convex and bounded set

\mathbb{H}

\theta_{1}

, parameter

\gamma,\epsilon>0

Set

A_{0}=\epsilon\cdot{I_{d}}

;

for

t=1

T

Observe

x_{t}

;

Set price

v_{t}=J(x_{t}^{\top}\theta_{t})

;

Receive

r_{t}=v_{t}\cdot\mathbbm{1}_{t}

;

Set surrogate loss function

l_{t}(\theta)

;

Calculate

\nabla_{t}=\nabla{l}_{t}(\theta)

;

Rank-1 update:

A_{t}=A_{t-1}+\nabla_{t}\nabla_{t}^{\top}

;

Newton step:

\hat{\theta}_{t+1}=\theta_{t}-\frac{1}{\gamma}A_{t}^{-1}\nabla_{t}

;

Projection:

\theta_{t+1}=\prod_{\mathbb{H}}^{A_{t}}(\hat{\theta}_{t+1})

end for

4.1 Pricing with Distribution-Free Stochastic Features

Assumption 4 (Stochastic features).

Assume $x_{t}\sim\mathbb{D}\subseteq D$ are independently identically distributed (i.i.d.) from an unknown distribution, for any $t=1,2,\ldots,T$ .

The first algorithm, Epoch-based Max-Likelihood Pricing (EMLP) algorithm, is suitable for a stochastic setting defined by Assumption 4. EMLP proceeds in epochs with each stage doubling the length of the previous epoch. At the end of each epoch, we consolidate the observed data and solve a maximum likelihood estimation problem to learn $\theta$ . A max likelihood estimator (MLE) obtained by minimizing $\hat{L}_{k}(\theta):=\frac{1}{\tau_{k}}\sum_{t=1}^{\tau_{k}}l_{k,t}(\theta),$ which is then used in the next epoch as if it is the true parameter vector. In the equation, $k,\tau_{k}$ denotes the index and length of epoch $k$ . The estimator is computed using data in $hist(k)$ , which denotes the transcript for epoch $1\sim k$ . The pseudo-code of EMLP is summarized in Algorithm 1. In the remainder of this section, we discuss the computational efficiency and prove the upper regret bound of $O(d\log{T})$ .

Computational Efficiency. The calculations in EMLP are straightforward except for $\arg\min\hat{L}_{k}(\theta)$ and $J(u)$ . As $g(v,u)$ is proved unimodal in Lemma 2, we may efficiently calculate $J(u)$ by binary search. We will prove that $l_{k,t}$ is exp-concave (and thus also convex). Therefore, we may apply any off-the-shelf tools for solving convex optimization.

MLE and Probit Regression. A closer inspection reveals that this log-likelihood function corresponds to a probit (Aldrich et al., 1984) or a logit model (Wright, 1995) for Gaussian or logistic noises. See Appendix C.2.1.

Affine Invariance. Both optimization problems involved depend only on $x^{\top}\theta$ , so if we add any affine transformation to $x$ into $\tilde{x}=Ax$ , the agent can instead learn a new parameter of $\tilde{\theta}^{*}=(A^{\top})^{-1}\theta^{*}$ and achieve the same $u_{t}=x_{t}^{\top}\theta^{*}$ . Also, the regret bound is not affected as the upper bound $B$ over $x^{\top}\theta$ does not change ⁶⁶6Here $A$ is assumed invertible, otherwise the mapping from $\tilde{x}_{t}$ to $u_{t}$ does not necessarily exist.. Therefore, it is only natural that the regret bound does not depend on the distribution $x$ , nor the condition numbers of $\mathbb{E}[xx^{\top}]$ (i.e., the ratio of $\lambda_{\max}/\lambda_{\min}$ ).

4.2 Pricing with Adversarial Features

In this part, we propose an “Online Newton Step Pricing (ONSP)” algorithm that deals with adversarial $\{x_{t}\}$ series and guarantees $O(d\log{T})$ regret. The pseudo-code of ONSP is shown as Algorithm 2. In each round, it uses the likelihood function as a surrogate loss and applies “Online Newton Step”(ONS) method to update $\hat{\theta}$ . In the next round, it adopts the updated $\hat{\theta}$ and sets a price greedily. In the remainder of this section, we discuss some properties of ONSP and prove the regret bound.

The calculations of ONSP are straightforward. The time complexity of calculating the matrix inverse $A_{t}^{-1}$ is $O(d^{3})$ , which is fair as $d$ is small. In high-dimensional cases, we may use Woodbury matrix identity⁷⁷7 $(A+xx^{\top})^{-1}=A^{-1}-\frac{1}{1+x^{\top}A^{-1}x}A^{-1}x(A^{-1}x)^{\top}.$ to reduce it to $O(d^{2})$ as we could get $A^{-1}$ directly from the latest round.

5 Regret Analysis

In this section, we mainly prove the logarithmic regret bounds of EMLP and ONSP corresponding to stochastic and adversarial settings, respectively. Besides, we also prove an $\Omega(\sqrt{T})$ regret bound on fully parametric $F$ with one parameter unknown.

5.1 $O(d\log{T})$ Regret of EMLP

In this part, we present the regret analysis of Algorithm 1. First of all, we propose the following theorem as our main result on EMLP.

Theorem 3 (Overall regret).

With Assumptions 1, 2, 3 and 4, the expected regret of EMLP can be bounded by:

\mathbb{E}[Reg_{\text{EMLP}}]\leq 2C_{s}{d}{\log{T}},

(6)

where $C_{s}$ is a constant that depends only on $F(\omega)$ and is independent to $\mathbb{D}$ .

The proof of Theorem 3 is sophisticated. For the sake of clarity, we next present an inequality system as a roadmap toward the proof. After this, we formally illustrate each line of it with lemmas.

Since EMLP proposes $J(x_{k,t}^{\top}\hat{\theta}_{k})$ in every round of epoch $k$ , we may denote the per-round regret as $Reg_{t}(\hat{\theta}_{k})$ , where:

Reg_{t}(\theta):=g(J(x_{t}^{\top}\theta^{*}),x_{t}^{\top}\theta^{*})-g(J(x_{t}^{\top}\theta),x_{t}^{\top}\theta^{*}).

(7)

Therefore, it is sufficient to prove the following Theorem:

Theorem 4 (Expected per-round regret).

For the per-round regret defined in Equation (7), we have:

\mathbb{E}[Reg_{k,t}(\hat{\theta}_{k})]\leq C_{s}\cdot\frac{d}{\tau_{k}}.

The proof roadmap of Theorem 4 can be written as the following inequality system.

\mathbb{E}[Reg_{k,t}(\hat{\theta}_{k})]\leq C\cdot\mathbb{E}[(\hat{\theta}_{k}-\theta^{*})^{\top}x_{k,t}x_{k,t}^{\top}(\hat{\theta}_{k}-\theta^{*})]\leq\frac{2C}{C_{\text{down}}}\mathbb{E}[\hat{L}_{k}(\hat{\theta}_{k})-\hat{L}_{k}(\theta^{*})]\leq\frac{2C\cdot C_{\text{exp}}}{C_{\text{down}}^{2}}\frac{d}{\tau_{k}}.

(8)

We explain Equation (8) in details. The first inequality comes from the following Lemma 5.

Lemma 5 (Quadratic regret bound).

We have:

Reg_{t}(\theta)\leq C\cdot(\theta-\theta^{*})^{\top}x_{t}x_{t}^{\top}(\theta-\theta^{*}),\forall\theta\in\mathbb{H},\forall x_{t}\in\mathbb{D}.

(9)

Here $C=2B_{f}+(B+J(0))\cdot B_{f^{\prime}}.$

The intuition is that function $g(J(u),u)$ is $2^{nd}$ -order-smooth at $(J(u^{*}),u^{*})$ . A detailed proof of Lemma 5 is in Appendix B.2.1. Note that $C$ is highly dependent on the distribution $F$ . After this, we propose Lemma 11 that contributes to the second inequality of Equation (8).

Lemma 6 (Quadratic likelihood bound).

For the expected likelihood function $L_{t}(\theta)$ defined in Equation (2), we have:

L_{t}(\theta)-L_{t}(\theta^{*})\geq\frac{1}{2}C_{\text{down}}(\theta-\theta^{*})^{\top}x_{t}x_{t}^{\top}(\theta-\theta^{*}),\forall\theta\in\mathbb{H},\forall x\in\mathbb{D},

(10)

\text{ where }\;\;\;C_{\text{down}}:=\inf_{\omega\in[-B,B+J(0)]}\min\left\{\frac{\text{d}^{2}\log(1-F(\omega))}{\text{d}\omega^{2}},\frac{\text{d}^{2}\log(F(\omega))}{\text{d}\omega^{2}}\right\}>0.

(11)

Proof.

Since the true parameter always maximizes the expected likelihood function (Murphy, 2012), by Taylor Expansion we have $\nabla L(\theta^{*})=0$ , and hence $L_{t}(\theta)-L_{t}(\theta^{*})=\frac{1}{2}(\theta-\theta^{*})^{\top}\nabla^{2}L_{t}(\tilde{\theta})(\theta-\theta^{*})$ for some $\tilde{\theta}=\alpha\theta^{*}+(1-\alpha)\theta$ . Therefore, we only need to prove the following lemma:

Lemma 7 (Strong convexity and Exponential Concavity).

Suppose $l_{t}(\theta)$ is the negative log-likelihood function in epoch $k$ at time $t$ . For any $\theta\in\mathbb{H},x_{t}\sim\mathbb{D}$ , we have:

\nabla^{2}l_{t}(\theta)\succeq C_{\text{down}}x_{t}x_{t}^{\top}\succeq\frac{C_{\text{down}}}{C_{\text{exp}}}\nabla l_{t}(\theta)\nabla l_{t}(\theta)^{\top}\succeq 0,

(12)

\text{ where }\;C_{\text{exp}}:=\sup_{\omega\in[-B,B+J(0)]}\max\left\{\frac{f(\omega)^{2}}{F(\omega)^{2}},\frac{f(\omega)^{2}}{(1-F(\omega))^{2}}\right\}<+\infty.

(13)

Proof of Lemma 7 is in Appendix B.2.2. With this lemma, we see that Lemma 11 holds. ∎

With Lemma 5 and Lemma 11, we can immediately get the following Lemma 8.

Lemma 8 (Surrogate Regret).

The relationship between $Reg(\theta)$ and likelihood function can be shown as follows:

Reg_{t}(\theta)\leq\frac{2\cdot C}{C_{\text{down}}}\left(L_{t}(\theta)-L_{t}(\theta^{*})\right),

(14)

$\forall\theta\in\mathbb{H},\forall{x}\in\mathbb{D}$ , where $C$ and $C_{\text{down}}$ are defined in Lemma 5 and 11 respectively.

Lemma 8 enables us to choose the negative log-likelihood function as a surrogate loss. This is not only an important insight of EMLP regret analysis, but also the foundation of ONSP design.

The last inequality of Equation (8) comes from this lemma:

Lemma 9 (Per-epoch surrogate regret bound).

Denoting $\hat{\theta}_{k}$ as the estimator coming from epoch $(k-1)$ and being used in epoch $k$ , we have:

\mathbb{E}_{h}[\hat{L}_{k}(\hat{\theta}_{k})-\hat{L}_{k}(\theta^{*})]\leq\frac{C_{\text{exp}}}{C_{\text{down}}}\cdot\frac{d}{\tau_{k}+1}.

(15)

Here $C_{\text{exp}}$ is defined in Equation 13, and $\mathbb{E}_{h}[\cdot]=\mathbb{E}[\cdot|hist(k-1)].$

Proof of Lemma 9 is partly derived from the work Koren and Levy (2015), and here we give a proof sketch without specific derivations. A detailed proof lies in Appendix B.2.3.

Proof sketch of Lemma 9. We list the four main points that contribute to the proof:

•

Notice that $l_{k,t}(\theta)$ is strongly convex w.r.t. a seminorm $x_{k,t}x_{k,t}^{\top}$ , we know $\hat{L}_{k}(\theta)$ is also strongly convex w.r.t. $\sum_{t=1}^{\tau_{k}}x_{k,t}x_{k,t}^{\top}$ .
•

For two strongly convex functions $g_{1}$ and $g_{2}$ , we can upper bound the distance between their arg-minimals (scaled by some norm $||\cdot||$ ) with the dual norm of $\nabla(g_{1}-g_{2})$ .
•

Since a seminorm has no dual norm, we apply two methods to convert it into a norm: (1) separation of parameters and likelihood functions with a “leave-one-out” method (to separately take expectations), and (2) separation of the spinning space and the null space.
•

As the dual data-dependent norm offsets the sum of $xx^{\top}$ to a constant, Lemma 9 holds.

We have so far proved Inequality (8) after proving Lemma 5, 11, 9. Therefore, Theorem 4 holds.

5.2 $O(d\log{T})$ Regret of ONSP

Here we present the regret analysis of Algorithm 2 (ONSP). Firstly, we state the main theorem.

Theorem 10.

With Assumptions 1, 2, 3, the regret of Algorithm 2 (ONSP) satisfies:

Reg_{\text{ONSP}}\leq C_{a}\cdot d\log{T},

(16)

where $C_{a}$ is a function only dependent on $F$ .

Proof.

Proof of Theorem 10 here is more concise than Section 5.1, because the important Lemma 7 and 8 have been proved there. From Lemma 8, we have:

g(J(u_{t}^{*}),u_{t}^{*})-g(J(u_{t}),u_{t}^{*})\leq\frac{2\cdot C}{C_{\text{down}}}\cdot\mathbb{E}_{N_{t}}[l_{t}(\theta_{t})-l_{t}(\theta^{*})].

(17)

With Equation 17, we may reduce the regret of likelihood functions as a surrogate regret of pricing. From Lemma 7 we see that the log-likelihood function is $\frac{C_{\text{down}}}{C_{\text{exp}}}$ -exponentially concave⁸⁸8A function $f(\mu)$ is $\alpha$ -exponentially concave iff $\nabla^{2}f(\mu)\succeq\alpha\nabla f(\mu)\nabla f(\mu)^{\top}$ .. This enables an application of Online Newton Step method to achieve a logarithmic regret. Therefore, by citing from the Online Convex Optimization (Hazan, 2016), we have the following Lemma.

Lemma 11 (Online Newton Step).

With parameters $\gamma=\frac{1}{2}\min\{\frac{1}{4GD},\alpha\}$ and $\epsilon=\frac{1}{\gamma^{2}D^{2}}$ , and $T>4$ guarantees:

\sup_{\{x_{t}\}}\left\{\sum_{t=1}^{T}l_{t}(\theta_{t})-\min_{\theta\in\mathbb{H}}\sum_{t=1}^{T}l_{t}(\theta)\right\}\leq 5\left(\frac{1}{\alpha}+GD\right)d\log T.

Here $\alpha=\frac{C_{\text{down}}}{C_{\text{exp}}}$ , $D=2\cdot B_{1}$ and $G=\sqrt{C_{\text{exp}}}\cdot B_{2}$ .

With Equation 17 and Lemma 11, we have:

Reg=\sum_{t=1}^{T}\left(g(J(u_{t}^{*}),u_{t}^{*})-\mathbb{E}_{N_{1},N_{2},\ldots,N_{t-1}}[g(J(u_{t}),u_{t}^{*})]\right)\leq\frac{2\cdot C}{C_{\text{down}}}\cdot 5\left(\frac{1}{\alpha}+GD\right)d\log T.

(18)

Therefore, we have proved Lemma 10. ∎

5.3 Lower Bound for Unknown Distribution

In this part, we evaluate Assumption 1 and prove that an $\Omega{(\sqrt{T})}$ lower regret bound is unavoidable with even a slight relaxation: a Gaussian noise with unknown $\sigma$ . Our proof is inspired by Broder and Rusmevichientong (2012) Theorem 3.1, while our lower bound relies on more specific assumptions (and thus applies to more general cases).

We firstly state Assumption 5 covering this part, and then state Theorem 12 as a lower bound:

Assumption 5.

The noise $N_{t}\sim\mathcal{N}(0,\sigma^{2})$ independently, where $0<\sigma\leq 1$ is fixed and unknown.

Theorem 12 (Lower bound with unknown $\sigma$ ).

Under Assumption 2, 3, 4 and 5, for any policy (algorithm) $\Psi:\mathbb{R}^{d}\times\left(\mathbb{R}^{d},\mathbb{R},\{0,1\}\right)^{t-1}\rightarrow\mathbb{R}^{+}$ and any $T>2$ , there exists a Gaussian parameter $\sigma\in\mathbb{R}^{+}$ , a distribution $\mathbb{D}$ of features and a fixed parameter $\theta^{*}$ , such that: $Reg_{\Psi}\geq{\frac{1}{24000}}\cdot\sqrt{T}.$

Remark: Here we assume $x_{t}$ to be i.i.d., which also implies the applicability on adversarial features. However, the minimax regret of the stochastic feature setting is $\Theta(\sqrt{T})$ (Javanmard and Nazerzadeh, 2019), while existing results have not yet closed the gap in adversarial feature settings.

Proof sketch of Theorem 12. Here we assume a fixed valuation, i.e. $u^{*}=x_{t}^{\top}\theta^{*},\forall t=1,2,\ldots$ . Equivalently, we assume a fixed feature. The main idea of proof is similar to that in Broder and Rusmevichientong (2012): we assume $\sigma_{1}=1,\sigma_{2}=1-T^{-\frac{1}{4}}$ , and we prove that: (1) it is costly for an algorithm to perform well in both cases if the $\sigma$ ’s are different by a lot, and (2) it is costly for an algorithm to distinguish the two cases if $\sigma$ ’s are close enough to each other. We put the detailed proof in Appendix B.3.

6 Numerical Result

Refer to caption — (a) Stochastic feature

In this section, we conduct numerical experiments to validate EMLP and ONSP. In comparison with the existing work, we implement a discretized EXP-4 (Auer et al., 2002) algorithm for pricing, as is introduced in Cohen et al. (2020) (in a slightly different setting). We will test these three algorithms in both stochastic and adversarial settings.

Basically, we assume $d=2,B_{1}=B_{2}=B=1$ and $N_{t}\sim\mathcal{N}(0,\sigma^{2})$ with $\sigma=0.25$ . In both settings, we conduct EMLP and ONSP for $T=2^{16}$ rounds. For ONSP, we empirically select $\gamma$ and $\epsilon$ that accelerates the convergence, instead of using the values specified in Lemma 11. Since EXP-4 consumes exponential time and requires the knowledge of $T$ in advance to discretize the policy and valuation spaces, we execute EXP-4 for a series of $T=2^{k},k=1,2,\ldots,12$ . We repeat every experiment 5 times for each setting and then take an average.

Stochastic Setting. We implement and test EMLP, ONSP and EXP-4 with stochastic $\{x_{t}\}$ ’s. The numerical results are shown in Figure 1(a) on a log-log diagram, with the regrets divided by $\log(t)$ . It shows $\log(t)$ -convergences on EMLP and ONSP, while EXP-4 is in a $t^{\alpha}$ rate with $\alpha\approx 0.699$ .

Adversarial Setting. We implement the three algorithms and test them with an adversarial $\{x_{t}\}$ ’s: for the $k$ -th epoch, i.e. $t=2^{k-1},2^{k-1}+1,\ldots,2^{k}-1$ , we let $x_{t}=[1,0]^{\top}$ if $k\equiv 1(\mod 2)$ and $x_{t}=[0,1]^{\top}$ if $k\equiv 0(\mod 2)$ . The numerical results are shown in Figure 1(b) on a log-log diagram, with the regrets divided by $\log(t)$ . The log-log plots of ONSP and EXP-4 are almost the same as those in Figure 1(a). However, EMLP shows an almost linear ( $t^{\alpha}$ rate with $\alpha\approx 0.912$ ) regret in this adversarial setting. This is because the adversarial series only trains one dimension of $\theta$ in each epoch, while the other is arbitrarily initialized and does not necessarily converge. However, in the next epoch, the incorrect dimension is exploited. Therefore, a linear regret originates.

7 Discussion

Here we discuss the coefficients on our regret bounds as a potential extension of future works. In Appendix C we will discuss more on algorithmic design, problem modeling, and ethic issues.

Coefficients on Regret Bounds. The exact regret bounds of both EMLP and ONSP contain a constant $\frac{C_{\text{exp}}}{C_{\text{down}}}$ that highly depends on the noise CDF $F$ and could be large. A detailed analysis in Appendix C.1 shows that $\frac{C_{\text{exp}}}{C_{\text{down}}}$ is exponentially large w.r.t. $\frac{B}{\sigma}$ (see Equation 39 and Lemma 21) for Gaussian noise $\mathcal{N}(0,\sigma^{2})$ , which implies that a smaller noise variance would lead to a (much) larger regret bound. This is very counter-intuitive as a larger noise usually leads to a more sophisticated situation, but similar phenomenons also occur in existing algorithms that are suitable for constant-variance noise, such as RMLP in Javanmard and Nazerzadeh (2019) and OORMLP in Wang et al. (2020). In fact, it is because a (constantly) large noise would help explore the unknown parameter $\theta^{*}$ and smoothen the expected regret. In this work, this can be addressed by increasing $T$ since we mainly concern the asymptotic regrets as $T\rightarrow\infty$ with fixed noise distributions. However, we admit that it is indeed a nontrivial issue for finite $T$ and small $\sigma$ situations. There exists a “ShallowPricing” method in Cohen et al. (2020) that can deal with a very-small-variance noise setting (when $\sigma=\tilde{O}(\frac{1}{T})$ ) and achieve a logarithmic regret. Specifically, its regret bound would decrease as the noise variance $\sigma$ decreases (but would still not reach $O(\log\log{T})$ as the noise vanishes). We might also apply this method as a preprocess to cut the parameter domain and decrease $\frac{B}{\sigma}$ within logarithmic trials (see Cohen et al. (2020) Thm. 3), but it is still open whether a $\log(T)$ regret is achievable when $\sigma=\Theta(T^{-\alpha})$ for $\alpha\in(0,1)$ .

8 Conclusion

In this work, we studied the problem of online feature-based dynamic pricing with a noisy linear valuation in both stochastic and adversarial settings. We proposed a max-likelihood-estimate-based algorithm (EMLP) for stochastic features and an online-Newton-step-based algorithm (ONSP) for adversarial features. Both of them enjoy a regret guarantee of $O(d\log{T})$ , which also attains the information-theoretic limit up to a constant factor. Compared with existing works, EMLP gets rid of strong assumptions on the distribution of the feature vectors in the stochastic setting, and ONSP improves the regret bound exponentially from $O(T^{2/3})$ to $O(\log{T})$ in the adversarial setting. We also showed that knowing the noise distribution (or the demand curve) is required to obtain logarithmic regret, where we prove a lower bound of $\Omega(\sqrt{T})$ on the regret for the case when the noise is knowingly Gaussian but with an unknown $\sigma$ . In addition, we conducted numerical experiments to empirically validate the scaling of our algorithms. Finally, we discussed the regret dependence on the noise variance, and proposed a subtle open problem for further study.

Acknowledgments

The work is partially supported by the Adobe Data Science Award and a start-up grant from the UCSB Department of Computer Science. We appreciate the input from anonymous reviewers and AC as well as a discussion with Akshay Krishnamurthy for clarifying some details of Krishnamurthy et al. (2021).

References

Agarwal et al. [2014] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning (ICML-14), pages 1638–1646, 2014.
Aldrich et al. [1984] J. H. Aldrich, F. D. Nelson, and E. S. Adler. Linear Probability, Logit, and Probit Models. Number 45. Sage, 1984.
Amin et al. [2014] K. Amin, A. Rostamizadeh, and U. Syed. Repeated contextual auctions with strategic buyers. In Advances in Neural Information Processing Systems (NIPS-14), pages 622–630, 2014.
Araman and Caldentey [2009] V. F. Araman and R. Caldentey. Dynamic pricing for nonperishable products with demand learning. Operations Research, 57(5):1169–1188, 2009.
Auer et al. [2002] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
Aydin and Ziya [2009] G. Aydin and S. Ziya. Personalized dynamic pricing of limited inventories. Operations Research, 57(6):1523–1531, 2009.
Besanko et al. [1998] D. Besanko, S. Gupta, and D. Jain. Logit demand estimation under competitive pricing behavior: An equilibrium framework. Management Science, 44(11-part-1):1533–1547, 1998.
Besanko et al. [2003] D. Besanko, J.-P. Dubé, and S. Gupta. Competitive price discrimination strategies in a vertical channel using aggregate retail data. Management Science, 49(9):1121–1138, 2003.
Besbes and Zeevi [2015] O. Besbes and A. Zeevi. On the (surprising) sufficiency of linear models for dynamic pricing with demand learning. Management Science, 61(4):723–739, 2015.
Broder and Rusmevichientong [2012] J. Broder and P. Rusmevichientong. Dynamic pricing under a general parametric choice model. Operations Research, 60(4):965–980, 2012.
Chan et al. [2009] T. Chan, V. Kadiyali, and P. Xiao. Structural models of pricing. Handbook of pricing research in marketing, pages 108–131, 2009.
Chen and Gallego [2021] N. Chen and G. Gallego. A primal-dual learning algorithm for personalized dynamic pricing with an inventory constraint. Mathematics of Operations Research, 2021.
Chen and Farias [2013] Y. Chen and V. F. Farias. Simple policies for dynamic pricing with imperfect forecasts. Operations Research, 61(3):612–624, 2013.
Chu et al. [2011] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS-11), pages 208–214, 2011.
Cohen et al. [2020] M. C. Cohen, I. Lobel, and R. Paes Leme. Feature-based dynamic pricing. Management Science, 66(11):4921–4943, 2020.
Cournot [1897] A. A. Cournot. Researches into the Mathematical Principles of the Theory of Wealth. Macmillan, 1897.
den Boer [2015] A. V. den Boer. Dynamic pricing and learning: historical origins, current research, and new directions. Surveys in Operations Research and Management Science, 20(1):1–18, 2015.
Draganska and Jain [2006] M. Draganska and D. C. Jain. Consumer preferences and product-line pricing strategies: An empirical analysis. Marketing science, 25(2):164–174, 2006.
Evans [1924] G. C. Evans. The dynamics of monopoly. The American Mathematical Monthly, 31(2):77–83, 1924.
Hazan [2016] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157–325, 2016.
Iyengar et al. [2007] R. Iyengar, A. Ansari, and S. Gupta. A model of consumer learning for service quality and usage. Journal of Marketing Research, 44(4):529–544, 2007.
Javanmard and Nazerzadeh [2019] A. Javanmard and H. Nazerzadeh. Dynamic pricing in high-dimensions. The Journal of Machine Learning Research, 20(1):315–363, 2019.
Joskow and Wolfram [2012] P. L. Joskow and C. D. Wolfram. Dynamic pricing of electricity. American Economic Review, 102(3):381–85, 2012.
Kadiyali et al. [1996] V. Kadiyali, N. J. Vilcassim, and P. K. Chintagunta. Empirical analysis of competitive product line pricing decisions: Lead, follow, or move together? Journal of Business, pages 459–487, 1996.
Keskin and Zeevi [2014] N. B. Keskin and A. Zeevi. Dynamic pricing with an unknown demand model: Asymptotically optimal semi-myopic policies. Operations Research, 62(5):1142–1167, 2014.
Kincaid and Darling [1963] W. Kincaid and D. Darling. An inventory pricing problem. Journal of Mathematical Analysis and Applications, 7:183–208, 1963.
Kleinberg and Leighton [2003] R. Kleinberg and T. Leighton. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In IEEE Symposium on Foundations of Computer Science (FOCS-03), pages 594–605. IEEE, 2003.
Koren and Levy [2015] T. Koren and K. Levy. Fast rates for exp-concave empirical risk minimization. In Advances in Neural Information Processing Systems (NIPS-15), pages 1477–1485, 2015.
Krämer et al. [2018] A. Krämer, M. Friesen, and T. Shelton. Are airline passengers ready for personalized dynamic pricing? a study of german consumers. Journal of Revenue and Pricing Management, 17(2):115–120, 2018.
Krishnamurthy et al. [2021] A. Krishnamurthy, T. Lykouris, C. Podimata, and R. Schapire. Contextual search in the presence of irrational agents. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC-21), pages 910–918, 2021.
Lambrecht et al. [2007] A. Lambrecht, K. Seim, and B. Skiera. Does uncertainty matter? consumer behavior under three-part tariffs. Marketing Science, 26(5):698–710, 2007.
Langford and Zhang [2007] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems (NIPS-07), pages 817–824, 2007.
Leme and Schneider [2018] R. P. Leme and J. Schneider. Contextual search via intrinsic volumes. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS-18), pages 268–282. IEEE, 2018.
Liu et al. [2021] A. Liu, R. P. Leme, and J. Schneider. Optimal contextual pricing and extensions. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA-21), pages 1059–1078. SIAM, 2021.
Lobel et al. [2018] I. Lobel, R. P. Leme, and A. Vladu. Multidimensional binary search for contextual decision-making. Operations Research, 66(5):1346–1361, 2018.
Mazumdar et al. [2005] T. Mazumdar, S. P. Raj, and I. Sinha. Reference price research: Review and propositions. Journal of Marketing, 69(4):84–102, 2005.
Mourtada [2019] J. Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. arXiv preprint arXiv:1912.10754, 2019.
Murphy [2012] K. P. Murphy. Machine Learning: a Probabilistic Perspective. MIT press, 2012.
Qiang and Bayati [2016] S. Qiang and M. Bayati. Dynamic pricing with demand covariates. arXiv preprint arXiv:1604.07463, 2016.
Schultz et al. [1938] H. Schultz et al. Theory and Measurement of Demand. The University of Chicago Press, 1938.
Wang et al. [2020] C.-H. Wang, Z. Wang, W. W. Sun, and G. Cheng. Online regularization for high-dimensional dynamic pricing algorithms. arXiv preprint arXiv:2007.02470, 2020.
Whittle [1980] P. Whittle. Multi-armed bandits and the gittins index. Journal of the Royal Statistical Society: Series B (Methodological), 42(2):143–149, 1980.
Wright [1995] R. E. Wright. Logistic regression. 1995.

APPENDIX

Appendix A Other related works

Here we will briefly review the history and recent studies that are related to our work. For the historical introductions, we mainly refer to den Boer [2015] as a survey. For bandit approaches, we will review some works that apply bandit algorithms to settle pricing problems. For the structural models, we will introduce different modules based on the review in Chan et al. [2009]. Based on the existing works, we might have a better view of our problem setting and methodology.

A.1 History of Pricing

It was the work of Cournot [1897] in 1897 that firstly applied mathematics to analyze the relationship between prices and demands. In that work, the price was denoted as $p$ and the demand was defined as a demand function $F(p)$ . Therefore, the revenue could be written as $pF(p)$ . This was a straightforward interpretation of the general pricing problem, and the key to solving it was estimations of $F(p)$ regarding different products. Later in 1938, the work Schultz et al. [1938] proposed price-demand measurements on exclusive kinds of products. It is worth mentioning that these problems are “static pricing” ones, because $F$ is totally determined by price $p$ and we only need to insist on the optimal one to maximize our profits.

However, the static settings were qualified by the following two observations: on the one hand, a demand function may not only depends on the static value of $p$ , but also be affected by the trend of $p$ ’s changing [Evans, 1924, Mazumdar et al., 2005]; on the other hand, even if $F(p)$ is static, $p$ itself might change over time according to other factors such as inventory level [Kincaid and Darling, 1963]. As a result, it is necessary to consider dynamics in both demand and price, which leads to a “dynamic pricing” problem setting.

A.2 Dynamic Pricing as Bandits

As is said in Section 2, the pricing problem can be viewed as a stochastic contextual bandits problem [see, e.g., Langford and Zhang, 2007, Agarwal et al., 2014]. Even though we may not know the form of the demand function, we can definitely see feedback of demands, i.e. how many products are sold out, which enables us to learn a better decision-making policy. Therefore, it can be studied in a bandit module. If the demand function is totally agnostic, i.e. the evaluations (the highest prices that customers would accept) come at random or even at adversary over time, then it can be modeled as a Multi-arm bandit (MAB) problem [Whittle, 1980] exactly. In our paper, instead, we focus on selling different products with a great variety of features. This can be characterized as a Contextual bandit (CB) problem [Auer et al., 2002, Langford and Zhang, 2007]. The work Cohen et al. [2020], which applies the “EXP-4” algorithm from Auer et al. [2002], also mentions that “the arms represent prices and the payoffs from the different arms are correlated since the measures of demand evaluated at different price points are correlated random variables”. A variety of existing works, including Kleinberg and Leighton [2003], Araman and Caldentey [2009], Chen and Farias [2013], Keskin and Zeevi [2014], Besbes and Zeevi [2015], has been approaching the demand function from a perspective of from either parameterized or non-parameterized bandits.

However, our problem setting is different from a contextual bandits setting in at least two perspectives: feedback and regret. The pricing problem has a specially structured feedback between full information and bandits setting. Specifically, $r_{t}>0$ implies that all policies producing $v<v_{t}$ will end up receiving $r^{\prime}_{t}=v$ , and $r_{t}=0$ implies that all policies producing $v>v_{t}$ will end up receiving $r^{\prime}_{t}=0$ . However, the missing patterns are confounded with the rewards. Therefore it is non-trivial to leverage this structure to improve the importance sampling approach underlying the algorithm of Agarwal et al. [2014]. We instead consider the natural analog to the linear contextual bandits setting [Chu et al., 2011]⁹⁹9But do notice that our expected reward above is not linear, even if the valuation function is. and demonstrate that in this case an exponential improvement in the regret is possible using the additional information from the censored feedback. As for regret, while in contextual bandits it refers to a comparison with the optimal policy, it is here referring to a comparison with the optimal action. In other words, though our approaches (both in EMLP and in ONSP) are finding the true parameter $\theta^{*}$ , the regret is defined as the “revenue gap” between the optimal price and our proposed prices. These are actually equivalent in our fully-parametric setting (where we assume a linear-valuation-known-noise model), but will differ a lot in partially parametric and totally agnostic settings.

A.3 Structural Model

While a totally agnostic model guarantees the most generality, a structural model would help us better understand the mechanism behind the observation of prices and demands. The key to a structural pricing model is the behavior of agents in the market, including customers and/or firms. In other words, the behavior of each side can be described as a decision model. From the perspective of demand (customers), the work Kadiyali et al. [1996] adopts a linear model on laundry detergents market, Iyengar et al. [2007] and Lambrecht et al. [2007] study three-part-tariff pricing problems on wireless and internet services with mixed logit models. Besanko et al. assumed an aggregate logit model on customers in works Besanko et al. [1998] and Besanko et al. [2003] in order to study the competitive behavior of manufacturers in ketchup market. Meanwhile, the supply side is usually assumed to be more strategic, such as Bertrand-Nash behaviors [Kadiyali et al., 1996, Besanko et al., 1998, Draganska and Jain, 2006]. For more details, please see Chan et al. [2009].

Appendix B Proofs

B.1 Proof of Lemma 2

Proof.

Since $v^{*}=\mathop{\mathrm{argmax}}g(v,u)$ , we have:

		$\displaystyle\frac{\partial g(v,u)}{\partial v}\|_{v=v^{*}}=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle 1-F(v^{}-u)-v^{}\cdot f(v^{*}-u)=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle\frac{1-F(v^{}-u)}{f(v^{}-u)}-(v^{*}-u)=u$

Define $\varphi(\omega)=\frac{1-F(\omega)}{f(\omega)}-\omega$ , and we take derivatives:

\varphi^{\prime}(\omega)=\frac{-f^{2}(\omega)-(1-F(\omega))f^{\prime}(w)}{f^{2}(w)}-1=\frac{\text{d}^{2}\log(1-F(\omega))}{\text{d}\omega^{2}}\cdot\frac{(1-F(\omega))^{2}}{(f(\omega))^{2}}-1<-1,

where the last equality comes from the strict log-concavity of $(1-F(\omega))$ . Therefore, $\varphi(\omega)$ is decreasing and $\varphi(+\infty)=-\infty$ . Also, notice $\varphi(-\infty)=+\infty$ , we know that for any $u\in\mathbb{R}$ , there exists an $\omega$ such that $\varphi(\omega)=u$ . For $u\geq 0$ , we know that $g(v,u)\geq 0$ for $v\geq 0$ and $g(v,u)<0$ for $v<0$ . Therefore, $v^{*}\geq 0$ if $u\geq 0$ . ∎

B.2 Proofs in Section 5.1

B.2.1 Proof of Lemma 5

Proof.

We again define $\varphi(\omega)=\frac{1-F(\omega)}{f(\omega)}-\omega$ as in Appendix B.1. According to Equation 5, we have:

	$\displaystyle\frac{\partial g(v,u)}{\partial v}\|_{v=J(u)}=0$	(19)
$\displaystyle\Rightarrow$	$\displaystyle 1-F(J(u)-u)-J(u)\cdot f(J(u)-u)=0$
$\displaystyle\Rightarrow$	$\displaystyle\varphi(J(u)-u)=u$
$\displaystyle\Rightarrow$	$\displaystyle J(u)=u+\varphi^{-1}(u)$
$\displaystyle\Rightarrow$	$\displaystyle J^{\prime}(u)=1+\frac{1}{\varphi^{\prime}(\varphi^{-1}(u))}.$

The last line of Equation 19 is due to the Implicit Function Derivatives Principle. From the result in Appendix B.1, we know that $\varphi^{\prime}(\omega)<-1,\forall\omega\in\mathbb{R}$ . Therefore, we have $J^{\prime}(u)\in(0,1),u\in\mathbb{R}$ , and hence $0\geq J(u)<u+J(0)$ for $u\geq 0$ . Since $u\in[0,B]$ , we may assume that $v\in[0,B+J(0)]$ without losing generality. In the following part, we will frequently use this range.

Denote $u:=x_{t}^{\top}\theta,u^{*}=x_{t}^{\top}\theta^{*}$ . According to Equation 7, we know that:

	$\displaystyle Reg_{t}(\theta)$	$\displaystyle=g(J(u^{}),u^{})-g(J(u),u^{*})$
		$\displaystyle=-\frac{\partial g(v,u^{})}{\partial v}\|_{v=J(u^{})}(J(u^{}-J(u)))+\frac{1}{2}\left(-\frac{\partial^{2}g(v,u^{})}{\partial v^{2}}\|_{v=\tilde{v}}\right)(J(u^{*})-J(u))^{2}$
		$\displaystyle\leq 0+\frac{1}{2}\max_{\tilde{v}\in[0,B+J(0)]}\left(-\frac{\partial^{2}g(v,u^{})}{\partial v^{2}}\|_{v=\tilde{v}}\right)\cdot(J(u^{})-J(u))^{2}$
		$\displaystyle=\frac{1}{2}\max_{\tilde{v}\in[0,B+J(0)]}\left(2f(\tilde{v}-u^{})+\tilde{v}\cdot{f^{\prime}}(\tilde{v}-u^{})\right)\cdot(J(u^{*})-J(u))^{2}$
		$\displaystyle\leq\frac{1}{2}(2B_{f}+(B+J(0))\cdot B_{f^{\prime}})(J(u^{*})-J(u))^{2}$
		$\displaystyle\leq\frac{1}{2}(2B_{f}+(B+J(0))\cdot B_{f^{\prime}})(u^{*}-u)^{2}$
		$\displaystyle=\frac{1}{2}(2B_{f}+(B+J(0))\cdot B_{f^{\prime}})(\theta^{}-\theta)^{\top}x_{t}x_{t}^{\top}(\theta^{}-\theta).$

Here the first line is from the definition of $g$ and $Reg(\theta)$ , the second line is due to Taylor’s Expansion, the third line is from the fact that $J(u^{*})$ maximizes $g(v,u^{*})$ with respect to $v$ , the fourth line is by calculus, the fifth line is from the assumption that $0<f(\omega)\leq B_{f},|f^{\prime}(\omega)|\leq B_{f^{\prime}}$ and $v\in[0,B+J(0)]$ , the sixth line is because of $J^{\prime}(u)\in(0,1),\forall u\in\mathbb{R}$ , and the seventh line is from the definition of $u$ and $u^{*}$ . ∎

B.2.2 Proof of Lemma 7

Proof.

We take derivatives of $l_{t}(\theta)$ , and we get:

$\displaystyle l_{t}(\theta)=$	$\displaystyle\mathbbm{1}_{t}\left(-\log(1-F(v_{t}-x_{t}^{\top}\theta))\right)+(1-\mathbbm{1}_{t})\left(-\log(F(v_{t}-x_{t}^{\top}\theta))\right)$	(20)
$\displaystyle\nabla l_{t}(\theta)=$	$\displaystyle\mathbbm{1}_{t}\left(-\frac{f(v_{t}-x_{t}^{\top}\theta)}{1-F(v_{t}-x_{t}^{\top}\theta)}\right)\cdot x_{t}+(1-\mathbbm{1}_{t})\left(\frac{f(v_{t}-x_{t}^{\top}\theta)}{F(v_{t}-x_{t}^{\top}\theta)}\right)\cdot x_{t}$
$\displaystyle\nabla^{2}l_{t}(\theta)=$	$\displaystyle\mathbbm{1}_{t}\cdot\frac{f(v_{t}-x_{t}^{\top}\theta)^{2}+f^{\prime}(v_{t}-x_{t}^{\top}\theta)\cdot(1-F(v_{t}-x_{t}^{\top}\theta))}{(1-F(v_{t}-x_{t}^{\top}\theta))^{2}}\cdot{x_{t}x_{t}^{\top}}$
	$\displaystyle+(1-\mathbbm{1}_{t})\cdot\frac{f(v_{t}-x_{t}^{\top}\theta)^{2}-f^{\prime}(v_{t}-x_{t}^{\top}\theta)F(v_{t}-x_{t}^{\top}\theta)}{F(v_{t}-x_{t}^{\top}\theta)^{2}}\cdot{x_{t}x_{t}^{\top}}$
$\displaystyle=$	$\displaystyle\mathbbm{1}_{t}\cdot\frac{-\text{d}^{2}\log(1-F(\omega))}{\text{d}\omega^{2}}\|_{\omega=v_{t}-x_{t}^{\top}\theta}\cdot{x_{t}x_{t}^{\top}}+(1-\mathbbm{1}_{t})\frac{-\text{d}^{2}\log(F(\omega))}{\text{d}\omega^{2}}\|_{\omega=v_{t}-x_{t}^{\top}\theta}\cdot{x_{t}x_{t}^{\top}}$
$\displaystyle\succeq$	$\displaystyle\inf_{\omega\in[-B,B+J(0)]}\min\left\{\frac{\text{d}^{2}\log(1-F(\omega))}{\text{d}\omega^{2}},\frac{\text{d}^{2}\log(F(\omega))}{\text{d}\omega^{2}}\right\}$
$\displaystyle=$	$\displaystyle C_{\text{down}}x_{t}x_{t}^{\top},$

which directly proves the first inequality. For the second inequality, just notice that

$\displaystyle\nabla l_{t}(\theta)\nabla l_{t}(\theta)^{\top}=$	$\displaystyle\mathbbm{1}_{t}\left(\frac{f(v_{t}-x_{t}^{\top}\theta)}{1-F(v_{t}-x_{t}^{\top}\theta)}\right)^{2}x_{t}x_{t}^{\top}+(1-\mathbbm{1}_{t})\left(\frac{f(v_{t}-x_{t}^{\top}\theta)}{F(v_{t}-x_{t}^{\top}\theta)}\right)^{2}x_{t}x_{t}^{\top}$	(21)
$\displaystyle\preceq$	$\displaystyle\sup_{\omega\in[-B,B+J(0)]}\max\{\left(\frac{f(\omega)}{F(\omega)}\right)^{2},\left(\frac{f(\omega)}{1-F(\omega)}\right)^{2}\}x_{t}x_{t}^{\top}$
$\displaystyle=$	$\displaystyle C_{\text{exp}}x_{t}x_{t}^{\top}.$

The only thing to point out is that $\frac{f(\omega)}{F(\omega)}$ and $\frac{f(\omega)}{1-F(\omega)}$ are all continuous for $\omega\in[-B,B+J(0)]$ , as $F(\omega)$ is strictly increasing and thus $0<F(\omega)<1,\omega\in\mathbb{R}$ . ∎

B.2.3 Proof of Lemma 9

Proof.

In the following part, we consider a situation that an epoch of $n\geq 2$ rounds of pricing is conducted, generating $l_{j}(\theta)$ as negative likelihood functions, $j=1,2,\ldots,n$ . Define a “leave-one-out”negative log-likelihood function

\tilde{L}_{i}(\theta)=\frac{1}{n}\sum_{j=1,j\neq i}^{n}l_{j}(\theta),

and let

\tilde{\theta}_{i}:=\mathop{\arg\min}\limits_{\theta}\tilde{L}_{i}(\theta).

Based on this definition, we know that $\tilde{\theta}_{i}$ is independent to $l_{i}(\theta)$ given historical data, and that $\tilde{\theta}_{i}$ are identically distributed for all $i=1,2,3,\ldots,n$ .

In the following part, we will firstly propose and proof the following inequality:

\frac{1}{n}\sum_{i=1}^{n}(l_{i}(\tilde{\theta}_{i})-l_{i}(\hat{\theta}))\leq\frac{C_{\text{exp}}}{C_{down}}\frac{d}{n}=O(\frac{d}{n}),

(22)

where $\hat{\theta}$ is the short-hand notation of $\hat{\theta}_{k}$ as we do not specify the epoch $k$ in this part. We now cite a lemma from Koren and Levy [2015]:

Lemma 13.

Let $g_{1}$ , $g_{2}$ be 2 convex function defined over a closed and convex domain $\mathcal{K}\subseteq\mathbb{R}^{d}$ , and let $x_{1}=\arg\min_{x\in\mathcal{K}}g_{1}(x)$ and $x_{2}=\arg\min_{x\in\mathcal{K}}g_{2}(x)$ . Assume $g_{2}$ is locally $\delta$ -strongly-convex at $x_{1}$ with respect to a norm $||\cdot||$ . Then, for $h=g_{2}-g_{1}$ we have

||x_{2}-x_{1}||\leq\frac{2}{\delta}||\nabla h(x_{1})||_{*}.

Here $||\cdot||_{*}$ denotes a dual norm.

The following is a proof of this lemma.

Proof.

(of Lemma 13) According to convexity of $g_{2}$ , we have:

g_{2}(x_{1})\geq g_{2}(x_{2})+\nabla g_{2}(x_{2})^{\top}(x_{1}-x_{2}).

(23)

According to strong convexity of $g_{2}$ at $x_{1}$ , we have:

g_{2}(x_{2})\geq g_{2}(x_{1})+\nabla g_{2}(x_{1})^{\top}(x_{2}-x_{1})+\frac{\delta}{2}||x_{2}-x_{1}||^{2}.

(24)

Add Equation (23) and (24), and we have:

	$\displaystyle g_{2}(x_{1})+g_{2}(x_{2})\geq g_{2}(x_{2})+g_{2}(x_{1})+(\nabla g_{2}(x_{1})-\nabla g_{2}(x_{2}))^{\top}(x_{2}-x_{1})+\frac{\delta}{2}\|\|x_{2}-x_{1}\|\|^{2}$	(25)
$\displaystyle\Leftrightarrow$	$\displaystyle(\nabla g_{2}(x_{1})-\nabla g_{2}(x_{2}))^{\top}(x_{1}-x_{2})\geq\frac{\delta}{2}\|\|x_{1}-x_{2}\|\|^{2}$
$\displaystyle\Leftrightarrow$	$\displaystyle(\nabla g_{1}(x_{1})+\nabla h(x_{1})-\nabla g_{2}(x_{2}))^{\top}(x_{1}-x_{2})\geq\frac{\delta}{2}\|\|x_{1}-x_{2}\|\|^{2}$
$\displaystyle\Leftrightarrow$	$\displaystyle\nabla h(x_{1})^{\top}(x_{1}-x_{2})\geq\frac{\delta}{2}\|\|x_{1}-x_{2}\|\|^{2}$
$\displaystyle\Rightarrow$	$\displaystyle\|\|\nabla h(x_{1})\|\|_{*}\|\|x_{1}-x_{2}\|\|\geq\frac{\delta}{2}\|\|x_{1}-x_{2}\|\|^{2}$
$\displaystyle\Rightarrow$	$\displaystyle\|\|\nabla h(x_{1})\|\|_{*}\geq\frac{\delta}{2}\|\|x_{1}-x_{2}\|\|.$

The first step is trivial. The second step is a sequence of $g_{2}=g_{1}+h$ . The third step is derived by the following 2 first-order optimality conditions: $\nabla g_{1}(x_{1})^{\top}(x_{1}-x_{2})\leq 0$ , and $\nabla g_{2}(x_{2})^{\top}(x_{2}-x_{1})\leq 0$ . The fourth step is derived from Holder’s Inequality:

||\nabla h(x_{1})||_{*}||x_{1}-x_{2}||\geq\nabla h(x_{1})^{\top}(x_{1}-x_{2}).

Therefore, the lemma holds. ∎

In the following part, we will set up a strongly convex function of $g_{2}$ . Denote $H=\sum_{t=1}^{n}x_{t}x_{t}^{\top}$ . From Lemma 7, we know that

\nabla^{2}\hat{L}(\theta)\succeq C_{down}\frac{1}{n}H.

Here $\hat{L}(\theta)$ is the short-hand notation of $\hat{L}_{k}(\theta)$ as we do not specify $k$ in this part. Since we do not know if $H$ is invertible, i.e. if a norm can be induced by $H$ , we cannot let $g_{2}(\theta)=\hat{L}(\theta)$ . Instead, we change the variable as follows:

We first apply singular value decomposition to $H$ , i.e. $H=U\Sigma U^{\top}$ , where $U\in\mathbb{R}^{d\times r},U^{\top}U=I_{r},\Sigma=diag\{\lambda_{1},\lambda_{2},\ldots,\lambda_{r}\}\succ 0$ . After that, we introduce a new variable $\eta:=U^{\top}\theta$ . Therefore, we have $\theta=U\eta+V\epsilon$ , where $V\in\mathbb{R}^{d\times(d-r)},V^{\top}V=I_{d-r},V^{\top}U=0$ is the standard orthogonal bases of the null space of $U$ , and $\epsilon\in\mathbb{R}^{(d-r)}$ . Similarly, we define $\tilde{\eta}_{i}=U^{\top}\tilde{\theta}_{i}$ and $\hat{\eta}=U^{\top}\hat{\theta}$ . According to these, we define the following functions:

$\displaystyle f_{i}(\eta)$	$\displaystyle:=l_{i}(\theta)=l_{i}(U\eta+V\epsilon)$	(26)
$\displaystyle\tilde{F}_{i}(\eta)$	$\displaystyle:=\tilde{L}_{i}(\theta)=\tilde{L}_{i}(U\eta+V\epsilon)$
$\displaystyle\hat{F}(\eta)$	$\displaystyle:=\hat{L}(\theta)=\hat{L}(U\eta+V\epsilon).$

Now we prove that $\hat{F}(\eta)$ is locally-strongly-convex. Similar to the proof of Lemma 7, we have:

$\displaystyle\nabla^{2}\hat{F}(\eta)=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}f_{i}(\eta)$	(27)
$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial^{2}l_{i}}{\partial(x_{i}^{\top}\theta)^{2}}(\frac{\partial x_{i}^{\top}\theta}{\partial\eta})(\frac{\partial x_{i}^{\top}\theta}{\partial\eta})^{\top}$
$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial^{2}l_{i}}{\partial(x_{i}^{\top}\theta)^{2}}(\frac{\partial x_{i}^{\top}(U\eta+V\epsilon)}{\partial\eta})(\frac{\partial x_{i}^{\top}(U\eta+V\epsilon)}{\partial\eta})^{\top}$
$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{\partial^{2}l_{i}}{\partial(x_{i}^{\top}\theta)^{2}}(U^{\top}x_{i})(U^{\top}x_{i})^{\top}$
$\displaystyle\succeq$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}C_{down}U^{\top}x_{i}x_{i}^{\top}U$
$\displaystyle=$	$\displaystyle\frac{1}{n}C_{down}U^{\top}(\sum_{i=1}^{n}x_{i}x_{i}^{\top})U^{\top}$
$\displaystyle=$	$\displaystyle\frac{1}{n}C_{down}U^{\top}HU$
$\displaystyle=$	$\displaystyle\frac{1}{n}C_{down}U^{\top}U\Sigma U^{\top}U$
$\displaystyle=$	$\displaystyle\frac{1}{n}C_{down}\Sigma$
$\displaystyle\succ$	$\displaystyle 0$

That is to say, $\hat{F}(\eta)$ is locally $\frac{C_{down}}{n}$ -strongly convex w.r.t $\Sigma$ at $\eta$ . Similarly, we can verify that $\tilde{F}_{i}(\eta)$ is convex (not necessarily strongly convex). Therefore, according to Lemma 13, let $g_{1}(\eta)=\tilde{F}_{i}(\eta),g_{2}(\eta)=\hat{F}(\eta)$ , and then $x_{1}=\tilde{\eta}_{i}=U^{\top}\tilde{\theta}_{i}$ , $x_{2}=\hat{\eta}=U^{\top}\hat{\theta}$ . Therefore, we have:

||\hat{\eta}-\tilde{\eta}_{i}||_{\Sigma}\leq\frac{1}{C_{down}}||\nabla f_{i}(\tilde{\eta}_{i})||_{\Sigma}^{*}.

(28)

Now let us show the validation of this theorem:

$\displaystyle l_{i}(\tilde{\theta}_{i})-l_{i}(\hat{\theta})=$	$\displaystyle f_{i}(\tilde{\eta}_{i})-f_{i}(\hat{\eta})$	(29)
$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{convexity}}}}{\leq}$	$\displaystyle\nabla f_{i}(\tilde{\eta}_{i})^{\top}(\tilde{\eta}_{i}-\hat{\eta})$
$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Holder\ inequality}}}}{\leq}$	$\displaystyle\|\|\nabla f_{i}(\tilde{\eta}_{i})\|\|^{*}_{\Sigma}\|\|\tilde{\eta}_{i}-\hat{\eta}\|\|_{\Sigma}$
$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma \ref{lemma_dual_norm_of_gradient_convex}}}}}{\leq}$	$\displaystyle\frac{1}{C_{down}}(\|\|\nabla f_{i}(\tilde{\eta}_{i})\|\|^{*}_{\Sigma})^{2}.$

And thus we have

$\displaystyle\sum_{i=1}^{n}l_{i}(\tilde{\theta}_{i})-l_{i}(\hat{\theta})$	$\displaystyle\leq\frac{1}{C_{down}}\sum_{i=1}^{n}\|\|\nabla f_{i}(\tilde{\eta}_{i})\|\|^{*}_{\Sigma})^{2}$	(30)
	$\displaystyle\leq\frac{1}{C_{down}}\sum_{i=1}^{n}(\frac{p}{\Phi})_{\max}^{2}x_{i}^{\top}U\Sigma^{-1}U^{\top}x_{i}$
	$\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}\sum_{i=1}^{n}tr(x_{i}^{\top}U\Sigma^{-1}U^{\top}x_{i})$
	$\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}\sum_{i=1}^{n}tr(U\Sigma^{-1}U^{\top}x_{i}x_{i}^{\top})$
	$\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(U\Sigma^{-1}U^{\top}\sum_{i=1}^{n}x_{i}x_{i}^{\top})$
	$\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(U\Sigma^{-1}U^{\top}H)$
	$\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(U\Sigma^{-1}U^{\top}U\Sigma U^{\top})$
	$\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(UU^{\top})$
	$\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(U^{\top}U)$
	$\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}tr(I_{r})$
	$\displaystyle=\frac{C_{\text{exp}}}{C_{down}}C_{\text{exp}}r$
	$\displaystyle\leq\frac{C_{\text{exp}}}{C_{down}}d.$

Thus the Inequality 22 is proved. After that, we have:

		$\displaystyle\mathbb{E}_{h}[L(\tilde{\theta}_{n})]-L(\theta^{*})$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{h}[L(\tilde{\theta}_{n})]-\mathbb{E}_{h}[\hat{L}(\theta^{*})]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{h}[L(\tilde{\theta}_{n})]-\mathbb{E}_{h}[\hat{L}(\hat{\theta})]$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{h}[L(\tilde{\theta}_{i})]-\mathbb{E}_{h}[\hat{L}(\hat{\theta})]$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{h}[l_{i}(\tilde{\theta}_{i})]-\mathbb{E}_{h}[\hat{L}(\hat{\theta})]$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{h}[l_{i}(\tilde{\theta}_{i})]-\sum_{i=1}^{n}\mathbb{E}_{h}[l_{i}(\hat{\theta})]$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{h}[l_{i}(\tilde{\theta}_{i})-l_{i}(\hat{\theta})]$
	$\displaystyle\leq$	$\displaystyle\frac{C_{\text{exp}}}{C_{down}}\frac{d}{n}$
	$\displaystyle=$	$\displaystyle O(\frac{d}{n})$

Thus we has proved that $\mathbb{E}_{h}[L(\tilde{\theta}_{n})]-L(\theta^{*})\leq\frac{C_{\text{exp}}}{C_{down}}\cdot\frac{d}{n}$ . Notice that $\tilde{\theta}_{n}$ is generated by optimizing the leave-one-out likelihood function $\tilde{L}_{n}(\theta)=\sum_{j=1}^{n-1}l_{j}(\theta)$ , which does not contain $l_{n}(\theta)$ , and that the expected likelihood function $L(\theta)$ does not depend on any specific result occurring in this round. That is to say, every term of this inequality is not related to the last round $(x_{n},v_{n},\mathbbm{1}_{n})$ at all. In other words, this inequality is still valid if we only conduct this epoch from round $1$ to $(n-1)$ .

Now let $n=\tau+1$ , and then we know that $\tilde{\theta}_{\tau+1}=\hat{\theta}$ . Therefore, the theorem holds. ∎

B.3 Proof of Lower bound in Section 5.3

Proof.

We assume a fixed $u^{*}$ such that $x^{\top}\theta^{*}=u^{*},\forall x\in\mathbb{D}$ . In other words, we are considering a non-context setting. Therefore, we can define a policy as $\Psi:\{0,1\}^{t}\rightarrow\mathbb{R}^{+},t=1,2,\ldots$ that does not observe $x_{t}$ at all. Before the proof begins, we firstly define a few notations: We denote $\Phi_{\sigma}(\omega)$ and $p_{\sigma}(\omega)$ as the CDF and PDF of Gaussian distribution $\mathcal{N}(0,\sigma^{2})$ , and the corresponding $J_{\sigma}(u)=\arg\max_{v}v(1-\Phi_{\sigma}(v-u))$ as the pricing function.

Since we have proved that $J^{\prime}(u)\in(0,1)$ for $u\in\mathbb{R}$ in Appendix B.2.2, we have the following lemma:

Lemma 14.

$u-J_{\sigma}(u)$ monotonically increases as $u\in(0,+\infty),\forall\sigma>0$ . Also, we know that $J_{\sigma}(0)>0,\forall\sigma>0$ .

Now consider the following cases: $\sigma_{1}=1,\sigma_{2}=1-f(T)$ , where $\lim_{T\rightarrow\infty}f(T)=0,f^{\prime}(T)<0,0<f(T)<\frac{1}{2}$ . We will later determine the explicit form of $f(T)$ .

Suppose $u^{*}$ satisfies $J_{\sigma_{1}}(u^{*})=u^{*}$ . Solve it and get $u^{*}=\sqrt{\frac{\pi}{2}}$ . Therefore, we have $u\in(0,u^{*})\Leftrightarrow J_{1}(u)>u$ , and $u\in(u^{*},+\infty)\Leftrightarrow J_{1}(u)<u$ . As a result, we have the following lemma.

Lemma 15.

For any $\sigma\in(\frac{1}{2},1)$ , we have:

J_{\sigma}(u^{*})\in(0,u^{*})

(31)

Proof.

Firstly, we have:

	$\displaystyle J_{\sigma}(u)=$	$\displaystyle\mathop{\arg\max}\limits_{v}v\Phi_{\sigma}(u-v)$
	$\displaystyle=$	$\displaystyle\mathop{\arg\max}\limits_{v}v\Phi_{1}(\frac{u-v}{\sigma})$
	$\displaystyle=$	$\displaystyle\mathop{\arg\max}\limits_{\omega=\frac{v}{\sigma}}\sigma\omega\Phi_{1}(\frac{u}{\sigma}-\omega)$
	$\displaystyle=$	$\displaystyle\sigma\mathop{\arg\max}\limits_{\omega}\Phi_{1}(\frac{u}{\sigma}-\omega)$
	$\displaystyle=$	$\displaystyle\sigma J_{1}(\frac{u}{\sigma}).$

When $\sigma\in(\frac{1}{2},1)$ , we know $\frac{u^{*}}{\sigma}>u^{*}$ . Since $J_{1}(u^{*})=u^{*}$ and that $u\in(u^{*},+\infty)\Leftrightarrow J_{1}(u)<u$ , we have $\frac{u^{*}}{\sigma}>J_{1}(\frac{u^{*}}{\sigma})$ . Hence

u^{*}>\sigma J_{1}(\frac{u^{*}}{\sigma})=J_{\sigma}(u^{*}).

(32)

∎

Therefore, without losing generality, we assume that for the problem parameterized by $\sigma_{2}$ , the price $v\in(0,u^{*})$ . To be specific, suppose $v^{*}(\sigma)=J_{\sigma}(u^{*})$ . Define $\Psi_{t+1}:[0,1]^{t}\rightarrow(0,u^{*})$ as any policy that proposes a price at time $t+1$ . Define $\Psi=\{\Psi_{1},\Psi_{2},\ldots,\Psi_{T-1},\Psi_{T}\}$ .

Define the sequence of price as $V=\{v_{1},v_{2},\ldots,v_{T-1},v_{T}\}$ , and the sequence of decisions as $\mathbbm{1}=\{\mathbbm{1}_{1},\mathbbm{1}_{2},\ldots,\mathbbm{1}_{T-1},\mathbbm{1}_{T}\}$ . Denote $V^{t}=\{v_{1},v_{2},\ldots,v_{t},\}$ .

Define the probability (also the likelihood if we change $u^{*}$ to other parameter $u$ ):

Q^{V,\sigma}_{T}(\mathbbm{1})=\prod_{t=1}^{T}\Phi_{\sigma}(u^{*}-v_{t})^{\mathbbm{1}_{t}}\Phi_{\sigma}(v_{t}-u^{*})^{1-\mathbbm{1}_{t}}.

(33)

Define a random variable $Y_{t}\in\{0,1\}^{t},Y_{t}\sim Q^{V^{t},\sigma}_{t}$ and one possible assignment
$y_{t}=\{\mathbbm{1}_{1},\mathbbm{1}_{2},\ldots,\mathbbm{1}_{t-1},\mathbbm{1}_{t}\}$ . For any price $v$ and any parameter $\sigma$ , define the expected reward function as $r(v,\sigma):=v\Phi_{\sigma}(u^{*}-v)$ . Based on this, we can further define the expected regret $\mathrm{Regret}(\sigma,T,\Psi)$ :

\mathrm{Regret}(\sigma,T,\Psi)=\mathbb{E}[\sum_{t=1}^{T}r(J_{\sigma}(u^{*}),\sigma)-r(\Psi_{t}(y_{t-1}),\sigma)]

(34)

Now we have the following properties:

Lemma 16.

1.

$r(v^{*}(\sigma),\sigma)-r(v,\sigma)\geq\frac{1}{60}(v^{*}(\sigma)-v)^{2}$ ;
2.

$|v^{*}(\sigma)-u^{*}|\geq\frac{2}{5}|1-\sigma|$ ;
3.

$|\Phi_{\sigma}(u^{*}-v)-\Phi_{1}(u^{*}-v)|\leq|u^{*}-v|\cdot|\sigma-1|$ .

Proof.

We have:

\displaystyle\frac{\partial r(v,\sigma)}{\partial v}|_{v=v^{*}(\sigma)}

\displaystyle=0\frac{\partial^{2}r(v,\sigma)}{\partial v^{2}}

\displaystyle=\frac{1}{\sigma^{2}}(v^{2}-u^{*}v-2\sigma^{2})p_{\sigma}(u^{*}-v)

Since $v\in(0,u^{*})$ , we have $(v^{2}-u^{*}v-2\sigma^{2})<-2\sigma^{2}$ . Also, since $\sigma\in(1/2,1)$ , we have $p_{\sigma}(u^{*}-v)>\frac{1}{\sqrt{2\pi}}\cdot e^{-\frac{(u^{*})^{2}}{2\cdot(1/2)^{2}}}=\frac{1}{\sqrt{2\pi}e^{\pi}}>0.017$ . Therefore, we have

\frac{\partial^{2}r(v,\sigma)}{\partial v^{2}}<-2*0.017<-\frac{1}{30}

As a result, we have:

$\displaystyle r(v*(\sigma),\sigma)-r(v,\sigma)$	$\displaystyle=-(v(\sigma)-v)\frac{\partial r(v,\sigma)}{\partial v}\|_{v=v^{}(\sigma)}-\frac{1}{2}(v*(\sigma)-v)^{2}\frac{\partial^{2}r(v,\sigma)}{\partial v^{2}}\|_{v=\tilde{v}}$	(35)
	$\displaystyle=0-\frac{1}{2}(v*(\sigma)-v)^{2}\frac{\partial^{2}r(v,\sigma)}{\partial v^{2}}\|_{v=\tilde{v}}$
	$\displaystyle\geq\frac{1}{2}\cdot\frac{1}{30}(v*(\sigma)-v)^{2}.$

According to Equation 32, we know that:

v^{*}(\sigma)=\sigma J_{1}(\frac{u^{*}}{\sigma})

For $u\in(u^{*},+\infty),J_{1}(u)<u$ . According to Lemma 14, we have:

	$\displaystyle J^{\prime}_{1}(u)$	$\displaystyle=1+\frac{1}{J_{1}(u)(J_{1}(u)-u)-2}$
		$\displaystyle>1+\frac{1}{0-2}$
		$\displaystyle=\frac{1}{2}.$

Also, for $u\in(u^{*},\frac{u^{*}}{\sigma})$ , we have:

	$\displaystyle J^{\prime}_{1}(u)$	$\displaystyle=1-\frac{1}{2+J_{1}(u)(u-J_{1}(u))}$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{0<J_{1}(u)<u}}}{\leq}1-\frac{1}{2+u(u-0)}$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{u<\frac{u^{}}{\sigma}<\frac{u^{}}{2}}}}{\leq}1-\frac{1}{2+(\frac{u^{*}}{2})^{2}}$
		$\displaystyle=1-\frac{1}{2+\frac{\pi}{8}}$
		$\displaystyle<\frac{3}{5}.$

Therefore, we have:

	$\displaystyle J_{1}(u^{})-J_{\sigma}(u^{})$	$\displaystyle=J_{1}(u^{})-\sigma J_{1}(\frac{u^{}}{\sigma})$
		$\displaystyle=J_{1}(u^{})-\sigma J_{1}(u^{})+\sigma(J_{1}(u^{})-J_{1}(\frac{u^{}}{\sigma}))$
		$\displaystyle=J_{1}(u^{})(1-\sigma)-\sigma(J_{1}(\frac{u^{}}{\sigma})-J_{1}(u^{*}))$
		$\displaystyle>u^{}(1-\sigma)-\sigma\cdot\frac{3}{5}(\frac{u^{}}{\sigma}-u^{*})$
		$\displaystyle=u^{}(1-\sigma)-\frac{3}{5}\sigma(\frac{1}{\sigma}-1)u^{}$
		$\displaystyle=u^{*}(1-\sigma)(1-\frac{3}{5})$
		$\displaystyle>\frac{2}{5}(1-\sigma).$

This is because:

	$\displaystyle\|\Phi_{\sigma}(u^{}-v)-\Phi_{1}(u^{}-v)\|$	$\displaystyle=\|\Phi_{\sigma}(u^{}-v)-\Phi_{\sigma}(\sigma(u^{}-v))\|$
		$\displaystyle\leq\max\|p_{\sigma}\|\cdot\|(u^{}-v)-\sigma(u^{}-v)\|$
		$\displaystyle\leq\frac{1}{\sqrt{2\pi}\sigma}\cdot\|(u^{}-v)-\sigma(u^{}-v)\|$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\sigma>\frac{1}{2}}}}{\leq}(1-\sigma)\|u^{*}-v\|.$

∎

In the following part, we will propose two theorems, which balance the cost of learning and that of uncertainty. This part is mostly similar to [BR12] Section 3, but we adopt a different family of demand curves here.

Theorem 17 (Learning is costly).

Let $\sigma\in(1/2,1)$ and $v_{t}\in(0,u^{*})$ , and we have:

\mathcal{K}(Q^{V,1};Q^{V,\sigma})<9900(1-\sigma)^{2}\mathrm{Regret}(1,T,\Psi).

(36)

Here $v_{t}=\Psi(y_{t-1}),t=1,2,\ldots,T$ .

Proof.

First of all, we cite the following lemma that would facilitate the proof.

Lemma 18 (Corollary 3.1 in Taneja and Kumar, 2004).

Suppose $B_{1}$ and $B_{2}$ are distributions of Bernoulli random variables with parameters $q_{1}$ and $q_{2}$ , respectively, with $q_{1},q_{2}\in(0,1)$ . Then,

\mathcal{K}(B_{1};B_{2})\leq\frac{(q_{1}-q_{2})^{2}}{q_{2}(1-q_{2})}.

According to the definition of KL-divergence, we have:

\mathcal{K}(Q_{T}^{V,1};Q_{T}^{V,\sigma})=\sum_{s=1}^{T}\mathcal{K}(Q_{s}^{V^{s},1};Q_{s}^{V^{s},\sigma}|Y_{s-1}).

For each term of the RHS, we have:

		$\displaystyle\qquad\mathcal{K}(Q_{s}^{V^{s},1},Q_{s}^{V^{s},\sigma}\|Y_{s-1})$
		$\displaystyle=\quad\sum_{y_{s}\in\{0,1\}^{s}}Q_{s}^{V^{s},1}(y_{s})\log\left(\frac{Q_{s}^{V^{s},1}(\mathbbm{1}_{s}\|y_{s-1})}{Q_{s}^{V^{s},\sigma}(\mathbbm{1}_{s}\|y_{s-1})}\right)$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{split $y_{s}$ as $y_{s-1}$ and $ind_{s}$}}}}{=}\qquad\qquad\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})\cdot\sum_{\mathbbm{1}_{s}\in\{0,1\}}Q_{s}^{V^{s},1}(\mathbbm{1}_{s}\|y_{s-1})\log\left(\frac{Q_{s}^{V^{s},1}(\mathbbm{1}_{s}\|y_{s-1})}{Q_{s}^{V^{s},\sigma}(\mathbbm{1}_{s}\|y_{s-1})}\right)$
		$\displaystyle=\qquad\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})\mathcal{K}\left(Q_{s}^{V^{s},1}(\cdot\|y_{s-1}),Q_{s}^{V^{s},\sigma}(\cdot\|y_{s-1})\right)$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma \ref{lemma_Taneja_and_Kumar_bernoulli_divergence}}}}}{\leq}\qquad\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})\frac{(\Phi_{1}(u^{}-v_{s})-\Phi_{\sigma}(u^{}-v_{s}))^{2}}{\Phi_{\sigma}(u^{}-v_{s})(1-\Phi_{\sigma}(u^{}-v_{s}))}$
		$\displaystyle=\quad\frac{1}{\Phi_{\sigma}(u^{}-v_{s})(1-\Phi_{\sigma}(u^{}-v_{s}))}\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})(\Phi_{1}(u^{}-v_{s})-\Phi_{\sigma}(u^{}-v_{s}))^{2}$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{(*)}}}{\leq}\qquad 165\cdot\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})(\Phi_{1}(u^{}-v_{s})-\Phi_{\sigma}(u^{*}-v_{s}))^{2}$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma \ref{lemma_properties} Property 3}}}}{\leq}\qquad 165\cdot\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})(u^{*}-v_{s})^{2}(1-\sigma)^{2}$
		$\displaystyle=\qquad 165(1-\sigma)^{2}\mathbb{E}_{Y_{s-1}}[(u^{*}-v_{s})^{2}].$

Here inequality (**) above is proved as follows: since $v_{s}\in(0,u^{*})$ as is assumed, we have:

	$\displaystyle\frac{1}{2}<\Phi_{\sigma}(u^{*}-v_{s})<$	$\displaystyle\Phi_{\sigma}(u^{*})$
	$\displaystyle=$	$\displaystyle\sigma\cdot\Phi_{1}(\frac{u^{*}}{\sigma})$
	$\displaystyle\leq$	$\displaystyle 1\cdot\Phi_{1}(\frac{\sqrt{\frac{\pi}{2}}}{\frac{1}{2}})$
	$\displaystyle\leq$	$\displaystyle\Phi_{1}(\sqrt{2\pi})$
	$\displaystyle\leq$	$\displaystyle 0.9939\ .$

As a result, we have $\frac{1}{\Phi_{\sigma}(u^{*}-v_{s})(1-\Phi_{\sigma}(u^{*}-v_{s}))}\leq\frac{1}{0.9939\times 0.0061}=164.7988\leq 165$ . Therefore, by summing up all $s$ , we have:

	$\displaystyle\mathcal{K}(Q_{T}^{V,1};Q_{T}^{V,\sigma})$	$\displaystyle=\sum_{s=1}^{T}\mathcal{K}(Q_{s}^{V^{s},1};Q_{s}^{V^{s},\sigma}\|Y_{s-1})$
		$\displaystyle\leq 165(1-\sigma)^{2}\sum_{s=1}^{T}\mathbb{E}_{Y_{s-1}}[(u^{*}-v_{s})^{2}]$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma~{}\ref{lemma_properties} Property 1}}}}{\leq}165\times 60\cdot(1-\sigma)^{2}\sum_{s=1}^{T}\left(r(u^{*},1)-r(v_{s},1)\right)$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{definition of regret and }v_{s}=\Psi(y_{s-1}).}}}{=}9900(1-\sigma)^{2}\mathrm{Regret}(1,T,\Psi),$

which concludes the proof. ∎

Theorem 19 (Uncertainty is costly).

Let $\sigma\leq 1-T^{-\frac{1}{4}}$ , and we have:

\mathrm{Regret}(1,T,\Psi)+\mathrm{Regret}(\sigma,T,\Psi)\geq\frac{1}{24000}\cdot\sqrt{T}\cdot e^{-\mathcal{K}(Q^{V,1};Q^{V,\sigma})}.

(37)

Here $v_{t}=\Psi(y_{t-1}),t=1,2,\ldots,T$ .

Proof.

First of all, we cite a lemma that would facilitate our proof:

Lemma 20.

Let $Q_{0}$ and $Q_{1}$ be two probability distributions on a finite space $\mathcal{Y}$ ; with $Q_{0}(y),Q_{1}(y)>0,\forall y\in\mathcal{Y}$ . Then for any function $F:\mathcal{Y}\rightarrow\{0,1\}$ ,

Q_{0}\{F=1\}+Q_{1}\{J=0\}\geq\frac{1}{2}e^{-\mathcal{K}(Q_{0};Q_{1})},

where $\mathcal{K}(Q_{0};Q_{1})$ denotes the KL-divergence of $Q_{0}$ and $Q_{1}$ .

Define two intervals of prices:

C_{1}=\{v:|u^{*}|\leq\frac{1}{10T^{\frac{1}{4}}}\}\ \ and\ \ C_{2}=\{v:|J_{\sigma}(u^{*})-v|\leq\frac{1}{10T^{\frac{1}{4}}}\}

Note that $C_{1}$ and $C_{2}$ are disjoint, since $|u^{*}-J_{\sigma}(u^{*})|\geq\frac{2}{5}|1-\sigma|=\frac{2}{5T^{1/2}}$ according to Lemma 16 Property 2. Also, for $v\in(0,u^{*})\backslash C_{2}$ , the regret is large according to Lemma 16 Property 1, because:

r(v^{*}(\sigma),\sigma)-r(v,\sigma)\geq\frac{1}{60}(v-v^{*}(\sigma))^{2}\geq\frac{1}{6000{T}^{\frac{1}{2}}}.

Then, we have:

		$\displaystyle\mathrm{Regret}(1,T,\Psi)+\mathrm{Regret}(\sigma,T,\Psi)$
	$\displaystyle\geq$	$\displaystyle\ \sum_{t=1}^{T-1}\mathbb{E}_{1}[r(u^{},1)-r(v_{t+1},1)]+\mathbb{E}_{\sigma}[r(J_{\sigma}(u^{}),\sigma)-r(v_{t+1},\sigma)]$
	$\displaystyle\geq$	$\displaystyle\ \frac{1}{6000\sqrt{T}}\sum_{t=1}^{T-1}\mathbb{P}_{1}[v_{t+1}\notin{C_{1}}]+\mathbb{P}_{\sigma}[v_{t+1}\notin\{C_{2}\}]$
	$\displaystyle\underset{\mathclap{\overset{\uparrow}{Suppose\ F_{t+1}=\mathbbm{1}[v_{t+1}\in C_{2}]}}}{\geq}$	$\displaystyle\ \frac{1}{6000\sqrt{T}}\sum_{t=1}^{T-1}\mathbb{P}_{1}[F_{t+1}=1]+\mathbb{P}_{\sigma}[F_{t+1}=0]$
	$\displaystyle\underset{\mathclap{\overset{\uparrow}{Lemma\ \ref{lemma_Tsybakov2009}}}}{\geq}$	$\displaystyle\ \frac{1}{6000\sqrt{T}}\sum_{t=1}^{T-1}\frac{1}{2}e^{-\mathcal{K}(Q_{t}^{V^{t},1};Q_{t}^{V^{t},\sigma})}$
	$\displaystyle\underset{\mathclap{\overset{\uparrow}{\mathcal{K}(Q_{t}^{V^{t},1};Q_{t}^{V^{t},\sigma})\ not\ decreasing\ }}}{\geq}$	$\displaystyle\ \frac{1}{6000\sqrt{T}}\frac{T-1}{2}e^{-\mathcal{K}(Q_{T}^{V,1};Q_{T}^{V,\sigma})}$
	$\displaystyle\geq$	$\displaystyle\ \frac{1}{24000}\sqrt{T}e^{-\mathcal{K}(Q_{T}^{V,1};Q_{T}^{V,\sigma})}.$

∎

According to Theorem 17 and Theorem 19, we can then prove Theorem 12. Let $\sigma=1-T^{-\frac{1}{4}}$

		$\displaystyle 2\left(\mathrm{Regret}(1,T,\Psi)+\mathrm{Regret}(\sigma,T,\Psi)\right)$
	$\displaystyle\geq$	$\displaystyle\mathrm{Regret}(1,T,\Psi)+\left(\mathrm{Regret}(1,T,\Psi)+\mathrm{Regret}(\sigma,T,\Psi)\right)$
	$\displaystyle\geq$	$\displaystyle\frac{1}{9900T^{-1/2}}\mathcal{K}(Q^{V,1};Q^{V,\sigma})+\frac{1}{24000}\cdot\sqrt{T}\cdot e^{-\mathcal{K}(Q^{V,1};Q^{V,\sigma})}$
	$\displaystyle\geq$	$\displaystyle\frac{1}{24000}\sqrt{T}\left(\mathcal{K}(Q^{V,1};Q^{V,\sigma})+e^{-\mathcal{K}(Q^{V,1};Q^{V,\sigma})}\right)$
	$\displaystyle\underset{\mathclap{\overset{\uparrow}{The\ fact\ e^{x}\geq x+1,\forall{x}\in\mathbb{R}}}}{\geq}$	$\displaystyle\frac{1}{24000}\sqrt{T}.$

Thus Theorem 12 is proved valid. ∎

Appendix C More Discussions

C.1 Dependence on $B$ and Noise Variance

Here we use a concrete example to analyze the coefficients of regret bounds. Again, we assume that $N_{t}\sim\mathcal{N}(0,\sigma^{2})$ . Notice that both $C_{s}$ and $C_{a}$ have a component of $\frac{C_{\exp}}{C_{down}}$ . In order to analyze $\frac{C_{\exp}}{C_{down}}$ , we define a hazard function denoted as $\lambda(\omega)$ with $\omega\in\mathbb{R}$ :

\lambda(\omega):=\frac{p_{1}(\omega)}{1-\Phi_{1}(\omega)}=\frac{p_{1}(-\omega)}{\Phi_{1}(-\omega)},

(38)

where $\Phi_{1}$ and $p_{1}$ are the CDF and PDF of standard Gaussian distribution. The concept of hazard function comes from the area of survival analysis. From Equation 11 and 13, we plug in Equation 38 and get:

	$\displaystyle C_{\text{down}}$	$\displaystyle\geq\inf_{\omega\in[-\frac{B}{\sigma},\frac{B}{\sigma}]}\left\{\frac{1}{\sigma^{2}}\lambda(-\omega)^{2}+\omega\cdot\lambda(-\omega)\right\}$		(39)
	$\displaystyle C_{\text{exp}}$	$\displaystyle\leq\sup_{\omega\in[-\frac{B}{\sigma},\frac{B}{\sigma}]}\left\{\frac{1}{\sigma^{2}}\lambda(-\omega)^{2}\right\}.$		(39)

In Lemma 21, we will prove that $\lambda(\omega)$ is exponentially small as $\omega\rightarrow+\infty$ , and is asymptotically close to $-\omega$ as $\omega\rightarrow-\infty$ . Therefore, $C_{down}$ is exponentially small and $C_{exp}$ is quadratically large with respect to $B/\sigma$ . Although we assume that $B$ and $\sigma$ are constant, we should be alert that the scale of $B/\sigma$ can be very large as $\sigma$ goes to zero, i.e. as the noise is “insignificant”. In practice (especially when $T$ is finite), this may cause extremely large regret at the beginning. A “Shallow Pricing” method introduced by Cohen et al. [2020] (as well as other domain-cutting methods in contextual searching) may serve as a good pre-process as it frequently conducts bisections to cut the feasible region of $\theta^{*}$ with high probability. According to Theorem 3 in Cohen et al. [2020], their Shallow Pricing algorithm will bisect the parameter set for at most logarithmic times to ensure that $\frac{B}{\sigma}$ has been small enough (i.e. upper-bounded by $O(poly\log(T))$ ). However, this does not necessarily means that we can use a $O(\log{T})$ -time pre-process to achieve the same effect, since they run the algorithm throughout the session while we only take it as a pre-process. Intuitively, at least under the adversarial feature assumption, we cannot totally rely on a few features occurring at the beginning (as they might be misleading) to cut the parameter set once and for all. A mixture approach of Shallow Pricing and EMLP/ONSP might work, as the algorithm can detect whether current $\frac{B}{\sigma}$ is larger than a threshold of bisection. However, this requires new regret analysis as the operations parameter domain are changing over time. Therefore, we claim in Section 7 that the regret bound is still open if $\sigma=\Theta(T^{-\alpha})$ for $\alpha\in(0,1)$ .

Lemma 21 (Properties of $\lambda(\omega)$ ).

For $\lambda(\omega):=\frac{p_{1}(\omega)}{1-\Phi_{1}(\omega)}$ , we have:

	$\displaystyle 1,$	$\displaystyle\qquad\frac{d}{d\omega}\lambda(\omega)>0.$
	$\displaystyle 2,$	$\displaystyle\qquad\lim_{\omega\rightarrow-\infty}\omega^{k}\lambda(\omega)=0,\ \forall k>0.$
	$\displaystyle 3,$	$\displaystyle\qquad\lim_{\omega\rightarrow+\infty}\lambda(\omega)-\omega=0.$
	$\displaystyle 4,$	$\displaystyle\qquad\lim_{\omega\rightarrow+\infty}\omega\left(\lambda(\omega)-\omega\right)=1.$

Proof.

We prove the Lemma 21 sequentially:

We have:

$\displaystyle\lambda^{\prime}(\omega)=$	$\displaystyle\frac{p_{1}^{2}(-\omega)-p^{\prime}_{1}(-\omega)\Phi_{1}(-\omega)}{\Phi_{1}(-\omega)^{2}}$	(40)
$\displaystyle=$	$\displaystyle\frac{p_{1}^{2}(-\omega)-\omega p_{1}(-\omega)\Phi_{1}(-\omega)}{\Phi_{1}(-\omega)^{2}}$
$\displaystyle=$	$\displaystyle\frac{p_{1}(-\omega)\left(p_{1}(-\omega)-\omega\Phi_{1}(-\omega)\right)}{\Phi_{1}(-\omega)^{2}}.$

Therefore, it is equivalent to prove that $p_{1}(-\omega)-\omega\Phi_{1}(-\omega)>0$ .

Suppose $f(\omega)=p_{1}(\omega)+\omega\Phi_{1}(\omega)$ . We now take its derivatives as follows:

$\displaystyle f^{\prime}(\omega)$	$\displaystyle=p^{\prime}_{1}(\omega)+(\Phi_{1}(\omega)+\omega\cdot p_{1}(\omega))$	(41)
	$\displaystyle=(-\omega)p_{1}(\omega)+\Phi_{1}(\omega)+\omega\cdot p_{1}(\omega)$
	$\displaystyle=\Phi_{1}(\omega)$
	$\displaystyle>0$

Therefore, we know that $f(\omega)$ monotonically increases in $\mathbb{R}$ . Additionally, since we have:

	$\displaystyle\lim_{\omega\rightarrow-\infty}f(\omega)$	(42)
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow-\infty}p_{1}(\omega)+\lim_{\omega\rightarrow-\infty}\omega\Phi_{1}(\omega)$
$\displaystyle=$	$\displaystyle 0+\lim_{\omega\rightarrow-\infty}\frac{1}{\sigma^{2}}\cdot\frac{\Phi_{1}(\omega)}{1/\omega}$
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow-\infty}\cdot\frac{p_{1}(\omega)}{-1/\omega^{2}}$
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow-\infty}\cdot\left(-\frac{1}{\sqrt{2\pi}}\cdot\frac{\omega^{2}}{\exp\{\frac{\omega^{2}}{2}\}}\right)$
$\displaystyle=$	$\displaystyle 0$

Therefore, we know that $f(\omega)>0$ , $\forall\omega\in\mathbb{R}$ , and as a result, $\lambda^{\prime}(\omega)>0$ .

We have:

	$\displaystyle\lim_{\omega\rightarrow-\infty}\omega^{k}\lambda(\omega)$	(43)
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow-\infty}\omega^{k}\frac{p_{1}(-\omega)}{\Phi_{1}(-\omega)}$
$\displaystyle=$	$\displaystyle\frac{\mathop{\lim}\limits_{\omega\rightarrow-\infty}\omega^{k}p_{1}(-\omega)}{\mathop{\lim}\limits_{\omega\rightarrow-\infty}\Phi_{1}(-\omega)}$
$\displaystyle=$	$\displaystyle\frac{\mathop{\lim}\limits_{\omega\rightarrow-\infty}\omega^{k}(\frac{1}{\sqrt{2\pi}}\exp\{-\frac{\omega^{2}}{2}\})}{1}$
$\displaystyle=$	$\displaystyle\frac{0}{1}$
$\displaystyle=$	$\displaystyle 0.$

We only need to prove that

\lim_{\omega\rightarrow+\infty}\lambda(\omega)-\omega=0.

Actually, we have:

	$\displaystyle\lim_{\omega\rightarrow+\infty}\lambda(\omega)-\omega$	(44)
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow+\infty}\frac{p_{1}(-\omega)-\omega\Phi_{1}(-\omega)}{\Phi_{1}(-\omega)}$
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow-\infty}\frac{p_{1}(\omega)+\omega\Phi_{1}(\omega)}{\Phi_{1}(\omega)}$
$\displaystyle\overset{\mathclap{\underset{\downarrow}{\text{L'Hospital's rule}}}}{=}$	$\displaystyle\lim_{\omega\rightarrow-\infty}\frac{(-\omega)p_{1}(\omega)+\Phi_{1}(\omega)+\omega p_{1}(\omega)}{p_{1}(\omega)}$
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow-\infty}\frac{\Phi_{1}(\omega)}{p_{1}(\omega)}$
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow-\infty}\frac{p_{1}(\omega)}{(-\omega)p_{1}(\omega)}$
$\displaystyle=$	$\displaystyle 0$

	$\displaystyle\lim_{\omega\rightarrow+\infty}\omega(\lambda(\omega)-\omega)$	(45)
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow+\infty}\frac{\omega\left(p_{1}(-\omega)-\omega\Phi_{1}(-\omega)\right)}{\Phi_{1}(-\omega)}$
$\displaystyle=$	$\displaystyle\lim_{\omega\rightarrow-\infty}\frac{-\omega p_{1}(\omega)-\omega^{2}\Phi_{1}(\omega)}{\Phi_{1}(\omega)}$
$\displaystyle\overset{\mathclap{\underset{\downarrow}{\text{L'Hospital's rule}}}}{=}$	$\displaystyle\lim_{\omega\rightarrow-\infty}\frac{-p_{1}(\omega)-\omega(-\omega)p_{1}(\omega)-\omega^{2}p_{1}(\omega)-2\omega\cdot\Phi_{1}(\omega)}{p_{1}(\omega)}$
$\displaystyle=$	$\displaystyle-1-2\lim_{\omega\rightarrow-\infty}\frac{\omega\Phi_{1}(\omega)}{p_{1}(\omega)}$
$\displaystyle=$	$\displaystyle-1+2\lim_{\omega\rightarrow+\infty}\frac{1}{\frac{\lambda(\omega)}{\omega}}$
$\displaystyle=$	$\displaystyle-1+2$
$\displaystyle=$	$\displaystyle 1.$

Thus the lemma holds.

∎

C.2 Algorithmic Design

C.2.1 Probit and Logistic Regressions

A probit/logit model is described as follows: a Boolean random variable $Y$ satisfies the following probabilistic distribution: $\mathbb{P}[Y=1|X]=F(X^{\top}\beta)$ , where $X\in\mathbb{R}$ is a random vector, $\beta\in\mathbb{R}$ is a parameter, and $F$ is the cumulative distribution function (CDF) of a (standard) Gaussian/logistic distribution. In our problem, we may treat $\mathbbm{1}_{t}$ as $Y$ , $[{x_{t}}^{\top},v_{t}]^{\top}$ as $X$ and $[{\theta^{*}}^{\top},-1]^{\top}$ as $\beta$ , which exactly fits this model if we assume the noise as Gaussian or logistic. Therefore, $\hat{\theta}_{k}=\mathop{\arg\min}_{\theta}\hat{L}_{k}(\theta)$ can be solved via the highly efficient implementation of generalized linear models, e.g., GLMnet, rather than resorting to generic tools for convex programming. As a heuristic, we could leverage the vast body of statistical work on probit or logit models and adopt a fully Bayesian approach that jointly estimates $\theta$ and hyper-parameters of $F$ . This would make the algorithm more practical by eliminating the need to choose the hyper-parameters when running this algorithm.

C.2.2 Advantages of EMLP over ONSP.

For the stochastic setting, we specifically propose EMLP even though ONSP also works. This is because EMLP only “switch” the pricing policy $\hat{\theta}$ for $\log{T}$ times. This makes it appealing in many applications (especially for brick-and-mortar sales) where the number of policy updates is a bottleneck. In fact, the iterations within one epoch can be carried out entirely in parallel.

C.2.3 Agnostic Dynamic Pricing: Explorations versus Exploitation

At the moment, the proposed algorithm relies on the assumption of a linear valuation function (see Appendix C.3 for more discussion on problem modeling). It will be interesting to investigate the settings of model-misspecified cases and the full agnostic settings. The key would be to exploit the structural feedback in model-free policy-evaluation methods such as importance sampling. The main reason why we do not explore lies in the noisy model: essentially we are implicitly exploring a higher (permitted) price using the naturally occurring noise in the data. In comparison, there is another problem setting named “adversarial irrationality” where some of the customers will valuate the product adaptively and adversarially¹⁰¹⁰10An adaptive adversary may take actions adversarially in respond to the environmental changes. In comparison, what we allow for the “adversarial features” is actually chosen by an oblivious adversary before the interactions start.. Existing work Krishnamurthy et al. [2021] adopts this setting and shows a linear regret dependence on the number of irrational customers, but they consider a different loss function (See Related Works Section).

C.3 Problem Modeling

C.3.1 Noise Distributions

In this work, we have made four assumptions on the noise distribution: strict log-concavity, $2^{\text{nd}}-\text{order smooth}$ , known, and i.i.d.. Here we explain each of them specifically.

•

The assumption of knowing the exact $F$ is critical to the regret bound: If we have this knowledge, then we achieve $O(\log{T})$ even with adversarial features; otherwise, an $\Omega(\sqrt{T})$ regret is unavoidable even with stochastic features.
•

The strictly log-concave distribution family includes Gaussian and logistic distributions as two common noises. In comparison, Javanmard and Nazerzadeh [2019] assumes log-concavity that further covers Laplacian, exponential and uniform distributions. Javanmard and Nazerzadeh [2019] also considers the cases when (1) the noise distribution is unknown but log-concave, and (2) the noise distribution is zero-mean and bounded by support of $[-\delta,\delta]$ . For case (1), they propose an algorithm with regret $O(\sqrt{T})$ and meanwhile prove the same lower bound. For case (2), they propose an algorithm with linear regret.
•

The assumption that $F$ is $2^{\text{nd}}-$ order smooth is also assumed by Javanmard and Nazerzadeh [2019] by taking derivatives $f^{\prime}(v)$ and applying its upper bound in the proof. Therefore, we are still unaware of the regret bound if the noise distribution is discrete, where a lower bound of $\Omega(\sqrt{T})$ can be directly applied from Kleinberg and Leighton [2003].
•

We even assume that the noise is identically distributed. However, the noise would vary among different people. The same problem happens on the parameter $\theta^{*}$ : can we assume different people sharing the same evaluation parameter? We may interpret it in the following two ways, but there are still flaws: (1) the “customer” can be the public, i.e. their performance is quite stable in general; or (2) the customer can be the same one over the whole time series. However, the former explanation cannot match the assumption that we just sell one product at each time, and the latter one would definitely undermine the independent assumption of the noise: people would do “human learning” and might gradually reduce their noise of making decisions. To this extent, it is closer to the fact if we assume noises as martingales. This assumption has been stated in Qiang and Bayati [2016].

C.3.2 Linear Valuations on Features

There exist many products whose prices are not linearly dependent on features. One famous instance is a diamond: a kilogram of diamond powder is very cheap because it can be produced artificially, but a single 5-carat (or 1 gram) diamond might cost more than $100,000. This is because of an intrinsic non-linear property of diamond: large ones are rare and cannot be (at least easily) compound from smaller ones. Another example lies in electricity pricing [Joskow and Wolfram, 2012], where the more you consume, the higher unit price you suffer. On the contrary, commodities tend to be cheaper than retail prices. These are both consequences of marginal costs: a large volume consuming of electricity may cause extra maintenance and increase the cost, and a large amount of purchasing would release the storage and thus reduce their costs. In a word, our problem setting might not be suitable for those large-enough features, and thus an upper bound of $x^{\top}\theta$ becomes a necessity.

C.4 Ex Ante v.s. Ex Post Regrets

In this work, we considered the ex ante regret $Reg_{ea}=\sum_{t=1}^{T}\max_{\theta}\mathbb{E}[v^{\theta}_{t}\cdot\mathbbm{1}(v^{\theta}_{t}\leq w_{t})]-\mathbb{E}[v_{t}\cdot\mathbbm{1}(v_{t}\leq w_{t})]$ , where $v^{\theta}_{t}=J(x_{t}^{\top}\theta)$ is the greedy price with parameter $\theta$ and $w_{t}=x_{t}^{\top}\theta^{*}+N_{t}$ is the realized random valuation. The ex post definition of the cumulative regret, i.e., $Reg_{ep}=\max_{\theta}\sum_{t=1}^{T}v^{\theta}_{t}\mathbbm{1}(v^{\theta}_{t}\leq w_{t})-v_{t}\mathbbm{1}(v_{t}\leq w_{t})$ makes sense, too. Note that we can decompose $\mathbb{E}\left[Reg_{ep}\right]=Reg_{ea}+\mathbb{E}[\max_{\theta}\sum_{t=1}^{T}v^{\theta}_{t}\mathbbm{1}(v^{\theta}_{t}\leq w_{t})-\sum_{t=1}^{T}v^{\theta^{*}}_{t}\mathbbm{1}(v^{\theta^{*}}_{t}\leq w_{t})].$ While it might be the case that the second term is $\Omega(\sqrt{dT})$ as the reviewer pointed out, it is a constant independent of the algorithm. For this reason, we believe using $Reg_{ea}$ is without loss of generality, and it reveals more nuanced performance differences of different algorithms.

For an ex post dynamic regret, i.e., $Reg_{d}=\sum_{t=1}^{T}w_{t}-v_{t}\cdot\mathbbm{1}(v_{t}\leq w_{t})$ , it is argued in Cohen et al. [2020] that any policy must suffer an expected regret of $\Omega(T)$ (even if $\theta^{*}$ is known). We may also present a good example lies in $N_{t}\sim\mathcal{N}(0,1),x_{t}^{\top}\theta^{*}=\sqrt{\frac{\pi}{2}}$ where the optimal price is $\sqrt{\frac{\pi}{2}}$ as well but the probability of acceptance is only 1/2, and this leads to a constant per-step regret of $\frac{1}{2}\sqrt{\frac{\pi}{2}}$ .

C.5 Ethic Issues

A field of study lies in “personalized dynamic pricing” [Aydin and Ziya, 2009, Chen and Gallego, 2021], where a firm makes use of information of individual customers and sets a unique price for each of them. This has been frequently applied in airline pricing [Krämer et al., 2018]. However, this causes first-order pricing discrimination. Even though this “discrimination” is not necessarily immoral, it must be embarrassing if we are witted proposing the same product with different prices towards different customers. For example, if we know the coming customer is rich enough and is not as sensitive towards a price (e.g., he/she has a variance larger than other customers), then we are probably raising the price without being too risky. Or if the customer is used to purchase goods from ours, then he or she might have a higher expectation on our products (e.g., he/she has a $\theta=a\theta^{*},a>1$ ), and we might take advantage and propose a higher price than others. These cases would not happen in an auction-based situation (such as a live sale), but might frequently happen in a more secret place, for instance, a customized travel plan.

		$\displaystyle\frac{\partial g(v,u)}{\partial v}\|_{v=v^{*}}=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle 1-F(v^{}-u)-v^{}\cdot f(v^{*}-u)=0$
	$\displaystyle\Leftrightarrow$	$\displaystyle\frac{1-F(v^{}-u)}{f(v^{}-u)}-(v^{*}-u)=u$

	$\displaystyle Reg_{t}(\theta)$	$\displaystyle=g(J(u^{}),u^{})-g(J(u),u^{*})$
		$\displaystyle=-\frac{\partial g(v,u^{})}{\partial v}\|_{v=J(u^{})}(J(u^{}-J(u)))+\frac{1}{2}\left(-\frac{\partial^{2}g(v,u^{})}{\partial v^{2}}\|_{v=\tilde{v}}\right)(J(u^{*})-J(u))^{2}$
		$\displaystyle\leq 0+\frac{1}{2}\max_{\tilde{v}\in[0,B+J(0)]}\left(-\frac{\partial^{2}g(v,u^{})}{\partial v^{2}}\|_{v=\tilde{v}}\right)\cdot(J(u^{})-J(u))^{2}$
		$\displaystyle=\frac{1}{2}\max_{\tilde{v}\in[0,B+J(0)]}\left(2f(\tilde{v}-u^{})+\tilde{v}\cdot{f^{\prime}}(\tilde{v}-u^{})\right)\cdot(J(u^{*})-J(u))^{2}$
		$\displaystyle\leq\frac{1}{2}(2B_{f}+(B+J(0))\cdot B_{f^{\prime}})(J(u^{*})-J(u))^{2}$
		$\displaystyle\leq\frac{1}{2}(2B_{f}+(B+J(0))\cdot B_{f^{\prime}})(u^{*}-u)^{2}$
		$\displaystyle=\frac{1}{2}(2B_{f}+(B+J(0))\cdot B_{f^{\prime}})(\theta^{}-\theta)^{\top}x_{t}x_{t}^{\top}(\theta^{}-\theta).$

	$\displaystyle J_{1}(u^{})-J_{\sigma}(u^{})$	$\displaystyle=J_{1}(u^{})-\sigma J_{1}(\frac{u^{}}{\sigma})$
		$\displaystyle=J_{1}(u^{})-\sigma J_{1}(u^{})+\sigma(J_{1}(u^{})-J_{1}(\frac{u^{}}{\sigma}))$
		$\displaystyle=J_{1}(u^{})(1-\sigma)-\sigma(J_{1}(\frac{u^{}}{\sigma})-J_{1}(u^{*}))$
		$\displaystyle>u^{}(1-\sigma)-\sigma\cdot\frac{3}{5}(\frac{u^{}}{\sigma}-u^{*})$
		$\displaystyle=u^{}(1-\sigma)-\frac{3}{5}\sigma(\frac{1}{\sigma}-1)u^{}$
		$\displaystyle=u^{*}(1-\sigma)(1-\frac{3}{5})$
		$\displaystyle>\frac{2}{5}(1-\sigma).$

	$\displaystyle\|\Phi_{\sigma}(u^{}-v)-\Phi_{1}(u^{}-v)\|$	$\displaystyle=\|\Phi_{\sigma}(u^{}-v)-\Phi_{\sigma}(\sigma(u^{}-v))\|$
		$\displaystyle\leq\max\|p_{\sigma}\|\cdot\|(u^{}-v)-\sigma(u^{}-v)\|$
		$\displaystyle\leq\frac{1}{\sqrt{2\pi}\sigma}\cdot\|(u^{}-v)-\sigma(u^{}-v)\|$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\sigma>\frac{1}{2}}}}{\leq}(1-\sigma)\|u^{*}-v\|.$

		$\displaystyle\qquad\mathcal{K}(Q_{s}^{V^{s},1},Q_{s}^{V^{s},\sigma}\|Y_{s-1})$
		$\displaystyle=\quad\sum_{y_{s}\in\{0,1\}^{s}}Q_{s}^{V^{s},1}(y_{s})\log\left(\frac{Q_{s}^{V^{s},1}(\mathbbm{1}_{s}\|y_{s-1})}{Q_{s}^{V^{s},\sigma}(\mathbbm{1}_{s}\|y_{s-1})}\right)$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{split $y_{s}$ as $y_{s-1}$ and $ind_{s}$}}}}{=}\qquad\qquad\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})\cdot\sum_{\mathbbm{1}_{s}\in\{0,1\}}Q_{s}^{V^{s},1}(\mathbbm{1}_{s}\|y_{s-1})\log\left(\frac{Q_{s}^{V^{s},1}(\mathbbm{1}_{s}\|y_{s-1})}{Q_{s}^{V^{s},\sigma}(\mathbbm{1}_{s}\|y_{s-1})}\right)$
		$\displaystyle=\qquad\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})\mathcal{K}\left(Q_{s}^{V^{s},1}(\cdot\|y_{s-1}),Q_{s}^{V^{s},\sigma}(\cdot\|y_{s-1})\right)$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma \ref{lemma_Taneja_and_Kumar_bernoulli_divergence}}}}}{\leq}\qquad\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})\frac{(\Phi_{1}(u^{}-v_{s})-\Phi_{\sigma}(u^{}-v_{s}))^{2}}{\Phi_{\sigma}(u^{}-v_{s})(1-\Phi_{\sigma}(u^{}-v_{s}))}$
		$\displaystyle=\quad\frac{1}{\Phi_{\sigma}(u^{}-v_{s})(1-\Phi_{\sigma}(u^{}-v_{s}))}\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})(\Phi_{1}(u^{}-v_{s})-\Phi_{\sigma}(u^{}-v_{s}))^{2}$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{(*)}}}{\leq}\qquad 165\cdot\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})(\Phi_{1}(u^{}-v_{s})-\Phi_{\sigma}(u^{*}-v_{s}))^{2}$
		$\displaystyle\underset{\mathclap{\overset{\uparrow}{\text{Lemma \ref{lemma_properties} Property 3}}}}{\leq}\qquad 165\cdot\sum_{y_{s-1}\in\{0,1\}^{s-1}}Q_{s-1}^{V^{s-1},1}(y_{s-1})(u^{*}-v_{s})^{2}(1-\sigma)^{2}$
		$\displaystyle=\qquad 165(1-\sigma)^{2}\mathbb{E}_{Y_{s-1}}[(u^{*}-v_{s})^{2}].$

Logarithmic Regret in Feature-based Dynamic Pricing

Abstract

1 Introduction

2 Related Works

3 Problem Setup

Definition 1 (Regret).

Assumption 1 (Known, bounded, strictly log-concave distribution).

Assumption 2 (Bounded convex parameter space).

Assumption 3 (Bounded feature space).

Lemma 2 (Uniqueness).

4 Algorithms

4.1 Pricing with Distribution-Free Stochastic Features

Assumption 4 (Stochastic features).

4.2 Pricing with Adversarial Features

5 Regret Analysis

5.1 O​(d​log⁡T)O(d\log{T}) Regret of EMLP

Theorem 3 (Overall regret).

Theorem 4 (Expected per-round regret).

Lemma 5 (Quadratic regret bound).

Lemma 6 (Quadratic likelihood bound).

Proof.

Lemma 7 (Strong convexity and Exponential Concavity).

Lemma 8 (Surrogate Regret).

Lemma 9 (Per-epoch surrogate regret bound).

5.2 O​(d​log⁡T)O(d\log{T}) Regret of ONSP

Theorem 10.

Proof.

Lemma 11 (Online Newton Step).

5.3 Lower Bound for Unknown Distribution

Assumption 5.

Theorem 12 (Lower bound with unknown σ\sigma).

6 Numerical Result

7 Discussion

8 Conclusion

Acknowledgments

References

Appendix A Other related works

A.1 History of Pricing

A.2 Dynamic Pricing as Bandits

A.3 Structural Model

Appendix B Proofs

B.1 Proof of Lemma 2

Proof.

B.2 Proofs in Section 5.1

B.2.1 Proof of Lemma 5

Proof.

B.2.2 Proof of Lemma 7

Proof.

B.2.3 Proof of Lemma 9

Proof.

Lemma 13.

Proof.

B.3 Proof of Lower bound in Section 5.3

Proof.

Lemma 14.

Lemma 15.

Proof.

Lemma 16.

Proof.

Theorem 17 (Learning is costly).

Proof.

Lemma 18 (Corollary 3.1 in Taneja and Kumar, 2004).

Theorem 19 (Uncertainty is costly).

Proof.

Lemma 20.

Appendix C More Discussions

C.1 Dependence on BB and Noise Variance

Lemma 21 (Properties of λ​(ω)\lambda(\omega)).

Proof.

C.2 Algorithmic Design

C.2.1 Probit and Logistic Regressions

C.2.2 Advantages of EMLP over ONSP.

C.2.3 Agnostic Dynamic Pricing: Explorations versus Exploitation

C.3 Problem Modeling

C.3.1 Noise Distributions

C.3.2 Linear Valuations on Features

C.4 Ex Ante v.s. Ex Post Regrets

C.5 Ethic Issues

5.1 $O(d\log{T})$ Regret of EMLP

5.2 $O(d\log{T})$ Regret of ONSP

Theorem 12 (Lower bound with unknown $\sigma$ ).

C.1 Dependence on $B$ and Noise Variance

Lemma 21 (Properties of $\lambda(\omega)$ ).