This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

No-Regret Learning for Stackelberg Equilibrium Computation in Newsvendor Pricing Games

Larkin LiuTechnical University of Munich; [email protected]
   Yuming RongGoogle; [email protected]
Abstract

We introduce the application of online learning in a Stackelberg game pertaining to a system with two learning agents in a dyadic exchange network, consisting of a supplier and retailer, specifically where the parameters of the demand function are unknown. In this game, the supplier is the first-moving leader, and must determine the optimal wholesale price of the product. Subsequently, the retailer who is the follower, must determine both the optimal procurement amount and selling price of the product. In the perfect information setting, this is known as the classical price-setting Newsvendor problem, and we prove the existence of a unique Stackelberg equilibrium when extending this to a two-player pricing game. In the framework of online learning, the parameters of the reward function for both the follower and leader must be learned, under the assumption that the follower will best respond with optimism under uncertainty. A novel algorithm based on contextual linear bandits with a measurable uncertainty set is used to provide a confidence bound on the parameters of the stochastic demand. Consequently, optimal finite time regret bounds on the Stackelberg regret, along with convergence guarantees to an approximate Stackelberg equilibrium, are provided.

footnotetext: A extended abstract was accepted at the 8th International Conference on Algorithmic Decision Theory (ADT 2024).

Keywords: Stackelberg Games, Online Learning, Dynamic Pricing

1 Introduction

We study the problem of dynamic pricing in a two-player general sum Stackelberg game. An economic interaction between learning agents in a dyadic exchange network is modelled as a form of economic competition between two firms in a connected supply chain. Economic competition can arise between self-interested firms in the intermediate stages of a supply chain, especially when they are non-centralized or not fully coordinated, operating as individual entities [Per07] [CZ99].

From a single agent perspective, online learning for dynamic pricing has been studied extensively, and known algorithms exist guaranteeing sub-linear regret [GP21] [BSW22] [Bab+15]. [KL03] first presented the problem of demand learning, where the firm must optimally learn the parameters of demand in an online learning setting. From a single-agent perspective, this has been previously studied in [BP06] and [PK ̵01], to name a few. When more than one agent is involved, the dynamic pricing environment becomes more complex. [Bra+18] investigates the problem setting where a seller is selling to a regret-minimizing buyer. Under the assumption that the buyer is running a contextual bandit algorithm, a reward-maximizing algorithm for the seller upper-bounded by a margin above the monopoly revenue is provided. [GP21] applied the use of linear contextual bandits to model multinomial logit demand in the multi-assortment dynamic pricing problem. [BSW22] investigates online learning for dynamic pricing with a partially linear demand function. Multi-agent dynamic pricing has also been previously studied in [DD00] and [KUP03], to name a few. Nevertheless, these aforementioned works do not consider the context of agents in a supply chain setting, where the inventory policy has an impact on the expected profits.

Dynamic pricing can also be considered in scenarios with limited supply [Bab+15], thereby limiting the retailer from profiting from the theoretical revenue maximizing price for the product. In situations where the retailer must order in advance the product to be sold, and bear the risk of limited supply, or supply surplus, this is known as a Newsvendor firm [AHM51] [Mil59] [HW63]. Specifically, learning and optimization within the Newsvendor framework have also been studied in [BR19]. When the Newsvendor must also dynamically price the item, this is referred to as the price-setting Newsvendor [JK13] [PD99] [Mil59], and methods to compute such optimal solutions under perfect information were first provided in [Mil59]. We further extend the well-known problem of online learning in a single agent dynamic pricing environment to a two agent competitive learning environment. One example of such competition is the rivalry between independent self-interested suppliers and retailers within a supply chain. Research by [Per07] and [Ma+22] has demonstrated that this type of competition can lead to a degradation in the general social welfare.

In our case we model the competition in a dyadic exchange network as a repeated Stackelberg game, which finds numerous applications in market economics [DD90] [AE92] and supply chain analysis [NKS19]. In previous work within the domain of supply chain management, the elements of parameter learning and regret minimization remain under-investigated. Specifically, in scenarios where suppliers and retailers engage in myopic supply chain competition aimed at profit maximization through the exchange of demand function information, as in [ZC13], equilibrium is often established straightforwardly under a perfect information setting. Most saliently, the ramifications of worst-case scenarios and suboptimal deviations in the face of uncertainty remain wholly unquantified. Therefore, addressing this challenge with mathematical rigor constitutes one of the focal points of our work.

In the framing of online learning for a supply chain Stackelberg game, [Ces+23] considered Newsvendor competition between supplier and retailer, under perfect information and limited information settings. Utilizing bandit learning techniques, such as explore-then-commit (ETC) and follow-the-leader (FTL), they are able to produce bounds on regret under mild assumptions [GLK16].

Our supply chain game consists of a leader (supplier) and a follower (retailer). The leader dynamically prices the item to sell to the follower, and the follower must thereafter dynamically price the item for the market.

The objective of this repeated game is to determine the optimal policy to maximize the expected profit in the face of stochastic demand for the leader, and determine the best response of the follower (under the assumptions in Section 2.4.2), leading to an approximate Stackelberg equilibrium. The rationale behind this is that, often times, decisions between firms competing with one another do not occur simultaneously, but rather turn-based. We define this Newsvendor pricing game more concretely in Section 2.4.

In most cases, given the known parameters of the model, the best response of the follower can be computed for any of the leader’s first action. Assuming rationality of both agents, it follows from the best response of the follower that the optimal policy of the leader can be computed. However, the sample-efficient learning of the parameters of the joint reward function presents a genuine problem for Stackelberg gamess, as the leader is learning both the joint reward function of the system and computing the best response of the follower. To increase the sample efficiency of parameter learning in Stackelberg games, bandit algorithms have been proposed to address this problem [Bai+21] [FCR19] [FCR20] [Bal+15]. Our paper proposes online learning algorithms for Stackelberg games in a dyadic exchange network utilizing linear contextual bandits [APS11] [Chu+11], borrowing bounds on misspecification from these bandit methods to achieve theoretical guarantees on the behaviour of the algorithm in terms of regret and optimality, later discussed in Section 3.

1.1 Key Contributions

Our study integrates established economic theory with online learning theory within a practical game theoretic framework. Specifically, we employ a linear contextual bandit algorithm within a Stackelberg game to address a well-defined economic scenario. The primary technical challenge lies in optimizing for both contextual and Stackelberg regret, particularly when facing stochastic environmental parameters, as well as uncertainty in agent strategies. Incorporating economic theory into online learning theory is inherently challenging due to the unique characteristics of reward functions and best response functions, as expounded in Section 2.4.3. In some cases, closed-form solutions are unattainable, requiring innovative approaches for optimization and bounding functions, as expounded in Section 3.5, in order to derive theoretical worst-case bounds on regret.

To summarize, we provide technical contributions in the following areas:

  1. 1.

    We prove the existence of a unique Stackelberg equilibrium under perfect information for a general sum game within a Newsvendor pricing game (described in Section 2.4).

  2. 2.

    An online learning algorithm for the Newsvendor pricing game, leveraging stochastic linear contextual bandits, is outlined, which is integrated with established economic theory.

  3. 3.

    The convergence properties of an online learning algorithm to approximate Stackelberg equilibrium, as well theoretical guarantees for bounds on finite-time regret, are provided.

  4. 4.

    We demonstrate our theoretical results via economic simulations, to show that our learning algorithm outperforms baseline algorithms in terms of finite-time cumulative regret.

2 Model Formulation

Definitions: We define two players in a repeated Stackelberg game. Player A is the leader, she acts first with action 𝐚\mathbf{a}. Player B is the follower, he acts second with action 𝐛\mathbf{b}. The follower acts in response to the leaders action, and both players earn a joint payoff as a function of the reward function. In the abstract sense, 𝐚\mathbf{a} and 𝐛\mathbf{b} denote the actions of the leader and follower in action spaces 𝒜\mathcal{A} and \mathcal{B} respectively. To be precise, given an action space with dimension d\mathbbm{R}^{d}, 𝐚\mathbf{a} and 𝐛\mathbf{b} are vectors of different dimensions (we later illustrate that 𝒜1\mathcal{A}\subseteq\mathbbm{R}^{1} and 2\mathcal{B}\subseteq\mathbbm{R}^{2} in Section 2.4.1). This is due to the fact that the leader’s and follower’s actions do not necessarily follow the same form.

2.1 Repeated Stackelberg Games

In a repeated Stackelberg game, the leader takes actions 𝐚𝒜\mathbf{a}\in\mathcal{A}, and the follower takes actions 𝐛\mathbf{b}\in\mathcal{B} in reaction to the leader. The joint action is represented as (𝐚t,𝐛t)(\mathbf{a}^{t},\mathbf{b}^{t}) for each time step tt from 11 to TT. The joint utility (reward) functions for the leader and follower are denoted as 𝒢A(𝐚t,𝐛t)\mathcal{G}_{A}(\mathbf{a}^{t},\mathbf{b}^{t}) and 𝒢B(𝐚t,𝐛t)\mathcal{G}_{B}(\mathbf{a}^{t},\mathbf{b}^{t}), respectively. From an economic standpoint, the application of pure strategies aligns with the prevalent strategic behaviours observed in many firms, emphasizing practicality and realism [PS80] [Tir88].

Strategy Definition: Let πA\pi_{A} and πB\pi_{B} denote the pure strategies of both agents represented as standard maps,

πA(t):t𝒜,πB(𝐚):𝒜,t{1tT}\displaystyle\pi_{A}(t):\mathcal{F}_{t}\to\mathcal{A},\quad\pi_{B}(\mathbf{a}):\mathcal{A}\to\mathcal{B},\quad t\in\mathbb{Z}\cap\{1\leq t\leq T\} (2.1)

The strategy of the leader at time tt is denoted by πA(t)\pi_{A}(t), which maps from the filtration t\mathcal{F}_{t} to the set of possible actions 𝒜\mathcal{A}. Where the filtration t\mathcal{F}_{t} consists of all possible σ\sigma-algebras of observations up until time tt, i.e., t{(𝐚1,𝐛1),,(𝐚t,𝐛t)}\mathcal{F}_{t}\equiv\{(\mathbf{a}^{1},\mathbf{b}^{1}),\ldots,(\mathbf{a}^{t},\mathbf{b}^{t})\}. To clarify, tt belongs to the set of integers \mathbb{Z} such that 1tT1\leq t\leq T, where TT is a fixed finite-time endpoint. The action taken by the leader at time tt, denoted as 𝐚t𝒜\mathbf{a}^{t}\in\mathcal{A}, is determined by the leader’s strategy πA(t)\pi_{A}(t), which depends on the observed actions up to time tt. Therefore, πA\pi_{A} is a morphism from the set of filtrations to the set of actions in discrete time, πA:{1,,T}{𝐚1,,𝐚T}\pi_{A}:\{\mathcal{F}_{1},\ldots,\mathcal{F}_{T}\}\mapsto\{\mathbf{a}^{1},\ldots,\mathbf{a}^{T}\}. On the other hand, the strategy of the follower is formulated as a response to the leader’s action; therefore, it is a morphism, πB:{𝐚1,,𝐚T}{πB(𝐚1),,πB(𝐚T)}\pi_{B}:\{\mathbf{a}^{1},\dots,\mathbf{a}^{T}\}\mapsto\{\pi_{B}(\mathbf{a}^{1}),\dots,\pi_{B}(\mathbf{a}^{T})\}.

Best Response Function: To be specific, 𝒢B(𝐚t,𝐛t)\mathcal{G}_{B}(\mathbf{a}^{t},\mathbf{b}^{t}) is expressed as a parametric function with exogenous sub-Gaussian noise (the exact parametric form for 𝒢B(𝐚t,𝐛t)\mathcal{G}_{B}(\mathbf{a}^{t},\mathbf{b}^{t}) is provided in Eq. (2.14)). Conversely, for the leader, 𝒢A(𝐚t,𝐛t)\mathcal{G}_{A}(\mathbf{a}^{t},\mathbf{b}^{t}) is a deterministic function solely driven by the direct action of the leader 𝐚t\mathbf{a}^{t} followed by the reaction of the follower 𝐛t\mathbf{b}^{t}. At each discrete time interval, the best response of the follower, 𝔅(𝐚t)\mathfrak{B}(\mathbf{a}^{t}), is defined as follows:

𝔅(𝐚t)=argmax𝐛𝔼[𝒢B(𝐚t,𝐛)]\mathfrak{B}(\mathbf{a}^{t})=\underset{\mathbf{b}\in\mathcal{B}}{\mathrm{argmax}}\ \mathbbm{E}[\mathcal{G}_{B}(\mathbf{a}^{t},\mathbf{b})] (2.2)

Eq. (2.2) makes the assumption that there exists a unique best response to 𝐚t\mathbf{a}^{t}. We demonstrate the uniqueness of 𝔅(𝐚t)\mathfrak{B}(\mathbf{a}^{t}) later in Theorem 1. 𝔅(𝐚)\mathfrak{B}(\mathbf{a}) is the best response function of 𝐛\mathbf{b} to 𝐚\mathbf{a}, as defined in Equation (2.2) under perfect information. Given 𝔅(𝐚)\mathfrak{B}(\mathbf{a}) is unique, the objective for the leader is to maximize her reward based on the expectation of the follower’s best response. In a Stackelberg game, the leader takes actions, from time {𝐚1,,𝐚T}\{\mathbf{a}^{1},\dots,\mathbf{a}^{T}\}, and receives rewards {𝒢A(𝐚1,πB(𝐚1)),,𝒢A(𝐚T,πB(𝐚T))}\{\mathcal{G}_{A}(\mathbf{a}^{1},\pi_{B}(\mathbf{a}^{1})),\dots,\mathcal{G}_{A}(\mathbf{a}^{T},\pi_{B}(\mathbf{a}^{T}))\}, as the follower is always reacting to the leader with his perfect or approximate best response, he receives rewards {𝒢B(𝐚1,πB(𝐚1)),,𝒢B(𝐚T,πB(𝐚T))}\{\mathcal{G}_{B}(\mathbf{a}^{1},\pi_{B}(\mathbf{a}^{1})),\dots,\mathcal{G}_{B}(\mathbf{a}^{T},\pi_{B}(\mathbf{a}^{T}))\}. We shall use the notation (πA,πB)(\pi_{A},\pi_{B}) to represent the joint strategy of both players.

Contextual Regret (Follower): In the bandit setting with no state transitions, the leader’s strategy πA\pi_{A} represents a sequence consisting of actions, or contexts, 𝐚t\mathbf{a}^{t}, t{1,,T}\forall t\in\{1,\dots,T\} for the follower. The contextual regret is defined as,

RBT(πA,πB)\displaystyle R_{B}^{T}(\pi_{A},\pi_{B}) =𝔼θ[t=1T𝒢B(𝐚t,𝔅(𝐚t))t=1T𝒢B(𝐚t,𝐛t)]\displaystyle=\mathbbm{E}_{\theta}\Big{[}\sum^{T}_{t=1}\mathcal{G}_{B}(\mathbf{a}^{t},\mathfrak{B}(\mathbf{a}^{t}))-\sum^{T}_{t=1}\mathcal{G}_{B}(\mathbf{a}^{t},\mathbf{b}^{t})\Big{]} (2.3)

Contextual regret, defined as RBT()R_{B}^{T}(\cdot) in Eq. (2.3), is the difference between cumulative follower rewards from the best-responding follower and any other strategy adhered to by the follower in response to a series of actions form the leader, {𝐚1,,𝐚T}\{\mathbf{a}^{1},\dots,\mathbf{a}^{T}\}. Given context 𝐚t\mathbf{a}^{t} and environmental parameters θ\theta, the follower attempts to maximize his rewards via some no-regret learner (we choose an optimistic stochastic linear contextual bandit algorithm as outlined in Section 3). 𝔼θ[𝒢B()]\mathbbm{E}_{\theta}[\mathcal{G}_{B}(\cdot)] indicates that expectation over uncertainty is with respect to the stochastic demand environment, governed by parameter θ\theta.

Stackelberg Regret (Leader): The Stackelberg regret, defined as RAT()R_{A}^{T}(\cdot) for the leader, is the difference in cumulative rewards when players optimize their strategies in hindsight, versus via any joint strategy (πA,πB)(\pi_{A},\pi_{B}). Stackelberg regret, also known as the external regret [Hag+22] [Zha+23] is defined from the leader’s perspective in Eq. (2.4) as the cumulative difference in leader’s rewards between the maximization of 𝒢A()\mathcal{G}_{A}(\cdot) under a best responding follower, versus any other joint strategy adhered to by both agents. This definition assumes that the rational leader must commit to a strategy πA\pi_{A} at the beginning of the game, which she believes is optimal on expectation. The leader maximizes their reward taking into account their anticipated uncertainty of the follower’s response 𝐛t\mathbf{b}^{t}. When computing the expectation of 𝒢A()\mathcal{G}_{A}(\cdot), the randomness of 𝒢A()\mathcal{G}_{A}(\cdot) is fully strategic, that is, it arises from the follower’s action according to the strategy 𝐛tπB\mathbf{b}^{t}\sim\pi_{B}. Our algorithm seeks to minimize the Stackelberg regret, constituting a no-regret learner from the leader’s perspective.

RAT(πA,πB)\displaystyle R_{A}^{T}(\pi_{A},\pi_{B}) =𝔼πB[t=1Tmax𝐚𝒜𝒢A(𝐚,𝔅(𝐚t))t=1T𝒢A(𝐚t,𝐛t)]\displaystyle=\mathbbm{E}_{\pi_{B}}\Big{[}\sum_{t=1}^{T}\underset{\mathbf{a}\in\mathcal{A}}{\mathrm{max}}\ \mathcal{G}_{A}(\mathbf{a},\mathfrak{B}(\mathbf{a}^{t}))-\sum_{t=1}^{T}\,\mathcal{G}_{A}(\mathbf{a}^{t},\mathbf{b}^{t})\Big{]} (2.4)

For convenience of notation, moving forward, we will establish the equivalences for 𝔼πB[𝒢A()]𝔼[𝒢A()]\mathbbm{E}_{\pi_{B}}[\mathcal{G}_{A}(\cdot)]\equiv\mathbbm{E}[\mathcal{G}_{A}(\cdot)] and 𝔼θ[𝒢B()]𝔼[𝒢B()]\mathbbm{E}_{\theta}[\mathcal{G}_{B}(\cdot)]\equiv\mathbbm{E}[\mathcal{G}_{B}(\cdot)].

2.2 Literature Review on Learning in Stackelberg Games

Source Prob. Setting Info. Methodology Theorem Regret
[Zha+23] Noisy omniscient follower. with noise Partial Info. UCB with expert information. Thm. 3.3 𝒪(ϵ^T+N(ϵ^)T)\mathcal{O}(\hat{\epsilon}T+\sqrt{N(\hat{\epsilon})T})
[Hag+22] Stackelberg Security Game. Partial Info. Batched binary search, with γ\gamma discounting, and delayed feedback. Thm. 3.13 𝒪(log(T)+Tγlog2(Tγ))\mathcal{O}(\log(T)+T_{\gamma}\log^{2}{(T_{\gamma})})
[Hag+22] Dynamic pricing. Partial Info. Successive elimination with upper-confidence bound. Thm. 4.7 𝒪(Tlog(T)+Tγlog2(TγT))\mathcal{O}(\sqrt{T\log(T)}+T_{\gamma}\log^{2}(T_{\gamma}T)) or 𝒪(log(T)+T)\mathcal{O}(\log(T)+T)
[FKM04] Convex joint reward function* Partial Info. Gradient approximation. Thm. 2 𝒪(P(T)+log3/4(T))\mathcal{O}(P(T)+\log^{3/4}{(T)})
[CLP19] Spam Classification Full Info. Adaptive polytope partitioning. Thm. 4.1 𝒪(Tlog((T)))\mathcal{O}(\sqrt{T\log{(T)}})
[Bal+15] Stackelberg Security Game Full Info. Follow the leader. Thm. 5.1 𝒪(T)\mathcal{O}(\sqrt{T})
Table 1: Worst case regret comparison among learning algorithms for Stackelberg games.

We classify game theoretic learning as consisting of two key aspects. The first is the learning of the opposing player’s strategy. In particular, the strategy of the follower can be optimal, in the case of an omniscient follower, or ϵ\epsilon-optimal [CLP19] [Bal+15] [Gan+23], in the case of an imperfect best response stemming from hindered information sharing or suboptimal follower policies. In this setting, the parameters of the reward function θ\theta^{*} are known, and the players are learning an approximate or perfect equilibrium strategy. Notably, [Bal+15] demonstrated that there always exists an algorithm that will produce a worst case leader regret bound of 𝒪(log((T)))\mathcal{O}(\log{(T)}). In another example, [CLP19] presents a method where polytopes constructed within the joint action space of the players can be adaptively partitioned to produce a learning algorithm achieving sublinear regret under perfect information, demonstrated in a spam classification setting.

The second aspect of online learning in SG’s pertains to the learning of the parameters of the joint reward function θ\theta^{*} which corresponds to 𝒢B()\mathcal{G}_{B}(\cdot) and influences 𝒢A()\mathcal{G}_{A}(\cdot). In this environment, θ\theta^{*} is unknown or only partially known, and must be learned over time in a bandit setting. Several approaches address this problem setting. For example, a worst case regret bound of 𝒪(ϵ^T+N(ϵ^)T)\mathcal{O}(\hat{\epsilon}T+\sqrt{N(\hat{\epsilon})T}) was proven in [Zha+23], where the authors use a ϵ^\hat{\epsilon}-covering method to achieve the bound on regret. Where N(ϵ^)N(\hat{\epsilon}) is the ϵ^\hat{\epsilon}-covering over the joint action space. [Hag+22] produce a worst-case regret bound of 𝒪(log(T)+Tγlog2(Tγ))\mathcal{O}(\log(T)+T_{\gamma}\log^{2}{(T_{\gamma})}) by leveraging batched binary search in a demand learning problem setting. Furthermore, they investigate the problem of a non-myopic follower, where the anticipation of future rewards ameliorated by a discount factor γ\gamma, induces the follower to suboptimally best respond. Tγ=1/(1γ)T_{\gamma}=1/(1-\gamma) represents a parameter proportional to the delay in the revelation of the follower’s actions by a third party. Under this mechanism, the regret is proportional to both TT as well as γ\gamma, where for very small values of γ\gamma (close to non-discounting), the TγT_{\gamma} could potentially dominate the regret. [FKM04] provide a bound for a convex joint reward function using approximate gradient descent, where P(T)P(T) represents a polynomial function with respect to TT. Note that in the partial information setting, the bound on worst-case leader regret is typically superlinear.

2.3 Approximate Stackelberg Equilibrium

The Stackelberg Equilibrium of this game is defined as any joint strategy which,

𝐚\displaystyle\mathbf{a}^{*} =argmax𝐚𝒜𝔼[𝒢A(𝐚,𝔅(𝐚))]\displaystyle=\underset{\mathbf{a}\in\mathcal{A}}{\mathrm{argmax}}\ \mathop{\mathbb{E}}[\mathcal{G}_{A}(\mathbf{a},\mathfrak{B}(\mathbf{a}))] (2.5)

Where 𝔅(𝐚)\mathfrak{B}(\mathbf{a}) is the best response of the follower, as defined in Eq. (2.2). Let ΔB\Delta_{B} denote a probability simplex over possible strategies πB\pi_{B}, thus the expectation on 𝔼πB[]\mathbbm{E}_{\pi_{B}}[\cdot] is with respect to πB\pi_{B}. From the follower’s perspective, suppose he adopts an approximate best response strategy π^Bπ^B(𝐚)\hat{\pi}_{B}\equiv\hat{\pi}_{B}(\mathbf{a}). This implies that there exists some ϵB0\epsilon_{B}\geq 0, such that,

𝔼θ[𝒢B(𝐚,π^B)]𝔼θ[𝒢B(𝐚,𝔅(𝐚))]ϵB,𝐚𝒜,πB()ΔB\displaystyle\mathbbm{E}_{\theta}[\mathcal{G}_{B}(\mathbf{a},\hat{\pi}_{B})]\geq\mathbbm{E}_{\theta}[\mathcal{G}_{B}(\mathbf{a},\mathfrak{B}(\mathbf{a}))]-\epsilon_{B},\quad\forall\mathbf{a}\in\mathcal{A},\quad\forall\pi_{B}(\cdot)\in\Delta_{B} (2.6)

This formalization allows for an ϵ\epsilon deviation, potentially making the problem computationally tractable, particularly when optimization over certain functions lacks simple closed-form solutions or is computationally intensive (as we explore in detail in Section 2.4.3). It’s worth noting that the expectation for the leader reflects uncertainty about the follower’s strategy, while the expectation for the follower relates to the randomness of the environment. Acknowledging the presence of the error of best response function resulting from the possibility of the follower to act imperfectly, or strategically with purpose influence the leader’s strategy, we propose an ϵ\epsilon-approximate equilibrium with full support, [Rou10] [DGP09].

Upper-Bounding Approximation Function: Let 𝒰B(𝐚,𝐛)\mathcal{U}_{B}(\mathbf{a},\mathbf{b}) represent a Lipschitz function, 𝒰B:𝐚×𝐛+\mathcal{U}_{B}:\mathbf{a}\times\mathbf{b}\to\mathbb{R}^{+} such that it is consistently greater than or equal to 𝔼[𝒢B(𝐚,𝐛)]\mathbbm{E}[\mathcal{G}_{B}(\mathbf{a},\mathbf{b})]. The selection of this upper-bounding function is appropriate within our problem framework for two key reasons. Firstly, it leverages on established theory on economic pricing, allowing us to derive a rational best response for the follower (see Section 2.4.3). Secondly, its adoption aligns with our implementation of an optimistic bandit algorithm, ensuring an inherent tendency to overestimate reward potential in the face of uncertainty.

𝒰B(𝐚,𝐛)\displaystyle\mathcal{U}_{B}(\mathbf{a},\mathbf{b}) 𝔼[𝒢B(𝐚,𝐛)],𝐚𝒜,𝐛\displaystyle\geq\mathbbm{E}[\mathcal{G}_{B}(\mathbf{a},\mathbf{b})],\ \forall\mathbf{a}\in\mathcal{A},\ \mathbf{b}\in\mathcal{B} (2.7)

Therefore, we define the approximate follower best response as,

𝔅~(𝐚)\displaystyle\widetilde{\mathfrak{B}}(\mathbf{a}) =argmaxb𝒰B(𝐚,𝐛)\displaystyle=\underset{b\in\mathcal{B}}{\mathrm{argmax}}\ \mathcal{U}_{B}(\mathbf{a},\mathbf{b}) (2.8)

In accordance with Eq. (2.7), we note that the follower reward utilizing the approximate best response is less than that of the optimal best response as a consequence, where 𝔼[𝒢B(a,𝔅~(a))]𝔼[𝒢B(a,𝔅(a))]\mathop{\mathbb{E}}[\mathcal{G}_{B}(a,\widetilde{\mathfrak{B}}(a))]\leq\mathop{\mathbb{E}}[\mathcal{G}_{B}(a,\mathfrak{B}(a))]. Furthermore, in our problem definition, the follower approximates his best response with 𝔅~(𝐚)\widetilde{\mathfrak{B}}(\mathbf{a}) such that, across its domain of 𝔼[𝒢B()]\mathbb{E}[\mathcal{G}_{B}(\cdot)], the image of the approximating function, 𝒰()\mathcal{U}(\cdot), is greater or equal than the image of 𝔼[𝒢B()\mathop{\mathbb{E}}[\mathcal{G}_{B}(\cdot)]. The motivation behind this is that, 𝒰()\mathcal{U}(\cdot) can represent a simple parametric function with a known analytical solution, whereas 𝔼[𝒢B()]\mathop{\mathbb{E}}[\mathcal{G}_{B}(\cdot)] could represent an more elaborate function, with no known exact analytical solution. Therefore it is possible that, 𝒰()\mathcal{U}(\cdot) and 𝔼[𝒢B()]\mathbb{E}[\mathcal{G}_{B}(\cdot)] may not necessarily belong to the same parametric family, nor adhere to the same functional form.

Approximate SE: As a consequence, we can therefore infer that there exists some additive ϵB0\epsilon_{B}\geq 0 which upper-bounds this approximation error in the image of 𝔼[𝒢B()]\mathbb{E}[\mathcal{G}_{B}(\cdot)]. We define an approximate ϵ\epsilon-Stackelberg equilibrium with ϵB\epsilon_{B} approximation error the follower’s perspective as follows,

𝔼[𝒢B(𝐚,𝔅~(𝐚))]𝔼[𝒢B(𝐚,𝔅(𝐚))]𝔼[𝒢B(𝐚,𝔅~(𝐚))]+ϵB,𝐚𝒜\mathop{\mathbb{E}}[\mathcal{G}_{B}(\mathbf{a},\widetilde{\mathfrak{B}}(\mathbf{a}))]\leq\mathop{\mathbb{E}}[\mathcal{G}_{B}(\mathbf{a},\mathfrak{B}(\mathbf{a}))]\leq\mathop{\mathbb{E}}[\mathcal{G}_{B}(\mathbf{a},\widetilde{\mathfrak{B}}(\mathbf{a}))]+\epsilon_{B},\quad\forall\mathbf{a}\in\mathcal{A} (2.9)

Effectively, ϵB\epsilon_{B} defines an upper bound of the range on where 𝔼[𝒢B(𝐚,𝔅(𝐚))]\mathop{\mathbb{E}}[\mathcal{G}_{B}(\mathbf{a},\mathfrak{B}(\mathbf{a}))] could lie, given the approximate best response under perfect information, defined as 𝔅~(𝐚)\widetilde{\mathfrak{B}}(\mathbf{a}). ϵB\epsilon_{B} represents the worst case additive approximation error of 𝔼[𝒢B(𝐚,𝔅(𝐚))]\mathbbm{E}[\mathcal{G}_{B}(\mathbf{a},\mathfrak{B}(\mathbf{a}))], where ϵB>0\epsilon_{B}>0.

ϵB\displaystyle\epsilon_{B} =sup𝐚𝒜(𝔼[𝒢B(𝐚,𝔅(𝐚))]𝔼[𝒢B(𝐚,𝔅~(𝐚))]),ϵB>0\displaystyle=\underset{\mathbf{a}\in\mathcal{A}}{\mathrm{sup}}\ \Big{(}\mathbbm{E}[\mathcal{G}_{B}(\mathbf{a},\mathfrak{B}(\mathbf{a}))]-\mathbbm{E}[\mathcal{G}_{B}(\mathbf{a},\widetilde{\mathfrak{B}}(\mathbf{a}))]\Big{)},\quad\epsilon_{B}>0 (2.10)

The suboptimality of 𝔼[𝒢B(𝐚,𝔅~(𝐚))]\mathop{\mathbb{E}}[\mathcal{G}_{B}(\mathbf{a},\widetilde{\mathfrak{B}}(\mathbf{a}))] is attributed to the error in the approximation of the theoretically optimal response 𝔅(𝐚)\mathfrak{B}(\mathbf{a}) with 𝔅~(𝐚)\widetilde{\mathfrak{B}}(\mathbf{a}), in response to the leader’s action 𝐚\mathbf{a} (the economic definitions of 𝔅()\mathfrak{B}(\cdot), 𝒰B()\mathcal{U}_{B}(\cdot) and 𝒢B()\mathcal{G}_{B}(\cdot) are later illustrated in Section 2.4). In our learning scenario, the follower has access to 𝒰()\mathcal{U}(\cdot), which shares the same parameters but differs from the functional form of 𝔼[𝒢B()]\mathbb{E}[\mathcal{G}_{B}(\cdot)]. This allows us devise a learning algorithm which converges to ϵB\epsilon_{B}. Thus, any algorithm which learns an ϵ\epsilon-approximate Stackelberg equilibrium, is formally defined as an algorithm that converges to an equilibrium solution with ϵB\epsilon_{B} approximation error, as defined in Eq. (2.10).

2.4 Multi-Agent Economic Setting - the Newsvendor Pricing Game (NPG)

We model the two learning agents in a Newsvendor pricing game, involving a supplier AA and a retailer BB. The leader, a supplier, is learning to dynamically price the product for the follower, a retailer, aiming to maximize her reward. To achieve this, the follower adheres to classical Newsvendor theory, which involves finding the optimal order quantity given a known demand distribution before the realization of the demand. We provide further economic details in Section 2.4.3.

Differing from the closely related work of [Ces+23], we do make some relaxations to the problem in their setting. Firstly, we assume no production cost for the item, which provides a non-consequential change to the economic theory. Secondly, we assume full observability of the realized demand of both the supplier and retailer. This is commonplace in settings where a disclosure of the realized sales must be disclosed to an upstream supplier, which could occur as a stipulation as a part of the supplier contract, and/or when the sales ecosystem is part of an integrated eCommerce platform.

Nevertheless, our problem setting further sophisticates the solution from [Ces+23] in multiple ways. Firstly, in addition to determining the optimal order amount (known as the Newsvendor problem), the follower must also dynamically determine the optimal retail price for the product in the face of demand uncertainty (known as the price-setting Newsvendor). Thereby, we introduce the problem of demand learning coupled with inventory risk. Secondly, in our problem setting, although the linear relation of price to demand pdθ(p)p\sim d_{\theta}(p) is assumed, the distribution of neither demand nor market price is explicitly given. Third, we apply the optimism under uncertainty linear bandit (OFUL) [APS11] to the Stackelberg game, whereas in [Ces+23] the Piyavsky-Schubert [Bou+20] algorithm is applied. Lastly, we provide Stackelberg regret guarantees on the cumulative regret, whereas [Ces+23] provides guarantees on the simple regret.

2.4.1 Rules of the Newsvendor Pricing Game

Moving forward, and away from abstraction, we explicitly denote a𝐚1a\equiv\mathbf{a}\in\mathbbm{R}^{1}, and 𝐛[b,p]2\mathbf{b}\equiv[b,p]^{\intercal}\in\mathbbm{R}^{2}. Where aa denotes wholesale price from the supplier firm, pp and bb denote the retail price and order amount of the retail firm.

  1. 1.

    The supplier selects wholesale price aa, and provides it to the retailer.

  2. 2.

    Given wholesale cost aa, the retailer reacts with his best response [b,p][b,p]^{\intercal}, consisting of retail price pp, and order amount bb.

  3. 3.

    As the retailer determines the optimal order amount bb, he pays 𝒢A(a,b)=ab\mathcal{G}_{A}(a,b)=ab to the supplier.

  4. 4.

    At time tt, nature draws demand dtdθ(p)d^{t}\sim d_{\theta}(p), and it is revealed to the retailer.

  5. 5.

    The retailer makes a profit of 𝒢B(a,b)=pmin{dt,b}ab\mathcal{G}_{B}(a,b)=p\ \min\{d^{t},b\}-ab.

  6. 6.

    Steps 1 to 5 are repeated for t1Tt\in 1...T iterations.

2.4.2 Economic Game Assumptions

From the price setting Newsvendor, the majority of the assumptions from [Mil59] are inherited as standard practice. In our current problem setting, we explicitly state the assumptions of the Newsvendor pricing game as follows:

  1. 1.

    Expected demand, Γθ(p)\Gamma_{\theta}(p), is an additive linear model, as illustrated later in Eq. (2.11).

  2. 2.

    Inventory is perishable, and we assume no goodwill costs and/or salvage value.

  3. 3.

    Market demand, dtd^{t}, and action history, (at,bt(a^{t},b^{t}, pt)p^{t}), is fully observable to both retailer and supplier.

  4. 4.

    The follower is rational and greedy, he maximizes for his immediate reward under the available information and best response computation.

  5. 5.

    No deviation, stochastic or otherwise, exists in btb^{t}, denoting the amount of goods ordered and goods received, from the supplier by the retailer.

  6. 6.

    Decisions are made during fixed periodic discrete time intervals.

aabab_{a}pap_{a}d(pa)d(p_{a})𝒢B(pa,ba)=pamin{d(pa),ba}\mathcal{G}_{B}(p_{a},b_{a})=p_{a}\min\{d(p_{a}),b_{a}\}𝒢A(a)=aba\mathcal{G}_{A}(a)=ab_{a}Leader (Supplier)Follower (Retailer)Market
Figure 1: The Newsvendor Pricing Game. In this Stackelberg game, there a logistics network between a supplier (leader) and retailer (follower), where utility functions are not necessarily supermodular, the supplier issues a wholesale price aa, and the retailer issues a purchase quantity bb, and a retail price pp in response.

At each time period the leader dynamically selects a wholesale price, aa. The follower must make two decisions in response to aa, the order amount, bb, and the market price pp, (which we show later by consequence of Eq. 2.15 are unique bijections). The uniqueness of the optimal bab_{a}^{*}, given aa, is proven in Theorem 1. This game can be viewed as an ultimatum game, where the retailer is constrained to accept or reject the the terms dictated by the supplier. However, the inclusion of inventory risk and demand parameter uncertainty in our model complicates this dynamic. The retailer must weigh his ordering decision against his risk assessment in the face of these uncertainties. Thus the problem setting can be interpreted as a variant of the ultimatum game, where a learning agent operates under market uncertainty.

2.4.3 Economic Theory of the Newsvendor Pricing Game

Stochastic demand is represented in Eq. 2.12, which is governed by a linear additive demand function Γθ(p)\Gamma_{\theta}(p) representing the expected demand, 𝔼[d(p)]\mathbbm{E}[d(p)], as a function of pp in Eq. 2.12. The demand function is governed by parameters θ\theta.

Γθ(p)\displaystyle\Gamma_{\theta}(p) =max{0,θ0θ1p},θ00,θ10\displaystyle=\max\{0,\theta_{0}-\theta_{1}p\},\quad\theta_{0}\geq 0,\ \theta_{1}\geq 0 (2.11)
dθ(p)\displaystyle d_{\theta}(p) =Γθ(p)+ϵ,ϵ𝒩(0,σ)\displaystyle=\Gamma_{\theta}(p)+\epsilon,\quad\epsilon\in\mathcal{N}(0,\sigma) (2.12)

We stipulate that noise ϵ\epsilon\in\mathop{\mathbb{R}} is zero mean and σ\sigma-sub-Gaussian. The retailer profit, 𝒢B(p,a)\mathcal{G}_{B}(p,a) is restricted by the order quantity bb, is therefore equal to, given wholesale price aa,

𝒢A(a,b)\displaystyle\mathcal{G}_{A}(a,b) =ab\displaystyle=ab (2.13)
𝒢B(p,b|a)\displaystyle\mathcal{G}_{B}(p,b|a) ={dθ(p)pab,0dθ(p)bb(pa),b<dθ(p)\displaystyle=\begin{cases}d_{\theta}(p)p-ab,\quad&0\leq d_{\theta}(p)\leq b\\ b(p-a),\quad&b<d_{\theta}(p)\\ \end{cases} (2.14)

The retailer’s profit should be greater than the supplier’s wholesale price multiplied by the order quantity, denoted as abab. Let fθ()f_{\theta}(\cdot) be the demand probability function given expected demand Γθ(p)\Gamma_{\theta}(p), and Fθ()F_{\theta}(\cdot) be the cumulative demand function. As the noise with respect to dθ(p)d_{\theta}(p) is σ\sigma-sub-Gaussian, consequently by the additive property so is the noise with respect to the expected retailer profit given supplier cost aa is 𝔼[𝒢B(p,b|a)]\mathbbm{E}[\mathcal{G}_{B}(p,b|a)]. Given any price pp, we can compute expected profit of the retailer 𝒢B\mathcal{G}_{B} as,

𝔼[𝒢B(p,b|a)]=p0bxfθ(x)𝑑x+pb(1Fθ(b))ab\displaystyle\mathbbm{E}[\mathcal{G}_{B}(p,b|a)]=p\int_{0}^{b}xf_{\theta}(x)dx+pb(1-F_{\theta}(b))-ab (2.15)

Given Γθ(p)\Gamma_{\theta}(p) and ϵ𝒩(0,σ)\epsilon\sim\mathcal{N}(0,\sigma) we can obtain a probability distribution fθ(x)f_{\theta}(x), which has a unique solution [AHM51]. The retailer can select pp generating dΓθ(p)d\sim\Gamma_{\theta}(p), and retailer must decide on bb to order from the supplier. Suppose we express demand as dΓθ(p)d\sim\Gamma_{\theta}(p), then the retailer must order an optimal amount b0(p,a)b_{0}(p,a) as determined by Eq. (2.16). The general solution to the Newsvendor order quantity is to maximize profit via 𝔅(a)\mathfrak{B}(a), the unique closed form solution is presented in Eq. (2.16). In classical Newsvendor theory, the optimal solution is a function of the value γ\gamma, referred to as the critical fractile, where γ=(pa)/p\gamma=(p-a)/p.

b0(p,a)=Fθ1(γ)=Fθ1(pap)\displaystyle b_{0}(p,a)=F_{\theta}^{-1}(\gamma)=F_{\theta}^{-1}\Big{(}\frac{p-a}{p}\Big{)} (2.16)

Unlike typical dynamic pricing, in a price-setting Newsvendor problem setting, the reward maximizing price pp^{*} is not necessarily the same as the profit maximizing price p0p_{0} under no inventory risk, also known as the riskless price. From [Mil59] we know that the optimal price to set in the price-setting Newsvendor with linear additive demand is,

p\displaystyle p^{*} =p0Ψθ(b)2θ1,where,Ψθ(b)=b(xb)fθ,p(x)𝑑x\displaystyle=p_{0}-\frac{\Psi_{\theta}(b)}{2\theta_{1}},\quad\text{where,}\quad\Psi_{\theta}(b)=\int_{b}^{\infty}(x-b)f_{\theta,p}(x)dx (2.17)

Where fθ,p(x)f_{\theta,p}(x) is the probability distribution of demand, such that 𝔼[dθ(p)]=x\mathbbm{E}[d_{\theta}(p)]=x.

Lemma 2.1.

Invariance of Ψθ(b)\Psi_{\theta}(b): Assuming symmetric behaviour of fθ,p()f_{\theta,p}(\cdot) with respect to Γθ(p)\Gamma_{\theta}(p) and constant standard deviation σ\sigma, the term representing supply shortage probability defined as Ψθ(b)\Psi_{\theta}(b), is invariant of the estimate of θ\theta. (Proof in Appendix A.2.)

The optimal price pp^{*}, has the relation pp0p^{*}\leq p_{0}. fθ,p(x)f_{\theta,p}(x) expresses the probability of demand equal to xx, under demand parameters θ\theta, and price pp. This defines the unique relation of bb to p(b)p(b) as evidenced by Eq. (2.17), and is a unique function. Knowing this relation, it is also logical to see that the riskless price, p0p_{0}, of the retailer must be always above the wholesale price, aa, as introduced in the constraint, thus, a<pp0a<p0a<p^{*}\leq p_{0}\xrightarrow[]{}a<p_{0}.

Theorem 1.

Uniqueness of Best Response under Perfect Information: In the perfect information NPG (from Sec. 2.4), with a known demand function, under the assumptions listed in Sec. 2.4.2, the optimal order amount bab^{*}_{a} which maximizes expected profit of the retailer 𝔼[𝒢B()]\mathbbm{E}[\mathcal{G}_{B}(\cdot)] is 𝔅(a)\mathfrak{B}(a) and is a surjection from a𝔅(a)a\xrightarrow[]{}\mathfrak{B}(a), and a bijection when a<p¯a<\bar{p}. Furthermore, within the range a<p¯a<\bar{p}, as aa monotonically increases, bab_{a}^{*} is monotonically decreasing. Where p¯\bar{p} denotes the optimistic estimate of the riskless price p0p_{0}. (Proof in Appendix A.1.)

Theorem 2.

Unique Stackelberg Equilibrium: A unique pure strategy Stackelberg equilibrium exists in the full-information Newsvendor pricing game. (Proof in Appendix A.4.)

Sketch of Proof (Theorem 1 and 2): Under perfect information, the unique solution to a𝔅(a)a\xrightarrow[]{}\mathfrak{B}(a) in Eq. (2.16) can be obtained, so long as 𝔅(a)>0\mathfrak{B}(a)>0. If each leader action aa elicits a unique action 𝔅(a)\mathfrak{B}(a), then there must exist at least one optimal action for the leader in a𝒜a\in\mathcal{A}. Note that, under perfect information, the follower would always act in accordance to their best approximation of (a)\mathcal{B}(a), which is unique under the assumptions in Section 2.4.2. This determines a unique best response. (Mathematical details are given in Appendix A.1 and A.4).

Evident from the previous discourse, 𝔼[𝒢B(p,b|a)]\mathbbm{E}[\mathcal{G}_{B}(p^{*},b^{*}|a)], representing the optimal expected reward (profit) function for the follower under perfect information, does not lend itself to a straightforward computation. Instead, it necessitates inverse look up of a probability density function, from Eq. (2.16), which has no parametric representation nor closed form solution. In addition, the solution to an integral equation from Eq. (2.17) is required, which also lacks a closed-form solution. Consequently, this motivates the introduction of an upper-bounding approximation function, such as 𝒰B(p|a)\mathcal{U}_{B}(p|a) (represented in Eq. (2.18)), which both bounds 𝔼[𝒢B(p,b,a)]\mathbbm{E}[\mathcal{G}_{B}(p^{*},b,a)], and is characterized by a straightforward parametric structure, (θ0θ1p)(pa)(\theta_{0}-\theta_{1}p)(p-a). Observing the absence of the variable bb from the argument is noteworthy, albeit without consequence, as 𝒰B(p|a)\mathcal{U}_{B}(p|a) effectively provides an upper bound for 𝒰B(p|a)\mathcal{U}_{B}(p|a) across all admissible values of bb in 𝔼[𝒢B(p,b|a)]\mathbbm{E}[\mathcal{G}_{B}(p^{*},b^{*}|a)]. Subsequently, we can then leverage this approximation within an online learning framework.

𝔼[𝒢B(p,b|a)]𝒰B(p|a)\displaystyle\mathbbm{E}[\mathcal{G}_{B}(p,b|a)]\leq\mathcal{U}_{B}(p|a) =(θ0θ1p)(pa)\displaystyle=(\theta_{0}-\theta_{1}p)(p-a) (2.18)

Note that the linear parametric form is preserved under this representation, where 𝒰B(p|a)=θX\mathcal{U}_{B}(p|a)=\theta X, with θ[θ0,θ1]\theta\equiv[\theta_{0},\theta_{1}] and X[(pa),(p2ap)]X\equiv[(p-a),-(p^{2}-ap)]^{\intercal}.

Economic Interpretation of 𝒰B(p|a)\mathcal{U}_{B}(p|a): 𝒰B(p|a)\mathcal{U}_{B}(p|a) in Eq. (2.18) represents the theoretical reward the follower could achieve if all realized demand was met, and no cost of inventory surplus or shortage is considered. This is simply the unit profit margin, denoted as pap-a, multiplied by the expected demand generated by the price, denoted as θ0θ1p\theta_{0}-\theta_{1}p, assuming all inventory is sold. Therefore, this riskless profit, 𝒰B(p|a)\mathcal{U}_{B}(p|a), is effectively a concave polynomial with respect to pp, and is always greater or equal than the expected reward 𝔼[𝒢B(p,b|a)]\mathbbm{E}[\mathcal{G}_{B}(p,b|a)] with inventory risk (a key result from [Mil59] [PD99]).

3 Online Learning Algorithms

We present a learning algorithm for the Newsvendor Pricing Game (NPG) involving two learning agents, namely two firms representing a supplier (leader) and retailer (follower), in a repeated Stackelberg game. In this online learning setting, both agents must learn the parameters of a reward function that is derived from a linear demand function. This presents additional challenge for both agents. For the leader, she bears the opportunity costs of suboptimal pricing given the best response of the retailer. For the retailer, he bears the cost of additional surplus or shortage when inaccurately estimating the demand function. These factors create additional complexity in comparison with a pure dynamic pricing problem setting.

More specifically, we adapt a contextual linear bandit learning algorithm [APS11] [Chu+11], which already enjoys robust theoretical guarantees in a single-agent setting, to a multi-agent setting. Furthermore, we extend its application to the economic domain by integrating classical economic theory from the price-setting Newsvendor, as discussed in Section 2.4.3, with contextual linear bandits. In this game, the follower optimizes to maximize 𝒰B(p|a)\mathcal{U}_{B}(p|a), an optimistic upper-bounding concave function on his perspective reward function, while the leader aims to learn an optimal strategy by leveraging information about the follower’s policy. We present technically novel analytical techniques for quantifying bounds on estimated parameters and best response functions. For example, we introduce an innovative argument via analysis in Euclidean space in Theorem 3 and approximation methods for non-elementary functions in Theorem 5. Given this, we successfully extend the robust theoretical guarantees of single-agent contextual linear bandits to an economic multi-agent setting.

3.1 Optimism in the Face of Uncertainty

We propose an adaptation of the optimism-under-uncertainty (OFUL) algorithm from [APS11] (illustrated in Appendix B.1 for reference) to learn the parameters of demand from Eq. (2.11). The OFUL algorithm is a contextual linear bandit algorithm, also previously studied in [DHK08] and [RT10], which constructs a confidence ball 𝒞t\mathcal{C}^{t} where the parameters of an estimated reward function fall within.

θtθtV¯t𝒪(log(t))\displaystyle\norm{\theta^{*}_{t}-\theta_{t}}_{\bar{V}_{t}}\in\mathcal{O}(\sqrt{\log{t}}) (3.1)

Given context aa, we have two parameters to estimate θ0\theta_{0} and θ1\theta_{1}, leading us to learning of the optimal riskless price optimization. From Eq. (3.1) we see that θtθtV¯t||\theta^{*}_{t}-\theta_{t}||_{\bar{V}_{t}} grows logarithmically, matching the covariance of θt\theta_{t}, where ||||V¯t||\cdot||_{\bar{V}_{t}} is defined as the self-normalizing norm [PKL04]. Lemma 3.1 extends this relationship to a shrinking finite 2-norm bound, which we later use to derive bounds on regret.

Lemma 3.1.

With probability 1δ1-\delta, and tet\geq e, the constraint θθV¯t𝒪(log(t))||\theta-\theta^{*}||_{\bar{V}_{t}}\leq\mathcal{O}(\sqrt{\log(t)}) from [APS11] implies the finite 22-norm bound θθ2𝒪(log(t)/t)||\theta-\theta^{*}||_{2}\leq\mathcal{O}(\sqrt{\log(t)/t}), and also thereby the existence of κ>0\kappa>0 such that θθ2κlog(t)/t||\theta-\theta^{*}||_{2}\leq\kappa\sqrt{\log(t)/t}. (Proof in Appendix B.3.)

3.2 Online Learning Algorithm for Newsvendor Pricing Game

Let (θ)θ0/θ1\mathcal{H}(\theta)\equiv\theta_{0}/\theta_{1}. We introduce a learning algorithm where both the leader and the follower optimistically estimates (θ)\mathcal{H}(\theta), while the follower learns 𝒰B(p0,b0|a)\mathcal{U}_{B}(p_{0},b_{0}|a). We denote by 𝒞t\mathcal{C}^{t}, the confidence set, wherein the true parameters lie with high probability 1δ1-\delta (i.e. θ𝒞t\theta^{*}\in\mathcal{C}^{t}). Our intuition, supported by Lemma 3.1, suggests that the size of 𝒞t\mathcal{C}^{t} decreases linearly with the sample size tt. Moreover, the linear structure of 𝒰B(p0,b0|a)\mathcal{U}_{B}(p_{0},b_{0}|a) enables Lemma 3.1 to remain valid during estimation of θ\theta via linear contextual bandits. To highlight, the follower executes the optimistic linear contextual bandit, while the leader selects her actions maximizing over the follower’s optimistic best response.

Algorithm 1 Learning Algorithm for Newsvendor Pricing Game (LNPG)
1:for t1Tt\in 1...T do:
2:     Leader and follower estimates 𝒞t\mathcal{C}^{t} from available data 𝒟t\mathcal{D}^{t}.
3:     Leader plays action aa, where a=argmaxa𝒜,θ𝒞taFθ¯a1(12a(θ)+a)a=\underset{a\in\mathcal{A},\theta\in\mathcal{C}^{t}}{\mathrm{argmax}}\ aF^{-1}_{\bar{\theta}_{a}}\Big{(}1-\frac{2a}{\mathcal{H}(\theta)+a}\Big{)} from Eq. (3.8).
4:     Follower sets price p=((θ)+a)/2p=(\mathcal{H}(\theta)+a)/2, referencing the maximization problem in Eq. (3.5a).
5:     Follower estimates their optimistic parameters θ¯a\bar{\theta}_{a}, and best response b¯a\bar{b}_{a} from Eq. (3.4) and (3.5a) respectively, and plays action b=b¯ab=\bar{b}_{a}.
6:     Leader obtains reward, 𝒢A=ab\mathcal{G}_{A}=ab.
7:     Follower obtains reward, 𝒢B=pmin{b,d(p)}\mathcal{G}_{B}=p\min\{b,d(p)\} (Eq. (2.12)).
8:end for

Given leader action aa, the optimal follower, 𝒢B()\mathcal{G}_{B}^{*}(\cdot) is unique from Theorem 2. Thus there there exists a unique solution where 𝒢A(a)\mathcal{G}_{A}^{*}(a) is maximized for any estimate of θ\theta. Although Theorem 2 illustrates the uniqueness of the SE under perfect information, there still exists uncertainty under imperfect information, when the parameters of demand are unknown. From the leader’s perspective, she is rewarded 𝒢A(a,b)=ab\mathcal{G}_{A}(a,b)=ab depending on the reaction of the follower. The follower’s economic reaction is determined by Fθ1((p0a)/p0)F_{\theta}^{-1}((p_{0}-a)/p_{0}), which is affected by estimate θ\theta. We present empirical results of Algorithm 1 in Appendix E.

3.3 Measuring the Best Response under Optimism

To characterize the best response of the follower, we must first establish a bound on the optimistic, yet uncertain, pricing behaviour pap_{a} of the follower given leader action aa. From Theorem (1) there exists a direct relationship between ptp^{t} and batb_{a}^{t}. As the follower is learning parameters via OFUL, its certainty about the model parameters is bounded by θθ2𝒪(log(t)/t)||\theta-\theta^{*}||_{2}\leq\mathcal{O}(\sqrt{\log(t)/t}). For each given action aa, and a confidence bound on the parameters θ\theta^{*}, denoted as 𝒞t\mathcal{C}^{t}, we can have an optimistic and pessimistic estimate of the best response, bab_{a}, denoted as (b¯a,b¯a)(\bar{b}_{a},\underline{b}_{a}) respectively. We define the pessimistic estimate of bab_{a}^{*} as,

b¯a\displaystyle\underline{b}_{a} max{0,Fθ^1(p0(θ)ap0(θ))}max{0,Fθ^1(p(θ)ap(θ))}\displaystyle\equiv\max\Big{\{}0,F_{\hat{\theta}}^{-1}\Big{(}\frac{p_{0}(\theta)-a}{p_{0}(\theta)}\Big{)}\Big{\}}\geq\max\Big{\{}0,F_{\hat{\theta}}^{-1}\Big{(}\frac{p^{*}(\theta)-a}{p^{*}(\theta)}\Big{)}\Big{\}} (3.2)

The pessimistic best response b¯a\underline{b}_{a} to aa occurs when the optimistic parameter estimate is equal to the maximum likelihood estimate itself, denoted as θ^\hat{\theta}. Thus for realism, we set a lower bound of 0 for bab_{a}. Additionally, we recognize that the riskless price estimate p0p_{0} is always greater than or equal to pp^{*}, represented as app0a\leq p^{*}\leq p_{0} in Eq. (3.2). Later, in Lemma 3.3, we will demonstrate that when the algorithm converges to p0p_{0}, the learned parameters enable the follower to calculate pp^{*} and bab_{a}^{*} respectively.

The upper bound of the optimistic best response, b¯a\bar{b}_{a} (Eq. (3.4)), can be determined by estimating the optimistic riskless price p¯0\bar{p}_{0}. Let 𝔼[𝒢B(θ|a)]\mathbb{E}[\mathcal{G}_{B}(\theta|a)] denote the expected profit under parameter θ\theta, give aa. Both the left and right terms of Eq. (3.3) are optimistic estimates of expected profit, 𝔼[𝒢B(θ|a)]\mathbb{E}[\mathcal{G}_{B}(\theta|a)], using the riskless price p0p_{0} as a surrogate for the optimal price. The follower’s pricing with p¯0\bar{p}_{0}, and ordering with b¯a\bar{b}_{a}, under an optimistic estimate of the demand parameters θ\theta given data 𝒟t\mathcal{D}^{t}, can be summarized from Eq. (3.3) to Eq. (3.4). (The derivation from Eq. (3.3) to Eq. (3.4) is in Appendix C.2.)

θ¯\displaystyle\bar{\theta} =argmaxθ𝒞t𝔼[𝒢B(θ|a)],𝔼[𝒢B(θ|a)]maxθ𝒞t((θ)+a2)Fθ1(12a(θ)+a)\displaystyle=\underset{\theta\in\mathcal{C}^{t}}{\mathrm{argmax}}\ \mathbb{E}[\mathcal{G}_{B}(\theta|a)],\quad\mathbb{E}[\mathcal{G}_{B}(\theta|a)]\leq\underset{\theta\in\mathcal{C}^{t}}{\mathrm{max}}\ \Big{(}\frac{\mathcal{H}(\theta)+a}{2}\Big{)}F_{\theta}^{-1}\Big{(}1-\frac{2a}{\mathcal{H}(\theta)+a}\Big{)} (3.3)
b¯a\displaystyle\bar{b}_{a} =Fθ¯a1(p0ap0),p0=a+p¯0a2=(θ¯)+a2,𝔅(a)[b¯a,b¯a],b¯a0\displaystyle=F^{-1}_{\bar{\theta}_{a}}\Big{(}\frac{p_{0}-a}{p_{0}}\Big{)},\quad p_{0}=a+\frac{\bar{p}_{0}-a}{2}=\frac{\mathcal{H}(\bar{\theta})+a}{2},\quad\mathfrak{B}(a)\in[\underline{b}_{a},\bar{b}_{a}],\quad\underline{b}_{a}\geq 0 (3.4)
Theorem 3.

Bounding the Optimistic Follower Pricing: Let Δ(t)=(θ)θ0/θ1\Delta_{\mathcal{H}}(t)=\mathcal{H}^{*}(\theta)-\theta_{0}^{*}/\theta_{1}^{*} represent the deviation in pricing action of the follower (retailer) at time tt. Then Δ(t)κHlog(t)/t\Delta_{\mathcal{H}}(t)\leq\kappa_{H}\sqrt{\log(t)/t} for some κH>0\kappa_{H}>0 and tet\geq e. (Proof in Appendix B.4.)

maxθ\displaystyle\!\max_{\theta} (θ)\displaystyle\qquad\qquad\mathcal{H}(\theta) =θ0θ1\displaystyle=\ \frac{\theta_{0}}{\theta_{1}} (3.5a)
subject to (θ0θ0)2+(θ1θ1)2\displaystyle\qquad\sqrt{(\theta_{0}-\theta^{*}_{0})^{2}+(\theta_{1}-\theta^{*}_{1})^{2}} κlog(t)/t\displaystyle\leq\kappa\sqrt{\log(t)/t} (3.5b)

Sketch of Proof: As illustrated in Fig. 3, we specify the optimization problem, where with probability 1δ1-\delta, the 2-norm error of the parameter estimates for θ\theta lie within the confidence ball characterized by a radius less than κlog(t)/t\kappa\sqrt{\log(t)/t}. The general maximization problem is expressed in Eq. (3.5a) and (3.5b). The optimistic estimate of the θ\theta, denoted as (θ)\mathcal{H}^{*}(\theta) (which we can denote in short as \mathcal{H}^{*}) lies in the extreme point between a ray (θ)\mathcal{H}(\theta) and the boundary of θθ2κlog(t)/t||\theta-\theta^{*}||_{2}\leq\kappa\sqrt{\log(t)}/\sqrt{t}.

θ1\displaystyle\mathcal{H}^{*}\theta_{1} =θ0\displaystyle=\theta_{0} (3.6)
|θ1θ0|2+1\displaystyle\frac{|\mathcal{H}^{*}\theta_{1}-\theta_{0}|}{\sqrt{{\mathcal{H}^{*}}^{2}+1}} =κlog(t)/t\displaystyle=\kappa\sqrt{\log(t)/t} (3.7)

Leveraging Eq. (3.7), we can construct a bound on Δ(t)\Delta_{\mathcal{H}}(t). Theorem 3 characterizes the shrinking nature of Δ(t)\Delta_{\mathcal{H}}(t) with respect to tt, as more samples are observed. Similar to Lemma 3.1, there also exists some κH\kappa_{H}, where κHlog(t)/t\kappa_{H}\sqrt{\log(t)/t} bounds Δ(T)\Delta_{\mathcal{H}}(T) as expressed in Theorem 3. The establishment of this bound characterizes the pricing deviation of the follower pricing strategy, and is used to prove Theorem 4. To obtain a closed form solution to this optimization problem, we note the closed form expression of the Euclidean distance between a point in 2-space (θ0,θ1)(\theta_{0},\theta_{1}) to a line described in Eq. (3.6). This distance measure is expressed in Eq. (3.7).

Γθ(p)=θ0θ1p\Gamma_{\theta}(p)=\theta_{0}-\theta_{1}paap¯0\bar{p}_{0}p¯\bar{p}𝔼[d]\mathbbm{E}[d]p0p_{0}θθ2κlog(t)/t||\theta-\theta^{*}||_{2}\leq\kappa\sqrt{\log(t)/t}
Figure 2: Linear demand uncertainty.
θ0\theta_{0}θ1\theta_{1}p¯=argmaxθ𝒞tθ0/θ1\bar{p}=\underset{\theta\in\mathcal{C}^{t}}{\mathrm{argmax}}\ \theta_{0}/\theta_{1}κlog(t)/t\kappa\sqrt{\log(t)/t}
Figure 3: Confidence ball.
Fig. 3 represents the uncertainty of the linear demand function given θθ2κlog(t)/t||\theta-\theta^{*}||_{2}\leq\kappa\sqrt{\log(t)/t}. We can see the optimistic estimate of the riskless price, p¯0\bar{p}_{0} is proportional to the estimates of p¯=θ0/θ1\bar{p}=\theta_{0}/\theta_{1}. Fig. 3 visualizes a confidence ball with the radius κlog(t)/t\kappa\sqrt{\log(t)/t}. The extreme values of p¯\bar{p} are represented by the rays extending from the origin and intersecting the maximum radius of the confidence ball.

From Lemma 3.2 we can see that the cumulative optimistic pricing deviation of the follower is Δ(T)𝒪(Tlog(T))\Delta_{\mathcal{H}}(T)\in\mathcal{O}(\sqrt{T\log(T)}). This will have further ramifications when we compute the Stackelberg regret in Theorem 5. For interest to the reader, the current bound in Lemman 3.2 would naturally extend to non-linear settings such as exponential class General Linear Models (GLM’s) (we provide a description of this Appendix C.6).

Lemma 3.2.

Bounding the Cumulation of Δ(t)\Delta_{\mathcal{H}}(t): Suppose we wish to measure the cumulative value of Δ(t)\Delta_{\mathcal{H}}(t) from 11 to TT, denoted as Δ(T)\Delta_{\mathcal{H}}(T). Then Δ(T)2κHTlog(T)\Delta_{\mathcal{H}}(T)\leq 2\kappa_{H}\sqrt{T\log(T)}. (Proof in Appendix B.5.)

Lemma 3.3.

Convergence of p¯\bar{p}^{*} to pp^{*}: Suppose the solution to Eq. (2.17) is computable. For any symmetric demand distribution, the optimistic estimate of the optimal market price from Eq. (2.17), denoted as p¯\bar{p}^{*} which is the solution to pp^{*} under optimistic parameters θ¯\bar{\theta}, is decreasing with respect to TT such that p¯(θT+1)p¯(θT)\bar{p}^{*}(\theta_{T+1})\leq\bar{p}^{*}(\theta_{T}), where the difference p¯(θT)p\bar{p}^{*}(\theta_{T})-p^{*} is bounded by 𝒪(log(T)/T)\mathcal{O}(\log(T)/T). (Proof in Appendix A.5 )

In Algorithm 1 we propose that the follower prices the product at the optimistic riskless price p¯0\bar{p}_{0}, and not necessarily the optimistic optimal price p¯\bar{p}^{*}. The profits derived from pricing with any of these suboptimal prices would result in suboptimal follower rewards (we characterize this regret later in Theorem 4).

3.4 Analysis from the Follower’s Perspective

For any approximately best responding follower, we can always expect a worst case error of ϵB\epsilon_{B}, defined in Eq. (2.10), for the follower at any discrete time interval. Thus, by intuition alone, we can presume the regret to be at least 𝒪(T)\mathcal{O}(T). Nevertheless, it is worth noting that there exists these two distinct components to the regret, one caused specifically by the approximation of the best response, and one caused by the parameter uncertainty of the learning algorithm itself. The problem ultimately reduces to a contextual linear bandit problem.

Given our bounds on the optimistic pricing behaviour of the follower under context aa and uncertainty θ𝒞t\theta\in\mathcal{C}^{t}, for any sequence of actions, {a1,,aT}\{a^{1},...,a^{T}\}, the expected regret of the follower is expressed in Eq. (2.3).

Theorem 4.

Follower Regret in Online Learning for NPG: Given pure leader strategy πA\pi_{A}, the worst-case regret of the follower, RBT(πA)R_{B}^{T}(\pi_{A}), as defined in Eq. (2.3) when adopting the LNPG algorithm, is bounded by RBT(πA)B+ϵBTθ1κH2log2(T)R_{B}^{T}(\pi_{A})\leq\aleph_{B}+\epsilon_{B}T-\theta_{1}\kappa_{H}^{2}\log^{2}(T), for sufficiently large values of TT, where B,ϵB,θ1,κH\aleph_{B},\epsilon_{B},\theta_{1},\kappa_{H} are positive real constants in +\mathbb{R}^{+}. (Proof in Appendix C.3.)

Sketch of Proof: As the follower implements the OFUL algorithm (see Appendix B.1) to estimate the demand parameters θ\theta, we propose 𝒰B(p0|a)\mathcal{U}_{B}(p_{0}|a) as the theoretically maximum achievable expected reward. We quantify the variation in 𝒰B(p¯0|a)\mathcal{U}_{B}(\bar{p}_{0}|a), represented by Δ𝒰(t)\Delta_{\mathcal{U}}(t), across a finite number of samples TT. Furthermore, we provide an upper bound on the instantaneous regret using the larger upper-bounding discrepancy 𝒰B(p¯0|a)𝔼[𝒢B(p|a)]\mathcal{U}_{B}(\bar{p}_{0}|a)-\mathbb{E}[\mathcal{G}_{B}(p^{*}|a)]. It is worth noting that although 𝒰B(p|a)\mathcal{U}_{B}(p|a) is a concave function, 𝔼[𝒢B(p|a)]\mathbb{E}[\mathcal{G}_{B}(p|a)] may not be. Under perfect information, the follower may incur an approximation error ϵ~\tilde{\epsilon} derived from this discrepancy, rendering it ϵ~\tilde{\epsilon}-suboptimal in the worst case scenario. Nonetheless, this error is independent of tt, and the learning error for θ\theta. Next, we establish a bound for the linear function estimation to compute optimal riskless pricing, for this we can apply the results of Theorem 3, and establish the bound of Δ𝒰(t)θ1κH2log(t)/t\Delta_{\mathcal{U}}(t)\leq\theta_{1}\kappa_{H}^{2}\log(t)/t. We then partition the learning time period into two phases: Case 2 denotes the period where 𝔼[𝒢B(p|a)]>𝒰B(p¯0|a)\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]>\mathcal{U}_{B}(\bar{p}_{0}|a), occurring within the time interval 1tT1\leq t\leq T^{\prime}, and Case 1 denotes when 𝔼[𝒢B(p|a)]𝒰B(p¯0|a)\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]\leq\mathcal{U}_{B}(\bar{p}_{0}|a), occurring within the interval TtTT^{\prime}\leq t\leq T (assuming sufficiently large values of TT). We note the differences in regret behaviour within these regimes and focus on the asymptotic regime as TT approaches \infty, where the cumulative regret is bounded between the worst-case approximation error for 𝔼[𝒢B(p|a)]\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)], denoted as ϵB\epsilon_{B}, and the logarithmic term Δ𝒰(t)\Delta_{\mathcal{U}}(t) pertaining to linear stochastic bandit learning, presented as B+ϵBTθ1κH2log2(T)\aleph_{B}+\epsilon_{B}T-\theta_{1}\kappa_{H}^{2}\log^{2}(T). (Complete details are provided in Appendix C.3.)

To provide further insight, Theorem 4 suggests that the regret is bounded by the linear worst case approximation error ϵBT\epsilon_{B}T, subtracted by the sublinear term, θ1κH2log2(T)\theta_{1}\kappa_{H}^{2}\log^{2}(T), for sufficiently large values of TT. As the follower forms an approximate best response to the leader’s action aa based on optimistic riskless pricing, we can expect linear regret in the worst-case. Nevertheless, knowing the characteristic behaviour of this regret is useful when we can quantify the learning error caused by the approximation of 𝔼[𝒢B()]\mathbb{E}[\mathcal{G}_{B}(\cdot)] with 𝒰B()\mathcal{U}_{B}(\cdot).

3.5 Stackelberg Regret in the Newsvendor Pricing Game

We focus on providing a theoretical worst-case guarantee of the regret from the leader’s perspective as defined in Eq. (2.4), also known as Stackelberg regret under an approximate best responding follower. We see that given the equation Eq. (2.16), the critical fractile used to compute bb is decreasing with respect to aa. This makes Fθ1F^{-1}_{\theta} also strictly decreasing with respect to aa, so bb is strictly decreasing with respect to aa. Therefore there exists a unique aa, under the assumptions provided in Section 2.4.2, which maximizes the upper bound on the leader profit 𝒢A()\mathcal{G}_{A}^{*}(\cdot), given the best response of the follower, bb, for any leader action aa.

𝒢A(a,𝔅(a))maxa𝒜aFθ1(p0ap0)=maxa𝒜aFθ1(12a(θ)+a)\displaystyle\mathcal{G}_{A}^{*}(a,\mathfrak{B}(a))\leq\underset{a\in\mathcal{A}}{\mathrm{max}}\ aF_{\theta}^{-1}\Big{(}\frac{p_{0}-a}{p_{0}}\Big{)}=\underset{a\in\mathcal{A}}{\mathrm{max}}\ aF_{\theta}^{-1}\Big{(}1-\frac{2a}{\mathcal{H}(\theta)+a}\Big{)} (3.8)

To bound the Stackelberg regret, we must effectively characterize the best response of the follower given the assumptions outlined in Sec. 2.4.2. The key idea is that the follower will 1) act optimistically in the face of uncertainty, and rationally and be greedy under his best approximation of the best response. Meaning he will not try to deceive or misplay its own actions in order to influence the leader’s future actions.

If the follower’s strategy is known to the leader, and that the information available to both agents are the same, then in theory there exists no regret from the leader’s perspective, as the leader has perfect premonition over the actions of the follower. In our case, regret comes from the existence of information asymmetry, for example when the follower has received more observations about the demand than the leader, and therefore tends to be more confident in their estimate of Γθ(p)\Gamma_{\theta}(p), and less optimistic in their ordering and pricing strategies. Algorithm 1 (LNPG) seeks to minimize the Stackelberg regret, as defined in Eq. (2.4), constituting a no-regret learner from the over the follower’s strategy from the leader’s perspective. Where aa^{*} denotes the optimal action from the leader’s perspective.

a\displaystyle a^{*} =argmaxa𝒜𝒢A(a,𝔅(a))\displaystyle=\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ \mathcal{G}_{A}(a,\mathfrak{B}(a)) (3.9)
1.0bθ1b_{\theta_{1}}bθ2b_{\theta_{2}}γ\gammaΔb(γ)\Delta_{b}(\gamma)Δb(θ)\Delta_{b}(\theta)γ2\gamma_{2}γ1\gamma_{1}
Figure 4: Denotes the change in the optimistic order amount bab_{a} as improved estimates of the demand function are obtained.
Theorem 5.

Stackelberg Regret: Given the pure strategy best response of the follower defined 𝔅(a)[b¯a,b¯a]\mathfrak{B}(a)\in[\underline{b}_{a},\bar{b}_{a}] from Eq. (3.2) and (3.4) respectively, the worst-case regret, RAT(πA)R_{A}^{T}(\pi_{A}), from the leader’s perspective, as defined in Eq. (2.4) when adopting the LNPG algorithm, is bounded by 𝒪(Tlog(T))\mathcal{O}(\sqrt{T\log(T)}). (Proof in Appendix C.4.)

Sketch of Proof: The technicalities in the proof for Theorem 5 are best illustrated in Fig. 4. The change in the best response of the follower are affected by two components. Δb(γ)\Delta_{b}(\gamma) reflects the change in bab_{a} due to changes in the critical fractile γ\gamma, and Δb(θ)\Delta_{b}(\theta) reflects the change in bab_{a} due to the overall shift in the expected demand due to the increasing confidence on θθ^2||\theta^{*}-\hat{\theta}||_{2}. We construct linear function approximator, f(γ)Fθ1(γ)f(\gamma)\simeq F_{\theta}^{-1}(\gamma) with a bounded estimation error. One existing challenge is that Fθ1(γ)F_{\theta}^{-1}(\gamma) has no closed form solution in the form of natural expression. We utilize the function f(γ)f(\gamma) to provide an estimation for the value of aa^{*}, given our confidence set θ𝒞t\theta^{*}\in\mathcal{C}^{t}. The core technical challenges lie in securing a sufficiently accurate approximation for Fθ1(γ)F_{\theta}^{-1}(\gamma) while imposing upper bounds on the error arising from such a function approximation. To address this, we employ a Taylor series expansion to approximate Fθ1(γ)F_{\theta}^{-1}(\gamma), subsequently applying the Wasserstein approximation theorem for polynomials to confine the approximation error. Following this, we measure this error within the function’s domain, exploiting certain inherent properties such as the concavity of the upper-bounding function (refer to details in Lemma C.1 in Appendix). Consequently, the approximation of b¯a(θ)\bar{b}^{*}_{a}(\theta) establishes of a bound on Δb(γ)\Delta_{b}(\gamma), while Δb(θ)\Delta_{b}(\theta) can be bounded using the outcome derived from Theorem 3 (the mathematical details for the proof of Theorem 3 are outlined in Appendix C.4).

Corollary 3.1.

Suppose the follower were to respond with an alternative best-response function 𝔅~(a)\tilde{\mathfrak{B}}(a). As ataa^{t}\to a^{*}, RAT(πA)R_{A}^{T}(\pi_{A}), where aa^{*} maximizes a𝔅(a)a\mathfrak{B}(a) and a~\tilde{a}^{*} maximizes a𝔅~(a)a\tilde{\mathfrak{B}}(a), the Stackelberg regret RAT(πA)R_{A}^{T}(\pi_{A}) from Eq. (C.136) would suffer an additive penalty term of ϵ^T\hat{\epsilon}T, where ϵ^=a~𝔅~(a~)a𝔅~(a)\hat{\epsilon}=\tilde{a}^{*}\tilde{\mathfrak{B}}(\tilde{a}^{*})-a^{*}\tilde{\mathfrak{B}}(a^{*}). (Proof in Appendix C.5.)

As one of our key assumptions in Section 2.4.2 was that the follower acts via risk-free pricing due to the lack of a closed-form solution for the price-setting Newsvendor problem, a prudent reader might question what would happen if an alternative approximation for the price-setting Newsvendor were devised and applied in this Stackelberg game setting, thereby violating one of the assumptions in Section 2.4.2. Corollary 3.1 states that, in this case, a penalty term of ϵ^\hat{\epsilon}, which can be computed in advance, would be added to the Stackelberg regret from Theorem 5. This notion is relatively intuitive, as the two objective functions are converging to different solutions for the leader, a~\tilde{a}^{*} and aa^{*}, while maintaining the alternate best response 𝔅~(a)\tilde{\mathfrak{B}}(a). This is an interesting notion that relaxes the constraint of assuming a specific unique best response for the leader. Nevertheless, we stress that this serves only as an extension of the theoretical framework, and all Stackelberg games require some form of assumptions on the structure of the follower’s best response.

3.6 Convergence to Stackelberg Equilibrium

Theorem 6.

The LNPG algorithm (Algorithm 1) converges to an approximate Stackelberg equilibrium, as defined in Section 2.3. (Proof in Appendix D.1.)

Sketch of Proof: Given the regret properties of the leader in Theorem 5 and follower in Theorem 4, as the asymptotic performance of instantaneous regret approaches 0 as tt\xrightarrow[]{}\infty, the algorithm achieves an approximate Stackelberg equilibrium.

4 Empirical Results

We compare the performance of our proposed algorithm, LNPG, with a well-known and widely used algorithm in the field of multi-armed bandit problems: Upper Confidence Bound (UCB) [Sze10]. UCB follows a straightforward exploration-exploitation trade-off strategy by selecting the action with the highest upper confidence bound. It requires discretization over the action space, and we use this as a baseline measure when computing empirical Stackelberg regret. The results are presented in Fig. 5, and additional results and configurations are presented in Appendix E.

Refer to caption
Refer to caption
Figure 5: Episodic reward and Stackelberg (leader) regret performance across an Newsvendor pricing game, comparing Algorithm 1 (LNPG) against UCB. All shaded areas, denoting confidence intervals, are within a quarter quantile. (Parameters used κ=3,θ0=18,θ1=7\kappa=3,\theta_{0}=18,\theta_{1}=7.)

5 Conclusion

We demonstrated the effective use of an optimistic online learning algorithm in a dyadic exchange network. We define a Newsvendor pricing game (NPG) as an economic game between two learning agents, representing firms in a supply chain. Both agents are learning the parameters of an additive demand function. The follower, or retailer, is a Newsvendor firm, meaning it must bear the cost of surplus and shortages when subject to stochastic demand. The follower must also determine the optimal price for the product, constituting the classical price-setting Newsvendor problem. We establish that, under perfect information, the NPG contains a unique follower best response and that a Stackelberg equilibrium must exist. In the context of online learning, we combine established economic theory of the price-setting Newsvendor with that of linear contextual bandits to learn the parameters of reward function. We demonstrate that from the leader’s perspective, the this algorithm achieves a Stackelberg regret 𝒪(Tlog(T))\mathcal{O}(\sqrt{T\log(T)}) for TT finite samples, and is guaranteed to converge to an approximate Stackelberg equilibrium. One limitation of this work is the structure of the demand function. Currently, the demand function is a linear function of price. To better represent reality, we propose in future work to expand the structure of the demand function to accommodate, in addition to price, multiple features (such as customer demographic features, product features etc.). This implies the extension of the Lemma 3.1 to p-norm rather than 2-norm, and a restructuring of Theorem 3 to accommodate a more general solution to the maximization problem from Eq. (3.5a).

References

  • [AE92] Simon P Anderson and Maxim Engers “Stackelberg versus Cournot oligopoly equilibrium” In International Journal of Industrial Organization 10.1 Elsevier, 1992, pp. 127–135
  • [AHM51] Kenneth J Arrow, Theodore Harris and Jacob Marschak “Optimal inventory policy” In Econometrica: Journal of the Econometric Society JSTOR, 1951, pp. 250–272
  • [APS11] Yasin Abbasi-Yadkori, Dávid Pál and Csaba Szepesvári “Improved algorithms for linear stochastic bandits” In Advances in neural information processing systems 24, 2011
  • [Bab+15] Moshe Babaioff, Shaddin Dughmi, Robert Kleinberg and Aleksandrs Slivkins “Dynamic pricing with limited supply” ACM New York, NY, USA, 2015
  • [Bai+21] Yu Bai, Chi Jin, Huan Wang and Caiming Xiong “Sample-efficient learning of stackelberg equilibria in general-sum games” In Advances in Neural Information Processing Systems 34, 2021, pp. 25799–25811
  • [Bal+15] Maria-Florina Balcan, Avrim Blum, Nika Haghtalab and Ariel D Procaccia “Commitment without regrets: Online learning in stackelberg security games” In Proceedings of the sixteenth ACM conference on economics and computation, 2015, pp. 61–78
  • [Bou+20] Clément Bouttier, Tommaso Cesari, Mélanie Ducoffe and Sébastien Gerchinovitz “Regret analysis of the Piyavskii-Shubert algorithm for global Lipschitz optimization” In arXiv preprint arXiv:2002.02390, 2020
  • [BP06] Dimitris Bertsimas and Georgia Perakis “Dynamic pricing: A learning approach” In Mathematical and computational models for congestion charging Springer, 2006, pp. 45–79
  • [BR19] Gah-Yi Ban and Cynthia Rudin “The big data newsvendor: Practical insights from machine learning” In Operations Research 67.1 INFORMS, 2019, pp. 90–108
  • [Bra+18] Mark Braverman, Jieming Mao, Jon Schneider and Matt Weinberg “Selling to a no-regret buyer” In Proceedings of the 2018 ACM Conference on Economics and Computation, 2018, pp. 523–538
  • [BSW22] Jinzhi Bu, David Simchi-Levi and Chonghuan Wang “Context-Based Dynamic Pricing with Partially Linear Demand Model” In Advances in Neural Information Processing Systems 35, 2022, pp. 23780–23791
  • [Ces+23] Nicolò Cesa-Bianchi et al. “Learning the stackelberg equilibrium in a newsvendor game” In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, 2023, pp. 242–250
  • [Chu+11] Wei Chu, Lihong Li, Lev Reyzin and Robert Schapire “Contextual bandits with linear payoff functions” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 208–214 JMLR WorkshopConference Proceedings
  • [CLP19] Yiling Chen, Yang Liu and Chara Podimata “Grinding the space: Learning to classify against strategic agents” In arXiv preprint arXiv:1911.04004, 2019
  • [CZ99] Gérard P Cachon and Paul H Zipkin “Competitive and cooperative inventory policies in a two-stage supply chain” In Management science 45.7 INFORMS, 1999, pp. 936–953
  • [DD00] Prithviraj Dasgupta and Rajarshi Das “Dynamic pricing with limited competitor information in a multi-agent economy” In Cooperative Information Systems: 7th International Conference, CoopIS 2000 Eilat, Israel, September 6-8, 2000. Proceedings 7, 2000, pp. 299–310 Springer
  • [DD90] Giovanni De Fraja and Flavio Delbono “Game theoretic models of mixed oligopoly” In Journal of economic surveys 4.1 Wiley Online Library, 1990, pp. 1–17
  • [DGP09] Constantinos Daskalakis, Paul W Goldberg and Christos H Papadimitriou “The complexity of computing a Nash equilibrium” In Communications of the ACM 52.2 ACM New York, NY, USA, 2009, pp. 89–97
  • [DHK08] Varsha Dani, Thomas P Hayes and Sham M Kakade “Stochastic linear optimization under bandit feedback”, 2008
  • [FCR19] Tanner Fiez, Benjamin Chasnov and Lillian J Ratliff “Convergence of learning dynamics in stackelberg games” In arXiv preprint arXiv:1906.01217, 2019
  • [FCR20] Tanner Fiez, Benjamin Chasnov and Lillian Ratliff “Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study” In International Conference on Machine Learning, 2020, pp. 3133–3144 PMLR
  • [FKM04] Abraham D. Flaxman, Adam Tauman Kalai and H. McMahan “Online convex optimization in the bandit setting: gradient descent without a gradient” arXiv:cs/0408007 arXiv, 2004 URL: http://arxiv.org/abs/cs/0408007
  • [Gan+23] Jiarui Gan, Minbiao Han, Jibang Wu and Haifeng Xu “Robust Stackelberg Equilibria” In Proceedings of the 24th ACM Conference on Economics and Computation, EC ’23 London, United Kingdom: Association for Computing Machinery, 2023, pp. 735 DOI: 10.1145/3580507.3597680
  • [GLK16] Aurélien Garivier, Tor Lattimore and Emilie Kaufmann “On explore-then-commit strategies” In Advances in Neural Information Processing Systems 29, 2016
  • [GP21] Vineet Goyal and Noemie Perivier “Dynamic pricing and assortment under a contextual MNL demand” In arXiv preprint arXiv:2110.10018, 2021
  • [Hag+22] Nika Haghtalab, Thodoris Lykouris, Sloan Nietert and Alexander Wei “Learning in Stackelberg Games with Non-myopic Agents” In Proceedings of the 23rd ACM Conference on Economics and Computation, 2022, pp. 917–918
  • [HW63] George Hadley and Thomson M Whitin “Analysis of inventory systems”, 1963
  • [JK13] Werner Jammernegg and Peter Kischka “The price-setting newsvendor with service and loss constraints” In Omega 41.2 Elsevier, 2013, pp. 326–335
  • [KL03] Robert Kleinberg and Tom Leighton “The value of knowing a demand curve: Bounds on regret for online posted-price auctions” In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., 2003, pp. 594–605 IEEE
  • [KUP03] Erich Kutschinski, Thomas Uthmann and Daniel Polani “Learning competitive pricing strategies by multi-agent reinforcement learning” In Journal of Economic Dynamics and Control 27.11-12 Elsevier, 2003, pp. 2207–2218
  • [LS20] Tor Lattimore and Csaba Szepesvári “Bandit algorithms” Cambridge University Press, 2020
  • [Ma+22] Zu-Jun Ma, Yu-Sen Ye, Ying Dai and Hong Yan “The price of anarchy in closed-loop supply chains” In International Transactions in Operational Research 29.1 Wiley Online Library, 2022, pp. 624–656
  • [Mil59] Edwin S Mills “Uncertainty and price theory” In The Quarterly Journal of Economics 73.1 MIT Press, 1959, pp. 116–130
  • [NKS19] Jiseong Noh, Jong Soo Kim and Biswajit Sarkar “Two-echelon supply chain coordination with advertising-driven demand under Stackelberg game policy” In European journal of industrial engineering 13.2 Inderscience Publishers (IEL), 2019, pp. 213–244
  • [PD99] Nicholas C Petruzzi and Maqbool Dada “Pricing and the newsvendor problem: A review with extensions” In Operations research 47.2 INFORMS, 1999, pp. 183–194
  • [Per07] Georgia Perakis “The “price of anarchy” under nonlinear and asymmetric costs” In Mathematics of Operations Research 32.3 INFORMS, 2007, pp. 614–628
  • [PK ̵01] Praveen K Kopalle PK Kannan “Dynamic pricing on the Internet: Importance and implications for consumer behavior” In International Journal of Electronic Commerce 5.3 Taylor & Francis, 2001, pp. 63–83
  • [PKL04] Victor H Pena, Michael J Klass and Tze Leung Lai “Self-normalized processes: exponential inequalities, moment bounds and iterated logarithm laws”, 2004
  • [PS80] Michael E Porter and Competitive Strategy “Techniques for analyzing industries and competitors” In Competitive Strategy. New York: Free, 1980
  • [Rou10] Tim Roughgarden “Algorithmic game theory” In Communications of the ACM 53.7 ACM New York, NY, USA, 2010, pp. 78–86
  • [RT10] Paat Rusmevichientong and John N Tsitsiklis “Linearly parameterized bandits” In Mathematics of Operations Research 35.2 INFORMS, 2010, pp. 395–411
  • [Sze10] Csaba Szepesvári “Algorithms for reinforcement learning” In Synthesis lectures on artificial intelligence and machine learning 4.1 Morgan & Claypool Publishers, 2010, pp. 1–103
  • [Tir88] Jean Tirole “The theory of industrial organization” MIT press, 1988
  • [Wei85] Karl Weierstrass “Über die analytische Darstellbarkeit sogenannter willkürlicher Functionen einer reellen Veränderlichen” In Verl. d. Kgl. Akad. d. Wiss 2, 1885
  • [ZC13] Juliang Zhang and Jian Chen “Coordination of information sharing in a supply chain” In International Journal of Production Economics 143.1 Elsevier, 2013, pp. 178–187
  • [Zha+23] Geng Zhao, Banghua Zhu, Jiantao Jiao and Michael I. Jordan “Online Learning in Stackelberg Games with an Omniscient Follower” arXiv:2301.11518 [cs] arXiv, 2023 URL: http://arxiv.org/abs/2301.11518

Appendix A Economic Theory

A.1 Proof of Theorem 1

Uniqueness of Best Response under Perfect Information: In the perfect information NPG (from Sec. 2.4) with a known demand function, under the assumptions listed in Sec. 2.4.2, the optimal order amount bab^{*}_{a} which maximizes expected profit of the retailer 𝔼[𝒢B()]\mathbbm{E}[\mathcal{G}_{B}(\cdot)] is 𝔅(a)\mathfrak{B}(a) and is a surjection from a𝔅(a)a\xrightarrow[]{}\mathfrak{B}(a), and a bijection when a<p¯a<\bar{p}. Furthermore, within the range a<p¯a<\bar{p}, as aa monotonically increases, bab_{a}^{*} is monotonically decreasing. Where p¯\bar{p} denotes the optimistic estimate of the riskless price p0p_{0}.

Proof.

Under the assumptions in Section 2.4.2, given any aa, there exists a unique maximizing solution to Eq. (2.15). From [Mil59] we know that the optimal price to set in the price-setting Newsvendor, with linear additive demand, is of the form,

p(z)\displaystyle p^{*}(z) =p0z(xz)fθ(x)𝑑x2θ1\displaystyle=p_{0}-\frac{\int_{z}^{\infty}(x-z)f_{\theta}(x)dx}{2\theta_{1}} (A.1)

Where p0p_{0} is the riskless price, the price which maximizes the linear expected demand. And therefore, the optimal price pp^{*}, has the relation pp0p^{*}\leq p_{0}. fθ(x)f_{\theta}(x) expresses the probability of demand equal to xx, under demand parameters θ\theta, and price pp. This defines the unique relation of zz to p(z)p(z) from [Mil59] and as evidenced by Eq. (2.17) is a unique function. Knowing this relation, it is also logical to see that the riskless price of the retailer must be always above the wholesale price of the product which, as introduced in the constraint in Eq. (A.2).

a<p\displaystyle a<p^{*} p0a<p0\displaystyle\leq p_{0}\xrightarrow[]{}a<p_{0} (A.2)

Given the unique relation from zz to p(z)p(z), pp^{*} represents the optimal pricing for ordering zz quantity above the expected demand from Γθ(p)\Gamma_{\theta}(p). Having this unique relation, we are ultimately faced with the optimization problem in Eq. (A.3).

𝒢B(a,p)\displaystyle\mathcal{G}_{B}^{*}(a,p) =argmaxz[μ,)𝔼[𝒢B(Γ(p)+z,p(z))]\displaystyle=\underset{z\in[-\mu,\infty)}{\mathrm{argmax}}\ \mathbbm{E}[\mathcal{G}_{B}(\Gamma(p)+z,p^{*}(z))] (A.3)

𝒢B(p)\mathcal{G}_{B}^{*}(p) is unique under certain conditions outlined in [PD99] and [Mil59], and there exists a unique optimal price pp^{*} for each problem defined by θ\theta. Examining the behaviour of the critical fractile γ=(pa)/p\gamma^{*}=(p^{*}-a)/p^{*}, and we see that γ\gamma^{*} is monotonically decreasing with respect to aa.

𝔅(a)=Fθ1(pap),\displaystyle\mathfrak{B}(a)=F_{\theta}^{-1}\Big{(}\frac{p^{*}-a}{p^{*}}\Big{)},\quad p>a\displaystyle p^{*}>a (A.4)

We see from Eq. (A.4) that 𝔅(a)\mathfrak{B}(a) is unique function of aa, as aa is strictly monotonic increasing, as bab_{a}^{*} is strictly monotonic decreasing.

A.2 Proof of Lemma 2.1

Invariance of Ψθ(b)\Psi_{\theta}(b): Assuming symmetric behaviour of fθ,p()f_{\theta,p}(\cdot) with respect to Γθ(p)\Gamma_{\theta}(p) and constant standard deviation σ\sigma, the term representing supply shortage probability defined as Ψθ(b)\Psi_{\theta}(b), is invariant of the estimate of θ\theta.

Proof.

First we define the properties of the probability density function of demand fθ()f_{\theta}(\cdot) and f()f(\cdot). Where f()f(\cdot) is the θ\theta invariant version of the demand function, and fθ()f_{\theta}(\cdot) is the probabilistic demand function with respect to the realized value of demand, such that Eq. (A.5) holds. xx is the difference between realized demand and the expected demand μ\mu, and zz is the difference between the order amount bb expected demand and the expected demand μ\mu.

f(x)\displaystyle f(x) =fθ(x+Γθ(p))\displaystyle=f_{\theta}(x+\Gamma_{\theta}(p)) (A.5)
Γθ(p)\displaystyle\Gamma_{\theta}(p) =𝔼[dθ(p)]\displaystyle=\mathbbm{E}[d_{\theta}(p)] (A.6)
z\displaystyle z =bΓθ(p)\displaystyle=b-\Gamma_{\theta}(p) (A.7)

For the price-setting Newsvendor, the optimal price is the is the solution to the equation denoting the expected profit, given retail price aa, 𝔼[𝒢B(p,b|a)]\mathbbm{E}[\mathcal{G}_{B}(p,b|a)], is expressed in Eq. (2.15).

p\displaystyle p^{*} =argmax(p,b)𝒫×𝔼[𝒢B(p,b|a)]\displaystyle=\underset{(p,b)\in\mathcal{P}\crossproduct\mathcal{B}}{\mathrm{argmax}}\ \mathbbm{E}[\mathcal{G}_{B}(p,b|a)] (A.8)

From [Mil59] and [PD99] we know that under perfect information, the optimal price for the price setting Newsvendor pp^{*} is always the risk-free price p0p_{0} subtracted by a term proportional to Ψθ(b)\Psi_{\theta}(b) for expected inventory loss. It follows that 𝔼[𝒢B(p,b|a)]\mathbbm{E}[\mathcal{G}_{B}(p,b|a)] is concave in pp given bb, and conversely bb is also concave given pp. Therefore, there exists a unique solution to pp^{*} given bb, as illustrated in Eq. (2.17).

p\displaystyle p^{*} =p0Ψθ(b)2θ1\displaystyle=p_{0}-\frac{\Psi_{\theta}(b)}{2\theta_{1}} (A.9)
where,p0\displaystyle\text{where,}\quad p_{0} =θ0+θ1a2\displaystyle=\frac{\theta_{0}+\theta_{1}a}{2} (A.10)

Given this relation we can then maximize over a subspace of 𝒫×\mathcal{P}\crossproduct\mathcal{B} more efficiently.

p=p(b)\displaystyle p^{*}=p(b^{*}) =argmaxb𝔼[𝒢B(p(b),b|a)]\displaystyle=\underset{b\in\mathcal{B}}{\mathrm{argmax}}\ \mathbbm{E}[\mathcal{G}_{B}(p(b),b|a)] (A.11)

We wish to show that Ψ(z)=Ψθ(b)\Psi(z)=\Psi_{\theta}(b) remains the same despite variation in θ\theta.

Ψθ(b)\displaystyle\Psi_{\theta}(b) =Ψ(z),θ\displaystyle=\Psi(z),\quad\forall\theta (A.12)
where,Ψθ(b)\displaystyle\text{where,}\quad\Psi_{\theta}(b) =b(xb)fθ,p(x)𝑑x\displaystyle=\int_{b}^{\infty}(x-b)f_{\theta,p}(x)dx (A.13)
Ψ(z)\displaystyle\quad\Psi(z) =z(xz)f(xΓθ(p))𝑑x\displaystyle=\int_{z}^{\infty}(x-z)f(x-\Gamma_{\theta}(p))dx (A.14)

We show the equivalance of Ψθ(z)\Psi_{\theta}(z) and Ψ(z)\Psi(z) via equivalence of the integral definitions from Eq. (A.13) and Eq. (A.14) respectively, utilizing the relation from Eq. (A.5). For convience, let μp=Γθ(p)\mu_{p}=\Gamma_{\theta}(p).

z(xz)f(x)𝑑x\displaystyle\int_{z}^{\infty}(x-z)f(x)dx =z(xz)fθ,p(x+μp)𝑑x\displaystyle=\int_{z}^{\infty}(x-z)f_{\theta,p}(x+\mu_{p})dx (A.15)
=bμp(x(bμp))fθ,p(x+μp,p)𝑑x\displaystyle=\int_{b-\mu_{p}}^{\infty}\Big{(}x-(b-\mu_{p})\Big{)}f_{\theta,p}(x+\mu_{p},p)dx (A.16)
=b((xμp)(bμp))fθ,p(x+μpμp)𝑑x\displaystyle=\int_{b}^{\infty}\Big{(}(x-\mu_{p})-(b-\mu_{p})\Big{)}f_{\theta,p}(x+\mu_{p}-\mu_{p})dx (A.17)
=b(xb)fθ,p(x)𝑑x\displaystyle=\int_{b}^{\infty}(x-b)f_{\theta,p}(x)dx (A.18)

Because θ\theta can only vary by updating samples with respect to tt, it follows that Ψ(z)\Psi(z) is also invariant with respect to tt.

A.3 Theorem A.3

Theorem A.3. Strictly decreasing γ\gamma^{*}: Given the condition a<pp0a<p^{*}\leq p_{0}, γ\gamma^{*} is strictly decreasing with respect to aa.

Proof.

Sketch of idea, as aa increases, the boundary between p0p¯p_{0}-\bar{p} decreases.

This boundary p¯p0=p0a\bar{p}-p_{0}=p_{0}-a due to symmetry.

p0\displaystyle p_{0} =p¯+a2\displaystyle=\frac{\bar{p}+a}{2} (A.19)
p¯p0\displaystyle\bar{p}-p_{0} =2p¯2p¯+a2=p¯a2=2p0aa2\displaystyle=\frac{2\bar{p}}{2}-\frac{\bar{p}+a}{2}=\frac{\bar{p}-a}{2}=\frac{2p_{0}-a-a}{2} (A.20)
=p0a\displaystyle=p_{0}-a (A.21)

Therefore given p(a,p0]p^{*}\in(a,p_{0}], we know that |(a,p0]||(a,p_{0}]| is decreasing w.r.t increasing aa, holding p¯\bar{p} constant.

Given pp0p^{*}\geq p_{0}, and γ=(pa)/p\gamma=(p-a)/p,

pp0(pa)/p(p0a)/p0\displaystyle p^{*}\leq p_{0}\xrightarrow[]{}(p^{*}-a)/p^{*}\leq(p_{0}-a)/p_{0} (A.22)

The behaviour of γ\gamma, where γ=(pa)/p\gamma^{*}=(p^{*}-a)/p^{*} will also be decreasing with increasing aa.

A.4 Proof of Theorem 2

Theorem 2. Unique Stackelberg Equilibrium: A unique pure strategy Stackelberg Equilibrium exists in the full-information Newsvendor pricing game.

Proof.

Let us define an arbitrary probability density function fθ(x)>0f_{\theta}(x)>0 which operates on the of x[0,p¯¯]x\in[0,\bar{\bar{p}}], and is strictly positive. The cumulative probability function function is therefore,

Fθ(x)=0xfθ(x)𝑑x=γ\displaystyle F_{\theta}(x)=\int_{0}^{x}f_{\theta}(x)dx=\gamma (A.23)

Therefore, from Eq. (A.23), its cumulative distribution, defined as the integral of such a probability distribution function Fθ(x)F_{\theta}(x) is strictly monotonically increasing from x[0,p¯¯]x\in[0,\bar{\bar{p}}]. Therefore the inverse of Fθ(x)F_{\theta}(x), defined as Fθ1F_{\theta}^{-1}, is an injection from γx\gamma\xrightarrow[]{}x. We also that the critical fractile γ=(pa)/p\gamma^{*}=(p^{*}-a)/p^{*} is strictly decreasing with respect to aa. Therefore, from the follower’s perspective, since Fθ1(γ)F_{\theta}^{-1}(\gamma^{*}) is an injection from γ𝔅(a)\gamma^{*}\xrightarrow[]{}\mathfrak{B}(a), there is always a best response function defined by 𝔅(a)\mathfrak{B}(a) given aa and is an injection. As 𝔅(a)\mathfrak{B}(a) is unique given aa, therefore the Eq. (A.24) has a unique maximum.

𝒢A(a)maxa𝒜aFθ1(pap)\displaystyle\mathcal{G}_{A}^{*}(a)\leq\underset{a\in\mathcal{A}}{\mathrm{max}}\ aF_{\theta}^{-1}\Big{(}\frac{p^{*}-a}{p^{*}}\Big{)} (A.24)

A.5 Proof of Lemma 3.3

Lemma 3.3. Convergence of p¯\bar{p}^{*} to pp^{*}: Suppose the solution to Eq. (2.17) is computable. For any symmetric demand distribution, the optimistic estimate of the optimal market price from Eq. (2.17), denoted as p¯\bar{p}^{*} which is the solution to pp^{*} under optimistic parameters θ¯\bar{\theta}, is decreasing with respect to tt such that p¯(θt+1)p¯(θt)\bar{p}^{*}(\theta_{t+1})\leq\bar{p}^{*}(\theta_{t}), where the difference p¯(θt)p\bar{p}^{*}(\theta_{t})-p^{*} is bounded by 𝒪(log(T)/T)\mathcal{O}(\log(T)/T).

Proof.

Suppose the solution to Eq. (2.17) is computable. We examine the equation for the optimal price function, from Eq. (2.17), and we make the assumption that the term Ψ(z)\Psi(z) is computable,

pθ(z)\displaystyle p_{\theta}^{*}(z) =p0(θ)Ψ(z)2θ1\displaystyle=p_{0}(\theta)-\frac{\Psi(z)}{2\theta_{1}} (A.25)

Due to Lemma 2.1 we know that Ψ(z)\Psi(z) is invariant of θ\theta, for symmetric probability distributions with constant variance. Thus using this advantage, we know from Eq. (B.43) that the rate of convergence to θ\theta^{*} is 𝒪(log(t)/t)\mathcal{O}(\log(t)/t). We then look at the behaviour of θ1\theta_{1}, from Eq. (3.5b), we can obtain an optimistic upper bound on θ1\theta_{1} denoted as θ¯1\bar{\theta}_{1} as tt\xrightarrow[]{}\infty, when given the confidence ball 𝒞t\mathcal{C}^{t}, by setting θ^0=θ0\hat{\theta}_{0}=\theta_{0}^{*}, this allows for the slack, denoted as, κlog(t)/t\kappa\sqrt{\log(t)/t} to be completely maximized for maximizing θ¯1t\bar{\theta}_{1}^{t}. In other words, we assume that the estimate of θ^0t\hat{\theta}_{0}^{t} is exact. Subsequently, we have a 1-dimensional optimization problem.

Δθ1=|θ^1θ1|\displaystyle\Delta_{\theta_{1}}=|\hat{\theta}_{1}-\theta_{1}^{*}| κlog(t)/t\displaystyle\leq\kappa\sqrt{\log(t)/t} (A.26)

Let Δθ1=θ^1θ1\Delta_{\theta_{1}}=\hat{\theta}_{1}-\theta_{1}^{*}, we maximize Δθ1\Delta_{\theta_{1}} by setting Δθ1=κlog(t)/t)\Delta_{\theta_{1}}=\kappa\sqrt{\log(t)/t}), and this maximum value is shrinking at a rate of 𝒪(log(t)/t\mathcal{O}(\log(t)/t. We examine the asymptotic behaviour of pθ(z)p^{*}_{\theta}(z) as tt\xrightarrow[]{}\infty.

limtpθ(z)=limtp0(θ)Ψ(z)2(θ1±Δθ1)\displaystyle\lim_{t\to\infty}p_{\theta}^{*}(z)=\lim_{t\to\infty}p_{0}(\theta)-\frac{\Psi(z)}{2(\theta_{1}^{*}\pm\Delta_{\theta_{1}})} (A.27)

Examining Eq. (A.27) we know that ultimately, p¯0(t)p0\bar{p}_{0}(t)\xrightarrow[]{}p_{0}, and Δθ10\Delta_{\theta_{1}}\xrightarrow[]{}0 as tt\xrightarrow[]{}\infty. And thus, p¯p\bar{p}^{*}\xrightarrow[]{}p^{*} asymptotically as tt\xrightarrow[]{}\infty.

Appendix B Online Learning Theory

B.1 Optimism Under Uncertainty (OFUL) Algorithm

Algorithm 2 OFUL Algorithm (from [APS11])
1:for t1Tt\in 1...T do:
2:     (Xt,θ^t)=argmax(x,θ)𝒳×𝒞ttx,θ(X_{t},\hat{\theta}_{t})=\underset{(x,\theta)\in\mathcal{X}\times\mathcal{C}^{t}_{t}}{\mathrm{argmax}}\langle x,\theta\rangle
3:     Agent plays XtX^{t} and observes reward YtY^{t}
4:     Update 𝒞tt\mathcal{C}^{t}_{t}
5:end for

B.2 Optimization Under Uncertainty (OFUL)

The OFUL algorithm presented in Appendix B.1 is a linear bandit optimization function. It works by constructing a confidence ball 𝒞tt\mathcal{C}^{t}_{t} which provides a bound on the variation of estimated parameters θ\theta. This confidence bound shrinks as more observations are obtained. When running this algorithm, the error of estimated parameters θtθ^tV¯t\norm{\theta^{*}_{t}-\hat{\theta}_{t}}_{\bar{V}_{t}} is bounded by the self-normalizing norm, V¯t\norm{\cdot}_{\bar{V}_{t}}, where by definition,

V¯t=λI+i=1tXiXiT,SV¯t=STV¯tS\displaystyle\bar{V}_{t}=\lambda I+\sum_{i=1}^{t}X_{i}X_{i}^{T},\quad\norm{S}_{\bar{V}_{t}}=\sqrt{S^{T}\bar{V}_{t}S} (B.1)

Where II is the identity matrix, for some λ>0\lambda>0. With these definitions in Eq. (B.1), the OFUL algorithm, with respect to the shrinking confidence ball 𝒞tt\mathcal{C}^{t}_{t} admits a bound of the confidence of the parameters as,

θtθ^tV¯tRdlog(1+tL2/λδmissing)+λ1/2J𝒪(log(t))\displaystyle\norm{\theta^{*}_{t}-\hat{\theta}_{t}}_{\bar{V}_{t}}\leq R\sqrt{d\log\Big(\frac{1+tL^{2}/\lambda}{\delta}\Big{missing})}+\lambda^{1/2}J\in\mathcal{O}(\sqrt{\log{t}}) (B.2)

Where the random variables ηt\eta_{t}, and filtration {Ft}t=0\{F_{t}\}_{t=0}^{\infty} are RR-sub Gaussian, for some R>0R>0, such that,

𝔼[eληt|Ft1]exp(λ2R22missing),λ\displaystyle\mathbbm{E}[e^{\lambda\eta_{t}}|F_{t-1}]\leq\exp\Big(\frac{\lambda^{2}R^{2}}{2}\Big{missing}),\quad\forall\lambda\in\mathbbm{R} (B.3)

Where dd is the dimension of the parameters. LL is a condition for the random variables where Xt2L,t1\norm{X_{t}}_{2}\leq L,\forall t\geq 1. And JJ is a constraint on the parameters such that θ2J\norm{\theta^{*}}_{2}\leq J.

B.3 Proof of Lemma 3.1

OFUL - Bound on 2-Norm: With probability 1δ1-\delta, and tet\geq e, the constraint θθV¯t𝒪(log(t))||\theta-\theta^{*}||_{\bar{V}_{t}}\leq\mathcal{O}(\sqrt{\log(t)}) from [APS11] implies the finite 22-norm bound θθ2𝒪(log(t)/t)||\theta-\theta^{*}||_{2}\leq\mathcal{O}(\sqrt{\log(t)/t}), and also thereby the existence of κ>0\kappa>0 such that θθ2κlog(t)/t||\theta-\theta^{*}||_{2}\leq\kappa\sqrt{\log(t)/t} (Proof in Appendix B.3.)

Proof.

We assume V¯t\bar{V}_{t} is full rank, and that sufficient sufficient exploration has occurred. Let 𝚫θ=θθ^t\mathbf{\Delta}_{\theta}=\theta^{*}-\hat{\theta}_{t}. From [APS11], we define V¯t\bar{V}_{t} as,

V¯t=V+t=1TXTX\displaystyle\bar{V}_{t}=V+\sum_{t=1}^{T}X^{T}X (B.4)

Let VV be a positive definite real-valued matrix defined on d\mathbb{R}^{d}, represented as a diagonal matrix with entries λ\lambda, where λ+\lambda\in\mathbb{R}^{+}. Moreover, XX is the vector corresponding to the arm selected (specifically in our case, this corresponds to the pricing action of the follower). Furthermore, we define the self-normalizing norm of 𝚫θ\mathbf{\Delta}_{\theta} as,

𝚫θV¯t=𝚫θTV¯t𝚫θ\displaystyle\norm{\mathbf{\Delta}_{\theta}}_{\bar{V}_{t}}=\sqrt{\mathbf{\Delta}_{\theta}^{T}\bar{V}_{t}\mathbf{\Delta}_{\theta}} (B.5)

Fundamentally, we wish to demonstrate that given κ\kappa there exists κ2\kappa_{2} such that,

θθV¯t\displaystyle||\theta-\theta^{*}||_{\bar{V}_{t}} κlog(t)θθ2κ2log(t)/t\displaystyle\leq\kappa\sqrt{\log(t)}\implies||\theta-\theta^{*}||_{2}\leq\kappa_{2}\sqrt{\log(t)/t} (B.6)
t\displaystyle\forall t {|1tT},κ1+,κ2+\displaystyle\in\{\mathbbm{Z}|1\leq t\leq T\},\quad\forall\kappa_{1}\in\mathbb{R}^{+},\quad\forall\kappa_{2}\in\mathbb{R}^{+} (B.7)

We note that θθV¯tκlog(t)||\theta-\theta^{*}||_{\bar{V}_{t}}\leq\kappa\sqrt{\log(t)} with non-decreasing eivengalues of V¯t\bar{V}_{t} with respect to tt, this generally denotes a volume decreasing ellipsoid [LS20]. Suppose there exists a scalar κ2(0,)\kappa_{2}\in(0,\infty), where we impose that the minimum eigenvalue of V¯t\bar{V}_{t} scales as κ2t\kappa_{2}t for all t[1,T]t\in[1,T]. This would result in the inequality,

κ2t𝚫θ2𝚫θTV¯t𝚫θ=𝚫θV¯tlog(t)\displaystyle\kappa_{2}\sqrt{t}||\mathbf{\Delta}_{\theta}||_{2}\leq\sqrt{\mathbf{\Delta}_{\theta}^{T}\bar{V}_{t}\mathbf{\Delta}_{\theta}}=\norm{\mathbf{\Delta}_{\theta}}_{\bar{V}_{t}}\leq\sqrt{\log(t)} (B.8)

The inequality in Eq. (B.8) holds independent of tt, as 𝚫θ2||\mathbf{\Delta}_{\theta}||_{2} is constant and 𝚫θV¯t||\mathbf{\Delta}_{\theta}||_{\bar{V}_{t}} is non-decreasing with respect to time. Therefore, the existence of a scalar constant κ2\kappa_{2} implies that,

κ2tθtθ^t2\displaystyle\kappa_{2}\sqrt{t}\norm{\theta_{t}^{*}-\hat{\theta}_{t}}_{2} θtθ^tV¯t\displaystyle\leq\norm{\theta_{t}^{*}-\hat{\theta}_{t}}_{\bar{V}_{t}} (B.9)

Next suppose we select some κv\kappa_{v}, such that κv\kappa_{v} obeys the constraint in Eq. B.10, we show the existence of κv>0\kappa_{v}>0 next.

θtθ^tV¯tRdlog(1+tL2/λδmissing)+λ1/2Sκvlog(t)\displaystyle\norm{\theta_{t}^{*}-\hat{\theta}_{t}}_{\bar{V}_{t}}\leq R\sqrt{d\log\Big(\frac{1+tL^{2}/\lambda}{\delta}\Big{missing})}+\lambda^{1/2}S\leq\kappa_{v}\sqrt{\log(t)} (B.10)

Existence of κv\kappa_{v}: We wish to show the existence of κv\kappa_{v}, such that,

Rdlog(1+tL2/λδmissing)+λ1/2Sκvlog(t)\displaystyle R\sqrt{d\log\Big(\frac{1+tL^{2}/\lambda}{\delta}\Big{missing})}+\lambda^{1/2}S\leq\kappa_{v}\sqrt{\log(t)} (B.11)

Let us define constants,

c1log(c2+c3t)+λ1/2S\displaystyle c_{1}\sqrt{\log(c_{2}+c_{3}t)}+\lambda^{1/2}S κvlog(t)\displaystyle\leq\kappa_{v}\sqrt{\log(t)} (B.12)
wherec1=Rd,c2=1/δ,c3\displaystyle\text{where}\quad c_{1}=R\sqrt{d},\,c_{2}=1/\delta,\,c_{3} =L2λδ\displaystyle=\frac{L^{2}}{\lambda\delta} (B.13)

We first look at the relation where,

c1log(c2+c3t)\displaystyle c_{1}\sqrt{\log(c_{2}+c_{3}t)} κvlog(t)\displaystyle\leq\kappa_{v}\sqrt{\log(t)} (B.14)

We wish to find an value for κv\kappa_{v} such that Eq. (B.14) holds. Let c=c2c3c^{\prime}=c_{2}c_{3}, so for t1t\geq 1, we have,

c1log(c2+c3t)\displaystyle c_{1}\sqrt{\log(c_{2}+c_{3}t)} c1log(ct)\displaystyle\leq c_{1}\sqrt{\log(c^{\prime}t)} (B.15)
=c1log(c)+log(t)\displaystyle=c_{1}\sqrt{\log(c^{\prime})+\log(t)} (B.16)

Suppose κv\kappa_{v} can be arbitrarily large, and we want to infer the existence of a κv\kappa_{v} such that Eq. (B.14) holds.

c1log(c)+log(t)\displaystyle c_{1}\sqrt{\log(c^{\prime})+\log(t)} κvlog(t)\displaystyle\leq\kappa_{v}\sqrt{\log(t)} (B.17)
c12(log(c)+log(t))\displaystyle c_{1}^{2}(\log(c^{\prime})+\log(t)) κv2log(t)\displaystyle\leq\kappa_{v}^{2}\log(t) (B.18)
c1log(ct)+1\displaystyle c_{1}\sqrt{\log(c^{\prime}-t)+1} κv\displaystyle\leq\kappa_{v} (B.19)

Therefore we select κv\kappa_{v} as,

κv\displaystyle\kappa_{v} =c1log(c)+1\displaystyle=c_{1}\sqrt{\log(c^{\prime})+1} (B.20)
=Rdlog(L2λδ2)+1\displaystyle=R\sqrt{d}\sqrt{\log(\frac{L^{2}}{\lambda\delta^{2}})+1} (B.21)

Such that Eq. (B.19) holds, as long as t1t\geq 1. The second condition is that,

κvlog(t)λ1/2S\displaystyle\kappa_{v}\sqrt{\log(t)}\geq\lambda^{1/2}S (B.22)

For this we can argue that so long as log(t)1te\sqrt{\log(t)}\geq 1\xrightarrow[]{}t\geq e, can choose κv=λ1/2S\kappa_{v}=\lambda^{1/2}S, such that Eq. (B.22) holds. Therefore, we can select κv\kappa_{v} as,

κv=max{λ1/2S,Rdlog(L2λδ2)+1}\displaystyle\kappa_{v}=\max\Big{\{}\lambda^{1/2}S,\ R\sqrt{d}\sqrt{\log(\frac{L^{2}}{\lambda\delta^{2}})+1}\Big{\}} (B.23)

Which holds when tet\geq e, allowing for an upper bound for κv\kappa_{v} in Eq. (B.23). We combine our bound with Eq. (B.9), where

κ2tθtθ^t2\displaystyle\kappa_{2}\sqrt{t}\norm{\theta_{t}^{*}-\hat{\theta}_{t}}_{2} θtθ^tV¯tκvlog(t)\displaystyle\leq\norm{\theta_{t}^{*}-\hat{\theta}_{t}}_{\bar{V}_{t}}\leq\kappa_{v}\sqrt{\log(t)} (B.24)
θtθ^t2\displaystyle\norm{\theta_{t}^{*}-\hat{\theta}_{t}}_{2} κlog(t)/t\displaystyle\leq\kappa\sqrt{\log(t)/t} (B.25)

Where κ=κv/κ2\kappa=\kappa_{v}/\kappa_{2}.

B.4 Proof of Theorem 3

Bounding the Optimistic Follower Action: Let Δ(t)\Delta_{\mathcal{H}}(t) represent the deviation in pricing action of the follower (retailer) at time tt. Then Δ(t)κHlog(t)/t\Delta_{\mathcal{H}}(t)\leq\kappa_{H}\sqrt{\log(t)/t} for some κH>0\kappa_{H}>0 and tet\geq e.

Proof.

We connect the bound on estimated parameters θ^θ2𝒪(log(t)/t)\norm{\hat{\theta}-\theta^{*}}_{2}\leq\mathcal{O}(\sqrt{\log(t)/t}) with bounds on the optimal expected follower rewards 𝒢(pa)a\mathcal{G}(p_{a})^{*}_{a}. Under the optimism principle, we therefore solve the optimization problem to maximize (θ)=p0(θ)\mathcal{H}(\theta)=p_{0}(\theta) defined in Eq. (B.26a) and (B.26b). In this optimization problem, we optimize for the optimistic estimate of p0p_{0} by maximizing θ0/θ1\theta_{0}/\theta_{1}.

By Lemma 3.1, we have an exact bound on the inequality Eq. (B.26b). Therefore we can solve for the maximization problem for the value (θ)\mathcal{H}(\theta) to give the optimistic estimate of the upper bound on follower reward. With probability 1δ1-\delta,

maxθ\displaystyle\!\max_{\theta} (θ)\displaystyle\qquad\qquad\mathcal{H}(\theta) =θ^0θ^1\displaystyle=\ \frac{\hat{\theta}_{0}}{\hat{\theta}_{1}} (B.26a)
subject to (θ^0θ0)2+(θ^1θ1)2\displaystyle\qquad\sqrt{(\hat{\theta}_{0}-\theta_{0}^{*})^{2}+(\hat{\theta}_{1}-\theta_{1}^{*})^{2}} κlog(t)/t\displaystyle\leq\kappa\sqrt{\log(t)}/\sqrt{t} (B.26b)

Where θ0\theta_{0} and θ1\theta_{1} can describe a hyperplane, (a line in the 2D case), and the confidence ball constraint in Eq. (B.26b) can denote the parameter constraints. The optimization problem can be visualized in 2D space by maximizing the distance a from point to line, and solving for the parameters which maximizes the slope θ0/θ1\theta_{0}/\theta_{1}, as illustrated in Fig. 3.

θ^1\displaystyle\mathcal{H}^{*}\hat{\theta}_{1} =θ^0\displaystyle=\hat{\theta}_{0} (B.27)
|θ1θ0|2+1\displaystyle\frac{|\mathcal{H}^{*}\theta_{1}-\theta_{0}|}{\sqrt{{\mathcal{H}^{*}}^{2}+1}} =κlog(t)/t\displaystyle=\kappa\sqrt{\log(t)}/\sqrt{t} (B.28)

Solving for Eq. (3.7) for \mathcal{H}^{*} gives us a closed form expression for (θ)\mathcal{H}^{*}(\theta) in terms of θ\theta as expressed in Eq. (B.29).

(θ)\displaystyle\mathcal{H}^{*}(\theta) =θ^0θ^1±(κlog(t)/t)2(θ^02+θ^12)(κlog(t)/t)4θ^12(κlog(t)/t)2\displaystyle=\frac{\hat{\theta}_{0}\hat{\theta}_{1}\pm\sqrt{(\kappa\sqrt{\log(t)}/\sqrt{t})^{2}(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})-(\kappa\sqrt{\log(t)}/\sqrt{t})^{4}}}{\hat{\theta}_{1}^{2}-(\kappa\sqrt{\log(t)}/\sqrt{t})^{2}} (B.29)
=θ^0θ^1±(θ^02+θ^12)κ2log(t)/tκ4log2(t)/t2θ^12κ2log(t)/t\displaystyle=\frac{\hat{\theta}_{0}\hat{\theta}_{1}\pm\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}\log(t)/t-\kappa^{4}\log^{2}(t)/t^{2}}}{\hat{\theta}_{1}^{2}-\kappa^{2}\log(t)/t} (B.30)
<θ^0θ^1+(θ^02+θ^12)κ2log(t)/tθ^12κ2log(t)/t\displaystyle<\frac{\hat{\theta}_{0}\hat{\theta}_{1}+\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}\log(t)/t}}{\hat{\theta}_{1}^{2}-\kappa^{2}\log(t)/t} (B.31)

When we obtain extreme points for (θ)\mathcal{H}(\theta) we obtain the bounds for 𝒢A(pa)[𝒢¯A(pa,a),𝒢¯A(pa,a)]\mathcal{G}_{A}(p_{a})^{*}\in[\underline{\mathcal{G}}_{A}(p_{a},a),\ \bar{\mathcal{G}}_{A}(p_{a},a)]. The extremum of (θ)\mathcal{H}(\theta) always occurs at the tangent of the confidence ball constraints, where the slope of the line denoted by θ0/θ1\theta_{0}/\theta_{1} is maximized or minimized. We note (θ)\mathcal{H}(\theta) is convex in both θ0\theta_{0} and θ1\theta_{1}.

As (θ)\mathcal{H}^{*}(\theta) denotes the optimistic estimate of p0p_{0}, we see from the bounded inequality expressed in Eq. (B.31) that (θ)\mathcal{H}^{*}(\theta) is a decreasing function as the estimate of θ\theta improves. Therefore, the critical fractile γ\gamma is also decreasing, with respect to tt, where p¯0=(θ)/2\bar{p}_{0}=\mathcal{H}^{*}(\theta)/2, with reference to Eq. (B.32).

γ=p0ap0\displaystyle\gamma=\frac{p_{0}-a}{p_{0}} =12a(θ)+a\displaystyle=1-\frac{2a}{\mathcal{H}(\theta)+a} (B.32)

The learning algorithms converges to a gap, we define this gap as, at large values of tt,

Δ(t)\displaystyle\Delta_{\mathcal{H}}(t) =θ^0θ^1±(θ^02+θ^12)κ2log(t)/tκ4log2(t)/t2θ^12κ2log(t)/tθ^0θ^1\displaystyle=\frac{\hat{\theta}_{0}\hat{\theta}_{1}\pm\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}\log(t)/t-\kappa^{4}\log^{2}(t)/t^{2}}}{\hat{\theta}_{1}^{2}-\kappa^{2}\log(t)/t}-\frac{\hat{\theta}_{0}}{\hat{\theta}_{1}} (B.33)

As the tTt\xrightarrow[]{}T for sufficiently large values of TT, Δ(t)\Delta_{\mathcal{H}}(t) approaches 0, as log(t)/t\log(t)/t approaches 0.

Δ(t)\displaystyle\Delta_{\mathcal{H}}(t) =θ^1(θ^0θ^1+(θ^02+θ^12)κ2log(t)/t)θ^0(θ^12κ2log(t)/t)θ^1(θ^12κ2log(t)/t)\displaystyle=\frac{\hat{\theta}_{1}\Big{(}\hat{\theta}_{0}\hat{\theta}_{1}+\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}\log(t)/t}\Big{)}-\hat{\theta}_{0}(\hat{\theta}_{1}^{2}-\kappa^{2}\log(t)/t)}{\hat{\theta}_{1}(\hat{\theta}_{1}^{2}-\kappa^{2}\log(t)/t)} (B.34)
=θ^1θ^0θ^1+θ^1(θ^02+θ^12)κ2log(t)/tθ^0θ^12+θ^0κ2log(t)/tθ^1(θ^12κ2log(t)/t)\displaystyle=\frac{\hat{\theta}_{1}\hat{\theta}_{0}\hat{\theta}_{1}+\hat{\theta}_{1}\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}\log(t)/t}-\hat{\theta}_{0}\hat{\theta}_{1}^{2}+\hat{\theta}_{0}\kappa^{2}\log(t)/t}{\hat{\theta}_{1}(\hat{\theta}_{1}^{2}-\kappa^{2}\log(t)/t)} (B.35)
=θ^12θ^0θ^0θ^12+θ^1(θ^02+θ^12)κ2log(t)/t+θ^0κ2log(t)/tθ^1(θ^12κ2log(t)/t)\displaystyle=\frac{\hat{\theta}_{1}^{2}\hat{\theta}_{0}-\hat{\theta}_{0}\hat{\theta}_{1}^{2}+\hat{\theta}_{1}\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}\log(t)/t}+\hat{\theta}_{0}\kappa^{2}\log(t)/t}{\hat{\theta}_{1}(\hat{\theta}_{1}^{2}-\kappa^{2}\log(t)/t)} (B.36)
=θ^1(θ^02+θ^12)κ2log(t)/t+θ^0κ2log(t)/tθ^1θ^12θ^1κ2log(t)/t\displaystyle=\frac{\hat{\theta}_{1}\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}\log(t)/t}+\hat{\theta}_{0}\kappa^{2}\log(t)/t}{\hat{\theta}_{1}\hat{\theta}_{1}^{2}-\hat{\theta}_{1}\kappa^{2}\log(t)/t} (B.37)
<θ^1(θ^02+θ^12)κ2log(t)/t+θ^0κ2log(t)/tθ^1θ^12\displaystyle<\frac{\hat{\theta}_{1}\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}\log(t)/t}+\hat{\theta}_{0}\kappa^{2}\log(t)/t}{\hat{\theta}_{1}\hat{\theta}_{1}^{2}} (B.38)

We have decreasing behaviour, at sufficiently large values of tt, such that log(t)/tlog(t)/t\sqrt{\log(t)/t}\gg\log(t)/t, which is not difficult to obtain, as the function θ^1θ^12θ^1κ2log(t)/tθ^1θ^12\hat{\theta}_{1}\hat{\theta}_{1}^{2}-\hat{\theta}_{1}\kappa^{2}\log(t)/t\leq\hat{\theta}_{1}\hat{\theta}_{1}^{2}, is always decreasing for t>et>e, and upper-bounded by θ^1θ^12\hat{\theta}_{1}\hat{\theta}_{1}^{2} (See Appendix B.7). The upper-bound on Δ(t)\Delta_{\mathcal{H}}(t) is denoted in Eq. (B.39).

Δ(t)\displaystyle\Delta_{\mathcal{H}}(t) θ^1(θ^02+θ^12)κ2log(t)/t+θ^0κ2log(t)/tθ^1θ^12\displaystyle\leq\frac{\hat{\theta}_{1}\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}\log(t)/t}+\hat{\theta}_{0}\kappa^{2}\log(t)/t}{\hat{\theta}_{1}\hat{\theta}_{1}^{2}} (B.39)

Essentially Eq. (B.39) is a linear combination of log(t)/t\sqrt{\log(t)/t} and log(t)/t\log(t)/t, where log(t)/tlog(t)/t\sqrt{\log(t)/t}\geq\log(t)/t for t1t\geq 1. We denote two constants v1v_{1} and v2v_{2} such that v1log(t)/t+v2log(t)/tv_{1}\sqrt{\log(t)/t}+v_{2}\log(t)/t. Therefore,

v1log(t)/t+v2log(t)/t(v1+v2)log(t)/t\displaystyle v_{1}\sqrt{\log(t)/t}+v_{2}\log(t)/t\leq(v_{1}+v_{2})\sqrt{\log(t)/t} (B.40)

Where,

v1=θ^1(θ^02+θ^12)κ2θ^1θ^12,v2=θ^0κ2θ^1θ^12\displaystyle v_{1}=\frac{\hat{\theta}_{1}\sqrt{(\hat{\theta}_{0}^{2}+\hat{\theta}_{1}^{2})\kappa^{2}}}{\hat{\theta}_{1}\hat{\theta}_{1}^{2}},\quad v_{2}=\frac{\hat{\theta}_{0}\kappa^{2}}{\hat{\theta}_{1}\hat{\theta}_{1}^{2}} (B.41)

We can see that from the inequality Eq. (B.40), that for some κH=v1+v2\kappa_{H}=v_{1}+v_{2}, then,

Δ(t)\displaystyle\Delta_{\mathcal{H}}(t) κHlog(t)/t\displaystyle\leq\kappa_{H}\sqrt{\log(t)/t} (B.42)

Therefore, we can deduce that the decreasing order of magnitude of Δ(t)\Delta_{\mathcal{H}}(t) is 𝒪(log(t)/t)\mathcal{O}(\sqrt{\log(t)/t}) for sufficiently large values of tet\geq e, where the bound in Eq. (B.39).

Δ(t)\displaystyle\Delta_{\mathcal{H}}(t) 𝒪(log(t)/t),te\displaystyle\in\mathcal{O}(\sqrt{\log(t)/t}),\quad\forall t\geq e (B.43)

Δ(t)\Delta_{\mathcal{H}}(t) describes the behaviour of the convergence of (θ)\mathcal{H}^{*}(\theta) towards the true p0p_{0} with respect to tt, and this decreases with 𝒪(log(t)/t)\mathcal{O}(\sqrt{\log(t)/t}).

B.5 Proof of Lemma 3.2

Bounding the Cumulation of Δ(t)\Delta_{\mathcal{H}}(t): Suppose we wish to measure the cumulative value of Δ(t)\Delta_{\mathcal{H}}(t) from 11 to TT, denoted as Δ(T)\Delta_{\mathcal{H}}(T). Then Δ(T)2κHTlog(T)\Delta_{\mathcal{H}}(T)\leq 2\kappa_{H}\sqrt{T\log(T)}.

Proof.

To quantify t=t¯TΔ(t)\sum^{T}_{t=\bar{t}}\Delta_{\mathcal{H}}(t) we take the integral of Δ(t)\Delta_{\mathcal{H}}(t) noting that the integral of Δ(t)\Delta_{\mathcal{H}}(t) from 11 to TT will always be greater than the discrete sum so longs as Δ(t)>0\Delta_{\mathcal{H}}(t)>0.

Δ(T)\displaystyle\Delta_{\mathcal{H}}(T) <t=t¯TΔ(t)\displaystyle<\sum^{T}_{t=\bar{t}}\Delta_{\mathcal{H}}(t) (B.44)
<κH1Tlog(t)t𝑑t\displaystyle<\kappa_{H}\int_{1}^{T}\sqrt{\frac{\log(t)}{t}}dt (B.45)
=κH(2tlog(t)2πerf(log(t)/2)|1T)\displaystyle=\kappa_{H}\Big{(}2\sqrt{t\log(t)}-\sqrt{2\pi}\erf(\sqrt{\log(t)/2})\Big{|}_{1}^{T}\Big{)} (B.46)
=κH(2Tlog(T)2πerf(log(T)/221log(1)2πerf(log(1)/2)missing)\displaystyle=\kappa_{H}\Big{(}2\sqrt{T\log(T)}-\sqrt{2\pi}\erf(\sqrt{\log(T)/2}-2\sqrt{1\log(1)}-\sqrt{2\pi}\erf(\sqrt{\log(1)/2})\Big{missing}) (B.47)
=κH(2Tlog(T)2πerf(log(T)/2))\displaystyle=\kappa_{H}\Big{(}2\sqrt{T\log(T)}-\sqrt{2\pi}\erf(\sqrt{\log(T)/2})\Big{)} (B.48)
2κHTlog(T)\displaystyle\leq 2\kappa_{H}\sqrt{T\log(T)} (B.49)

From Lemma 3.2 we can see that the optimistic pricing deviation of the follower is,

Δ(T)\displaystyle\Delta_{\mathcal{H}}(T) 𝒪(Tlog(T))\displaystyle\in\mathcal{O}(\sqrt{T\log(T)}) (B.50)

B.6 Proof of Lemma B.1

Lemma B.1.

Demand estimate reduction: Given any price pp, the maximum optimistic demand estimate is d¯θ(p)𝒪(log(t)/t)\bar{d}_{\theta}(p)\in\mathcal{O}(\sqrt{\log(t)/t}), and |θ0tθ0|𝒪(log(t)/t)|\theta_{0}^{t}-\theta_{0}^{*}|\in\mathcal{O}(\sqrt{\log(t)/t}). (Proof in Appendix B.6.)

Proof.

Suppose we are interested in only the confidence on θ0\theta_{0}, then Eq. (3.5b) simplifies to Eq. (B.51).

(θ^0θ0)2\displaystyle(\hat{\theta}_{0}-\theta_{0}^{*})^{2} κ2log(t)/t\displaystyle\leq\kappa^{2}\log(t)/t (B.51)
θ^0θ0\displaystyle\hat{\theta}_{0}-\theta_{0}^{*} κlog(t)/t\displaystyle\leq\kappa\sqrt{\log(t)/t} (B.52)

Suppose in Eq. (B.52) we are setting θ1=θ^1\theta^{*}_{1}=\hat{\theta}_{1}, thereby maximizing the possible demand. Effectively this is the maximum possible demand (when p=0p=0) and shrinks at a rate of κlog(t)/t\kappa\sqrt{\log(t)/t}.

B.7 Notes on the Bound on Δ(t)\Delta_{\mathcal{H}}(t)

We want to show that θ^1θ^12θ^1κ2log(t)/tθ^1θ^12\hat{\theta}_{1}\hat{\theta}_{1}^{2}-\hat{\theta}_{1}\kappa^{2}\log(t)/t\leq\hat{\theta}_{1}\hat{\theta}_{1}^{2} for tet\geq e.

ddt(κ2logtt)\displaystyle\frac{d}{dt}\left(\frac{\kappa^{2}\log t}{t}\right) =vdudtudvdtv2\displaystyle=\frac{v\frac{du}{dt}-u\frac{dv}{dt}}{v^{2}}
=tddt(κ2logt)κ2logtddt(t)t2\displaystyle=\frac{t\cdot\frac{d}{dt}(\kappa^{2}\log t)-\kappa^{2}\log t\cdot\frac{d}{dt}(t)}{t^{2}}
=tκ21tκ2logt1t2\displaystyle=\frac{t\cdot\kappa^{2}\frac{1}{t}-\kappa^{2}\log t\cdot 1}{t^{2}}
=κ2κ2logtt2\displaystyle=\frac{\kappa^{2}-\kappa^{2}\log t}{t^{2}}
=κ2(1logt)t2\displaystyle=\frac{\kappa^{2}(1-\log t)}{t^{2}}

Therefore,

ddt(κ2logtt)=0t=e\displaystyle\frac{d}{dt}\left(\frac{\kappa^{2}\log t}{t}\right)=0\xrightarrow[]{}t=e (B.53)

Appendix C Regret Bounds

C.1 Proof for Lemma C.1

Lemma C.1.

Approximation of Smooth Concave Function: Suppose any smooth differentiable concave function f(x):f(x):\mathbb{R}\to\mathbb{R}, and its approximation, f^(x):\hat{f}(x):\mathbb{R}\to\mathbb{R}, which is also smooth and differentiable (but not guaranteed to be concave), if f(x)f(x) exhibits an decreasing upper-bound on the approximation error f^(x)f(x)ϵ\lVert\hat{f}(x)-f(x)\rVert\leq\epsilon, then ϵ0\epsilon\to 0 consequently implies x^x0\norm{\hat{x}^{*}-x^{*}}\to 0 almost surely, where x=argmaxf(x)x^{*}=\mathrm{argmax}\ f(x), x^=argmaxf^(x)\hat{x}^{*}=\mathrm{argmax}\ \hat{f}(x), and ||||||\cdot|| denotes the supremum norm.

Proof.

Suppose some controllable function ϵ(N)\epsilon(N) is monotonically decreasing with respect to increasing NN bounding the supremum norm of the approximation error of f(x)f(x), then the following statements hold true for any xx\in\mathbb{R}.

f(x)f^(x)ϵ(N),f(x^)f^(x^)ϵ(N)\displaystyle\norm{f(x^{*})-\hat{f}(x^{*})}\leq\epsilon(N),\quad\norm{f(\hat{x}^{*})-\hat{f}(\hat{x}^{*})}\leq\epsilon(N) (C.1)

Summing these conditions, we have the new relation,

f(x)f^(x)+f^(x^)f(x^)2ϵ(N)\displaystyle\norm{f(x^{*})-\hat{f}(x^{*})}+\norm{\hat{f}(\hat{x}^{*})-f(\hat{x}^{*})}\leq 2\epsilon(N) (C.2)

By triangle inequality,

f(x)f^(x)+f^(x^)f(x^)f(x)f^(x)+f^(x^)f(x^)2ϵ(N)\displaystyle\norm{f(x^{*})-\hat{f}(x^{*})+\hat{f}(\hat{x}^{*})-f(\hat{x}^{*})}\leq\norm{f(x^{*})-\hat{f}(x^{*})}+\norm{\hat{f}(\hat{x}^{*})-f(\hat{x}^{*})}\leq 2\epsilon(N) (C.3)

Note: the order of f^()f()\hat{f}(\cdot)-f(\cdot) versus f()f^()f(\cdot)-\hat{f}(\cdot) can be swapped and will yield the same result. The right left hand side of Eq. (C.3), allows us to rearrange the terms to obtain the equivalent expression rearranging the terms inide the norm on the left hand side,

f(x)f(x^)+f^(x^)f^(x)2ϵ(N)\displaystyle\norm{f(x^{*})-f(\hat{x}^{*})+\hat{f}(\hat{x}^{*})-\hat{f}(x^{*})}\leq 2\epsilon(N) (C.4)

By definition, for any smooth function,

f(x)f(x^)0,f^(x^)f^(x)0\displaystyle f(x^{*})-f(\hat{x}^{*})\geq 0,\quad\hat{f}(\hat{x}^{*})-\hat{f}(x^{*})\geq 0 (C.5)

Thus we have equality,

f^(x^)f^(x)0+f(x)f(x^)0=f^(x^)f^(x)+f(x)f(x^)\displaystyle\lVert\underbrace{\hat{f}(\hat{x}^{*})-\hat{f}(x^{*})}_{\geq 0}+\underbrace{f(x^{*})-f(\hat{x}^{*})}_{\geq 0}\rVert=\lVert\hat{f}(\hat{x}^{*})-\hat{f}(x^{*})\rVert+\norm{f(x^{*})-f(\hat{x}^{*})} (C.6)

Thus we have the new inequality,

0f^(x^)f^(x)+f(x)f(x^)2ϵ(N)\displaystyle 0\leq\lVert\hat{f}(\hat{x}^{*})-\hat{f}(x^{*})\rVert+\norm{f(x^{*})-f(\hat{x}^{*})}\leq 2\epsilon(N) (C.7)

As 2ϵ(N)02\epsilon(N)\to 0, therefore this implies both,

limϵ0f^(x^)f^(x)=0\displaystyle\lim_{{\epsilon\to 0}}\norm{\hat{f}(\hat{x}^{*})-\hat{f}(x^{*})}=0 (C.8)
limϵ0f(x)f(x^)=0\displaystyle\lim_{{\epsilon\to 0}}\norm{f(x^{*})-f(\hat{x}^{*})}=0 (C.9)

Concavity of f(x)f(x): This is where it is important to note the concavity of f(x)f(x) particularly in Eq. (C.9). Due to the concavity of f(x)f(x), as the function approximation error approaches 0, so does x^x\norm{\hat{x}^{*}-x^{*}}.

limϵ0f(x)f(x^)=0limϵ0x^x=0\displaystyle\lim_{{\epsilon\to 0}}\norm{f(x^{*})-f(\hat{x}^{*})}=0\implies\lim_{{\epsilon\to 0}}\norm{\hat{x}^{*}-x^{*}}=0 (C.10)

C.2 Notes on Optimistic Best Response

The upper-bound of the optimistic best response, b¯a\bar{b}_{a}, can be determined by estimating the optimistic riskless price p¯0\bar{p}_{0}, while both sides of Eq. (3.3) provide optimistic estimates of expected profit, 𝔼[𝒢B(θ|a)]\mathbb{E}[\mathcal{G}_{B}(\theta|a)], assuming the optimal price is at the riskless price, p0=(θ)p_{0}=\mathcal{H}(\theta). These expressions summarize follower pricing and ordering under a greedy strategy, accounting for optimistic parameter estimation. In principle, although the follower may never converge to the optimal pp^{*} given the limitations of the approximation tools, aiming for convergence towards p0p_{0} serves as a practical approach. p0p_{0} represents not only a simplifying mathematical assumption, but also aligns with the idea of optimization under optimism - as pp0p^{\leq}p_{0}. By pricing as the riskless price, the follower aims to price optimistically, serving as a reasonable economic strategy for dynamic pricing.

𝔼[𝒢¯B(θ|a)]\displaystyle\mathbb{E}[\bar{\mathcal{G}}_{B}(\theta|a)] =maxθ𝒞t(p(θ)a)Fθ1(p(θ)ap(θ))\displaystyle=\underset{\theta\in\mathcal{C}^{t}}{\mathrm{max}}\ \Big{(}p^{*}(\theta)-a\Big{)}F_{\theta}^{-1}\Big{(}\frac{p^{*}(\theta)-a}{p^{*}(\theta)}\Big{)} (C.11)
maxθ𝒞t((θ)+a2)Fθ1((θ)a(θ))\displaystyle\leq\underset{\theta\in\mathcal{C}^{t}}{\mathrm{max}}\ \Big{(}\frac{\mathcal{H}(\theta)+a}{2}\Big{)}F_{\theta}^{-1}\Big{(}\frac{\mathcal{H}(\theta)-a}{\mathcal{H}(\theta)}\Big{)} (C.12)
=maxθ𝒞t((θ)+a2)Fθ1(12a(θ)+a)\displaystyle=\underset{\theta\in\mathcal{C}^{t}}{\mathrm{max}}\ \Big{(}\frac{\mathcal{H}(\theta)+a}{2}\Big{)}F_{\theta}^{-1}\Big{(}1-\frac{2a}{\mathcal{H}(\theta)+a}\Big{)} (C.13)
θ¯\displaystyle\bar{\theta} =argmaxθ𝒞t((θ)+a2)Fθ1(12a(θ)+a)\displaystyle=\underset{\theta\in\mathcal{C}^{t}}{\mathrm{argmax}}\ \Big{(}\frac{\mathcal{H}(\theta)+a}{2}\Big{)}F_{\theta}^{-1}\Big{(}1-\frac{2a}{\mathcal{H}(\theta)+a}\Big{)} (C.14)
b¯a\displaystyle\bar{b}_{a} =Fθa1(p0ap0),p0=a+p¯a2=(θ)+a2,𝔅(a)[b¯a,b¯a],b¯a0\displaystyle=F^{-1}_{\theta_{a}}\Big{(}\frac{p_{0}-a}{p_{0}}\Big{)},\quad p_{0}=a+\frac{\bar{p}-a}{2}=\frac{\mathcal{H}(\theta)+a}{2},\quad\mathfrak{B}(a)\in[\underline{b}_{a},\bar{b}_{a}],\ \underline{b}_{a}\geq 0\ (C.15)

C.3 Proof of Theorem 4

Follower Regret in Online Learning for NPG: Follower Regret in Online Learning for NPG: Given pure leader strategy πA\pi_{A}, the worst-case regret of the follower, RBT(πA)R_{B}^{T}(\pi_{A}), as defined in Eq. (2.3) when adopting the LNPG algorithm, is bounded by RBT(πA)B+ϵBTθ1κH2log2(T)R_{B}^{T}(\pi_{A})\leq\aleph_{B}+\epsilon_{B}T-\theta_{1}\kappa_{H}^{2}\log^{2}(T), for sufficiently large values of TT, where B,ϵB,θ1,κH\aleph_{B},\epsilon_{B},\theta_{1},\kappa_{H} are positive real constants in +\mathbb{R}^{+}.

Proof.

Let 𝒰B(p|a)\mathcal{U}_{B}(p|a) represent the riskless profit for the follower at follower action (retail price) pp, give leader action (supplier price) aa.

𝒰B(p|a)\displaystyle\mathcal{U}_{B}(p|a) =θ0(pa)θ1(pa)2=θ1((pa)θ02θ1)2+θ024θ1\displaystyle=\theta_{0}(p-a)-\theta_{1}(p-a)^{2}=-\theta_{1}\Big{(}(p-a)-\frac{\theta_{0}}{2\theta_{1}}\Big{)}^{2}+\frac{\theta_{0}^{2}}{4\theta_{1}} (C.16)

We argue that given any pure strategy πA\pi_{A}, the contextual regret RBT(πA)R_{B}^{T}(\pi_{A}), is bounded by RBT(πA)B+ϵBTθ1κH2log2(T)R_{B}^{T}(\pi_{A})\leq\aleph_{B}+\epsilon_{B}T-\theta_{1}\kappa_{H}^{2}\log^{2}(T), for well-defined positive real value constants B,ϵB,θ1,κH\aleph_{B},\epsilon_{B},\theta_{1},\kappa_{H}. For the sake of the argument, we assume that given aa, p(z)p(z)^{*} which is the solution to Eq. (A.3) is known. Where zz represents the quantity above the expected demand from Γθ(p)\Gamma_{\theta}(p) (refer to Appendix A.1). For the mechanics of this proof, this allows us to consider 𝔼[𝒢B(p|a)]\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)] a constant value.

Riskless Pricing Difference Δ𝒰(t)\Delta_{\mathcal{U}}(t): First we begin by analyzing the convergence properties of the upper-bounding function 𝒰B(p|a)\mathcal{U}_{B}(p|a). We show that Δ𝒰(t)θ1κH2log(t)/t\Delta_{\mathcal{U}}(t)\leq\theta_{1}\kappa_{H}^{2}\log(t)/t. The difference in exact profit between pricing at the riskless price 𝒰B(p0)\mathcal{U}_{B}(p_{0}) and some optimistic riskless price 𝒰B(p¯0)\mathcal{U}_{B}(\bar{p}_{0}), denoted as Δ𝒰(t)=𝒰B(p0)𝒰B(p¯0)\Delta_{\mathcal{U}}(t)=\mathcal{U}_{B}(p_{0})-\mathcal{U}_{B}(\bar{p}_{0}) is such that Δ𝒰(t)θ1κH2log(t)/t\Delta_{\mathcal{U}}(t)\leq\theta_{1}\kappa_{H}^{2}\log(t)/t.

For convenience, in equations where no ambiguity exists, we may omit from explicitly including the variable aa in the subsequent equations, denoting them as 𝒢B(p|a)𝒢B(p)\mathcal{G}_{B}(p|a)\equiv\mathcal{G}_{B}(p), for 𝒰B(p|a),𝒰B(p)\mathcal{U}_{B}(p|a),\equiv\mathcal{U}_{B}(p) and so forth.

𝒰B(p)\displaystyle\mathcal{U}_{B}(p) =(θ0θ1p)(pa)\displaystyle=(\theta_{0}-\theta_{1}p)(p-a) (C.17)
=θ1(p2θ1a+θ0θ1p)+θ0a\displaystyle=-\theta_{1}\left(p^{2}-\frac{\theta_{1}a+\theta_{0}}{\theta_{1}}p\right)+\theta_{0}a (C.18)
=θ1(pθ1a+θ02θ1)2+θ1a24+3θ0a2+θ024θ1\displaystyle=-\theta_{1}\left(p-\frac{\theta_{1}a+\theta_{0}}{2\theta_{1}}\right)^{2}+\frac{\theta_{1}a^{2}}{4}+\frac{3\theta_{0}a}{2}+\frac{\theta_{0}^{2}}{4\theta_{1}} (C.19)
=θ1(pθ1a+θ02θ1)2+g(a,θ0,θ1),whereg(a,θ0,θ1)θ1a24+3θ0a2+θ024θ1\displaystyle=-\theta_{1}\left(p-\frac{\theta_{1}a+\theta_{0}}{2\theta_{1}}\right)^{2}+g(a,\theta_{0},\theta_{1}),\quad\text{where}\quad g(a,\theta_{0},\theta_{1})\equiv\frac{\theta_{1}a^{2}}{4}+\frac{3\theta_{0}a}{2}+\frac{\theta_{0}^{2}}{4\theta_{1}} (C.20)
=θ1(p(θ)+a2)2+g(a,θ0,θ1)\displaystyle=-\theta_{1}\left(p-\frac{\mathcal{H}(\theta)+a}{2}\right)^{2}+g(a,\theta_{0},\theta_{1}) (C.21)

The difference in the riskless price, note p0=(θ)+a2p_{0}=\frac{\mathcal{H}(\theta)+a}{2}, combining Theorem 3,

Δ𝒰(t)\displaystyle\Delta_{\mathcal{U}}(t) =𝒰B(p0)𝒰B(p¯0)\displaystyle=\mathcal{U}_{B}(p_{0})-\mathcal{U}_{B}(\bar{p}_{0}) (C.22)
=g(a,θ0,θ1)θ1(p0(θ)+a2)2g(a,θ0,θ1)+θ1(p0+Δ(t)(θ)+a2)2\displaystyle=g(a,\theta_{0},\theta_{1})-\theta_{1}\Big{(}p_{0}-\frac{\mathcal{H}(\theta)+a}{2}\Big{)}^{2}-g(a,\theta_{0},\theta_{1})+\theta_{1}\Big{(}p_{0}+\Delta_{\mathcal{H}}(t)-\frac{\mathcal{H}(\theta)+a}{2}\Big{)}^{2} (C.23)
=θ1(Δ(t))2\displaystyle=\theta_{1}(\Delta_{\mathcal{H}}(t))^{2} (C.24)
θ1κH2log(t)/t\displaystyle\leq\theta_{1}\kappa_{H}^{2}\log(t)/t (C.25)

We integrate the term log(t)/t\log(t)/t to obtain obtain a closed form expression for Eq. (C.25).

1Tlog(t)t𝑑t=12(log2(T)log2(1))\displaystyle\int_{1}^{T}\frac{\log(t)}{t}\,dt=\frac{1}{2}\Big{(}\log^{2}(T)-\log^{2}(1)\Big{)} (C.26)

Regret Analysis: We introduce aa back again into the notation, as the expected profit, 𝔼[𝒢B(p|a)]\mathbbm{E}[\mathcal{G}_{B}(p|a)] and upper-bounding functions 𝒰B(p)\mathcal{U}_{B}(p), are conditioned on the leader action aa. We present the definition of regret RBT(πA)R_{B}^{T}(\pi_{A}) conditioned upon the policy of πA\pi_{A}, which is a pure strategy consisting of a sequence of action. As we assume a rational and greedy follower; he does not attempt to alter the strategy of the leader.

RBT(πA)\displaystyle R_{B}^{T}(\pi_{A}) =t=1T𝔼[𝒢B(p|a)]t=1T𝔼[𝒢B(pt|a)]\displaystyle=\sum^{T}_{t=1}\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]-\sum^{T}_{t=1}\mathbbm{E}[\mathcal{G}_{B}(p^{t}|a)] (C.27)
=t=1T(𝔼[𝒢B(p|a)]𝔼[𝒢B(p¯0|a)])\displaystyle=\sum^{T}_{t=1}\ \Big{(}\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]-\mathbbm{E}[\mathcal{G}_{B}(\bar{p}_{0}|a)]\Big{)} (C.28)

We also know that if we use the riskless price estimate.

𝔼[𝒢B(p¯0|a)]\displaystyle\mathbbm{E}[\mathcal{G}_{B}(\bar{p}_{0}|a)] 𝔼[𝒢B(p|a)]𝒰B(p0|a)\displaystyle\leq\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]\leq\mathcal{U}_{B}(p_{0}|a) (C.29)
where,𝔼[𝒢B(p|a)]\displaystyle\text{where,}\quad\mathbbm{E}[\mathcal{G}_{B}(p|a)] 𝒰B(p|a),p,a\displaystyle\leq\mathcal{U}_{B}(p|a),\quad\forall p,\ \forall a (C.30)

Where p¯0\bar{p}_{0} is the upper estimate of the riskless price p0p_{0}. And in the general case,

𝔼[𝒢B(p~|a)]\displaystyle\mathbbm{E}[\mathcal{G}_{B}(\tilde{p}|a)] 𝔼[𝒢B(p|a)]𝒰B(p0|a)\displaystyle\leq\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]\leq\mathcal{U}_{B}(p_{0}|a) (C.31)
where, p~{x+:xp}\displaystyle\tilde{p}\in\{x\in\mathbb{R}^{+}:x\neq p^{*}\} (C.32)
where,𝔼[𝒢B(p|a)]\displaystyle\text{where,}\quad\mathbbm{E}[\mathcal{G}_{B}(p|a)] 𝒢B(p|a),p,a\displaystyle\leq\mathcal{G}_{B}(p^{*}|a),\quad\forall p,\ \forall a (C.33)

For any price pp, we know that the profits under no inventory risk is always higher than when inventory risk exists, 𝔼[𝒢B(p|a)]𝒰B(p|a)\mathbbm{E}[\mathcal{G}_{B}(p|a)]\leq\mathcal{U}_{B}(p|a). We make note of the relation 𝒢B(p|a)𝒰B(p|a),p+,a+\mathcal{G}_{B}(p|a)\leq\mathcal{U}_{B}(p|a),\ \forall p\in\mathbb{R}^{+},\ \forall a\in\mathbb{R}^{+}. Thus we define 𝒵B(p|a)\mathcal{Z}_{B}(p|a) as,

𝒵B(p|a)=𝒰B(p|a)𝔼[𝒢B(p|a)]\displaystyle\mathcal{Z}_{B}(p|a)=\mathcal{U}_{B}(p|a)-\mathbbm{E}[\mathcal{G}_{B}(p|a)] (C.34)

As a consequence of this approximation error, 𝒵B(p|a)\mathcal{Z}_{B}(p|a), we note there must be a constant term which upper bounds the regret of the follower.

ϵ0(a)=𝔼[𝒢B(p|a)]𝔼[𝒢B(p0|a)]\displaystyle\epsilon_{0}(a)=\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]-\mathbbm{E}[\mathcal{G}_{B}(p_{0}|a)] (C.35)

Our approach involves analyzing two cases.

Case 1: 𝔼[𝒢B(p)]𝒰B(p¯0)\mathbbm{E}[\mathcal{G}_{B}(p^{*})]\leq\mathcal{U}_{B}(\bar{p}_{0}),

RBT(πA)\displaystyle R_{B}^{T}(\pi_{A}) t=1T𝒰B(p¯0)𝔼[𝒢B(p0)]\displaystyle\leq\sum^{T}_{t=1}\ \mathcal{U}_{B}(\bar{p}_{0})-\mathbbm{E}[\mathcal{G}_{B}(p_{0})] (C.36)
=t=1T(𝒰B(p0)Δ𝒰(t))(𝒰B(p0)𝒵B(p0))\displaystyle=\sum^{T}_{t=1}\Big{(}\mathcal{U}_{B}(p_{0})-\Delta_{\mathcal{U}}(t)\Big{)}-\Big{(}\mathcal{U}_{B}(p_{0})-\mathcal{Z}_{B}(p_{0})\Big{)} (C.37)
=t=1T𝒵B(p0)Δ𝒰(t)\displaystyle=\sum^{T}_{t=1}\mathcal{Z}_{B}(p_{0})-\Delta_{\mathcal{U}}(t) (C.38)

Case 2: 𝔼[𝒢B(p)]>𝒰B(p¯0)\mathbbm{E}[\mathcal{G}_{B}(p^{*})]>\mathcal{U}_{B}(\bar{p}_{0}),

RBT(πA)\displaystyle R_{B}^{T}(\pi_{A}) t=1T𝒰B(p0)𝔼[𝒢B(p¯0)]\displaystyle\leq\sum^{T}_{t=1}\ \ \mathcal{U}_{B}(p_{0})-\mathbbm{E}[\mathcal{G}_{B}(\bar{p}_{0})] (C.39)
t=1T𝒰B(p0)(𝒰B(p¯0)𝒵B(p¯0))\displaystyle\leq\sum^{T}_{t=1}\ \ \mathcal{U}_{B}(p_{0})-(\mathcal{U}_{B}(\bar{p}_{0})-\mathcal{Z}_{B}(\bar{p}_{0})) (C.40)
=t=1T𝒰B(p0)(𝒰B(p0)Δ𝒰(t)𝒵B(p¯0))\displaystyle=\sum^{T}_{t=1}\ \ \mathcal{U}_{B}(p_{0})-(\mathcal{U}_{B}(p_{0})-\Delta_{\mathcal{U}}(t)-\mathcal{Z}_{B}(\bar{p}_{0})) (C.41)
=t=1T𝒵B(p¯0)+Δ𝒰(t)\displaystyle=\sum^{T}_{t=1}\ \mathcal{Z}_{B}(\bar{p}_{0})+\Delta_{\mathcal{U}}(t) (C.42)

We define Δ𝒵(t)\Delta_{\mathcal{Z}}(t) as,

Δ𝒵(t)\displaystyle\Delta_{\mathcal{Z}}(t) =𝒵B(p¯0)𝒵B(p0)\displaystyle=\mathcal{Z}_{B}(\bar{p}_{0})-\mathcal{Z}_{B}(p_{0}) (C.43)
=𝒰B(p¯0)𝔼[𝒢B(p¯0)]𝒰B(p0)+𝔼[𝒢B(p0)]\displaystyle=\mathcal{U}_{B}(\bar{p}_{0})-\mathbb{E}[\mathcal{G}_{B}(\bar{p}_{0})]-\mathcal{U}_{B}(p_{0})+\mathbb{E}[\mathcal{G}_{B}(p_{0})] (C.44)
=𝒰B(p¯0)𝒰B(p0)Δ𝒰(t)+𝔼[𝒢B(p0)]𝔼[𝒢B(p¯0)]Δ𝒢(t)\displaystyle=\underbrace{\mathcal{U}_{B}(\bar{p}_{0})-\mathcal{U}_{B}(p_{0})}_{\Delta_{\mathcal{U}}(t)}+\underbrace{\mathbb{E}[\mathcal{G}_{B}(p_{0})]-\mathbb{E}[\mathcal{G}_{B}(\bar{p}_{0})]}_{\Delta_{\mathcal{G}}(t)} (C.45)

We take the definition of Δ𝒵(t)\Delta_{\mathcal{Z}}(t) and insert it back to the definition in Eq. (C.42).

RBT(πA)\displaystyle R_{B}^{T}(\pi_{A}) t=1T𝒵B(p¯0)+Δ𝒰(t)\displaystyle\leq\sum^{T}_{t=1}\ \mathcal{Z}_{B}(\bar{p}_{0})+\Delta_{\mathcal{U}}(t) (C.46)
=t=1T𝒵B(p0)+Δ𝒵(t)+Δ𝒰(t)\displaystyle=\sum^{T}_{t=1}\ \mathcal{Z}_{B}(p_{0})+\Delta_{\mathcal{Z}}(t)+\Delta_{\mathcal{U}}(t) (C.47)
=t=1T𝒵B(p0)+2Δ𝒰(t)+Δ𝒢(t)\displaystyle=\sum^{T}_{t=1}\ \mathcal{Z}_{B}(p_{0})+2\Delta_{\mathcal{U}}(t)+\Delta_{\mathcal{G}}(t) (C.48)

We note the peculiarity that Δ𝒢(t)\Delta_{\mathcal{G}}(t)\in\mathbb{R} and Δ𝒰(t)0\Delta_{\mathcal{U}}(t)\geq 0. Thus in the worst case,

RBT(πA)\displaystyle R_{B}^{T}(\pi_{A}) t=1T𝒵B(p0)+2Δ𝒰(t)+|Δ𝒢(t)|\displaystyle\leq\sum^{T}_{t=1}\ \mathcal{Z}_{B}(p_{0})+2\Delta_{\mathcal{U}}(t)+|\Delta_{\mathcal{G}}(t)| (C.49)

We upper bound 𝒵B(p¯0|a)\mathcal{Z}_{B}(\bar{p}_{0}|a) by ϵ~(a)\tilde{\epsilon}(a), where,

𝔼[𝒢B(p|a)]𝔼[𝒢B(p0|a)]\displaystyle\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]-\mathbbm{E}[\mathcal{G}_{B}(p_{0}|a)] 𝒰B(p0|a)𝔼[𝒢B(p0|a)]\displaystyle\leq\mathcal{U}_{B}(p_{0}|a)-\mathbbm{E}[\mathcal{G}_{B}(p_{0}|a)] (C.50)
ϵ~0\displaystyle\tilde{\epsilon}_{0} =𝒰B(p0|a)𝔼[𝒢B(p0|a)]\displaystyle=\mathcal{U}_{B}(p_{0}|a)-\mathbbm{E}[\mathcal{G}_{B}(p_{0}|a)] (C.51)

We note that 𝒰B(p|a)\mathcal{U}_{B}(p|a) can be expressed in linear form, with parameters θ\theta, whereas 𝔼[𝒢B(p|a)]\mathbbm{E}[\mathcal{G}_{B}(p|a)] requires the solution to Eq. (A.3), which may not always have a closed form solution.

ϵ~(a)\displaystyle\tilde{\epsilon}(a) =maxp+𝒵B(p|a)\displaystyle=\underset{p\in\mathbb{R}^{+}}{\mathrm{max}}\ \mathcal{Z}_{B}(p|a) (C.52)

We can see that there are two contributions to the contextual regret. ϵ~(a)\tilde{\epsilon}(a) is the worst-case error term of the approximation of 𝔼[𝒢B()]\mathbb{E}[\mathcal{G}_{B}(\cdot)] via 𝒰B()\mathcal{U}_{B}(\cdot). Δ𝒰(t)\Delta_{\mathcal{U}}(t) represents the learning error of learning to achieve riskless pricing under imperfect information. Economically, ϵ~(a)\tilde{\epsilon}(a) represents the upper-bound on the difference between expected profit with and without potential inventory risk. Thus using this approximation, we learn to price at p0p_{0}, and converge to a constant error term ϵ0(a)\epsilon_{0}(a).

In Case 2, where 𝔼[𝒢B(p|a)]𝒰B(p¯0|a)\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]\geq\mathcal{U}_{B}(\bar{p}_{0}|a), the follower learns to price at p¯0\bar{p}_{0} due to misspecification. Nevertheless, in terms of asymptotic behaviour, as tt\xrightarrow[]{}\infty, p¯0p0\bar{p}_{0}\xrightarrow[]{}p_{0} and Δ𝒰(t)0\Delta_{\mathcal{U}}(t)\xrightarrow[]{}0 almost surely. Since by definition 𝒰B(p0)𝔼[𝒢B(p)],p+\mathcal{U}_{B}(p_{0})\geq\mathbbm{E}[\mathcal{G}_{B}(p)],\ \forall p\in\mathbb{R}^{+}, we know that the system will approach Case 1 as tt\xrightarrow[]{}\infty. In this case, the learning error Δ𝒰(t)\Delta_{\mathcal{U}}(t) is vanishing, while the riskless inventory approximation error ϵ0(a)\epsilon_{0}(a) is persistent, and bounded by ϵ0(a)ϵ~(a)\epsilon_{0}(a)\leq\tilde{\epsilon}(a). We can now we can impose a worst case bound on expected regret, assuming tt is sufficiently large, and ϵ~(a)Δ𝒰(t),a+\tilde{\epsilon}(a)\geq\Delta_{\mathcal{U}}(t),\,\forall a\in\mathbb{R}^{+} (which is true in Case 1). We consider the behaviour of TT as it approaches infinity. When considering relatively small values of TT, the regret may exhibit a worst-case regret spanning the entire range of the 𝒰()\mathcal{U}(\cdot), given the absence of constraints on the magnitude of |Δ𝒢(t)||\Delta_{\mathcal{G}}(t)|. Therefore, it becomes imperative to establish the assumption that the temporal horizon TT is sufficiently large.

RBT(πA)\displaystyle R_{B}^{T}(\pi_{A}) t=1T(ϵ~(a)Δ𝒰(t))\displaystyle\leq\sum^{T}_{t=1}\ \Big{(}\tilde{\epsilon}(a)-\Delta_{\mathcal{U}}(t)\Big{)} (C.53)

Therefore, when examining the asymptotic behaviour for Case 2, when letting ϵB\epsilon_{B} be the maximum value of ϵ~(a)\tilde{\epsilon}(a) for all a+a\in\mathbb{R}^{+}, we observe that,

RBT(πA)\displaystyle R_{B}^{T}(\pi_{A}) ϵBTθ1κH2log2(T)\displaystyle\leq\epsilon_{B}T-\theta_{1}\kappa_{H}^{2}\log^{2}(T) (C.54)
whereϵB\displaystyle\text{where}\quad\epsilon_{B} =maxa+𝒵B(ϵ~(a)|a)\displaystyle=\underset{a\in\mathbb{R}^{+}}{\mathrm{max}}\ \mathcal{Z}_{B}(\tilde{\epsilon}(a)|a) (C.55)

Where Eq. (C.55) serves as an alternative expression for Eq. (2.10). As recounted in Eq. (C.51) and Eq. (C.53), the system will approach Case 1 almost surely as tt\xrightarrow[]{}\infty. We note that the constant worst-case linear scaling via ϵB\epsilon_{B} due to the approximate best response of the follower is external to the learning algorithm, and thus is dependent on the error of the approximation of the solution to Eq. (A.3), via 𝒵B(p0)\mathcal{Z}_{B}(p_{0}).

General Case: We combine the characteristics of Case 1 and Case 2 and express the instantaneous regret behaviour as,

rBt(πA)={𝒵B(p0)+2Δ𝒰(t)+Δ𝒢(t),1tT𝒵B(p0)Δ𝒰(t)T<t\displaystyle r^{t}_{B}(\pi_{A})=\begin{cases}\mathcal{Z}_{B}(p_{0})+2\Delta_{\mathcal{U}}(t)+\Delta_{\mathcal{G}}(t),\quad&1\leq t\leq T^{\prime}\\ \mathcal{Z}_{B}(p_{0})-\Delta_{\mathcal{U}}(t)\quad&T^{\prime}<t\end{cases} (C.56)

To clarify, TT^{\prime} denotes the change point where the system transitions from Case 2, where 𝔼[𝒢B(p|a)]>𝒰B(p¯0|a)\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]>\mathcal{U}_{B}(\bar{p}_{0}|a), to Case 1, where 𝔼[𝒢B(p|a)]𝒰B(p¯0|a)\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]\leq\mathcal{U}_{B}(\bar{p}_{0}|a).

RBT(πA)\displaystyle R^{T}_{B}(\pi_{A}) =t=1TrBt(πA)\displaystyle=\sum_{t=1}^{T}r^{t}_{B}(\pi_{A}) (C.57)
t=1Tt=1TrBt(πA)\displaystyle\leq\int_{t=1}^{T}\,\sum_{t=1}^{T}r^{t}_{B}(\pi_{A}) (C.58)
=t=1T(𝒵B(p0)+2Δ𝒰(t)+|Δ𝒢(t)|)𝑑t+t=TT(𝒵B(p0)Δ𝒰(t))𝑑t\displaystyle=\int_{t=1}^{T^{\prime}}\,\Big{(}\mathcal{Z}_{B}(p_{0})+2\Delta_{\mathcal{U}}(t)+|\Delta_{\mathcal{G}}(t)|\Big{)}dt+\int_{t=T^{\prime}}^{T}\,\Big{(}\mathcal{Z}_{B}(p_{0})-\Delta_{\mathcal{U}}(t)\Big{)}dt (C.59)

Suppose an upper-bounding term bounds |Δ𝒢(t)||\Delta_{\mathcal{G}}(t)|, such that |Δ𝒢(t)|Δ¯𝒢|\Delta_{\mathcal{G}}(t)|\leq\bar{\Delta}_{\mathcal{G}}. We can then rearrange,

RBT(πA)\displaystyle R^{T}_{B}(\pi_{A}) 𝒵B(p0)(T1)+Δ¯𝒢(T1)+21TΔ𝒰(t)𝑑t+𝒵B(p0)(TT)TTΔ𝒰(t)𝑑t\displaystyle\leq\mathcal{Z}_{B}(p_{0})(T^{\prime}-1)+\bar{\Delta}_{\mathcal{G}}(T^{\prime}-1)+2\int_{1}^{T^{\prime}}\,\Delta_{\mathcal{U}}(t)\,dt+\mathcal{Z}_{B}(p_{0})(T-T^{\prime})-\int_{T^{\prime}}^{T}\,\Delta_{\mathcal{U}}(t)\,dt (C.60)
=C𝒵+T𝒵B(p0)+2θ1κH2log2(t)|1Tθ1κH2log2(t)|TT\displaystyle=C_{\mathcal{Z}}+T\mathcal{Z}_{B}(p_{0})+2\theta_{1}\kappa_{H}^{2}\log^{2}(t)\Big{|}_{1}^{T^{\prime}}-\theta_{1}\kappa_{H}^{2}\log^{2}(t)\Big{|}_{T^{\prime}}^{T} (C.61)

Although Δ𝒢\Delta_{\mathcal{G}} is fundamentally the value of interest, it potentially exhibits arbitrary growth until TT^{\prime}. Given the unconstrained nature of this growth, we are particularly concerned about Case 2. Therefore, in a hypothetical scenario where we possess knowledge of 𝒢B(p|a)\mathcal{G}_{B}(p^{*}|a) and 𝒰(p0|a)\mathcal{U}(p_{0}|a) (without direct access to pp^{*} and p0p_{0}), we could theoretically determine tt, contingent upon the convergence rate of θ\theta towards θ\theta^{*}, as outlined in Lemma 3.1.

C𝒵=Δ¯𝒢(T1)𝒵B(p0)\displaystyle C_{\mathcal{Z}}=\bar{\Delta}_{\mathcal{G}}(T^{\prime}-1)-\mathcal{Z}_{B}(p_{0}) (C.62)

From which, it holds that,

RBT(πA)\displaystyle R^{T}_{B}(\pi_{A}) C𝒵+T𝒵B(p0)+θ1κH2(2log2(T)log2(T)log2(T))\displaystyle\leq C_{\mathcal{Z}}+T\mathcal{Z}_{B}(p_{0})+\theta_{1}\kappa_{H}^{2}\Big{(}2\log^{2}(T^{\prime})-\log^{2}(T)-\log^{2}(T^{\prime})\Big{)} (C.63)
=C𝒵+T𝒵B(p0)+θ1κH2(log2(T)log2(T))\displaystyle=C_{\mathcal{Z}}+T\mathcal{Z}_{B}(p_{0})+\theta_{1}\kappa_{H}^{2}\Big{(}\log^{2}(T^{\prime})-\log^{2}(T)\Big{)} (C.64)
=B+ϵBTθ1κH2log2(T)\displaystyle=\aleph_{B}+\epsilon_{B}T-\theta_{1}\kappa_{H}^{2}\log^{2}(T) (C.65)

Where B\aleph_{B} absorbs the finite θ1κH2log2(T)\theta_{1}\kappa_{H}^{2}\log^{2}(T^{\prime}) term, B=C𝒵+θ1κH2log2(T)\aleph_{B}=C_{\mathcal{Z}}+\theta_{1}\kappa_{H}^{2}\log^{2}(T^{\prime}), and ϵB=𝒵B(p0)\epsilon_{B}=\mathcal{Z}_{B}(p_{0}) by definition.

C.4 Proof for Theorem 5

Stackelberg Regret: Given the pure strategy best response of the follower defined 𝔅(a)[b¯a,b¯a]\mathfrak{B}(a)\in[\underline{b}_{a},\bar{b}_{a}] from Eq. (3.2) and (3.4) respectively, the worst-case regret, RAT(πA)R_{A}^{T}(\pi_{A}), from the leader’s perspective, as defined in Eq. (2.4) when adopting the LNPG algorithm, is bounded by 𝒪(Tlog(T))\mathcal{O}(\sqrt{T\log(T)}).

Proof.

From the perspective of the leader, we are given the bounds on the follower’s best response to any action aa, denoted as 𝔅(a)[b¯a,b¯a]\mathfrak{B}(a)\in[\underline{b}_{a},\bar{b}_{a}] from Eq. (3.2) and (3.4) respectively. And we can express this as a function of (θ)\mathcal{H}^{*}(\theta) (which is the riskless price). Let’s take an upper bound on the actions bb,

b¯a\displaystyle\bar{b}_{a} =Fθ¯a1(12a(θ)+a)\displaystyle=F^{-1}_{\bar{\theta}_{a}}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta)+a}\Big{)} (C.66)

In our algorithm the leader has access to the same parameter estimates of θ\theta as the follower, and the follower is running OFUL, with logarithmic regret bounds. Given the same confidence ball 𝒞t\mathcal{C}^{t} which leads to an optimistic parameter estimate θ¯\bar{\theta}, the leader can also optimistically estimate her rewards, 𝒢¯A(θ)\bar{\mathcal{G}}_{A}(\theta), by anticipating the optimistic best response of the follower.

𝒢¯A(θ)\displaystyle\bar{\mathcal{G}}_{A}(\theta) =maxa𝒜ab¯a=maxa𝒜aFθ¯1(12a(θ)+a)\displaystyle=\underset{a\in\mathcal{A}}{\mathrm{max}}\ a\bar{b}_{a}=\underset{a\in\mathcal{A}}{\mathrm{max}}\ aF^{-1}_{\bar{\theta}}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta)+a}\Big{)} (C.67)

Essentially the leader plays the arm that maximizes 𝒢¯A(θ)\bar{\mathcal{G}}_{A}(\theta), which is the optimization problem for the leader. We know that the solution to (θ)\mathcal{H}^{*}(\theta) is expressed in Eq. (B.29). We express leader regret by assuming b¯\underline{b} to be the lower bound of the optimistic best response of the follower.

RAT(πA)\displaystyle R_{A}^{T}(\pi_{A}) =t=1T𝒢A(a,𝔅(a))t=1T𝒢A(at,bt)\displaystyle=\sum^{T}_{t=1}\mathcal{G}_{A}(a^{*},\mathfrak{B}(a))-\sum^{T}_{t=1}\mathcal{G}_{A}(a^{t},b^{t}) (C.68)
=t=1Ta𝔅(a)t=1Tat𝔅(at)\displaystyle=\sum^{T}_{t=1}a^{*}\mathfrak{B}(a^{*})-\sum^{T}_{t=1}a^{t}\mathfrak{B}(a^{t}) (C.69)
=t=1Tmaxa𝒜aFθ1(12a(θ)+a)t=1Ta¯Fθ¯1(12a¯(θ)+a¯)\displaystyle=\sum^{T}_{t=1}\ \underset{a\in\mathcal{A}}{\mathrm{max}}\ aF^{-1}_{\theta^{*}}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+a}\Big{)}-\sum^{T}_{t=1}\bar{a}F^{-1}_{\bar{\theta}}\Big{(}1-\frac{2\bar{a}}{\mathcal{H}^{*}(\theta^{*})+\bar{a}}\Big{)} (C.70)
t=1Tmaxa𝒜aFθ1(12a(θ)+a)t=1Ta¯Fθ1(12a¯(θ)+a¯)\displaystyle\leq\sum^{T}_{t=1}\ \underset{a\in\mathcal{A}}{\mathrm{max}}\ aF^{-1}_{\theta^{*}}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+a}\Big{)}-\sum^{T}_{t=1}\bar{a}F^{-1}_{\theta^{*}}\Big{(}1-\frac{2\bar{a}}{\mathcal{H}^{*}(\theta^{*})+\bar{a}}\Big{)} (C.71)
a\displaystyle a^{*} =argmaxa𝒜a𝔅(a),a¯=argmaxa𝒜aFθ¯1(12a(θ)+Δ+a)\displaystyle=\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ a\,\mathfrak{B}(a),\quad\bar{a}=\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ aF^{-1}_{\bar{\theta}}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}\Big{)} (C.72)

There exists a confidence ball for both agents 𝒞t\mathcal{C}^{t}, one of which could be the subset of another (i.e. one agent has more information than the other), and a unique maximizing point for 𝒢A(at)\mathcal{G}_{A}(a^{t}), for b(θt)b(\theta^{t}), which depends on the estimate of θ𝒞t\theta\in\mathcal{C}^{t}. From Theorem 1, we know that bab_{a} is monotonically decreasing w.r.t. pp, and from Lemma 3.2, we know that the optimistic estimate of p¯0\bar{p}_{0} is decreasing. There exists a unique maximizing solution when θ\theta^{*} is known. However when we solve for the approximation, there is a suboptimality introduced by a¯\bar{a}.

We would like to quantify and bound the behaviour of a¯(t)\bar{a}(t) from Eq. (C.71), and thus we can then quantify the behaviour of leader regret, in Eq. (C.71). First we note an assumption from Section 2.4.2, where the demand distribution is symmetric 𝒩(μt,σ)\mathcal{N}(\mu^{t},\sigma), with constant variance σ\sigma, where μ=Γθ(p)\mu=\Gamma_{\theta}(p), a property of additive demand. We characterize the change in γ\gamma with respect to order amount (or follower response) bb. We know the relationship, when considering optimistic demand parameters θ¯\bar{\theta}, that b=Fθ¯1(γ)b=F^{-1}_{\bar{\theta}}(\gamma). Suppose we hold aa constant, the quantifiable difference of γ(at)\gamma(a^{t}) to the optimal γ(a)\gamma(a^{*}) is defined in Eq. (C.74).

γ\displaystyle\gamma =12a(θ)+Δ+a\displaystyle=1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a} (C.73)
Δγ(t)\displaystyle\Delta_{\gamma}(t) =γ(at)γ(a)\displaystyle=\gamma(a^{t})-\gamma(a^{*}) (C.74)
=12a(θ)+Δ+a1+2a(θ)+a\displaystyle=1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}-1+\frac{2a^{*}}{\mathcal{H}^{*}(\theta^{*})+a^{*}} (C.75)
=2a(θ)+a2a(θ)+Δ+a\displaystyle=\frac{2a^{*}}{\mathcal{H}^{*}(\theta^{*})+a^{*}}-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a} (C.76)

With knowledge that ata^{t} with respect to aa^{*}, there could be multiple scenarios, either at>aa^{t}>a^{*}, or ataa^{t}\leq a^{*}. We can argue is that we start ataa^{t}\geq a^{*} with probability 1δ1-\delta as we are optimistic, and seek to set prices always higher than the optimal aa^{*}, therefore the series ata^{t} must be decreasing, and we known by design at0a^{t}\geq 0, therefore ata0a^{t}-a^{*}\geq 0.

Δγ(t)\displaystyle\Delta_{\gamma}(t) 2at(1(θ)+at1(θ)+Δ+at)\displaystyle\leq 2a^{t}\Big{(}\frac{1}{\mathcal{H}^{*}(\theta^{*})+a^{t}}-\frac{1}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a^{t}}\Big{)} (C.77)
=2at(θ)+at(1(θ)+at(θ)+Δ+at)\displaystyle=\frac{2a^{t}}{\mathcal{H}^{*}(\theta^{*})+a^{t}}\Big{(}1-\frac{\mathcal{H}^{*}(\theta^{*})+a^{t}}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a^{t}}\Big{)} (C.78)
=2at(θ)+at(Δ(θ)+Δ+at),aat\displaystyle=\frac{2a^{t}}{\mathcal{H}^{*}(\theta^{*})+a^{t}}\Big{(}\frac{\Delta_{\mathcal{H}}}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a^{t}}\Big{)},\quad a^{*}\leq a^{t} (C.79)

We know from Lemma B.1 that we have a bound, Δ2κHlog(t)/t\Delta_{\mathcal{H}}\leq 2\kappa_{H}\sqrt{\log(t)/t}. So if we substitute this bound into Eq. (C.79) which maintains the inequality, we have,

Δγ(t)\displaystyle\Delta_{\gamma}(t) 2a(θ)+a(2κHlog(t)/t(θ)+a+2κHlog(t)/t)\displaystyle\leq\frac{2a}{\mathcal{H}^{*}(\theta^{*})+a}\Big{(}\frac{2\kappa_{H}\sqrt{\log(t)/t}}{\mathcal{H}^{*}(\theta^{*})+a+2\kappa_{H}\sqrt{\log(t)/t}}\Big{)} (C.80)
=2a(θ)+a(2κHlog(t)2κHlog(t)+((θ)+a)tlog(t))\displaystyle=\frac{2a}{\mathcal{H}^{*}(\theta^{*})+a}\Big{(}\frac{2\kappa_{H}\log(t)}{2\kappa_{H}\log(t)+(\mathcal{H}^{*}(\theta^{*})+a)\sqrt{t\log(t)}}\Big{)} (C.81)
<2a(θ)+a(2κHlog(t)((θ)+a)tlog(t))\displaystyle<\frac{2a}{\mathcal{H}^{*}(\theta^{*})+a}\Big{(}\frac{2\kappa_{H}\log(t)}{(\mathcal{H}^{*}(\theta^{*})+a)\sqrt{t\log(t)}}\Big{)} (C.82)
=4aκH((θ)+a)2(log(t)t)\displaystyle=\frac{4a\kappa_{H}}{(\mathcal{H}^{*}(\theta^{*})+a)^{2}}\Big{(}\sqrt{\frac{\log(t)}{t}}\Big{)} (C.83)
4κH(θ)2(log(t)t)a\displaystyle\leq\frac{4\kappa_{H}}{\mathcal{H}^{*}(\theta^{*})^{2}}\Big{(}\sqrt{\frac{\log(t)}{t}}\Big{)}a (C.84)

Suppose Fθ¯(b)F_{\bar{\theta}}(b) is continuous and differentiable in the domain of b[0,b¯]b\in[0,\bar{b}], and there exists a finite minimum >0\mathscr{M}>0 lower bounds the derivative of Fθ¯(b)F_{\bar{\theta}}(b) with respect to bb,

γb=Fθ¯(b)b\displaystyle\mathscr{M}\leq\frac{\partial\gamma}{\partial b}=\frac{\partial F_{\bar{\theta}}(b)}{\partial b} (C.85)

We can use the existence of \mathscr{M} to bound the change in γ\gamma with respect to the change in bb, denoted as Δγ(t)\Delta_{\gamma}(t) and Δb(t)\Delta_{b}(t) respectively.

Δb(t)Δγ(t)\displaystyle\Delta_{b}(t)\leq\frac{\Delta_{\gamma}(t)}{\mathscr{M}} (C.86)

We construct a lower bound on 𝒢A(a¯t,𝔅(a¯t))\mathcal{G}_{A}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})), where 𝔅¯(a)\bar{\mathfrak{B}}(a) denotes the optimistic best response as a function of leader action aa. For notation purposes, unless otherwise specified, 𝒢A(a)\mathcal{G}_{A}(a) denotes that the leader is acting with aa, and the follower is responding with 𝔅(a)\mathfrak{B}(a), thus we will use the shorthand definition 𝒢A(a)\mathcal{G}_{A}(a),

𝒢A(a)𝒢A(a,𝔅(a))=a𝔅(a),a𝒜\displaystyle\mathcal{G}_{A}(a)\equiv\mathcal{G}_{A}(a,\mathfrak{B}(a))=a\,\mathfrak{B}(a),\forall a\in\mathcal{A} (C.87)

Subsequently, we can bound the instantaneous expected leader reward by examining the behaviour of Δa\Delta_{a} and Δb\Delta_{b}, where if Δa>0Δb<0\Delta_{a}>0\xrightarrow[]{}\Delta_{b}<0. Any suboptimal action a=a+Δaa=a^{*}+\Delta_{a} will result in a potentially lower expected reward for the leader, thus 𝒢A(a)𝒢A(a)\mathcal{G}_{A}(a)\leq\mathcal{G}_{A}(a^{*}). We can quantify this reduction in reward as,

𝒢A(a+Δa)=(a+Δa)(𝔅(a)+Δb(a,t))𝒢A(a)\displaystyle\mathcal{G}_{A}(a^{*}+\Delta_{a})=(a^{*}+\Delta_{a})(\mathfrak{B}(a)+\Delta_{b}(a,t))\leq\mathcal{G}_{A}(a^{*}) (C.88)

Let us characterize 𝒢A(a¯t)\mathcal{G}_{A}(\bar{a}^{t}), denoted in Eq. (C.89), as the leader’s reward, under a best responding follower under perfect information, given an estimate of aa computed using an optimistic estimate of θ\theta with respect to tt, denoted as a¯t\bar{a}^{t}. We define a lower bound to 𝒢A(a¯t)\mathcal{G}_{A}(\bar{a}^{t}), in Eq. (C.92), to establish that 𝒢A(a¯t)𝒢¯A(a¯t)\mathcal{G}_{A}(\bar{a}^{t})\geq\underline{\mathcal{G}}_{A}(\bar{a}^{t}).

𝒢A(a¯t)\displaystyle\mathcal{G}_{A}(\bar{a}^{t}) =a¯t𝔅(a¯t)=(a+Δa(t))𝔅(a+Δa(t))\displaystyle=\bar{a}^{t}\,\mathfrak{B}(\bar{a}^{t})=(a^{*}+\Delta_{a}(t))\mathfrak{B}(a^{*}+\Delta_{a}(t)) (C.89)
a𝔅(a+Δa(t))\displaystyle\geq a^{*}\mathfrak{B}(a^{*}+\Delta_{a}(t)) (C.90)
=a𝔅(argmaxa𝒜a𝔅¯(a))=a𝔅(argmaxa𝒜aFθ¯1(γ¯(a)))\displaystyle=a^{*}\mathfrak{B}(\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ a\bar{\mathfrak{B}}(a))=a^{*}\mathfrak{B}\Big{(}\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ aF^{-1}_{\bar{\theta}}(\bar{\gamma}(a))\Big{)} (C.91)
a𝔅(argmaxa𝒜aFθ1(γ¯(a)))𝒢¯A(a¯t)\displaystyle\geq a^{*}\mathfrak{B}\Big{(}\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ aF^{-1}_{\theta}(\bar{\gamma}(a))\Big{)}\equiv\underline{\mathcal{G}}_{A}(\bar{a}^{t}) (C.92)

We denote upper bound a¯\bar{a} as the optimistic estimate of aa^{*}, and quantify the relation from Eq. (C.72), in a closed form expression. The value in Eq. (C.92) forms a lower bound for the best response under optimism 𝒢A(a¯t)\mathcal{G}_{A}(\bar{a}^{t}), and therefore 𝒢¯(a¯t)\underline{\mathcal{G}}(\bar{a}^{t}) can be used to form an upper bound on the leader regret. To be more specific, 𝒢¯(a¯t)\underline{\mathcal{G}}(\bar{a}^{t}) is the hypothetical value, which lower bounds the leader’s reward obtained under optimistic misspecification, 𝒢A(a¯t)\mathcal{G}_{A}(\bar{a}^{t}).

We give the expressions for aa^{*} , the optimal leader action, a¯\bar{a} the optimistic leader action, and a¯^\hat{\bar{a}} an estimate of the optimistic leader action, in Eq. (C.93) and Eq. (C.94) respectively.

a\displaystyle a^{*} =argmaxa𝒜a𝔅(a)\displaystyle=\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ a\,\mathfrak{B}(a) (C.93)
a¯\displaystyle\bar{a} =argmaxa𝒜aFθ¯1(12a(θ)+Δ+a)\displaystyle=\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ a\,F^{-1}_{\bar{\theta}}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}\Big{)} (C.94)

Eq. (C.94) is strictly concave and thus a unique solution to a¯\bar{a} exists. The major obstacle currently is that there is no closed form solution to Fθ¯1F^{-1}_{\bar{\theta}}, and therefore we need to find an appropriate approximation function such that we can perform the maximization over the concave space with controllable approximation error. Let 𝒫N()\mathcal{P}_{N}(\cdot) denote an approximation of Fθ¯1()F^{-1}_{\bar{\theta}}(\cdot), where 𝒫N()\mathcal{P}_{N}(\cdot) can be expressed by a closed form solution. We estimate the a¯\bar{a} by setting,

Fθ¯1(12a(θ)+Δ)𝒫N(12a(θ)+Δ)\displaystyle F^{-1}_{\bar{\theta}}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}}\Big{)}\approx\mathcal{P}_{N}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}}\Big{)} (C.95)

Where Eq. (C.95) represents an approximation of the inverse CDF Fθ¯1()F^{-1}_{\bar{\theta}}(\cdot). This approximation will give us a maximum approximation error denoted as,

Δ~a=argmaxa𝒜[a𝒫N(12a(θ)+Δ+a)]a¯\displaystyle\widetilde{\Delta}_{a}=\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ \Big{[}a\,\mathcal{P}_{N}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}\Big{)}\Big{]}-\bar{a} (C.96)

Where Δ~a\widetilde{\Delta}_{a} is the error in approximation of a¯\bar{a} due to the approximation of the inverse CDF Fθ1()F_{\theta}^{-1}(\cdot) with 𝒫N()\mathcal{P}_{N}(\cdot).

Bounded Action Space: We stipulate explicitly that, as a consequence of a bounded action space for the follower [b¯,b¯][\underline{b},\bar{b}], the interval of the the critical fractile is also bounded in [γ¯,γ¯][\underline{\gamma},\bar{\gamma}]. This implies that the image of Fθ1(γ¯)F^{-1}_{\theta}(\underline{\gamma}) is lower bounded by 0.

Fθ1(γ¯)=0,γ[γ¯,γ¯]\displaystyle F^{-1}_{\theta}(\underline{\gamma})=0,\quad\gamma\in[\underline{\gamma},\bar{\gamma}] (C.97)

Since in the economic sense we cannot order negative items, we implement a lower bound on the action space of the follower, and conduct our analysis in the range [γ¯,γ¯][\underline{\gamma},\bar{\gamma}], implying a finite upper bound on aa, such that a[a¯,a¯]a\in[\underline{a},\bar{a}] for the leader. Within this range lies the optimal solution for G¯A()\underline{G}_{A}(\cdot) .

Weierstrass Approximation Theorem: For any continuous function f(γ):[γ¯,γ¯]f(\gamma):[\underline{\gamma},\bar{\gamma}]\rightarrow\mathbb{R}, for any ϵ(N)+\epsilon(N)\in\mathbb{R}^{+} arbitrarily small, there exists a sequence of polynomials {PN}\{P_{N}\} of the NthN^{th} order such that the supremum norm, denoted as \lVert\cdot\rVert, is bounded by ϵ(N)\epsilon(N), as stated in Eq. (C.98). [Wei85]

f(γ)PNϵ(N)\displaystyle\lVert f(\gamma)-P_{N}\rVert\leq\epsilon(N) (C.98)

As Fθ¯1()F^{-1}_{\bar{\theta}}(\cdot) is a bounded and smooth function, we can leverage the Weierstrass approximation theorem to always obtain an NthN^{th} order polynomial approximation such that if NN is large enough, f(γ)PN<ϵ(N)\lVert f(\gamma)-P_{N}\rVert<\epsilon(N) for any ϵ\epsilon arbitrarily small. Applying a polynomial approximation and the Weirstrass approximation theorem brings us two main crucial benefits. First we can take advantage of having an arbitrarily small error ϵ(N)\epsilon(N) that is not respective of the sample size tt, rather a separate controllable computational term NN. Secondly, by approximating Fθ¯1()F^{-1}_{\bar{\theta}}(\cdot) with PNP_{N} we consequently have a closed form expression for an approximation of Fθ¯1()F^{-1}_{\bar{\theta}}(\cdot) with controllable error. This allows us to perform optimization over a polynomial rather than a some arbirtary concave function derived from a cumulative distribution function.

RA(T)\displaystyle R_{A}(T) t=1Ta𝔅(a)𝒢A(a¯t,𝔅(a¯t))\displaystyle\leq\sum^{T}_{t=1}a^{*}\mathfrak{B}(a^{*})-\mathcal{G}_{A}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})) (C.99)
t=1Ta𝔅(a)𝒢¯A(a¯t,𝔅(a¯t))\displaystyle\leq\sum^{T}_{t=1}a^{*}\mathfrak{B}(a^{*})-\underline{\mathcal{G}}_{A}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})) (C.100)

We characterize 𝒢¯A(a¯t,𝔅(a¯t))\underline{\mathcal{G}}_{A}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})) as the hypothetical leader profit, derived within a best response framework, where the best response is an additive composite of the term Δb(θ)\Delta_{b}(\theta) under perfect information (where Δb(θ)=0\Delta_{b}(\theta)=0), and the term Δb(γ)\Delta_{b}(\gamma) under imperfect information (where Δb(γ)0\Delta_{b}(\gamma)\geq 0). In the context of Fig. 4, an optimistic assessment under uncertainty influences two components with respect to the inverse CDF. First, it affects the overall optimism in demand, inducing a systematic shift along the vertical axis in the negative direction. We denote this as the change as Δb(θ)\Delta_{b}(\theta). Second, it results in a descent along the CDF curve itself, signified by the diminishing value of γ(a)\gamma(a). We denote this as the change as Δb(γ)\Delta_{b}(\gamma). Both aspects additively contribute to the total change in 𝔅(a)\mathfrak{B}(a). This formulation enables the construction of a lower bound for the best response mechanism, which is instrumental in subsequent analysis.

Solution via NthN^{th} Order Polynomial Approximation: In theory an infinite Taylor series expansion PN(γ)P_{N}(\gamma) or NN polynomial terms, centred around γ¯\underline{\gamma} can represent Fθ1(γ)F^{-1}_{\theta}(\gamma).

Fθ1(γ)\displaystyle F^{-1}_{\theta}(\gamma) =Fθ1(γ¯)+11!dFθ1(γ¯)dγ(γγ¯)+12!d2Fθ1(γ¯)dγ2(γγ¯)2+13!d3Fθ1(γ¯)dγ3(γγ¯)3+\displaystyle=F^{-1}_{\theta}(\underline{\gamma})+\frac{1}{1!}\frac{dF^{-1}_{\theta}(\underline{\gamma})}{d\gamma}(\gamma-\underline{\gamma})+\frac{1}{2!}\frac{d^{2}F^{-1}_{\theta}(\underline{\gamma})}{d\gamma^{2}}(\gamma-\underline{\gamma})^{2}+\frac{1}{3!}\frac{d^{3}F^{-1}_{\theta}(\underline{\gamma})}{d\gamma^{3}}(\gamma-\underline{\gamma})^{3}+\cdots (C.101)
=limNFθ1(γ¯)+n=1N1n!dnFθ1(γ¯)dγn(γγ¯)n\displaystyle=\lim_{N\to\infty}F^{-1}_{\theta}(\underline{\gamma})+\sum_{n=1}^{N}\frac{1}{n!}\frac{d^{n}F^{-1}_{\theta}(\underline{\gamma})}{d\gamma^{n}}(\gamma-\underline{\gamma})^{n} (C.102)

As there is no closed form solution for Fθ1()F_{\theta}^{-1}(\cdot), we can apply a finite Taylor series expansion 𝒫N(γ)\mathcal{P}_{N}(\gamma) to approximate Fθ1()F_{\theta}^{-1}(\cdot). Let 𝒫N(γ)\mathcal{P}_{N}(\gamma) denote the NthN^{th} order Taylor expansion centred around γ¯\underline{\gamma}. Consequently, Fθ1(γ)F^{-1}_{\theta}(\gamma) is equivalent to the sum of 𝒫N(γ)\mathcal{P}_{N}(\gamma) and the remainder term RN(γ)R_{N}(\gamma), which decreases with increasing order of the polynomial series, NN.

𝒫N(γ)\displaystyle\mathcal{P}_{N}(\gamma) =Fθ1(γ¯)+n=1N1n!dnFθ1(γ¯)dγn(γγ¯)n\displaystyle=F^{-1}_{\theta}(\underline{\gamma})+\sum_{n=1}^{N}\frac{1}{n!}\frac{d^{n}F^{-1}_{\theta}(\underline{\gamma})}{d\gamma^{n}}(\gamma-\underline{\gamma})^{n} (C.103)
Fθ1(γ)\displaystyle F^{-1}_{\theta}(\gamma) =𝒫N(γ)+RN(γ)\displaystyle=\mathcal{P}_{N}(\gamma)+R_{N}(\gamma) (C.104)

Let 𝒱(a,t)\mathcal{V}(a,t) represent a polynomial approximation of 𝒢¯(a¯t,𝔅(a¯t))\underline{\mathcal{G}}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})), given 𝒫N(γ)\mathcal{P}_{N}(\gamma).

𝒱(a,t)\displaystyle\mathcal{V}(a,t) =a𝒫N(γ(a)γ¯)\displaystyle=a\mathcal{P}_{N}\Big{(}\gamma(a)-\underline{\gamma}\Big{)} (C.105)
=ai=0Nβi(γ(a)γ¯)n\displaystyle=a\sum_{i=0}^{N}\beta_{i}\Big{(}\gamma(a)-\underline{\gamma}\Big{)}^{n} (C.106)
=ai=0Nβi(12a(θ)+Δ+aγ¯)n\displaystyle=a\sum_{i=0}^{N}\beta_{i}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}-\underline{\gamma}\Big{)}^{n} (C.107)

The structural form of Eq. (C.107) allows us to exploit the Weierstrass approximation theorem to place an arbitrarily small error bound on the approximation of 𝒢¯(a¯t,𝔅(a¯t))\underline{\mathcal{G}}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})) via 𝒱(a,t)\mathcal{V}(a,t), which can be expressed as the sum of finite polynomials, with some controllable additive error decreasing with NN, which we denote as ϵ(N)\epsilon(N).

𝒢¯(a¯t,𝔅(a¯t))\displaystyle\underline{\mathcal{G}}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})) =𝒱(a,t)+ϵ(N)\displaystyle=\mathcal{V}(a,t)+\epsilon(N) (C.108)

We also proved that as the approximation error of Fθ1(γ)F^{-1}_{\theta}(\gamma), via 𝒫N(γ)\mathcal{P}_{N}(\gamma) with the approaches 0, then the distance between the estimated optima value and the true optimal value, denoted as Δ~a\widetilde{\Delta}_{a} also approaches 0. Thus by Lemma C.1, ϵ(N)0Δ~a0\epsilon(N)\xrightarrow[]{}0\implies\widetilde{\Delta}_{a}\xrightarrow[]{}0. Where Δ~a\widetilde{\Delta}_{a} represents the maximum discrepancy between the value of aa that maximizes 𝒱(a,t)\mathcal{V}(a,t) and the value of aa that maximizes 𝒢¯(a¯t,𝔅(a¯t))\underline{\mathcal{G}}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})).

a¯\displaystyle\bar{a} =argmaxa𝒜𝒱(a,t)+Δ~a\displaystyle=\underset{a\in\mathcal{A}}{\mathrm{argmax}}\ \mathcal{V}(a,t)+\widetilde{\Delta}_{a} (C.109)

But the Weierstrass approximation theorem alone is insufficient to complete our derivation to solve for a¯\bar{a}, as we have to take the derivative of 𝒫N(γ)\mathcal{P}_{N}(\gamma), and set 𝒫N(γ)γ=0\frac{\partial\mathcal{P}_{N}(\gamma)}{\partial\gamma}=0. This has no simple closed form solution that generalizes to NN for 𝒫N(γ)\mathcal{P}_{N}(\gamma). The challenge arises when differentiating 𝒫N(γ)\mathcal{P}_{N}(\gamma) with respect to γ\gamma and equating 𝒫N(γ)γ\frac{\partial\mathcal{P}_{N}(\gamma)}{\partial\gamma} to 0 to solve for a¯\bar{a}. Identifying a concise closed-form solution for a¯\bar{a} that generalizes to NN for 𝒫N(γ)\mathcal{P}_{N}(\gamma) remains a formidable challenge, which motivates the use of the γ(a)\gamma(a) trick.

The γ(a)\gamma(a) trick: The selection of γ¯\underline{\gamma} as the center for the Taylor series expansion was deliberate means of simplification, stemming from the the fact that we impose of Fθ1(γ¯)=0F^{-1}_{\theta}(\underline{\gamma})=0 in Eq. (C.97). To address the complexity of root-finding problem, a strategic approach is employed. We introduce 𝒫N(k)(γ)\mathcal{P}^{(k)}_{N}(\gamma) as the Taylor expansion of F1θ(γ)F^{-1}\theta(\gamma) centered at γ¯\underline{\gamma}, while omitting the initial kk terms. To be clear,

𝒫N(k)(γ)\displaystyle\mathcal{P}^{(k)}_{N}(\gamma) =i=k+1Nβi(γ(a)γ¯)n\displaystyle=\sum_{i=k+1}^{N}\beta_{i}\Big{(}\gamma(a)-\underline{\gamma}\Big{)}^{n} (C.110)
=i=k+1Nβi(12a(θ)+Δ+aγ¯)n\displaystyle=\sum_{i=k+1}^{N}\beta_{i}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}-\underline{\gamma}\Big{)}^{n} (C.111)

Such that, at k=2k=2. Also note that β0=0\beta_{0}=0, from the construct of our Taylor expansion as it was centeres at γ¯\underline{\gamma}.

𝒫N(γ)\displaystyle\mathcal{P}_{N}(\gamma) =β0+β1(γ(a)γ¯)+𝒫N(k=2)(γ)\displaystyle=\beta_{0}+\beta_{1}\Big{(}\gamma(a)-\underline{\gamma}\Big{)}+\mathcal{P}^{(k=2)}_{N}(\gamma) (C.112)

We define Λ(γ)\Lambda(\gamma) as,

Λ(γ)\displaystyle\Lambda(\gamma) β0+β1(γ(a)γ¯)\displaystyle\equiv\beta_{0}+\beta_{1}\Big{(}\gamma(a)-\underline{\gamma}\Big{)} (C.113)
=β0+β1(12a(θ)+Δ+aγ¯)\displaystyle=\beta_{0}+\beta_{1}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}-\underline{\gamma}\Big{)} (C.114)

Subsequently, a linear affine transform on our function of interest Fθ1(γ)F^{-1}_{\theta}(\gamma) can be performed, where we denote this affine linear transform as ϕ(Fθ1(γ))\phi(F^{-1}_{\theta}(\gamma)).

ϕ(Fθ1(γ))Fθ1(γ)Λ(γ)\displaystyle\phi(F^{-1}_{\theta}(\gamma))\equiv F^{-1}_{\theta}(\gamma)-\Lambda(\gamma) (C.115)

We leverage the Weierstrass approximation theorem, as expressed in Eq. (C.98), for the relation of ϕ(Fθ1(γ))\phi(F^{-1}_{\theta}(\gamma)) and 𝒫N(k)(γ)\mathcal{P}^{(k)}_{N}(\gamma), resulting in, This allows us to express the

Λ(γ)\Lambda(\gamma) is defined as a linear function, allowing us to subsequently substitute it into the Weierstrass approximation theorem, as expressed in Eq. (C.98), resulting in,

Fθ1(γ)𝒫N(γ)=ϕ(Fθ1(γ))𝒫N(k)(γ)ϵ(N)\displaystyle\lVert F^{-1}_{\theta}(\gamma)-\mathcal{P}_{N}(\gamma)\rVert=\lVert\phi(F^{-1}_{\theta}(\gamma))-\mathcal{P}^{(k)}_{N}(\gamma)\rVert\leq\epsilon(N) (C.116)

Now we take a look at the behaviour of 𝒫N(k)(γ(a))\mathcal{P}^{(k)}_{N}(\gamma(a)) specifically. We wish to find the value of aa that would maximize 𝒫N(k)()\mathcal{P}^{(k)}_{N}(\cdot), and by taking partial derivatives and setting the result to 0. Clearly, we can apply the product rule, followed by the chain rule for this.

a𝒱(a,t)\displaystyle\frac{\partial}{\partial a}\mathcal{V}(a,t) =i=2Nβi(γ(a)γ¯)n+a(γ(a)γ¯)i=2Nβin(12a(θ)+Δ+aγ¯)n1\displaystyle=\sum_{i=2}^{N}\beta_{i}\left(\gamma(a)-\underline{\gamma}\right)^{n}+\frac{\partial}{\partial a}(\gamma(a)-\underline{\gamma})\sum_{i=2}^{N}\beta_{i}n\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}-\underline{\gamma}\Big{)}^{n-1} (C.117)
=i=2Nβi(12a(θ)+Δ+aγ¯)n\displaystyle=\sum_{i=2}^{N}\beta_{i}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}-\underline{\gamma}\Big{)}^{n}
(2a((θ)+Δ)((θ)+Δ+a)2)i=2Nβin(12a(θ)+Δ+aγ¯)n1=0\displaystyle\quad\quad-\Big{(}\frac{2a(\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}})}{(\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a)^{2}}\Big{)}\sum_{i=2}^{N}\beta_{i}n\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}-\underline{\gamma}\Big{)}^{n-1}=0 (C.118)

Here is where the γ(a)\gamma(a) trick becomes pivotal. We can now solve Eq. (C.107) by solving,

γ(a)γ¯=12a(θ)+Δ+aγ¯\displaystyle\gamma(a)-\underline{\gamma}=1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}+a}-\underline{\gamma} =0\displaystyle=0 (C.119)

Which gives us the solution,

a\displaystyle a =(θ)+Δγ¯((θ)+Δ)γ¯+1\displaystyle=\frac{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}-\underline{\gamma}(\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}})}{\underline{\gamma}+1} (C.120)
=γ~((θ)+Δ),γ~=1γ¯1+γ¯\displaystyle=\tilde{\gamma}\,(\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}),\quad\tilde{\gamma}=\frac{1-\underline{\gamma}}{1+\underline{\gamma}} (C.121)

And by setting aa as equal to the value in Eq. (C.121), we effectively solve the partial derivative equation from Eq. (C.118). The result from Eq. (C.121) allows us to simply set a=γ~((θ)+Δ)a=\tilde{\gamma}(\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}) to solve for aa in Eq. (C.109). Essentially, by maximizing our approximation 𝒱(a,t)\mathcal{V}(a,t), w.r.t. aa, we force the condition such that Λ(γ(a))=0\Lambda(\gamma(a))=0. Therefore, the bound Fθ1(γ)𝒫N(γ)\lVert F^{-1}_{\theta}(\gamma)-\mathcal{P}_{N}(\gamma)\rVert expressed in Eq. (C.116) holds.

Fθ1(γ)𝒫N(γ)=ϕ(Fθ1(γ))𝒫Nk=2(γ)=Fθ1(γ)𝒫N(k=2)(γ)\displaystyle\lVert F^{-1}_{\theta}(\gamma)-\mathcal{P}_{N}(\gamma)\rVert=\lVert\phi(F^{-1}_{\theta}(\gamma))-\mathcal{P}^{k=2}_{N}(\gamma)\rVert=\lVert F^{-1}_{\theta}(\gamma)-\mathcal{P}^{(k=2)}_{N}(\gamma)\rVert (C.122)

The beauty is that we have an error term that can be arbitrarily small as a result ϵ(N)\epsilon(N). So then we can apply the lower bound from from Eq. (C.92) and 𝒱(a,t)\mathcal{V}(a,t) approximation of 𝒢A(a¯t,𝔅(a¯t))\mathcal{G}_{A}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})), so obtain a bound on Δa(t)\Delta_{a}(t). Given Δa(t)=a¯a\Delta_{a}(t)=\bar{a}-a^{*}, we now bound 𝒢A(a¯)\mathcal{G}_{A}(\bar{a}) given the bound on Δa\Delta_{a} and Δγ\Delta_{\gamma},

𝒢A(a¯t,𝔅(a¯t))\displaystyle\mathcal{G}_{A}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})) =(a+Δa(t))Fθ1(γΔγ(t))\displaystyle=\Big{(}a^{*}+\Delta_{a}(t)\Big{)}F^{-1}_{\theta}\Big{(}\gamma^{*}-\Delta_{\gamma}(t)\Big{)} (C.123)
aFθ1(γΔγ(t))\displaystyle\geq a^{*}F^{-1}_{\theta}\Big{(}\gamma^{*}-\Delta_{\gamma}(t)\Big{)} (C.124)
=a(𝔅(a)Δb(t))\displaystyle=a^{*}\Big{(}\mathfrak{B}(a)-\Delta_{b}(t)\Big{)} (C.125)
a(𝔅(a)14κH(θ)2(log(t)/t)a¯)\displaystyle\geq a^{*}\Big{(}\mathfrak{B}(a)-\frac{1}{\mathscr{M}}\frac{4\kappa_{H}}{\mathcal{H}^{*}(\theta^{*})^{2}}(\sqrt{\log(t)/t})\,\bar{a}\Big{)} (C.126)

We aim to express the change in b(a)b(a) with a constant scale with respect to a change in γ(a)\gamma(a). Thus we apply the relations from Eq. (C.84) and Eq. (C.85) in combination with Eq. (C.126) to obtain a bound on 𝒢A(a¯t,𝔅(a¯t))\mathcal{G}_{A}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})) as expressed in Eq. (C.130).

𝒢A(a¯t,𝔅(a¯t))\displaystyle\mathcal{G}_{A}(\bar{a}^{t},\mathfrak{B}(\bar{a}^{t})) a(𝔅(a)14κH(θ)2(log(t)/t)a¯)\displaystyle\geq a^{*}\Big{(}\mathfrak{B}(a)-\frac{1}{\mathscr{M}}\frac{4\kappa_{H}}{\mathcal{H}^{*}(\theta^{*})^{2}}(\sqrt{\log(t)/t})\bar{a}\Big{)} (C.127)
a(𝔅(a)14κH(θ)2log(t)/t(γ~((θ)+Δ)Δ~a))\displaystyle\geq a^{*}\Big{(}\mathfrak{B}(a)-\frac{1}{\mathscr{M}}\frac{4\kappa_{H}}{\mathcal{H}^{*}(\theta^{*})^{2}}\sqrt{\log(t)/t}\Big{(}\tilde{\gamma}\,(\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}})-\widetilde{\Delta}_{a}\Big{)}\Big{)} (C.128)
=a(𝔅(a)14κH(θ)2log(t)/t(γ~((θ)+κHlog(t)/t)Δ~a))\displaystyle=a^{*}\Big{(}\mathfrak{B}(a)-\frac{1}{\mathscr{M}}\frac{4\kappa_{H}}{\mathcal{H}^{*}(\theta^{*})^{2}}\sqrt{\log(t)/t}\Big{(}\tilde{\gamma}\,\Big{(}\mathcal{H}^{*}(\theta^{*})+\kappa_{H}\sqrt{\log(t)/t}\Big{)}-\widetilde{\Delta}_{a}\Big{)}\Big{)} (C.129)
=ab+C1log(t)/t+C2log(t)/t+C3Δ~a\displaystyle=a^{*}b^{*}+C_{1}\sqrt{\log(t)/t}+C_{2}\log(t)/t+C_{3}\widetilde{\Delta}_{a} (C.130)

Effectively we solve the Eq. (C.129), for three constants, defined as, given Δa(t)=a¯a\Delta_{a}(t)=\bar{a}-a^{*}, we now bound 𝒢A(a¯)\mathcal{G}_{A}(\bar{a}) given the bound on Δa\Delta_{a} and Δγ\Delta_{\gamma}, where,

C1\displaystyle C_{1} =4aκH(γ~(θ)Δ~a)(θ)2\displaystyle=-\frac{4a^{*}\kappa_{H}(\tilde{\gamma}\mathcal{H}^{*}(\theta^{*})-\widetilde{\Delta}_{a})}{\mathscr{M}\mathcal{H}^{*}(\theta^{*})^{2}} (C.131)
C2\displaystyle C_{2} =4aκH2γ~(θ)2\displaystyle=-\frac{4a^{*}\kappa_{H}^{2}\tilde{\gamma}}{\mathscr{M}\mathcal{H}^{*}(\theta^{*})^{2}} (C.132)
C3\displaystyle C_{3} =4aκHΔ~a(θ)2\displaystyle=\frac{4a^{*}\kappa_{H}\widetilde{\Delta}_{a}}{\mathscr{M}\mathcal{H}^{*}(\theta^{*})^{2}} (C.133)

The novelty in our method is that we can force Δ~a\widetilde{\Delta}_{a} to be arbitrarily small as a consequence of Lemma C.1, by increasing the order of the polynomial approximation such that, ϵ(N)<min{|C1|,|C2|}\epsilon(N)<\min\{|C_{1}|,|C_{2}|\}. Given this we can then eliminate the term introduced by ω~\tilde{\omega} for big 𝒪\mathcal{O} analysis. Substituting Eq. (C.130) into Eq. (C.99), we obtain a bound on leader regret.

RA(T)\displaystyle R_{A}(T) t=1Ta𝔅(a)aFθ1(γΔγ(t))\displaystyle\leq\sum^{T}_{t=1}a^{*}\mathfrak{B}(a^{*})-a^{*}F^{-1}_{\theta}\Big{(}\gamma^{*}-\Delta_{\gamma}(t)\Big{)} (C.134)
=t=1TC1log(t)/t+C2log(t)/t+C3Δ~a\displaystyle=\sum^{T}_{t=1}C_{1}\sqrt{\log(t)/t}+C_{2}\log(t)/t+C_{3}\widetilde{\Delta}_{a} (C.135)

By the same process as Lemma 3.2 we take the integral of the discrete upper bound and obtain a Big-O bound on leader regret. The constant TT term originates from estimation error of the inverse CDF via a linear function, we denote this error, ω~\tilde{\omega}, to be sufficiently small given a suitable range [b¯,b¯][\underline{b},\bar{b}], and is external to the learning algorithm.

RA(T)𝒪(Tlog(T))\displaystyle R_{A}(T)\in\mathcal{O}(\sqrt{T\log(T)}) (C.136)

C.5 Proof for Corollary 3.1

Relaxation of the Follower’s Objective: Suppose we extend our analysis to another best-response proxy 𝔅~(a)\tilde{\mathfrak{B}}(a), which gives a closer solution to the price-setting Newsvendor (PSN). Let RAT(πA)R_{A}^{T}(\pi_{A}) be the Stackelberg regret of the original Algorithm 1 (as also expressed in Eq. (C.136)). As ataa^{t}\to a^{*}, RAT(πA)R_{A}^{T}(\pi_{A}) now becomes R~AT(πA)\tilde{R}_{A}^{T}(\pi_{A}) with new target a~\tilde{a}^{*} (where a~\tilde{a}^{*} maximizes a𝔅~(a)a\tilde{\mathfrak{B}}(a)),

RAT(πA)=t=1Ta𝔅(a)t=1Tat𝔅(at),becomesR~AT(πA)=t=1Ta~𝔅~(a~)t=1Tat𝔅~(at)\displaystyle R_{A}^{T}(\pi_{A})=\sum^{T}_{t=1}a^{*}\mathfrak{B}(a^{*})-\sum^{T}_{t=1}a^{t}\mathfrak{B}(a^{t}),\xrightarrow{\text{becomes}}\tilde{R}_{A}^{T}(\pi_{A})=\sum^{T}_{t=1}\tilde{a}^{*}\tilde{\mathfrak{B}}(\tilde{a}^{*})-\sum^{T}_{t=1}a^{t}\tilde{\mathfrak{B}}(a^{t}) (C.137)

The leader is still learning the via the original best-response proxy 𝔅()\mathfrak{B}(\cdot), converging to solution aa^{*}, the regret for the new proxy R~AT(πA)\tilde{R}_{A}^{T}(\pi_{A}) can be expressed as:

R~AT(πA)=t=1T(a𝔅(a)ΔG)t=1T(at𝔅(at)ΔG(t))\displaystyle\tilde{R}_{A}^{T}(\pi_{A})=\sum^{T}_{t=1}\Big{(}a^{*}\mathfrak{B}(a^{*})-\Delta_{G}\Big{)}-\sum^{T}_{t=1}\Big{(}a^{t}\mathfrak{B}(a^{t})-\Delta_{G}(t)\Big{)} (C.138)

Where:

ΔG=a𝔅(a)a~𝔅~(a~),ΔG(t)=at𝔅(at)at𝔅~(at)\displaystyle\Delta_{G}=a^{*}\mathfrak{B}(a^{*})-\tilde{a}^{*}\tilde{\mathfrak{B}}(\tilde{a}^{*}),\quad\Delta_{G}(t)=a^{t}\mathfrak{B}(a^{t})-a^{t}\tilde{\mathfrak{B}}(a^{t}) (C.139)

We can express the new regret with the original expression as:

R~AT(πA)\displaystyle\tilde{R}_{A}^{T}(\pi_{A}) =t=1Ta𝔅(a)t=1Tat𝔅(at)RT(πA)+t=1TΔG(t)t=1TΔG=RAT(πA)+t=1T(ΔG(t)ΔG)\displaystyle=\underbrace{\sum^{T}_{t=1}a^{*}\mathfrak{B}(a^{*})-\sum^{T}_{t=1}a^{t}\mathfrak{B}(a^{t})}_{R^{T}(\pi_{A})}+\sum^{T}_{t=1}\Delta_{G}(t)-\sum^{T}_{t=1}\Delta_{G}=R_{A}^{T}(\pi_{A})+\sum^{T}_{t=1}\Big{(}\Delta_{G}(t)-\Delta_{G}\Big{)} (C.140)
=RAT(πA)+t=1T(at𝔅(at)at𝔅~(at)a𝔅(a)+a~𝔅~(a~))\displaystyle=R_{A}^{T}(\pi_{A})+\sum^{T}_{t=1}\Big{(}a^{t}\mathfrak{B}(a^{t})-a^{t}\tilde{\mathfrak{B}}(a^{t})-a^{*}\mathfrak{B}(a^{*})+\tilde{a}^{*}\tilde{\mathfrak{B}}(\tilde{a}^{*})\Big{)} (C.141)
=RAT(πA)+t=1T(at𝔅(at)a𝔅(a))c(T)0+t=1T(a~𝔅~(a~)at𝔅~(at)f(T)0)\displaystyle=R_{A}^{T}(\pi_{A})+\underbrace{\sum^{T}_{t=1}\Big{(}a^{t}\mathfrak{B}(a^{t})-a^{*}\mathfrak{B}(a^{*})\Big{)}}_{c(T)\leq 0}+\underbrace{\sum^{T}_{t=1}\Big{(}\tilde{a}^{*}\tilde{\mathfrak{B}}(\tilde{a}^{*})-a^{t}\tilde{\mathfrak{B}}(a^{t})}_{f(T)\geq 0}\Big{)} (C.142)

Notes on Asymptotic Performance: In short, by rearranging the expression for R~AT(πA)\tilde{R}_{A}^{T}(\pi_{A}), for sufficiently large sample size TT, we can express:

R~AT(πA)RAT(πA)+c(T)+f(T)\displaystyle\tilde{R}_{A}^{T}(\pi_{A})\leq R_{A}^{T}(\pi_{A})+c(T)+f(T) (C.143)

We note, that as as TT\to\infty, |c(T)|CTlog(T)|c(T)|\leq C\sqrt{T\log(T)}, for some constant CC, by extension of Thm. 4. Furthermore, a~𝔅~(a~)at𝔅~(at)ϵ^\tilde{a}^{*}\tilde{\mathfrak{B}}(\tilde{a}^{*})-a^{t}\tilde{\mathfrak{B}}(a^{t})\to\hat{\epsilon}, where ϵ^=a~𝔅~(a~)a𝔅~(a)\hat{\epsilon}=\tilde{a}^{*}\tilde{\mathfrak{B}}(\tilde{a}^{*})-a^{*}\tilde{\mathfrak{B}}(a^{*}), by divergence of the proxy objective. Additionally, c(T)0c(T)\leq 0 is by definition since aa^{*} is the optimal solution to a𝔅(a)a^{*}\mathfrak{B}(a^{*}) under the original risk-free pricing proxy, and at𝔅(at)a𝔅(a)a^{t}\mathfrak{B}(a^{t})\leq a^{*}\mathfrak{B}(a^{*}). Likewise ϵ^0\hat{\epsilon}\geq 0, as a~\tilde{a}^{*} optimizes for a~𝔅~(a~)\tilde{a}^{*}\tilde{\mathfrak{B}}(\tilde{a}^{*}) and a~𝔅~(a~)at𝔅~(at)\tilde{a}^{*}\tilde{\mathfrak{B}}(\tilde{a}^{*})\geq a^{t}\tilde{\mathfrak{B}}(a^{t}). Therefore, for sufficiently large values of TT:

R~AT(πA)𝒪(Tlog(T)+ϵ^T)\displaystyle\tilde{R}_{A}^{T}(\pi_{A})\in\mathcal{O}(\sqrt{T\log(T)}+\hat{\epsilon}T) (C.144)

Because the PSN has no closed-form solution, the follower must always approximate their best response with some proxy 𝔅~(a)\tilde{\mathfrak{B}}(a), which will induce a suboptimality term ϵ^\hat{\epsilon} for the leader’s Stackelberg regret. Depending on the fixed value of ϵ^\hat{\epsilon}, w.r.t. the terms expressed in Eq. (C.159) of Appendix C.6 in the paper, ϵ^\hat{\epsilon} could be negligible. Evidently, the Stackelberg regret always relies on assumptions regarding the follower’s approximation to the PSN problem, where ϵ^\hat{\epsilon} is computable given the 𝔅~()\tilde{\mathfrak{B}}(\cdot) and 𝔅()\mathfrak{B}(\cdot). In other cases, ϵ^\hat{\epsilon} could be computed explicitly and added to the regret term, but it entirely depends on the approximation method. Fundamentally, we can see that this Stackelberg game always relies on assumptions regarding the follower’s approximation to the PSN problem, and the current regret bounds can transfer to a new proxy, given the approximation error.

C.6 Non-Linear Demand Function under Exponential Link Function

The current framework can accommodate certain non-linear demand functions by employing Generalized Linear Models (GLMs). We apply a link function Φ()\Phi(\cdot) with a well-defined functional form in order to establish a linear relationship: Φ(Γ(p))=θ0θ1p\Phi(\Gamma(p))=\theta_{0}-\theta_{1}p. Consider the link function Φ(x)=log(x)\Phi(x)=\log(x), which implies Γ(p)=exp(θ0θ1p)\Gamma(p)=\exp(\theta_{0}-\theta_{1}p). This choice simplifies the analysis for Theorem 3. Specifically, the maximization operator from Eq. (3.5a) becomes (θ)=θ^0\mathcal{H}(\theta)=\hat{\theta}_{0}, where θ^1\hat{\theta}_{1} can be disregarded as it does not influence the optimum of the new expected revenue function pexp(θ0θ1p)p\exp(\theta_{0}-\theta_{1}p). This can be verified through calculus or graphical methods by evaluating the solution of the optimum of pexp(θ0θ1p)p\exp(\theta_{0}-\theta_{1}p).

Thus, our updated objective is to maximize θ0\theta_{0} alone, subject to the constraints given by:

Maximize: θ0,\displaystyle\theta_{0}, (C.145)
Subject to: (θ^0θ0)2κlog(t)t.\displaystyle\sqrt{(\hat{\theta}_{0}-\theta_{0}^{*})^{2}}\leq\frac{\kappa\sqrt{\log(t)}}{\sqrt{t}}. (C.146)

This adjustment simplifies the analysis from Appendix Eq. (B.29) to Eq. (B.42), focusing solely on the estimated variable θ^0\hat{\theta}_{0}. It is straightforward to demonstrate that the bound in Appendix Eq. (B.42) remains valid under this new objective function. Therefore, the non-linear link function Φ(x)=log(x)\Phi(x)=\log(x) preserves the bounds on Stackelberg Regret established in Theorem 4. It follows that, the rest of the theoretical results regarding Stackelberg regret continue to hold. For any hypothetical non-linear demand function, the applicability of our theoretical results will depend on the specific functional form of the GLM link function. However, we posit that for any demand function of the logarithmic or exponential class, the theoretical contributions of this paper remain valid.

Appendix D Equilibrium Proofs

D.1 Proof of Theorem 6

Convergence to Equilibrium (Algorithm 1): The LNPG algorithm (Algorithm 1) converges to an approximate Stackelberg equilibrium, as defined in Section 2.3.

Proof.

Follower Best Response: By Lemma 3.3, so long as p¯0\bar{p}_{0} converges to p0p_{0}, then so does p¯(t)p\bar{p}^{*}(t)\xrightarrow[]{}p^{*}. We also know from Theorem 3 that p¯0\bar{p}_{0} will converge to p0p_{0} in the asymptotic infinite time horizon. Where,

p0\displaystyle p_{0} pFθ1(12a(θ)+a)Fθ1(1ap)\displaystyle\geq p^{*}\xrightarrow[]{}F_{\theta}^{-1}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+a}\Big{)}\geq F_{\theta}^{-1}\Big{(}1-\frac{a}{p^{*}}\Big{)} (D.1)

We can solve pp0p^{*}-p_{0} from Eq. (2.17), and we can also compute a solution the the CDF, Fθ1(1ap)F_{\theta}^{-1}\Big{(}1-\frac{a}{p^{*}}\Big{)}. We define,

𝔼[𝒢B(p|a)]\displaystyle\mathbbm{E}[\mathcal{G}_{B}(p|a)] =𝔼[θ(p|a)]𝔼[aFθ1(1a/p)]\displaystyle=\mathbbm{E}[\mathcal{M}_{\theta}(p|a)]-\mathbbm{E}[aF_{\theta}^{-1}(1-a/p)] (D.2)
where,𝔼[θ(p|a)]\displaystyle\text{where,}\quad\mathbbm{E}[\mathcal{M}_{\theta}(p|a)] 𝔼[pmin(dθ(p),Fθ1(1a/p)]\displaystyle\equiv\mathbbm{E}[p\min(d_{\theta}(p),F_{\theta}^{-1}(1-a/p)] (D.3)

𝔼[θ(p|a)]\mathbbm{E}[\mathcal{M}_{\theta}(p|a)] is a controllable variable which consequently maximizes 𝔼[𝒢B(p|a)]\mathbbm{E}[\mathcal{G}_{B}(p|a)]. As Fθ1()F_{\theta}^{-1}(\cdot) is a deterministic function, we note that 𝔼[aFθ1(1a/p)]=aFθ1(1a/p)\mathbbm{E}[aF_{\theta}^{-1}(1-a/p)]=aF_{\theta}^{-1}(1-a/p) Therefore,

𝔼[𝒢B(p|a)]\displaystyle\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)] =𝔼[θ(p|a)]aFθ1(1a/p)\displaystyle=\mathbbm{E}[\mathcal{M}_{\theta}(p^{*}|a)]-aF_{\theta}^{-1}(1-a/p^{*}) (D.4)
𝔼[𝒢B(p0|a)]\displaystyle\mathbbm{E}[\mathcal{G}_{B}(p_{0}|a)] =𝔼[θ(p0|a)]aFθ1(1a/p0)\displaystyle=\mathbbm{E}[\mathcal{M}_{\theta}(p_{0}|a)]-aF_{\theta}^{-1}(1-a/p_{0}) (D.5)
=𝔼[θ((θ)+a2)]aFθ1(12a(θ)+a)\displaystyle=\mathbbm{E}\Big{[}\mathcal{M}_{\theta}\Big{(}\frac{\mathcal{H}^{*}(\theta^{*})+a}{2}\Big{)}\Big{]}-aF_{\theta}^{-1}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+a}\Big{)} (D.6)

From Theorem 4, the follower will always have some suboptimality upper bounded by ϵB>0\epsilon_{B}>0, given an upper-bounding approximation function for 𝒢B()\mathcal{G}_{B}(\cdot), where ϵB\epsilon_{B} denotes the gap to the optimal reward under the best response pp^{*} versus p0p_{0} given aa.

ϵB(a)=𝔼[𝒢B(p|a)]𝔼[𝒢B(p0|a)]\displaystyle\epsilon_{B}(a)=\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]-\mathbbm{E}[\mathcal{G}_{B}(p_{0}|a)] (D.7)

The optimistic estimate of p¯0\bar{p}_{0} approaches p0p_{0}, with a worst case error of 𝒪(log(t)/t)\mathcal{O}(\sqrt{\log(t)/t}) by Theorem 3. Specifically we can express, 𝔼[𝒢B(p¯0|a)]\mathbbm{E}[\mathcal{G}_{B}(\bar{p}_{0}|a)] as,

𝔼[𝒢B(p¯0|a)]\displaystyle\mathbbm{E}[\mathcal{G}_{B}(\bar{p}_{0}|a)] =𝔼[θ((θ)+Δ(t)+a2)]aFθ¯1(12a(θ)+Δ(t)+a)\displaystyle=\mathbbm{E}\Big{[}\mathcal{M}_{\theta}\Big{(}\frac{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}(t)+a}{2}\Big{)}\Big{]}-aF_{\bar{\theta}}^{-1}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}(t)+a}\Big{)} (D.8)

Let ΔB(t)\Delta_{B}(t) denote the suboptimality gap over time.

ΔB(t|a)\displaystyle\Delta_{B}(t|a) =𝔼[𝒢B(p|a)]𝔼[𝒢B(p¯0|a)]\displaystyle=\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]-\mathbbm{E}[\mathcal{G}_{B}(\bar{p}_{0}|a)] (D.9)
=𝔼[𝒢B(p|a)]𝔼[𝒢B(p0+Δ(t)|a)]\displaystyle=\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]-\mathbbm{E}[\mathcal{G}_{B}(p_{0}+\Delta_{\mathcal{H}}(t)|a)] (D.10)
=𝔼[𝒢B(p|a)]𝔼[θ(p¯0)]+aFθ1(1a/p¯0)\displaystyle=\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]-\mathbbm{E}[\mathcal{M}_{\theta}(\bar{p}_{0})]+aF_{\theta}^{-1}(1-a/\bar{p}_{0}) (D.11)
=𝔼[𝒢B(p|a)]𝔼[θ((θ)+Δ(t)+a2)]+aFθ¯1(12a(θ)+Δ(t)+a)\displaystyle=\mathbbm{E}[\mathcal{G}_{B}(p^{*}|a)]-\mathbbm{E}\Big{[}\mathcal{M}_{\theta}\Big{(}\frac{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}(t)+a}{2}\Big{)}\Big{]}+aF_{\bar{\theta}}^{-1}\Big{(}1-\frac{2a}{\mathcal{H}^{*}(\theta^{*})+\Delta_{\mathcal{H}}(t)+a}\Big{)} (D.12)

As tt\xrightarrow[]{}\infty, θ¯θ\bar{\theta}\xrightarrow[]{}\theta^{*}, and Δ(t)0\Delta_{\mathcal{H}}(t)\xrightarrow[]{}0. Therefore,

maxa𝒜(limtΔB(t|a))=ϵB\displaystyle\underset{a\in\mathcal{A}}{\mathrm{max}}\ \Big{(}\lim_{t\to\infty}\ \Delta_{B}(t|a)\Big{)}=\epsilon_{B} (D.13)

Follower’s Perspective: As the follower will always be subject to a suboptimality gap of at most ϵB\epsilon_{B}, the learned policy constitutes an ϵ\epsilon-approximate Stackelberg equilibrium, with ϵB\epsilon_{B} denoting the maximum difference in the follower’s expected reward due to the approximate best response (as defined in Eq. (2.8)).

Leader’s Perspective: In this problem setting, the follower does not attempt to alter the strategy of the leader via its own strategy, and only the optimal order amount computed by b=Fθ1(γ)b=F_{\theta}^{-1}(\gamma) affects the reward of the leader. Assuming the follower is abiding by an optimistic riskless pricing p¯0\bar{p}_{0}, we suppose that there exists some uncertainty with respect to the level of optimism for the follower response, in which 𝔅(a)𝔅ε(a)[ba¯,b¯a]\mathfrak{B}(a)\in\mathfrak{B}_{\varepsilon}(a)\equiv[\underline{b_{a}},\bar{b}_{a}].

μ¯A\bar{\mu}_{A}μ^A\hat{\mu}_{A}μ^B\hat{\mu}_{B}μ¯B\bar{\mu}_{B}μ^\hat{\mu}^{*}μ^A\hat{\mu}_{A}^{\prime}μ^B\hat{\mu}_{B}^{\prime}μ¯A\bar{\mu}_{A}^{\prime}μ¯B\bar{\mu}_{B}^{\prime}
Figure 6: Visualizing the optimistic μ¯=θ¯0\bar{\mu}=\bar{\theta}_{0}. μ\mu^{*} represents the true θ0\theta_{0}^{*}. Suppose A and B have different estimates of θ0\theta_{0}, where μ\mu^{*} lines between [μ^,μ¯][\hat{\mu},\bar{\mu}] with probability 1δ1-\delta. The orange lines for [μ^A,μ¯A][\hat{\mu}_{A}^{\prime},\bar{\mu}_{A}^{\prime}] and [μ^B,μ¯B][\hat{\mu}_{B}^{\prime},\bar{\mu}_{B}^{\prime}] denote the worst case scenario for Δμ=μ¯Bμ^A\Delta_{\mu}=\bar{\mu}_{B}^{\prime}-\hat{\mu}_{A}^{\prime} due to misspecification.

We define θ\theta as a 2-element vector [θ0,θ1]T[\theta_{0},\theta_{1}]^{T}, where θ^[θ^0,θ^1]\hat{\theta}\equiv[\hat{\theta}_{0},\hat{\theta}_{1}]^{\intercal}, and θ[θ0,θ1]\theta^{*}\equiv[\theta_{0}^{*},\theta_{1}^{*}]^{\intercal}. We let θ^A\hat{\theta}_{A} and θ^B\hat{\theta}_{B} denote the estimates of θ\theta^{*} for players A and B respectively. Leveraging Lemma 3.1, we bound the confidence ranges for the estimates of θ\theta^{*}, as expressed in Eq. (D.14) to (D.15). Thus we construct the following uncertainty ranges denoted as 𝐮A\mathbf{u}_{A} and 𝐮B\mathbf{u}_{B},

𝐮A\displaystyle\mathbf{u}_{A} ={θ:θθ^2κlog(tA)/tA}\displaystyle=\left\{\theta:\norm{\theta^{*}-\hat{\theta}}_{2}\leq\kappa\sqrt{\log(t_{A})/t_{A}}\right\} (D.14)
𝐮B\displaystyle\mathbf{u}_{B} ={θ:θθ^2κlog(tB)/tB}\displaystyle=\left\{\theta:\norm{\theta^{*}-\hat{\theta}}_{2}\leq\kappa\sqrt{\log(t_{B})/t_{B}}\right\} (D.15)

We see that, due to potential differences in sample sizes tAt_{A} and tBt_{B}, the confidence regions for 𝐮A\mathbf{u}_{A} and 𝐮B\mathbf{u}_{B}, can exhibit differences in Lebesgue measures, highlighting the effect of varying sample sizes on parameter estimation in the respective domains (for example A may have a different estimate of θ\theta^{*} than B, due to her being more confident of her market demand estimate as a consequence of being in the market for a longer time and obtaining more observations than B). Consequently, this enforces, tA>tBt_{A}>t_{B}. For the purpose of this analysis, without loss of generality, we suppose A is always more confident than B (the opposite can also be argued without altering the mechanics of the proof).

It follows that, with probability 1δ21-\delta^{2}, θ\theta^{*} must lie within the union of the two uncertainty ranges, which overlap each other.

θ𝐮A𝐮B\displaystyle\theta^{*}\in\mathbf{u}_{A}\cup\mathbf{u}_{B} (D.16)
𝐮A𝐮B\displaystyle\mathbf{u}_{A}\cap\mathbf{u}_{B}\neq\emptyset (D.17)

Eq. (D.16) and Eq. (D.17) represent the union and intersection of the two confidence regions, and due to Eq. (D.17), the two confidence regions overlap. While the existence of θ\theta^{*} outside of the union of the two confidence regions is less than probability δ\delta. The conditions imply the existence of a non-empty subset common to both sets, ensuring the non-disjointness of the confidence regions.

θ0\theta_{0}θ1\theta_{1}θ^A\hat{\theta}_{A}θ^B\hat{\theta}_{B}θ\theta^{*}𝐮B\mathbf{u}_{B}𝐮A\mathbf{u}_{A}
Figure 7: Visualizing the confidence intervals 𝐮A\mathbf{u}_{A} and 𝐮B\mathbf{u}_{B} as two ellipsoid regions, where θ\theta^{*} must fall within the intersection of the two regions. We can see that as the volume of both ellipsoids shrink with time tt, the uncertainty in θ\theta^{*} is also decreasing.

As both tAt_{A} and tBt_{B} approach infinity, any optimistic estimate of θ\theta^{*}, denoted as θ¯\bar{\theta}^{*}, for either A or B, would approach θ^A\hat{\theta}_{A} and θ^B\hat{\theta}_{B} respectively. We see that from Eq. (D.14) and Eq. (D.15), both θ^A\hat{\theta}_{A} and θ^B\hat{\theta}_{B} approach the true parameters θ\theta^{*}. By the squeezing of both of the overlapping confidence regions as min{tA,tB}\min\{t_{A},t_{B}\}\to\infty, any action the leader takes leads to a tighter possibility of the best responses of the follower under optimism. Thus, with probability 1δ1-\delta, we have convergence ta unique approximate best response as the best response range 𝔅~(a)𝔅ε(a)[ba¯,b¯a]\widetilde{\mathfrak{B}}(a)\in\mathfrak{B}_{\varepsilon}(a)\equiv[\underline{b_{a}},\bar{b}_{a}] is shrinking, and b¯aba¯0\bar{b}_{a}-\underline{b_{a}}\xrightarrow[]{}0 as min{tA,tB}\min\{t_{A},t_{B}\}\to\infty. Consequently, the system converges to a unique best response with increasing sample size, 𝔅ε(a){𝔅~(a)}\mathfrak{B}_{\varepsilon}(a)\xrightarrow[]{}\{\widetilde{\mathfrak{B}}(a)\}. The convergence of the algorithm to ϵB\epsilon_{B} completes the definition for the ϵ\epsilon-approximate Stackelberg equilibrium, as defined in Section 2.3,

Appendix E Experimental Results

We present additional experiments, experimenting with varying environmental and LNPG algorithm parameters. Empirical analysis across multiple settings consistently demonstrates LNPG’s convergence to stable equilibrium and its superiority over the baseline algorithm (UCB), particularly evident when minimizing Stackelberg regret.

Refer to caption
Refer to caption
Parameters: κ=1.0,θ0=73,θ1=7,σ=3.2\kappa=1.0,\theta_{0}=73,\theta_{1}=7,\sigma=3.2.
Refer to caption
Refer to caption
Parameters: κ=1.0,θ0=80,θ1=11,σ=3.2\kappa=1.0,\theta_{0}=80,\theta_{1}=11,\sigma=3.2.
Refer to caption
Refer to caption
Parameters: κ=0.05,θ0=40,θ1=5,σ=5.8\kappa=0.05,\theta_{0}=40,\theta_{1}=5,\sigma=5.8.
Figure 8: Regret performance of various experiments. Mean values are computed over 20,000 trials. All shaded areas, denoting confidence intervals, are within a quarter quantile.
Parameter Value
Number of Arms (UCB) 50
Pricing Range [0,50][0,50]
Number of Trials 20,000
Table 2: Global hyperparameters.