Online Stochastic Allocation of Reusable Resources
Abstract
We study a multi-objective model on the allocation of reusable resources under model uncertainty. Heterogeneous customers arrive sequentially according to a latent stochastic process, request for certain amounts of resources, and occupy them for random durations of time. The decision maker’s goal is to simultaneously maximize multiple types of rewards generated by the customers, while satisfying the resource capacity constraints in each time step. We develop models and algorithms for deciding on the allocation actions. We show that when the usage duration is relatively small compared with the length of the planning horizon, our policy achieves fraction of the optimal expected rewards, where decays to zero at a near optimal rate as the resource capacities grow.
I Introduction
In the online optimization framework, information is revealed sequentially in time. The decisions are made without knowledge of the future information, but they can depend on past observations. In this work, we study online optimization algorithms in reusable resource allocation applications, where a resource unit is returned to the system after a period of usage duration, and can be further assigned to another customer. The decision-maker (DM) assigns limited inventories of reusable resources to sequentially arriving customers. In each time step, the DM’s decision leads to a set of allocation outcomes, consisting of the amounts of rewards earned, the amounts of resources consumed and the usage durations of the assigned resources. Our model captures a diversity of real-life applications include hotel booking, rental of cars and fashion items and cloud computing services. Our problem instance incorporates the following features:
-
1.
Multiple objectives. The DM’s goal is to maximize multiple types of rewards.
-
2.
Customer heterogeneity. The customers are associated with different customer types.
-
3.
Online setting. In each time step, the arriving customer’s type is drawn independently and identically from an unknown probability distribution.
-
4.
Reusability. Each type of resource is endowed with a stochastic usage duration, whose probability distribution is known to the DM. However, the DM does not know the the realized usage duration of an allocation before the resource is returned.
Features 1-3 are shared by both non-reusable and reusable resource allocation problems, while feature 4 is a distinct feature of reusable resource allocation problems. Without considering feature 4, our problem reduces to the non-reusable setting as in [1]. On feature 3, we remark that in many applications, given a customer type, its mean allocation outcome is accessible by machine learning (ML) approaches in a data-efficient manner ([2, 3, 4]). In many existing resource allocation research ([21, 5, 6, 7, 8]), the mean allocation outcomes are assumed to be prior-knowledge acquired through ML models. However, in applications where customer types are represented as high-dimensional feature vectors, the number of types can be exponential in the dimension of the feature vectors or even unbounded. Such a curse of dimensionality hinders the estimation on the proportion of each customer type. Therefore, we treat the probability distribution of each customer type as the unknown object. We further remark on feature 4 that our usage duration is defined in the same manner as [6] and [9], which are recent works on offline reusable resource allocation problems. We highlight that the probability distribution of each usage duration can be arbitrarily defined, which means our result does not depend on specific structures of certain usage distributions, such as the exponential distribution.
Traditional resource allocation problems [10, 11] concern the allocation of non-reusable resources. Online algorithms for allocating non-reusable resources have been extensively studied in [12, 13, 14, 1, 15, 16, 17]. These algorithms involve adaptive weighing processes that balance the trade-off between the rewards earned and the resources consumed. Most of their analysis largely depend on the monotonically-decreasing inventories. However, in our reusable model, the effect of an allocation may be different for each future time step, contingent on whether the allocated resources are returned, causing fluctuating resource consumption amount across consecutive time steps.
To our knowledge, this is the first paper to address reusable resource allocation problems in an online stochastic setting and demonstrate a near-optimal performance guarantee. Some studies focus on assortment planning problems in adversarial settings ([20, 18, 19]), and achieve non-trivial competitive ratios. Offline pricing and assortment planning problems have been studied in [6, 9, 20], where near-optimality is achieved in the form of approximation ratios under full model certainty. The main contribution of our paper can be summarized as follows.
-
•
Model generality. We propose a general reusable resource allocation model which allows for various decision settings (such as admission control, matching, pricing and assortment planning), multiple objectives (such as revenue, market share and service level), and large numbers of customer types or allocation types (the algorithm’s performance is independent of these sizes).
-
•
Near-optimal algorithm performance. We develop an adaptive weighing algorithm that trades-off among not only the resources occupied and the rewards earned, but also the usage durations. In the regime where each usage duration is short compared with the length of the planning horizon, our algorithm achieves matching near-optimal performance as the online non-reusable resource setting ([1]), as well as the the state-of-art offline reusable setting ([20]).
The remainder of paper is organized as follows. In Section II, we present our model and highlight some of its applications. An online algorithm and the corresponding performance analysis are proposed in Section III. In Section IV, numerical experiments are provided.
II Model
Notation. The reward types and the resource types are respectively indexed by two finite sets and . A generic reward type or resource type is denoted as . For each , the DM has units of resource for allocation. Each customer is associated with a customer type , which reflects the customer’s features. We denote the set of possible allocation actions as , and each element as an action. The action set can model a broad range of decisions, which is elaborate in the end of Section II.
The DM allocates the resources in discrete time steps. In time step , at most one customer arrives. We denote the customer type of the arrival at time as . In particular, we designate the type to represent the case of no arrival. We assume that are independently and identically distributed (i.i.d.) random variables over . We denote , and .
When a customer (denote his type as ) arrives, the DM chooses an action . The choice leads to an array of stochastic outcomes, consisting of the amount of rewards earned , the amount of resources occupied , and the usage durations . For the no arrival customer type , we stipulate that for all , since there should be no reward earned and no resource occupied in the case of no arrival. To ensure feasibility in our resource constrained model, we assume that there exists a null action that satisfies for all . Selecting the null action is equivalent to rejecting a customer.
For each , the stochastic outcomes follow the joint distribution , namely . We allow to be arbitrarily correlated. For each , the random usage duration is independent of . This assumption is also made in related works on offline reusable resource allocation, such as [6] and [9], since the usage duration reflects more of a customer’s intrinsic needs on each resource. We assume that for each , almost surely for each , and almost surely for each . Additionally, we denote , , and .
Model uncertainty and dynamics. We assume that the DM knows the horizon length , the values of , , , as well as for every . However, the DM does not know the probability distribution over customer types. At each time step , the DM observes the type of the arriving customer, and the mean outcomes specific to the type . Then, the DM chooses an action , and observes the stochastic outcomes of rewards and resources at time . Our model uncertainty scenario included the case when the DM knows the mean outcomes in advance. For example, the DM could have estimates on by constructing supervised learning models [2, 3, 4] on a pool of customer demand data.
An integer programming formulation. We let binary decision variables be the DM’s decision under a non-anticipatory algorithm , where if action is taken at time , and otherwise. Under a non-anticiaptory algorithm, depends on historical observations . The DM aims to maximize , which achieves the simultaneous maximization of all the reward types by ensuring max-min fairness. Here we maximize the rewards to keep in line with the resource allocation literature instead of minimize the regret as in the classical online convex optimization literature, but we remark that they are essentially eqivalent. For each and , we require that the resource constraint
(1) |
holds with certainty. The left hand side in (1) represents the amount of type resources occupied at time step . Our goal can be formulated as the following binary integer program
We remark that the term induces non-stationarity in resource consumption, since even when a DM selects the same action at , their amounts of resource consumption at time are differently distributed. Existing works on non-reusable resources crucially hinges on model stationarity, which does not hold true in our setting.
A tractable benchmark. The goal of constructing a non-anticipatory algorithm that achieves the optimal value of (IP-C) is analytically intractable due to the curse of dimensionality. The intractability motivates us to consider an alternative linear program (LP), dubbed (LP-E), where the realization of the customer arrivals, their usage duration and outcomes exactly follow the expectation:
Define the optimal objective value of (LP-E) to be , and let the optimal objective of (IP-C) be . The following lemma shows that is a tractable upper bound for the expected reward of any online algorithms.
Lemma 1.
.
For the algorithm design, we further introduce a “steady state” benchmark, assuming the decision variables are invariant across time:
(LP-SS): | ||||||
s.t. | ||||||
We denote an optimal solution of (LP-SS) as , and the optimal value of (LP-SS) as . We further define
Assumption 1.
There exists and such that , .
This assumption indicates that our algorithm does not apply to large , say for non-reusable resources where with certainty. In the next lemma, we show that is close to .
Lemma 2.
.
We remark that under a wide range of usage durations, we can use (LP-SS) as a benchmark to gauge the performance of our algorithm. For instance, for light tailed (for example, there exists such that ), we can take , . If has bounded support, i.e. almost surely, we can take and let .
Applications. Before proceeding to our algorithm development, we highlight the generality of our model by discussing some of its applications, where the reward type set , the customer type set and the action set can be chosen to model a variety of decisions.
-
•
Admission control. In this setting, the DM is to either admit or reject each arriving customer [21]. Real life examples include patient inflow control in an emergency department or an ICU. The admission control setting can be modeled by letting action set . The reward of a resource is fixed at for . Upon taking an action for a type customer, an array of stochastic demands is generated. Our model captures different reward settings. We list some of the examples: for simultaneously maximizing the revenue/social welfare for each type of resource, define and . For maximizing the total revenue/social welfare of all resources, let be a singleton, and define . For maximizing the service level of each resource, we define and . We remark that multiple kinds of rewards can be modeled simultaneously.
-
•
Assortment Planning. In assortment planning problems, one unit of resource is associated with a fixed price . The DM influences the customers’ demands through offering different assortments of resources. Real life assortment planning examples with reusable resources include renting of fashion items and vehicles. Contingent upon the arrival of a customer, say of type , the DM decides the assortment to display, where is a collection of subsets of ([9, 22]). Let denote the probability for customer type to choose product in assortment . In the revenue management literature, the probability is modeled by a random utility choice model. The assortment planning problem (simultaneously maximizing revenue of each resource) can be incorporated in our model by setting , setting to be the Bernoulli random variable with mean , and setting .
III Online algorithm and performance analysis
Main results. In this section, we propose a multi-stage adaptive weighing algorithm (dubbed Algorithm as in “Adaptive”, and displayed in Algorithm 1). We first provide our main results. We assume there exists satisfying , where . Let
denote the type reward achieved by Algorithm , our main result is shown in the following theorem.
Theorem 1.
Let be an arbitrary constant satisfying . Without knowing ,
for every with probability at least .
We remark that if has bounded support, i.e. almost surely for all , then the above reward can be simplified as for every with probability at least . In the case when and , our algorithm achieves a reward at least . This result nearly matches the [1] in studying an online non-reusable resource allocation problem, as well as [20] in the state-of-art work on the offline assortment planning with reusable resources.
The assumption means that the amount of resource for each is sufficiently large, and the planning horizon is sufficiently long. Crucially, the performance guarantee in Theorem 1 does not deteriorate even when or grows. Consequently, our Algorithm is applicable to complex scenarios when .
High-level description and comparison against [1]. Our Algorithm extends [1]’s idea of adaptive weighing from the non-reusable setting to the reusable setting. [1] proposes a multi-stage adaptive weighing algorithm to trade-off between the rewards and the resources. For each reward and resource constraint of a fluid LP (corresponding to our (LP-E)), a penalty weight is defined. The weight on each constraint progressively gets larger as the reward generated or the resource consumed gets closer to their total capacity (the capacity of each reward is approximated). [1] does not apply to the reusable setting, as their penalty weights and algorithm design depend on the monotonic-decreasing resources.
Somewhat surprisingly, we can still utilize their adaptive weighing idea. we approximate (LP-E) with a knapsack-constrained (LP-SS), and define penalty weights that incorporates the usage duration as well as the resource consumption. In a series of Lemmas that eventually leads to Theorem 1, we show that our algorithm effectively capitalizes the reusability of the resources to maximize the total rewards. We overcome the technical difficulty in non-monotonic and time-correlated resource levels, by achieving near-optimal performance. Nevertheless, we remark that the closeness of (LP-E) and (LP-SS) builds upon Assumption 1, and hence our algorithm is not applicable to the non-reusable setting.
A multistage online algorithm. In Algorithm 1, we divide the time horizon into stages where satisfies for some . Stage consists of time periods. This stage is solely for exploration on the latent , and the first customers are served with random actions. Stage consists of time periods. The assumption ensures that (there is at least 1 stage) and (each stage consists of at least time periods). We denote as the type of the customer who arrives at the -th time step in stage (where ), meaning that .
In each stage , Algorithm consists of two steps. In Step 1, we estimate the optimum of (LP-SS). In Step 2, we define “penalty weights” on constraints of (LP-SS), and choose the action that balances between each constraint.
Step 1: Estimate the value of (Line 3 of Algorithm 1). We first derive , which is the optimal objective value of the linear program
s.t. | |||||
where , denoting the empirical customer distribution based on customer arrivals in stage . is a sample average approximation (SAA) of (LP-SS). It is worth mentioning that both (LP-SS) and are highly tractable, even in assortment planning application when is exponential in ([23]). In addition, the knapsack-type constraints in allows us to apply the adaptive weighing in Step 2 in a similar manner to the non-reusable setting. Given that is easily obtained in Step 1, we define in the following lemma, and show it is a progressively more accurate estimate of as grows.
Lemma 3.
Define for . For any , with probability at least ,
where .
Step 2: Run an online algorithm given (Line 4 - Line 10 of Algorithm 1). With slight abuse of notation, we write as , as , and as . In addition, we denote as . Define, at time in stage , and respectively as resource consumed by customer , and reward earned.
At the -th time step of stage , after observing the customer type we take action according to Line 7. The parameter represents a “penalty weight” for the resource constraint in (LP-SS). If the allocation decisions during in stage leads to a high amount of resource occupation at time , the penalty would also be high. Similarly, a lower amount of accrued reward type during in stage leads to a higher value of the weight . Both weights quantify the DM’s emphasis on resources and rewards.
Input: the number of time periods , the capacities for each resource , the values of and .
Output: actions to take , for .
Performance guarantee of Algorithm . Our analysis involves bounding the total probability of violating each constraint in (LP-C) in all time steps. Each violation leads to unserved customers and lost rewards. We show that under Algorithm , all constraints of (LP-C) are satisfied with high probability. To understand the choice of in Line 7 of Algorithm 1, we introduce an auxiliary offline static algorithm (dubbed Algorithm as in “Static”). Algorithm requires knowing , an optimal solution to (LP-SS), and a tuning parameter for preserving capacities in anticipation of any constraint violation. At a time , if a customer of type arrives, the DM selects action with probability , and selects the null action with probability . We denote as . Define and where for each in stage . A performance guarantee of Algorithm is provided in the following lemma.
Lemma 4.
Let , Algorithm achieves a total reward of at least
for every with probability at least .
For analysis sake, we consider a hybrid Algorithm in each stage . For Algorithm , the DM makes allocation decisions based on Algorithm in time step , and based on Algorithm in time step . We show that outperforms , which inductively leads to the conclusion that the performance of the online adaptive Algorithm is no worse than the offline static Algorithm (see Lemma 4). This induction technique is introduced in [1] on the non-reusable setting. We need more refined techniques to disentangle the time-correlation between the resources constraints at each time step, and finally prove the following lemma.
Lemma 5.
Algorithm achieves a total reward of at least
for every with probability at least .
IV Numerical experiments
We consider an assortment planning problem. One unit of resource is associated with a fixed price . Contingent upon the arrival of a customer, say of type , the DM decides the assortment to display, where is a collection of subsets of . Let denote the probability for customer type to choose product in assortment . Therefore, the assortment planning problem (simultaneously maximizing the revenue of each resource) can be incorporated in our model by setting , setting to be the Bernoulli random variable with mean , and setting . In our test, the probability is modeled by the multinomial logit (MNL) choice model. Each resource is associated with a feature vector , and each customer type is associated with a set of feature vectors , where for each . The feature vector could involve the fixed price , and if , and if . In complement, the probability of no purchase is Notice that the size of the action set scales exponentially with the number of products. Nevertheless, for the MNL models, can be computed efficiently by solving a simple LP whose computational time is polynomial in [24].
We consider a synthetic data-set with types of resources indexed by , and types of customers indexed by . We allow offering any assortment of fewer than 5 products, and hence the assortment set is of size . We let follow a discrete distribution with a support of . For any , we set , and follow a randomly generated probability distribution with a bounded support of . Theoretically, the regret of both Algorithms and should grow sublinearly against the scale , roughly at the rate of . For each value, we run simulations using a column generation approach and take the average as well as the standard deviation.
UB | Total Revenue | Gap from UB | ||||||
---|---|---|---|---|---|---|---|---|
Algo | Algo | Algo | Algo | |||||
Mean | Std | Mean | Std | |||||
1000 | 0.3 | 644.90 | 507.06 | 42.1689 | 331.03 | 2.531798 | 21.37% | 48.67% |
2000 | 0.22 | 1289.80 | 1097.47 | 50.88593 | 887.28 | 7.681146 | 14.91% | 31.21% |
3000 | 0.185 | 1934.70 | 1702.97 | 75.61902 | 1419.54 | 8.537564 | 11.98% | 26.63% |
4000 | 0.162 | 2579.60 | 2327.51 | 86.79659 | 1966.83 | 6.663332 | 9.77% | 23.75% |
5000 | 0.148 | 3224.50 | 2936.00 | 98.77243 | 2522.00 | 10.44462 | 8.95% | 21.79% |
6000 | 0.136 | 3869.40 | 3557.05 | 74.11313 | 3086.28 | 20.91315 | 8.07% | 20.24% |
7000 | 0.127 | 4514.30 | 4200.24 | 67.91298 | 3647.95 | 9.508417 | 6.96% | 19.19% |
8000 | 0.12 | 5159.20 | 4804.85 | 68.66264 | 4217.17 | 14.06983 | 6.87% | 18.26% |
In Table I, the third column of upper bounds are the optimal values of (LP-SS). The offline Algorithm performs better than the online Algorithm . Algorithm achieves rewards within fraction of the upper bounds.
References
- [1] N. R. Devanur, K. Jain, B. Sivan, and C. A. Wilkens, “Near optimal online algorithms and fast approximation algorithms for resource allocation problems,” Journal of the ACM (JACM), vol. 66, no. 1, pp. 1–41, 2019.
- [2] X. Chen, Z. Owen, C. Pixton, and D. Simchi-Levi, “A statistical learning approach to personalization in revenue management,” Management Science, vol. 68, no. 3, pp. 1923–1937, 2022.
- [3] Y. Han, F. C. Pereira, M. Ben-Akiva, and C. Zegras, “A neural-embedded choice model: Tastenet-mnl modeling taste heterogeneity with flexibility and interpretability,” arXiv preprint arXiv:2002.00922, 2020.
- [4] A. Aouad and A. Désir, “Representing random utility choice models with neural networks,” arXiv preprint arXiv:2207.12877, 2022.
- [5] Y. Chen, R. Levi, and C. Shi, “Revenue management of reusable resources with advanced reservations,” Production and Operations Management, vol. 26, no. 5, pp. 836–859, 2017.
- [6] Y. Lei and S. Jasin, “Real-time dynamic pricing for revenue management with reusable resources, advance reservation, and deterministic service time requirements,” Operations Research, vol. 68, no. 3, pp. 676– 685, 2020.
- [7] J. Baek and W. Ma, “Bifurcating constraints to improve approximation ratios for network revenue management with reusable resources,” Operations Research, 2022.
- [8] O. Besbes, A. N. Elmachtoub, and Y. Sun, “Static pricing: Universal guarantees for reusable resources,” Operations Research, 2021.
- [9] P. Rusmevichientong, M. Sumida, and H. Topaloglu, “Dynamic assortment optimization for reusable prod- ucts with random usage durations,” Management Science, vol. 66, no. 7, pp. 2820–2844, 2020.
- [10] D. Adelman, “Dynamic bid prices in revenue management,” Operations Research, vol. 55, no. 4, pp. 647– 661, 2007.
- [11] S. Alaei, M. Hajiaghayi, and V. Liaghat, “Online prophet-inequality matching with applications to ad allocation,” in Proceedings of the 13th ACM Conference on Electronic Commerce, pp. 18–35, 2012.
- [12] S. Agrawal and N. R. Devanur, “Fast algorithms for online stochastic convex programming,” in Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pp. 1405–1424, SIAM, 2014.
- [13] S. Agrawal, Z. Wang, and Y. Ye, “A dynamic near-optimal algorithm for online linear programming,” Operations Research, vol. 62, no. 4, pp. 876–890, 2014.
- [14] S. Balseiro, H. Lu, and V. Mirrokni, “Dual mirror descent for online allocation problems,” in International Conference on Machine Learning, pp. 613–628, PMLR, 2020.
- [15] J. Feldman, M. Henzinger, N. Korula, V. S. Mirrokni, and C. Stein, “Online stochastic packing applied to display ad allocation,” in Algorithms – ESA 2010 (M. de Berg and U. Meyer, eds.), (Berlin, Heidelberg), pp. 182–194, Springer Berlin Heidelberg, 2010.
- [16] H. Yu, M. Neely, and X. Wei, “Online convex optimization with stochastic constraints,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [17] J. Yuan and A. Lamperski, “Online convex optimization for cumulative constraints,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- [18] X. Y. Gong, V. Goyal, G. Iyengar, D. Simchi-Levi, R. Udwani, and S. Wang, “Online assortment optimization with reusable resources,” Available at SSRN 3334789, 2019.
- [19] V. Goyal, G. Iyengar, and R. Udwani, “Asymptotically optimal competitive ratio for online allocation of reusable resources,” arXiv preprint arXiv:2002.02430, 2020.
- [20] Y. Feng, R. Niazadeh, and A. Saberi, “Near-optimal bayesian online assortment of reusable resources,” Proceedings of the 23rd ACM Conference on Economics and Computation, 2022..
- [21] R. Levi and A. Radovanović, “Provably near-optimal lp-based policies for revenue management in systems with reusable resources,” Operations Research, vol. 58, no. 2, pp. 503–507, 2010.
- [22] Y. Feng, R. Niazadeh, and A. Saberi, “Linear programming based online policies for real-time assortment of reusable resources,” Available at SSRN 3421227, 2019.
- [23] J. J. M. Bront, I. Méndez-Díaz, and G. Vulcano, “A column generation algorithm for choice-based network revenue management,” Operations research, vol. 57, no. 3, pp. 769–784, 2009.
- [24] J. Davis, G. Gallego, and H. Topaloglu, “Assortment planning under the multinomial logit model with totally unimodular constraint structures,” Work in Progress, 2013.
Appendix
IV-A Proof of Lemma 1
Proof.
Take the expectation of the resource constraints of the (LP-C) over , , and for :
The first equation follows from the summation of probabilities. The second equation holds because is independent of , and the decision is made before the realization of and . The third inequality is valid since and only depend on . The forth inequality follows by separately consider and customers at other time periods. The fifth inequality stands since are i.i.d. across time, and we can simply denote as . The last inequality holds since the clairvoyant’s LP is feasible for all paths. Letting , we have a feasible solution for the online DM’s expected LP. For the reward constraints, . Since the solution achieves an objective of and it is a feasible solution of the DM’s expected LP, we have . ∎
IV-B Proof of Lemma 2
Proof.
For each we have . Hence, the capacity constraints of (LP-SS) can be expressed as:
From the above, it is evident that the optimal solution of (LP-SS) is feasible for the (LP-E) by letting for all , and therefore . In contrast, the optimal solution of (LP-E) may not be feasible for (LP-SS).
We consider a truncated version (LP-TE) of the DM’s expected LP (LP-E). For each resource constraint:
s.t. | ||||||
Denote for (LP-TE), the optimal objective value as and the optimal solution as for . Since the feasible region (LP-TE) contains the feasible region of (LP-E) and both LPs have the same objective, we have . Note that may not be feasible for (LP-E). Despite that, we can show that is feasible for (LP-E):
By letting , the objective value is at least . Hence . We formulate a truncated version of the steady-state LP (LP-SS) in a similar fashion:
s.t. | ||||||
Denote for the (LP-TSS), the optimal objective value as and the optimal solution as . Similar to (LP-TE), we have and . It’s also obvious that since we can let and get a feasible solution for the (LP-TE). Next we show that is not so far away from .
The dual of the truncated DM’s expected LP is:
s.t. | ||||||
The dual of truncated steady-state LP is:
s.t. | ||||||
Obviously, the optimal objective value of (LP-TE-D) is larger than or equal to times the optimal objective value of (LP-TSS-D) . Let denote the optimal solution of (LP-TSS-D). Since these are minimization problems, if for (LP-TE-D) we can let , for all , and and still have a feasible LP, then . Consider the first set of constraints in (LP-TE-D). For , by letting , and , we have:
However, for , the constraints may be violated:
where . Notice that since (LP-TE-D) is a minimization problem, is never strictly larger than . Hence to make the above expression larger than or equal to , it’s enough to let . In all, by letting for , for , for and , we have a feasible solution for the (LP-TE-D). Then . Concluding the above analysis, we have:
Putting these together, we have:
∎
IV-C Proof of Lemma 3
Before proceeding to the proof, we introduce the multiplicative Chernoff bounds.
Lemma 6 (Multiplicative Chernoff bounds).
Let , where are independent random variables, and let .
(a) For all ,
Therefore, for all , with probability at least ,
(b) For ,
Hence, for all , with probability at least ,
For ,
Lemma 3.
Since , showing is equivalent to showing . We first show that the lower bound holds with probability . It’s obvious that the expected instance of has the same optimal solution and objective function as (LP-SS). In , letting , the probability of violating each resource constraint is:
The first and second inequalities stand by the definition of and respectively. For the third inequality, since ’s are independent, and
we apply the multiplicative Chernoff bounds to derive the result. The fourth inequality follows if
This is indeed the case since we assume the usage duration is small compared with . Taking a union bound over all , the probability of violating any resource constraints is at most .
A similar process holds for bounding the probability of violating each reward constraint:
Combining the union bounds for all resource constraints and reward constraints, with with probability at least , the stage LP achieves an objective value of at least .
Next we show that the upper bound holds with probability . Note that the dual of (LP-SS) is:
s.t. | ||||||
The dual of is:
s.t. | ||||||
Since the domain of the constraints are exactly the same for two dual LPs, the optimal solution of (LP-SS-D) is feasible for . Suppose the optimal solutions to (LP-SS-D) is , and . The optimal objective value is then . Letting , and , since is a minimization problem, we have an upper bound for optimal objective value , i.e. . If is strictly larger than , we can lower the values of and still get a feasible solution for (LP-SS-D). In this case the objective is not optimal. Therefore, . It can be seen that , since otherwise we can lower the value of and still have a feasible solution. Hence if , then with probability at most we have
The inequality is derived by applying the Chernoff bounds to . We have under the square root since and therefore . By the assumption that and definition of , we have:
Thus with a probability at least ,
If , then the above procedure doesn’t hold. Notice that all we need is with high probability. Hence define . Supposing (which will be proved later), then using the Chernoff bound, we have:
The third inequality follows from . The fourth inequality holds by the definitions of and the fact that for all . The last inequality holds if . This is indeed the case since and we let (which will be demonstrated later in the proof of Lemma 5). Hence the only thing left to prove is that given , we have . This would be a direct result if . Next we show that given , . This completes the whole proof.
The second inequality holds since . The third inequalities follow from . The fourth inequality holds since . ∎
IV-D Proof of Lemma 4
Before proving Lemma 4, we first demonstrate the validity of the following lemma.
Lemma 7.
Suppose . Define for . In stage , Algorithm achieves a reward of at least
for every with probability at least .
Lemma 7.
To aid analysis, we leave customers unserved in the end of each stage, so the next stage starts nearly empty. This causes a total reward loss of . Suppose we run Algorithm over the entire planning horizon. Then at any time of stage , the number of resource occupied by customers from previous stages can defined as . Since , it is evident that . Then the probability of violating the resource constraint at the -th time of stage is bounded as:
The first inequality follows from Markov’s inequality. The second inequality holds since . The third in equality holds since . The fourth inequality stands since for all and , and function for . The fifth inequality holds because for each time period , are independent from each other. The sixth inequality holds by the fact that function . The seven inequality follows from . The eighth inequality stands for all , and the last inequality follows by assuming . Taking a union bound over all and , the probability of satisfying all resource constraints for is at least
For the reward constraints, notice , we have:
The first inequality follows from applying the multiplicative Chernoff bounds on . It is evident that these summing terms are mutually independent and within range , and meanwhile . The last inequality holds by the definition of . By a union bound over resource and reward constraints, we have the result of Lemma 7. ∎
Lemma 4.
Suppose holds. Using as an approximation of , over the stages, we achieve a total reward of
The first inequality holds since . The third inequality holds by definitions of and , and the fact that . The fourth and fifth inequalities stems from and respectively.
Notice that with probability of at most , fails. Therefore, Algorithm gives us a reward of at least with probability of at least
∎
IV-E Proof of Lemma 5
Proof.
To aid the analysis, we leave customer unserved in the end of each stage, so the next stage starts nearly empty. This causes a total reward loss of at most . Define the number of resource occupied by customers from previous stages as in time of stage . It is evident that . Then at any time step , by the Markov inequality, the probability of violating resource constraint is:
(33) |
While it is desirable to upper bound the probabilities of violating any resource constraints without the term , the values of are not known in the online setting. Nevertheless, we use the fact that for the “steady-state” optimal solution, and upper bound as follows:
(34) |
Plugging (34) in (33), in time step , we aim to minimize the following term which, by the union over and , upper bounds the probability of violating at least one future resource constraint in stage :
(35) |
Similarly, since , by the Markov inequality, we upper bound the reward constraint violation probabilities in stage as:
(36) |
Define the sum of terms (35) and (36) as . In time step of stage , the DM aims to take the action that minimizes , which has exactly the form of (7). We further show that in fact implies that as follows, and inductively we have .
Suppose we run Algorithm in the first time periods of stage , then at time period , Algorithm obviously minimizes . Hence, if we replace the action Algorithm at time by the choice of Algorithm , the probability of constraint violation is no smaller:
By induction, we have . Since Algorithm has no worse performance than Algorithm in each stage, following Lemma 4, the total probability of constraint violation over the entire planning horizon
Therefore, Algorithm achieves a reward of at least for every with probability at least . ∎
IV-F More on the numerical experiments
It is worth mentioning that for the assortment planning applications, the size of the action set scales exponentially with the number of products, and therefore equation (1) is computationally intractable under a general choice model. However, we don’t need to enumerate all possible assortments in certain cases, say under MNL models. For simultaneously maximizing the revenue of each resource, we set , and . In this case, equation (1) can be written as:
(4) |
Under a MNL model, if , and if . It can be shown that for a type customer, (4) can be solved efficiently by solving the following LP ([24]):
s.t. | |||
where the decision variables are .
In our numerical tests, when the number of customer types is large, the computation time is large despite the usage of column generation method. We can use the sample average approximation (SAA) to reduce the number of customer types. Specifically, we can sample 100 times from the 1000 customer types and approximate the value of on the 100 samples. This technique is adopted in Algorithm 3, where we have shown that with large enough sample sizes, the SAA works well. Hence essentially we are performing SAA on SAA when the size of is large. Table 1 shows the results and computational time comparisons between the original algorithm and algorithm with SAA.
T | UB | Total Revenue | Gap from UB | Computational Time | ||||
---|---|---|---|---|---|---|---|---|
SAA | SAA | SAA | ||||||
1000 | (300, 0.3) | 644.90 | 331.03 | 305.90 | 48.67% | 52.57% | 134.18 | 55.19 |
2000 | (600, 0.22) | 1289.80 | 887.28 | 838.60 | 31.21% | 34.98% | 260.01 | 103.77 |
3000 | (900, 0.185) | 1934.70 | 1419.54 | 1357.70 | 26.63% | 29.82% | 390.10 | 143.35 |
4000 | (1200, 0.162) | 2579.60 | 1966.83 | 1861.60 | 23.75% | 27.83% | 489.99 | 188.43 |
5000 | (1500, 0.148) | 3224.50 | 2522.00 | 2398.70 | 21.79% | 25.61% | 576.65 | 246.04 |
6000 | (1800, 0.136) | 3869.40 | 3086.28 | 2934.40 | 20.24% | 24.16% | 641.84 | 297.80 |
7000 | (2100, 0.127) | 4514.30 | 3647.95 | 3477.90 | 19.19% | 22.96% | 732.04 | 352.72 |
8000 | (2400, 0.12) | 5159.20 | 4217.17 | 4014.40 | 18.26% | 22.19% | 1114.00 | 687.89 |