A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit
Abstract
In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker’s problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon . Though this problem has recently attracted considerable attention, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, current algorithms for this problem have regret bounded by , where is a problem-dependent constant that may have an exponential dependency on the number of attributes, . In this paper, we propose an optimistic algorithm and show that the regret is bounded by , significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favorable regret guarantee. We also demonstrate that our algorithm has robust performance for varying values through numerical experiments.
keywords:
Revenue management , OR in marketing , Multi-armed bandit , Multinomial Logit model , Sequential decision-making1 Introduction
Assortment optimization problems arise in many industries, and prominent examples include retailing and online advertising (check Alfandari et al. [2021], Timonina-Farkas et al. [2020], Wang et al. [2020] and see Kök & Fisher [2007] for a detailed review). The problem faced by a decision-maker is that of selecting a subset (assortment) of items to offer from a universe of substitutable items111If all consumers have identical preferences towards same characteristics of an item, then that item is termed as substitutable such that the expected revenue is maximized. In many e-commerce applications, the data on consumer choices tends to be either limited or non-existent (similar to the cold start problem in recommendation systems). Consumer preferences must be learned by experimenting with various assortments and observing consumer choices, but this experimentation with various assortments must be balanced to maximize cumulative revenue. Furthermore, in many settings, the retailer has to consider a very large number of products that are similar (examples range from apparel to consumer electronics). The commonality in their features can be expressed with the aid of auxiliary variables which summarize product attributes. This enables a significant reduction in dimensionality but introduces additional challenges in designing policies that have to dynamically balance demand learning (exploration) while simultaneously maximizing cumulative revenues (exploitation).
Motivated by these issues, we consider the dynamic assortment optimization problem. In every round, the retailer offers a subset (assortment) of products to a consumer and observes the consumer response. Consumers purchase (at most one product from each assortment) products that maximize their utility, and the retailer enjoys revenue from the successful purchase. We assume that the products are described by a set of attributes and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the retailer’s problem of dynamically learning the model parameters while optimizing cumulative revenues over the selling horizon . Specifically, we have a universe of substitutable items, and each item is associated with an attribute vector which is known a priori. The mean utility for the consumer for the product is given by the inner product where is some fixed but initially unknown parameter vector. Each of the coordinates of for product represent a variety of characteristics such as cost, popularity, brand, etc. Given the substitutable good assumption, the preference of all consumers towards these characteristics are identical and denoted by the same parameter222This assumption may appear quite restrictive at first. But, as described in the following paragraph and Section 2.2, the model is rich enough to model non-identical consumer behavior as well. . Further, any two products and could vary in terms of these characteristics and hence are associated with different vectors and respectively. Our goal is to offer assortments at times from a feasible collection of assortments such that the cumulative expected revenue of the retailer over the said horizon is maximized. In general, the feasible set of assortments can reflect the constraints of retailers and online platforms (such as cardinality, inventory availability and other related constraints).
For an intuitive understanding of the choice model, consider an example of an online furniture retailer that offers distinct products where the product has an attribute vector (in general, this attribute can vary over time, representing varying consumers’ choices, and is more appropriately represented by ). Suppose consumers query for a specific product category, say tables. In this example, the parameter will be a distinct vector corresponding to the product category: table. As discussed before, th true that determines consumer choice behavior is unknown. With each interaction with the consumer, the online retailer is learning which of the products offers the most utility (captured by for each product ) to the consumer by observing the past purchase decisions of the consumers. The online furniture retailer is constrained to offer at most of products in each interaction with the consumer. Such a constraint may be encountered in practical situations: limitation of the online consumer interface to display large number of products; consumer preferring to examine only a subset of products at a time etc. Out of furniture items, some particular table could have high utility, , whereas, for a some product (say, a table with unpopular color, bad design or inferior material etc.) could be low. The consumer may purchase one or none of the presented products. Later in Section 2.2, we demonstrate that when the consumer’s propensity to purchase a specific product is driven by its utility, the retailer’s expected revenue at each round is given by a softmax function.
The rest of this section is organized as follows: We first describe the related literature and qualitative significance of the parameter . Then, we highlight our contributions and end the section by contrasting them with recent notable research works.
1.1 Related literature
The MNL model is a widely used choice model for capturing consumer purchase behavior in assortment selection models (see Flores et al. [2019] and Avadhanula [2019]). Recently, large-scale field experiments at Alibaba [Feldman et al., 2018] have demonstrated the efficacy of the MNL model in boosting revenues. Rusmevichientong et al. [2010] and Sauré & Zeevi [2013] were a couple of early works that studied explore-then-commit strategies for the dynamic assortment selection problem under the MNL model when there are no contexts/product features. The works of Agrawal et al. [2019] and Agrawal et al. [2017] revisited this problem and presented adaptive online learning algorithms based on the Upper Confidence Bounds(UCB) and Thompson Sampling (TS) ideas. These approaches, unlike earlier ideas, did not require prior information about the problem parameters and had near-optimal regret bounds. Following these developments, the contextual variant of the problem has received considerable attention. Cheung & Simchi-Levi [2017] and Oh & Iyengar [2019] propose TS-based approaches and establish Bayesian regret bounds on their performance333Our results give worst-case regret bound which is strictly stronger than Bayesian regret bound. Worst-case regret bounds directly imply Bayesian regret bounds with same order dependence.. Chen et al. [2020] present a UCB-based algorithm and establish min-max regret bounds. However, these contextual MNL algorithms and their performance bounds depend on a problem parameter that can be prohibitively large, even for simple real-life examples. See Figure 1 for an illustration and Section 1.2 for a detailed discussion.

We note that Ou et al. [2018] also consider a similar problem of developing an online algorithm for the MNL model with linear utility parameters. Though they establish a regret bound that does not depend on the aforementioned parameter , they work with an inaccurate version of the MNL model. More specifically, in the MNL model, the probability of a consumer preferring an item is proportional to the exponential of the utility parameter and is not linear in the utility parameter as assumed in Ou et al. [2018].
The multi-armed bandit problem, which underlies these dynamic decision making settings, has been well studied in the literature (see Xu et al. [2021], Grant & Szechtman [2021]). Our problem is closely related to the parametric bandit problem, where a common unknown parameter connects the rewards of each arm. In particular, for linear bandits, each arm (consider to be the set of all arms) is associated with a -dimensional vector known a priori. And the expected reward upon selecting arm is given by the inner product , for some unknown parameter vector (see Dani et al. [2008], Rusmevichientong & Tsitsiklis [2010], Abbasi-Yadkori et al. [2011]). The key difference is that the rewards (i.e., the revenue of the retailer) corresponding to an assortment under the MNL cannot be modeled in the framework of linear payoffs. Closer to our formulation is the literature on generalized linear bandits (see Filippi et al. [2010] and Faury et al. [2020]), where the expected payoff upon selecting arm is given by , where is a real-valued, non-linear function. However, unlike our setting, where an arm could be a collection of products (thus involving -dimensional vectors), is a single variable function in these prior works.
1.2 On the parameter
As discussed earlier, the retailer’s revenue (reward function) is the softmax function. Intuitively, the curvature of the reward function influences how easy (or difficult) it is to learn the true choice parameter . In Section 2, we explicitly define as inversely proportional to the lower bound on of the derivative of the reward function in the entire decision region. Existence of a global lower bound on the curvature of the reward function is a necessary assumption for the maximum likelihood estimation of .
In previous works on generalized linear bandits and variants [Filippi et al., 2010, Li et al., 2017, Oh & Iyengar, 2019], the quantity features in regret guarantees as a multiplicative factor of the primary term (i.e., as ), and this is because they ignore the local effect of the curvature, and use global properties (via ) leading to loose worst-case bounds. For a cleaner exposition of this issue, lets take , i.e., the rewards are given by a sigmoid function of . The derivative of sigmoid is “bell”-shaped (see Figure 1). When is very high (i.e., the assortment contains products with high utilities) or when is very low (i.e., the assortment contains products with low utility), the value of will be large. From Assumption 2, for , is equivalent to , for some . Thus, when is close to or , the value of will be large. The exponential dependence for case follows when we replace with a sigmoid function. In the context of our problem, this translates to an exponential dependence of the per-round regret on the magnitude of utilities (i.e., ).
1.3 Contributions
In this paper, we build on recent developments for generalized linear bandits (Faury et al. [2020]) to propose a new optimistic algorithm, CB-MNL for the problem of contextual multinomial logit bandits. CB-MNL follows the standard template of optimistic parameter search strategies (also known as optimism in the face of uncertainty approaches) [Abbasi-Yadkori et al., 2011, Abeille et al., 2021]. We use Bernstein-style concentration for self-normalized martingales, which were previously proposed in the context of scalar logistic bandits in Faury et al. [2020], to define our confidence set over the true parameter, taking into account the effects of the local curvature of the reward function. We show that the performance of CB-MNL (as measured by regret) is bounded as , significantly improving the theoretical performance over existing algorithms where appears as a multiplicative factor in the leading term. We also leverage a self-concordance [Bach, 2010] like relation for the multinomial logit reward function [Zhang & Lin, 2015], which helps us limit the effect of on the final regret upper bound to only the higher-order terms. Finally, we propose a different convex confidence set for the optimization problem in the decision set of CB-MNL, which reduces the optimization problem to a constrained convex problem.
In summary, our work establishes strong worst-case regret guarantees by carefully accounting for local gradient information and using second-order function approximation for the estimation error.
1.4 Comparison with notable prior works
Comparison with Filippi et al. [2010] Our setting is different from the standard generalized linear bandit of Filippi et al. [2010]. In our setting, the reward due to an action (assortment) can be dependent on up to variables () instead of a single variable. Further, we focus on removing the multiplicative dependence on from the regret bounds. This leads to a more involved technical treatment in our work.
Comparison with Oh & Iyengar [2019] The Thompson Sampling based approach is inherently different from our Optimism in the face of uncertainty (OFU) style Algorithm CB-MNL. However, the main result in Oh & Iyengar [2019] also relies on a confidence set based analysis along the lines of Filippi et al. [2010] but has a multiplicative factor in the bound.
Comparison with Faury et al. [2020] Faury et al. [2020] use a bonus term for optimization in each round, and their algorithm performs non-trivial projections on the admissible log-odds. While we do reuse the Bernstein-style concentration inequality as proposed by them, their results do not seem to extend directly to the MNL setting without requiring significantly more work. Further, our algorithm CB-MNL performs an optimistic parameter search for making decisions instead of using a bonus term, which allow for a cleaner and shorter analysis.
Comparison with Oh & Iyengar [2021] While the authors in Oh & Iyengar [2021] provide sharper bounds by a factor of , they still retain the multiplicative factor in their regret bounds. Their focus is on improving the dependence on the dimension parameter for the dynamic assortment optimization problem.
Comparison with Abeille et al. [2021] Abeille et al. [2021] recently proposed the idea of convex relaxation of the confidence set for the more straightforward logistic bandit setting. Our work can be viewed as an extension of their construction to the MNL setting.
Comparison with Amani & Thrampoulidis [2021] While the authors in Amani & Thrampoulidis [2021] also extend the algorithms of Faury et al. [2020] to a multinomial problem, their setting is materially different from ours. They model various click-types for the same advertisement (action) via the multinomial distribution. further, they consider actions played at each round to be non-combinatorial, i.e., a single action as opposed to a bundle of actions, which differs from the assortment optimization setting in this work. Therefore, their approach and technical analysis are different from ours.
2 Preliminaries
2.1 Notations
For a vector , denotes the transpose. Given a positive definite matrix , the induced norm is given by . For two symmetric matrices and , means that is positive semi-definite. For any positive integer , . denotes an identity matrix of dimension . The platform (i.e. the learner) is referred using the pronouns she/her/hers.
2.2 Model setting
Rewards Model:
At every round , the platform (learner) is presented with set of distinct items, indexed by and their attribute vectors (contexts): such that , where is the cardinality of set . The platform then selects an assortment and the interacting consumer (environment) offers the reward to the platform. The assortments have a cardinality of at most , i.e. . The platform’s decision is based on the entire history of interaction. The history is represented by the filtration set ,444 denotes the -algebra set over the sequence . where is any prior information available to the platform. The interaction lasts for rounds. Conditioned on , the reward is a binary vector such that and the vector follows a multinomial distribution. We have . Specifically, the probability that is given by the softmax function:
(1) |
where is an unknown time-invariant parameter. The numeral in the denominator accounts for the case when the consumer purchases none of the items in the assortment. By definition, , i.e., is multinomial with a single trial. Also, the expected revenue due to the assortment555Each item is also associated with a price (or revenue) parameter, for round . We assume for all items and rounds for an uncluttered exposition of results. If is not , then it features as a fixed factor in the definition of and the analysis exactly follows as that presented here for all rounds and items. is given by:
(2) |
Also, may vary adversarially in each round in our model, unlike in Li et al. [2017], where the attribute vectors are assumed to be drawn from an unknown i.i.d. distribution. When , the above model reduces to the case of the logistic bandit.
Choice Modeling Perspective:
Eq 1 can be considered from a discrete choice modeling viewpoint, where the platform presents an assortment of items to a user, and the user selects at most one item from this assortment. In this interpretation, the probability of choosing an item is given by . Likewise, the probability of the user not selecting any item is given by: . The platform is motivated to offer such an assortment that the user’s propensity to make a successful selection is high.
Regret:
The platform does not know the value of . Our learning algorithm CB-MNL (see Algorithm 1) sequentially makes the assortment selection decisions, so that the cumulative expected revenue is high. Its performance is quantified by pseudo-regret, which is the gap between the expected revenue generated by the algorithm and that of the optimal assortments in hindsight. The learning goal is to minimize the cumulative pseudo-regret up to time , defined as:
(3) |
where is the offline optimal assortment at round under full information of , defined as:
As in the case of contextual linear bandits Abbasi-Yadkori et al. [2011], Chu et al. [2011], the emphasis here is to make good sequential decisions while tracking the true parameter with a close estimate (see Section 2.4). Our algorithm (like others) does not necessarily improve the estimate at each round. However, it ensures that is always within a confidence interval of the estimate of (with high probability) and the future analysis demonstrates that the aggregate prediction error over all rounds is bounded.
Our model is fairly general, as the contextual information may be used to model combined information of the item in the set and the user at round . Suppose the user at round is represented by a vector and the item has attribute vector as , then (vectorized outer product of and ). We assume that the platform knows the interaction horizon .
Additional notations: denotes a design matrix whose columns are the attribute vectors () of the items in the assortment . Also, we now denote as to signify that .
2.3 Assumptions
Following Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2019], Faury et al. [2020], we introduce the following assumptions on the problem structure.
Assumption 1 (Bounded parameters).
, where is a compact subset of . is known to the learner. Further, for all values of and .
This assumption simplifies analysis and removes scaling constants from the equations.
Assumption 2.
There exists such that for every item and for any and all rounds :
Note that denotes the derivative of the softmax function along the direction. This assumption is necessary from the likelihood theory Lehmann & Casella [2006] as it ensures that the fisher matrix for estimation is invertible for all possible input instances. We refer to Oh & Iyengar [2019] for a detailed discussion in this regard. We denote and as the upper bounds on the first and second derivatives of the softmax function along any component, respectively. We have [Gao & Pavel, 2017] for all problem instances.
2.4 Maximum likelihood estimate
CB-MNL, described in Algorithm 1, uses a regularized maximum likelihood estimator to compute an estimate of . Since follows a multinomial distribution, the regularized log-likelihood (negative cross entropy loss) function, till the round, under parameter could be written as:
(4) |
is concave in for , and the maximum likelihood estimator is given by calculating the critical point of . Setting , we get as the solution of:
(5) |
For future analysis we also define
(6) |
At the start of the interaction, when no contexts have been observed, is well-defined by Eq (5) when . Therefore, the regularization parameter makes CB-MNL burn-in period free, in contrast to some previous works, e.g. Filippi et al. [2010].
2.5 Confidence sets
Algorithm 1 follows the template of in the face of uncertainty (OFU) strategies [Auer et al., 2002, Filippi et al., 2010, Faury et al., 2020]. Technical analysis of OFU algorithms relies on two key factors: the design of the confidence set and the ease of choosing an action using the confidence set.
In Section 4, we derive (defined below) as the confidence set on such that with probability at least (randomness is over user choices). used for making decisions at each round (see Eq (12)) by CB-MNL in Algorithm 1:
(7) |
where , and
(8) |
A confidence set similar to in Eq (7) was recently proposed in Abeille et al. [2021] for the simpler logisitic bandit setting. Here, we extend its construction to the MNL setting. The set is convex since the log-loss function is convex. This makes the decision step in Eq (12) a constraint convex optimization problem. However, it is difficult to prove bounds directly with . Therefore we leverage a result in Faury et al. [2020], where the authors proposed a new Bernstein-like tail inequality for self-normalized vectorial martingales (see Appendix A.1), to derive another confidence set on :
(9) |
where
(10) |
is the partial derivative of in the direction of the component of the assortment and is defined in Eq (8). The value of is an outcome of the concentration result of Faury et al. [2020]. As a consequence of this concentration, we have with probability at least (randomness is over user choices). The Bernstein-like concentration inequality used here is similar to Theorem 1 of Abbasi-Yadkori et al. [2011] with the difference that we take into account local variance information (hence local curvature information of the reward function) in defining . The above discussion is formalized in Appendix A.1.
The set is non-convex, which follows from the non-linearity of . We use directly to prove regret guarantees. In Section 4.3, we mention how the a convex set is related to and share many useful properties of . Till then, to maintain ease of technical flow and to compare it with the previous work Faury et al. [2020], we assume that the algorithm uses as the confidence set. We highlight that for the confidence sets, and , Algorithm CB-MNL is identical except for the calculation in Eq (12). For later sections we also define the following norm inducing design matrix based on all the contexts observed till time :
(11) |
3 Algorithm
At each round , the attribute parameters (contexts) are made available to the algorithm (online platform) CB-MNL. The algorithm calculates an estimate of the true parameter according to Eq (5). The algorithm keeps track of the confidence set () as defined in Eq (9) (Eq (7). Let the set contain all feasible assortments of with cardinality up to . The algorithm makes the following decision:
(12) |
In each round , the reward of the online platform is denoted by the vector . Also, the prediction error of at , defined as:
(13) |
represents the difference in perceived rewards due to the inaccuracy in the estimation of the parameter .
Remark 1 (Optimistic parameter search).
CB-MNL enforces optimism via an optimistic parameter search (e.g. in Abbasi-Yadkori et al. [2011]), which is in contrast to the use of an exploration bonus as seen in Faury et al. [2020], Filippi et al. [2010]. Optimistic parameter search provides a cleaner description of the learning strategy. In non-linear reward models, both approaches may not follow similar trajectory but may have overlapping analysis styles (see Filippi et al. [2010] for a short discussion).
Remark 2 (Tractable decision-making).
Input: regularization parameters: , distinct items: ,
for do
4 Main results
We present a regret upper bound for the CB-MNL algorithm in Theorem 1.
Theorem 1.
With probability at least over the randomness of user choices:
where the constants are given as , , and is given by Eq (8).
The formal proof is deferred to the technical Appendix, in this section we discuss the key technical ideas leading to this result. The order dependence on the model parameters is made explicit by the following corollary.
Corollary 2.
Setting the regularization parameter , where is the maximum cardinality of the assortments to be selected, makes . The regret upper bound is given by .
Recall the expression for cumulative regret
where pessimism is the additive inverse of the optimism (difference between the payoffs under true parameters and those estimated by CB-MNL). Due to optimistic decision-making and the fact that (see Eq (12)), pessimism is non-positive, for all rounds. Thus, the regret is upper bounded by the sum of the prediction error for rounds. In Section 4.1 we derive an the expression for prediction error upper bound for a single round . We also contrast with the previous works Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021] and point out specific technical differences which allow us to use Bernstein-like tail concentration inequality and therefore, achieve stronger regret guarantees. In Section 4.2, we describe the additional steps leading to the statement of Theorem 1. The style of the arguments is simpler and shorter than that in Faury et al. [2020]. Finally, in Section 4.3, we discuss the relationship between two confidence sets and and show that even using in place of , we get the regret upper bounds with same parameter dependence as in Corollary 2. Lemma 3 gives the expression for an upper bound on the prediction error.
4.1 Bounds on prediction error
The detailed proof is provided in A.4. Here we develop the main ideas leading to this result and develop an analytical flow which will be re-used while working with convex confidence set in Section 4.3. In the previous works Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021], global upper and lower bounds of the derivative of the link function (here softmax) are employed early in the analysis, leading to loss of local information carried by the MLE estimate . In those previous works the first step was to upper bound the prediction error by the Lipschitz constant (which is a global property) of the softmax (or sigmoid for the logistic bandit case) function, as:
(15) |
For building intuition, assume that lies on “flatter” region of , then Eq (15) is a loose upper bound.
Next we show how using a global lower bound in form of (see Assumption 2) early in the analysis in the works Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021] lead to loose prediction error upper bound. For this we first introduce a new notation:
(16) |
We also define From Eq (6), we obtain (see A.2 for details of this derivation):
(17) |
From Assumption 2, is a positive definite matrix for and therefore can be used to define a norm. Using Cauchy-Schwarz inequality with Eq (17) simplifies the prediction error as:
(18) |
The previous literature Filippi et al. [2010], Oh & Iyengar [2021] has utilized and upper bounded by Lipschitz constant (directly at this stage), thereby incurring loose regret bounds. Instead, here we work with the norm induced by and retain the location information in .
(19) |
It is not straight-forward to bound , we extend the self-concordance style relations from Faury et al. [2020] for the multinomial logit function which allow us to relate and (or ) to develop a bound on .
Lemma 4.
For all , the following inequalities hold:
Lemma 5.
For , we have the following relation with probability at least :
Proofs of Lemma 4 and 5 have been deferred to A.3. Notice that Lemma 5 is a key result which characterizes worthiness of the confidence set . Recall that (with a tuned as in Corollary 2). Therefore, any is not too far from the optimal under the norm induced by . Now, we use Lemma 5 in Eq (19) to get:
(20) |
The quantity as described in the Eq (16) is upper bounded in the following result
Lemma 6.
For the assortment chosen by the algorithm CB-MNL, as given by Eq (12) and any the following holds with probability at least :
4.2 Regret calculation
The complete technical work is provided in A.4 and A.5. The key step to retrieve the upper bounds of Theorem 1 is to calculate rounds summation of the prediction error as given in Eq (3). Compared to the previous literature Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021], the term is new here. Further, our treatment of this term is much simpler and straight-forward as compared to that in Faury et al. [2020]
4.3 Convex relaxation of the optimization step
Sections 4.1 & 4.2 provide an analytical framework for calculating the regret bounds given: (1) the confidence set (Eq (9)) with the guarantee that with probability at least ; (2) the assurance that the confidence set is small (Lemma 5). In order to re-use previously developed techniques, we show: (1) (see Eq (7)) and therefore with probability at least (see Lemma 7; (2) an analog of Lemma 5 using (see Lemma 8). The proof of Theorem 1 is therefore repeated while using Lemma 8, following steps as sketched in sections 4.1 & 4.2. The order dependence of the regret upper bound is retained (see Corollary 2).
Lemma 7.
, therefore for any , we also have (see Eq (7)).
The complete proof is provided in A.6. We highlight the usefulness of Lemma 7. Since all of set lies within , the consequence of the concentration inequality also implies .
Lemma 8.
Under the event , the following holds :
When , then , , and .
The complete proof can be found in A.6.
5 Numerical experiments
In this section we compare the empirical performance of our proposed algorithm CB-MNL with the previous state of the art in the MNL contextual bandit literature: UCB-MNL[Oh & Iyengar, 2021] and TS-MNL[Oh & Iyengar, 2019] on artificial data. We focus on performance comparison for varying values of parameter , and show that our algorithm has a consistently superior performance for different values in Figure 2. This highlights the primary contribution of our theoretical analysis. Refer to A.8 for additional empirical analysis.



For each experimental configuration, we consider a problem instance with inventory size , instance dimensions , maximum assortment size , and time horizon , averaged over Monte Carlo simulation runs. is a dimensional random vector with each coordinate in , independently and uniformly distributed. The contexts follow a multivariate Gaussian distribution. The parameter is manually tuned. Algorithm CB-MNL only knows the value of . In contrast, algorithms TS-MNLand UCB-MNL also need to know the value of for their implementation. We observe that that our algorithm CB-MNL has robust performance for varying values of .
6 Conclusion and discussion
In this work, we proposed an optimistic algorithm for learning under the MNL contextual bandit framework. Using techniques from Faury et al. [2020], we developed an improved technical analysis to deal with the non-linear nature of the MNL reward function. As a result, the leading term in our regret bound does not suffer from the problem-dependent parameter . This contribution is significant as can be very large (refer to Section 1.2). For example, for , the results of Oh & Iyengar [2021, 2019] suffer regret, while our algorithm continues to enjoy . Further, we also presented a tractable version of the decision-making step of the algorithm by constructing a convex relaxation of the confidence set.
Our result is still away from the minimax lower of bound Chu et al. [2011] known for the linear contextual bandit. In the case of logistic bandits, Li et al. [2017] makes an i.i.d. assumption on the contexts to bridge the gap (however, they still retain the factor). Improving the worst-case regret bound by while keeping as an additive term is an open problem. It may be possible to improve the dependence on by using a higher-order approximation for estimation error. Finding a lower bound on dependence is an interesting open problem and may require newer techniques than presented in this work.
Oh & Iyengar [2019] gave a Thompson sampling (TS) based learning strategy for the MNL contextual bandit. Thompson sampling approaches may not have to search the entire action space to take decisions as optimistic algorithms (such as ours) do. TS-based strategies are likely to have better empirical performance. Authors in Oh & Iyengar [2019] use a confidence set based analysis to bound the estimation error term. However, results in Oh & Iyengar [2019] suffer from the prohibitive scaling of the problem-dependent parameter that we have overcome here. Modifying our analysis for a TS-based learning strategy could bring together the best of both worlds.
References
- Abbasi-Yadkori et al. [2011] Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2312–2320.
- Abeille et al. [2021] Abeille, M., Faury, L., & Calauzènes, C. (2021). Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (pp. 3691–3699). PMLR.
- Agrawal et al. [2017] Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2017). Thompson sampling for the mnl-bandit. In Conference on Learning Theory (pp. 76–78). PMLR.
- Agrawal et al. [2019] Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2019). Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67, 1453–1485. doi:10.1287/opre.2018.1832.
- Alfandari et al. [2021] Alfandari, L., Hassanzadeh, A., & Ljubić, I. (2021). An exact method for assortment optimization under the nested logit model. European Journal of Operational Research, 291, 830–845. doi:https://doi.org/10.1016/j.ejor.2020.12.007.
- Amani & Thrampoulidis [2021] Amani, S., & Thrampoulidis, C. (2021). Ucb-based algorithms for multinomial logistic regression bandits. Advances in Neural Information Processing Systems, 34, 2913–2924.
- Auer et al. [2002] Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47, 235–256.
- Avadhanula [2019] Avadhanula, V. (2019). The MNL-Bandit Problem: Theory and Applications. Ph.D. thesis Columbia University.
- Bach [2010] Bach, F. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4, 384 – 414. URL: https://doi.org/10.1214/09-EJS521. doi:10.1214/09-EJS521.
- Chen et al. [2020] Chen, X., Wang, Y., & Zhou, Y. (2020). Dynamic assortment optimization with changing contextual information. Journal of Machine Learning Research, 21, 1–44.
- Cheung & Simchi-Levi [2017] Cheung, W. C., & Simchi-Levi, D. (2017). Thompson sampling for online personalized assortment optimization problems with multinomial logit choice models. Available at SSRN 3075658, .
- Chu et al. [2011] Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 208–214).
- Dani et al. [2008] Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. In Conference on Learning Theory.
- Faury et al. [2020] Faury, L., Abeille, M., Calauzènes, C., & Fercoq, O. (2020). Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (pp. 3052–3060). PMLR.
- Feldman et al. [2018] Feldman, J., Zhang, D., Liu, X., & Zhang, N. (2018). Taking assortment optimization from theory to practice: Evidence from large field experiments on alibaba. Available at SSRN, .
- Filippi et al. [2010] Filippi, S., Cappe, O., Garivier, A., & Szepesvári, C. (2010). Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems (pp. 586–594).
- Flores et al. [2019] Flores, A., Berbeglia, G., & Van Hentenryck, P. (2019). Assortment optimization under the sequential multinomial logit model. European Journal of Operational Research, 273, 1052–1064. doi:https://doi.org/10.1016/j.ejor.2018.08.047.
- Gao & Pavel [2017] Gao, B., & Pavel, L. (2017). On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805, .
- Grant & Szechtman [2021] Grant, J. A., & Szechtman, R. (2021). Filtered poisson process bandit on a continuum. European Journal of Operational Research, 295, 575–586. doi:https://doi.org/10.1016/j.ejor.2021.03.033.
- Kök & Fisher [2007] Kök, A. G., & Fisher, M. L. (2007). Demand estimation and assortment optimization under substitution: Methodology and application. Operations Research, 55, 1001–1021. doi:https://doi.org/10.1287/opre.1070.0409.
- Lehmann & Casella [2006] Lehmann, E. L., & Casella, G. (2006). Theory of point estimation. Springer Science & Business Media.
- Li et al. [2017] Li, L., Lu, Y., & Zhou, D. (2017). Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 2071–2080).
- Oh & Iyengar [2019] Oh, M.-h., & Iyengar, G. (2019). Thompson sampling for multinomial logit contextual bandits. In Advances in Neural Information Processing Systems (pp. 3151–3161).
- Oh & Iyengar [2021] Oh, M.-h., & Iyengar, G. (2021). Multinomial logit contextual bandits: Provable optimality and practicality. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 9205–9213). volume 35.
- Ou et al. [2018] Ou, M., Li, N., Zhu, S., & Jin, R. (2018). Multinomial logit bandit with linear utility functions. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (pp. 2602–2608).
- Perivier & Goyal [2022] Perivier, N., & Goyal, V. (2022). Dynamic pricing and assortment under a contextual mnl demand. Advances in Neural Information Processing Systems, 35, 3461–3474.
- Rusmevichientong et al. [2010] Rusmevichientong, P., Shen, Z.-J. M., & Shmoys, D. B. (2010). Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58, 1666–1680. doi:https://doi.org/10.1287/opre.1100.0866.
- Rusmevichientong & Tsitsiklis [2010] Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35, 395–411. doi:https://doi.org/10.1287/moor.1100.0446.
- Sauré & Zeevi [2013] Sauré, D., & Zeevi, A. (2013). Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management, 15, 387–404. doi:https://doi.org/10.1287/msom.2013.0429.
- Timonina-Farkas et al. [2020] Timonina-Farkas, A., Katsifou, A., & Seifert, R. W. (2020). Product assortment and space allocation strategies to attract loyal and non-loyal customers. European Journal of Operational Research, 285, 1058–1076. doi:https://doi.org/10.1016/j.ejor.2020.02.019.
- Wang et al. [2020] Wang, X., Zhao, X., & Liu, B. (2020). Design and pricing of extended warranty menus based on the multinomial logit choice model. European Journal of Operational Research, 287, 237–250. doi:https://doi.org/10.1016/j.ejor.2020.05.012.
- Xu et al. [2021] Xu, J., Chen, L., & Tang, O. (2021). An online algorithm for the risk-aware restless bandit. European Journal of Operational Research, 290, 622–639. doi:https://doi.org/10.1016/j.ejor.2020.08.028.
- Zhang & Lin [2015] Zhang, Y., & Lin, X. (2015). Disco: Distributed optimization for self-concordant empirical loss. In International conference on machine learning (pp. 362–370).
Appendix A Appendix
A.1 Confidence set
In this section, we justify the design of confidence set defined in Eq (9). This particular choice is based on the following concentration inequality for self-normalized vectorial martingales.
Theorem 9.
Appears as Theorem 4 in Abeille et al. [2021] Let be a filtration. Let be a stochastic process in such that is measurable. Let be a martingale difference sequence such that is measurable. Furthermore, assume that conditionally on we have almost surely, and note . Let be a predictable sequence of non-negative scalars. Define:
Then for any :
Theorem 9 cannot be directly used in our setting as in the MNL model the actual rewards (for any time step ) are correlated. Hence a concentration almost identical (varying only in minor constant modification) to Theorem 9, appearing as Theorem C.6 in Perivier & Goyal [2022] is used instead.
Lemma 10 (confidence bounds for multinomial logistic rewards).
Proof.
is the maximizer of the regularized log-likelihood:
where is given by Eq (1) as . Solving for , we obtain:
This result, combined with the definition of yields:
where we denoted for all and and for all . For any , from the definition of it follows that . Hence, . Later in the proof of Theorem 1, we present our choice of which always ensures .
(21) |
Conditioned on the filtration set (see Section 2.2 to review the definition of the filtration set), is a martingale difference is bounded by as we assume the maximum reward that is accrued at any round is upper bounded by . We calculate for all :
(22) |
Also from Remark 3, we have :
Therefore setting as and as we invoke an instance of Theorem C.6 in Perivier & Goyal [2022] to obtain:
It is insightful to compare Theorem 9 with Theorem 1 of Abbasi-Yadkori et al. [2011]. The later is re-stated below:
Theorem 11.
Let be a filtration. Let be a real-valued stochastic process such that is -measurable and is conditionally -sub-Gaussian for some , i.e
Let be an valued stochastic process such that is -measurable. Assume is a positive definite matrix. For any , define:
Then, for any , with probability at least . for all ,
A.2 Local information preserving norm
Deviating from the previous analyses as in Filippi et al. [2010], Li et al. [2017], we describe norm which preserves the local information The matrix is the design matrix composed of the contexts received at time step as its columns. The expected reward due to the item in the assortment is given by:
Further, we consider the following integral:
(25) |
where is the partial derivative of in the direction of the component and represents integration of with respect to the coordinate ( hence the limits of the integration only consider change in the coordinate ). For notation purposes which would become clear later, we define:
(26) | ||||
where the second step is due to Fundamental Theorem of Calculus. We have exploited the two ways to view the multinomial logit function: sum of individual probabilities and a vector valued function. We write:
(27) |
We also have:
(28) |
It follows that:
where . Since (from Assumption 2), therefore . Hence we get:
(29) |
A.3 Self-Concordance Style Relations for Multinomial Logistic Function
Lemma 12.
For an assortment and , the following holds:
Proof.
We write:
(30) |
where represents integration of with respect to the coordinate ( hence the limits of the integration only consider change in the coordinate ). For some , consider:
where is the double derivative of . Using Lemma 16, we have . Thus we get:
Using Fundamental Theorem of Calculus, we get:
(31) |
Using Eq (A.3) and for , and for all and we have:
(32) |
Reversing the role of and , such that then again by using Eq (A.3) we write:
(33) |
(34) |
If , then , and therefore . Thus we lower bound the left hand side of Eq (34) as:
Using above with and in Eq (30) gives:
∎
Lemma 4.
For all such that (Assumption 1), the following inequalities hold:
Proof.
From Lemma 12, we have:
(Cauchy-Schwartz) | ||||
() |
Now we write as:
Since, and have symmetric roles in the definition of , we also obtain the second relation by a change of variable directly. ∎
The following Lemma presents a crucial bound over the deviation , which we extensively use in our derivations.
Lemma 5.
For , we have the following relation with probability at least :
(35) |
A.4 Bounds on prediction error
Lemma 6.
For the assortment chosen by the algorithm CB-MNL, as given by Eq (12) and any the following holds with probability at least :
Proof.
Consider the mulinomial logit function:
(36) |
We use second-order Taylor expansion for each component of the multinomial logit function at . Consider for all :
(37) |
In Eq (A.4), we substitute: , , and . Thus we re-write Eq (36) as:
where we upper bound by . An application of Cauchy-Schwarz gives us:
(38) |
∎
Lemma 3.
For the assortment chosen by the algorithm CB-MNL, as given by Eq (12) and any the following holds with probability at least :
Proof.
Corollary 7.
For the assortment chosen by the algorithm CB-MNL, as given by Eq (12) and any the following holds with probability at least :
where and .
Proof.
This directly follows from the uniqueness and realizability of .
∎
A.5 Regret calculation
The following two lemmas give the upper bounds on the self-normalized vector summations.
Lemma 13.
Proof.
The proof follows by a direct application of Lemma 17 and 18 as:
(From Lemma 17) | ||||
(Upper bound by Lipschitz constant) | ||||
(From Lemma 18) |
∎
Similar to Lemma 13, we prove the following.
Lemma 14.
Theorem 1.
A.6 Convex relaxation
Lemma 8.
, therefore for any , we also have (see Eq (7)).
Proof.
Let be the maximum likelihood estimate (see Eq (5)), the second-order Taylor series expansion of the log-loss (with integral remainder term) for any is given by:
(40) |
by definition since is maximum likelihood estimate. Therefore :
() | ||||
(def. of ) | ||||
(Eq (29)) |
Thus we obtain:
(from Lemma 15) |
where the last inequality suggests that by the definition of the set . Therefore, . ∎
The following helper lemma, which translates the confidence set definition of Lemma 10 to the norm defined by .
Lemma 15.
Let . For all and as the maximum likelihood estimate in Eq (5).
Proof.
We have:
(def. of ) | ||||
(from Lemma 12) | ||||
(Cauchy-Schwarz inequality) | ||||
() | ||||
(def. of ) | ||||
(from Eq (29)) |
where
is analogous to local information containing counterpart of the relation in Lemma 4. This gives:
(from Lemma 10) |
where the last relation is a quadratic inequality in , which on solving completes the proof of the statement in the lemma. ∎
Lemma 9.
Under the event , the following holds :
When , then , , and
Proof.
A.7 Technical lemmas
Remark 3 (Derivatives for MNL choice function).
For the multinomial logit choice function, where the expected reward due to item of the assortment is modeled as:
the partial derivative with respect to the expected reward of item is given as:
and the double derivative as:
Lemma 16 (Self-Concordance like relation for MNL).
For the multinomial logit choice function, where the expected reward due to item of the assortment is modeled as:
the following relation holds:
Proof.
The proof directly follows from Remark 3 and the observation
for all items, in the assortment choice.
∎
Lemma 17 (Generalized elliptical potential).
Let be a sequence in such that for each , has columns as where for all and . Also, let be a non-negative scalar. For , define where is strictly positive for all . Then the following inequality holds:
with .
Proof.
By the definition of :
Taking log from both sides and summing from to :
(By a telescopic sum cancellation) | ||||
(44) |
For any such that , it follows that . Therefore, we write:
(From Eq (44)) |
∎
Lemma 18 (Determinant-trace inequality, see Lemma 10 in Abbasi-Yadkori et al. [2011]).
Let a sequence in such that for all , and let be a non-negative scalar. For define . The following inequality holds:
A.8 Numerical experiments
We build on Section 5 and compare the empirical performance of our proposed algorithm CB-MNL with the previous state of the art in the MNL contextual bandit literature: UCB-MNL[Oh & Iyengar, 2021] and TS-MNL[Oh & Iyengar, 2019] on artificial data for varying model parameters. is dimensional uniformly random variable with each coordinate in independently and uniformly distributed. The contexts follow multivariate Gaussian distribution. Algorithm CB-MNL only knows the value of . Besides, algorithms TS-MNLand UCB-MNL also need to know the value of for their implementation. Here we simulate for two additional parameter instances again averaged over Monte Carlo runs.

