This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Tractable Online Learning Algorithm for the Multinomial Logit Contextual Bandit

Priyank Agrawal [email protected] Theja Tulabandhula [email protected] Vashist Avadhanula [email protected] 500 W 120th St, New York, NY 10027 University Hall, 601 S Morgan St, Chicago, IL 60607 1100 Enterprise Way, Sunnyvale, CA 94089
Abstract

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision maker’s problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon TT. Though this problem has recently attracted considerable attention, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, current algorithms for this problem have regret bounded by O(κdT)O(\sqrt{\kappa dT}), where κ\kappa is a problem-dependent constant that may have an exponential dependency on the number of attributes, dd. In this paper, we propose an optimistic algorithm and show that the regret is bounded by O(dT+κ)O(\sqrt{dT}+\kappa), significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favorable regret guarantee. We also demonstrate that our algorithm has robust performance for varying κ\kappa values through numerical experiments.

keywords:
Revenue management , OR in marketing , Multi-armed bandit , Multinomial Logit model , Sequential decision-making
journal: European Journal of Operational Research

1 Introduction

Assortment optimization problems arise in many industries, and prominent examples include retailing and online advertising (check Alfandari et al. [2021], Timonina-Farkas et al. [2020], Wang et al. [2020] and see Kök & Fisher [2007] for a detailed review). The problem faced by a decision-maker is that of selecting a subset (assortment) of items to offer from a universe of substitutable items111If all consumers have identical preferences towards same characteristics of an item, then that item is termed as substitutable such that the expected revenue is maximized. In many e-commerce applications, the data on consumer choices tends to be either limited or non-existent (similar to the cold start problem in recommendation systems). Consumer preferences must be learned by experimenting with various assortments and observing consumer choices, but this experimentation with various assortments must be balanced to maximize cumulative revenue. Furthermore, in many settings, the retailer has to consider a very large number of products that are similar (examples range from apparel to consumer electronics). The commonality in their features can be expressed with the aid of auxiliary variables which summarize product attributes. This enables a significant reduction in dimensionality but introduces additional challenges in designing policies that have to dynamically balance demand learning (exploration) while simultaneously maximizing cumulative revenues (exploitation).

Motivated by these issues, we consider the dynamic assortment optimization problem. In every round, the retailer offers a subset (assortment) of products to a consumer and observes the consumer response. Consumers purchase (at most one product from each assortment) products that maximize their utility, and the retailer enjoys revenue from the successful purchase. We assume that the products are described by a set of attributes and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the retailer’s problem of dynamically learning the model parameters while optimizing cumulative revenues over the selling horizon TT. Specifically, we have a universe of NN substitutable items, and each item ii is associated with an attribute vector xid,x_{i}\in\mathbb{R}^{d}, which is known a priori. The mean utility for the consumer for the product ii is given by the inner product θxi,\theta\cdot x_{i}, where θd\theta\in\mathbb{R}^{d} is some fixed but initially unknown parameter vector. Each of the dd coordinates of xix_{i} for product ii represent a variety of characteristics such as cost, popularity, brand, etc. Given the substitutable good assumption, the preference of all consumers towards these characteristics are identical and denoted by the same parameter222This assumption may appear quite restrictive at first. But, as described in the following paragraph and Section 2.2, the model is rich enough to model non-identical consumer behavior as well. θd\theta\in\mathbb{R}^{d}. Further, any two products ii and jj could vary in terms of these characteristics and hence are associated with different vectors xix_{i} and xjdx_{j}\in\mathbb{R}^{d} respectively. Our goal is to offer assortments 𝒬1,,𝒬T\mathcal{Q}_{1},\cdots,\mathcal{Q}_{T} at times 1,,T1,\cdots,T from a feasible collection of assortments such that the cumulative expected revenue of the retailer over the said horizon is maximized. In general, the feasible set of assortments can reflect the constraints of retailers and online platforms (such as cardinality, inventory availability and other related constraints).

For an intuitive understanding of the choice model, consider an example of an online furniture retailer that offers NN distinct products where the ithi^{th} product has an attribute vector xix_{i} (in general, this attribute can vary over time, representing varying consumers’ choices, and is more appropriately represented by xt,ix_{t,i}). Suppose consumers query for a specific product category, say tables. In this example, the θ\theta parameter will be a distinct vector corresponding to the product category: table. As discussed before, th true θ\theta_{*} that determines consumer choice behavior is unknown. With each interaction with the consumer, the online retailer is learning which of the NN products offers the most utility (captured by θxi\theta\cdot x_{i} for each product ii) to the consumer by observing the past purchase decisions of the consumers. The online furniture retailer is constrained to offer at most KK of NN products in each interaction with the consumer. Such a constraint may be encountered in practical situations: limitation of the online consumer interface to display large number of products; consumer preferring to examine only a subset of products at a time etc. Out of NN furniture items, some particular table jj could have high utility, θxj\theta\cdot x_{j}, whereas, for a some product kk (say, a table with unpopular color, bad design or inferior material etc.) θxk\theta\cdot x_{k} could be low. The consumer may purchase one or none of the KK presented products. Later in Section 2.2, we demonstrate that when the consumer’s propensity to purchase a specific product is driven by its utility, the retailer’s expected revenue at each round is given by a softmax function.

The rest of this section is organized as follows: We first describe the related literature and qualitative significance of the parameter κ\kappa. Then, we highlight our contributions and end the section by contrasting them with recent notable research works.

1.1 Related literature

The MNL model is a widely used choice model for capturing consumer purchase behavior in assortment selection models (see Flores et al. [2019] and Avadhanula [2019]). Recently, large-scale field experiments at Alibaba [Feldman et al., 2018] have demonstrated the efficacy of the MNL model in boosting revenues. Rusmevichientong et al. [2010] and Sauré & Zeevi [2013] were a couple of early works that studied explore-then-commit strategies for the dynamic assortment selection problem under the MNL model when there are no contexts/product features. The works of Agrawal et al. [2019] and Agrawal et al. [2017] revisited this problem and presented adaptive online learning algorithms based on the Upper Confidence Bounds(UCB) and Thompson Sampling (TS) ideas. These approaches, unlike earlier ideas, did not require prior information about the problem parameters and had near-optimal regret bounds. Following these developments, the contextual variant of the problem has received considerable attention. Cheung & Simchi-Levi [2017] and Oh & Iyengar [2019] propose TS-based approaches and establish Bayesian regret bounds on their performance333Our results give worst-case regret bound which is strictly stronger than Bayesian regret bound. Worst-case regret bounds directly imply Bayesian regret bounds with same order dependence.. Chen et al. [2020] present a UCB-based algorithm and establish min-max regret bounds. However, these contextual MNL algorithms and their performance bounds depend on a problem parameter κ\kappa that can be prohibitively large, even for simple real-life examples. See Figure 1 for an illustration and Section 1.2 for a detailed discussion.

Refer to caption
Figure 1: Illustration of the impact of the κ\kappa parameter (logistic case, multinomial logit case closely follows): A representative plot of the derivative of the reward function. The x-axis represents the linear function xθx^{\top}\theta and the y-axis is proportional to 1/κ1/\kappa. Parameter κ\kappa is small only in the narrow region around 0 and grows arbitrarily large depending on the problem instance (captured by xθx^{\top}\theta values).

We note that Ou et al. [2018] also consider a similar problem of developing an online algorithm for the MNL model with linear utility parameters. Though they establish a regret bound that does not depend on the aforementioned parameter κ\kappa, they work with an inaccurate version of the MNL model. More specifically, in the MNL model, the probability of a consumer preferring an item is proportional to the exponential of the utility parameter and is not linear in the utility parameter as assumed in Ou et al. [2018].

The multi-armed bandit problem, which underlies these dynamic decision making settings, has been well studied in the literature (see Xu et al. [2021], Grant & Szechtman [2021]). Our problem is closely related to the parametric bandit problem, where a common unknown parameter connects the rewards of each arm. In particular, for linear bandits, each arm aAa\in A (consider AA to be the set of all arms) is associated with a dd-dimensional vector xadx_{a}\in\mathbb{R}^{d} known a priori. And the expected reward upon selecting arm aAa\in A is given by the inner product θxa\theta\cdot x_{a}, for some unknown parameter vector θ\theta (see Dani et al. [2008], Rusmevichientong & Tsitsiklis [2010], Abbasi-Yadkori et al. [2011]). The key difference is that the rewards (i.e., the revenue of the retailer) corresponding to an assortment under the MNL cannot be modeled in the framework of linear payoffs. Closer to our formulation is the literature on generalized linear bandits (see Filippi et al. [2010] and Faury et al. [2020]), where the expected payoff upon selecting arm aa is given by f(θxa)f(\theta\cdot x_{a}), where ff is a real-valued, non-linear function. However, unlike our setting, where an arm could be a collection of KK products (thus involving KK dd-dimensional vectors), f(.)f(.) is a single variable function in these prior works.

1.2 On the parameter κ\kappa

As discussed earlier, the retailer’s revenue (reward function) is the softmax function. Intuitively, the curvature of the reward function influences how easy (or difficult) it is to learn the true choice parameter θ\theta_{*}. In Section 2, we explicitly define κ\kappa as inversely proportional to the lower bound on of the derivative of the reward function in the entire decision region. Existence of a global lower bound on the curvature of the reward function is a necessary assumption for the maximum likelihood estimation of θ\theta_{*}.

In previous works on generalized linear bandits and variants [Filippi et al., 2010, Li et al., 2017, Oh & Iyengar, 2019], the quantity κ\kappa features in regret guarantees as a multiplicative factor of the primary term (i.e., as O~(κT)\tilde{\mathrm{O}}(\kappa\sqrt{T})), and this is because they ignore the local effect of the curvature, and use global properties (via κ\kappa) leading to loose worst-case bounds. For a cleaner exposition of this issue, lets take K=1K=1, i.e., the rewards are given by a sigmoid function of θx\theta\cdot x. The derivative of sigmoid is “bell”-shaped (see Figure 1). When θx\theta\cdot x is very high (i.e., the assortment contains products with high utilities) or when θx\theta\cdot x is very low (i.e., the assortment contains products with low utility), the value of κ\kappa will be large. From Assumption 2, for K=1K=1, κ\kappa is equivalent to max1a(1a)\max\frac{1}{a(1-a)}, for some a(0,1)a\in(0,1). Thus, when aa is close to 11 or 0, the value of κ\kappa will be large. The exponential dependence for K=1K=1 case follows when we replace aa with a sigmoid function. In the context of our problem, this translates to an exponential dependence of the per-round regret on the magnitude of utilities (i.e., θx\theta\cdot x).

1.3 Contributions

In this paper, we build on recent developments for generalized linear bandits (Faury et al. [2020]) to propose a new optimistic algorithm, CB-MNL for the problem of contextual multinomial logit bandits. CB-MNL follows the standard template of optimistic parameter search strategies (also known as optimism in the face of uncertainty approaches)  [Abbasi-Yadkori et al., 2011, Abeille et al., 2021]. We use Bernstein-style concentration for self-normalized martingales, which were previously proposed in the context of scalar logistic bandits in Faury et al. [2020], to define our confidence set over the true parameter, taking into account the effects of the local curvature of the reward function. We show that the performance of CB-MNL (as measured by regret) is bounded as O~(dT+κ)\tilde{\mathrm{O}}\mathinner{\left(d\sqrt{T}+\kappa\right)}, significantly improving the theoretical performance over existing algorithms where κ\kappa appears as a multiplicative factor in the leading term. We also leverage a self-concordance [Bach, 2010] like relation for the multinomial logit reward function [Zhang & Lin, 2015], which helps us limit the effect of κ\kappa on the final regret upper bound to only the higher-order terms. Finally, we propose a different convex confidence set for the optimization problem in the decision set of CB-MNL, which reduces the optimization problem to a constrained convex problem.

In summary, our work establishes strong worst-case regret guarantees by carefully accounting for local gradient information and using second-order function approximation for the estimation error.

1.4 Comparison with notable prior works

Comparison with Filippi et al. [2010] Our setting is different from the standard generalized linear bandit of Filippi et al. [2010]. In our setting, the reward due to an action (assortment) can be dependent on up to KK variables (θxt,i,i𝒬t\theta_{*}\cdot x_{t,i},\,i\in\mathcal{Q}_{t}) instead of a single variable. Further, we focus on removing the multiplicative dependence on κ\kappa from the regret bounds. This leads to a more involved technical treatment in our work.

Comparison with Oh & Iyengar [2019] The Thompson Sampling based approach is inherently different from our Optimism in the face of uncertainty (OFU) style Algorithm CB-MNL. However, the main result in Oh & Iyengar [2019] also relies on a confidence set based analysis along the lines of Filippi et al. [2010] but has a multiplicative κ\kappa factor in the bound.

Comparison with Faury et al. [2020] Faury et al. [2020] use a bonus term for optimization in each round, and their algorithm performs non-trivial projections on the admissible log-odds. While we do reuse the Bernstein-style concentration inequality as proposed by them, their results do not seem to extend directly to the MNL setting without requiring significantly more work. Further, our algorithm CB-MNL performs an optimistic parameter search for making decisions instead of using a bonus term, which allow for a cleaner and shorter analysis.

Comparison with Oh & Iyengar [2021] While the authors in Oh & Iyengar [2021] provide sharper bounds by a factor of O~(d)\tilde{\mathrm{O}}(\sqrt{d}), they still retain the κ\kappa multiplicative factor in their regret bounds. Their focus is on improving the dependence on the dimension parameter dd for the dynamic assortment optimization problem.

Comparison with Abeille et al. [2021]  Abeille et al. [2021] recently proposed the idea of convex relaxation of the confidence set for the more straightforward logistic bandit setting. Our work can be viewed as an extension of their construction to the MNL setting.

Comparison with Amani & Thrampoulidis [2021] While the authors in Amani & Thrampoulidis [2021] also extend the algorithms of Faury et al. [2020] to a multinomial problem, their setting is materially different from ours. They model various click-types for the same advertisement (action) via the multinomial distribution. further, they consider actions played at each round to be non-combinatorial, i.e., a single action as opposed to a bundle of actions, which differs from the assortment optimization setting in this work. Therefore, their approach and technical analysis are different from ours.

2 Preliminaries

2.1 Notations

For a vector xdx\,\in\,\mathbb{R}^{d}, xx^{\top} denotes the transpose. Given a positive definite matrix 𝐌d×d\mathbf{M}\,\in\,\mathbb{R}^{d\times d}, the induced norm is given by x𝐌=x𝐌x||x||_{\mathbf{M}}=\sqrt{x\mathbf{M}x}. For two symmetric matrices 𝐌𝟏\mathbf{M_{1}} and 𝐌𝟐\mathbf{M_{2}}, 𝐌𝟏𝐌𝟐\mathbf{M_{1}}\succeq\mathbf{M_{2}} means that 𝐌𝟏𝐌𝟐\mathbf{M_{1}}-\mathbf{M_{2}} is positive semi-definite. For any positive integer nn, [n]{1,2,3,,n}[n]\coloneqq\{1,2,3,\cdots,n\}. 𝐈d\mathbf{I}_{d} denotes an identity matrix of dimension d×dd\times d. The platform (i.e. the learner) is referred using the pronouns she/her/hers.

2.2 Model setting

Rewards Model:

At every round tt, the platform (learner) is presented with set 𝒩\mathcal{N} of distinct items, indexed by i[N]i\,\in\,[N] and their attribute vectors (contexts): {xt,i}i=1N\{x_{t,i}\}_{i=1}^{N} such that i[N],xt,id\forall\,i\,\in[N],\,x_{t,i}\,\in\,\mathbb{R}^{d}, where N=|𝒩|N=|\mathcal{N}| is the cardinality of set 𝒩\mathcal{N}. The platform then selects an assortment 𝒬t𝒩\mathcal{Q}_{t}\subset\mathcal{N} and the interacting consumer (environment) offers the reward rtr_{t} to the platform. The assortments have a cardinality of at most KK, i.e. |𝒬t|K|\mathcal{Q}_{t}|\leq K. The platform’s decision is based on the entire history of interaction. The history is represented by the filtration set t{0,σ({{xs,i}i=1N,𝒬s}s=1t1)}\mathcal{F}_{t}\coloneqq\{\mathcal{F}_{0},\sigma(\{\{x_{s,i}\}_{i=1}^{N},\mathcal{Q}_{s}\}_{s=1}^{t-1})\},444σ({})\sigma(\{\cdot\}) denotes the σ\sigma-algebra set over the sequence {}\{\cdot\}. where 0\mathcal{F}_{0} is any prior information available to the platform. The interaction lasts for t=1,2,,Tt=1,2,\cdots,T rounds. Conditioned on t\mathcal{F}_{t}, the reward rtr_{t} is a binary vector such that rt{0,1}Nr_{t}\,\in\,\{0,1\}^{N} and the vector {rt,i}i𝒬t\{r_{t,i}\}_{i\in\mathcal{Q}_{t}} follows a multinomial distribution. We have rt,i=0,i𝒬tr_{t,i}=0,\forall\,i\,\notin\,\mathcal{Q}_{t}. Specifically, the probability that rt,i=1,i𝒬tr_{t,i}=1,\forall\,i\,\in\,\mathcal{Q}_{t} is given by the softmax function:

(rt,i)=1|𝒬t,t=μi(𝒬t,θ)exp(xt,iθ)1+j𝒬texp(xt,jθ),\displaystyle\mathbb{P}(r_{t,i)=1|\mathcal{Q}_{t},\mathcal{F}_{t}}=\mu_{i}(\mathcal{Q}_{t},\theta_{*})\coloneqq\frac{\exp(x_{t,i}^{\top}\theta_{*})}{1+\sum_{j\in\mathcal{Q}_{t}}\exp(x_{t,j}^{\top}\theta_{*})}, (1)

where θ\theta_{*} is an unknown time-invariant parameter. The numeral 11 in the denominator accounts for the case when the consumer purchases none of the items in the assortment. By definition, i𝒬trt,i1\sum_{i\in\mathcal{Q}_{t}}r_{t,i}\leq 1, i.e., rtr_{t} is multinomial with a single trial. Also, the expected revenue due to the assortment555Each item ii is also associated with a price (or revenue) parameter, pt,ip_{t,i} for round tt. We assume pt,i=1p_{t,i}=1 for all items and rounds for an uncluttered exposition of results. If pt,ip_{t,i} is not 11, then it features as a fixed factor in the definition of μi()\mu_{i}(\cdot) and the analysis exactly follows as that presented here pt,i=1p_{t,i}=1 for all rounds and items. 𝒬t\mathcal{Q}_{t} is given by:

μ(𝒬t,θ)i𝒬tμi(𝒬t,θ).\mu(\mathcal{Q}_{t},\theta_{*})\coloneqq\sum_{i\in\mathcal{Q}_{t}}\mu_{i}(\mathcal{Q}_{t},\theta_{*}). (2)

Also, {xt,i}\{x_{t,i}\} may vary adversarially in each round in our model, unlike in Li et al. [2017], where the attribute vectors are assumed to be drawn from an unknown i.i.d. distribution. When K=1K=1, the above model reduces to the case of the logistic bandit.

Choice Modeling Perspective:

Eq 1 can be considered from a discrete choice modeling viewpoint, where the platform presents an assortment of items to a user, and the user selects at most one item from this assortment. In this interpretation, the probability of choosing an item ii is given by μi(𝒬t,θ)\mu_{i}(\mathcal{Q}_{t},\theta_{*}). Likewise, the probability of the user not selecting any item is given by: 1/(1+j𝒬texp(xt,jθ))\nicefrac{{1}}{{(1+\sum_{j\in\mathcal{Q}_{t}}\exp(x_{t,j}^{\top}\theta_{*}))}}. The platform is motivated to offer such an assortment that the user’s propensity to make a successful selection is high.

Regret:

The platform does not know the value of θ\theta_{*}. Our learning algorithm CB-MNL (see Algorithm 1) sequentially makes the assortment selection decisions, 𝒬1,𝒬2,,𝒬T\mathcal{Q}_{1},\mathcal{Q}_{2},\cdots,\mathcal{Q}_{T} so that the cumulative expected revenue t=1Tμ(𝒬t,θ)\sum_{t=1}^{T}\mu(\mathcal{Q}_{t},\theta_{*}) is high. Its performance is quantified by pseudo-regret, which is the gap between the expected revenue generated by the algorithm and that of the optimal assortments in hindsight. The learning goal is to minimize the cumulative pseudo-regret up to time TT, defined as:

𝐑Tt=1T[μ(𝒬t,θ)μ(𝒬t,θ)],\mathbf{R}_{T}\coloneqq\sum_{t=1}^{T}[\mu(\mathcal{Q}_{t}^{*},\theta_{*})-\mu(\mathcal{Q}_{t},\theta_{*})], (3)

where 𝒬t\mathcal{Q}_{t}^{*} is the offline optimal assortment at round tt under full information of θ\theta_{*}, defined as: 𝒬targmax𝒬𝒩μ(𝒬,θ).\mathcal{Q}_{t}^{*}\coloneqq\operatorname*{argmax}_{\mathcal{Q}\subset\mathcal{N}}\mu(\mathcal{Q},\theta_{*}).

As in the case of contextual linear bandits Abbasi-Yadkori et al. [2011], Chu et al. [2011], the emphasis here is to make good sequential decisions while tracking the true parameter θ\theta_{*} with a close estimate θ^t\hat{\theta}_{t} (see Section 2.4). Our algorithm (like others) does not necessarily improve the estimate at each round. However, it ensures that θ\theta_{*} is always within a confidence interval of the estimate of θ\theta_{*} (with high probability) and the future analysis demonstrates that the aggregate prediction error over all TT rounds is bounded.

Our model is fairly general, as the contextual information xt,ix_{t,i} may be used to model combined information of the item ii in the set 𝒩\mathcal{N} and the user at round tt. Suppose the user at round tt is represented by a vector vtv_{t} and the item ii has attribute vector as wt,iw_{t,i}, then xt,i=vec(vtwt,i)x_{t,i}=\text{vec}(v_{t}w_{t,i}^{\top}) (vectorized outer product of vtv_{t} and wt,iw_{t,i}). We assume that the platform knows the interaction horizon TT.
Additional notations: 𝐗𝒬t\mathbf{X}_{\mathcal{Q}_{t}} denotes a design matrix whose columns are the attribute vectors (xt,ix_{t,i}) of the items in the assortment 𝒬t\mathcal{Q}_{t}. Also, we now denote μ(𝒬t,θ)\mu(\mathcal{Q}_{t},\theta_{*}) as μ(𝐗𝒬tθ)\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}) to signify that μ(𝒬t,θ):|𝒬t|\mu(\mathcal{Q}_{t},\theta_{*})\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{|\mathcal{Q}_{t}|}\to\mathbb{R}.

2.3 Assumptions

Following Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2019], Faury et al. [2020], we introduce the following assumptions on the problem structure.

Assumption 1 (Bounded parameters).

θΘ\theta_{*}\,\in\,\Theta, where Θ\Theta is a compact subset of d\mathbb{R}^{d}. SmaxθΘθ2S\coloneqq\max_{\theta\in\Theta}||\theta||_{2} is known to the learner. Further, xt,i21||x_{t,i}||_{2}\leq 1 for all values of tt and ii.

This assumption simplifies analysis and removes scaling constants from the equations.

Assumption 2.

There exists κ>0\kappa>0 such that for every item i𝒬ti\,\in\,\mathcal{Q}_{t} and for any 𝒬t𝒩\mathcal{Q}_{t}\subset\mathcal{N} and all rounds tt:

inf𝒬t𝒩,θdμi(𝐗𝒬tθ)(1μi(𝐗𝒬tθ))1κ.\inf_{\mathcal{Q}_{t}\subset\mathcal{N},\theta\in\mathbb{R}^{d}}\mu_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta)(1-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta))\geq\frac{1}{\kappa}.

Note that μi(𝐗𝒬tθ)(1μi(𝐗𝒬tθ))\mu_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta)(1-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta)) denotes the derivative of the softmax function along the ithi_{th} direction. This assumption is necessary from the likelihood theory Lehmann & Casella [2006] as it ensures that the fisher matrix for θ\theta_{*} estimation is invertible for all possible input instances. We refer to Oh & Iyengar [2019] for a detailed discussion in this regard. We denote LL and MM as the upper bounds on the first and second derivatives of the softmax function along any component, respectively. We have L,M1L,M\leq 1 [Gao & Pavel, 2017] for all problem instances.

2.4 Maximum likelihood estimate

CB-MNL, described in Algorithm 1, uses a regularized maximum likelihood estimator to compute an estimate θ^t\hat{\theta}_{t} of θ\theta_{*}. Since {rt,i}i𝒬t\{r_{t,i}\}_{i\in\mathcal{Q}_{t}} follows a multinomial distribution, the regularized log-likelihood (negative cross entropy loss) function, till the (t1)th(t-1)_{th} round, under parameter θ\theta could be written as:

tλt(θ)=s=1t1i𝒬srs,ilog(μi(𝐗𝒬sθ))λt2θ22,\displaystyle\mathcal{L}_{t}^{\lambda_{t}}(\theta)=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}r_{s,i}\log(\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta))-\frac{\lambda_{t}}{2}||\theta||_{2}^{2}, (4)

tλt(θ)\mathcal{L}_{t}^{\lambda_{t}}(\theta) is concave in θ\theta for λt>0\lambda_{t}>0, and the maximum likelihood estimator is given by calculating the critical point of tλt(θ)\mathcal{L}_{t}^{\lambda_{t}}(\theta). Setting θtλt(θ)=0\nabla_{\theta}\mathcal{L}_{t}^{\lambda_{t}}(\theta)=0, we get θ^t\hat{\theta}_{t} as the solution of:

s=1t1i𝒬s[μi(𝐗𝒬sθ^t)rt,i]xs,i+λtθ^t=0.\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}[\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\hat{\theta}_{t})-r_{t,i}]x_{s,i}+\lambda_{t}\hat{\theta}_{t}=0. (5)

For future analysis we also define

gt(θ)s=1t1i𝒬sμi(𝐗𝒬sθ)xs,i+λtθ,gt(θ^t)s=1t1i𝒬srs,ixs,i.\displaystyle g_{t}(\theta)\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)x_{s,i}+\lambda_{t}\theta,\quad g_{t}(\hat{\theta}_{t})\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}r_{s,i}x_{s,i}. (6)

At the start of the interaction, when no contexts have been observed, θ^t\hat{\theta}_{t} is well-defined by Eq (5) when λt>0\lambda_{t}>0. Therefore, the regularization parameter λt\lambda_{t} makes CB-MNL burn-in period free, in contrast to some previous works, e.g. Filippi et al. [2010].

2.5 Confidence sets

Algorithm 1 follows the template of in the face of uncertainty (OFU) strategies [Auer et al., 2002, Filippi et al., 2010, Faury et al., 2020]. Technical analysis of OFU algorithms relies on two key factors: the design of the confidence set and the ease of choosing an action using the confidence set.

In Section 4, we derive Et(δ)E_{t}(\delta) (defined below) as the confidence set on θ\theta_{*} such that θCt(δ),t\theta_{*}\in C_{t}(\delta),\,\forall t with probability at least 1δ1-\delta (randomness is over user choices). Et(δ)E_{t}(\delta) used for making decisions at each round (see Eq (12)) by CB-MNL in Algorithm 1:

Et(δ){θΘ,tλt(θ)tλt(θ^t)βt2(δ)},E_{t}(\delta)\coloneqq\{\theta\in\Theta,\,\mathcal{L}^{\lambda_{t}}_{t}(\theta)-\mathcal{L}^{\lambda_{t}}_{t}(\hat{\theta}_{t})\leq\beta^{2}_{t}(\delta)\}, (7)

where βt(δ)γt(δ)+γt2(δ)λt\beta_{t}(\delta)\coloneqq\gamma_{t}(\delta)+\frac{\gamma_{t}^{2}(\delta)}{\lambda_{t}}, and

γt(δ)\displaystyle\gamma_{t}(\delta)\coloneqq λt2+2λtlog((λt+LKt/d)d/2λtd/2δ)+2dλtlog(2).\displaystyle\frac{\sqrt{\lambda_{t}}}{2}+\frac{2}{\sqrt{\lambda_{t}}}\log(\frac{(\lambda_{t}+LKt/d)^{d/2}\lambda_{t}^{-d/2}}{\delta})+\frac{2d}{\sqrt{\lambda_{t}}}\log(2). (8)

A confidence set similar to Et(δ)E_{t}(\delta) in Eq (7) was recently proposed in Abeille et al. [2021] for the simpler logisitic bandit setting. Here, we extend its construction to the MNL setting. The set Et(δ)E_{t}(\delta) is convex since the log-loss function is convex. This makes the decision step in Eq (12) a constraint convex optimization problem. However, it is difficult to prove bounds directly with Et(δ)E_{t}(\delta). Therefore we leverage a result in Faury et al. [2020], where the authors proposed a new Bernstein-like tail inequality for self-normalized vectorial martingales (see Appendix A.1), to derive another confidence set on θ\theta_{*}:

Ct(δ){θΘ,gt(θ)gt(θ^t)𝐇t1(θ)γt(δ)}.C_{t}(\delta)\coloneqq\{\theta\in\Theta,\,||g_{t}(\theta)-g_{t}(\hat{\theta}_{t})||_{\mathbf{H}_{t}^{-1}(\theta)}\leq\gamma_{t}(\delta)\}. (9)

where

𝐇t(θ1)s=1t1i𝒬sμ˙i(𝐗𝒬sθ1)xs,ixs,i+λt𝐈d.\mathbf{H}_{t}(\theta_{1})\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}. (10)

μ˙i()\dot{\mu}_{i}(\cdot) is the partial derivative of μi\mu_{i} in the direction of the ithi_{th} component of the assortment and γt(δ)\gamma_{t}(\delta) is defined in Eq (8). The value of γt(δ)\gamma_{t}(\delta) is an outcome of the concentration result of Faury et al. [2020]. As a consequence of this concentration, we have θCt(δ)\theta_{*}\,\in\,C_{t}(\delta) with probability at least 1δ1-\delta (randomness is over user choices). The Bernstein-like concentration inequality used here is similar to Theorem 1 of Abbasi-Yadkori et al. [2011] with the difference that we take into account local variance information (hence local curvature information of the reward function) in defining 𝐇t\mathbf{H}_{t}. The above discussion is formalized in Appendix A.1.

The set Ct(δ)C_{t}(\delta) is non-convex, which follows from the non-linearity of 𝐇t1(θ)\mathbf{H}^{-1}_{t}(\theta). We use Ct(δ)C_{t}(\delta) directly to prove regret guarantees. In Section 4.3, we mention how the a convex set Et(δ)E_{t}(\delta) is related to Ct(δ)C_{t}(\delta) and share many useful properties of Ct(δ)C_{t}(\delta). Till then, to maintain ease of technical flow and to compare it with the previous work Faury et al. [2020], we assume that the algorithm uses Ct(δ)C_{t}(\delta) as the confidence set. We highlight that for the confidence sets, Ct(δ)C_{t}(\delta) and Et(δ)E_{t}(\delta), Algorithm CB-MNL is identical except for the calculation in Eq (12). For later sections we also define the following norm inducing design matrix based on all the contexts observed till time t1t-1:

𝐕ts=1t1i𝒬sxs,ixs,i+λt𝐈d.\mathbf{V}_{t}\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}. (11)

3 Algorithm

At each round tt, the attribute parameters (contexts) {xt,1,xt,2,,xt,N}\{x_{t,1},x_{t,2},\cdots,x_{t,N}\} are made available to the algorithm (online platform) CB-MNL. The algorithm calculates an estimate of the true parameter θ\theta_{*} according to Eq (5). The algorithm keeps track of the confidence set Ct(δ)C_{t}(\delta) (Et(δ)E_{t}(\delta)) as defined in Eq (9) (Eq (7). Let the set 𝒜\mathbf{\mathcal{A}} contain all feasible assortments of 𝒩\mathcal{N} with cardinality up to KK. The algorithm makes the following decision:

(𝒬t,θt)=argmaxAt𝒜,θCt(δ)μ(𝐗Atθ).(\mathcal{Q}_{t},\theta_{t})=\operatorname*{argmax}_{A_{t}\in\mathcal{A},\theta\in C_{t}(\delta)}\mu(\mathbf{X}_{A_{t}}^{\top}\theta). (12)

In each round tt, the reward of the online platform is denoted by the vector rtr_{t}. Also, the prediction error of θ\theta at 𝐗𝒬t\mathbf{X}_{\mathcal{Q}_{t}}, defined as:

Δpred(𝐗𝒬t,θ)|μ(𝐗𝒬tθ)μ(𝐗𝒬tθ)|.\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta)\coloneqq|\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta)|. (13)

Δpred(𝐗𝒬t,θ)\Delta^{\text{pred}}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}},\theta\right)} represents the difference in perceived rewards due to the inaccuracy in the estimation of the parameter θ\theta_{*}.

Remark 1 (Optimistic parameter search).

CB-MNL enforces optimism via an optimistic parameter search (e.g. in Abbasi-Yadkori et al. [2011]), which is in contrast to the use of an exploration bonus as seen in Faury et al. [2020], Filippi et al. [2010]. Optimistic parameter search provides a cleaner description of the learning strategy. In non-linear reward models, both approaches may not follow similar trajectory but may have overlapping analysis styles (see Filippi et al. [2010] for a short discussion).

Remark 2 (Tractable decision-making).

In Section 4.3, we show that the decision problem of Eq (12) can be relaxed to an convex optimization problem by using a convex set Et(δ)E_{t}(\delta), instead of Ct(δ)C_{t}(\delta), while keeping the regret performance of Algorithm 1 intact up to constant factors.

Input: regularization parameters: λt,t[T]\lambda_{t},\forall\,t\,\in\,[T], NN distinct items: 𝒩\mathcal{N}, KK
for t 1t\,\geq\,1  do

       Given: Set {xt,1,xt,2,,xt,N}\{x_{t,1},x_{t,2},\cdots,x_{t,N}\} of dd-dimensional parameters.
Estimate θ^t\hat{\theta}_{t} according to Eq (5).
Construct Ct(δ)C_{t}(\delta) as defined in Eq (9).
Construct the set 𝒜\mathbf{\mathcal{A}} of all feasible assortments of 𝒩\mathcal{N} with cardinality upto KK.
Play (𝒬t,θt)=argmaxAt𝒜,θCt(δ)μ(𝐗Atθ)(\mathcal{Q}_{t},\theta_{t})=\operatorname*{argmax}_{A_{t}\in\mathcal{A},\theta\in C_{t}(\delta)}\mu(\mathbf{X}_{A_{t}}^{\top}\theta).
Observe rewards 𝐫t\mathbf{r}_{t}
end for
Algorithm 1 CB-MNL

4 Main results

We present a regret upper bound for the CB-MNL algorithm in Theorem 1.

Theorem 1.

With probability at least 1δ1-\delta over the randomness of user choices:

𝐑T\displaystyle\mathbf{R}_{T}\leq C1γT(δ)2dlog(1+LKTdλT)T+C2κγT(δ)2dlog(1+KTdλT),\displaystyle C_{1}\gamma_{T}(\delta)\sqrt{2d\log(1+\frac{LKT}{d\lambda_{T}})T}+C_{2}\kappa\gamma_{T}(\delta)^{2}d\log(1+\frac{KT}{d\lambda_{T}}),

where the constants are given as C1=(4+8S)C_{1}=(4+8S), C2=4(4+8S)3/2MC_{2}=4(4+8S)^{\nicefrac{{3}}{{2}}}M, and γT(δ)\gamma_{T}(\delta) is given by Eq (8).

The formal proof is deferred to the technical Appendix, in this section we discuss the key technical ideas leading to this result. The order dependence on the model parameters is made explicit by the following corollary.

Corollary 2.

Setting the regularization parameter λT=O(dlog(KT))\lambda_{T}=\mathrm{O}(d\log(KT)), where KK is the maximum cardinality of the assortments to be selected, makes γT(δ)=O(d1/2log1/2(KT))\gamma_{T}(\delta)=\mathrm{O}(d^{\nicefrac{{1}}{{2}}}\log^{\nicefrac{{1}}{{2}}}(KT)). The regret upper bound is given by 𝐑T=O(dTlog(KT)+κd2log2(KT))\mathbf{R}_{T}=\mathrm{O}(d\sqrt{T}\log(KT)+\kappa d^{2}\log^{2}(KT)).

Recall the expression for cumulative regret

𝐑T\displaystyle\mathbf{R}_{T} =t=1T[μ(𝐗𝒬tθ)μ(𝐗𝒬tθ)]\displaystyle=\sum_{t=1}^{T}[\mu(\mathbf{X}_{\mathcal{Q}^{*}_{t}}^{\top}\theta_{*})-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})]
=t=1T[μ(𝐗𝒬tθ)μ(𝐗𝒬tθt)]pessimism+t=1T[μ(𝐗𝒬tθt)μ(𝐗𝒬tθ)]prediction error,\displaystyle=\sum_{t=1}^{T}\underbrace{[\mu(\mathbf{X}_{\mathcal{Q}^{*}_{t}}^{\top}\theta_{*})-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{t})]}_{\text{pessimism}}+\sum_{t=1}^{T}\underbrace{[\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{t})-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})]}_{\text{prediction error}},

where pessimism is the additive inverse of the optimism (difference between the payoffs under true parameters and those estimated by CB-MNL). Due to optimistic decision-making and the fact that θCt(δ)\theta_{*}\in C_{t}(\delta) (see Eq (12)), pessimism is non-positive, for all rounds. Thus, the regret is upper bounded by the sum of the prediction error for TT rounds. In Section 4.1 we derive an the expression for prediction error upper bound for a single round tt. We also contrast with the previous works Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021] and point out specific technical differences which allow us to use Bernstein-like tail concentration inequality and therefore, achieve stronger regret guarantees. In Section 4.2, we describe the additional steps leading to the statement of Theorem 1. The style of the arguments is simpler and shorter than that in Faury et al. [2020]. Finally, in Section 4.3, we discuss the relationship between two confidence sets Ct(δ)C_{t}(\delta) and Et(δ)E_{t}(\delta) and show that even using Et(δ)E_{t}(\delta) in place of Ct(δ)C_{t}(\delta), we get the regret upper bounds with same parameter dependence as in Corollary 2. Lemma 3 gives the expression for an upper bound on the prediction error.

4.1 Bounds on prediction error

Lemma 3.

For θtCt(δ)\theta_{t}\in C_{t}(\delta) (see Eq (12)) with probability at least 1δ1-\delta:

Δpred(𝐗𝒬t,θt)\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t})\leq (2+4S)γt(δ)i𝒬tμ˙i(𝐗𝒬tθ)xt,i𝐇t1(θ)\displaystyle(2+4S)\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})||x_{t,i}||_{\mathbf{H}_{t}^{-1}(\theta_{*})}
+4κ(1+2S)2Mγt(δ)2i𝒬sxt,i𝐕t12,\displaystyle+4\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal{Q}_{s}}||x_{t,i}||^{2}_{\mathbf{V}_{t}^{-1}}, (14)

where 𝐕t1\mathbf{V}_{t}^{-1} is given by Eq (11).

The detailed proof is provided in A.4. Here we develop the main ideas leading to this result and develop an analytical flow which will be re-used while working with convex confidence set Et(δ)E_{t}(\delta) in Section 4.3. In the previous works Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021], global upper and lower bounds of the derivative of the link function (here softmax) are employed early in the analysis, leading to loss of local information carried by the MLE estimate θt\theta_{t}. In those previous works the first step was to upper bound the prediction error by the Lipschitz constant (which is a global property) of the softmax (or sigmoid for the logistic bandit case) function, as:

|μ(𝐗𝒬tθ)μ(𝐗𝒬tθt)|L|𝐗𝒬t(θθt)|.|\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{t})|\leq L|\mathbf{X}_{\mathcal{Q}_{t}}^{\top}(\theta_{*}-\theta_{t})|. (15)

For building intuition, assume that 𝐗𝒬tθ\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*} lies on “flatter” region of μ()\mu(\cdot), then Eq (15) is a loose upper bound.

Next we show how using a global lower bound in form of κ\kappa (see Assumption 2) early in the analysis in the works Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021] lead to loose prediction error upper bound. For this we first introduce a new notation:

αi(𝐗𝒬t,θt,θ)xt,i(θθt)μi(𝐗𝒬tθ)μi(𝐗𝒬tθt).\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*})x_{t,i}^{\top}(\theta_{*}-\theta_{t})\coloneqq\mu_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{t}). (16)

We also define 𝐆t(θt,θ)s=1t1i𝒬sαi(𝐗𝒬s,θt,θ)xs,ixs,i+λ𝐈d.\mathbf{G}_{t}(\theta_{t},\theta_{*})\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{t},\theta_{*})x_{s,i}x_{s,i}^{\top}+\lambda\mathbf{I}_{d}. From Eq (6), we obtain (see A.2 for details of this derivation):

g(θ)g(θt)=𝐆t(θt,θ)(θθt).\displaystyle g(\theta_{*})-g(\theta_{t})=\mathbf{G}_{t}(\theta_{t},\theta_{*})(\theta_{*}-\theta_{t}). (17)

From Assumption 2, 𝐆t(θt,θ)\mathbf{G}_{t}(\theta_{t},\theta_{*}) is a positive definite matrix for λ>0\lambda>0 and therefore can be used to define a norm. Using Cauchy-Schwarz inequality with Eq (17) simplifies the prediction error as:

Δpred(𝐗𝒬t,θt)|i𝒬tαi(𝐗𝒬t,θt,θ)xt,i𝐆t1(θt,θ)θθt𝐆t(θt,θ)|\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t})\leq\big{|}\sum_{i\in\mathcal{Q}_{t}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*})||x_{t,i}||_{\mathbf{G}_{t}^{-1}(\theta_{t},\theta_{*})}||\theta_{*}-\theta_{t}||_{\mathbf{G}_{t}(\theta_{t},\theta_{*})}\big{|} (18)

The previous literature Filippi et al. [2010], Oh & Iyengar [2021] has utilized 𝐆t1(θt,θ)κ1𝐕t\mathbf{G}_{t}^{-1}(\theta_{t},\theta_{*})\succeq\kappa^{-1}\mathbf{V}_{t} and upper bounded αi(𝐗𝒬t,θt,θ)\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*}) by Lipschitz constant (directly at this stage), thereby incurring loose regret bounds. Instead, here we work with the norm induced by 𝐇t(θ)\mathbf{H}_{t}(\theta_{*}) and retain the location information in αi(𝐗𝒬t,θt,θ)\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*}).

Δpred(𝐗𝒬t,θt)|i𝒬tαi(𝐗𝒬t,θt,θ)xt,i𝐇t1(θ)θθt𝐇t(θ)|\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t})\leq\big{|}\sum_{i\in\mathcal{Q}_{t}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*})||x_{t,i}||_{\mathbf{H}_{t}^{-1}(\theta_{*})}||\theta_{*}-\theta_{t}||_{\mathbf{H}_{t}(\theta_{*})}\big{|} (19)

It is not straight-forward to bound θθt𝐇t(θ)||\theta_{*}-\theta_{t}||_{\mathbf{H}_{t}(\theta_{*})}, we extend the self-concordance style relations from Faury et al. [2020] for the multinomial logit function which allow us to relate 𝐆t1(θt,θ)\mathbf{G}_{t}^{-1}(\theta_{t},\theta_{*}) and 𝐇t1(θt)\mathbf{H}_{t}^{-1}(\theta_{t}) (or 𝐇t1(θ)\mathbf{H}_{t}^{-1}(\theta_{*})) to develop a bound on θθ𝐇t(θ)||\theta_{*}-\theta||_{\mathbf{H}_{t}(\theta_{*})}.

Lemma 4.

For all θ1,θ2Θ\theta_{1},\theta_{2}\,\in\,\Theta, the following inequalities hold:

𝐆t(θ1,θ2)(1+2S)1𝐇t(θ1),𝐆t(θ1,θ2)(1+2S)1𝐇t(θ2)\displaystyle\mathbf{G}_{t}(\theta_{1},\theta_{2})\succeq(1+2S)^{-1}\mathbf{H}_{t}(\theta_{1}),\quad\mathbf{G}_{t}(\theta_{1},\theta_{2})\succeq(1+2S)^{-1}\mathbf{H}_{t}(\theta_{2})
Lemma 5.

For θtCt(δ)\theta_{t}\in C_{t}(\delta), we have the following relation with probability at least 1δ1-\delta: θtθ𝐇t(θ)2(1+2S)γt(δ).||\theta_{t}-\theta_{*}||_{\mathbf{H}_{t}(\theta_{*})}\leq 2(1+2S)\gamma_{t}(\delta).

Proofs of Lemma 4 and 5 have been deferred to A.3. Notice that Lemma 5 is a key result which characterizes worthiness of the confidence set Ct(δ)C_{t}(\delta). Recall that γT(δ)=O(dlog(KT))\gamma_{T}(\delta)=\mathrm{O}(\sqrt{d\log(KT)}) (with a tuned λ\lambda as in Corollary 2). Therefore, any θCt(δ)\theta\,\in\,C_{t}(\delta) is not too far from the optimal θ\theta_{*} under the norm induced by ||||𝐇t(θ)||\cdot||_{\mathbf{H}_{t}(\theta_{*})}. Now, we use Lemma 5 in Eq (19) to get:

Δpred(𝐗𝒬t,θt)2(1+2S)γt(δ)i𝒬t|αi(𝐗𝒬t,θ,θt)xt,i𝐇t1(θ)|.\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t})\leq 2(1+2S)\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}|\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta_{t})||x_{t,i}||_{\mathbf{H}_{t}^{-1}(\theta_{*})}|. (20)

The quantity αi(𝐗𝒬t,θt,θ)\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t},\theta_{*}) as described in the Eq (16) is upper bounded in the following result

Lemma 6.

For the assortment chosen by the algorithm CB-MNL, 𝒬t\mathcal{Q}_{t} as given by Eq (12) and any θCt(δ)\theta\in C_{t}(\delta) the following holds with probability at least 1δ1-\delta: αi(𝐗𝒬t,θ,θ)μ˙i(𝐗𝒬tθ)+2(1+2S)Mγt(δ)xt,i𝐇t1(θ).\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)\leq\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})+2(1+2S)M\gamma_{t}(\delta)||x_{t,i}||_{\mathbf{H}_{t}^{-1}(\theta_{*})}.

We use the result of Lemma 6 in Eq (20) followed by an application of Lemma 4 and the relation i𝒬txt,i𝐇t1(θt)2κi𝒬txt,i𝐕t12\sum_{i\in\mathcal{Q}_{t}}||x_{t,i}||^{2}_{\mathbf{H}_{t}^{-1}(\theta_{t})}\leq\kappa\sum_{i\in\mathcal{Q}_{t}}||x_{t,i}||^{2}_{\mathbf{V}_{t}^{-1}} from Assumption 2, to arrive at the statement of Lemma 3.

4.2 Regret calculation

The complete technical work is provided in A.4 and A.5. The key step to retrieve the upper bounds of Theorem 1 is to calculate TT rounds summation of the prediction error as given in Eq (3). Compared to the previous literature Filippi et al. [2010], Li et al. [2017], Oh & Iyengar [2021], the term i𝒬tμ˙i(𝐗𝒬tθ)xt,i𝐇t1(θ)\sum_{i\in\mathcal{Q}_{t}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})||x_{t,i}||_{\mathbf{H}_{t}^{-1}(\theta_{*})} is new here. Further, our treatment of this term is much simpler and straight-forward as compared to that in Faury et al. [2020]

4.3 Convex relaxation of the optimization step

Sections 4.14.2 provide an analytical framework for calculating the regret bounds given: (1) the confidence set (Eq (9)) with the guarantee that θCt(δ)\theta_{*}\,\in\,C_{t}(\delta) with probability at least 1δ1-\delta; (2) the assurance that the confidence set is small (Lemma 5). In order to re-use previously developed techniques, we show: (1) Et(δ)Ct(δ)E_{t}(\delta)\supseteq C_{t}(\delta) (see Eq (7)) and therefore θEt(δ)\theta_{*}\,\in\,E_{t}(\delta) with probability at least 1δ1-\delta (see Lemma 7; (2) an analog of Lemma 5 using Et(δ)E_{t}(\delta) (see Lemma 8). The proof of Theorem 1 is therefore repeated while using Lemma 8, following steps as sketched in sections 4.14.2. The order dependence of the regret upper bound is retained (see Corollary 2).

Lemma 7.

Et(δ)Ct(δ)E_{t}(\delta)\supseteq C_{t}(\delta), therefore for any θCt(δ)\theta\in C_{t}(\delta), we also have θEt(δ)\theta\,\in\,E_{t}(\delta) (see Eq (7)).

The complete proof is provided in A.6. We highlight the usefulness of Lemma 7. Since all of set Ct(δ)C_{t}(\delta) lies within Et(δ)E_{t}(\delta), the consequence of the concentration inequality also implies (t1,θEt(δ))1δ\mathbb{P}(\forall t\geq 1,\theta_{*}\in E_{t}(\delta))\geq 1-\delta.

Lemma 8.

Under the event θCt(δ)\theta_{*}\,\in\,C_{t}(\delta), the following holds θEt(δ)\forall\,\theta\,\in\,E_{t}(\delta):

θθ𝐇t(θ)(2+2S)γt(δ)+21+Sβt(δ).\displaystyle||\theta-\theta_{*}||_{\mathbf{H}_{t}(\theta_{*})}\leq(2+2S)\gamma_{t}(\delta)+2\sqrt{1+S}\beta_{t}(\delta).

When λt=O(dlog(Kt))\lambda_{t}=\mathrm{O}(d\log(Kt)), then γt(δ)=O~(dlog(t))\gamma_{t}(\delta)=\tilde{\mathrm{O}}(\sqrt{d\log(t)}), βt(δ)=O~(dlog(t))\beta_{t}(\delta)=\tilde{\mathrm{O}}(\sqrt{d\log(t)}), and θθ𝐇t(θ)=O~(dlog(t))||\theta-\theta_{*}||_{\mathbf{H}_{t}(\theta_{*})}=\tilde{\mathrm{O}}(\sqrt{d\log(t)}).

The complete proof can be found in A.6.

5 Numerical experiments

In this section we compare the empirical performance of our proposed algorithm CB-MNL with the previous state of the art in the MNL contextual bandit literature: UCB-MNL[Oh & Iyengar, 2021] and TS-MNL[Oh & Iyengar, 2019] on artificial data. We focus on performance comparison for varying values of parameter κ\kappa, and show that our algorithm has a consistently superior performance for different κ\kappa values in Figure 2. This highlights the primary contribution of our theoretical analysis. Refer to A.8 for additional empirical analysis.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Comparison of cumulative regret as a function of time for varying κ\kappa ( left to right: κ(T)\kappa\gg\mathinner{\left(\sqrt{T}\right)}, κ<(T)\kappa<\mathinner{\left(\sqrt{T}\right)}, and κ(T)\kappa\ll\mathinner{\left(\sqrt{T}\right)})

For each experimental configuration, we consider a problem instance with inventory size N=15N=15, instance dimensions d=5d=5, maximum assortment size K=4K=4, and time horizon T=100T=100, averaged over 2525 Monte Carlo simulation runs. θd\theta_{*}\in\mathbb{R}^{d} is a dd-dimensional random vector with each coordinate in [0,1][0,1], independently and uniformly distributed. The contexts follow a multivariate Gaussian distribution. The λ\lambda parameter is manually tuned. Algorithm CB-MNL only knows the value of N,T,K,dN,T,K,d. In contrast, algorithms TS-MNLand UCB-MNL also need to know the value of κ\kappa for their implementation. We observe that that our algorithm CB-MNL has robust performance for varying values of κ\kappa.

6 Conclusion and discussion

In this work, we proposed an optimistic algorithm for learning under the MNL contextual bandit framework. Using techniques from Faury et al. [2020], we developed an improved technical analysis to deal with the non-linear nature of the MNL reward function. As a result, the leading term in our regret bound does not suffer from the problem-dependent parameter κ\kappa. This contribution is significant as κ\kappa can be very large (refer to Section 1.2). For example, for κ=O(T)\kappa=\mathrm{O}(\sqrt{T}), the results of Oh & Iyengar [2021, 2019] suffer O~(T)\tilde{\mathrm{O}}(T) regret, while our algorithm continues to enjoy O~(T)\tilde{\mathrm{O}}(\sqrt{T}). Further, we also presented a tractable version of the decision-making step of the algorithm by constructing a convex relaxation of the confidence set.

Our result is still O(d)\mathrm{O}(\sqrt{d}) away from the minimax lower of bound Chu et al. [2011] known for the linear contextual bandit. In the case of logistic bandits, Li et al. [2017] makes an i.i.d. assumption on the contexts to bridge the gap (however, they still retain the κ\kappa factor). Improving the worst-case regret bound by O(d)\mathrm{O}(\sqrt{d}) while keeping κ\kappa as an additive term is an open problem. It may be possible to improve the dependence on κ\kappa by using a higher-order approximation for estimation error. Finding a lower bound on dependence κ\kappa is an interesting open problem and may require newer techniques than presented in this work.

Oh & Iyengar [2019] gave a Thompson sampling (TS) based learning strategy for the MNL contextual bandit. Thompson sampling approaches may not have to search the entire action space to take decisions as optimistic algorithms (such as ours) do. TS-based strategies are likely to have better empirical performance. Authors in Oh & Iyengar [2019] use a confidence set based analysis to bound the estimation error term. However, results in Oh & Iyengar [2019] suffer from the prohibitive scaling of the problem-dependent parameter κ\kappa that we have overcome here. Modifying our analysis for a TS-based learning strategy could bring together the best of both worlds.

References

  • Abbasi-Yadkori et al. [2011] Abbasi-Yadkori, Y., Pál, D., & Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2312–2320.
  • Abeille et al. [2021] Abeille, M., Faury, L., & Calauzènes, C. (2021). Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (pp. 3691–3699). PMLR.
  • Agrawal et al. [2017] Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2017). Thompson sampling for the mnl-bandit. In Conference on Learning Theory (pp. 76–78). PMLR.
  • Agrawal et al. [2019] Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2019). Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67, 1453–1485. doi:10.1287/opre.2018.1832.
  • Alfandari et al. [2021] Alfandari, L., Hassanzadeh, A., & Ljubić, I. (2021). An exact method for assortment optimization under the nested logit model. European Journal of Operational Research, 291, 830–845. doi:https://doi.org/10.1016/j.ejor.2020.12.007.
  • Amani & Thrampoulidis [2021] Amani, S., & Thrampoulidis, C. (2021). Ucb-based algorithms for multinomial logistic regression bandits. Advances in Neural Information Processing Systems, 34, 2913–2924.
  • Auer et al. [2002] Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47, 235–256.
  • Avadhanula [2019] Avadhanula, V. (2019). The MNL-Bandit Problem: Theory and Applications. Ph.D. thesis Columbia University.
  • Bach [2010] Bach, F. (2010). Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4, 384 – 414. URL: https://doi.org/10.1214/09-EJS521. doi:10.1214/09-EJS521.
  • Chen et al. [2020] Chen, X., Wang, Y., & Zhou, Y. (2020). Dynamic assortment optimization with changing contextual information. Journal of Machine Learning Research, 21, 1–44.
  • Cheung & Simchi-Levi [2017] Cheung, W. C., & Simchi-Levi, D. (2017). Thompson sampling for online personalized assortment optimization problems with multinomial logit choice models. Available at SSRN 3075658, .
  • Chu et al. [2011] Chu, W., Li, L., Reyzin, L., & Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (pp. 208–214).
  • Dani et al. [2008] Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. In Conference on Learning Theory.
  • Faury et al. [2020] Faury, L., Abeille, M., Calauzènes, C., & Fercoq, O. (2020). Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (pp. 3052–3060). PMLR.
  • Feldman et al. [2018] Feldman, J., Zhang, D., Liu, X., & Zhang, N. (2018). Taking assortment optimization from theory to practice: Evidence from large field experiments on alibaba. Available at SSRN, .
  • Filippi et al. [2010] Filippi, S., Cappe, O., Garivier, A., & Szepesvári, C. (2010). Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems (pp. 586–594).
  • Flores et al. [2019] Flores, A., Berbeglia, G., & Van Hentenryck, P. (2019). Assortment optimization under the sequential multinomial logit model. European Journal of Operational Research, 273, 1052–1064. doi:https://doi.org/10.1016/j.ejor.2018.08.047.
  • Gao & Pavel [2017] Gao, B., & Pavel, L. (2017). On the properties of the softmax function with application in game theory and reinforcement learning. arXiv preprint arXiv:1704.00805, .
  • Grant & Szechtman [2021] Grant, J. A., & Szechtman, R. (2021). Filtered poisson process bandit on a continuum. European Journal of Operational Research, 295, 575–586. doi:https://doi.org/10.1016/j.ejor.2021.03.033.
  • Kök & Fisher [2007] Kök, A. G., & Fisher, M. L. (2007). Demand estimation and assortment optimization under substitution: Methodology and application. Operations Research, 55, 1001–1021. doi:https://doi.org/10.1287/opre.1070.0409.
  • Lehmann & Casella [2006] Lehmann, E. L., & Casella, G. (2006). Theory of point estimation. Springer Science & Business Media.
  • Li et al. [2017] Li, L., Lu, Y., & Zhou, D. (2017). Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 2071–2080).
  • Oh & Iyengar [2019] Oh, M.-h., & Iyengar, G. (2019). Thompson sampling for multinomial logit contextual bandits. In Advances in Neural Information Processing Systems (pp. 3151–3161).
  • Oh & Iyengar [2021] Oh, M.-h., & Iyengar, G. (2021). Multinomial logit contextual bandits: Provable optimality and practicality. In Proceedings of the AAAI Conference on Artificial Intelligence (pp. 9205–9213). volume 35.
  • Ou et al. [2018] Ou, M., Li, N., Zhu, S., & Jin, R. (2018). Multinomial logit bandit with linear utility functions. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (pp. 2602–2608).
  • Perivier & Goyal [2022] Perivier, N., & Goyal, V. (2022). Dynamic pricing and assortment under a contextual mnl demand. Advances in Neural Information Processing Systems, 35, 3461–3474.
  • Rusmevichientong et al. [2010] Rusmevichientong, P., Shen, Z.-J. M., & Shmoys, D. B. (2010). Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research, 58, 1666–1680. doi:https://doi.org/10.1287/opre.1100.0866.
  • Rusmevichientong & Tsitsiklis [2010] Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35, 395–411. doi:https://doi.org/10.1287/moor.1100.0446.
  • Sauré & Zeevi [2013] Sauré, D., & Zeevi, A. (2013). Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management, 15, 387–404. doi:https://doi.org/10.1287/msom.2013.0429.
  • Timonina-Farkas et al. [2020] Timonina-Farkas, A., Katsifou, A., & Seifert, R. W. (2020). Product assortment and space allocation strategies to attract loyal and non-loyal customers. European Journal of Operational Research, 285, 1058–1076. doi:https://doi.org/10.1016/j.ejor.2020.02.019.
  • Wang et al. [2020] Wang, X., Zhao, X., & Liu, B. (2020). Design and pricing of extended warranty menus based on the multinomial logit choice model. European Journal of Operational Research, 287, 237–250. doi:https://doi.org/10.1016/j.ejor.2020.05.012.
  • Xu et al. [2021] Xu, J., Chen, L., & Tang, O. (2021). An online algorithm for the risk-aware restless bandit. European Journal of Operational Research, 290, 622–639. doi:https://doi.org/10.1016/j.ejor.2020.08.028.
  • Zhang & Lin [2015] Zhang, Y., & Lin, X. (2015). Disco: Distributed optimization for self-concordant empirical loss. In International conference on machine learning (pp. 362–370).

Appendix A Appendix

A.1 Confidence set

In this section, we justify the design of confidence set defined in Eq (9). This particular choice is based on the following concentration inequality for self-normalized vectorial martingales.

Theorem 9.

Appears as Theorem 4 in Abeille et al. [2021] Let {t}t=1\{\mathcal{F}_{t}\}_{t=1}^{\infty} be a filtration. Let {xt}t=1\{x_{t}\}_{t=1}^{\infty} be a stochastic process in 2(d)\mathcal{B}_{2}(d) such that xtx_{t} is t\mathcal{F}_{t} measurable. Let {εt}t=2\{\varepsilon_{t}\}_{t=2}^{\infty} be a martingale difference sequence such that εt+1\varepsilon_{t+1} is t+1\mathcal{F}_{t+1} measurable. Furthermore, assume that conditionally on t\mathcal{F}_{t} we have |εt+1|1|\varepsilon_{t+1}|\leq 1 almost surely, and note σt2𝔼[εt+12|t]\sigma_{t}^{2}\coloneqq\mathbb{E}\left[\varepsilon_{t+1}^{2}|\mathcal{F}_{t}\right]. Let {λt}t=1\{\lambda_{t}\}_{t=1}^{\infty} be a predictable sequence of non-negative scalars. Define:

𝐇ts=1t1σs2xsxsT+λt𝐈d,Sts=1t1εs+1xs.\displaystyle\mathbf{H}_{t}\coloneqq\sum_{s=1}^{t-1}\sigma_{s}^{2}x_{s}x_{s}^{T}+\lambda_{t}\mathbf{I}_{d},\qquad S_{t}\coloneqq\sum_{s=1}^{t-1}\varepsilon_{s+1}x_{s}.

Then for any δ(0,1]\delta\in(0,1]:

(t1,St𝐇t1λt2+2λtlog(det(𝐇t)12λtd2δ)+2λtdlog(2))δ.\displaystyle\mathbb{P}\Bigg{(}\exists t\geq 1,\,\left\lVert S_{t}\right\rVert_{\mathbf{H}_{t}^{-1}}\!\geq\!\frac{\sqrt{\lambda_{t}}}{2}\!+\!\frac{2}{\sqrt{\lambda_{t}}}\log\!\left(\frac{\det\left(\mathbf{H}_{t}\right)^{\frac{1}{2}}\!\lambda_{t}^{-\frac{d}{2}}}{\delta}\right)+\frac{2}{\sqrt{\lambda_{t}}}d\log(2)\Bigg{)}\leq\delta.

Theorem 9 cannot be directly used in our setting as in the MNL model the actual rewards (for any time step ss) {rs,i}iQs\{r_{s,i}\}_{i\in Q_{s}} are correlated. Hence a concentration almost identical (varying only in minor constant modification) to Theorem 9, appearing as Theorem C.6 in Perivier & Goyal [2022] is used instead.

Lemma 10 (confidence bounds for multinomial logistic rewards).

With θ^t\hat{\theta}_{t} as the regularized maximum log-likelihood estimate as defined in Eq (5), the following follows with probability at least 1δ1-\delta:

t 1,gt(θ^t)gt(θ)𝐇t1γt(δ),\displaystyle\forall t\,\geq\,1,\quad\mathinner{\!\left\lVert g_{t}(\hat{\theta}_{t})-g_{t}(\theta_{*})\right\rVert}_{\mathbf{H}_{t}^{-1}}\leq\gamma_{t}\mathinner{\left(\delta\right)},

where 𝐇t(θ1)=s=1t1i𝒬sμ˙i(𝐗𝒬sθ1)xs,ixs,i+λ𝐈d\mathbf{H}_{t}\mathinner{\left(\theta_{1}\right)}=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}\right)}x_{s,i}x_{s,i}^{\top}+\lambda\mathbf{I}_{d} and gt()g_{t}(\cdot) is defined in Eq (6).

Proof.

θ^t\hat{\theta}_{t} is the maximizer of the regularized log-likelihood:

tλt(θ)\displaystyle\mathcal{L}_{t}^{\lambda_{t}}(\theta) =s=1t1i𝒬srs,ilog(μi(𝐗𝒬sθ))λt2θ22,\displaystyle=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}r_{s,i}\log\left(\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)\right)-\frac{\lambda_{t}}{2}\mathinner{\!\left\lVert\theta\right\rVert}_{2}^{2},

where μi(𝐗𝒬sθ)\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta) is given by Eq (1) as exs,iθ1+j𝒬sexs,jθ\frac{e^{x^{\top}_{s,i}\theta}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x^{\top}_{s,j}\theta}}. Solving for θtλt=0\nabla_{\theta}\mathcal{L}_{t}^{\lambda_{t}}=0, we obtain:

s=1t1i𝒬sμi(𝐗𝒬sθ)xs,i+λtθ^t=s=1t1i𝒬srs,ixs,i\displaystyle\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)x_{s,i}+\lambda_{t}\hat{\theta}_{t}=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}r_{s,i}x_{s,i}

This result, combined with the definition of gt(θ)=s=1t1i𝒬sμi(𝐗𝒬sθ)xs,i+λtθg_{t}(\theta_{*})=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})x_{s,i}\\ +\lambda_{t}\theta_{*} yields:

gt(θ^t)gt(θ)\displaystyle g_{t}(\hat{\theta}_{t})-g_{t}(\theta_{*}) =s=1t1i𝒬sεs,ixs,iλtθ\displaystyle=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\varepsilon_{s,i}x_{s,i}-\lambda_{t}\theta_{*}
=St,Kλtθ\displaystyle=S_{t,K}-\lambda_{t}\theta_{*}

where we denoted εs,irs,iμi(𝐗𝒬sθ)\varepsilon_{s,i}\coloneqq r_{s,i}-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*}) for all s1s\geq 1 and i[K]i\in[K] and St,Ks=1t1i𝒬sεs,ixs,iS_{t,K}\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\varepsilon_{s,i}x_{s,i} for all t1t\geq 1. For any λt1\lambda_{t}\geq 1, from the definition of 𝐇t(θ)\mathbf{H}_{t}(\theta_{*}) it follows that 𝐇t1(θ)𝐈d\mathbf{H}^{-1}_{t}(\theta_{*})\preceq\mathbf{I}_{d}. Hence, λtθ𝐇t1(θ)λtθ2\mathinner{\!\left\lVert\lambda_{t}\theta_{*}\right\rVert}_{\mathbf{H}^{-1}_{t}(\theta_{*})}\leq\mathinner{\!\left\lVert\lambda_{t}\theta_{*}\right\rVert}_{2}. Later in the proof of Theorem 1, we present our choice of λt\lambda_{t} which always ensures λt1\lambda_{t}\geq 1.

gt(θ^t)gt(θ)𝐇t1(θ)St,K𝐇t1(θ)+λtS\displaystyle\mathinner{\!\left\lVert g_{t}(\hat{\theta}_{t})-g_{t}(\theta_{*})\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\leq\mathinner{\!\left\lVert S_{t,K}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}+\sqrt{\lambda_{t}}S (21)

Conditioned on the filtration set t,i\mathcal{F}_{t,i} (see Section 2.2 to review the definition of the filtration set), εs,i\varepsilon_{s,i} is a martingale difference is bounded by 11 as we assume the maximum reward that is accrued at any round is upper bounded by 11. We calculate for all s1s\geq 1:

𝔼[εs,i2|t]=𝔼[(rs,iμi(𝐗𝒬sθ))2|t]\displaystyle\mathbb{E}\left[\varepsilon^{2}_{s,i}\big{|}\mathcal{F}_{t}\right]=\mathbb{E}\left[\mathinner{\left(r_{s,i}-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})\right)}^{2}\bigg{|}\mathcal{F}_{t}\right]
=\displaystyle= 𝕍[rs,i|t]=μi(𝐗𝒬sθ)(1μi(𝐗𝒬sθ)).\displaystyle\mathbb{V}\left[r_{s,i}|\mathcal{F}_{t}\right]=\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})\mathinner{\left(1-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})\right)}. (22)

Also from Remark 3, we have :

μ˙i(𝐗𝒬sθ)=μi(𝐗𝒬sθ)(1μi(𝐗𝒬sθ)).\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})=\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})\mathinner{\left(1-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})\right)}.

Therefore setting Ht{H}_{t} as 𝐇t(θ)=s=1t1i𝒬sμ˙i(𝐗𝒬sθ)xs,ixs,i+λt𝐈d\mathbf{H}_{t}(\theta_{*})=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d} and UtU_{t} as St,KS_{t,K} we invoke an instance of Theorem C.6 in Perivier & Goyal [2022] to obtain:

1δ\displaystyle 1-\delta\leq (t1,St𝐇t1(θ)λt2+2λtlog(2ddet(𝐇t(θ))1/2λtd/2δ)\displaystyle\mathbb{P}\left(\forall t\geq 1,\mathinner{\!\left\lVert S_{t}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\leq\frac{\sqrt{\lambda_{t}}}{2}+\frac{2}{\sqrt{\lambda_{t}}}\log\left(\frac{2^{d}\det(\mathbf{H}_{t}(\theta_{*}))^{1/2}\lambda_{t}^{-d/2}}{\delta}\right)\right.
+2dλtlog(2))\displaystyle+\left.\frac{2d}{\sqrt{\lambda_{t}}}\log(2)\right)

We simplify det(𝐇t(θ))\det(\mathbf{H}_{t}(\theta_{*})), using the fact that the multinomial logistic function is LL-Lipschitz (see Assumption 2):

det(𝐇t(θ))=\displaystyle\det(\mathbf{H}_{t}(\theta_{*}))= det(s=1t1i𝒬sμ˙i(𝐗𝒬sθ)xs,ixs,i+λt𝐈d)\displaystyle\det\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{*})x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}\right)
\displaystyle\leq Lddet(s=1t1i𝒬sxs,ixs,i+λtL𝐈d).\displaystyle L^{d}\det\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}x_{s,i}x_{s,i}^{\top}+\frac{\lambda_{t}}{L}\mathbf{I}_{d}\right).

Further, using Lemma 18 and using xs,i21\mathinner{\!\left\lVert x_{s,i}\right\rVert}_{2}\leq 1 we write:

Lddet(s=1t1i𝒬sxs,ixs,i+λtL𝐈d)(λt+LKtd)d.\displaystyle L^{d}\det\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}x_{s,i}x_{s,i}^{\top}+\frac{\lambda_{t}}{L}\mathbf{I}_{d}\right)\leq\left(\lambda_{t}+\frac{LKt}{d}\right)^{d}.

This we simplify Eq (A.1) as:

1δ\displaystyle 1-\delta\leq (t1,St𝐇t1(θ)λt2+2λtlog((λt+LKt/d)d/2λtd/2δ)\displaystyle\mathbb{P}\left(\forall t\geq 1,\mathinner{\!\left\lVert S_{t}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\leq\frac{\sqrt{\lambda_{t}}}{2}+\frac{2}{\sqrt{\lambda_{t}}}\log\left(\frac{\left(\lambda_{t}+LKt/d\right)^{d/2}\lambda_{t}^{-d/2}}{\delta}\right)\right.
+2dλtlog(2))\displaystyle+\left.\frac{2d}{\sqrt{\lambda_{t}}}\log(2)\right)
(t1,St𝐇t1(θ)λt2+2λtlog((1+LKtλtd)d/2δ)\displaystyle\leq\mathbb{P}\left(\forall t\geq 1,\mathinner{\!\left\lVert S_{t}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\leq\frac{\sqrt{\lambda_{t}}}{2}+\frac{2}{\sqrt{\lambda_{t}}}\log\left(\frac{\left(1+\frac{LKt}{\lambda_{t}d}\right)^{d/2}}{\delta}\right)\right.
+2dλtlog(2))\displaystyle+\left.\frac{2d}{\sqrt{\lambda_{t}}}\log(2)\right)
=(t1,St𝐇t1(θ)γt(δ)λtS)\displaystyle=\mathbb{P}\left(\forall t\geq 1,\mathinner{\!\left\lVert S_{t}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\leq\gamma_{t}(\delta)-\sqrt{\lambda_{t}}S\right) (24)

Combining Eq ((21)) and Eq ((24)) yields:

(t1,gt(θ^t)gt(θ)𝐇t1(θ)γt(δ))\displaystyle\mathbb{P}\left(\forall t\geq 1,\,\mathinner{\!\left\lVert g_{t}(\hat{\theta}_{t})-g_{t}(\theta_{*})\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\leq\gamma_{t}(\delta)\right)
(t1,St𝐇t1(θ)+λtSγt(δ))\displaystyle\geq\mathbb{P}\left(\forall t\geq 1,\,\mathinner{\!\left\lVert S_{t}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}+\sqrt{\lambda_{t}}S\leq\gamma_{t}(\delta)\right)
1δ.\displaystyle\geq 1-\delta.

This completes the proof. ∎

It is insightful to compare Theorem 9 with Theorem 1 of Abbasi-Yadkori et al. [2011]. The later is re-stated below:

Theorem 11.

Let {}t=0\{\mathcal{F}\}_{t=0}^{\infty} be a filtration. Let {η}t=1\{\eta\}_{t=1}^{\infty} be a real-valued stochastic process such that ηt\eta_{t} is t\mathcal{F}_{t}-measurable and ηt\eta_{t} is conditionally RR-sub-Gaussian for some R0R\geq 0, i.e

λt,𝔼[exp(λtηt)t1]exp(λt2R22).\forall\,\lambda_{t}\,\in\mathbb{R},\qquad\mathbb{E}\mathinner{\left[\exp(\lambda_{t}\eta_{t})\mid\mathcal{F}_{t-1}\right]}\leq\exp\mathinner{\left(\frac{\lambda_{t}^{2}R^{2}}{2}\right)}.

Let {xt}t=1\{x_{t}\}_{t=1}^{\infty} be an d\mathbb{R}^{d}-valued stochastic process such that XtX_{t} is t1\mathcal{F}_{t-1}-measurable. Assume 𝐕\mathbf{V} is a d×dd\times d positive definite matrix. For any t0t\geq 0, define:

𝐕¯t=𝐕+s=1txsxs,St=s=1tηsxs.\overline{\mathbf{V}}_{t}=\mathbf{V}+\sum_{s=1}^{t}x_{s}x_{s}^{\top},\qquad\quad S_{t}=\sum_{s=1}^{t}\eta_{s}x_{s}.

Then, for any δ>0\delta>0, with probability at least 1δ1-\delta. for all t0t\geq 0,

St𝐕¯t12Rlog(det(𝐕¯)1/2det(𝐕)1/2δ).||S_{t}||_{\overline{\mathbf{V}}_{t}^{-1}}\leq 2R\log\mathinner{\left(\frac{\det(\overline{\mathbf{V}})^{-\nicefrac{{1}}{{2}}}\det(\mathbf{V})^{-\nicefrac{{1}}{{2}}}}{\delta}\right)}.

Theorem 11 makes an uniform sub-Gaussian assumption and unlike Theorem 9 does not take into account local variance information.

A.2 Local information preserving norm

Deviating from the previous analyses as in Filippi et al. [2010], Li et al. [2017], we describe norm which preserves the local information The matrix 𝐗𝒬s\mathbf{X}_{\mathcal{Q}_{s}} is the design matrix composed of the contexts xs,1,xs,2,,xs,Kx_{s,1},x_{s,2},\cdots,x_{s,K} received at time step ss as its columns. The expected reward due to the ithi_{th} item in the assortment is given by:

μi(𝐗𝒬sθ)=exs,iθ1+j𝒬sexs,jθ.\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)=\frac{e^{x^{\top}_{s,i}\theta}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x^{\top}_{s,j}\theta}}.

Further, we consider the following integral:

ν=01μ˙i(ν𝐗𝒬sθ2+(1ν)𝐗𝒬sθ1)𝑑ν\displaystyle\int_{\nu=0}^{1}\dot{\mu}_{i}\mathinner{\left(\nu\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2}+\mathinner{\left(1-\nu\right)}\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}\right)}\cdot d\nu =xs,iθ1xs,iθ21xs,i(θ2θ1)μ˙i(ti)𝑑ti,\displaystyle=\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\frac{1}{x_{s,i}^{\top}(\theta_{2}-\theta_{1})}\dot{\mu}_{i}(t_{i})\cdot dt_{i}, (25)

where μ˙i\dot{\mu}_{i} is the partial derivative of μi\mu_{i} in the direction of the ithi_{th} component and xs,iθ1xs,iθ2μ˙i(ti)𝑑ti\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\dot{\mu}_{i}(t_{i})\cdot dt_{i} represents integration of μ˙()\dot{\mu}(\cdot) with respect to the coordinate tit_{i} ( hence the limits of the integration only consider change in the coordinate tit_{i}). For notation purposes which would become clear later, we define:

αi(𝐗𝒬s,θ1,θ2)xs,i(θ2θ1)\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{1},\theta_{2})x_{s,i}^{\top}(\theta_{2}-\theta_{1}) μi(𝐗𝒬sθ2)μi(𝐗𝒬sθ1)\displaystyle\coloneqq\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2})-\mu_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})
=exs,iθ21+j𝒬sexs,jθ2exs,iθ11+j𝒬sexs,jθ1\displaystyle=\frac{e^{x_{s,i}^{\top}\theta_{2}}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x_{s,j}^{\top}\theta_{2}}}-\frac{e^{x_{s,i}^{\top}\theta_{1}}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x_{s,j}^{\top}\theta_{1}}} (26)
=xs,iθ1xs,iθ2μ˙i(ti)𝑑ti,\displaystyle=\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\dot{\mu}_{i}(t_{i})\cdot dt_{i},

where the second step is due to Fundamental Theorem of Calculus. We have exploited the two ways to view the multinomial logit function: sum of individual probabilities and a vector valued function. We write:

i𝒬sαi(𝐗𝒬s,θ1,θ2)xs,i(θ2θ1)=i𝒬sν=01μ˙i(ν𝐗𝒬sθ2+(1ν)𝐗𝒬sθ1)𝑑ν\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{1},\theta_{2})x_{s,i}^{\top}(\theta_{2}-\theta_{1})=\sum_{i\in\mathcal{Q}_{s}}\int_{\nu=0}^{1}\dot{\mu}_{i}\mathinner{\left(\nu\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2}+\mathinner{\left(1-\nu\right)}\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}\right)}\cdot d\nu (27)

We also have:

μ(𝐗𝒬sθ1)μ(𝐗𝒬sθ2)=i=1Kαi(𝐗𝒬s,θ2,θ1)xs,i(θ1θ2).\displaystyle\mu(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})-\mu(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2})=\sum_{i=1}^{K}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{2},\theta_{1})x_{s,i}^{\top}(\theta_{1}-\theta_{2}). (28)

It follows that:

g(θ1)g(θ2)=\displaystyle g(\theta_{1})-g(\theta_{2})= s=1t1i𝒬s(exs,iθ11+j𝒬sexs,jθ1exs,iθ21+j𝒬sexs,jθ2)xs,i\displaystyle\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\mathinner{\left(\frac{e^{x_{s,i}^{\top}\theta_{1}}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x_{s,j}^{\top}\theta_{1}}}-\frac{e^{x_{s,i}^{\top}\theta_{2}}}{1+\sum_{j\in\mathcal{Q}_{s}}e^{x_{s,j}^{\top}\theta_{2}}}\right)}x_{s,i}
+λt(θ1θ2)\displaystyle+\lambda_{t}(\theta_{1}-\theta_{2})
=\displaystyle= s=1t1i𝒬sαi(𝐗𝒬s,θ2,θ1)xsxx(θ1θ2)+λt(θ1θ2)\displaystyle\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{2},\theta_{1})x_{s}x_{x}^{\top}(\theta_{1}-\theta_{2})+\lambda_{t}(\theta_{1}-\theta_{2})
=\displaystyle= 𝐆t(θ2,θ1)(θ1θ2),\displaystyle\mathbf{G}_{t}(\theta_{2},\theta_{1})(\theta_{1}-\theta_{2}),

where 𝐆t(θ1,θ2)s=1t1i𝒬sαi(𝐗𝒬s,θ1,θ2)xsxs+λt𝐈d\mathbf{G}_{t}\mathinner{\left(\theta_{1},\theta_{2}\right)}\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{1},\theta_{2}\right)}x_{s}x_{s}^{\top}+\lambda_{t}\mathbf{I}_{d}. Since α(𝐗𝒬s,θ1,θ2)1κ\alpha\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{1},\theta_{2}\right)}\geq\frac{1}{\kappa} (from Assumption 2), therefore 𝐆t(θ1,θ2)𝐎d×d\mathbf{G}_{t}(\theta_{1},\theta_{2})\succ\mathbf{O}_{d\times d}. Hence we get:

θ1θ2𝐆t(θ2,θ1)=g(θ1)g(θ2)𝐆t1(θ2,θ1).\mathinner{\!\left\lVert\theta_{1}-\theta_{2}\right\rVert}_{\mathbf{G}_{t}(\theta_{2},\theta_{1})}=\mathinner{\!\left\lVert g(\theta_{1})-g(\theta_{2})\right\rVert}_{\mathbf{G}_{t}^{-1}(\theta_{2},\theta_{1})}. (29)

A.3 Self-Concordance Style Relations for Multinomial Logistic Function

Lemma 12.

For an assortment 𝒬s\mathcal{Q}_{s} and θ1,θ2Θ\theta_{1},\theta_{2}\,\in\,\Theta, the following holds:

i𝒬sαi(𝐗𝒬s,θ2,θ1)\displaystyle\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{2},\theta_{1}) =i𝒬sν=01μ˙i(ν𝐗𝒬sθ2+(1ν)𝐗𝒬sθ1)𝑑ν\displaystyle=\sum_{i\in\mathcal{Q}_{s}}\int_{\nu=0}^{1}\dot{\mu}_{i}\mathinner{\left(\nu\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2}+\mathinner{\left(1-\nu\right)}\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}\right)}\cdot d\nu
i𝒬sμ˙i(𝐗𝒬sθ1)(1+|xs,iθ1xs,iθ2|)1\displaystyle\geq\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})\mathinner{\left(1+|x_{s,i}^{\top}\theta_{1}-x_{s,i}^{\top}\theta_{2}|\right)}^{-1}
Proof.

We write:

ν=01μ˙i(ν𝐗𝒬sθ2+(1ν)𝐗𝒬sθ1)𝑑ν\displaystyle\int_{\nu=0}^{1}\dot{\mu}_{i}\mathinner{\left(\nu\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2}+\mathinner{\left(1-\nu\right)}\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}\right)}\cdot d\nu =i𝒬sxs,iθ1xs,iθ21xs,i(θ2θ1)μ˙i(ti)𝑑ti,\displaystyle=\sum_{i\in\mathcal{Q}_{s}}\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\frac{1}{x_{s,i}^{\top}(\theta_{2}-\theta_{1})}\dot{\mu}_{i}(t_{i})\cdot dt_{i}, (30)

where xs,iθ1xs,iθ2μ˙i(ti)𝑑ti\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\dot{\mu}_{i}(t_{i})\cdot dt_{i} represents integration of μ˙()\dot{\mu}(\cdot) with respect to the coordinate tit_{i} ( hence the limits of the integration only consider change in the coordinate tit_{i}). For some z>z1z>z_{1}\,\in\,\mathbb{R}, consider:

z1zddtilog(μ˙i(ti))dti=z1z2μi,i(ti)μ˙i(ti)𝑑ti,\displaystyle\int_{z_{1}}^{z}\frac{d}{dt_{i}}\log\mathinner{\left(\dot{\mu}_{i}(t_{i})\right)}\cdot dt_{i}=\int_{z_{1}}^{z}\frac{\nabla^{2}\mu_{i,i}(t_{i})}{\dot{\mu}_{i}(t_{i})}dt_{i},

where 2μi,i()\nabla^{2}\mu_{i,i}(\cdot) is the double derivative of μ()\mu(\cdot). Using Lemma 16, we have 12μi,i()μ˙i()1-1\leq\frac{\nabla^{2}\mu_{i,i}(\cdot)}{\dot{\mu}_{i}(\cdot)}\leq 1. Thus we get:

(zz1)z1zddtilog(μ˙i(ti)).dti(zz1)-(z-z_{1})\leq\int_{z_{1}}^{z}\frac{d}{dt_{i}}\log\mathinner{\left(\dot{\mu}_{i}(t_{i})\right)}.dt_{i}\leq(z-z_{1})

Using Fundamental Theorem of Calculus, we get:

(zz1)log(μ˙i(z))log(μ˙i(z1))(zz1)\displaystyle-(z-z_{1})\leq\log\mathinner{\left(\dot{\mu}_{i}(z)\right)}-\log\mathinner{\left(\dot{\mu}_{i}(z_{1})\right)}\leq(z-z_{1})
\displaystyle\therefore μ˙i(z1)exp((zz1))μ˙i(z)μ˙i(z1)exp(zz1)\displaystyle~{}\dot{\mu}_{i}(z_{1})\exp(-(z-z_{1}))\leq\dot{\mu}_{i}(z)\leq\dot{\mu}_{i}(z_{1})\exp(z-z_{1}) (31)

Using Eq (A.3) and for z2z1z_{2}\geq z_{1}\,\in\,\mathbb{R}, and for all i[K]i\,\in\,[K] and we have:

μ˙i(z1)(1exp((z2z1)))z1z2μ˙(ti)𝑑tiμ˙i(z1)(exp(z2z1)1)\displaystyle\dot{\mu}_{i}(z_{1})\mathinner{\left(1-\exp(-(z_{2}-z_{1}))\right)}\leq\int_{z_{1}}^{z_{2}}\dot{\mu}(t_{i})dt_{i}\leq\dot{\mu}_{i}(z_{1})\mathinner{\left(\exp(z_{2}-z_{1})-1\right)}
\displaystyle\therefore μ˙i(z1)1exp((z2z1))z2z11z2z1z1z2μ˙(ti)𝑑tiμ˙i(z1)exp(z2z1)1z2z1.\displaystyle~{}\dot{\mu}_{i}(z_{1})\frac{1-\exp(-(z_{2}-z_{1}))}{z_{2}-z_{1}}\leq\frac{1}{z_{2}-z_{1}}\int_{z_{1}}^{z_{2}}\dot{\mu}(t_{i})dt_{i}\leq\dot{\mu}_{i}(z_{1})\frac{\exp(z_{2}-z_{1})-1}{z_{2}-z_{1}}. (32)

Reversing the role of z1z_{1} and z2z_{2}, such that z2z1z_{2}\leq z_{1} then again by using Eq (A.3) we write:

μ˙i(z1)exp((z1z2))1z2z11z2z1z1z2μ˙(ti)𝑑tiμ˙i(z1)exp(z1z2)1z2z1.\displaystyle\dot{\mu}_{i}(z_{1})\frac{\exp(-(z_{1}-z_{2}))-1}{z_{2}-z_{1}}\leq\frac{1}{z_{2}-z_{1}}\int_{z_{1}}^{z_{2}}\dot{\mu}(t_{i})dt_{i}\leq\dot{\mu}_{i}(z_{1})\frac{\exp(z_{1}-z_{2})-1}{z_{2}-z_{1}}. (33)

Combining Eq (A.3) and (33) and for all i[K]i\,\in\,[K] we get:

μ˙i(z1)1exp(|z1z2|)|z1z2|1z2z1z1z2μ˙(ti)𝑑ti.\displaystyle\dot{\mu}_{i}(z_{1})\frac{1-\exp(-|z_{1}-z_{2}|)}{|z_{1}-z_{2}|}\leq\frac{1}{z_{2}-z_{1}}\int_{z_{1}}^{z_{2}}\dot{\mu}(t_{i})dt_{i}. (34)

If x0x\geq 0, then ex(1+x)1e^{-x}\leq(1+x)^{-1}, and therefore (1ex)/x(1+x)1(1-e^{-x})/x\geq(1+x)^{-1}. Thus we lower bound the left hand side of Eq (34) as:

μ˙i(z1)(1+|z1z2|)1μ˙i(z1)1exp(|z1z2|)|z1z2|1z2z1z1z2μ˙(ti)𝑑ti.\displaystyle\dot{\mu}_{i}(z_{1})\mathinner{\left(1+|z_{1}-z_{2}|\right)}^{-1}\leq\dot{\mu}_{i}(z_{1})\frac{1-\exp(-|z_{1}-z_{2}|)}{|z_{1}-z_{2}|}\leq\frac{1}{z_{2}-z_{1}}\int_{z_{1}}^{z_{2}}\dot{\mu}(t_{i})dt_{i}.

Using above with z2=xs,iθ2z_{2}=x_{s,i}^{\top}\theta_{2} and z1=xs,iθ1z_{1}=x_{s,i}^{\top}\theta_{1} in Eq (30) gives:

i𝒬sν=01μ˙i(ν𝐗𝒬sθ2+(1ν)𝐗𝒬sθ1)𝑑ν\displaystyle\sum_{i\in\mathcal{Q}_{s}}\int_{\nu=0}^{1}\dot{\mu}_{i}\mathinner{\left(\nu\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{2}+\mathinner{\left(1-\nu\right)}\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}\right)}\cdot d\nu
=\displaystyle= i𝒬sxs,iθ1xs,iθ21xs,i(θ2θ1)μ˙i(ti)𝑑tii𝒬sμ˙i(𝐗𝒬sθ1)(1+|xs,iθ1xs,iθ2|)1.\displaystyle\sum_{i\in\mathcal{Q}_{s}}\int_{x_{s,i}^{\top}\theta_{1}}^{x_{s,i}^{\top}\theta_{2}}\frac{1}{x_{s,i}^{\top}(\theta_{2}-\theta_{1})}\dot{\mu}_{i}(t_{i})\cdot dt_{i}\geq\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})\mathinner{\left(1+|x_{s,i}^{\top}\theta_{1}-x_{s,i}^{\top}\theta_{2}|\right)}^{-1}.

Lemma 4.

For all θ1,θ2Θ\theta_{1},\theta_{2}\,\in\,\Theta such that SmaxθΘθ2S\coloneqq\max_{\theta\,\in\,\Theta}\mathinner{\!\left\lVert\theta\right\rVert}_{2} (Assumption 1), the following inequalities hold:

𝐆t(θ1,θ2)(1+2S)1𝐇t(θ1)\displaystyle\mathbf{G}_{t}(\theta_{1},\theta_{2})\succeq(1+2S)^{-1}\mathbf{H}_{t}(\theta_{1})
𝐆t(θ1,θ2)(1+2S)1𝐇t(θ2)\displaystyle\mathbf{G}_{t}(\theta_{1},\theta_{2})\succeq(1+2S)^{-1}\mathbf{H}_{t}(\theta_{2})
Proof.

From Lemma 12, we have:

i𝒬sαi(𝐗𝒬s,θ2,θ1)\displaystyle\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{2},\theta_{1}) i𝒬s(1+|xs,iθ1xs,iθ2|)1μ˙i(𝐗𝒬sθ1)\displaystyle\geq\sum_{i\in\mathcal{Q}_{s}}\mathinner{\left(1+|x_{s,i}^{\top}\theta_{1}-x_{s,i}^{\top}\theta_{2}|\right)}^{-1}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})
i𝒬s(1+xs,i2θ1θ22)1μ˙i(𝐗𝒬sθ1)\displaystyle\geq\sum_{i\in\mathcal{Q}_{s}}\left(1+\mathinner{\!\left\lVert x_{s,i}\right\rVert}_{2}\mathinner{\!\left\lVert\theta_{1}-\theta_{2}\right\rVert}_{2}\right)^{-1}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}) (Cauchy-Schwartz)
i𝒬s(1+2S)1μ˙i(𝐗𝒬sθ1)\displaystyle\geq\sum_{i\in\mathcal{Q}_{s}}\left(1+2S\right)^{-1}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1}) (θ1,θ2Θ,xs,i21\theta_{1},\theta_{2}\in\Theta,\,||x_{s,i}||_{2}\leq 1)

Now we write 𝐆t(θ1,θ2)\mathbf{G}_{t}(\theta_{1},\theta_{2}) as:

𝐆t(θ1,θ2)\displaystyle\mathbf{G}_{t}(\theta_{1},\theta_{2}) =s=1t1i𝒬sαi(𝐗𝒬s,θ2,θ1)xs,ixs,i+λt𝐈d\displaystyle=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{2},\theta_{1})x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}
(1+2S)1s=1t1i𝒬sμ˙i(𝐗𝒬sθ1)xs,ixs,i+λt𝐈d\displaystyle\succeq(1+2S)^{-1}\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}
=(1+2S)1(s=1t1i𝒬sμ˙i(𝐗𝒬sθ1)xs,ixs,i+(1+2S)λt𝐈d)\displaystyle=(1+2S)^{-1}\mathinner{\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})x_{s,i}x_{s,i}^{\top}+(1+2S)\lambda_{t}\mathbf{I}_{d}\right)}
(1+2S)1(s=1t1i𝒬sμ˙i(𝐗𝒬sθ1)xs,ixs,i+λt𝐈d)\displaystyle\succeq(1+2S)^{-1}\mathinner{\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta_{1})x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}\right)}
=(1+2S)1𝐇t(θ1).\displaystyle=(1+2S)^{-1}\mathbf{H}_{t}(\theta_{1}).

Since, θ1\theta_{1} and θ2\theta_{2} have symmetric roles in the definition of αi(𝐗𝒬s,θ2,θ1)\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{s}},\theta_{2},\theta_{1}), we also obtain the second relation by a change of variable directly. ∎

The following Lemma presents a crucial bound over the deviation (θθ)(\theta-\theta_{*}), which we extensively use in our derivations.

Lemma 5.

For θCt(δ)\theta\in C_{t}\mathinner{\left(\delta\right)}, we have the following relation with probability at least 1δ1-\delta:

θθ𝐇t(θ)2(1+2S)γt(δ).\displaystyle\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}_{\mathbf{H}_{t}(\theta)}\leq 2(1+2S)\gamma_{t}(\delta). (35)
Proof.

Since θ,θΘ\theta,\theta_{*}\in\Theta, then by Lemma 4, it follows that:

θθ𝐇t(θ)1+2Sθθ𝐆t(θ,θ).\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}_{\mathbf{H}_{t}(\theta)}\leq\sqrt{1+2S}\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}_{\mathbf{G}_{t}(\theta,\theta_{*})}.

From triangle inequality, we write :

g(θ)g(θ)𝐆t1(θ,θ)g(θ)g(θ^t)𝐆t1(θ,θ)+g(θ^t)g(θ)𝐆t1(θ,θ),\displaystyle\mathinner{\!\left\lVert g(\theta_{*})-g(\theta)\right\rVert}_{\mathbf{G}_{t}^{-1}(\theta,\theta_{*})}\leq\mathinner{\!\left\lVert g(\theta_{*})-g(\hat{\theta}_{t})\right\rVert}_{\mathbf{G}_{t}^{-1}(\theta,\theta_{*})}+\mathinner{\!\left\lVert g(\hat{\theta}_{t})-g(\theta)\right\rVert}_{\mathbf{G}_{t}^{-1}(\theta,\theta_{*})},

where θ^t\hat{\theta}_{t} is the MLE estimate. Further Lemma 4 gives:

g(θ)g(θ)𝐆t1(θ,θ)\displaystyle\mathinner{\!\left\lVert g(\theta_{*})-g(\theta)\right\rVert}_{\mathbf{G}_{t}^{-1}(\theta,\theta_{*})} 1+2Sg(θ)g(θ^t)𝐇t1(θ)\displaystyle\leq\sqrt{1+2S}\mathinner{\!\left\lVert g(\theta_{*})-g(\hat{\theta}_{t})\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}
+1+2Sg(θ^t)g(θ)𝐇t1(θ).\displaystyle+\sqrt{1+2S}\mathinner{\!\left\lVert g(\hat{\theta}_{t})-g(\theta)\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta)}.

since θ\theta is the minimizer of g(θ)g(θ^t)𝐇t1(θ)\mathinner{\!\left\lVert g(\theta)-g(\hat{\theta}_{t})\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta)}, therefore we write:

g(θ)g(θ)𝐆t1(θ,θ)21+2Sg(θ)g(θ^t)𝐇t1(θ).\displaystyle\mathinner{\!\left\lVert g(\theta_{*})-g(\theta)\right\rVert}_{\mathbf{G}_{t}^{-1}(\theta,\theta_{*})}\leq 2\sqrt{1+2S}\mathinner{\!\left\lVert g(\theta_{*})-g(\hat{\theta}_{t})\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}.

Finally, the Eq (35) follows by an application of Lemma 10 as:

g(θ)g(θ)𝐇t1(θ)\displaystyle\mathinner{\!\left\lVert g(\theta_{*})-g(\theta)\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})} γt(δ).\displaystyle\leq\gamma_{t}(\delta).

A.4 Bounds on prediction error

Lemma 6.

For the assortment chosen by the algorithm CB-MNL, 𝒬t\mathcal{Q}_{t} as given by Eq (12) and any θCt(δ)\theta\in C_{t}(\delta) the following holds with probability at least 1δ1-\delta:

αi(𝐗𝒬t,θ,θ)\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta) μ˙i(𝐗𝒬tθ)+2(1+2S)Mγt(δ)xt,i𝐇t1(θ).\displaystyle\leq\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}+2(1+2S)M\gamma_{t}(\delta)\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}.
Proof.

Consider the mulinomial logit function:

αi(𝐗𝒬t,θ,θ)xt,i(θθ)=ext,iθ1+j𝒬text,jθext,iθ1+j𝒬text,jθ.\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)x_{t,i}^{\top}(\theta-\theta_{*})=\frac{e^{x_{t,i}^{\top}\theta}}{1+\sum_{j\in\mathcal{Q}_{t}}e^{x_{t,j}^{\top}\theta}}-\frac{e^{x_{t,i}^{\top}\theta_{*}}}{1+\sum_{j\in\mathcal{Q}_{t}}e^{x_{t,j}^{\top}\theta_{*}}}. (36)

We use second-order Taylor expansion for each component of the multinomial logit function at aia_{i}. Consider for all i[K]i\,\in\,[K]:

fi(ri)\displaystyle f_{i}(r_{i}) =eri1+eri+j𝒬s,jierj\displaystyle=\frac{e^{r_{i}}}{1+e^{r_{i}}+\sum_{j\in\mathcal{Q}_{s},j\neq i}e^{r_{j}}}
f(ai)+fi(ai)(riai)+fi′′(ai)(riai)22.\displaystyle\leq f(a_{i})+f_{i}^{\prime}(a_{i})(r_{i}-a_{i})+\frac{f_{i}^{\prime\prime}(a_{i})(r_{i}-a_{i})^{2}}{2}. (37)

In Eq (A.4), we substitute: fi()μif_{i}(\cdot)\to\mu_{i}, rixt,iθr_{i}\to x_{t,i}^{\top}\theta, and aixt,iθa_{i}\to x_{t,i}^{\top}\theta_{*}. Thus we re-write Eq (36) as:

αi(𝐗𝒬t,θ,θ)xs,i(θθ)\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)x_{s,i}^{\top}(\theta-\theta_{*}) μ˙i(𝐗𝒬tθ)(xt,i(θθ))+μ¨i(𝐗𝒬tθ)(xt,i(θθ)2,\displaystyle\leq\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}(x_{t,i}^{\top}(\theta-\theta_{*}))+\ddot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}(x_{t,i}^{\top}(\theta-\theta_{*})^{2},
αi(𝐗𝒬t,θ,θ)\displaystyle\therefore~{}~{}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta) μ˙i(𝐗𝒬tθ)(xt,i(θθ))+μ¨i(𝐗𝒬tθ)|xt,i(θθ)|\displaystyle\leq\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}(x_{t,i}^{\top}(\theta-\theta_{*}))+\ddot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}|x_{t,i}^{\top}(\theta-\theta_{*})|
μ˙i(𝐗𝒬tθ)+M|xt,i(θθ)|,\displaystyle\leq\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}+M\mathinner{\!\left\lvert x_{t,i}^{\top}(\theta-\theta_{*})\right\rvert},

where we upper bound μ¨i\ddot{\mu}_{i} by MM. An application of Cauchy-Schwarz gives us:

|xt,i(θθ)|\displaystyle\mathinner{\!\left\lvert x_{t,i}^{\top}(\theta_{*}-\theta)\right\rvert} xt,i𝐇t1(θ)θθ𝐇t(θ)\displaystyle\leq\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\mathinner{\!\left\lVert\theta_{*}-\theta\right\rVert}_{\mathbf{H}_{t}(\theta_{*})} (38)

Upon Combining the last two equations we get:

αi(𝐗𝒬t,θ,θ)\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta) μ˙i(𝐗𝒬tθ)+xt,i𝐇t1(θ)θθ𝐇t(θ).\displaystyle\leq\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}+\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\mathinner{\!\left\lVert\theta_{*}-\theta\right\rVert}_{\mathbf{H}_{t}(\theta_{*})}.

From Lemma 5 we get:

αi(𝐗𝒬t,θ,θ)\displaystyle\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta) μ˙i(𝐗𝒬tθ)+2(1+2S)Mγt(δ)xt,i𝐇t1(θ).\displaystyle\leq\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}+2(1+2S)M\gamma_{t}(\delta)\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}.

Lemma 3.

For the assortment chosen by the algorithm CB-MNL, 𝒬t\mathcal{Q}_{t} as given by Eq (12) and any θCt(δ)\theta\in C_{t}(\delta) the following holds with probability at least 1δ1-\delta:

Δpred(𝐗𝒬t,θ)\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta)\leq (2+4S)γt(δ)i𝒬tμ˙i(𝐗𝒬tθ)xt,i𝐇t1(θ)\displaystyle\mathinner{\left(2+4S\right)}\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}
+4κ(1+2S)2Mγt(δ)2i𝒬txt,i𝐕t12\displaystyle+4\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{V}_{t}^{-1}}
Proof.
Δpred(𝐗𝒬t,θ)\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta) =|μ(𝐗𝒬tθ)μ(𝐗𝒬tθ)|\displaystyle=~{}\mathinner{\!\left\lvert\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta)-\mu(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})\right\rvert}
=\displaystyle= |i𝒬tαi(𝐗𝒬t,θ,θ)xt,i(θθ)|\displaystyle~{}\mathinner{\!\left\lvert\sum_{i\in\mathcal{Q}_{t}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)x_{t,i}^{\top}(\theta-\theta_{*})\right\rvert} (From Eq (28))
\displaystyle\leq |i𝒬tαi(𝐗𝒬t,θ,θ)xt,i𝐇t1(θ)θθ𝐇t(θ)|\displaystyle~{}\mathinner{\!\left\lvert\sum_{i\in\mathcal{Q}_{t}}\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\mathinner{\!\left\lVert\theta_{*}-\theta\right\rVert}_{\mathbf{H}_{t}(\theta_{*})}\right\rvert} (Cauchy-Schwarz inequality and Eq (29))
\displaystyle\leq 2(1+2S)γt(δ)i𝒬t|αi(𝐗𝒬t,θ,θ)xt,i𝐇t1(θ)|\displaystyle~{}2(1+2S)\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lvert\alpha_{i}(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{*},\theta)\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\right\rvert} (From Lemma 5)
\displaystyle\leq 2(1+2S)γt(δ)i𝒬t(μ˙i(𝐗𝒬tθ)xt,i𝐇t1(θ)\displaystyle~{}2(1+2S)\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\left(\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\right.
+2(1+2S)Mγt(δ)xt,i𝐇t1(θ)2)\displaystyle+\left.2(1+2S)M\gamma_{t}(\delta)\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\right) (From Lemma 6)

Upon re-arranging the terms we get:

Δpred(𝐗𝒬t,θ)\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta)\leq (2+4S)γt(δ)i𝒬tμ˙i(𝐗𝒬tθ)xt,i𝐇t1(θ)\displaystyle\mathinner{\left(2+4S\right)}\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\dot{\mu}_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*}\right)}\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}
+4κ(1+2S)2Mγt(δ)2i𝒬txt,i𝐕t12,\displaystyle+4\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{V}_{t}^{-1}},

where we use 𝐇t1(θ)κ1𝐕t\mathbf{H}_{t}^{-1}(\theta_{*})\succeq\kappa^{-1}\mathbf{V}_{t} from Assumption 2. ∎

Corollary 7.

For the assortment chosen by the algorithm CB-MNL, 𝒬t\mathcal{Q}_{t} as given by Eq (12) and any θCt(δ)\theta\in C_{t}(\delta) the following holds with probability at least 1δ1-\delta:

Δpred(𝐗𝒬t,θ)\displaystyle\Delta^{\text{pred}}(\mathbf{X}_{\mathcal{Q}_{t}},\theta)\leq 2(1+2S)γt(δ)i𝒬tx~t,i𝐉t1\displaystyle 2\mathinner{\left(1+2S\right)}\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert\tilde{x}_{t,i}\right\rVert}_{\mathbf{J}_{t}^{-1}}
+4κ(1+2S)2Mγt(δ)2i𝒬txt,i𝐕t12,\displaystyle+4\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{V}_{t}^{-1}},

where x~t,i=μ˙i(𝐗𝒬tθ)xt,i\tilde{x}_{t,i}=\sqrt{\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})}x_{t,i} and x𝐇t1(θ)=x𝐉t1\mathinner{\!\left\lVert x\right\rVert}_{\mathbf{H}^{-1}_{t}(\theta_{*})}=\mathinner{\!\left\lVert x\right\rVert}_{\mathbf{J}^{-1}_{t}}.

Proof.

This directly follows from the uniqueness and realizability of θ\theta_{*}.

A.5 Regret calculation

The following two lemmas give the upper bounds on the self-normalized vector summations.

Lemma 13.
t=1Tmin{i𝒬tx~t,i𝐉T+11(θ)2,1}\displaystyle\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert\tilde{x}_{t,i}\right\rVert}^{2}_{\mathbf{J}^{-1}_{T+1}(\theta)},1\right\}} 2dlog(1+LKTdλt).\displaystyle\leq~{}2d\log\mathinner{\left(1+\frac{LKT}{d\lambda_{t}}\right)}.
Proof.

The proof follows by a direct application of Lemma 17 and 18 as:

t=1Tmin{i𝒬tx~t,i𝐉T+11(θ)2,1}\displaystyle\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert\tilde{x}_{t,i}\right\rVert}^{2}_{\mathbf{J}^{-1}_{T+1}(\theta)},1\right\}}
\displaystyle\leq 2log(det(𝐉T+1)λtd)\displaystyle~{}2\log\mathinner{\left(\frac{\det(\mathbf{J}_{T+1})}{\lambda_{t}^{d}}\right)} (From Lemma 17)
=\displaystyle= 2log(det(s=1t1i𝒬sμ˙i(𝐗𝒬tθ)xt,ixt,i+λt𝐈d)λtd)\displaystyle~{}2\log\mathinner{\left(\frac{\det\mathinner{\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{t}}^{\top}\theta_{*})x_{t,i}x_{t,i}^{\top}+\lambda_{t}\mathbf{I}_{d}\right)}}{\lambda_{t}^{d}}\right)}
\displaystyle\leq 2log(det(s=1t1i𝒬sLxt,ixt,i+λt𝐈d)λtd)\displaystyle~{}2\log\mathinner{\left(\frac{\det\mathinner{\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}Lx_{t,i}x_{t,i}^{\top}+\lambda_{t}\mathbf{I}_{d}\right)}}{\lambda_{t}^{d}}\right)} (Upper bound by Lipschitz constant)
\displaystyle\leq 2log(Lddet(s=1t1i𝒬sxt,ixt,i+λt/L𝐈d)λtd)\displaystyle~{}2\log\mathinner{\left(\frac{L^{d}\det\mathinner{\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}x_{t,i}x_{t,i}^{\top}+\nicefrac{{\lambda_{t}}}{{L}}\mathbf{I}_{d}\right)}}{\lambda_{t}^{d}}\right)}
\displaystyle\leq 2log(det(s=1t1i𝒬sLxt,ixt,i+λt𝐈d)λtd)\displaystyle~{}2\log\mathinner{\left(\frac{\det\mathinner{\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}Lx_{t,i}x_{t,i}^{\top}+\lambda_{t}\mathbf{I}_{d}\right)}}{\lambda_{t}^{d}}\right)}
\displaystyle\leq 2dlog(1+LKTdλt).\displaystyle~{}2d\log\mathinner{\left(1+\frac{LKT}{d\lambda_{t}}\right)}. (From Lemma 18)

Similar to Lemma 13, we prove the following.

Lemma 14.
t=1Tmin{i𝒬txt,i𝐕T+11(θ)2,1}\displaystyle\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{V}^{-1}_{T+1}(\theta)},1\right\}} 2dlog(1+KTdλt).\displaystyle\leq~{}2d\log\mathinner{\left(1+\frac{KT}{d\lambda_{t}}\right)}.
Proof.
t=1Tmin{i𝒬txt,i𝐕T+11(θ)2,1}\displaystyle\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{V}^{-1}_{T+1}(\theta)},1\right\}}
\displaystyle\leq 2log(det(𝐕T+1)λtd)\displaystyle~{}2\log\mathinner{\left(\frac{\det(\mathbf{V}_{T+1})}{\lambda_{t}^{d}}\right)} (From Lemma 17, set μi¯˙()=1\dot{\underline{\mu_{i}}}(\cdot)=1)
=\displaystyle= 2log(det(s=1t1i𝒬sxt,ixt,i+λt𝐈d)λtd)\displaystyle~{}2\log\mathinner{\left(\frac{\det\mathinner{\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}x_{t,i}x_{t,i}^{\top}+\lambda_{t}\mathbf{I}_{d}\right)}}{\lambda_{t}^{d}}\right)}
\displaystyle\leq 2dlog(1+KTdλt).\displaystyle~{}2d\log\mathinner{\left(1+\frac{KT}{d\lambda_{t}}\right)}. (From Lemma 18)

Theorem 1.

With probability at least 1δ1-\delta:

𝐑T\displaystyle\mathbf{R}_{T}\leq C1γt(δ)2dlog(1+LKTdλt)T+C2κγt(δ)2dlog(1+KTdλt),\displaystyle C_{1}\gamma_{t}(\delta)\sqrt{2d\log\mathinner{\left(1+\frac{LKT}{d\lambda_{t}}\right)}T}+C_{2}\kappa\gamma_{t}(\delta)^{2}d\log\mathinner{\left(1+\frac{KT}{d\lambda_{t}}\right)},

where the constants are given as C1=(4+8S)C_{1}=\mathinner{\left(4+8S\right)}, C2=4(4+8S)3/2MC_{2}=4(4+8S)^{\nicefrac{{3}}{{2}}}M and γt(δ)\gamma_{t}(\delta) is given by Eq (8).

Proof.

The regret is upper bounded by the prediction error.

𝐑T\displaystyle\mathbf{R}_{T}\leq t=1Tmin{Δpred(𝐗𝒬t,θtest),1}\displaystyle\sum_{t=1}^{T}\min\mathinner{\left\{\Delta^{\text{pred}}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{t}},\theta_{t}^{\,\text{est}}\right)},1\right\}} (Rmax=1R_{\max}=1)
\displaystyle\leq t=1Tmin{(2+4S)γt(δ)i𝒬txt,i𝐉t1\displaystyle\sum_{t=1}^{T}\min\left\{\mathinner{\left(2+4S\right)}\gamma_{t}(\delta)\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{J}_{t}^{-1}}\right.
+8κ(1+2S)2Mγt(δ)2i𝒬txt,i𝐕t12,1}\displaystyle+\left.8\kappa(1+2S)^{2}M\gamma_{t}(\delta)^{2}\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{V}_{t}^{-1}},1\right\} (From Lemma 7)
\displaystyle\leq 2(1+2S)γt(δ)t=1Tmin{i𝒬txt,i𝐉t1,1}\displaystyle 2\mathinner{\left(1+2S\right)}\gamma_{t}(\delta)\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}_{\mathbf{J}_{t}^{-1}},1\right\}}
+8(1+2S)2κMγt(δ)2t=1Tmin{i𝒬txt,i𝐕t12,1}\displaystyle+8(1+2S)^{2}\kappa M\gamma_{t}(\delta)^{2}\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{V}_{t}^{-1}},1\right\}}
\displaystyle\leq 2(1+2S)γt(δ)Tt=1Tmin{i𝒬txt,i𝐉t12,1}\displaystyle 2\mathinner{\left(1+2S\right)}\gamma_{t}(\delta)\sqrt{T}\sqrt{\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{J}_{t}^{-1}},1\right\}}}
+8(1+2S)2κMγt(δ)2t=1Tmin{i𝒬txt,i𝐕t12,1}\displaystyle+8(1+2S)^{2}\kappa M\gamma_{t}(\delta)^{2}\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert x_{t,i}\right\rVert}^{2}_{\mathbf{V}_{t}^{-1}},1\right\}} (Using Cauchy-Schwarz inequality)
\displaystyle\leq 2(1+2S)γt(δ)2dlog(1+LKTdλt)T\displaystyle 2\mathinner{\left(1+2S\right)}\gamma_{t}(\delta)\sqrt{2d\log\mathinner{\left(1+\frac{LKT}{d\lambda_{t}}\right)}T}
+8(1+2S)2κMγt(δ)2dlog(1+KTdλt).\displaystyle+8(1+2S)^{2}\kappa M\gamma_{t}(\delta)^{2}d\log\mathinner{\left(1+\frac{KT}{d\lambda_{t}}\right)}. (From Lemma 13 and 14)

For a choice of λt=dlog(KT)\lambda_{t}=d\log(KT) γt(δ)=O(d1/2log1/2(KT))\gamma_{t}(\delta)=\mathrm{O}\mathinner{\left(d^{\nicefrac{{1}}{{2}}}\log^{\nicefrac{{1}}{{2}}}\mathinner{\left(KT\right)}\right)}. ∎

A.6 Convex relaxation

Lemma 8.

Et(δ)Ct(δ)E_{t}\mathinner{\left(\delta\right)}\supseteq C_{t}\mathinner{\left(\delta\right)}, therefore for any θCt(δ)\theta\in C_{t}(\delta), we also have θEt(δ)\theta\,\in\,E_{t}(\delta) (see Eq (7)).

Proof.

Let θ^t\hat{\theta}_{t} be the maximum likelihood estimate (see Eq (5)), the second-order Taylor series expansion of the log-loss (with integral remainder term) for any θd\theta\in\mathbb{R}^{d} is given by:

tλ(θ)=\displaystyle\mathcal{L}^{\lambda}_{t}(\theta)= tλ(θ^t)+tλ(θ^t)(θθ^t)\displaystyle\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+\nabla\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})^{\top}(\theta-\hat{\theta}_{t})
+(θθ^t)(ν=01(1ν)2tλ(θ^t+ν(θθ^t))𝑑ν)(θθ^t)\displaystyle+(\theta-\hat{\theta}_{t})\mathinner{\left(\int^{1}_{\nu=0}(1-\nu)\nabla^{2}\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t}+\nu(\theta-\hat{\theta}_{t}))\cdot d\nu\right)}(\theta-\hat{\theta}_{t}) (40)

tλ(θ^t)=0\nabla\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})=0 by definition since θ^t\hat{\theta}_{t} is maximum likelihood estimate. Therefore :

tλ(θ)\displaystyle\mathcal{L}^{\lambda}_{t}(\theta) =tλ(θ^t)+(θθ^t)(ν=01(1ν)2tλ(θ^t+ν(θθ^t))𝑑ν)(θθ^t)\displaystyle=\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+(\theta-\hat{\theta}_{t})^{\top}\mathinner{\left(\int^{1}_{\nu=0}(1-\nu)\nabla^{2}\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t}+\nu(\theta-\hat{\theta}_{t}))\cdot d\nu\right)}(\theta-\hat{\theta}_{t})
=tλ(θ^t)+(θθ^t)(ν=01(1ν)𝐇t(θ^t+ν(θθ^t))𝑑ν)(θθ^t)\displaystyle=\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+(\theta-\hat{\theta}_{t})^{\top}\mathinner{\left(\int^{1}_{\nu=0}(1-\nu)\mathbf{H}_{t}(\hat{\theta}_{t}+\nu(\theta-\hat{\theta}_{t}))\cdot d\nu\right)}(\theta-\hat{\theta}_{t}) (2tλ()=𝐇t()\nabla^{2}\mathcal{L}^{\lambda}_{t}(\cdot)=\mathbf{H}_{t}(\cdot))
tλ(θ^t)+θθ^t𝐆t(θ,θ^t)2\displaystyle\leq\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+\mathinner{\!\left\lVert\theta-\hat{\theta}_{t}\right\rVert}^{2}_{\mathbf{G}_{t}(\theta,\hat{\theta}_{t})} (def. of 𝐆t(θ,θ^t)\mathbf{G}_{t}(\theta,\hat{\theta}_{t}))
tλ(θ^t)+gt(θ)gt(θ^t)𝐆t1(θ,θ^t)2.\displaystyle\leq\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})+\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta}_{t})\right\rVert}^{2}_{\mathbf{G}^{-1}_{t}(\theta,\hat{\theta}_{t})}. (Eq (29))

Thus we obtain:

tλ(θ)tλ(θ^t)\displaystyle\mathcal{L}^{\lambda}_{t}(\theta)-\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t}) gt(θ)gt(θt^)𝐆t1(θ,θ^t)2\displaystyle\leq\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta_{t}})\right\rVert}^{2}_{\mathbf{G}_{t}^{-1}(\theta,\hat{\theta}_{t})}
(γt2(δ)λt+γt(δ))2=βt2(δ),\displaystyle\leq\mathinner{\left(\frac{\gamma_{t}^{2}(\delta)}{\lambda_{t}}+\gamma_{t}(\delta)\right)}^{2}=\beta^{2}_{t}(\delta), (from Lemma 15)

where the last inequality suggests that θEt(δ)\theta\in E_{t}(\delta) by the definition of the set Et(δ)E_{t}(\delta). Therefore, (t1,θEt(δ))1δ\mathbb{P}\mathinner{\left(\forall t\geq 1,\theta_{*}\in E_{t}(\delta)\right)}\geq 1-\delta. ∎

The following helper lemma, which translates the confidence set definition of Lemma 10 to the norm defined by 𝐆t1(θ1,θ2)\mathbf{G}_{t}^{-1}(\theta_{1},\theta_{2}).

Lemma 15.

Let δ(0,1]\delta\in(0,1]. For all θCt(δ)\theta\in C_{t}(\delta) and θ^t\hat{\theta}_{t} as the maximum likelihood estimate in Eq (5).

gt(θ)gt(θt^)𝐆t1(θ,θ^t)γt2(δ)λt+γt(δ).\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta_{t}})\right\rVert}_{\mathbf{G}_{t}^{-1}(\theta,\hat{\theta}_{t})}\leq\frac{\gamma_{t}^{2}(\delta)}{\lambda_{t}}+\gamma_{t}(\delta).
Proof.

We have:

𝐆t(θ,θ^t)\displaystyle\mathbf{G}_{t}\mathinner{\left(\theta,\hat{\theta}_{t}\right)} =s=1t1i𝒬sαi(𝐗𝒬s,θ,θ^t)xs,ixs,i+λt𝐈d\displaystyle=\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\alpha_{i}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{s}},\theta,\hat{\theta}_{t}\right)}x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d} (def. of 𝐆t(θ,θ^t)\mathbf{G}_{t}\mathinner{\left(\theta,\hat{\theta}_{t}\right)})
s=1t1i𝒬sμ˙i(𝐗𝒬sθ)(1+|xs,iθxs,iθ^t|)1xs,ixs,i+λt𝐈d\displaystyle\geq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)\mathinner{\left(1+|x_{s,i}^{\top}\theta-x_{s,i}^{\top}\hat{\theta}_{t}|\right)}^{-1}x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d} (from Lemma 12)
s=1t1i𝒬sμ˙i(𝐗𝒬sθ)(1+xs,i𝐆t1(θ,θ^t)θθ^t𝐆t(θ,θ^t))1xs,ixs,i\displaystyle\geq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)\mathinner{\left(1+\mathinner{\!\left\lVert x_{s,i}\right\rVert}_{\mathbf{G}^{-1}_{t}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}\mathinner{\!\left\lVert\theta-\hat{\theta}_{t}\right\rVert}_{\mathbf{G}_{t}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}\right)}^{-1}x_{s,i}x_{s,i}^{\top}
+λt𝐈d\displaystyle+\lambda_{t}\mathbf{I}_{d} (Cauchy-Schwarz inequality)
(1+λt1/2θθ^t𝐆t(θ,θ^t))1s=1t1i𝒬sμ˙i(𝐗𝒬sθ)xs,ixs,i+λt𝐈d\displaystyle\geq\mathinner{\left(1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\mathinner{\!\left\lVert\theta-\hat{\theta}_{t}\right\rVert}_{\mathbf{G}_{t}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}\right)}^{-1}\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d} (𝐆t(θ,θ^t)λt𝐈d\mathbf{G}_{t}(\theta,\hat{\theta}_{t})\succeq\lambda_{t}\mathbf{I}_{d})
(1+λt1/2θθ^t𝐆t(θ,θ^t))1(s=1t1i𝒬sμ˙i(𝐗𝒬sθ)xs,ixs,i+λt𝐈d)\displaystyle\geq\mathinner{\left(1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\mathinner{\!\left\lVert\theta-\hat{\theta}_{t}\right\rVert}_{\mathbf{G}_{t}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}\right)}^{-1}\mathinner{\left(\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\mu}_{i}(\mathbf{X}_{\mathcal{Q}_{s}}^{\top}\theta)x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d}\right)}
=(1+λt1/2θθ^t𝐆t(θ,θ^t))1𝐇t(θ)\displaystyle=\mathinner{\left(1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\mathinner{\!\left\lVert\theta-\hat{\theta}_{t}\right\rVert}_{\mathbf{G}_{t}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}\right)}^{-1}\mathbf{H}_{t}\mathinner{\left(\theta\right)} (def. of 𝐇t(θ)\mathbf{H}_{t}\mathinner{\left(\theta\right)})
=(1+λt1/2gt(θ)gt(θ^t)𝐆t1(θ,θ^t))1𝐇t(θ),\displaystyle=\mathinner{\left(1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta}_{t})\right\rVert}_{\mathbf{G}_{t}^{-1}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}\right)}^{-1}\mathbf{H}_{t}\mathinner{\left(\theta\right)}, (from Eq (29))

where

𝐆t(θ,θ^t)(1+λt1/2gt(θ)gt(θ^t)𝐆t1(θ,θ^t))1𝐇t(θ)\mathbf{G}_{t}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}\succeq\mathinner{\left(1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta}_{t})\right\rVert}_{\mathbf{G}_{t}^{-1}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}\right)}^{-1}\mathbf{H}_{t}\mathinner{\left(\theta\right)}

is analogous to local information containing counterpart of the relation in Lemma 4. This gives:

gt(θ)gt(θ^t)𝐆t1(θ,θ^t)2\displaystyle\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta}_{t})\right\rVert}^{2}_{\mathbf{G}_{t}^{-1}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}
\displaystyle\leq (1+λt1/2gt(θ)gt(θ^t)𝐆t1(θ,θ^t))1gt(θ)gt(θ^t)𝐇t1(θ)2\displaystyle\mathinner{\left(1+\lambda_{t}^{-\nicefrac{{1}}{{2}}}\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta}_{t})\right\rVert}_{\mathbf{G}_{t}^{-1}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}\right)}^{-1}\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta}_{t})\right\rVert}^{2}_{\mathbf{H}_{t}^{-1}\mathinner{\left(\theta\right)}}
\displaystyle\leq λt1/2γt2(δ)gt(θ)gt(θ^t)𝐆t1(θ,θ^t)+γt2(δ),\displaystyle\lambda_{t}^{-\nicefrac{{1}}{{2}}}\gamma_{t}^{2}(\delta)\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta}_{t})\right\rVert}_{\mathbf{G}_{t}^{-1}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}+\gamma^{2}_{t}(\delta), (from Lemma 10)

where the last relation is a quadratic inequality in gt(θ)gt(θ^t)𝐆t1(θ,θ^t)\mathinner{\!\left\lVert g_{t}(\theta)-g_{t}(\hat{\theta}_{t})\right\rVert}_{\mathbf{G}_{t}^{-1}\mathinner{\left(\theta,\hat{\theta}_{t}\right)}}, which on solving completes the proof of the statement in the lemma. ∎

Lemma 9.

Under the event θCt(δ)\theta_{*}\,\in\,C_{t}(\delta), the following holds θEt(δ)\forall\,\theta\,\in\,E_{t}(\delta):

θθ𝐇t(θ)(2+2S)γt(δ)+21+Sβt(δ).\displaystyle\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}_{\mathbf{H}_{t}(\theta_{*})}\leq(2+2S)\gamma_{t}(\delta)+2\sqrt{1+S}\beta_{t}(\delta).

When λt=dlog(t)\lambda_{t}=d\log(t), then γt(δ)=O~(dlog(t))\gamma_{t}(\delta)=\tilde{\mathrm{O}}\mathinner{\left(\sqrt{d\log(t)}\right)}, βt(δ)=O~(dlog(t))\beta_{t}(\delta)=\tilde{\mathrm{O}}\mathinner{\left(\sqrt{d\log(t)}\right)}, and

θθ𝐇t(θ)=O~(dlog(t)).\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}_{\mathbf{H}_{t}(\theta_{*})}=\tilde{\mathrm{O}}\mathinner{\left(\sqrt{d\log(t)}\right)}.
Proof.

Second-order Taylor expansion of the log-likelihood function with integral remainder term gives:

tλ(θ)=\displaystyle\mathcal{L}^{\lambda}_{t}(\theta)= tλ(θ)+tλ(θ^t)(θθ)\displaystyle\mathcal{L}^{\lambda}_{t}(\theta_{*})+\nabla\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})^{\top}(\theta-\theta_{*})
+(θθ)(ν=01(1ν)2tλ(θ+ν(θθ))𝑑ν)(θθ)\displaystyle+(\theta-\theta_{*})\mathinner{\left(\int^{1}_{\nu=0}(1-\nu)\nabla^{2}\mathcal{L}^{\lambda}_{t}(\theta_{*}+\nu(\theta-\theta_{*}))\cdot d\nu\right)}(\theta-\theta_{*})
=\displaystyle= tλ(θ)+tλ(θ^t)(θθ)+θθ𝐆~t(θ,θ)2,\displaystyle\mathcal{L}^{\lambda}_{t}(\theta_{*})+\nabla\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})^{\top}(\theta-\theta_{*})+\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}^{2}_{\mathbf{\tilde{G}}_{t}(\theta_{*},\theta)},

where 𝐆~t(θ,θ)=(θθ)(ν=01(1ν)𝐇t(θ+ν(θθ))𝑑ν)(θθ)\mathbf{\tilde{G}}_{t}(\theta_{*},\theta)=(\theta-\theta_{*})\mathinner{\left(\int^{1}_{\nu=0}(1-\nu)\mathbf{H}_{t}(\theta_{*}+\nu(\theta-\theta_{*}))\cdot d\nu\right)}(\theta-\theta_{*}) . From Lemma 4 and Lemma 8 of Abeille et al. [2021] is also follows that

θθ𝐆~t(θ,θ)2(2+2S)1θθ𝐇t(θ)2\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}^{2}_{\mathbf{\tilde{G}}_{t}(\theta_{*},\theta)}\geq(2+2S)^{-1}\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}^{2}_{\mathbf{H}_{t}(\theta_{*})}

Therefore we have:

θθ𝐇t(θ)2\displaystyle\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}^{2}_{\mathbf{H}_{t}(\theta_{*})}
\displaystyle\leq (2+2S)|tλ(θ)tλ(θ)|+(2+2S)|tλ(θ^t)(θθ)|\displaystyle(2+2S)\mathinner{\!\left\lvert\mathcal{L}^{\lambda}_{t}(\theta)-\mathcal{L}^{\lambda}_{t}(\theta_{*})\right\rvert}+(2+2S)\mathinner{\!\left\lvert\nabla\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})^{\top}(\theta-\theta_{*})\right\rvert}
\displaystyle\leq 2(2+2S)βt2(δ)+(2+2S)|tλ(θ^t)(θθ)|\displaystyle 2(2+2S)\beta_{t}^{2}(\delta)+(2+2S)\mathinner{\!\left\lvert\nabla\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})^{\top}(\theta-\theta_{*})\right\rvert} (def. of Et(δ)E_{t}(\delta))
\displaystyle\leq 2(2+2S)βt2(δ)+(2+2S)tλ(θ^t)𝐇t1(θ)(θθ)𝐇t(θ)\displaystyle 2(2+2S)\beta_{t}^{2}(\delta)+(2+2S)\mathinner{\!\left\lVert\nabla\mathcal{L}^{\lambda}_{t}(\hat{\theta}_{t})\right\rVert}_{\mathbf{H}_{t}^{-1}(\theta_{*})}\mathinner{\!\left\lVert(\theta-\theta_{*})\right\rVert}_{\mathbf{H}_{t}(\theta_{*})} (Cauchy-Schwarz inequality)
\displaystyle\leq 2(2+2S)βt2(δ)+(2+2S)γt(δ)(θθ)𝐇t(θ).\displaystyle 2(2+2S)\beta_{t}^{2}(\delta)+(2+2S)\gamma_{t}(\delta)\mathinner{\!\left\lVert(\theta-\theta_{*})\right\rVert}_{\mathbf{H}_{t}(\theta_{*})}.

Solving the quadratic inequality in θθ𝐇t(θ)\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}_{\mathbf{H}_{t}(\theta_{*})}, we get:

θθ𝐇t(θ)(2+2S)γt(δ)+21+Sβt(δ).\displaystyle\mathinner{\!\left\lVert\theta-\theta_{*}\right\rVert}_{\mathbf{H}_{t}(\theta_{*})}\leq(2+2S)\gamma_{t}(\delta)+2\sqrt{1+S}\beta_{t}(\delta).

When λt=dlog(t)\lambda_{t}=d\log(t), then γt(δ)=O~(dlog(t))\gamma_{t}(\delta)=\tilde{\mathrm{O}}\mathinner{\left(\sqrt{d\log(t)}\right)} and βt(δ)=O~(dlog(t))\beta_{t}(\delta)=\tilde{\mathrm{O}}\mathinner{\left(\sqrt{d\log(t)}\right)}. ∎

A.7 Technical lemmas

Remark 3 (Derivatives for MNL choice function).

For the multinomial logit choice function, where the expected reward due to item ii of the assortment StS_{t} is modeled as:

fi(St,𝐫)=eri1+eri+jSt,jiKerjf_{i}(S_{t},\mathbf{r})=\frac{e^{r_{i}}}{1+e^{r_{i}}+\sum_{j\,\in\,S_{t},\,j\neq i}^{K}e^{r_{j}}}

the partial derivative with respect to the expected reward of ithi_{th} item is given as:

firi=fi(St,𝐫)(1fi(St,𝐫))\frac{\partial f_{i}}{\partial r_{i}}=f_{i}(S_{t},\mathbf{r})\mathinner{\left(1-f_{i}(S_{t},\mathbf{r})\right)}

and the double derivative as:

2firi2=fi(St,𝐫)(1fi(St,𝐫))(12fi(St,𝐫)).\frac{\partial^{2}f_{i}}{\partial r_{i}^{2}}=f_{i}(S_{t},\mathbf{r})\mathinner{\left(1-f_{i}(S_{t},\mathbf{r})\right)}\mathinner{\left(1-2f_{i}(S_{t},\mathbf{r})\right)}.
Lemma 16 (Self-Concordance like relation for MNL).

For the multinomial logit choice function, where the expected reward due to item ii of the assortment StS_{t} is modeled as:

fi(St,𝐫)=eri1+eri+jSt,jiKerjf_{i}(S_{t},\mathbf{r})=\frac{e^{r_{i}}}{1+e^{r_{i}}+\sum_{j\,\in\,S_{t},\,j\neq i}^{K}e^{r_{j}}}

the following relation holds:

|2firi2|firi\mathinner{\!\left\lvert\frac{\partial^{2}f_{i}}{\partial r_{i}^{2}}\right\rvert}\leq\frac{\partial f_{i}}{\partial r_{i}}
Proof.

The proof directly follows from Remark 3 and the observation
|12fi(St,𝐫)|1\mathinner{\!\left\lvert 1-2f_{i}(S_{t},\mathbf{r})\right\rvert}\leq 1 for all items, ii in the assortment choice. ∎

Lemma 17 (Generalized elliptical potential).

Let {𝐗𝒬s}\mathinner{\left\{\mathbf{X}_{\mathcal{Q}_{s}}\right\}} be a sequence in d×K\mathbb{R}^{d\times K} such that for each ss, 𝐗𝒬s\mathbf{X}_{\mathcal{Q}_{s}} has columns as {xs,1,xs,2,,xs,K}\{x_{s,1},x_{s,2},\cdots,x_{s,K}\} where xs,i2w,d\mathinner{\!\left\lVert x_{s,i}\right\rVert}_{2}\leq w,\,\in\,\mathbb{R}^{d} for all s1s\geq 1 and i[K]i\,\in\,[K]. Also, let λt\lambda_{t} be a non-negative scalar. For t1t\geq 1, define 𝐉ts=1t1i𝒬sμi¯˙(𝐗𝒬s)xs,ixs,i+λt𝐈d\mathbf{J}_{t}\coloneqq\sum_{s=1}^{t-1}\sum_{i\in\mathcal{Q}_{s}}\dot{\underline{\mu_{i}}}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{s}}\right)}x_{s,i}x_{s,i}^{\top}+\lambda_{t}\mathbf{I}_{d} where μi¯˙(𝐗𝒬s)\dot{\underline{\mu_{i}}}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{s}}\right)} is strictly positive for all i,,[K]i,\in,[K]. Then the following inequality holds:

t=1Tmin{i𝒬tx~t,i𝐉t12,1}2log(det(𝐉T+1)λtd)\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert\tilde{x}_{t,i}\right\rVert}^{2}_{\mathbf{J}^{-1}_{t}},1\right\}}\leq 2\log\mathinner{\left(\frac{\det(\mathbf{J}_{T+1})}{\lambda_{t}^{d}}\right)}

with x~t,i=μi¯˙(𝐗𝒬s)xs,i\tilde{x}_{t,i}=\sqrt{\dot{\underline{\mu_{i}}}\mathinner{\left(\mathbf{X}_{\mathcal{Q}_{s}}\right)}}x_{s,i}.

Proof.

By the definition of 𝐉t\mathbf{J}_{t}:

det(𝐉t+1)\displaystyle\det\mathinner{\left(\mathbf{J}_{t+1}\right)} =det(𝐉t+i𝒬tx~t,ix~t,i)\displaystyle=\det\mathinner{\left(\mathbf{J}_{t}+\sum_{i\in\mathcal{Q}_{t}}\tilde{x}_{t,i}\tilde{x}_{t,i}^{\top}\right)}
=det(𝐉t)det(𝐈d+𝐉t1/2i𝒬tx~t,ix~t,i𝐉t1/2)\displaystyle=\det\mathinner{\left(\mathbf{J}_{t}\right)}\det\mathinner{\left(\mathbf{I}_{d}+\mathbf{J}_{t}^{-\nicefrac{{1}}{{2}}}\sum_{i\in\mathcal{Q}_{t}}\tilde{x}_{t,i}\tilde{x}_{t,i}^{\top}\mathbf{J}_{t}^{-\nicefrac{{1}}{{2}}}\right)}
=det(𝐉t)(1+i𝒬tx~t,i𝐉t12).\displaystyle=\det\mathinner{\left(\mathbf{J}_{t}\right)}\mathinner{\left(1+\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert\tilde{x}_{t,i}\right\rVert}^{2}_{\mathbf{J}^{-1}_{t}}\right)}.

Taking log from both sides and summing from t=1t=1 to TT:

t=1Tlog(1+i𝒬tx~t,i𝐉t12)\displaystyle\sum_{t=1}^{T}\log\mathinner{\left(1+\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert\tilde{x}_{t,i}\right\rVert}^{2}_{\mathbf{J}^{-1}_{t}}\right)} =t=1Tlog(det(𝐉t+1))log(det(𝐉t))\displaystyle=\sum_{t=1}^{T}\log\mathinner{\left(\det(\mathbf{J}_{t+1})\right)}-\log\mathinner{\left(\det(\mathbf{J}_{t})\right)}
=t=1Tlog(𝐉t+1𝐉t)\displaystyle=\sum_{t=1}^{T}\log\mathinner{\left(\frac{\mathbf{J}_{t+1}}{\mathbf{J}_{t}}\right)}
=log(det(𝐉t+1)det(λt𝐈d))\displaystyle=\log\mathinner{\left(\frac{\det(\mathbf{J}_{t+1})}{\det(\lambda_{t}\mathbf{I}_{d})}\right)} (By a telescopic sum cancellation)
=log(det(𝐉t+1)λtd).\displaystyle=\log\mathinner{\left(\frac{\det(\mathbf{J}_{t+1})}{\lambda_{t}^{d}}\right)}. (44)

For any aa such that 0a10\leq a\leq 1, it follows that a2log(1+a)a\leq 2\log(1+a). Therefore, we write:

t=1Tmin{i𝒬tx~t,i𝐉t12,1}\displaystyle\sum_{t=1}^{T}\min\mathinner{\left\{\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert\tilde{x}_{t,i}\right\rVert}^{2}_{\mathbf{J}^{-1}_{t}},1\right\}} 2t=1Tlog(1+i𝒬tx~t,i𝐉t12)\displaystyle\leq 2\sum_{t=1}^{T}\log\mathinner{\left(1+\sum_{i\in\mathcal{Q}_{t}}\mathinner{\!\left\lVert\tilde{x}_{t,i}\right\rVert}^{2}_{\mathbf{J}^{-1}_{t}}\right)}
=2log(det(𝐉T+1)λtd).\displaystyle=2\log\mathinner{\left(\frac{\det(\mathbf{J}_{T+1})}{\lambda_{t}^{d}}\right)}. (From Eq (44))

Lemma 18 (Determinant-trace inequality, see Lemma 10 in Abbasi-Yadkori et al. [2011]).

Let {xs}s=1\{x_{s}\}_{s=1}^{\infty} a sequence in d\mathbb{R}^{d} such that xs2X\mathinner{\!\left\lVert x_{s}\right\rVert}_{2}\leq X for all ss\in\mathbb{N}, and let λt\lambda_{t} be a non-negative scalar. For t1t\geq 1 define 𝐕ts=1t1xsxs+λt𝐈d\mathbf{V}_{t}\coloneqq\sum_{s=1}^{t-1}x_{s}x_{s}^{\top}+\lambda_{t}\mathbf{I}_{d}. The following inequality holds:

det(𝐕t+1)(λt+tX2/d)d.\displaystyle\det(\mathbf{V}_{t+1})\leq\left(\lambda_{t}+tX^{2}/d\right)^{d}.

A.8 Numerical experiments

We build on Section 5 and compare the empirical performance of our proposed algorithm CB-MNL with the previous state of the art in the MNL contextual bandit literature: UCB-MNL[Oh & Iyengar, 2021] and TS-MNL[Oh & Iyengar, 2019] on artificial data for varying model parameters. θd\theta_{*}\in\mathbb{R}^{d} is dd-dimensional uniformly random variable with each coordinate in [0,1][0,1] independently and uniformly distributed. The contexts follow multivariate Gaussian distribution. Algorithm CB-MNL only knows the value of N,T,K,dN,T,K,d. Besides, algorithms TS-MNLand UCB-MNL also need to know the value of κ\kappa for their implementation. Here we simulate for two additional parameter instances again averaged over 2525 Monte Carlo runs.

Refer to caption
Refer to caption
Figure 3: Comparison of cumulative regret for two additional parameter instance ( left: κ(T)\kappa\gg\mathinner{\left(\sqrt{T}\right)}, N=10,d=3,K=6,T=100N=10,d=3,K=6,T=100; right: κ(T)\kappa\gg\mathinner{\left(\sqrt{T}\right)}, N=20,d=3,K=5,T=100N=20,d=3,K=5,T=100