Adversarial Combinatorial Bandits with General Non-linear Reward Functions
Abstract
In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. The adversarial combinatorial bandit with general non-linear reward is an important open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback. We show that, with arms and subsets of arms being chosen at each of time periods, the minimax optimal regret is if the reward function is a -degree polynomial with , and if the reward function is not a low-degree polynomial. Both bounds are significantly different from the bound for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures. Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual assortment as independent.
1 Introduction
In this paper we study the combinatorial bandit problem, which is a natural extension to the multi-armed bandit problem (Auer et al., 1995) and has applications to online advertising, online shortest paths and many other practical problems (Cesa-Bianchi & Lugosi, 2012; Chen et al., 2013, 2016a, 2016b; Wang & Chen, 2018). In the adversarial combinatorial bandit setting, there are time periods and arms. At the beginning of each time period , an adaptive adversary chooses a reward vector not revealed to the algorithm. The algorithm chooses a subset consisting of exactly distinct arms (i.e., ). The algorithm then receives a bandit feedback satisfying
(1) |
where is a known link function. The objective is to minimize the regret of the algorithm compared against a stationary benchmark, defined as
(2) |
where and are the subsets outputted by a regret minimization algorithm.
As far as we know, all existing works on adversarial combinatorial bandit studied only the case when the link function is linear (i.e., ) (Cesa-Bianchi & Lugosi, 2012; Bubeck et al., 2012; Audibert et al., 2014). While there have been research on combinatorial bandits with general link functions, such results are established exclusively for the stochastic setting, in which the reward vectors do not change over time (Rusmevichientong et al., 2010; Agarwal & Aggarwal, 2018; Agrawal et al., 2019), and most of them further assume a semi-bandit feedback where noisy observations of all with are available (Combes et al., 2015; Kveton et al., 2015; Chen et al., 2016a, 2018a, 2018b). This motivates the following natural question:
Question 1.
For adversarial combinatorial bandit with a non-linear link function and bandit feedback, is it possible to achieve regret?
Note that in adversarial combinatorial bandit with linear link function, or stochastic combinatorial (semi-)bandit with general link functions, the regret targeted in Question 1 can be attained (Bubeck et al., 2012; Combes et al., 2015; Kveton et al., 2015; Chen et al., 2016a; Agrawal et al., 2019). The question also has important practical motivations beyond theoretical/mathematical reasoning, because many interesting applications of combinatorial bandit involve non-linear link functions, such as online assortment optimization with a multinomial logit (MNL) model, which corresponds to a link function of . Please see more discussions Sec. 1.2.
1.1 Our results
Below is an informal statement of our main result, as a summary of Theorem 1 later in the paper.
Corollary 1 (Informal).
Fix an arbitrary, known reward function . If is a -degree polynomial for some , then the optimal regret is
Otherwise, if is either not a polynomial or a polynomial of degree at least , the optimal regret is
The results in Corollary 1 easily cover the linear link function case , with and the optimal regret being (Bubeck et al., 2012). On the other hand, Corollary 1 shows that when is not a polynomial, no algorithm can achieve a regret better than . This shows that when is a general non-linear reward function, the subsets of the arms can only be treated as “independent” and it is information-theoretically impossible for any algorithm to exploit correlation between subsets of arms to achieve a significantly smaller regret.
1.2 Dynamic assortment optimization
Dynamic assortment optimization is a key problem in revenue management and online recommendation, which naturally serves as a motivation for the generalized combinatorial bandit problem studied in this paper. In the standard setup of dynamic assortment optimization (Agrawal et al., 2019), there are substitutable products, each associated with a known profit parameter and an unknown mean utility parameter . At each time, the seller offers an assortment (i.e., the recommended set of products) of size , e.g., there are display spots of recommended products on an Amazon webpage. Then the customer either purchases one of the products being offered (i.e., ) or leaves without making any purchase (i.e., ). The choice behavior of the customer is governed by the well-known MultiNomial-Logit (MNL) model from economics (Train, 2009):
(3) |
with the definition that , where denote the utility of no-purchase. The objective for the seller or retailer is to maximize the expected profit/revenue
Note also that, in the adversarial setting, the mean utility vector will be different for each time period , and will be selected by an adaptive adversary. The regret is then defined as (2).
Let us first consider a special case, where all the products have the profit one (i.e., ) and only binary purchase/no-purchase actions are observable. That is, one only observes a binary reward at time , , which indicates whether there is a purchase. Then the dynamic assortment optimization question reduces to the generalized (adversarial) combinatorial bandit problem formulated in Eqs. (1,2) with the link function . Since is clearly not a polynomial, Corollary 1 shows that should be the optimal regret. The following corollary extends this to the general case of dynamic assortment optimization, where different products can have different profit parameters.
Corollary 2.
Consider the dynamic assortment optimization question with known profit parameters and unknown mean utility parameters chosen by an adaptive adversary. Then there exists an algorithm with regret upper bounded .
The next corollary, on the other hand, shows that the regret in not improvable, even with a richer non-binary feedback and if all products have the same profits parameter .
Corollary 3.
Suppose . There exists an adaptive adversary that chooses , such that for any algorithm, the regret is lower bounded by .
1.3 Proof techniques
As we shall see later in this paper, the upper bounds of or in Corollary 1 are relatively easier to establish, via reduction to known adversarial multi-armed or linear bandit algorithms. The key challenge is to establish corresponding and lower bounds in Corollary 1.
In this section we give a high-level sketch of the key ideas in our lower bound proof. For simplicity we consider only the case when is not a polynomial function. The key insight is to prove the existence of a distribution on , such that for any , , the following holds on the choice distribution with :
(4) |
Intuitively, Eq. (4) shows that, no information is gained unless an algorithm exactly guesses the optimal subset , even if the subset produced by the algorithm only differ by a single element from . Since there are different subsets, the question of guessing the optimal subset exactly correct is similar to locating the best arm of a multi-armed bandit question with arms, which incurs a regret of .
To gain deeper insights into the construction of that satisfies Eq. (4), it is instructive to consider some simpler bandit settings in which a small regret can be achieved and understand why the construction of does not apply there.
-
•
The first setting is dynamic assortment optimization under the stationary setting, in which the vectors remain the same for all periods. The results of (Agrawal et al., 2019) achieve regret in this setting. In this setting, the vector is deterministic and fixed, and therefore the laws and must be correlated as long as . This means that Eq. (4) cannot be possibly satisfied, with every subset revealing no information about .
-
•
The second setting is the adversarial combinatorial bandit with a linear link function , for which an regret is attainable (Bubeck et al., 2012; Combes et al., 2015). When is linear, the expectation of the mixture distribution is
where is the indicator vector of the subset . Clearly, this is impossible to achieve (4) as there is no vector satisfying that is constant for all and .
-
•
The third setting is a special stochastic combinatorial bandit, where is random but there exists a total ordering of the stochastic dominance relation among the components of , and an increasing . In this setting, it was shown in (Agarwal & Aggarwal, 2018) that a regret of can be achieved, and the stochastic dominance requirement implies that once an element of is replaced by an element of , the expected reward must increase. Therefore, (4) cannot hold in this scenario either.
1.4 Other related works
Combinatorial bandit is a classical question in machine learning and has been extensively studied under the settings of stochastic semi-bandits (Chen et al., 2013; Combes et al., 2015; Kveton et al., 2015; Chen et al., 2016a, b; Wang & Chen, 2018; Merlis & Mannor, 2019, 2020), stochastic bandits (Agarwal & Aggarwal, 2018; Rejwan & Mansour, 2020; Kuroki et al., 2020), and adversarial linear bandits (Cesa-Bianchi & Lugosi, 2012; Bubeck et al., 2012; Audibert et al., 2014; Combes et al., 2015). In the above-mentioned works, either the reward link function is linear, or the model is stochastic (stationary).
There is another line of research on dynamic assortment optimization with the multinomial logit model, which is a form of combinatorial bandit with general reward functions (Rusmevichientong et al., 2010; Agrawal et al., 2017; Chen et al., 2018a, b; Chen & Wang, 2018; Chen et al., 2019; Agrawal et al., 2019). All of these works are carried out under the stochastic setting, with the exception of (Chen et al., 2019) which studied an -contamination model and obtained a regret upper bound . Clearly, with the adversarial setting in this paper () the regret bound in (Chen et al., 2019) becomes linear in and thus meaningless.
1.5 Notations
For a multi-index , let , and for a -variate function . For and interval , let be the set of -times continuously differentiable functions on . For two probability distributions and , let and be the total variation (TV) distance and the Kullback–Leibler (KL) divergence, respectively. We adopt the standard asymptotic notation: for two non-negative sequences and , we use the notation to denote that for all and constant depending only on , to denote , and to denote both and . We also use to denote the respective meanings up to a multiplicative poly-logarithmic factor in .
2 Problem formulation and results
Suppose there are arms, time periods and a known reward function . At each time period , the algorithm outputs a subset , and receives a binary bandit feedback . Note that the binary feedback structure can be significantly relaxed for the purpose of upper bounds, as discussed in Sec. 2.1. Let be the filtration of observable statistics at time , and be the filtration of unobservable reward vectors. Let also be an unknown adversary and be an admissible policy. The reward dynamics are modeled as follows:
For any , the minimax regret is defined as
(5) |
where , and the expectation is taken with respect to both the bandit algorithm and the adaptive adversary .
The following theorem is a rigorous statement of the main result of this paper.
Theorem 1.
Fix function that is -times continuously differentiable on . If is a polynomial with degree , then there exist constants depending only on such that, for every and , it holds that
Furthermore, if is a polynomial with degree at least or not a polynomial, then there exist constants depending only on such that, for every and , it holds that
Remark 1.
Based on Propositions 1 and 2 later, the hidden dependence of the constants on is and , respectively. However, since our lower bound relies on an existential result (cf. Lemma 2), the hidden dependence of constants and is unknown. It is an outstanding open question to characterize an explicit dependence on in the lower bound.
The results in Theorem 1 cover the linear reward case of via and a regret of , which matches the existing results on adversarial linear combinatorial bandits. On the other hand, the results for general non-polynomial reward functions are quite negative, with a regret showing that all the subsets are essentially independent and the bandit algorithm cannot hope to exploit correlation between overlapping subsets like in the linear case. Finally, the reward function being a low-degree polynomial interpolates between the linear case and the general case, with a regret of for between and .
In the rest of this section we sketch the proofs of Theorem 1 by studying the upper bounds and lower bounds separately. We will also adapt the proof of Theorem 1 to cover the more general dynamic assortment optimization model described in Sec. 1.2.
2.1 Upper bounds
We first prove the regret upper bound for general link functions. In fact, we state the following result that is much more general than Theorem 1.
Proposition 1.
Suppose at each time the adversary could choose an arbitrary combinatorial reward model , and an arbitrary bandit feedback . Let also denote the reward as a function of . There exists a bandit algorithm and a universal constant such that for any , ,
Proof.
For each subset of size , let be a constructed arm and be the bandit reward feedback at time if the arm is pulled (i.e., subset is selected). This reduces the problem to an adversarial multi-armed bandit problem with independent arms. Applying the Implicitly Normalized Forecaster (INF) algorithm from (Audibert & Bubeck, 2009), we have the regret upper bound . ∎
We remark that Proposition 1 is more general and contains the upper bound in Theorem 1 as a special case. By considering the feedback model and , Proposition 1 also covers the dynamic assortment optimization model described in Sec. 1.2 and Corollary 2.
We next establish the upper bound for polynomial reward functions.
Proposition 2.
Fix a known -degree polynomial . Suppose at each time , conditioned on the selected subset of size , the bandit feedback is supported on an arbitrary bounded set not necessarily , such that . Then there exists a bandit algorithm such that, for any adaptive adversary ,
where , and .
Proof.
For any -dimensional vector , let be the order- tensorization of . It is easy to verify that, for any and , where is the indicator vector of . Hence,
Define as
As , it is easy to verify that, for every , .
2.2 Lower bounds
We first prove the following result corresponding to the minimax lower bounds in Theorem 1.
Lemma 1.
Suppose for some fixed, known function that is -times continuously differentiable on . If is a degree- polynomial with , then there exists a constant such that for all , ,
Otherwise, there exists a constant such that for all , ,
Our proof is based on the following technical lemma:
Lemma 2.
Let be a real-valued and -times continuously differentiable function on , with . Then the following two statements are equivalent:
-
1.
is not a polynomial of degree at most ;
-
2.
there exists a random vector supported on , which follows an exchangeable joint distribution , and a scalar , such that
(6) for all , and
(7)
The proof of Lemma 2 is deferred to the Sec. 3. The construction of the distributions in Lemma 2 is non-constructive and uses duality existential arguments. Its proof also applies several technical tools from real analysis and functional analysis (Rudin, 1991; Donoghue, 1969; Dudley, 2018).
We are now ready to prove Lemma 1.
Proof.
We first prove the case when is not a polynomial of degree at most . We use the construction and in Lemma 2 with , and construct i.i.d. copies for each . Consider the following random strategy of the adversary when the optimal subset is : at each time , nature assigns to the restriction of to with probability , and assigns otherwise, with the parameter to be specified later; nature also assigns for all . Suppose an algorithm selects subset , at time , such that . Then
(8) |
where . Essentially, Eq. (8) shows that the marginal distribution of conditioned on is if , but if .
Let denote the distribution of when the adversary chooses as the optimal subset. Let also denote the distribution of with . As are binary, Eq. (8) implies that, for every ,
(9) |
where is the expectation under , is the number of times is selected and constant depends only on and and thus only on . By Pinsker’s inequality,
(10) |
Subsequently,
(11) | |||
(12) |
Here, Eq. (11) holds because , and Eq. (12) is due to the Cauchy-Schwarz inequality. Setting , Eq. (12) is lower bounded by the minimum between and
which is to be demonstrated.
The scenario when is a polynomial of degree can be proved in an entire similar way. Applying Lemma 2 with , we could obtain a random vector and some such that the conditions (6) and (7) hold. The nature uses the same strategy, with the only difference being that the size of the random subset is instead of . In this case, any size- set with gives the optimal reward, and the learner observes the non-informative outcome at time if and only if . Consequently, both Eqs. (8, 9) still hold, with replaced by in Eq. (8) and the definition of changed to in Eq. (9). Using
(12) with yields a lower bound of :
Setting and noting that does not depend on , the above lower bound is simplified to
which is to be demonstrated. ∎
Next, we establish an lower bound for the dynamic assortment optimization model described in Sec. 1.2. It implies Corollary 3 in the introduction section.
Lemma 3.
Consider the dynamic assortment optimization question with mutlinomial logit choice models described in Sec. 1.2. Adopt the unit price model, with for all . Then there exists an adversary choosing and a constant depending only on , such that for any bandit algorithm and , , it holds that
where .
Proof.
It is easy to verify that with . Since the link function here is clearly not a polynomial of any degree, the construction in Lemma 1 should immediately imply an lower bound. However, in the multinomial logit choice model for dynamic assortment optimization, the bandit feedback is not binary. This means that expectation calculations in Eq. (8) are not sufficient, and we have to calculate the KL-divergence between discrete observations directly.
Recall that in the MNL model, the bandit feedback is supported on , with for all and . Suppose . Then
(13) |
For every , , and therefore
(14) |
Next consider any . Because are exchangeable, the probabilities are the same for all . Define for some . By the law of total probability and Eqs. (13,14) we have that
Consequently,
(15) |
Comparing Eqs. (13,14,15), we conclude that for any , , and for all . This shows that all , are information theoretically indistinguishable. On the other hand, for , with probability all elements of are assigned , for which . Hence, . The rest of the proof is identical to the proof of Lemma 1 when the link function is not a polynomial. ∎
3 Proof of Lemma 2
We first prove the easy direction , whose contrapositive is that if is a polynomial of degree at most , then (6) and (7) cannot hold simultaneously. In fact, defining and for all , condition (6) implies that for all , it holds that . By exchangeability of , this also shows that , where for all ,
(16) |
Since is a polynomial of degree at most and , the following lemma shows that , a contradiction to (7).
Lemma 4.
For reals and any polynomial of degree at most , define and as in (16) for . Then the following identity holds:
Proof.
By linearity it suffices to prove Lemma 4 for , . In this case, simple algebra verifies that the coefficient of with in is
where , and if . Consequently, the coefficient of in the LHS of the equation is
where the last identity makes use of the inequality . ∎
Next we prove the hard direction . The proof makes use of the idea in convex analysis: after fixing , the problem is to find an exchangeable distribution of from the convex set of all such distributions; moreover, both constraints (6) and (7) are linear in the joint distribution, and the Dirac point measure on satisfies (6) as well as (7) if is replaced by . Therefore, there are two convex sets in total, one being the family of all exchangeable joint distributions and one from the constraints (6) and (7), and their closure has an intersection point, i.e. the Dirac point measure at . Now our target is to show that these sets have a non-empty intersection without taking the closure, and the following lemma characterizes a sufficient condition via duality.
Lemma 5.
Proof.
We prove the contrapositive of this result. Let be the topological vector space of all finite Borel signed measures on (which are also Radon measures by Ulam’s theorem; cf. (Dudley, 2018, Theorem 7.1.4)) equipped with the weak-⋆ topology (as the dual of ), and be the collection of all exchangeable Borel probability measures (i.e. invariant to permutations). Moreover, for , we define as the collection of all signed Borel measures such that
for and
Therefore, the non-existence of such an exchangeable distribution implies that for all . Since is compact, the set is a closed subset of the unit weak-⋆ ball, and is therefore compact due to Banach–Alaoglu (see, e.g. (Rudin, 1991, Theorem 3.15)). Moreover, as , the set is closed under the weak-⋆ topology. Finally, as both sets are convex, and (Rudin, 1991, Theorem 3.10) shows that the dual of under the weak-⋆ topology is , (Rudin, 1991, Theorem 3.4) implies that there exist some and such that
(17) |
As is a linear subspace of , the RHS of (17) must be zero (otherwise it would be ). Then by (Rudin, 1991, Lemma 3.9), we must have for some vector . Plugging this back into the first inequality of (17), and defining as the Dirac point measure on , we arrive at
(18) |
Since is a symmetric function (i.e. invariant to permutations of the input), the LHS of (18) is simply . Then , and by multiplying a positive constant to all entries of , we may assume that and . Choosing , the compactness of implies some subsequence as . Taking the limit along this subsequence, it is clear that is a non-zero vector with , and is a global maxima of the function . ∎
Now it remains to choose a suitable such that the condition in Lemma 5 holds. We choose to be any point in such that for all , whose existence is ensured by the following lemma, which itself is a standard exercise on the Baire category theorem.
Lemma 6 (A slight variant of (Donoghue, 1969), Page 53).
If satisfies for some and every , then is a polynomial of degree at most on .
Now we assume by contradiction that the function attains its maximum at for some non-zero vector . We will prove by induction on the following claims:
-
1.
for all multi-indices with ;
-
2.
, with and for .
For the base case , the first claim is exactly the first-order condition for the maxima, and the second claim follows from the first claim, the identity , and our choice of that . Now suppose that these claims hold for . Then for with , simple algebra gives that
where the last identity is due to the inductive hypothesis for . Therefore, for the first claim it remains to consider multi-indices with . By the inductive hypothesis and the Taylor expansion for multivariate functions, for sufficiently small and every binary vector , it holds that
As when are Radamacher random variables, and the term inside the expectation is not identically zero, we conclude that this term could be either positive or negative after carefully choosing . As a result, in order for to be a maxima of , the above expansion implies that (recall that ), which is exactly the second claim for . The remainder of the first claim also follows from the second claim, and the induction is complete.
Finally we derive a contradiction from the above result, and thereby show that such a non-zero vector cannot exist. In fact, the second claim of the inductive result for constitutes a linear system for the vector , where is a upper triangular matrix with non-zero diagonal entries. Therefore, we have , which is a contradiction.
4 Conclusion and future directions
In this paper study the adversarial combinatorial bandit problem with general reward functions , with a complete characterization of the minimax regret depending on whether is a low-degree polynomial. For the most general case when is not a polynomial, including dynamic assortment optimization under the multinomial logit choice models, our results imply an regret lower bound, hinting that it is not possible for any bandit algorithm to exploit inherent correlation among subsets/assortments. We believe it is a promising future research direction to study models that interpolate between the stochastic and the adversarial settings, in order to achieve intermediate regret bounds.
When is a general non-linear function, the adversarial construction in our lower bound proof can be interpreted as a latent variable model: first sample , and then sample . To foster identifiability, one common assumption is to have access to multiple observations conditioned on the same latent variable , such as “high dimensionality” assumptions in learning Gaussian mixture models (Ge et al., 2015) and learning topic models with multiple words per document (Anandkumar et al., 2015). This motivates the following model: at the beginning of each epoch, an adaptive adversary chooses ; afterwards, the bandit algorithm produces subsets and observes feedback for . Clearly, with we recover the stochastic (stationary) setting in which do not change, and with we recover the adversarial setting in which is different for each time period. An intermediate value of is likely to result in interesting interpolation between the two settings, and we leave this as an interesting future research question.
References
- Agarwal & Aggarwal (2018) Agarwal, M. and Aggarwal, V. Regret bounds for stochastic combinatorial multi-armed bandits with linear space complexity. arXiv preprint arXiv:1811.11925, 2018.
- Agrawal et al. (2017) Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Thompson sampling for the mnl-bandit. arXiv preprint arXiv:1706.00977, 2017.
- Agrawal et al. (2019) Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453–1485, 2019.
- Anandkumar et al. (2015) Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y.-K. A spectral algorithm for latent dirichlet allocation. Algorithmica, 72(1):193–214, 2015.
- Audibert & Bubeck (2009) Audibert, J.-Y. and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In COLT, pp. 217–226, 2009.
- Audibert et al. (2014) Audibert, J.-Y., Bubeck, S., and Lugosi, G. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
- Auer et al. (1995) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE Annual Conference on Foundations of Computer Science (FOCS), pp. 322–331. IEEE, 1995.
- Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, pp. 41–1. JMLR Workshop and Conference Proceedings, 2012.
- Cesa-Bianchi & Lugosi (2012) Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
- Chen et al. (2013) Chen, W., Wang, Y., and Yuan, Y. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the International Conference on Machine Learning (ICML), pp. 151–159, 2013.
- Chen et al. (2016a) Chen, W., Hu, W., Li, F., Li, J., Liu, Y., and Lu, P. Combinatorial multi-armed bandit with general reward functions. Advances in Neural Information Processing Systems (NIPS), 29:1659–1667, 2016a.
- Chen et al. (2016b) Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. Journal of Machine Learning Research, 17(1):1746–1778, 2016b.
- Chen & Wang (2018) Chen, X. and Wang, Y. A note on a tight lower bound for capacitated mnl-bandit assortment selection models. Operations Research Letters, 46(5):534–537, 2018.
- Chen et al. (2018a) Chen, X., Wang, Y., and Zhou, Y. Dynamic assortment selection under the nested logit models. Production and Operations Management (to appear), 2018a.
- Chen et al. (2018b) Chen, X., Wang, Y., and Zhou, Y. An optimal policy for dynamic assortment planning under uncapacitated multinomial logit models. Mathematics of Operations Research (to appear), 2018b.
- Chen et al. (2019) Chen, X., Krishnamurthy, A., and Wang, Y. Robust dynamic assortment optimization in the presence of outlier customers. arXiv preprint arXiv:1910.04183, 2019.
- Combes et al. (2015) Combes, R., Talebi Mazraeh Shahi, M. S., Proutiere, A., et al. Combinatorial bandits revisited. Advances in neural information processing systems, 28:2116–2124, 2015.
- Donoghue (1969) Donoghue, W. F. Distributions and Fourier transforms. Academic Press, 1969.
- Dudley (2018) Dudley, R. M. Real analysis and probability. CRC Press, 2018.
- Ge et al. (2015) Ge, R., Huang, Q., and Kakade, S. M. Learning mixtures of gaussians in high dimensions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing (STOC), pp. 761–770, 2015.
- Kuroki et al. (2020) Kuroki, Y., Xu, L., Miyauchi, A., Honda, J., and Sugiyama, M. Polynomial-time algorithms for multiple-arm identification with full-bandit feedback. Neural Computation, 32(9):1733–1773, 2020.
- Kveton et al. (2015) Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pp. 535–543, 2015.
- Merlis & Mannor (2019) Merlis, N. and Mannor, S. Batch-size independent regret bounds for the combinatorial multi-armed bandit problem. arXiv preprint arXiv:1905.03125, 2019.
- Merlis & Mannor (2020) Merlis, N. and Mannor, S. Tight lower bounds for combinatorial multi-armed bandits. arXiv preprint arXiv:2002.05392, 2020.
- Rejwan & Mansour (2020) Rejwan, I. and Mansour, Y. Top- combinatorial bandits with full-bandit feedback. In Algorithmic Learning Theory, pp. 752–776. PMLR, 2020.
- Rudin (1991) Rudin, W. Functional analysis. Mcgraw-Hill Inc, New York, 1991.
- Rusmevichientong et al. (2010) Rusmevichientong, P., Shen, Z.-J. M., and Shmoys, D. B. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations Research, 58(6):1666–1680, 2010.
- Train (2009) Train, K. Discrete choice methods with simulation. Cambridge University Press, 2nd edition, 2009.
- Wang & Chen (2018) Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandits. arXiv preprint arXiv:1803.04623, 2018.