This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarial Combinatorial Bandits with General Non-linear Reward Functions

Xi Chen    Yanjun Han    Yining Wang
Abstract

In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. The adversarial combinatorial bandit with general non-linear reward is an important open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback. We show that, with NN arms and subsets of KK arms being chosen at each of TT time periods, the minimax optimal regret is Θ~d(NdT)\widetilde{\Theta}_{d}(\sqrt{N^{d}T}) if the reward function is a dd-degree polynomial with d<Kd<K, and ΘK(NKT)\Theta_{K}(\sqrt{N^{K}T}) if the reward function is not a low-degree polynomial. Both bounds are significantly different from the bound O(poly(N,K)T)O(\sqrt{\mathrm{poly}(N,K)T}) for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures. Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual (NK)\binom{N}{K} assortment as independent.

combinatorial bandit, assortment optimization, minimax analysis

1 Introduction

In this paper we study the combinatorial bandit problem, which is a natural extension to the multi-armed bandit problem (Auer et al., 1995) and has applications to online advertising, online shortest paths and many other practical problems (Cesa-Bianchi & Lugosi, 2012; Chen et al., 2013, 2016a, 2016b; Wang & Chen, 2018). In the adversarial combinatorial bandit setting, there are TT time periods and NN arms. At the beginning of each time period tt, an adaptive adversary chooses a reward vector vt=(vt1,,vtN)[0,1]Nv_{t}=(v_{t1},\cdots,v_{tN})\in[0,1]^{N} not revealed to the algorithm. The algorithm chooses a subset St[N]S_{t}\subseteq[N] consisting of exactly KNK\leq N distinct arms (i.e., |St|=K|S_{t}|=K). The algorithm then receives a bandit feedback rt[0,1]r_{t}\in[0,1] satisfying

𝔼[rt|St,vt]=g(iStvti),\mathbb{E}[r_{t}|S_{t},v_{t}]=g\left(\sum_{i\in S_{t}}v_{ti}\right), (1)

where g:+[0,1]g:\mathbb{R}^{+}\to[0,1] is a known link function. The objective is to minimize the regret of the algorithm compared against a stationary benchmark, defined as

max|S|=K𝔼[t=1TR(S,vt)R(St,vt)],\max_{|S^{\star}|=K}\mathbb{E}\left[\sum_{t=1}^{T}R(S^{\star},v_{t})-R(S_{t},v_{t})\right], (2)

where R(S,vt):=g(iSvi)R(S,v_{t}):=g(\sum_{i\in S}v_{i}) and {St}t=1T\{S_{t}\}_{t=1}^{T} are the subsets outputted by a regret minimization algorithm.

As far as we know, all existing works on adversarial combinatorial bandit studied only the case when the link function is linear (i.e., g(x)=cxg(x)=cx) (Cesa-Bianchi & Lugosi, 2012; Bubeck et al., 2012; Audibert et al., 2014). While there have been research on combinatorial bandits with general link functions, such results are established exclusively for the stochastic setting, in which the reward vectors {vt}t=1T\{v_{t}\}_{t=1}^{T} do not change over time (Rusmevichientong et al., 2010; Agarwal & Aggarwal, 2018; Agrawal et al., 2019), and most of them further assume a semi-bandit feedback where noisy observations of all vtiv_{ti} with iSti\in S_{t} are available (Combes et al., 2015; Kveton et al., 2015; Chen et al., 2016a, 2018a, 2018b). This motivates the following natural question:

Question 1.

For adversarial combinatorial bandit with a non-linear link function g()g(\cdot) and bandit feedback, is it possible to achieve O~(poly(N,K)T)\widetilde{O}(\sqrt{\mathrm{poly}(N,K)T}) regret?

Note that in adversarial combinatorial bandit with linear link function, or stochastic combinatorial (semi-)bandit with general link functions, the O~(poly(N,K)T)\widetilde{O}(\sqrt{\mathrm{poly}(N,K)T}) regret targeted in Question 1 can be attained (Bubeck et al., 2012; Combes et al., 2015; Kveton et al., 2015; Chen et al., 2016a; Agrawal et al., 2019). The question also has important practical motivations beyond theoretical/mathematical reasoning, because many interesting applications of combinatorial bandit involve non-linear link functions, such as online assortment optimization with a multinomial logit (MNL) model, which corresponds to a link function of g(x)=x/(1+x)g(x)=x/(1+x). Please see more discussions Sec. 1.2.

1.1 Our results

Below is an informal statement of our main result, as a summary of Theorem 1 later in the paper.

Corollary 1 (Informal).

Fix an arbitrary, known reward function g:+[0,1]g:\mathbb{R}^{+}\to[0,1]. If gg is a dd-degree polynomial for some d<Kd<K, then the optimal regret is Θ~g,d,K(NdT).\widetilde{\Theta}_{g,d,K}\big{(}\sqrt{N^{d}T}\big{)}.

Otherwise, if gg is either not a polynomial or a polynomial of degree at least KK, the optimal regret is Θg,K(NKT).\Theta_{g,K}\big{(}\sqrt{N^{K}T}\big{)}.

The results in Corollary 1 easily cover the linear link function case g(x)=xg(x)=x, with d=1d=1 and the optimal regret being Θ~(NT)\widetilde{\Theta}(\sqrt{NT}) (Bubeck et al., 2012). On the other hand, Corollary 1 shows that when gg is not a polynomial, no algorithm can achieve a regret better than O(NKT)O(\sqrt{N^{K}T}). This shows that when gg is a general non-linear reward function, the (NK)\binom{N}{K} subsets of the NN arms can only be treated as “independent” and it is information-theoretically impossible for any algorithm to exploit correlation between subsets of arms to achieve a significantly smaller regret.

1.2 Dynamic assortment optimization

Dynamic assortment optimization is a key problem in revenue management and online recommendation, which naturally serves as a motivation for the generalized combinatorial bandit problem studied in this paper. In the standard setup of dynamic assortment optimization (Agrawal et al., 2019), there are NN substitutable products, each associated with a known profit parameter pi[0,1]p_{i}\in[0,1] and an unknown mean utility parameter vi[0,1]v_{i}\in[0,1]. At each time, the seller offers an assortment (i.e., the recommended set of products) St[N]S_{t}\subseteq[N] of size KK, e.g., there are KK display spots of recommended products on an Amazon webpage. Then the customer either purchases one of the products being offered (i.e., itSti_{t}\in S_{t}) or leaves without making any purchase (i.e., it=0i_{t}=0). The choice behavior of the customer is governed by the well-known MultiNomial-Logit (MNL) model from economics (Train, 2009):

[it=i|St,v]=viv0+jStvj,iSt{0},\mathbb{P}[i_{t}=i|S_{t},v]=\frac{v_{i}}{v_{0}+\sum_{j\in S_{t}}v_{j}},\;\;\;\;\;\forall i\in S_{t}\cup\{0\}, (3)

with the definition that v0:=1v_{0}:=1, where v0v_{0} denote the utility of no-purchase. The objective for the seller or retailer is to maximize the expected profit/revenue

R(S,v)=iSpi[i|S,v]=iSpiviv0+iSvi.R(S,v)=\sum_{i\in S}p_{i}\mathbb{P}[i|S,v]=\frac{\sum_{i\in S}p_{i}v_{i}}{v_{0}+\sum_{i\in S}v_{i}}.

Note also that, in the adversarial setting, the mean utility vector vt={vti}i=1Nv_{t}=\{v_{ti}\}_{i=1}^{N} will be different for each time period t=1,2,,Tt=1,2,\cdots,T, and will be selected by an adaptive adversary. The regret is then defined as (2).

Let us first consider a special case, where all the products have the profit one (i.e., pi1p_{i}\equiv 1) and only binary purchase/no-purchase actions are observable. That is, one only observes a binary reward at time tt, rt=𝟏{itSt}r_{t}=\boldsymbol{1}\{i_{t}\in S_{t}\}, which indicates whether there is a purchase. Then the dynamic assortment optimization question reduces to the generalized (adversarial) combinatorial bandit problem formulated in Eqs. (1,2) with the link function g(x)=x/(1+x)g(x)=x/(1+x). Since g(x)=x/(1+x)g(x)=x/(1+x) is clearly not a polynomial, Corollary 1 shows that Θ(NKT)\Theta(\sqrt{N^{K}T}) should be the optimal regret. The following corollary extends this to the general case of dynamic assortment optimization, where different products can have different profit parameters.

Corollary 2.

Consider the dynamic assortment optimization question with known profit parameters {pi}i=1N[0,1]\{p_{i}\}_{i=1}^{N}\subseteq[0,1] and unknown mean utility parameters {vti}t,i=1T,N[0,1]\{v_{ti}\}_{t,i=1}^{T,N}\subseteq[0,1] chosen by an adaptive adversary. Then there exists an algorithm with regret upper bounded OK(NKT)O_{K}(\sqrt{N^{K}T}).

The next corollary, on the other hand, shows that the O(NKT)O(\sqrt{N^{K}T}) regret in not improvable, even with a richer non-binary feedback and if all products have the same profits parameter pi1p_{i}\equiv 1.

Corollary 3.

Suppose pi1p_{i}\equiv 1. There exists an adaptive adversary that chooses {vti}t,i=1T,N[0,1]\{v_{ti}\}_{t,i=1}^{T,N}\subseteq[0,1], such that for any algorithm, the regret is lower bounded by ΩK(NKT)\Omega_{K}(\sqrt{N^{K}T}).

Both corollaries are consequences of Proposition 1 and Lemma 3 later in the paper.

1.3 Proof techniques

As we shall see later in this paper, the upper bounds of O~g,K,d(NdT)\widetilde{O}_{g,K,d}(\sqrt{N^{d}T}) or Og,K(NKT)O_{g,K}(\sqrt{N^{K}T}) in Corollary 1 are relatively easier to establish, via reduction to known adversarial multi-armed or linear bandit algorithms. The key challenge is to establish corresponding Ω(NdT)\Omega(\sqrt{N^{d}T}) and Ω(NKT)\Omega(\sqrt{N^{K}T}) lower bounds in Corollary 1.

In this section we give a high-level sketch of the key ideas in our lower bound proof. For simplicity we consider only the case when g()g(\cdot) is not a polynomial function. The key insight is to prove the existence of a distribution μ\mu on v[0,1]nv\in[0,1]^{n}, such that for any S[n]S\subseteq[n], |S|=K|S|=K, the following holds on the choice distribution (|S,v)\mathbb{P}(\cdot|S,v) with 01\mathbb{P}_{0}\neq\mathbb{P}_{1}:

𝔼vμ[(|S,v)]{0,if S=S,1,if SS.\mathbb{E}_{v\sim\mu}[\mathbb{P}(\cdot|S,v)]\equiv\left\{\begin{array}[]{ll}\mathbb{P}_{0},&\text{if }S=S^{\star},\\ \mathbb{P}_{1},&\text{if }S\neq S^{\star}.\end{array}\right. (4)

Intuitively, Eq. (4) shows that, no information is gained unless an algorithm exactly guesses the optimal subset SS^{\star}, even if the subset StS_{t} produced by the algorithm only differ by a single element from SS^{\star}. Since there are (NK)\binom{N}{K} different subsets, the question of guessing the optimal subset SS^{\star} exactly correct is similar to locating the best arm of a multi-armed bandit question with (NK)=ΘK(NK)\binom{N}{K}=\Theta_{K}(N^{K}) arms, which incurs a regret of ΩK(NKT)\Omega_{K}(\sqrt{N^{K}T}).

To gain deeper insights into the construction of vμv\sim\mu that satisfies Eq. (4), it is instructive to consider some simpler bandit settings in which a small regret can be achieved and understand why the construction of μ\mu does not apply there.

  • The first setting is dynamic assortment optimization under the stationary setting, in which the {vt}t=1T\{v_{t}\}_{t=1}^{T} vectors remain the same for all TT periods. The results of (Agrawal et al., 2019) achieve O~(NT)\widetilde{O}(\sqrt{NT}) regret in this setting. In this setting, the vector vv is deterministic and fixed, and therefore the laws (|S,v)\mathbb{P}(\cdot|S,v) and (|S,v)\mathbb{P}(\cdot|S^{\prime},v) must be correlated as long as SSS\cap S^{\prime}\neq\emptyset. This means that Eq. (4) cannot be possibly satisfied, with every subset SSS\neq S^{\star} revealing no information about SS^{\star}.

  • The second setting is the adversarial combinatorial bandit with a linear link function g(x)=cxg(x)=cx, for which an O~K(NT)\widetilde{O}_{K}(\sqrt{NT}) regret is attainable (Bubeck et al., 2012; Combes et al., 2015). When gg is linear, the expectation of the mixture distribution 𝔼vμ[(|S,v)]\mathbb{E}_{v\sim\mu}[\mathbb{P}(\cdot|S,v)] is

    𝔼vμ[R(S,v)]=g(𝟏S,𝔼μ[v]),\mathbb{E}_{v\sim\mu}[R(S,v)]=g(\langle\boldsymbol{1}_{S},\mathbb{E}_{\mu}[v]\rangle),

    where 𝟏S{0,1}n\boldsymbol{1}_{S}\in\{0,1\}^{n} is the indicator vector of the subset S[N]S\subseteq[N]. Clearly, this is impossible to achieve (4) as there is no vector wNw\in\mathbb{R}^{N} satisfying that 𝟏S,w\langle\boldsymbol{1}_{S},w\rangle is constant for all SSS\neq S^{\star} and 𝟏S,w𝟏S,w\langle\boldsymbol{1}_{S},w\rangle\neq\langle\boldsymbol{1}_{S^{\star}},w\rangle.

  • The third setting is a special stochastic combinatorial bandit, where vμv\sim\mu is random but there exists a total ordering of the stochastic dominance relation among the components (μ1,,μN)(\mu_{1},\cdots,\mu_{N}) of μ=i=1Nμi\mu=\prod_{i=1}^{N}\mu_{i}, and an increasing gg. In this setting, it was shown in (Agarwal & Aggarwal, 2018) that a regret of O~K(N1/3T2/3)\widetilde{O}_{K}(N^{1/3}T^{2/3}) can be achieved, and the stochastic dominance requirement implies that once an element of [N]\S[N]\backslash S^{\star} is replaced by an element of SS^{\star}, the expected reward must increase. Therefore, (4) cannot hold in this scenario either.

1.4 Other related works

Combinatorial bandit is a classical question in machine learning and has been extensively studied under the settings of stochastic semi-bandits (Chen et al., 2013; Combes et al., 2015; Kveton et al., 2015; Chen et al., 2016a, b; Wang & Chen, 2018; Merlis & Mannor, 2019, 2020), stochastic bandits (Agarwal & Aggarwal, 2018; Rejwan & Mansour, 2020; Kuroki et al., 2020), and adversarial linear bandits (Cesa-Bianchi & Lugosi, 2012; Bubeck et al., 2012; Audibert et al., 2014; Combes et al., 2015). In the above-mentioned works, either the reward link function g()g(\cdot) is linear, or the model is stochastic (stationary).

There is another line of research on dynamic assortment optimization with the multinomial logit model, which is a form of combinatorial bandit with general reward functions (Rusmevichientong et al., 2010; Agrawal et al., 2017; Chen et al., 2018a, b; Chen & Wang, 2018; Chen et al., 2019; Agrawal et al., 2019). All of these works are carried out under the stochastic setting, with the exception of (Chen et al., 2019) which studied an ε\varepsilon-contamination model and obtained a regret upper bound O~(NT+εT)\widetilde{O}(\sqrt{NT}+\varepsilon T). Clearly, with the adversarial setting in this paper (ε=1\varepsilon=1) the regret bound in (Chen et al., 2019) becomes linear in TT and thus meaningless.

1.5 Notations

For a multi-index αd\alpha\in\mathbb{N}^{d}, let |α|=i=1dαi|\alpha|=\sum_{i=1}^{d}\alpha_{i}, and Dαf=|α|f/i=1dxiαiD^{\alpha}f=\partial^{|\alpha|}f/\prod_{i=1}^{d}\partial x_{i}^{\alpha_{i}} for a dd-variate function ff. For mm\in\mathbb{N} and interval II, let Cm(I)C^{m}(I) be the set of mm-times continuously differentiable functions on II. For two probability distributions PP and QQ, let TV(P,Q)=12|dPdQ|\mathrm{TV}(P,Q)=\frac{1}{2}\int|dP-dQ| and DKL(PQ)=𝑑Plog(dP/dQ)D_{\mathrm{KL}}(P\|Q)=\int dP\log(dP/dQ) be the total variation (TV) distance and the Kullback–Leibler (KL) divergence, respectively. We adopt the standard asymptotic notation: for two non-negative sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, we use the notation an=Oc(bn)a_{n}=O_{c}(b_{n}) to denote that anCbna_{n}\leq Cb_{n} for all nn and constant C<C<\infty depending only on cc, an=Ωc(bn)a_{n}=\Omega_{c}(b_{n}) to denote bn=Oc(an)b_{n}=O_{c}(a_{n}), and an=Θc(bn)a_{n}=\Theta_{c}(b_{n}) to denote both an=Oc(bn)a_{n}=O_{c}(b_{n}) and an=Ωc(bn)a_{n}=\Omega_{c}(b_{n}). We also use O~(),Ω~(),Θ~()\widetilde{O}(\cdot),\widetilde{\Omega}(\cdot),\widetilde{\Theta}(\cdot) to denote the respective meanings up to a multiplicative poly-logarithmic factor in (N,T)(N,T).

2 Problem formulation and results

Suppose there are NN arms, TT time periods and a known reward function g:+[0,1]g:\mathbb{R}_{+}\to[0,1]. At each time period tt, the algorithm outputs a subset St[N]S_{t}\subseteq[N], |St|=K|S_{t}|=K and receives a binary bandit feedback rt{0,1}r_{t}\in\{0,1\}. Note that the binary feedback structure can be significantly relaxed for the purpose of upper bounds, as discussed in Sec. 2.1. Let t={Sτ,rτ}τt\mathcal{F}_{t}=\{S_{\tau},r_{\tau}\}_{\tau\leq t} be the filtration of observable statistics at time tt, and 𝒱t={vτ}τt\mathcal{V}_{t}=\{v_{\tau}\}_{\tau\leq t} be the filtration of unobservable reward vectors. Let also 𝒜\mathcal{A} be an unknown adversary and π\pi be an admissible policy. The reward dynamics are modeled as follows:

vt\displaystyle v_{t} \displaystyle\sim 𝒜(t1,𝒱t1);\displaystyle\mathcal{A}(\mathcal{F}_{t-1},\mathcal{V}_{t-1});
St\displaystyle S_{t} \displaystyle\sim π(t1);\displaystyle\pi(\mathcal{F}_{t-1});
rt\displaystyle r_{t} \displaystyle\sim Bernoulli(g(iStvti)).\displaystyle\textstyle\mathrm{Bernoulli}(g(\sum_{i\in S_{t}}v_{ti})).

For any g(),N,K,Tg(\cdot),N,K,T, the minimax regret (g,N,K,T)\mathfrak{R}(g,N,K,T) is defined as

(g,N,K,T):=infπsup𝒜max|S|=K𝔼[t=1TR(S,vt)R(St,vt)],\mathfrak{R}(g,N,K,T)\\ :=\inf_{\pi}\sup_{\mathcal{A}}\max_{|S^{\star}|=K}\mathbb{E}\left[\sum_{t=1}^{T}R(S^{\star},v_{t})-R(S_{t},v_{t})\right], (5)

where R(S,vt)=g(iSvti)R(S,v_{t})=g(\sum_{i\in S}v_{ti}), and the expectation is taken with respect to both the bandit algorithm π\pi and the adaptive adversary 𝒜\mathcal{A}.

The following theorem is a rigorous statement of the main result of this paper.

Theorem 1.

Fix function g:+[0,1]g:\mathbb{R}^{+}\to[0,1] that is KK-times continuously differentiable on (0,K)(0,K). If gg is a polynomial with degree d[1,K)d\in[1,K), then there exist constants 0<cg,d,KCg,d,K<0<c_{g,d,K}\leq C_{g,d,K}<\infty depending only on g,d,Kg,d,K such that, for every NKN\geq K and T1T\geq 1, it holds that

cg,d,K(g,N,K,T)min{T,NdT}Cg,d,KlogN.\displaystyle c_{g,d,K}\leq\frac{\mathfrak{R}(g,N,K,T)}{\min\{T,\sqrt{N^{d}T}\}}\leq C_{g,d,K}\sqrt{\log N}.

Furthermore, if gg is a polynomial with degree at least KK or not a polynomial, then there exist constants 0<cg,KCg,K<0<c_{g,K}\leq C_{g,K}<\infty depending only on g,Kg,K such that, for every NKN\geq K and T1T\geq 1, it holds that

cg,K(g,N,K,T)min{T,NKT}Cg,K.c_{g,K}\leq\frac{\mathfrak{R}(g,N,K,T)}{\min\{T,\sqrt{N^{K}T}\}}\leq C_{g,K}.
Remark 1.

Based on Propositions 1 and 2 later, the hidden dependence of the constants Cg,d,K,Cg,KC_{g,d,K},C_{g,K} on (g,d,K)(g,d,K) is O(1)O(1) and O(K)O(\sqrt{K}), respectively. However, since our lower bound relies on an existential result (cf. Lemma 2), the hidden dependence of constants cg,d,Kc_{g,d,K} and cg,Kc_{g,K} is unknown. It is an outstanding open question to characterize an explicit dependence on (g,d,K)(g,d,K) in the lower bound.

The results in Theorem 1 cover the linear reward case of g(x)=cxg(x)=cx via d=1d=1 and a regret of Θ~K(min{T,NT})\widetilde{\Theta}_{K}(\min\{T,\sqrt{NT}\}), which matches the existing results on adversarial linear combinatorial bandits. On the other hand, the results for general non-polynomial reward functions g()g(\cdot) are quite negative, with a Θg,K(min{T,NKT})\Theta_{g,K}(\min\{T,\sqrt{N^{K}T}\}) regret showing that all the (NK)\binom{N}{K} subsets are essentially independent and the bandit algorithm cannot hope to exploit correlation between overlapping subsets like in the linear case. Finally, the reward function g()g(\cdot) being a low-degree polynomial interpolates between the linear case and the general case, with a regret of Θ~g,d,K(min{T,NdT})\widetilde{\Theta}_{g,d,K}(\min\{T,\sqrt{N^{d}T}\}) for d(1,K)d\in(1,K) between Θ~K(min{T,NT})\widetilde{\Theta}_{K}(\min\{T,\sqrt{NT}\}) and Θg,K(min{T,NKT})\Theta_{g,K}(\min\{T,\sqrt{N^{K}T}\}).

In the rest of this section we sketch the proofs of Theorem 1 by studying the upper bounds and lower bounds separately. We will also adapt the proof of Theorem 1 to cover the more general dynamic assortment optimization model described in Sec. 1.2.

2.1 Upper bounds

We first prove the OK(min{T,NKT})O_{K}(\min\{T,\sqrt{N^{K}T}\}) regret upper bound for general link functions. In fact, we state the following result that is much more general than Theorem 1.

Proposition 1.

Suppose at each time tt the adversary 𝒜\mathcal{A} could choose an arbitrary combinatorial reward model {t(|S)}S[N],|S|=K\{\mathbb{P}_{t}(\cdot|S)\}_{S\subseteq[N],|S|=K}, and an arbitrary bandit feedback {bt}t=1T\{b_{t}\}_{t=1}^{T}. Let also rt(bt)[0,1]r_{t}(b_{t})\in[0,1] denote the reward as a function of btb_{t}. There exists a bandit algorithm π\pi and a universal constant C<C<\infty such that for any NKN\geq K, T1T\geq 1,

sup𝒜max|S|=K𝔼[t=1T𝔼(rt(bt)|S,t)𝔼(rt(bt)|St,t)]Cmin{T,NKT}.\sup_{\mathcal{A}}\max_{|S^{\star}|=K}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}(r_{t}(b_{t})|S^{\star},\mathbb{P}_{t})-\mathbb{E}(r_{t}(b_{t})|S_{t},\mathbb{P}_{t})\right]\\ \leq C\min\{T,\sqrt{N^{K}T}\}.
Proof.

For each subset S[N]S\subseteq[N] of size KK, let jSj_{S} be a constructed arm and rt(bt,jS)r_{t}(b_{t,j_{S}}) be the bandit reward feedback at time tt if the arm jSj_{S} is pulled (i.e., subset SS is selected). This reduces the problem to an adversarial multi-armed bandit problem with (NK)\binom{N}{K} independent arms. Applying the Implicitly Normalized Forecaster (INF) algorithm from (Audibert & Bubeck, 2009), we have the regret upper bound O(min{T,(NK)T})=OK(min{T,NKT})O(\min\{T,\sqrt{\binom{N}{K}T}\})=O_{K}(\min\{T,\sqrt{N^{K}T}\}). ∎

We remark that Proposition 1 is more general and contains the Cg,Kmin{T,NKT}C_{g,K}\min\{T,\sqrt{N^{K}T}\} upper bound in Theorem 1 as a special case. By considering the feedback model btSt{0}b_{t}\in S_{t}\cup\{0\} and rt(bt)=iStpi𝟏{bt=i}r_{t}(b_{t})=\sum_{i\in S_{t}}p_{i}\boldsymbol{1}\{b_{t}=i\}, Proposition 1 also covers the dynamic assortment optimization model described in Sec. 1.2 and Corollary 2.

We next establish the O(min{T,NdTlogN})O(\min\{T,\sqrt{N^{d}T\log N}\}) upper bound for polynomial reward functions.

Proposition 2.

Fix a known dd-degree polynomial g(x)=adxd+ad1xd1++a1x+a0g(x)=a_{d}x^{d}+a_{d-1}x^{d-1}+\cdots+a_{1}x+a_{0}. Suppose at each time tt, conditioned on the selected subset St[N]S_{t}\subseteq[N] of size KK, the bandit feedback rtr_{t} is supported on an arbitrary bounded set not necessarily {0,1}\{0,1\}, such that 𝔼[rt|St,vt]=g(iStvt)\mathbb{E}[r_{t}|S_{t},v_{t}]=g(\sum_{i\in S_{t}}v_{t}). Then there exists a bandit algorithm π\pi such that, for any adaptive adversary 𝒜\mathcal{A},

max|S|=K𝔼[t=1TR(S,vt)R(St,vt)]Cg,d,Kmin{T,NdTlogN},\max_{|S^{\star}|=K}\mathbb{E}\left[\sum_{t=1}^{T}R(S^{\star},v_{t})-R(S_{t},v_{t})\right]\\ \leq C_{g,d,K}\min\{T,\sqrt{N^{d}T\log N}\},

where R(S,vt)=g(iSvti)R(S,v_{t})=g(\sum_{i\in S}v_{ti}), and Cg,d,K=O(K)C_{g,d,K}=O(\sqrt{K}).

Proof.

For any nn-dimensional vector xnx\in\mathbb{R}^{n}, let xd=(xi1xi2xid)i1,,id=1nndx^{\otimes d}=(x_{i_{1}}x_{i_{2}}\cdots x_{i_{d}})_{i_{1},\cdots,i_{d}=1}^{n}\in\mathbb{R}^{n^{d}} be the order-dd tensorization of xx. It is easy to verify that, for any 0kd0\leq k\leq d and St[N]S_{t}\subseteq[N], (iStvti)k=vt,𝟏Stk=vtk,𝟏Stk,\textstyle(\sum_{i\in S_{t}}v_{ti})^{k}=\langle v_{t},\boldsymbol{1}_{S_{t}}\rangle^{k}=\langle v_{t}^{\otimes k},\boldsymbol{1}_{S_{t}}^{\otimes k}\rangle, where 𝟏St{0,1}n\boldsymbol{1}_{S_{t}}\in\{0,1\}^{n} is the indicator vector of StS_{t}. Hence,

R(St,vt)=k=0dakvt,𝟏Stk=k=0dakvtk,𝟏Stk.R(S_{t},v_{t})=\sum_{k=0}^{d}a_{k}\langle v_{t},\boldsymbol{1}_{S_{t}}\rangle^{k}=\sum_{k=0}^{d}a_{k}\langle v_{t}^{\otimes k},\boldsymbol{1}_{S_{t}}^{\otimes k}\rangle.

Define v~tnd\widetilde{v}_{t}\in\mathbb{R}^{n^{d}} as

v~t:=k=0dakvtvtk times𝟏K𝟏Kdk times.\widetilde{v}_{t}:=\sum_{k=0}^{d}a_{k}\underbrace{v_{t}\otimes\cdots\otimes v_{t}}_{k\text{ times}}\otimes\underbrace{\frac{\boldsymbol{1}}{K}\otimes\cdots\otimes\frac{\boldsymbol{1}}{K}}_{d-k\text{ times}}.

As 𝟏,𝟏St=K\langle\boldsymbol{1},\boldsymbol{1}_{S_{t}}\rangle=K, it is easy to verify that, for every St[N]S_{t}\subseteq[N], R(St,vt)=𝔼[rt|St,vt]=v~t,𝟏StdR(S_{t},v_{t})=\mathbb{E}[r_{t}|S_{t},v_{t}]=\langle\widetilde{v}_{t},\boldsymbol{1}_{S_{t}}^{\otimes d}\rangle.

With this transformation, the problem reduces to adversarial linear bandit with dimension D=ndD=n^{d} and fixed action space 𝒜={𝟏Sd}S[N],|S|=K\mathcal{A}=\{\boldsymbol{1}_{S}^{\otimes d}\}_{S\subseteq[N],|S|=K} with |𝒜|=(NK)|\mathcal{A}|=\binom{N}{K}. Applying the EXP2 algorithm with John’s exploration and the analysis in (Audibert & Bubeck, 2009), the regret is upper bounded by O(DTlog|𝒜|)=O(NdTKlogN)O(\sqrt{DT\log|\mathcal{A}|})=O(\sqrt{N^{d}TK\log N}), which proves Proposition 2. ∎

Propositions 1 and 2 complete the proof of minimax upper bounds in Theorem 1.

2.2 Lower bounds

We first prove the following result corresponding to the minimax lower bounds in Theorem 1.

Lemma 1.

Suppose rtBernoulli(g(iStvti))r_{t}\sim\text{Bernoulli}(g(\sum_{i\in S_{t}}v_{ti})) for some fixed, known function g:+n[0,1]g:\mathbb{R}_{+}^{n}\to[0,1] that is KK-times continuously differentiable on (0,K)(0,K). If gg is a degree-dd polynomial with d<Kd<K, then there exists a constant cg,d,K>0c_{g,d,K}>0 such that for all NKN\geq K, T1T\geq 1,

(g,N,K,T)cg,d,Kmin{T,NdT}.\mathfrak{R}(g,N,K,T)\geq c_{g,d,K}\min\{T,\sqrt{N^{d}T}\}.

Otherwise, there exists a constant cg,K>0c_{g,K}>0 such that for all NKN\geq K, T1T\geq 1,

(g,N,K,T)cg,Kmin{T,NKT}.\mathfrak{R}(g,N,K,T)\geq c_{g,K}\min\{T,\sqrt{N^{K}T}\}.

Our proof is based on the following technical lemma:

Lemma 2.

Let gCm([0,b])g\in C^{m}([0,b]) be a real-valued and mm-times continuously differentiable function on [0,b][0,b], with bmb\geq m. Then the following two statements are equivalent:

  1. 1.

    gg is not a polynomial of degree at most m1m-1;

  2. 2.

    there exists a random vector (X1,,Xm)(X_{1},\cdots,X_{m}) supported on [0,1]m[0,1]^{m}, which follows an exchangeable joint distribution μ\mu, and a scalar x0[0,1]x_{0}\in[0,1], such that

    𝔼μ[g(X1++X1+(b+1)x0)]\displaystyle\mathbb{E}_{\mu}[g(X_{1}+\cdots+X_{\ell-1}+(b-\ell+1)x_{0})]
    =𝔼μ[g(X1++X+(b)x0)]\displaystyle=\mathbb{E}_{\mu}[g(X_{1}+\cdots+X_{\ell}+(b-\ell)x_{0})] (6)

    for all =1,2,,m1\ell=1,2,\cdots,m-1, and

    𝔼μ[g(X1++Xm1+(bm+1)x0)]\displaystyle\mathbb{E}_{\mu}[g(X_{1}+\cdots+X_{m-1}+(b-m+1)x_{0})]
    <𝔼μ[g(X1++Xm+(bm)x0)].\displaystyle<\mathbb{E}_{\mu}[g(X_{1}+\cdots+X_{m}+(b-m)x_{0})]. (7)

The proof of Lemma 2 is deferred to the Sec. 3. The construction of the distributions in Lemma 2 is non-constructive and uses duality existential arguments. Its proof also applies several technical tools from real analysis and functional analysis (Rudin, 1991; Donoghue, 1969; Dudley, 2018).

We are now ready to prove Lemma 1.

Proof.

We first prove the case when g()g(\cdot) is not a polynomial of degree at most K1K-1. We use the construction (X1,,XK)(X_{1},\cdots,X_{K}) and x0x_{0} in Lemma 2 with (m,b)=(K,K)(m,b)=(K,K), and construct i.i.d. copies (Xt,1,,Xt,K)(X_{t,1},\cdots,X_{t,K}) for each t[T]t\in[T]. Consider the following random strategy of the adversary 𝒜\mathcal{A} when the optimal subset is SS^{\star}: at each time t[T]t\in[T], nature assigns (Xt,1,,Xt,K)(X_{t,1},\cdots,X_{t,K}) to the restriction of vtv_{t} to SS^{\star} with probability δ(0,1]\delta\in(0,1], and assigns (x0,,x0)(x_{0},\cdots,x_{0}) otherwise, with the parameter δ\delta to be specified later; nature also assigns vti=x0v_{ti}=x_{0} for all iSi\notin S^{\star}. Suppose an algorithm selects subset St[N]S_{t}\subseteq[N], |St|=K|S_{t}|=K at time tt, such that |StS|=|S_{t}\cap S^{\star}|=\ell. Then

𝔼vt[𝔼[rt|St,vt]]\displaystyle\mathbb{E}_{v_{t}}[\mathbb{E}[r_{t}|S_{t},v_{t}]]
=δ𝔼μ[g(X1++X+(K)x0)]+(1δ)g(Kx0)\displaystyle=\delta\mathbb{E}_{\mu}[g(X_{1}+\cdots+X_{\ell}+(K-\ell)x_{0})]+(1-\delta)g(Kx_{0})
=g(Kx0)+δγ×𝟏{=K},\displaystyle=g(Kx_{0})+\delta\gamma\times\boldsymbol{1}\{\ell=K\}, (8)

where γ:=𝔼μ[g(X1++XK)]g(Kx0)>0\gamma:=\mathbb{E}_{\mu}[g(X_{1}+\cdots+X_{K})]-g(Kx_{0})>0. Essentially, Eq. (8) shows that the marginal distribution of rtr_{t} conditioned on StS_{t} is Bernoulli(g(Kx0))\mathrm{Bernoulli}(g(Kx_{0})) if StSS_{t}\neq S^{\star}, but Bernoulli(g(Kx0)+δγ)\mathrm{Bernoulli}(g(Kx_{0})+\delta\gamma) if St=SS_{t}=S^{\star}.

Let S\mathbb{P}_{S} denote the distribution of {rt}t=1T\{r_{t}\}_{t=1}^{T} when the adversary chooses S=SS^{\star}=S as the optimal subset. Let 0\mathbb{P}_{0} also denote the distribution of {rt}t=1T\{r_{t}\}_{t=1}^{T} with vtix0v_{ti}\equiv x_{0}. As {rt}\{r_{t}\} are binary, Eq. (8) implies that, for every S[N],|S|=KS\subseteq[N],|S|=K,

DKL(0S)𝔼0[TS]×Γg,K2δ2,D_{\mathrm{KL}}(\mathbb{P}_{0}\|\mathbb{P}_{S})\leq\mathbb{E}_{0}[T_{S}]\times\Gamma_{g,K}^{2}\delta^{2}, (9)

where 𝔼0\mathbb{E}_{0} is the expectation under 0\mathbb{P}_{0}, TS:=t=1T𝟏{St=S}T_{S}:=\sum_{t=1}^{T}\boldsymbol{1}\{S_{t}=S\} is the number of times SS is selected and constant Γg,K<\Gamma_{g,K}<\infty depends only on g(Kx0)g(Kx_{0}) and 𝔼μ[g(X1,,XK)]\mathbb{E}_{\mu}[g(X_{1},\cdots,X_{K})] and thus only on g,Kg,K. By Pinsker’s inequality,

|𝔼0[TS]𝔼S[TS]|\displaystyle\big{|}\mathbb{E}_{0}[T_{S}]-\mathbb{E}_{S}[T_{S}]\big{|} TTV(0,S)\displaystyle\leq T\cdot{\mathrm{TV}}(\mathbb{P}_{0},\mathbb{P}_{S})
Γg,KδT𝔼0[TS].\displaystyle\leq\Gamma_{g,K}\delta T\sqrt{\mathbb{E}_{0}[T_{S}]}. (10)

Subsequently,

(g,N,K,T)=max|S|=K𝔼[t=1TR(S,vt)R(St,vt)]\displaystyle\mathfrak{R}(g,N,K,T)=\max_{|S^{\star}|=K}\mathbb{E}\left[\sum_{t=1}^{T}R(S^{\star},v_{t})-R(S_{t},v_{t})\right]
1(NK)|S|=K𝔼[t=1TR(S,vt)R(St,vt)]\displaystyle\geq\frac{1}{\binom{N}{K}}\sum_{|S^{\star}|=K}\mathbb{E}\left[\sum_{t=1}^{T}R(S^{\star},v_{t})-R(S_{t},v_{t})\right]
=δγ(NK)|S|=K(T𝔼S[TS])\displaystyle=\frac{\delta\gamma}{\binom{N}{K}}\sum_{|S^{\star}|=K}(T-\mathbb{E}_{S^{\star}}[T_{S^{\star}}])
δγ(NK)|S|=K(T𝔼0[TS]|𝔼S[TS]𝔼0[TS]|)\displaystyle\geq\frac{\delta\gamma}{\binom{N}{K}}\sum_{|S^{\star}|=K}(T-\mathbb{E}_{0}[T_{S^{\star}}]-\big{|}\mathbb{E}_{S^{\star}}[T_{S^{\star}}]-\mathbb{E}_{0}[T_{S^{\star}}]\big{|})
=δγ(NK)[(NK)TTΓg,KδT|S|=K𝔼0[TS]]\displaystyle=\frac{\delta\gamma}{\binom{N}{K}}\left[\binom{N}{K}T-T-\Gamma_{g,K}\delta T\sum_{|S|=K}\sqrt{\mathbb{E}_{0}[T_{S}]}\right] (11)
δγ(NK)[(NK)TTΓg,KδT(NK)T].\displaystyle\geq\frac{\delta\gamma}{\binom{N}{K}}\left[\binom{N}{K}T-T-\Gamma_{g,K}\delta T\sqrt{\binom{N}{K}T}\right]. (12)

Here, Eq. (11) holds because |S|=K𝔼0[TS]=T\sum_{|S|=K}\mathbb{E}_{0}[T_{S}]=T, and Eq. (12) is due to the Cauchy-Schwarz inequality. Setting δ=min{1,(NK)/(2Γg,KT)}\delta=\min\{1,\sqrt{\binom{N}{K}}/(2\Gamma_{g,K}\sqrt{T})\}, Eq. (12) is lower bounded by the minimum between Ω(T)\Omega(T) and

δγ(NK)[(NK)TT(NK)T2]=Ωg,K((NK)T),\frac{\delta\gamma}{\binom{N}{K}}\left[\binom{N}{K}T-T-\binom{N}{K}\frac{T}{2}\right]=\Omega_{g,K}\bigg{(}\sqrt{\binom{N}{K}T}\bigg{)},

which is to be demonstrated.

The scenario when g()g(\cdot) is a polynomial of degree d<Kd<K can be proved in an entire similar way. Applying Lemma 2 with (m,b)=(d,K)(m,b)=(d,K), we could obtain a random vector (X1,,Xd)(X_{1},\cdots,X_{d}) and some x0[0,1]x_{0}\in[0,1] such that the conditions (6) and (7) hold. The nature uses the same strategy, with the only difference being that the size of the random subset SS^{\star} is dd instead of KK. In this case, any size-KK set SS with SSS^{\star}\subseteq S gives the optimal reward, and the learner observes the non-informative outcome at time tt if and only if SStS^{\star}\nsubseteq S_{t}. Consequently, both Eqs. (89) still hold, with 𝟏{=K}\boldsymbol{1}\{\ell=K\} replaced by 𝟏{d}\boldsymbol{1}\{\ell\geq d\} in Eq. (8) and the definition of TST_{S} changed to TS=t=1T𝟏{SSt}T_{S}=\sum_{t=1}^{T}\boldsymbol{1}\{S\subseteq S_{t}\} in Eq. (9). Using

|S|=dTS=t=1T|S|=d𝟏{SSt}=(Kd)T,\sum_{|S|=d}T_{S}=\sum_{t=1}^{T}\sum_{|S|=d}\boldsymbol{1}\{S\subseteq S_{t}\}=\binom{K}{d}T,

(12) with |S|=d|S|=d yields a lower bound of (g,N,K,T)\mathfrak{R}(g,N,K,T):

δγ(Nd)[(Nd)T(Kd)TΓg,d,KδT(Nd)(Kd)T].\displaystyle\frac{\delta\gamma}{\binom{N}{d}}\left[\binom{N}{d}T-\binom{K}{d}T-\Gamma_{g,d,K}\delta T\sqrt{\binom{N}{d}\binom{K}{d}T}\right].

Setting δ=min{1,(Nd))/(2Γg,d,K(Kd)T)}\delta=\min\{1,\sqrt{\binom{N}{d})}/(2\Gamma_{g,d,K}\sqrt{\binom{K}{d}T})\} and noting that (Kd)\binom{K}{d} does not depend on (N,T)(N,T), the above lower bound is simplified to

Ωg,d,K((Nd)T),\Omega_{g,d,K}\bigg{(}\sqrt{\binom{N}{d}T}\bigg{)},

which is to be demonstrated. ∎

Next, we establish an Ω(NKT)\Omega(\sqrt{N^{K}T}) lower bound for the dynamic assortment optimization model described in Sec. 1.2. It implies Corollary 3 in the introduction section.

Lemma 3.

Consider the dynamic assortment optimization question with mutlinomial logit choice models described in Sec. 1.2. Adopt the unit price model, with pi1p_{i}\equiv 1 for all i[N]i\in[N]. Then there exists an adversary choosing {vti}t,i=1T,N\{v_{ti}\}_{t,i=1}^{T,N} and a constant cK>0c_{K}>0 depending only on KK, such that for any bandit algorithm and NKN\geq K, T1T\geq 1, it holds that

max|S|=K𝔼[t=1TR(S,vt)R(St,vt)]cKmin{1,NKT},\max_{|S^{\star}|=K}\mathbb{E}\left[\sum_{t=1}^{T}R(S^{\star},v_{t})-R(S_{t},v_{t})\right]\\ \geq c_{K}\min\{1,\sqrt{N^{K}T}\},

where R(S,vt)=iSpivti1+iSvtiR(S,v_{t})=\frac{\sum_{i\in S}p_{i}v_{ti}}{1+\sum_{i\in S}v_{ti}}.

Proof.

It is easy to verify that R(S,vt)=g(iSvti)R(S,v_{t})=g(\sum_{i\in S}v_{ti}) with g(x)=x/(1+x)g(x)=x/(1+x). Since the link function gg here is clearly not a polynomial of any degree, the construction in Lemma 1 should immediately imply an Ω(NKT)\Omega(\sqrt{N^{K}T}) lower bound. However, in the multinomial logit choice model for dynamic assortment optimization, the bandit feedback is not binary. This means that expectation calculations in Eq. (8) are not sufficient, and we have to calculate the KL-divergence between discrete observations directly.

Recall that in the MNL model, the bandit feedback iti_{t} is supported on St{0}S_{t}\cup\{0\}, with [it=i|St,vt]=vti/(1+jStvtj)\mathbb{P}[i_{t}=i|S_{t},v_{t}]=v_{ti}/(1+\sum_{j\in S_{t}}v_{tj}) for all iSti\in S_{t} and [it=0|St,vt]=1/(1+jStvtj)\mathbb{P}[i_{t}=0|S_{t},v_{t}]=1/(1+\sum_{j\in S_{t}}v_{tj}). Suppose |StS|=<K|S_{t}\cap S^{\star}|=\ell<K. Then

𝔼vt[[it=0|St,vt]]\displaystyle\mathbb{E}_{v_{t}}[\mathbb{P}[i_{t}=0|S_{t},v_{t}]]
=1𝔼μ[g(X1++X+(K)x0)]\displaystyle=1-\mathbb{E}_{\mu}[g(X_{1}+\cdots+X_{\ell}+(K-\ell)x_{0})]
=1g(Kx0)=1/(1+Kx0).\displaystyle=1-g(Kx_{0})=1/(1+Kx_{0}). (13)

For every iSt\Si\in S_{t}\backslash S^{\star}, vtix0v_{ti}\equiv x_{0}, and therefore

𝔼vt[[it=i|St,vt]]\displaystyle\mathbb{E}_{v_{t}}[\mathbb{P}[i_{t}=i|S_{t},v_{t}]]
=x0(1𝔼μ[g(X1++X+(K)x0)])\displaystyle=x_{0}(1-\mathbb{E}_{\mu}[g(X_{1}+\cdots+X_{\ell}+(K-\ell)x_{0})])
=x0(1g(Kx0))=x0/(1+Kx0).\displaystyle=x_{0}(1-g(Kx_{0}))=x_{0}/(1+Kx_{0}). (14)

Next consider any iStSi\in S_{t}\cap S^{\star}. Because X1,,XX_{1},\cdots,X_{\ell} are exchangeable, the probabilities 𝔼vt[[it=i|St,vt]]\mathbb{E}_{v_{t}}[\mathbb{P}[i_{t}=i|S_{t},v_{t}]] are the same for all iStSi\in S_{t}\cap S^{\star}. Define β:=𝔼vt[[it=i|St,vt]]\beta:=\mathbb{E}_{v_{t}}[\mathbb{P}[i_{t}=i|S_{t},v_{t}]] for some iStSi\in S_{t}\cap S^{\star}. By the law of total probability and Eqs. (13,14) we have that

1=iSt{0}𝔼vt[(it=i|St,vt)]=β+1+(K)x01+Kx0.1=\sum_{i\in S_{t}\cup\{0\}}\mathbb{E}_{v_{t}}[\mathbb{P}(i_{t}=i|S_{t},v_{t})]=\ell\beta+\frac{1+(K-\ell)x_{0}}{1+Kx_{0}}.

Consequently,

β=1[11+(K)x01+Kx0]=x01+Kx0.\beta=\frac{1}{\ell}\left[1-\frac{1+(K-\ell)x_{0}}{1+Kx_{0}}\right]=\frac{x_{0}}{1+Kx_{0}}. (15)

Comparing Eqs. (13,14,15), we conclude that for any |St|=K|S_{t}|=K, StSS_{t}\neq S^{\star}, 𝔼vt[[it=0|St,vt]]=1/(1+Kx0)\mathbb{E}_{v_{t}}[\mathbb{P}[i_{t}=0|S_{t},v_{t}]]=1/(1+Kx_{0}) and 𝔼vt[[it=i|St,vt]]=x0/(1+Kx0)\mathbb{E}_{v_{t}}[\mathbb{P}[i_{t}=i|S_{t},v_{t}]]=x_{0}/(1+Kx_{0}) for all iSti\in S_{t}. This shows that all |St|=K|S_{t}|=K, StSS_{t}\neq S^{\star} are information theoretically indistinguishable. On the other hand, for St=SS_{t}=S^{\star}, with probability 1δ1-\delta all elements of vtv_{t} are assigned v0v_{0}, for which [|St=S,vt]=[|StS,vt]\mathbb{P}[\cdot|S_{t}=S^{\star},v_{t}]=\mathbb{P}[\cdot|S_{t}\neq S^{\star},v_{t}]. Hence, DKL(0S)O(δ2)×𝔼0[TS]D_{\mathrm{KL}}(\mathbb{P}_{0}\|\mathbb{P}_{S^{\star}})\leq O(\delta^{2})\times\mathbb{E}_{0}[T_{S^{\star}}]. The rest of the proof is identical to the proof of Lemma 1 when the link function gg is not a polynomial. ∎

3 Proof of Lemma 2

We first prove the easy direction 212\Rightarrow 1, whose contrapositive is that if gg is a polynomial of degree at most m1m-1, then (6) and (7) cannot hold simultaneously. In fact, defining s(x):=g(x+bx0)g(bx0)s(x):=g(x+bx_{0})-g(bx_{0}) and Yi:=Xix0Y_{i}:=X_{i}-x_{0} for all i[m]i\in[m], condition (6) implies that for all =1,,m1\ell=1,\cdots,m-1, it holds that 𝔼[s(Y1++Y)]=0\mathbb{E}[s(Y_{1}+\cdots+Y_{\ell})]=0. By exchangeability of (Y1,,Ym)(Y_{1},\cdots,Y_{m}), this also shows that 𝔼[s(Y1,,Ym)]=0\mathbb{E}[s_{\ell}(Y_{1},\cdots,Y_{m})]=0, where for all =1,2,,m1\ell=1,2,\cdots,m-1,

s(Y1,,Ym):=1(m)S[m]:|S|=s(iSYi).\displaystyle s_{\ell}(Y_{1},\cdots,Y_{m}):=\frac{1}{\binom{m}{\ell}}\sum_{S\subseteq[m]:|S|=\ell}s\left(\sum_{i\in S}Y_{i}\right). (16)

Since ss is a polynomial of degree at most m1m-1 and s(0)=0s(0)=0, the following lemma shows that 𝔼[sm(Y1,,Ym)]=0\mathbb{E}[s_{m}(Y_{1},\cdots,Y_{m})]=0, a contradiction to (7).

Lemma 4.

For mm reals y1,,ymy_{1},\cdots,y_{m} and any polynomial ss of degree at most m1m-1, define s0s(0)s_{0}\equiv s(0) and ss_{\ell} as in (16) for =1,,m\ell=1,\cdots,m. Then the following identity holds:

=0m(1)(m)s(y1,,ym)=0.\displaystyle\sum_{\ell=0}^{m}(-1)^{\ell}\binom{m}{\ell}s_{\ell}(y_{1},\cdots,y_{m})=0.
Proof.

By linearity it suffices to prove Lemma 4 for s(x)=xds(x)=x^{d}, d{0,1,,m1}d\in\{0,1,\cdots,m-1\}. In this case, simple algebra verifies that the coefficient of i=1myiri\prod_{i=1}^{m}y_{i}^{r_{i}} with ri0,i=1mri=dr_{i}\geq 0,\sum_{i=1}^{m}r_{i}=d in s(y1,,ym)s_{\ell}(y_{1},\cdots,y_{m}) is

d!i=1m(ri!)(mI(r)I(r))(m),\frac{d!}{\prod_{i=1}^{m}(r_{i}!)}\cdot\frac{\binom{m-I(r)}{\ell-I(r)}}{\binom{m}{\ell}},

where I(r):=i=1m𝟏(ri>0)I(r):=\sum_{i=1}^{m}\boldsymbol{1}(r_{i}>0), and (ab):=0\binom{a}{b}:=0 if b<0b<0. Consequently, the coefficient of i=1myiri\prod_{i=1}^{m}y_{i}^{r_{i}} in the LHS of the equation is

d!i=1m(ri!)=0m(1)(mI(r)I(r))\displaystyle\frac{d!}{\prod_{i=1}^{m}(r_{i}!)}\sum_{\ell=0}^{m}(-1)^{\ell}\binom{m-I(r)}{\ell-I(r)}
=d!i=1m(ri!)=I(r)m(1)(mI(r)I(r))=0,\displaystyle=\frac{d!}{\prod_{i=1}^{m}(r_{i}!)}\sum_{\ell=I(r)}^{m}(-1)^{\ell}\binom{m-I(r)}{\ell-I(r)}=0,

where the last identity makes use of the inequality I(r)d<mI(r)\leq d<m. ∎

Next we prove the hard direction 121\Rightarrow 2. The proof makes use of the idea in convex analysis: after fixing x0x_{0}, the problem is to find an exchangeable distribution of (X1,,Xm)(X_{1},\cdots,X_{m}) from the convex set of all such distributions; moreover, both constraints (6) and (7) are linear in the joint distribution, and the Dirac point measure on (x0,,x0)(x_{0},\cdots,x_{0}) satisfies (6) as well as (7) if >> is replaced by \geq. Therefore, there are two convex sets in total, one being the family of all exchangeable joint distributions and one from the constraints (6) and (7), and their closure has an intersection point, i.e. the Dirac point measure at (x0,,x0)(x_{0},\cdots,x_{0}). Now our target is to show that these sets have a non-empty intersection without taking the closure, and the following lemma characterizes a sufficient condition via duality.

Lemma 5.

For gC([0,b])g\in C([0,b]) and a fixed x0[0,1]x_{0}\in[0,1], there exists an exchangeable Borel distribution on [0,1]m[0,1]^{m} such that both (6) and (7) hold, if the following condition holds: for every non-zero vector λ=(λ1,,λm)\lambda=(\lambda_{1},\cdots,\lambda_{m}) with λm0\lambda_{m}\geq 0, and the functions g(x1,,xm)g_{\ell}(x_{1},\cdots,x_{m}) defined as

g(x1,,xm)=1(m)S[m]:|S|=g(iSxi+(b)x0)\displaystyle g_{\ell}(x_{1},\cdots,x_{m})=\frac{1}{\binom{m}{\ell}}\sum_{S\subseteq[m]:|S|=\ell}g\left(\sum_{i\in S}x_{i}+(b-\ell)x_{0}\right)

for =1,2,,m\ell=1,2,\cdots,m, the point (x0,,x0)(x_{0},\cdots,x_{0}) is not a global maxima of the function =1mλg(x)\sum_{\ell=1}^{m}\lambda_{\ell}g_{\ell}(x) over [0,1]m[0,1]^{m}.

Proof.

We prove the contrapositive of this result. Let XX be the topological vector space of all finite Borel signed measures on [0,1]m[0,1]^{m} (which are also Radon measures by Ulam’s theorem; cf. (Dudley, 2018, Theorem 7.1.4)) equipped with the weak- topology (as the dual of C([0,1]m)C([0,1]^{m})), and AXA\subseteq X be the collection of all exchangeable Borel probability measures (i.e. invariant to permutations). Moreover, for ε>0\varepsilon>0, we define BεXB_{\varepsilon}\subseteq X as the collection of all signed Borel measures μ\mu such that

[0,1]m(g(x)g(bx0))μ(dx)=0\int_{[0,1]^{m}}(g_{\ell}(x)-g(bx_{0}))\mu(\mathrm{d}x)=0

for =1,,m1\ell=1,\cdots,m-1 and

[0,1]m(gm(x)g(bx0)ε)μ(dx)=0.\int_{[0,1]^{m}}(g_{m}(x)-g(bx_{0})-\varepsilon)\mu(\mathrm{d}x)=0.

Therefore, the non-existence of such an exchangeable distribution implies that ABε=A\cap B_{\varepsilon}=\emptyset for all ε>0\varepsilon>0. Since [0,1]m[0,1]^{m} is compact, the set AA is a closed subset of the unit weak- ball, and is therefore compact due to Banach–Alaoglu (see, e.g. (Rudin, 1991, Theorem 3.15)). Moreover, as gC([0,1]m)g_{\ell}\in C([0,1]^{m}), the set BεB_{\varepsilon} is closed under the weak- topology. Finally, as both sets are convex, and (Rudin, 1991, Theorem 3.10) shows that the dual of XX under the weak- topology is C([0,1]m)C([0,1]^{m}), (Rudin, 1991, Theorem 3.4) implies that there exist some fεC([0,1]m)f_{\varepsilon}\in C([0,1]^{m}) and γε\gamma_{\varepsilon}\in\mathbb{R} such that

supμAfεdμ<γεinfνBεfεdν.\sup_{\mu\in A}\int f_{\varepsilon}\mathrm{d}\mu<\gamma_{\varepsilon}\leq\inf_{\nu\in B_{\varepsilon}}\int f_{\varepsilon}\mathrm{d}\nu. (17)

As BεB_{\varepsilon} is a linear subspace of XX, the RHS of (17) must be zero (otherwise it would be -\infty). Then by (Rudin, 1991, Lemma 3.9), we must have fε(x)==1m1λε,(g(x)g(bx0))+λε,m(gm(x)g(bx0)ε)f_{\varepsilon}(x)=\sum_{\ell=1}^{m-1}\lambda_{\varepsilon,\ell}(g_{\ell}(x)-g(bx_{0}))+\lambda_{\varepsilon,m}(g_{m}(x)-g(bx_{0})-\varepsilon) for some vector λε=(λε,1,,λε,m)\lambda_{\varepsilon}=(\lambda_{\varepsilon,1},\cdots,\lambda_{\varepsilon,m}). Plugging this back into the first inequality of (17), and defining μ0A\mu_{0}\in A as the Dirac point measure on (x0,,x0)(x_{0},\cdots,x_{0}), we arrive at

supμA(=1mλε,g(x))(μ(dx)μ0(dx))<λε,mε.\sup_{\mu\in A}\int\left(\sum_{\ell=1}^{m}\lambda_{\varepsilon,\ell}g_{\ell}(x)\right)(\mu(\mathrm{d}x)-\mu_{0}(\mathrm{d}x))<\lambda_{\varepsilon,m}\varepsilon. (18)

Since gε(x)==1mλε,g(x)g_{\varepsilon}(x)=\sum_{\ell=1}^{m}\lambda_{\varepsilon,\ell}g_{\ell}(x) is a symmetric function (i.e. invariant to permutations of the input), the LHS of (18) is simply maxx[0,1]mgε(x)gε(x0,,x0)0\max_{x\in[0,1]^{m}}g_{\varepsilon}(x)-g_{\varepsilon}(x_{0},\cdots,x_{0})\geq 0. Then λε,m>0\lambda_{\varepsilon,m}>0, and by multiplying a positive constant to all entries of λε\lambda_{\varepsilon}, we may assume that max|λε,|=1\max_{\ell}|\lambda_{\varepsilon,\ell}|=1 and maxx[0,1]mgε(x)gε(x0,,x0)<ε\max_{x\in[0,1]^{m}}g_{\varepsilon}(x)-g_{\varepsilon}(x_{0},\cdots,x_{0})<\varepsilon. Choosing εn0\varepsilon_{n}\to 0, the compactness of [1,1]mλεn[-1,1]^{m}\ni\lambda_{\varepsilon_{n}} implies some subsequence λεnkλ\lambda_{\varepsilon_{n_{k}}}\to\lambda as kk\to\infty. Taking the limit along this subsequence, it is clear that λ\lambda is a non-zero vector with λm0\lambda_{m}\geq 0, and (x0,,x0)(x_{0},\cdots,x_{0}) is a global maxima of the function g(x)==1mλg(x)g(x)=\sum_{\ell=1}^{m}\lambda_{\ell}g_{\ell}(x). ∎

Now it remains to choose a suitable x0[0,1]x_{0}\in[0,1] such that the condition in Lemma 5 holds. We choose x0x_{0} to be any point in (0,1)(0,1) such that g()(bx0)0g^{(\ell)}(bx_{0})\neq 0 for all [m]\ell\in[m], whose existence is ensured by the following lemma, which itself is a standard exercise on the Baire category theorem.

Lemma 6 (A slight variant of (Donoghue, 1969), Page 53).

If gCmg\in C^{m} satisfies g(nx)(x)=0g^{(n_{x})}(x)=0 for some nx[m]n_{x}\in[m] and every x(0,1)x\in(0,1), then gg is a polynomial of degree at most m1m-1 on (0,1)(0,1).

Now we assume by contradiction that the function f(x):==1nλg(x)f(x):=\sum_{\ell=1}^{n}\lambda_{\ell}g_{\ell}(x) attains its maximum at x=(x0,,x0)x=(x_{0},\cdots,x_{0}) for some non-zero vector λ\lambda. We will prove by induction on k=1,2,,mk=1,2,\cdots,m the following claims:

  1. 1.

    Dαf(x0,,x0)=0D^{\alpha}f(x_{0},\cdots,x_{0})=0 for all multi-indices α=(α1,,αm)m\alpha=(\alpha_{1},\cdots,\alpha_{m})\in\mathbb{N}^{m} with |α|:==1kα=k|\alpha|:=\sum_{\ell=1}^{k}\alpha_{\ell}=k;

  2. 2.

    =1mλ(mkk)/(m)=0\sum_{\ell=1}^{m}\lambda_{\ell}\cdot\binom{m-k}{\ell-k}/\binom{m}{\ell}=0, with (00):=1\binom{0}{0}:=1 and (ab):=0\binom{a}{b}:=0 for b<0b<0.

For the base case k=1k=1, the first claim is exactly the first-order condition for the maxima, and the second claim follows from the first claim, the identity g(x0,,x0)=g(bx0)𝟏(m11)/(m)\nabla g_{\ell}(x_{0},\cdots,x_{0})=g^{\prime}(bx_{0}){\boldsymbol{1}}\cdot\binom{m-1}{\ell-1}/\binom{m}{\ell}, and our choice of x0x_{0} that g(bx0)0g^{\prime}(bx_{0})\neq 0. Now suppose that these claims hold for 1,2,,k11,2,\cdots,k-1. Then for |α|=k|\alpha|=k with I(α):==1m𝟏(α>0)k1I(\alpha):=\sum_{\ell=1}^{m}\boldsymbol{1}(\alpha_{\ell}>0)\leq k-1, simple algebra gives that

Dαf(x0,,x0)=g(k)(bx0)=1mλ(mI(α)I(α))(m)=0,\displaystyle D^{\alpha}f(x_{0},\cdots,x_{0})=g^{(k)}(bx_{0})\cdot\sum_{\ell=1}^{m}\lambda_{\ell}\frac{\binom{m-I(\alpha)}{\ell-I(\alpha)}}{\binom{m}{\ell}}=0,

where the last identity is due to the inductive hypothesis for k=I(α)<kk^{\prime}=I(\alpha)<k. Therefore, for the first claim it remains to consider multi-indices α\alpha with |α|=I(α)=k|\alpha|=I(\alpha)=k. By the inductive hypothesis and the Taylor expansion for multivariate functions, for δ>0\delta>0 sufficiently small and every binary vector (ε1,,εm){±1}m(\varepsilon_{1},\cdots,\varepsilon_{m})\in\{\pm 1\}^{m}, it holds that

f(x0+δε1,,x0+δεm)=f(x0,,x0)\displaystyle f(x_{0}+\delta\varepsilon_{1},\cdots,x_{0}+\delta\varepsilon_{m})=f(x_{0},\cdots,x_{0})
+g(k)(bx0)=1mλ(mkk)(m)δkS[m]:|S|=kiSεi+o(δk).\displaystyle+g^{(k)}(bx_{0})\sum_{\ell=1}^{m}\lambda_{\ell}\frac{\binom{m-k}{\ell-k}}{\binom{m}{\ell}}\delta^{k}\sum_{S\subseteq[m]:|S|=k}\prod_{i\in S}\varepsilon_{i}+o(\delta^{k}).

As 𝔼[S[m]:|S|=kiSεi]=0\mathbb{E}[\sum_{S\subseteq[m]:|S|=k}\prod_{i\in S}\varepsilon_{i}]=0 when ε1,,εm\varepsilon_{1},\cdots,\varepsilon_{m} are Radamacher random variables, and the term inside the expectation is not identically zero, we conclude that this term could be either positive or negative after carefully choosing (ε1,,εm)(\varepsilon_{1},\cdots,\varepsilon_{m}). As a result, in order for (x0,,x0)(x_{0},\cdots,x_{0}) to be a maxima of ff, the above expansion implies that =1mλ(mkk)/(m)=0\sum_{\ell=1}^{m}\lambda_{\ell}\cdot\binom{m-k}{\ell-k}/\binom{m}{\ell}=0 (recall that g(k)(bx0)0g^{(k)}(bx_{0})\neq 0), which is exactly the second claim for kk. The remainder of the first claim also follows from the second claim, and the induction is complete.

Finally we derive a contradiction from the above result, and thereby show that such a non-zero vector λ\lambda cannot exist. In fact, the second claim of the inductive result for k=1,2,,mk=1,2,\cdots,m constitutes a linear system Aλ=0A\lambda=0 for the vector λ\lambda, where AA is a upper triangular matrix with non-zero diagonal entries. Therefore, we have λ=0\lambda=0, which is a contradiction.

4 Conclusion and future directions

In this paper study the adversarial combinatorial bandit problem with general reward functions gg, with a complete characterization of the minimax regret depending on whether gg is a low-degree polynomial. For the most general case when gg is not a polynomial, including dynamic assortment optimization under the multinomial logit choice models, our results imply an ΩK(NKT)\Omega_{K}(\sqrt{N^{K}T}) regret lower bound, hinting that it is not possible for any bandit algorithm to exploit inherent correlation among subsets/assortments. We believe it is a promising future research direction to study models that interpolate between the stochastic and the adversarial settings, in order to achieve intermediate regret bounds.

When gg is a general non-linear function, the adversarial construction in our lower bound proof can be interpreted as a latent variable model: first sample vtμv_{t}\sim\mu, and then sample rt(|St,vt)r_{t}\sim\mathbb{P}(\cdot|S_{t},v_{t}). To foster identifiability, one common assumption is to have access to multiple observations {rt1,,rtM}\{r_{t}^{1},\cdots,r_{t}^{M}\} conditioned on the same latent variable vtv_{t}, such as “high dimensionality” assumptions in learning Gaussian mixture models (Ge et al., 2015) and learning topic models with multiple words per document (Anandkumar et al., 2015). This motivates the following model: at the beginning of each τ{1,2,,T/M}\tau\in\{1,2,\cdots,T/M\} epoch, an adaptive adversary chooses vτ[0,1]Nv_{\tau}\in[0,1]^{N}; afterwards, the bandit algorithm produces subsets {Sτ1,,SτM}[N]\{S_{\tau}^{1},\cdots,S_{\tau}^{M}\}\subseteq[N] and observes feedback rτt(|Sτt,vτ)r_{\tau}^{t}\sim\mathbb{P}(\cdot|S_{\tau}^{t},v_{\tau}) for t=1,2,,Mt=1,2,\cdots,M. Clearly, with M=TM=T we recover the stochastic (stationary) setting in which {vt}\{v_{t}\} do not change, and with M=1M=1 we recover the adversarial setting in which vtv_{t} is different for each time period. An intermediate value of 1<M<T1<M<T is likely to result in interesting interpolation between the two settings, and we leave this as an interesting future research question.

References

  • Agarwal & Aggarwal (2018) Agarwal, M. and Aggarwal, V. Regret bounds for stochastic combinatorial multi-armed bandits with linear space complexity. arXiv preprint arXiv:1811.11925, 2018.
  • Agrawal et al. (2017) Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Thompson sampling for the mnl-bandit. arXiv preprint arXiv:1706.00977, 2017.
  • Agrawal et al. (2019) Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453–1485, 2019.
  • Anandkumar et al. (2015) Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y.-K. A spectral algorithm for latent dirichlet allocation. Algorithmica, 72(1):193–214, 2015.
  • Audibert & Bubeck (2009) Audibert, J.-Y. and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In COLT, pp.  217–226, 2009.
  • Audibert et al. (2014) Audibert, J.-Y., Bubeck, S., and Lugosi, G. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31–45, 2014.
  • Auer et al. (1995) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE Annual Conference on Foundations of Computer Science (FOCS), pp.  322–331. IEEE, 1995.
  • Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, pp.  41–1. JMLR Workshop and Conference Proceedings, 2012.
  • Cesa-Bianchi & Lugosi (2012) Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
  • Chen et al. (2013) Chen, W., Wang, Y., and Yuan, Y. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the International Conference on Machine Learning (ICML), pp.  151–159, 2013.
  • Chen et al. (2016a) Chen, W., Hu, W., Li, F., Li, J., Liu, Y., and Lu, P. Combinatorial multi-armed bandit with general reward functions. Advances in Neural Information Processing Systems (NIPS), 29:1659–1667, 2016a.
  • Chen et al. (2016b) Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. Journal of Machine Learning Research, 17(1):1746–1778, 2016b.
  • Chen & Wang (2018) Chen, X. and Wang, Y. A note on a tight lower bound for capacitated mnl-bandit assortment selection models. Operations Research Letters, 46(5):534–537, 2018.
  • Chen et al. (2018a) Chen, X., Wang, Y., and Zhou, Y. Dynamic assortment selection under the nested logit models. Production and Operations Management (to appear), 2018a.
  • Chen et al. (2018b) Chen, X., Wang, Y., and Zhou, Y. An optimal policy for dynamic assortment planning under uncapacitated multinomial logit models. Mathematics of Operations Research (to appear), 2018b.
  • Chen et al. (2019) Chen, X., Krishnamurthy, A., and Wang, Y. Robust dynamic assortment optimization in the presence of outlier customers. arXiv preprint arXiv:1910.04183, 2019.
  • Combes et al. (2015) Combes, R., Talebi Mazraeh Shahi, M. S., Proutiere, A., et al. Combinatorial bandits revisited. Advances in neural information processing systems, 28:2116–2124, 2015.
  • Donoghue (1969) Donoghue, W. F. Distributions and Fourier transforms. Academic Press, 1969.
  • Dudley (2018) Dudley, R. M. Real analysis and probability. CRC Press, 2018.
  • Ge et al. (2015) Ge, R., Huang, Q., and Kakade, S. M. Learning mixtures of gaussians in high dimensions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing (STOC), pp.  761–770, 2015.
  • Kuroki et al. (2020) Kuroki, Y., Xu, L., Miyauchi, A., Honda, J., and Sugiyama, M. Polynomial-time algorithms for multiple-arm identification with full-bandit feedback. Neural Computation, 32(9):1733–1773, 2020.
  • Kveton et al. (2015) Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pp.  535–543, 2015.
  • Merlis & Mannor (2019) Merlis, N. and Mannor, S. Batch-size independent regret bounds for the combinatorial multi-armed bandit problem. arXiv preprint arXiv:1905.03125, 2019.
  • Merlis & Mannor (2020) Merlis, N. and Mannor, S. Tight lower bounds for combinatorial multi-armed bandits. arXiv preprint arXiv:2002.05392, 2020.
  • Rejwan & Mansour (2020) Rejwan, I. and Mansour, Y. Top-kk combinatorial bandits with full-bandit feedback. In Algorithmic Learning Theory, pp.  752–776. PMLR, 2020.
  • Rudin (1991) Rudin, W. Functional analysis. Mcgraw-Hill Inc, New York, 1991.
  • Rusmevichientong et al. (2010) Rusmevichientong, P., Shen, Z.-J. M., and Shmoys, D. B. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations Research, 58(6):1666–1680, 2010.
  • Train (2009) Train, K. Discrete choice methods with simulation. Cambridge University Press, 2nd edition, 2009.
  • Wang & Chen (2018) Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandits. arXiv preprint arXiv:1803.04623, 2018.