This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions

Kei Takemura,1 Shinji Ito,1 Daisuke Hatano,2 Hanna Sumita,3 Takuro Fukunaga,4,2,5
Naonori Kakimura,6 Ken-ichi Kawarabayashi7
Abstract

The contextual combinatorial semi-bandit problem with linear payoff functions is a decision-making problem in which a learner chooses a set of arms with the feature vectors in each round under given constraints so as to maximize the sum of rewards of arms. Several existing algorithms have regret bounds that are optimal with respect to the number of rounds TT. However, there is a gap of O~(max(d,k))\tilde{O}(\max(\sqrt{d},\sqrt{k})) between the current best upper and lower bounds, where dd is the dimension of the feature vectors, kk is the number of the chosen arms in a round, and O~()\tilde{O}(\cdot) ignores the logarithmic factors. The dependence of kk and dd is of practical importance because kk may be larger than TT in real-world applications such as recommender systems. In this paper, we fill the gap by improving the upper and lower bounds. More precisely, we show that the C2UCB algorithm proposed by Qin, Chen, and Zhu (2014) has the optimal regret bound O~(dkT+dk)\tilde{O}(d\sqrt{kT}+dk) for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C2UCB algorithm and demonstrate that it enjoys the optimal regret bound for a more general problem that can take into account other objectives simultaneously. We also show that our technique would be applicable to related problems. Numerical experiments support our theoretical results and considerations.

Introduction

Upper bound Lower bound
The best known
O~(max(d,k)dkT)\tilde{O}(\max(\sqrt{d},\sqrt{k})\sqrt{dkT})
(Qin, Chen, and Zhu 2014; Takemura and Ito 2019)
Ω(min(dkT,kT))\Omega(\min(\sqrt{dkT},kT)) (Kveton et al. 2015)
This work O~(dkT+dk)\tilde{O}(d\sqrt{kT}+dk) Ω(min(dkT+dk,kT))\Omega(\min(d\sqrt{kT}+dk,kT))
Table 1: Regret bounds for CCS problem (O~()\tilde{O}(\cdot) ignores the logarithmic factors).

This paper investigates the contextual combinatorial semi-bandit problem with linear payoff functions, which we call CCS problem (Qin, Chen, and Zhu 2014; Takemura and Ito 2019; Wen, Kveton, and Ashkan 2015). In this problem, a learner iterates the following process TT times. First, the learner observes dd-dimensional vectors, called arms, and a set of feasible combinations of arms, where the size of each combination is kk. Each arm offers a reward defined by a common linear function over the arms, but the reward is not revealed to the learner at this point. Next, the learner chooses a feasible combination of arms. At the end, the learner observes the rewards of the chosen arms. The objective of the learner is to maximize the sum of rewards.

The CCS problem includes the linear bandit (LB) problem (Abbasi-Yadkori, Pál, and Szepesvári 2011; Agrawal and Goyal 2013; Auer 2002; Chu et al. 2011; Dani, Hayes, and Kakade 2008) and the combinatorial semi-bandit (CS) problem111 Here, the CS problem denotes the problem of maximizing the sum of rewards (Combes et al. 2015; Kveton et al. 2014, 2015), while Chen et al. (2016a, b) deal with a more general objective. (Chen et al. 2016a, b; Combes et al. 2015; Gai, Krishnamachari, and Jain 2012; Kveton et al. 2015; Wang et al. 2017; Wen, Kveton, and Ashkan 2015) as special cases. The difference from the LB problem is that, in the CCS problem, the learner chooses multiple arms at once. Moreover, while the given arms are fixed over the rounds and orthogonal to each other in the CS problem, they may be changed in each round and correlated in the CCS problem.

These differences enable the CCS problem to model more realistic situations of applications such as routing networks (Kveton et al. 2014), shortest paths (Gai, Krishnamachari, and Jain 2012; Wen, Kveton, and Ashkan 2015), and recommender systems (Li et al. 2010; Qin, Chen, and Zhu 2014; Wang et al. 2017). For example, when a recommender system is modeled with the LB problem, it is assumed that once a recommendation result is obtained, the internal predictive model is updated before the next recommendation. However, in a real recommender system, it is more common to update the predictive model after multiple recommendations, e.g., periodic updates (Chapelle and Li 2011). Such a situation can be modeled with the CCS problem, where the number of recommendations between the updates is kk and the number of the updates is TT (Takemura and Ito 2019)222 Strictly speaking, the LB problem with periodic updates is a little more restrictive than the CCS problem. However, most algorithms for the CCS problem, including the ones proposed in this paper, are applicable to the problem. .

As in numerous previous studies on bandit algorithms, we measure the performance of an algorithm by its regret, which is the difference between the sum of the rewards of the optimal choices and that of the algorithm’s choices. The existing regret bounds are summarized in Table 1, where O~()\tilde{O}(\cdot) means that the logarithmic factors are ignored. The best known upper bound on the regret is achieved by C2UCB algorithm, which is given by Qin, Chen, and Zhu (2014). Takemura and Ito (2019) refined their analysis to improve the dependence on other parameters in the regret bound. The best lower bound is given for the CS problem by Kveton et al. (2015). Note that any lower bound for the CS problem is also a lower bound for the CCS problem, as the CCS problem covers the CS problem.

Although these regret upper and lower bounds match with respect to TT, there is a gap of O~(max(d,k))\tilde{O}(\max(\sqrt{d},\sqrt{k})) between them. In the literature on regret analysis, the degree of dependence on TT in the regret bound usually draws much attention. However, for the CCS problem, the degree of dependence on kk is also important because there are real-world applications of the CCS problem such that kk is large. In recommender systems with periodic updates, for example, the number of recommendations between the updates could be large. An alternative example is the sending promotion problem, in which the number of users to send a promotion at once is much larger than the number of times to send the promotion, i.e., kTk\gg T (Takemura and Ito 2019).

Our contribution is two-fold. First, we improve dependence on dd and kk in both the regret upper and lower bounds. Our upper and lower bounds match up to logarithmic factors. Second, we clarify a drawback of the UCB-type algorithms for other related problems and propose general techniques to overcome the drawback.

To improve the upper bound of the CCS problem, we first revisit the C2UCB algorithm. This algorithm optimistically estimates rewards of arms using confidence intervals of estimates and then chooses a set of arms based on the optimistic estimates. Existing upper bounds have kTk\sqrt{T} factor, which leads to the gap from the lower bound. In our analysis, however, we reveal that the linear dependence on kk in the regret comes from the arms of large confidence intervals and obtain O~(dkT+dk2)\tilde{O}(d\sqrt{kT}+dk^{2}) regret by handling such arms separately. For further improvement, we focus on the case where the feasible combinations of arms are given by partition matroids. We show that the algorithm has the optimal regret bound in this case. Unfortunately, this analysis cannot apply to the general constraints, and we do not know whether the C2UCB algorithm achieves the optimal regret upper bound. Instead, based on these analyses, we propose another algorithm that estimates the rewards of arms of large confidence intervals more rigorously; the algorithm divides the given arms into two groups based on their confidence intervals and underestimates the rewards of the arms with large confidence intervals. We show that the proposed algorithm enjoys the optimal regret bound for the CCS problem with any feasible combinations of arms, and is also optimal for a more general problem that can take into account both the sum of rewards and other objectives. For example, recommender systems often require diversity of recommended items (Qin and Zhu 2013; Qin, Chen, and Zhu 2014).

We support our theoretical analysis through numerical experiments. We first evaluate the performance of the algorithms on instances in which constraints are not represented by the partition matroid. We observe that the proposed algorithm is superior to the C2UCB algorithm on these instances, which confirms our theoretical analysis that the C2UCB algorithm may not achieve the optimal regret bound while our proposed algorithm does. We also evaluate the algorithms on instances with partition matroid constraints. For these instances, we observe that the C2UCB and our proposed algorithms perform similarly.

Our theoretical and numerical analyses indicate that the sub-optimality of the C2UCB algorithm arises from the combinatorial structure of the CCS problem, i.e., choosing a set of arms in each round. More precisely, the existence of an arm with a confidence interval that is too large makes the algorithm choose a bad set of arms. This is an interesting phenomenon that does not occur in the LB problem (the CCS problem when k=1k=1) or the case of partition matroid constraints. Since the technique we propose for the CCS problem is so general that it is independent of the linearity of the linear payoff functions, we believe it could be generalized to overcome the same issue for other semi-bandit problems.

Problem Setting

In this section, we present the formal definition of the CCS problem and the required assumptions. The CCS problem consists of TT rounds. Let NN denote the number of arms, and each arm is indexed by an integer in [N]:={1,2,,N}[N]:=\{1,2,\ldots,N\}. We denote by StS_{t} a set of combinations of arms we can choose in the tt-th round. We assume that each combination is of size kk. Thus, St{I[N]|I|=k}S_{t}\subseteq\{I\subseteq[N]\mid|I|=k\}.

The learner progresses through each round as follows. At the beginning of the tt-th round, the learner observes the set of arms with the associated feature vectors {xt(i)}i[N]d\{x_{t}(i)\}_{i\in[N]}\subseteq\mathbb{R}^{d} and the set of combinations of arms StS_{t}. Then, the learner chooses ItStI_{t}\in S_{t}. At the end of the round, the learner obtains the rewards {rt(i)}iIt\{r_{t}(i)\}_{i\in I_{t}}, where for all iIti\in I_{t}, rt(i)=θxt(i)+ηt(i)r_{t}(i)={\theta^{*}}^{\top}x_{t}(i)+\eta_{t}(i) for some θd\theta^{*}\in\mathbb{R}^{d} and ηt(i)\eta_{t}(i)\in\mathbb{R} is a random noise with zero mean.

We evaluate the performance of an algorithm by the expected regret R(T)R(T), which is defined as

R(T)=t=1T(iItθxt(i)iItθxt(i)),\displaystyle R(T)=\sum_{t=1}^{T}\left(\sum_{i\in I^{*}_{t}}{\theta^{*}}^{\top}x_{t}(i)-\sum_{i\in I_{t}}{\theta^{*}}^{\top}x_{t}(i)\right),

where It=argmaxIStiIθxt(i)I^{*}_{t}=\operatorname*{argmax}_{I\in S_{t}}\sum_{i\in I}{\theta^{*}}^{\top}x_{t}(i).

As in previous work (Qin, Chen, and Zhu 2014; Takemura and Ito 2019), we assume the following:

Assumption 1.

t[T]\forall t\in[T] and iIt\forall i\in I_{t}, the random noise ηt(i)\eta_{t}(i) is conditionally RR-sub-Gaussian, i.e.,

λ,𝔼[exp(ληt(i))t]exp(λ2R2/2),\displaystyle\forall\lambda\in\mathbb{R},\mathbb{E}\left[{\exp(\lambda\eta_{t}(i))\mid\mathcal{F}_{t}}\right]\leq\exp\left(\lambda^{2}R^{2}/2\right),

where t=σ({{xs(j)}jIs}s[t],{{ηs(j)}jIs}s[t1])\mathcal{F}_{t}=\sigma\left(\{\{x_{s}(j)\}_{j\in I_{s}}\}_{s\in[t]},\{\{\eta_{s}(j)\}_{j\in I_{s}}\}_{s\in[t-1]}\right).

In addition, we define the following parameters of the CCS problem: (i) L>0L>0 such that i[N]\forall i\in[N] and t[T]\forall t\in[T], xt(i)2L\|x_{t}(i)\|_{2}\leq L, (ii) S>0S>0 such that θ2S\|\theta^{*}\|_{2}\leq S, and (iii) B>0B>0 such that i[N]\forall i\in[N] and t[T]\forall t\in[T], |θxt(i)|B|{\theta^{*}}^{\top}x_{t}(i)|\leq B. Note that LSLS is an obvious upper bound of BB.

Regret Analysis of the C2UCB Algorithm

Algorithm 1 C2UCB (Qin, Chen, and Zhu 2014)
1:λ>0\lambda>0 and {αt}t[T]\{\alpha_{t}\}_{t\in[T]} s.t. αt>0\alpha_{t}>0 for all t[T]t\in[T].
2:V0λIV_{0}\leftarrow\lambda I and b0𝟎b_{0}\leftarrow\bm{0}.
3:for t=1,2,,Tt=1,2,\dots,T do
4:     Observe {xt(i)}i[N]\{x_{t}(i)\}_{i\in[N]} and StS_{t}.
5:     θ^tVt11bt1\hat{\theta}_{t}\leftarrow V_{t-1}^{-1}b_{t-1}.
6:     for i[N]i\in[N] do
7:         r^t(i)θ^txt(i)+αtxt(i)Vt11xt(i)\hat{r}_{t}(i)\leftarrow\hat{\theta}_{t}^{\top}x_{t}(i)+\alpha_{t}\sqrt{x_{t}(i)^{\top}V_{t-1}^{-1}x_{t}(i)}.
8:     end for
9:     Play a set of arms It=argmaxIStiIr^t(i)I_{t}=\operatorname*{argmax}_{I\in S_{t}}\sum_{i\in I}\hat{r}_{t}(i) and observe rewards {rt(i)}iIt\{r_{t}(i)\}_{i\in I_{t}}.
10:     VtVt1+iItxt(i)xt(i)V_{t}\leftarrow V_{t-1}+\sum_{i\in I_{t}}x_{t}(i)x_{t}(i)^{\top} and btbt1+iItrt(i)xt(i)b_{t}\leftarrow b_{t-1}+\sum_{i\in I_{t}}r_{t}(i)x_{t}(i).
11:end for

Existing Analyses

Qin, Chen, and Zhu (2014) proposed the C2UCB algorithm (Algorithm 1), which chooses a set of arms based on optimistically estimated rewards in a similar way to other UCB-type algorithms (Auer 2002; Chen et al. 2016b; Chu et al. 2011; Li et al. 2010).

The C2UCB algorithm works as follows. At the beginning of each round, it constructs an estimator of θ\theta^{*} using the arms chosen so far and its rewards (line 3). It then computes an optimistic reward estimator r^t(i)\hat{r}_{t}(i) for each observed arm ii (line 6), where αtxt(i)Vt11xt(i)\alpha_{t}\sqrt{x_{t}(i)^{\top}V_{t-1}^{-1}x_{t}(i)} represents the size of the confidence interval of the estimated reward of arm ii. Then, it chooses arms ItI_{t} obtained by solving the optimization problem based on {r^t(i)}i[N]\{\hat{r}_{t}(i)\}_{i\in[N]} (line 8). Finally, it observes the reward of the chosen arms and updates the internal parameters btb_{t} and VtV_{t} (line 9).

Qin, Chen, and Zhu (2014) showed that the algorithm admits a sublinear regret bound with respect to TT. Takemura and Ito (2019) refined their analysis to improve the dependence on RR, SS, and LL as follows. Here, for δ(0,1)\delta\in(0,1), we define βt(δ)=Rdlog(1+L2kt/λδ)+Sλ\beta_{t}(\delta)=R\sqrt{d\log\left(\frac{1+L^{2}kt/\lambda}{\delta}\right)}+S\sqrt{\lambda}.

Theorem 1 (Theorem 4 of Takemura and Ito (2019)333 The original versions of Theorem 1 and Lemma 1 assume L=1L=1, but it is possible to obtain these results by scaling xt(i)x_{t}(i) and λ\lambda without modifying the proof. ).

If αt=βt(δ)\alpha_{t}=\beta_{t}(\delta) and λ=R2S2d\lambda=R^{2}S^{-2}d, the C2UCB algorithm has the following regret bound with probability 1δ1-\delta:

O~(RdkT)\displaystyle\tilde{O}\left(Rd\sqrt{kT}\right) ifλL2k\displaystyle\quad\mathrm{if}\ \lambda\geq L^{2}k
O~(LSkdT)\displaystyle\tilde{O}\left(LSk\sqrt{dT}\right) otherwise.\displaystyle\quad\mathrm{otherwise}.

To prove Theorem 1, it suffices to bound the cumulative estimating error of rewards, i.e., t[T]iIt(θθ^t)xt(i)\sum_{t\in[T]}\sum_{i\in I_{t}}(\theta^{*}-\hat{\theta}_{t})^{\top}x_{t}(i). Let xt(i)Vt11\|x_{t}(i)\|_{V_{t-1}^{-1}} denote xt(i)Vt11xt(i)\sqrt{x_{t}(i)^{\top}V_{t-1}^{-1}x_{t}(i)} for all i[N]i\in[N] and t[T]t\in[T]. To bound the error, Takemura and Ito (2019) showed that

t[T]iIt(θθ^t)xt(i)βT(δ)t[T]iItxt(i)Vt11.\displaystyle\sum_{t\in[T]}\sum_{i\in I_{t}}(\theta^{*}-\hat{\theta}_{t})^{\top}x_{t}(i)\leq\beta_{T}(\delta)\sum_{t\in[T]}\sum_{i\in I_{t}}\|x_{t}(i)\|_{V_{t-1}^{-1}}. (1)

The right-hand side is then bounded by the following lemma:

Lemma 1 (Lemma 5 of Takemura and Ito (2019)).

Let λ>0\lambda>0. Let {{xt(i)}i[k]}t[T]\{\{x_{t}(i)\}_{i\in[k]}\}_{t\in[T]} be any sequence such that xt(i)dx_{t}(i)\in\mathbb{R}^{d} and xt(i)2L\|x_{t}(i)\|_{2}\leq L for all i[k]i\in[k] and t[T]t\in[T]. Let Vt=λI+t[t]i[k]xt(i)xt(i)V_{t}=\lambda I+\sum_{t^{\prime}\in[t]}\sum_{i\in[k]}x_{t^{\prime}}(i)x_{t^{\prime}}(i)^{\top} for all t[T]t\in[T]. Then, we have

t[T]i[k]xt(i)Vt11=O~(Ldk2T/λ).\displaystyle\sum_{t\in[T]}\sum_{i\in[k]}\|x_{t}(i)\|_{V_{t-1}^{-1}}=\tilde{O}\left(L\sqrt{dk^{2}T/\lambda}\right).

This bound is tight up to logarithmic factors because we have t[T]i[k]xt(i)Vt11=Ldk/λ\sum_{t\in[T]}\sum_{i\in[k]}\|x_{t}(i)\|_{V_{t-1}^{-1}}=Ldk/\sqrt{\lambda} when T=dT=d and xt(i)=Letx_{t}(i)=Le_{t} for all i[k]i\in[k] and t[T]t\in[T], where for all l[d]l\in[d], elde_{l}\in\mathbb{R}^{d} is a vector in which the ll-th element is 1 and the other elements are 0.

Improved Regret Bound

In this subsection, we improve the regret bound of the C2UCB algorithm. A key observation of our analysis is that Lemma 1 is not tight for sufficiently large TT. To improve Lemma 1, we divide {{xt(i)}i[k]}t[T]\{\{x_{t}(i)\}_{i\in[k]}\}_{t\in[T]} into two groups: the family of xt(i)x_{t}(i) such that xt(i)Vt111/k\|x_{t}(i)\|_{V_{t-1}^{-1}}\leq 1/\sqrt{k}, and the others. As shown in Lemma 2 below, the sum of xt(i)Vt11\|x_{t}(i)\|_{V_{t-1}^{-1}} in the former group is O~(dkT)\tilde{O}(\sqrt{dkT}), which is smaller than Lemma 1. Moreover, the number of arms in the latter group is shown to be O~(dk)\tilde{O}(dk), which means that not so many arms xt(i)x_{t}(i) have large xt(i)Vt11\|x_{t}(i)\|_{V_{t-1}^{-1}}.

Lemma 2.

Let λ>0\lambda>0. Let {{xt(i)}i[k]}t[T]\{\{x_{t}(i)\}_{i\in[k]}\}_{t\in[T]} be any sequence such that xt(i)dx_{t}(i)\in\mathbb{R}^{d} and xt(i)2L\|x_{t}(i)\|_{2}\leq L for all i[k]i\in[k] and t[T]t\in[T]. Let Vt=λI+t[t]i[k]xt(i)xt(i)V_{t}=\lambda I+\sum_{t^{\prime}\in[t]}\sum_{i\in[k]}x_{t^{\prime}}(i)x_{t^{\prime}}(i)^{\top} for all t[T]t\in[T]. Then, we have

t[T]i[k]min(1k,xt(i)Vt11)=O~(dkT)\displaystyle\sum_{t\in[T]}\sum_{i\in[k]}\min\left(\frac{1}{\sqrt{k}},\|x_{t}(i)\|_{V_{t-1}^{-1}}\right)=\tilde{O}\left(\sqrt{dkT}\right) (2)

and

t[T]i[k]𝟙(xt(i)Vt11>1/k)=O~(dk).\displaystyle\sum_{t\in[T]}\sum_{i\in[k]}\mathds{1}\left(\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}\right)=\tilde{O}(dk). (3)

Based on Lemma 2, we can bound the right-hand side of (1) to obtain a better regret upper bound. The regret bound given by this theorem is optimal when LS=BLS=B.

Theorem 2.

If αt=βt(δ)\alpha_{t}=\beta_{t}(\delta) and λ=R2S2d\lambda=R^{2}S^{-2}d, the C2UCB algorithm has the following regret bound with probability 1δ1-\delta:

R(T)=O~(RdkT+min(LS,Bk)dk).\displaystyle R(T)=\tilde{O}\left(Rd\sqrt{kT}+\min\left(LS,Bk\right)dk\right).
Proof sketch.

Let Jt={i[N]xt(i)Vt11>1/k}J_{t}=\{i\in[N]\mid\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}\} and Jt=ItJtJ^{\prime}_{t}=I_{t}\cap J_{t}. We separate chosen arms into two groups444 To show the regret bound of the LinUCB algorithm (Chu et al. 2011; Li et al. 2010), i.e., the C2UCB algorithm for the case k=1k=1, Lattimore and Szepesvári (2020) take a similar approach in the note of exercise 19.3. : {Jt}t[T]\{J^{\prime}_{t}\}_{t\in[T]} and the remaining arms. For {Jt}t[T]\{J^{\prime}_{t}\}_{t\in[T]}, replacing Lemma 1 with Lemma 2 in the proof of Theorem 1 gives the first term of the regret bound. There are two ways to bound the regret caused by the other group. In one way, we use the same proof as the former group, which obtains O~(LSdk)\tilde{O}(LSdk). In the other way, by Lemma 2, we bound the number of rounds in which the arms of this group are chosen. Then, we have an upper bound of the regret in a round that is 2Bk2Bk. Thus, we obtain O~(Bdk2)\tilde{O}(Bdk^{2}) in this way. The second term of the regret bound can be obtained by combining these two ways. ∎

Next, we show that Theorem 2 is better than Theorem 1. We first consider the case λL2k\lambda\geq L^{2}k. From the definition of λ\lambda, we have LSkdTRdkTLSk\sqrt{dT}\leq Rd\sqrt{kT}. Since Theorem 1 implies O~(RdkT)\tilde{O}(Rd\sqrt{kT}) regret, it suffices to compare LSkdTLSk\sqrt{dT} with min(LS,Bk)dk\min(LS,Bk)dk. If T<dT<d, the C2UCB algorithm has an obvious regret upper bound 2BkT2BkT, which satisfies O~(LSkdT)\tilde{O}(LSk\sqrt{dT}) and O~(min(LS,Bk)dk)\tilde{O}(\min(LS,Bk)dk); otherwise, we have LSdkLSkdTLSdk\leq LSk\sqrt{dT}. In the other case, Theorem 1 implies O~(LSkdT)\tilde{O}(LSk\sqrt{dT}) regret and we have RdkTLSkdTRd\sqrt{kT}\leq LSk\sqrt{dT}. Thus, it also suffices to compare LSkdTLSk\sqrt{dT} with min(LS,Bk)dk\min(LS,Bk)dk. By the discussion in the first case, we obtain the desired result.

Improved Regret Bound for the CCS Problem with Partition Matroid Constraints

In this subsection, we show that the C2UCB algorithm admits an improved regret upper bound for the CCS problem with the partition matroid constraint, that matches the regret lower bound shown in Table 1.

Now we define the partition matroid constraint. Let {Bt(j)}j[M]\{B_{t}(j)\}_{j\in[M]} be a partition of [N][N] into MM subsets. Let {dt(j)}j[M]\{d_{t}(j)\}_{j\in[M]} be a set of MM natural numbers. Then the partition matroid constraint StS_{t} is defined from {Bt(j)}j[M]\{B_{t}(j)\}_{j\in[M]} and {dt(j)}j[M]\{d_{t}(j)\}_{j\in[M]} as

St={I[N]|IBt(j)|=dt(j),j[M]}.\displaystyle S_{t}=\left\{I\subseteq[N]\mid|I\cap B_{t}(j)|=d_{t}(j),\forall j\in[M]\right\}. (4)

Such StS_{t} is known as the set of the bases of a partition matroid. It is also known that linear optimization problems on a partition matroid constraint can be solved by the greedy algorithm. The class of StS_{t} is so large that many fundamental classes are included. Indeed, the CCS problem with these constraints leads to the CCS problem with the uniform matroid constraints (i.e., the cardinality constraint) when M=1M=1 and dt(1)=kd_{t}(1)=k for all t[T]t\in[T], and the LB problem with periodic updates when M=kM=k and dt(j)=1d_{t}(j)=1 for all j[M]j\in[M] and t[T]t\in[T].

We show that the C2UCB algorithm achieves the optimal regret bound for the CCS problem with constraints satisfying (4):

Theorem 3.

Assume that StS_{t} is defined by (4) for all t[T]t\in[T]. Then, if αt=βt(δ)\alpha_{t}=\beta_{t}(\delta) and λ=R2S2d\lambda=R^{2}S^{-2}d, the C2UCB algorithm has the following regret bound with probability 1δ1-\delta:

R(T)=O~(RdkT+Bdk).\displaystyle R(T)=\tilde{O}\left(Rd\sqrt{kT}+Bdk\right).
Proof sketch.

Recall that ItI_{t} is the set of arms chosen by the C2UCB algorithm in the tt-th round. Let Jt={i[N]xt(i)Vt112>1/k}J_{t}=\{i\in[N]\mid\|x_{t}(i)\|_{V_{t-1}^{-1}}^{2}>1/k\} and Jt=ItJtJ^{\prime}_{t}=I_{t}\cap J_{t}. As in the proof of Theorem 2, we separate chosen arms into two groups: ItJtI_{t}\setminus J^{\prime}_{t} and JtJ^{\prime}_{t}. From the definition of ItI_{t} and JtJ^{\prime}_{t}, we obtain ItJt=argmaxIStiIr^t(i)I_{t}\setminus J^{\prime}_{t}=\operatorname*{argmax}_{I\in S^{\prime}_{t}}\sum_{i\in I}\hat{r}_{t}(i) for all t[T]t\in[T], where

St={I[N]Jtj[M],|IBt(j)|=\displaystyle S^{\prime}_{t}=\left\{I\subseteq[N]\setminus J^{\prime}_{t}\mid\forall j\in[M],|I\cap B_{t}(j)|=\right.
dt(j)|Bt(j)Jt|}.\displaystyle\left.d_{t}(j)-|B_{t}(j)\cap J^{\prime}_{t}|\right\}.

Let JtJ^{*}_{t} be a subset of ItI^{*}_{t} that consists of the arms in ItJtI^{*}_{t}\cap J^{\prime}_{t}, and |Bt(j)Jt||ItJtBt(j)||B_{t}(j)\cap J^{\prime}_{t}|-|I^{*}_{t}\cap J^{\prime}_{t}\cap B_{t}(j)| arms chosen arbitrarily from ItBt(j)I^{*}_{t}\cap B_{t}(j) for each j[M]j\in[M]. Then, ItJtStI^{*}_{t}\setminus J^{*}_{t}\in S^{\prime}_{t} and |Jt|=|Jt||J^{*}_{t}|=|J^{\prime}_{t}| for all t[T]t\in[T]. Similar to ItI_{t}, we divide ItI^{*}_{t} into ItJtI^{*}_{t}\setminus J^{*}_{t} and JtJ^{*}_{t}. This gives

R(T)\displaystyle R(T) =t[T](iItJtθxt(i)iItJtθxt(i))\displaystyle=\sum_{t\in[T]}\left(\sum_{i\in I_{t}^{*}\setminus J_{t}^{*}}{\theta^{*}}^{\top}x_{t}(i)-\sum_{i\in I_{t}\setminus J^{\prime}_{t}}{\theta^{*}}^{\top}x_{t}(i)\right)
+t[T](iJtθxt(i)iJtθxt(i)).\displaystyle+\sum_{t\in[T]}\left(\sum_{i\in J_{t}^{*}}{\theta^{*}}^{\top}x_{t}(i)-\sum_{i\in J^{\prime}_{t}}{\theta^{*}}^{\top}x_{t}(i)\right).

The former term in the right-hand side of this equation is O~(RdkT)\tilde{O}(Rd\sqrt{kT}) by the optimality of ItJtI_{t}\setminus J^{\prime}_{t} and the discussion in the proof of Theorem 2. The latter term is O~(Bdk)\tilde{O}(Bdk) by the definition of BB and Lemma 2. ∎

Note that for the LB problem with periodic updates, the C2UCB algorithm reduces to the LinUCB algorithm (Chu et al. 2011; Li et al. 2010) with periodic updates, and has the optimal regret bound. Note also that we can show a similar result for related problems if we have a UCB-type algorithm and an upper bound of the number of chosen arms that have large confidence bounds.

Proposed Algorithm

In this section, we propose an algorithm for a more general problem than the CCS problem. We will show the optimal regret bound of the proposed algorithm for the general problem.

First, let us define the general CCS problem. Let Xt={xt(i)}i[N]X_{t}=\{x_{t}(i)\}_{i\in[N]} and rt={θxt(i)}i[N]r_{t}^{*}=\{{\theta^{*}}^{\top}x_{t}(i)\}_{i\in[N]} for all t[T]t\in[T]. In this problem, the learner aims to maximize the sum of values t[T]frt,Xt(It)\sum_{t\in[T]}f_{r_{t}^{*},X_{t}}(I_{t}) instead of the sum of rewards, where frt,Xt(It)f_{r_{t}^{*},X_{t}}(I_{t}) measures the quality of the chosen arms. As in Qin, Chen, and Zhu (2014), we assume that the learner has access to an α\alpha-approximation oracle 𝒪S(r,X)\mathcal{O}_{S}(r,X), which provides ISI\in S such that fr,X(I)αmaxISfr,X(I)f_{r,X}(I)\geq\alpha\max_{I^{\prime}\in S}f_{r,X}(I^{\prime}) for some α(0,1]\alpha\in(0,1]. Thus, we evaluate the performance of an algorithm by the α\alpha-regret Rα(T)R^{\alpha}(T), which is defined as

Rα(T)=t=1T(αfrt,Xt(It)frt,Xt(It)),\displaystyle R^{\alpha}(T)=\sum_{t=1}^{T}\left(\alpha f_{r_{t}^{*},X_{t}}(I^{*}_{t})-f_{r_{t}^{*},X_{t}}(I_{t})\right),

where It=iIfrt,Xt(I)I^{*}_{t}=\sum_{i\in I}f_{r_{t}^{*},X_{t}}(I). Note that the regret of the CCS problem is recovered if α=1\alpha=1 and fr,X(I)f_{r,X}(I) is the sum of rewards. We make the following assumptions that are almost identical to those in Qin, Chen, and Zhu (2014).

Assumption 2.

For all t[T]t\in[T] and IStI\in S_{t}, if a pair of rewards rr and rr^{\prime} satisfies r(i)r(i)r(i)\leq r^{\prime}(i) for all i[N]i\in[N], we have fr,Xt(I)fr,Xt(I)f_{r,X_{t}}(I)\leq f_{r^{\prime},X_{t}}(I).

Assumption 3.

There exists a constant C>0C>0 such that for all t[T]t\in[T], all IStI\in S_{t}, and any pair of rewards rr and rr^{\prime}, we have |fr,X(I)fr,X(I)|CiI|r(i)r(i)||f_{r,X}(I)-f_{r^{\prime},X}(I)|\leq C\sum_{i\in I}|r(i)-r^{\prime}(i)|.

The class of functions that satisfies the assumptions includes practically useful functions. For example, the sum of rewards with the entropy regularizer (Qin and Zhu 2013; Qin, Chen, and Zhu 2014), which has been applied to recommender systems in order to take into account both the sum of rewards and the diversity of the chosen arms, satisfies the assumptions with C=1C=1.

The proposed algorithm is described in Algorithm 2. When fr,X(I)f_{r,X}(I) is the sum of rewards, the difference between the C2UCB and the proposed algorithms is the definition of r^t(i)\hat{r}_{t}(i). We show the effectiveness of this difference. In the analysis of the C2UCB algorithm, the regret can be decomposed as

R(T)\displaystyle R(T) =t[T](iItθxt(i)iItr^t(i))\displaystyle=\sum_{t\in[T]}\left(\sum_{i\in I_{t}^{*}}{\theta^{*}}^{\top}x_{t}(i)-\sum_{i\in I_{t}}\hat{r}_{t}(i)\right)
+t[T]iIt(r^t(i)θxt(i)),\displaystyle+\sum_{t\in[T]}\sum_{i\in I_{t}}\left(\hat{r}_{t}(i)-{\theta^{*}}^{\top}x_{t}(i)\right),

and the first term can be bounded by 0 since ItI_{t} is an optimal solution to the problem maxIStiIr^t(i)\max_{I\in S_{t}}\sum_{i\in I}\hat{r}_{t}(i). Then, the right-hand side is bounded by

R(T)\displaystyle R(T) t[T]iItJt(r^t(i)θxt(i))\displaystyle\leq\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}(\hat{r}_{t}(i)-{\theta^{*}}^{\top}x_{t}(i))
+t[T]iJt(r^t(i)θxt(i)),\displaystyle+\sum_{t\in[T]}\sum_{i\in J^{\prime}_{t}}(\hat{r}_{t}(i)-{\theta^{*}}^{\top}x_{t}(i)),

where we recall that JtItJ^{\prime}_{t}\subseteq I_{t} is the set of arms such that xt(i)Vt11>1/k\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}. In the proof of Theorem 2, the first term of the right-hand side is shown to be O~(RdkT)\tilde{O}(Rd\sqrt{kT}), which is optimal, while the second term can be O~(max(LS,Bk)dk)\tilde{O}(\max(LS,Bk)dk). The reason the second term is so large is that each arm iJti\in J^{\prime}_{t} may have an overly optimistic reward estimate (i.e., r^t(i)\hat{r}_{t}(i) may be large). To overcome this issue, we reduce r^t(i)\hat{r}_{t}(i) when arm ii has an overly optimistic reward estimate, keeping that the reduced value is an optimistic estimate required by UCB-type algorithms. As described in Algorithm 2, we adopt the maximum value of the average reward BB as r^t(i)\hat{r}_{t}(i) when iJti\in J_{t}.

Similar to the above, we can show that the proposed algorithm (Algorithm 2) has the following regret bound:

Theorem 4.

If αt=βt(δ)\alpha_{t}=\beta_{t}(\delta) and λ=R2S2d\lambda=R^{2}S^{-2}d, the proposed algorithm has the following regret bound with probability 1δ1-\delta:

Rα(T)=O~(C(RdkT+Bdk)).\displaystyle R^{\alpha}(T)=\tilde{O}\left(C\left(Rd\sqrt{kT}+Bdk\right)\right).

We show that this regret bound is optimal. We can define an instance of the general problem with any C>0C>0 from any instance of the CCS problem. Indeed, for any C>0C>0, we can define fr,X(I)=CiIr(i)f_{r,X}(I)=C\sum_{i\in I}r(i). Thus, the optimal degree of dependence on CC in the regret is linear. For other parameters, we will show the lower bound in the next section.

Algorithm 2 Proposed algorithm
1:λ>0\lambda>0 and {αt}t[T]\{\alpha_{t}\}_{t\in[T]} s.t. αt>0\alpha_{t}>0 for all t[T]t\in[T].
2:V0λIV_{0}\leftarrow\lambda I and b0𝟎b_{0}\leftarrow\bm{0}.
3:for t=1,2,,Tt=1,2,\dots,T do
4:     Observe Xt={xt(i)}i[N]X_{t}=\{x_{t}(i)\}_{i\in[N]} and StS_{t}, and let Jt={i[N]xt(i)Vt11xt(i)>1/k}J_{t}=\{i\in[N]\mid x_{t}(i)^{\top}V_{t-1}^{-1}x_{t}(i)>1/k\}.
5:     θ^tVt11bt1\hat{\theta}_{t}\leftarrow V_{t-1}^{-1}b_{t-1}.
6:     for i[N]i\in[N] do
7:         If iJti\in J_{t} then r^t(i)B\hat{r}_{t}(i)\leftarrow B; otherwise r^t(i)θ^txt(i)+αtxt(i)Vt11xt(i)\hat{r}_{t}(i)\leftarrow\hat{\theta}_{t}^{\top}x_{t}(i)+\alpha_{t}\sqrt{x_{t}(i)^{\top}V_{t-1}^{-1}x_{t}(i)}.
8:     end for
9:     Play a set of arms It=𝒪St({r^t(i)}i[N],Xt)I_{t}=\mathcal{O}_{S_{t}}(\{\hat{r}_{t}(i)\}_{i\in[N]},X_{t}) and observe rewards {rt(i)}iIt\{r_{t}(i)\}_{i\in I_{t}}.
10:     VtVt1+iItxt(i)xt(i)V_{t}\leftarrow V_{t-1}+\sum_{i\in I_{t}}x_{t}(i)x_{t}(i)^{\top} and btbt1+iItrt(i)xt(i)b_{t}\leftarrow b_{t-1}+\sum_{i\in I_{t}}r_{t}(i)x_{t}(i).
11:end for

Lower Bounds

In this section, we show the regret lower bound that matches the regret upper bound shown in Theorems 3 and 4 up to logarithmic factors. To achieve the lower bound, we mix two types of instances, which provide Ω(RdkT)\Omega(Rd\sqrt{kT}) and Ω(Bdk)\Omega(Bdk) regret, respectively. While the first type of instance represents the difficulty of learning due to the noise added to the rewards, the second represents the minimum sample size required to learn the dd-dimensional vector θ\theta^{*} in the CCS problem.

We first consider instances that achieve Ω(RdkT)\Omega(Rd\sqrt{kT}) and are analogous to the instances for the LB problem. Since the lower bound of the LB problem is known to be Ω(dT)\Omega(d\sqrt{T}) with R=1R=1, the CCS problem in which the number of arms to select is kTkT would yield Ω(RdkT)\Omega(Rd\sqrt{kT}). In these instances, the learner chooses kk vertices from a dd-dimensional hyper cube. Note that the duplication of vertices is allowed.

Theorem 5.

Let {xt(i)}i=(s1)2d+1s2d={1,1}d\{x_{t}(i)\}_{i=(s-1)2^{d}+1}^{s2^{d}}=\{-1,1\}^{d} and St={I[k2d]|I|=k}S_{t}=\{I\subseteq[k2^{d}]\mid|I|=k\} for any s[k]s\in[k] and t[T]t\in[T]. Let Θ={R/kT,R/kT}d\Theta=\{-R/\sqrt{kT},R/\sqrt{kT}\}^{d}. Assume that ηt(i)𝒩(0,R2)\eta_{t}(i)\sim\mathcal{N}(0,R^{2}) independently. Then, for any algorithm, there exists θΘ\theta^{*}\in\Theta such that R(T)=Ω(RdkT)R(T)=\Omega(Rd\sqrt{kT}).

Proof.

We first consider instances that achieve the lower bound of the LB problem. Using the discussion of Theorem 24.1 of Lattimore and Szepesvári (2020), we obtain the lower bound of Ω(RdT)\Omega(Rd\sqrt{T}) for a certain θΘ\theta\in\Theta when k=1k=1. Note that this lower bound holds even if the algorithm knows in advance the given set of arms of all rounds.

Then, we observe that the set of algorithms for the above instances with kTkT rounds includes any algorithm for the CCS problem, which proves the theorem. ∎

We next introduce the instances of Ω(dk)\Omega(dk), based on the fact that no feedback can be received until kk arms are selected in the CCS problem. More specifically, these instances consist of Θ(d)\Theta(d) independent 2-armed bandit problems with delayed feedback. In each problem, the learner suffers Ω(Bk)\Omega(Bk) regret due to the delayed feedback.

Theorem 6.

Let d=2dd=2d^{\prime} and St={I[2k]|I|=k}S_{t}=\{I\subseteq[2k]\mid|I|=k\}. Assume that ηt(i)=0\eta_{t}(i)=0. Define min(d,T)\min(d^{\prime},T) groups by dividing rounds. For each group j[min(d,T)]j\in[\min(d^{\prime},T)], the given arms are defined as xt(i)=Bde2j1x_{t}(i)=B\sqrt{d}e_{2j-1} for iki\leq k and xt(i)=Bde2jx_{t}(i)=B\sqrt{d}e_{2j} for i>ki>k, where {el}l[d]\{e_{l}\}_{l\in[d]} is the normalized standard basis. Let Θ={1/d,1/d}d\Theta=\{-1/\sqrt{d},1/\sqrt{d}\}^{d}. Then, for any algorithm, there exists θΘ\theta^{*}\in\Theta such that R(T)=Ω(min(Bdk,BkT))R(T)=\Omega(\min(Bdk,BkT)).

Proof.

As in Appendix A of Auer et al. (2002), it is sufficient to consider only the deterministic algorithms. In the first round of each group, any algorithm selects k/2k/2 or more from one of the two types of arms. Therefore, we can choose θΘ\theta^{*}\in\Theta so that for each group, the majority type of chosen arms is not optimal, in which case the algorithm suffers Θ(Bk)\Theta(Bk) regret. ∎

Finally, by combining the two types of instance above, we have instances achieving the matching regret lower bound:

Theorem 7.

Suppose that kT=Ω((Rd/B)2)kT=\Omega((Rd/B)^{2}) and d=2dd=2d^{\prime}. Then, for any given algorithm, if we use instances of Theorem 5 and Theorem 6 constructed using different dd^{\prime} dimensions in the first and second halves of the round, respectively, that instance achieves the following:

R(T)=Ω(min(RdkT+Bdk,BkT)).\displaystyle R(T)=\Omega(\min(Rd\sqrt{kT}+Bdk,BkT)).
Proof.

From kT=Ω((Rd/B)2)kT=\Omega((Rd/B)^{2}), we have |θxt(i)|<B|{\theta^{*}}^{\top}x_{t}(i)|<B for all i[N]i\in[N] and t[T]t\in[T]. Hence, we obtain R(T)=O(BkT)R(T)=O(BkT). Alternatively, from Theorem 5 and Theorem 6, we have Ω(RdkT+min(Bdk,BkT))\Omega(Rd\sqrt{kT}+\min(Bdk,BkT)). ∎

Note that we can set R>0R>0 and B>0B>0 arbitrarily in the instances of Theorem 7, but LL and SS are automatically determined as L=O(max(1,Bd))L=O(\max(1,B\sqrt{d})) and S=O(1)S=O(1).

Numerical Experiments

Algorithm Parameters
ε\varepsilon-greedy ε=0.05\varepsilon=0.05 and λ=1\lambda=1
C2UCB (Algorithm 1) (Qin, Chen, and Zhu 2014) λ=d\lambda=d and t,αt=d\forall t,\alpha_{t}=\sqrt{d}
Thompson sampling (Takemura and Ito 2019) λ=d\lambda=d and t,vt=d\forall t,v_{t}=\sqrt{d}
CombLinUCB (Wen, Kveton, and Ashkan 2015) λ=1\lambda=1, σ=1\sigma=1, and c=dc=\sqrt{d}
CombLinTS (Wen, Kveton, and Ashkan 2015) λ=1\lambda=1 and σ=1\sigma=1
Proposed (Algorithm 2) λ=d\lambda=d and t,αt=d\forall t,\alpha_{t}=\sqrt{d}
Table 2: Algorithms in the numerical experiments.
Refer to caption
Figure 1: Experimental results.

Setup

In this section, we evaluate the performance of the C2UCB and the proposed algorithms through numerical experiments. Two types of instance are prepared: one in which the constraints are not represented by the partition matroid and one in which they are. We call these types grouped type and uniform matroid type, respectively. Our analysis suggests that the C2UCB algorithm performs well on the uniform matroid type only and that our proposed algorithm does well on both types. The aim of our experiments is to verify this.

Let us explain the details of the instances. The grouped type is given by combining the instances of Theorem 5 with d=4d=4 and R=1R=1 and an instance defined as follows. Suppose that d=3d=3, N=2kN=2k, and θ=(0,0.1,0.9)\theta^{*}=(0,0.1,0.9)^{\top}. Let f(t)f(t) be tkt/kt-k\lfloor t/k\rfloor. The feature vectors are defined as

2f(t)e1\displaystyle 2^{f(t)}e_{1} ifi=1\displaystyle\quad\mathrm{if}\ i=1
e2\displaystyle e_{2} if 1<ik\displaystyle\quad\mathrm{if}\ 1<i\leq k
e3\displaystyle e_{3} ifi>k\displaystyle\quad\mathrm{if}\ i>k

for all t[T]t\in[T]. The random noise ηt(i)\eta_{t}(i) follows 𝒩(0,1)\mathcal{N}(0,1) independently for all t[T]t\in[T] and i[N]i\in[N]. The feasible combinations are defined as St={{1,2,,k},{k+1,k+2,,2k}}S_{t}=\{\{1,2,\ldots,k\},\{k+1,k+2,\ldots,2k\}\} for all t[T]t\in[T]. Note that this is not represented by the partition matroid. As for the uniform matroid type, the feasible combinations are defined as St={I[N]|I|=k}S_{t}=\{I\subseteq[N]\mid|I|=k\} for all t[T]t\in[T]. This is one of the uniform matroid constraints, which forms a subclass of partition matroid constraints. The other parameters are the same as the grouped type.

We start with k=2k=2 and T=40T=40, and increase kk and TT so that they satisfy k=Θ(T)k=\Theta(T). We run 100 simulations to obtain the means of the regrets. We evaluate the performance of an algorithm by the means of the regrets for the worst θ\theta^{*}: We compare the means for all θ\theta^{*} for the largest kTkT and choose the θ\theta^{*} with the largest mean.

We compare the proposed algorithm with five existing algorithms as baselines using the parameters described in Table 2. The ε\varepsilon-greedy algorithm has two ways of estimating the rewards of given arms: one is to use the values sampled from 𝒩(0,1)\mathcal{N}(0,1) independently, and the other is to estimate the rewards as in line 6 of Algorithm 1 with αt=0\alpha_{t}=0. This algorithm chooses the former way with probability ε\varepsilon and the latter way otherwise. Then, it plays a set of arms as in line 8 of Algorithm 1.

Results

Figure 1(a) and (b) show the relation between the number of pulled arms (i.e., kTkT) and the regret for the grouped type and the uniform matroid type, respectively. Error bars represent the standard error.

As we can see in Figure 1(a), the regret of the proposed algorithm increased most slowly, which indicates that the regrets of the existing and proposed algorithms have different degrees of dependence on the number of pulled arms. We can explain this phenomenon from the viewpoint of the overly optimistic estimates of rewards. Since xt(1)2\|x_{t}(1)\|_{2} increased exponentially until the kk-th round, the C2UCB algorithm often gave the arm an overly optimistic reward in these rounds. It follows from this optimistic estimate that the sum of optimistic rewards in the first group {1,2,,k}\{1,2,\ldots,k\} was often greater than that in the other group. Hence, the C2UCB algorithm often chose the sub-optimal group and suffered Θ(Bk)\Theta(Bk) regret in a round. Note that this phenomenon is almost completely independent of the linearity of the linear payoff function, which implies that the negative effect of the overly optimistic estimates could appear in UCB-type algorithms for related problems with semi-bandit feedback.

On the other hand, as shown in Figure 1(b), the regrets of all the algorithms except the ε\varepsilon-greedy algorithm were almost the same. This is because the constraints of the uniform matroid type satisfy the condition (4), and then the C2UCB algorithm has the optimal regret bound described in Theorem 3. More precisely, as opposed to the grouped type, the regret suffered from the overly optimistic estimates is at most Θ(B)\Theta(B) in a round.

Conclusion

We have discussed the CCS problem and shown matching upper and lower bounds of the regret. Our analysis has improved the existing regret bound of the C2UCB algorithm and clarified the negative effect of the overly optimistic estimates of rewards in bandit problems with semi-bandit feedback. We have solved this issue in two ways: introducing partition matroid constraints and providing other optimistic rewards to arms with large confidence intervals. Our theoretical and numerical analyses have demonstrated the impact of the overly optimistic estimation and the effectiveness of our approaches.

As we discussed, the negative effect of the overly optimistic estimation could appear in related problems as well. Since the ideas of our approaches do not depend on the linearity of the linear payoff functions, we believe they are applicable to overly optimistic estimation in related problems.

Although the proposed algorithm achieves the optimal regret bound, it uses BB explicitly as opposed to the C2UCB algorithm. It is an open question whether there exists some algorithm that achieves the optimal regret bound for general constraints without knowledge of the tight upper bound of BB.

Acknowledgements

SI was supported by JST, ACT-I, Grant Number JPMJPR18U5, Japan. TF was supported by JST, PRESTO, Grant Number JPMJPR1759, Japan. NK and KK were supported by JSPS, KAKENHI, Grant Number JP18H05291, Japan.

References

  • Abbasi-Yadkori, Pál, and Szepesvári (2011) Abbasi-Yadkori, Y.; Pál, D.; and Szepesvári, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2312–2320.
  • Agrawal and Goyal (2013) Agrawal, S.; and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, 127–135.
  • Auer (2002) Auer, P. 2002. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3(Nov): 397–422.
  • Auer et al. (2002) Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32(1): 48–77.
  • Chapelle and Li (2011) Chapelle, O.; and Li, L. 2011. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems, 2249–2257.
  • Chen et al. (2016a) Chen, W.; Hu, W.; Li, F.; Li, J.; Liu, Y.; and Lu, P. 2016a. Combinatorial multi-armed bandit with general reward functions. In Advances in Neural Information Processing Systems, 1659–1667.
  • Chen et al. (2016b) Chen, W.; Wang, Y.; Yuan, Y.; and Wang, Q. 2016b. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research 17(1): 1746–1778.
  • Chu et al. (2011) Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual Bandits with Linear Payoff Functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 208–214.
  • Combes et al. (2015) Combes, R.; Shahi, M. S. T. M.; Proutiere, A.; et al. 2015. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, 2116–2124.
  • Dani, Hayes, and Kakade (2008) Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, 355–366.
  • Gai, Krishnamachari, and Jain (2012) Gai, Y.; Krishnamachari, B.; and Jain, R. 2012. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking 20(5): 1466–1478.
  • Kveton et al. (2014) Kveton, B.; Wen, Z.; Ashkan, A.; Eydgahi, H.; and Eriksson, B. 2014. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, 420–429.
  • Kveton et al. (2015) Kveton, B.; Wen, Z.; Ashkan, A.; and Szepesvári, C. 2015. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 535–543.
  • Lattimore and Szepesvári (2020) Lattimore, T.; and Szepesvári, C. 2020. Bandit Algorithms. Cambridge University Press.
  • Li et al. (2010) Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web, 661–670.
  • Qin, Chen, and Zhu (2014) Qin, L.; Chen, S.; and Zhu, X. 2014. Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, 461–469.
  • Qin and Zhu (2013) Qin, L.; and Zhu, X. 2013. Promoting diversity in recommendation by entropy regularizer. In Twenty-Third International Joint Conference on Artificial Intelligence.
  • Takemura and Ito (2019) Takemura, K.; and Ito, S. 2019. An Arm-Wise Randomization Approach to Combinatorial Linear Semi-Bandits. In Proceedings of the 2019 IEEE International Conference on Data Mining, 1318–1323.
  • Wang et al. (2017) Wang, Y.; Ouyang, H.; Wang, C.; Chen, J.; Asamov, T.; and Chang, Y. 2017. Efficient Ordered Combinatorial Semi-Bandits for Whole-Page Recommendation. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2746–2753.
  • Wen, Kveton, and Ashkan (2015) Wen, Z.; Kveton, B.; and Ashkan, A. 2015. Efficient Learning in Large-Scale Combinatorial Semi-Bandits. In Proceedings of the 32nd International Conference on Machine Learning, 1113–1122.

Appendix A Proofs

Known Results

Our proofs use the following known results:

Lemma 3 (Theorem 2 of Abbasi-Yadkori, Pál, and Szepesvári (2011)).

Let {Ft}t=0\{F_{t}\}_{t=0}^{\infty} be a filtration, {Xt}t=1\{X_{t}\}_{t=1}^{\infty} be an d\mathbb{R}^{d}-valued stochastic process such that XtX_{t} is Ft1F_{t-1}-measurable, and {ηt}t=1\{\eta_{t}\}_{t=1}^{\infty} be a real-valued stochastic process such that ηt\eta_{t} is FtF_{t}-measurable. Let V=λIV=\lambda I be a positive definite matrix, Vt=V+s[t]XsXsV_{t}=V+\sum_{s\in[t]}X_{s}X_{s}^{\top}, Yt=s[t]θXs+ηsY_{t}=\sum_{s\in[t]}{\theta^{*}}^{\top}X_{s}+\eta_{s} and θ^t=Vt11Yt\hat{\theta}_{t}=V_{t-1}^{-1}Y_{t}. Assume for all tt that ηt\eta_{t} is conditionally RR-sub-Gaussian for some R>0R>0 and θ2S\|\theta^{*}\|_{2}\leq S. Then, for any δ>0\delta>0, with probability at least 1δ1-\delta, for any t1t\geq 1,

θ^tθVt1R2log(det(Vt1)1/2det(λI)1/2δ)+λS.\displaystyle\|\hat{\theta}_{t}-\theta^{*}\|_{V_{t-1}}\leq R\sqrt{2\log\left(\frac{\det(V_{t-1})^{1/2}\det(\lambda I)^{-1/2}}{\delta}\right)}+\sqrt{\lambda}S.

Furthermore, if Xt2L\|X_{t}\|_{2}\leq L for all t1t\geq 1, then with probability at least 1δ1-\delta, for all t1t\geq 1,

θ^tθVt1Rdlog(1+(t1)L2/λδ)+λS.\|\hat{\theta}_{t}-\theta^{*}\|_{V_{t-1}}\leq R\sqrt{d\log\left(\frac{1+(t-1)L^{2}/\lambda}{\delta}\right)}+\sqrt{\lambda}S.
Lemma 4 (Lemma 10 of Abbasi-Yadkori, Pál, and Szepesvári (2011)).

Suppose X1X_{1}, X2X_{2}, \dots, XtdX_{t}\in\mathbb{R}^{d} and for any 1st1\leq s\leq t, Xs2L\|X_{s}\|_{2}\leq L. Let Vt=λI+s[t]XsXsV_{t}=\lambda I+\sum_{s\in[t]}X_{s}X_{s}^{\top} for some λ>0\lambda>0. Then,

det(Vt)(λ+tL2/d)d.\det(V_{t})\leq(\lambda+tL^{2}/d)^{d}.

Proof of Lemma 2

We first show the lemma below, which proves the first part of Lemma 2. Note that this lemma is an extension of Lemma 11 of Abbasi-Yadkori, Pál, and Szepesvári (2011).

Lemma 5.

Let {{xt(i)}i[k]}t[T]\{\{x_{t}(i)\}_{i\in[k]}\}_{t\in[T]} be any sequence such that xt(i)dx_{t}(i)\in\mathbb{R}^{d} and xt(i)2L\|x_{t}(i)\|_{2}\leq L for all i[k]i\in[k] and t[T]t\in[T]. Let Vt=λI+s[t]i[k]xs(i)xs(i)V_{t}=\lambda I+\sum_{s\in[t]}\sum_{i\in[k]}x_{s}(i)x_{s}(i)^{\top} with λ>0\lambda>0. Then, we have

t[T]i[k]min(1k,xt(i)Vt112)2dlog(1+L2kT/(dλ)).\displaystyle\sum_{t\in[T]}\sum_{i\in[k]}\min\left(\frac{1}{k},\|x_{t}(i)\|_{V_{t-1}^{-1}}^{2}\right)\leq 2d\log(1+L^{2}kT/(d\lambda)). (5)

Accordingly, we have

t[T]i[k]min(1k,xt(i)Vt11)2dkTlog(1+L2kT/(dλ)).\displaystyle\sum_{t\in[T]}\sum_{i\in[k]}\min\left(\frac{1}{\sqrt{k}},\|x_{t}(i)\|_{V_{t-1}^{-1}}\right)\leq\sqrt{2dkT\log(1+L^{2}kT/(d\lambda))}. (6)
Proof.

We have

logdet(VT)\displaystyle\log\det\left(V_{T}\right) =logdet(VT1+i[k]xT(i)xT(i))\displaystyle=\log\det\left(V_{T-1}+\sum_{i\in[k]}x_{T}(i)x_{T}(i)^{\top}\right)
=logdet(VT1)+logdet(I+i[k]VT11/2xT(i)(VT11/2xT(i)))\displaystyle=\log\det\left(V_{T-1}\right)+\log\det\left(I+\sum_{i\in[k]}V_{T-1}^{-1/2}x_{T}(i)(V_{T-1}^{-1/2}x_{T}(i))^{\top}\right)
=logdet(VT1)+logdet(i[k]1k(I+kVT11/2xT(i)(VT11/2xT(i))))\displaystyle=\log\det\left(V_{T-1}\right)+\log\det\left(\sum_{i\in[k]}\frac{1}{k}\left(I+kV_{T-1}^{-1/2}x_{T}(i)(V_{T-1}^{-1/2}x_{T}(i))^{\top}\right)\right)
logdet(VT1)+i[k]1klogdet(I+kVT11/2xT(i)(VT11/2xT(i)))\displaystyle\geq\log\det\left(V_{T-1}\right)+\sum_{i\in[k]}\frac{1}{k}\log\det\left(I+kV_{T-1}^{-1/2}x_{T}(i)(V_{T-1}^{-1/2}x_{T}(i))^{\top}\right)
logdet(λI)+t[T]i[k]1klogdet(I+kVt11/2xt(i)(Vt11/2xt(i))),\displaystyle\geq\log\det\left(\lambda I\right)+\sum_{t\in[T]}\sum_{i\in[k]}\frac{1}{k}\log\det\left(I+kV_{t-1}^{-1/2}x_{t}(i)(V_{t-1}^{-1/2}x_{t}(i))^{\top}\right),

where the first inequality follows from Jensen’s inequality applied to the concave function logdet()\log\det(\cdot). Then, we have

i[k]1klogdet(I+kVt11/2xt(i)(Vt11/2xt(i)))\displaystyle\sum_{i\in[k]}\frac{1}{k}\log\det\left(I+kV_{t-1}^{-1/2}x_{t}(i)(V_{t-1}^{-1/2}x_{t}(i))^{\top}\right)
=\displaystyle= i[k]1klog(1+kxt(i)Vt11xt(i))\displaystyle\sum_{i\in[k]}\frac{1}{k}\log\left(1+kx_{t}(i)^{\top}V_{t-1}^{-1}x_{t}(i)\right)
\displaystyle\geq i[k]1klog(1+min(1,kxt(i)Vt11xt(i)))\displaystyle\sum_{i\in[k]}\frac{1}{k}\log\left(1+\min\left(1,kx_{t}(i)^{\top}V_{t-1}^{-1}x_{t}(i)\right)\right)
\displaystyle\geq i[k]12min(1k,xt(i)Vt11xt(i))\displaystyle\sum_{i\in[k]}\frac{1}{2}\min\left(\frac{1}{k},x_{t}(i)^{\top}V_{t-1}^{-1}x_{t}(i)\right)

for all t[T]t\in[T], where the second inequality is derived from 2log(1+x)x2\log(1+x)\geq x for all x[0,1]x\in[0,1]. Therefore, we obtain

t[T]i[k]12min(1k,xt(i)Vt11xt(i))logdet(VT)logdet(λI).\displaystyle\sum_{t\in[T]}\sum_{i\in[k]}\frac{1}{2}\min\left(\frac{1}{k},x_{t}(i)^{\top}V_{t-1}^{-1}x_{t}(i)\right)\leq\log\det\left(V_{T}\right)-\log\det(\lambda I).

Applying Lemma 4 to the above inequality, we obtain (5). Finally, we obtain (6) by applying the Cauchy-Schwarz inequality to (5). ∎

Using Lemma 5, we prove the second part of Lemma 2.

Lemma 6.

Let {{xt(i)}i[k]}t[T]\{\{x_{t}(i)\}_{i\in[k]}\}_{t\in[T]} be any sequence such that xt(i)dx_{t}(i)\in\mathbb{R}^{d} and xt(i)2L\|x_{t}(i)\|_{2}\leq L for all i[k]i\in[k] and t[T]t\in[T]. Let Vt=λI+s[t]i[k]xs(i)xs(i)V_{t}=\lambda I+\sum_{s\in[t]}\sum_{i\in[k]}x_{s}(i)x_{s}(i)^{\top} with λ>0\lambda>0. Then, we have

t[T]i[k]𝟙(xt(i)Vt11>1/k)2dklog(1+L2kT/(dλ)).\displaystyle\sum_{t\in[T]}\sum_{i\in[k]}\mathds{1}\left(\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}\right)\leq 2dk\log(1+L^{2}kT/(d\lambda)).
Proof.

From Lemma 5, we obtain

1kt[T]i[k]𝟙(xt(i)Vt11>1/k)\displaystyle\frac{1}{k}\sum_{t\in[T]}\sum_{i\in[k]}\mathds{1}\left(\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}\right) t[T]i[k]min(1k,xt(i)Vt112)\displaystyle\leq\sum_{t\in[T]}\sum_{i\in[k]}\min\left(\frac{1}{k},\|x_{t}(i)\|_{V_{t-1}^{-1}}^{2}\right)
2dlog(1+L2kT/(dλ)).\displaystyle\leq 2d\log(1+L^{2}kT/(d\lambda)).

Proof of Theorem 2

Recall that βt(δ)=Rdlog(1+L2kt/λδ)+Sλ\beta_{t}(\delta)=R\sqrt{d\log\left(\frac{1+L^{2}kt/\lambda}{\delta}\right)}+S\sqrt{\lambda} for δ>0\delta>0. Theorem 2 is a corollary of the following theorem.

Theorem 8.

If αt=βt(δ)\alpha_{t}=\beta_{t}(\delta), the C2UCB algorithm has the following regret bound with probability 1δ1-\delta:

R(T)\displaystyle R(T) =O(βT(δ)dkTlog(1+L2kTdλ)+min(LβT(δ)λ,Bk)dklog(1+L2kTdλ)).\displaystyle=O\left(\beta_{T}(\delta)\sqrt{dkT\log\left(1+\frac{L^{2}kT}{d\lambda}\right)}+\min\left(\frac{L\beta_{T}(\delta)}{\sqrt{\lambda}},Bk\right)dk\log\left(1+\frac{L^{2}kT}{d\lambda}\right)\right).

We note that this theorem holds even if the definition of BB is replaced with

t[T],i,j[N],|θ(xt(i)xt(j))|2B.\displaystyle\forall t\in[T],\forall i,j\in[N],|{\theta^{*}}^{\top}(x_{t}(i)-x_{t}(j))|\leq 2B. (7)

Condition (7) is weaker than the original definition of BB, as (7) is derived from |θxt(i)|B|{\theta^{*}}^{\top}x_{t}(i)|\leq B for all t[T]t\in[T] and i[N]i\in[N].

Proof of Theorem 8.

Let Jt={i[N]xt(i)Vt11>1/k}J_{t}=\{i\in[N]\mid\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}\} and Jt=ItJtJ^{\prime}_{t}=I_{t}\cap J_{t}. From Lemma 3, we have θ^tθVt1βt(δ)\|\hat{\theta}_{t}-\theta^{*}\|_{V_{t-1}}\leq\beta_{t}(\delta) for all t[T]t\in[T] with probability 1δ1-\delta. Then, we have

r^t(i)θxt(i)\displaystyle\hat{r}_{t}(i)-{\theta^{*}}^{\top}x_{t}(i) =βt(δ)xt(i)Vt11+(θ^tθ)xt(i)\displaystyle=\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}}+(\hat{\theta}_{t}-\theta^{*})^{\top}x_{t}(i)
(βt(δ)θ^tθVt)xt(i)Vt11\displaystyle\geq\left(\beta_{t}(\delta)-\|\hat{\theta}_{t}-\theta^{*}\|_{V_{t}}\right)\|x_{t}(i)\|_{V_{t-1}^{-1}}
0\displaystyle\geq 0

for all i[N]i\in[N] and t[T]t\in[T]. Similarly, we have θ^txt(i)βt(δ)xt(i)Vt11θxt(i)0\hat{\theta}_{t}^{\top}x_{t}(i)-\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}}-{\theta^{*}}^{\top}x_{t}(i)\leq 0 for all i[N]i\in[N] and t[T]t\in[T]. Thus, it follows that

R(T)\displaystyle R(T) =t[T]min(iItθxt(i)iItθxt(i),2Bk)\displaystyle=\sum_{t\in[T]}\min\left(\sum_{i\in I_{t}^{*}}{\theta^{*}}^{\top}x_{t}(i)-\sum_{i\in I_{t}}{\theta^{*}}^{\top}x_{t}(i),2Bk\right)
t[T]min(iItr^t(i)iItθxt(i),2Bk)\displaystyle\leq\sum_{t\in[T]}\min\left(\sum_{i\in I_{t}^{*}}\hat{r}_{t}(i)-\sum_{i\in I_{t}}{\theta^{*}}^{\top}x_{t}(i),2Bk\right)
t[T]min(iIt(r^t(i)θxt(i)),2Bk)\displaystyle\leq\sum_{t\in[T]}\min\left(\sum_{i\in I_{t}}(\hat{r}_{t}(i)-{\theta^{*}}^{\top}x_{t}(i)),2Bk\right)
t[T]min(iIt2βt(δ)xt(i)Vt11,2Bk)\displaystyle\leq\sum_{t\in[T]}\min\left(\sum_{i\in I_{t}}2\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}},2Bk\right)
t[T]iItJt2βt(δ)xt(i)Vt11+t[T]min(iJt2βt(δ)xt(i)Vt11,2Bk),\displaystyle\leq\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}2\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}}+\sum_{t\in[T]}\min\left(\sum_{i\in J^{\prime}_{t}}2\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}},2Bk\right), (8)

where the second inequality follows since It=argmaxIStiIr^t(i)I_{t}=\operatorname*{argmax}_{I\in S_{t}}\sum_{i\in I}\hat{r}_{t}(i), and the last inequality is derived from βt(δ)xt(i)Vt11>0\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}}>0 for all δ(0,1)\delta\in(0,1), i[N]i\in[N], and t[T]t\in[T].

From Lemma 5, we bound the first term of (8) as follows:

t[T]iItJt2βt(δ)xt(i)Vt11\displaystyle\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}2\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}} 2βT(δ)t[T]iItJtxt(i)Vt11\displaystyle\leq 2\beta_{T}(\delta)\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\|x_{t}(i)\|_{V_{t-1}^{-1}}
2βT(δ)2dkTlog(1+L2kT/(dλ)).\displaystyle\leq 2\beta_{T}(\delta)\sqrt{2dkT\log(1+L^{2}kT/(d\lambda))}.

We show that the second term of (8) is O(min(LβT(δ)λ,Bk)dklog(1+L2kTdλ))O\left(\min\left(\frac{L\beta_{T}(\delta)}{\sqrt{\lambda}},Bk\right)dk\log\left(1+\frac{L^{2}kT}{d\lambda}\right)\right) by bounding this term in two ways. We have xt(i)Vt112L2/λ\|x_{t}(i)\|_{V_{t-1}^{-1}}^{2}\leq L^{2}/\lambda for all i[N]i\in[N] and t[T]t\in[T]. From this fact and Lemma 6, we obtain

t[T]min(iJt2βt(δ)xt(i)Vt11,2Bk)\displaystyle\sum_{t\in[T]}\min\left(\sum_{i\in J^{\prime}_{t}}2\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}},2Bk\right) t[T]iJt2βt(δ)xt(i)Vt11\displaystyle\leq\sum_{t\in[T]}\sum_{i\in J^{\prime}_{t}}2\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}}
2βT(δ)t[T]iJtxt(i)Vt11\displaystyle\leq 2\beta_{T}(\delta)\sum_{t\in[T]}\sum_{i\in J^{\prime}_{t}}\|x_{t}(i)\|_{V_{t-1}^{-1}}
4βT(δ)Ldkλlog(1+L2kT/(dλ)).\displaystyle\leq 4\beta_{T}(\delta)\frac{Ldk}{\sqrt{\lambda}}\log(1+L^{2}kT/(d\lambda)).

Alternatively, it follows from Lemma 6 that

t[T]min(iJt2βt(δ)xt(i)Vt11,2Bk)\displaystyle\sum_{t\in[T]}\min\left(\sum_{i\in J^{\prime}_{t}}2\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}},2Bk\right) t[T]2|Jt|Bk\displaystyle\leq\sum_{t\in[T]}2|J^{\prime}_{t}|Bk
4Bdk2log(1+L2kT/(dλ)).\displaystyle\leq 4Bdk^{2}\log(1+L^{2}kT/(d\lambda)).

Combining these inequalites, we obtain Theorem 8. ∎

Proof of Theorem 3

Theorem 3 is a corollary of the theorem below. Note that this theorem holds when BB is a parameter satisfying (7).

Theorem 9.

Assume that StS_{t} satisfies (4) for a partition {Bt(j)}j[M]\{B_{t}(j)\}_{j\in[M]} and {dt(j)}j[M]\{d_{t}(j)\}_{j\in[M]} for all t[T]t\in[T]. Then, if αt=βt(δ)\alpha_{t}=\beta_{t}(\delta), the C2UCB algorithm has the following regret bound with probability 1δ1-\delta:

R(T)=O(βT(δ)dkTlog(1+L2kTdλ)+Bdklog(1+L2kTdλ)).\displaystyle R(T)=O\left(\beta_{T}(\delta)\sqrt{dkT\log\left(1+\frac{L^{2}kT}{d\lambda}\right)}+Bdk\log\left(1+\frac{L^{2}kT}{d\lambda}\right)\right).
Proof.

Let Jt={i[N]xt(i)Vt11>1/k}J_{t}=\{i\in[N]\mid\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}\} and Jt=ItJtJ^{\prime}_{t}=I_{t}\cap J_{t}. We separate chosen arms into two groups: {Jt}t[T]\{J^{\prime}_{t}\}_{t\in[T]} and the other arms. From the optimality of ItI_{t}, ItBt(j)I_{t}\cap B_{t}(j) is the top dt(j)d_{t}(j) arms in terms of r^t()\hat{r}_{t}(\cdot) in Bt(j)B_{t}(j) for each j[M]j\in[M]. Thus, we obtain

ItJt=argmaxIStiIr^t(i)\displaystyle I_{t}\setminus J^{\prime}_{t}=\operatorname*{argmax}_{I\in S^{\prime}_{t}}\sum_{i\in I}\hat{r}_{t}(i) (9)

for all t[T]t\in[T], where

St={I[N]Jt|IBt(j)|=dt(j)|Bt(j)Jt|,j[M]}.\displaystyle S^{\prime}_{t}=\left\{I\subseteq[N]\setminus J^{\prime}_{t}\mid|I\cap B_{t}(j)|=d_{t}(j)-|B_{t}(j)\cap J^{\prime}_{t}|,\forall j\in[M]\right\}.

Let JtJ^{*}_{t} be a subset of ItI^{*}_{t} that consists of the arms in ItJtI^{*}_{t}\cap J^{\prime}_{t}, and |Bt(j)Jt||ItJtBt(j)||B_{t}(j)\cap J^{\prime}_{t}|-|I^{*}_{t}\cap J^{\prime}_{t}\cap B_{t}(j)| arms chosen arbitrarily from ItBt(j)I^{*}_{t}\cap B_{t}(j) for each j[M]j\in[M]. Then, ItJtStI^{*}_{t}\setminus J^{*}_{t}\in S^{\prime}_{t} and |Jt|=|Jt||J^{*}_{t}|=|J^{\prime}_{t}| for all t[T]t\in[T]. Similarly to ItI_{t}, we divide ItI^{*}_{t} into ItJtI^{*}_{t}\setminus J^{*}_{t} and JtJ^{*}_{t}. This gives

R(T)=t[T](iItJtθxt(i)iItJtθxt(i))+t[T](iJtθxt(i)iJtθxt(i)).\displaystyle R(T)=\sum_{t\in[T]}\left(\sum_{i\in I_{t}^{*}\setminus J_{t}^{*}}{\theta^{*}}^{\top}x_{t}(i)-\sum_{i\in I_{t}\setminus J^{\prime}_{t}}{\theta^{*}}^{\top}x_{t}(i)\right)+\sum_{t\in[T]}\left(\sum_{i\in J_{t}^{*}}{\theta^{*}}^{\top}x_{t}(i)-\sum_{i\in J^{\prime}_{t}}{\theta^{*}}^{\top}x_{t}(i)\right). (10)

For the second term of (10), from Lemma 6 and the definition of BB, we have

t[T](iJtθxt(i)iJtθxt(i))4Bdklog(1+L2kT/(dλ)).\displaystyle\sum_{t\in[T]}\left(\sum_{i\in J_{t}^{*}}{\theta^{*}}^{\top}x_{t}(i)-\sum_{i\in J^{\prime}_{t}}{\theta^{*}}^{\top}x_{t}(i)\right)\leq 4Bdk\log(1+L^{2}kT/(d\lambda)).

It remains to be shown that the first term of (10) is O(βT(δ)dkTlog(1+L2kTdλ))O\left(\beta_{T}(\delta)\sqrt{dkT\log\left(1+\frac{L^{2}kT}{d\lambda}\right)}\right). By the same discussion as the proof of Theorem 8, we obtain r^t(i)θxt(i)\hat{r}_{t}(i)\geq{\theta^{*}}^{\top}x_{t}(i) for all i[N]i\in[N] and t[T]t\in[T] with probability 1δ1-\delta. Then, we have

t[T](iItJtθxt(i)iItJtθxt(i))\displaystyle\sum_{t\in[T]}\left(\sum_{i\in I_{t}^{*}\setminus J_{t}^{*}}{\theta^{*}}^{\top}x_{t}(i)-\sum_{i\in I_{t}\setminus J^{\prime}_{t}}{\theta^{*}}^{\top}x_{t}(i)\right)
\displaystyle\leq t[T](iItJtr^t(i)iItJtθxt(i))\displaystyle\sum_{t\in[T]}\left(\sum_{i\in I_{t}^{*}\setminus J_{t}^{*}}\hat{r}_{t}(i)-\sum_{i\in I_{t}\setminus J^{\prime}_{t}}{\theta^{*}}^{\top}x_{t}(i)\right)
\displaystyle\leq t[T]iItJt(r^t(i)θxt(i))\displaystyle\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\left(\hat{r}_{t}(i)-{\theta^{*}}^{\top}x_{t}(i)\right)
=\displaystyle= t[T]iItJt(r^t(i)θ^txt(i))+t[T]iItJt(θ^tθ)xt(i),\displaystyle\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\left(\hat{r}_{t}(i)-\hat{\theta}_{t}^{\top}x_{t}(i)\right)+\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\left(\hat{\theta}_{t}-\theta^{*}\right)^{\top}x_{t}(i),

where the second inequality is derived from (9). We define Ralg(T)R^{alg}(T) and Rest(T)R^{est}(T) as

Ralg(T)\displaystyle R^{alg}(T) =t[T]iItJt(r^t(i)θ^txt(i))and\displaystyle=\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\left(\hat{r}_{t}(i)-\hat{\theta}_{t}^{\top}x_{t}(i)\right)\quad\mathrm{and}
Rest(T)\displaystyle R^{est}(T) =t[T]iItJt(θ^tθ)xt(i).\displaystyle=\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\left(\hat{\theta}_{t}-\theta^{*}\right)^{\top}x_{t}(i).

From Claim 1 and Claim 2 below, we can bound Ralg(T)R^{alg}(T) and Rest(T)R^{est}(T), respectively, which gives the desired bound of the first term of (10). ∎

Claim 1.
Ralg(T)=O(βT(δ)dkTlog(1+L2kTdλ))\displaystyle R^{alg}(T)=O\left(\beta_{T}(\delta)\sqrt{dkT\log\left(1+\frac{L^{2}kT}{d\lambda}\right)}\right)
Proof of Claim 1.

Recall that Jt={i[N]xt(i)Vt11>1/k}J_{t}=\{i\in[N]\mid\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}\} and Jt=ItJtJ^{\prime}_{t}=I_{t}\cap J_{t}. We can bound Ralg(T)R^{alg}(T) as follows:

Ralg(T)\displaystyle R^{alg}(T) =t[T]iItJt(r^t(i)θ^txt(i))\displaystyle=\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\left(\hat{r}_{t}(i)-\hat{\theta}_{t}^{\top}x_{t}(i)\right)
=t[T]iItJtβt(δ)xt(i)Vt11\displaystyle=\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}}
βT(δ)t[T]iItJtxt(i)Vt11.\displaystyle\leq\beta_{T}(\delta)\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\|x_{t}(i)\|_{V_{t-1}^{-1}}.

Then, Lemma 5 gives the desired result. ∎

Claim 2.
Rest(T)=O(βT(δ)dkTlog(1+L2kTdλ))\displaystyle R^{est}(T)=O\left(\beta_{T}(\delta)\sqrt{dkT\log\left(1+\frac{L^{2}kT}{d\lambda}\right)}\right)
Proof of Claim 2.

Recall that Jt={i[N]xt(i)Vt11>1/k}J_{t}=\{i\in[N]\mid\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}\} and Jt=ItJtJ^{\prime}_{t}=I_{t}\cap J_{t}. It follows from Lemma 3 that with probability 1δ1-\delta,

Rest(T)\displaystyle R^{est}(T) =t[T]iItJt(θ^tθ)xt(i)\displaystyle=\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}(\hat{\theta}_{t}-\theta^{*})^{\top}x_{t}(i)
t[T]iItJtθ^tθVt1xt(i)Vt11\displaystyle\leq\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\|\hat{\theta}_{t}-\theta^{*}\|_{V_{t-1}}\|x_{t}(i)\|_{V_{t-1}^{-1}}
t[T]iItJtβt(δ)xt(i)Vt11\displaystyle\leq\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\beta_{t}(\delta)\|x_{t}(i)\|_{V_{t-1}^{-1}}
βT(δ)t[T]iItJtxt(i)Vt11.\displaystyle\leq\beta_{T}(\delta)\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\|x_{t}(i)\|_{V_{t-1}^{-1}}.

Applying Lemma 5 to the above, we obtain the desired result. ∎

Proof of Theorem 4

Theorem 4 is a corollary of the theorem below. In this subsection, we prove this theorem.

Theorem 10.

If αt=βt(δ)\alpha_{t}=\beta_{t}(\delta), the proposed algorithm has the following regret bound with probability 1δ1-\delta:

Rα(T)\displaystyle R^{\alpha}(T) =O(C(βT(δ)dkTlog(1+L2kTdλ)+Bdklog(1+L2kTdλ))).\displaystyle=O\left(C\left(\beta_{T}(\delta)\sqrt{dkT\log\left(1+\frac{L^{2}kT}{d\lambda}\right)}+Bdk\log\left(1+\frac{L^{2}kT}{d\lambda}\right)\right)\right).
Proof.

Let Jt={i[N]xt(i)Vt11>1/k}J_{t}=\{i\in[N]\mid\|x_{t}(i)\|_{V_{t-1}^{-1}}>1/\sqrt{k}\} and Jt=ItJtJ^{\prime}_{t}=I_{t}\cap J_{t}. Similarly to the discussion in the proof of Theorem 8, we have θxt(i)r^t(i){\theta^{*}}^{\top}x_{t}(i)\leq\hat{r}_{t}(i) for all i[N]Jti\in[N]\setminus J_{t} and t[T]t\in[T] with probability 1δ1-\delta. Furthermore, from the definition of BB, we also have θxt(i)r^t(i){\theta^{*}}^{\top}x_{t}(i)\leq\hat{r}_{t}(i) for all iJti\in J_{t} and t[T]t\in[T]. Thus, we obtain r^t(i)θxt(i)\hat{r}_{t}(i)\geq{\theta^{*}}^{\top}x_{t}(i) for all i[N]i\in[N] and t[T]t\in[T].

Let r^t={r^t(i)}i[N]\hat{r}_{t}=\{\hat{r}_{t}(i)\}_{i\in[N]} for all t[T]t\in[T]. Recall that rt={θxt(i)}i[N]r_{t}^{*}=\{{\theta^{*}}^{\top}x_{t}(i)\}_{i\in[N]}. Then, we have

Rα(T)\displaystyle R^{\alpha}(T) =t[T](αfrt,Xt(It)frt,Xt(It))\displaystyle=\sum_{t\in[T]}(\alpha f_{r_{t}^{*},X_{t}}(I^{*}_{t})-f_{r_{t}^{*},X_{t}}(I_{t}))
t[T](αfr^t,Xt(It)frt,Xt(It))\displaystyle\leq\sum_{t\in[T]}(\alpha f_{\hat{r}_{t},X_{t}}(I^{*}_{t})-f_{r_{t}^{*},X_{t}}(I_{t}))
t[T](fr^t,Xt(It)frt,Xt(It))\displaystyle\leq\sum_{t\in[T]}(f_{\hat{r}_{t},X_{t}}(I_{t})-f_{r_{t}^{*},X_{t}}(I_{t}))
Ct[T]iIt(r^t(i)θxt(i)),\displaystyle\leq C\sum_{t\in[T]}\sum_{i\in I_{t}}(\hat{r}_{t}(i)-{\theta^{*}}^{\top}x_{t}(i)),

where the second inequality is derived from the definition of ItI_{t} and the first and third inequalities are derived from Assumptions 2 and 3 and the optimisticity of r^t\hat{r}_{t} for all t[T]t\in[T]. From the above discussion, we obtain

Rα(T)C(Ralg(T)+Rest(T)+Rcon(T)),\displaystyle R^{\alpha}(T)\leq C\left(R^{alg}(T)+R^{est}(T)+R^{con}(T)\right),

where

Ralg(T)\displaystyle R^{alg}(T) =t[T]iItJt(r^t(i)θ^txt(i)),\displaystyle=\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\left(\hat{r}_{t}(i)-\hat{\theta}_{t}^{\top}x_{t}(i)\right),
Rest(T)\displaystyle R^{est}(T) =t[T]iItJt(θ^tθ)xt(i),and\displaystyle=\sum_{t\in[T]}\sum_{i\in I_{t}\setminus J^{\prime}_{t}}\left(\hat{\theta}_{t}-\theta^{*}\right)^{\top}x_{t}(i),\quad\mathrm{and}
Rcon(T)\displaystyle R^{con}(T) =t[T]iJt(r^t(i)θxt(i)).\displaystyle=\sum_{t\in[T]}\sum_{i\in J^{\prime}_{t}}\left(\hat{r}_{t}(i)-{\theta^{*}}^{\top}x_{t}(i)\right).

We will bound each of them separately.

For Rcon(T)R^{con}(T), it follows from the definition of BB and Lemma 6 that

Rcon(T)\displaystyle R^{con}(T) 2Bt[T]|Jt|\displaystyle\leq 2B\sum_{t\in[T]}|J^{\prime}_{t}|
4Bdklog(1+L2kT/(dλ)).\displaystyle\leq 4Bdk\log(1+L^{2}kT/(d\lambda)).

Similarly to Claim 1 and Claim 2, we obtain

Ralg(T)\displaystyle R^{alg}(T) =O(βT(δ)dkTlog(1+L2kTdλ))and\displaystyle=O\left(\beta_{T}(\delta)\sqrt{dkT\log\left(1+\frac{L^{2}kT}{d\lambda}\right)}\right)\quad\mathrm{and}
Rest(T)\displaystyle R^{est}(T) =O(βT(δ)dkTlog(1+L2kTdλ)).\displaystyle=O\left(\beta_{T}(\delta)\sqrt{dkT\log\left(1+\frac{L^{2}kT}{d\lambda}\right)}\right).

This completes the proof. ∎