Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions
Abstract
The contextual combinatorial semi-bandit problem with linear payoff functions is a decision-making problem in which a learner chooses a set of arms with the feature vectors in each round under given constraints so as to maximize the sum of rewards of arms. Several existing algorithms have regret bounds that are optimal with respect to the number of rounds . However, there is a gap of between the current best upper and lower bounds, where is the dimension of the feature vectors, is the number of the chosen arms in a round, and ignores the logarithmic factors. The dependence of and is of practical importance because may be larger than in real-world applications such as recommender systems. In this paper, we fill the gap by improving the upper and lower bounds. More precisely, we show that the C2UCB algorithm proposed by Qin, Chen, and Zhu (2014) has the optimal regret bound for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C2UCB algorithm and demonstrate that it enjoys the optimal regret bound for a more general problem that can take into account other objectives simultaneously. We also show that our technique would be applicable to related problems. Numerical experiments support our theoretical results and considerations.
Introduction
Upper bound | Lower bound | |||
---|---|---|---|---|
The best known |
|
(Kveton et al. 2015) | ||
This work |
This paper investigates the contextual combinatorial semi-bandit problem with linear payoff functions, which we call CCS problem (Qin, Chen, and Zhu 2014; Takemura and Ito 2019; Wen, Kveton, and Ashkan 2015). In this problem, a learner iterates the following process times. First, the learner observes -dimensional vectors, called arms, and a set of feasible combinations of arms, where the size of each combination is . Each arm offers a reward defined by a common linear function over the arms, but the reward is not revealed to the learner at this point. Next, the learner chooses a feasible combination of arms. At the end, the learner observes the rewards of the chosen arms. The objective of the learner is to maximize the sum of rewards.
The CCS problem includes the linear bandit (LB) problem (Abbasi-Yadkori, Pál, and Szepesvári 2011; Agrawal and Goyal 2013; Auer 2002; Chu et al. 2011; Dani, Hayes, and Kakade 2008) and the combinatorial semi-bandit (CS) problem111 Here, the CS problem denotes the problem of maximizing the sum of rewards (Combes et al. 2015; Kveton et al. 2014, 2015), while Chen et al. (2016a, b) deal with a more general objective. (Chen et al. 2016a, b; Combes et al. 2015; Gai, Krishnamachari, and Jain 2012; Kveton et al. 2015; Wang et al. 2017; Wen, Kveton, and Ashkan 2015) as special cases. The difference from the LB problem is that, in the CCS problem, the learner chooses multiple arms at once. Moreover, while the given arms are fixed over the rounds and orthogonal to each other in the CS problem, they may be changed in each round and correlated in the CCS problem.
These differences enable the CCS problem to model more realistic situations of applications such as routing networks (Kveton et al. 2014), shortest paths (Gai, Krishnamachari, and Jain 2012; Wen, Kveton, and Ashkan 2015), and recommender systems (Li et al. 2010; Qin, Chen, and Zhu 2014; Wang et al. 2017). For example, when a recommender system is modeled with the LB problem, it is assumed that once a recommendation result is obtained, the internal predictive model is updated before the next recommendation. However, in a real recommender system, it is more common to update the predictive model after multiple recommendations, e.g., periodic updates (Chapelle and Li 2011). Such a situation can be modeled with the CCS problem, where the number of recommendations between the updates is and the number of the updates is (Takemura and Ito 2019)222 Strictly speaking, the LB problem with periodic updates is a little more restrictive than the CCS problem. However, most algorithms for the CCS problem, including the ones proposed in this paper, are applicable to the problem. .
As in numerous previous studies on bandit algorithms, we measure the performance of an algorithm by its regret, which is the difference between the sum of the rewards of the optimal choices and that of the algorithm’s choices. The existing regret bounds are summarized in Table 1, where means that the logarithmic factors are ignored. The best known upper bound on the regret is achieved by C2UCB algorithm, which is given by Qin, Chen, and Zhu (2014). Takemura and Ito (2019) refined their analysis to improve the dependence on other parameters in the regret bound. The best lower bound is given for the CS problem by Kveton et al. (2015). Note that any lower bound for the CS problem is also a lower bound for the CCS problem, as the CCS problem covers the CS problem.
Although these regret upper and lower bounds match with respect to , there is a gap of between them. In the literature on regret analysis, the degree of dependence on in the regret bound usually draws much attention. However, for the CCS problem, the degree of dependence on is also important because there are real-world applications of the CCS problem such that is large. In recommender systems with periodic updates, for example, the number of recommendations between the updates could be large. An alternative example is the sending promotion problem, in which the number of users to send a promotion at once is much larger than the number of times to send the promotion, i.e., (Takemura and Ito 2019).
Our contribution is two-fold. First, we improve dependence on and in both the regret upper and lower bounds. Our upper and lower bounds match up to logarithmic factors. Second, we clarify a drawback of the UCB-type algorithms for other related problems and propose general techniques to overcome the drawback.
To improve the upper bound of the CCS problem, we first revisit the C2UCB algorithm. This algorithm optimistically estimates rewards of arms using confidence intervals of estimates and then chooses a set of arms based on the optimistic estimates. Existing upper bounds have factor, which leads to the gap from the lower bound. In our analysis, however, we reveal that the linear dependence on in the regret comes from the arms of large confidence intervals and obtain regret by handling such arms separately. For further improvement, we focus on the case where the feasible combinations of arms are given by partition matroids. We show that the algorithm has the optimal regret bound in this case. Unfortunately, this analysis cannot apply to the general constraints, and we do not know whether the C2UCB algorithm achieves the optimal regret upper bound. Instead, based on these analyses, we propose another algorithm that estimates the rewards of arms of large confidence intervals more rigorously; the algorithm divides the given arms into two groups based on their confidence intervals and underestimates the rewards of the arms with large confidence intervals. We show that the proposed algorithm enjoys the optimal regret bound for the CCS problem with any feasible combinations of arms, and is also optimal for a more general problem that can take into account both the sum of rewards and other objectives. For example, recommender systems often require diversity of recommended items (Qin and Zhu 2013; Qin, Chen, and Zhu 2014).
We support our theoretical analysis through numerical experiments. We first evaluate the performance of the algorithms on instances in which constraints are not represented by the partition matroid. We observe that the proposed algorithm is superior to the C2UCB algorithm on these instances, which confirms our theoretical analysis that the C2UCB algorithm may not achieve the optimal regret bound while our proposed algorithm does. We also evaluate the algorithms on instances with partition matroid constraints. For these instances, we observe that the C2UCB and our proposed algorithms perform similarly.
Our theoretical and numerical analyses indicate that the sub-optimality of the C2UCB algorithm arises from the combinatorial structure of the CCS problem, i.e., choosing a set of arms in each round. More precisely, the existence of an arm with a confidence interval that is too large makes the algorithm choose a bad set of arms. This is an interesting phenomenon that does not occur in the LB problem (the CCS problem when ) or the case of partition matroid constraints. Since the technique we propose for the CCS problem is so general that it is independent of the linearity of the linear payoff functions, we believe it could be generalized to overcome the same issue for other semi-bandit problems.
Problem Setting
In this section, we present the formal definition of the CCS problem and the required assumptions. The CCS problem consists of rounds. Let denote the number of arms, and each arm is indexed by an integer in . We denote by a set of combinations of arms we can choose in the -th round. We assume that each combination is of size . Thus, .
The learner progresses through each round as follows. At the beginning of the -th round, the learner observes the set of arms with the associated feature vectors and the set of combinations of arms . Then, the learner chooses . At the end of the round, the learner obtains the rewards , where for all , for some and is a random noise with zero mean.
We evaluate the performance of an algorithm by the expected regret , which is defined as
where .
Assumption 1.
and , the random noise is conditionally -sub-Gaussian, i.e.,
where .
In addition, we define the following parameters of the CCS problem: (i) such that and , , (ii) such that , and (iii) such that and , . Note that is an obvious upper bound of .
Regret Analysis of the C2UCB Algorithm
Existing Analyses
Qin, Chen, and Zhu (2014) proposed the C2UCB algorithm (Algorithm 1), which chooses a set of arms based on optimistically estimated rewards in a similar way to other UCB-type algorithms (Auer 2002; Chen et al. 2016b; Chu et al. 2011; Li et al. 2010).
The C2UCB algorithm works as follows. At the beginning of each round, it constructs an estimator of using the arms chosen so far and its rewards (line 3). It then computes an optimistic reward estimator for each observed arm (line 6), where represents the size of the confidence interval of the estimated reward of arm . Then, it chooses arms obtained by solving the optimization problem based on (line 8). Finally, it observes the reward of the chosen arms and updates the internal parameters and (line 9).
Qin, Chen, and Zhu (2014) showed that the algorithm admits a sublinear regret bound with respect to . Takemura and Ito (2019) refined their analysis to improve the dependence on , , and as follows. Here, for , we define .
Theorem 1 (Theorem 4 of Takemura and Ito (2019)333 The original versions of Theorem 1 and Lemma 1 assume , but it is possible to obtain these results by scaling and without modifying the proof. ).
If and , the C2UCB algorithm has the following regret bound with probability :
To prove Theorem 1, it suffices to bound the cumulative estimating error of rewards, i.e., . Let denote for all and . To bound the error, Takemura and Ito (2019) showed that
(1) |
The right-hand side is then bounded by the following lemma:
Lemma 1 (Lemma 5 of Takemura and Ito (2019)).
Let . Let be any sequence such that and for all and . Let for all . Then, we have
This bound is tight up to logarithmic factors because we have when and for all and , where for all , is a vector in which the -th element is 1 and the other elements are 0.
Improved Regret Bound
In this subsection, we improve the regret bound of the C2UCB algorithm. A key observation of our analysis is that Lemma 1 is not tight for sufficiently large . To improve Lemma 1, we divide into two groups: the family of such that , and the others. As shown in Lemma 2 below, the sum of in the former group is , which is smaller than Lemma 1. Moreover, the number of arms in the latter group is shown to be , which means that not so many arms have large .
Lemma 2.
Let . Let be any sequence such that and for all and . Let for all . Then, we have
(2) |
and
(3) |
Based on Lemma 2, we can bound the right-hand side of (1) to obtain a better regret upper bound. The regret bound given by this theorem is optimal when .
Theorem 2.
If and , the C2UCB algorithm has the following regret bound with probability :
Proof sketch.
Let and . We separate chosen arms into two groups444 To show the regret bound of the LinUCB algorithm (Chu et al. 2011; Li et al. 2010), i.e., the C2UCB algorithm for the case , Lattimore and Szepesvári (2020) take a similar approach in the note of exercise 19.3. : and the remaining arms. For , replacing Lemma 1 with Lemma 2 in the proof of Theorem 1 gives the first term of the regret bound. There are two ways to bound the regret caused by the other group. In one way, we use the same proof as the former group, which obtains . In the other way, by Lemma 2, we bound the number of rounds in which the arms of this group are chosen. Then, we have an upper bound of the regret in a round that is . Thus, we obtain in this way. The second term of the regret bound can be obtained by combining these two ways. ∎
Next, we show that Theorem 2 is better than Theorem 1. We first consider the case . From the definition of , we have . Since Theorem 1 implies regret, it suffices to compare with . If , the C2UCB algorithm has an obvious regret upper bound , which satisfies and ; otherwise, we have . In the other case, Theorem 1 implies regret and we have . Thus, it also suffices to compare with . By the discussion in the first case, we obtain the desired result.
Improved Regret Bound for the CCS Problem with Partition Matroid Constraints
In this subsection, we show that the C2UCB algorithm admits an improved regret upper bound for the CCS problem with the partition matroid constraint, that matches the regret lower bound shown in Table 1.
Now we define the partition matroid constraint. Let be a partition of into subsets. Let be a set of natural numbers. Then the partition matroid constraint is defined from and as
(4) |
Such is known as the set of the bases of a partition matroid. It is also known that linear optimization problems on a partition matroid constraint can be solved by the greedy algorithm. The class of is so large that many fundamental classes are included. Indeed, the CCS problem with these constraints leads to the CCS problem with the uniform matroid constraints (i.e., the cardinality constraint) when and for all , and the LB problem with periodic updates when and for all and .
We show that the C2UCB algorithm achieves the optimal regret bound for the CCS problem with constraints satisfying (4):
Theorem 3.
Assume that is defined by (4) for all . Then, if and , the C2UCB algorithm has the following regret bound with probability :
Proof sketch.
Recall that is the set of arms chosen by the C2UCB algorithm in the -th round. Let and . As in the proof of Theorem 2, we separate chosen arms into two groups: and . From the definition of and , we obtain for all , where
Let be a subset of that consists of the arms in , and arms chosen arbitrarily from for each . Then, and for all . Similar to , we divide into and . This gives
The former term in the right-hand side of this equation is by the optimality of and the discussion in the proof of Theorem 2. The latter term is by the definition of and Lemma 2. ∎
Note that for the LB problem with periodic updates, the C2UCB algorithm reduces to the LinUCB algorithm (Chu et al. 2011; Li et al. 2010) with periodic updates, and has the optimal regret bound. Note also that we can show a similar result for related problems if we have a UCB-type algorithm and an upper bound of the number of chosen arms that have large confidence bounds.
Proposed Algorithm
In this section, we propose an algorithm for a more general problem than the CCS problem. We will show the optimal regret bound of the proposed algorithm for the general problem.
First, let us define the general CCS problem. Let and for all . In this problem, the learner aims to maximize the sum of values instead of the sum of rewards, where measures the quality of the chosen arms. As in Qin, Chen, and Zhu (2014), we assume that the learner has access to an -approximation oracle , which provides such that for some . Thus, we evaluate the performance of an algorithm by the -regret , which is defined as
where . Note that the regret of the CCS problem is recovered if and is the sum of rewards. We make the following assumptions that are almost identical to those in Qin, Chen, and Zhu (2014).
Assumption 2.
For all and , if a pair of rewards and satisfies for all , we have .
Assumption 3.
There exists a constant such that for all , all , and any pair of rewards and , we have .
The class of functions that satisfies the assumptions includes practically useful functions. For example, the sum of rewards with the entropy regularizer (Qin and Zhu 2013; Qin, Chen, and Zhu 2014), which has been applied to recommender systems in order to take into account both the sum of rewards and the diversity of the chosen arms, satisfies the assumptions with .
The proposed algorithm is described in Algorithm 2. When is the sum of rewards, the difference between the C2UCB and the proposed algorithms is the definition of . We show the effectiveness of this difference. In the analysis of the C2UCB algorithm, the regret can be decomposed as
and the first term can be bounded by 0 since is an optimal solution to the problem . Then, the right-hand side is bounded by
where we recall that is the set of arms such that . In the proof of Theorem 2, the first term of the right-hand side is shown to be , which is optimal, while the second term can be . The reason the second term is so large is that each arm may have an overly optimistic reward estimate (i.e., may be large). To overcome this issue, we reduce when arm has an overly optimistic reward estimate, keeping that the reduced value is an optimistic estimate required by UCB-type algorithms. As described in Algorithm 2, we adopt the maximum value of the average reward as when .
Similar to the above, we can show that the proposed algorithm (Algorithm 2) has the following regret bound:
Theorem 4.
If and , the proposed algorithm has the following regret bound with probability :
We show that this regret bound is optimal. We can define an instance of the general problem with any from any instance of the CCS problem. Indeed, for any , we can define . Thus, the optimal degree of dependence on in the regret is linear. For other parameters, we will show the lower bound in the next section.
Lower Bounds
In this section, we show the regret lower bound that matches the regret upper bound shown in Theorems 3 and 4 up to logarithmic factors. To achieve the lower bound, we mix two types of instances, which provide and regret, respectively. While the first type of instance represents the difficulty of learning due to the noise added to the rewards, the second represents the minimum sample size required to learn the -dimensional vector in the CCS problem.
We first consider instances that achieve and are analogous to the instances for the LB problem. Since the lower bound of the LB problem is known to be with , the CCS problem in which the number of arms to select is would yield . In these instances, the learner chooses vertices from a -dimensional hyper cube. Note that the duplication of vertices is allowed.
Theorem 5.
Let and for any and . Let . Assume that independently. Then, for any algorithm, there exists such that .
Proof.
We first consider instances that achieve the lower bound of the LB problem. Using the discussion of Theorem 24.1 of Lattimore and Szepesvári (2020), we obtain the lower bound of for a certain when . Note that this lower bound holds even if the algorithm knows in advance the given set of arms of all rounds.
Then, we observe that the set of algorithms for the above instances with rounds includes any algorithm for the CCS problem, which proves the theorem. ∎
We next introduce the instances of , based on the fact that no feedback can be received until arms are selected in the CCS problem. More specifically, these instances consist of independent 2-armed bandit problems with delayed feedback. In each problem, the learner suffers regret due to the delayed feedback.
Theorem 6.
Let and . Assume that . Define groups by dividing rounds. For each group , the given arms are defined as for and for , where is the normalized standard basis. Let . Then, for any algorithm, there exists such that .
Proof.
As in Appendix A of Auer et al. (2002), it is sufficient to consider only the deterministic algorithms. In the first round of each group, any algorithm selects or more from one of the two types of arms. Therefore, we can choose so that for each group, the majority type of chosen arms is not optimal, in which case the algorithm suffers regret. ∎
Finally, by combining the two types of instance above, we have instances achieving the matching regret lower bound:
Theorem 7.
Proof.
Note that we can set and arbitrarily in the instances of Theorem 7, but and are automatically determined as and .
Numerical Experiments
Algorithm | Parameters |
---|---|
-greedy | and |
C2UCB (Algorithm 1) (Qin, Chen, and Zhu 2014) | and |
Thompson sampling (Takemura and Ito 2019) | and |
CombLinUCB (Wen, Kveton, and Ashkan 2015) | , , and |
CombLinTS (Wen, Kveton, and Ashkan 2015) | and |
Proposed (Algorithm 2) | and |

Setup
In this section, we evaluate the performance of the C2UCB and the proposed algorithms through numerical experiments. Two types of instance are prepared: one in which the constraints are not represented by the partition matroid and one in which they are. We call these types grouped type and uniform matroid type, respectively. Our analysis suggests that the C2UCB algorithm performs well on the uniform matroid type only and that our proposed algorithm does well on both types. The aim of our experiments is to verify this.
Let us explain the details of the instances. The grouped type is given by combining the instances of Theorem 5 with and and an instance defined as follows. Suppose that , , and . Let be . The feature vectors are defined as
for all . The random noise follows independently for all and . The feasible combinations are defined as for all . Note that this is not represented by the partition matroid. As for the uniform matroid type, the feasible combinations are defined as for all . This is one of the uniform matroid constraints, which forms a subclass of partition matroid constraints. The other parameters are the same as the grouped type.
We start with and , and increase and so that they satisfy . We run 100 simulations to obtain the means of the regrets. We evaluate the performance of an algorithm by the means of the regrets for the worst : We compare the means for all for the largest and choose the with the largest mean.
We compare the proposed algorithm with five existing algorithms as baselines using the parameters described in Table 2. The -greedy algorithm has two ways of estimating the rewards of given arms: one is to use the values sampled from independently, and the other is to estimate the rewards as in line 6 of Algorithm 1 with . This algorithm chooses the former way with probability and the latter way otherwise. Then, it plays a set of arms as in line 8 of Algorithm 1.
Results
Figure 1(a) and (b) show the relation between the number of pulled arms (i.e., ) and the regret for the grouped type and the uniform matroid type, respectively. Error bars represent the standard error.
As we can see in Figure 1(a), the regret of the proposed algorithm increased most slowly, which indicates that the regrets of the existing and proposed algorithms have different degrees of dependence on the number of pulled arms. We can explain this phenomenon from the viewpoint of the overly optimistic estimates of rewards. Since increased exponentially until the -th round, the C2UCB algorithm often gave the arm an overly optimistic reward in these rounds. It follows from this optimistic estimate that the sum of optimistic rewards in the first group was often greater than that in the other group. Hence, the C2UCB algorithm often chose the sub-optimal group and suffered regret in a round. Note that this phenomenon is almost completely independent of the linearity of the linear payoff function, which implies that the negative effect of the overly optimistic estimates could appear in UCB-type algorithms for related problems with semi-bandit feedback.
On the other hand, as shown in Figure 1(b), the regrets of all the algorithms except the -greedy algorithm were almost the same. This is because the constraints of the uniform matroid type satisfy the condition (4), and then the C2UCB algorithm has the optimal regret bound described in Theorem 3. More precisely, as opposed to the grouped type, the regret suffered from the overly optimistic estimates is at most in a round.
Conclusion
We have discussed the CCS problem and shown matching upper and lower bounds of the regret. Our analysis has improved the existing regret bound of the C2UCB algorithm and clarified the negative effect of the overly optimistic estimates of rewards in bandit problems with semi-bandit feedback. We have solved this issue in two ways: introducing partition matroid constraints and providing other optimistic rewards to arms with large confidence intervals. Our theoretical and numerical analyses have demonstrated the impact of the overly optimistic estimation and the effectiveness of our approaches.
As we discussed, the negative effect of the overly optimistic estimation could appear in related problems as well. Since the ideas of our approaches do not depend on the linearity of the linear payoff functions, we believe they are applicable to overly optimistic estimation in related problems.
Although the proposed algorithm achieves the optimal regret bound, it uses explicitly as opposed to the C2UCB algorithm. It is an open question whether there exists some algorithm that achieves the optimal regret bound for general constraints without knowledge of the tight upper bound of .
Acknowledgements
SI was supported by JST, ACT-I, Grant Number JPMJPR18U5, Japan. TF was supported by JST, PRESTO, Grant Number JPMJPR1759, Japan. NK and KK were supported by JSPS, KAKENHI, Grant Number JP18H05291, Japan.
References
- Abbasi-Yadkori, Pál, and Szepesvári (2011) Abbasi-Yadkori, Y.; Pál, D.; and Szepesvári, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2312–2320.
- Agrawal and Goyal (2013) Agrawal, S.; and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, 127–135.
- Auer (2002) Auer, P. 2002. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3(Nov): 397–422.
- Auer et al. (2002) Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing 32(1): 48–77.
- Chapelle and Li (2011) Chapelle, O.; and Li, L. 2011. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems, 2249–2257.
- Chen et al. (2016a) Chen, W.; Hu, W.; Li, F.; Li, J.; Liu, Y.; and Lu, P. 2016a. Combinatorial multi-armed bandit with general reward functions. In Advances in Neural Information Processing Systems, 1659–1667.
- Chen et al. (2016b) Chen, W.; Wang, Y.; Yuan, Y.; and Wang, Q. 2016b. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research 17(1): 1746–1778.
- Chu et al. (2011) Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual Bandits with Linear Payoff Functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 208–214.
- Combes et al. (2015) Combes, R.; Shahi, M. S. T. M.; Proutiere, A.; et al. 2015. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, 2116–2124.
- Dani, Hayes, and Kakade (2008) Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, 355–366.
- Gai, Krishnamachari, and Jain (2012) Gai, Y.; Krishnamachari, B.; and Jain, R. 2012. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking 20(5): 1466–1478.
- Kveton et al. (2014) Kveton, B.; Wen, Z.; Ashkan, A.; Eydgahi, H.; and Eriksson, B. 2014. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, 420–429.
- Kveton et al. (2015) Kveton, B.; Wen, Z.; Ashkan, A.; and Szepesvári, C. 2015. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 535–543.
- Lattimore and Szepesvári (2020) Lattimore, T.; and Szepesvári, C. 2020. Bandit Algorithms. Cambridge University Press.
- Li et al. (2010) Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation. In Proceedings of the 19th International Conference on World Wide Web, 661–670.
- Qin, Chen, and Zhu (2014) Qin, L.; Chen, S.; and Zhu, X. 2014. Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, 461–469.
- Qin and Zhu (2013) Qin, L.; and Zhu, X. 2013. Promoting diversity in recommendation by entropy regularizer. In Twenty-Third International Joint Conference on Artificial Intelligence.
- Takemura and Ito (2019) Takemura, K.; and Ito, S. 2019. An Arm-Wise Randomization Approach to Combinatorial Linear Semi-Bandits. In Proceedings of the 2019 IEEE International Conference on Data Mining, 1318–1323.
- Wang et al. (2017) Wang, Y.; Ouyang, H.; Wang, C.; Chen, J.; Asamov, T.; and Chang, Y. 2017. Efficient Ordered Combinatorial Semi-Bandits for Whole-Page Recommendation. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2746–2753.
- Wen, Kveton, and Ashkan (2015) Wen, Z.; Kveton, B.; and Ashkan, A. 2015. Efficient Learning in Large-Scale Combinatorial Semi-Bandits. In Proceedings of the 32nd International Conference on Machine Learning, 1113–1122.
Appendix A Proofs
Known Results
Our proofs use the following known results:
Lemma 3 (Theorem 2 of Abbasi-Yadkori, Pál, and Szepesvári (2011)).
Let be a filtration, be an -valued stochastic process such that is -measurable, and be a real-valued stochastic process such that is -measurable. Let be a positive definite matrix, , and . Assume for all that is conditionally -sub-Gaussian for some and . Then, for any , with probability at least , for any ,
Furthermore, if for all , then with probability at least , for all ,
Lemma 4 (Lemma 10 of Abbasi-Yadkori, Pál, and Szepesvári (2011)).
Suppose , , , and for any , . Let for some . Then,
Proof of Lemma 2
We first show the lemma below, which proves the first part of Lemma 2. Note that this lemma is an extension of Lemma 11 of Abbasi-Yadkori, Pál, and Szepesvári (2011).
Lemma 5.
Let be any sequence such that and for all and . Let with . Then, we have
(5) |
Accordingly, we have
(6) |
Proof.
We have
where the first inequality follows from Jensen’s inequality applied to the concave function . Then, we have
for all , where the second inequality is derived from for all . Therefore, we obtain
Applying Lemma 4 to the above inequality, we obtain (5). Finally, we obtain (6) by applying the Cauchy-Schwarz inequality to (5). ∎
Lemma 6.
Let be any sequence such that and for all and . Let with . Then, we have
Proof.
Proof of Theorem 2
Recall that for . Theorem 2 is a corollary of the following theorem.
Theorem 8.
If , the C2UCB algorithm has the following regret bound with probability :
We note that this theorem holds even if the definition of is replaced with
(7) |
Condition (7) is weaker than the original definition of , as (7) is derived from for all and .
Proof of Theorem 3
Theorem 3 is a corollary of the theorem below. Note that this theorem holds when is a parameter satisfying (7).
Theorem 9.
Assume that satisfies (4) for a partition and for all . Then, if , the C2UCB algorithm has the following regret bound with probability :
Proof.
Let and . We separate chosen arms into two groups: and the other arms. From the optimality of , is the top arms in terms of in for each . Thus, we obtain
(9) |
for all , where
Let be a subset of that consists of the arms in , and arms chosen arbitrarily from for each . Then, and for all . Similarly to , we divide into and . This gives
(10) |
It remains to be shown that the first term of (10) is . By the same discussion as the proof of Theorem 8, we obtain for all and with probability . Then, we have
where the second inequality is derived from (9). We define and as
From Claim 1 and Claim 2 below, we can bound and , respectively, which gives the desired bound of the first term of (10). ∎
Claim 1.
Proof of Claim 1.
Claim 2.
Proof of Theorem 4
Theorem 4 is a corollary of the theorem below. In this subsection, we prove this theorem.
Theorem 10.
If , the proposed algorithm has the following regret bound with probability :
Proof.
Let and . Similarly to the discussion in the proof of Theorem 8, we have for all and with probability . Furthermore, from the definition of , we also have for all and . Thus, we obtain for all and .