Contextual Combinatorial Bandits with Probabilistically Triggered Arms
Abstract
We study contextual combinatorial bandits with probabilistically triggered arms (C2MAB-T) under a variety of smoothness conditions that capture a wide range of applications, such as contextual cascading bandits and contextual influence maximization bandits. Under the triggering probability modulated (TPM) condition, we devise the C2-UCB-T algorithm and propose a novel analysis that achieves an regret bound, removing a potentially exponentially large factor , where is the dimension of contexts, is the minimum positive probability that any arm can be triggered, and batch-size is the maximum number of arms that can be triggered per round. Under the variance modulated (VM) or triggering probability and variance modulated (TPVM) conditions, we propose a new variance-adaptive algorithm VAC2-UCB and derive a regret bound , which is independent of the batch-size . As a valuable by-product, our analysis technique and variance-adaptive algorithm can be applied to the CMAB-T and C2MAB setting, improving existing results there as well. We also include experiments that demonstrate the improved performance of our algorithms compared with benchmark algorithms on synthetic and real-world datasets.
1 Introduction
C2MAB-T | Algorithm | Condition | Coefficient | Regret Bound |
C3-UCB (Li et al., 2016)∗ | 1-norm | |||
(Main Result 1) | C2-UCB-T (Algorithm 1) | 1-norm TPM | ||
(Main Result 2) | VAC2-UCB (Algorithm 2) | VM | † | |
(Main Result 3) | VAC2-UCB (Algorithm 2) | TPVM | , ‡ | |
CMAB-T | Algorithm | Condition | Coefficient | Regret Bound |
BCUCB-T (Liu et al., 2022) | TPVM | , | ||
(Additional Result 1) | BCUCB-T (Our new analysis) | TPVM | , | |
C2MAB | Algorithm | Condition | Coefficient | Regret Bound |
C2UCB (Qin et al., 2014) | 2-norm | § | ||
C2UCB (Takemura et al., 2021) | 1-norm | |||
(Additional Result 2) | VAC2-UCB (Algorithm 2) | VM |
-
∗ This work is specified for contextual combinatorial cascading bandits, without formally defining the arm triggering process.
-
† Generally, coefficient and the existing regret bound is improved when
-
‡ is a coefficient in TPVM condition: when is larger, the condition is stronger with smaller regret but can include less applications.
-
∗∗ We also show improved distribution-dependent regret bound in Appendix C;
-
§ Almost all applications satisfy .
The stochastic multi-armed bandit (MAB) problem is a classical sequential decision-making problem that has been widely studied (Robbins, 1952; Auer et al., 2002; Bubeck et al., 2012). As an extension of MAB, combinatorial multi-armed bandits (CMAB) have drawn attention due to fruitful applications in online advertising, network optimization, and healthcare systems (Gai et al., 2012; Kveton et al., 2015a; Chen et al., 2013, 2016a; Wang & Chen, 2017; Merlis & Mannor, 2019). CMAB is a sequential decision-making game between a learning agent and an environment. In each round, the agent chooses a combinatorial action that triggers a set of base arms (i.e., a super-arm) to be pulled simultaneously, and the outcomes of these pulled base arms are observed as feedback (typically known as semi-bandit feedback). The goal of the agent is to minimize the expected regret, which is the difference in expectation for the overall rewards between always playing the best action (i.e., the action with the highest expected reward) and playing according to the agent’s own policy.
Motivated by large-scale applications with a huge number of items (base arms), there exists a prominent line of work that advances the CMAB model: the combinatorial contextual bandits (or C2MAB for short) (Qin et al., 2014; Li et al., 2016; Takemura et al., 2021). Specifically, C2MAB incorporates contextual information and adds the simple yet effective linear structure assumption to allow scalability, which provides regret bounds that are independent of the number of base arms . Despite C2MAB’s success in leveraging contextual information for better scalability, existing works fail to formulate the general arm triggering process, which is essential to model a wider range of applications, e.g., cascading bandits (CB) and influence maximization (IM), and more importantly, they do not provide satisfying results for settings with probabilistically triggered arms. For example, Qin et al. (2014); Takemura et al. (2021) only consider the deterministic semi-bandit feedback for C2MAB. Li et al. (2016); Wen et al. (2017) implicitly consider the arm triggering process for specific CB or IM applications but only gives sub-optimal results with unsatisfying factors (e.g., that could be as large as the number of base arms), owing to loose analysis, weak conditions, or inefficient algorithms that explore the unknown parameters too conservatively.
To handle the above issues, we enhance the C2MAB framework by considering an arm triggering process. Specifically, we propose the general framework of contextual combinatorial bandits with probabilistically triggered arms (or C2MAB-T for short). At the base arm level, C2MAB-T uses a time-varying feature map to model the contextual information at each round , and the mean outcome of each arm is a linear product of the feature vector and an unknown vector (where to handle large-scale applications). At the (combinatorial) action level, inspired by the non-contextual CMAB with probabilistically triggered arms (or CMAB-T) works (Chen et al., 2016b; Wang & Chen, 2017; Liu et al., 2022), we formally define an arm-triggering process to cover more general feedback models such as semi-bandit, cascading, and probabilistic feedback. We also inherit smoothness conditions for the non-linear reward function to cover different application scenarios, such as CB, IM, and online probabilistic maximum coverage (PMC) problems (Chen et al., 2016a; Wang & Chen, 2017). With this formulation, C2MAB-T retains C2MAB’s scalability while also enjoying CMAB-T’s rich reward functions and general feedback models.
Contributions. Our main results are shown in Table 1.
First, we study C2MAB-T under the triggering probability modulated (TPM) smoothness condition, a condition introduced by Wang & Chen (2017) to remove a factor of in the pioneer CMAB-T work (Chen et al., 2016a). This result follows a similar vein by devising C2-UCB-T algorithm and proving a regret, which removes a factor for prior contextual CB applications (Li et al., 2016) (Main Result 1 in Table 1). The key technical challenge is that the triggering group (TG) analysis (Wang & Chen, 2017) for CMAB-T cannot handle the triggering probability determined by time-varying contexts. To tackle this issue, we devise a new technique, called the triggering probability equivalence (TPE), which links the triggering probabilities with the random triggering event under expectation. In this way, we no longer need to bound the regret caused by possibly triggered arms, but only need to bound the regret caused by actually triggered arms. As a result, we can then directly apply the simple non-triggering C2MAB analysis to obtain the regret bound for C2MAB-T. In addition, our TPE can reproduce the results for CMAB-T in a similar way.
Second, we study the C2MAB-T under the variance modulated (VM) smoothness condition (Liu et al., 2022), in light of the recent variance-adaptive algorithms to remove the batch size dependence for CMAB-T (Merlis & Mannor, 2019; Liu et al., 2022; Vial et al., 2022). We propose a new variance-adaptive algorithm VAC2-UCB and prove a batch-size independent regret under VM condition (Main Result 2 in Table 1). The main technical difficulty is to deal with the unknown variance. Inspired by Lattimore et al. (2015), we use the UCB/LCB value to construct an optimistic variance and on top of that, we prove a new concentration bound to incorporate the triggered arms and optimistic variance to get the desirable results.
Third, we investigate the stronger triggering probability and variance modulated (TPVM) condition (Liu et al., 2022) in order to remove the additional factor. The key challenge is that we cannot directly use TPE to link the true triggering probability with the random trigger event as before, since the TPVM condition only yields a mismatched triggering probability associated with the optimistic variance used in the algorithm. Our solution is to bound this additional mismatch by lower-order terms based on mild conditions on the triggering probability, which achieves the regret bounds (Main Result 3 in Table 1).
As a valuable by-product, our TPE analysis and VAC2-UCB algorithm can be applied to non-contextual CMAB-T and C2MAB, improving the existing results by a factor (Additional Result 1 in Table 1) and (Additional Result 2 in Table 1), respectively. Our empirical results on both synthetic and real data demonstrate that the VAC2-UCB algorithm outperforms the state-of-art variance-agnostic and variance-aware bandit algorithms in the linear cascading bandit application that satisfies the TPVM condition.
Related Work. The stochastic CMAB model has received significant attention. The literature was initiated by (Gai et al., 2012). Since, its regret has been improved by Kveton et al. (2015b), Combes et al. (2015), Chen et al. (2016a). There exist two prominent lines of work in the literature related to our study: contextual CMAB and CMAB with probabilistically triggered arms (C2MAB and CMAB-T).
For C2MAB works, Qin et al. (2014) is the first study, which proposes C2UCB algorithm that considers reward functions under 2-norm smoothness condition. Then Takemura et al. (2021) replaces the 2-norm smoothness condition with a new 1-norm smoothness condition, and proves a regret bound. In this work, we extend the C2MAB with triggering arms to cover more application scenarios (e.g., contextual CB and contextual IM). Moreover, we further consider the stronger VM condition and propose a new variance-adaptive algorithm that can achieve a factor improvement in the regret upper bound for applications like PMC.
For CMAB-T works, Chen et al. (2016a) is the first work that considers the arm triggering process to cover CB and IM applications. The authors propose the CUCB algorithm, and give an regret bound under 1-norm smoothness condition. Then Wang & Chen (2017) proposes the stronger 1-norm triggering probability modulated (TPM) smoothness condition, and use the triggering group (TG) analysis to remove a factor in the previous regret. Recently, Liu et al. (2022) incorporates the variance information, and proposes the variance-adaptive algorithm BCUCB-T, which also uses the TG analysis and further reduces the regret’s dependency on batch-size from to under the new variance and triggering probability modulated (TPVM) condition. The smoothness conditions considered in this work are mostly inspired by the above works, but directly following their algorithm and TG analysis fail to obtain any meaningful result for our C2MAB-T setting. Conversely, our new TPE analysis can be applied to CMAB-T, reproducing CMAB-T’s result under the 1-norm TPM condition, and improving a factor of under the TPVM condition.
There are also many studies considering specific applications under the C2MAB-T framework (by unifying C2MAB and CMAB-T), including contextual CB (Li et al., 2016; Vial et al., 2022), contextual IM (Wen et al., 2017), etc. One can see that these applications fit into our framework by verifying that they satisfy the TPM, VM, or TPVM conditions; thus achieving improved results regarding factors. We defer a detailed theoretical and empirical comparison to Sections 3, 4 and 5. Zuo et al. (2022) study the online competitive IM and also uses C2MAB-T to denote their contextual setting. However, their meaning of “contexts” is the action of the competitor, which acts at the action level and only affects the reward function (or regret) but not the base arms’ estimation. This is very different from our setting, where contexts act at the base arm level and hence one cannot directly apply their results.
2 Problem Setting
We study contextual combinatorial bandits with probabilistically triggered arms (C2MAB-T). We use to represent set . We use boldface lowercase letters and boldface CAPITALIZED letters for column vectors and matrices, respectively. denotes the norm of vector . For any symmetric positive semi-definite (PSD) matrix (i.e., ), denotes the matrix norm of regarding matrix .
We specify a C2MAB-T problem instance using a tuple , where is the set of base arms (or arms); is the set of eligible actions where is an action;*** is a general action space. When is a collection of subsets of , we often refer to as a super arm. is the set of possible feature maps where any feature map is a function that maps an arm to a -dimensional feature vector (and w.l.o.g. we normalize ); is the parameter space; is the probabilistic triggering function to characterize the arm triggering process (and feedback), and is the reward function.
In C2MAB-T, a learning game is played between a learning agent (or player) and the unknown environment in a sequential manner. Before the game starts, the environment chooses a parameter unknown to the agent (and w.l.o.g. we also assume ). At the beginning of round , the environment reveals feature vectors for each arm, where is the feature map known to the agent. Given , the agent selects an action , and the environment draws Bernoulli outcomes for base arms††† We assume are Bernoulli for the ease of exposition, yet our setting can handle any distribution with bounded ., with mean for each base arm . Here denotes the history before the agent chooses and will be specified shortly after. Note that the outcome is assumed to be conditional independent across arms given history , similar to previous works (Qin et al., 2014; Li et al., 2016; Vial et al., 2022). For convenience, we use to denote the mean vector and to denote all possible mean vectors generated by and .
After the action is played on the outcome , base arms in a random set are triggered, meaning that the outcomes of arms in , i.e. are revealed as the feedback to the agent, and are involved in determining the reward of action . At the end of round , the agent will receive a non-negative reward , determined by and , and similar to (Wang & Chen, 2017), the expected reward is assumed to be , a function of the unknown mean vector , where the expectation is taken over the randomness of and . To allow the algorithm to estimate the underlying parameter directly from samples, we assume the outcome does not depend on whether the arm is triggered, i.e., . To this end, we can give the formal definition of the history , which contains all information before round , as well as the contextual information at round .
The goal of CMAB-T is to accumulate as much reward as possible over rounds by learning the underlying parameter . The performance of an online learning algorithm is measured by its regret, defined as the difference of the expected cumulative reward between always playing the best action in each round and playing actions chosen by algorithm . For many reward functions, it is NP-hard to compute the exact even when is known, so similar to (Wang & Chen, 2017), we assume that the algorithm has access to an offline -approximation oracle, which for mean vector outputs an action such that . The -round -approximate regret is defined as
(1) |
where the expectation is taken over the randomness of outcomes , the triggered sets , as well as the randomness of algorithm itself.
Remark 1 (Difference with CMAB-T).
C2MAB-T strictly generalizes CMAB-T by allowing a probably time-varying feature map . Specifically, let and fix where is the one-hot vector with at the -th entry and elsewhere, one can easily reproduce the CMAB-T setting in (Wang & Chen, 2017).
Remark 2 (Difference with C2MAB).
C2MAB-T enhances the modeling power of prior C2MAB (Qin et al., 2014; Takemura et al., 2021) by capturing the probabilistic nature of the feedback (v.s. the deterministic semi-bandit feedback). This enables a wider range of applications such as combinatorial CB, multi-layered network exploration, and online IM (Wang & Chen, 2017; Liu et al., 2022).
2.1 Key Quantities and Conditions
In the C2MAB-T model, there are several quantities and assumptions that are crucial to the subsequent study. We define triggering probability as the probability that base arm is triggered when the action is , the mean vector is , and the probabilistic triggering function is . Since is always fixed in a given application context, we ignore it in the notation for simplicity, and use henceforth. Triggering probabilities ’s are crucial for the triggering probability modulated bounded smoothness conditions to be defined below. We define to be the set of arms that can be triggered by , i.e.,, the batch size as the maximum number of arms that can be triggered, i.e., , and .
Owing to the nonlinearity and the combinatorial structure of the reward, it is essential to give some conditions for the reward function in order to achieve any meaningful regret bounds (Chen et al., 2013, 2016a; Wang & Chen, 2017; Merlis & Mannor, 2019; Liu et al., 2022). For C2MAB-T, we consider the following conditions.
Condition 1 (Monotonicity).
We say that a C2MAB-T problem instance satisfies monotonicity condition, if for any action , any mean vectors such that for all , we have .
Condition 2 (1-norm TPM Bounded Smoothness, (Wang & Chen, 2017)).
We say that a C2MAB-T problem instance satisfies the triggering probability modulated (TPM) -bounded smoothness condition, if for any action , any mean vectors , we have .
Condition 3 (VM Bounded Smoothness, (Liu et al., 2022)).
We say that a C2MAB-T problem instance satisfies the variance modulated (VM) -bounded smoothness condition, if for any action , mean vector , for any s.t. , we have .
Condition 4 (TPVM Bounded Smoothness, (Liu et al., 2022)).
We say that a C2MAB-T problem instance satisfies the triggering probability and variance modulated (TPVM) -bounded smoothness condition, if for any action , mean vector , for any s.t. , we have .
Condition 1 indicates the reward is monotonically increasing when the parameter increases. Condition 2, 3 and 4 all bound the reward smoothness/sensitivity.
For Condition 2, the key feature is that the parameter change in each base arm is modulated by the triggering probability . Intuitively, for base arm that is unlikely to be triggered/observed (small ), Condition 2 ensures that a large change in (due to insufficient observation) only causes a small change (multiplied by ) in reward, which helps to save a factor for non-contextual CMAB-T.
For Condition 3, intuitively if we ignore the denominator of the leading term, the reward change would be when the amount of parameter change for each arm . This introduces a factor reduction in the reward change and translates to a improvement in the regret, compared with reward change when applying the non-triggering version of Condition 2 (i.e., if and otherwise). However, for real applications, which cancels the improvement. To reduce coefficient, the leading term is modulated by the inverse of the variance , and thus allow applications to achieve a coefficient independent of (or at least )), leading to significant savings in the regret bound for applications like PMC (Liu et al., 2022). The relation between Condition 2 and 3 is generally not comparable, but compared with Condition 2’s non-triggering counterpart (i.e., 1-norm condition), Condition 3 is stronger.
Finally, for Condition 4, it combines both the triggering-probability modulation from Condition 2 and the variance modulation from Condition 3. The exponent of gives additional flexibility to trade-off between the strength of the condition and the regret, i.e., with a larger , one can obtain a smaller regret bound, while with a smaller , the condition is easier to satisfy to include more applications. In general, Condition 4 is stronger than Condition 2 and Condition 3, as the former degenerates to the other two conditions by setting and the fact that and otherwise, respectively. Conversely, by applying the Cauchy-Schwartz inequality, one can verify that if a reward function is TPM -bounded smooth, then it is TPVM -bounded smooth for any or similarly VM -bounded smooth, respectively.
In light of the above conditions that significantly advance the non-contextual CMAB-T, the goal of subsequent sections is to design algorithms and conduct analysis to derive the (improved) results for the contextual setting. And later in Section 5, we demonstrate how these conditions are applied to applications, such as CB and online IM, to achieve both theoretical and empirical improvements. Due to the space limit, the detailed proofs are included in the Appendix.
3 Algorithm and Regret Analysis for C2MAB-T under the TPM Condition
Our proposed algorithm C2-UCB-T (Algorithm 1) is a generalization of the C3-UCB algorithm originally designed for contextual combinatorial cascading bandits (Li et al., 2016). Our main contribution is to show an improved regret bound by a factor of under the 1-norm TPM condition.
Recall that we define the data about the history as . Different from the CUCB algorithm (Wang & Chen, 2017) that directly estimates the mean for each arm, Algorithm 1 estimates the underlying parameter via a ridge regression problem over the history data . More specifically, we estimate by solving the following -regularized least-square problem with regularization parameter :
|
(2) |
The closed form solution is precisely the calculated in line 4, where the Gram matrix and the b-vector are computed in line 10 and 11. We claim that is a good estimator of by bounding their difference via the following lemma, which is also used in (Qin et al., 2014; Li et al., 2016).
Proposition 1 (Theorem 2, (Abbasi-Yadkori et al., 2011)).
Let , then with probability at least , for all ,
Building on this, we construct an optimistic estimation of each arm’s mean in line 6, where is in Proposition 1, and are the empirical mean and confidence interval towards the direction , respectively. As a convention, we clip into if or .
Thanks to Proposition 1, we have the following lemma for the desired amount of the base arm level optimism,
Lemma 1.
With probability at least , we have for all .
Proof.
See Section A.2. ∎
After computing the UCB values , the agent selects action via the offline oracle with as input. Then base arms in are triggered, and the agent receives observation set as feedback to improve future decisions.
Theorem 1.
For a C2MAB-T instance that satisfies monotonicity (Condition 1) and TPM smoothness (Condition 2) with coefficient , C2-UCB-T (Algorithm 1) with an -approximation oracle achieves an -approximate regret bounded by
Discussion. Looking at Theorem 1, we achieve an regret bound when , which is independent of the number of arms and the minimum triggering probability . Consider the combinatorial cascading bandits (Li et al., 2016) satisfying (see Section 5), our result improves the Li et al. (2016) by a factor of . Consider the linear reward function (Takemura et al., 2021) without triggering arms (i.e., for , and otherwise), one can easily verify and our regret matches the lower bound Takemura et al. (2021) up to logarithmic factors.
Analysis. Here we explain how to prove a regret bound that removes the factor under the 1-norm TPM condition. The main challenge is that the mean vector and the triggering probability are dependent on time-varying contexts , so it is impossible to derive any meaningful concentration inequality or regret bound based on , which is the number of times that arm is triggered, and has been used by the triggering group (TG) technique (Wang & Chen, 2017) to remove . To deal with this problem, we bypass the quantity and use the triggering-probability equivalence (TPE) technique that equalizes with , which in turn replaces the expected regret produced by all possible arms with the expected regret produced by to avoid . To sketch our proof idea, we assume the oracle is deterministic with (the randomness of the oracle and are handled in Appendix A), and let filtration be the history data (defined in Section 2). Denote , the -round regret , based on Condition 1, Lemma 1 and definition of . Then
(3) |
where (a) is by Condition 2, (b) is because are measurable so that the only randomness is from triggering set and we can substitute with event under expectation, is by absorbing the expectation over to , and is a change of notation. After applying the TPE, we only need to bound the regret produced by . Hence
(4) |
where (a) follows from Lemma 1, (b) is by Cauchy Schwarz inequality over both and , and (c) is by the ellipsoidal potential lemma (Lemma 5) in the Appendix.
Remark 3.
In addition to the general C2MAB-T setting, the TPE technique can also replace the more involved TG technique (Wang & Chen, 2017) for CMAB-T. Such replacement can save an unnecessary union bound over the group index, which in turn reproduce Theorem 1 of Wang & Chen (2017) under Condition 2, and improve Theorem 1 of Liu et al. (2022) under Condition 4 by a factor of , see Appendix C for details.
4 Variance-Adaptive Algorithm and Analysis for C2MAB-T under VM/TPVM Condition
In this section, we propose a new variance-adaptive algorithm VAC2-UCB (Algorithm 2) to further remove the factor and achieve regret bound for applications satisfying stronger VM/TPVM conditions.
Different from Algorithm 1, VAC2-UCB leverages the second-order statistics (i.e., variance) to speed up the learning process. To get some intuition, we first assume the variance for each base arm at round is known in advance. In this case, VAC2-UCB adopts the weighted ridge-regression to learn the parameter :
|
(5) |
where the first term is “weighted” by the true variance . The closed-form solution of the above estimator is where the Gram matrix and the b-vector , which enjoys the similar form (but uses different weights ) of line 12 and line 13.
The intuition of using the inverse of to re-weight the observation is that: the smaller the variance, the more accurate the observation is, and thus more important for the agent to learn unknown . In fact, the above estimator is closely related to the best linear unbiased estimator (BLUE) (Henderson, 1975). Concretely, in the literature of linear regression, Equation (5) is the lowest variance estimator of among all unbiased linear estimators, when the regularization term , are the true variance proxy of outcomes and the context sequence follows the fixed design in Equation (5).
For our C2MAB-T setting, one new challenge arises since the variance is not known a priori. Inspired by (Lattimore et al., 2015; Zhou et al., 2021), we construct an optimistic estimation to replace the true variance in Equation 5. Indeed, we construct by solving the optimal value for the problem , whose closed form solution immediately follows from the equation below,
(6) |
where and are UCB and LCB values to be introduced later. Notice that with high probability the true lies within LCB and UCB values and as they are getting more accurate, the optimistic variance is also approaching the true variance .
To guarantee is a good estimator, we prove a new lemma (similar to Proposition 1) to guarantee the concentration bound of but in face of the unknown variance. Note that the sentinel work Lattimore et al. (2015) proves a similar concentration bound, the difference is that we have multiple arms triggered in each round instead of a single arm. To address this, we replaced the original concentration bound with the new one below that has an extra factor in , which finally results in factor in the confidence radius .
Building on this lemma, we construct as an upper bound of in line 6, and as a lower bound of in line 7, based on our variance-adaptive , . Note that the doubling of the radius instead of using in Lemma 2 is purely for the correctness of our technical analysis. As a convention, we clip into if they are above 1 or below 0.
Lemma 3.
With probability at least , we have , and for all .
Proof.
After the agent plays , the base arms in are triggered, and the agent receives observation set as feedback. These observations (reweighted by optimistic variance ) are then used to update and for future rounds.
Application | Condition | Regret | Improvement | |
---|---|---|---|---|
Online Influence Maximization (Wen et al., 2017) | TPM | |||
Disjunctive Combinatorial Cascading Bandits (Li et al., 2016) | TPVM | |||
Conjunctive Combinatorial Cascading Bandits (Li et al., 2016) | TPVM | |||
Linear Cascading Bandits (Vial et al., 2022)∗ | TPVM | |||
Multi-layered Network Exploration (Liu et al., 2021b) | TPVM | |||
Probabilistic Maximum Coverage (Chen et al., 2013)∗∗ | VM |
-
denotes the number of target nodes, the number of edges that can be triggered by the set of seed nodes, the number of layers, the number of seed nodes and the length of the longest directed path, respectively;
-
is the length of the ordered list, ;
-
∗ A special case of disjunctive combinatorial cascading bandits.
-
∗∗ This row is for C2MAB application and the rest of rows are for C2MAB-T applications.
4.1 Results and Analysis under VM condition
We first show a regret bound for VAC2-UCB that is independent of batch size when the VM condition holds.
Theorem 2.
For a C2MAB-T instance that satisfies monotonicity (Condition 1) and VM smoothness (Condition 3) with coefficient , VAC2-UCB (Algorithm 2) with an -approximation oracle achieves an -approximate regret bounded by
Discussion. Looking at Theorem 2, we achieve an regret bound when . For combinatorial cascading bandits (Li et al., 2016) with , our regret is independent of and improves Li et al. (2016) by a factor .
In addition to the general C2MAB-T setting, one can verify that for non-triggering C2MAB, , and we obtain the batch-size independent regret bound . Recall for any C2MAB-T instances, so our regret bound reproduces , and thus matches the similar lower bound (Takemura et al., 2021) for the linear reward functions. For the more interesting non-linear reward function with , our regret improves non-variance-adaptive algorithm C2UCB, whose regret is (Qin et al., 2014; Takemura et al., 2021).
Analysis. At a high level, the improvement of comes from the VM condition and the optimistic variance, which together save the use of Cauchy-Schwarz inequality that generates a factor in the step (b) of Section 3. In order to leverage the variance information, we decompose the regret into term (\@slowromancapi@) and (\@slowromancapii@), (7) where is the vector whose -th entry is the maximizer that achieves optimistic variance , i.e., . Now we show a sketched proof to bound the term (\@slowromancapi@) and one can bound the term (\@slowromancapii@) similarly. (8) where (a) is by Condition 3, (b) is by the definition of for , (c) is by Cauchy–Schwarz over and the Jensen’s inequality, (d) follows from the TPE and Lemma 3, (e) follows from Lemma 6.
4.2 Results and Analysis under TPVM Condition
Next, we show that VAC2-UCB can achieve regret bounds that remove the and factor for applications satisfying the stronger TPVM conditions.
We first introduce a mild condition over the triggering probability (which is similar to Condition 2) to give our regret bounds and analysis.
Condition 5 (1-norm TPM Bounded Smoothness for Triggering Probability).
We say that a C2MAB-T problem instance satisfies the triggering probability modulated -bounded smoothness condition over the triggering probability, if for any action , any mean vectors , and any arm , we have .
Now we state our main theorem as follows.
Theorem 3.
For a C2MAB-T instance, when its reward function satisfies monotonicity (Condition 1) and TPVM smoothness (Condition 4) with coefficient , and its triggering probability satisfies 1-norm TPM smoothness with coefficient (Condition 5), if , then VAC2-UCB (Algorithm 2) with an -approximation oracle achieves an -approximate regret bounded by
(9) |
and if , then VAC2-UCB (Algorithm 2) achieves an -approximate regret bounded by
(10) |
Discussion. The leading term of Theorem 3 is when , which removes the factor compared with Theorem 2. Also, notice that Theorem 3 relies on an additional -smoothness condition over the triggering probability. However, we claim that this condition is mild and almost always satisfies with for applications considered in this paper (see Appendix D).
Analysis. We use the regret decomposition of Equation 7 to the same term (\@slowromancapi@) and (\@slowromancapii@), and leverage on TPVM condition (Condition 4) to obtain:
(11) |
However, we cannot use TPE as Equation 8 because in general. To handle this mismatch, we use the fact that triggering probability usually satisfies a smoothness condition in Condition 5, and prove that the mismatch only affect the lower-order term as follows:
By Condition 5, is upper bounded by when , and the regret is bounded by the terms as shown below: where the leading term is of order by using the same derivation of step (c)-(e) in Equation 8, and the lower order term is bounded by by TPE and the weighted ellipsoidal potential lemma (Lemma 6). For , the lower-order term becomes , which results in a larger lower-order regret term. See Section B.3 for details.
5 Applications and Experiments
We now move to applications and experimental results. We first show how our theoretical results improve various C2MAB and C2MAB-T applications under 1-norm TPM, TPVM and VM smoothness conditions with their corresponding coefficients. Then, we provide an empirical comparison in the context of the contextual cascading bandit application.
The instantiation of our theoretical results in the context of a variety of specific C2MAB and C2MAB-T applications is shown in Table 2. The final column of the table details the improvement in regret that our results yield in each case. For detailed settings, proofs, and the discussion of the application results, see Appendix D.
Our experimental results are summarized in Figure 1, which details experiments on the MovieLens-1M dataset‡‡‡ grouplens.org/datasets/movielens/1m/. Experiments on other data are included in the Appendix. Figure 1 illustrates that our VAC2-UCB algorithm outperforms C3-UCB (Li et al., 2016), the variance-agnostic cascading bandit algorithm, and CascadeWOFUL (Vial et al., 2022), the state-of-the-art variance-aware cascading bandit algorithm, eventually incurring and less regret. For detailed settings, comparisons, and discussions, see Appendix E.


6 Conclusion
This paper studies contextual combinatorial bandits with probabilistically triggered arms (C2MAB-T) under a variety of smoothness conditions. Under the triggering probability modulated (TPM) condition, we design the C2-UCB-T algorithm and propose a novel analysis to achieve an regret bound, removing a potentially exponentially large factor . Under the variance modulated conditions (VM or TPVM), we propose a new variance-adaptive algorithm VAC2-UCB and derive a regret bound , which removes the batch-size dependence. As valuable by-product, we find our TPE analysis technique and variance-adaptive algorithm can be applied to the CMAB-T and C2MAB setting, improving existing results as well. Experiments show that our algorithm can achieve at least 13% and 25% improvement compared with benchmark algorithms on synthetic and real-world datasets, respectively. For the future study, it would be interesting to extend our application scenarios. One could also relax the perfectly linear assumption by introducing model mis-specifications or corruptions.
Acknowledgement
The work of John C.S. Lui was supported in part by RGC’s GRF 14215722. The work of Mohammad Hajiesmaili is supported by NSF CAREER-2045641, CPS-2136199, CNS-2106299, and CNS-2102963. Wierman is supported by NSF grants CNS-2146814, CPS-2136197, CNS-2106403, and NGSDI-2105648.
References
- Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
- Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Bernstein (1946) Bernstein, S. The Theory of Probabilities (Russian). Moscow, 1946.
- Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Chen et al. (2013) Chen, W., Wang, Y., and Yuan, Y. Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pp. 151–159. PMLR, 2013.
- Chen et al. (2016a) Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016a.
- Chen et al. (2016b) Chen, X., Li, Y., Wang, P., and Lui, J. A general framework for estimating graphlet statistics via random walk. Proceedings of the VLDB Endowment, 10(3):253–264, 2016b.
- Combes et al. (2015) Combes, R., Talebi Mazraeh Shahi, M. S., Proutiere, A., et al. Combinatorial bandits revisited. Advances in neural information processing systems, 28, 2015.
- Freedman (1975) Freedman, D. A. On tail probabilities for martingales. the Annals of Probability, pp. 100–118, 1975.
- Gai et al. (2012) Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking (TON), 20(5):1466–1478, 2012.
- Henderson (1975) Henderson, C. R. Best linear unbiased estimation and prediction under a selection model. Biometrics, pp. 423–447, 1975.
- Kempe et al. (2003) Kempe, D., Kleinberg, J., and Tardos, É. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 137–146, 2003.
- Kveton et al. (2015a) Kveton, B., Wen, Z., Ashkan, A., and Szepesvári, C. Combinatorial cascading bandits. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pp. 1450–1458, 2015a.
- Kveton et al. (2015b) Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Tight regret bounds for stochastic combinatorial semi-bandits. In AISTATS, 2015b.
- Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
- Lattimore et al. (2015) Lattimore, T., Crammer, K., and Szepesvári, C. Linear multi-resource allocation with semi-bandit feedback. Advances in Neural Information Processing Systems, 28, 2015.
- Li et al. (2016) Li, S., Wang, B., Zhang, S., and Chen, W. Contextual combinatorial cascading bandits. In International conference on machine learning, pp. 1245–1253. PMLR, 2016.
- Li et al. (2020) Li, S., Kong, F., Tang, K., Li, Q., and Chen, W. Online influence maximization under linear threshold model. Advances in Neural Information Processing Systems, 33:1192–1204, 2020.
- Liu et al. (2021a) Liu, L. T., Ruan, F., Mania, H., and Jordan, M. I. Bandit learning in decentralized matching markets. Journal of Machine Learning Research, 22(211):1–34, 2021a.
- Liu et al. (2021b) Liu, X., Zuo, J., Chen, X., Chen, W., and Lui, J. C. Multi-layered network exploration via random walks: From offline optimization to online learning. In International Conference on Machine Learning, pp. 7057–7066. PMLR, 2021b.
- Liu et al. (2022) Liu, X., Zuo, J., Wang, S., Joe-Wong, C., Lui, J., and Chen, W. Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms. In Advances in Neural Information Processing Systems, 2022.
- Merlis & Mannor (2019) Merlis, N. and Mannor, S. Batch-size independent regret bounds for the combinatorial multi-armed bandit problem. In Conference on Learning Theory, pp. 2465–2489. PMLR, 2019.
- Qin et al. (2014) Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 461–469. SIAM, 2014.
- Robbins (1952) Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
- Takemura et al. (2021) Takemura, K., Ito, S., Hatano, D., Sumita, H., Fukunaga, T., Kakimura, N., and Kawarabayashi, K.-i. Near-optimal regret bounds for contextual combinatorial semi-bandits with linear payoff functions. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9791–9798, 2021.
- Vial et al. (2022) Vial, D., Shakkottai, S., and Srikant, R. Minimax regret for cascading bandits. In Advances in Neural Information Processing Systems, 2022.
- Wang & Chen (2017) Wang, Q. and Chen, W. Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. In Advances in Neural Information Processing Systems, pp. 1161–1171, 2017.
- Wen et al. (2017) Wen, Z., Kveton, B., Valko, M., and Vaswani, S. Online influence maximization under independent cascade model with semi-bandit feedback. Advances in neural information processing systems, 30, 2017.
- Zhou et al. (2021) Zhou, D., Gu, Q., and Szepesvari, C. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp. 4532–4576. PMLR, 2021.
- Zong et al. (2016) Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. arXiv preprint arXiv:1603.05359, 2016.
- Zuo et al. (2022) Zuo, J., Liu, X., Joe-Wong, C., Lui, J. C., and Chen, W. Online competitive influence maximization. In International Conference on Artificial Intelligence and Statistics, pp. 11472–11502. PMLR, 2022.
Appendix
The Appendix is organized as follows. Appendix A gives the detailed proofs for theorems and lemmas in Section 3. Appendix B provides the detailed proofs for theorems and lemmas in Section 4. Appendix C shows how the triggering probability equivalence technique can be applied to non-contextual CMAB-T to obtain improved results. Appendix D gives the detailed settings, results and comparisons included in Table 2. Appendix E provides detailed experimental setups and additional results. Appendix F summarizes the concentration bounds, facts, and technical lemmas used in this paper.
Appendix A Proofs for C2MAB-T under the TPM Condition (Section 3)
A.1 Proof of Theorem 1
We first give/recall some definitions and events. Recall that in Algorithm 1, the gram matrix, the b-vector and the estimator are
(12) | ||||
(13) | ||||
(14) |
Let us use to denote the nice event when the oracle can output solution with where for any at round . We use to denote the nice event when the holds for any . Define the filtration to be that takes both history data and action to handle the randomness of the oracle, and let . Now we bound the regret under nice event and ,
(15) |
where (a) follows from the regret definition and the tower rule, (b) is by Condition 1 and Lemma 1 saying that , (c) is by nice event and the definition of , (d) is by Condition 2, (e) follows from by the TPE trick Lemma 4, (f) is by Lemma 1, (g) is by tower rule, (h) by Cauchy Schwarz inequality, and (i) is by the ellipsoidal potential lemma (Lemma 5). Similar to (Wang & Chen, 2017) The theorem is concluded by the definition of the -approximate regret, and considering event or , which contributes to at most regret.
A.2 Important Lemmas used for proving Theorem 1
See 1
Proof.
For any , we have
where by Cauchy-Schwartz, (b) by Proposition 1. Now use the definition of and finishes the proof. ∎
Lemma 4 (Triggering Probability Equivalence (TPE)).
.
Proof.
We have
(16) |
(a) is because are -measurable so that the only randomness is from triggering set and we can substitute with event under expectation, (b) is by absorbing the expectation over to , and (c) is a simple change of notation. Actually, TPE can be applied whenever the quantities (other than ) are -measurable, which would be helpful for later sections. ∎
Lemma 5 (Ellipsoidal Potential Lemma).
when .
Proof.
(17) |
where (a) follows from the definition, (b) follows from and , (c) follows from Lemma 14, (d) follows from repeatedly applying (c).
Since , we have . Using the fact that for any , we have
where the (a) follows from Equation 17, (b) follows from Lemma 15. ∎
Appendix B Proofs for C2MAB-T under the VM or TPVM Condition (Section 4)
B.1 Proof of Lemma 2
Our analysis is inspired by the derivation of Theorem 3 by (Lattimore et al., 2015) to bound the key ellipsoidal radius for the C2MAB-T setting, where multiple arms can be triggered in each round. Before we going into the main proof, we first introduce some notations and events as follows.
Recall that for , is a Bernoulli random variable with mean , suppose , we can represent by , where noise , its mean , and its variance . Also note that in Algorithm 2, the gram matrices, the b-vector and the weighted least-square estimator are the following.
(18) | ||||
(19) | ||||
(20) |
where we set , and optimistic variances are defined as in Equation 6 of Algorithm 2.
Let us define , and the key of this proof is to bound (this quantity is often denoted as in the self-normalized bound (Abbasi-Yadkori et al., 2011), but is occupied to denote actions at round in this work).
We finally define failure events be a sequence of events defined by
(21) |
These failure events are crucial in the sense that lies in the confidence ellipsoid (see Lemma 8 for its proof).
Next, we can prove by induction that the probability of given is very small, for (see its proof in Lemma 7). Based on this, we can have (as always holds), and thus by Equation 147, Lemma 2 is proved as desired.
B.2 Proof of Theorem 2 under VM condition
Similar to Appendix A, we first give/recall some definitions and events. Recall that in Algorithm 2, the gram matrices, the b-vector, and the weighted least-square estimator are defined in Equation 18 The optimistic variances are defined as in Equation 6 of Algorithm 2. Let us use to denote the nice event when the oracle can output solution with where for any at round . We use to denote the nice event when the holds for any (which can be implied by ). Define the filtration to be that takes both history data and action to handle the randomness of the oracle, and let .
Let be the vector whose -th entry is the maximizer that achieves , i.e., . Now we bound the regret under nice event and (where can be implied from by derivation in Lemma 8),
(22) | |||
(23) | |||
(24) | |||
(25) |
where (a) is by definition, (b) follows from Condition 1 and Lemma 3, (c) from event and the definition of , (d) from triangle inequality.
Now we show how to bound term (\@slowromancapi@),
(26) |
where (a) follows from Condition 3, (b) follows from the definition of s.t. for , (c) follows from Cauchy–Schwarz, (d) follows from Jensen’s inequality, (e) follows from the TPE trick, (f) follows from Lemma 3, (g) follows from Lemma 6.
Now for the term (\@slowromancapii@) follows from the similar derivation of Section B.2 by replacing with . And the theorem is concluded by considering and , similar to Appendix A.
Lemma 6 (Weighted Ellipsoidal Potential Lemma).
when and .
Proof.
(27) |
where (a) follows from the definition, (b) follows from and , (c) follows from Lemma 14, (d) follows from repeatedly applying (c).
If , , else if , and since , by Lemma 9, . Therefore, we have . Using the fact that for any , we have
where the (a) follows from Equation 17, (b) follows from Lemma 15 by setting (from Lemma 11). ∎
B.3 Proof of Theorem 3 Under TPVM Condition
In this section, we consider two cases when and . Recall that to use the TPVM condition (Condition 4), we need one additional condition over the triggering probability (Condition 5).
B.3.1 When :
We inherit the same notation and events as in Section A.1, and start to bound term (\@slowromancapi@) in Section B.2 differently,
(28) | ||||
(29) | ||||
(30) | ||||
(31) | ||||
(32) | ||||
(33) | ||||
(34) | ||||
(35) | ||||
(36) |
where (a) follows from Condition 4, (b) is by applying Condition 5 for triggering probability and , (c) follows from , (d) follows from same derivation from Section B.2, (e) follows from , (f) follows from Cauchy-Schwarz, (g) follows from , (h) follows from by event , (i) follows from the similar analysis of (d)-(g) in Section B.2 inside the square-root without considering the additional .
For the term (\@slowromancapii@), one can easily verify it follows from the similar deviation of the term (\@slowromancapi@) with the difference in constant terms. And Theorem 3 is concluded by considering small probability and events.
B.3.2 When :
We inherit the same notation and events as in Section A.1, and start to bound term (\@slowromancapi@) in Equation 28 as follows,
(37) | ||||
(38) | ||||
(39) | ||||
(40) | ||||
(41) | ||||
(42) | ||||
(43) | ||||
(44) | ||||
(45) | ||||
(46) |
where (a) follows from Condition 4, (b) is by applying Condition 5 for triggering probability and , (c) follows from , (d) follows from same derivation from Section B.2, (e) follows from , (f) follows from Cauchy-Schwarz, (g) follows from , (h) follows from by event , (i) follows from Holder’s inequality with , (j) follows from the similar analysis of (d)-(g) in Section B.2 inside the square-root without considering the additional .
For the term (\@slowromancapii@), one can easily verify it follows from the similar deviation of the term (\@slowromancapi@) with the difference in constant terms. And Theorem 3 is concluded by considering small probability and events.
Appendix C Proofs for TPE Trick to Improve Non-Contextual CMAB-T
We first introduce some definitions that are used in (Wang & Chen, 2017) and (Liu et al., 2022). Recall that non-contextual CMAB-T is a degenerate case when and , where is the mean of the true outcome distribution .
Definition 1 ((Approximation) Gap).
Fix a distribution and its mean vector , for each action , we define the (approximation) gap as . For each arm , we define , . As a convention, if there is no action such that and , then . We define and .
Definition 2 (Event-Filtered Regret).
For any series of events indexed by round number , we define the as the regret filtered by events , or the regret is only counted in if happens in . Formally,
(47) |
For simplicity, we will omit and rewrite as when contexts are clear.
C.1 Reproducing Theorem 1 of (Wang & Chen, 2017) under 1-norm TPM Condition
Theorem 4.
For a CMAB-T problem instance that satisfies monotonicity (Condition 1), and TPM bounded smoothness (Condition 2) with coefficient , if , CUCB (Wang & Chen, 2017) with an -approximation oracle achieves an -approximate distribution-dependent regret bounded by
(48) |
And the distribution-independent regret,
(49) |
The main idea is to use TPE trick to replace (arms that could be probabilistically triggered by action ) with (arms that are actually triggered by action ) under conditional expectation, so that we can use the simpler Appendix B.2 of Wang & Chen (2017) to avoid the much more involved Appendix B.3 of Wang & Chen (2017). Such replacement bypasses the triggering group analysis (and its counter ) in Appendix B.3, which uses to associate with the counters for . For our simplified analysis, we can directly associate the with the arm triggering for the arms that are actually triggered/observed and eventually reproduce the regret bounds of (Wang & Chen, 2017).
We follow exactly the same CUCB algorithm (Algorithm 1 (Wang & Chen, 2017)), conditions (Condition 1, 2 (Wang & Chen, 2017)). We also inherit the event definitions of (Definition 4 (Wang & Chen, 2017)) that for every arm , , and the event being . Let us further denote , be the arms actually triggered by at time . Let filtration be , and let We also have that , are measurable. Also note that we use to denote the triggering probability for any in order to match the notation of Wang & Chen (2017).
Proof.
Under event and , and given filtration , we have
(50) | ||||
(51) | ||||
(52) | ||||
(53) | ||||
(54) |
where (a) follows from exactly the Equation (10) of Appendix B.3 in Wang & Chen (2017), (b) is by the reverse amortization trick that multiplies two to both sides of (a) and rearranges the terms, (c) is by and , (d) by event so that .
Let
(55) |
where .
It follows that
(56) | ||||
(57) | ||||
(58) | ||||
(59) |
where (a) follows from Equation 54, (b) follows from the TPE trick to replace , (c) follows from that if , we have , and if , then , so .
Now we apply the definition of the event-filtered regret,
(60) | ||||
(61) | ||||
(62) | ||||
(63) | ||||
(64) | ||||
(65) | ||||
(66) | ||||
(67) |
where (a) follows from Equation 59, (b) follows from the tower rule, (c) follows from that is increased by if and only if , (d) is by the sum integral inequality for non-increasing function .
Following Wang & Chen (2017) to handle small probability events and , we have
(68) |
and the distribution-independent regret is
(69) |
∎
C.2 Improving Theorem 1 of (Liu et al., 2022) under TPVM Condition
We first show the regret bound of using our TPE technique in Theorem 5 and its prior result in Proposition 2.
Theorem 5.
For a CMAB-T problem instance that satisfies monotonicity (Condition 1), and TPVM bounded smoothness (Condition 4) with coefficient , if , BCUCB-T (Liu et al., 2022) with an -approximation oracle achieves an -approximate distribution-dependent regret bounded by
(70) |
where . And the distribution-independent regret,
(71) |
Proposition 2 (Theorem 1, Liu et al. (2022)).
Looking at our regret bound (Theorem 5), there are two improvements compared with Proposition 2: (1) the min gap is improved to , (2) we remove a for the leading term. For (2), it translates to a improvement for the distribution-independent bound.
Proof.
Similar to Section C.1, the main idea is to use TPE trick to replace (arms that could be probabilistically triggered by action ) with (arms that are actually triggered by action ) under conditional expectation to avoid the usage of much more involved triggering group analysis (Wang & Chen, 2017). Such replacement bypasses the triggering group analysis (and its counter ) (Liu et al., 2022), which uses to associate with the counters for . By doing so, we do not need a union bound over the group index , which saves a (or ) factor.
We follow exactly the same BCUCB-T algorithm (Algorithm 1 (Liu et al., 2022)), conditions (Condition 1, 2, 3 (Liu et al., 2022)). We also inherit the event definitions of (Definition 6 (Liu et al., 2022)) that (1) for every base arm , , where ; (2) for every base arm , . We use the event being . Let us further denote , be the arms actually triggered by at time . Let filtration be , and let We also have that , are measurable. Also note that we use to denote the triggering probability for any in order to match the notation of Wang & Chen (2017); Liu et al. (2022).
We follow the same regret decomposition as in Lemma 9 of Liu et al. (2022), to decompose the event-filtered regret into two event-filtered regret and under events .
(74) |
where event , event , ,.
C.2.1 Bounding the term
We bound the leading term under two cases when and .
(a) When ,
Let , . Given filtration and event , we have
(75) | ||||
(76) | ||||
(77) | ||||
(78) | ||||
(79) |
where (a) follows from event , (b) is by the reverse amortization trick that multiplies two to both sides of (a) and rearranges the terms, (c), (d) are by definition of .
It follows that
(80) | ||||
(81) | ||||
(82) |
where (a) follows from Equation 79, (b) follows from TPE trick to replace , (c) is because we define a regret allocation function
(83) |
where , , , and (c) holds due to Lemma 16.
(84) | ||||
(85) | ||||
(86) | ||||
(87) | ||||
(88) | ||||
(89) | ||||
(90) |
where (a) follows from Equation 82, (b) follows from the tower rule, (c) follows from that is increased by if and only if .
(b) When ,
Let , . Note that when , when , and , when , for any . Given filtration and under event , we have
(91) | ||||
(92) | ||||
(93) | ||||
(94) | ||||
(95) |
(96) | ||||
(97) | ||||
(98) |
where the last inequality is by Lemma 17 and we define a regret allocation function
(99) |
where , , , .
(100) | ||||
(101) | ||||
(102) | ||||
(103) | ||||
(104) | ||||
(105) | ||||
(106) |
where (a) follows from Equation 98, (b) follows from the tower rule, (c) follows from that is increased by if and only if .
C.2.2 Bounding the term
Let . Given filtration and event , we have
(107) | ||||
(108) |
where (a) follows from event , (b) is by the reverse amortization trick that multiplies two to both sides of (a) and rearranges the terms, (c) follows from .
It follows that
(109) |
regret allocation function
(110) |
where , . And (a) follows from Equation 108, (b) from the TPE, (c) follows from Lemma 18.
(111) | ||||
(112) | ||||
(113) | ||||
(114) | ||||
(115) | ||||
(116) | ||||
(117) | ||||
(118) |
where (a) follows from Equation 109, (b) follows from the tower rule, (c) follows from that is increased by if and only if . ∎
Appendix D Applications
For convenience, we show our table again in Table 3.
Application | Condition | Regret | Improvement | |
---|---|---|---|---|
Online Influence Maximization (Wen et al., 2017) | TPM | † | ||
Disjunctive Combinatorial Cascading Bandits (Li et al., 2016) | TPVM | ‡ | ||
Conjunctive Combinatorial Cascading Bandits (Li et al., 2016) | TPVM | |||
Linear Cascading Bandits (Vial et al., 2022)∗ | TPVM | ‡ | ||
Multi-layered Network Exploration (Liu et al., 2021b) | TPVM | † | ||
Probabilistic Maximum Coverage (Chen et al., 2013)∗∗ | VM |
-
† denotes the number of target nodes, the number of edges that can be triggered by the set of seed nodes, the number of layers, the number of seed nodes and the length of the longest directed path, respectively;
-
‡ K is the length of the ordered list, ;
-
∗ A special case of disjunctive combinatorial cascading bandits.
-
∗∗ This row is for C2MAB application and the rest of rows are for C2MAB-T applications.
D.1 Online Influence Maximization Bandit (Wang & Chen, 2017) and Its Contextual Generalization (Wen et al., 2017)
Following the setting of (Wang & Chen, 2017, Section 2.1), we consider a weighted directed graph , where is the set of vertices, is the set of directed edges, and each edge is associated with a probability . When the agent selects a set of seed nodes , the influence propagates as follows: At time , the seed nodes are activated; At time , a node activated at time will have one chance to activate its inactive out-neighbor with independent probability . The influence spread of is denoted as and is defined as the expected number of activated nodes after the propagation process ends. The problem of Influence Maximization is to find seed nodes with so that the influence spread is maximized.
For the problem of online influence maximization (OIM), we consider rounds repeated influence maximization tasks and the edge probabilities are assumed to be unknown initially. For each round , the agent selects seed nodes as , the influence propagation of is observed and the reward is the number of nodes activated in round . The agent’s goal is to accumulate as much reward as possible in rounds. The OIM fits into CMAB-T framework: the edges are the set of base arms , the (unknown) outcome distribution is the joint of independent Bernoulli random variables for the edge set , the action are any seed node sets with size at most . For the arm triggering, the triggered set is the set of edges whose source node is reachable from . Let be the outcomes of the edges according to probability and the live-edge graph be a induced graph with edges that are alive, i.e., iff for . The triggering probability distribution degenerates to a deterministic triggered set, i.e., is deterministically decided given and . The reward equals to the number activated nodes at the end of , i.e., the nodes that are reachable from in the live-edge graph . The offline oracle is a -approximation algorithm given by the greedy algorithm from (Kempe et al., 2003).
Now consider OIM’s contextual generalization for large-scale OIM, we follow Wen et al. (2017), where each edge is associated with a known feature vector and an unknown parameter , and the edge probability is . By Lemma 2 of (Wang & Chen, 2017), , where is the largest number of nodes any node can reach and batch size , so by Theorem 1, C2-UCB-T obtain a worst-case regret bound. Compared with IMLinUCB algorithm (Wen et al., 2017) that achieves , our regret achieves a improvement up to a factor of .
Now for the claim of the triggering probability satisfies , it follows from the Theorem 4 of Li et al. (2020) by identifying .
D.2 Contextual Combinatorial Cascading Bandits (Li et al., 2016)
Contextual Combinatorial cascading bandits have two categories: conjunctive cascading bandits and disjunctive cascading bandits (Li et al., 2016). We also compare with a special case of linear cascading bandits that also uses variance-adaptive algorithms and achieve very competitive results.
Disjunctive form. For the disjunctive form, we want to select an ordered list of items from total items, so as to maximize the probability that at least one of the outcomes of the selected items are . Each item is associated with a Bernoulli random variable with mean at round , indicating whether the user will be satisfied with the item if he scans the item. To leverage the contextual information, Li et al. (2016) assumes where is the known context at round for arm , is the unknown parameter to be learned. This setting models the movie recommendation system where the user sequentially scans a list of recommended items and the system is rewarded when the user is satisfied with any recommended item. After the user is satisfied with any item or scans all items but is not satisfied with any of them, the user leaves the system. Due to this stopping rule, the agent can only observe the outcome of items until (including) the first item whose outcome is . If there are no satisfactory items, the outcomes must be all . In other words, the triggered set is the prefix set of items until the stopping condition holds.
Without loss of generality, let the action be , then the reward function is and the triggering probability is . Let and , where with . By Lemma 19 in Liu et al. (2021a), disjunctive CB satisfies Condition 4 with . Also, we can verify that disjunctive CB also satisfies as follows:
(119) | |||
(120) | |||
(121) | |||
(122) |
Now by Theorem 3, VAC2-UCB obtains a regret bound of . Compared with Corollary 4.5 in Li et al. (2016) that yields a regret, our results improves by a factor of .
Conjunctive form. For the conjunctive form, the learning agent wants to select paths from total paths (i.e., base arms) so as to maximize the probability that the outcomes of the selected paths are all . Each item is associated with a Bernoulli random variable with mean at round , indicating whether the path will be live if the package will transmit via this path. Such a setting models the network routing problem (Kveton et al., 2015a), where the items are routing paths and the package is delivered when all paths are alive. The learning agent will observe the outcome of the first few paths till the first one that is down, since the transmission will stop if any of the path is down. In other words, the triggered set is the prefix set of paths until the stopping condition holds.
Without loss of generality, let the action be , then the reward function is and the triggering probability is . Let and , where with . By Lemma 20 in Liu et al. (2021a), conjunctive CB satisfies Condition 4 with . Also, we can verify that conjunctive CB also satisfies as follows:
(123) | |||
(124) | |||
(125) | |||
(126) |
Now by Theorem 3, VAC2-UCB obtains a regret bound of . Compared with Corollary 4.6 in Li et al. (2016) that yields a regret, our results improves by a factor of .
Linear Cascading Bandit. Linear cascading bandit (Vial et al., 2022) is a special case of combinatorial cascading bandit (Li et al., 2016). The former assumes that action space is the collection of all permutations whose size equals to (i.e., a uniform matroid). In this case, the items in the feasible solutions are exchangeable (a critical property for matroids), i.e., , for any . Based on this property, their analysis can get the correct results. For the latter, however, (i.e., in [16]) consists of arbitrary feasible actions (perhaps with different sizes), e.g., could refer to any path that connects the source and the destination in network routing applications.
Other than the above difference, linear cascading bandits follow the same setting as disjunctive contextual combinatorial bandits. Following the similar argument of disjunctive contextual combinatorial bandits, the regret bound of VAC2-UCB is . Compared with CascadeWOFUL that achieves by Theorem 4 in Vial et al. (2022), our regret improves a factor of . For the empirical comparison, see Section 5 for details.
D.3 Multi-layered Network Exploration Problem (MuLaNE) (Liu et al., 2021b)
We consider the MuLaNE problem with random node weights. After we apply the bipartite coverage graph, the corresponding graph is a tri-partite graph (i.e., a 3-layered graph where the first layer and the second layer forms a bipartite graph, and the second and the third layer forms another bipartite graph), where the left nodes represent random walkers; Middle nodes are possible targets to be explored; Right nodes are nodes, each of which has only one edge connecting the middle edge. The MuLaNE task is to allocate budgets into layers to explore target nodes and the base arms are .
With budget allocation , the (effective) base arms consist of two parts:
(1) , each of which is associated with visiting probability indicating whether node will be visited by explorer given budgets. All these base arms corresponds to budget are triggered.
(2) for represents the random node weight. The triggering probability .
For its contextual generalization, we assume , , where are the known features for visiting probability and the node weights for large-scale MuLaNE applications, respectively. Let effective base arms , where , for . For the target node , the per-target reward function . Denote . Based on Lemma 21 in Liu et al. (2022), contextual MuLaNE satisfies Condition 4 with . To validate that this application satisfies Condition 5 with , we have
(127) | |||
(128) | |||
(129) |
By Theorem 3, we obtain , which improves the result that follows the result of C3UCB algorithm (Li et al., 2016) by a factor of .
D.4 Probabilistic Maximum Coverage Bandit (Chen et al., 2016a; Merlis & Mannor, 2019)
In this section, we consider the probabilistic maximum coverage (PMC) problem. PMC is modeled by a weighted bipartite graph , where are the source nodes, is the target nodes and each edge is associated with a probability . The task of PMC is to select a set of size so as to maximize the expected number of nodes activated in , where a node can be activated by a node with an independent probability . PMC can naturally model the advertisement placement application, where are candidate web-pages, are the set of users, and is the probability that a user will click on web-page .
PMC fits into the non-triggering CMAB framework: each edge corresponds to a base arm, the action is the set of edges that are incident to the set , the unknown mean vectors with and we assume they are independent across all base arms. In this context, the reward function .
In this paper, we consider a contextual generalization by assuming that , where is the known context and is the unknown parameter. By Lemma 24 in Liu et al. (2022), PMC satisfies Condition 3 with . Following Theorem 2, VAC2-UCB obtains , which improves the C3UCB algorithm’s bound (Li et al., 2016) by a factor of .
Appendix E Experiments
Synthetic data. We consider the same disjunctive linear cascading bandit setting as in (Vial et al., 2022), where the goal is to choose out of items to maximize the reward. Notice that the linear cascading bandit problem is a simplified version of the contextual cascading bandit problem where the feature vectors of base arms are fixed in all rounds (see Section D.2 for details). For each , we sample the click probability of item uniformly in for and in for . We vary to generate the same and compute unit-norm vectors and satisfying . We compare VAC2-UCB to C3-UCB (Li et al., 2016) and CascadeWOFUL (Vial et al., 2022): C3-UCB is the variance-agnostic cascading bandit algorithm (essentially the same as CascadeLinUCB (Zong et al., 2016) in the linear cascading setting by using the tunable parameter ) and CascadeWOFUL is the state-of-the-art variance-aware cascading bandit algorithm. As shown in Figure 2, the regret of our VAC2-UCB algorithm has superior dependence on and over that of C3-UCB. When , VAC2-UCB achieves sublinear regret; it incurs and less regret than C3-UCB and CascadeWOFUL after rounds. Notice that CascadeWOFUL is also variance-aware but specifically designed for cascading bandits, while our algorithm can be applied to general C2MAB-T.

Real data. We conduct experiments on the MovieLens-1M dataset which contains user ratings for movies. Following the same experimental setup in (Vial et al., 2022), we set , and the goal is to choose out of movies to maximize the reward of the cascading recommendation. We use their learned feature mapping from movies to the probability that a uniformly random user rated the movie more than three stars. We point the reader to Section 6 of (Vial et al., 2022) for more details. In each round, we sample a random user and define the potential click result . In other words, we observe the actual feedback of user instead of using the Bernoulli click model. Figure 1(a) shows that VAC2-UCB outperforms C3-UCB and CascadeWOFUL, incurring and less regret after rounds. To model platforms like Netflix that recommend movies in specific categories, we also run experiments while restricting the candidate items to movies of a particular genre. Figure 1(b) shows that VAC2-UCB is superior for all genres compared to C3-UCB and CascadeWOFUL.
Appendix F Concentration Bounds, Facts, and Technical Lemmas
In this section, we first give key concentration bounds and then provide lemmas that are useful for the analysis.
F.1 Concentration Bounds
We mainly use the following concentration bounds, which is essentially a modification of the Freedman’s version of the Bernstein’s inequality (Bernstein, 1946; Freedman, 1975).
Proposition 3 (Theorem 9, Lattimore et al. (2015)).
Let and be a sequence of random variables adapted to filtration with . Let be such that is -measurable and let be measurable such that almost surely. Let , , and . Then , where , and .
F.2 Facts
Fact 1.
For any positive-definite matrices and any vectors . It holds that
-
1.
If , then .
-
2.
If , then .
-
3.
Suppose has maximum eigenvalue , then and .
F.3 Technical Lemmas
Recall that event is defined in Equation 21, gram matrix is defined in Equation 18, optimistic variance is defined in Equation 6, is defined in Equation 131.
Lemma 7.
, for .
Proof of lemma 7.
Let and define
(130) |
(131) |
By applying the Proposition 3, with probability at least it holds that
(132) |
where .
Since could be a random variable in later proofs, we use the covering argument trick (Chap.20, Lattimore & Szepesvári (2020)) to handle . Specifically, we define the covering set , with size and parameters will be determined shortly after. By applying union bound on Equation 132, we have with probability at least that
(133) |
Now we can set , and it follows from Lemma 12 that . Based on our construction of the covering set , there exists with , and , such that
(134) | ||||
(135) | ||||
(136) |
where Equation 135 uses the fact that for any , Equation 136 follows from the following derivation,
(137) | ||||
(138) | ||||
(139) | ||||
(140) | ||||
(141) |
where Equation 137 follows from implies for by Lemma 8 and thus , Equation 138 follows from definition of , Equation 140 follows from .
Now we set , we have
(142) | ||||
(143) | ||||
(144) |
where Equation 143 is to bound by Lemma 10, Equation 144 is by the definition of as an upper bound of .
By rearranging and simplifying Equation 144, we have
(145) | ||||
(146) |
where the last inequality is because of from the definition of , Lemma 10, and Equation 141. Finally, we solve the above equation and set , which completes the reduction on to show the probability under event . ∎
Lemma 8.
If holds, then it holds that,
(147) |
Proof.
We have
(148) | ||||
(149) | ||||
(150) | ||||
(151) | ||||
(152) | ||||
(153) |
where Equation (148)-(150) follow from definition and math calculation, Equation 151 from and , Equation 152 from that if holds, then . ∎
Lemma 9.
For any , , and if holds and , for any .
Proof.
The first inequality is by and Fact 1. For the second inequality, when holds, as in Equation 147, and since , it follows from the definition of Equation 6 that at least one of the following is true:
(154) | ||||
(155) |
which concludes the second inequality. ∎
Lemma 10.
Let , if , then
Proof.
For all and , we have , where the first inequality follows from Lemma 9, the last inequality follows from the Cauchy-Schwarz inequality. ∎
Lemma 11.
If , then .
Proof.
If , the inequality trivially holds since . Consider , and be the maximum eigenvalue of . Then, it holds that , where the first inequality follows from Lemma 9, the second inequality is by , and the last is by 1.3.
Now Assume for , which always holds for . By reduction, we consider round , it holds that , where the first inequality follows from the analysis in the last paragraph, the third inequality follows from reduction over , and the last inequality is by math calculation. ∎
Lemma 12.
If , then .
Proof.
Lemma 13.
If , then .
Proof.
, where the first inequality follows from , the second inequality follows from Lemma 12. ∎
Lemma 14 (Lemma A.3, (Li et al., 2016)).
Let , . Then we have
Lemma 15 (Lemma 11, (Abbasi-Yadkori et al., 2011)).
Let with , and let , then
Lemma 16.
Equation 82 holds.
Proof.
When ,
we have .
When ,
We have .
When ,
We further consider two different cases or .
For the former case, if there exists so that , then we know , which makes eq. 82 holds no matter what. This means we do not need to consider this case for good.
For the later case, when , we know that .
When ,
We have .
Combining all above cases, we have . ∎
Lemma 17.
Equation 98 holds.
Proof.
When ,
we have .
When ,
We have .
When ,
We further consider two different cases or .
For the former case, if there exists so that , then we know , which makes eq. 98 holds no matter what. This means we do not need to consider this case for good.
For the later case, when , we know that .
When ,
We have .
Combining all above cases, we have . ∎
Lemma 18.
Equation 109 holds.
Proof.
When ,
we have .
When ,
We have .
When ,
If there exists so that , then we know , which makes eq. 109 holds no matter what. This means we do not need to consider this case for good.
Combining all above cases, we have . ∎