This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Contextual Combinatorial Bandits with Probabilistically Triggered Arms

Xutong Liu    Jinhang Zuo    Siwei Wang    John C.S. Lui    Mohammad Hajiesmaili    Adam Wierman    Wei Chen
Abstract

We study contextual combinatorial bandits with probabilistically triggered arms (C2MAB-T) under a variety of smoothness conditions that capture a wide range of applications, such as contextual cascading bandits and contextual influence maximization bandits. Under the triggering probability modulated (TPM) condition, we devise the C2-UCB-T algorithm and propose a novel analysis that achieves an O~(dKT)\tilde{O}(d\sqrt{KT}) regret bound, removing a potentially exponentially large factor O(1/pmin)O(1/p_{\min}), where dd is the dimension of contexts, pminp_{\min} is the minimum positive probability that any arm can be triggered, and batch-size KK is the maximum number of arms that can be triggered per round. Under the variance modulated (VM) or triggering probability and variance modulated (TPVM) conditions, we propose a new variance-adaptive algorithm VAC2-UCB and derive a regret bound O~(dT)\tilde{O}(d\sqrt{T}), which is independent of the batch-size KK. As a valuable by-product, our analysis technique and variance-adaptive algorithm can be applied to the CMAB-T and C2MAB setting, improving existing results there as well. We also include experiments that demonstrate the improved performance of our algorithms compared with benchmark algorithms on synthetic and real-world datasets.

Machine Learning, ICML

1 Introduction

Table 1: Summary of the main results for C2MAB-T, and additional results for CMAB-T and C2MAB.
C2MAB-T Algorithm Condition Coefficient Regret Bound
C3-UCB (Li et al., 2016) 1-norm B1B_{1} O(B1dKTlogT/pmin)O({B_{1}d\sqrt{KT}\cdot\log T}/{p_{\min}})
(Main Result 1) C2-UCB-T (Algorithm 1) 1-norm TPM B1B_{1} O(B1dKTlogT)O(B_{1}d\sqrt{KT}\cdot\log T)
(Main Result 2) VAC2-UCB (Algorithm 2) VM BvB_{v} O(BvdTlogT/pmin)O(B_{v}d\sqrt{T}\cdot\log T/\sqrt{p_{\min}})
(Main Result 3) VAC2-UCB (Algorithm 2) TPVM BvB_{v}, λ1\lambda\geq 1 O(BvdTlogT)O(B_{v}d\sqrt{T}\cdot\log T)
CMAB-T Algorithm Condition Coefficient Regret Bound
BCUCB-T (Liu et al., 2022) TPVM BvB_{v}, λ1\lambda\geq 1 O(Bvm(logK)TlogT)O(B_{v}\sqrt{m(\log K)T}\cdot\log T)
(Additional Result 1) BCUCB-T (Our new analysis) TPVM BvB_{v}, λ1\lambda\geq 1 O(Bvm(logK)T(logT)1/2)O(B_{v}\sqrt{m(\log K)T}\cdot(\log T)^{1/2})^{**}
C2MAB Algorithm Condition Coefficient Regret Bound
C2UCB (Qin et al., 2014) 2-norm B2B_{2}§ O(B2dTlogT)O(B_{2}d\sqrt{T}\log T)
C2UCB (Takemura et al., 2021) 1-norm B1B_{1} O(B1dKTlogT)O(B_{1}d\sqrt{KT}\log T)
(Additional Result 2) VAC2-UCB (Algorithm 2) VM BvB_{v} O(BvdTlogT)O(B_{v}d\sqrt{T}\log T)
  • This work is specified for contextual combinatorial cascading bandits, without formally defining the arm triggering process.

  • Generally, coefficient Bv=O(B1K)B_{v}=O(B_{1}\sqrt{K}) and the existing regret bound is improved when Bv=o(B1K)B_{v}=o(B_{1}\sqrt{K})

  • λ\lambda is a coefficient in TPVM condition: when λ\lambda is larger, the condition is stronger with smaller regret but can include less applications.

  • ∗∗ We also show improved distribution-dependent regret bound in Appendix C;

  • § Almost all applications satisfy B2=Θ(B1K)B_{2}=\Theta(B_{1}\sqrt{K}).

The stochastic multi-armed bandit (MAB) problem is a classical sequential decision-making problem that has been widely studied (Robbins, 1952; Auer et al., 2002; Bubeck et al., 2012). As an extension of MAB, combinatorial multi-armed bandits (CMAB) have drawn attention due to fruitful applications in online advertising, network optimization, and healthcare systems (Gai et al., 2012; Kveton et al., 2015a; Chen et al., 2013, 2016a; Wang & Chen, 2017; Merlis & Mannor, 2019). CMAB is a sequential decision-making game between a learning agent and an environment. In each round, the agent chooses a combinatorial action that triggers a set of base arms (i.e., a super-arm) to be pulled simultaneously, and the outcomes of these pulled base arms are observed as feedback (typically known as semi-bandit feedback). The goal of the agent is to minimize the expected regret, which is the difference in expectation for the overall rewards between always playing the best action (i.e., the action with the highest expected reward) and playing according to the agent’s own policy.

Motivated by large-scale applications with a huge number of items (base arms), there exists a prominent line of work that advances the CMAB model: the combinatorial contextual bandits (or C2MAB for short)  (Qin et al., 2014; Li et al., 2016; Takemura et al., 2021). Specifically, C2MAB incorporates contextual information and adds the simple yet effective linear structure assumption to allow scalability, which provides regret bounds that are independent of the number of base arms mm. Despite C2MAB’s success in leveraging contextual information for better scalability, existing works fail to formulate the general arm triggering process, which is essential to model a wider range of applications, e.g., cascading bandits (CB) and influence maximization (IM), and more importantly, they do not provide satisfying results for settings with probabilistically triggered arms. For example, Qin et al. (2014); Takemura et al. (2021) only consider the deterministic semi-bandit feedback for C2MAB. Li et al. (2016); Wen et al. (2017) implicitly consider the arm triggering process for specific CB or IM applications but only gives sub-optimal results with unsatisfying factors (e.g., 1/pmin,K1/p_{\min},K that could be as large as the number of base arms), owing to loose analysis, weak conditions, or inefficient algorithms that explore the unknown parameters too conservatively.

To handle the above issues, we enhance the C2MAB framework by considering an arm triggering process. Specifically, we propose the general framework of contextual combinatorial bandits with probabilistically triggered arms (or C2MAB-T for short). At the base arm level, C2MAB-T uses a time-varying feature map ϕt\bm{\phi}_{t} to model the contextual information at each round tt, and the mean outcome of each arm i[m]i\in[m] is a linear product of the feature vector ϕt(i)d\bm{\phi}_{t}(i)\in\mathbb{R}^{d} and an unknown vector θd\theta^{*}\in\mathbb{R}^{d} (where dmd\ll m to handle large-scale applications). At the (combinatorial) action level, inspired by the non-contextual CMAB with probabilistically triggered arms (or CMAB-T) works (Chen et al., 2016b; Wang & Chen, 2017; Liu et al., 2022), we formally define an arm-triggering process to cover more general feedback models such as semi-bandit, cascading, and probabilistic feedback. We also inherit smoothness conditions for the non-linear reward function to cover different application scenarios, such as CB, IM, and online probabilistic maximum coverage (PMC) problems (Chen et al., 2016a; Wang & Chen, 2017). With this formulation, C2MAB-T retains C2MAB’s scalability while also enjoying CMAB-T’s rich reward functions and general feedback models.

Contributions. Our main results are shown in Table 1.

First, we study C2MAB-T under the triggering probability modulated (TPM) smoothness condition, a condition introduced by Wang & Chen (2017) to remove a factor of 1/pmin1/p_{\min} in the pioneer CMAB-T work (Chen et al., 2016a). This result follows a similar vein by devising C2-UCB-T algorithm and proving a O~(dKT)\tilde{O}(d\sqrt{KT}) regret, which removes a 1/pmin1/p_{\min} factor for prior contextual CB applications (Li et al., 2016) (Main Result 1 in Table 1). The key technical challenge is that the triggering group (TG) analysis (Wang & Chen, 2017) for CMAB-T cannot handle the triggering probability determined by time-varying contexts. To tackle this issue, we devise a new technique, called the triggering probability equivalence (TPE), which links the triggering probabilities with the random triggering event under expectation. In this way, we no longer need to bound the regret caused by possibly triggered arms, but only need to bound the regret caused by actually triggered arms. As a result, we can then directly apply the simple non-triggering C2MAB analysis to obtain the regret bound for C2MAB-T. In addition, our TPE can reproduce the results for CMAB-T in a similar way.

Second, we study the C2MAB-T under the variance modulated (VM) smoothness condition (Liu et al., 2022), in light of the recent variance-adaptive algorithms to remove the batch size dependence O(K)O(\sqrt{K}) for CMAB-T (Merlis & Mannor, 2019; Liu et al., 2022; Vial et al., 2022). We propose a new variance-adaptive algorithm VAC2-UCB and prove a batch-size independent regret O~(dT/pmin)\tilde{O}(d\sqrt{T/p_{\min}}) under VM condition (Main Result 2 in Table 1). The main technical difficulty is to deal with the unknown variance. Inspired by Lattimore et al. (2015), we use the UCB/LCB value to construct an optimistic variance and on top of that, we prove a new concentration bound to incorporate the triggered arms and optimistic variance to get the desirable results.

Third, we investigate the stronger triggering probability and variance modulated (TPVM) condition (Liu et al., 2022) in order to remove the additional 1/pmin1/\sqrt{p_{\min}} factor. The key challenge is that we cannot directly use TPE to link the true triggering probability with the random trigger event as before, since the TPVM condition only yields a mismatched triggering probability associated with the optimistic variance used in the algorithm. Our solution is to bound this additional mismatch by lower-order terms based on mild conditions on the triggering probability, which achieves the O~(dT)\tilde{O}(d\sqrt{T}) regret bounds (Main Result 3 in Table 1).

As a valuable by-product, our TPE analysis and VAC2-UCB algorithm can be applied to non-contextual CMAB-T and C2MAB, improving the existing results by a factor logT\sqrt{\log T} (Additional Result 1 in Table 1) and K\sqrt{K} (Additional Result 2 in Table 1), respectively. Our empirical results on both synthetic and real data demonstrate that the VAC2-UCB algorithm outperforms the state-of-art variance-agnostic and variance-aware bandit algorithms in the linear cascading bandit application that satisfies the TPVM condition.

Related Work. The stochastic CMAB model has received significant attention. The literature was initiated by (Gai et al., 2012). Since, its regret has been improved by Kveton et al. (2015b), Combes et al. (2015), Chen et al. (2016a). There exist two prominent lines of work in the literature related to our study: contextual CMAB and CMAB with probabilistically triggered arms (C2MAB and CMAB-T).

For C2MAB works, Qin et al. (2014) is the first study, which proposes C2UCB algorithm that considers reward functions under 2-norm B2B_{2} smoothness condition. Then Takemura et al. (2021) replaces the 2-norm smoothness condition with a new 1-norm B1B_{1} smoothness condition, and proves a O(B1dKTlogT)O(B_{1}d\sqrt{KT}\log T) regret bound. In this work, we extend the C2MAB with triggering arms to cover more application scenarios (e.g., contextual CB and contextual IM). Moreover, we further consider the stronger VM condition and propose a new variance-adaptive algorithm that can achieve a K\sqrt{K} factor improvement in the regret upper bound for applications like PMC.

For CMAB-T works, Chen et al. (2016a) is the first work that considers the arm triggering process to cover CB and IM applications. The authors propose the CUCB algorithm, and give an O(B1mKTlogT/pmin)O(B_{1}\sqrt{mKT\log T}/p_{\min}) regret bound under 1-norm B1B_{1} smoothness condition. Then Wang & Chen (2017) proposes the stronger 1-norm triggering probability modulated (TPM) B1B_{1} smoothness condition, and use the triggering group (TG) analysis to remove a 1/pmin1/p_{\min} factor in the previous regret. Recently, Liu et al. (2022) incorporates the variance information, and proposes the variance-adaptive algorithm BCUCB-T, which also uses the TG analysis and further reduces the regret’s dependency on batch-size from O(K)O(K) to O(logK)O(\log K) under the new variance and triggering probability modulated (TPVM) condition. The smoothness conditions considered in this work are mostly inspired by the above works, but directly following their algorithm and TG analysis fail to obtain any meaningful result for our C2MAB-T setting. Conversely, our new TPE analysis can be applied to CMAB-T, reproducing CMAB-T’s result under the 1-norm TPM condition, and improving a factor of (logT)(\sqrt{\log T}) under the TPVM condition.

There are also many studies considering specific applications under the C2MAB-T framework (by unifying C2MAB and CMAB-T), including contextual CB (Li et al., 2016; Vial et al., 2022), contextual IM (Wen et al., 2017), etc. One can see that these applications fit into our framework by verifying that they satisfy the TPM, VM, or TPVM conditions; thus achieving improved results regarding K,pminK,p_{\min} factors. We defer a detailed theoretical and empirical comparison to Sections 3, 4 and 5. Zuo et al. (2022) study the online competitive IM and also uses C2MAB-T to denote their contextual setting. However, their meaning of “contexts” is the action of the competitor, which acts at the action level and only affects the reward function (or regret) but not the base arms’ estimation. This is very different from our setting, where contexts act at the base arm level and hence one cannot directly apply their results.

2 Problem Setting

We study contextual combinatorial bandits with probabilistically triggered arms (C2MAB-T). We use [n][n] to represent set {1,,n}\{1,...,n\}. We use boldface lowercase letters and boldface CAPITALIZED letters for column vectors and matrices, respectively. 𝒙p\left\lVert\bm{x}\right\rVert_{p} denotes the p\ell_{p} norm of vector 𝒙\bm{x}. For any symmetric positive semi-definite (PSD) matrix 𝑴\bm{M} (i.e., 𝒙𝑴𝒙0,𝒙\bm{x}^{\top}\bm{M}\bm{x}\geq 0,\forall\bm{x}), 𝒙𝑴=𝒙𝑴𝒙\left\lVert\bm{x}\right\rVert_{\bm{M}}=\sqrt{\bm{x}^{\top}\bm{M}\bm{x}} denotes the matrix norm of 𝒙\bm{x} regarding matrix 𝑴\bm{M}.

We specify a C2MAB-T problem instance using a tuple ([m],𝒮,Φ,Θ,Dtrig,R)([m],\mathcal{S},\Phi,\Theta,D_{\text{trig}},R), where [m]={1,2,,m}[m]=\{1,2,...,m\} is the set of base arms (or arms); 𝒮\mathcal{S} is the set of eligible actions where S𝒮S\in\mathcal{S} is an action;***𝒮\mathcal{S} is a general action space. When 𝒮\mathcal{S} is a collection of subsets of [m][m], we often refer to S𝒮S\in\mathcal{S} as a super arm. Φ\Phi is the set of possible feature maps where any feature map ϕΦ\bm{\phi}\in\Phi is a function [m]d[m]\rightarrow\mathbb{R}^{d} that maps an arm to a dd-dimensional feature vector (and w.l.o.g. we normalize ϕ(i)21\left\lVert\bm{\phi}(i)\right\rVert_{2}\leq 1); ΘRd\Theta\subseteq R^{d} is the parameter space; DtrigD_{\text{trig}} is the probabilistic triggering function to characterize the arm triggering process (and feedback), and RR is the reward function.

In C2MAB-T, a learning game is played between a learning agent (or player) and the unknown environment in a sequential manner. Before the game starts, the environment chooses a parameter 𝜽Θ\bm{\theta}^{*}\in\Theta unknown to the agent (and w.l.o.g. we also assume θ21\left\lVert\theta^{*}\right\rVert_{2}\leq 1). At the beginning of round tt, the environment reveals feature vectors (ϕt(1),,ϕt(m))(\bm{\phi}_{t}(1),...,\bm{\phi}_{t}(m)) for each arm, where ϕtΦ\bm{\phi}_{t}\in\Phi is the feature map known to the agent. Given ϕt\bm{\phi}_{t}, the agent selects an action St𝒮S_{t}\in\mathcal{S}, and the environment draws Bernoulli outcomes 𝑿t=(Xt,1,Xt,m){0,1}m\bm{X}_{t}=(X_{t,1},...X_{t,m})\in\{0,1\}^{m} for base arms We assume Xt,iX_{t,i} are Bernoulli for the ease of exposition, yet our setting can handle any distribution with bounded Xt,i[0,1]X_{t,i}\in[0,1]., with mean 𝔼[Xt,i|t]=𝜽,ϕt(i)\mathbb{E}[X_{t,i}|\mathcal{H}_{t}]=\langle\bm{\theta}^{*},\bm{\phi}_{t}(i)\rangle for each base arm ii. Here t\mathcal{H}_{t} denotes the history before the agent chooses StS_{t} and will be specified shortly after. Note that the outcome 𝑿t\bm{X}_{t} is assumed to be conditional independent across arms given history t\mathcal{H}_{t}, similar to previous works (Qin et al., 2014; Li et al., 2016; Vial et al., 2022). For convenience, we use 𝝁t(𝜽,ϕt(i))i[m]\bm{\mu}_{t}\triangleq(\langle\bm{\theta}^{*},\bm{\phi}_{t}(i)\rangle)_{i\in[m]} to denote the mean vector and {𝜽,ϕ(i)i[m]:ϕΦ,𝜽Θ}\mathcal{M}\triangleq\{\langle\bm{\theta},\bm{\phi}(i)\rangle_{i\in[m]}:\bm{\phi}\in\Phi,\bm{\theta}\in\Theta\} to denote all possible mean vectors generated by Φ\Phi and Θ\Theta.

After the action StS_{t} is played on the outcome 𝑿t\bm{X}_{t}, base arms in a random set τtDtrig(St,𝑿t)\tau_{t}\sim D_{\text{trig}}(S_{t},\bm{X}_{t}) are triggered, meaning that the outcomes of arms in τt\tau_{t}, i.e. (Xt,i)iτt(X_{t,i})_{i\in\tau_{t}} are revealed as the feedback to the agent, and are involved in determining the reward of action StS_{t}. At the end of round tt, the agent will receive a non-negative reward R(St,𝑿t,τt)R(S_{t},\bm{X}_{t},\tau_{t}), determined by St,𝑿tS_{t},\bm{X}_{t} and τt\tau_{t}, and similar to (Wang & Chen, 2017), the expected reward is assumed to be r(St;𝝁t)𝔼[R(St,𝑿t,τt)]r(S_{t};\bm{\mu}_{t})\triangleq\mathbb{E}[R(S_{t},\bm{X}_{t},\tau_{t})], a function of the unknown mean vector 𝝁t\bm{\mu}_{t}, where the expectation is taken over the randomness of 𝑿t\bm{X}_{t} and τtDtrig(St,𝑿t)\tau_{t}\sim D_{\text{trig}}(S_{t},\bm{X}_{t}). To allow the algorithm to estimate the underlying parameter 𝜽\bm{\theta}^{*} directly from samples, we assume the outcome does not depend on whether the arm ii is triggered, i.e., 𝔼𝑿D,τDtrig(S,𝑿)[Xi|iτ]=𝔼𝑿D[Xi]\mathbb{E}_{\bm{X}\sim D,\tau\sim D_{\text{trig}}(S,\bm{X})}[X_{i}|i\in\tau]=\mathbb{E}_{\bm{X}\sim D}[X_{i}]. To this end, we can give the formal definition of the history t=(ϕs,Ss,τs,(Xs,i)iτs)s<tϕt\mathcal{H}_{t}=(\bm{\phi}_{s},S_{s},\tau_{s},(X_{s,i})_{i\in\tau_{s}})_{s<t}\bigcup\bm{\phi}_{t}, which contains all information before round tt, as well as the contextual information ϕt\bm{\phi}_{t} at round tt.

The goal of CMAB-T is to accumulate as much reward as possible over TT rounds by learning the underlying parameter θ\theta^{*}. The performance of an online learning algorithm AA is measured by its regret, defined as the difference of the expected cumulative reward between always playing the best action StargmaxS𝒮r(S;𝝁t)S^{*}_{t}\triangleq\operatorname*{argmax}_{S\in\mathcal{S}}r(S;\bm{\mu}_{t}) in each round tt and playing actions chosen by algorithm AA. For many reward functions, it is NP-hard to compute the exact StS^{*}_{t} even when 𝝁t\bm{\mu}_{t} is known, so similar to (Wang & Chen, 2017), we assume that the algorithm AA has access to an offline (α,β)(\alpha,\beta)-approximation oracle, which for mean vector 𝝁\bm{\mu} outputs an action SS such that Pr[r(S;𝝁)αr(S;𝝁)]β\Pr\left[r(S;\bm{\mu})\geq\alpha\cdot r(S^{*};\bm{\mu})\right]\geq\beta. The TT-round (α,β)(\alpha,\beta)-approximate regret is defined as

Reg(T)=𝔼[t=1T(αβr(St;𝝁t)r(St;𝝁t))],\textstyle\text{Reg}(T)=\mathbb{E}\left[\sum_{t=1}^{T}\left(\alpha\beta\cdot r(S^{*}_{t};\bm{\mu}_{t})-r(S_{t};\bm{\mu}_{t})\right)\right], (1)

where the expectation is taken over the randomness of outcomes 𝑿1,,𝑿T\bm{X}_{1},...,\bm{X}_{T}, the triggered sets τ1,,τT\tau_{1},...,\tau_{T}, as well as the randomness of algorithm AA itself.

Remark 1 (Difference with CMAB-T).

C2MAB-T strictly generalizes CMAB-T by allowing a probably time-varying feature map ϕt\bm{\phi}_{t}. Specifically, let 𝛉=(μ1,,μm)\bm{\theta}^{*}=(\mu_{1},...,\mu_{m}) and fix ϕt(i)=𝐞i\bm{\phi}_{t}(i)=\bm{e}_{i} where 𝐞im\bm{e}_{i}\in\mathbb{R}^{m} is the one-hot vector with 11 at the ii-th entry and 0 elsewhere, one can easily reproduce the CMAB-T setting in (Wang & Chen, 2017).

Remark 2 (Difference with C2MAB).

C2MAB-T enhances the modeling power of prior C2MAB (Qin et al., 2014; Takemura et al., 2021) by capturing the probabilistic nature of the feedback (v.s. the deterministic semi-bandit feedback). This enables a wider range of applications such as combinatorial CB, multi-layered network exploration, and online IM (Wang & Chen, 2017; Liu et al., 2022).

2.1 Key Quantities and Conditions

In the C2MAB-T model, there are several quantities and assumptions that are crucial to the subsequent study. We define triggering probability pi𝝁,Dtrig,Sp_{i}^{\bm{\mu},D_{\text{trig}},S} as the probability that base arm ii is triggered when the action is SS, the mean vector is 𝝁\bm{\mu}, and the probabilistic triggering function is DtrigD_{\text{trig}}. Since DtrigD_{\text{trig}} is always fixed in a given application context, we ignore it in the notation for simplicity, and use pi𝝁,Sp_{i}^{\bm{\mu},S} henceforth. Triggering probabilities pi𝝁,Sp_{i}^{\bm{\mu},S}’s are crucial for the triggering probability modulated bounded smoothness conditions to be defined below. We define S~\tilde{S} to be the set of arms that can be triggered by SS, i.e.,{i[m]:pi𝝁,S>0, for any 𝝁}\{i\in[m]:p_{i}^{\bm{\mu},S}>0,\text{ for any }\bm{\mu}\in\mathcal{M}\}, the batch size KK as the maximum number of arms that can be triggered, i.e., K=maxS𝒮|S~|K=\max_{S\in\mathcal{S}}|\tilde{S}|, and pmin=mini[m],𝝁,S𝒮,pi𝝁,S>0pi𝝁,Sp_{\min}=\min_{i\in[m],\bm{\mu}\in\mathcal{M},S\in\mathcal{S},p_{i}^{\bm{\mu},S}>0}{p_{i}^{\bm{\mu},S}}.

Owing to the nonlinearity and the combinatorial structure of the reward, it is essential to give some conditions for the reward function in order to achieve any meaningful regret bounds (Chen et al., 2013, 2016a; Wang & Chen, 2017; Merlis & Mannor, 2019; Liu et al., 2022). For C2MAB-T, we consider the following conditions.

Condition 1 (Monotonicity).

We say that a C2MAB-T problem instance satisfies monotonicity condition, if for any action S𝒮S\in\mathcal{S}, any mean vectors 𝛍,𝛍[0,1]m\bm{\mu},\bm{\mu}^{\prime}\in[0,1]^{m} such that μiμi\mu_{i}\leq\mu^{\prime}_{i} for all i[m]i\in[m], we have r(S;𝛍)r(S;𝛍)r(S;\bm{\mu})\leq r(S;\bm{\mu}^{\prime}).

Condition 2 (1-norm TPM Bounded Smoothness, (Wang & Chen, 2017)).

We say that a C2MAB-T problem instance satisfies the triggering probability modulated (TPM) B1B_{1}-bounded smoothness condition, if for any action S𝒮S\in\mathcal{S}, any mean vectors 𝛍,𝛍[0,1]m\bm{\mu},\bm{\mu}^{\prime}\in[0,1]^{m}, we have |r(S;𝛍)r(S;𝛍)|B1i[m]pi𝛍,S|μiμi||r(S;\bm{\mu}^{\prime})-r(S;\bm{\mu})|\leq B_{1}\sum_{i\in[m]}p_{i}^{\bm{\mu},S}|\mu_{i}-\mu^{\prime}_{i}|.

Condition 3 (VM Bounded Smoothness, (Liu et al., 2022)).

We say that a C2MAB-T problem instance satisfies the variance modulated (VM) (Bv,B1)(B_{v},B_{1})-bounded smoothness condition, if for any action S𝒮S\in\mathcal{S}, mean vector 𝛍,𝛍(0,1)m\bm{\mu},\bm{\mu}^{\prime}\in(0,1)^{m}, for any 𝛇,𝛈[1,1]m\bm{\zeta},\bm{\eta}\in[-1,1]^{m} s.t. 𝛍=𝛍+𝛇+𝛈\bm{\mu}^{\prime}=\bm{\mu}+\bm{\zeta}+\bm{\eta}, we have |r(S;𝛍)r(S;𝛍)|BviS~ζi2(1μi)μi+B1iS~|ηi||r(S;\bm{\mu}^{\prime})-r(S;\bm{\mu})|\leq B_{v}\sqrt{\sum_{i\in\tilde{S}}\frac{\zeta_{i}^{2}}{(1-\mu_{i})\mu_{i}}}+B_{1}\sum_{i\in\tilde{S}}\left|\eta_{i}\right|.

Condition 4 (TPVM Bounded Smoothness, (Liu et al., 2022)).

We say that a C2MAB-T problem instance satisfies the triggering probability and variance modulated (TPVM) (Bv,B1,λ)(B_{v},B_{1},\lambda)-bounded smoothness condition, if for any action S𝒮S\in\mathcal{S}, mean vector 𝛍,𝛍(0,1)m\bm{\mu},\bm{\mu}^{\prime}\in(0,1)^{m}, for any 𝛇,𝛈[1,1]m\bm{\zeta},\bm{\eta}\in[-1,1]^{m} s.t. 𝛍=𝛍+𝛇+𝛈\bm{\mu}^{\prime}=\bm{\mu}+\bm{\zeta}+\bm{\eta}, we have |r(S;𝛍)r(S;𝛍)|Bvi[m](pi𝛍,S)λζi2(1μi)μi+B1i[m]pi𝛍,S|ηi||r(S;\bm{\mu}^{\prime})-r(S;\bm{\mu})|\leq B_{v}\sqrt{\sum_{i\in[m]}(p_{i}^{\bm{\mu},S})^{\lambda}\frac{\zeta_{i}^{2}}{(1-\mu_{i})\mu_{i}}}+B_{1}\sum_{i\in[m]}p_{i}^{\bm{\mu},S}\left|\eta_{i}\right|.

Condition 1 indicates the reward is monotonically increasing when the parameter 𝝁\bm{\mu} increases. Condition 2, 3 and 4 all bound the reward smoothness/sensitivity.

For Condition 2, the key feature is that the parameter change in each base arm ii is modulated by the triggering probability pi𝝁,Sp_{i}^{\bm{\mu},S}. Intuitively, for base arm ii that is unlikely to be triggered/observed (small pi𝝁,Sp_{i}^{\bm{\mu},S}), Condition 2 ensures that a large change in μi\mu_{i} (due to insufficient observation) only causes a small change (multiplied by pi𝝁,Sp_{i}^{\bm{\mu},S}) in reward, which helps to save a pminp_{\min} factor for non-contextual CMAB-T.

For Condition 3, intuitively if we ignore the denominator (1μi)μi(1-\mu_{i})\mu_{i} of the leading BvB_{v} term, the reward change would be O(BvKΔ)O(B_{v}\sqrt{K}\Delta) when the amount of parameter change |μiμi|=Δ|\mu^{\prime}_{i}-\mu_{i}|=\Delta for each arm ii. This introduces a O(K)O(\sqrt{K}) factor reduction in the reward change and translates to a O(K)O(K) improvement in the regret, compared with O(B1KΔ)O(B_{1}K\Delta) reward change when applying the non-triggering version of Condition 2 (i.e., pi𝝁,S=1p_{i}^{\bm{\mu},S}=1 if iS~i\in\tilde{S} and pi𝝁,S=0p_{i}^{\bm{\mu},S}=0 otherwise). However, for real applications, B1=Θ(B1K)B_{1}=\Theta(B_{1}\sqrt{K}) which cancels the O(K)O(\sqrt{K}) improvement. To reduce BvB_{v} coefficient, the leading BvB_{v} term is modulated by the inverse of the variance Vi=(1μi)μiV_{i}=(1-\mu_{i})\mu_{i}, and thus allow applications to achieve a BvB_{v} coefficient independent of KK (or at least Bv=o(B1KB_{v}=o(B_{1}\sqrt{K})), leading to significant savings in the regret bound for applications like PMC (Liu et al., 2022). The relation between Condition 2 and 3 is generally not comparable, but compared with Condition 2’s non-triggering counterpart (i.e., 1-norm condition), Condition 3 is stronger.

Finally, for Condition 4, it combines both the triggering-probability modulation from Condition 2 and the variance modulation from Condition 3. The exponent λ\lambda of pi𝝁,Sp_{i}^{\bm{\mu},S} gives additional flexibility to trade-off between the strength of the condition and the regret, i.e., with a larger λ\lambda, one can obtain a smaller regret bound, while with a smaller λ\lambda, the condition is easier to satisfy to include more applications. In general, Condition 4 is stronger than Condition 2 and Condition 3, as the former degenerates to the other two conditions by setting 𝜻=𝟎\bm{\zeta}=\bm{0} and the fact that pi𝝁,S1 for iS~p_{i}^{\bm{\mu},S}\leq 1\text{ for }i\in\tilde{S} and pi𝝁,S=0p_{i}^{\bm{\mu},S}=0 otherwise, respectively. Conversely, by applying the Cauchy-Schwartz inequality, one can verify that if a reward function is TPM B1B_{1}-bounded smooth, then it is TPVM (B1K/2,B1,λ)(B_{1}\sqrt{K}/2,B_{1},\lambda)-bounded smooth for any λ2\lambda\leq 2 or similarly VM (B1K/2,B1)(B_{1}\sqrt{K}/2,B_{1})-bounded smooth, respectively.

In light of the above conditions that significantly advance the non-contextual CMAB-T, the goal of subsequent sections is to design algorithms and conduct analysis to derive the (improved) results for the contextual setting. And later in Section 5, we demonstrate how these conditions are applied to applications, such as CB and online IM, to achieve both theoretical and empirical improvements. Due to the space limit, the detailed proofs are included in the Appendix.

3 Algorithm and Regret Analysis for C2MAB-T under the TPM Condition

Algorithm 1 C2-UCB-T: Contextual Combinatorial Upper Confidence Bound Algorithm for C2MAB-T
1:  Input: Base arms [m][m], dimension dd, regularizer γ\gamma, failure probability δ=1/T\delta=1/T, offline oracle ORACLE.
2:  Initialize: Gram matrix 𝑮1=γ𝑰\bm{G}_{1}=\gamma\bm{I}, vector 𝒃1=𝟎\bm{b}_{1}=\bm{0}.
3:  for t=1,,Tt=1,...,T  do
4:     𝜽^t=𝑮t1𝒃t\hat{\bm{\theta}}_{t}=\bm{G}_{t}^{-1}\bm{b}_{t}.
5:     for i[m]i\in[m] do
6:        μ¯t,i=ϕt(i),𝜽^t+ρ(δ)ϕt(i)𝑮t1\bar{\mu}_{t,i}=\langle\bm{\phi}_{t}(i),\hat{\bm{\theta}}_{t}\rangle+\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}.
7:     end for
8:     St=ORACLE(μ¯t,1,,μ¯t,m)S_{t}=\text{ORACLE}(\bar{\mu}_{t,1},...,\bar{\mu}_{t,m}).
9:     Play StS_{t} and observe triggering arm set τt\tau_{t} and observation set (Xt,i)iτt(X_{t,i})_{i\in\tau_{t}}.
10:     𝑮t+1=𝑮t+iτtϕt(i)ϕt(i)\bm{G}_{t+1}=\bm{G}_{t}+\sum_{i\in\tau_{t}}\bm{\phi}_{t}(i)\bm{\phi}_{t}(i)^{\top}.
11:     𝒃t+1=𝒃t+iτtϕt(i)Xt,i\bm{b}_{t+1}=\bm{b}_{t}+\sum_{i\in\tau_{t}}\bm{\phi}_{t}(i)X_{t,i}.
12:  end for

Our proposed algorithm C2-UCB-T (Algorithm 1) is a generalization of the C3-UCB algorithm originally designed for contextual combinatorial cascading bandits (Li et al., 2016). Our main contribution is to show an improved regret bound by a factor of 1/pmin1/p_{\min} under the 1-norm TPM condition.

Recall that we define the data about the history as t=(ϕs,Ss,τs,(Xs,i)iτs)s<tϕt\mathcal{H}_{t}=(\bm{\phi}_{s},S_{s},\tau_{s},(X_{s,i})_{i\in\tau_{s}})_{s<t}\bigcup\bm{\phi}_{t}. Different from the CUCB algorithm (Wang & Chen, 2017) that directly estimates the mean 𝝁t,i\bm{\mu}_{t,i} for each arm, Algorithm 1 estimates the underlying parameter 𝜽\bm{\theta}^{*} via a ridge regression problem over the history data t\mathcal{H}_{t}. More specifically, we estimate 𝜽\bm{\theta}^{*} by solving the following 2\ell_{2}-regularized least-square problem with regularization parameter γ>0\gamma>0:

𝜽^t=argmin𝜽Θs<tiτs(𝜽,ϕs(i)Xs,i)2+γ𝜽22.\displaystyle\hat{\bm{\theta}}_{t}=\operatorname*{argmin}_{\bm{\theta}\in\Theta}\sum_{s<t}\sum_{i\in\tau_{s}}(\langle\bm{\theta},\bm{\phi}_{s}(i)\rangle-X_{s,i})^{2}+\gamma\left\lVert\bm{\theta}\right\rVert_{2}^{2}.

(2)

The closed form solution is precisely the θ^t\hat{\theta}_{t} calculated in line 4, where the Gram matrix 𝑮t=s<tiτsϕs(i)ϕs(i)\bm{G}_{t}=\sum_{s<t}\sum_{i\in\tau_{s}}\bm{\phi}_{s}(i)\bm{\phi}_{s}(i)^{\top} and the b-vector 𝒃t=s<tiτsϕs(i)Xs,i\bm{b}_{t}=\sum_{s<t}\sum_{i\in\tau_{s}}\bm{\phi}_{s}(i)X_{s,i} are computed in line 10 and 11. We claim that θ^t\hat{\theta}_{t} is a good estimator of θ\theta^{*} by bounding their difference via the following lemma, which is also used in (Qin et al., 2014; Li et al., 2016).

Proposition 1 (Theorem 2, (Abbasi-Yadkori et al., 2011)).

Let ρ(δ)=log((γ+KT/d)dγdδ2)+γ\rho(\delta)=\sqrt{\log\left(\frac{(\gamma+KT/d)^{d}}{\gamma^{d}\cdot\delta^{2}}\right)}+\sqrt{\gamma}, then with probability at least 1δ1-\delta, for all t[T]t\in[T], 𝛉^t𝛉𝐆tρ(δ).\left\lVert\hat{\bm{\theta}}_{t}-\bm{\theta}^{*}\right\rVert_{\bm{G}_{t}}\leq\rho(\delta).

Building on this, we construct an optimistic estimation of each arm’s mean μ¯t,i\bar{\mu}_{t,i} in line 6, where ρ(δ)\rho(\delta) is in Proposition 1, ϕt(i),𝜽^t\langle\bm{\phi}_{t}(i),\hat{\bm{\theta}}_{t}\rangle and ρ(δ)ϕt(i)𝑮t1\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}} are the empirical mean and confidence interval towards the direction ϕt(i)\bm{\phi}_{t}(i), respectively. As a convention, we clip μ¯t,i\bar{\mu}_{t,i} into [0,1][0,1] if μ¯t,i>1\bar{\mu}_{t,i}>1 or μ¯t,i<0\bar{\mu}_{t,i}<0.

Thanks to Proposition 1, we have the following lemma for the desired amount of the base arm level optimism,

Lemma 1.

With probability at least 1δ1-\delta, we have μt,iμ¯t,iμt,i+2ρ(δ)ϕt(i)𝐆t1\mu_{t,i}\leq\bar{\mu}_{t,i}\leq\mu_{t,i}+2\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}} for all i[m],t[T]i\in[m],t\in[T].

Proof.

See Section A.2. ∎

After computing the UCB values 𝝁¯t\bar{\bm{\mu}}_{t}, the agent selects action StS_{t} via the offline oracle with 𝝁¯t\bar{\bm{\mu}}_{t} as input. Then base arms in τt\tau_{t} are triggered, and the agent receives observation set (Xt,i)iτt(X_{t,i})_{i\in\tau_{t}} as feedback to improve future decisions.

Theorem 1.

For a C2MAB-T instance that satisfies monotonicity (Condition 1) and TPM smoothness (Condition 2) with coefficient B1B_{1}, C2-UCB-T (Algorithm 1) with an (α,β)(\alpha,\beta)-approximation oracle achieves an (α,β)(\alpha,\beta)-approximate regret bounded by O(B1(dlog(KT/γ)+γ)KTdlog(KT/γ)).O\left(B_{1}(\sqrt{d\log(KT/\gamma)}+\sqrt{\gamma})\sqrt{KTd\log(KT/\gamma)}\right).

Discussion. Looking at Theorem 1, we achieve an O(B1dKTlogT)O(B_{1}d\sqrt{KT}\log T) regret bound when dKmTd\ll K\leq m\ll T, which is independent of the number of arms mm and the minimum triggering probability pminp_{\min}. Consider the combinatorial cascading bandits (Li et al., 2016) satisfying B1=1B_{1}=1 (see Section 5), our result improves the Li et al. (2016) by a factor of 1/pmin1/p_{\min}. Consider the linear reward function (Takemura et al., 2021) without triggering arms (i.e., pi𝝁,S=1p_{i}^{\bm{\mu},S}=1 for iSi\in S, and 0 otherwise), one can easily verify B1=1B_{1}=1 and our regret matches the lower bound Ω(dKT){\Omega}(d\sqrt{KT}) Takemura et al. (2021) up to logarithmic factors.

Analysis. Here we explain how to prove a regret bound that removes the 1/pmin1/p_{\min} factor under the 1-norm TPM condition. The main challenge is that the mean vector 𝝁t\bm{\mu}_{t} and the triggering probability pi𝝁t,Sp_{i}^{\bm{\mu}_{t},S} are dependent on time-varying contexts ϕt(i)\bm{\phi}_{t}(i), so it is impossible to derive any meaningful concentration inequality or regret bound based on Tt,iT_{t,i}, which is the number of times that arm ii is triggered, and has been used by the triggering group (TG) technique (Wang & Chen, 2017) to remove 1/pmin1/p_{\min}. To deal with this problem, we bypass the quantity Tt,iT_{t,i} and use the triggering-probability equivalence (TPE) technique that equalizes pi𝝁t,Sp_{i}^{\bm{\mu}_{t},S} with 𝔼t[𝕀{iτt}]\mathbb{E}_{t}[\mathbb{I}\{i\in\tau_{t}\}], which in turn replaces the expected regret produced by all possible arms with the expected regret produced by iτti\in\tau_{t} to avoid pminp_{\min}. To sketch our proof idea, we assume the oracle is deterministic with β=1\beta=1 (the randomness of the oracle and β<1\beta<1 are handled in Appendix A), and let filtration t1\mathcal{F}_{t-1} be the history data t\mathcal{H}_{t} (defined in Section 2). Denote 𝔼t[]=𝔼[t1]\mathbb{E}_{t}[\cdot]=\mathbb{E}[\cdot\mid\mathcal{F}_{t-1}], the tt-round regret 𝔼t[αr(St;𝝁t)r(St;𝝁t)]𝔼t[r(St;𝝁¯t)r(St;𝝁t)]\mathbb{E}_{t}[\alpha\cdot r(S^{*}_{t};\bm{\mu}_{t})-r(S_{t};\bm{\mu}_{t})]{\leq}\mathbb{E}_{t}[r(S_{t};\bar{\bm{\mu}}_{t})-r(S_{t};\bm{\mu}_{t})], based on Condition 1, Lemma 1 and definition of StS_{t}. Then

𝔼t[r(St;𝝁¯t)r(St;𝝁t)](a)𝔼t[iS~tB1pi𝝁t,St(μ¯t,iμt,i)]\textstyle\mathbb{E}_{t}[r(S_{t};\bar{\bm{\mu}}_{t})-r(S_{t};\bm{\mu}_{t})]\overset{(a)}{\leq}\mathbb{E}_{t}\left[\sum_{i\in\tilde{S}_{t}}B_{1}p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\mu_{t,i})\right]
=(b)𝔼[iS~tB1𝔼τt[𝕀{iτt}](μ¯t,iμt,i)t1]\textstyle\overset{(b)}{=}\mathbb{E}\left[\sum_{i\in\tilde{S}_{t}}B_{1}\mathbb{E}_{\tau_{t}}[\mathbb{I}\{i\in\tau_{t}\}](\bar{\mu}_{t,i}-\mu_{t,i})\mid\mathcal{F}_{t-1}\right]
=(c)𝔼t[iS~t𝕀{iτt}B1(μ¯t,iμt,i)]\textstyle\overset{(c)}{=}\mathbb{E}_{t}\left[\sum_{i\in\tilde{S}_{t}}\mathbb{I}\{i\in\tau_{t}\}B_{1}(\bar{\mu}_{t,i}-\mu_{t,i})\right]
=(d)𝔼t[iτtB1(μ¯t,iμt,i)],\textstyle\overset{(d)}{=}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}B_{1}(\bar{\mu}_{t,i}-\mu_{t,i})\right], (3)

where (a) is by Condition 2, (b) is because μ¯t,i,μt,i,St\bar{\mu}_{t,i},\mu_{t,i},S_{t} are t1\mathcal{F}_{t-1} measurable so that the only randomness is from triggering set τt\tau_{t} and we can substitute pi𝝁t,Stp_{i}^{\bm{\mu}_{t},S_{t}} with event 𝕀{iτt}\mathbb{I}\{i\in\tau_{t}\} under expectation, (c)(c) is by absorbing the expectation over τt\tau_{t} to 𝔼t\mathbb{E}_{t}, and (d)(d) is a change of notation. After applying the TPE, we only need to bound the regret produced by iτti\in\tau_{t}. Hence

Reg(T)𝔼[t[T]𝔼t[iτtB1(μ¯t,iμt,i)]]\textstyle\text{Reg}(T){\leq}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}B_{1}(\bar{\mu}_{t,i}-\mu_{t,i})\right]\right]
(a)𝔼[t[T]𝔼t[iτt2B1ρ(δ)ϕt(i)𝑮t1]]\textstyle\overset{(a)}{\leq}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}2B_{1}\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}\right]\right]
(b)2B1ρ(δ)𝔼[KTt[T]iτtϕt(i)𝑮t12]\textstyle\overset{(b)}{\leq}2B_{1}\rho(\delta)\mathbb{E}\left[\sqrt{KT\sum_{t\in[T]}\sum_{i\in\tau_{t}}\left\lVert\bm{\phi}_{t}(i)\right\rVert^{2}_{\bm{G}_{t}^{-1}}}\right]
(c)O(B1dKTlogT).\textstyle\overset{(c)}{\leq}{O}(B_{1}d\sqrt{KT}\log T). (4)

where (a) follows from Lemma 1, (b) is by Cauchy Schwarz inequality over both ii and tt, and (c) is by the ellipsoidal potential lemma (Lemma 5) in the Appendix.

Remark 3.

In addition to the general C2MAB-T setting, the TPE technique can also replace the more involved TG technique (Wang & Chen, 2017) for CMAB-T. Such replacement can save an unnecessary union bound over the group index, which in turn reproduce Theorem 1 of Wang & Chen (2017) under Condition 2, and improve Theorem 1 of Liu et al. (2022) under Condition 4 by a factor of O(logT)O(\sqrt{\log T}), see Appendix C for details.

4 Variance-Adaptive Algorithm and Analysis for C2MAB-T under VM/TPVM Condition

Algorithm 2 VAC2-UCB: Variance-Adaptive Contextual Combinatorial Upper Confidence Bound Algorithm
1:  Input: Base arms [m][m], dimension dd, regularizer γ\gamma, failure probability δ=1/T\delta=1/T, offline oracle ORACLE.
2:  Initialize: Gram matrix 𝑮1=γ𝑰\bm{G}_{1}=\gamma\bm{I}, regressand 𝒃1=𝟎\bm{b}_{1}=\bm{0}.
3:  for t=1,,Tt=1,...,T  do
4:     𝜽^t=𝑮t1𝒃t\hat{\bm{\theta}}_{t}=\bm{G}_{t}^{-1}\bm{b}_{t}.
5:     for i[m]i\in[m] do
6:        μ¯t,i=ϕt(i),𝜽^t+2ρ(δ)ϕt(i)𝑮t1\bar{\mu}_{t,i}=\langle\bm{\phi}_{t}(i),\hat{\bm{\theta}}_{t}\rangle+2\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}
7:        μ¯t,i=ϕt(i),𝜽^t2ρ(δ)ϕt(i)𝑮t1\underaccent{\bar}{\mu}_{t,i}=\langle\bm{\phi}_{t}(i),\hat{\bm{\theta}}_{t}\rangle-2\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}
8:        Set the optimistic variance V¯t,i\bar{V}_{t,i} as Equation 6.
9:     end for
10:     St=ORACLE(μ¯t,1,,μ¯t,m)S_{t}=\text{ORACLE}(\bar{\mu}_{t,1},...,\bar{\mu}_{t,m}).
11:     Play StS_{t} and observe triggering arm set τt\tau_{t} and observation set (Xt,i)iτt(X_{t,i})_{i\in\tau_{t}}.
12:     𝑮t+1=𝑮t+iτtV¯t,i1ϕt(i)ϕt(i)\bm{G}_{t+1}=\bm{G}_{t}+\sum_{i\in\tau_{t}}\bar{V}_{t,i}^{-1}\bm{\phi}_{t}(i)\bm{\phi}_{t}(i)^{\top}.
13:     𝒃t+1=𝒃t+iτtV¯t,i1ϕt(i)Xt,i\bm{b}_{t+1}=\bm{b}_{t}+\sum_{i\in\tau_{t}}\bar{V}_{t,i}^{-1}\bm{\phi}_{t}(i)X_{t,i}.
14:  end for

In this section, we propose a new variance-adaptive algorithm VAC2-UCB (Algorithm 2) to further remove the O(K)O(\sqrt{K}) factor and achieve O~(BvdT)\tilde{O}(B_{v}d\sqrt{T}) regret bound for applications satisfying stronger VM/TPVM conditions.

Different from Algorithm 1, VAC2-UCB leverages the second-order statistics (i.e., variance) to speed up the learning process. To get some intuition, we first assume the variance Vs,i=Var[Xs,i]V_{s,i}=\text{Var}[X_{s,i}] for each base arm ii at round ss is known in advance. In this case, VAC2-UCB adopts the weighted ridge-regression to learn the parameter 𝜽\bm{\theta}^{*}:

𝜽^t=argmin𝜽Θs<tiτs(𝜽,ϕs(i)Xs,i)2/Vs,i+γ𝜽22,\displaystyle\hat{\bm{\theta}}_{t}=\operatorname*{argmin}_{\bm{\theta}\in\Theta}\sum_{s<t}\sum_{i\in\tau_{s}}(\langle\bm{\theta},\bm{\phi}_{s}(i)\rangle-X_{s,i})^{2}/V_{s,i}+\gamma\left\lVert\bm{\theta}\right\rVert_{2}^{2},

(5)

where the first term is “weighted” by the true variance Vs,iV_{s,i}. The closed-form solution of the above estimator is 𝜽^t=𝑮t1𝒃t\hat{\bm{\theta}}_{t}=\bm{G}_{t}^{-1}\bm{b}_{t} where the Gram matrix 𝑮t=s<tiτsVs,i1ϕs(i)ϕs(i)\bm{G}_{t}=\sum_{s<t}\sum_{i\in\tau_{s}}V_{s,i}^{-1}\bm{\phi}_{s}(i)\bm{\phi}_{s}(i)^{\top} and the b-vector 𝒃t=s<tiτsVs,i1ϕs(i)Xs,i\bm{b}_{t}=\sum_{s<t}\sum_{i\in\tau_{s}}V_{s,i}^{-1}\bm{\phi}_{s}(i)X_{s,i}, which enjoys the similar form (but uses different weights V¯s,i\bar{V}_{s,i}) of line 12 and line 13.

The intuition of using the inverse of Vs,iV_{s,i} to re-weight the observation is that: the smaller the variance, the more accurate the observation (ϕt(i),Xt,i)(\bm{\phi}_{t}(i),X_{t,i}) is, and thus more important for the agent to learn unknown 𝜽\bm{\theta}^{*}. In fact, the above estimator 𝜽^t\hat{\bm{\theta}}_{t} is closely related to the best linear unbiased estimator (BLUE) (Henderson, 1975). Concretely, in the literature of linear regression, Equation (5) is the lowest variance estimator of 𝜽\bm{\theta}^{*} among all unbiased linear estimators, when the regularization term γ=0\gamma=0, Vs,iV_{s,i} are the true variance proxy of outcomes (Xs,i)s<t,iτs(X_{s,i})_{s<t,i\in\tau_{s}} and the context sequence (ϕs(i))s<t,iτs(\bm{\phi}_{s}(i))_{s<t,i\in\tau_{s}} follows the fixed design in Equation (5).

For our C2MAB-T setting, one new challenge arises since the variance Vs,i=μs,i(1μs,i)V_{s,i}=\mu_{s,i}(1-\mu_{s,i}) is not known a priori. Inspired by (Lattimore et al., 2015; Zhou et al., 2021), we construct an optimistic estimation V¯s,i\bar{V}_{s,i} to replace the true variance Vs,iV_{s,i} in Equation 5. Indeed, we construct V¯t,i\bar{V}_{t,i} by solving the optimal value for the problem maxμ[μ¯t,i,μ¯t,i]μ(1μ)\max_{\mu\in[\underaccent{\bar}{\mu}_{t,i},\bar{\mu}_{t,i}]}\mu(1-\mu), whose closed form solution immediately follows from the equation below,

V¯t,i={(1μ¯t,i)μ¯t,i,if μ¯t,i12(1μ¯t,i)μ¯t,i,if μ¯t,i1214,otherwise\displaystyle\bar{V}_{t,i}=\begin{cases}(1-\bar{\mu}_{t,i})\bar{\mu}_{t,i},&\text{if $\bar{\mu}_{t,i}\leq\frac{1}{2}$}\\ (1-\underaccent{\bar}{\mu}_{t,i})\underaccent{\bar}{\mu}_{t,i},&\text{if $\underaccent{\bar}{\mu}_{t,i}\geq\frac{1}{2}$}\\ \frac{1}{4},&\text{otherwise}\end{cases} (6)

where μ¯t,i\bar{\mu}_{t,i} and μ¯t,i\underaccent{\bar}{\mu}_{t,i} are UCB and LCB values to be introduced later. Notice that with high probability the true μt,i\mu_{t,i} lies within LCB and UCB values and as they are getting more accurate, the optimistic variance Vt,iV_{t,i} is also approaching the true variance Vt,iV_{t,i}.

To guarantee 𝜽^t\hat{\bm{\theta}}_{t} is a good estimator, we prove a new lemma (similar to Proposition 1) to guarantee the concentration bound of 𝜽t\bm{\theta}_{t} but in face of the unknown variance. Note that the sentinel work Lattimore et al. (2015) proves a similar concentration bound, the difference is that we have multiple arms triggered in each round instead of a single arm. To address this, we replaced the original concentration bound with the new one below that has an extra K4K^{4} factor in NN, which finally results in logK\log K factor in the confidence radius ρ(δ)\rho(\delta).

Lemma 2.

Let γ>0,N=(4d2K4T4)d\gamma>0,N=(4d^{2}K^{4}T^{4})^{d} so that ρ(δ)=(1+γ+4log(6TNδlog(3TNδ)))\rho(\delta)=\left(1+\sqrt{\gamma}+4\sqrt{\log\left(\frac{6TN}{\delta}\log\left(\frac{3TN}{\delta}\right)\right)}\right). We have for all tTt\leq T, with probability at least 1δ1-\delta, 𝛉^t𝛉𝐆t2ρ(δ)\left\lVert\hat{\bm{\theta}}_{t}-\bm{\theta}^{*}\right\rVert_{\bm{G}_{t}}^{2}\leq\rho(\delta).

Proof.

See Section B.1. ∎

Building on this lemma, we construct μ¯t,i\bar{\mu}_{t,i} as an upper bound of 𝝁t,i\bm{\mu}_{t,i} in line 6, and μ¯t,i\underaccent{\bar}{\mu}_{t,i} as a lower bound of 𝝁t,i\bm{\mu}_{t,i} in line 7, based on our variance-adaptive 𝜽^t\hat{\bm{\theta}}_{t}, 𝑮t\bm{G}_{t}. Note that the doubling of the radius 2ρ(δ)2\rho(\delta) instead of using ρ(δ)\rho(\delta) in Lemma 2 is purely for the correctness of our technical analysis. As a convention, we clip μ¯t,i,μ¯t,i\bar{\mu}_{t,i},\underaccent{\bar}{\mu}_{t,i} into [0,1][0,1] if they are above 1 or below 0.

Lemma 3.

With probability at least 1δ1-\delta, we have μt,iμ¯t,iμt,i+3ρ(δ)ϕt(i)𝐆t1\mu_{t,i}\leq\bar{\mu}_{t,i}\leq\mu_{t,i}+3\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}, and μt,iμ¯t,iμt,i3ρ(δ)ϕt(i)𝐆t1\mu_{t,i}\geq\underaccent{\bar}{\mu}_{t,i}\geq\mu_{t,i}-3\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}} for all i[m]i\in[m].

Proof.

This lemma follows from the similar derivation of Lemma 1, where we have different definitions of μ¯t,i,μ¯t,i\bar{\mu}_{t,i},\underaccent{\bar}{\mu}_{t,i} and the concentration now relies on Lemma 2. ∎

After the agent plays StS_{t}, the base arms in τt\tau_{t} are triggered, and the agent receives observation set (Xt,i)iτt(X_{t,i})_{i\in\tau_{t}} as feedback. These observations (reweighted by optimistic variance V¯t,i\bar{V}_{t,i}) are then used to update 𝑮t\bm{G}_{t} and 𝒃t\bm{b}_{t} for future rounds.

Table 2: Summary of the coefficients, regret bounds and improvements for various applications.
Application Condition (Bv,B1,λ)(B_{v},B_{1},\lambda) Regret Improvement
Online Influence Maximization (Wen et al., 2017) TPM (,|V|,)(-,|V|,-) O(d|V||E|TlogT)O(d|V|\sqrt{|E|T}\log T) O~(|E|)\tilde{O}(\sqrt{|E|})
Disjunctive Combinatorial Cascading Bandits (Li et al., 2016) TPVM (1,1,1)(1,1,1) O(dTlogT)O(d\sqrt{T}\log T) O~(K/pmin)\tilde{O}(\sqrt{K}/p_{\min})
Conjunctive Combinatorial Cascading Bandits (Li et al., 2016) TPVM (1,1,1)(1,1,1) O(dTlogT)O(d\sqrt{T}\log T) O~(K/rmax)\tilde{O}(\sqrt{K}/r_{\max})
Linear Cascading Bandits (Vial et al., 2022) TPVM (1,1,1)(1,1,1) O(dTlogT)O(d\sqrt{T}\log T) O~(K/d)\tilde{O}(\sqrt{K/d})
Multi-layered Network Exploration (Liu et al., 2021b) TPVM (1.25|V|,1,2)(\sqrt{1.25|V|},1,2) O(d|V|TlogT)O(d\sqrt{|V|T}\log T) O~(n/pmin)\tilde{O}(\sqrt{n}/p_{\min})
Probabilistic Maximum Coverage (Chen et al., 2013)∗∗ VM (32|V|,1,)(3\sqrt{2|V|},1,-) O(d|V|TlogT)O(d\sqrt{|V|T}\log T) O~(k)\tilde{O}(\sqrt{k})
  • |V|,|E|,n,k,L|V|,|E|,n,k,L denotes the number of target nodes, the number of edges that can be triggered by the set of seed nodes, the number of layers, the number of seed nodes and the length of the longest directed path, respectively;

  • KK is the length of the ordered list, rmax=αmaxt[T],S𝒮r(S;𝝁t)r_{\max}=\alpha\cdot\max_{t\in[T],S\in\mathcal{S}}r(S;\bm{\mu}_{t});

  • A special case of disjunctive combinatorial cascading bandits.

  • ∗∗ This row is for C2MAB application and the rest of rows are for C2MAB-T applications.

4.1 Results and Analysis under VM condition

We first show a regret bound for VAC2-UCB that is independent of batch size KK when the VM condition holds.

Theorem 2.

For a C2MAB-T instance that satisfies monotonicity (Condition 1) and VM smoothness (Condition 3) with coefficient (Bv,B1)(B_{v},B_{1}), VAC2-UCB (Algorithm 2) with an (α,β)(\alpha,\beta)-approximation oracle achieves an (α,β)(\alpha,\beta)-approximate regret bounded by O(Bvpmin(dlog(KT/γ)+γ)Tdlog(KT/γ)).O\left(\frac{B_{v}}{\sqrt{p_{\min}}}(\sqrt{d\log(KT/\gamma)}+\sqrt{\gamma})\sqrt{Td\log(KT/\gamma)}\right).

Discussion. Looking at Theorem 2, we achieve an O(BvdTlogT/pmin)O(B_{v}d\sqrt{T}\log T/\sqrt{p_{\min}}) regret bound when dKmTd\ll K\leq m\ll T. For combinatorial cascading bandits (Li et al., 2016) with Bv=1B_{v}=1, our regret is independent of m,Km,K and improves Li et al. (2016) by a factor O(K/pmin)O(\sqrt{K/p_{\min}}).

In addition to the general C2MAB-T setting, one can verify that for non-triggering C2MAB, pmin=1p_{\min}=1, and we obtain the batch-size independent regret bound O(BvdTlogT)O(B_{v}d\sqrt{T}\log T). Recall Bv=O(B1K)B_{v}=O(B_{1}\sqrt{K}) for any C2MAB-T instances, so our regret bound reproduces O(B1dKTlogT)O(B_{1}d\sqrt{KT}\log T), and thus matches the similar lower bound (Takemura et al., 2021) for the linear reward functions. For the more interesting non-linear reward function with Bv=o(B1K)B_{v}=o(B_{1}\sqrt{K}), our regret improves non-variance-adaptive algorithm C2UCB, whose regret is O(B1dKTlogT)O(B_{1}d\sqrt{KT}\log T) (Qin et al., 2014; Takemura et al., 2021).

Analysis. At a high level, the improvement of K\sqrt{K} comes from the VM condition and the optimistic variance, which together save the use of Cauchy-Schwarz inequality that generates a O(K)O(\sqrt{K}) factor in the step (b) of Section 3. In order to leverage the variance information, we decompose the regret into term (\@slowromancapi@) and (\@slowromancapii@), Reg(T)𝔼[t=1Tr(St;𝝁¯t)r(St;𝝁t)]\textstyle\text{Reg}(T)\leq\mathbb{E}\left[\sum_{t=1}^{T}r(S_{t};\bar{\bm{\mu}}_{t})-r(S_{t};\bm{\mu}_{t})\right] 𝔼[t=1T|r(St;𝝁¯t)r(St;𝝁~t)|(\@slowromancapi@)+|r(St;𝝁t)r(St;𝝁~t)|(\@slowromancapii@)],\textstyle\leq\mathbb{E}[\sum_{t=1}^{T}\underbrace{\left|r(S_{t};\bar{\bm{\mu}}_{t})-r(S_{t};\tilde{\bm{\mu}}_{t})\right|}_{\text{(\@slowromancap i@)}}+\underbrace{\left|r(S_{t};\bm{\mu}_{t})-r(S_{t};\tilde{\bm{\mu}}_{t})\right|}_{\text{(\@slowromancap ii@)}}], (7) where 𝝁~t\tilde{\bm{\mu}}_{t} is the vector whose ii-th entry is the maximizer that achieves optimistic variance V¯t,i\bar{V}_{t,i}, i.e., μ~t,i=argmaxμ[μ¯t,i,μ¯t,i]μ(1μ)\tilde{\mu}_{t,i}=\operatorname*{argmax}_{\mu\in[\underaccent{\bar}{\mu}_{t,i},\bar{\mu}_{t,i}]}\mu(1-\mu). Now we show a sketched proof to bound the term (\@slowromancapi@) and one can bound the term (\@slowromancapii@) similarly. 𝔼[t[T](\@slowromancapi@)](a)Bv𝔼[t=1TiS~t(μ¯t,iμ~t,i)2/V¯t,i]\textstyle\mathbb{E}\left[\sum_{t\in[T]}\text{(\@slowromancap i@)}\right]\overset{(a)}{\leq}B_{v}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}/{\bar{V}_{t,i}}}\right] (b)Bvpmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2/V¯t,i]\textstyle\overset{(b)}{\leq}\frac{B_{v}}{\sqrt{p_{\min}}}\cdot\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}p_{i}^{\bm{\mu}_{t},S_{t}}{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}/{\bar{V}_{t,i}}}\right] (c)BvpminT𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2/V¯t,i]\textstyle\overset{(c)}{\leq}\frac{B_{v}}{\sqrt{p_{\min}}}\cdot\sqrt{T\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i\in\tilde{S}_{t}}p_{i}^{\bm{\mu}_{t},S_{t}}{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}/{\bar{V}_{t,i}}\right]} (d)BvpminT𝔼[t=1Tiτt(6ρ(δ)ϕt(i)𝑮t1)2/V¯t,i]\textstyle\overset{(d)}{\leq}\frac{B_{v}}{\sqrt{p_{\min}}}\cdot\sqrt{T\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i\in\tau_{t}}{(6\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}})^{2}}/{\bar{V}_{t,i}}\right]} (e)O(BvdTlogT/pmin),\textstyle\overset{(e)}{\leq}O(B_{v}d\sqrt{T\log T}/\sqrt{p_{\min}}), (8) where (a) is by Condition 3, (b) is by the definition of pi𝝁t,Stpminp_{i}^{\bm{\mu}_{t},S_{t}}\geq p_{\min} for iS~ti\in\tilde{S}_{t}, (c) is by Cauchy–Schwarz over tt and the Jensen’s inequality, (d) follows from the TPE and Lemma 3, (e) follows from Lemma 6.

4.2 Results and Analysis under TPVM Condition

Next, we show that VAC2-UCB can achieve regret bounds that remove the O(K)O(\sqrt{K}) and O(1/pmin)O(1/\sqrt{p_{\min}}) factor for applications satisfying the stronger TPVM conditions.

We first introduce a mild condition over the triggering probability (which is similar to Condition 2) to give our regret bounds and analysis.

Condition 5 (1-norm TPM Bounded Smoothness for Triggering Probability).

We say that a C2MAB-T problem instance satisfies the triggering probability modulated BpB_{p}-bounded smoothness condition over the triggering probability, if for any action S𝒮S\in\mathcal{S}, any mean vectors 𝛍,𝛍[0,1]m\bm{\mu},\bm{\mu}^{\prime}\in[0,1]^{m}, and any arm i[m]i\in[m], we have |pi𝛍,Spi𝛍,S|Bpj[m]pj𝛍,S|μjμj||p_{i}^{\bm{\mu}^{\prime},S}-p_{i}^{\bm{\mu},S}|\leq B_{p}\sum_{j\in[m]}p_{j}^{\bm{\mu},S}|\mu_{j}-\mu^{\prime}_{j}|.

Now we state our main theorem as follows.

Theorem 3.

For a C2MAB-T instance, when its reward function satisfies monotonicity (Condition 1) and TPVM smoothness (Condition 4) with coefficient (Bv,B1,λ)(B_{v},B_{1},\lambda), and its triggering probability pi𝛍,Sp_{i}^{\bm{\mu},S} satisfies 1-norm TPM smoothness with coefficient BpB_{p} (Condition 5), if λ2\lambda\geq 2, then VAC2-UCB (Algorithm 2) with an (α,β)(\alpha,\beta)-approximation oracle achieves an (α,β)(\alpha,\beta)-approximate regret bounded by

O(BvdTlogT+BvBpK(dlogT)2/pmin),\displaystyle O\left(B_{v}d\sqrt{T}\log T+{B_{v}B_{p}\sqrt{K}(d\log T)^{2}}/{\sqrt{p_{\min}}}\right), (9)

and if λ1\lambda\geq 1, then VAC2-UCB (Algorithm 2) achieves an (α,β)(\alpha,\beta)-approximate regret bounded by

O(BvdTlogT+BvBp(KT)1/4(dlogT)3/2/pmin).\displaystyle O\left(B_{v}d\sqrt{T}\log T+{B_{v}\sqrt{B_{p}}(KT)^{1/4}(d\log T)^{3/2}}/{\sqrt{p_{\min}}}\right). (10)

Discussion. The leading term of Theorem 3 is O(BvdTlogT)O(B_{v}d\sqrt{T}\log T) when dKmTd\ll K\leq m\ll T, which removes the 1/pmin1/\sqrt{p_{\min}} factor compared with Theorem 2. Also, notice that Theorem 3 relies on an additional BpB_{p}-smoothness condition over the triggering probability. However, we claim that this condition is mild and almost always satisfies with Bp=B1B_{p}=B_{1} for applications considered in this paper (see Appendix D).

Analysis. We use the regret decomposition of Equation 7 to the same term (\@slowromancapi@) and (\@slowromancapii@), and leverage on TPVM condition (Condition 4) to obtain:

𝔼[t[T](\@slowromancapi@)](a)Bv𝔼[t=1TiS~t(pi𝝁~t,St)λ(μ¯t,iμ~t,i)2/V¯t,i].\leavevmode\resizebox{394.59578pt}{}{$\displaystyle\mathbb{E}\left[\sum_{t\in[T]}\text{(\@slowromancap i@)}\right]\overset{(a)}{\leq}B_{v}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}(p_{i}^{\tilde{\bm{\mu}}_{t},S_{t}})^{\lambda}{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}/{\bar{V}_{t,i}}}\right]$}. (11)

However, we cannot use TPE as Equation 8 because pi𝝁~t,Stpi𝝁t,Stp_{i}^{\tilde{\bm{\mu}}_{t},S_{t}}\neq p_{i}^{\bm{\mu}_{t},S_{t}} in general. To handle this mismatch, we use the fact that triggering probability usually satisfies a smoothness condition in Condition 5, and prove that the mismatch only affect the lower-order term as follows:

By Condition 5, (pi𝝁~t,St)λ(p_{i}^{\tilde{\bm{\mu}}_{t},S_{t}})^{\lambda} is upper bounded by (pi𝝁t,St+min{1,jS~tBppj𝝁,St|μt,jμ~t,j|})2(p_{i}^{\bm{\mu}_{t},S_{t}}+\min\{1,\sum_{j\in\tilde{S}_{t}}B_{p}p_{j}^{\bm{\mu},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|\})^{2} when λ2\lambda\geq 2, and the regret is bounded by the terms as shown below: Eq. (11)𝔼[t=1TBviS~t3pi𝝁t,St(μ¯t,iμ~t,i)2/V¯t,i]leading term\textstyle\text{Eq. }(\ref{eq:TPVM_mismatch_1})\leq\underbrace{\mathbb{E}\left[\sum_{t=1}^{T}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}3p_{i}^{\bm{\mu}_{t},S_{t}}{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}/{\bar{V}_{t,i}}}\right]}_{\text{leading term}} +BvBpKpmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ¯t,i)2/V¯t,i]lower-order term,\textstyle+\underbrace{\frac{B_{v}B_{p}\sqrt{K}}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i\in\tilde{S}_{t}}{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\underaccent{\bar}{\mu}_{t,i})^{2}}/{\bar{V}_{t,i}}\right]}_{\text{lower-order term}}, where the leading term is of order O(BvTlogT)O(B_{v}\sqrt{T}\log T) by using the same derivation of step (c)-(e) in Equation 8, and the lower order term is bounded by O(BvBpK/pminlogT)O(B_{v}B_{p}\sqrt{K/p_{\min}}\log T) by TPE and the weighted ellipsoidal potential lemma (Lemma 6). For λ1\lambda\geq 1, the lower-order term becomes BvBpK1/4pmin𝔼[t=1T(iS~tpi𝝁t,St(μ¯t,iμ¯t,i)2V¯t,i)3/4]\frac{B_{v}\sqrt{B_{p}}K^{1/4}}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\left(\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\underaccent{\bar}{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}\right)^{3/4}\right], which results in a larger lower-order regret term. See Section B.3 for details.

5 Applications and Experiments

We now move to applications and experimental results. We first show how our theoretical results improve various C2MAB and C2MAB-T applications under 1-norm TPM, TPVM and VM smoothness conditions with their corresponding B1,Bv,λB_{1},B_{v},\lambda coefficients. Then, we provide an empirical comparison in the context of the contextual cascading bandit application.

The instantiation of our theoretical results in the context of a variety of specific C2MAB  and C2MAB-T applications is shown in Table 2. The final column of the table details the improvement in regret that our results yield in each case. For detailed settings, proofs, and the discussion of the application results, see Appendix D.

Our experimental results are summarized in Figure 1, which details experiments on the MovieLens-1M dataset grouplens.org/datasets/movielens/1m/. Experiments on other data are included in the Appendix. Figure 1 illustrates that our VAC2-UCB algorithm outperforms C3-UCB (Li et al., 2016), the variance-agnostic cascading bandit algorithm, and CascadeWOFUL (Vial et al., 2022), the state-of-the-art variance-aware cascading bandit algorithm, eventually incurring 45%45\% and 25%25\% less regret. For detailed settings, comparisons, and discussions, see Appendix E.

Refer to caption
(a) All genres
Refer to caption
(b) Particular genre
Figure 1: Regret results for MovieLens data.

6 Conclusion

This paper studies contextual combinatorial bandits with probabilistically triggered arms (C2MAB-T) under a variety of smoothness conditions. Under the triggering probability modulated (TPM) condition, we design the C2-UCB-T algorithm and propose a novel analysis to achieve an O~(dKT)\tilde{O}(d\sqrt{KT}) regret bound, removing a potentially exponentially large factor O(1/pmin)O(1/p_{\min}). Under the variance modulated conditions (VM or TPVM), we propose a new variance-adaptive algorithm VAC2-UCB and derive a regret bound O~(dT)\tilde{O}(d\sqrt{T}), which removes the batch-size KK dependence. As valuable by-product, we find our TPE analysis technique and variance-adaptive algorithm can be applied to the CMAB-T and C2MAB setting, improving existing results as well. Experiments show that our algorithm can achieve at least 13% and 25% improvement compared with benchmark algorithms on synthetic and real-world datasets, respectively. For the future study, it would be interesting to extend our application scenarios. One could also relax the perfectly linear assumption by introducing model mis-specifications or corruptions.

Acknowledgement

The work of John C.S. Lui was supported in part by RGC’s GRF 14215722. The work of Mohammad Hajiesmaili is supported by NSF CAREER-2045641, CPS-2136199, CNS-2106299, and CNS-2102963. Wierman is supported by NSF grants CNS-2146814, CPS-2136197, CNS-2106403, and NGSDI-2105648.

References

  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • Bernstein (1946) Bernstein, S. The Theory of Probabilities (Russian). Moscow, 1946.
  • Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Chen et al. (2013) Chen, W., Wang, Y., and Yuan, Y. Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pp. 151–159. PMLR, 2013.
  • Chen et al. (2016a) Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016a.
  • Chen et al. (2016b) Chen, X., Li, Y., Wang, P., and Lui, J. A general framework for estimating graphlet statistics via random walk. Proceedings of the VLDB Endowment, 10(3):253–264, 2016b.
  • Combes et al. (2015) Combes, R., Talebi Mazraeh Shahi, M. S., Proutiere, A., et al. Combinatorial bandits revisited. Advances in neural information processing systems, 28, 2015.
  • Freedman (1975) Freedman, D. A. On tail probabilities for martingales. the Annals of Probability, pp.  100–118, 1975.
  • Gai et al. (2012) Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking (TON), 20(5):1466–1478, 2012.
  • Henderson (1975) Henderson, C. R. Best linear unbiased estimation and prediction under a selection model. Biometrics, pp.  423–447, 1975.
  • Kempe et al. (2003) Kempe, D., Kleinberg, J., and Tardos, É. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  137–146, 2003.
  • Kveton et al. (2015a) Kveton, B., Wen, Z., Ashkan, A., and Szepesvári, C. Combinatorial cascading bandits. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pp.  1450–1458, 2015a.
  • Kveton et al. (2015b) Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Tight regret bounds for stochastic combinatorial semi-bandits. In AISTATS, 2015b.
  • Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
  • Lattimore et al. (2015) Lattimore, T., Crammer, K., and Szepesvári, C. Linear multi-resource allocation with semi-bandit feedback. Advances in Neural Information Processing Systems, 28, 2015.
  • Li et al. (2016) Li, S., Wang, B., Zhang, S., and Chen, W. Contextual combinatorial cascading bandits. In International conference on machine learning, pp. 1245–1253. PMLR, 2016.
  • Li et al. (2020) Li, S., Kong, F., Tang, K., Li, Q., and Chen, W. Online influence maximization under linear threshold model. Advances in Neural Information Processing Systems, 33:1192–1204, 2020.
  • Liu et al. (2021a) Liu, L. T., Ruan, F., Mania, H., and Jordan, M. I. Bandit learning in decentralized matching markets. Journal of Machine Learning Research, 22(211):1–34, 2021a.
  • Liu et al. (2021b) Liu, X., Zuo, J., Chen, X., Chen, W., and Lui, J. C. Multi-layered network exploration via random walks: From offline optimization to online learning. In International Conference on Machine Learning, pp. 7057–7066. PMLR, 2021b.
  • Liu et al. (2022) Liu, X., Zuo, J., Wang, S., Joe-Wong, C., Lui, J., and Chen, W. Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms. In Advances in Neural Information Processing Systems, 2022.
  • Merlis & Mannor (2019) Merlis, N. and Mannor, S. Batch-size independent regret bounds for the combinatorial multi-armed bandit problem. In Conference on Learning Theory, pp.  2465–2489. PMLR, 2019.
  • Qin et al. (2014) Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp.  461–469. SIAM, 2014.
  • Robbins (1952) Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
  • Takemura et al. (2021) Takemura, K., Ito, S., Hatano, D., Sumita, H., Fukunaga, T., Kakimura, N., and Kawarabayashi, K.-i. Near-optimal regret bounds for contextual combinatorial semi-bandits with linear payoff functions. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  9791–9798, 2021.
  • Vial et al. (2022) Vial, D., Shakkottai, S., and Srikant, R. Minimax regret for cascading bandits. In Advances in Neural Information Processing Systems, 2022.
  • Wang & Chen (2017) Wang, Q. and Chen, W. Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. In Advances in Neural Information Processing Systems, pp. 1161–1171, 2017.
  • Wen et al. (2017) Wen, Z., Kveton, B., Valko, M., and Vaswani, S. Online influence maximization under independent cascade model with semi-bandit feedback. Advances in neural information processing systems, 30, 2017.
  • Zhou et al. (2021) Zhou, D., Gu, Q., and Szepesvari, C. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp.  4532–4576. PMLR, 2021.
  • Zong et al. (2016) Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. arXiv preprint arXiv:1603.05359, 2016.
  • Zuo et al. (2022) Zuo, J., Liu, X., Joe-Wong, C., Lui, J. C., and Chen, W. Online competitive influence maximization. In International Conference on Artificial Intelligence and Statistics, pp.  11472–11502. PMLR, 2022.

Appendix

The Appendix is organized as follows. Appendix A gives the detailed proofs for theorems and lemmas in Section 3. Appendix B provides the detailed proofs for theorems and lemmas in Section 4. Appendix C shows how the triggering probability equivalence technique can be applied to non-contextual CMAB-T to obtain improved results. Appendix D gives the detailed settings, results and comparisons included in Table 2. Appendix E provides detailed experimental setups and additional results. Appendix F summarizes the concentration bounds, facts, and technical lemmas used in this paper.

Appendix A Proofs for C2MAB-T under the TPM Condition (Section 3)

A.1 Proof of Theorem 1

We first give/recall some definitions and events. Recall that in Algorithm 1, the gram matrix, the b-vector and the estimator are

𝑮t\displaystyle\bm{G}_{t} =γ𝑰+s<tiτsϕs(i)ϕs(i)\displaystyle=\gamma\bm{I}+\sum_{s<t}\sum_{i\in\tau_{s}}\bm{\phi}_{s}(i)\bm{\phi}_{s}(i)^{\top} (12)
𝒃t\displaystyle\bm{b}_{t} =s<tiτsϕs(i)Xs,i\displaystyle=\sum_{s<t}\sum_{i\in\tau_{s}}\bm{\phi}_{s}(i)X_{s,i} (13)
𝜽^t\displaystyle\hat{\bm{\theta}}_{t} =𝑮t1𝒃t.\displaystyle=\bm{G}_{t}^{-1}\bm{b}_{t}. (14)

Let us use 𝒲t{\mathcal{W}}_{t} to denote the nice event when the oracle can output solution SS with r(S;𝝁)αr(S;𝝁)r(S;\bm{\mu})\geq\alpha\cdot r(S^{*};\bm{\mu}) where S=argmaxS𝒮r(S;𝝁)S^{*}=\operatorname*{argmax}_{S\in\mathcal{S}}r(S;\bm{\mu}) for any 𝝁\bm{\mu} at round tt. We use 𝒩t\mathcal{N}_{t} to denote the nice event when the 𝜽^t𝜽𝑮tρ(δ)\left\lVert\hat{\bm{\theta}}_{t}-\bm{\theta}^{*}\right\rVert_{\bm{G}_{t}}\leq\rho(\delta) holds for any t[T]t\in[T]. Define the filtration to be t1=(S1,ϕ1,τ1,(X1,i)tτ1,,St1,ϕt1,τt1,(Xt1,i)tτt1,St,ϕt)\mathcal{F}_{t-1}=(S_{1},\bm{\phi}_{1},\tau_{1},(X_{1,i})_{t\in\tau_{1}},...,S_{t-1},\bm{\phi}_{t-1},\tau_{t-1},(X_{t-1,i})_{t\in\tau_{t-1}},S_{t},\bm{\phi}_{t}) that takes both history data t\mathcal{H}_{t} and action StS_{t} to handle the randomness of the oracle, and let 𝔼t[]=𝔼[t1]\mathbb{E}_{t}[\cdot]=\mathbb{E}[\cdot\mid\mathcal{F}_{t-1}]. Now we bound the regret under nice event 𝒲t{\mathcal{W}}_{t} and 𝒩t\mathcal{N}_{t},

Reg(T)=(a)𝔼[t[T]𝔼t[αr(St;𝝁t)r(St;𝝁t)]]\displaystyle\text{Reg}(T)\overset{(a)}{=}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}[\alpha\cdot r(S^{*}_{t};\bm{\mu}_{t})-r(S_{t};\bm{\mu}_{t})]\right]
(b)𝔼[t[T]𝔼t[αr(St;𝝁¯t)r(St;𝝁t)]](c)𝔼[t[T]𝔼t[r(St;𝝁¯t)r(St;𝝁t)]]\displaystyle\overset{(b)}{\leq}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}[\alpha\cdot r(S^{*}_{t};\bar{\bm{\mu}}_{t})-r(S_{t};\bm{\mu}_{t})]\right]\overset{(c)}{\leq}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}[r(S_{t};\bar{\bm{\mu}}_{t})-r(S_{t};\bm{\mu}_{t})]\right]
(d)𝔼[t[T]𝔼t[iS~tB1pi𝝁t,St(μ¯t,iμt,i)]]\displaystyle\overset{(d)}{\leq}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}\left[\sum_{i\in\tilde{S}_{t}}B_{1}p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\mu_{t,i})\right]\right]
=(e)𝔼[t[T]𝔼t[iτtB1(μ¯t,iμt,i)]]\displaystyle\overset{(e)}{=}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}B_{1}(\bar{\mu}_{t,i}-\mu_{t,i})\right]\right]
(f)𝔼[t[T]𝔼t[iτt2B1ρ(δ)ϕt(i)𝑮t1]]=(g)2B1ρ(δ)𝔼[t[T]iτtϕt(i)𝑮t1]\displaystyle\overset{(f)}{\leq}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}2B_{1}\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}\right]\right]\overset{(g)}{=}2B_{1}\rho(\delta)\mathbb{E}\left[\sum_{t\in[T]}\sum_{i\in\tau_{t}}\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}\right]
(h)2B1ρ(δ)𝔼[KTt[T]iτtϕt(i)𝑮t12]\displaystyle\overset{(h)}{\leq}2B_{1}\rho(\delta)\mathbb{E}\left[\sqrt{KT\sum_{t\in[T]}\sum_{i\in\tau_{t}}\left\lVert\bm{\phi}_{t}(i)\right\rVert^{2}_{\bm{G}_{t}^{-1}}}\right]
(i)O(B1(2dlogT+γ)2KTlogT)O(B1dKTlogT).\displaystyle\overset{(i)}{\leq}O\left(B_{1}(\sqrt{2d\log T}+\sqrt{\gamma})\sqrt{2KT\log T}\right)\leq O(B_{1}d\sqrt{KT}\log T). (15)

where (a) follows from the regret definition and the tower rule, (b) is by Condition 1 and Lemma 1 saying that μt,iμ¯t,i\mu_{t,i}\leq\bar{\mu}_{t,i}, (c) is by nice event 𝒲t{\mathcal{W}}_{t} and the definition of StS_{t}, (d) is by Condition 2, (e) follows from by the TPE trick Lemma 4, (f) is by Lemma 1, (g) is by tower rule, (h) by Cauchy Schwarz inequality, and (i) is by the ellipsoidal potential lemma (Lemma 5). Similar to (Wang & Chen, 2017) The theorem is concluded by the definition of the (α,β)(\alpha,\beta)-approximate regret, and considering event ¬𝒲t\neg{\mathcal{W}}_{t} or ¬𝒩t\neg\mathcal{N}_{t}, which contributes to at most (1β)TΔmax+δTΔmax(1-\beta)T\Delta_{\max}+\delta T\Delta_{\max} regret.

A.2 Important Lemmas used for proving Theorem 1

See 1

Proof.

For any i[m],t[T]i\in[m],t\in[T], we have

|𝜽^t,ϕt(i)𝜽,ϕt(i)|\displaystyle\left|\langle\hat{\bm{\theta}}_{t},\bm{\phi}_{t}(i)\rangle-\langle\bm{\theta}^{*},\bm{\phi}_{t}(i)\rangle\right|
=|𝜽^t𝜽,ϕt(i)|\displaystyle{=}\left|\langle\hat{\bm{\theta}}_{t}-\bm{\theta}^{*},\bm{\phi}_{t}(i)\rangle\right|
(a)𝜽^t𝜽𝑮tϕt(i)𝑮t1\displaystyle\overset{(a)}{\leq}\left\lVert\hat{\bm{\theta}}_{t}-\bm{\theta}^{*}\right\rVert_{\bm{G}_{t}}\cdot\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}
(b)ρ(δ)ϕt(i)𝑮t1,\displaystyle\overset{(b)}{\leq}\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}},

where (a)(a) by Cauchy-Schwartz, (b) by Proposition 1. Now use the definition of μt,i=𝜽,ϕt(i)\mu_{t,i}=\langle\bm{\theta}^{*},\bm{\phi}_{t}(i)\rangle and μ¯t,i=𝜽^t,ϕt(i)+ρ(δ)ϕt(i)𝑮t1\bar{\mu}_{t,i}=\langle\hat{\bm{\theta}}_{t},\bm{\phi}_{t}(i)\rangle+\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}} finishes the proof. ∎

Lemma 4 (Triggering Probability Equivalence (TPE)).

𝔼t[iS~tB1pi𝝁t,St(μ¯t,iμt,i)]=𝔼t[iτtB1(μ¯t,iμt,i)]\mathbb{E}_{t}\left[\sum_{i\in\tilde{S}_{t}}B_{1}p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\mu_{t,i})\right]=\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}B_{1}(\bar{\mu}_{t,i}-\mu_{t,i})\right].

Proof.

We have

𝔼t[iS~tB1pi𝝁t,St(μ¯t,iμt,i)]\displaystyle\mathbb{E}_{t}\left[\sum_{i\in\tilde{S}_{t}}B_{1}p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\mu_{t,i})\right]
=(a)𝔼[iS~tB1𝔼τt[𝕀{iτt}](μ¯t,iμt,i)t1]\displaystyle\overset{(a)}{=}\mathbb{E}\left[\sum_{i\in\tilde{S}_{t}}B_{1}\mathbb{E}_{\tau_{t}}[\mathbb{I}\{i\in\tau_{t}\}](\bar{\mu}_{t,i}-\mu_{t,i})\mid\mathcal{F}_{t-1}\right]
=(b)𝔼t[iS~t𝕀{iτt}B1(μ¯t,iμt,i)]\displaystyle\overset{(b)}{=}\mathbb{E}_{t}\left[\sum_{i\in\tilde{S}_{t}}\mathbb{I}\{i\in\tau_{t}\}B_{1}(\bar{\mu}_{t,i}-\mu_{t,i})\right]
=(c)𝔼t[iτtB1(μ¯t,iμt,i)],\displaystyle\overset{(c)}{=}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}B_{1}(\bar{\mu}_{t,i}-\mu_{t,i})\right], (16)

(a) is because μ¯t,i,μt,i,St\bar{\mu}_{t,i},\mu_{t,i},S_{t} are t1\mathcal{F}_{t-1}-measurable so that the only randomness is from triggering set τt\tau_{t} and we can substitute pi𝝁t,Stp_{i}^{\bm{\mu}_{t},S_{t}} with event 𝕀{iτt}\mathbb{I}\{i\in\tau_{t}\} under expectation, (b) is by absorbing the expectation over τt\tau_{t} to 𝔼t\mathbb{E}_{t}, and (c) is a simple change of notation. Actually, TPE can be applied whenever the quantities (other than piD,Sp_{i}^{D,S}) are t1\mathcal{F}_{t-1}-measurable, which would be helpful for later sections. ∎

Lemma 5 (Ellipsoidal Potential Lemma).

t=1Tiτtϕt(i)𝑮t122dlog(1+KT/(γd))2dlogT\sum_{t=1}^{T}\sum_{i\in\tau_{t}}\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}^{2}\leq 2d\log(1+KT/(\gamma d))\leq 2d\log T when γK\gamma\geq K.

Proof.
det(𝑮t+1)\displaystyle\det(\bm{G}_{t+1}) =(a)det(𝑮t+iτtϕt(i)ϕt(i))\displaystyle\overset{(a)}{=}\det\left(\bm{G}_{t}+\sum_{i\in\tau_{t}}\bm{\phi}_{t}(i)\bm{\phi}_{t}(i)^{\top}\right)
=(b)det(𝑮t)det(𝑰+iτtGt1/2ϕt(i)(Gt1/2ϕt(i)))\displaystyle\overset{(b)}{=}\det(\bm{G}_{t})\cdot\det\left(\bm{I}+\sum_{i\in\tau_{t}}G_{t}^{-1/2}\bm{\phi}_{t}(i)(G_{t}^{-1/2}\bm{\phi}_{t}(i))^{\top}\right)
(c)det(𝑮t)(1+iτtϕt(i)𝑮t12)\displaystyle\overset{(c)}{\geq}\det(\bm{G}_{t})\cdot\left(1+\sum_{i\in\tau_{t}}\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}^{2}\right)
(d)det(γ𝑰)s=1t(1+iτsϕs(i)𝑮s12),\displaystyle\overset{(d)}{\geq}\det(\gamma\bm{I})\prod_{s=1}^{t}\left(1+\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}\right), (17)

where (a) follows from the definition, (b) follows from det(𝑨𝑩)=det(𝑨)det(𝑩)\det(\bm{A}\bm{B})=\det(\bm{A})\det(\bm{B}) and 𝑨+𝑩=𝑨1/2(𝑰+𝑨1/2𝑩𝑨1/2)𝑨1/2\bm{A}+\bm{B}=\bm{A}^{1/2}(\bm{I}+\bm{A}^{-1/2}\bm{B}\bm{A}^{-1/2})\bm{A}^{1/2}, (c) follows from Lemma 14, (d) follows from repeatedly applying (c).

Since ϕs(i)𝑮s12ϕs(i)2λmin(𝑮s)1/γ1/K\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}\leq\frac{\left\lVert\bm{\phi}_{s}(i)\right\rVert^{2}}{\lambda_{\min}(\bm{G}_{s})}\leq 1/\gamma\leq 1/K, we have iτsϕs(i)𝑮s121\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}\leq 1. Using the fact that 2log(1+x)x2\log(1+x)\geq x for any [0,1][0,1], we have

stiτsϕs(i)𝑮s12\displaystyle\sum_{s\in t}\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}
2s=1tlog(1+iτsϕs(i)𝑮s12)\displaystyle{\leq}2\sum_{s=1}^{t}\log\left(1+\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}\right)
=2logs=1t(1+iτsϕs(i)𝑮s12)\displaystyle=2\log\prod_{s=1}^{t}\left(1+\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}\right)
(a)2log(det(𝑮t+1)det(γ𝑰))\displaystyle\overset{(a)}{\leq}2\log\left(\frac{\det(\bm{G}_{t+1})}{\det(\gamma\bm{I})}\right)
(b)2log((γ+KT/d)dγd)=2dlog(1+KT/(γd))2dlog(T),\displaystyle\overset{(b)}{\leq}2\log\left(\frac{(\gamma+KT/d)^{d}}{\gamma^{d}}\right)=2d\log(1+KT/(\gamma d))\leq 2d\log(T),

where the (a) follows from Equation 17, (b) follows from Lemma 15. ∎

Appendix B Proofs for C2MAB-T under the VM or TPVM Condition (Section 4)

B.1 Proof of Lemma 2

Our analysis is inspired by the derivation of Theorem 3 by (Lattimore et al., 2015) to bound the key ellipsoidal radius 𝜽𝜽^t𝑮tρ\left\lVert\bm{\theta}^{*}-\hat{\bm{\theta}}_{t}\right\rVert_{\bm{G}_{t}}\leq\rho for the C2MAB-T setting, where multiple arms can be triggered in each round. Before we going into the main proof, we first introduce some notations and events as follows.

Recall that for t1t\geq 1, Xt,iX_{t,i} is a Bernoulli random variable with mean μt,i=𝜽,ϕt(i)\mu_{t,i}=\langle\bm{\theta}^{*},\bm{\phi}_{t}(i)\rangle, suppose 𝜽21,ϕt(i)1\left\lVert\bm{\theta}^{*}\right\rVert_{2}\leq 1,\left\lVert\bm{\phi}_{t}(i)\right\rVert\leq 1, we can represent Xt,iX_{t,i} by Xt,i=μt,i+ηt,iX_{t,i}=\mu_{t,i}+\eta_{t,i}, where noise ηt,i[1,1]\eta_{t,i}\in[-1,1], its mean 𝔼[ηt,it1]=0\mathbb{E}[\eta_{t,i}\mid\mathcal{F}_{t-1}]=0, and its variance Var[ηt,it1]=μt,i(1μt,i){\rm Var}[\eta_{t,i}\mid\mathcal{F}_{t-1}]=\mu_{t,i}(1-\mu_{t,i}). Also note that in Algorithm 2, the gram matrices, the b-vector and the weighted least-square estimator are the following.

𝑮t\displaystyle\bm{G}_{t} =γ𝑰+s=1t1iτsV¯s,i1ϕs(i)ϕs(i),\displaystyle=\gamma\cdot\bm{I}+\sum_{s=1}^{t-1}\sum_{i\in\tau_{s}}\bar{V}_{s,i}^{-1}\bm{\phi}_{s}(i)\bm{\phi}_{s}(i)^{\top}, (18)
𝒃t\displaystyle\bm{b}_{t} =s=1t1iτsV¯s,i1ϕs(i)Xs,i,\displaystyle=\sum_{s=1}^{t-1}\sum_{i\in\tau_{s}}\bar{V}_{s,i}^{-1}\bm{\phi}_{s}(i)X_{s,i}, (19)
𝜽^t\displaystyle\hat{\bm{\theta}}_{t} =𝑮t1𝒃t,\displaystyle=\bm{G}_{t}^{-1}\bm{b}_{t}, (20)

where we set G0=γ𝑰G_{0}=\gamma\bm{I}, and optimistic variances V¯s,i\bar{V}_{s,i} are defined as in Equation 6 of Algorithm 2.

Let us define 𝒁t=s<tiτsηs,iϕt(i)/V¯s,i\bm{Z}_{t}=\sum_{s<t}\sum_{i\in\tau_{s}}\eta_{s,i}\bm{\phi}_{t}(i)/\bar{V}_{s,i}, and the key of this proof is to bound 𝒁t\bm{Z}_{t} (this quantity is often denoted as StS_{t} in the self-normalized bound (Abbasi-Yadkori et al., 2011), but StS_{t} is occupied to denote actions at round tt in this work).

We finally define failure events F0F1FTF_{0}\subseteq F_{1}\subseteq...\subseteq F_{T} be a sequence of events defined by

Ft={st such that 𝒁s𝑮s+γρ}.\displaystyle F_{t}=\{\exists s\leq t\text{ such that }\left\lVert\bm{Z}_{s}\right\rVert_{\bm{G}_{s}}+\sqrt{\gamma}\geq\rho\}. (21)

These failure events are crucial in the sense that 𝜽\bm{\theta}^{*} lies in the confidence ellipsoid 𝜽𝜽^t𝑮tρ\left\lVert\bm{\theta}^{*}-\hat{\bm{\theta}}_{t}\right\rVert_{\bm{G}_{t}}\leq\rho (see Lemma 8 for its proof).

Next, we can prove by induction that the probability of 𝒁t𝑮t+γρ\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}}+\sqrt{\gamma}\geq\rho given ¬Ft1\neg F_{t-1} is very small, for t=1,,Tt=1,...,T (see its proof in Lemma 7). Based on this, we can have Pr[¬FT]=1Pr[F0]t=1TPr[𝒁t𝑮t+γρ and ¬Ft1]1δ\Pr[\neg F_{T}]=1-\Pr[F_{0}]-\sum_{t=1}^{T}\Pr\left[\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}}+\sqrt{\gamma}\geq\rho\text{ and }\neg F_{t-1}\right]\geq 1-\delta (as ¬F0\neg F_{0} always holds), and thus by Equation 147, Lemma 2 is proved as desired.

B.2 Proof of Theorem 2 under VM condition

Similar to Appendix A, we first give/recall some definitions and events. Recall that in Algorithm 2, the gram matrices, the b-vector, and the weighted least-square estimator are defined in Equation 18 The optimistic variances V¯s,i\bar{V}_{s,i} are defined as in Equation 6 of Algorithm 2. Let us use 𝒲t{\mathcal{W}}_{t} to denote the nice event when the oracle can output solution SS with r(S;𝝁)αr(S;𝝁)r(S;\bm{\mu})\geq\alpha\cdot r(S^{*};\bm{\mu}) where S=argmaxS𝒮r(S;𝝁)S^{*}=\operatorname*{argmax}_{S\in\mathcal{S}}r(S;\bm{\mu}) for any 𝝁\bm{\mu} at round tt. We use 𝒩t\mathcal{N}_{t} to denote the nice event when the 𝜽^t𝜽𝑮tρ(δ)\left\lVert\hat{\bm{\theta}}_{t}-\bm{\theta}^{*}\right\rVert_{\bm{G}_{t}}\leq\rho(\delta) holds for any t[T]t\in[T] (which can be implied by ¬FT\neg F_{T}). Define the filtration to be t1=(S1,ϕ1,τ1,(X1,i)tτ1,,St1,ϕt1,τt1,(Xt1,i)tτt1,St,ϕt)\mathcal{F}_{t-1}=(S_{1},\bm{\phi}_{1},\tau_{1},(X_{1,i})_{t\in\tau_{1}},...,S_{t-1},\bm{\phi}_{t-1},\tau_{t-1},(X_{t-1,i})_{t\in\tau_{t-1}},S_{t},\bm{\phi}_{t}) that takes both history data t\mathcal{H}_{t} and action StS_{t} to handle the randomness of the oracle, and let 𝔼t[]=𝔼[t1]\mathbb{E}_{t}[\cdot]=\mathbb{E}[\cdot\mid\mathcal{F}_{t-1}].

Let 𝝁~t\tilde{\bm{\mu}}_{t} be the vector whose ii-th entry is the maximizer that achieves V¯t,i\bar{V}_{t,i}, i.e., μ~t,i=argmaxμ[μ¯t,i,μ¯t,i]μ(1μ)\tilde{\mu}_{t,i}=\operatorname*{argmax}_{\mu\in[\underaccent{\bar}{\mu}_{t,i},\bar{\mu}_{t,i}]}\mu(1-\mu). Now we bound the regret under nice event 𝒲t{\mathcal{W}}_{t} and 𝒩\mathcal{N} (where 𝒩t\mathcal{N}_{t} can be implied from ¬FT\neg F_{T} by derivation in Lemma 8),

Reg(T)=(a)𝔼[t=1Tαr(St;𝝁t)r(St;𝝁t)]\displaystyle\text{Reg}(T)\overset{(a)}{=}\mathbb{E}\left[\sum_{t=1}^{T}\alpha r(S_{t}^{*};\bm{\mu}_{t})-r(S_{t};\bm{\mu}_{t})\right] (22)
(b)𝔼[t=1Tαr(St;𝝁¯t)r(St;𝝁t)]\displaystyle\overset{(b)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}\alpha r(S^{*}_{t};\bar{\bm{\mu}}_{t})-r(S_{t};\bm{\mu}_{t})\right] (23)
(c)𝔼[t=1Tr(St;𝝁¯t)r(St;𝝁t)]\displaystyle\overset{(c)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}r(S_{t};\bar{\bm{\mu}}_{t})-r(S_{t};\bm{\mu}_{t})\right] (24)
(d)𝔼[t=1T|r(St;𝝁¯t)r(St;𝝁~t)|(\@slowromancapi@)+|r(St;𝝁t)r(St;𝝁~t)|(\@slowromancapii@)],\displaystyle\overset{(d)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}\underbrace{\left|r(S_{t};\bar{\bm{\mu}}_{t})-r(S_{t};\tilde{\bm{\mu}}_{t})\right|}_{\text{(\@slowromancap i@)}}+\underbrace{\left|r(S_{t};\bm{\mu}_{t})-r(S_{t};\tilde{\bm{\mu}}_{t})\right|}_{\text{(\@slowromancap ii@)}}\right], (25)

where (a) is by definition, (b) follows from Condition 1 and Lemma 3, (c) from event 𝒲{\mathcal{W}} and the definition of StS_{t}, (d) from triangle inequality.

Now we show how to bound term (\@slowromancapi@),

𝔼[t[T](\@slowromancapi@)](a)Bv𝔼[t=1TiS~t(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\mathbb{E}\left[\sum_{t\in[T]}\text{(\@slowromancap i@)}\right]\overset{(a)}{\leq}B_{v}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right]
(b)Bvpmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\overset{(b)}{\leq}\frac{B_{v}}{\sqrt{p_{\min}}}\cdot\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}p_{i}^{\bm{\mu}_{t},S_{t}}\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right]
(c)Bvpmin𝔼[Tt=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\overset{(c)}{\leq}\frac{B_{v}}{\sqrt{p_{\min}}}\cdot\mathbb{E}\left[\sqrt{T\sum_{t=1}^{T}\sum_{i\in\tilde{S}_{t}}p_{i}^{\bm{\mu}_{t},S_{t}}\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right]
(d)BvpminT𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\overset{(d)}{\leq}\frac{B_{v}}{\sqrt{p_{\min}}}\cdot\sqrt{T\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i\in\tilde{S}_{t}}p_{i}^{\bm{\mu}_{t},S_{t}}\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}\right]}
=(e)BvpminT𝔼[t=1Tiτt(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\overset{(e)}{=}\frac{B_{v}}{\sqrt{p_{\min}}}\cdot\sqrt{T\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i\in\tau_{t}}\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}\right]}
(f)BvpminT𝔼[t=1Tiτt(6ρ(δ)ϕt(i)𝑮t1)2V¯t,i]\displaystyle\overset{(f)}{\leq}\frac{B_{v}}{\sqrt{p_{\min}}}\cdot\sqrt{T\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i\in\tau_{t}}\frac{(6\rho(\delta)\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}})^{2}}{\bar{V}_{t,i}}\right]}
(g)O(BvdTlog(KT)/pmin),\displaystyle\overset{(g)}{\leq}{O}(B_{v}d\sqrt{T}\log(KT)/\sqrt{p_{\min}}), (26)

where (a) follows from Condition 3, (b) follows from the definition of pminp_{\min} s.t. pi𝝁t,Stpminp_{i}^{\bm{\mu}_{t},S_{t}}\geq p_{\min} for iS~ti\in\tilde{S}_{t}, (c) follows from Cauchy–Schwarz, (d) follows from Jensen’s inequality, (e) follows from the TPE trick, (f) follows from Lemma 3, (g) follows from Lemma 6.

Now for the term (\@slowromancapii@)O(BvdTlog(KT)/pmin)\leq{O}(B_{v}d\sqrt{T}\log(KT)/\sqrt{p_{\min}}) follows from the similar derivation of Section B.2 by replacing (μ¯t,iμ~t,i)2(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2} with (μt,iμ~t,i)2(\mu_{t,i}-\tilde{\mu}_{t,i})^{2}. And the theorem is concluded by considering ¬𝒲t\neg{\mathcal{W}}_{t} and ¬𝒩t\neg\mathcal{N}_{t}, similar to Appendix A.

Lemma 6 (Weighted Ellipsoidal Potential Lemma).

t=1Tiτtϕt(i)𝑮t12/V¯t,i2dlog(1+KT/(γd))2dlogT\sum_{t=1}^{T}\sum_{i\in\tau_{t}}\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}^{2}/\bar{V}_{t,i}\leq 2d\log(1+KT/(\gamma d))\leq 2d\log T when ¬FT\neg F_{T} and γ4K\gamma\geq 4K.

Proof.
det(𝑮t+1)\displaystyle\det(\bm{G}_{t+1}) =(a)det(𝑮t+iτtϕt(i)ϕt(i)/V¯t,i)\displaystyle\overset{(a)}{=}\det\left(\bm{G}_{t}+\sum_{i\in\tau_{t}}\bm{\phi}_{t}(i)\bm{\phi}_{t}(i)^{\top}/\bar{V}_{t,i}\right)
=(b)det(𝑮t)det(𝑰+iτtGt1/2ϕt(i)(Gt1/2ϕt(i))/V¯t,i)\displaystyle\overset{(b)}{=}\det(\bm{G}_{t})\cdot\det\left(\bm{I}+\sum_{i\in\tau_{t}}G_{t}^{-1/2}\bm{\phi}_{t}(i)(G_{t}^{-1/2}\bm{\phi}_{t}(i))^{\top}/\bar{V}_{t,i}\right)
(c)det(𝑮t)(1+iτtϕt(i)𝑮t12/V¯t,i)\displaystyle\overset{(c)}{\geq}\det(\bm{G}_{t})\cdot\left(1+\sum_{i\in\tau_{t}}\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}^{2}/\bar{V}_{t,i}\right)
(d)det(γ𝑰)s=1t(1+iτsϕs(i)𝑮s12/V¯t,i),\displaystyle\overset{(d)}{\geq}\det(\gamma\bm{I})\prod_{s=1}^{t}\left(1+\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}/\bar{V}_{t,i}\right), (27)

where (a) follows from the definition, (b) follows from det(𝑨𝑩)=det(𝑨)det(𝑩)\det(\bm{A}\bm{B})=\det(\bm{A})\det(\bm{B}) and 𝑨+𝑩=𝑨1/2(𝑰+𝑨1/2𝑩𝑨1/2)𝑨1/2\bm{A}+\bm{B}=\bm{A}^{1/2}(\bm{I}+\bm{A}^{-1/2}\bm{B}\bm{A}^{-1/2})\bm{A}^{1/2}, (c) follows from Lemma 14, (d) follows from repeatedly applying (c).

If V¯s,i=14\bar{V}_{s,i}=\frac{1}{4}, ϕs(i)𝑮s12/V¯s,i4ϕs(i)2λmin(𝑮s)4/γ1/K\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}/\bar{V}_{s,i}\leq\frac{4\left\lVert\bm{\phi}_{s}(i)\right\rVert^{2}}{\lambda_{\min}(\bm{G}_{s})}\leq 4/\gamma\leq 1/K, else if V¯s,i<14\bar{V}_{s,i}<\frac{1}{4}, and since ¬T\neg\mathcal{F}_{T}, by Lemma 9, ϕs(i)𝑮s12/V¯s,i1ρ(δ)1γ1γ1/(4K)\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}/\bar{V}_{s,i}\leq\frac{1}{\rho(\delta)}\frac{1}{\sqrt{\gamma}}\leq\frac{1}{\gamma}\leq 1/(4K). Therefore, we have iτsϕs(i)𝑮s121\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}\leq 1. Using the fact that 2log(1+x)x2\log(1+x)\geq x for any [0,1][0,1], we have

stiτsϕs(i)𝑮s12/V¯s,i\displaystyle\sum_{s\in t}\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}/\bar{V}_{s,i}
2s=1tlog(1+iτsϕs(i)𝑮s12/V¯s,i)\displaystyle{\leq}2\sum_{s=1}^{t}\log\left(1+\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}/\bar{V}_{s,i}\right)
=2logs=1t(1+iτsϕs(i)𝑮s12/V¯s,i)\displaystyle=2\log\prod_{s=1}^{t}\left(1+\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}^{2}/\bar{V}_{s,i}\right)
(a)2log(det(𝑮t+1)det(γ𝑰))\displaystyle\overset{(a)}{\leq}2\log\left(\frac{\det(\bm{G}_{t+1})}{\det(\gamma\bm{I})}\right)
(b)2log((γ+KT/d)dγd)=2dlog(1+4dK2T2/(γd))4dlog(KT),\displaystyle\overset{(b)}{\leq}2\log\left(\frac{(\gamma+KT/d)^{d}}{\gamma^{d}}\right)=2d\log(1+4dK^{2}T^{2}/(\gamma d))\leq 4d\log(KT),

where the (a) follows from Equation 17, (b) follows from Lemma 15 by setting L=ϕs(i)2/V¯s,i4dKsL=\left\lVert\bm{\phi}_{s}(i)\right\rVert^{2}/\bar{V}_{s,i}\leq 4dKs (from Lemma 11). ∎

B.3 Proof of Theorem 3 Under TPVM Condition

In this section, we consider two cases when λ2\lambda\geq 2 and λ1\lambda\geq 1. Recall that to use the TPVM condition (Condition 4), we need one additional condition over the triggering probability (Condition 5).

B.3.1 When λ2\lambda\geq 2:

We inherit the same notation and events as in Section A.1, and start to bound term (\@slowromancapi@) in Section B.2 differently,

𝔼[t[T](\@slowromancapi@)](a)𝔼[t=1TBviS~t(pi𝝁~t,St)2(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\mathbb{E}\left[\sum_{t\in[T]}\text{(\@slowromancap i@)}\right]\overset{(a)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}(p_{i}^{\tilde{\bm{\mu}}_{t},S_{t}})^{2}\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right] (28)
(b)𝔼[t=1TBviS~t(pi𝝁t,St+min{1,jS~tBppj𝝁,St|μt,jμ~t,j|})2(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\overset{(b)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}\left(p_{i}^{\bm{\mu}_{t},S_{t}}+\min\left\{1,\sum_{j\in\tilde{S}_{t}}B_{p}p_{j}^{\bm{\mu},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|\right\}\right)^{2}\cdot\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right] (29)
(c)𝔼[t=1TBviS~t3pi𝝁t,St(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\overset{(c)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}3p_{i}^{\bm{\mu}_{t},S_{t}}\cdot\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right]
+𝔼[t=1TBviS~t(jS~tBppj𝝁t,St|μt,jμ~t,j|)2(μ¯t,iμ~t,i)2V¯t,i]\displaystyle+\mathbb{E}\left[\sum_{t=1}^{T}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}\left(\sum_{j\in\tilde{S}_{t}}B_{p}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|\right)^{2}\cdot\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right] (30)
=(d)O(BvdTlog(KT))+Bv𝔼[t=1TiS~t(μ¯t,iμ~t,i)2V¯t,ijS~tBppj𝝁t,St|μt,jμ~t,j|]\displaystyle\overset{(d)}{=}O(B_{v}d\sqrt{T}\log(KT))+B_{v}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\cdot\sum_{j\in\tilde{S}_{t}}B_{p}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|\right] (31)
(e)O(BvdTlog(KT))+Bv1pmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,ijS~tBppj𝝁t,St|μt,jμ~t,j|]\displaystyle\overset{(e)}{\leq}O(B_{v}d\sqrt{T}\log(KT))+B_{v}\frac{1}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\cdot\sum_{j\in\tilde{S}_{t}}B_{p}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|\right] (32)
(f)O(BvdTlog(KT))+Bv1pmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,iKjS~tBp2pj𝝁t,St|μt,jμ~t,j|2]\displaystyle\overset{(f)}{\leq}O(B_{v}d\sqrt{T}\log(KT))+B_{v}\frac{1}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\cdot\sqrt{K\sum_{j\in\tilde{S}_{t}}B_{p}^{2}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|^{2}}\right] (33)
(g)O(BvdTlog(KT))+BvBpKpmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,ijS~tpj𝝁t,St|μt,jμ~t,j|2/V¯t,j]\displaystyle\overset{(g)}{\leq}O(B_{v}d\sqrt{T}\log(KT))+B_{v}B_{p}\frac{\sqrt{K}}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\cdot\sqrt{\sum_{j\in\tilde{S}_{t}}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|^{2}/\bar{V}_{t,j}}\right] (34)
(h)O(BvdTlog(KT))+BvBpKpmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ¯t,i)2V¯t,i]\displaystyle\overset{(h)}{\leq}O(B_{v}d\sqrt{T}\log(KT))+B_{v}B_{p}\frac{\sqrt{K}}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\underaccent{\bar}{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}\right] (35)
(i)O(BvdTlog(KT)+BvBpKpmin(dlog(KT))2)=O(BvdTlog(KT)),\displaystyle\overset{(i)}{\leq}O(B_{v}d\sqrt{T}\log(KT)+B_{v}B_{p}\frac{\sqrt{K}}{\sqrt{p_{\min}}}(d\log(KT))^{2})=O(B_{v}d\sqrt{T}\log(KT)), (36)

where (a) follows from Condition 4, (b) is by applying Condition 5 for triggering probability pi𝝁¯t,S~tp_{i}^{\bar{\bm{\mu}}_{t},\tilde{S}_{t}} and pi𝝁¯t,S~t,pi𝝁t,S~t1p_{i}^{\bar{\bm{\mu}}_{t},\tilde{S}_{t}},p_{i}^{{\bm{\mu}}_{t},\tilde{S}_{t}}\leq 1, (c) follows from a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}, (d) follows from same derivation from Section B.2, (e) follows from pi𝝁~t,Stpminip_{i}^{\tilde{\bm{\mu}}_{t},S_{t}}\geq p^{i}_{\min}, (f) follows from Cauchy-Schwarz, (g) follows from V¯t,j1/4\bar{V}_{t,j}\leq 1/4, (h) follows from μ~t,i,μt,i[μ¯t,i,μ¯t,i]\tilde{\mu}_{t,i},\mu_{t,i}\in[\bar{\mu}_{t,i},\underaccent{\bar}{\mu}_{t,i}] by event 𝒩t\mathcal{N}_{t}, (i) follows from the similar analysis of (d)-(g) in Section B.2 inside the square-root without considering the additional BvT/pminB_{v}\sqrt{T}/\sqrt{p_{\min}}.

For the term (\@slowromancapii@), one can easily verify it follows from the similar deviation of the term (\@slowromancapi@) with the difference in constant terms. And Theorem 3 is concluded by considering small probability ¬𝒲t\neg{\mathcal{W}}_{t} and 𝒩t\mathcal{N}_{t} events.

B.3.2 When λ1\lambda\geq 1:

We inherit the same notation and events as in Section A.1, and start to bound term (\@slowromancapi@) in Equation 28 as follows,

𝔼[t[T](\@slowromancapi@)](a)𝔼[t=1TBviS~t(pi𝝁~t,St)(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\mathbb{E}\left[\sum_{t\in[T]}\text{(\@slowromancap i@)}\right]\overset{(a)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}(p_{i}^{\tilde{\bm{\mu}}_{t},S_{t}})\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right] (37)
(b)𝔼[t=1TBviS~t(pi𝝁t,St+min{1,jS~tBppj𝝁,St|μt,jμ~t,j|})(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\overset{(b)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}\left(p_{i}^{\bm{\mu}_{t},S_{t}}+\min\left\{1,\sum_{j\in\tilde{S}_{t}}B_{p}p_{j}^{\bm{\mu},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|\right\}\right)\cdot\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right] (38)
(c)𝔼[t=1TBviS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,i]\displaystyle\overset{(c)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}p_{i}^{\bm{\mu}_{t},S_{t}}\cdot\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right]
+𝔼[t=1TBviS~t(jS~tBppj𝝁t,St|μt,jμ~t,j|)(μ¯t,iμ~t,i)2V¯t,i]\displaystyle+\mathbb{E}\left[\sum_{t=1}^{T}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}\left(\sum_{j\in\tilde{S}_{t}}B_{p}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|\right)\cdot\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\right] (39)
=(d)O(BvdTlog(KT))+Bv𝔼[t=1TiS~t(μ¯t,iμ~t,i)2V¯t,ijS~tBppj𝝁t,St|μt,jμ~t,j|]\displaystyle\overset{(d)}{=}O(B_{v}d\sqrt{T}\log(KT))+B_{v}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}\frac{(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\cdot\sqrt{\sum_{j\in\tilde{S}_{t}}B_{p}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|}\right] (40)
(e)O(BvdTlog(KT))+Bv1pmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,ijS~tBppj𝝁t,St|μt,jμ~t,j|]\displaystyle\overset{(e)}{\leq}O(B_{v}d\sqrt{T}\log(KT))+B_{v}\frac{1}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\cdot\sqrt{\sum_{j\in\tilde{S}_{t}}B_{p}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|}\right] (41)
(f)O(BvdTlog(KT))+Bv1pmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,i(KjS~tBp2pj𝝁t,St|μt,jμ~t,j|2)1/4]\displaystyle\overset{(f)}{\leq}O(B_{v}d\sqrt{T}\log(KT))+B_{v}\frac{1}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\cdot\left({K\sum_{j\in\tilde{S}_{t}}B_{p}^{2}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|^{2}}\right)^{1/4}\right] (42)
(g)O(BvdTlog(KT))+BvBpK1/4pmin𝔼[t=1TiS~tpi𝝁t,St(μ¯t,iμ~t,i)2V¯t,i(jS~tpj𝝁t,St|μt,jμ~t,j|2/V¯t,j)1/4]\displaystyle\overset{(g)}{\leq}O(B_{v}d\sqrt{T}\log(KT))+B_{v}\sqrt{B_{p}}\frac{K^{1/4}}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\sqrt{\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\tilde{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}}\cdot\left(\sum_{j\in\tilde{S}_{t}}p_{j}^{\bm{\mu}_{t},S_{t}}\left|{\mu}_{t,j}-\tilde{\mu}_{t,j}\right|^{2}/\bar{V}_{t,j}\right)^{1/4}\right] (43)
(h)O(BvdTlog(KT))+BvBpK1/4pmin𝔼[t=1T(iS~tpi𝝁t,St(μ¯t,iμ¯t,i)2V¯t,i)3/4]\displaystyle\overset{(h)}{\leq}O(B_{v}d\sqrt{T}\log(KT))+B_{v}\sqrt{B_{p}}\frac{{K}^{1/4}}{\sqrt{p_{\min}}}\mathbb{E}\left[\sum_{t=1}^{T}\left(\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\underaccent{\bar}{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}\right)^{3/4}\right] (44)
(i)O(BvdTlog(KT))+BvBp(KT)1/4pmin𝔼[(t=1TiS~tpi𝝁t,St(μ¯t,iμ¯t,i)2V¯t,i)3/4]\displaystyle\overset{(i)}{\leq}O(B_{v}d\sqrt{T}\log(KT))+B_{v}\sqrt{B_{p}}\frac{{(KT)}^{1/4}}{\sqrt{p_{\min}}}\mathbb{E}\left[\left(\sum_{t=1}^{T}\sum_{i\in\tilde{S}_{t}}\frac{p_{i}^{\bm{\mu}_{t},S_{t}}(\bar{\mu}_{t,i}-\underaccent{\bar}{\mu}_{t,i})^{2}}{\bar{V}_{t,i}}\right)^{3/4}\right] (45)
(j)O(BvdTlog(KT)+BvBp(KT)1/4pmin(dlog(KT))3/2)=O(BvdTlog(KT)),\displaystyle\overset{(j)}{\leq}O\left(B_{v}d\sqrt{T}\log(KT)+B_{v}\sqrt{B_{p}}\frac{(KT)^{1/4}}{\sqrt{p_{\min}}}\left(d\log(KT)\right)^{3/2}\right)=O(B_{v}d\sqrt{T}\log(KT)), (46)

where (a) follows from Condition 4, (b) is by applying Condition 5 for triggering probability pi𝝁¯t,S~tp_{i}^{\bar{\bm{\mu}}_{t},\tilde{S}_{t}} and pi𝝁¯t,S~t,pi𝝁t,S~t1p_{i}^{\bar{\bm{\mu}}_{t},\tilde{S}_{t}},p_{i}^{{\bm{\mu}}_{t},\tilde{S}_{t}}\leq 1, (c) follows from a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}, (d) follows from same derivation from Section B.2, (e) follows from pi𝝁~t,Stpminip_{i}^{\tilde{\bm{\mu}}_{t},S_{t}}\geq p^{i}_{\min}, (f) follows from Cauchy-Schwarz, (g) follows from V¯t,j1/4\bar{V}_{t,j}\leq 1/4, (h) follows from μ~t,i,μt,i[μ¯t,i,μ¯t,i]\tilde{\mu}_{t,i},\mu_{t,i}\in[\bar{\mu}_{t,i},\underaccent{\bar}{\mu}_{t,i}] by event 𝒩t\mathcal{N}_{t}, (i) follows from Holder’s inequality with p=4,q=4/3p=4,q=4/3, (j) follows from the similar analysis of (d)-(g) in Section B.2 inside the square-root without considering the additional BvT/pminB_{v}\sqrt{T}/\sqrt{p_{\min}}.

For the term (\@slowromancapii@), one can easily verify it follows from the similar deviation of the term (\@slowromancapi@) with the difference in constant terms. And Theorem 3 is concluded by considering small probability ¬𝒲t\neg{\mathcal{W}}_{t} and 𝒩t\mathcal{N}_{t} events.

Appendix C Proofs for TPE Trick to Improve Non-Contextual CMAB-T

We first introduce some definitions that are used in (Wang & Chen, 2017) and (Liu et al., 2022). Recall that non-contextual CMAB-T is a degenerate case when ϕt(i)=𝒆i\bm{\phi}_{t}(i)=\bm{e}_{i} and 𝜽=𝝁\bm{\theta}^{*}=\bm{\mu}, where 𝝁𝔼𝑿t𝒟[𝑿tt]\bm{\mu}\triangleq\mathbb{E}_{\bm{X}_{t}\sim\mathcal{D}}[\bm{X}_{t}\mid\mathcal{H}_{t}] is the mean of the true outcome distribution DD.

Definition 1 ((Approximation) Gap).

Fix a distribution D𝒟D\in\mathcal{D} and its mean vector 𝛍\bm{\mu}, for each action S𝒮S\in\mathcal{S}, we define the (approximation) gap as ΔS=max{0,αr(S;𝛍)r(S;𝛍)}\Delta_{S}=\max\{0,\alpha r(S^{*};\bm{\mu})-r(S;\bm{\mu})\}. For each arm ii, we define Δimin=infS𝒮:piD,S>0, ΔS>0ΔS\Delta_{i}^{\min}=\inf_{S\in\mathcal{S}:p_{i}^{D,S}>0,\text{ }\Delta_{S}>0}\Delta_{S}, Δimax=supS𝒮:piD,S>0,ΔS>0ΔS\Delta_{i}^{\max}=\sup_{S\in\mathcal{S}:p_{i}^{D,S}>0,\Delta_{S}>0}\Delta_{S}. As a convention, if there is no action S𝒮S\in\mathcal{S} such that piD,S>0p_{i}^{D,S}>0 and ΔS>0\Delta_{S}>0, then Δimin=+,Δimax=0\Delta_{i}^{\min}=+\infty,\Delta_{i}^{\max}=0. We define Δmin=mini[m]Δimin\Delta_{\min}=\min_{i\in[m]}\Delta_{i}^{\min} and Δmax=maxi[m]Δimax\Delta_{\max}=\max_{i\in[m]}\Delta_{i}^{\max}.

Definition 2 (Event-Filtered Regret).

For any series of events (t)t[T](\mathcal{E}_{t})_{t\in[T]} indexed by round number tt, we define the Regα,𝛍A(T,(t)t[T])Reg^{A}_{\alpha,\bm{\mu}}(T,(\mathcal{E}_{t})_{t\in[T]}) as the regret filtered by events (t)t[T](\mathcal{E}_{t})_{t\in[T]}, or the regret is only counted in tt if \mathcal{E} happens in tt. Formally,

Regα,𝝁A(T,(t)t[T])=𝔼[t[T]𝕀(t)(αr(S;𝝁)r(St;𝝁))].\displaystyle Reg^{A}_{\alpha,\bm{\mu}}(T,(\mathcal{E}_{t})_{t\in[T]})=\mathbb{E}\left[\sum_{t\in[T]}\mathbb{I}(\mathcal{E}_{t})(\alpha\cdot r(S^{*};\bm{\mu})-r(S_{t};\bm{\mu}))\right]. (47)

For simplicity, we will omit A,α,𝛍,t[T]A,\alpha,\bm{\mu},t\in[T] and rewrite Regα,𝛍A(T,(t)t[T])Reg^{A}_{\alpha,\bm{\mu}}(T,(\mathcal{E}_{t})_{t\in[T]}) as Reg(T,t)Reg(T,\mathcal{E}_{t}) when contexts are clear.

C.1 Reproducing Theorem 1 of (Wang & Chen, 2017) under 1-norm TPM Condition

Theorem 4.

For a CMAB-T problem instance ([m],𝒮,𝒟,Dtrig,R)([m],\mathcal{S},\mathcal{D},D_{\text{trig}},R) that satisfies monotonicity (Condition 1), and TPM bounded smoothness (Condition 2) with coefficient B1B_{1}, if λ1\lambda\geq 1, CUCB (Wang & Chen, 2017) with an (α,β)(\alpha,\beta)-approximation oracle achieves an (α,β)(\alpha,\beta)-approximate distribution-dependent regret bounded by

Reg(T)i[m]48B12KlogTΔimin+2B1m+π23Δmax.Reg(T)\leq\sum_{i\in[m]}\frac{48B_{1}^{2}K\log T}{\Delta_{i}^{\min}}+2B_{1}m+\frac{\pi^{2}}{3}\cdot\Delta_{\max}. (48)

And the distribution-independent regret,

Reg(T)14B1mKTlogT+2B1m+π23Δmax.\displaystyle Reg(T)\leq 14B_{1}\sqrt{mKT\log T}+2B_{1}m+\frac{\pi^{2}}{3}\cdot\Delta_{\max}. (49)

The main idea is to use TPE trick to replace S~t\tilde{S}_{t} (arms that could be probabilistically triggered by action StS_{t}) with τt\tau_{t} (arms that are actually triggered by action StS_{t}) under conditional expectation, so that we can use the simpler Appendix B.2 of Wang & Chen (2017) to avoid the much more involved Appendix B.3 of Wang & Chen (2017). Such replacement bypasses the triggering group analysis (and its counter Nt,i,jN_{t,i,j}) in Appendix B.3, which uses Nt,i,jN_{t,i,j} to associate Tt,iT_{t,i} with the counters for S~t\tilde{S}_{t}. For our simplified analysis, we can directly associate the Tt,iT_{t,i} with the arm triggering for the arms τt\tau_{t} that are actually triggered/observed and eventually reproduce the regret bounds of (Wang & Chen, 2017).

We follow exactly the same CUCB algorithm (Algorithm 1 (Wang & Chen, 2017)), conditions (Condition 1, 2 (Wang & Chen, 2017)). We also inherit the event definitions of 𝒩ts\mathcal{N}_{t}^{s} (Definition 4 (Wang & Chen, 2017)) that for every arm i[m]i\in[m], |μ^t1,iμi|<ρt,i=3logt2Tt1,i\left|\hat{\mu}_{t-1,i}-\mu_{i}\right|<\rho_{t,i}=\sqrt{\frac{3\log t}{2T_{t-1,i}}}, and the event FtF_{t} being {r(St;𝝁¯t)<αopt(𝝁¯t)}\{r(S_{t};{\bar{\bm{\mu}}}_{t})<\alpha\cdot\text{opt}(\bar{\bm{\mu}}_{t})\}. Let us further denote ΔSt=αr(S;𝝁)r(St;𝝁)\Delta_{S_{t}}=\alpha r(S^{*};\bm{\mu})-r(S_{t};\bm{\mu}), τt\tau_{t} be the arms actually triggered by StS_{t} at time tt. Let filtration t1\mathcal{F}_{t-1} be (ϕ1,S1,τ1,(X1,i)iτ1,,ϕt1,St1,τt1,(Xt1,i)iτt1,ϕt,St)(\phi_{1},S_{1},\tau_{1},(X_{1,i})_{i\in\tau_{1}},...,\phi_{t-1},S_{t-1},\tau_{t-1},(X_{t-1,i})_{i\in\tau_{t-1}},\phi_{t},S_{t}), and let 𝔼t[]=𝔼[t1]\mathbb{E}_{t}[\cdot]=\mathbb{E}[\cdot\mid\mathcal{F}_{t-1}] We also have that t1\mathcal{F}_{t-1}, Tt1,i,μ^t,iT_{t-1,i},\hat{\mu}_{t,i} are measurable. Also note that we use piD,Sp_{i}^{D,S} to denote the triggering probability pi𝝁,Sp_{i}^{\bm{\mu},S} for any i[m],S𝒮i\in[m],S\in\mathcal{S} in order to match the notation of Wang & Chen (2017).

Proof.

Under event 𝒩ts\mathcal{N}^{s}_{t} and ¬Ft\neg F_{t}, and given filtration t1\mathcal{F}_{t-1}, we have

ΔSt\displaystyle\Delta_{S_{t}} (a)B1i[m]piD,St(μ¯t,iμi)\displaystyle\overset{(a)}{\leq}B_{1}\sum_{i\in[m]}p_{i}^{D,S_{t}}(\bar{\mu}_{t,i}-\mu_{i}) (50)
(b)ΔSt+2B1i[m]piD,St(μ¯t,iμi)\displaystyle\overset{(b)}{\leq}-\Delta_{S_{t}}+2B_{1}\sum_{i\in[m]}p_{i}^{D,S_{t}}(\bar{\mu}_{t,i}-\mu_{i}) (51)
=i[m]piD,StΔSti[m]piD,St+2B1i[m]piD,St(μ¯t,iμi)\displaystyle=-\frac{\sum_{i\in[m]}p_{i}^{D,S_{t}}\Delta_{S_{t}}}{\sum_{i\in[m]}p_{i}^{D,S_{t}}}+2B_{1}\sum_{i\in[m]}p_{i}^{D,S_{t}}(\bar{\mu}_{t,i}-\mu_{i}) (52)
(c)2B1i[m]piD,St(Δimin2B1K+(μ¯t,iμi))\displaystyle\overset{(c)}{\leq}2B_{1}\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(-\frac{\Delta_{i}^{\min}}{2B_{1}K}+(\bar{\mu}_{t,i}-\mu_{i})\right) (53)
(d)2B1i[m]piD,St(Δimin2B1K+min{1,6logTTt1,i}),\displaystyle\overset{(d)}{\leq}2B_{1}\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(-\frac{\Delta_{i}^{\min}}{2B_{1}K}+\min\left\{1,\sqrt{\frac{6\log T}{T_{t-1,i}}}\right\}\right), (54)

where (a) follows from exactly the Equation (10) of Appendix B.3 in Wang & Chen (2017), (b) is by the reverse amortization trick that multiplies two to both sides of (a) and rearranges the terms, (c) is by piD,St1p_{i}^{D,S_{t}}\leq 1 and ΔiminΔSt\Delta_{i}^{\min}\leq\Delta_{S_{t}}, (d) by event 𝒩ts\mathcal{N}^{s}_{t} so that (μ¯t,iμi)min{1,2ρt,i}={1,6logTTt1,i}(\bar{\mu}_{t,i}-\mu_{i})\leq\min\{1,2\rho_{t,i}\}=\left\{1,\sqrt{\frac{6\log T}{T_{t-1,i}}}\right\}.

Let

κi,T()={2B1,if =0,2B16logT,if 1Li,T,0,if >Li,T,\kappa_{i,T}(\ell)=\begin{cases}2B_{1},&\text{if $\ell=0$,}\\ 2B_{1}\sqrt{\frac{6\log T}{\ell}},&\text{if $1\leq\ell\leq L_{i,T}$,}\\ 0,&\text{if $\ell>L_{i,T}$,}\end{cases} (55)

where Li,T=24B12K2logT(Δimin)2L_{i,T}=\frac{24B_{1}^{2}K^{2}\log T}{(\Delta_{i}^{\min})^{2}}.

It follows that

ΔSt\displaystyle\Delta_{S_{t}} =𝔼t[ΔSt](a)𝔼t[2B1i[m]piD,St(Δimin2B1K+min{1,6logTTt1,i})]\displaystyle=\mathbb{E}_{t}[\Delta_{S_{t}}]\overset{(a)}{\leq}\mathbb{E}_{t}\left[2B_{1}\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(-\frac{\Delta_{i}^{\min}}{2B_{1}K}+\min\left\{1,\sqrt{\frac{6\log T}{T_{t-1,i}}}\right\}\right)\right] (56)
=(b)𝔼t[2B1i[m]𝕀{iτt}(Δimin2B1K+min{1,6logTTt1,i})]\displaystyle\overset{(b)}{=}\mathbb{E}_{t}\left[2B_{1}\sum_{i\in[m]}\mathbb{I}\{i\in\tau_{t}\}\left(-\frac{\Delta_{i}^{\min}}{2B_{1}K}+\min\left\{1,\sqrt{\frac{6\log T}{T_{t-1,i}}}\right\}\right)\right] (57)
=𝔼t[2B1iτt(Δimin2B1K+min{1,6logTTt1,i})]\displaystyle=\mathbb{E}_{t}\left[2B_{1}\sum_{i\in\tau_{t}}\left(-\frac{\Delta_{i}^{\min}}{2B_{1}K}+\min\left\{1,\sqrt{\frac{6\log T}{T_{t-1,i}}}\right\}\right)\right] (58)
(c)𝔼t[iτtκi,T(Tt1,i)],\displaystyle\overset{(c)}{\leq}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}\kappa_{i,T}(T_{t-1,i})\right], (59)

where (a) follows from Equation 54, (b) follows from the TPE trick to replace piD,St=𝔼t[𝕀{iτt}]p_{i}^{D,S_{t}}=\mathbb{E}_{t}[\mathbb{I}\{i\in\tau_{t}\}], (c) follows from that if Tt1,iLi,TT_{t-1,i}\leq L_{i,T}, we have min{6logTTt1,i,1}12B1κi,T(Tt1,i)\min\{\sqrt{\frac{6\log T}{T_{t-1,i}}},1\}\leq\frac{1}{2B_{1}}\kappa_{i,T}(T_{t-1,i}), and if Tt1,iLi,T+1T_{t-1,i}\geq L_{i,T}+1, then min{1,6logTTt1,i}Δimin2B1K\min\left\{1,\sqrt{\frac{6\log T}{T_{t-1,i}}}\right\}\leq\frac{\Delta_{i}^{\min}}{2B_{1}K}, so Δimin2B1K+min{1,6logTTt1,i}0=κi,T(Tt1,i)-\frac{\Delta_{i}^{\min}}{2B_{1}K}+\min\left\{1,\sqrt{\frac{6\log T}{T_{t-1,i}}}\right\}\leq 0=\kappa_{i,T}(T_{t-1,i}).

Now we apply the definition of the event-filtered regret,

Reg(𝒩ts,¬Ft)\displaystyle Reg(\mathcal{N}_{t}^{s},\neg F_{t}) =𝔼[t=1TΔSt]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\Delta_{S_{t}}\right] (60)
(a)𝔼[t=1T𝔼t[iτtκi,T(Tt1,i)]]\displaystyle\overset{(a)}{\leq}\mathbb{E}\left[\sum_{t=1}^{T}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}\kappa_{i,T}(T_{t-1,i})\right]\right] (61)
=(b)𝔼[t=1Tiτtκi,T(Tt1,i)]\displaystyle\overset{(b)}{=}\mathbb{E}\left[\sum_{t=1}^{T}\sum_{i\in\tau_{t}}\kappa_{i,T}(T_{t-1,i})\right] (62)
=(c)𝔼[i[m]s=0TT1,iκi,T(s)]\displaystyle\overset{(c)}{=}\mathbb{E}\left[\sum_{i\in[m]}\sum_{s=0}^{T_{T-1,i}}\kappa_{i,T}(s)\right] (63)
i[m]s=0Li,Tκi,T(s)\displaystyle\leq\sum_{i\in[m]}\sum_{s=0}^{L_{i,T}}\kappa_{i,T}(s) (64)
=2B1m+i[m]s=1Li,T2B16logTs\displaystyle=2B_{1}m+\sum_{i\in[m]}\sum_{s=1}^{L_{i,T}}2B_{1}\sqrt{\frac{6\log T}{s}} (65)
(d)2B1m+i[m]s=0Li,T2B16logTs𝑑s\displaystyle\overset{(d)}{\leq}2B_{1}m+\sum_{i\in[m]}\int_{s=0}^{L_{i,T}}2B_{1}\sqrt{\frac{6\log T}{s}}\cdot ds (66)
2B1m+i[m]48B12KlogTΔimin,\displaystyle\leq 2B_{1}m+\sum_{i\in[m]}\frac{48B_{1}^{2}K\log T}{\Delta_{i}^{\min}}, (67)

where (a) follows from Equation 59, (b) follows from the tower rule, (c) follows from that Tt1,iT_{t-1,i} is increased by 11 if and only if iτti\in\tau_{t}, (d) is by the sum &\& integral inequality L1Uf(x)𝑑xi=LUf(i)LU+1f(x)𝑑x\int_{L-1}^{U}f(x)dx\geq\sum_{i=L}^{U}f(i)\geq\int_{L}^{U+1}f(x)dx for non-increasing function ff.

Following Wang & Chen (2017) to handle small probability events ¬𝒩ts\neg\mathcal{N}_{t}^{s} and FtF_{t}, we have

Reg(T)i[m]48B12KlogTΔimin+2B1m+π23Δmax,\displaystyle Reg(T)\leq\sum_{i\in[m]}\frac{48B_{1}^{2}K\log T}{\Delta_{i}^{\min}}+2B_{1}m+\frac{\pi^{2}}{3}\cdot\Delta_{\max}, (68)

and the distribution-independent regret is

Reg(T)14B1mKTlogT+2B1m+π23Δmax.\displaystyle Reg(T)\leq 14B_{1}\sqrt{mKT\log T}+2B_{1}m+\frac{\pi^{2}}{3}\cdot\Delta_{\max}. (69)

C.2 Improving Theorem 1 of (Liu et al., 2022) under TPVM Condition

We first show the regret bound of using our TPE technique in Theorem 5 and its prior result in Proposition 2.

Theorem 5.

For a CMAB-T problem instance ([m],𝒮,𝒟,Dtrig,R)([m],\mathcal{S},\mathcal{D},D_{\text{trig}},R) that satisfies monotonicity (Condition 1), and TPVM bounded smoothness (Condition 4) with coefficient (Bv,B1,λ)(B_{v},B_{1},\lambda), if λ1\lambda\geq 1, BCUCB-T (Liu et al., 2022) with an (α,β)(\alpha,\beta)-approximation oracle achieves an (α,β)(\alpha,\beta)-approximate distribution-dependent regret bounded by

O(i[m]Bv2logKlogTΔ~i,λmin+i[m]B1log(B1KΔimin)logT),O\left(\sum_{i\in[m]}\frac{B_{v}^{2}\log K\log T}{\tilde{\Delta}_{i,\lambda}^{\min}}+\sum_{i\in[m]}B_{1}\log\left(\frac{B_{1}K}{\Delta_{i}^{\min}}\right)\log T\right), (70)

where Δ~i,λmin=minS:piD,S>0,ΔS>0ΔSt/(piD,St)λ1\tilde{\Delta}_{i,\lambda}^{\min}=\min_{S:p_{i}^{D,S}>0,\Delta_{S}>0}\Delta_{S_{t}}/(p_{i}^{D,S_{t}})^{\lambda-1}. And the distribution-independent regret,

Reg(T)O(Bvm(logK)TlogT+B1mlog(KT)logT).\displaystyle Reg(T)\leq O\left(B_{v}\sqrt{m(\log K)T\log T}+B_{1}m\log(KT)\log T\right). (71)
Proposition 2 (Theorem 1, Liu et al. (2022)).

For a CMAB-T problem instance ([m],𝒮,𝒟,Dtrig,R)([m],\mathcal{S},\mathcal{D},D_{\text{trig}},R) that satisfies monotonicity (Condition 1), and TPVM bounded smoothness (Condition 4) with coefficient (Bv,B1,λ)(B_{v},B_{1},\lambda), if λ1\lambda\geq 1, BCUCB-T with an (α,β)(\alpha,\beta)-approximation oracle achieves an (α,β)(\alpha,\beta)-approximate regret bounded by

O(i[m]log(BvKΔimin)Bv2logKlogTΔimin+i[m]B1log2(B1KΔimin)logT).O\left(\sum_{i\in[m]}\log\left(\frac{B_{v}K}{\Delta_{i}^{\min}}\right)\frac{B_{v}^{2}\log K\log T}{\Delta_{i}^{\min}}+\sum_{i\in[m]}B_{1}\log^{2}\left(\frac{B_{1}K}{\Delta_{i}^{\min}}\right)\log T\right). (72)

And the distribution-independent regret,

Reg(T)O(Bvm(logK)Tlog(KT)+B1mlog2(KT)logT).\displaystyle Reg(T)\leq O\left(B_{v}\sqrt{m(\log K)T}\log(KT)+B_{1}m\log^{2}(KT)\log T\right). (73)

Looking at our regret bound (Theorem 5), there are two improvements compared with Proposition 2: (1) the min gap is improved to Δ~i,λminΔimin\tilde{\Delta}_{i,\lambda}^{\min}\geq\Delta_{i}^{\min}, (2) we remove a O(log(BvKΔimin))O(\log(\frac{B_{v}K}{\Delta_{i}^{\min}})) for the leading term. For (2), it translates to a O(logT)O(\sqrt{\log T}) improvement for the distribution-independent bound.

Proof.

Similar to Section C.1, the main idea is to use TPE trick to replace S~t\tilde{S}_{t} (arms that could be probabilistically triggered by action StS_{t}) with τt\tau_{t} (arms that are actually triggered by action StS_{t}) under conditional expectation to avoid the usage of much more involved triggering group analysis (Wang & Chen, 2017). Such replacement bypasses the triggering group analysis (and its counter Nt,i,jN_{t,i,j}) (Liu et al., 2022), which uses Nt,i,jN_{t,i,j} to associate Tt,iT_{t,i} with the counters for S~t\tilde{S}_{t}. By doing so, we do not need a union bound over the group index jj, which saves a log(BvK/Δimin\log(B_{v}K/\Delta_{i}^{\min} (or log(B1K/Δimin\log(B_{1}K/\Delta_{i}^{\min}) factor.

We follow exactly the same BCUCB-T algorithm (Algorithm 1 (Liu et al., 2022)), conditions (Condition 1, 2, 3 (Liu et al., 2022)). We also inherit the event definitions of 𝒩ts\mathcal{N}_{t}^{s} (Definition 6 (Liu et al., 2022)) that (1) for every base arm i[m]i\in[m], |μ^t1,iμi|ρt,i|\hat{\mu}_{t-1,i}-\mu_{i}|\leq\rho_{t,i}, where ρt,i=6V^t1,ilogtTt1,i+9logtTt1,i\rho_{t,i}=\sqrt{\frac{6\hat{V}_{t-1,i}\log t}{T_{t-1,i}}}+\frac{9\log t}{T_{t-1,i}}; (2) for every base arm i[m]i\in[m], V^t1,i2μi(1μi)+3.5logtTt1,i\hat{V}_{t-1,i}\leq 2\mu_{i}(1-\mu_{i})+\frac{3.5\log t}{T_{t-1,i}}. We use the event FtF_{t} being {r(St;𝝁¯t)<αopt(𝝁¯t)}\{r(S_{t};\bar{\bm{\mu}}_{t})<\alpha\cdot\text{opt}(\bar{\bm{\mu}}_{t})\}. Let us further denote ΔSt=αr(S;𝝁)r(St;𝝁)\Delta_{S_{t}}=\alpha r(S^{*};\bm{\mu})-r(S_{t};\bm{\mu}), τt\tau_{t} be the arms actually triggered by StS_{t} at time tt. Let filtration t1\mathcal{F}_{t-1} be (ϕ1,S1,τ1,(X1,i)iτ1,,ϕt1,St1,τt1,(Xt1,i)iτt1,ϕt,St)(\phi_{1},S_{1},\tau_{1},(X_{1,i})_{i\in\tau_{1}},...,\phi_{t-1},S_{t-1},\tau_{t-1},(X_{t-1,i})_{i\in\tau_{t-1}},\phi_{t},S_{t}), and let 𝔼t[]=𝔼[t1]\mathbb{E}_{t}[\cdot]=\mathbb{E}[\cdot\mid\mathcal{F}_{t-1}] We also have that t1\mathcal{F}_{t-1}, Tt1,i,μ^t,iT_{t-1,i},\hat{\mu}_{t,i} are measurable. Also note that we use piD,Sp_{i}^{D,S} to denote the triggering probability pi𝝁,Sp_{i}^{\bm{\mu},S} for any i[m],S𝒮i\in[m],S\in\mathcal{S} in order to match the notation of Wang & Chen (2017); Liu et al. (2022).

We follow the same regret decomposition as in Lemma 9 of Liu et al. (2022), to decompose the event-filtered regret Reg(T,𝒩ts,¬Ft)Reg(T,\mathcal{N}^{s}_{t},\neg F_{t}) into two event-filtered regret Reg(T,Et,1)Reg(T,E_{t,1}) and Reg(T,Et,2)Reg(T,E_{t,2}) under events 𝒩ts,¬Ft\mathcal{N}^{s}_{t},\neg F_{t}.

Reg(T)Reg(T,Et,1)+Reg(T,Et,2),\displaystyle Reg(T)\leq Reg(T,E_{t,1})+Reg(T,E_{t,2}), (74)

where event Et,1={ΔSt2et,1(St)}E_{t,1}=\{\Delta_{S_{t}}\leq 2e_{t,1}(S_{t})\}, event Et,2={ΔSt2et,2(St)}E_{t,2}=\{\Delta_{S_{t}}\leq 2e_{t,2}(S_{t})\}, et,1(St)=43BviS~t(logtTt1,i128)(piD,St)λe_{t,1}(S_{t})=4\sqrt{3}B_{v}\sqrt{\sum_{i\in\tilde{S}_{t}}(\frac{\log t}{T_{t-1,i}}\wedge\frac{1}{28})(p_{i}^{D,S_{t}})^{\lambda}},et,2(St)=28B1iS~t(logtTt1,i128)(piD,St)e_{t,2}(S_{t})=28B_{1}\sum_{i\in\tilde{S}_{t}}(\frac{\log t}{T_{t-1,i}}\wedge\frac{1}{28})(p_{i}^{D,S_{t}}).

C.2.1 Bounding the Reg(T,Et,1)Reg(T,E_{t,1}) term

We bound the leading Reg(T,Et,1)Reg(T,E_{t,1}) term under two cases when λ[1,2)\lambda\in[1,2) and λ[2,)\lambda\in[2,\infty).

(a) When λ[1,2)\lambda\in[1,2),

Let c1=43c_{1}=4\sqrt{3}, Δ~St=ΔSt/(piD,S)λ1\tilde{\Delta}_{S_{t}}=\Delta_{S_{t}}/(p_{i}^{D,S})^{\lambda-1}. Given filtration t1\mathcal{F}_{t-1} and event Et,1E_{t,1}, we have

ΔSt\displaystyle\Delta_{S_{t}} (a)i[m]4c12Bv2(piD,St)λlogtTt1,iΔSt\displaystyle\overset{(a)}{\leq}\sum_{i\in[m]}\frac{4c_{1}^{2}B_{v}^{2}(p_{i}^{D,S_{t}})^{\lambda}\frac{\log t}{T_{t-1,i}}}{\Delta_{S_{t}}} (75)
=(b)ΔSt+2i[m]4c12Bv2(piD,St)λlogtTt1,iΔSt\displaystyle\overset{(b)}{=}-\Delta_{S_{t}}+2\sum_{i\in[m]}\frac{4c_{1}^{2}B_{v}^{2}(p_{i}^{D,S_{t}})^{\lambda}\frac{\log t}{T_{t-1,i}}}{\Delta_{S_{t}}} (76)
=iS~tpiD,StΔSt/(piD,St)λ1i[m](piD,St)2λ+2i[m]4c12Bv2(piD,St)λlogtTt1,iΔSt\displaystyle=-\frac{\sum_{i\in\tilde{S}_{t}}p_{i}^{D,S_{t}}\Delta_{S_{t}}/(p_{i}^{D,S_{t}})^{\lambda-1}}{\sum_{i\in[m]}(p_{i}^{D,S_{t}})^{2-\lambda}}+2\sum_{i\in[m]}\frac{4c_{1}^{2}B_{v}^{2}(p_{i}^{D,S_{t}})^{\lambda}\frac{\log t}{T_{t-1,i}}}{\Delta_{S_{t}}} (77)
(c)i[m]piD,St(8c12Bv2logtTt1,iΔSt/(piD,St)λ1ΔSt/(piD,St)λ1K)\displaystyle\overset{(c)}{\leq}\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(\frac{8c_{1}^{2}B_{v}^{2}\frac{\log t}{T_{t-1,i}}}{\Delta_{S_{t}}/(p_{i}^{D,S_{t}})^{\lambda-1}}-\frac{\Delta_{S_{t}}/(p_{i}^{D,S_{t}})^{\lambda-1}}{K}\right) (78)
=(d)i[m]piD,St(8c12Bv2logtTt1,iΔ~StΔ~StK),\displaystyle\overset{(d)}{=}\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(\frac{8c_{1}^{2}B_{v}^{2}\frac{\log t}{T_{t-1,i}}}{{\tilde{\Delta}}_{S_{t}}}-\frac{\tilde{\Delta}_{S_{t}}}{K}\right), (79)

where (a) follows from event Et,1E_{t,1}, (b) is by the reverse amortization trick that multiplies two to both sides of (a) and rearranges the terms, (c), (d) are by definition of K,Δ~StK,\tilde{\Delta}_{S_{t}}.

It follows that

ΔSt=𝔼t[ΔSt]\displaystyle\Delta_{S_{t}}=\mathbb{E}_{t}[\Delta_{S_{t}}] (a)𝔼t[i[m]piD,St(8c12Bv2logtTt1,iΔ~StΔ~StK)]\displaystyle\overset{(a)}{\leq}\mathbb{E}_{t}\left[\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(\frac{8c_{1}^{2}B_{v}^{2}\frac{\log t}{T_{t-1,i}}}{\tilde{\Delta}_{S_{t}}}-\frac{\tilde{\Delta}_{S_{t}}}{K}\right)\right] (80)
=(b)𝔼t[iτt(8c12Bv2logtTt1,iΔ~StΔ~StK)]\displaystyle\overset{(b)}{=}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}\left(\frac{8c_{1}^{2}B_{v}^{2}\frac{\log t}{T_{t-1,i}}}{\tilde{\Delta}_{S_{t}}}-\frac{\tilde{\Delta}_{S_{t}}}{K}\right)\right] (81)
(c)𝔼t[iτtκi,T(Tt1,i)]\displaystyle\overset{(c)}{\leq}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}{\kappa_{i,T}(T_{t-1,i})}\right] (82)

where (a) follows from Equation 79, (b) follows from TPE trick to replace piD,St=𝔼t[𝕀{iτt}]p_{i}^{D,S_{t}}=\mathbb{E}_{t}[\mathbb{I}\{i\in\tau_{t}\}], (c) is because we define a regret allocation function

κi,T()={c12Bv2Δ~imin,if =0,24c12Bv2logT,if 1Li,T,1,8c12Bv2logTΔ~imin1,if Li,T,1<Li,T,2,0,if >Li,T,2,\kappa_{i,T}(\ell)=\begin{cases}\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i}^{\min}},&\text{if $\ell=0$,}\\ 2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{\ell}},&\text{if $1\leq\ell\leq L_{i,T,1}$,}\\ \frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i}^{\min}}\frac{1}{\ell},&\text{if $L_{i,T,1}<\ell\leq L_{i,T,2}$,}\\ 0,&\text{if $\ell>L_{i,T,2}$,}\end{cases} (83)

where Li,T,1=4c12Bv2logT(Δ~imin)2L_{i,T,1}=\frac{4c_{1}^{2}B_{v}^{2}\log T}{(\tilde{\Delta}_{i}^{\min})^{2}}, Li,T,2=8c12Bv2KlogT(Δ~imin)2L_{i,T,2}=\frac{8c_{1}^{2}B_{v}^{2}K\log T}{(\tilde{\Delta}_{i}^{\min})^{2}}, Δ~imin=minS:piD,S>0,ΔS>0ΔSt/(piD,St)λ1\tilde{\Delta}_{i}^{\min}=\min_{S:p_{i}^{D,S}>0,\Delta_{S}>0}\Delta_{S_{t}}/(p_{i}^{D,S_{t}})^{\lambda-1}, and (c) holds due to Lemma 16.

Reg(T,Et,1)\displaystyle Reg(T,E_{t,1}) =𝔼[t=1TΔSt]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\Delta_{S_{t}}\right] (84)
(a)𝔼[t[T]𝔼t[iτtκi,T(Tt1,i)]]\displaystyle\overset{(a)}{\leq}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}{\kappa_{i,T}(T_{t-1,i})}\right]\right] (85)
=(b)𝔼[t[T]iτtκi,T(Tt1,i)]\displaystyle\overset{(b)}{=}\mathbb{E}\left[\sum_{t\in[T]}\sum_{i\in\tau_{t}}{\kappa_{i,T}(T_{t-1,i})}\right] (86)
=(c)𝔼[i[m]s=0TT1,iκi,T(s)]\displaystyle\overset{(c)}{=}\mathbb{E}\left[\sum_{i\in[m]}\sum_{s=0}^{T_{T-1,i}}\kappa_{i,T}(s)\right] (87)
i[m]c12Bv2Δ~imin+i[m]s=1Li,T,124c12Bv2logTs+i[m]s=Li,T,1+1Li,T,28c12Bv2logTΔ~imin1s\displaystyle\leq\sum_{i\in[m]}\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i}^{\min}}+\sum_{i\in[m]}\sum_{s=1}^{L_{i,T,1}}2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{s}}+\sum_{i\in[m]}\sum_{s=L_{i,T,1}+1}^{L_{i,T,2}}\frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i}^{\min}}\frac{1}{s} (88)
i[m]c12Bv2Δ~imin+i[m]s=0Li,T,124c12Bv2logTs𝑑s+i[m]s=Li,T,1Li,T,28c12Bv2logTΔ~imin1s𝑑s\displaystyle\leq\sum_{i\in[m]}\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i}^{\min}}+\sum_{i\in[m]}\int_{s=0}^{L_{i,T,1}}2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{s}}\cdot ds+\sum_{i\in[m]}\int_{s=L_{i,T,1}}^{L_{i,T,2}}\frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i}^{\min}}\frac{1}{s}\cdot ds (89)
i[m]c12Bv2Δ~imin+i[m]8c12Bv2logTΔ~imin(3+logK),\displaystyle\leq\sum_{i\in[m]}\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i}^{\min}}+\sum_{i\in[m]}\frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i}^{\min}}(3+\log K), (90)

where (a) follows from Equation 82, (b) follows from the tower rule, (c) follows from that Tt1,iT_{t-1,i} is increased by 11 if and only if iτti\in\tau_{t}.

(b) When λ[2,)\lambda\in[2,\infty),

Let Δ~St,λ=ΔSt/(piD,St)λ1\tilde{\Delta}_{S_{t},\lambda}=\Delta_{S_{t}}/(p_{i}^{D,S_{t}})^{\lambda-1}, Δ~St=ΔSt/piD,St\tilde{\Delta}_{S_{t}}=\Delta_{S_{t}}/p_{i}^{D,S_{t}}. Note that Δ~S,λ=Δ~S\tilde{\Delta}_{S,\lambda}=\tilde{\Delta}_{S} when λ=2\lambda=2, Δ~S,λΔ~S\tilde{\Delta}_{S,\lambda}\geq\tilde{\Delta}_{S} when λ2\lambda\geq 2, and Δ~S,λΔ~S\tilde{\Delta}_{S,\lambda}\leq\tilde{\Delta}_{S}, when λ2\lambda\leq 2, for any i,Si,S. Given filtration t1\mathcal{F}_{t-1} and under event Et,1E_{t,1}, we have

ΔSt\displaystyle\Delta_{S_{t}} i[m]4c12Bv2(piD,St)λlogtTt1,iΔSt\displaystyle\leq\sum_{i\in[m]}\frac{4c_{1}^{2}B_{v}^{2}(p_{i}^{D,S_{t}})^{\lambda}\frac{\log t}{T_{t-1,i}}}{\Delta_{S_{t}}} (91)
=ΔSt+2i[m]4c12Bv2(piD,St)λlogtTt1,iΔSt\displaystyle=-\Delta_{S_{t}}+2\sum_{i\in[m]}\frac{4c_{1}^{2}B_{v}^{2}(p_{i}^{D,S_{t}})^{\lambda}\frac{\log t}{T_{t-1,i}}}{\Delta_{S_{t}}} (92)
=iS~tpiD,StΔSt/piD,StK+2i[m]4c12Bv2(piD,St)λlogtTt1,iΔSt\displaystyle=-\frac{\sum_{i\in\tilde{S}_{t}}p_{i}^{D,S_{t}}\Delta_{S_{t}}/p_{i}^{D,S_{t}}}{K}+2\sum_{i\in[m]}\frac{4c_{1}^{2}B_{v}^{2}(p_{i}^{D,S_{t}})^{\lambda}\frac{\log t}{T_{t-1,i}}}{\Delta_{S_{t}}} (93)
i[m]piD,St(8c12Bv2logtTt1,iΔSt/(piD,St)λ1ΔSt/piD,StK)\displaystyle\leq\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(\frac{8c_{1}^{2}B_{v}^{2}\frac{\log t}{T_{t-1,i}}}{\Delta_{S_{t}}/(p_{i}^{D,S_{t}})^{\lambda-1}}-\frac{\Delta_{S_{t}}/p_{i}^{D,S_{t}}}{K}\right) (94)
=i[m]piD,St(8c12Bv2logtTt1,iΔ~St,λΔ~StK).\displaystyle=\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(\frac{8c_{1}^{2}B_{v}^{2}\frac{\log t}{T_{t-1,i}}}{{\tilde{\Delta}}_{S_{t},\lambda}}-\frac{\tilde{\Delta}_{S_{t}}}{K}\right). (95)
ΔSt=𝔼t[ΔSt]\displaystyle\Delta_{S_{t}}=\mathbb{E}_{t}[\Delta_{S_{t}}] 𝔼t[i[m]piD,St(8c12Bv2logtTt1,iΔ~St,λΔ~StK)]\displaystyle\leq\mathbb{E}_{t}\left[\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(\frac{8c_{1}^{2}B_{v}^{2}\frac{\log t}{T_{t-1,i}}}{\tilde{\Delta}_{S_{t},\lambda}}-\frac{\tilde{\Delta}_{S_{t}}}{K}\right)\right] (96)
=𝔼t[iτt(8c12Bv2logtTt1,iΔ~St,λΔ~StK)]\displaystyle=\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}\left(\frac{8c_{1}^{2}B_{v}^{2}\frac{\log t}{T_{t-1,i}}}{\tilde{\Delta}_{S_{t},\lambda}}-\frac{\tilde{\Delta}_{S_{t}}}{K}\right)\right] (97)
𝔼t[iτtκi,T(Tt1,i)]\displaystyle\leq\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}{\kappa_{i,T}(T_{t-1,i})}\right] (98)

where the last inequality is by Lemma 17 and we define a regret allocation function

κi,T()={c12Bv2Δ~i,λmin,if =0,24c12Bv2logT,if 1Li,T,1,8c12Bv2logTΔ~i,λmin1,if Li,T,1<Li,T,2,0,if >Li,T,2,\kappa_{i,T}(\ell)=\begin{cases}\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i,\lambda}^{\min}},&\text{if $\ell=0$,}\\ 2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{\ell}},&\text{if $1\leq\ell\leq L_{i,T,1}$,}\\ \frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i,\lambda}^{\min}}\frac{1}{\ell},&\text{if $L_{i,T,1}<\ell\leq L_{i,T,2}$,}\\ 0,&\text{if $\ell>L_{i,T,2}$,}\end{cases} (99)

where Li,T,1=4c12Bv2logTΔ~iminΔ~i,λminL_{i,T,1}=\frac{4c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i}^{\min}\cdot\tilde{\Delta}_{i,\lambda}^{\min}}, Li,T,2=8c12Bv2KlogTΔ~iminΔ~i,λminL_{i,T,2}=\frac{8c_{1}^{2}B_{v}^{2}K\log T}{\tilde{\Delta}_{i}^{\min}\cdot\tilde{\Delta}_{i,\lambda}^{\min}}, Δ~imin=minS:piD,S>0,ΔS>0ΔSt/piD,St\tilde{\Delta}_{i}^{\min}=\min_{S:p_{i}^{D,S}>0,\Delta_{S}>0}\Delta_{S_{t}}/p_{i}^{D,S_{t}}, Δ~i,λmin=minS:piD,S>0,ΔS>0ΔSt/(piD,St)λ1\tilde{\Delta}_{i,\lambda}^{\min}=\min_{S:p_{i}^{D,S}>0,\Delta_{S}>0}\Delta_{S_{t}}/(p_{i}^{D,S_{t}})^{\lambda-1}.

Reg(T,Et,1)\displaystyle Reg(T,E_{t,1}) =𝔼[t=1TΔSt]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\Delta_{S_{t}}\right] (100)
(a)𝔼[t[T]𝔼t[iτtκi,T(Tt1,i)]]\displaystyle\overset{(a)}{\leq}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}{\kappa_{i,T}(T_{t-1,i})}\right]\right] (101)
=(b)𝔼[t[T]iτtκi,T(Tt1,i)]\displaystyle\overset{(b)}{=}\mathbb{E}\left[\sum_{t\in[T]}\sum_{i\in\tau_{t}}{\kappa_{i,T}(T_{t-1,i})}\right] (102)
=(c)𝔼[i[m]s=0TT1,iκi,T(s)]\displaystyle\overset{(c)}{=}\mathbb{E}\left[\sum_{i\in[m]}\sum_{s=0}^{T_{T-1,i}}\kappa_{i,T}(s)\right] (103)
i[m]c12Bv2Δ~imin+i[m]s=1Li,T,124c12Bv2logTs+i[m]s=Li,T,1+1Li,T,28c12Bv2logTΔ~imin1s\displaystyle\leq\sum_{i\in[m]}\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i}^{\min}}+\sum_{i\in[m]}\sum_{s=1}^{L_{i,T,1}}2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{s}}+\sum_{i\in[m]}\sum_{s=L_{i,T,1}+1}^{L_{i,T,2}}\frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i}^{\min}}\frac{1}{s} (104)
i[m]c12Bv2Δ~imin+i[m]s=0Li,T,124c12Bv2logTs𝑑s+i[m]s=Li,T,1Li,T,28c12Bv2logTΔ~i,λmin1s𝑑s\displaystyle\leq\sum_{i\in[m]}\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i}^{\min}}+\sum_{i\in[m]}\int_{s=0}^{L_{i,T,1}}2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{s}}\cdot ds+\sum_{i\in[m]}\int_{s=L_{i,T,1}}^{L_{i,T,2}}\frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i,\lambda}^{\min}}\frac{1}{s}\cdot ds (105)
i[m]c12Bv2Δ~i,λmin+i[m]8c12Bv2logTΔ~i,λmin(1+logK)+i[m]16c12Bv2logTΔ~i,λminΔ~imin,\displaystyle\leq\sum_{i\in[m]}\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i,\lambda}^{\min}}+\sum_{i\in[m]}\frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i,\lambda}^{\min}}(1+\log K)+\sum_{i\in[m]}\frac{16c_{1}^{2}B_{v}^{2}\log T}{\sqrt{\tilde{\Delta}_{i,\lambda}^{\min}\cdot\tilde{\Delta}_{i}^{\min}}}, (106)

where (a) follows from Equation 98, (b) follows from the tower rule, (c) follows from that Tt1,iT_{t-1,i} is increased by 11 if and only if iτti\in\tau_{t}.

C.2.2 Bounding the Reg(T,Et,2)Reg(T,E_{t,2}) term

Let c2=28c_{2}=28. Given filtration t1\mathcal{F}_{t-1} and event Et,2E_{t,2}, we have

ΔSt\displaystyle\Delta_{S_{t}} (a)iS~t2c2B1piD,Stmin{1/28,logTTt1,i}\displaystyle\overset{(a)}{\leq}\sum_{i\in\tilde{S}_{t}}2c_{2}B_{1}p_{i}^{D,S_{t}}\min\left\{1/28,\frac{\log T}{T_{t-1,i}}\right\}
(b)ΔSt+2iS~t2c2B1piD,Stmin{1/28,logTTt1,i}\displaystyle\overset{(b)}{\leq}-\Delta_{S_{t}}+2\sum_{i\in\tilde{S}_{t}}2c_{2}B_{1}p_{i}^{D,S_{t}}\min\left\{1/28,\frac{\log T}{T_{t-1,i}}\right\}
=i[m]piD,StΔSti[m]piD,St+2i[m]2c2B1piD,Stmin{1/28,logTTt1,i}\displaystyle=-\frac{\sum_{i\in[m]}p_{i}^{D,S_{t}}\Delta_{S_{t}}}{\sum_{i\in[m]}p_{i}^{D,S_{t}}}+2\sum_{i\in[m]}2c_{2}B_{1}p_{i}^{D,S_{t}}\min\left\{1/28,\frac{\log T}{T_{t-1,i}}\right\} (107)
(c)i[m]piD,St(ΔStK+4c2B1min{1/28,logTTt1,i}),\displaystyle\overset{(c)}{\leq}\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(-\frac{\Delta_{S_{t}}}{K}+4c_{2}B_{1}\min\left\{1/28,\frac{\log T}{T_{t-1,i}}\right\}\right), (108)

where (a) follows from event Et,2E_{t,2}, (b) is by the reverse amortization trick that multiplies two to both sides of (a) and rearranges the terms, (c) follows from piD,St1p_{i}^{D,S_{t}}\leq 1.

It follows that

ΔSt=𝔼t[ΔSt]\displaystyle\Delta_{S_{t}}=\mathbb{E}_{t}[\Delta_{S_{t}}] (a)𝔼t[i[m]piD,St(ΔStK+4c2B1min{1/28,logTTt1,i})]\displaystyle\overset{(a)}{\leq}\mathbb{E}_{t}\left[\sum_{i\in[m]}p_{i}^{D,S_{t}}\left(-\frac{\Delta_{S_{t}}}{K}+4c_{2}B_{1}\min\left\{1/28,\frac{\log T}{T_{t-1,i}}\right\}\right)\right]
=(b)𝔼t[iτt(ΔStK+4c2B1min{1/28,logTTt1,i})]\displaystyle\overset{(b)}{=}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}\left(-\frac{\Delta_{S_{t}}}{K}+4c_{2}B_{1}\min\left\{1/28,\frac{\log T}{T_{t-1,i}}\right\}\right)\right]
(c)𝔼t[iτtκi,T(Tt1,i)]\displaystyle\overset{(c)}{\leq}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}{\kappa_{i,T}(T_{t-1,i})}\right] (109)

regret allocation function

κi,T()={Δimax,if 0Li,T,14c2B1logT,if Li,1<Li,20,if >Li,T,2+1,\kappa_{i,T}(\ell)=\begin{cases}\Delta_{i}^{\max},&\text{if $0\leq\ell\leq L_{i,T,1}$}\\ \frac{4c_{2}B_{1}\log T}{\ell},&\text{if $L_{i,1}<\ell\leq L_{i,2}$}\\ 0,&\text{if $\ell>L_{i,T,2}+1$,}\end{cases} (110)

where Li,T,1=4c2B1logTΔimaxL_{i,T,1}=\frac{4c_{2}B_{1}\log T}{\Delta_{i}^{\max}}, Li,T,2=4c2B1KlogTΔiminL_{i,T,2}=\frac{4c_{2}B_{1}K\log T}{\Delta_{i}^{\min}}. And (a) follows from Equation 108, (b) from the TPE, (c) follows from Lemma 18.

Reg(T,Et,2)\displaystyle Reg(T,E_{t,2}) =𝔼[t=1TΔSt]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{T}\Delta_{S_{t}}\right] (111)
(a)𝔼[t[T]𝔼t[iτtκi,T(Tt1,i)]]\displaystyle\overset{(a)}{\leq}\mathbb{E}\left[\sum_{t\in[T]}\mathbb{E}_{t}\left[\sum_{i\in\tau_{t}}{\kappa_{i,T}(T_{t-1,i})}\right]\right] (112)
=(b)𝔼[t[T]iτtκi,T(Tt1,i)]\displaystyle\overset{(b)}{=}\mathbb{E}\left[\sum_{t\in[T]}\sum_{i\in\tau_{t}}{\kappa_{i,T}(T_{t-1,i})}\right] (113)
=(c)𝔼[i[m]s=0TT1,iκi,T(s)]\displaystyle\overset{(c)}{=}\mathbb{E}\left[\sum_{i\in[m]}\sum_{s=0}^{T_{T-1,i}}\kappa_{i,T}(s)\right] (114)
mΔmax+i[m]=1Li,T,1Δimax+i[m]=Li,T,1+1Li,T,24c2B1logT\displaystyle\leq m\Delta_{\max}+\sum_{i\in[m]}\sum_{\ell=1}^{L_{i,T,1}}\Delta_{i}^{\max}+\sum_{i\in[m]}\sum_{\ell=L_{i,T,1}+1}^{L_{i,T,2}}\frac{4c_{2}B_{1}\log T}{\ell} (115)
mΔmax+i[m]4c2B1logT+i[m]4c2B1log(KΔimaxΔimin)logT\displaystyle\leq m\Delta_{\max}+\sum_{i\in[m]}4c_{2}B_{1}\log T+\sum_{i\in[m]}4c_{2}B_{1}\log(\frac{K\Delta_{i}^{\max}}{\Delta_{i}^{\min}})\log T (116)
=mΔmax+i[m]4c2B1(1+log(KΔimaxΔimin))logT\displaystyle=m\Delta_{\max}+\sum_{i\in[m]}4c_{2}B_{1}\left(1+\log(\frac{K\Delta_{i}^{\max}}{\Delta_{i}^{\min}})\right)\log T (117)
mΔmax+i[m]4c2B1(1+log(KΔimaxΔimin))logT,\displaystyle\leq m\Delta_{\max}+\sum_{i\in[m]}4c_{2}B_{1}\left(1+\log(\frac{K\Delta_{i}^{\max}}{\Delta_{i}^{\min}})\right)\log T, (118)

where (a) follows from Equation 109, (b) follows from the tower rule, (c) follows from that Tt1,iT_{t-1,i} is increased by 11 if and only if iτti\in\tau_{t}. ∎

Appendix D Applications

For convenience, we show our table again in Table 3.

Table 3: Summary of the coefficients, regret bounds and improvements for various applications.
Application Condition (Bv,B1,λ)(B_{v},B_{1},\lambda) Regret Improvement
Online Influence Maximization (Wen et al., 2017) TPM (,|V|,)(-,|V|,-) O(d|V||E|TlogT)O(d|V|\sqrt{|E|T}\log T) O~(|E|)\tilde{O}(\sqrt{|E|})
Disjunctive Combinatorial Cascading Bandits (Li et al., 2016) TPVM (1,1,1)(1,1,1) O(dTlogT)O(d\sqrt{T}\log T) O~(K/pmin)\tilde{O}(\sqrt{K}/p_{\min})
Conjunctive Combinatorial Cascading Bandits (Li et al., 2016) TPVM (1,1,1)(1,1,1) O(dTlogT)O(d\sqrt{T}\log T) O~(K/rmax)\tilde{O}(\sqrt{K}/r_{\max})
Linear Cascading Bandits (Vial et al., 2022) TPVM (1,1,1)(1,1,1) O(dTlogT)O(d\sqrt{T}\log T) O~(K/d)\tilde{O}(\sqrt{K/d})
Multi-layered Network Exploration (Liu et al., 2021b) TPVM (1.25|V|,1,2)(\sqrt{1.25|V|},1,2) O(d|V|TlogT)O(d\sqrt{|V|T}\log T) O~(n/pmin)\tilde{O}(\sqrt{n}/p_{\min})
Probabilistic Maximum Coverage (Chen et al., 2013)∗∗ VM (32|V|,1,)(3\sqrt{2|V|},1,-) O(d|V|TlogT)O(d\sqrt{|V|T}\log T) O~(k)\tilde{O}(\sqrt{k})
  • |V|,|E|,n,k,L|V|,|E|,n,k,L denotes the number of target nodes, the number of edges that can be triggered by the set of seed nodes, the number of layers, the number of seed nodes and the length of the longest directed path, respectively;

  • K is the length of the ordered list, rmax=αmaxt[T],S𝒮r(S;𝝁t)r_{\max}=\alpha\cdot\max_{t\in[T],S\in\mathcal{S}}r(S;\bm{\mu}_{t});

  • A special case of disjunctive combinatorial cascading bandits.

  • ∗∗ This row is for C2MAB application and the rest of rows are for C2MAB-T applications.

D.1 Online Influence Maximization Bandit (Wang & Chen, 2017) and Its Contextual Generalization (Wen et al., 2017)

Following the setting of (Wang & Chen, 2017, Section 2.1), we consider a weighted directed graph G(V,E,p)G(V,E,p), where VV is the set of vertices, EE is the set of directed edges, and each edge (u,v)E(u,v)\in E is associated with a probability p(u,v)[0,1]p(u,v)\in[0,1]. When the agent selects a set of seed nodes SVS\subseteq V, the influence propagates as follows: At time 0, the seed nodes SS are activated; At time t>1t>1, a node uu activated at time t1t-1 will have one chance to activate its inactive out-neighbor vv with independent probability p(u,v)p(u,v). The influence spread of SS is denoted as σ(S)\sigma(S) and is defined as the expected number of activated nodes after the propagation process ends. The problem of Influence Maximization is to find seed nodes SS with |S|k|S|\leq k so that the influence spread σ(S)\sigma(S) is maximized.

For the problem of online influence maximization (OIM), we consider TT rounds repeated influence maximization tasks and the edge probabilities p(u,v)p(u,v) are assumed to be unknown initially. For each round t[T]t\in[T], the agent selects kk seed nodes as StS_{t}, the influence propagation of StS_{t} is observed and the reward is the number of nodes activated in round tt. The agent’s goal is to accumulate as much reward as possible in TT rounds. The OIM fits into CMAB-T framework: the edges EE are the set of base arms [m][m], the (unknown) outcome distribution DD is the joint of mm independent Bernoulli random variables for the edge set EE, the action SS are any seed node sets with size kk at most kk. For the arm triggering, the triggered set τt\tau_{t} is the set of edges (u,v)(u,v) whose source node uu is reachable from StS_{t}. Let XtX_{t} be the outcomes of the edges EE according to probability p(u,v)p(u,v) and the live-edge graph Gtlive(V,Elive)G_{t}^{\text{live}}(V,E^{\text{live}}) be a induced graph with edges that are alive, i.e., eElivee\in E^{\text{live}} iff Xt,e=1X_{t,e}=1 for eEe\in E. The triggering probability distribution Dtrig(St,Xt)D_{\text{trig}}(S_{t},X_{t}) degenerates to a deterministic triggered set, i.e., τt\tau_{t} is deterministically decided given StS_{t} and XtX_{t}. The reward R(St,Xt,τt)R(S_{t},X_{t},\tau_{t}) equals to the number activated nodes at the end of tt, i.e., the nodes that are reachable from StS_{t} in the live-edge graph GtliveG_{t}^{\text{live}}. The offline oracle is a (11/eε,1/|V|)(1-1/e-\varepsilon,1/|V|)-approximation algorithm given by the greedy algorithm from (Kempe et al., 2003).

Now consider OIM’s contextual generalization for large-scale OIM, we follow Wen et al. (2017), where each edge e=(u,v)e=(u,v) is associated with a known feature vector 𝒙ed\bm{x}_{e}\in\mathbb{R}^{d} and an unknown parameter 𝜽d\bm{\theta}^{*}\in\mathbb{R}^{d}, and the edge probability is p(u,v)=𝒙e,𝜽p(u,v)=\langle\bm{x}_{e},\bm{\theta}^{*}\rangle. By Lemma 2 of (Wang & Chen, 2017), B1=C~|V|B_{1}=\tilde{C}\leq|V|, where C~\tilde{C} is the largest number of nodes any node can reach and batch size K|E|K\leq|E|, so by Theorem 1, C2-UCB-T obtain a worst-case O~(d|V||E|T)\tilde{O}(d|V|\sqrt{|E|T}) regret bound. Compared with IMLinUCB algorithm (Wen et al., 2017) that achieves O~(d(|V|k)|E|T)\tilde{O}(d(|V|-k)|E|\sqrt{T}), our regret achieves a improvement up to a factor of O~(|E|)\tilde{O}(\sqrt{|E|}).

Now for the claim of the triggering probability satisfies Bp=B1B_{p}=B_{1}, it follows from the Theorem 4 of Li et al. (2020) by identifying f(S,w,v)=piw,Sf(S,w,v)=p_{i}^{w,S}.

D.2 Contextual Combinatorial Cascading Bandits (Li et al., 2016)

Contextual Combinatorial cascading bandits have two categories: conjunctive cascading bandits and disjunctive cascading bandits (Li et al., 2016). We also compare with a special case of linear cascading bandits that also uses variance-adaptive algorithms and achieve very competitive results.

Disjunctive form. For the disjunctive form, we want to select an ordered list SS of KK items from total LL items, so as to maximize the probability that at least one of the outcomes of the selected items are 11. Each item is associated with a Bernoulli random variable with mean μt,i[0,1]\mu_{t,i}\in[0,1] at round tt, indicating whether the user will be satisfied with the item if he scans the item. To leverage the contextual information, Li et al. (2016) assumes 𝝁t,i=𝒙t,i,𝜽,\bm{\mu}_{t,i}=\langle\bm{x}_{t,i},\bm{\theta}^{*}\rangle, where 𝒙t,id\bm{x}_{t,i}\in\mathbb{R}^{d} is the known context at round tt for arm ii, 𝜽d\bm{\theta}\in\mathbb{R}^{d} is the unknown parameter to be learned. This setting models the movie recommendation system where the user sequentially scans a list of recommended items and the system is rewarded when the user is satisfied with any recommended item. After the user is satisfied with any item or scans all KK items but is not satisfied with any of them, the user leaves the system. Due to this stopping rule, the agent can only observe the outcome of items until (including) the first item whose outcome is 11. If there are no satisfactory items, the outcomes must be all 0. In other words, the triggered set is the prefix set of items until the stopping condition holds.

Without loss of generality, let the action be {1,,K}\{1,...,K\}, then the reward function is r(S;𝝁)=1j=1K(1μj)r(S;\bm{\mu})=1-\prod_{j=1}^{K}(1-\mu_{j}) and the triggering probability is pi𝝁,S=j=1i1(1μj)p_{i}^{\bm{\mu},S}=\prod_{j=1}^{i-1}(1-\mu_{j}). Let 𝝁¯=(μ¯1,,μ¯K)\bar{\bm{\mu}}=(\bar{\mu}_{1},...,\bar{\mu}_{K}) and 𝝁=(μ1,,μK)\bm{\mu}=(\mu_{1},...,\mu_{K}), where 𝝁¯=𝝁+𝜻+𝜼\bar{\bm{\mu}}=\bm{\mu}+\bm{\zeta}+\bm{\eta} with 𝝁¯,𝝁(0,1)K,𝜻,𝜼[1,1]K\bar{\bm{\mu}},\bm{\mu}\in(0,1)^{K},\bm{\zeta},\bm{\eta}\in[-1,1]^{K}. By Lemma 19 in Liu et al. (2021a), disjunctive CB satisfies Condition 4 with (Bv,B1,λ)=(1,1,1)(B_{v},B_{1},\lambda)=(1,1,1). Also, we can verify that disjunctive CB also satisfies Bp=B1=1B_{p}=B_{1}=1 as follows:

|pi𝝁¯,Spi𝝁,S|\displaystyle\left|p_{i}^{\bar{\bm{\mu}},S}-p_{i}^{\bm{\mu},S}\right|
=|j=1i(1μj)j=1i(1μ¯j)|\displaystyle=\left|\prod_{j=1}^{i}(1-\mu_{j})-\prod_{j=1}^{i}(1-\bar{\mu}_{j})\right| (119)
=j=1i|μ¯jμj|(1μ1)(1μj1)(1μ¯j+1)(1μ¯i)\displaystyle=\sum_{j=1}^{i}\left|\bar{\mu}_{j}-\mu_{j}\right|(1-\mu_{1})...(1-\mu_{j-1})(1-\bar{\mu}_{j+1})...(1-\bar{\mu}_{i}) (120)
j=1i|μ¯jμj|(1μ1)(1μj1)\displaystyle\leq\sum_{j=1}^{i}\left|\bar{\mu}_{j}-\mu_{j}\right|(1-\mu_{1})...(1-\mu_{j-1}) (121)
=j=1i|μ¯jμj|pj𝝁,S.\displaystyle=\sum_{j=1}^{i}\left|\bar{\mu}_{j}-\mu_{j}\right|p_{j}^{\bm{\mu},S}. (122)

Now by Theorem 3, VAC2-UCB obtains a regret bound of O(dTlogT)O(d\sqrt{T}\log T). Compared with Corollary 4.5 in Li et al. (2016) that yields a O(dKTlogT/pmin)O(d\sqrt{KT}\log T/p_{\min}) regret, our results improves by a factor of O(K/pmin)O(\sqrt{K}/p_{\min}).

Conjunctive form. For the conjunctive form, the learning agent wants to select KK paths from total LL paths (i.e., base arms) so as to maximize the probability that the outcomes of the selected paths are all 11. Each item is associated with a Bernoulli random variable with mean μt,i\mu_{t,i} at round tt, indicating whether the path will be live if the package will transmit via this path. Such a setting models the network routing problem (Kveton et al., 2015a), where the items are routing paths and the package is delivered when all paths are alive. The learning agent will observe the outcome of the first few paths till the first one that is down, since the transmission will stop if any of the path is down. In other words, the triggered set is the prefix set of paths until the stopping condition holds.

Without loss of generality, let the action be {1,,K}\{1,...,K\}, then the reward function is r(S;𝝁)=1j=1K(μj)r(S;\bm{\mu})=1-\prod_{j=1}^{K}(\mu_{j}) and the triggering probability is pi𝝁,S=j=1i1(μj)p_{i}^{\bm{\mu},S}=\prod_{j=1}^{i-1}(\mu_{j}). Let 𝝁¯=(μ¯1,,μ¯K)\bar{\bm{\mu}}=(\bar{\mu}_{1},...,\bar{\mu}_{K}) and 𝝁=(μ1,,μK)\bm{\mu}=(\mu_{1},...,\mu_{K}), where 𝝁¯=𝝁+𝜻+𝜼\bar{\bm{\mu}}=\bm{\mu}+\bm{\zeta}+\bm{\eta} with 𝝁¯,𝝁(0,1)K,𝜻,𝜼[1,1]K\bar{\bm{\mu}},\bm{\mu}\in(0,1)^{K},\bm{\zeta},\bm{\eta}\in[-1,1]^{K}. By Lemma 20 in Liu et al. (2021a), conjunctive CB satisfies Condition 4 with (Bv,B1,λ)=(1,1,1)(B_{v},B_{1},\lambda)=(1,1,1). Also, we can verify that conjunctive CB also satisfies Bp=B1=1B_{p}=B_{1}=1 as follows:

|pi𝝁¯,Spi𝝁,S|\displaystyle\left|p_{i}^{\bar{\bm{\mu}},S}-p_{i}^{\bm{\mu},S}\right|
=|j=1iμjj=1iμ¯j|\displaystyle=\left|\prod_{j=1}^{i}\mu_{j}-\prod_{j=1}^{i}\bar{\mu}_{j}\right| (123)
=j=1i|μ¯jμj|(μ1)(μj1)(μ¯j+1)(μ¯i)\displaystyle=\sum_{j=1}^{i}\left|\bar{\mu}_{j}-\mu_{j}\right|(\mu_{1})...(\mu_{j-1})(\bar{\mu}_{j+1})...(\bar{\mu}_{i}) (124)
j=1i|μ¯jμj|(μ1)(μj1)\displaystyle\leq\sum_{j=1}^{i}\left|\bar{\mu}_{j}-\mu_{j}\right|(\mu_{1})...(\mu_{j-1}) (125)
=j=1i|μ¯jμj|pj𝝁,S.\displaystyle=\sum_{j=1}^{i}\left|\bar{\mu}_{j}-\mu_{j}\right|p_{j}^{\bm{\mu},S}. (126)

Now by Theorem 3, VAC2-UCB obtains a regret bound of O(dTlogT)O(d\sqrt{T}\log T). Compared with Corollary 4.6 in Li et al. (2016) that yields a O(dKTlogT/rmax)O(d\sqrt{KT}\log T/r_{\max}) regret, our results improves by a factor of O(K/rmax)O(\sqrt{K}/r_{\max}).

Linear Cascading Bandit. Linear cascading bandit (Vial et al., 2022) is a special case of combinatorial cascading bandit (Li et al., 2016). The former assumes that action space 𝒮\mathcal{S} is the collection of all permutations whose size equals to KK (i.e., a uniform matroid). In this case, the items in the feasible solutions are exchangeable (a critical property for matroids), i.e., S{e1}+{e2}𝒮S-\{e_{1}\}+\{e_{2}\}\in\mathcal{S}, for any S𝒮,e1,e2[m]S\in\mathcal{S},e_{1},e_{2}\in[m]. Based on this property, their analysis can get the correct results. For the latter, however, 𝒮\mathcal{S} (i.e., Θ\Theta in [16]) consists of arbitrary feasible actions (perhaps with different sizes), e.g., S𝒮S\in\mathcal{S} could refer to any path that connects the source and the destination in network routing applications.

Other than the above difference, linear cascading bandits follow the same setting as disjunctive contextual combinatorial bandits. Following the similar argument of disjunctive contextual combinatorial bandits, the regret bound of VAC2-UCB is O(dTlogT)O(d\sqrt{T}\log T). Compared with CascadeWOFUL that achieves O~(d(d+K)T)\tilde{O}(\sqrt{d(d+K)T}) by Theorem 4 in Vial et al. (2022), our regret improves a factor of O~(1+K/d)\tilde{O}(\sqrt{1+K/d}). For the empirical comparison, see Section 5 for details.

D.3 Multi-layered Network Exploration Problem (MuLaNE) (Liu et al., 2021b)

We consider the MuLaNE problem with random node weights. After we apply the bipartite coverage graph, the corresponding graph is a tri-partite graph (n,V,R)(n,V,R) (i.e., a 3-layered graph where the first layer and the second layer forms a bipartite graph, and the second and the third layer forms another bipartite graph), where the left nodes represent nn random walkers; Middle nodes are |V||V| possible targets VV to be explored; Right nodes RR are VV nodes, each of which has only one edge connecting the middle edge. The MuLaNE task is to allocate BB budgets into nn layers to explore target nodes VV and the base arms are 𝒜={(i,u,b):i[n],uV,b[B]}\mathcal{A}=\{(i,u,b):i\in[n],u\in V,b\in[B]\}.

With budget allocation k1,,kLk_{1},...,k_{L}, the (effective) base arms consist of two parts:

(1) {(i,j):i[n],jV}\{(i,j):i\in[n],j\in V\}, each of which is associated with visiting probability xi,j[0,1]x_{i,j}\in[0,1] indicating whether node jj will be visited by explorer ii given kik_{i} budgets. All these base arms corresponds to budget ki,i[n]k_{i},i\in[n] are triggered.

(2) yj[0,1]y_{j}\in[0,1] for jVj\in V represents the random node weight. The triggering probability pj𝝁,S=1i[n](1xi,j)p_{j}^{\bm{\mu},S}=1-\prod_{i\in[n]}{(1-x_{i,j})}.

For its contextual generalization, we assume xi,j=ϕx(i,j),𝜽x_{i,j}=\langle\bm{\phi}_{x}(i,j),\bm{\theta}^{*}\rangle, yj=ϕy(j),𝜽y_{j}=\langle\bm{\phi}_{y}(j),\bm{\theta}^{*}\rangle, where ϕx(i,j),ϕy(j)\bm{\phi}_{x}(i,j),\bm{\phi}_{y}(j) are the known features for visiting probability and the node weights for large-scale MuLaNE applications, respectively. Let effective base arms 𝝁=(𝒙,𝒚)(0,1)(n|V|+|V|),𝝁¯=(𝒙¯,𝒚¯)(0,1)(n|V|+|V|)\bm{\mu}=(\bm{x},\bm{y})\in(0,1)^{(n|V|+|V|)},\bar{\bm{\mu}}=(\bar{\bm{x}},\bar{\bm{y}})\in(0,1)^{(n|V|+|V|)}, where 𝒙¯=𝜻x+𝜼x+𝒙,𝒚¯=𝜻y+𝜼y+𝒚\bar{\bm{x}}=\bm{\zeta}_{x}+\bm{\eta}_{x}+\bm{x},\bar{\bm{y}}=\bm{\zeta}_{y}+\bm{\eta}_{y}+\bm{y}, for 𝜻,𝜼[1,1](n|V|+|V|)\bm{\zeta},\bm{\eta}\in[-1,1]^{(n|V|+|V|)}. For the target node jVj\in V, the per-target reward function rj(S;𝒙,𝒚)=yj(1i[n](1xi,j))r_{j}(S;\bm{x},\bm{y})=y_{j}(1-\prod_{i\in[n]}(1-x_{i,j})). Denote p¯j𝝁,S=1i[n](1x¯i,j)\bar{p}_{j}^{\bm{\mu},S}=1-\prod_{i\in[n]}{(1-\bar{x}_{i,j})}. Based on Lemma 21 in Liu et al. (2022), contextual MuLaNE satisfies Condition 4 with (Bv,B1,λ)=(1.25|V|,1,2)(B_{v},B_{1},\lambda)=(\sqrt{1.25|V|},1,2). To validate that this application satisfies Condition 5 with Bp=B1=1B_{p}=B_{1}=1, we have

|pj𝝁¯,Spj𝝁,S|\displaystyle\left|p_{j}^{\bar{\bm{\mu}},S}-p_{j}^{\bm{\mu},S}\right|
=|i[n](1xi,j)i[n](1x¯i,j)|\displaystyle=\left|\prod_{i\in[n]}(1-x_{i,j})-\prod_{i\in[n]}(1-\bar{x}_{i,j})\right| (127)
=i[n]|x¯i,jxi,j|(1x1,j)(1xi1,j)(1x¯i+1,j)(1x¯i,j)\displaystyle=\sum_{i\in[n]}\left|\bar{x}_{i,j}-x_{i,j}\right|(1-x_{1,j})...(1-x_{i-1,j})(1-\bar{x}_{i+1,j})...(1-\bar{x}_{i,j}) (128)
=i[n]|x¯i,jxi,j|.\displaystyle=\sum_{i\in[n]}\left|\bar{x}_{i,j}-x_{i,j}\right|. (129)

By Theorem 3, we obtain O(d|V|TlogT)O(d\sqrt{|V|T}\log T), which improves the result O(dn|V|TlogT/pmin)O(d\sqrt{n|V|T}\log T/p_{\min}) that follows the result of C3UCB algorithm (Li et al., 2016) by a factor of O(n/pmin)O(\sqrt{n/p_{\min}}).

D.4 Probabilistic Maximum Coverage Bandit (Chen et al., 2016a; Merlis & Mannor, 2019)

In this section, we consider the probabilistic maximum coverage (PMC) problem. PMC is modeled by a weighted bipartite graph G=(L,V,E)G=(L,V,E), where LL are the source nodes, VV is the target nodes and each edge (u,v)E(u,v)\in E is associated with a probability p(u,v)p(u,v). The task of PMC is to select a set SLS\subseteq L of size kk so as to maximize the expected number of nodes activated in VV, where a node vVv\in V can be activated by a node uSu\in S with an independent probability p(u,v)p(u,v). PMC can naturally model the advertisement placement application, where LL are candidate web-pages, VV are the set of users, and p(u,v)p(u,v) is the probability that a user vv will click on web-page uu.

PMC fits into the non-triggering CMAB framework: each edge (u,v)E(u,v)\in E corresponds to a base arm, the action is the set of edges that are incident to the set SLS\subseteq L, the unknown mean vectors 𝝁(0,1)E\bm{\mu}\in(0,1)^{E} with μu,v=p(u,v)\mu_{u,v}=p(u,v) and we assume they are independent across all base arms. In this context, the reward function r(S;𝝁)=vV(1uS(1μu,v))r(S;\bm{\mu})=\sum_{v\in V}(1-\prod_{u\in S}(1-\mu_{u,v})).

In this paper, we consider a contextual generalization by assuming that p(u,v)=ϕ(u,v),𝜽p(u,v)=\langle\bm{\phi}(u,v),\bm{\theta}^{*}\rangle, where ϕ(u,v)d\bm{\phi}(u,v)\in\mathbb{R}^{d} is the known context and 𝜽d\bm{\theta}^{*}\in\mathbb{R}^{d} is the unknown parameter. By Lemma 24 in Liu et al. (2022), PMC satisfies Condition 3 with (Bv,B1)=(32|V|,1)(B_{v},B_{1})=(3\sqrt{2|V|},1). Following Theorem 2, VAC2-UCB obtains O(d|V|TlogT)O(d\sqrt{|V|T}\log T), which improves the C3UCB algorithm’s bound O(dk|V|TlogT)O(d\sqrt{k|V|T}\log T) (Li et al., 2016) by a factor of O(k)O(\sqrt{k}).

Appendix E Experiments

Synthetic data. We consider the same disjunctive linear cascading bandit setting as in (Vial et al., 2022), where the goal is to choose K{2i}i=28K\in\{2i\}_{i=2}^{8} out of m=100m=100 items to maximize the reward. Notice that the linear cascading bandit problem is a simplified version of the contextual cascading bandit problem where the feature vectors of base arms are fixed in all rounds (see Section D.2 for details). For each KK, we sample the click probability μi\mu_{i} of item ii uniformly in [23K,1K][\frac{2}{3K},\frac{1}{K}] for iKi\leq K and in [0,13K][0,\frac{1}{3K}] for i>Ki>K. We vary d{2i}i=28d\in\{2i\}_{i=2}^{8} to generate the same 𝝁\bm{\mu} and compute unit-norm vectors θ\theta^{*} and ϕ(i)\phi(i) satisfying μi=θ,ϕ(i)\mu_{i}=\langle\theta^{*},\phi(i)\rangle. We compare VAC2-UCB to C3-UCB (Li et al., 2016) and CascadeWOFUL (Vial et al., 2022): C3-UCB is the variance-agnostic cascading bandit algorithm (essentially the same as CascadeLinUCB (Zong et al., 2016) in the linear cascading setting by using the tunable parameter σ=1\sigma=1) and CascadeWOFUL is the state-of-the-art variance-aware cascading bandit algorithm. As shown in Figure 2, the regret of our VAC2-UCB algorithm has superior dependence on KK and dd over that of C3-UCB. When d=K=10d=K=10, VAC2-UCB achieves sublinear regret; it incurs 75%75\% and 13%13\% less regret than C3-UCB and CascadeWOFUL after 100,000100,000 rounds. Notice that CascadeWOFUL is also variance-aware but specifically designed for cascading bandits, while our algorithm can be applied to general C2MAB-T.

Refer to caption
Figure 2: Results for synthetic data

Real data. We conduct experiments on the MovieLens-1M dataset which contains user ratings for m4000m\approx 4000 movies. Following the same experimental setup in (Vial et al., 2022), we set d=20,K=4d=20,K=4, and the goal is to choose KK out of mm movies to maximize the reward of the cascading recommendation. We use their learned feature mapping ϕ\phi from movies to the probability that a uniformly random user rated the movie more than three stars. We point the reader to Section 6 of (Vial et al., 2022) for more details. In each round, we sample a random user JtJ_{t} and define the potential click result Xt,i=𝕀{user Jt rated movie i more than 3 stars}X_{t,i}=\mathbb{I}\{\text{user $J_{t}$ rated movie $i$ more than 3 stars}\}. In other words, we observe the actual feedback of user JtJ_{t} instead of using the Bernoulli click model. Figure 1(a) shows that VAC2-UCB outperforms C3-UCB and CascadeWOFUL, incurring 45%45\% and 25%25\% less regret after 100,000100,000 rounds. To model platforms like Netflix that recommend movies in specific categories, we also run experiments while restricting the candidate items to movies of a particular genre. Figure 1(b) shows that VAC2-UCB is superior for all genres compared to C3-UCB and CascadeWOFUL.

Appendix F Concentration Bounds, Facts, and Technical Lemmas

In this section, we first give key concentration bounds and then provide lemmas that are useful for the analysis.

F.1 Concentration Bounds

We mainly use the following concentration bounds, which is essentially a modification of the Freedman’s version of the Bernstein’s inequality (Bernstein, 1946; Freedman, 1975).

Proposition 3 (Theorem 9, Lattimore et al. (2015)).

Let δ(0,1)\delta\in(0,1) and X1,,XnX_{1},...,X_{n} be a sequence of random variables adapted to filtration {t}\{\mathcal{F}_{t}\} with 𝔼[Xtt1]=0\mathbb{E}[X_{t}\mid\mathcal{F}_{t-1}]=0. Let Z[n]Z\subseteq[n] be such that 𝕀{tZ}\mathbb{I}\{t\in Z\} is t1\mathcal{F}_{t-1}-measurable and let RtR_{t} be t1\mathcal{F}_{t-1} measurable such that |Xt|Rt|X_{t}|\leq R_{t} almost surely. Let V=tZVar[Xtt1]+tZRt2/2V=\sum_{t\in Z}{\rm Var}[X_{t}\mid\mathcal{F}_{t-1}]+\sum_{t\notin Z}R_{t}^{2}/2, R=maxtZRtR=\max_{t\in Z}R_{t}, and S=t=1nXtS=\sum_{t=1}^{n}X_{t}. Then Pr[Sf(R,V)]δ\Pr[S\geq f(R,V)]\leq\delta, where f(r,v)=2(r+1)3log2δr,v+2(v+1)log2δr,vf(r,v)=\frac{2(r+1)}{3}\log\frac{2}{\delta_{r,v}}+\sqrt{2(v+1)\log\frac{2}{\delta_{r,v}}}, and δr,v=δ3(r+1)2(v+1)\delta_{r,v}=\frac{\delta}{3(r+1)^{2}(v+1)}.

F.2 Facts

Fact 1.

For any positive-definite matrices 𝐀,𝐁>𝟎d\bm{A},\bm{B}>\bm{0}^{d} and any vectors 𝐱,𝐲d\bm{x},\bm{y}\in\mathbb{R}^{d}. It holds that

  1. 1.

    If 𝑨𝑩\bm{A}\leq\bm{B}, then 𝑨1𝑩1\bm{A}^{-1}\geq\bm{B}^{-1}.

  2. 2.

    If 𝑨𝑩\bm{A}\leq\bm{B}, then 𝒙𝑨𝒙𝑩\left\lVert\bm{x}\right\rVert_{\bm{A}}\leq\left\lVert\bm{x}\right\rVert_{\bm{B}}.

  3. 3.

    Suppose 𝑨\bm{A} has maximum eigenvalue λmax\lambda_{\max}, then 𝑨𝒙2λmax𝒙2\left\lVert\bm{A}\bm{x}\right\rVert_{2}\leq\lambda_{\max}\cdot\left\lVert\bm{x}\right\rVert_{2} and λmaxtrace(𝑨)\lambda_{\max}\leq\mathrm{trace}(\bm{A}).

F.3 Technical Lemmas

Recall that event FtF_{t} is defined in Equation 21, gram matrix 𝑮t\bm{G}_{t} is defined in Equation 18, optimistic variance V¯t,i\bar{V}_{t,i} is defined in Equation 6, R𝒗R_{\bm{v}} is defined in Equation 131.

Lemma 7.

Pr[𝒁t𝑮t+γρ and ¬Ft1]δ/T\Pr\left[\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}}+\sqrt{\gamma}\geq\rho\text{ and }\neg F_{t-1}\right]\leq\delta/T, for t=1,,Tt=1,...,T.

Proof of lemma 7.

Let 𝒗d\bm{v}\in\mathbb{R}^{d} and define

Vs,i,𝒗={Var[ηs,is1]ϕs(i),𝒗2/V¯s,i2,if V¯s,i<14ϕs(i),𝒗2/V¯s,i,otherwise.V_{s,i,\bm{v}}=\begin{cases}{\rm Var}[\eta_{s,i}\mid\mathcal{F}_{s-1}]\langle\bm{\phi}_{s}(i),\bm{v}\rangle^{2}/\bar{V}_{s,i}^{2},&\text{if $\bar{V}_{s,i}<\frac{1}{4}$}\\ \langle\bm{\phi}_{s}(i),\bm{v}\rangle^{2}/\bar{V}_{s,i},&\text{otherwise}.\end{cases} (130)
R𝒗=maxs<t,iτs{ϕs(i),𝒗/V¯s,i:V¯s,i<14}R_{\bm{v}}=\max_{s<t,i\in\tau_{s}}\{\langle\bm{\phi}_{s}(i),\bm{v}\rangle/\bar{V}_{s,i}:\bar{V}_{s,i}<\frac{1}{4}\} (131)

By applying the Proposition 3, with probability at least 1δ/T1-\delta/T it holds that

Zt,𝒗=s<tiτsηs,iϕs(i),𝒗/V¯s,i2(R𝒗+1)3log1δ𝒗+2(1+s<tiτsVs,i,𝒗)log1δ𝒗\displaystyle\langle Z_{t},\bm{v}\rangle=\sum_{s<t}\sum_{i\in\tau_{s}}\eta_{s,i}\langle\bm{\phi}_{s}(i),\bm{v}\rangle/\bar{V}_{s,i}\leq\frac{2(R_{\bm{v}}+1)}{3}\log\frac{1}{\delta_{\bm{v}}}+\sqrt{2(1+\sum_{s<t}\sum_{i\in\tau_{s}}V_{s,i,\bm{v}})\log\frac{1}{\delta_{\bm{v}}}} (132)

where δ𝒗=3δT(1+R𝒗)2(1+s<tiτsVs,i,𝒗)2\delta_{\bm{v}}=\frac{3\delta}{T(1+R_{\bm{v}})^{2}(1+\sum_{s<t}\sum_{i\in\tau_{s}}V_{s,i,\bm{v}})^{2}}.

Since 𝒗\bm{v} could be a random variable in later proofs, we use the covering argument trick (Chap.20,  Lattimore & Szepesvári (2020)) to handle 𝒗\bm{v}. Specifically, we define the covering set Λ={jε:j=Cε,Cε+1,,Cε1,Cε}d\Lambda=\{j\cdot\varepsilon:j=-\frac{C}{\varepsilon},-\frac{C}{\varepsilon}+1,...,\frac{C}{\varepsilon}-1,\frac{C}{\varepsilon}\}^{d}, with size N=|Λ|=(2C/ε)dN=|\Lambda|=(2C/\varepsilon)^{d} and parameters C,εC,\varepsilon will be determined shortly after. By applying union bound on Equation 132, we have with probability at least 1δ1-\delta that

Zt,𝒗2(R𝒗+1)3logNδ𝒗+2(1+s<tiτsVs,i,𝒗)logNδ𝒗 for all 𝒗Λ.\displaystyle\langle Z_{t},\bm{v}\rangle\leq\frac{2(R_{\bm{v}}+1)}{3}\log\frac{N}{\delta_{\bm{v}}}+\sqrt{2(1+\sum_{s<t}\sum_{i\in\tau_{s}}V_{s,i,\bm{v}})\log\frac{N}{\delta_{\bm{v}}}}\text{ for all }\bm{v}\in\Lambda. (133)

Now we can set 𝒗=𝑮t1𝒁t\bm{v}=\bm{G}_{t}^{-1}\bm{Z}_{t}, and it follows from Lemma 12 that 𝒗𝒁t12dK2t2=C\left\lVert\bm{v}\right\rVert_{\infty}\leq\left\lVert\bm{Z}_{t}\right\rVert_{1}\leq 2dK^{2}t^{2}=C. Based on our construction of the covering set Λ\Lambda, there exists 𝒗Λ\bm{v}^{\prime}\in\Lambda with 𝒗𝒗\bm{v}^{\prime}\leq\bm{v}, and 𝒗𝒗ε\left\lVert\bm{v}^{\prime}-\bm{v}\right\rVert_{\infty}\leq\varepsilon, such that

𝒁t𝑮t12\displaystyle\left\lVert\bm{Z}_{t}\right\rVert^{2}_{\bm{G}_{t}^{-1}} =𝒁t,𝒗𝒁t1ε+𝒁t,𝒗\displaystyle=\langle\bm{Z}_{t},\bm{v}\rangle\leq\left\lVert\bm{Z}_{t}\right\rVert_{1}\varepsilon+\langle\bm{Z}_{t},\bm{v}^{\prime}\rangle (134)
𝒁t1ε+2(R𝒗+1)3logNδ𝒗+2(1+s<tiτsVs,i,𝒗)logNδ𝒗\displaystyle\leq\left\lVert\bm{Z}_{t}\right\rVert_{1}\varepsilon+\frac{2(R_{\bm{v}}+1)}{3}\log\frac{N}{\delta_{\bm{v}}}+\sqrt{2(1+\sum_{s<t}\sum_{i\in\tau_{s}}V_{s,i,\bm{v}})\log\frac{N}{\delta_{\bm{v}}}} (135)
𝒁t1ε+2(R𝒗+1)3logNδ𝒗+2(1+𝒁t𝑮t12)logNδ𝒗\displaystyle\leq\left\lVert\bm{Z}_{t}\right\rVert_{1}\varepsilon+\frac{2(R_{\bm{v}}+1)}{3}\log\frac{N}{\delta_{\bm{v}}}+\sqrt{2(1+\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}^{2})\log\frac{N}{\delta_{\bm{v}}}} (136)

where Equation 135 uses the fact that R𝒗R𝒗,Vs,i,𝒗Vs,i,𝒗,1δ𝒗1δ𝒗R_{\bm{v}^{\prime}}\leq R_{\bm{v}},V_{s,i,\bm{v}^{\prime}}\leq V_{s,i,\bm{v}},\frac{1}{\delta_{\bm{v}^{\prime}}}\leq\frac{1}{\delta_{\bm{v}}} for any 𝒗𝒗\bm{v}^{\prime}\leq\bm{v}, Equation 136 follows from the following derivation,

s<tiτsVs,i,𝒗\displaystyle\sum_{s<t}\sum_{i\in\tau_{s}}V_{s,i,\bm{v}} s<tiτsϕs(i),𝒗2/V¯s,i\displaystyle\leq\sum_{s<t}\sum_{i\in\tau_{s}}\langle\bm{\phi}_{s}(i),\bm{v}\rangle^{2}/\bar{V}_{s,i} (137)
=s<tiτs(𝑮t1𝒁t)ϕs(i)ϕs(i)𝑮t1𝒁t/V¯s,i\displaystyle=\sum_{s<t}\sum_{i\in\tau_{s}}(\bm{G}_{t}^{-1}\bm{Z}_{t})^{\top}\bm{\phi}_{s}(i)\bm{\phi}_{s}(i)^{\top}\bm{G}_{t}^{-1}\bm{Z}_{t}/\bar{V}_{s,i} (138)
=(𝑮t1𝒁t)(s<tiτsϕs(i)ϕs(i)/V¯s,i)𝑮t1𝒁t\displaystyle=(\bm{G}_{t}^{-1}\bm{Z}_{t})^{\top}\left(\sum_{s<t}\sum_{i\in\tau_{s}}\bm{\phi}_{s}(i)\bm{\phi}_{s}(i)^{\top}/\bar{V}_{s,i}\right)\bm{G}_{t}^{-1}\bm{Z}_{t} (139)
(𝑮t1𝒁t)𝑮t(𝑮t1𝒁t)\displaystyle\leq(\bm{G}_{t}^{-1}\bm{Z}_{t})^{\top}\bm{G}_{t}(\bm{G}_{t}^{-1}\bm{Z}_{t}) (140)
=𝒁t𝑮t12,\displaystyle=\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}^{2}, (141)

where Equation 137 follows from ¬Fs1\neg F_{s-1} implies 𝜽𝜽^s𝑮sρ\left\lVert\bm{\theta}^{*}-\hat{\bm{\theta}}_{s}\right\rVert_{\bm{G}_{s}}\leq\rho for s<ts<t by Lemma 8 and thus V¯t,iVar[ηs,is1]\bar{V}_{t,i}\geq{\rm Var}[\eta_{s,i}\mid\mathcal{F}_{s-1}], Equation 138 follows from definition of 𝒗\bm{v}, Equation 140 follows from s<tiτsϕs(i)ϕs(i)/V¯s,i<𝑮t\sum_{s<t}\sum_{i\in\tau_{s}}\bm{\phi}_{s}(i)\bm{\phi}_{s}(i)^{\top}/\bar{V}_{s,i}<\bm{G}_{t}.

Now we set ε=1/C=1/(2K2t2d)\varepsilon=1/C=1/(2K^{2}t^{2}d), we have

𝒁t𝑮t12\displaystyle\left\lVert\bm{Z}_{t}\right\rVert^{2}_{\bm{G}_{t}^{-1}} RHS of Equation 136\displaystyle\leq\text{RHS of \lx@cref{creftypecap~refnum}{apdx_eq:Zt_Gt_2}} (142)
Cε+2(2𝒁t𝑮t1/ρ+1)3logNδ𝒗+2(1+𝒁t𝑮t12)logNδ𝒗\displaystyle\leq C\varepsilon+\frac{2(2\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}/\rho+1)}{3}\log\frac{N}{\delta_{\bm{v}}}+\sqrt{2(1+\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}^{2})\log\frac{N}{\delta_{\bm{v}}}} (143)
1+2logNδ𝒗+2(1+𝒁t𝑮t12)logNδ𝒗\displaystyle\leq 1+2\log\frac{N}{\delta_{\bm{v}}}+\sqrt{2(1+\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}^{2})\log\frac{N}{\delta_{\bm{v}}}} (144)

where Equation 143 is to bound R𝒗R_{\bm{v}} by Lemma 10, Equation 144 is by the definition of ρ\rho as an upper bound of 𝒁t𝑮t1\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}.

By rearranging and simplifying Equation 144, we have

𝒁t𝑮t1+γ\displaystyle\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}+\sqrt{\gamma} 1+γ+4logNδ𝒗\displaystyle\leq 1+\sqrt{\gamma}+4\sqrt{\log\frac{N}{\delta_{\bm{v}}}} (145)
1+γ+4log(6TNδ(1+𝒁t𝑮t12)),\displaystyle\leq 1+\sqrt{\gamma}+4\sqrt{\log\left(\frac{6TN}{\delta}(1+\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}^{2})\right)}, (146)

where the last inequality is because of δ𝒗3δT(1+𝒁t𝑮t12)\delta_{\bm{v}}\geq\frac{3\delta}{T(1+\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}^{2})} from the definition of δ𝒗\delta_{\bm{v}}, Lemma 10, and Equation 141. Finally, we solve the above equation and set ρ=1+γ+4log(6TNδlog(3TNδ))\rho=1+\sqrt{\gamma}+4\sqrt{\log\left(\frac{6TN}{\delta}\log(\frac{3TN}{\delta})\right)} , which completes the reduction on tt to show the probability Pr[𝒁t𝑮t1+γρ]1δ/T\Pr[\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}+\sqrt{\gamma}\geq\rho]\geq 1-\delta/T under event ¬Ft1\neg F_{t-1}. ∎

Lemma 8.

If ¬Ft\neg F_{t} holds, then it holds that,

𝜽𝜽^t𝑮tρ.\displaystyle\left\lVert\bm{\theta}^{*}-\hat{\bm{\theta}}_{t}\right\rVert_{\bm{G}_{t}}\leq\rho. (147)
Proof.

We have

𝜽𝜽^t𝑮t\displaystyle\left\lVert\bm{\theta}^{*}-\hat{\bm{\theta}}_{t}\right\rVert_{\bm{G}_{t}} =𝑮t1(s<tiτsϕs(i)Xs,i/V¯s,i)𝑮t1𝑮t𝜽𝑮t\displaystyle=\left\lVert\bm{G}_{t}^{-1}(\sum_{s<t}\sum_{i\in\tau_{s}}\bm{\phi}_{s}(i)X_{s,i}/\bar{V}_{s,i})-\bm{G}_{t}^{-1}\bm{G}_{t}\bm{\theta}^{*}\right\rVert_{\bm{G}_{t}} (148)
=𝑮t1𝒁t+𝑮t1(s<tiτsϕs(i)ϕs(i)𝜽/V¯s,i)𝑮t1𝑮t𝜽𝑮t\displaystyle=\left\lVert\bm{G}_{t}^{-1}\bm{Z}_{t}+\bm{G}_{t}^{-1}(\sum_{s<t}\sum_{i\in\tau_{s}}\bm{\phi}_{s}(i)\bm{\phi}_{s}(i)^{\top}\bm{\theta}^{*}/\bar{V}_{s,i})-\bm{G}_{t}^{-1}\bm{G}_{t}\bm{\theta}^{*}\right\rVert_{\bm{G}_{t}} (149)
=𝑮t1𝒁tγ𝑮t1𝜽𝑮t\displaystyle=\left\lVert\bm{G}_{t}^{-1}\bm{Z}_{t}-\gamma\bm{G}_{t}^{-1}\bm{\theta}^{*}\right\rVert_{\bm{G}_{t}} (150)
𝒁t𝑮t1+γ𝜽𝑮t1\displaystyle\leq\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}+\gamma\left\lVert\bm{\theta}^{*}\right\rVert_{\bm{G}_{t}^{-1}} (151)
𝒁t𝑮t1+γ\displaystyle\leq\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}+\sqrt{\gamma} (152)
ργ+γ=ρ,\displaystyle\leq\rho-\sqrt{\gamma}+\sqrt{\gamma}=\rho, (153)

where Equation (148)-(150) follow from definition and math calculation, Equation 151 from 𝑮t𝑮0=γ𝑰\bm{G}_{t}\geq\bm{G}_{0}=\gamma\bm{I} and 𝜽21\left\lVert\bm{\theta}\right\rVert_{2}\leq 1, Equation 152 from that if ¬Ft\neg F_{t} holds, then 𝒁t𝑮t+γρ\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}}+\sqrt{\gamma}\leq\rho. ∎

Lemma 9.

For any s<ts<t, ϕs(i)𝐆t1V¯s,iϕs(i)𝐆s1V¯s,i\frac{\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{t}^{-1}}}{\bar{V}_{s,i}}\leq\frac{\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}}{\bar{V}_{s,i}}, and if ¬Ft1\neg F_{t-1} holds and V¯s,i<14\bar{V}_{s,i}<\frac{1}{4}, ϕs(i)𝐆s1V¯s,i2ρ1\frac{\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}}{\bar{V}_{s,i}}\leq\frac{2}{\rho}\leq 1 for any i[m]i\in[m].

Proof.

The first inequality is by 𝑮t𝑮s\bm{G}_{t}\geq\bm{G}_{s} and Fact 1. For the second inequality, when ¬Ft1\neg F_{t-1} holds, 𝜽𝜽^s𝑮sρ\left\lVert\bm{\theta}^{*}-\hat{\bm{\theta}}_{s}\right\rVert_{\bm{G}_{s}}\leq\rho as in Equation 147, and since V¯s,i<14\bar{V}_{s,i}<\frac{1}{4}, it follows from the definition of V¯s,i\bar{V}_{s,i} Equation 6 that at least one of the following is true:

V¯s,i\displaystyle\bar{V}_{s,i} 12(ϕs(i),𝜽^s+2ρϕs(i)𝑮s1)ρϕs(i)𝑮s1/2,\displaystyle\geq\frac{1}{2}(\langle\bm{\phi}_{s}(i),\hat{\bm{\theta}}_{s}+2\rho\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}\rangle)\geq\rho\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}/2, (154)
V¯s,i\displaystyle\bar{V}_{s,i} 12(1ϕs(i),𝜽^s+2ρϕs(i)𝑮s1)ρϕs(i)𝑮s1/2,\displaystyle\geq\frac{1}{2}(1-\langle\bm{\phi}_{s}(i),\hat{\bm{\theta}}_{s}+2\rho\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}\rangle)\geq\rho\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{s}^{-1}}/2, (155)

which concludes the second inequality. ∎

Lemma 10.

Let 𝐯=𝐆t1𝐛t\bm{v}=\bm{G}_{t}^{-1}\bm{b}_{t}, if ¬Ft1\neg F_{t-1}, then R𝐯2𝐙t𝐆t1ρ.R_{\bm{v}}\leq\frac{2{\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}}}{\rho}.

Proof.

For all s<ts<t and i[m]i\in[m], we have ϕs(i),𝒗/V¯s,i2ϕs(i),𝒗ϕs(i)𝑮t1ρ=2ϕs(i),𝑮t1𝒁tϕs(i)𝑮t1ρ2ϕs(i)𝑮t1𝑮t1𝒁t𝑮tϕs(i)𝑮t1ρ=2𝒁t𝑮t1ρ\langle\bm{\phi}_{s}(i),\bm{v}\rangle/\bar{V}_{s,i}\leq\frac{2\langle\bm{\phi}_{s}(i),\bm{v}\rangle}{\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{t}^{-1}}\cdot\rho}=\frac{2\langle\bm{\phi}_{s}(i),\bm{G}_{t}^{-1}\bm{Z}_{t}\rangle}{\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{t}^{-1}}\cdot\rho}\leq\frac{2\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{t}^{-1}}\cdot\left\lVert\bm{G}_{t}^{-1}\bm{Z}_{t}\right\rVert_{\bm{G}_{t}}}{\left\lVert\bm{\phi}_{s}(i)\right\rVert_{\bm{G}_{t}^{-1}}\cdot\rho}=\frac{2\left\lVert\bm{Z}_{t}\right\rVert_{\bm{G}_{t}^{-1}}}{\rho}, where the first inequality follows from Lemma 9, the last inequality follows from the Cauchy-Schwarz inequality. ∎

Lemma 11.

If ¬Ft\neg F_{t}, then ϕt(i)22/V¯t,i4dKt\left\lVert\bm{\phi}_{t}(i)\right\rVert_{2}^{2}/\bar{V}_{t,i}\leq 4dKt.

Proof.

If V¯t,i=14\bar{V}_{t,i}=\frac{1}{4}, the inequality trivially holds since ϕt(i)1\left\lVert\bm{\phi}_{t}(i)\right\rVert\leq 1. Consider V¯t,i<14\bar{V}_{t,i}<\frac{1}{4}, and λmax\lambda_{\max} be the maximum eigenvalue of 𝑮t\bm{G}_{t}. Then, it holds that ϕt(i)22/V¯t,iϕt(i)22ρϕt(i)𝑮t1ϕt(i)2ϕt(i)𝑮t1=𝑮t1/2𝑮t1/2ϕt(i)2ϕt(i)𝑮t1λmax\left\lVert\bm{\phi}_{t}(i)\right\rVert_{2}^{2}/\bar{V}_{t,i}\leq\frac{\left\lVert\bm{\phi}_{t}(i)\right\rVert_{2}^{2}}{\rho\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}}\leq\frac{\left\lVert\bm{\phi}_{t}(i)\right\rVert_{2}}{\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}}=\frac{\left\lVert\bm{G}_{t}^{1/2}\bm{G}_{t}^{-1/2}\bm{\phi}_{t}(i)\right\rVert_{2}}{\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}}\leq\sqrt{\lambda_{\max}}, where the first inequality follows from Lemma 9, the second inequality is by ρ1,ϕt(i)1\rho\geq 1,\left\lVert\bm{\phi}_{t}(i)\right\rVert\leq 1, and the last is by 1.3.

Now Assume ϕs(i)22/V¯s,i4s\left\lVert\bm{\phi}_{s}(i)\right\rVert_{2}^{2}/\bar{V}_{s,i}\leq 4s for s<ts<t, which always holds for t=1t=1. By reduction, we consider round tt, it holds that ϕt(i)22/V¯t,iλmaxtrace(𝑮t)=γd+s=1t1iτsϕs(i)22/V¯s,iKd+s=1t14dK2sd(K+2K2t(t1))4dKt\left\lVert\bm{\phi}_{t}(i)\right\rVert_{2}^{2}/\bar{V}_{t,i}\leq\sqrt{\lambda_{\max}}\leq\sqrt{\mathrm{trace}(\bm{G}_{t})}=\sqrt{\gamma d+\sum_{s=1}^{t-1}\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)\right\rVert^{2}_{2}/\bar{V}_{s,i}}\leq\sqrt{Kd+\sum_{s=1}^{t-1}4dK^{2}s}\leq\sqrt{d(K+2K^{2}t(t-1))}\leq 4dKt, where the first inequality follows from the analysis in the last paragraph, the third inequality follows from reduction over s<ts<t, and the last inequality is by math calculation. ∎

Lemma 12.

If ¬Ft\neg F_{t}, then ϕt(i)1/V¯t,i4dKt\left\lVert\bm{\phi}_{t}(i)\right\rVert_{1}/\bar{V}_{t,i}\leq 4dKt.

Proof.

Similar to the proof of Lemma 11, ϕt(i)1/V¯t,idϕt(i)2/V¯t,idϕt(i)2ρϕt(i)𝑮t1ϕt(i)2ϕt(i)𝑮t1=𝑮t1/2𝑮t1/2ϕt(i)2ϕt(i)𝑮t1λmax4dKt\left\lVert\bm{\phi}_{t}(i)\right\rVert_{1}/\bar{V}_{t,i}\leq\sqrt{d}\left\lVert\bm{\phi}_{t}(i)\right\rVert_{2}/\bar{V}_{t,i}\leq\frac{\sqrt{d}\left\lVert\bm{\phi}_{t}(i)\right\rVert_{2}}{\rho\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}}\leq\frac{\left\lVert\bm{\phi}_{t}(i)\right\rVert_{2}}{\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}}=\frac{\left\lVert\bm{G}_{t}^{1/2}\bm{G}_{t}^{-1/2}\bm{\phi}_{t}(i)\right\rVert_{2}}{\left\lVert\bm{\phi}_{t}(i)\right\rVert_{\bm{G}_{t}^{-1}}}\leq\sqrt{\lambda_{\max}}\leq 4dKt, where the first inequality uses Cauchy-Schwarz, the second inequality uses ρd\rho\geq\sqrt{d}, and the rest follows from the proof of Lemma 11. ∎

Lemma 13.

If ¬Ft1\neg F_{t-1}, then 𝐙t12dK2t2\left\lVert\bm{Z}_{t}\right\rVert_{1}\leq 2dK^{2}t^{2}.

Proof.

𝒁t1=s<tiτsηs,iϕs(i)/V¯s,i1s<tiτsϕs(i)/V¯s,i1s<tiτs4dKt2dK2t2\left\lVert\bm{Z}_{t}\right\rVert_{1}=\left\lVert\sum_{s<t}\sum_{i\in\tau_{s}}\eta_{s,i}\bm{\phi}_{s}(i)/\bar{V}_{s,i}\right\rVert_{1}\leq\sum_{s<t}\sum_{i\in\tau_{s}}\left\lVert\bm{\phi}_{s}(i)/\bar{V}_{s,i}\right\rVert_{1}\leq\sum_{s<t}\sum_{i\in\tau_{s}}4dKt\leq 2dK^{2}t^{2}, where the first inequality follows from ηs,i[1,1]\eta_{s,i}\in[-1,1], the second inequality follows from Lemma 12. ∎

Lemma 14 (Lemma A.3, (Li et al., 2016)).

Let 𝐱id\bm{x}_{i}\in\mathbb{R}^{d}, 1in1\leq i\leq n. Then we have

det(𝑰+i=1n𝒙i𝒙i)1+i=1n𝒙22.\displaystyle\det\left(\bm{I}+\sum_{i=1}^{n}{\bm{x}_{i}\bm{x}_{i}^{\top}}\right)\geq 1+\sum_{i=1}^{n}\left\lVert\bm{x}\right\rVert_{2}^{2}.
Lemma 15 (Lemma 11, (Abbasi-Yadkori et al., 2011)).

Let 𝐱id\bm{x}_{i}\in\mathbb{R}^{d} with 𝐱i2L\left\lVert\bm{x}_{i}\right\rVert_{2}\leq L, 1in1\leq i\leq n and let 𝐆t=γ𝐈+i=1t1𝐱i𝐱i\bm{G}_{t}=\gamma\bm{I}+\sum_{i=1}^{t-1}\bm{x}_{i}\bm{x}_{i}^{\top}, then

det(𝑮t+1)(γ+tL2/d)d.\displaystyle\det(\bm{G}_{t+1})\leq(\gamma+tL^{2}/d)^{d}.
Lemma 16.

Equation 82 holds.

Proof.

When Tt1,i>Li,T,2=8c12Bv2KlogT(Δ~imin)2T_{t-1,i}>L_{i,T,2}=\frac{8c_{1}^{2}B_{v}^{2}K\log T}{(\tilde{\Delta}_{i}^{\min})^{2}},

we have (82,i)8c12Bv2logTTt1,i,Δ~StΔ~StK<(Δ~imin)2KΔ~StΔ~StK0=κi,T(Tt1,i)(\ref{apdx_eq:kappa_tpvm},i)\leq\frac{8c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i,}\cdot\tilde{\Delta}_{S_{t}}}-\frac{\tilde{\Delta}_{S_{t}}}{K}<\frac{(\tilde{\Delta}_{i}^{\min})^{2}}{K\tilde{\Delta}_{S_{t}}}-\frac{\tilde{\Delta}_{S_{t}}}{K}\leq 0=\kappa_{i,T}(T_{t-1,i}).

When Li,T,1<Tt1,iLi,T,2L_{i,T,1}<T_{t-1,i}\leq L_{i,T,2},

We have (82,i)8c12Bv2logTTt1,i,Δ~StΔ~StK<8c12Bv2logTTt1,i,Δ~St8c12Bv2logTTt1,iΔ~mini=κi,T(Tt1,i,jiSt)(\ref{apdx_eq:kappa_tpvm},i)\leq\frac{8c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i,}\cdot\tilde{\Delta}_{S_{t}}}-\frac{\tilde{\Delta}_{S_{t}}}{K}<\frac{8c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i,}\cdot\tilde{\Delta}_{S_{t}}}\leq\frac{8c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}\cdot\tilde{\Delta}^{i}_{\min}}=\kappa_{i,T}(T_{t-1,i,j_{i}^{S_{t}}}).

When Tt1,iLi,T,1T_{t-1,i}\leq L_{i,T,1},

We further consider two different cases Tt1,i4c12Bv2logT(Δ~St)2T_{t-1,i}\leq\frac{4c_{1}^{2}B_{v}^{2}\log T}{(\tilde{\Delta}_{S_{t}})^{2}} or 4c12Bv2logT(Δ~St)2<Tt1,iLi,T,1=4c12Bv2logT(Δ~imin)2\frac{4c_{1}^{2}B_{v}^{2}\log T}{(\tilde{\Delta}_{S_{t}})^{2}}<T_{t-1,i}\leq L_{i,T,1}=\frac{4c_{1}^{2}B_{v}^{2}\log T}{(\tilde{\Delta}_{i}^{\min})^{2}}.

For the former case, if there exists iτti\in\tau_{t} so that Tt1,i4c12Bv2logT(Δ~St)2T_{t-1,i}\leq\frac{4c_{1}^{2}B_{v}^{2}\log T}{(\tilde{\Delta}_{S_{t}})^{2}}, then we know qS~tκq,T(Tt1,q)κi,T(Tt1,i)=24c12Bv2logTTt1,i2Δ~St>ΔSt\sum_{q\in\tilde{S}_{t}}\kappa_{q,T}(T_{t-1,q})\geq\kappa_{i,T}(T_{t-1,i})=2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}}}\geq 2\tilde{\Delta}_{S_{t}}>\Delta_{S_{t}}, which makes eq. 82 holds no matter what. This means we do not need to consider this case for good.

For the later case, when 4c12Bv2logT(Δ~St)2<Tt1,i\frac{4c_{1}^{2}B_{v}^{2}\log T}{(\tilde{\Delta}_{S_{t}})^{2}}<T_{t-1,i}, we know that (82,i)8c12Bv2logTΔ~St1Tt1,i=24c12Bv2logT(Δ~St)21Tt1,i4c12Bv2logTTt1,i24c12Bv2logTTt1,i=κi,T(Tt1,i)(\ref{apdx_eq:kappa_tpvm},i)\leq\frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{S_{t}}}\frac{1}{T_{t-1,i}}=2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{(\tilde{\Delta}_{S_{t}})^{2}}\frac{1}{T_{t-1,i}}}\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}}}\leq 2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}}}=\kappa_{i,T}(T_{t-1,i}).

When =0\ell=0,

We have (82,i)8c12Bv2Δ~St128Δ~StKc12Bv2Δ~Stc12Bv2Δ~imin=κi,T(Tt1,i)(\ref{apdx_eq:kappa_tpvm},i)\leq\frac{8c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{S_{t}}}\cdot\frac{1}{28}-\frac{\tilde{\Delta}_{S_{t}}}{K}\leq\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{S_{t}}}\leq\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i}^{\min}}=\kappa_{i,T}(T_{t-1,i}).

Combining all above cases, we have ΔSt𝔼[iτtκi,T(Tt1,i)]\Delta_{S_{t}}\leq\mathbb{E}[\sum_{i\in\tau_{t}}\kappa_{i,T}(T_{t-1,i})]. ∎

Lemma 17.

Equation 98 holds.

Proof.

When Tt1,i>Li,T,2=8c12Bv2KlogTΔ~iminΔ~i,λminT_{t-1,i}>L_{i,T,2}=\frac{8c_{1}^{2}B_{v}^{2}K\log T}{\tilde{\Delta}_{i}^{\min}\cdot\tilde{\Delta}_{i,\lambda}^{\min}},

we have (98,i)8c12Bv2logTTt1,i,Δ~St,λΔ~StK<Δ~iminΔ~i,λminKΔ~St,λΔ~StK0=κi,T(Tt1,i)(\ref{apdx_eq:kappa_tpvm_general},i)\leq\frac{8c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i,}\cdot\tilde{\Delta}_{S_{t},\lambda}}-\frac{\tilde{\Delta}_{S_{t}}}{K}<\frac{\tilde{\Delta}_{i}^{\min}\cdot\tilde{\Delta}_{i,\lambda}^{\min}}{K\tilde{\Delta}_{S_{t},\lambda}}-\frac{\tilde{\Delta}_{S_{t}}}{K}\leq 0=\kappa_{i,T}(T_{t-1,i}).

When Li,T,1<Tt1,iLi,T,2L_{i,T,1}<T_{t-1,i}\leq L_{i,T,2},

We have (98,i)8c12Bv2logTTt1,iΔ~St,λΔ~StK<8c12Bv2logTTt1,i,Δ~St,λ8c12Bv2logTTt1,iΔ~mini,λ=κi,T(Tt1,i)(\ref{apdx_eq:kappa_tpvm_general},i)\leq\frac{8c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}\cdot\tilde{\Delta}_{S_{t},\lambda}}-\frac{\tilde{\Delta}_{S_{t}}}{K}<\frac{8c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i,}\cdot\tilde{\Delta}_{S_{t},\lambda}}\leq\frac{8c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}\cdot\tilde{\Delta}^{i,\lambda}_{\min}}=\kappa_{i,T}(T_{t-1,i}).

When Tt1,iLi,T,1T_{t-1,i}\leq L_{i,T,1},

We further consider two different cases Tt1,i4c12Bv2logTΔ~St,λΔ~StT_{t-1,i}\leq\frac{4c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{S_{t},\lambda}\cdot\tilde{\Delta}_{S_{t}}} or 4c12Bv2logTΔ~St,λΔ~St<Tt1,iLi,T,1=4c12Bv2logTΔ~i,λminΔ~imin\frac{4c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{S_{t},\lambda}\cdot\tilde{\Delta}_{S_{t}}}<T_{t-1,i}\leq L_{i,T,1}=\frac{4c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{i,\lambda}^{\min}\cdot\tilde{\Delta}_{i}^{\min}}.

For the former case, if there exists iτti\in\tau_{t} so that Tt1,i4c12Bv2logTΔ~St,λΔ~StT_{t-1,i}\leq\frac{4c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{S_{t},\lambda}\cdot\tilde{\Delta}_{S_{t}}}, then we know qS~tκq,T(Tt1,q)κi,T(Tt1,i)=24c12Bv2logTTt1,i2Δ~St,λΔ~StΔSt\sum_{q\in\tilde{S}_{t}}\kappa_{q,T}(T_{t-1,q})\geq\kappa_{i,T}(T_{t-1,i})=2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}}}\geq 2\sqrt{\tilde{\Delta}_{S_{t},\lambda}\cdot\tilde{\Delta}_{S_{t}}}\geq\Delta_{S_{t}}, which makes eq. 98 holds no matter what. This means we do not need to consider this case for good.

For the later case, when 4c12Bv2logTΔ~St,λΔ~St<Tt1,i\frac{4c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{S_{t},\lambda}\cdot\tilde{\Delta}_{S_{t}}}<T_{t-1,i}, we know that (98,i)8c12Bv2logTΔ~St,λ1Tt1,i=24c12Bv2logT(Δ~St,λ)21Tt1,i4c12Bv2logTTt1,i2Δ~StΔ~St,λ4c12Bv2logTTt1,i24c12Bv2logTTt1,i=κi,T(Tt1,i)(\ref{apdx_eq:kappa_tpvm_general},i)\leq\frac{8c_{1}^{2}B_{v}^{2}\log T}{\tilde{\Delta}_{S_{t},\lambda}}\frac{1}{T_{t-1,i}}=2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{(\tilde{\Delta}_{S_{t},\lambda})^{2}}\frac{1}{T_{t-1,i}}}\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}}}\leq 2\sqrt{\frac{\tilde{\Delta}_{S_{t}}}{\tilde{\Delta}_{S_{t},\lambda}}}\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}}}\leq 2\sqrt{\frac{4c_{1}^{2}B_{v}^{2}\log T}{T_{t-1,i}}}=\kappa_{i,T}(T_{t-1,i}).

When =0\ell=0,

We have (98,i)8c12Bv2Δ~St,λ128Δ~StKc12Bv2Δ~St,λc12Bv2Δ~i,λmin=κi,T(Tt1,i)(\ref{apdx_eq:kappa_tpvm_general},i)\leq\frac{8c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{S_{t},\lambda}}\cdot\frac{1}{28}-\frac{\tilde{\Delta}_{S_{t}}}{K}\leq\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{S_{t},\lambda}}\leq\frac{c_{1}^{2}B_{v}^{2}}{\tilde{\Delta}_{i,\lambda}^{\min}}=\kappa_{i,T}(T_{t-1,i}).

Combining all above cases, we have ΔSt𝔼[iτtκi,T(Tt1,i)]\Delta_{S_{t}}\leq\mathbb{E}[\sum_{i\in\tau_{t}}\kappa_{i,T}(T_{t-1,i})]. ∎

Lemma 18.

Equation 109 holds.

Proof.

When Tt1,i>Li,T,2=4c2B1KlogTΔiminT_{t-1,i}>L_{i,T,2}=\frac{4c_{2}B_{1}K\log T}{\Delta_{i}^{\min}},

we have (109,i)4c2B1logTTt1,iΔStK<ΔiminKΔStK0=κi,T(Tt1,i)(\ref{apdx_eq:tpvm_lambda_term_2},i)\leq 4c_{2}B_{1}\frac{\log T}{T_{t-1,i}}-\frac{\Delta_{S_{t}}}{K}<\frac{\Delta_{i}^{\min}}{K}-\frac{\Delta_{S_{t}}}{K}\leq 0=\kappa_{i,T}(T_{t-1,i}).

When Tt1,iLi,T,2T_{t-1,i}\leq L_{i,T,2},

We have (109,i)4c2B1logTTt1,iΔStK<4c2B1logTTt1,i=κi,jiSt,T(Nt1,i,jiSt)(\ref{apdx_eq:tpvm_lambda_term_2},i)\leq 4c_{2}B_{1}\frac{\log T}{T_{t-1,i}}-\frac{\Delta_{S_{t}}}{K}<\frac{4c_{2}B_{1}\log T}{T_{t-1,i}}=\kappa_{i,j_{i}^{S_{t}},T}(N_{t-1,i,j_{i}^{S_{t}}}).

When Tt1,iLi,T,1T_{t-1,i}\leq L_{i,T,1},

If there exists iS~ti\in\tilde{S}_{t} so that Tt1,iLi,T,1T_{t-1,i}\leq L_{i,T,1}, then we know qS~tκi,T(Tt1,q)κi,T(Tt1,i)=ΔimaxΔSt\sum_{q\in\tilde{S}_{t}}\kappa_{i,T}(T_{t-1,q})\geq\kappa_{i,T}(T_{t-1,i})=\Delta_{i}^{\max}\geq\Delta_{S_{t}}, which makes eq. 109 holds no matter what. This means we do not need to consider this case for good.

Combining all above cases, we have ΔSt𝔼t[iτtκi,T(Tt1,i)]\Delta_{S_{t}}\leq\mathbb{E}_{t}[\sum_{i\in\tau_{t}}\kappa_{i,T}(T_{t-1,i})]. ∎