This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Thompson Sampling for (Combinatorial) Pure Exploration

Siwei Wang    Jun Zhu
Abstract

Existing methods of combinatorial pure exploration mainly focus on the UCB approach. To make the algorithm efficient, they usually use the sum of upper confidence bounds within arm set SS to represent the upper confidence bound of SS, which can be much larger than the tight upper confidence bound of SS and leads to a much higher complexity than necessary, since the empirical means of different arms in SS are independent. To deal with this challenge, we explore the idea of Thompson Sampling (TS) that uses independent random samples instead of the upper confidence bounds, and design the first TS-based algorithm TS-Explore for (combinatorial) pure exploration. In TS-Explore, the sum of independent random samples within arm set SS will not exceed the tight upper confidence bound of SS with high probability. Hence it solves the above challenge, and achieves a lower complexity upper bound than existing efficient UCB-based algorithms in general combinatorial pure exploration. As for pure exploration of classic multi-armed bandit, we show that TS-Explore achieves an asymptotically optimal complexity upper bound.

Combinatorial Pure Exploration, Thompson Sampling

1 Introduction

Pure exploration is an important task in online learning, and it tries to find out the target arm as fast as possible. In pure exploration of classic multi-armed bandit (MAB) (Audibert et al., 2010), there are totally mm arms, and each arm ii is associated with a probability distribution DiD_{i} with mean μi\mu_{i}. Once arm ii is pulled, it returns an observation rir_{i}, which is drawn independently from DiD_{i} by the environment. At each time step tt, the learning policy π\pi either chooses an arm i(t)i(t) to pull, or chooses to output an arm a(t)a(t). The goal of the learning policy is to pull arms properly, such that with an error probability at most δ\delta, its output arm a(t)a(t) is the optimal arm (i.e., a(t)=argmaxi[m]μia(t)=\operatornamewithlimits{argmax}_{i\in[m]}\mu_{i}) with complexity (the total number of observations) as low as possible.

Pure exploration is widely adopted in real applications. For example, in the selling procedure of cosmetics, there is always a testing phase before the commercialization phase (Audibert et al., 2010). The goal of the testing phase is to help to maximize the cumulative reward collected in the commercialization phase. Therefore, instead of regret minimization (Berry & Fristedt, 1985; Auer et al., 2002), the testing phase only needs to do exploration (e.g., to investigate which product is the most popular one), and wants to find out the target with both correctness guarantee and low cost. In the real world, sometimes the system focuses on the best action under a specific combinatorial structure, instead of the best single arm (Chen et al., 2014). For example, a network routing system needs to search for the path with minimum delay between the source and the destination. Since there can be an exponential number of paths, the cost of exploring them separately is unacceptable. Therefore, people choose to find out the best path by exploring single edges, and this is a pure exploration problem instance in combinatorial multi-armed bandit (CMAB). In this setting, we still pull single arms (base arms) at each time step, and there is a super arm set 2[m]\mathcal{I}\subseteq 2^{[m]}. The expected reward of a super arm SS\in\mathcal{I} is iSμi\sum_{i\in S}\mu_{i}, i.e., the sum of the expected rewards of its contained base arms. And the goal of the player is to find out the optimal super arm with error probability at most δ\delta and complexity as low as possible.

Most of the existing solutions for pure exploration follow the UCB approach (Audibert et al., 2010; Kalyanakrishnan et al., 2012; Chen et al., 2014; Kaufmann & Kalyanakrishnan, 2013). They compute the confidence bounds for all the arms, and claim that one arm is optimal only if its lower confidence bound is larger than the upper confidence bounds of all the other arms. Though this approach is asymptotically optimal in pure exploration of classic MAB problems (Kalyanakrishnan et al., 2012; Kaufmann & Kalyanakrishnan, 2013), it faces some challenges in the CMAB setting (Chen et al., 2014). In the UCB approach, for a super arm SS, the algorithm usually uses the sum of upper confidence bounds of all its contained base arms as its upper confidence bound to achieve a low implementation cost. This means that the gap between the empirical mean of super arm SS and the used upper confidence bound of SS is about O~(iS1/Ni)\tilde{O}(\sum_{i\in S}\sqrt{1/N_{i}}) (here NiN_{i} is the number of observations on base arm ii). However, since the observations of different base arms are independent, the standard deviation of the empirical mean of super arm SS is O~(iS1/Ni)\tilde{O}(\sqrt{\sum_{i\in S}{1/N_{i}}}), which is much smaller than O~(iS1/Ni)\tilde{O}(\sum_{i\in S}\sqrt{1/N_{i}}). This means that the used upper confidence bound of SS is much larger than the tight upper confidence bound of SS. Combes et al. (2015) deal with this problem by computing the upper confidence bounds for all the super arms independently, which leads to an exponential time cost and is hard to implement in practice. In fact, existing efficient UCB-based solutions either suffer from a high complexity bound (Chen et al., 2014; Gabillon et al., 2016), or need further assumptions on the combinatorial structure to achieve an optimal complexity bound efficiently (Chen et al., 2017).

Then a natural solution to deal with this challenge is to use random samples of arm ii (with mean to be its empirical mean and standard deviation to be O~(1/Ni)\tilde{O}(\sqrt{1/N_{i}})) instead of its upper confidence bound to judge whether a super arm is optimal. If we let the random samples of different base arms be independent, then with high probability, the gap between the empirical mean of super arm SS and the sum of random samples within SS is O~(iS1/Ni)\tilde{O}(\sqrt{\sum_{i\in S}{1/N_{i}}}), which has the same order as the gap between the empirical mean of super arm SS and its real mean. Therefore, using independent random samples can behave better than using confidence bounds, and this is the key idea of Thompson Sampling (TS) (Thompson, 1933; Kaufmann et al., 2012; Agrawal & Goyal, 2013). In fact, many prior works show that TS-based algorithms have smaller cumulative regret than UCB-based algorithms in regret minimization of CMAB model (Wang & Chen, 2018; Perrault et al., 2020). However, there still lack studies on adapting the idea of TS, i.e., using random samples to make decisions, to the pure exploration setting.

In this paper, we attempt to fill up this gap, and study using random samples in pure exploration under the frequentist setting for both MAB and CMAB instances. We emphasize that it is non-trivial to design (and analyze) such a TS-based algorithm. The first challenge is that there is a lot more uncertainty in the random samples than in the confidence bounds. In UCB-based algorithms, the upper confidence bound of the optimal arm is larger than its expected reward with high probability. Thus, the arm with the largest upper confidence bound is either an insufficiently learned sub-optimal arm (i.e., the number of its observations is not enough to make sure that it is a sub-optimal arm) or the optimal arm, which means that the number of pulls on sufficiently learned sub-optimal arms is limited. However, for TS-based algorithms that use random samples instead, there is always a constant probability that the random sample of the optimal arm is smaller than its expected reward. In this case, the arm with the largest random sample may be a sufficiently learned sub-optimal arm. Therefore, the mechanism of the TS-based policy should be designed carefully to make sure that we can still obtain an upper bound for the number of pulls on sufficiently learned sub-optimal arms.

Another challenge is that using random samples to make decisions is a kind of Bayesian algorithm and it loses many good properties in the frequentist setting. In the Bayesian setting, at each time step tt, the parameters of the game follow a posterior distribution 𝒫t\mathcal{P}_{t}, and the random samples are drawn from 𝒫t\mathcal{P}_{t} independently as well. Therefore, using the random samples to output the optimal arm can explicitly ensure the correctness of the TS-based algorithm. However, in the frequentist setting, the parameters of the game are fixed but unknown, and they have no such correlations with the random samples. Because of this, if we still choose to use the random samples to output the optimal arm, then the distributions of random samples in the TS-based algorithm need to be chosen carefully to make sure that we can still obtain the correctness guarantee in the frequentist setting.

Besides, the analysis of the TS-based algorithm in pure exploration is also very different from that in regret minimization. In regret minimization, at each time step, we only need to draw one sample for each arm (Thompson, 1933; Agrawal & Goyal, 2013; Wang & Chen, 2018; Perrault et al., 2020). However, in pure exploration, one set of samples at each time step is not enough, since the algorithm needs to i) check whether there is an arm that is the optimal one with high probability; ii) look for an arm that needs exploration. None of these two goals can be achieved by only one set of samples. Therefore, we must draw several sets of samples to make decisions, and this may fail the existing analysis of TS in regret minimization.

In this paper, we solve the above challenges, and design a TS-based algorithm TS-Explore for (combinatorial) pure exploration under the frequentist setting with polynomial implementation cost. At each time step tt, TS-Explore first draws independent random samples θik(t)\theta_{i}^{k}(t) for all the (base) arms i[m]i\in[m] and k=1,2,,Mk=1,2,\cdots,M (i.e., totally MM independent samples for each arm). Then it tries to find out the MM best (super) arms S~tk\tilde{S}^{k}_{t}’s under sample sets 𝜽k(t)=[θ1k(t),θ2k(t),,θmk(t)]\bm{\theta}^{k}(t)=[\theta_{1}^{k}(t),\theta_{2}^{k}(t),\cdots,\theta_{m}^{k}(t)] for k=1,2,,Mk=1,2,\cdots,M. If in all these sample sets, the best (super) arm is the same as the empirically best (super) arm S^t\hat{S}_{t} (i.e., the best arm under the empirical means), then the algorithm will output that this (super) arm is optimal. Otherwise, for all k[M]k\in[M], the algorithm will check the reward gap between S~tk\tilde{S}^{k}_{t} and S^t\hat{S}_{t} under parameter set 𝜽k(t)\bm{\theta}^{k}(t). Then it focuses on the (super) arm S~tkt\tilde{S}^{k_{t}^{*}}_{t} with the largest reward gap, i.e., it chooses to pull a (base) arm in the exchange set of S^t\hat{S}_{t} and S~tkt\tilde{S}^{k_{t}^{*}}_{t}.

Recording such reward gaps and focusing on S~tkt\tilde{S}^{k_{t}^{*}}_{t} is the key mechanism we used to solve the above challenges. On the one hand, our analysis shows that with a proper value MM (the number of sample sets) and proper random sample distributions, S~tkt\tilde{S}^{k_{t}^{*}}_{t} has some similar properties with the (super) arm with the largest upper confidence bound. These properties play a critical role in the analysis of UCB approaches. Thus, we can also use them to obtain an upper bound for the number of pulls on sufficiently learned sub-optimal arms in TS-Explore as well as the correctness guarantee of TS-Explore. On the other hand, this novel mechanism saves the key advantage of Thompson Sampling, i.e., the sum of random samples within SS will not exceed the tight upper confidence bound of SS with high probability. Hence, TS-Explore can achieve a lower complexity upper bound than existing efficient UCB-based solutions. These results indicate that our TS-Explore policy is correct, efficient (with low implementation cost), and effective (with low complexity).

In the general CMAB setting, we show that TS-Explore is near optimal and achieves a lower complexity upper bound than existing efficient UCB-based algorithms (Chen et al., 2014). The optimal algorithm in (Chen et al., 2017) is only efficient when the combinatorial structure satisfies some specific properties (otherwise it suffers from an exponential implementation cost), and is less general than our results. We also conduct experiments to compare the complexity of TS-Explore with existing baselines. The experimental results show that TS-Explore outperforms the efficient baseline CLUCB (Chen et al., 2014), and behaves only a little worse than the optimal but non-efficient baseline NaiveGapElim (Chen et al., 2017). As for the MAB setting, we show that TS-Explore is asymptotically optimal, i.e., it has a comparable complexity to existing optimal algorithms (Kalyanakrishnan et al., 2012; Kaufmann & Kalyanakrishnan, 2013) when δ0\delta\to 0. All these results indicate that our TS-based algorithm is efficient and effective in dealing with pure exploration problems. To the best of our knowledge, this is the first result of using this kind of TS-based algorithm (i.e., always making decisions based on random samples) in pure exploration under the frequentist setting.

2 Related Works

Pure exploration of the classic MAB model is first proposed by Audibert et al. (2010). After that, people have designed lots of learning policies for this problem. The two most representative algorithms are successive-elimination (Even-Dar et al., 2006; Audibert et al., 2010; Kaufmann & Kalyanakrishnan, 2013) and LUCB (Kalyanakrishnan et al., 2012; Kaufmann & Kalyanakrishnan, 2013). Both of them adopt the idea of UCB (Auer et al., 2002) and achieve an asymptotically optimal complexity upper bound (i.e., it matches with the complexity lower bound proposed by Kalyanakrishnan et al. (2012)). Compared to these results, our TS-Explore policy uses a totally different approach, and can achieve an asymptotically optimal complexity upper bound as well.

Combinatorial pure exploration is first studied by Chen et al. (2014). They propose CLUCB, an LUCB-based algorithm that is efficient as long as there is an offline oracle to output the best super arm under any given parameter set. Chen et al. (2017) then design an asymptotically optimal algorithm for this problem. However, their algorithm can only be implemented efficiently when the combinatorial structure follows some specific constraints. Recently, based on the game approach, Jourdan et al. (2021) provide another optimal learning policy for pure exploration in CMAB. But their algorithm still suffers from an exponential implementation cost. Compared with these UCB-based algorithms, our TS-Explore policy achieves a lower complexity bound than CLUCB (Chen et al., 2014) (with a similar polynomial time cost), and has a much lower implementation cost than the optimal policies (Chen et al., 2017; Jourdan et al., 2021) in the most general combinatorial pure exploration setting.

There also exist some researches about applying Thompson Sampling to pure exploration. For example, Russo (2016) considers a frequentist setting of pure exploration in classic MAB, and proposes algorithms called TTPS, TTVS, and TTTS; Shang et al. (2020) extend the results in (Russo, 2016), design a T3C algorithm, and provide its analysis for Gaussian bandits; Li et al. (2021) study Bayesian contextual pure exploration, propose an algorithm called BRE and obtain its corresponding analysis. However, these results are still very different from ours. The first point is that they mainly use random distributions but not random samples to decide the next chosen arm or when to stop, which may cause a high implementation cost when we extend them to the combinatorial setting, since it is much more complex to deal with random distributions than to deal with random samples. Moreover, our results are still more general even if we only consider pure exploration in classic MAB under the frequentist setting. For example, Russo (2016) does not provide a correct stopping rule, and Shang et al. (2020) only obtain the correctness guarantee for Gaussian bandits. Besides, their complexity bounds are asymptotic ones (which require δ0\delta\to 0), while ours works for any δ(0,1)\delta\in(0,1).

3 Model Setting

3.1 Pure Exploration in Multi-armed Bandit

A pure exploration problem instance of MAB is a tuple ([m],𝑫,δ)([m],\bm{D},\delta). Here [m]={1,2,,m}[m]=\{1,2,\cdots,m\} is the set of arms, 𝑫={D1,D2,,Dm}\bm{D}=\{D_{1},D_{2},\cdots,D_{m}\} are the corresponding reward distributions of the arms, and δ\delta is the error constraint. In this paper, we assume that all the distributions DiD_{i}’s are supported on [0,1][0,1]. Let μi𝔼XDi[X]\mu_{i}\triangleq\mathbb{E}_{X\sim D_{i}}[X] denote the expected reward of arm ii, and a=argmaxi[m]μia^{*}=\operatornamewithlimits{argmax}_{i\in[m]}\mu_{i} is the optimal arm with the largest expected reward. Similar to many existing works (e.g., (Audibert et al., 2010)), we assume that the optimal arm is unique. At each time step tt, the learning policy π\pi can either pull an arm i(t)[m]i(t)\in[m], or output an arm a(t)[m]a(t)\in[m]. If it chooses to pull arm i(t)i(t), then it will receive an observation ri(t)(t)r_{i(t)}(t), which is drawn independently from Di(t)D_{i(t)}. The goal of the learning policy is to make sure that with probability at least 1δ1-\delta, its output a(t)=aa(t)=a^{*}. Under this constraint, it aims to minimize the complexity

Zπi=1mNi(Tπ),Z^{\pi}\triangleq\sum_{i=1}^{m}N_{i}(T^{\pi}),

where TπT^{\pi} denotes the time step tt that policy π\pi chooses to output a(t)a(t), and Ni(t)N_{i}(t) denotes the number of observations on arm ii until time step tt.

Let Δi,mμaμi\Delta_{i,m}\triangleq\mu_{a^{*}}-\mu_{i} denote the expected reward gap between the optimal arm aa^{*} and any other arm iai\neq a^{*}. For the optimal arm aa^{*}, its Δa,m\Delta_{a^{*},m} is defined as μamaxiaμi\mu_{a^{*}}-\max_{i\neq a^{*}}\mu_{i}. We also define Hmi[m]1Δi,m2H_{m}\triangleq\sum_{i\in[m]}{1\over\Delta_{i,m}^{2}}, and existing works (Kalyanakrishnan et al., 2012) show that the complexity lower bound of any pure exploration algorithm is Ω(Hmlog1δ)\Omega(H_{m}\log{1\over\delta}).

3.2 Pure Exploration in Combinatorial Multi-armed Bandit

A pure exploration problem instance of CMAB is an extension of the MAB setting. The arms i[m]i\in[m] are called base arms, and there is also a super arm set 2[m]\mathcal{I}\subseteq 2^{[m]}. For each super arm SS\in\mathcal{I}, its expected reward is iSμi\sum_{i\in S}\mu_{i}. Let S=argmaxSiSμiS^{*}=\operatornamewithlimits{argmax}_{S\in\mathcal{I}}\sum_{i\in S}\mu_{i} denote the optimal super arm with the largest expected reward, and we assume that the optimal super arm is unique (as in (Chen et al., 2014)). At each time step tt, the learning policy π\pi can either pull a base arm i(t)[m]i(t)\in[m], or output a super arm S(t)S(t)\in\mathcal{I}. The goal of the learning policy is to make sure that with probability at least 1δ1-\delta, its output S(t)=SS(t)=S^{*}. Under this constraint, it also wants to minimize its complexity ZπZ^{\pi}.

As in many existing works (Chen et al., 2013, 2014; Wang & Chen, 2018), we also assume that there exists an offline 𝖮𝗋𝖺𝖼𝗅𝖾{\sf Oracle}, which takes a set of parameters 𝜽=[θ1,,θm]\bm{\theta}=[\theta_{1},\cdots,\theta_{m}] as input, and outputs the best super arm under this parameter set, i.e., 𝖮𝗋𝖺𝖼𝗅𝖾(𝜽)=argmaxSiSθi{\sf Oracle}(\bm{\theta})=\operatornamewithlimits{argmax}_{S\in\mathcal{I}}\sum_{i\in S}\theta_{i}.

In this paper, for iSi\notin S^{*}, we use Δi,cjSμjmaxS:iSjSμj\Delta_{i,c}\triangleq\sum_{j\in S^{*}}\mu_{j}-\max_{S\in\mathcal{I}:i\in S}\sum_{j\in S}\mu_{j} to denote the expected reward gap between the optimal super arm SS^{*} and the best super arm that contains ii. As for iSi\in S^{*}, its Δi,c\Delta_{i,c} is defined as Δi,cjSμjmaxS:iSjSμj\Delta_{i,c}\triangleq\sum_{j\in S^{*}}\mu_{j}-\max_{S\in\mathcal{I}:i\notin S}\sum_{j\in S}\mu_{j}, i.e., the expected reward gap between SS^{*} and the best super arm that does not contain ii. We also define SS=(SS)(SS)S\oplus S^{\prime}=(S\setminus S^{\prime})\cup(S^{\prime}\setminus S) and 0ptmaxSS|SS|0pt\triangleq\max_{S\neq S^{\prime}}|S\oplus S^{\prime}|, and let H1,c0pti[m]1Δi,c2H_{1,c}\triangleq 0pt\sum_{i\in[m]}{1\over\Delta_{i,c}^{2}}, H2,c0pt2i[m]1Δi,c2H_{2,c}\triangleq 0pt^{2}\sum_{i\in[m]}{1\over\Delta_{i,c}^{2}}.

Chen et al. (2017) prove that the complexity lower bound for combinatorial pure exploration is Ω(H0,clog1δ)\Omega(H_{0,c}\log{1\over\delta}), where H0,cH_{0,c} is the optimal value of the following convex program (here ΔS,S=iSμiiSμi\Delta_{S^{*},S}=\sum_{i\in S^{*}}\mu_{i}-\sum_{i\in S}\mu_{i}):

min\displaystyle\min i[m]Ni\displaystyle\sum_{i\in[m]}N_{i}
s.t.\displaystyle\mathrm{s.t.} iSS1NiΔS,S2,S,SS\displaystyle\sum_{i\in S^{*}\oplus S}{1\over N_{i}}\leq\Delta_{S^{*},S}^{2},\quad\forall S\in\mathcal{I},S\neq S^{*}

The following result shows the relationships between H0,cH_{0,c}, H1,cH_{1,c} and H2,cH_{2,c}.

Proposition 3.1.

For any combinatorial pure exploration instance, H0,cH1,cH2,cH_{0,c}\leq H_{1,c}\leq H_{2,c}.

Proof.

Since 0pt10pt\geq 1, we have that H1,cH2,cH_{1,c}\leq H_{2,c}. As for the first inequality, note that i[m]\forall i\in[m], Ni=0ptΔi,c2N_{i}={0pt\over\Delta_{i,c}^{2}} is a feasible solution of the above convex program. Hence we have that H0,cH1,cH_{0,c}\leq H_{1,c}. ∎

4 Thompson Sampling-based Pure Exploration Algorithm

Note that pure exploration of classic multi-armed bandit is a special case of combinatorial multi-armed bandit (i.e., ={{1},{2},,{m}}\mathcal{I}=\{\{1\},\{2\},\cdots,\{m\}\}). Therefore, in this section, we mainly focus on the TS-based pure exploration algorithm in the combinatorial multi-armed bandit setting. Its framework and analysis can directly lead to results in the classic multi-armed bandit setting.

In the following, we let Φ(x,μ,σ2)PrX𝒩(μ,σ2)[Xx]\Phi(x,\mu,\sigma^{2})\triangleq\Pr_{X\sim\mathcal{N}(\mu,\sigma^{2})}[X\geq x]. For any x(0,0.5)x\in(0,0.5), we also define ϕ(x)\phi(x) as a function of xx such that Φ(ϕ(x),0,1)=x\Phi(\phi(x),0,1)=x.

4.1 Algorithm Framework

Algorithm 1 TS-Explore
1:  Input: Error constraint δ\delta, q[δ,0.1]q\in[\delta,0.1], tmt\leftarrow m, Ni0,Ri0N_{i}\leftarrow 0,R_{i}\leftarrow 0 for all i[m]i\in[m].
2:  Pull each arm once, update their number of pulls NiN_{i}’s, the sum of their observations RiR_{i}’s.
3:  while true do
4:     tt+1t\leftarrow t+1.
5:     For all base arm i[m]i\in[m], μ^i(t)RiNi\hat{\mu}_{i}(t)\leftarrow{R_{i}\over N_{i}}.
6:     𝝁^(t)[μ^1(t),μ^2(t),,μ^m(t)]\bm{\hat{\mu}}(t)\leftarrow[\hat{\mu}_{1}(t),\hat{\mu}_{2}(t),\cdots,\hat{\mu}_{m}(t)].
7:     S^t𝖮𝗋𝖺𝖼𝗅𝖾(𝝁^(t))\hat{S}_{t}\leftarrow{\sf Oracle}(\bm{\hat{\mu}}(t)).
8:     for k=1,2,,M(δ,q,t)k=1,2,\cdots,M(\delta,q,t) do
9:        For each arm ii, draw θik(t)\theta_{i}^{k}(t) independently from distribution 𝒩(μ^i(t),C(δ,q,t)Ni)\mathcal{N}(\hat{\mu}_{i}(t),{C(\delta,q,t)\over N_{i}}).
10:        𝜽k(t)[θ1k(t),θ2k(t),,θmk(t)]\bm{\theta}^{k}(t)\leftarrow[\theta_{1}^{k}(t),\theta_{2}^{k}(t),\cdots,\theta_{m}^{k}(t)].
11:        S~tk𝖮𝗋𝖺𝖼𝗅𝖾(𝜽k(t))\tilde{S}^{k}_{t}\leftarrow{\sf Oracle}(\bm{\theta}^{k}(t)).
12:        Δ~tkiS~tkθik(t)iS^tθik(t)\tilde{\Delta}^{k}_{t}\leftarrow\sum_{i\in\tilde{S}^{k}_{t}}\theta_{i}^{k}(t)-\sum_{i\in\hat{S}_{t}}\theta_{i}^{k}(t).
13:     end for
14:     if 1kM(δ,q,t)\forall 1\leq k\leq M(\delta,q,t), S~tk=S^t\tilde{S}^{k}_{t}=\hat{S}_{t} then
15:        Return: S^t\hat{S}_{t}.
16:     else
17:        ktargmaxkΔ~tkk^{*}_{t}\leftarrow\operatornamewithlimits{argmax}_{k}\tilde{\Delta}^{k}_{t}, S~tS~tkt\tilde{S}_{t}\leftarrow\tilde{S}^{k^{*}_{t}}_{t}.
18:        Pull arm i(t)argminiS^tS~tNii(t)\leftarrow\operatornamewithlimits{argmin}_{i\in\hat{S}_{t}\oplus\tilde{S}_{t}}N_{i}, update its number of pulls Ni(t)N_{i(t)} and the sum of its observations Ri(t)R_{i(t)}.
19:     end if
20:  end while

Our Thompson Sampling-based algorithm (TS-Explore) is described in Algorithm 1. We use NiN_{i} to denote the number of observations on arm ii, RiR_{i} to denote the sum of all the observations from arm ii, and Ni(t),Ri(t)N_{i}(t),R_{i}(t) to denote the value of Ni,RiN_{i},R_{i} at the beginning of time step tt.

At each time step tt, for any i[m]i\in[m], TS-Explore first draws M(δ,q,t)1qlog(12||2t2/δ)M(\delta,q,t)\triangleq{1\over q}\log(12|\mathcal{I}|^{2}t^{2}/\delta) random samples {θik(t)}k=1M(δ,q,t)\{\theta_{i}^{k}(t)\}_{k=1}^{M(\delta,q,t)} independently from a Gaussian distribution with mean μ^i(t)Ri(t)Ni(t)\hat{\mu}_{i}(t)\triangleq{R_{i}(t)\over N_{i}(t)} (i.e., the empirical mean of arm ii) and variance C(δ,q,t)Ni(t){C(\delta,q,t)\over N_{i}(t)} (i.e., inversely proportional to Ni(t)N_{i}(t)), where C(δ,q,t)log(12||2t2/δ)ϕ2(q)C(\delta,q,t)\triangleq{\log(12|\mathcal{I}|^{2}t^{2}/\delta)\over\phi^{2}(q)}. Then it checks which super arm is optimal in the empirical mean parameter set 𝝁^(t)=[μ^1(t),μ^2(t),,μ^m(t)]\bm{\hat{\mu}}(t)=[\hat{\mu}_{1}(t),\hat{\mu}_{2}(t),\cdots,\hat{\mu}_{m}(t)] and the kk-th sample set 𝜽k(t)=[θ1k(t),θ2k(t),,θmk(t)]\bm{\theta}^{k}(t)=[\theta_{1}^{k}(t),\theta_{2}^{k}(t),\cdots,\theta_{m}^{k}(t)] for all kk by using the offline 𝖮𝗋𝖺𝖼𝗅𝖾{\sf Oracle}, i.e., S^t=𝖮𝗋𝖺𝖼𝗅𝖾(𝝁^(t))\hat{S}_{t}={\sf Oracle}(\bm{\hat{\mu}}(t)) and S~tk=𝖮𝗋𝖺𝖼𝗅𝖾(𝜽k(t))\tilde{S}^{k}_{t}={\sf Oracle}(\bm{\theta}^{k}(t)). If all the best super arms S~tk\tilde{S}^{k}_{t}’s are the same as S^t\hat{S}_{t}, then TS-Explore outputs that this super arm is the optimal one. Otherwise, for all 1kM(δ,q,t)1\leq k\leq M(\delta,q,t), it will compute the reward gap between S~tk\tilde{S}^{k}_{t} and S^t\hat{S}_{t} under the kk-th sample set 𝜽k(t)\bm{\theta}^{k}(t), and focus on ktk^{*}_{t} with the largest reward gap, i.e., TS-Explore will choose to pull the base arm with the least number of observations in S^tS~tkt\hat{S}_{t}\oplus\tilde{S}^{k^{*}_{t}}_{t} (in the following, we use S~t\tilde{S}_{t} to represent S~tkt\tilde{S}^{k^{*}_{t}}_{t} to simplify notations). Note that S^tS~t\hat{S}_{t}\neq\tilde{S}_{t} (otherwise we will output S^t\hat{S}_{t} as the optimal super arm), thus the rule of choosing arms to pull is well defined.

4.2 Analysis of TS-Explore

Theorem 4.1.

In the CMAB setting, for q[δ,0.1]q\in[\delta,0.1], with probability at least 1δ1-\delta, TS-Explore will output the optimal super arm SS^{*} with complexity upper bounded by O(H1,c(log1δ+log(||H1,c))log1δlog1q+H1,clog2(||H1,c)log1q)O(H_{1,c}(\log{1\over\delta}+\log(|\mathcal{I}|H_{1,c})){\log{1\over\delta}\over\log{1\over q}}+H_{1,c}{\log^{2}(|\mathcal{I}|H_{1,c})\over\log{1\over q}}). Specifically, if we choose q=δq=\delta, then the complexity upper bound is O(H1,clog1δ+H1,clog2(||H1,c))O(H_{1,c}\log{1\over\delta}+H_{1,c}\log^{2}(|\mathcal{I}|H_{1,c})).

Remark 4.2.

When the error constraint δ(0.1,1)\delta\in(0.1,1), we can still let the parameters (δ,q)(\delta,q) in TS-Explore be (δ0,q0)=(0.1,0.1)(\delta_{0},q_{0})=(0.1,0.1). In this case: i) its error probability is upper bounded by δ0=0.1<δ\delta_{0}=0.1<\delta; and ii) since δδ0<10.1=10{\delta\over\delta_{0}}<{1\over 0.1}=10, the complexity of TS-Explore is still upper bounded by O(H1,clog1δ0+H1,clog2(||H1,c))=O(H1,clog1δ+H1,clog2(||H1,c))O(H_{1,c}\log{1\over\delta_{0}}+H_{1,c}\log^{2}(|\mathcal{I}|H_{1,c}))=O(H_{1,c}\log{1\over\delta}+H_{1,c}\log^{2}(|\mathcal{I}|H_{1,c})).

By Theorem 4.1, we can directly obtain the correctness guarantee and the complexity upper bound for applying TS-Explore in pure exploration of classic multi-armed bandit.

Corollary 4.3.

In the MAB setting, for q[δ,0.1]q\in[\delta,0.1], with probability at least 1δ1-\delta, TS-Explore will output the optimal arm aa^{*} with complexity upper bounded by O(Hm(log1δ+log(mHm))log1δlog1q+Hmlog2(mHm)log1q)O(H_{m}(\log{1\over\delta}+\log(mH_{m})){\log{1\over\delta}\over\log{1\over q}}+H_{m}{\log^{2}(mH_{m})\over\log{1\over q}}). Specifically, if we choose q=δq=\delta, then the complexity upper bound is O(Hmlog1δ+Hmlog2(mHm))O(H_{m}\log{1\over\delta}+H_{m}\log^{2}(mH_{m})).

Remark 4.4.

The value qq in TS-Explore is used to control the number of times that we draw random samples at each time step. Note that M(δ,q,t)=1qlog(12||2t2/δ)M(\delta,q,t)={1\over q}\log(12|\mathcal{I}|^{2}t^{2}/\delta). Hence when qq becomes larger, we need fewer samples, but the complexity bound becomes worse. Here is a trade-off between the algorithm’s complexity and the number of random samples it needs to draw. Our analysis shows that using q=δ1βq=\delta^{1\over\beta} for some constant β1\beta\geq 1 can make sure that the complexity upper bound remains the same order, and reduce the number of random samples significantly.

Remark 4.5.

If the value |||\mathcal{I}| is unknown (which is common in real applications), we can use 2m2^{m} instead (in M(δ,q,t)M(\delta,q,t) and C(δ,q,t)C(\delta,q,t)). This only increases the constant term in the complexity upper bound and does not influence the major term O(H1,clog1δ)O(H_{1,c}\log{1\over\delta}).

Theorem 4.1 shows that the complexity upper bound of TS-Explore is 0pt0pt lower than the CLUCB policy in (Chen et al., 2014). To the best of our knowledge, TS-Explore is the first algorithm that efficiently achieves an O(H1,clog1δ)O(H_{1,c}\log{1\over\delta}) complexity upper bound in the most general combinatorial pure exploration setting. Besides, this is also the first theoretical analysis of using a TS-based algorithm (i.e., using random samples to make decisions) to deal with combinatorial pure exploration problems under the frequentist setting.

Though the complexity bound of TS-Explore still has some gap with the optimal one O(H0,clog1δ)O(H_{0,c}\log{1\over\delta}), we emphasize that this is because we only use the simple offline oracle and do not seek more detailed information about the combinatorial structure. This makes our policy more efficient. As a comparison, the existing optimal policies (Chen et al., 2017; Jourdan et al., 2021) either suffer from an exponential time cost, or require the combinatorial structure to satisfy some specific constraints so that they can adopt more powerful offline oracles that explore detailed information about the combinatorial structure efficiently (please see Appendix C for discussions about this). Therefore, our algorithm is more efficient in the most general CMAB setting, and can be attractive in real applications with large scale and complex combinatorial structures.

On the other hand, the gap between the complexity upper bound and the complexity lower bound does not exist in the MAB setting. Corollary 4.3 shows that the complexity upper bound of applying TS-Explore in the classic MAB setting matches the complexity lower bound Ω(Hmlog1δ)\Omega(H_{m}\log{1\over\delta}) (Kalyanakrishnan et al., 2012) when δ0\delta\to 0, i.e., TS-Explore is asymptotically optimal in the classic MAB setting.

Now we provide the proof of Theorem 4.1.

In this proof, we denote 𝒥={U:S,S,U=SS}\mathcal{J}=\{U:\exists S,S^{\prime}\in\mathcal{I},U=S\setminus S^{\prime}\}. Note that |𝒥|||2|\mathcal{J}|\leq|\mathcal{I}|^{2} and for any U𝒥U\in\mathcal{J}, |U|0pt|U|\leq 0pt. We also denote L1(t)=log(12||2t2/δ)L_{1}(t)=\log(12|\mathcal{I}|^{2}t^{2}/\delta) and L2(t)=log(12||2t2M(δ,q,t)/δ)L_{2}(t)=\log(12|\mathcal{I}|^{2}t^{2}M(\delta,q,t)/\delta) to simplify notations.

We first define three events as follows: 0,c\mathcal{E}_{0,c} is the event that for all t>0t>0, U𝒥U\in\mathcal{J},

|iU(μ^i(t)μi)|iU12Ni(t)L1(t);|\sum_{i\in U}(\hat{\mu}_{i}(t)-\mu_{i})|\leq\sqrt{\sum_{i\in U}{1\over 2N_{i}(t)}L_{1}(t)};

1,c\mathcal{E}_{1,c} is the event that for all t>0t>0, 1kM(δ,q,t)1\leq k\leq M(\delta,q,t), U𝒥U\in\mathcal{J},

|iU(θik(t)μ^i(t))|iU2C(δ,q,t)Ni(t)L2(t);\displaystyle|\sum_{i\in U}(\theta_{i}^{k}(t)-\hat{\mu}_{i}(t))|\leq\sqrt{\sum_{i\in U}{2C(\delta,q,t)\over N_{i}(t)}L_{2}(t)};

and 2,c\mathcal{E}_{2,c} is the event that for all t>0t>0, S,SS,S^{\prime}\in\mathcal{I}. there exists 1kM(δ,q,t)1\leq k\leq M(\delta,q,t) such that

iSθik(t)iSθik(t)iSμiiSμi.\sum_{i\in S}\theta_{i}^{k}(t)-\sum_{i\in S^{\prime}}\theta_{i}^{k}(t)\geq\sum_{i\in S}\mu_{i}-\sum_{i\in S^{\prime}}\mu_{i}.

Roughly speaking, 0,c\mathcal{E}_{0,c} means that for any U𝒥U\in\mathcal{J}, the gap between its real mean and its empirical mean lies in the corresponding confidence radius; 1,c\mathcal{E}_{1,c} means that for any U𝒥U\in\mathcal{J}, the gap between its empirical mean and the sum of its random samples lies in the corresponding confidence radius; and 2,c\mathcal{E}_{2,c} means that for any two super arms SSS\neq S^{\prime}, at each time step, there is at least one sample set such that the gap between the sum of their random samples is larger than the gap between their real means.

We will first prove that 0,c1,c2,c\mathcal{E}_{0,c}\land\mathcal{E}_{1,c}\land\mathcal{E}_{2,c} happens with high probability, and then show the correctness guarantee and complexity upper bound under 0,c1,c2,c\mathcal{E}_{0,c}\land\mathcal{E}_{1,c}\land\mathcal{E}_{2,c}.

Lemma 4.6.

In Algorithm 1, we have that

Pr[0,c1,c2,c]1δ.\Pr[\mathcal{E}_{0,c}\land\mathcal{E}_{1,c}\land\mathcal{E}_{2,c}]\geq 1-\delta.
Proof.

Note that the random variable (μ^i(t)μi)(\hat{\mu}_{i}(t)-\mu_{i}) is zero-mean and 14Ni(t){1\over 4N_{i}(t)} sub-Gaussian, and for different ii, the random variables (μ^i(t)μi)(\hat{\mu}_{i}(t)-\mu_{i})’s are independent. Therefore, iU(μ^i(t)μi)\sum_{i\in U}(\hat{\mu}_{i}(t)-\mu_{i}) is zero-mean and iU14Ni(t)\sum_{i\in U}{1\over 4N_{i}(t)} sub-Gaussian. Then by concentration inequality of sub-Gaussian random variables (see details in Appendix D),

Pr[|iU(μ^i(t)μi)|>iU12Ni(t)L1(t)]\displaystyle\Pr\left[|\sum_{i\in U}(\hat{\mu}_{i}(t)-\mu_{i})|>\sqrt{\sum_{i\in U}{1\over 2N_{i}(t)}L_{1}(t)}\right]
\displaystyle\leq 2exp(L1(t))\displaystyle 2\exp(-L_{1}(t))
=\displaystyle= δ6||2t2.\displaystyle{\delta\over 6|\mathcal{I}|^{2}t^{2}}.

This implies that

Pr[¬0,c]U,tδ6||2t2tδ6t2δ3,\Pr[\neg\mathcal{E}_{0,c}]\leq\sum_{U,t}{\delta\over 6|\mathcal{I}|^{2}t^{2}}\leq\sum_{t}{\delta\over 6t^{2}}\leq{\delta\over 3},

where the second inequality is because that |𝒥|||2|\mathcal{J}|\leq|\mathcal{I}|^{2}, and the third inequality is because that t1t22\sum_{t}{1\over t^{2}}\leq 2.

Similarly, the random variable (θik(t)μ^i(t))(\theta_{i}^{k}(t)-\hat{\mu}_{i}(t)) is a zero-mean Gaussian random variable with variance C(δ,q,t)Ni(t){C(\delta,q,t)\over N_{i}(t)}, and for different ii, the random variables (θik(t)μ^i(t))(\theta_{i}^{k}(t)-\hat{\mu}_{i}(t))’s are also independent. Then by concentration inequality,

Pr[|iU(θik(t)μ^i(t))|>iU2C(δ,q,t)Ni(t)L2(t)]\displaystyle\Pr\left[|\sum_{i\in U}(\theta_{i}^{k}(t)-\hat{\mu}_{i}(t))|>\sqrt{\sum_{i\in U}{2C(\delta,q,t)\over N_{i}(t)}L_{2}(t)}\right]
\displaystyle\leq 2exp(L2(t))\displaystyle 2\exp(-L_{2}(t))
=\displaystyle= δ6||2t2M(δ,q,t).\displaystyle{\delta\over 6|\mathcal{I}|^{2}t^{2}M(\delta,q,t)}.

This implies that

Pr[¬1,c]U,t,kδ6||2t2M(δ,q,t)U,tδ6||2t2δ3,\Pr[\neg\mathcal{E}_{1,c}]\leq\sum_{U,t,k}{\delta\over 6|\mathcal{I}|^{2}t^{2}M(\delta,q,t)}\leq\sum_{U,t}{\delta\over 6|\mathcal{I}|^{2}t^{2}}\leq{\delta\over 3},

where the second inequality is because that there are totally M(δ,q,t)M(\delta,q,t) sample sets at time step tt.

Finally we consider the probability Pr[¬2,c|0,c]\Pr[\neg\mathcal{E}_{2,c}|\mathcal{E}_{0,c}]. In the following, we denote ΔS,SiSμiiSμi=iSSμiiSSμi\Delta_{S,S^{\prime}}\triangleq\sum_{i\in S}\mu_{i}-\sum_{i\in S^{\prime}}\mu_{i}=\sum_{i\in S\setminus S^{\prime}}\mu_{i}-\sum_{i\in S^{\prime}\setminus S}\mu_{i} as the reward gap (under the real means) between SS and SS^{\prime}.

For any fixed SSS\neq S^{\prime}, we denote A(t)=iSS12Ni(t)A(t)=\sum_{i\in S\setminus S^{\prime}}{1\over 2N_{i}(t)}, B(t)=iSS12Ni(t)B(t)=\sum_{i\in S^{\prime}\setminus S}{1\over 2N_{i}(t)} and C(t)=C(δ,q,t)C(t)=C(\delta,q,t) to simplify notations. Then under event 0,c\mathcal{E}_{0,c}, we must have that iSSμ^i(t)iSSμ^i(t)(iSSμiA(t)L1(t))(iSSμi+B(t)L1(t))ΔS,SA(t)L1(t)B(t)L1(t)\sum_{i\in S\setminus S^{\prime}}\hat{\mu}_{i}(t)-\sum_{i\in S^{\prime}\setminus S}\hat{\mu}_{i}(t)\geq(\sum_{i\in S\setminus S^{\prime}}\mu_{i}-\sqrt{A(t)L_{1}(t)})-(\sum_{i\in S^{\prime}\setminus S}\mu_{i}+\sqrt{B(t)L_{1}(t)})\geq\Delta_{S,S^{\prime}}-\sqrt{A(t)L_{1}(t)}-\sqrt{B(t)L_{1}(t)}.

Since iSSθik(t)iSSθik(t)\sum_{i\in S\setminus S^{\prime}}\theta_{i}^{k}(t)-\sum_{i\in S^{\prime}\setminus S}\theta_{i}^{k}(t) is a Gaussian random variable with mean iSSμ^i(t)iSSμ^i(t)\sum_{i\in S\setminus S^{\prime}}\hat{\mu}_{i}(t)-\sum_{i\in S^{\prime}\setminus S}\hat{\mu}_{i}(t) and variance 2A(t)C(t)+2B(t)C(t)2A(t)C(t)+2B(t)C(t), then under event 0,c\mathcal{E}_{0,c}, (recall that Φ(x,μ,σ2)=PrX𝒩(μ,σ2)[Xx]\Phi(x,\mu,\sigma^{2})=\Pr_{X\sim\mathcal{N}(\mu,\sigma^{2})}[X\geq x]):

Pr[iSθik(t)iSθik(t)ΔS,S]\displaystyle\!\!\!\!\!\Pr\left[\sum_{i\in S}\theta_{i}^{k}(t)-\sum_{i\in S^{\prime}}\theta_{i}^{k}(t)\geq\Delta_{S,S^{\prime}}\right]
=\displaystyle= Pr[iSSθik(t)iSSθik(t)ΔS,S]\displaystyle\!\!\!\!\!\Pr\left[\sum_{i\in S\setminus S^{\prime}}\theta_{i}^{k}(t)-\sum_{i\in S^{\prime}\setminus S}\theta_{i}^{k}(t)\geq\Delta_{S,S^{\prime}}\right]
=\displaystyle= Φ(ΔS,S,iSSμ^i(t)iSSμ^i(t),2A(t)C(t)+2B(t)C(t))\displaystyle\!\!\!\!\!\Phi\!\left(\!\Delta_{S,S^{\prime}},\!\!\!\sum_{i\in S\setminus S^{\prime}}\!\!\hat{\mu}_{i}(t)-\!\!\!\!\!\sum_{i\in S^{\prime}\setminus S}\!\!\hat{\mu}_{i}(t),2A(t)C(t)+2B(t)C(t)\!\right)
\displaystyle\geq Φ(ΔS,S,ΔS,SA(t)L1(t)B(t)L1(t),\displaystyle\!\!\!\!\!\Phi(\Delta_{S,S^{\prime}},\Delta_{S,S^{\prime}}-\sqrt{A(t)L_{1}(t)}-\sqrt{B(t)L_{1}(t)},
2A(t)C(t)+2B(t)C(t))\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad 2A(t)C(t)+\!2B(t)C(t))
=\displaystyle= Φ(A(t)L1(t)+B(t)L1(t),0,2A(t)C(t)+2B(t)C(t))\displaystyle\!\!\!\!\!\Phi(\sqrt{A(t)L_{1}(t)}+\!\sqrt{B(t)L_{1}(t)},0,2A(t)C(t)+\!2B(t)C(t))
=\displaystyle= Φ(L1(t)C(t)A(t)+B(t)2A(t)+2B(t),0,1)\displaystyle\!\!\!\!\!\Phi\left(\sqrt{L_{1}(t)\over C(t)}\cdot{\sqrt{A(t)}+\sqrt{B(t)}\over\sqrt{2A(t)+2B(t)}},0,1\right)
\displaystyle\geq Φ(L1(t)C(t),0,1)\displaystyle\!\!\!\!\!\Phi\left(\sqrt{L_{1}(t)\over C(t)},0,1\right)
=\displaystyle= q,\displaystyle\!\!\!\!\!q,

where the last equation is because that we choose C(t)=L1(t)ϕ2(q)C(t)={L_{1}(t)\over\phi^{2}(q)} and Φ(ϕ(q),0,1)=q\Phi(\phi(q),0,1)=q (by definition of ϕ\phi).

Note that the parameter sets {𝜽k(t)}k=1M(δ,q,t)\{\bm{\theta}^{k}(t)\}_{k=1}^{M(\delta,q,t)} are chosen independently, therefore under event 0,c\mathcal{E}_{0,c}, we have that

Pr[k,iSθik(t)iSθik(t)<ΔS,S]\displaystyle\Pr\left[\forall k,\sum_{i\in S}\theta_{i}^{k}(t)-\sum_{i\in S^{\prime}}\theta_{i}^{k}(t)<\Delta_{S,S^{\prime}}\right]
\displaystyle\leq (1q)M(δ,q,t)\displaystyle(1-q)^{M(\delta,q,t)}
\displaystyle\leq δ12||2t2,\displaystyle{\delta\over 12|\mathcal{I}|^{2}t^{2}},

where that last inequality is because that we choose M(δ,q,t)=1qlog(12||2t2/δ)M(\delta,q,t)={1\over q}\log(12|\mathcal{I}|^{2}t^{2}/\delta).

This implies that

Pr[¬2,c|0,c]t,S,Sδ12||2t2tδ12t2δ3.\Pr[\neg\mathcal{E}_{2,c}|\mathcal{E}_{0,c}]\leq\sum_{t,S,S^{\prime}}{\delta\over 12|\mathcal{I}|^{2}t^{2}}\leq\sum_{t}{\delta\over 12t^{2}}\leq{\delta\over 3}.

All these show that Pr[0,c1,c2,c]1δ\Pr[\mathcal{E}_{0,c}\land\mathcal{E}_{1,c}\land\mathcal{E}_{2,c}]\geq 1-\delta. ∎

Then it is sufficient to prove that under event 0,c1,c2,c\mathcal{E}_{0,c}\land\mathcal{E}_{1,c}\land\mathcal{E}_{2,c}, TS-Explore works correctly with complexity upper bound shown in Theorem 4.1.

Firstly, we prove that TS-Explore will output SS^{*}. The proof is quite straightforward: if S^tS\hat{S}_{t}\neq S^{*}, then under event 2,c\mathcal{E}_{2,c}, there exists kk such that iSθik(t)iS^tθik(t)ΔS,S^t>0\sum_{i\in S^{*}}\theta^{k}_{i}(t)-\sum_{i\in\hat{S}_{t}}\theta^{k}_{i}(t)\geq\Delta_{S^{*},\hat{S}_{t}}>0. Therefore S~tkS^t\tilde{S}^{k}_{t}\neq\hat{S}_{t} and we will not output S^t\hat{S}_{t}. Because of this, TS-Explore can only return SS^{*} under event 0,c1,c2,c\mathcal{E}_{0,c}\land\mathcal{E}_{1,c}\land\mathcal{E}_{2,c}, and we finish the proof of its correctness.

Then we come to bound the complexity of TS-Explore, and we will use the following lemma in our analysis.

Lemma 4.7.

Under event 0,c1,c2,c\mathcal{E}_{0,c}\land\mathcal{E}_{1,c}\land\mathcal{E}_{2,c}, a base arm ii will not be pulled if Ni(t)980ptC(t)L2(t)Δi,c2N_{i}(t)\geq{980ptC(t)L_{2}(t)\over\Delta_{i,c}^{2}}.

Due to space limit, we only prove Lemma 4.7 for the case that (S^t=S)(S~t=S)(\hat{S}_{t}=S^{*})\lor(\tilde{S}_{t}=S^{*}) in our main text, and defer the complete proof to Appendix A.

Proof.

We will prove this lemma by contradiction.

First we consider the case that (S^t=S)(S~tS)(\hat{S}_{t}=S^{*})\land(\tilde{S}_{t}\neq S^{*}). In this case, iSS~ti\in S^{*}\oplus\tilde{S}_{t}, which implies that Δi,cΔS,S~t\Delta_{i,c}\leq\Delta_{S^{*},\tilde{S}_{t}}. If we choose a base arm ii with Ni(t)980ptC(t)L2(t)Δi,c2980ptC(t)L2(t)ΔS,S~t2N_{i}(t)\geq{980ptC(t)L_{2}(t)\over\Delta_{i,c}^{2}}\geq{980ptC(t)L_{2}(t)\over\Delta_{S^{*},\tilde{S}_{t}}^{2}} to pull, we know that jSS~t\forall j\in S^{*}\oplus\tilde{S}_{t}, Nj(t)980ptC(t)L2(t)ΔS,S~t2N_{j}(t)\geq{980ptC(t)L_{2}(t)\over\Delta_{S^{*},\tilde{S}_{t}}^{2}}. This implies that jSS~t2Nj(t)ΔS,S~t7C(t)L2(t)\sqrt{\sum_{j\in S^{*}\setminus\tilde{S}_{t}}{2\over N_{j}(t)}}\leq{\Delta_{S^{*},\tilde{S}_{t}}\over 7\sqrt{C(t)L_{2}(t)}} and jS~tS2Nj(t)ΔS,S~t7C(t)L2(t)\sqrt{\sum_{j\in\tilde{S}_{t}\setminus S^{*}}{2\over N_{j}(t)}}\leq{\Delta_{S^{*},\tilde{S}_{t}}\over 7\sqrt{C(t)L_{2}(t)}}.

By 2,c\mathcal{E}_{2,c}, there exists kk such that jSS~tθjk(t)jS~tSθjk(t)ΔS,S~t\sum_{j\in S^{*}\setminus\tilde{S}_{t}}\theta_{j}^{k}(t)-\sum_{j\in\tilde{S}_{t}\setminus S^{*}}\theta_{j}^{k}(t)\geq\Delta_{S^{*},\tilde{S}_{t}}. By 1,c\mathcal{E}_{1,c}, |jSS~t(θjk(t)μ^j(t))|jSS~t2C(t)Nj(t)L2(t)ΔS,S~t7|\sum_{j\in S^{*}\setminus\tilde{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|\leq\sqrt{\sum_{j\in S^{*}\setminus\tilde{S}_{t}}{2C(t)\over N_{j}(t)}L_{2}(t)}\leq{\Delta_{S^{*},\tilde{S}_{t}}\over 7} and similarly |jS~tS(θjk(t)μ^j(t))|ΔS,S~t7|\sum_{j\in\tilde{S}_{t}\setminus S^{*}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|\leq{\Delta_{S^{*},\tilde{S}_{t}}\over 7}. Hence

jSS~tμ^j(t)jS~tSμ^j(t)\displaystyle\sum_{j\in S^{*}\setminus\tilde{S}_{t}}\hat{\mu}_{j}(t)-\sum_{j\in\tilde{S}_{t}\setminus S^{*}}\hat{\mu}_{j}(t)
\displaystyle\geq jSS~tθjk(t)jS~tSθjk(t)\displaystyle\sum_{j\in S^{*}\setminus\tilde{S}_{t}}\theta_{j}^{k}(t)-\sum_{j\in\tilde{S}_{t}\setminus S^{*}}\theta_{j}^{k}(t)
|jSS~t(θjk(t)μ^j(t))||jS~tS(θjk(t)μ^j(t))|\displaystyle-|\!\!\!\!\!\sum_{j\in S^{*}\setminus\tilde{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|-|\!\!\!\!\!\sum_{j\in\tilde{S}_{t}\setminus S^{*}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|
\displaystyle\geq 5ΔS,S~t7.\displaystyle{5\Delta_{S^{*},\tilde{S}_{t}}\over 7}.

1,c\mathcal{E}_{1,c} also means |jSS~t(θjkt(t)μ^j(t))|ΔS,S~t7|\sum_{j\in S^{*}\setminus\tilde{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|\leq{\Delta_{S^{*},\tilde{S}_{t}}\over 7} and |jS~tS(θjkt(t)μ^j(t))|ΔS,S~t7|\sum_{j\in\tilde{S}_{t}\setminus S^{*}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|\leq{\Delta_{S^{*},\tilde{S}_{t}}\over 7}. Thus we know that

jSS~tθjkt(t)jS~tSθjkt(t)\displaystyle\sum_{j\in S^{*}\setminus\tilde{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)-\sum_{j\in\tilde{S}_{t}\setminus S^{*}}\theta_{j}^{k_{t}^{*}}(t)
\displaystyle\geq jSS~tμ^j(t)jS~tSμ^j(t)\displaystyle\sum_{j\in S^{*}\setminus\tilde{S}_{t}}\hat{\mu}_{j}(t)-\sum_{j\in\tilde{S}_{t}\setminus S^{*}}\hat{\mu}_{j}(t)
|jSS~t(θjkt(t)μ^j(t))||jS~tS(θjkt(t)μ^j(t))|\displaystyle-|\!\!\!\!\!\sum_{j\in S^{*}\setminus\tilde{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|-|\!\!\!\!\!\sum_{j\in\tilde{S}_{t}\setminus S^{*}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|
\displaystyle\geq 3ΔS,S~t7.\displaystyle{3\Delta_{S^{*},\tilde{S}_{t}}\over 7}.

This contradicts with the fact that S~t\tilde{S}_{t} is the optimal super arm under the ktk_{t}^{*}-th sample set 𝜽kt(t)\bm{\theta}^{k_{t}^{*}}(t).

Then we come to the case that (S^tS)(S~t=S)(\hat{S}_{t}\neq S^{*})\land(\tilde{S}_{t}=S^{*}). In this case, iSS^ti\in S^{*}\oplus\hat{S}_{t}, which implies that Δi,cΔS,S^t\Delta_{i,c}\leq\Delta_{S^{*},\hat{S}_{t}}. If we choose a base arm ii with Ni(t)980ptC(t)L2(t)Δi,c2980ptC(t)L2(t)ΔS,S^t2N_{i}(t)\geq{980ptC(t)L_{2}(t)\over\Delta_{i,c}^{2}}\geq{980ptC(t)L_{2}(t)\over\Delta_{S^{*},\hat{S}_{t}}^{2}} to pull, we have that jSS^t2Nj(t)ΔS,S^t7C(t)L2(t)\sqrt{\sum_{j\in S^{*}\setminus\hat{S}_{t}}{2\over N_{j}(t)}}\leq{\Delta_{S^{*},\hat{S}_{t}}\over 7\sqrt{C(t)L_{2}(t)}} and jS^tS2Nj(t)ΔS,S^t7C(t)L2(t)\sqrt{\sum_{j\in\hat{S}_{t}\setminus S^{*}}{2\over N_{j}(t)}}\leq{\Delta_{S^{*},\hat{S}_{t}}\over 7\sqrt{C(t)L_{2}(t)}}.

By 2,c\mathcal{E}_{2,c}, there exists kk such that jSS^tθjk(t)jS^tSθjk(t)ΔS,S^t\sum_{j\in S^{*}\setminus\hat{S}_{t}}\theta_{j}^{k}(t)-\sum_{j\in\hat{S}_{t}\setminus S^{*}}\theta_{j}^{k}(t)\geq\Delta_{S^{*},\hat{S}_{t}}. By 1,c\mathcal{E}_{1,c}, |jSS^t(θjk(t)μ^j(t))|ΔS,S^t7|\sum_{j\in S^{*}\setminus\hat{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|\leq{\Delta_{S^{*},\hat{S}_{t}}\over 7} and |jS^tS(θjk(t)μ^j(t))|ΔS,S^t7|\sum_{j\in\hat{S}_{t}\setminus S^{*}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|\leq{\Delta_{S^{*},\hat{S}_{t}}\over 7}. All these imply

jSS^tμ^j(t)jS^tSμ^j(t)\displaystyle\sum_{j\in S^{*}\setminus\hat{S}_{t}}\hat{\mu}_{j}(t)-\sum_{j\in\hat{S}_{t}\setminus S^{*}}\hat{\mu}_{j}(t)
\displaystyle\geq jSS^tθjk(t)jS^tSθjk(t)\displaystyle\sum_{j\in S^{*}\setminus\hat{S}_{t}}\theta_{j}^{k}(t)-\sum_{j\in\hat{S}_{t}\setminus S^{*}}\theta_{j}^{k}(t)
|jSS^t(θjk(t)μ^j(t))||jS^tS(θjk(t)μ^j(t))|\displaystyle-|\!\!\!\!\!\sum_{j\in S^{*}\setminus\hat{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|-|\!\!\!\!\!\sum_{j\in\hat{S}_{t}\setminus S^{*}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|
\displaystyle\geq 5ΔS,S^t7.\displaystyle{5\Delta_{S^{*},\hat{S}_{t}}\over 7}.

This contradicts with the fact that S^t\hat{S}_{t} is the empirically optimal super arm. ∎

Lemma 4.7 is similar to Lemma 10 in (Chen et al., 2014). Both of them give an upper bound for the number of pulls on base arm ii. The key difference is that in (Chen et al., 2014), for an arm set U𝒥U\in\mathcal{J}, the gap between its real mean and its upper confidence bound is O~(iU1Ni)\tilde{O}(\sum_{i\in U}\sqrt{1\over N_{i}}), which means that we require all the NiN_{i}’s to be Θ~(0pt2Δi,c2)\tilde{\Theta}({0pt^{2}\over\Delta_{i,c}^{2}}) to make sure that this gap is less than Δi,c\Delta_{i,c}. In our paper, based on event 0,c1,c\mathcal{E}_{0,c}\land\mathcal{E}_{1,c}, the gap between UU’s real mean and the sum of random samples in UU is O~(iUC(t)Ni)\tilde{O}(\sqrt{\sum_{i\in U}{C(t)\over N_{i}}}). Therefore, we only require all the NiN_{i}’s to be Θ~(0ptC(t)Δi,c2)\tilde{\Theta}({0ptC(t)\over\Delta_{i,c}^{2}}) to make sure that this gap is less than Δi,c\Delta_{i,c}, and this reduces a factor of 0pt0pt in the number of pulls on base arm ii (our analysis shows that C(t)C(t) is approximately a constant). In fact, reducing a 0pt0pt factor in Lemma 4.7 is the key reason that the complexity upper bound of TS-Explore is 0pt0pt lower than the CLUCB policy in (Chen et al., 2014).

The novelty of our proof for Lemma 4.7 mainly lies in the event 2,c\mathcal{E}_{2,c} (as well as the mechanism of recording all the Δ~tk\tilde{\Delta}^{k}_{t}’s and focusing on the largest one), i.e., we show that under event 2,c\mathcal{E}_{2,c}, S~t\tilde{S}_{t} has some similar properties as the super arm with the largest upper confidence bound (which is used in LUCB-based policies such as CLUCB). For example, when S^t\hat{S}_{t} does not equal the optimal super arm SS^{*}, 2,c\mathcal{E}_{2,c} tells us that there must exist kk such that iSθik(t)iS^tθik(t)ΔS,S^t\sum_{i\in S^{*}}\theta_{i}^{k}(t)-\sum_{i\in\hat{S}_{t}}\theta_{i}^{k}(t)\geq\Delta_{S^{*},\hat{S}_{t}}. Along with the fact kt=argmaxkΔ~tkk^{*}_{t}=\operatornamewithlimits{argmax}_{k}\tilde{\Delta}^{k}_{t}, we know Δ~tktΔS,S^t\tilde{\Delta}^{k^{*}_{t}}_{t}\geq\Delta_{S^{*},\hat{S}_{t}}. This means that the reward gap between S^t\hat{S}_{t} and S~t\tilde{S}_{t} (S~tkt\tilde{S}^{k^{*}_{t}}_{t}) could be larger than ΔS,S^t\Delta_{S^{*},\hat{S}_{t}}, which implies that S~t\tilde{S}_{t} is either an insufficiently learned sub-optimal arm or the optimal arm (see details in the complete proof in Appendix A). This method solves the challenge of the uncertainty in random samples, and allows us to use similar analysis techniques (e.g., (Chen et al., 2014)) to prove Lemma 4.7.

By Lemma 4.7, if i,Ni(t)980ptC(t)L2(t)Δi,c2\forall i,N_{i}(t)\geq{980ptC(t)L_{2}(t)\over\Delta_{i,c}^{2}}, then TS-Explore must terminate (and output the correct answer). Thus, the complexity ZZ satisfies

Zi[m]980ptC(Z)L2(Z)Δi,c2=98H1,cC(Z)L2(Z).Z\leq\sum_{i\in[m]}{980ptC(Z)L_{2}(Z)\over\Delta_{i,c}^{2}}=98H_{1,c}C(Z)L_{2}(Z).

For q0.1q\leq 0.1, 1ϕ2(q)=O(1log1q){1\over\phi^{2}(q)}=O({1\over\log{1\over q}}). Then with C(Z)=log(12||2Z2/δ)ϕ2(q)C(Z)={\log(12|\mathcal{I}|^{2}Z^{2}/\delta)\over\phi^{2}(q)}, L2(Z)=log(12||2Z2M(Z)/δ)L_{2}(Z)=\log(12|\mathcal{I}|^{2}Z^{2}M(Z)/\delta) and M(Z)=log(12||2Z2/δ)qM(Z)={\log(12|\mathcal{I}|^{2}Z^{2}/\delta)\over q}, we have that (note that qδq\geq\delta):

ZO(H1,c(log(||Z)+log1δ)2log1q).Z\leq O\left(H_{1,c}{\left(\log(|\mathcal{I}|Z)+\log{1\over\delta}\right)^{2}\over\log{1\over q}}\right). (1)

Therefore, after some basic calculations (the details are deferred to Appendix B), we know that

Z=O(H1,c(log(||H1,c)+log1δ)2log1q).Z=O\left(H_{1,c}{\left(\log(|\mathcal{I}|H_{1,c})+\log{1\over\delta}\right)^{2}\over\log{1\over q}}\right).

5 Experiments

In this section, we conduct some experiments to compare the complexity performances of efficient learning algorithms (i.e., TS-Explore and CLUCB (Chen et al., 2014)) and the optimal but non-efficient algorithm NaiveGapElim (Chen et al., 2017) in the CMAB setting. We consider the following combinatorial pure exploration problem instance with parameter n+n\in\mathbb{N}^{+}, and always choose q=δq=\delta in TS-Explore. All the results (average complexities and their standard deviations) take an average of 100 independent runs.

Problem 5.1.

For fixed value nn, there are totally 2n2n base arms. For the first nn base arms, their expected rewards equal 0.1, and for the last nn base arms, their expected rewards equal 0.9. There are only two super arms: S1S_{1} contains the first nn base arms and S2S_{2} contains the last nn base arms.

We first fix δ=103\delta=10^{-3}, and compare the complexity of the above algorithms under different nn’s (Fig. 1(a)). We can see that when nn increases, the complexities of TS-Explore and NaiveGapElim do not increase a lot, while the complexity of CLUCB increases linearly. This accords with our analysis, since H0,c(n)=H1,c(n)=2n2n(0.8n)2=6.25H_{0,c}(n)=H_{1,c}(n)=2n\cdot{2n\over(0.8n)^{2}}=6.25 is a constant but H2,c(n)=2n(2n)2(0.8n)2=12.5nH_{2,c}(n)=2n\cdot{(2n)^{2}\over(0.8n)^{2}}=12.5n is linear with nn (here H0,c(n),H1,c(n),H2,c(n)H_{0,c}(n),H_{1,c}(n),H_{2,c}(n) are the values of H0,c,H1,c,H2,cH_{0,c},H_{1,c},H_{2,c} under the problem instance with parameter nn, respectively).

Then we fix n=2n=2, and compare the complexity of the above algorithms under different δ\delta’s (Fig. 1(b)). We can see that when δ\delta is large, the complexity of TS-Explore decreases as δ\delta decreases, and when δ\delta is small, the complexity of TS-Explore increases as δ\delta decreases. Moreover, the complexities of TS-Explore and NaiveGapElim increase much slower than CLUCB (when δ\delta decreases). This also accords with our analysis. Note that there is a term O(H1,clog2(||H1,c)log1q)O(H_{1,c}{\log^{2}(|\mathcal{I}|H_{1,c})\over\log{1\over q}}) in our complexity bound. Since we choose q=δq=\delta, this term decreases as δ\delta decreases. When δ=101\delta=10^{-1}, this term is very large and becomes the majority term in complexity, and therefore the complexity decreases when δ\delta decreases from 10110^{-1} to 10310^{-3}. When δ=103\delta=10^{-3}, the term O(H1,clog1δ)O(H_{1,c}\log{1\over\delta}) becomes the majority term in complexity, therefore the complexity increases when δ\delta decreases from 10310^{-3} to 10510^{-5}.

Refer to caption
(a) δ=103\delta=10^{-3}
Refer to caption
(b) n=2n=2
Figure 1: Comparison of TS-Explore, CLUCB and NaiveGapElim.

All the experimental results indicate that TS-Explore outperforms CLUCB (especially when the size of the problem is large or the error constraint δ\delta is small). On the other hand, the complexity of our efficient algorithm TS-Explore is only a little higher than the optimal but non-efficient algorithm NaiveGapElim. These results demonstrate the effectiveness of our algorithm.

6 Conclusions

In this paper, we explore the idea of Thompson Sampling to solve the pure exploration problems under the frequentist setting. We first propose TS-Explore, an efficient policy that uses random samples to make decisions, and then show that: i) in the combinatorial multi-armed bandit setting, our policy can achieve a lower complexity bound than existing efficient policies; and ii) in the classic multi-armed bandit setting, our policy can achieve an asymptotically optimal complexity bound.

There remain many interesting topics to be further studied, e.g., how to achieve the optimal complexity bound in the combinatorial multi-armed bandit setting by seeking detailed information about the combinatorial structure in TS-Explore; and what are the complexity bounds of using our TS-Explore framework in other pure exploration problems (e.g., dueling bandit and linear bandit). It is also worth studying how to design TS-based pure exploration algorithms for the fixed budget setting.

Acknowledgement

This work was supported by the National Key Research and Development Program of China (2017YFA0700904, 2020AAA0104304), NSFC Projects (Nos. 62061136001, 62106122, 62076147, U19B2034, U1811461, U19A2081), Beijing NSF Project (No. JQ19016), Tsinghua-Huawei Joint Research Program, and a grant from Tsinghua Institute for Guo Qiang.

References

  • Agrawal & Goyal (2013) Agrawal, S. and Goyal, N. Further optimal regret bounds for thompson sampling. In International Conference on Artificial Intelligence and Statistics, 2013.
  • Audibert et al. (2010) Audibert, J. Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT 2010 - the Conference on Learning Theory, Haifa, Israel, June, pp.  41–53, 2010.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • Berry & Fristedt (1985) Berry, D. A. and Fristedt, B. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). Springer, 1985.
  • Chen et al. (2017) Chen, L., Gupta, A., Li, J., Qiao, M., and Wang, R. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp.  482–534. PMLR, 2017.
  • Chen et al. (2014) Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. Combinatorial pure exploration of multi-armed bandits. In International Conference on Neural Information Processing Systems, pp.  379–387, 2014.
  • Chen et al. (2013) Chen, W., Wang, Y., and Yuan, Y. Combinatorial multi-armed bandit: General framework, results and applications. In Proceedings of the 30th International Conference on Machine Learning, pp.  151–159, 2013.
  • Combes et al. (2015) Combes, R., Talebi, S., Proutière, A., and Lelarge, M. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, 2015.
  • Even-Dar et al. (2006) Even-Dar, E., Mannor, S., Mansour, Y., and Mahadevan, S. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
  • Gabillon et al. (2016) Gabillon, V., Lazaric, A., Ghavamzadeh, M., Ortner, R., and Bartlett, P. Improved learning complexity in combinatorial pure exploration bandits. In Artificial Intelligence and Statistics, pp.  1004–1012. PMLR, 2016.
  • Jourdan et al. (2021) Jourdan, M., Mutnỳ, M., Kirschner, J., and Krause, A. Efficient pure exploration for combinatorial bandits with semi-bandit feedback. In Algorithmic Learning Theory, pp.  805–849. PMLR, 2021.
  • Kalyanakrishnan et al. (2012) Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning, 2012.
  • Kaufmann & Kalyanakrishnan (2013) Kaufmann, E. and Kalyanakrishnan, S. Information complexity in bandit subset selection. In Conference on Learning Theory, pp.  228–251, 2013.
  • Kaufmann et al. (2012) Kaufmann, E., Korda, N., and Munos, R. Thompson sampling: an asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, 2012.
  • Li et al. (2021) Li, J., Du, C., and Zhu, J. A bayesian approach for subset selection in contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  8384–8391, 2021.
  • Perrault et al. (2020) Perrault, P., Boursier, E., Perchet, V., and Valko, M. Statistical efficiency of thompson sampling for combinatorial semi-bandits. In Neural Information Processing Systems, 2020.
  • Russo (2016) Russo, D. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pp.  1417–1418. PMLR, 2016.
  • Shang et al. (2020) Shang, X., Heide, R., Menard, P., Kaufmann, E., and Valko, M. Fixed-confidence guarantees for bayesian best-arm identification. In International Conference on Artificial Intelligence and Statistics, pp.  1823–1832. PMLR, 2020.
  • Thompson (1933) Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • Wang & Chen (2018) Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandit. In International Conference on Machine Learning, 2018.

Appendix A The Complete Proof of Lemma 4.7

Lemma 4.7 (restated) Under event 0,c1,c2,c\mathcal{E}_{0,c}\land\mathcal{E}_{1,c}\land\mathcal{E}_{2,c}, a base arm ii will not be pulled if Ni(t)980ptC(t)L2(t)Δi,c2N_{i}(t)\geq{980ptC(t)L_{2}(t)\over\Delta_{i,c}^{2}}.

Proof.

We still prove this lemma by contradiction.

Assume that arm ii is pulled with Ni(t)980ptC(t)L2(t)Δi,c2N_{i}(t)\geq{980ptC(t)L_{2}(t)\over\Delta_{i,c}^{2}}, then there are four probabilities: (iS^t)(iS)(i\in\hat{S}_{t})\land(i\in S^{*}), (iS^t)(iS)(i\in\hat{S}_{t})\land(i\notin S^{*}), (iS~t)(iS)(i\in\tilde{S}_{t})\land(i\in S^{*}), (iS~t)(iS)(i\in\tilde{S}_{t})\land(i\notin S^{*}).

Case i): (iS^t)(iS)(i\in\hat{S}_{t})\land(i\in S^{*}) or (iS~t)(iS)(i\in\tilde{S}_{t})\land(i\notin S^{*}). In this case, iSS~ti\in S^{*}\oplus\tilde{S}_{t} and therefore Δi,cΔS,S~t\Delta_{i,c}\leq\Delta_{S^{*},\tilde{S}_{t}}.

Since S^t=𝖮𝗋𝖺𝖼𝗅𝖾(𝝁^(t))\hat{S}_{t}={\sf Oracle}(\bm{\hat{\mu}}(t)), we have that jS~tS^tμ^j(t)jS^tS~tμ^j(t)\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\hat{\mu}_{j}(t)\leq\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\hat{\mu}_{j}(t). By event 1,c\mathcal{E}_{1,c}, we also have that for any kk, |jS~tS^t(θjk(t)μ^j(t))|C(t)jS~tS^t2Nj(t)L2(t)Δi,c7|\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|\leq\sqrt{C(t)\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}{2\over N_{j}(t)}L_{2}(t)}\leq{\Delta_{i,c}\over 7}, and similarly |jS^tS~t(θjk(t)μ^j(t))|C(t)jS^tS~t2Nj(t)L2(t)Δi,c7|\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|\leq\sqrt{C(t)\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}{2\over N_{j}(t)}L_{2}(t)}\leq{\Delta_{i,c}\over 7} (recall that Ni(t)980ptC(t)L2(t)Δi,c2N_{i}(t)\geq{980ptC(t)L_{2}(t)\over\Delta_{i,c}^{2}} and i=argminjS^tS~tNj(t)i=\operatornamewithlimits{argmin}_{j\in\hat{S}_{t}\oplus\tilde{S}_{t}}N_{j}(t)).

Therefore, for any kk, we have that

jS~tS^tθjk(t)jS^tS~tθjk(t)\displaystyle\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\theta_{j}^{k}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\theta_{j}^{k}(t)
=\displaystyle= (jS~tS^tμ^j(t)jS^tS~tμ^j(t))+(jS~tS^t(θjk(t)μ^j(t)))(jS^tS~t(θjk(t)μ^j(t)))\displaystyle\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\hat{\mu}_{j}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\hat{\mu}_{j}(t)\right)+\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))\right)-\left(\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))\right)
\displaystyle\leq (jS~tS^tμ^j(t)jS^tS~tμ^j(t))+(|jS~tS^t(θjk(t)μ^j(t))|)+(|jS^tS~t(θjk(t)μ^j(t))|)\displaystyle\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\hat{\mu}_{j}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\hat{\mu}_{j}(t)\right)+\left(|\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|\right)+\left(|\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k}(t)-\hat{\mu}_{j}(t))|\right)
\displaystyle\leq 0+27Δi,c\displaystyle 0+{2\over 7}\Delta_{i,c}
\displaystyle\leq 27Δi,c.\displaystyle{2\over 7}\Delta_{i,c}.

This means that Δ~tkt=jS~tS^tθjkt(t)jS^tS~tθjkt(t)27Δi,c\tilde{\Delta}_{t}^{k_{t}^{*}}=\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)\leq{2\over 7}\Delta_{i,c}.

Moreover, since jS~tS^tθjkt(t)jS^tS~tθjkt(t)\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)\geq\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\theta_{j}^{k_{t}^{*}}(t), we have that

jS~tS^tμ^j(t)jS^tS~tμ^j(t)\displaystyle\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\hat{\mu}_{j}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\hat{\mu}_{j}(t) (2)
=\displaystyle= (jS~tS^tθjkt(t)jS^tS~tθjkt(t))(jS~tS^t(θjkt(t)μ^j(t)))+(jS^tS~t(θjkt(t)μ^j(t)))\displaystyle\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)\right)-\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))\right)+\left(\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))\right)
\displaystyle\geq (jS~tS^tθjkt(t)jS^tS~tθjkt(t))(|jS~tS^t(θjkt(t)μ^j(t))|)(|jS^tS~t(θjkt(t)μ^j(t))|)\displaystyle\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)\right)-\left(|\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|\right)-\left(|\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|\right)
\displaystyle\geq 027Δi,c\displaystyle 0-{2\over 7}\Delta_{i,c}
\displaystyle\geq 27Δi,c.\displaystyle-{2\over 7}\Delta_{i,c}.

On the other hand, by event 2,c\mathcal{E}_{2,c}, we know that k\exists k^{\prime} such that jSθjk(t)jS~tθjk(t)ΔS,S~t\sum_{j\in S^{*}}\theta_{j}^{k^{\prime}}(t)-\sum_{j\in\tilde{S}_{t}}\theta_{j}^{k^{\prime}}(t)\geq\Delta_{S^{*},\tilde{S}_{t}}. Then

jSθjk(t)jS^tθjk(t)\displaystyle\sum_{j\in S^{*}}\theta_{j}^{k^{\prime}}(t)-\sum_{j\in\hat{S}_{t}}\theta_{j}^{k^{\prime}}(t)
=\displaystyle= (jSθjk(t)jS~tθjk(t))+(jS~tθjk(t)jS^tθjk(t))\displaystyle\left(\sum_{j\in S^{*}}\theta_{j}^{k^{\prime}}(t)-\sum_{j\in\tilde{S}_{t}}\theta_{j}^{k^{\prime}}(t)\right)+\left(\sum_{j\in\tilde{S}_{t}}\theta_{j}^{k^{\prime}}(t)-\sum_{j\in\hat{S}_{t}}\theta_{j}^{k^{\prime}}(t)\right)
\displaystyle\geq ΔS,S~t+(jS~tθjk(t)jS^tθjk(t))\displaystyle\Delta_{S^{*},\tilde{S}_{t}}+\left(\sum_{j\in\tilde{S}_{t}}\theta_{j}^{k^{\prime}}(t)-\sum_{j\in\hat{S}_{t}}\theta_{j}^{k^{\prime}}(t)\right)
=\displaystyle= ΔS,S~t+(jS~tS^tθjk(t)jS^tS~tθjk(t))\displaystyle\Delta_{S^{*},\tilde{S}_{t}}+\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\theta_{j}^{k^{\prime}}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\theta_{j}^{k^{\prime}}(t)\right)
=\displaystyle= ΔS,S~t+(jS~tS^tμ^j(t)jS^tS~tμ^j(t))+(jS~tS^t(θjk(t)μ^j(t)))(jS^tS~t(θjk(t)μ^j(t)))\displaystyle\Delta_{S^{*},\tilde{S}_{t}}+\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\hat{\mu}_{j}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\hat{\mu}_{j}(t)\right)+\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k^{\prime}}(t)-\hat{\mu}_{j}(t))\right)-\left(\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k^{\prime}}(t)-\hat{\mu}_{j}(t))\right)
\displaystyle\geq ΔS,S~t27Δi,c(|jS~tS^t(θjk(t)μ^j(t))|)(|jS^tS~t(θjk(t)μ^j(t))|)\displaystyle\Delta_{S^{*},\tilde{S}_{t}}-{2\over 7}\Delta_{i,c}-\left(|\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k^{\prime}}(t)-\hat{\mu}_{j}(t))|\right)-\left(|\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k^{\prime}}(t)-\hat{\mu}_{j}(t))|\right)
\displaystyle\geq ΔS,S~t47Δi,c\displaystyle\Delta_{S^{*},\tilde{S}_{t}}-{4\over 7}\Delta_{i,c}
>\displaystyle> 27Δi,c,\displaystyle{2\over 7}\Delta_{i,c},

where Eq. (A) is because of Eq. (2). This means that Δ~tk>27Δi,c\tilde{\Delta}^{k^{\prime}}_{t}>{2\over 7}\Delta_{i,c}, which contradicts with Δ~tkt27Δi,c\tilde{\Delta}_{t}^{k_{t}^{*}}\leq{2\over 7}\Delta_{i,c} (since kt=argmaxkΔ~tkk_{t}^{*}=\operatornamewithlimits{argmax}_{k}\tilde{\Delta}^{k}_{t}).

Case ii): (iS^t)(iS)(i\in\hat{S}_{t})\land(i\notin S^{*}) or (iS~t)(iS)(i\in\tilde{S}_{t})\land(i\in S^{*}). In this case, iSS^ti\in S^{*}\oplus\hat{S}_{t} and therefore Δi,cΔS,S^t\Delta_{i,c}\leq\Delta_{S^{*},\hat{S}_{t}}.

By event 2,c\mathcal{E}_{2,c}, we know that k,jSθjk(t)jS^tθjk(t)ΔS,S^t\exists k,\sum_{j\in S^{*}}\theta_{j}^{k}(t)-\sum_{j\in\hat{S}_{t}}\theta_{j}^{k}(t)\geq\Delta_{S^{*},\hat{S}_{t}}. Hence Δ~tkΔS,S^t\tilde{\Delta}_{t}^{k}\geq\Delta_{S^{*},\hat{S}_{t}}. Moreover, since kt=argmaxkΔ~tkk_{t}^{*}=\operatornamewithlimits{argmax}_{k}\tilde{\Delta}_{t}^{k}, we have that jS~tθjkt(t)jS^tθjkt(t)ΔS,S^t\sum_{j\in\tilde{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)-\sum_{j\in\hat{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)\geq\Delta_{S^{*},\hat{S}_{t}}, which is the same as jS~tS^tθjkt(t)jS^tS~tθjkt(t)ΔS,S^t\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)\geq\Delta_{S^{*},\hat{S}_{t}}. On the other hand, by event 1,c\mathcal{E}_{1,c}, we have that |jS~tS^t(θjkt(t)μ^j(t))|C(t)jS~tS^t2Nj(t)L2(t)Δi,c7|\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|\leq\sqrt{C(t)\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}{2\over N_{j}(t)}L_{2}(t)}\leq{\Delta_{i,c}\over 7}, and similarly |jS^tS~t(θjkt(t)μ^j(t))|C(t)jS^tS~t2Nj(t)L2(t)Δi,c7|\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|\leq\sqrt{C(t)\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}{2\over N_{j}(t)}L_{2}(t)}\leq{\Delta_{i,c}\over 7} (recall that Ni(t)980ptC(t)L2(t)Δi,c2N_{i}(t)\geq{980ptC(t)L_{2}(t)\over\Delta_{i,c}^{2}} and i=argminjS^tS~tNj(t)i=\operatornamewithlimits{argmin}_{j\in\hat{S}_{t}\oplus\tilde{S}_{t}}N_{j}(t)).

Therefore,

jS~tS^tμ^j(t)jS^tS~tμ^j(t)\displaystyle\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\hat{\mu}_{j}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\hat{\mu}_{j}(t)
=\displaystyle= (jS~tS^tθjkt(t)jS^tS~tθjkt(t))(jS~tS^t(θjkt(t)μ^j(t)))+(jS^tS~t(θjkt(t)μ^j(t)))\displaystyle\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)\right)-\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))\right)+\left(\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))\right)
\displaystyle\geq (jS~tS^tθjkt(t)jS^tS~tθjkt(t))(|jS~tS^t(θjkt(t)μ^j(t))|)(|jS^tS~t(θjkt(t)μ^j(t))|)\displaystyle\left(\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)-\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}\theta_{j}^{k_{t}^{*}}(t)\right)-\left(|\sum_{j\in\tilde{S}_{t}\setminus\hat{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|\right)-\left(|\sum_{j\in\hat{S}_{t}\setminus\tilde{S}_{t}}(\theta_{j}^{k_{t}^{*}}(t)-\hat{\mu}_{j}(t))|\right)
\displaystyle\geq ΔS,S^t27Δi,c\displaystyle\Delta_{S^{*},\hat{S}_{t}}-{2\over 7}\Delta_{i,c}
>\displaystyle> 0.\displaystyle 0.

This means that jS~tμ^j(t)jS^tμ^j(t)>0\sum_{j\in\tilde{S}_{t}}\hat{\mu}_{j}(t)-\sum_{j\in\hat{S}_{t}}\hat{\mu}_{j}(t)>0, which contradicts with the fact that S^t=𝖮𝗋𝖺𝖼𝗅𝖾(𝝁^(t))\hat{S}_{t}={\sf Oracle}(\bm{\hat{\mu}}(t)). ∎

Appendix B How to Obtain Complexity Upper Bound by Eq. (1)

Eq. (1) is restated below:

ZO(H1,c(log(||Z)+log1δ)2log1q).Z\leq O\left(H_{1,c}{\left(\log(|\mathcal{I}|Z)+\log{1\over\delta}\right)^{2}\over\log{1\over q}}\right).

We can then use the following lemma to find an upper bound for complexity ZZ.

Lemma B.1.

Given KK functions f1(x),,fK(x)f_{1}(x),\cdots,f_{K}(x) and KK positive values X1,,XKX_{1},\cdots,X_{K}, if xXk,Kfk(x)<x\forall x\geq X_{k},Kf_{k}(x)<x holds for all 1kK1\leq k\leq K, then for any xkXk,kfk(x)<xx\geq\sum_{k}X_{k},\sum_{k}f_{k}(x)<x.

Proof.

Since X1,,XKX_{1},\cdots,X_{K} are positive values, for any xkXkx\geq\sum_{k}X_{k}, we must have that xXkx\geq X_{k}. Therefore Kfk(x)<xKf_{k}(x)<x, which implies that

kKfk(x)<kx.\displaystyle\sum_{k}Kf_{k}(x)<\sum_{k}x.

This is the same as kfk(x)<x\sum_{k}f_{k}(x)<x. ∎

To apply Lemma B.1, we set f1(Z)=H1,clog2(||Z)log1qf_{1}(Z)=H_{1,c}{\log^{2}(|\mathcal{I}|Z)\over\log{1\over q}}, f2(Z)=H1,clog(||Z)log1δlog1qf_{2}(Z)=H_{1,c}{\log(|\mathcal{I}|Z)\log{1\over\delta}\over\log{1\over q}}, and f3(Z)=H1,clog21δlog1qf_{3}(Z)=H_{1,c}{\log^{2}{1\over\delta}\over\log{1\over q}}. After some basic calculations, we get that X1c1H1,clog2(||H1,c)log1qX_{1}\leq c_{1}H_{1,c}{\log^{2}(|\mathcal{I}|H_{1,c})\over\log{1\over q}}, X2c2,1H1,clog(||H1,c)log1δlog1q+c2,2H1,clog21δlog1qX_{2}\leq c_{2,1}H_{1,c}{\log(|\mathcal{I}|H_{1,c})\log{1\over\delta}\over\log{1\over q}}+c_{2,2}H_{1,c}{\log^{2}{1\over\delta}\over\log{1\over q}} and X3c3H1,clog21δlog1qX_{3}\leq c_{3}H_{1,c}{\log^{2}{1\over\delta}\over\log{1\over q}}. Here c1,c2,1,c2,2,c3c_{1},c_{2,1},c_{2,2},c_{3} are universal constants.

Then we know that for ZΘ(H1,c(log1δ+log(||H1,c))log1δlog1q+H1,clog2(||H1,c)log1q)Z\geq\Theta(H_{1,c}(\log{1\over\delta}+\log(|\mathcal{I}|H_{1,c})){\log{1\over\delta}\over\log{1\over q}}+H_{1,c}{\log^{2}(|\mathcal{I}|H_{1,c})\over\log{1\over q}}), f1(Z)+f2(Z)+f3(Z)<Zf_{1}(Z)+f_{2}(Z)+f_{3}(Z)<Z (by Lemma B.1). This contradicts with Eq. (1). Therefore, we know that

Z=O(H1,c(log(||H1,c)+log1δ)2log1q)=O(H1,c(log1δ+log(||H1,c))log1δlog1q+H1,clog2(||H1,c)log1q),Z=O\left(H_{1,c}{\left(\log(|\mathcal{I}|H_{1,c})+\log{1\over\delta}\right)^{2}\over\log{1\over q}}\right)=O\left(H_{1,c}\left(\log{1\over\delta}+\log(|\mathcal{I}|H_{1,c})\right){\log{1\over\delta}\over\log{1\over q}}+H_{1,c}{\log^{2}(|\mathcal{I}|H_{1,c})\over\log{1\over q}}\right),

and this is the complexity upper bound in Theorem 4.1.

Appendix C Discussions about the Optimal Algorithms for Combinatorial Pure Exploration

Chen et al. (2017) prove that the complexity lower bound for combinatorial pure exploration is Ω(H0,clog1δ)\Omega(H_{0,c}\log{1\over\delta}), where H0,cH_{0,c} is the optimal value of the following convex program (here ΔS,S=iSμiiSμi\Delta_{S^{*},S}=\sum_{i\in S^{*}}\mu_{i}-\sum_{i\in S}\mu_{i}):

min\displaystyle\min i[m]Nm\displaystyle\sum_{i\in[m]}N_{m}
s.t.\displaystyle\mathrm{s.t.} iSS1NiΔS,S2,S,SS\displaystyle\sum_{i\in S^{*}\oplus S}{1\over N_{i}}\leq\Delta_{S^{*},S}^{2},\quad\forall S\in\mathcal{I},S\neq S^{*}

In other words, for any correct combinatorial pure exploration algorithm, [𝔼[N1]log1δ,𝔼[N2]log1δ,,𝔼[Nm]log1δ][{\mathbb{E}[N_{1}]\over\log{1\over\delta}},{\mathbb{E}[N_{2}]\over\log{1\over\delta}},\cdots,{\mathbb{E}[N_{m}]\over\log{1\over\delta}}] must be a feasible solution for the above convex program, where 𝔼[Ni]\mathbb{E}[N_{i}] represents the expected number of pulls on base arm ii. We say base arm ii needs exploration the most at time tt if αNilog1δNi(t)\alpha N_{i}^{*}\log{1\over\delta}-N_{i}(t) is positive, where α\alpha is some universal constant and NiN_{i}^{*} is the value of NiN_{i} in the optimal solution H0,cH_{0,c} (note that there may be several base arms that need exploration the most in one time step). By this definition, if we always pull a base arm that needs exploration the most, then the frequency of pulling each base arm converges to the optimal solution of H0,cH_{0,c}, which leads to an optimal complexity upper bound.

However, the simple offline oracle used in TS-Explore (as described in Section 3.2) is not enough to look for a base arm that needs exploration the most. In fact, both the existing optimal policies (Chen et al., 2017; Jourdan et al., 2021) not only need to use this simple offline oracle, but also require some other mechanisms to explore detailed information about the combinatorial structure of the problem instance to look for a base arm that needs exploration the most. The algorithms in (Chen et al., 2017) need to record all the super arms in \mathcal{I} with an empirical mean larger than some threshold η\eta. This is one kind of information about the combinatorial structure that can help to find out a base arm that needs exploration the most. Nevertheless, the authors only provide an efficient way to do this when the combinatorial structure satisfies some specific constraints. In the most general setting, the algorithms in (Chen et al., 2017) must pay an exponential time cost for collecting the detailed information. As for (Jourdan et al., 2021), the best-response oracle used by the λ\lambda-player needs to return a super arm within set N(St)N(S_{t}) that has the shortest distance to a fixed target. Here StS_{t} is a super arm, N(St)N(S_{t}) represents the set of super arms whose cells’ boundaries intersect the boundary of the cell of StS_{t}, and the cell of a super arm StS_{t} is defined as all the possible parameter sets [θ1,θ2,,θm][\theta_{1},\theta_{2},\cdots,\theta_{m}] in which StS_{t} is the optimal super arm. This is another kind of information about the combinatorial structure that can help to find out a base arm that needs exploration the most. Nevertheless, this best-response oracle also has an exponential time cost (which is scaled with |N(St)||N(S_{t})|). By using these exponential time cost mechanisms, the optimal algorithms (Chen et al., 2017; Jourdan et al., 2021) can find out a base arm that needs exploration the most, which is critical to achieving the optimal complexity upper bound.

In this paper, to make TS-Explore efficient in the most general setting, we only use the simple offline oracle in our algorithm and our mechanism can only inform us of one of the violated constraints in the optimization problem (if all the constraints are not violated, TS-Explore will output the correct optimal arm). This means that we know nothing about the combinatorial structure of the problem instance. Therefore, the best thing we can do is to treat all the base arms in the violated constraint equally, i.e., we choose to pull the base arm (in the violated constraint) with the smallest number of pulls. This leads to a complexity upper bound of O(H1,clog1δ)O(H_{1,c}\log{1\over\delta}). If we want TS-Explore to achieve an optimal complexity upper bound O(H0,clog1δ)O(H_{0,c}\log{1\over\delta}), then we need to know which base arm needs exploration the most, e.g., by applying a powerful offline oracle that takes the empirical means and random samples as input and outputs a base arm that needs exploration the most. How to design such offline oracles and how to implement them efficiently is one of our future research topics.

Appendix D Concentration Inequality of Sub-Gaussian Random Variables

Fact D.1.

If XX is zero-mean and σ2\sigma^{2} sub-Gaussian, then

Pr[|X|>ϵ]2exp(ϵ22σ2).\Pr[|X|>\epsilon]\leq 2\exp(-{\epsilon^{2}\over 2\sigma^{2}}).