Thompson Sampling for (Combinatorial) Pure Exploration
Abstract
Existing methods of combinatorial pure exploration mainly focus on the UCB approach. To make the algorithm efficient, they usually use the sum of upper confidence bounds within arm set to represent the upper confidence bound of , which can be much larger than the tight upper confidence bound of and leads to a much higher complexity than necessary, since the empirical means of different arms in are independent. To deal with this challenge, we explore the idea of Thompson Sampling (TS) that uses independent random samples instead of the upper confidence bounds, and design the first TS-based algorithm TS-Explore for (combinatorial) pure exploration. In TS-Explore, the sum of independent random samples within arm set will not exceed the tight upper confidence bound of with high probability. Hence it solves the above challenge, and achieves a lower complexity upper bound than existing efficient UCB-based algorithms in general combinatorial pure exploration. As for pure exploration of classic multi-armed bandit, we show that TS-Explore achieves an asymptotically optimal complexity upper bound.
1 Introduction
Pure exploration is an important task in online learning, and it tries to find out the target arm as fast as possible. In pure exploration of classic multi-armed bandit (MAB) (Audibert et al., 2010), there are totally arms, and each arm is associated with a probability distribution with mean . Once arm is pulled, it returns an observation , which is drawn independently from by the environment. At each time step , the learning policy either chooses an arm to pull, or chooses to output an arm . The goal of the learning policy is to pull arms properly, such that with an error probability at most , its output arm is the optimal arm (i.e., ) with complexity (the total number of observations) as low as possible.
Pure exploration is widely adopted in real applications. For example, in the selling procedure of cosmetics, there is always a testing phase before the commercialization phase (Audibert et al., 2010). The goal of the testing phase is to help to maximize the cumulative reward collected in the commercialization phase. Therefore, instead of regret minimization (Berry & Fristedt, 1985; Auer et al., 2002), the testing phase only needs to do exploration (e.g., to investigate which product is the most popular one), and wants to find out the target with both correctness guarantee and low cost. In the real world, sometimes the system focuses on the best action under a specific combinatorial structure, instead of the best single arm (Chen et al., 2014). For example, a network routing system needs to search for the path with minimum delay between the source and the destination. Since there can be an exponential number of paths, the cost of exploring them separately is unacceptable. Therefore, people choose to find out the best path by exploring single edges, and this is a pure exploration problem instance in combinatorial multi-armed bandit (CMAB). In this setting, we still pull single arms (base arms) at each time step, and there is a super arm set . The expected reward of a super arm is , i.e., the sum of the expected rewards of its contained base arms. And the goal of the player is to find out the optimal super arm with error probability at most and complexity as low as possible.
Most of the existing solutions for pure exploration follow the UCB approach (Audibert et al., 2010; Kalyanakrishnan et al., 2012; Chen et al., 2014; Kaufmann & Kalyanakrishnan, 2013). They compute the confidence bounds for all the arms, and claim that one arm is optimal only if its lower confidence bound is larger than the upper confidence bounds of all the other arms. Though this approach is asymptotically optimal in pure exploration of classic MAB problems (Kalyanakrishnan et al., 2012; Kaufmann & Kalyanakrishnan, 2013), it faces some challenges in the CMAB setting (Chen et al., 2014). In the UCB approach, for a super arm , the algorithm usually uses the sum of upper confidence bounds of all its contained base arms as its upper confidence bound to achieve a low implementation cost. This means that the gap between the empirical mean of super arm and the used upper confidence bound of is about (here is the number of observations on base arm ). However, since the observations of different base arms are independent, the standard deviation of the empirical mean of super arm is , which is much smaller than . This means that the used upper confidence bound of is much larger than the tight upper confidence bound of . Combes et al. (2015) deal with this problem by computing the upper confidence bounds for all the super arms independently, which leads to an exponential time cost and is hard to implement in practice. In fact, existing efficient UCB-based solutions either suffer from a high complexity bound (Chen et al., 2014; Gabillon et al., 2016), or need further assumptions on the combinatorial structure to achieve an optimal complexity bound efficiently (Chen et al., 2017).
Then a natural solution to deal with this challenge is to use random samples of arm (with mean to be its empirical mean and standard deviation to be ) instead of its upper confidence bound to judge whether a super arm is optimal. If we let the random samples of different base arms be independent, then with high probability, the gap between the empirical mean of super arm and the sum of random samples within is , which has the same order as the gap between the empirical mean of super arm and its real mean. Therefore, using independent random samples can behave better than using confidence bounds, and this is the key idea of Thompson Sampling (TS) (Thompson, 1933; Kaufmann et al., 2012; Agrawal & Goyal, 2013). In fact, many prior works show that TS-based algorithms have smaller cumulative regret than UCB-based algorithms in regret minimization of CMAB model (Wang & Chen, 2018; Perrault et al., 2020). However, there still lack studies on adapting the idea of TS, i.e., using random samples to make decisions, to the pure exploration setting.
In this paper, we attempt to fill up this gap, and study using random samples in pure exploration under the frequentist setting for both MAB and CMAB instances. We emphasize that it is non-trivial to design (and analyze) such a TS-based algorithm. The first challenge is that there is a lot more uncertainty in the random samples than in the confidence bounds. In UCB-based algorithms, the upper confidence bound of the optimal arm is larger than its expected reward with high probability. Thus, the arm with the largest upper confidence bound is either an insufficiently learned sub-optimal arm (i.e., the number of its observations is not enough to make sure that it is a sub-optimal arm) or the optimal arm, which means that the number of pulls on sufficiently learned sub-optimal arms is limited. However, for TS-based algorithms that use random samples instead, there is always a constant probability that the random sample of the optimal arm is smaller than its expected reward. In this case, the arm with the largest random sample may be a sufficiently learned sub-optimal arm. Therefore, the mechanism of the TS-based policy should be designed carefully to make sure that we can still obtain an upper bound for the number of pulls on sufficiently learned sub-optimal arms.
Another challenge is that using random samples to make decisions is a kind of Bayesian algorithm and it loses many good properties in the frequentist setting. In the Bayesian setting, at each time step , the parameters of the game follow a posterior distribution , and the random samples are drawn from independently as well. Therefore, using the random samples to output the optimal arm can explicitly ensure the correctness of the TS-based algorithm. However, in the frequentist setting, the parameters of the game are fixed but unknown, and they have no such correlations with the random samples. Because of this, if we still choose to use the random samples to output the optimal arm, then the distributions of random samples in the TS-based algorithm need to be chosen carefully to make sure that we can still obtain the correctness guarantee in the frequentist setting.
Besides, the analysis of the TS-based algorithm in pure exploration is also very different from that in regret minimization. In regret minimization, at each time step, we only need to draw one sample for each arm (Thompson, 1933; Agrawal & Goyal, 2013; Wang & Chen, 2018; Perrault et al., 2020). However, in pure exploration, one set of samples at each time step is not enough, since the algorithm needs to i) check whether there is an arm that is the optimal one with high probability; ii) look for an arm that needs exploration. None of these two goals can be achieved by only one set of samples. Therefore, we must draw several sets of samples to make decisions, and this may fail the existing analysis of TS in regret minimization.
In this paper, we solve the above challenges, and design a TS-based algorithm TS-Explore for (combinatorial) pure exploration under the frequentist setting with polynomial implementation cost. At each time step , TS-Explore first draws independent random samples for all the (base) arms and (i.e., totally independent samples for each arm). Then it tries to find out the best (super) arms ’s under sample sets for . If in all these sample sets, the best (super) arm is the same as the empirically best (super) arm (i.e., the best arm under the empirical means), then the algorithm will output that this (super) arm is optimal. Otherwise, for all , the algorithm will check the reward gap between and under parameter set . Then it focuses on the (super) arm with the largest reward gap, i.e., it chooses to pull a (base) arm in the exchange set of and .
Recording such reward gaps and focusing on is the key mechanism we used to solve the above challenges. On the one hand, our analysis shows that with a proper value (the number of sample sets) and proper random sample distributions, has some similar properties with the (super) arm with the largest upper confidence bound. These properties play a critical role in the analysis of UCB approaches. Thus, we can also use them to obtain an upper bound for the number of pulls on sufficiently learned sub-optimal arms in TS-Explore as well as the correctness guarantee of TS-Explore. On the other hand, this novel mechanism saves the key advantage of Thompson Sampling, i.e., the sum of random samples within will not exceed the tight upper confidence bound of with high probability. Hence, TS-Explore can achieve a lower complexity upper bound than existing efficient UCB-based solutions. These results indicate that our TS-Explore policy is correct, efficient (with low implementation cost), and effective (with low complexity).
In the general CMAB setting, we show that TS-Explore is near optimal and achieves a lower complexity upper bound than existing efficient UCB-based algorithms (Chen et al., 2014). The optimal algorithm in (Chen et al., 2017) is only efficient when the combinatorial structure satisfies some specific properties (otherwise it suffers from an exponential implementation cost), and is less general than our results. We also conduct experiments to compare the complexity of TS-Explore with existing baselines. The experimental results show that TS-Explore outperforms the efficient baseline CLUCB (Chen et al., 2014), and behaves only a little worse than the optimal but non-efficient baseline NaiveGapElim (Chen et al., 2017). As for the MAB setting, we show that TS-Explore is asymptotically optimal, i.e., it has a comparable complexity to existing optimal algorithms (Kalyanakrishnan et al., 2012; Kaufmann & Kalyanakrishnan, 2013) when . All these results indicate that our TS-based algorithm is efficient and effective in dealing with pure exploration problems. To the best of our knowledge, this is the first result of using this kind of TS-based algorithm (i.e., always making decisions based on random samples) in pure exploration under the frequentist setting.
2 Related Works
Pure exploration of the classic MAB model is first proposed by Audibert et al. (2010). After that, people have designed lots of learning policies for this problem. The two most representative algorithms are successive-elimination (Even-Dar et al., 2006; Audibert et al., 2010; Kaufmann & Kalyanakrishnan, 2013) and LUCB (Kalyanakrishnan et al., 2012; Kaufmann & Kalyanakrishnan, 2013). Both of them adopt the idea of UCB (Auer et al., 2002) and achieve an asymptotically optimal complexity upper bound (i.e., it matches with the complexity lower bound proposed by Kalyanakrishnan et al. (2012)). Compared to these results, our TS-Explore policy uses a totally different approach, and can achieve an asymptotically optimal complexity upper bound as well.
Combinatorial pure exploration is first studied by Chen et al. (2014). They propose CLUCB, an LUCB-based algorithm that is efficient as long as there is an offline oracle to output the best super arm under any given parameter set. Chen et al. (2017) then design an asymptotically optimal algorithm for this problem. However, their algorithm can only be implemented efficiently when the combinatorial structure follows some specific constraints. Recently, based on the game approach, Jourdan et al. (2021) provide another optimal learning policy for pure exploration in CMAB. But their algorithm still suffers from an exponential implementation cost. Compared with these UCB-based algorithms, our TS-Explore policy achieves a lower complexity bound than CLUCB (Chen et al., 2014) (with a similar polynomial time cost), and has a much lower implementation cost than the optimal policies (Chen et al., 2017; Jourdan et al., 2021) in the most general combinatorial pure exploration setting.
There also exist some researches about applying Thompson Sampling to pure exploration. For example, Russo (2016) considers a frequentist setting of pure exploration in classic MAB, and proposes algorithms called TTPS, TTVS, and TTTS; Shang et al. (2020) extend the results in (Russo, 2016), design a T3C algorithm, and provide its analysis for Gaussian bandits; Li et al. (2021) study Bayesian contextual pure exploration, propose an algorithm called BRE and obtain its corresponding analysis. However, these results are still very different from ours. The first point is that they mainly use random distributions but not random samples to decide the next chosen arm or when to stop, which may cause a high implementation cost when we extend them to the combinatorial setting, since it is much more complex to deal with random distributions than to deal with random samples. Moreover, our results are still more general even if we only consider pure exploration in classic MAB under the frequentist setting. For example, Russo (2016) does not provide a correct stopping rule, and Shang et al. (2020) only obtain the correctness guarantee for Gaussian bandits. Besides, their complexity bounds are asymptotic ones (which require ), while ours works for any .
3 Model Setting
3.1 Pure Exploration in Multi-armed Bandit
A pure exploration problem instance of MAB is a tuple . Here is the set of arms, are the corresponding reward distributions of the arms, and is the error constraint. In this paper, we assume that all the distributions ’s are supported on . Let denote the expected reward of arm , and is the optimal arm with the largest expected reward. Similar to many existing works (e.g., (Audibert et al., 2010)), we assume that the optimal arm is unique. At each time step , the learning policy can either pull an arm , or output an arm . If it chooses to pull arm , then it will receive an observation , which is drawn independently from . The goal of the learning policy is to make sure that with probability at least , its output . Under this constraint, it aims to minimize the complexity
where denotes the time step that policy chooses to output , and denotes the number of observations on arm until time step .
Let denote the expected reward gap between the optimal arm and any other arm . For the optimal arm , its is defined as . We also define , and existing works (Kalyanakrishnan et al., 2012) show that the complexity lower bound of any pure exploration algorithm is .
3.2 Pure Exploration in Combinatorial Multi-armed Bandit
A pure exploration problem instance of CMAB is an extension of the MAB setting. The arms are called base arms, and there is also a super arm set . For each super arm , its expected reward is . Let denote the optimal super arm with the largest expected reward, and we assume that the optimal super arm is unique (as in (Chen et al., 2014)). At each time step , the learning policy can either pull a base arm , or output a super arm . The goal of the learning policy is to make sure that with probability at least , its output . Under this constraint, it also wants to minimize its complexity .
As in many existing works (Chen et al., 2013, 2014; Wang & Chen, 2018), we also assume that there exists an offline , which takes a set of parameters as input, and outputs the best super arm under this parameter set, i.e., .
In this paper, for , we use to denote the expected reward gap between the optimal super arm and the best super arm that contains . As for , its is defined as , i.e., the expected reward gap between and the best super arm that does not contain . We also define and , and let , .
Chen et al. (2017) prove that the complexity lower bound for combinatorial pure exploration is , where is the optimal value of the following convex program (here ):
The following result shows the relationships between , and .
Proposition 3.1.
For any combinatorial pure exploration instance, .
Proof.
Since , we have that . As for the first inequality, note that , is a feasible solution of the above convex program. Hence we have that . ∎
4 Thompson Sampling-based Pure Exploration Algorithm
Note that pure exploration of classic multi-armed bandit is a special case of combinatorial multi-armed bandit (i.e., ). Therefore, in this section, we mainly focus on the TS-based pure exploration algorithm in the combinatorial multi-armed bandit setting. Its framework and analysis can directly lead to results in the classic multi-armed bandit setting.
In the following, we let . For any , we also define as a function of such that .
4.1 Algorithm Framework
Our Thompson Sampling-based algorithm (TS-Explore) is described in Algorithm 1. We use to denote the number of observations on arm , to denote the sum of all the observations from arm , and to denote the value of at the beginning of time step .
At each time step , for any , TS-Explore first draws random samples independently from a Gaussian distribution with mean (i.e., the empirical mean of arm ) and variance (i.e., inversely proportional to ), where . Then it checks which super arm is optimal in the empirical mean parameter set and the -th sample set for all by using the offline , i.e., and . If all the best super arms ’s are the same as , then TS-Explore outputs that this super arm is the optimal one. Otherwise, for all , it will compute the reward gap between and under the -th sample set , and focus on with the largest reward gap, i.e., TS-Explore will choose to pull the base arm with the least number of observations in (in the following, we use to represent to simplify notations). Note that (otherwise we will output as the optimal super arm), thus the rule of choosing arms to pull is well defined.
4.2 Analysis of TS-Explore
Theorem 4.1.
In the CMAB setting, for , with probability at least , TS-Explore will output the optimal super arm with complexity upper bounded by . Specifically, if we choose , then the complexity upper bound is .
Remark 4.2.
When the error constraint , we can still let the parameters in TS-Explore be . In this case: i) its error probability is upper bounded by ; and ii) since , the complexity of TS-Explore is still upper bounded by .
By Theorem 4.1, we can directly obtain the correctness guarantee and the complexity upper bound for applying TS-Explore in pure exploration of classic multi-armed bandit.
Corollary 4.3.
In the MAB setting, for , with probability at least , TS-Explore will output the optimal arm with complexity upper bounded by . Specifically, if we choose , then the complexity upper bound is .
Remark 4.4.
The value in TS-Explore is used to control the number of times that we draw random samples at each time step. Note that . Hence when becomes larger, we need fewer samples, but the complexity bound becomes worse. Here is a trade-off between the algorithm’s complexity and the number of random samples it needs to draw. Our analysis shows that using for some constant can make sure that the complexity upper bound remains the same order, and reduce the number of random samples significantly.
Remark 4.5.
If the value is unknown (which is common in real applications), we can use instead (in and ). This only increases the constant term in the complexity upper bound and does not influence the major term .
Theorem 4.1 shows that the complexity upper bound of TS-Explore is lower than the CLUCB policy in (Chen et al., 2014). To the best of our knowledge, TS-Explore is the first algorithm that efficiently achieves an complexity upper bound in the most general combinatorial pure exploration setting. Besides, this is also the first theoretical analysis of using a TS-based algorithm (i.e., using random samples to make decisions) to deal with combinatorial pure exploration problems under the frequentist setting.
Though the complexity bound of TS-Explore still has some gap with the optimal one , we emphasize that this is because we only use the simple offline oracle and do not seek more detailed information about the combinatorial structure. This makes our policy more efficient. As a comparison, the existing optimal policies (Chen et al., 2017; Jourdan et al., 2021) either suffer from an exponential time cost, or require the combinatorial structure to satisfy some specific constraints so that they can adopt more powerful offline oracles that explore detailed information about the combinatorial structure efficiently (please see Appendix C for discussions about this). Therefore, our algorithm is more efficient in the most general CMAB setting, and can be attractive in real applications with large scale and complex combinatorial structures.
On the other hand, the gap between the complexity upper bound and the complexity lower bound does not exist in the MAB setting. Corollary 4.3 shows that the complexity upper bound of applying TS-Explore in the classic MAB setting matches the complexity lower bound (Kalyanakrishnan et al., 2012) when , i.e., TS-Explore is asymptotically optimal in the classic MAB setting.
Now we provide the proof of Theorem 4.1.
In this proof, we denote . Note that and for any , . We also denote and to simplify notations.
We first define three events as follows: is the event that for all , ,
is the event that for all , , ,
and is the event that for all , . there exists such that
Roughly speaking, means that for any , the gap between its real mean and its empirical mean lies in the corresponding confidence radius; means that for any , the gap between its empirical mean and the sum of its random samples lies in the corresponding confidence radius; and means that for any two super arms , at each time step, there is at least one sample set such that the gap between the sum of their random samples is larger than the gap between their real means.
We will first prove that happens with high probability, and then show the correctness guarantee and complexity upper bound under .
Lemma 4.6.
In Algorithm 1, we have that
Proof.
Note that the random variable is zero-mean and sub-Gaussian, and for different , the random variables ’s are independent. Therefore, is zero-mean and sub-Gaussian. Then by concentration inequality of sub-Gaussian random variables (see details in Appendix D),
This implies that
where the second inequality is because that , and the third inequality is because that .
Similarly, the random variable is a zero-mean Gaussian random variable with variance , and for different , the random variables ’s are also independent. Then by concentration inequality,
This implies that
where the second inequality is because that there are totally sample sets at time step .
Finally we consider the probability . In the following, we denote as the reward gap (under the real means) between and .
For any fixed , we denote , and to simplify notations. Then under event , we must have that .
Since is a Gaussian random variable with mean and variance , then under event , (recall that ):
where the last equation is because that we choose and (by definition of ).
Note that the parameter sets are chosen independently, therefore under event , we have that
where that last inequality is because that we choose .
This implies that
All these show that . ∎
Then it is sufficient to prove that under event , TS-Explore works correctly with complexity upper bound shown in Theorem 4.1.
Firstly, we prove that TS-Explore will output . The proof is quite straightforward: if , then under event , there exists such that . Therefore and we will not output . Because of this, TS-Explore can only return under event , and we finish the proof of its correctness.
Then we come to bound the complexity of TS-Explore, and we will use the following lemma in our analysis.
Lemma 4.7.
Under event , a base arm will not be pulled if .
Due to space limit, we only prove Lemma 4.7 for the case that in our main text, and defer the complete proof to Appendix A.
Proof.
We will prove this lemma by contradiction.
First we consider the case that . In this case, , which implies that . If we choose a base arm with to pull, we know that , . This implies that and .
By , there exists such that . By , and similarly . Hence
also means and . Thus we know that
This contradicts with the fact that is the optimal super arm under the -th sample set .
Then we come to the case that . In this case, , which implies that . If we choose a base arm with to pull, we have that and .
By , there exists such that . By , and . All these imply
This contradicts with the fact that is the empirically optimal super arm. ∎
Lemma 4.7 is similar to Lemma 10 in (Chen et al., 2014). Both of them give an upper bound for the number of pulls on base arm . The key difference is that in (Chen et al., 2014), for an arm set , the gap between its real mean and its upper confidence bound is , which means that we require all the ’s to be to make sure that this gap is less than . In our paper, based on event , the gap between ’s real mean and the sum of random samples in is . Therefore, we only require all the ’s to be to make sure that this gap is less than , and this reduces a factor of in the number of pulls on base arm (our analysis shows that is approximately a constant). In fact, reducing a factor in Lemma 4.7 is the key reason that the complexity upper bound of TS-Explore is lower than the CLUCB policy in (Chen et al., 2014).
The novelty of our proof for Lemma 4.7 mainly lies in the event (as well as the mechanism of recording all the ’s and focusing on the largest one), i.e., we show that under event , has some similar properties as the super arm with the largest upper confidence bound (which is used in LUCB-based policies such as CLUCB). For example, when does not equal the optimal super arm , tells us that there must exist such that . Along with the fact , we know . This means that the reward gap between and () could be larger than , which implies that is either an insufficiently learned sub-optimal arm or the optimal arm (see details in the complete proof in Appendix A). This method solves the challenge of the uncertainty in random samples, and allows us to use similar analysis techniques (e.g., (Chen et al., 2014)) to prove Lemma 4.7.
By Lemma 4.7, if , then TS-Explore must terminate (and output the correct answer). Thus, the complexity satisfies
For , . Then with , and , we have that (note that ):
(1) |
Therefore, after some basic calculations (the details are deferred to Appendix B), we know that
5 Experiments
In this section, we conduct some experiments to compare the complexity performances of efficient learning algorithms (i.e., TS-Explore and CLUCB (Chen et al., 2014)) and the optimal but non-efficient algorithm NaiveGapElim (Chen et al., 2017) in the CMAB setting. We consider the following combinatorial pure exploration problem instance with parameter , and always choose in TS-Explore. All the results (average complexities and their standard deviations) take an average of 100 independent runs.
Problem 5.1.
For fixed value , there are totally base arms. For the first base arms, their expected rewards equal 0.1, and for the last base arms, their expected rewards equal 0.9. There are only two super arms: contains the first base arms and contains the last base arms.
We first fix , and compare the complexity of the above algorithms under different ’s (Fig. 1(a)). We can see that when increases, the complexities of TS-Explore and NaiveGapElim do not increase a lot, while the complexity of CLUCB increases linearly. This accords with our analysis, since is a constant but is linear with (here are the values of under the problem instance with parameter , respectively).
Then we fix , and compare the complexity of the above algorithms under different ’s (Fig. 1(b)). We can see that when is large, the complexity of TS-Explore decreases as decreases, and when is small, the complexity of TS-Explore increases as decreases. Moreover, the complexities of TS-Explore and NaiveGapElim increase much slower than CLUCB (when decreases). This also accords with our analysis. Note that there is a term in our complexity bound. Since we choose , this term decreases as decreases. When , this term is very large and becomes the majority term in complexity, and therefore the complexity decreases when decreases from to . When , the term becomes the majority term in complexity, therefore the complexity increases when decreases from to .


All the experimental results indicate that TS-Explore outperforms CLUCB (especially when the size of the problem is large or the error constraint is small). On the other hand, the complexity of our efficient algorithm TS-Explore is only a little higher than the optimal but non-efficient algorithm NaiveGapElim. These results demonstrate the effectiveness of our algorithm.
6 Conclusions
In this paper, we explore the idea of Thompson Sampling to solve the pure exploration problems under the frequentist setting. We first propose TS-Explore, an efficient policy that uses random samples to make decisions, and then show that: i) in the combinatorial multi-armed bandit setting, our policy can achieve a lower complexity bound than existing efficient policies; and ii) in the classic multi-armed bandit setting, our policy can achieve an asymptotically optimal complexity bound.
There remain many interesting topics to be further studied, e.g., how to achieve the optimal complexity bound in the combinatorial multi-armed bandit setting by seeking detailed information about the combinatorial structure in TS-Explore; and what are the complexity bounds of using our TS-Explore framework in other pure exploration problems (e.g., dueling bandit and linear bandit). It is also worth studying how to design TS-based pure exploration algorithms for the fixed budget setting.
Acknowledgement
This work was supported by the National Key Research and Development Program of China (2017YFA0700904, 2020AAA0104304), NSFC Projects (Nos. 62061136001, 62106122, 62076147, U19B2034, U1811461, U19A2081), Beijing NSF Project (No. JQ19016), Tsinghua-Huawei Joint Research Program, and a grant from Tsinghua Institute for Guo Qiang.
References
- Agrawal & Goyal (2013) Agrawal, S. and Goyal, N. Further optimal regret bounds for thompson sampling. In International Conference on Artificial Intelligence and Statistics, 2013.
- Audibert et al. (2010) Audibert, J. Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT 2010 - the Conference on Learning Theory, Haifa, Israel, June, pp. 41–53, 2010.
- Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Berry & Fristedt (1985) Berry, D. A. and Fristedt, B. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). Springer, 1985.
- Chen et al. (2017) Chen, L., Gupta, A., Li, J., Qiao, M., and Wang, R. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp. 482–534. PMLR, 2017.
- Chen et al. (2014) Chen, S., Lin, T., King, I., Lyu, M. R., and Chen, W. Combinatorial pure exploration of multi-armed bandits. In International Conference on Neural Information Processing Systems, pp. 379–387, 2014.
- Chen et al. (2013) Chen, W., Wang, Y., and Yuan, Y. Combinatorial multi-armed bandit: General framework, results and applications. In Proceedings of the 30th International Conference on Machine Learning, pp. 151–159, 2013.
- Combes et al. (2015) Combes, R., Talebi, S., Proutière, A., and Lelarge, M. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, 2015.
- Even-Dar et al. (2006) Even-Dar, E., Mannor, S., Mansour, Y., and Mahadevan, S. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
- Gabillon et al. (2016) Gabillon, V., Lazaric, A., Ghavamzadeh, M., Ortner, R., and Bartlett, P. Improved learning complexity in combinatorial pure exploration bandits. In Artificial Intelligence and Statistics, pp. 1004–1012. PMLR, 2016.
- Jourdan et al. (2021) Jourdan, M., Mutnỳ, M., Kirschner, J., and Krause, A. Efficient pure exploration for combinatorial bandits with semi-bandit feedback. In Algorithmic Learning Theory, pp. 805–849. PMLR, 2021.
- Kalyanakrishnan et al. (2012) Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In International Conference on Machine Learning, 2012.
- Kaufmann & Kalyanakrishnan (2013) Kaufmann, E. and Kalyanakrishnan, S. Information complexity in bandit subset selection. In Conference on Learning Theory, pp. 228–251, 2013.
- Kaufmann et al. (2012) Kaufmann, E., Korda, N., and Munos, R. Thompson sampling: an asymptotically optimal finite-time analysis. In International Conference on Algorithmic Learning Theory, 2012.
- Li et al. (2021) Li, J., Du, C., and Zhu, J. A bayesian approach for subset selection in contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 8384–8391, 2021.
- Perrault et al. (2020) Perrault, P., Boursier, E., Perchet, V., and Valko, M. Statistical efficiency of thompson sampling for combinatorial semi-bandits. In Neural Information Processing Systems, 2020.
- Russo (2016) Russo, D. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pp. 1417–1418. PMLR, 2016.
- Shang et al. (2020) Shang, X., Heide, R., Menard, P., Kaufmann, E., and Valko, M. Fixed-confidence guarantees for bayesian best-arm identification. In International Conference on Artificial Intelligence and Statistics, pp. 1823–1832. PMLR, 2020.
- Thompson (1933) Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- Wang & Chen (2018) Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandit. In International Conference on Machine Learning, 2018.
Appendix A The Complete Proof of Lemma 4.7
Lemma 4.7 (restated) Under event , a base arm will not be pulled if .
Proof.
We still prove this lemma by contradiction.
Assume that arm is pulled with , then there are four probabilities: , , , .
Case i): or . In this case, and therefore .
Since , we have that . By event , we also have that for any , , and similarly (recall that and ).
Therefore, for any , we have that
This means that .
Moreover, since , we have that
(2) | |||||
On the other hand, by event , we know that such that . Then
where Eq. (A) is because of Eq. (2). This means that , which contradicts with (since ).
Case ii): or . In this case, and therefore .
By event , we know that . Hence . Moreover, since , we have that , which is the same as . On the other hand, by event , we have that , and similarly (recall that and ).
Therefore,
This means that , which contradicts with the fact that . ∎
Appendix B How to Obtain Complexity Upper Bound by Eq. (1)
Eq. (1) is restated below:
We can then use the following lemma to find an upper bound for complexity .
Lemma B.1.
Given functions and positive values , if holds for all , then for any .
Proof.
Since are positive values, for any , we must have that . Therefore , which implies that
This is the same as . ∎
To apply Lemma B.1, we set , , and . After some basic calculations, we get that , and . Here are universal constants.
Appendix C Discussions about the Optimal Algorithms for Combinatorial Pure Exploration
Chen et al. (2017) prove that the complexity lower bound for combinatorial pure exploration is , where is the optimal value of the following convex program (here ):
In other words, for any correct combinatorial pure exploration algorithm, must be a feasible solution for the above convex program, where represents the expected number of pulls on base arm . We say base arm needs exploration the most at time if is positive, where is some universal constant and is the value of in the optimal solution (note that there may be several base arms that need exploration the most in one time step). By this definition, if we always pull a base arm that needs exploration the most, then the frequency of pulling each base arm converges to the optimal solution of , which leads to an optimal complexity upper bound.
However, the simple offline oracle used in TS-Explore (as described in Section 3.2) is not enough to look for a base arm that needs exploration the most. In fact, both the existing optimal policies (Chen et al., 2017; Jourdan et al., 2021) not only need to use this simple offline oracle, but also require some other mechanisms to explore detailed information about the combinatorial structure of the problem instance to look for a base arm that needs exploration the most. The algorithms in (Chen et al., 2017) need to record all the super arms in with an empirical mean larger than some threshold . This is one kind of information about the combinatorial structure that can help to find out a base arm that needs exploration the most. Nevertheless, the authors only provide an efficient way to do this when the combinatorial structure satisfies some specific constraints. In the most general setting, the algorithms in (Chen et al., 2017) must pay an exponential time cost for collecting the detailed information. As for (Jourdan et al., 2021), the best-response oracle used by the -player needs to return a super arm within set that has the shortest distance to a fixed target. Here is a super arm, represents the set of super arms whose cells’ boundaries intersect the boundary of the cell of , and the cell of a super arm is defined as all the possible parameter sets in which is the optimal super arm. This is another kind of information about the combinatorial structure that can help to find out a base arm that needs exploration the most. Nevertheless, this best-response oracle also has an exponential time cost (which is scaled with ). By using these exponential time cost mechanisms, the optimal algorithms (Chen et al., 2017; Jourdan et al., 2021) can find out a base arm that needs exploration the most, which is critical to achieving the optimal complexity upper bound.
In this paper, to make TS-Explore efficient in the most general setting, we only use the simple offline oracle in our algorithm and our mechanism can only inform us of one of the violated constraints in the optimization problem (if all the constraints are not violated, TS-Explore will output the correct optimal arm). This means that we know nothing about the combinatorial structure of the problem instance. Therefore, the best thing we can do is to treat all the base arms in the violated constraint equally, i.e., we choose to pull the base arm (in the violated constraint) with the smallest number of pulls. This leads to a complexity upper bound of . If we want TS-Explore to achieve an optimal complexity upper bound , then we need to know which base arm needs exploration the most, e.g., by applying a powerful offline oracle that takes the empirical means and random samples as input and outputs a base arm that needs exploration the most. How to design such offline oracles and how to implement them efficiently is one of our future research topics.
Appendix D Concentration Inequality of Sub-Gaussian Random Variables
Fact D.1.
If is zero-mean and sub-Gaussian, then