Asymptotic Instance-Optimal Algorithms for Interactive Decision Making
Abstract
Past research on interactive decision making problems (bandits, reinforcement learning, etc.) mostly focuses on the minimax regret that measures the algorithm’s performance on the hardest instance. However, an ideal algorithm should adapt to the complexity of a particular problem instance and incur smaller regrets on easy instances than worst-case instances. In this paper, we design the first asymptotic instance-optimal algorithm for general interactive decision making problems with finite number of decisions under mild conditions. On every instance , our algorithm outperforms all consistent algorithms (those achieving non-trivial regrets on all instances), and has asymptotic regret , where is an exact characterization of the complexity of . The key step of the algorithm involves hypothesis testing with active data collection. It computes the most economical decisions with which the algorithm collects observations to test whether an estimated instance is indeed correct; thus, the complexity is the minimum cost to test the instance against other instances. Our results, instantiated on concrete problems, recover the classical gap-dependent bounds for multi-armed bandits (Lai et al., 1985) and prior works on linear bandits (Lattimore and Szepesvari, 2017), and improve upon the previous best instance-dependent upper bound (Xu et al., 2021) for reinforcement learning.
1 Introduction
Bandit and reinforcement learning (RL) algorithms demonstrated a wide range of successful real-life applications (Silver et al., 2016, 2017; Mnih et al., 2013; Berner et al., 2019; Vinyals et al., 2019; Mnih et al., 2015; Degrave et al., 2022). Past works have theoretically studied the regret or sample complexity of various interactive decision making problems, such as contextual bandits, reinforcement learning (RL), partially observable Markov decision process (see Azar et al. (2017); Jin et al. (2018); Dong et al. (2021); Li et al. (2019); Agarwal et al. (2014); Foster and Rakhlin (2020); Jin et al. (2020), and references therein). Recently, Foster et al. (2021) present a unified algorithmic principle for achieving the minimax regret—the optimal regret for the worst-case problem instances.
However, minimax regret bounds do not necessarily always present a full picture of the statistical complexity of the problem. They characterize the complexity of the most difficult instances, but potentially many other instances are much easier. An ideal algorithm should adapt to the complexity of a particular instance and incur smaller regrets on easy instances than the worst-case instances. Thus, an ideal regret bound should be instance-dependent, that is, depending on some properties of each instance. Prior works designed algorithms with instance-dependent regret bounds that are stronger than minimax regret bounds, but they are often not directly comparable because they depend on different properties of the instances, such as the gap conditions and the variance of the value function (Zanette and Brunskill, 2019; Xu et al., 2021; Foster et al., 2020; Tirinzoni et al., 2021).
A more ambitious and challenging goal is to design instance-optimal algorithms that outperform, on every instance, all consistent algorithms (those achieving non-trivial regrets on all instances). Past works designed instance-optimal algorithms for multi-armed bandit (Lai et al., 1985), linear bandits (Kirschner et al., 2021; Hao et al., 2020), Lipschitz bandits (Magureanu et al., 2014), and ergodic MDPs (Ok et al., 2018). However, instance-optimal regret bounds for tabular reinforcement learning remain an open question, despite recent progress (Tirinzoni et al., 2021, 2022). The challenge partly stems from the fact that the existence of such an instance-optimal algorithm is even a priori unclear for general interactive decision making problems. Conceivably, each algorithm can have its own specialty on a subset of instances, and no algorithm can dominate all others on all instances.
Somewhat surprisingly, we prove that there exists a simple algorithm (T2C, stated in Alg. 1) that is asymptotic instance-optimal for general interactive decision making problems with finite number of decisions.We determine the exact leading term of the optimal asymptotic regret for instance to be . Here, is the number of interactions and is a complexity measure for the instance that intuitively captures the difficulty of distinguishing from other instances (that have different optimal decisions) using observations. Concretely, under mild conditions on the interactive decision problem, our algorithm achieves an asymptotic regret (Theorem 5.2) for every instance , while every consistent algorithm must have an asymptotic regret at least (Theorem 3.2).
Our algorithm consists of three simple steps. First, it explores uniformly for -fraction of the steps and computes the MLE estimate of the instance with relatively low confidence. Then, it tests whether the estimate instance (or, precisely, its associated optimal decision) is indeed correct using the most economical set of queries/decisions. Concretely, it computes a set of decisions with minimal regret, such that, using a log-likelihood ratio test, it can either distinguish the estimated instance from all other instances (with different optimal decisions) with high confidence, or determine that our estimate was incorrect. Thirdly, with the high-confidence estimate, it commits to the optimal decision of the estimated instance in the rest of the steps. The algorithmic framework essentially reduces the problem to the key second step — a problem of optimal hypothesis testing with active data collection.
Our results recover the classical gap-dependent regret bounds for multi-armed bandits (Lai et al., 1985) and prior works on linear bandits (Lattimore and Szepesvari, 2017; Hao et al., 2020). As an instantiation of the general algorithm, we present the first asymptotic instance-optimal algorithm for episodic tabular RL, improving upon prior instance-dependent algorithms (Xu et al., 2021; Simchowitz and Jamieson, 2019; Tirinzoni et al., 2021, 2022).
1.1 Additional Related Works
Instance-optimal and instance-dependent analysis.
Prior works have designed instance-optimal algorithms for specific interactive decision making problems. Variants of UCB algorithms are instance-optimal for bandits with various assumptions (Lattimore and Szepesvári, 2020; Gupta et al., 2021; Tirinzoni et al., 2020; Degenne et al., 2020; Magureanu et al., 2014), but are suboptimal for linear bandits (Lattimore and Szepesvari, 2017). These algorithms rely on the optimism-in-face-of-uncertainty principle to deal with exploration-exploitation tradeoff, whereas our algorithm explicitly finds the best tradeoff. There are also non-optimistic instance-optimal algorithms for linear bandits (Kirschner et al., 2021; Lattimore and Szepesvari, 2017; Hao et al., 2020), ergodic MDPs (Ok et al., 2018; Burnetas and Katehakis, 1997), and others (Graves and Lai, 1997; Rajeev et al., 1989).
Another line of research focuses on instance-optimal pure exploration algorithms in multi-armed bandits (Mason et al., 2020; Marjani et al., 2022; Chen et al., 2017b, a) and deterministic MDPs (Tirinzoni et al., 2022), where the goal is to use minimum number of samples to identify an -optimal decision.
Many algorithms’ regret bounds depend on some properties of instances such as the gap condition. Foster et al. (2020) prove a gap-dependent regret bound for contextual bandits. For reinforcement learning, the regret bound may depend on the variance of the optimal value function (Zanette and Brunskill, 2019) or the gap of the -function (Xu et al., 2021; Simchowitz and Jamieson, 2019; Yang et al., 2021). Xu et al. (2021); Foster et al. (2020) also prove that the gap-dependent bounds cannot be improve on some instances. To some extent, these instance-dependent lower bounds can be viewed as minimax bounds for a fine-grained instance family (e.g., all instances with the same -function gap), and therefore are different from ours.
Instance-optimal algorithm via log-likelihood ratio test.
The idea of using log-likelihood ratio test to design instance-optimal algorithms traces back to Rajeev et al. (1989); Graves and Lai (1997). Rajeev et al. (1989) design an asymptotic instance-optimal algorithm for general decision making problems with finite hypothesis class, finite state-action space, and known rewards. The algorithm in Graves and Lai (1997) is instance-optimal for infinite horizon MDPs where the Markov chain induced by every policy is uniformly recurrent, meaning that the probability of visiting every state is uniformly lower bounded away from 0. In comparison, our analysis is applicable to infinite hypothesis settings without the uniform recurrent assumption (e.g., episodic tabular RL).
2 Setup and Notations
Additional notations.
We use to denote a polynomial in . For two non-negative sequences , we write if and if
Interactive decision making problems.
We focus on interactive decision making with structured observations (Foster et al., 2021), which includes bandits and reinforcement learning as special cases. An interactive decision making problem is defined by a family of decisions , a space of observations , a reward function , and a function (also called an instance) that maps a decision to a distribution over observations . We use to denote the density function of the distribution . We assume that the reward is a deterministic and known function.
An environment picks an ground-truth instance from a instance family , and then an agent (which knows the instance family but not the ground-truth ) interacts with the environment for total rounds. In round ,
-
1.
the learner selects a decision from the decision class , and
-
2.
the environment generates an observation following the ground-truth distribution and reveals the observation. Then the agent receives a reward .
As an example, multi-armed bandits, linear bandits, and reinforcement learning all belong to interactive decision making problems. For bandits, a decision corresponds to an action and an observation corresponds to a reward. For reinforcement learning, a decision is a deterministic policy that maps from states to actions, and an observation is a trajectory (including the reward at each step). In other words, a round of interactions in the interactive decision making problem corresponds to an episode of the reinforcement learning problem. We will formally discuss the reinforcement learning setup in Section 5.
Let be the expected reward for decision under instance , and the optimal decision of instance . The expected regret measures how much worse the agent’s decision is than the optimal decision:
(1) |
We consider the case where the decision family is finite, and every instance has a unique optimal decision, denoted by . We also assume that for every and , almost surely when is drawn from the distribution , and .
Consistent algorithms and asymptotic instance-optimality.
We first note that it’s unreasonable to ask for an algorithm that can outperform or match any arbitrary algorithm on every instance. This is because, for any instance , a bespoke algorithm that always outputs can achieve zero regret on instance (though it has terrible regrets for other instances). Therefore, if an algorithm can outperform or match any algorithm on any instance, it must have zero regret on every instance, which is generally impossible. Instead, we are interested in finding an algorithm that is as good as any other reasonable algorithms that are not only customized to a single instance and completely fail on other instances.
We say an algorithm is consistent if its expected regret satisfies for every and (Lai et al., 1985). Most of the reasonable algorithms are consistent, such as the UCB algorithm for multi-armed bandits (Lattimore and Szepesvári, 2020), UCBVI, and UCB-Q algorithm for tabular reinforcement learning (Simchowitz and Jamieson, 2019; Yang et al., 2021), because all of them achieve asymptotically regrets on any instance, where hides constants that depend on the property of the particular instance.111Here the regret scales in because we are in the asymptotic setting where the instance is fixed and tends to infinity. If the instance can depend on (which is not the setting we are interested in), then the minimax regret typically scales in . However, consistent algorithms exclude the algorithm mentioned in the previous paragraph which is purely customized to a particular instance.
We say an algorithm is asymptotically instance-optimal if on every instance , is the smallest among all consistent algorithms (Lai et al., 1985). We note that even though an instance-optimal algorithm only needs to perform as well as every consistent algorithm, a priori, it’s still unclear if such an algorithm exists.
Some prior works have also used a slightly weaker definition in place of consistent algorithms, e.g., -uniformly good algorithms in Tirinzoni et al. (2021), which allows a sublinear regret bound for some constant . The alternative definition, though apparently includes more algorithms, does not change the essence. Our algorithm is still instance-optimal up to a constant factor—a simple modification of the lower bound part of the proof shows that its asymptotic regret is at most factor bigger than any -uniformly good algorithms on any instance. This paper thus primarily compares with consistent algorithms for simplicity.
3 Main Results
In this section, we present an intrinsic complexity measure of an instance (Section 3.1), and an instance-optimal algorithm that achieves an asymptotic regret (Section 3.2).
3.1 Complexity Measure and Regret Lower Bounds
In this section, our goal is to understand the minimum regret of consistent algorithms, which is an intrinsic complexity measure of an instance. We define an instance-dependent complexity measure and prove that any consistent algorithm must have asymptotic regret at least
The key observation is that any consistent algorithm has to collect enough information from the observations to tell apart the ground-truth instance from other instances with different optimal decisions. Consider the situation when a sequence of decisions is insufficient to distinguish two instances, denoted by and , with different optimal decisions. If an algorithm achieves a sublinear regret on , the sequence must contain (the optimal decision for ) at least times. As a result, if the true instance were , the algorithm suffers a linear regret (due to the linear number of ), and therefore cannot be consistent.
An ideal algorithm should find out decisions that can most efficiently identify the ground-truth instance (or precisely the family of instances with the same optimal decision as the ground-truth). However, decisions collect information but also incur regrets. So the algorithm should pick a list of decisions with the best tradeoff between these two effects—minimizing the decisions’ regret and collecting sufficient information to identify the true instance. Concretely, suppose a sequence of decisions includes occurrences of decision . The regret of these decisions is , where is the sub-optimality gap of decision . We use KL divergence between the observations of under two instances and (denoted by ) to measure ’s power to distinguish them. The following optimization problem defines the optimal mixture of the decisions (in terms of the regret) that has sufficient power to distinguish from all other instances with different optimal decision.
(2) | ||||
s.t. | (3) | |||
(4) |
The last constraint makes sure that even if the decision has no regret, and we only care about the case when approaches infinity. The asymptotic complexity of is
(5) |
In Eq. (3), we only require separation between instances when they have different optimal decisions. As there are only finite decisions, there are finite equivalent classes, so and are in principle separable with sufficient number of observations. Thus, the complexity measure is well defined as stated in the following lemma.
Lemma 3.1.
For any , is non-increasing in , and there exists such that for all , As a corollary, and is well defined.
Proof of Lemma 3.1 is deferred to Appendix B.1. In Section 3.1, we formalize the intuition above and prove that is a lower bound of the asymptotic regret, as stated in the following theorem.
Theorem 3.2.
For every instance , the expected regret of any consistent algorithm satisfies
(6) |
We prove Theorem 3.2 in Appendix B.2. The proof of Theorem 3.2 is inspired by previous works (Lattimore and Szepesvari, 2017; Tirinzoni et al., 2021). When applied to RL problems, our lower bound is very similar to that in Tirinzoni et al. (2021) and the only difference is that their optimization problems omit the constraint Eq. (4). (Informally, the lower bound of Tirinzoni et al. (2021) is equal to in our language.) For some carefully-constructed hypothesis classes, there exists an instance such that , meaning that the lower bound in Tirinzoni et al. (2021) is not tight (see Appendix B.5 for the construction and corresponding proof). However, this constraint can be removed for linear bandits (see Lattimore and Szepesvari (2017, Lemma 17)). For the standard tabular RL problems whose hypothesis class is the entire set of possible RL problems with fixed number of states and actions, we conjecture that our lower bound would be the same as that in Tirinzoni et al. (2021). We also note that Tirinzoni et al. (2021) do not have a matching upper bound.
The complexity measure can be computed explicitly in concrete settings. reduces to the well-known inverse action gap bound for multi-armed bandits (Proposition B.2), and recovers the result of Lattimore and Szepesvari (2017) (Proposition B.3). For reinforcement learning, Tirinzoni et al. (2021) prove that the instance-optimal bound can be smaller than the gap-dependent bound in Xu et al. (2021) and Simchowitz and Jamieson (2019).
3.2 Instance-Optimal Algorithms
We first present the regret bound for our algorithm, and then discuss the T2C algorithm. For simplicity, we consider a finite hypothesis (i.e., ) here, and extend to infinite hypothesis case in Section 5.
We start by stating a condition that excludes abnormal observation distributions . Recall that for any , the Rényi divergence of two distributions is
(7) |
The Rényi divergence is non-decreasing in , and (Van Erven and Harremos, 2014). We require the limits converge uniformly for all instances and decisions with a bounded away from 1 (the choice of the constants in Condition 1 is not critical to our result), as stated below.
Condition 1.
For any fixed , instance , there exists such that for all , and , Moreover, we require for some universal constants , where is a function that only depends on .
Condition 1 holds for a wide range of distributions, such as Gaussian, Bernoulli, multinomial, Laplace with bounded mean, Log-normal (Gil et al., 2013), and tabular RL where is a distribution over a trajectory consists of state, action and reward tuples (see Theorem 5.1 for the proof of tabular RL). A stronger but more interpretable variant of Condition 1 is that the log density ratio of and has finite fourth moments (see Proposition B.4), therefore Condition 1 can also be potentially verified for other distributions.
The main theorem analyzing Alg. 1 is shown below. The asymptotic regret of Alg. 1 matches the constants in the lower bound (Theorem 3.2), indicating the asymptotic instance-optimality of our algorithm.
Theorem 3.3.
(9) |
Our algorithm is stated in Alg. 1 and consists of three steps: initialization, identification, and exploitation. In the initialization step, we explore uniformly for a short period and compute the MLE estimate of the true instance (Line 3 of Alg. 1), where we only requires to be accurate with moderate probability (i.e., ). The estimation is used to compute the lowest-regret list of decisions that can distinguish the optimal decision of . Since we only collect samples, the regret of this step is negligible asymptotically.
In the identification step, we hold the belief that is the true instance and solve to get the best list of decisions to fortify this belief (see Line 4 of Alg. 1). Then we collect more samples using this list (with minor decorations) to boost the confidence of our initial estimation to (or reject it when the estimation is not accurate) by running the following log-likelihood ratio test:
(10) |
Intuitively, if is not the ground-truth , is unlikely to hold because the expected log-likelihood ratio is non-positive:
So with high probability a wrong guess cannot be accepted. On the other hand, the first constraint (Eq. (3)) in the definition of guarantees that when , the expected log-likelihood ratio is large for all and , so an accurate guess will be accepted. In this step , so Step 2 dominates the regret of Alg. 1. In other words, the efficiency of the log-likelihood test is critical to our analysis.
Finally in the exploitation step, the algorithm commits to the optimal decision of if it believes this estimation with confidence , or run a standard UCB algorithm as if on a multi-armed bandits problem (with a well-known regret) when the initial estimation is not accurate (which happens with probability at most ). Therefore the expected regret here is .
Our three-stage algorithm is inspired by Lattimore and Szepesvari (2017), where their test in Step 2 is specially designed for linear bandits with Gaussian noise. In contrast, the log-likelihood ratio test is a general hypothesis testing method. Hence, our algorithm works for a large family of interactive decision making problems including tabular RL.
4 Proof Sketches of the Main Results
In this section, we discuss the proof sketch of our main results. Section 4.1 discusses the proof sketch of the lower bound, and Section 4.2 shows the main lemmas for each step in Alg. 1. In Section 4.3, we discuss the log-likelihood ratio test in detail.
4.1 Proof Sketch of Theorem 3.2
Recall that is the minimum regret of a list of decisions that can distinguish the instance with all other instances (with a different optimal decision). So to prove the lower bound, we show that the sequence of decisions played by any consistent algorithm must also distinguish , and thus, lower bounds the regret of any consistent algorithm.
For any consistent algorithm, number of interactions , and two instances with different optimal decisions, let and denote the probability space generated by running the algorithm on instances and respectively for rounds. Since have different optimal decisions, and should be very different. Indeed, let be the random variable denoting the number of times decision is executed, and we consider the event . By the basic property of TV distance, we have
Since the algorithm has sub-polynomial regret, for every we must have and (otherwise the algorithm incurs regret on either or ). Consequently,
(11) |
This states that the observations from the two instances should be distinguishable, although the TV distance is complex and hard to deal with (especially when the algorithm chooses its decision sequence adaptively). So we translate the requirement from TV distance to a more localized quantity. First, we invoke a extended version of Pinsker’s inequality (see Sampson and Guttorp (1992))
(12) |
Combining with Eq. (11) we get Let the random variable be the decision of the algorithm in round . The chain rule of KL divergence shows
Now consider the vector where . Based on the derivation above, we can verify that is a valid solution to and therefore
(13) |
Then the final result is proved by the fact that .
4.2 Proof Sketch of Theorem 3.3
In the following, we discuss main lemmas and their proof sketches for the three steps of Alg. 1.
Step 1: Initialization.
In this step we show that the max likelihood estimation can find the exact instance (i.e., ) with probability at least and negligible regret. Note that the failure probability is not small enough to directly commit to the optimal decision of (i.e., play forever), which would incur linear regret when . Formally speaking, the main lemma for Step 1 is stated as follows, whose proof is deterred to Appendix A.1.
Lemma 4.1.
Under Condition 1, with probability at least we get In addition, the regret of Step 1 is upper bounded by
Step 2: Identification.
In the identification step, we boost the failure probability to using the log-likelihood test. To this end, we compute the optimal list of decisions that distinguishes by solving (Line 4 of Alg. 1). The choice of is not critical, and could be replaced by any smaller quantity that approaches infinity as . Then we run the log-likelihood ratio test using the observations collected by executing the list of decision , and achieve the following:
-
(a)
when the true instance and the estimation have different optimal decisions, accept with probability at most ;
-
(b)
when , accept with probability at least .
In Section 4.3, we will discuss the log-likelihood ratio test in detail. Note that the regret after we falsely reject the true instance is (by the regret bound of UCB algorithm), so we only require a failure probability for (b) because then it leads to a expected regret. The main lemma for Step 2 is stated as follows, and its proof is deferred to Appendix A.2
Lemma 4.2.
Under Condition 1, for any finite hypothesis , for large enough the following holds:
-
(a)
conditioned on the event , is true with probability at most ;
-
(b)
conditioned on the event , is true with probability at least ;
-
(c)
conditioned on the event , the expected regret of Step 2 is upper bounded by
Otherwise, the expected regret of Step 2 is upper bounded by
Step 3: Exploitation.
Finally, in Step 3 we either commits to the optimal decision when the estimation is accepted, or run a standard UCB algorithm with regret (Lattimore and Szepesvári, 2020). Combining Lemma 4.1 and Lemma 4.2 we can prove that
-
1.
the probability of Step 3 incurring a regret is at most , and
-
2.
the probability of Step 3 incurring a regret is at most .
As a result, the expected regret in Step 3 is and negligible . Finally Theorem 3.3 is proved by stitching the three steps together, and deferred to Appendix A.4.
4.3 The Log-Likelihood Ratio Test
The goal of the log-likelihood ratio test is to boost the confidence of our initial estimation to , so that the algorithm can safely commit to its optimal decision in Step 3. In other words, the test should reject a wrong estimation but also accept a correct one.
Formally speaking, in the test we observe a list of observations collected by the list of decision on the true instance , and achieve the following two goals simultaneously,
-
(a)
when and are sufficiently different (i.e., their KL divergence is large in the sense that ), accept with probability at most , and
-
(b)
when , accept with probability at least .
We prove that the log-likelihood ratio test with proper parameters achieves (a) and (b) under Condition 1 in the next lemma, whose proof is deferred to Appendix A.3.
Lemma 4.3.
Given two sequences of distributions and , and a sequence of independent random variables . For any fixed , and the test event
satisfies
(14) | |||
(15) |
To use Lemma 4.3 in the proof of Lemma C.2, we set and . Note that the choice of in Line 4 of Alg. 1 and Condition 1 implies when have different optimal policies. So the conclusion of this lemma matches the first two items in Lemma 4.2.
Lemma 4.3 is closely related to the Chernoff-Stein lemma (see Chernoff (1959) and Mao (2021, Theorem 4.11)). The failure probability of Eq. (15) in classical Chernoff-Stein lemma is a constant, while here it decreases with . The proof of Eq. (14) is the same as in Chernoff-Stein lemma, and proof of Eq. (15) only requires an one-sided concentration of the empirical log-likelihood ratio. Indeed, the expectation of empirical log-likelihood ratio can be lower bounded by
(16) |
So is the probability that the empirical log-likelihood ratio is (approximately) larger than its expectation. We also note that the concentration is non-trivial because we do not make assumptions on the boundedness on the tail of .
5 Extensions to Infinite Hypothesis Class
Now we extend our analysis to infinite hypothesis settings, and instantiate our results on tabular RL. Our results in the infinite hypothesis case need two additional conditions. The first condition requires an upper bound on covering number of the hypothesis, and is used to prove a infinite-hypothesis version of Lemma 4.2. Formally speaking, for any , define their distance as
(17) |
An (proper) covering of is a subset , such that for any , there exists with The covering number is the size of the minimum covering of .
Condition 2.
There exists constant that depends on such that for every In addition, the base measure of the probability space has a finite volume
The second condition upper bounds the distance of two instances by a polynomial of their KL divergence, and is used to prove a stronger version of Lemma 4.1.
Condition 3.
There exists a constant (which may depend on ) such that for all and In addition, there exists a constant that only depends on and , such that for all , when
A lot of interactive decision making problems satisfy Conditions 2, 3, such as multi-armed bandits and linear bandits with Bernoulli reward or truncated Gaussian reward. For bandit problems, Conditions 2 and 3 only depend on the noise of the reward function and can be verified easily.
For tabular RL, an observation denotes a trajectory where the state-action pairs are discrete random variables but the rewards are continuous. A decision is a mapping from the state space to action space, so In Appendix D, we formally define the tabular RL problems where the reward given follows a truncated Gaussian distribution, and prove that Conditions 1-3 are satisfied as stated in the following theorem.
Theorem 5.1.
For tabular RL with truncated Gaussian reward, the observation is a mixture of discrete random variables (i.e., the states and actions ) and continuous random variables (i.e., the rewards ). To prove Conditions 1-3, we deal with the discrete and continuous part of the observation separately. We also have flexibility in the reward distribution, e.g., our proof technique can deal with other distributions such as truncated Laplace distribution.
We present these unified conditions for bounded random variables, but Conditions 2 and 3 do not hold for unbounded random variables because the base measure has an infinite volume (which contradicts to Condition 2), and the density of the observation cannot be lower bounded (which contradicts to Condition 3). However, we can still prove the same result using a slightly different approach (see Appendix E).
With Conditions 1-3, we can prove our main results for infinite hypothesis class. The asymptotic regret of Alg. 1 in this case still matches the lower bound exactly. The proof of Theorem 5.2 is deferred to Appendix C.1.
Theorem 5.2.
As a corollary of Theorem 5.1 and Theorem 5.2, our algorithm strictly improves upon previous best gap-dependent bounds in Xu et al. (2021) and Simchowitz and Jamieson (2019) because the gap-dependent bounds are not tight for many instances (Tirinzoni et al., 2021).
In the following we discuss the challenges when extending our analysis to infinite hypothesis settings.
Covering Number.
With infinite hypothesis, we need to accept an accurate estimation even when there are infinitely many other choices. Recall that the accept event is
(19) |
So informally speaking, we need to show that with high probability,
(20) |
which is essentially an uniform concentration inequality as discussed in Section 4.3. So we resort to a covering number argument. The standard covering argument is not directly suitable in this case — even if is small, it’s still possible that is very different from (especially when is very close to 0). Instead, we consider the distribution with density where is the normalization factor, and only prove a one-sided covering (that is, ). We state and prove the uniform concentration in Appendix F.
Initialization.
With infinite hypothesis, we cannot hope to recover the true instance exactly — some instances can be arbitrarily close to and thus indistinguishable. Instead, we prove that the estimation in Step 1 satisfies . The main lemma of Step 1 in the infinite hypothesis class case is stated in Lemma C.1.
6 Conclusion
In this paper, we design instance-optimal algorithms for general interactive decision making problems with finite decisions. As an instantiation, our algorithm is the first instance-optimal algorithm for tabular MDPs. For future works, we raise the following open questions.
-
1.
To implement Alg. 1, we need to solve , which is a linear programming with variables and infinitely many constraints. However, is exponentially large for tabular MDPs. Can we compute this optimization problem efficiently for tabular MDPs?
-
2.
Although our algorithm is asymptotically optimal, the lower order terms may dominate the regret unless is very large. Can we design non-asymptotic instance optimal algorithms?
Acknowledgment
The authors would like to thank Yuanhao Wang, Jason D. Lee for helpful discussions, and the support from JD.com.
References
- Agarwal et al. [2014] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646, 2014.
- Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017.
- Berner et al. [2019] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Burnetas and Katehakis [1997] Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
- Chen et al. [2017a] Lijie Chen, Jian Li, and Mingda Qiao. Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101–110. PMLR, 2017a.
- Chen et al. [2017b] Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory, pages 535–592. PMLR, 2017b.
- Chernoff [1959] Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
- Cover [1999] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
- Degenne et al. [2020] Rémy Degenne, Han Shao, and Wouter Koolen. Structure adaptive algorithms for stochastic bandits. In International Conference on Machine Learning, pages 2443–2452. PMLR, 2020.
- Degrave et al. [2022] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022.
- Dong et al. [2021] Kefan Dong, Jiaqi Yang, and Tengyu Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. arXiv preprint arXiv:2102.04168, 2021.
- Foster and Rakhlin [2020] Dylan Foster and Alexander Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3199–3210. PMLR, 13–18 Jul 2020.
- Foster et al. [2020] Dylan J Foster, Alexander Rakhlin, David Simchi-Levi, and Yunzong Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. arXiv preprint arXiv:2010.03104, 2020.
- Foster et al. [2021] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
- Gil et al. [2013] Manuel Gil, Fady Alajaji, and Tamas Linder. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences, 249:124–131, 2013.
- Graves and Lai [1997] Todd L Graves and Tze Leung Lai. Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM journal on control and optimization, 35(3):715–743, 1997.
- Gupta et al. [2021] Samarth Gupta, Shreyas Chaudhari, Gauri Joshi, and Osman Yağan. Multi-armed bandits with correlated arms. IEEE Transactions on Information Theory, 67(10):6711–6732, 2021.
- Hao et al. [2020] Botao Hao, Tor Lattimore, and Csaba Szepesvari. Adaptive exploration in linear contextual bandit. In International Conference on Artificial Intelligence and Statistics, pages 3536–3545. PMLR, 2020.
- Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868–4878, 2018.
- Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- Kirschner et al. [2021] Johannes Kirschner, Tor Lattimore, Claire Vernade, and Csaba Szepesvári. Asymptotically optimal information-directed sampling. In Conference on Learning Theory, pages 2777–2821. PMLR, 2021.
- Lai et al. [1985] Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- Lattimore and Szepesvari [2017] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737. PMLR, 2017.
- Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- Li et al. [2019] Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pages 2173–2174. PMLR, 2019.
- Magureanu et al. [2014] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975–999. PMLR, 2014.
- Mao [2021] Cheng Mao. Hypothesis testing: Lecture notes for math 6263 at georgia tech. 2021.
- Marjani et al. [2022] Aymen Al Marjani, Tomáš Kocák, and Aurélien Garivier. On the complexity of all -best arms identification. arXiv preprint arXiv:2202.06280, 2022.
- Mason et al. [2020] Blake Mason, Lalit Jain, Ardhendu Tripathy, and Robert Nowak. Finding all -good arms in stochastic bandits. Advances in Neural Information Processing Systems, 33:20707–20718, 2020.
- Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Ok et al. [2018] Jungseul Ok, Alexandre Proutiere, and Damianos Tranos. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems, pages 8874–8882, 2018.
- Rajeev et al. [1989] A Rajeev, Demosthenis Teneketzis, and Venkatachalam Anantharam. Asymptotically efficient adaptive allocation schemes for controlled iid processes: Finite parameter space. IEEE Transactions on Automatic Control, 34(3), 1989.
- Sampson and Guttorp [1992] Paul D Sampson and Peter Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87(417):108–119, 1992.
- Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
- Simchowitz and Jamieson [2019] Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular MDPs. In Advances in Neural Information Processing Systems, pages 1153–1162, 2019.
- Tirinzoni et al. [2020] Andrea Tirinzoni, Alessandro Lazaric, and Marcello Restelli. A novel confidence-based algorithm for structured bandits. In International Conference on Artificial Intelligence and Statistics, pages 3175–3185. PMLR, 2020.
- Tirinzoni et al. [2021] Andrea Tirinzoni, Matteo Pirotta, and Alessandro Lazaric. A fully problem-dependent regret lower bound for finite-horizon mdps. arXiv preprint arXiv:2106.13013, 2021.
- Tirinzoni et al. [2022] Andrea Tirinzoni, Aymen Al-Marjani, and Emilie Kaufmann. Near instance-optimal pac reinforcement learning for deterministic mdps. arXiv preprint arXiv:2203.09251, 2022.
- Van de Geer [2000] Sara Van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000.
- Van Erven and Harremos [2014] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
- Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Xu et al. [2021] Haike Xu, Tengyu Ma, and Simon S Du. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. arXiv preprint arXiv:2102.04692, 2021.
- Yang et al. [2021] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
- Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.
List of Appendices
[sections] \printcontents[sections]l1
Appendix A Proofs for Finite Hypothesis Class
In this section we show the missing proofs in Section 4.
A.1 Proof of Lemma 4.1
In this section, we prove Lemma 4.1.
Proof of Lemma 4.1.
Recall that . Let be the sequence of decisions in the initialization step of Alg. 1, and the corresponding observations. Define
(21) |
For finite hypothesis we have Then Lemma A.2, for any we get
(22) |
Let . By definition of we get, for all
(23) | ||||
(24) | ||||
(25) |
By union bound, with probability at least we have
(26) |
which implies that
(27) |
Recalling that , Eq. (27) implies
By algebraic manipulation, for large enough we have
(28) |
As a result, for large enough we get In addition, the regret of Step 1 is upper bounded by ∎
A.2 Proof of Lemma 4.2
In this section, we prove Lemma 4.2. To this end, we need the following lemma.
Lemma A.1.
Consider an instance and . Let and Define
(29) |
as the value that Condition 1 holds with corresponding parameters.
Consider any such that . Let be a list of decisions where a decision occurs times for every , and .
Define the set For any constant , there exits such that for all ,
(30) |
In addition, we have and
Proof.
To prove this lemma, we invoke Condition 1 with proper parameters.
First we bound the value of . By the assumption of this lemma, we have . Consequently, for large enough we get
(31) |
On the other hand,
(32) |
Define Then for large enough we have We can also lower bound . By the upper bound of and the fact that , we get
We invoke Condition 1 with parameters . Note that defined in Eq. (29) satisfies because . Therefore we get
(33) |
In the following, we prove that for all . First of all, for large enough , for all we get
(34) | ||||
(35) |
where the last inequality comes from the fact that
Now for a fixed , consider the following two cases.
Case 1:
Case 2:
Combining the two cases together we get the desired result. The lower bound of follows directly from Condition 1. ∎
Now we are ready to prove Lemma 4.2.
Proof of item (a).
In this case we have By Lemma 4.3 we get
(41) | ||||
(42) |
Proof of item (b).
Recall that in this case we have . Since is the solution to (Line 4 of Alg. 1), we have for all . Recall that is the list of decisions computed by Line 5 of Alg. 1. By Lemma A.1 we get
(43) |
Let . By Lemma 4.3, for every we have
(44) |
Lemma A.1 also yields and . Therefore . Consequently, for large enough we have
(45) |
Applying union bound, under the event we get
(46) |
Proof of item (c).
Recall that is the sub-optimality gap of decision on instance , and is the maximum decision gap of instance . Since is the solution to , we have As a result, the regret of Step 2 is upper bounded by
(47) |
which proves the second part of (c). For the first part, when we have
(48) | ||||
(49) | ||||
(50) |
By Lemma B.1, . As a result,
(51) |
∎
A.3 Proof of Lemma 4.3
In this section, we prove Lemma 4.3. To this end, we need the following concentration inequality.
Lemma A.2.
Consider two sequences of distributions and , and a sequence of random variables independently drawn from . For any and we have
(52) |
Proof of Lemma A.2.
Now we are ready to prove Lemma 4.3.
A.4 Proof of Theorem 3.3
Proof of Theorem 3.3.
We upper bound the regret of Alg. 1 by discussing the regret under following events separately. Let and be the regret incurred in Step 1, 2, and 3 respectively. Let be the event that the initial estimation is accurate.
Regret of Step 1:
By Lemma 4.1 we have,
Regret of Step 2.
Regret of Step 3.
First focus on the event . Under event , there is no regret in Step 3. On the other hand, by item (b) of Lemma 4.2 we have Since UCB gives logarithmic regret, we have
(70) |
As a result,
(71) |
Now we focus on the event . Let Under event the algorithm has no regret in Step 3:
(72) |
On the other hand, consider the event . By item (a) of Lemma 4.2 we have
(73) |
Under the event , Step 3 incurs logarithmic regret. As a result,
(74) |
Combining Eqs. (72), (73) and (74) we get
(75) |
Therefore, combining Eqs. (71) and (75),
(76) |
Stitching the regrets from steps 1-3 together we prove Theorem 3.3. ∎
Appendix B Missing Proofs in Section 3.1
In this section, we present the missing proofs in section 3.1.
B.1 Proof of Lemma 3.1
To prove Lemma 3.1, we need the following lemma.
Lemma B.1.
For any , let Consider such that for all . Then is a valid solution to for all As a corollary, for all
Proof of Lemma B.1.
Recall that is the reward gap of decision under instance , and is the minimum decision gap of the instance .
Let For any instance , consider the following two decisions: We claim that
(77) |
This claim can be proved by contradiction. Suppose on the contrary that
Then we have
which contradicts with the fact that
Recall that As a result, . Now, by Pinsker’s inequality, for any we have
(78) |
Combining with Eq. (77), for any we get
(79) |
Since this inequality holds for every , we prove the desired result. ∎
Now we present the proof of Lemma 3.1.
B.2 Proof of Theorem 3.2
Now we show the proof of Theorem 3.2.
Proof of Theorem 3.2.
In the following we fix a consistent algorithm . Consider any instance such that .
Recall that is the reward gap of decision under instance and is the minimum decision gap of the instance . Let Consider two distributions (over observations and decisions) induced by running steps of algorithm on instance respectively. In addition, let be the random variable indicates the number of times decision is executed, and the random variable indicating the decision executed at step .
Let and be the regret of running algorithm on instances and respectively. By definition we have
(80) |
By basic inequality of KL divergence [Lattimore and Szepesvari, 2017, Lemma 5] we get,
(81) |
Combining Eq. (80) and Eq. (81) we get,
(82) |
Now applying the chain rule for KL divergence (see, e.g., [Cover, 1999, Theorem 2.5.3]), we get
(83) |
Therefore we have
(84) |
Consider the mixture of decisions
Since and , satisfies the constraint of for large enough . In addition, the expected regret of algorithm is
(85) | ||||
(86) | ||||
(87) |
Since is consistent, for any we have Consequently, for any ,
(88) | ||||
(89) | ||||
(90) |
Since is an arbitrary constant, we get
(91) |
∎
B.3 Instantiation of the Complexity Measure
In this section, we instantiate our complexity measure on concrete settings.
First we consider multi-armed bandits with unit Gaussian reward. The family of decisions is . Let be the mean rewards of each decision. An instance in this case is characterized by the mean rewards, and
Proposition B.2.
For an multi-armed bandit instance with unique optimal decision and unit Gaussian noise, let be the mean reward of each action. Then
where
Proof.
We prove this proposition by constructing a solution to the optimization problem for large enough .
Without loss of generality, we assume for all Consider the action frequency
(92) |
Then when is large enough we have On the other hand, consider any such that Suppose In the following we show that
(93) |
Let be the mean reward of instance . For multi-armed bandits with unit Gaussian noise, we have
(94) |
By the condition that , we get Combining with the fact that we get
(95) | ||||
(96) | ||||
(97) |
By the definition of , we have
(98) |
As a result,
(99) |
It follows that
(100) |
∎
In the following, we focus on a linear bandit instance where the action set is The mean of an action is given by for some Let be the optimal action, and be the sub-optimality gap of action . Define An linear bandit instance is characterized by the vector , and
We assume that is discrete, full rank, and for all Then we have the following proposition.
Proposition B.3.
For an linear bandit instance with unique optimal decision and unit Gaussian noise, our complexity measure recovers that in Lattimore and Szepesvari [2017]. That is,
(101) | ||||
s.t. | (102) |
where
Proof.
Let be the solution to the RHS of Eq. (101). In the following we construct a solution to our optimization problem from . Recall that
(103) | ||||
s.t. | (104) | |||
(105) |
For any and any instance associates with parameter , we have
(106) |
Consider any such that . Suppose . It follows that Recall that By algebraic manipulation we have,
(107) |
Therefore to satisfy Eq. (104), it’s enough to construct a solution such that
(108) |
Define When the action set is full rank, is positive definite (see Lattimore and Szepesvari [2017, Appendix C]). We use to denote the maximum/minimum singular value of a matrix respectively. Then for any , consider the following solution
(109) |
For large enough we get In the following, we prove that satisfies Eq. (107). Let
for shorthand. Since , we have Therefore Then for any , by Eq. (102) we have
(110) | ||||
(111) |
Invoking Lemma G.7 we get
which implies Eq. (108). As a result, is a valid solution ot Consequently,
(112) |
By definition we have . Consequently, ∎
B.4 A Sufficient Condition for Condition 1
In this section, we discuss a sufficient but more interpretable condition to Condition 1.
Proposition B.4.
Proof.
For any fixed , and , by Lemma G.8 we get
(113) |
Therefore we have
(114) |
Since is monotonically decreasing with , we prove the desired result. ∎
B.5 Comparison with Existing Lower Bounds
Recall that when applied to tabular RL problems, our lower bound is very similar to that in Tirinzoni et al. [2021], and the only difference is that their optimization problems omit the second constraint (Eq. (4)), On a high level, the constraint is necessary to prevent degenerate cases and guarantee mathematical rigor.
The value can be different from the solution without this constraint for some artificial hypothesis class . For example, we can construct a multi-armed bandit hypothesis class (where represents the mean reward of each arm). Then for , for every (because there exits other instances in whose mean reward of action 1 is arbitrarily close of 0.5). As a result, by the definition of limits and in this case . However, without the constraint , the solution will be , achieved by letting and .
Appendix C Extension to Infinite Hypothesis Class
In this section, we extend our results to the infinite hypothesis case.
C.1 Proof of Theorem 5.2
To prove Theorem 5.2, we require the following lemmas for steps 1 and 2 by analogy with the finite hypothesis case.
Lemma C.1 (Main lemma for Initialization).
Lemma C.2 (Main lemma for Identification).
C.2 Proof of Lemma C.1
Proof of Lemma C.1.
We prove the two items in this lemma separately.
Proof of item (a):
Let be the constant from Condition 1. Set , , and . Let be the list of decisions run by Step 1, and the corresponding observations. Recall that by definition,
(115) |
Combining with the fact that , we have
(116) |
Let We will prove that for all , we have . Combining with Eq. (116) we get
To this end, we apply Lemma F.1 with parameters . Following Lemma F.1, define
Then we have Let be the value that satisfies Condition 1, and
Recall that the condition of Lemma F.1 states
(117) |
The failure probability in this case is . By Condition 1, for large enough we get By Condition 2 we get
As a result, when is large enough
(118) | ||||
(119) |
where the last inequality comes from the definition of , i.e., . Recall that . When is large enough, the condition of Lemma F.1 (i.e., Eq. (117)) is satisfied.
Proof of item (b):
Now we focus on item (b). For any fixed , by Pinsker’s inequality and Eq. (121) we get
(122) |
By assumption we have almost surely for both and . It follows that
(123) |
Then we prove item (b) with and
Proof of item (c):
Since , (c) follows from (b) directly when is large enough.
Proof of regret:
The number of samples collected in Step 1 is upper bounded by As a result, the regret is upper bounded by
(124) |
∎
C.3 Proof of Lemma C.2
Item (a):
First we prove item (a) of Lemma C.2. By Markov inequality, for any , we have
(125) | ||||
(126) | ||||
(127) |
The last equality follows from the fact that and are both probability distributions given any decision
Recall that in this case . Therefore,
(128) | ||||
(129) |
Item (b):
Let and . We prove this statement by invoking Lemma F.1 with parameters . Following Lemma F.1, let and be the value that satisfies Condition 1. Let
First of all, we prove that the condition for Lemma F.1 holds. That is,
(130) |
Recall that in Alg. 1, is the solution of , , and . As a result, Now consider the RHS of Eq. (130). By the definition of we get , so and . It follows from Condition 1 that By the definition of and Condition 2 we get
(131) |
Consequently, when is sufficiently large, Eq. (130) holds. By Lemma F.1 we get, with probability at least ,
(132) |
where . In the following, we prove that Eq. (132) implies Recall that is the event defined as follows:
(133) |
Recall that Next, we apply Lemma G.4 to show that To verify the condition of Lemma G.4, we have for all by item (a) of Lemma C.1. By the definition of we get
(134) |
Therefore when is large enough,
(135) |
Then . It follows from Eq. (132) that
(136) |
Finally, by Condition 3 we get for any . As a result
(137) |
When , applying the basic inequality we get
(138) |
where the last inequality comes from item (a) of Lemma C.1 and Condition 3. Therefore, for large enough we get As a result,
(139) |
Since is arbitrary, have
(140) | ||||
(141) |
Item (c):
By the definition of , we have for every . As a result, Therefore, the expect regret of Step 2 is upper bounded by
(142) |
Item (d):
Recall that is the solution of As a result, the regret of Step 2 is upper bounded by
(143) |
where is the sub-optimality gap of decision under instance . In the following, we prove that
(144) |
Let be the solution to . Define Let and . We will show that is also (approximately) a solution of
By the definition of , for any we have
(145) |
Let be the list of decisions that appears times for every . Define and Next we apply Lemma G.4 with parameters To verify its condition, item (a) of Lemma C.1 gives
(146) |
which satisfies the condition of Lemma G.4. Consequently, we get for every . Therefore,
(147) |
By item (c) of Lemma C.1, . When is large enough, we have Therefore, satisfies all the constraints of . Recall that is the solution to . By the optimality of we have
(148) |
By item (b) of Lemma C.1, As a result,
(149) | ||||
(150) |
In addition,
(151) |
Stitching the inequalities above we have
(152) |
As a result, the regret in Step 2 is bounded by
∎
Appendix D Proof of Conditions for Tabular Reinforcement Learning
In this section, we first prove that Conditions 1- 3 holds for tabular RL with truncated Gaussian reward. Consequently, Theorem 5.2 gives the first asymptotic instance-optimal regret bound in this setting. Later in Appendix E, we extend our analysis to Gaussian reward.
First of all, we define some additional notations. A finite horizon Markov Decision Process (MDP) is denoted by a tuple . and are the state and action spaces respectively. denotes the horizon. The tansition function maps a state-action pair to a distribution over , and the reward function maps state-action pair to a truncated Gaussian distribution. In particular, for a fixed , the mean reward is , and the density of the reward is
(153) |
where is the normalization factor. We use a fixed truncation for every state-action pair regardless of because otherwise the KL divergence of two instances will easily be infinity. Without loss of generality, we assume is partitioned into where for any state and action , . We also assume that the initial state is fixed.
A decision (a.k.a. policy) is a deterministic mapping from a state to an action, and is the set of possible decisions with An observation is a trajectory . The reward of an observation is We emphasize that are discrete random variables and is a continuous random variable. In the following, we use to denote the set of state-action pairs in and the set of rewards. As a basic property of MDP, the rewards are independent conditioned on , and .
An instance is a represented by the transition function and reward function (which is uniquely determined by ). The family of instances is defined as the set of all possible instances.
For an instance and a decision , the density of an observation is
(154) |
For simplicity, we also use to denote the distribution , and to denote .
By definition, for all we have We also define
(155) |
to be the marginal density of and
the conditional distribution of
For an instance , we define Since are finite, we have .
D.1 Proof of Theorem 5.1
In this section, we prove Theorem 5.1.
Proof.
In the following, we prove the three conditions separately.
Proof of Condition 1.
Proof of Condition 2.
The first part of this condition is proved by Lemma D.2. On the other hand, we have
(156) |
Proof of Condition 3.
To prove the first part of this condition, recall that
(157) |
where is the transition function and is the reward distribution. Let
As a result, for all . We prove the second part of this condition in Lemma D.3. ∎
D.2 Proof of Condition 1
In this section, we present a lemma that establishes Condition 1 for tabular RL.
Lemma D.1.
Consider any fixed RL instance with discrete state, action and general reward distribution. Suppose there exists a constant such that for any , the reward distributions of instance and at state and action (denoted by respectively) satisfy
(158) |
Then for every , Condition 1 holds with
(159) |
Proof of Lemma D.1.
Let
(160) |
Recall that for reinforcement learning, an observation represents a trajectory, and denotes the state-action pairs in the trajectory. Consider any fixed decision , for any , we prove the following two cases separately.
Case 1:
In this case, we prove that
Let By the condition of this case we have Applying Lemma G.1 we get
(161) |
In the following we prove the RHS of Eq. (161) is larger than . We start with Hölder’s inequality and the basic inequality that for any :
(162) | ||||
(163) | ||||
(164) | ||||
(165) | ||||
(166) | ||||
(167) |
Recall the basic inequality for all . Therefore,
(168) | ||||
(169) | ||||
(170) | ||||
(171) |
Recall the basic inequality that for all . Since we have
when we get
(172) | ||||
(173) |
By the definition of we get
(174) |
which leads to
Case 2:
By Lemma G.8, for any we get
(175) |
Let be the reward distributions of instance and respectively, and the densities of the reward given state-action pair . Recall that for a trajectory we have
(176) |
By Hölder’s inequality we get
(177) | ||||
(178) | ||||
(179) | ||||
(180) | ||||
(181) |
where the last inequality comes from item (c) of Condition 4. Therefore, when we get
(182) | ||||
(183) | ||||
(184) | ||||
(185) | ||||
(186) |
Recall that
(187) |
By algebraic manipulation we get
(188) |
which proves the desired result. ∎
D.3 Proof of Condition 2
Recall that for two instances , their distance is
Lemma D.2.
Suppose represents tabular RL with truncated Gaussian reward with state space and action space , then we have for every .
Proof.
For instances and state-action pair , let be its reward distribution and its transition. Recall that for tabular RL we have
(189) |
where the observation is . Consequently,
Therefore, we construct a covering such that for every , there exists with
(190) | |||
(191) |
Since is a discrete distribution, the covering number for the transition function is upper bounded by On the other hand, is a truncated Gaussian distribution, so covering number for the reward is also upper bounded by As a result,
(192) |
∎
D.4 Proof of Condition 3
Lemma D.3.
Consider tabular reinforcement learning with truncated Gaussian reward. For a fixed instance , for all we have
(193) |
Proof.
Recall that for tabular RL, an observation is . We use to denote the collection of states and actions in the trajectory , and the collection of of rewards.For instances and state-action pair , let be its reward distribution and its transition.
Let Consider the random variables and By the chain rules of KL divergence we have
(194) |
Since is a discrete random variable, we have
(195) |
Therefore, for any ,
(196) |
where the last inequality comes from the fact that
Appendix E RL with General Reward Distribution
In this section, we extend our analysis to RL with general reward distribution, where the support of the reward may have infinite volume (e.g., the real line when the reward distribution is Gaussian). We assume that for any state-action pair , the reward distribution comes from a distribution family (e.g., when the reward is Gaussian). For any , let be its density function and its mean and we assume We emphasize that for general reward distributions, we do not require Conditions 2 and 3 because they do not hold for Gaussian distribution. Instead, we require the following.
Condition 4.
Let be the reward distribution family. Then
-
(a)
for all , there exists a constant such that for every ,
(208) -
(b)
for all , ;
-
(c)
for all ,
-
(d)
for any , there exists a covering such that
(209) And
For tabular reinforcement learning problems, we only require Condition 4 for the reward distribution of any fixed state-action pair, which is one-dimensional. Condition 4 holds for Gaussian distribution (see Proposition E.5), Laplace distribution, etc.
Our main result for tabular RL is stated as follows.
Theorem E.1.
In section E.1 we present the main lemmas for Steps 1-2. Then, Theorem E.1 is proved using exactly the same approach as Theorem 5.2 (by replacing the main lemmas for Steps 1-2).
E.1 Lemmas for RL with General Reward Distribution
In this section, we state the main lemmas for RL with general reward distribution (namely, Lemma E.2 for Step 1 and Lemma E.3 for Step 2).
Lemma E.2.
Let be the event that, there exists a universal constant and a function that only depends on such that
-
(a)
for all , ,
-
(b)
, for all ,
-
(c)
For tabular reinforcement learning with general reward distribution that satisfies Condition 4, there exists such that when , In addition, the regret of Step 1 is upper bounded by
Proof.
Items (a) and (c):
Item (b):
Recall that an observation is a sequence of state-action-reward tuples: . Therefore, for all we have
(211) | ||||
(212) |
where and are the mean reward of instance , respectively. Since we have
(213) | ||||
(214) | ||||
(215) | ||||
(216) |
In the following we upper bound the second term of Eq. (216). Let be the reward distributions of instance and respectively, and the densities of the reward given state-action pair . By item (b) of Condition 4 and Cauchy-Schwarz we get
(217) | ||||
(218) | ||||
(219) | ||||
(220) |
Recall that denotes the transition function of instances respectively. By the chain rule of KL divergence,
(221) | ||||
(222) | ||||
(223) |
By item (a) of this lemma, as . Therefore for large enough we get
(224) |
Combining with item (a) of this lemma, we prove the desired result. ∎
Lemma E.3.
For tabular reinforcement learning with general reward distribution that satisfies Condition 4, there exists such that when , the following holds.
-
(a)
Conditioned on the event , .
-
(b)
Conditioned on the event , .
-
(c)
The expected regret of Step 2 is always upper bounded by
-
(d)
Conditioned on the event , the expected regret of Step 2 is upper bounded by
Proof.
We prove the four items above separately.
Items (a), (c), and (d):
Proofs of (a), (c), and (d) are the same as that of Lemma C.2.
Item (b):
Let and . We prove this statement by invoking Lemma F.2 with parameters . Following Lemma F.2, let and be the value that satisfies Condition 1. Let
First of all, we prove that the condition for Lemma F.2 holds. That is,
(225) |
Recall that . As a result, Now consider the RHS of Eq. (225). By the definition of we get and . It follows from Condition 1 that By the definition of and Condition 2 we get
(226) |
Consequently, the RHS of Eq. (225) is at most , and Eq. (225) holds when is sufficiently large. By Lemma F.2 we get, with probability at least ,
(227) |
where .
In the following, we prove that Eq. (227) implies Recall that is the event defined as follows:
(228) |
Next, we apply Lemma G.4 to any To verify its condition, we have for all by item (a) of Lemma E.2. Therefore when is large enough,
(229) |
Then . It follows from Eq. (227) that
(230) |
By Lemma E.4, with probability at least we have
(231) |
For large enough , As a result, combining Eq. (230) and Eq. (231), with probability we have
(232) |
∎
The following lemma is used in the proof of Lemma E.3.
Lemma E.4.
Let be any fixed tabular RL instances with reward distribution that satisfies Condition 4. For any sequence of policies , let be a sequence of random observations drawn from . Then there exists constant and that only depends on such that, for any , with probability at least ,
(233) |
assuming
Proof.
Recall that for reinforcement learning, an observation represents a trajectory, and denotes the state-action pairs in the trajectory. By algebraic manipulation, we get
(234) |
Recall that for the instance , we have
(235) |
Since both and are finite for tabular RL, we get On the one hand, for any , by Pinsker’s inequality we get
(236) |
As a result
When , applying the basic inequality we get for all and ,
(237) | ||||
(238) |
On the other hand, let and be the reward distribution of instance and given state-action pair respectively. Then
(239) |
For any , by the chain rule of KL divergence we get
(240) | ||||
(241) | ||||
(242) |
Because , for any we have and . As a result, for all , item (a) of Condition 4 implies that with probability at least
(243) | ||||
(244) | ||||
(245) |
Let . Apply union bound and we get
(246) |
It follows that with probability at least ,
Therefore, Eq. (233) is satisfied by setting
(247) |
∎
E.2 Proof of Condition 4 for Gaussian Distribution
In this section, we prove that the reward distribution family satisfies Condition 4.
Proposition E.5.
Condition 4 holds for the reward distribution family
Proof.
In the following we prove the four items of Condition 4 respectively.
Item (a).
For any fixed , let be the mean of . In other words, Then we have
(248) |
By definition, for any , we have
Therefore when we get
(249) |
As a result, with probability at least we have
(250) |
In addition, when we have As a result, item (a) holds with .
Item (b).
Recall that for Gaussian distribution with unit variance, when we have Therefore item (b) holds with .
Item (c).
Recall that for any , . Therefore when we get
(251) | ||||
(252) |
Therefore, item (c) holds with
Item (d).
For , we have
(253) |
Therefore, we can set . Then ∎
Appendix F Uniform Concentration
In this section, we present the uniform concentration results.
F.1 Uniform Concentration with Covering
In this section we prove uniform concentration results with covering. For two instances , define the distance as
(254) |
Let be the proper covering number of w.r.t. the distance Then we have the following lemma.
Lemma F.1.
Consider any fixed , list of decisions , . Let and let be the value that satisfies Condition 1. Let and define
Then for any , when we have
(255) |
Proof.
For any , we define a induced distribution for all :
(256) |
Because and
the induced distribution is a valid distribution.
Properties of the induced covering.
Let . Consider the minimum covering set of For any , let be its cover and the induced distribution of . Now, we prove that satisfies the following two properties:
-
(a)
For any and , and
-
(b)
for
First we prove item (a). Since is the cover of , we get for any and . As a result,
(257) |
By the basic inequality we get
(258) |
As a result, item (a) follows directly. Now we prove item (b). By algebraic manipulation,
(259) | ||||
(260) | ||||
(261) | ||||
(262) |
Applying Lemma G.5, we get item (b).
Uniform concentration via covering.
F.2 Uniform Concentration for RL with General Reward Distribution
In the following, we present an uniform concentration lemma for tabular RL with general reward.
Lemma F.2.
Proof.
Recall that a tabular RL instance , we use to denote its reward distribution given state-action pair , and its transition. We prove this lemma using the covering argument.
The covering set.
Define
Let be the minimum covering of such that for any (parameterized by ), there exists (parameterized by ) such that
(268) |
By item (d) of Condition 4 and a standard covering argument for discrete distributions, we have For any , we consider the induced instance defined by
(269) |
In the following, we prove that
-
(a)
with probability at least ,
(270) -
(b)
for
To prove (a), recall that represents a trajectory. Consequently,
(271) | ||||
(272) | ||||
(273) |
By item (a) of Condition 4 and union bound we get with probability at least
(274) |
Therefore,
(275) |
which proves (a). For (b), by algebraic manipulation we have
(276) | ||||
(277) | ||||
(278) | ||||
(279) |
Applying Lemma G.5, we get item (b).
Uniform concentration via covering.
Appendix G Helper Lemmas
The following lemma shows that the Rényi divergence of a marginal distribution is smaller than that of the original distribution.
Lemma G.1 (Theorem 1 of Van Erven and Harremos [2014]).
Consider any distributions over variables . Let denote their marginal distribution over respectively. For any , we have
(284) |
Proof.
When , we have As a result, we only need to prove that
(285) |
We prove this by Hölder’s inequality. In particular,
(286) | ||||
(287) | ||||
(288) | ||||
(289) |
∎
Intuitively, the following two lemmas upper bound the difference of Rényi divergence by the TV distance.
Lemma G.2.
For and fixed and distribution , consider two distributions such that Then we have
(290) |
Proof.
Let We start by proving
(291) |
By Hólder’s inequality we get
(292) | ||||
(293) | ||||
(294) | ||||
(295) | ||||
(296) |
where the Eq. (295) follows from the basic inequality for
Lemma G.3.
For and fixed and distribution , consider two distributions such that Then we have
(300) |
Proof.
We use a similar proof as Lemma G.2. Let Then we have
(301) | ||||
(302) | ||||
(303) | ||||
(304) |
The following lemma is used to prove that when is close to (measured by TV distance), we can use to approximately solve (see Lemma C.2).
Lemma G.4.
For an instance and . Let and For any such that , define be a list of decisions where a decision occurs times for every , and . Consider two instances such that there exists constant with
Define the set Then for any constant , there exits such that for all ,
(308) |
Proof of Lemma G.4.
First we invoke Condition 1 with proper parameters. Define
(309) |
By the definition of and the fact that we get
(310) |
Let Consider . By the upper bound of we get
We invoke Condition 1 with parameters . For large enough we have , which implies that . Then for any we get
(311) |
We claim that for large enough . Indeed, by the definition of we get
(312) |
Let
By Lemma G.3 with parameters and the assumption that for all , we get
(313) |
Now we consider the following two cases.
Case 1:
Case 2:
For all , . In this case we also have . Therefore Eq. (313) and Eq. (311) implies that
(316) | ||||
(317) |
As a result,
(318) | ||||
(319) |
Combining the two cases together, we get ∎
The following lemma is used to prove a nice property of the covering (see Lemma F.1).
Lemma G.5.
Consider any , sequence of decisions . Let and be the value that satisfies Condition 1. For two distributions such that
we have
(320) |
Case 1:
Case 2:
For all , . In this case we also have . Therefore Eq. (321) and Eq. (322) implies that
(325) | ||||
(326) |
As a result,
(327) | ||||
(328) |
Combining the two cases together, we get ∎
The following lemma shows that for truncated Gaussian distributions, the difference in their density function can be upper bounded by their KL divergence. We use this lemma to prove Condition 3 for tabular RL with truncated Gaussian reward.
Lemma G.6.
Consider two truncated Gaussian distributions with density
(329) |
where is the normalization factor. Assuming , we have
(330) |
Proof.
We first prove that Then we show that .
By Pinsker’s inequality,
(331) |
Now we prove that W.l.o.g., we assume and Then we have, for any ,
(332) | ||||
(333) | ||||
(334) |
As a result,
(335) |
The following lemma can be viewed as an perturbation analysis to the covariance matrix in linear bandits setting. We use the lemma to prove that our complexity measure recovers that in [Lattimore and Szepesvari, 2017].
Lemma G.7.
For a fixed positive definite matrix and an unit vector , let and Then
(344) |
Proof.
By Sherman–Morrison formula we get
(345) | ||||
(346) |
Then for any such that , we get
(347) | ||||
(348) | ||||
(349) | ||||
(350) | ||||
(351) |
∎
The following lemma upper bounds the difference between KL divergence and Rényi divergence, and is used to prove Condition 1.
Lemma G.8.
For any two distribution and constant we have
(352) |
Proof.
Recall that
Define the function By basic algebra we get
(353) | ||||
(354) |
By Taylor expansion, there exists such that
(355) |
By definition we have and As a result, we get
(356) | ||||
(357) | ||||
(358) |
By Hölder’s inequality, when we get
(359) | ||||
(360) | ||||
(361) | ||||
(362) |
Combining the inequalities above we get the desired result. ∎