This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Asymptotic Instance-Optimal Algorithms for Interactive Decision Making

Kefan Dong
Stanford University
[email protected]
   Tengyu Ma
Stanford University
[email protected]
Abstract

Past research on interactive decision making problems (bandits, reinforcement learning, etc.) mostly focuses on the minimax regret that measures the algorithm’s performance on the hardest instance. However, an ideal algorithm should adapt to the complexity of a particular problem instance and incur smaller regrets on easy instances than worst-case instances. In this paper, we design the first asymptotic instance-optimal algorithm for general interactive decision making problems with finite number of decisions under mild conditions. On every instance ff, our algorithm outperforms all consistent algorithms (those achieving non-trivial regrets on all instances), and has asymptotic regret 𝒞(f)lnn\mathcal{C}(f)\ln n, where 𝒞(f)\mathcal{C}(f) is an exact characterization of the complexity of ff. The key step of the algorithm involves hypothesis testing with active data collection. It computes the most economical decisions with which the algorithm collects observations to test whether an estimated instance is indeed correct; thus, the complexity 𝒞(f)\mathcal{C}(f) is the minimum cost to test the instance ff against other instances. Our results, instantiated on concrete problems, recover the classical gap-dependent bounds for multi-armed bandits (Lai et al., 1985) and prior works on linear bandits (Lattimore and Szepesvari, 2017), and improve upon the previous best instance-dependent upper bound (Xu et al., 2021) for reinforcement learning.

1 Introduction

Bandit and reinforcement learning (RL) algorithms demonstrated a wide range of successful real-life applications (Silver et al., 2016, 2017; Mnih et al., 2013; Berner et al., 2019; Vinyals et al., 2019; Mnih et al., 2015; Degrave et al., 2022). Past works have theoretically studied the regret or sample complexity of various interactive decision making problems, such as contextual bandits, reinforcement learning (RL), partially observable Markov decision process (see Azar et al. (2017); Jin et al. (2018); Dong et al. (2021); Li et al. (2019); Agarwal et al. (2014); Foster and Rakhlin (2020); Jin et al. (2020), and references therein). Recently, Foster et al. (2021) present a unified algorithmic principle for achieving the minimax regret—the optimal regret for the worst-case problem instances.

However, minimax regret bounds do not necessarily always present a full picture of the statistical complexity of the problem. They characterize the complexity of the most difficult instances, but potentially many other instances are much easier. An ideal algorithm should adapt to the complexity of a particular instance and incur smaller regrets on easy instances than the worst-case instances. Thus, an ideal regret bound should be instance-dependent, that is, depending on some properties of each instance. Prior works designed algorithms with instance-dependent regret bounds that are stronger than minimax regret bounds, but they are often not directly comparable because they depend on different properties of the instances, such as the gap conditions and the variance of the value function (Zanette and Brunskill, 2019; Xu et al., 2021; Foster et al., 2020; Tirinzoni et al., 2021).

A more ambitious and challenging goal is to design instance-optimal algorithms that outperform, on every instance, all consistent algorithms (those achieving non-trivial regrets on all instances). Past works designed instance-optimal algorithms for multi-armed bandit (Lai et al., 1985), linear bandits (Kirschner et al., 2021; Hao et al., 2020), Lipschitz bandits (Magureanu et al., 2014), and ergodic MDPs (Ok et al., 2018). However, instance-optimal regret bounds for tabular reinforcement learning remain an open question, despite recent progress (Tirinzoni et al., 2021, 2022). The challenge partly stems from the fact that the existence of such an instance-optimal algorithm is even a priori unclear for general interactive decision making problems. Conceivably, each algorithm can have its own specialty on a subset of instances, and no algorithm can dominate all others on all instances.

Somewhat surprisingly, we prove that there exists a simple algorithm (T2C, stated in Alg. 1) that is asymptotic instance-optimal for general interactive decision making problems with finite number of decisions.We determine the exact leading term of the optimal asymptotic regret for instance ff to be 𝒞(f)lnn{\mathcal{C}}(f)\ln n. Here, nn is the number of interactions and 𝒞(f){\mathcal{C}}(f) is a complexity measure for the instance ff that intuitively captures the difficulty of distinguishing ff from other instances (that have different optimal decisions) using observations. Concretely, under mild conditions on the interactive decision problem, our algorithm achieves an asymptotic regret 𝒞(f)lnn{\mathcal{C}}(f)\ln n (Theorem 5.2) for every instance ff, while every consistent algorithm must have an asymptotic regret at least 𝒞(f)lnn{\mathcal{C}}(f)\ln n (Theorem 3.2).

Our algorithm consists of three simple steps. First, it explores uniformly for o(1)o(1)-fraction of the steps and computes the MLE estimate of the instance with relatively low confidence. Then, it tests whether the estimate instance (or, precisely, its associated optimal decision) is indeed correct using the most economical set of queries/decisions. Concretely, it computes a set of decisions with minimal regret, such that, using a log-likelihood ratio test, it can either distinguish the estimated instance from all other instances (with different optimal decisions) with high confidence, or determine that our estimate was incorrect. Thirdly, with the high-confidence estimate, it commits to the optimal decision of the estimated instance in the rest of the steps. The algorithmic framework essentially reduces the problem to the key second step — a problem of optimal hypothesis testing with active data collection.

Our results recover the classical gap-dependent regret bounds for multi-armed bandits (Lai et al., 1985) and prior works on linear bandits (Lattimore and Szepesvari, 2017; Hao et al., 2020). As an instantiation of the general algorithm, we present the first asymptotic instance-optimal algorithm for episodic tabular RL, improving upon prior instance-dependent algorithms (Xu et al., 2021; Simchowitz and Jamieson, 2019; Tirinzoni et al., 2021, 2022).

1.1 Additional Related Works

Instance-optimal and instance-dependent analysis.

Prior works have designed instance-optimal algorithms for specific interactive decision making problems. Variants of UCB algorithms are instance-optimal for bandits with various assumptions (Lattimore and Szepesvári, 2020; Gupta et al., 2021; Tirinzoni et al., 2020; Degenne et al., 2020; Magureanu et al., 2014), but are suboptimal for linear bandits (Lattimore and Szepesvari, 2017). These algorithms rely on the optimism-in-face-of-uncertainty principle to deal with exploration-exploitation tradeoff, whereas our algorithm explicitly finds the best tradeoff. There are also non-optimistic instance-optimal algorithms for linear bandits (Kirschner et al., 2021; Lattimore and Szepesvari, 2017; Hao et al., 2020), ergodic MDPs (Ok et al., 2018; Burnetas and Katehakis, 1997), and others (Graves and Lai, 1997; Rajeev et al., 1989).

Another line of research focuses on instance-optimal pure exploration algorithms in multi-armed bandits (Mason et al., 2020; Marjani et al., 2022; Chen et al., 2017b, a) and deterministic MDPs (Tirinzoni et al., 2022), where the goal is to use minimum number of samples to identify an ϵ\epsilon-optimal decision.

Many algorithms’ regret bounds depend on some properties of instances such as the gap condition. Foster et al. (2020) prove a gap-dependent regret bound for contextual bandits. For reinforcement learning, the regret bound may depend on the variance of the optimal value function (Zanette and Brunskill, 2019) or the gap of the QQ-function (Xu et al., 2021; Simchowitz and Jamieson, 2019; Yang et al., 2021). Xu et al. (2021); Foster et al. (2020) also prove that the gap-dependent bounds cannot be improve on some instances. To some extent, these instance-dependent lower bounds can be viewed as minimax bounds for a fine-grained instance family (e.g., all instances with the same QQ-function gap), and therefore are different from ours.

Instance-optimal algorithm via log-likelihood ratio test.

The idea of using log-likelihood ratio test to design instance-optimal algorithms traces back to Rajeev et al. (1989); Graves and Lai (1997). Rajeev et al. (1989) design an asymptotic instance-optimal algorithm for general decision making problems with finite hypothesis class, finite state-action space, and known rewards. The algorithm in Graves and Lai (1997) is instance-optimal for infinite horizon MDPs where the Markov chain induced by every policy is uniformly recurrent, meaning that the probability of visiting every state is uniformly lower bounded away from 0. In comparison, our analysis is applicable to infinite hypothesis settings without the uniform recurrent assumption (e.g., episodic tabular RL).

2 Setup and Notations

Additional notations.

We use poly(n)\mathrm{poly}(n) to denote a polynomial in nn. For two non-negative sequences an,bna_{n},b_{n}, we write an=𝒪(bn)a_{n}={\mathcal{O}}(b_{n}) if lim supnan/bn<\limsup_{n\to\infty}a_{n}/b_{n}<\infty and an=o(bn)a_{n}=o(b_{n}) if lim supnan/bn=0.\limsup_{n\to\infty}a_{n}/b_{n}=0.

Interactive decision making problems.

We focus on interactive decision making with structured observations (Foster et al., 2021), which includes bandits and reinforcement learning as special cases. An interactive decision making problem is defined by a family of decisions Π\Pi, a space of observations OO, a reward function R:OR:O\to\mathbb{R}, and a function ff (also called an instance) that maps a decision πΠ\pi\in\Pi to a distribution over observations f[π]f[\pi]. We use f[π]():O+f[\pi](\cdot):O\to\mathbb{R}_{+} to denote the density function of the distribution f[π]f[\pi]. We assume that the reward RR is a deterministic and known function.

An environment picks an ground-truth instance f{f^{\star}} from a instance family {\mathcal{F}}, and then an agent (which knows the instance family {\mathcal{F}} but not the ground-truth f{f^{\star}}) interacts with the environment for nn total rounds. In round tnt\leq n,

  1. 1.

    the learner selects a decision πt\pi_{t} from the decision class Π\Pi, and

  2. 2.

    the environment generates an observation oto_{t} following the ground-truth distribution f[πt]{f^{\star}}[\pi_{t}] and reveals the observation. Then the agent receives a reward R(ot)R(o_{t}).

As an example, multi-armed bandits, linear bandits, and reinforcement learning all belong to interactive decision making problems. For bandits, a decision π\pi corresponds to an action and an observation oo corresponds to a reward. For reinforcement learning, a decision π\pi is a deterministic policy that maps from states to actions, and an observation oo is a trajectory (including the reward at each step). In other words, a round of interactions in the interactive decision making problem corresponds to an episode of the reinforcement learning problem. We will formally discuss the reinforcement learning setup in Section 5.

Let Rf(π)=𝔼of[π][R(o)]R_{f}(\pi)=\mathbb{E}_{o\sim f[\pi]}[R(o)] be the expected reward for decision π\pi under instance ff, and π(f)argmaxπRf(π)\pi^{\star}(f)\triangleq\operatorname*{argmax}_{\pi}R_{f}(\pi) the optimal decision of instance ff. The expected regret measures how much worse the agent’s decision is than the optimal decision:

Regf,n=𝔼f[t=1n(maxπΠRf(π)Rf(πt))].\displaystyle\mathrm{Reg}_{f,n}=\mathbb{E}_{f}\left[\sum_{t=1}^{n}\Big{(}\max_{\pi\in\Pi}R_{f}(\pi)-R_{f}(\pi_{t})\Big{)}\right]. (1)

We consider the case where the decision family Π\Pi is finite, and every instance ff\in{\mathcal{F}} has a unique optimal decision, denoted by π(f)\pi^{\star}(f). We also assume that for every ff\in{\mathcal{F}} and πΠ\pi\in\Pi, 0R(o)Rmax0\leq R(o)\leq R_{\rm max} almost surely when oo is drawn from the distribution f[π]f[\pi], and supof[π](o)<\sup_{o}f[\pi](o)<\infty.

Consistent algorithms and asymptotic instance-optimality.

We first note that it’s unreasonable to ask for an algorithm that can outperform or match any arbitrary algorithm on every instance. This is because, for any instance ff\in{\mathcal{F}}, a bespoke algorithm that always outputs π(f)\pi^{\star}(f) can achieve zero regret on instance ff (though it has terrible regrets for other instances). Therefore, if an algorithm can outperform or match any algorithm on any instance, it must have zero regret on every instance, which is generally impossible. Instead, we are interested in finding an algorithm that is as good as any other reasonable algorithms that are not only customized to a single instance and completely fail on other instances.

We say an algorithm is consistent if its expected regret satisfies Regf,n=o(np)\mathrm{Reg}_{f,n}=o(n^{p}) for every p>0p>0 and ff\in{\mathcal{F}} (Lai et al., 1985). Most of the reasonable algorithms are consistent, such as the UCB algorithm for multi-armed bandits (Lattimore and Szepesvári, 2020), UCBVI, and UCB-Q algorithm for tabular reinforcement learning (Simchowitz and Jamieson, 2019; Yang et al., 2021), because all of them achieve asymptotically 𝒪(lnn){\mathcal{O}}(\ln n) regrets on any instance, where 𝒪{\mathcal{O}} hides constants that depend on the property of the particular instance.111Here the regret scales in logn\log n because we are in the asymptotic setting where the instance is fixed and nn tends to infinity. If the instance can depend on nn (which is not the setting we are interested in), then the minimax regret typically scales in O(n)O(\sqrt{n}). However, consistent algorithms exclude the algorithm mentioned in the previous paragraph which is purely customized to a particular instance.

We say an algorithm is asymptotically instance-optimal if on every instance ff\in{\mathcal{F}}, lim supnRegf,n/lnn\limsup_{n\to\infty}\mathrm{Reg}_{f,n}/\ln n is the smallest among all consistent algorithms (Lai et al., 1985). We note that even though an instance-optimal algorithm only needs to perform as well as every consistent algorithm, a priori, it’s still unclear if such an algorithm exists.

Some prior works have also used a slightly weaker definition in place of consistent algorithms, e.g., α\alpha-uniformly good algorithms in Tirinzoni et al. (2021), which allows a sublinear regret bound 𝒪(nα){\mathcal{O}}(n^{\alpha}) for some constant α<1\alpha<1. The alternative definition, though apparently includes more algorithms, does not change the essence. Our algorithm is still instance-optimal up to a constant factor—a simple modification of the lower bound part of the proof shows that its asymptotic regret is at most (1α)1(1-\alpha)^{-1} factor bigger than any α\alpha-uniformly good algorithms on any instance. This paper thus primarily compares with consistent algorithms for simplicity.

3 Main Results

In this section, we present an intrinsic complexity measure 𝒞(f){\mathcal{C}}(f) of an instance ff (Section 3.1), and an instance-optimal algorithm that achieves an asymptotic regret 𝒞(f)lnn{\mathcal{C}}(f)\ln n (Section 3.2).

3.1 Complexity Measure and Regret Lower Bounds

In this section, our goal is to understand the minimum regret of consistent algorithms, which is an intrinsic complexity measure of an instance. We define an instance-dependent complexity measure 𝒞(f){\mathcal{C}}(f) and prove that any consistent algorithm must have asymptotic regret at least 𝒞(f)lnn.{\mathcal{C}}(f)\ln n.

The key observation is that any consistent algorithm has to collect enough information from the observations to tell apart the ground-truth instance from other instances with different optimal decisions. Consider the situation when a sequence of nn decisions is insufficient to distinguish two instances, denoted by ff and gg, with different optimal decisions. If an algorithm achieves a sublinear regret on ff, the sequence must contain π(f)\pi^{\star}(f) (the optimal decision for ff) at least no(n)n-o(n) times. As a result, if the true instance were gg, the algorithm suffers a linear regret (due to the linear number of π(f)\pi^{\star}(f)), and therefore cannot be consistent.

An ideal algorithm should find out decisions that can most efficiently identify the ground-truth instance (or precisely the family of instances with the same optimal decision as the ground-truth). However, decisions collect information but also incur regrets. So the algorithm should pick a list of decisions with the best tradeoff between these two effects—minimizing the decisions’ regret and collecting sufficient information to identify the true instance. Concretely, suppose a sequence of decisions includes wπw_{\pi} occurrences of decision π\pi. The regret of these decisions is πΠwπΔ(f,π)\sum_{\pi\in\Pi}w_{\pi}\Delta(f,\pi), where Δ(f,π)Rf(π(f))Rf(π)\Delta(f,\pi)\triangleq R_{f}(\pi^{\star}(f))-R_{f}(\pi) is the sub-optimality gap of decision π\pi. We use KL divergence between the observations of π\pi under two instances ff and gg (denoted by DKL(f[π]g[π])D_{\mathrm{KL}}(f[\pi]\|g[\pi])) to measure π\pi’s power to distinguish them. The following optimization problem defines the optimal mixture of the decisions (in terms of the regret) that has sufficient power to distinguish ff from all other instances with different optimal decision.

𝒞(f,n)minw+|Π|\displaystyle{\mathcal{C}}(f,n)\triangleq\min_{w\in\mathbb{R}^{|\Pi|}_{+}}\; πΠwπΔ(f,π)\displaystyle\sum_{\pi\in\Pi}w_{\pi}\Delta(f,\pi) (2)
s.t. πΠwπDKL(f[π]g[π])1,g,π(g)π(f),\displaystyle\sum_{\pi\in\Pi}w_{\pi}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 1,\;\forall g\in{\mathcal{F}},\pi^{\star}(g)\neq\pi^{\star}(f), (3)
wn.\displaystyle\|w\|_{\infty}\leq n. (4)

The last constraint makes sure that wπ<w_{\pi}<\infty even if the decision has no regret, and we only care about the case when nn approaches infinity. The asymptotic complexity of ff is

𝒞(f)limn𝒞(f,n).\displaystyle{\mathcal{C}}(f)\triangleq\lim_{n\to\infty}{\mathcal{C}}(f,n). (5)

In Eq. (3), we only require separation between instances f,gf,g when they have different optimal decisions. As there are only finite decisions, there are finite equivalent classes, so ff and gg are in principle separable with sufficient number of observations. Thus, the complexity measure 𝒞(f,n){\mathcal{C}}(f,n) is well defined as stated in the following lemma.

Lemma 3.1.

For any ff\in{\mathcal{F}}, 𝒞(f,n){\mathcal{C}}(f,n) is non-increasing in nn, and there exists n0>0n_{0}>0 such that for all n>n0n>n_{0}, 𝒞(f,n)<.{\mathcal{C}}(f,n)<\infty. As a corollary, 𝒞(f)<{\mathcal{C}}(f)<\infty and is well defined.

Proof of Lemma 3.1 is deferred to Appendix B.1. In Section 3.1, we formalize the intuition above and prove that 𝒞(f){\mathcal{C}}(f) is a lower bound of the asymptotic regret, as stated in the following theorem.

Theorem 3.2.

For every instance ff\in{\mathcal{F}}, the expected regret of any consistent algorithm satisfies

lim supnRegf,nlnn𝒞(f).\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{f,n}}{\ln n}\geq{\mathcal{C}}(f). (6)

We prove Theorem 3.2 in Appendix B.2. The proof of Theorem 3.2 is inspired by previous works (Lattimore and Szepesvari, 2017; Tirinzoni et al., 2021). When applied to RL problems, our lower bound is very similar to that in Tirinzoni et al. (2021) and the only difference is that their optimization problems omit the constraint Eq. (4). (Informally, the lower bound of Tirinzoni et al. (2021) is equal to 𝒞(f,){\mathcal{C}}(f,\infty) in our language.) For some carefully-constructed hypothesis classes, there exists an instance ff such that limn𝒞(f,n)𝒞(f,)\lim_{n\to\infty}{\mathcal{C}}(f,n)\neq{\mathcal{C}}(f,\infty), meaning that the lower bound in Tirinzoni et al. (2021) is not tight (see Appendix B.5 for the construction and corresponding proof). However, this constraint can be removed for linear bandits (see Lattimore and Szepesvari (2017, Lemma 17)). For the standard tabular RL problems whose hypothesis class is the entire set of possible RL problems with fixed number of states and actions, we conjecture that our lower bound would be the same as that in Tirinzoni et al. (2021). We also note that Tirinzoni et al. (2021) do not have a matching upper bound.

The complexity measure 𝒞(f){\mathcal{C}}(f) can be computed explicitly in concrete settings. 𝒞(f){\mathcal{C}}(f) reduces to the well-known inverse action gap bound 𝒪(a𝒜,Δ(a)>01/Δ(a)){\mathcal{O}}(\sum_{a\in{\mathcal{A}},\Delta(a)>0}1/\Delta(a)) for multi-armed bandits (Proposition B.2), and recovers the result of Lattimore and Szepesvari (2017) (Proposition B.3). For reinforcement learning, Tirinzoni et al. (2021) prove that the instance-optimal bound can be smaller than the gap-dependent bound 𝒪(s,a:Δ(s,a)>01/Δ(s,a)){\mathcal{O}}(\sum_{s,a:\Delta(s,a)>0}1/\Delta(s,a)) in Xu et al. (2021) and Simchowitz and Jamieson (2019).

3.2 Instance-Optimal Algorithms

We first present the regret bound for our algorithm, and then discuss the T2C algorithm. For simplicity, we consider a finite hypothesis (i.e., ||<|{\mathcal{F}}|<\infty) here, and extend to infinite hypothesis case in Section 5.

We start by stating a condition that excludes abnormal observation distributions f[π]f[\pi]. Recall that for any ζ(0,1)\zeta\in(0,1), the Rényi divergence of two distributions p,qp,q is

Dζ(pq)=1ζ1lnxp(x)ζq(x)1ζdx.\displaystyle D_{\zeta}(p\|q)=\frac{1}{\zeta-1}\ln\int_{x}p(x)^{\zeta}q(x)^{1-\zeta}\mathrm{d}x. (7)

The Rényi divergence Dζ(pq)D_{\zeta}(p\|q) is non-decreasing in ζ\zeta, and limζ1Dζ(pq)=DKL(pq)\lim_{\zeta\uparrow 1}D_{\zeta}(p\|q)=D_{\mathrm{KL}}(p\|q) (Van Erven and Harremos, 2014). We require the limits converge uniformly for all instances gg\in{\mathcal{F}} and decisions πΠ\pi\in\Pi with a ζ\zeta bounded away from 1 (the choice of the constants in Condition 1 is not critical to our result), as stated below.

Condition 1.

For any fixed α>0,ϵ>0\alpha>0,\epsilon>0, instance ff\in{\mathcal{F}}, there exists λ0(α,ϵ,f)>0\lambda_{0}(\alpha,\epsilon,f)>0 such that for all λλ0(α,ϵ,f)\lambda\leq\lambda_{0}(\alpha,\epsilon,f), gg\in{\mathcal{F}} and πΠ\pi\in\Pi, D1λ(f[π]g[π])min{DKL(f[π]g[π])ϵ,α}.D_{1-\lambda}(f[\pi]\|g[\pi])\geq\min\{D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon,\alpha\}. Moreover, we require λ0(α,ϵ,f)ϵc1min{1/α,c2}c3ι(f)\lambda_{0}(\alpha,\epsilon,f)\geq\epsilon^{c_{1}}\min\{1/\alpha,c_{2}\}^{c_{3}}\iota(f) for some universal constants c1,c2,c3>0c_{1},c_{2},c_{3}>0, where ι(f)>0\iota(f)>0 is a function that only depends on ff.

Condition 1 holds for a wide range of distributions, such as Gaussian, Bernoulli, multinomial, Laplace with bounded mean, Log-normal (Gil et al., 2013), and tabular RL where f[π]f[\pi] is a distribution over a trajectory consists of state, action and reward tuples (see Theorem 5.1 for the proof of tabular RL). A stronger but more interpretable variant of Condition 1 is that the log density ratio of f[π]f[\pi] and g[π]g[\pi] has finite fourth moments (see Proposition B.4), therefore Condition 1 can also be potentially verified for other distributions.

The main theorem analyzing Alg. 1 is shown below. The asymptotic regret of Alg. 1 matches the constants in the lower bound (Theorem 3.2), indicating the asymptotic instance-optimality of our algorithm.

Theorem 3.3.

Suppose {\mathcal{F}} is a finite hypothesis class and satisfies Condition 1 The regret of Alg. 1 satisfies

lim supnRegf,nlnn𝒞(f).\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{{f^{\star}},n}}{\ln n}\leq{\mathcal{C}}({f^{\star}}). (8)
Algorithm 1 Test-to-Commit (T2C)
1:Parameters: the number of rounds of interactions nn. Step 1: Initialization.
2:Play each decision lnnlnlnn\lceil\frac{\ln n}{\ln\ln n}\rceil times. We denote these decisions by {π^i}i=1minit\{\hat{\pi}_{i}\}_{i=1}^{{m_{\rm init}}}, where minit=|Π|lnnlnlnn{m_{\rm init}}=|\Pi|\lceil\frac{\ln n}{\ln\ln n}\rceil, and the corresponding observations by {o^i}i=1minit\{\hat{o}_{i}\}_{i=1}^{{m_{\rm init}}}.
3:Compute the max likelihood estimation (MLE) with arbitrary tie-breaking
f^=argmaxfi=1minitlnf[π^i](o^i).\displaystyle{\hat{f}}=\operatorname*{argmax}_{f\in{\mathcal{F}}}\;\sum_{i=1}^{{m_{\rm init}}}\ln f[\hat{\pi}_{i}](\hat{o}_{i}). (9)
Step 2: Identification.
4:Let w^\hat{w} be the solution of the program defining 𝒞(f^,(lnlnn)1/4){\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}) and w¯π=(1+1(lnlnn)1/4)w^π+1(lnlnn)1/4\bar{w}_{\pi}=\left(1+\frac{1}{(\ln\ln n)^{1/4}}\right)\hat{w}_{\pi}+\frac{1}{(\ln\ln n)^{1/4}} for all πΠ\pi\in\Pi.
5:Play each decision π\pi for w¯πlnn\lceil\bar{w}_{\pi}\ln n\rceil times. Denote these decisions by w={πi}i=1mw=\{\pi_{i}\}_{i=1}^{m} where m=πw¯πln(n)m=\sum_{\pi}\lceil\bar{w}_{\pi}\ln(n)\rceil, and the corresponding observations by {oi}i=1m.\{o_{i}\}_{i=1}^{m}.
6:Run the log-likelihood ratio test on instance f^{\hat{f}}, the sequence of decision {πi}i=1m\{\pi_{i}\}_{i=1}^{m} and its corresponding observations {oi}i=1m\{o_{i}\}_{i=1}^{m} (that is, compute the event accf^{\mathcal{E}}_{\rm acc}^{\hat{f}} defined in Eq. (10)). Step 3: Exploitation.
7:if accf^=true{\mathcal{E}}_{\rm acc}^{\hat{f}}=\rm true then
8:     Commit to π(f^)\pi^{\star}({\hat{f}}) (i.e., run π(f^)\pi^{\star}({\hat{f}}) for the remaining steps).
9:else
10:     Run UCB for the remaining steps.

Our algorithm is stated in Alg. 1 and consists of three steps: initialization, identification, and exploitation. In the initialization step, we explore uniformly for a short period and compute the MLE estimate f^{\hat{f}} of the true instance (Line 3 of Alg. 1), where we only requires f^{\hat{f}} to be accurate with moderate probability (i.e., 11/lnn1-1/\ln n). The estimation is used to compute the lowest-regret list of decisions that can distinguish the optimal decision of f^{\hat{f}}. Since we only collect o(lnn)o(\ln n) samples, the regret of this step is negligible asymptotically.

In the identification step, we hold the belief that f^{\hat{f}} is the true instance and solve 𝒞(f^,(lnlnn)1/4){\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}) to get the best list of decisions to fortify this belief (see Line 4 of Alg. 1). Then we collect more samples using this list (with minor decorations) to boost the confidence of our initial estimation to 11/n1-1/n (or reject it when the estimation is not accurate) by running the following log-likelihood ratio test:

accf^\displaystyle{\mathcal{E}}_{\rm acc}^{{\hat{f}}} =𝕀[g and π(g)π(f^),i=1mlnf^[πi](oi)g[πi](oi)lnn].\displaystyle=\mathbb{I}\bigg{[}\forall g\in{\mathcal{F}}\text{ and }\pi^{\star}(g)\neq\pi^{\star}({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\bigg{]}. (10)

Intuitively, if f^{\hat{f}} is not the ground-truth f{f^{\star}}, accf^{\mathcal{E}}_{\rm acc}^{{\hat{f}}} is unlikely to hold because the expected log-likelihood ratio is non-positive:

𝔼of[π][ln(f^[π](o)/f[π](o))]=DKL(f[π]f^[π])0.\mathbb{E}_{o\sim{f^{\star}}[\pi]}[\ln({\hat{f}}[\pi](o)/{f^{\star}}[\pi](o))]=-D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq 0.

So with high probability a wrong guess cannot be accepted. On the other hand, the first constraint (Eq. (3)) in the definition of 𝒞(f^){\mathcal{C}}({\hat{f}}) guarantees that when f^=f{\hat{f}}={f^{\star}}, the expected log-likelihood ratio is large for all gg\in{\mathcal{F}} and π(g)π(f^)\pi^{\star}(g)\neq\pi^{\star}({\hat{f}}), so an accurate guess will be accepted. In this step m=Θ~(lnn)m=\tilde{\Theta}(\ln n), so Step 2 dominates the regret of Alg. 1. In other words, the efficiency of the log-likelihood test is critical to our analysis.

Finally in the exploitation step, the algorithm commits to the optimal decision of f^{\hat{f}} if it believes this estimation with confidence 11/n1-1/n, or run a standard UCB algorithm as if on a multi-armed bandits problem (with a well-known 𝒪(lnn){\mathcal{O}}(\ln n) regret) when the initial estimation is not accurate (which happens with probability at most 1/lnn1/\ln n). Therefore the expected regret here is 𝒪(lnn)/lnn+𝒪(n)/n=𝒪(1){\mathcal{O}}(\ln n)/\ln n+{\mathcal{O}}(n)/n={\mathcal{O}}(1).

Our three-stage algorithm is inspired by Lattimore and Szepesvari (2017), where their test in Step 2 is specially designed for linear bandits with Gaussian noise. In contrast, the log-likelihood ratio test is a general hypothesis testing method. Hence, our algorithm works for a large family of interactive decision making problems including tabular RL.

4 Proof Sketches of the Main Results

In this section, we discuss the proof sketch of our main results. Section 4.1 discusses the proof sketch of the lower bound, and Section 4.2 shows the main lemmas for each step in Alg. 1. In Section 4.3, we discuss the log-likelihood ratio test in detail.

4.1 Proof Sketch of Theorem 3.2

Recall that 𝒞(f){\mathcal{C}}(f) is the minimum regret of a list of decisions that can distinguish the instance ff with all other instances gg (with a different optimal decision). So to prove the lower bound, we show that the sequence of decisions played by any consistent algorithm must also distinguish ff, and thus, 𝒞(f){\mathcal{C}}(f) lower bounds the regret of any consistent algorithm.

For any consistent algorithm, number of interactions n>0n>0, and two instances f,gf,g\in{\mathcal{F}} with different optimal decisions, let Pf,nP_{f,n} and Pg,nP_{g,n} denote the probability space generated by running the algorithm on instances ff and gg respectively for nn rounds. Since f,gf,g have different optimal decisions, Pf,nP_{f,n} and Pg,nP_{g,n} should be very different. Indeed, let NπN_{\pi} be the random variable denoting the number of times decision π\pi is executed, and we consider the event A=𝕀[Nπ(f)n2]A=\mathbb{I}\left[N_{\pi^{\star}(f)}\geq\frac{n}{2}\right]. By the basic property of TV distance, we have

Prf,n(A)Prg,n(A)DTV(Pf,nPg,n).\mathop{\rm Pr}\nolimits_{f,n}(A)-\mathop{\rm Pr}\nolimits_{g,n}(A)\leq D_{\rm TV}(P_{f,n}\|P_{g,n}).

Since the algorithm has sub-polynomial regret, for every p>0p>0 we must have Prf,n(A)1𝒪(1/n1p)\mathop{\rm Pr}\nolimits_{f,n}(A)\geq 1-{\mathcal{O}}(1/n^{1-p}) and Prg,n(A)𝒪(1/n1p)\mathop{\rm Pr}\nolimits_{g,n}(A)\leq{\mathcal{O}}(1/n^{1-p}) (otherwise the algorithm incurs 𝒪(np){\mathcal{O}}(n^{p}) regret on either ff or gg). Consequently,

p>0,DTV(Pf,nPg,n)1𝒪(1/n1p).\displaystyle\forall p>0,D_{\rm TV}(P_{f,n}\|P_{g,n})\geq 1-{\mathcal{O}}(1/n^{1-p}). (11)

This states that the observations from the two instances should be distinguishable, although the TV distance is complex and hard to deal with (especially when the algorithm chooses its decision sequence adaptively). So we translate the requirement from TV distance to a more localized quantity. First, we invoke a extended version of Pinsker’s inequality (see Sampson and Guttorp (1992))

DTV(Pf,nPg,n)1exp(DKL(Pf,nPg,n))/2.\displaystyle D_{\rm TV}(P_{f,n}\|P_{g,n})\leq 1-\exp(-D_{\mathrm{KL}}(P_{f,n}\|P_{g,n}))/2. (12)

Combining with Eq. (11) we get DKL(Pf,nPg,n)(1+o(1))lnn.D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})\geq(1+o(1))\ln n. Let the random variable πi\pi_{i} be the decision of the algorithm in round ii. The chain rule of KL divergence shows

(1+o(1))lnnDKL(Pf,nPg,n)=𝔼f,n[i=1nDKL(f[πi]g[πi])]=π𝔼f,n[Nπ]DKL(f[π]g[π]).\displaystyle(1+o(1))\ln n\leq D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})=\mathbb{E}_{f,n}\bigg{[}\sum_{i=1}^{n}D_{\mathrm{KL}}(f[\pi_{i}]\|g[\pi_{i}])\bigg{]}=\sum_{\pi}\mathbb{E}_{f,n}[N_{\pi}]D_{\mathrm{KL}}(f[\pi]\|g[\pi]).

Now consider the vector w+|Π|w\in\mathbb{R}^{|\Pi|}_{+} where wπ=𝔼f[Nπ]/((1+o(1))lnn)w_{\pi}=\mathbb{E}_{f}[N_{\pi}]/((1+o(1))\ln n). Based on the derivation above, we can verify that ww is a valid solution to 𝒞(f,n),{\mathcal{C}}(f,n), and therefore

Regf,n=π𝔼f[Nπ]Δ(f,π)=πwπΔ(f,π)(1+o(1))lnn𝒞(f,n)(1+o(1))lnn.\displaystyle\mathrm{Reg}_{f,n}=\sum_{\pi}\mathbb{E}_{f}[N_{\pi}]\Delta(f,\pi)=\sum_{\pi}w_{\pi}\Delta(f,\pi)(1+o(1))\ln n\geq{\mathcal{C}}(f,n)(1+o(1))\ln n. (13)

Then the final result is proved by the fact that 𝒞(f)=limn𝒞(f,n){\mathcal{C}}(f)=\lim_{n\to\infty}{\mathcal{C}}(f,n).

4.2 Proof Sketch of Theorem 3.3

In the following, we discuss main lemmas and their proof sketches for the three steps of Alg. 1.

Step 1: Initialization.

In this step we show that the max likelihood estimation can find the exact instance (i.e., f^=f{\hat{f}}={f^{\star}}) with probability at least 11/lnn1-1/\ln n and negligible regret. Note that the failure probability is not small enough to directly commit to the optimal decision of f^{\hat{f}} (i.e., play π(f^)\pi^{\star}({\hat{f}}) forever), which would incur linear regret when f^f{\hat{f}}\neq{f^{\star}}. Formally speaking, the main lemma for Step 1 is stated as follows, whose proof is deterred to Appendix A.1.

Lemma 4.1.

Under Condition 1, with probability at least 11/lnn1-1/\ln n we get f^=f.{\hat{f}}={f^{\star}}. In addition, the regret of Step 1 is upper bounded by 𝒪(lnnlnlnn).{\mathcal{O}}(\frac{\ln n}{\ln\ln n}).

Lemma 4.1 is not surprising since the MLE only requires 𝒪(log(1/δ)){\mathcal{O}}(\log(1/\delta)) samples to reach a failure probability δ\delta in general (Van de Geer, 2000, Section 7). In Step 1, we require 1/lnn1/\ln n failure probability but use (|Π|lnn)/(lnlnn)=Ω(lnlnn)(|\Pi|\ln n)/(\ln\ln n)=\Omega(\ln\ln n) samples, so the result is expected for large enough nn.

Step 2: Identification.

In the identification step, we boost the failure probability to 1/n1/n using the log-likelihood test. To this end, we compute the optimal list of decisions w={π1,,πm}w=\{\pi_{1},\cdots,\pi_{m}\} that distinguishes f^{\hat{f}} by solving 𝒞(f^,(lnlnn)1/4){\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}) (Line 4 of Alg. 1). The choice of (lnlnn)1/4(\ln\ln n)^{1/4} is not critical, and could be replaced by any smaller quantity that approaches infinity as nn\to\infty. Then we run the log-likelihood ratio test using the observations collected by executing the list of decision ww, and achieve the following:

  1. (a)

    when the true instance f{f^{\star}} and the estimation f^{\hat{f}} have different optimal decisions, accept f^{\hat{f}} with probability at most 1/n1/n;

  2. (b)

    when f^=f{\hat{f}}={f^{\star}}, accept f^{\hat{f}} with probability at least 11/lnn1-1/\ln n.

In Section 4.3, we will discuss the log-likelihood ratio test in detail. Note that the regret after we falsely reject the true instance is 𝒪(lnn){\mathcal{O}}(\ln n) (by the regret bound of UCB algorithm), so we only require a 1/lnn1/\ln n failure probability for (b) because then it leads to a 𝒪(lnn)/lnn=𝒪(1){\mathcal{O}}(\ln n)/\ln n={\mathcal{O}}(1) expected regret. The main lemma for Step 2 is stated as follows, and its proof is deferred to Appendix A.2

Lemma 4.2.

Under Condition 1, for any finite hypothesis {\mathcal{F}}, for large enough nn the following holds:

  1. (a)

    conditioned on the event π(f^)π(f)\pi^{\star}({\hat{f}})\neq\pi^{\star}(f^{\star}), accf^{\mathcal{E}}_{\rm acc}^{\hat{f}} is true with probability at most 1/n1/n;

  2. (b)

    conditioned on the event f^=f{\hat{f}}={f^{\star}}, accf^{\mathcal{E}}_{\rm acc}^{\hat{f}} is true with probability at least 11/lnn1-1/\ln n;

  3. (c)

    conditioned on the event f^=f{\hat{f}}={f^{\star}}, the expected regret of Step 2 is upper bounded by

    (𝒞(f,(lnlnn)1/4)+o(1))lnn.\left({\mathcal{C}}(f^{\star},(\ln\ln n)^{1/4})+o(1)\right)\ln n.

    Otherwise, the expected regret of Step 2 is upper bounded by 𝒪(lnnlnlnn).{\mathcal{O}}(\ln n\ln\ln n).

Step 3: Exploitation.

Finally, in Step 3 we either commits to the optimal decision π(f^)\pi^{\star}({\hat{f}}) when the estimation f^{\hat{f}} is accepted, or run a standard UCB algorithm with 𝒪(lnn){\mathcal{O}}(\ln n) regret (Lattimore and Szepesvári, 2020). Combining Lemma 4.1 and Lemma 4.2 we can prove that

  1. 1.

    the probability of Step 3 incurring a 𝒪(n){\mathcal{O}}(n) regret is at most 1/n1/n, and

  2. 2.

    the probability of Step 3 incurring a 𝒪(lnn){\mathcal{O}}(\ln n) regret is at most 1/lnn1/\ln n.

As a result, the expected regret in Step 3 is 𝒪(1){\mathcal{O}}(1) and negligible . Finally Theorem 3.3 is proved by stitching the three steps together, and deferred to Appendix A.4.

4.3 The Log-Likelihood Ratio Test

The goal of the log-likelihood ratio test is to boost the confidence of our initial estimation f^{\hat{f}} to 11/n1-1/n, so that the algorithm can safely commit to its optimal decision in Step 3. In other words, the test should reject a wrong estimation but also accept a correct one.

Formally speaking, in the test we observe a list of observations {o1,,om}\{o_{1},\cdots,o_{m}\} collected by the list of decision {π1,,πm}\{\pi_{1},\cdots,\pi_{m}\} on the true instance f{f^{\star}}, and achieve the following two goals simultaneously,

  1. (a)

    when f^{\hat{f}} and f{f^{\star}} are sufficiently different (i.e., their KL divergence is large in the sense that i=1mDKL(f^[πi]f[πi])(1+o(1))lnn\sum_{i=1}^{m}D_{\mathrm{KL}}({\hat{f}}[\pi_{i}]\|{f^{\star}}[\pi_{i}])\geq(1+o(1))\ln n), accept f^{\hat{f}} with probability at most 1/n1/n, and

  2. (b)

    when f^=f{\hat{f}}={f^{\star}}, accept f^{\hat{f}} with probability at least 11/lnn1-1/\ln n.

We prove that the log-likelihood ratio test with proper parameters achieves (a) and (b) under Condition 1 in the next lemma, whose proof is deferred to Appendix A.3.

Lemma 4.3.

Given two sequences of distributions P={Pi}i=1mP=\{P_{i}\}_{i=1}^{m} and Q={Qi}i=1mQ=\{Q_{i}\}_{i=1}^{m}, and a sequence of independent random variables o={oi}i=1mo=\{o_{i}\}_{i=1}^{m}. For any fixed λ>0,c>0\lambda>0,c>0, and β=1mi=1mD1λ(PiQi),\beta=\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(P_{i}\|Q_{i}), the test event

acc=𝕀[i=1mlnPi(oi)Qi(oi)c]{\mathcal{E}}_{\rm acc}=\mathbb{I}\bigg{[}\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\geq c\bigg{]}

satisfies

ProQ(acc)exp(c),\displaystyle\mathop{\rm Pr}\nolimits_{o\sim Q}({\mathcal{E}}_{\rm acc})\leq\exp(-c), (14)
ProP(acc)1exp(λ(mβc)).\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}({\mathcal{E}}_{\rm acc})\geq 1-\exp(-\lambda(m\beta-c)). (15)

To use Lemma 4.3 in the proof of Lemma C.2, we set c=lnnc=\ln n and λ=poly(lnlnn)\lambda=\mathrm{poly}(\ln\ln n). Note that the choice of ww in Line 4 of Alg. 1 and Condition 1 implies mβ=i=1mD1λ(PiQi)i=1mDKL(f^[πi]f[πi])=(1+o(1))lnnm\beta=\sum_{i=1}^{m}D_{1-\lambda}(P_{i}\|Q_{i})\gtrapprox\sum_{i=1}^{m}D_{\mathrm{KL}}({\hat{f}}[\pi_{i}]\|{f^{\star}}[\pi_{i}])=(1+o(1))\ln n when f^,f{\hat{f}},{f^{\star}} have different optimal policies. So the conclusion of this lemma matches the first two items in Lemma 4.2.

Lemma 4.3 is closely related to the Chernoff-Stein lemma (see Chernoff (1959) and Mao (2021, Theorem 4.11)). The failure probability of Eq. (15) in classical Chernoff-Stein lemma is a constant, while here it decreases with mm. The proof of Eq. (14) is the same as in Chernoff-Stein lemma, and proof of Eq. (15) only requires an one-sided concentration of the empirical log-likelihood ratio. Indeed, the expectation of empirical log-likelihood ratio can be lower bounded by

𝔼P[i=1mlnPi(oi)Qi(oi)]=i=1mDKL(PiQi)i=1mD1λ(PiQi)=mβ.\displaystyle\mathbb{E}_{P}\bigg{[}\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\bigg{]}=\sum_{i=1}^{m}D_{\mathrm{KL}}\left(P_{i}\|Q_{i}\right)\geq\sum_{i=1}^{m}D_{1-\lambda}\left(P_{i}\|Q_{i}\right)=m\beta. (16)

So PrP(acc)\mathop{\rm Pr}\nolimits_{P}({\mathcal{E}}_{\rm acc}) is the probability that the empirical log-likelihood ratio is (approximately) larger than its expectation. We also note that the concentration is non-trivial because we do not make assumptions on the boundedness on the tail of QiQ_{i}.

5 Extensions to Infinite Hypothesis Class

Now we extend our analysis to infinite hypothesis settings, and instantiate our results on tabular RL. Our results in the infinite hypothesis case need two additional conditions. The first condition requires an upper bound on covering number of the hypothesis, and is used to prove a infinite-hypothesis version of Lemma 4.2. Formally speaking, for any f,gf,g\in{\mathcal{F}}, define their distance as

d(f,g)=supπΠ,o|f[π](o)g[π](o)|.\displaystyle d(f,g)=\sup_{\pi\in\Pi,o}{\left|{f[\pi](o)-g[\pi](o)}\right|}. (17)

An ϵ\epsilon (proper) covering of {\mathcal{F}} is a subset 𝒞{\mathcal{C}}\subseteq{\mathcal{F}}, such that for any gg\in{\mathcal{F}}, there exists g𝒞g^{\prime}\in{\mathcal{C}} with d(g,g)ϵ.d(g,g^{\prime})\leq\epsilon. The covering number 𝒩(,ϵ){\mathcal{N}}({\mathcal{F}},\epsilon) is the size of the minimum ϵ\epsilon covering of {\mathcal{F}}.

Condition 2.

There exists constant cc that depends on {\mathcal{F}} such that ln𝒩(,ϵ)𝒪(cln(1/ϵ))\ln{\mathcal{N}}({\mathcal{F}},\epsilon)\leq{\mathcal{O}}(c\ln(1/\epsilon)) for every ϵ>0.\epsilon>0. In addition, the base measure of the probability space has a finite volume Vol<.{\rm Vol}<\infty.

The second condition upper bounds the distance of two instances by a polynomial of their KL divergence, and is used to prove a stronger version of Lemma 4.1.

Condition 3.

There exists a constant cmin>0c_{\rm min}>0 (which may depend on f{f^{\star}}) such that for all πΠ\pi\in\Pi and osupp(f[π])o\in\operatorname{supp}({f^{\star}}[\pi]) f[π](o)>cmin.{f^{\star}}[\pi](o)>c_{\rm min}. In addition, there exists a constant ι(f)>0\iota({f^{\star}})>0 that only depends on f{f^{\star}} and c5>0c_{5}>0, such that for all f,πΠf\in{\mathcal{F}},\pi\in\Pi, when DKL(f[π]f[π])1D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])\leq 1

f[π]f[π]ι(f)DKL(f[π]f[π])c5.\|{f^{\star}}[\pi]-f[\pi]\|_{\infty}\leq\iota({f^{\star}})D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])^{c_{5}}.

A lot of interactive decision making problems satisfy Conditions 23, such as multi-armed bandits and linear bandits with Bernoulli reward or truncated Gaussian reward. For bandit problems, Conditions 2 and  3 only depend on the noise of the reward function and can be verified easily.

For tabular RL, an observation oo denotes a trajectory (s1,a1,r1,,sH,aH,rH)(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H}) where the state-action pairs (sh,ah)(s_{h},a_{h}) are discrete random variables but the rewards rhr_{h} are continuous. A decision π:𝒮𝒜\pi:{\mathcal{S}}\to{\mathcal{A}} is a mapping from the state space to action space, so |Π|=|𝒜||𝒮|<.|\Pi|=|{\mathcal{A}}|^{|{\mathcal{S}}|}<\infty. In Appendix D, we formally define the tabular RL problems where the reward rhr_{h} given (sh,ah)(s_{h},a_{h}) follows a truncated Gaussian distribution, and prove that Conditions 1-3 are satisfied as stated in the following theorem.

Theorem 5.1.

Let {\mathcal{F}} be the family of tabular RL with truncated Gaussian reward and unique optimal policies. Then Conditions 123 holds simultaneously.

For tabular RL with truncated Gaussian reward, the observation oo is a mixture of discrete random variables (i.e., the states and actions sh,ahs_{h},a_{h}) and continuous random variables (i.e., the rewards rhr_{h}). To prove Conditions 1-3, we deal with the discrete and continuous part of the observation separately. We also have flexibility in the reward distribution, e.g., our proof technique can deal with other distributions such as truncated Laplace distribution.

We present these unified conditions for bounded random variables, but Conditions 2 and 3 do not hold for unbounded random variables because the base measure has an infinite volume (which contradicts to Condition 2), and the density of the observation cannot be lower bounded (which contradicts to Condition 3). However, we can still prove the same result using a slightly different approach (see Appendix E).

With Conditions 1-3, we can prove our main results for infinite hypothesis class. The asymptotic regret of Alg. 1 in this case still matches the lower bound exactly. The proof of Theorem 5.2 is deferred to Appendix C.1.

Theorem 5.2.

Suppose {\mathcal{F}} is an infinite hypothesis class that satisfies Conditions 1-3, the regret of Alg. 1 satisfies

lim supnRegf,nlnn𝒞(f).\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{{f^{\star}},n}}{\ln n}\leq{\mathcal{C}}({f^{\star}}). (18)

As a corollary of Theorem 5.1 and Theorem 5.2, our algorithm strictly improves upon previous best gap-dependent bounds 𝒪(s,a:Δ(s,a)>01/Δ(s,a)){\mathcal{O}}(\sum_{s,a:\Delta(s,a)>0}1/\Delta(s,a)) in Xu et al. (2021) and Simchowitz and Jamieson (2019) because the gap-dependent bounds are not tight for many instances (Tirinzoni et al., 2021).

In the following we discuss the challenges when extending our analysis to infinite hypothesis settings.

Covering Number.

With infinite hypothesis, we need to accept an accurate estimation even when there are infinitely many other choices. Recall that the accept event is

accf^\displaystyle{\mathcal{E}}_{\rm acc}^{{\hat{f}}} =𝕀[g and π(g)π(f^),i=1mlnf^[πi](oi)g[πi](oi)lnn].\displaystyle=\mathbb{I}\left[\forall g\in{\mathcal{F}}\text{ and }\pi^{\star}(g)\neq\pi^{\star}({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right]. (19)

So informally speaking, we need to show that with high probability,

g and π(g)π(f^),i=1mlnf^[πi](oi)g[πi](oi)(D1λw(f^g)ϵ)mlnn,\displaystyle\forall g\in{\mathcal{F}}\text{ and }\pi^{\star}(g)\neq\pi^{\star}({\hat{f}}),\quad\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\gtrsim(D^{w}_{1-\lambda}({\hat{f}}\|g)-\epsilon)m\geq\ln n, (20)

which is essentially an uniform concentration inequality as discussed in Section 4.3. So we resort to a covering number argument. The standard covering argument is not directly suitable in this case — even if d(g,g)d(g,g^{\prime}) is small, it’s still possible that ln(f[π](o)/g[π](o))\ln(f[\pi](o)/g[\pi](o)) is very different from ln(f[π](o)/g[π](o))\ln(f[\pi](o)/g^{\prime}[\pi](o)) (especially when g[π](o)g[\pi](o) is very close to 0). Instead, we consider the distribution with density g^[π](o)(g[π](o)+ϵ)/Z\hat{g}[\pi](o)\triangleq(g^{\prime}[\pi](o)+\epsilon)/Z where ZZ is the normalization factor, and only prove a one-sided covering (that is, ln(f[π](o)/g[π](o))ln(f[π](o)/g^[π](o))𝒪(ϵ)\ln(f[\pi](o)/g[\pi](o))\geq\ln(f[\pi](o)/\hat{g}[\pi](o))-{\mathcal{O}}(\epsilon)). We state and prove the uniform concentration in Appendix F.

Initialization.

With infinite hypothesis, we cannot hope to recover the true instance f{f^{\star}} exactly — some instances can be arbitrarily close to f{f^{\star}} and thus indistinguishable. Instead, we prove that the estimation in Step 1 satisfies supπΠDKL(f[π]f^[π])poly(lnlnnlnn)\sup_{\pi\in\Pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\mathrm{poly}(\frac{\ln\ln n}{\ln n}). The main lemma of Step 1 in the infinite hypothesis class case is stated in Lemma C.1.

6 Conclusion

In this paper, we design instance-optimal algorithms for general interactive decision making problems with finite decisions. As an instantiation, our algorithm is the first instance-optimal algorithm for tabular MDPs. For future works, we raise the following open questions.

  1. 1.

    To implement Alg. 1, we need to solve 𝒞(f,(lnlnn)1/4/2){\mathcal{C}}(f,(\ln\ln n)^{1/4}/2), which is a linear programming with |Π||\Pi| variables and infinitely many constraints. However, |Π||\Pi| is exponentially large for tabular MDPs. Can we compute this optimization problem efficiently for tabular MDPs?

  2. 2.

    Although our algorithm is asymptotically optimal, the lower order terms may dominate the regret unless nn is very large. Can we design non-asymptotic instance optimal algorithms?

Acknowledgment

The authors would like to thank Yuanhao Wang, Jason D. Lee for helpful discussions, and the support from JD.com.

References

  • Agarwal et al. [2014] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646, 2014.
  • Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017.
  • Berner et al. [2019] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  • Burnetas and Katehakis [1997] Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
  • Chen et al. [2017a] Lijie Chen, Jian Li, and Mingda Qiao. Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101–110. PMLR, 2017a.
  • Chen et al. [2017b] Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory, pages 535–592. PMLR, 2017b.
  • Chernoff [1959] Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
  • Cover [1999] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
  • Degenne et al. [2020] Rémy Degenne, Han Shao, and Wouter Koolen. Structure adaptive algorithms for stochastic bandits. In International Conference on Machine Learning, pages 2443–2452. PMLR, 2020.
  • Degrave et al. [2022] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022.
  • Dong et al. [2021] Kefan Dong, Jiaqi Yang, and Tengyu Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. arXiv preprint arXiv:2102.04168, 2021.
  • Foster and Rakhlin [2020] Dylan Foster and Alexander Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3199–3210. PMLR, 13–18 Jul 2020.
  • Foster et al. [2020] Dylan J Foster, Alexander Rakhlin, David Simchi-Levi, and Yunzong Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. arXiv preprint arXiv:2010.03104, 2020.
  • Foster et al. [2021] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
  • Gil et al. [2013] Manuel Gil, Fady Alajaji, and Tamas Linder. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences, 249:124–131, 2013.
  • Graves and Lai [1997] Todd L Graves and Tze Leung Lai. Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM journal on control and optimization, 35(3):715–743, 1997.
  • Gupta et al. [2021] Samarth Gupta, Shreyas Chaudhari, Gauri Joshi, and Osman Yağan. Multi-armed bandits with correlated arms. IEEE Transactions on Information Theory, 67(10):6711–6732, 2021.
  • Hao et al. [2020] Botao Hao, Tor Lattimore, and Csaba Szepesvari. Adaptive exploration in linear contextual bandit. In International Conference on Artificial Intelligence and Statistics, pages 3536–3545. PMLR, 2020.
  • Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868–4878, 2018.
  • Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  • Kirschner et al. [2021] Johannes Kirschner, Tor Lattimore, Claire Vernade, and Csaba Szepesvári. Asymptotically optimal information-directed sampling. In Conference on Learning Theory, pages 2777–2821. PMLR, 2021.
  • Lai et al. [1985] Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • Lattimore and Szepesvari [2017] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737. PMLR, 2017.
  • Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Li et al. [2019] Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pages 2173–2174. PMLR, 2019.
  • Magureanu et al. [2014] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975–999. PMLR, 2014.
  • Mao [2021] Cheng Mao. Hypothesis testing: Lecture notes for math 6263 at georgia tech. 2021.
  • Marjani et al. [2022] Aymen Al Marjani, Tomáš Kocák, and Aurélien Garivier. On the complexity of all ε\varepsilon-best arms identification. arXiv preprint arXiv:2202.06280, 2022.
  • Mason et al. [2020] Blake Mason, Lalit Jain, Ardhendu Tripathy, and Robert Nowak. Finding all ϵ\epsilon-good arms in stochastic bandits. Advances in Neural Information Processing Systems, 33:20707–20718, 2020.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Ok et al. [2018] Jungseul Ok, Alexandre Proutiere, and Damianos Tranos. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems, pages 8874–8882, 2018.
  • Rajeev et al. [1989] A Rajeev, Demosthenis Teneketzis, and Venkatachalam Anantharam. Asymptotically efficient adaptive allocation schemes for controlled iid processes: Finite parameter space. IEEE Transactions on Automatic Control, 34(3), 1989.
  • Sampson and Guttorp [1992] Paul D Sampson and Peter Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87(417):108–119, 1992.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
  • Simchowitz and Jamieson [2019] Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular MDPs. In Advances in Neural Information Processing Systems, pages 1153–1162, 2019.
  • Tirinzoni et al. [2020] Andrea Tirinzoni, Alessandro Lazaric, and Marcello Restelli. A novel confidence-based algorithm for structured bandits. In International Conference on Artificial Intelligence and Statistics, pages 3175–3185. PMLR, 2020.
  • Tirinzoni et al. [2021] Andrea Tirinzoni, Matteo Pirotta, and Alessandro Lazaric. A fully problem-dependent regret lower bound for finite-horizon mdps. arXiv preprint arXiv:2106.13013, 2021.
  • Tirinzoni et al. [2022] Andrea Tirinzoni, Aymen Al-Marjani, and Emilie Kaufmann. Near instance-optimal pac reinforcement learning for deterministic mdps. arXiv preprint arXiv:2203.09251, 2022.
  • Van de Geer [2000] Sara Van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000.
  • Van Erven and Harremos [2014] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
  • Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Xu et al. [2021] Haike Xu, Tengyu Ma, and Simon S Du. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. arXiv preprint arXiv:2102.04692, 2021.
  • Yang et al. [2021] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
  • Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.

List of Appendices

\startcontents

[sections] \printcontents[sections]l1

Appendix A Proofs for Finite Hypothesis Class

In this section we show the missing proofs in Section 4.

A.1 Proof of Lemma 4.1

In this section, we prove Lemma 4.1.

Proof of Lemma 4.1.

Recall that minit=|Π|lnnlnlnn{m_{\rm init}}=|\Pi|\lceil\frac{\ln n}{\ln\ln n}\rceil. Let w={π^i}i=1minitw=\{\hat{\pi}_{i}\}_{i=1}^{{m_{\rm init}}} be the sequence of decisions in the initialization step of Alg. 1, and {o^i}i=1minit\{\hat{o}_{i}\}_{i=1}^{{m_{\rm init}}} the corresponding observations. Define

ρ=ming,gfmaxπΠD1/2(f[π]g[π]).\displaystyle\rho=\min_{g\in{\mathcal{F}},g\neq{f^{\star}}}\max_{\pi\in\Pi}D_{1/2}({f^{\star}}[\pi]\|g[\pi]). (21)

For finite hypothesis we have ρ>0.\rho>0. Then Lemma A.2, for any ϵ>0\epsilon>0 we get

Pr(1miniti=1minitlnf[π^i](o^i)g[π^i](o^i)1miniti=1minitD1/2(f[π^i]g[π^i])ϵ)1exp(minitϵ/2).\displaystyle\mathop{\rm Pr}\nolimits\left(\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{g[\hat{\pi}_{i}](\hat{o}_{i})}\geq\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}D_{1/2}({f^{\star}}[\hat{\pi}_{i}]\|g[\hat{\pi}_{i}])-\epsilon\right)\geq 1-\exp(-{m_{\rm init}}\epsilon/2). (22)

Let ϵ=ρ2|Π|\epsilon=\frac{\rho}{2|\Pi|}. By definition of ρ\rho we get, for all g{f}g\in{\mathcal{F}}\setminus\{{f^{\star}}\}

Pr(1miniti=1minitlnf[π^i](o^i)g[π^i](o^i)ρ2|Π|)\displaystyle\mathop{\rm Pr}\nolimits\left(\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{g[\hat{\pi}_{i}](\hat{o}_{i})}\geq\frac{\rho}{2|\Pi|}\right) (23)
\displaystyle\geq\; Pr(1miniti=1minitlnf[π^i](o^i)g[π^i](o^i)1miniti=1minitD1/2(f[π^i]g[π^i])ρ2|Π|)\displaystyle\mathop{\rm Pr}\nolimits\left(\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{g[\hat{\pi}_{i}](\hat{o}_{i})}\geq\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}D_{1/2}({f^{\star}}[\hat{\pi}_{i}]\|g[\hat{\pi}_{i}])-\frac{\rho}{2|\Pi|}\right) (24)
\displaystyle\geq\; 1exp(ρlnn4lnlnn).\displaystyle 1-\exp\left(-\frac{\rho\ln n}{4\ln\ln n}\right). (25)

By union bound, with probability at least 1||exp(ρlnn4lnlnn)1-|{\mathcal{F}}|\exp\left(-\frac{\rho\ln n}{4\ln\ln n}\right) we have

g{f},i=1minitlnf[π^i](o^i)g[π^i](o^i)>0=i=1minitlnf[π^i](o^i)f[π^i](o^i),\displaystyle\forall g\in{\mathcal{F}}\setminus\{{f^{\star}}\},\quad\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{g[\hat{\pi}_{i}](\hat{o}_{i})}>0=\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}, (26)

which implies that

g{f},i=1minitlng[π^i](o^i)<i=1minitlnf[π^i](o^i).\displaystyle\forall g\in{\mathcal{F}}\setminus\{{f^{\star}}\},\quad\sum_{i=1}^{{m_{\rm init}}}\ln g[\hat{\pi}_{i}](\hat{o}_{i})<\sum_{i=1}^{{m_{\rm init}}}\ln{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i}). (27)

Recalling that f^=argmaxfi=1minitlnf[π^i](o^i){\hat{f}}=\operatorname*{argmax}_{f\in{\mathcal{F}}}\sum_{i=1}^{{m_{\rm init}}}\ln f[\hat{\pi}_{i}](\hat{o}_{i}), Eq. (27) implies f^=f.{\hat{f}}={f^{\star}}.

By algebraic manipulation, for large enough nn we have

exp(ρlnn4lnlnn)exp(ln||lnlnn)=1||lnn.\displaystyle\exp\left(-\frac{\rho\ln n}{4\ln\ln n}\right)\leq\exp(-\ln|{\mathcal{F}}|-\ln\ln n)=\frac{1}{|{\mathcal{F}}|\ln n}. (28)

As a result, for large enough nn we get Pr(f^=f)11/lnn.\mathop{\rm Pr}\nolimits({\hat{f}}={f^{\star}})\geq 1-1/\ln n. In addition, the regret of Step 1 is upper bounded by 𝒪(minit)=𝒪(lnnlnlnn).{\mathcal{O}}({m_{\rm init}})={\mathcal{O}}(\frac{\ln n}{\ln\ln n}).

A.2 Proof of Lemma 4.2

In this section, we prove Lemma 4.2. To this end, we need the following lemma.

Lemma A.1.

Consider an instance ff\in{\mathcal{F}} and n>0n>0. Let δ=(lnlnn)1/4\delta=(\ln\ln n)^{-1/4} and ϵ=(lnlnn)1.\epsilon=(\ln\ln n)^{-1}. Define

λ=λ0(4(lnlnn)3/4,1lnlnn,f)\displaystyle\lambda=\lambda_{0}\left(4(\ln\ln n)^{3/4},\frac{1}{\ln\ln n},f\right) (29)

as the value that Condition 1 holds with corresponding parameters.

Consider any w^+|Π|\hat{w}\in\mathbb{R}^{|\Pi|}_{+} such that w^(lnlnn)1/4\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4}. Let w={πi}i=1mw=\{\pi_{i}\}_{i=1}^{m} be a list of decisions where a decision π\pi occurs ((1+δ)w^π+δ)lnn\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil times for every πΠ\pi\in\Pi, and m=π((1+δ)w^π+δ)lnnm=\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil.

Define the set (w^,f)={g:πΠw^πDKL(f[π]g[π])1}.{\mathcal{F}}(\hat{w},f)=\{g\in{\mathcal{F}}:\sum_{\pi\in\Pi}\hat{w}_{\pi}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 1\}. For any constant c>0c>0, there exits n0>0n_{0}>0 such that for all n>n0n>n_{0},

D1λw(fg)lnnm+cϵ,g(w^,f).\displaystyle D^{w}_{1-\lambda}(f\|g)\geq\frac{\ln n}{m}+c\epsilon,\quad\forall g\in{\mathcal{F}}(\hat{w},f). (30)

In addition, we have λ1=O(poly(lnlnn))\lambda^{-1}=O(\mathrm{poly}(\ln\ln n)) and m|Π|lnn(lnlnn)1/4.m\geq\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}.

Proof.

To prove this lemma, we invoke Condition 1 with proper parameters.

First we bound the value of mm. By the assumption of this lemma, we have w^(lnlnn)1/4\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4}. Consequently, for large enough nn we get

m=π((1+δ)w^π+δ)lnn2|Π|lnn(lnlnn)1/4.\displaystyle m=\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\leq 2|\Pi|\ln n(\ln\ln n)^{1/4}. (31)

On the other hand,

m=π((1+δ)w^π+δ)lnn|Π|δlnn=|Π|lnn(lnlnn)1/4.\displaystyle m=\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\geq|\Pi|\delta\ln n=\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}. (32)

Define α=lnnm+cϵ.\alpha=\frac{\ln n}{m}+c\epsilon. Then for large enough nn we have α+ϵ2|Π|(lnlnn)1/4.\alpha+\epsilon\leq\frac{2}{|\Pi|}(\ln\ln n)^{1/4}. We can also lower bound γ=1mminπ((1+δ)w^π+δ)lnn\gamma=\frac{1}{m}\min_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil. By the upper bound of mm and the fact that ((1+δ)w^π+δ)lnnδlnn=lnn(lnlnn)1/4\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\geq\delta\ln n=\frac{\ln n}{(\ln\ln n)^{1/4}}, we get γ12|Π|(lnlnn)1/2.\gamma\geq\frac{1}{2|\Pi|(\ln\ln n)^{1/2}}.

We invoke Condition 1 with parameters ((α+ϵ)/γ,ϵ,f)((\alpha+\epsilon)/\gamma,\epsilon,f). Note that λ\lambda defined in Eq. (29) satisfies λλ0((α+ϵ)/γ,ϵ,f)\lambda\leq\lambda_{0}((\alpha+\epsilon)/\gamma,\epsilon,f) because (α+ϵ)/γ4(lnlnn)3/4(\alpha+\epsilon)/\gamma\leq 4(\ln\ln n)^{3/4}. Therefore we get

D1λ(f[π]g[π])min{DKL(f[π]g[π])ϵ,(α+ϵ)/γ},g,πΠ.\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi])\geq\min\{D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon,(\alpha+\epsilon)/\gamma\},\quad\forall g\in{\mathcal{F}},\pi\in\Pi. (33)

In the following, we prove that D1λw(fg)αD^{w}_{1-\lambda}(f\|g)\geq\alpha for all g(w^,f)g\in{\mathcal{F}}(\hat{w},f). First of all, for large enough nn, for all g(w^,f)g\in{\mathcal{F}}(\hat{w},f) we get

DKLw(fg)=1mi=1mDKL(f[πi]g[πi])=1mπΠ((1+δ)w^π+δ)lnnDKL(f[π]g[π])\displaystyle D_{\mathrm{KL}}^{w}(f\|g)=\frac{1}{m}\sum_{i=1}^{m}D_{\mathrm{KL}}(f[\pi_{i}]\|g[\pi_{i}])=\frac{1}{m}\sum_{\pi\in\Pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil D_{\mathrm{KL}}(f[\pi]\|g[\pi]) (34)
\displaystyle\geq 1mπΠ(1+δ)(lnn)w^πDKL(f[π]g[π])1m(1+δ)lnn1m(lnn+(c+1)mϵ),\displaystyle\;\frac{1}{m}\sum_{\pi\in\Pi}(1+\delta)(\ln n)\hat{w}_{\pi}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq\frac{1}{m}(1+\delta)\ln n\geq\frac{1}{m}(\ln n+(c+1)m\epsilon), (35)

where the last inequality comes from the fact that δlnn=lnn(lnlnn)1/42(c+1)|Π|lnn(lnlnn)3/4(c+1)mϵ.\delta\ln n=\frac{\ln n}{(\ln\ln n)^{1/4}}\gtrsim 2(c+1)|\Pi|\frac{\ln n}{(\ln\ln n)^{3/4}}\geq(c+1)m\epsilon.

Now for a fixed g(w^,f)g\in{\mathcal{F}}(\hat{w},f), consider the following two cases.

Case 1: π¯Π,DKL(f[π¯]g[π¯])(α+ϵ)/γ.\exists\bar{\pi}\in\Pi,D_{\mathrm{KL}}(f[\bar{\pi}]\|g[\bar{\pi}])\geq(\alpha+\epsilon)/\gamma.

In this case we have

D1λw(fg)=1mi=1mD1λ(f[πi]g[πi])\displaystyle D^{w}_{1-\lambda}(f\|g)=\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(f[\pi_{i}]\|g[\pi_{i}]) (36)
\displaystyle\geq\; ((1+δ)w^π¯+δ)lnnmD1λ(f[π¯]g[π¯])\displaystyle\frac{\lceil((1+\delta)\hat{w}_{\bar{\pi}}+\delta)\ln n\rceil}{m}D_{1-\lambda}(f[\bar{\pi}]\|g[\bar{\pi}]) (37)
\displaystyle\geq\; γD1λ(f[π¯]g[π¯])α,\displaystyle\gamma D_{1-\lambda}(f[\bar{\pi}]\|g[\bar{\pi}])\geq\alpha, (38)

where the last inequality comes from Eq. (33).

Case 2: πΠ,DKL(f[π]g[π])(α+ϵ)/γ.\forall\pi\in\Pi,D_{\mathrm{KL}}(f[\pi]\|g[\pi])\leq(\alpha+\epsilon)/\gamma.

In this case, Eq. (33) implies that

D1λ(f[π]g[π])DKL(f[π]g[π])ϵ,πΠ.\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi])\geq D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon,\quad\forall\pi\in\Pi. (39)

Consequently,

D1λw(fg)=1mi=1mD1λ(f[πi]g[πi])1mi=1m(DKL(f[πi]g[πi])ϵ)=DKLw(fg)ϵ.\displaystyle D^{w}_{1-\lambda}(f\|g)=\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(f[\pi_{i}]\|g[\pi_{i}])\geq\frac{1}{m}\sum_{i=1}^{m}(D_{\mathrm{KL}}(f[\pi_{i}]\|g[\pi_{i}])-\epsilon)=D_{\mathrm{KL}}^{w}(f\|g)-\epsilon.

By Eq. (35) we get DKLw(fg)α+ϵ.D_{\mathrm{KL}}^{w}(f\|g)\geq\alpha+\epsilon. Therefore D1λw(fg)α.D^{w}_{1-\lambda}(f\|g)\geq\alpha.

Combining the two cases together we get the desired result. The lower bound of λ1\lambda^{-1} follows directly from Condition 1. ∎

Now we are ready to prove Lemma 4.2.

Proof of Lemma 4.2.

We prove the four items in Lemma 4.2 separately. We will invoke Lemma A.1 and Lemma 4.3 in the proof. Following Lemma A.1, let δ=(lnlnn)1/4,ϵ=(lnlnn)1\delta=(\ln\ln n)^{1/4},\epsilon=(\ln\ln n)^{-1} and

λ=λ0(4(lnlnn)3/4,1lnlnn,f).\displaystyle\lambda=\lambda_{0}\left(4(\ln\ln n)^{3/4},\frac{1}{\ln\ln n},f\right). (40)

Recall that Λ(f)={g:π(g)π(f)}.\Lambda(f)=\{g\in{\mathcal{F}}:\pi^{\star}(g)\neq\pi^{\star}(f)\}.

Proof of item (a).

In this case we have fΛ(f^).{f^{\star}}\in\Lambda({\hat{f}}). By Lemma 4.3 we get

Prf(accf^)=Prf(gΛ(f^),i=1mlnf^[πi](oi)g[πi](oi)lnn)\displaystyle\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left({\mathcal{E}}_{\rm acc}^{{\hat{f}}}\right)=\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left(\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right) (41)
\displaystyle\leq Prf(i=1mlnf^[πi](oi)f[πi](oi)lnn)exp(lnn)=1/n.\displaystyle\;\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq\ln n\right)\leq\exp(-\ln n)=1/n. (42)

Proof of item (b).

Recall that in this case we have f^=f{\hat{f}}={f^{\star}}. Since w^\hat{w} is the solution to 𝒞(f^,(lnlnn)1/4){\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}) (Line 4 of Alg. 1), we have πΠw^πDKL(f^[π]g[π])1\sum_{\pi\in\Pi}\hat{w}_{\pi}D_{\mathrm{KL}}({\hat{f}}[\pi]\|g[\pi])\geq 1 for all gΛ(f^)g\in\Lambda({\hat{f}}). Recall that ww is the list of decisions computed by Line 5 of Alg. 1. By Lemma A.1 we get

D1λw(fg)lnnm+ϵ,gΛ(f^).\displaystyle D^{w}_{1-\lambda}(f\|g)\geq\frac{\ln n}{m}+\epsilon,\quad\forall g\in\Lambda({\hat{f}}). (43)

Let β=lnnm+ϵ\beta=\frac{\ln n}{m}+\epsilon. By Lemma 4.3, for every gΛ(f^)g\in\Lambda({\hat{f}}) we have

Prf(i=1mlnf[πi](oi)g[πi](oi)lnn)1exp(mλϵ).\displaystyle\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left(\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right)\geq 1-\exp(-m\lambda\epsilon). (44)

Lemma A.1 also yields λ1poly(lnlnn)\lambda^{-1}\leq\mathrm{poly}(\ln\ln n) and m|Π|lnn(lnlnn)1/4m\geq\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}. Therefore mλϵlnnpoly(lnlnn)m\lambda\epsilon\geq\frac{\ln n}{\mathrm{poly}(\ln\ln n)}. Consequently, for large enough nn we have

exp(mλϵ)lnn||.\displaystyle\exp(-m\lambda\epsilon)\leq\frac{\ln n}{|{\mathcal{F}}|}. (45)

Applying union bound, under the event 𝕀[f^=f]\mathbb{I}\left[{\hat{f}}={f^{\star}}\right] we get

Prf(accf^)=Prf(gΛ(f^),i=1mlnf^[πi](oi)g[πi](oi)lnn)1lnn.\displaystyle\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left({\mathcal{E}}_{\rm acc}^{{\hat{f}}}\right)=\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left(\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right)\geq 1-\ln n. (46)

Proof of item (c).

Recall that Δ(f,π)\Delta(f,\pi) is the sub-optimality gap of decision π\pi on instance ff, and Δmax(f)=maxπΔ(f,π)\Delta_{\rm max}(f)=\max_{\pi}\Delta(f,\pi) is the maximum decision gap of instance ff. Since w^\hat{w} is the solution to 𝒞(f^,(lnlnn)1/4){\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}), we have w^(lnlnn)1/4.\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4}. As a result, the regret of Step 2 is upper bounded by

π((1+δ)w^π+δ)lnnΔmax(f)|Π|lnn(lnlnn)1/4,\displaystyle\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\Delta_{\rm max}({f^{\star}})\lesssim|\Pi|\ln n(\ln\ln n)^{1/4}, (47)

which proves the second part of (c). For the first part, when f^=f{\hat{f}}={f^{\star}} we have

π((1+δ)w^π+δ)lnnΔ(f,π)\displaystyle\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\Delta({f^{\star}},\pi) (48)
\displaystyle\leq\; π(1+δ)w^π(lnn)Δ(f,π)+|Π|Δmax(f)(1+δlnn)\displaystyle\sum_{\pi}(1+\delta)\hat{w}_{\pi}(\ln n)\Delta({f^{\star}},\pi)+|\Pi|\Delta_{\rm max}({f^{\star}})(1+\delta\ln n) (49)
=\displaystyle=\; ((1+δ)𝒞(f,(lnlnn)1/4)+o(1))lnn.\displaystyle((1+\delta){\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})+o(1))\ln n. (50)

By Lemma B.1, 𝒞(f,(lnlnn)1/4)=𝒪(1){\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})={\mathcal{O}}(1). As a result,

((1+δ)𝒞(f,(lnlnn)1/4)+o(1))lnn(𝒞(f,(lnlnn)1/4)+o(1))lnn.\displaystyle((1+\delta){\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})+o(1))\ln n\leq({\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})+o(1))\ln n. (51)

A.3 Proof of Lemma 4.3

In this section, we prove Lemma 4.3. To this end, we need the following concentration inequality.

Lemma A.2.

Consider two sequences of distributions P={Pi}i=1mP=\{P_{i}\}_{i=1}^{m} and Q={Qi}i=1mQ=\{Q_{i}\}_{i=1}^{m}, and a sequence of random variables o={oi}i=1mo=\{o_{i}\}_{i=1}^{m} independently drawn from PP. For any λ(0,1)\lambda\in(0,1) and ϵ>0\epsilon>0 we have

ProP(1mi=1mlnPi(oi)Qi(oi)1mi=1mD1λ(PiQi)ϵ)1exp(mλϵ).\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}\left(\frac{1}{m}\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\geq\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(P_{i}\|Q_{i})-\epsilon\right)\geq 1-\exp(-m\lambda\epsilon). (52)
Proof of Lemma A.2.

We prove Lemma A.2 by moment method. Consider any λ(0,1),ϵ>0\lambda\in(0,1),\epsilon>0 and let β=1mi=1mD1λ(PiQi).\beta=\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(P_{i}\|Q_{i}). Then we have

ProP(i=1mlnPi(oi)Qi(oi)m(βϵ))\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}\left(\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\leq m(\beta-\epsilon)\right) (53)
=\displaystyle=\; ProP(i=1mlnQi(oi)Pi(oi)m(βϵ))\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}\left(\sum_{i=1}^{m}\ln\frac{Q_{i}(o_{i})}{P_{i}(o_{i})}\geq-m(\beta-\epsilon)\right) (54)
=\displaystyle=\; ProP(exp(λi=1mlnQi(oi)Pi(oi))exp(λm(βϵ)))\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}\left(\exp\left(\lambda\sum_{i=1}^{m}\ln\frac{Q_{i}(o_{i})}{P_{i}(o_{i})}\right)\geq\exp(-\lambda m(\beta-\epsilon))\right) (55)
\displaystyle\leq\; exp(λm(βϵ))𝔼oP[exp(λi=1mlnQi(oi)Pi(oi))]\displaystyle\exp(\lambda m(\beta-\epsilon))\mathbb{E}_{o\sim P}\left[\exp\left(\lambda\sum_{i=1}^{m}\ln\frac{Q_{i}(o_{i})}{P_{i}(o_{i})}\right)\right] (56)
\displaystyle\leq\; exp(λm(βϵ))i=1m𝔼oiPi[(Qi(oi)λPi(oi)λ)]\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\mathbb{E}_{o_{i}\sim P_{i}}\left[\left(Q_{i}(o_{i})^{\lambda}P_{i}(o_{i})^{-\lambda}\right)\right] (57)
\displaystyle\leq\; exp(λm(βϵ))i=1m(o(Qi(o)λPi(o)1λ)do)\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\left(\int_{o}\left(Q_{i}(o)^{\lambda}P_{i}(o)^{1-\lambda}\right)\mathrm{d}o\right) (58)
\displaystyle\leq\; exp(λm(βϵ))i=1mexp((λ1)Dλ(QiPi)).\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\exp((\lambda-1)D_{\lambda}(Q_{i}\|P_{i})). (59)

where Eq. (56) follows from Markov inequality. By Van Erven and Harremos [2014, Proposition 2] we get Dλ(QP)=λ1λD1λ(PQ)D_{\lambda}(Q\|P)=\frac{\lambda}{1-\lambda}D_{1-\lambda}(P\|Q) for any distributions P,QP,Q. As a result,

exp(λm(βϵ))i=1mexp((λ1)Dλ(QiPi))\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\exp((\lambda-1)D_{\lambda}(Q_{i}\|P_{i})) (60)
=\displaystyle=\; exp(λm(βϵ))i=1mexp(λD1λ(PiQi))\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\exp(-\lambda D_{1-\lambda}(P_{i}\|Q_{i})) (61)
=\displaystyle=\; exp(λm(βϵ))exp(λmβ)\displaystyle\exp(\lambda m(\beta-\epsilon))\exp\left(-\lambda m\beta\right) (62)
=\displaystyle=\; exp(mλϵ).\displaystyle\exp\left(-m\lambda\epsilon\right). (63)

Now we are ready to prove Lemma 4.3.

Proof of Lemma 4.3.

First we prove Eq. (14). By the moment method, for any c>0c>0

PrQ(acc)=PrQ(i=1mlnPi(oi)Qi(oi)c)\displaystyle\mathop{\rm Pr}\nolimits_{Q}({\mathcal{E}}_{\rm acc})=\mathop{\rm Pr}\nolimits_{Q}\left(\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\geq c\right) (64)
=\displaystyle=\; PrQ(exp(i=1mlnPi(oi)Qi(oi))exp(c))\displaystyle\mathop{\rm Pr}\nolimits_{Q}\left(\exp\left(\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\right)\geq\exp(c)\right) (65)
\displaystyle\leq\; exp(c)𝔼Q[i=1mPi(oi)Qi(oi)]\displaystyle\exp(-c)\mathbb{E}_{Q}\left[\prod_{i=1}^{m}\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\right] (Markov inequality)
\displaystyle\leq\; exp(c)i=1m𝔼oiQi[Pi(oi)Qi(oi)]\displaystyle\exp(-c)\prod_{i=1}^{m}\mathbb{E}_{o_{i}\sim Q_{i}}\left[\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\right] (independence of oio_{i}’s)
=\displaystyle=\; exp(c),\displaystyle\exp(-c), (66)

where the last line comes from the fact that both Pi,QiP_{i},Q_{i} are valid distributions. For Eq. (15), we can directly invoke Lemma A.2 with ϵ=βc/m\epsilon=\beta-c/m. ∎

A.4 Proof of Theorem 3.3

We prove Theorem 3.3 by stitching Lemma 4.1 and Lemma 4.2 together.

Proof of Theorem 3.3.

We upper bound the regret of Alg. 1 by discussing the regret under following events separately. Let RegStep1,RegStep2,\mathrm{Reg}_{\rm Step1},\mathrm{Reg}_{\rm Step2}, and RegStep3\mathrm{Reg}_{\rm Step3} be the regret incurred in Step 1, 2, and 3 respectively. Let init=𝕀[f^=f]{\mathcal{E}}_{\rm init}=\mathbb{I}\left[{\hat{f}}={f^{\star}}\right] be the event that the initial estimation is accurate.

Regret of Step 1:

By Lemma 4.1 we have,

lim supn1lnn𝔼[RegStep1]0.\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}[\mathrm{Reg}_{\rm Step1}]\leq 0.

Regret of Step 2.

By item (c) of Lemma 4.2 and Lemma 4.1,

𝔼[RegStep2]=𝔼[RegStep2init]Pr(init)+𝔼[RegStep2initc]Pr(initc)\displaystyle\mathbb{E}\left[\mathrm{Reg}_{\rm Step2}\right]=\mathbb{E}\left[\mathrm{Reg}_{\rm Step2}\mid{\mathcal{E}}_{\rm init}\right]\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init})+\mathbb{E}\left[\mathrm{Reg}_{\rm Step2}\mid{\mathcal{E}}_{\rm init}^{c}\right]\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init}^{c}) (67)
\displaystyle\leq\; (𝒞(f,(lnlnn)1/4)+o(1))ln(n)+𝒪(lnnlnlnn)/lnn.\displaystyle({\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})+o(1))\ln(n)+{\mathcal{O}}(\ln n\ln\ln n)/\ln n. (68)

As a result,

lim supn1lnn𝔼[RegStep2]lim supn𝒞(f,(lnlnn)1/4)=𝒞(f).\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathrm{Reg}_{\rm Step2}\right]\leq\limsup_{n\to\infty}{\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})={\mathcal{C}}({f^{\star}}). (69)

Regret of Step 3.

First focus on the event init{\mathcal{E}}_{\rm init}. Under event accf^init{\mathcal{E}}_{\rm acc}^{\hat{f}}\cap{\mathcal{E}}_{\rm init}, there is no regret in Step 3. On the other hand, by item (b) of Lemma 4.2 we have Pr(init(accf^)c)1/lnn.\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init}\cap({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c})\leq 1/\ln n. Since UCB gives logarithmic regret, we have

lim supn1lnn𝔼[𝕀[init,(accf^)c]RegStep3]lim supn𝒪(lnn)lnnPr(init(accf^)c)0.\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init},({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c}\right]\mathrm{Reg}_{\rm Step3}\right]\leq\limsup_{n\to\infty}\frac{{\mathcal{O}}(\ln n)}{\ln n}\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init}\cap({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c})\leq 0. (70)

As a result,

lim supn1lnn𝔼[𝕀[init]RegStep3]0.\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}\right]\mathrm{Reg}_{\rm Step3}\right]\leq 0. (71)

Now we focus on the event initc{\mathcal{E}}_{\rm init}^{c}. Let 1={π(f^)=π(f)}.{\mathcal{E}}_{1}=\{\pi^{\star}({\hat{f}})=\pi^{\star}({f^{\star}})\}. Under event accf^1{\mathcal{E}}_{\rm acc}^{\hat{f}}\cap{\mathcal{E}}_{1} the algorithm has no regret in Step 3:

𝔼[𝕀[initc,1,accf^]RegStep3]=0.\displaystyle\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}^{c},{\mathcal{E}}_{1},{\mathcal{E}}_{\rm acc}^{\hat{f}}\right]\mathrm{Reg}_{\rm Step3}\right]=0. (72)

On the other hand, consider the event accf^1c{\mathcal{E}}_{\rm acc}^{\hat{f}}\cap{\mathcal{E}}_{1}^{c}. By item (a) of Lemma 4.2 we have

𝔼[𝕀[initc,1c,accf^]RegStep3]nΔmaxPr(𝕀[π(f)π(f^),accf^])nΔmaxn𝒪(1).\displaystyle\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}^{c},{\mathcal{E}}_{1}^{c},{\mathcal{E}}_{\rm acc}^{\hat{f}}\right]\mathrm{Reg}_{\rm Step3}\right]\leq n\Delta_{\rm max}\mathop{\rm Pr}\nolimits\left(\mathbb{I}\left[\pi^{\star}({f^{\star}})\neq\pi^{\star}({\hat{f}}),{\mathcal{E}}_{\rm acc}^{\hat{f}}\right]\right)\leq\frac{n\Delta_{\rm max}}{n}\leq{\mathcal{O}}(1). (73)

Under the event (accf^)c({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c}, Step 3 incurs logarithmic regret. As a result,

𝔼[𝕀[initc,(accf^)c]RegStep3]𝒪(lnn)Pr(initc)𝒪(1).\displaystyle\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}^{c},({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c}\right]\mathrm{Reg}_{\rm Step3}\right]\leq{\mathcal{O}}(\ln n)\mathop{\rm Pr}\nolimits\left({\mathcal{E}}_{\rm init}^{c}\right)\leq{\mathcal{O}}(1). (74)

Combining Eqs. (72), (73) and (74) we get

lim supn1lnn𝔼[𝕀[initc]RegStep3]=0.\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}^{c}\right]\mathrm{Reg}_{\rm Step3}\right]=0. (75)

Therefore, combining Eqs. (71) and (75),

lim supn1lnn𝔼[RegStep3]=0.\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathrm{Reg}_{\rm Step3}\right]=0. (76)

Stitching the regrets from steps 1-3 together we prove Theorem 3.3. ∎

Appendix B Missing Proofs in Section 3.1

In this section, we present the missing proofs in section 3.1.

B.1 Proof of Lemma 3.1

To prove Lemma 3.1, we need the following lemma.

Lemma B.1.

For any ff\in{\mathcal{F}}, let n0=5(Δmin(f)/Rmax)2.n_{0}=5(\Delta_{\rm min}(f)/R_{\rm max})^{-2}. Consider w+|Π|w\in\mathbb{R}^{|\Pi|}_{+} such that wπ=n0w_{\pi}=n_{0} for all πΠ\pi\in\Pi. Then ww is a valid solution to 𝒞(f,n){\mathcal{C}}(f,n) for all n>n0.n>n_{0}. As a corollary, 𝒞(f,n)Δmax(f)|Π|n0{\mathcal{C}}(f,n)\leq\Delta_{\rm max}(f)|\Pi|n_{0} for all n>n0.n>n_{0}.

Proof of Lemma B.1.

Recall that Δ(f,π)=maxπΠRf(π)Rf(π)\Delta(f,\pi)=\max_{\pi^{\prime}\in\Pi}R_{f}(\pi^{\prime})-R_{f}(\pi) is the reward gap of decision π\pi under instance ff, and Δmin(f)=minπ:Δ(f,π)>0Δ(f,π)\Delta_{\rm min}(f)=\min_{\pi:\Delta(f,\pi)>0}\Delta(f,\pi) is the minimum decision gap of the instance ff.

Let Λ(f)={g,π(g)π(f)}.\Lambda(f)=\{g\in{\mathcal{F}},\pi^{\star}(g)\neq\pi^{\star}(f)\}. For any instance gΛ(f)g\in\Lambda(f), consider the following two decisions: π1=π(f),π2=π(g).\pi_{1}=\pi^{\star}(f),\pi_{2}=\pi^{\star}(g). We claim that

maxπ{π1,π2}|Rf(π)Rg(π)|Δmin(f)3.\displaystyle\max_{\pi\in\{\pi_{1},\pi_{2}\}}{\left|{R_{f}(\pi)-R_{g}(\pi)}\right|}\geq\frac{\Delta_{\rm min}(f)}{3}. (77)

This claim can be proved by contradiction. Suppose on the contrary that

maxπ{π1,π2}|Rf(π)Rg(π)|<Δmin(f)/3.\max_{\pi\in\{\pi_{1},\pi_{2}\}}{\left|{R_{f}(\pi)-R_{g}(\pi)}\right|}<\Delta_{\rm min}(f)/3.

Then we have

Rg(π1)Rf(π1)Δmin(f)3Rf(π2)+2Δmin(f)3Rg(π2)+Δmin(f)3>Rg(π2),\displaystyle R_{g}(\pi_{1})\geq R_{f}(\pi_{1})-\frac{\Delta_{\rm min}(f)}{3}\geq R_{f}(\pi_{2})+\frac{2\Delta_{\rm min}(f)}{3}\geq R_{g}(\pi_{2})+\frac{\Delta_{\rm min}(f)}{3}>R_{g}(\pi_{2}),

which contradicts with the fact that π2=π(g).\pi_{2}=\pi^{\star}(g).

Recall that Rf(π)=𝔼of[π][R(o)].R_{f}(\pi)=\mathbb{E}_{o\sim f[\pi]}[R(o)]. As a result, |Rf(π)Rg(π)|RmaxDTV(f[π]g[π]){\left|{R_{f}(\pi)-R_{g}(\pi)}\right|}\leq R_{\rm max}D_{\rm TV}(f[\pi]\|g[\pi]). Now, by Pinsker’s inequality, for any πΠ\pi\in\Pi we have

DKL(f[π]g[π])2DTV(f[π]g[π])22(|Rf(π)Rg(π)|/Rmax)2.\displaystyle D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 2D_{\rm TV}(f[\pi]\|g[\pi])^{2}\geq 2({\left|{R_{f}(\pi)-R_{g}(\pi)}\right|}/R_{\rm max})^{2}. (78)

Combining with Eq. (77), for any gΛ(f)g\in\Lambda(f) we get

πΠn0DKL(f[π]g[π])1.\displaystyle\sum_{\pi\in\Pi}n_{0}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 1. (79)

Since this inequality holds for every gΛ(f)g\in\Lambda(f), we prove the desired result. ∎

Now we present the proof of Lemma 3.1.

Proof of Lemma 3.1.

By Lemma B.1, for n>5(Δmin(f)/Rmax)2n>5(\Delta_{\rm min}(f)/R_{\rm max})^{-2}, we get 𝒞(f,n)<{\mathcal{C}}(f,n)<\infty. ∎

B.2 Proof of Theorem 3.2

Now we show the proof of Theorem 3.2.

Proof of Theorem 3.2.

In the following we fix a consistent algorithm 𝒜{\mathcal{A}}. Consider any instance gg\in{\mathcal{F}} such that π(g)π(f)\pi^{\star}(g)\neq\pi^{\star}(f).

Recall that Δ(f,π)=maxπΠRf(π)Rf(π)\Delta(f,\pi)=\max_{\pi^{\prime}\in\Pi}R_{f}(\pi^{\prime})-R_{f}(\pi) is the reward gap of decision π\pi under instance ff and Δmin(f)=minπ:Δ(f,π)>0Δ(f,π)\Delta_{\rm min}(f)=\min_{\pi:\Delta(f,\pi)>0}\Delta(f,\pi) is the minimum decision gap of the instance ff. Let ϵ=min{Δmin(g),Δmin(f)}/2.\epsilon=\min\{\Delta_{\rm min}(g),\Delta_{\rm min}(f)\}/2. Consider two distributions (over observations and decisions) Pf,n,Pg,nP_{f,n},P_{g,n} induced by running nn steps of algorithm 𝒜{\mathcal{A}} on instance f,gf,g respectively. In addition, let NπN_{\pi} be the random variable indicates the number of times decision π\pi is executed, and πi\pi_{i} the random variable indicating the decision executed at step ii.

Let Regf,n\mathrm{Reg}_{f,n} and Regg,n\mathrm{Reg}_{g,n} be the regret of running algorithm 𝒜{\mathcal{A}} on instances ff and gg respectively. By definition we have

Regf,n+Regg,nϵnPrf(Nπ(f)n2)+Prg(Nπ(f)>n2)\displaystyle\frac{\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}}{\epsilon n}\geq\mathop{\rm Pr}\nolimits\nolimits_{f}\left(N_{\pi^{\star}(f)}\leq\frac{n}{2}\right)+\mathop{\rm Pr}\nolimits\nolimits_{g}\left(N_{\pi^{\star}(f)}>\frac{n}{2}\right) (80)

By basic inequality of KL divergence [Lattimore and Szepesvari, 2017, Lemma 5] we get,

Prf(Nπ(f)n2)+Prg(Nπ(f)>n2)12exp(DKL(Pf,nPg,n)).\displaystyle\mathop{\rm Pr}\nolimits\nolimits_{f}\left(N_{\pi^{\star}(f)}\leq\frac{n}{2}\right)+\mathop{\rm Pr}\nolimits\nolimits_{g}\left(N_{\pi^{\star}(f)}>\frac{n}{2}\right)\geq\frac{1}{2}\exp(-D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})). (81)

Combining Eq. (80) and Eq. (81) we get,

DKL(Pf,nPg,n)ln(ϵn2(Regf,n+Regg,n)).\displaystyle D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})\geq\ln\left(\frac{\epsilon n}{2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right). (82)

Now applying the chain rule for KL divergence (see, e.g., [Cover, 1999, Theorem 2.5.3]), we get

DKL(Pf,nPg,n)=𝔼f[i=1nDKL(f[πi]g[πi])]=πΠ𝔼f[Nπ]DKL(f[π]g[π]).\displaystyle D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})=\mathbb{E}_{f}\left[\sum_{i=1}^{n}D_{\mathrm{KL}}(f[\pi_{i}]\|g[\pi_{i}])\right]=\sum_{\pi\in\Pi}\mathbb{E}_{f}[N_{\pi}]D_{\mathrm{KL}}(f[\pi]\|g[\pi]). (83)

Therefore we have

πΠ𝔼f[Nπ]ln(ϵn)ln(2(Regf,n+Regg,n))DKL(f[π]g[π])1.\displaystyle\sum_{\pi\in\Pi}\frac{\mathbb{E}_{f}[N_{\pi}]}{\ln\left(\epsilon n\right)-\ln\left({2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right)}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 1. (84)

Consider the mixture of decisions

wπ=𝔼f[Nπ]ln(ϵn)ln(2(Regf,n+Regg,n)).w_{\pi}=\frac{\mathbb{E}_{f}[N_{\pi}]}{\ln\left(\epsilon n\right)-\ln\left({2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right)}.

Since Regf,n+Regg,nnΔmax\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\leq n\Delta_{\rm max} and 𝔼f[Nπ]n\mathbb{E}_{f}[N_{\pi}]\leq n, wπw_{\pi} satisfies the constraint of 𝒞(f,n){\mathcal{C}}\left(f,n\right) for large enough nn. In addition, the expected regret of algorithm 𝒜{\mathcal{A}} is

Regf,n=𝔼f[i=1nΔπi]=π𝔼f[Nπ]Δπ\displaystyle\mathrm{Reg}_{f,n}=\mathbb{E}_{f}\left[\sum_{i=1}^{n}\Delta_{\pi_{i}}\right]=\sum_{\pi}\mathbb{E}_{f}[N_{\pi}]\Delta_{\pi} (85)
\displaystyle\geq\; πΔπwπln(ϵn2(Regf,n+Regg,n))\displaystyle\sum_{\pi}\Delta_{\pi}w_{\pi}\ln\left(\frac{\epsilon n}{2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right) (86)
\displaystyle\geq\; 𝒞(f,n)ln(ϵn2(Regf,n+Regg,n)).\displaystyle{\mathcal{C}}\left(f,n\right)\ln\left(\frac{\epsilon n}{2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right). (87)

Since 𝒜{\mathcal{A}} is consistent, for any p>0p>0 we have Regf,n+Regg,n=O(np).\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}=O(n^{p}). Consequently, for any p>0p>0,

lim supnRegf,nln(n)\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{f,n}}{\ln(n)} (88)
\displaystyle\geq\; lim supn𝒞(f,n)ln(n)ln(2(Regf,n+Regg,n)/ϵ)ln(n)\displaystyle\limsup_{n\to\infty}{\mathcal{C}}\left(f,n\right)\frac{\ln(n)-\ln\left(2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)/\epsilon\right)}{\ln(n)} (89)
\displaystyle\geq\; 𝒞(f)(1p).\displaystyle{\mathcal{C}}(f)(1-p). (90)

Since p>0p>0 is an arbitrary constant, we get

lim supnRegf,nln(n)𝒞(f).\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{f,n}}{\ln(n)}\geq{\mathcal{C}}(f). (91)

B.3 Instantiation of the Complexity Measure

In this section, we instantiate our complexity measure 𝒞(f){\mathcal{C}}(f) on concrete settings.

First we consider multi-armed bandits with unit Gaussian reward. The family of decisions is Π={1,2,,A}\Pi=\{1,2,\cdots,A\}. Let (μ1,μ2,,μA)(\mu_{1},\mu_{2},\cdots,\mu_{A}) be the mean rewards of each decision. An instance ff in this case is characterized by the mean rewards, and f[i]=𝒩(μi,1).f[i]={\mathcal{N}}(\mu_{i},1).

Proposition B.2.

For an multi-armed bandit instance ff with unique optimal decision and unit Gaussian noise, let (μ1,μ2,,μA)(\mu_{1},\mu_{2},\cdots,\mu_{A}) be the mean reward of each action. Then

𝒞(f)i[1,A] and Δi>02Δi{\mathcal{C}}(f)\leq\sum_{i\in[1,A]\text{ and }\Delta_{i}>0}\frac{2}{\Delta_{i}}

where Δi=maxiμiμi.\Delta_{i}=\max_{i^{\prime}}\mu_{i^{\prime}}-\mu_{i}.

Proof.

We prove this proposition by constructing a solution w+Aw\in\mathbb{R}^{A}_{+} to the optimization problem 𝒞(f,n){\mathcal{C}}(f,n) for large enough nn.

Without loss of generality, we assume μ1>μi\mu_{1}>\mu_{i} for all i2.i\geq 2. Consider the action frequency

wi={2nnΔi23,i2,n,i=1.\displaystyle w_{i}=\begin{cases}\frac{2n}{n\Delta_{i}^{2}-3},&i\geq 2,\\ n,&i=1.\end{cases} (92)

Then when nn is large enough we have wn.\|w\|_{\infty}\leq n. On the other hand, consider any gg\in{\mathcal{F}} such that π(g)π(f).\pi^{\star}(g)\neq\pi^{\star}(f). Suppose π(g)=i.\pi^{\star}(g)=i. In the following we show that

i=1nwiDKL(f[i]g[i])1.\displaystyle\sum_{i=1}^{n}w_{i}D_{\mathrm{KL}}(f[i]\|g[i])\geq 1. (93)

Let (μ1,μ2,,μA)(\mu_{1}^{\prime},\mu_{2}^{\prime},\cdots,\mu_{A}^{\prime}) be the mean reward of instance gg. For multi-armed bandits with unit Gaussian noise, we have

i=1nwiDKL(f[i]g[i])=12i=1nwi(μiμi)212(w1(μ1μ1)2+wi(μiμi)2).\displaystyle\sum_{i=1}^{n}w_{i}D_{\mathrm{KL}}(f[i]\|g[i])=\frac{1}{2}\sum_{i=1}^{n}w_{i}(\mu_{i}-\mu_{i}^{\prime})^{2}\geq\frac{1}{2}\left(w_{1}(\mu_{1}-\mu_{1}^{\prime})^{2}+w_{i}(\mu_{i}-\mu_{i}^{\prime})^{2}\right). (94)

By the condition that π(g)=i\pi^{\star}(g)=i, we get μiμ1.\mu_{i}^{\prime}\geq\mu_{1}^{\prime}. Combining with the fact that μ1>μi\mu_{1}>\mu_{i} we get

minμ1,μi:μ1μi(w1(μ1μ1)2+wi(μiμi)2)\displaystyle\min_{\mu_{1}^{\prime},\mu_{i}^{\prime}:\mu_{1}^{\prime}\leq\mu_{i}^{\prime}}\left(w_{1}(\mu_{1}-\mu_{1}^{\prime})^{2}+w_{i}(\mu_{i}-\mu_{i}^{\prime})^{2}\right) (95)
=\displaystyle= minμ1[μi,μ1](w1(μ1μ1)2+wi(μiμ1)2)\displaystyle\;\min_{\mu_{1}^{\prime}\in[\mu_{i},\mu_{1}]}\left(w_{1}(\mu_{1}-\mu_{1}^{\prime})^{2}+w_{i}(\mu_{i}-\mu_{1}^{\prime})^{2}\right) (96)
=\displaystyle= w1wiw1+wi(μ1μi)2.\displaystyle\;\frac{w_{1}w_{i}}{w_{1}+w_{i}}(\mu_{1}-\mu_{i})^{2}. (97)

By the definition of wiw_{i}, we have

w1wiw1+wi(μ1μi)2=2n2nΔi23n+2nnΔi23Δi2=2Δi2nnΔi212.\displaystyle\frac{w_{1}w_{i}}{w_{1}+w_{i}}(\mu_{1}-\mu_{i})^{2}=\frac{\frac{2n^{2}}{n\Delta_{i}^{2}-3}}{n+\frac{2n}{n\Delta_{i}^{2}-3}}\Delta_{i}^{2}=\frac{2\Delta_{i}^{2}n}{n\Delta_{i}^{2}-1}\geq 2. (98)

As a result,

𝒞(f,n)i=2A2nnΔi23Δi.\displaystyle{\mathcal{C}}(f,n)\leq\sum_{i=2}^{A}\frac{2n}{n\Delta_{i}^{2}-3}\Delta_{i}. (99)

It follows that

𝒞(f)=limn𝒞(f,n)limni=2A2nnΔi23Δi=i=2A2Δi.\displaystyle{\mathcal{C}}(f)=\lim_{n\to\infty}{\mathcal{C}}(f,n)\leq\lim_{n\to\infty}\sum_{i=2}^{A}\frac{2n}{n\Delta_{i}^{2}-3}\Delta_{i}=\sum_{i=2}^{A}\frac{2}{\Delta_{i}}. (100)

In the following, we focus on a linear bandit instance ff where the action set is 𝒜d.{\mathcal{A}}\subset\mathbb{R}^{d}. The mean of an action x𝒜x\in{\mathcal{A}} is given by μx=x,θ\mu_{x}=\left<x,\theta\right> for some θd.\theta\in\mathbb{R}^{d}. Let x=argmaxxμxx^{\star}=\operatorname*{argmax}_{x}\mu_{x} be the optimal action, and Δxμxμx\Delta_{x}\triangleq\mu_{x^{\star}}-\mu_{x} be the sub-optimality gap of action xx. Define 𝒜=𝒜{x}.{\mathcal{A}}^{-}={\mathcal{A}}\setminus\{x^{\star}\}. An linear bandit instance is characterized by the vector θ\theta, and f[x]=𝒩(x,θ,1).f[x]={\mathcal{N}}(\left<x,\theta\right>,1).

We assume that 𝒜{\mathcal{A}} is discrete, full rank, and x21\|x\|_{2}\leq 1 for all x𝒜.x\in{\mathcal{A}}. Then we have the following proposition.

Proposition B.3.

For an linear bandit instance ff with unique optimal decision and unit Gaussian noise, our complexity measure 𝒞(f){\mathcal{C}}(f) recovers that in Lattimore and Szepesvari [2017]. That is,

𝒞(f)infw+A\displaystyle{\mathcal{C}}(f)\leq\inf_{w\in\mathbb{R}^{A}_{+}}\; x𝒜wxΔx,\displaystyle\sum_{x\in{\mathcal{A}}^{-}}w_{x}\Delta_{x}, (101)
s.t. xH(w)12Δx22,x𝒜,\displaystyle\|x\|_{H(w)^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2},\quad\forall x\in{\mathcal{A}}^{-}, (102)

where H(w)=xwxxx.H(w)=\sum_{x}w_{x}xx^{\top}.

Proof.

Let w^\hat{w} be the solution to the RHS of Eq. (101). In the following we construct a solution to our optimization problem 𝒞(f,n){\mathcal{C}}(f,n) from w^\hat{w}. Recall that

𝒞(f,n)minw+|𝒜|\displaystyle{\mathcal{C}}(f,n)\triangleq\min_{w\in\mathbb{R}^{|{\mathcal{A}}|}_{+}}\; x𝒜wπΔ(f,x)\displaystyle\sum_{x\in{\mathcal{A}}}w_{\pi}\Delta(f,x) (103)
s.t. x𝒜wxDKL(f[x]g[x])1,g,π(g)π(f),\displaystyle\sum_{x\in{\mathcal{A}}}w_{x}D_{\mathrm{KL}}(f[x]\|g[x])\geq 1,\;\forall g\in{\mathcal{F}},\pi^{\star}(g)\neq\pi^{\star}(f), (104)
wn.\displaystyle\|w\|_{\infty}\leq n. (105)

For any w+Aw\in\mathbb{R}^{A}_{+} and any instance gg\in{\mathcal{F}} associates with parameter θ\theta^{\prime}, we have

x𝒜wxDKL(f[x]g[x])=x𝒜wxDKL(𝒩(x,θ,1)𝒩(x,θ,1))=θθH(w)22.\displaystyle\sum_{x\in{\mathcal{A}}}w_{x}D_{\mathrm{KL}}(f[x]\|g[x])=\sum_{x\in{\mathcal{A}}}w_{x}D_{\mathrm{KL}}({\mathcal{N}}(\left<x,\theta\right>,1)\|{\mathcal{N}}(\left<x,\theta^{\prime}\right>,1))=\frac{\|\theta-\theta^{\prime}\|_{H(w)}^{2}}{2}. (106)

Consider any gg such that π(g)π(f)\pi^{\star}(g)\neq\pi^{\star}(f). Suppose π(g)=xx\pi^{\star}(g)=x\neq x^{\star}. It follows that xx,θ>0.\left<x-x^{\star},\theta^{\prime}\right>>0. Recall that Δx=xx,θ.\Delta_{x}=\left<x^{\star}-x,\theta\right>. By algebraic manipulation we have,

minθ:xx,θθ>Δx12θθH(w)2=12Δx2xxH(w)12\displaystyle\min_{\theta^{\prime}:\left<x^{\star}-x,\theta-\theta^{\prime}\right>>\Delta_{x}}\frac{1}{2}\|\theta-\theta^{\prime}\|_{H(w)}^{2}=\frac{1}{2}\frac{\Delta_{x}^{2}}{\|x^{\star}-x\|_{H(w)^{-1}}^{2}} (107)

Therefore to satisfy Eq. (104), it’s enough to construct a solution ww such that

12Δx2xxH(w)121,x𝒜.\displaystyle\frac{1}{2}\frac{\Delta_{x}^{2}}{\|x^{\star}-x\|_{H(w)^{-1}}^{2}}\geq 1,\forall x\in{\mathcal{A}}^{-}. (108)

Define A=x𝒜w^xxx+(x)(x).A=\sum_{x\in{\mathcal{A}}^{-}}\hat{w}_{x}xx^{\top}+(x^{\star})(x^{\star})^{\top}. When the action set 𝒜{\mathcal{A}} is full rank, AA is positive definite (see Lattimore and Szepesvari [2017, Appendix C]). We use σmax(A),σmin(A)\sigma_{\rm max}(A),\sigma_{\rm min}(A) to denote the maximum/minimum singular value of a matrix AA respectively. Then for any n>0n>0, consider the following solution

wx={w^x(18Δmin2σmax(A1)1+(n1)σmin(A1))1, when xx,n, when x=x.\displaystyle w_{x}=\begin{cases}\hat{w}_{x}\left(1-\frac{8}{\Delta_{\rm min}^{2}}\frac{\sigma_{\rm max}(A^{-1})}{1+(n-1)\sigma_{\rm min}(A^{-1})}\right)^{-1},&\text{ when }x\neq x^{\star},\\ n,&\text{ when }x=x^{\star}.\end{cases} (109)

For large enough nn we get wn.\|w\|_{\infty}\leq n. In the following, we prove that ww satisfies Eq. (107). Let

cn=(18Δmin2σmax(A1)1+(n1)σmin(A1))1c_{n}=\left(1-\frac{8}{\Delta_{\rm min}^{2}}\frac{\sigma_{\rm max}(A^{-1})}{1+(n-1)\sigma_{\rm min}(A^{-1})}\right)^{-1}

for shorthand. Since Δx=0\Delta_{x^{\star}}=0, we have w^x=.\hat{w}_{x^{\star}}=\infty. Therefore xH(w^)1=0.\|x^{\star}\|_{H(\hat{w})^{-1}}=0. Then for any x𝒜x\in{\mathcal{A}}^{-}, by Eq. (102) we have

(xx)H(cnw^)1(xx)=cn1(xx)H(w^)1(xx)\displaystyle(x^{\star}-x)^{\top}H(c_{n}\hat{w})^{-1}(x^{\star}-x)=c_{n}^{-1}(x^{\star}-x)^{\top}H(\hat{w})^{-1}(x^{\star}-x) (110)
=\displaystyle=\; cn1xH(w^)1xΔx224σmax(A1)1+(n1)σmin(A1).\displaystyle c_{n}^{-1}x^{\top}H(\hat{w})^{-1}x\leq\frac{\Delta_{x}^{2}}{2}-\frac{4\sigma_{\rm max}(A^{-1})}{1+(n-1)\sigma_{\rm min}(A^{-1})}. (111)

Invoking Lemma G.7 we get

(xx)H(w)1(xx)(xx)H(cnw^)1(xx)+4σmax(A1)1+(n1)σmin(A1)=Δx22,\displaystyle(x^{\star}-x)^{\top}H(w)^{-1}(x^{\star}-x)\leq(x^{\star}-x)^{\top}H(c_{n}\hat{w})^{-1}(x^{\star}-x)+\frac{4\sigma_{\rm max}(A^{-1})}{1+(n-1)\sigma_{\rm min}(A^{-1})}=\frac{\Delta_{x}^{2}}{2},

which implies Eq. (108). As a result, ww is a valid solution ot 𝒞(f,n).{\mathcal{C}}(f,n). Consequently,

𝒞(f,n)xΔxwxcnxw^xΔx.\displaystyle{\mathcal{C}}(f,n)\leq\sum_{x}\Delta_{x}w_{x}\leq c_{n}\sum_{x}\hat{w}_{x}\Delta_{x}. (112)

By definition we have limncn=1\lim_{n\to\infty}c_{n}=1. Consequently, 𝒞(f)=limn𝒞(f,n)xw^xΔx.{\mathcal{C}}(f)=\lim_{n\to\infty}{\mathcal{C}}(f,n)\leq\sum_{x}\hat{w}_{x}\Delta_{x}.

B.4 A Sufficient Condition for Condition 1

In this section, we discuss a sufficient but more interpretable condition to Condition 1.

Proposition B.4.

Suppose there exists a constant cMc_{M} such that for every f,gf,g\in{\mathcal{F}} and πΠ\pi\in\Pi,

𝔼of[π][(lnf[π](o)g[π](o))4]cM4.\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right]\leq c_{M}^{4}.

Then Condition 1 holds with λ0(α,ϵ,f)=min{2ϵ/cM2,1/2}.\lambda_{0}(\alpha,\epsilon,f)=\min\{2\epsilon/c_{M}^{2},1/2\}.

Proof.

For any fixed f,gf,g\in{\mathcal{F}}, πΠ\pi\in\Pi and λ<1/2\lambda<1/2, by Lemma G.8 we get

DKL(f[π]g[π])D1λ(f[π]g[π])λ2𝔼of[π][(lnf[π](o)g[π](o))4]1/2.\displaystyle D_{\mathrm{KL}}(f[\pi]\|g[\pi])-D_{1-\lambda}(f[\pi]\|g[\pi])\leq\frac{\lambda}{2}\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right]^{1/2}. (113)

Therefore we have

DKL(f[π]g[π])D1λ(f[π]g[π])λ2cM2.\displaystyle D_{\mathrm{KL}}(f[\pi]\|g[\pi])-D_{1-\lambda}(f[\pi]\|g[\pi])\leq\frac{\lambda}{2}c_{M}^{2}. (114)

Since D1λ(f[π]g[π])D_{1-\lambda}(f[\pi]\|g[\pi]) is monotonically decreasing with λ\lambda, we prove the desired result. ∎

B.5 Comparison with Existing Lower Bounds

Recall that when applied to tabular RL problems, our lower bound is very similar to that in Tirinzoni et al. [2021], and the only difference is that their optimization problems omit the second constraint (Eq. (4)), On a high level, the constraint wπn\|w_{\pi}\|_{\infty}\leq n is necessary to prevent degenerate cases and guarantee mathematical rigor.

The value 𝒞(f,n){\mathcal{C}}(f,n) can be different from the solution without this constraint for some artificial hypothesis class \mathcal{F}. For example, we can construct a multi-armed bandit hypothesis class ={μA:μ(1)0.5}{[0.5,0.1,,0.1]}\mathcal{F}=\{\mu\in\mathbb{R}^{A}:\mu(1)\neq 0.5\}\cup\{[0.5,0.1,\cdots,0.1]\} (where μA\mu\in\mathbb{R}^{A} represents the mean reward of each arm). Then for f=[0.5,0.1,,0.1]f=[0.5,0.1,\cdots,0.1], 𝒞(f,n)>0{\mathcal{C}}(f,n)>0 for every n>0n>0 (because there exits other instances in \mathcal{F} whose mean reward of action 1 is arbitrarily close of 0.5). As a result, 𝒞(f)=limn𝒞(f,n)>0{\mathcal{C}}(f)=\lim_{n\to\infty}{\mathcal{C}}(f,n)>0 by the definition of limits and in this case limn𝒞(f,n)𝒞(f,)\lim_{n\to\infty}{\mathcal{C}}(f,n)\neq{\mathcal{C}}(f,\infty). However, without the constraint wπn\|w_{\pi}\|_{\infty}\leq n, the solution will be 0, achieved by letting w1=w_{1}=\infty and wi=0,i1w_{i}=0,\forall i\neq 1.

For other hypothesis classes (such as the standard MAB and linear bandits discussed in Proposition B.2 &  B.3, and tabular RL), however, this constraint does not change the value of C(f,n).C(f,n).

Appendix C Extension to Infinite Hypothesis Class

In this section, we extend our results to the infinite hypothesis case.

C.1 Proof of Theorem 5.2

To prove Theorem 5.2, we require the following lemmas for steps 1 and 2 by analogy with the finite hypothesis case.

Lemma C.1 (Main lemma for Initialization).

Let init{\mathcal{E}}_{\rm init} be the event that, there exists a universal constant c4>0c_{4}>0 such that

  1. (a)

    maxπDKL(f[π]f^[π])(lnlnnlnn)c4\max_{\pi}D_{\mathrm{KL}}(f^{\star}[\pi]\|{\hat{f}}[\pi])\leq\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}},

  2. (b)

    |Rf^(π)Rf(π)|Rmax(lnlnnlnn)c4{\left|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right|}\leq R_{\rm max}\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}}, for all πΠ\pi\in\Pi,

  3. (c)

    π(f^)=π(f).\pi^{\star}({\hat{f}})=\pi^{\star}({f^{\star}}).

Under Conditions 1 and 2, there exists n0>0n_{0}>0 such that when n>n0n>n_{0}, Pr(init)11/lnn.\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init})\geq 1-1/\ln n. In addition, the regret of Step 1 is upper bounded by 𝒪(lnnlnlnn).{\mathcal{O}}(\frac{\ln n}{\ln\ln n}).

Proof of Lemma C.1 is deferred to Appendix C.2.

Lemma C.2 (Main lemma for Identification).

Under Conditions 1, 2, and 3, there exists n0>0n_{0}>0 such that when n>n0n>n_{0}, the following holds.

  1. (a)

    Conditioned on the event π(f^)π(f)\pi^{\star}({\hat{f}})\neq\pi^{\star}(f^{\star}), Pr(acc)1/n\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc})\leq 1/n.

  2. (b)

    Conditioned on the event init{\mathcal{E}}_{\rm init}, Pr(acc)11/lnn\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc})\geq 1-1/\ln n.

  3. (c)

    The expected regret of Step 2 is always upper bounded by 𝒪(lnnlnlnn).{\mathcal{O}}(\ln n\ln\ln n).

  4. (d)

    Conditioned on the event init{\mathcal{E}}_{\rm init}, the expected regret of Step 2 is upper bounded by

    (𝒞(f,(lnlnn)1/4/2)+o(1))lnn.\left({\mathcal{C}}(f^{\star},(\ln\ln n)^{1/4}/2)+o(1)\right)\ln n.

Proof of Lemma C.2 is deferred to Appendix C.3.

Then, proof of Theorem 5.2 is the same as that of Theorem 3.3 by plugging in Lemma C.1 and Lemma C.2,

C.2 Proof of Lemma C.1

Proof of Lemma C.1.

We prove the two items in this lemma separately.

Proof of item (a):

Let c1>0c_{1}>0 be the constant from Condition 1. Set c6=12c1+3c_{6}=\frac{1}{2c_{1}+3}, α=(lnlnnlnn)c6\alpha=\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}}, and ϵ=α5\epsilon=\frac{\alpha}{5}. Let w=(π1,,πminit)w=\left(\pi_{1},\cdots,\pi_{{m_{\rm init}}}\right) be the list of decisions run by Step 1, and o1,,ominito_{1},\cdots,o_{m_{\rm init}} the corresponding observations. Recall that by definition,

f^=argmaxgi=1minitlng[πi](oi).\displaystyle{\hat{f}}=\operatorname*{argmax}_{g\in{\mathcal{F}}}\sum_{i=1}^{{m_{\rm init}}}\ln g[\pi_{i}](o_{i}). (115)

Combining with the fact that f{f^{\star}}\in{\mathcal{F}}, we have

i=1minitlnf[πi](oi)f^[πi](oi)i=1minitlnf[πi](oi)f[πi](oi)0.\displaystyle\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{{\hat{f}}[\pi_{i}](o_{i})}\leq\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\leq 0. (116)

Let 𝒢(α)={g:π:DKL(f[π]g[π])α}.{\mathcal{G}}(\alpha)=\{g\in{\mathcal{F}}:\exists\pi:D_{\mathrm{KL}}({f^{\star}}[\pi]\|g[\pi])\geq\alpha\}. We will prove that for all g𝒢(α)g\in{\mathcal{G}}(\alpha), we have i=1minitlnf[πi](oi)g[πi](oi)>0\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}>0. Combining with Eq. (116) we get DKL(f[π]f^[π])α,π.D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\alpha,\forall\pi.

To this end, we apply Lemma F.1 with parameters (α/|Π|,ϵ/|Π|,w)(\alpha/|\Pi|,\epsilon/|\Pi|,w). Following Lemma F.1, define

γ=1minitminπi=1minit𝕀[πi=π].\gamma=\frac{1}{{m_{\rm init}}}\min_{\pi}\sum_{i=1}^{{m_{\rm init}}}\mathbb{I}\left[\pi_{i}=\pi\right].

Then we have γ=1/|Π|.\gamma=1/|\Pi|. Let λ=λ0(α,ϵ/|Π|,f)\lambda=\lambda_{0}(\alpha,\epsilon/|\Pi|,f) be the value that satisfies Condition 1, and

ϵ0=exp(α)(ϵλ/|Π|)1/λ3Vol.\epsilon_{0}=\frac{\exp(-\alpha)(\epsilon\lambda/|\Pi|)^{1/\lambda}}{3{\rm Vol}}.

Recall that the condition of Lemma F.1 states

minit1λϵ(ln𝒩(,ϵ0)+ln(1/δ))).\displaystyle{m_{\rm init}}\geq\frac{1}{\lambda\epsilon}\left(\ln{\mathcal{N}}({\mathcal{F}},\epsilon_{0})+\ln(1/\delta))\right). (117)

The failure probability in this case is δ=1/lnn\delta=1/\ln n. By Condition 1, for large enough nn we get 1/λϵc1.1/\lambda\lesssim\epsilon^{-c_{1}}. By Condition 2 we get

ln𝒩(,ϵ0)ln(1/ϵ0)𝒪(1)+1λln(1/(ϵλ))+α.\ln{\mathcal{N}}({\mathcal{F}},\epsilon_{0})\lesssim\ln(1/\epsilon_{0})\lesssim{\mathcal{O}}(1)+\frac{1}{\lambda}\ln(1/(\epsilon\lambda))+\alpha.

As a result, when nn is large enough

1λϵ(ln𝒩(,ϵ0)+ln(1/δ)))1ϵc1+1(𝒪(1)+lnlnn+1ϵc1ln1ϵc1)\displaystyle\frac{1}{\lambda\epsilon}\left(\ln{\mathcal{N}}({\mathcal{F}},\epsilon_{0})+\ln(1/\delta))\right)\lesssim\frac{1}{\epsilon^{c_{1}+1}}\left({\mathcal{O}}(1)+\ln\ln n+\frac{1}{\epsilon^{c_{1}}}\ln\frac{1}{\epsilon^{c_{1}}}\right) (118)
\displaystyle\lesssim\; ϵ(2c1+2)(lnnlnlnn)112c1+3,\displaystyle\epsilon^{-(2c_{1}+2)}\lesssim\left(\frac{\ln n}{\ln\ln n}\right)^{1-\frac{1}{2c_{1}+3}}, (119)

where the last inequality comes from the definition of ϵ\epsilon, i.e., ϵ=𝒪((lnlnnlnn)12c1+3)\epsilon={\mathcal{O}}\left(\left(\frac{\ln\ln n}{\ln n}\right)^{\frac{1}{2c_{1}+3}}\right). Recall that minit|Π|lnnlnlnn{m_{\rm init}}\geq|\Pi|\frac{\ln n}{\ln\ln n}. When nn is large enough, the condition of Lemma F.1 (i.e., Eq. (117)) is satisfied.

Because every policy appears in ww exactly the same number of times, we have DKLw(fg)α/|Π|D_{\mathrm{KL}}^{w}({f^{\star}}\|g)\geq\alpha/|\Pi| for all g𝒢(α)g\in{\mathcal{G}}(\alpha). Therefore, by Lemma F.1 with paremeters (α/|Π|,ϵ/|Π|,w)(\alpha/|\Pi|,\epsilon/|\Pi|,w),

g𝒢(α),i=1minitlnf[πi](oi)g[πi](oi)(α|Π|4ϵ|Π|)m>0.\displaystyle\forall g\in{\mathcal{G}}(\alpha),\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\left(\frac{\alpha}{|\Pi|}-4\frac{\epsilon}{|\Pi|}\right)m>0. (120)

Combining with Eq. (116), we get f^𝒢(α){\hat{f}}\not\in{\mathcal{G}}(\alpha). As a result

πΠ,DKL(f[π]f^[π])α=(lnlnnlnn)c6.\displaystyle\forall\pi\in\Pi,\quad D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\alpha=\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}}. (121)

Proof of item (b):

Now we focus on item (b). For any fixed πΠ\pi\in\Pi, by Pinsker’s inequality and Eq. (121) we get

DTV(f[π]f^[π])12DKL(f[π]f^[π])(lnlnnlnn)c6/2.\displaystyle D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\sqrt{\frac{1}{2}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])}\leq\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}/2}. (122)

By assumption we have 0R(o)Rmax0\leq R(o)\leq R_{\rm max} almost surely for both f[π]{f^{\star}}[\pi] and f^[π]{\hat{f}}[\pi]. It follows that

|Rf^(π)Rf(π)|RmaxDTV(f[π]f^[π]).\displaystyle{\left|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right|}\leq R_{\rm max}D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi]). (123)

Then we prove item (b) with c4=c6/2c_{4}=c_{6}/2 and ι(f)=Rmax.\iota({f^{\star}})=R_{\rm max}.

Proof of item (c):

Since Δmin(f)>0\Delta_{\rm min}({f^{\star}})>0, (c) follows from (b) directly when nn is large enough.

Proof of regret:

The number of samples collected in Step 1 is upper bounded by minit=|Π|lnnlnlnn.{m_{\rm init}}=|\Pi|\lceil\frac{\ln n}{\ln\ln n}\rceil. As a result, the regret is upper bounded by

𝒪(Δmaxminit)=𝒪(lnnlnlnn).\displaystyle{\mathcal{O}}(\Delta_{\rm max}{m_{\rm init}})={\mathcal{O}}\left(\frac{\ln n}{\ln\ln n}\right). (124)

C.3 Proof of Lemma C.2

Proof of Lemma C.2.

We prove the four items in Lemma C.2 separately.

Item (a):

First we prove item (a) of Lemma C.2. By Markov inequality, for any c>0c>0, we have

Prf(i=1mlnf^[πi](oi)f[πi](oi)lnn)=Prf(exp(i=1mlnf^[πi](oi)f[πi](oi))exp(lnn))\displaystyle\mathop{\rm Pr}\nolimits_{f^{\star}}\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq\ln n\right)=\mathop{\rm Pr}\nolimits_{f^{\star}}\left(\exp\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\right)\geq\exp(\ln n)\right) (125)
\displaystyle\leq\; exp(lnn)𝔼f[exp(i=1mlnf^[πi](oi)f[πi](oi))]=exp(lnn)i=1m𝔼f[f^[πi](oi)f[πi](oi)]\displaystyle\exp(-\ln n)\mathbb{E}_{f^{\star}}\left[\exp\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\right)\right]=\exp(-\ln n)\prod_{i=1}^{m}\mathbb{E}_{f^{\star}}\left[\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\right] (126)
=\displaystyle=\; 1/n.\displaystyle 1/n. (127)

The last equality follows from the fact that f^[π]{\hat{f}}[\pi] and f[π]f^{\star}[\pi] are both probability distributions given any decision πΠ.\pi\in\Pi.

Recall that in this case π(f^)π(f)\pi^{\star}({\hat{f}})\neq\pi^{\star}(f^{\star}). Therefore,

Pr(accf^)=Pr(g and π(g)π(f^),i=1mlnf^[πi](oi)g[πi](oi)lnn)\displaystyle\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc}^{\hat{f}})=\mathop{\rm Pr}\nolimits\left(\forall g\in{\mathcal{F}}\text{ and }\pi^{\star}(g)\neq\pi^{\star}({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right) (128)
\displaystyle\leq\; Pr(i=1mlnf^[πi](oi)f[πi](oi)lnn)1/n.\displaystyle\mathop{\rm Pr}\nolimits\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq\ln n\right)\leq 1/n. (129)

Item (b):

Let ϵ=1/lnlnn\epsilon=1/\ln\ln n and α=lnnm\alpha=\frac{\ln n}{m}. We prove this statement by invoking Lemma F.1 with parameters (α+5ϵ,ϵ,w)(\alpha+5\epsilon,\epsilon,w). Following Lemma F.1, let γ=1mminπi=1m𝕀[πi=π]\gamma=\frac{1}{m}\min_{\pi}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right] and λ=λ0((α+5ϵ)/γ,ϵ,f)\lambda=\lambda_{0}((\alpha+5\epsilon)/\gamma,\epsilon,{f^{\star}}) be the value that satisfies Condition 1. Let

ϵ0=exp((α+5ϵ)/γ)(ϵλ)1/λ3Vol.\epsilon_{0}=\frac{\exp(-(\alpha+5\epsilon)/\gamma)(\epsilon\lambda)^{1/\lambda}}{3{\rm Vol}}.

First of all, we prove that the condition for Lemma F.1 holds. That is,

m1λϵ(ln𝒩(,ϵ0)+lnlnn).\displaystyle m\geq\frac{1}{\lambda\epsilon}\left(\ln{\mathcal{N}}\left({\mathcal{F}},\epsilon_{0}\right)+\ln\ln n\right). (130)

Recall that in Alg. 1, w^\hat{w} is the solution of 𝒞(f^,(lnlnn)1/4){\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}), w¯π=(1+1(lnlnn)1/4)w^π+1(lnlnn)1/4\bar{w}_{\pi}=\left(1+\frac{1}{(\ln\ln n)^{1/4}}\right)\hat{w}_{\pi}+\frac{1}{(\ln\ln n)^{1/4}}, and m=xw¯πln(n)m=\sum_{x}\lceil\bar{w}_{\pi}\ln(n)\rceil. As a result, m|Π|lnn(lnlnn)1/4.m\geq\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}. Now consider the RHS of Eq. (130). By the definition of w¯π\bar{w}_{\pi} we get m2|Π|lnn(lnlnn)1/4m\leq 2|\Pi|\ln n(\ln\ln n)^{1/4}, so α1|Π|(lnlnn)1/4\alpha\leq\frac{1}{|\Pi|}(\ln\ln n)^{1/4} and γ12|Π|(lnlnn)1/2\gamma^{-1}\leq 2|\Pi|(\ln\ln n)^{1/2}. It follows from Condition 1 that λpoly(1/lnlnn).\lambda\geq\mathrm{poly}(1/\ln\ln n). By the definition of ϵ0\epsilon_{0} and Condition 2 we get

ln𝒩(,ϵ0)ln(1/ϵ0)poly(lnlnn).\displaystyle\ln{\mathcal{N}}\left({\mathcal{F}},\epsilon_{0}\right)\lesssim\ln(1/\epsilon_{0})\lesssim\mathrm{poly}(\ln\ln n). (131)

Consequently, when nn is sufficiently large, Eq. (130) holds. By Lemma F.1 we get, with probability at least 11/lnn1-1/\ln n,

i=1mlnf[πi](oi)g[πi](oi)(α+ϵ)m,g(w,f,α+5ϵ),\displaystyle\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m,\quad\forall g\in{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon), (132)

where (w,f,α+5ϵ)={g:DKLw(fg)α+5ϵ}{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon)=\{g\in{\mathcal{F}}:D_{\mathrm{KL}}^{w}({f^{\star}}\|g)\geq\alpha+5\epsilon\}. In the following, we prove that Eq. (132) implies accf^.{\mathcal{E}}_{\rm acc}^{{\hat{f}}}. Recall that accf^{\mathcal{E}}_{\rm acc}^{{\hat{f}}} is the event defined as follows:

accf^\displaystyle{\mathcal{E}}_{\rm acc}^{\hat{f}} =𝕀[gΛ(f^),i=1mlnf^[πi](oi)g[πi](oi)lnn].\displaystyle=\mathbb{I}\left[\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right]. (133)

Recall that Λ(f^)={g,π(g)π(f^)}.\Lambda({\hat{f}})=\{g\in{\mathcal{F}},\pi^{\star}(g)\neq\pi^{\star}({\hat{f}})\}. Next, we apply Lemma G.4 to show that Λ(f^)(w,f,α+5ϵ).\Lambda({\hat{f}})\subseteq{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon). To verify the condition of Lemma G.4, we have DTV(f[π]f^[π])(lnlnnlnn)c4D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}} for all πΠ\pi\in\Pi by item (a) of Lemma C.1. By the definition of w^\hat{w} we get

gΛ(f^),πΠw^πDKL(f^[π]g[π])1.\displaystyle\forall g\in\Lambda({\hat{f}}),\quad\sum_{\pi\in\Pi}\hat{w}_{\pi}D_{\mathrm{KL}}({\hat{f}}[\pi]\|g[\pi])\geq 1. (134)

Therefore when nn is large enough,

gΛ(f^),DKLw(fg)lnnm+5ϵ=α+5ϵ.\displaystyle\forall g\in\Lambda({\hat{f}}),D_{\mathrm{KL}}^{w}(f^{\star}\|g)\geq\frac{\ln n}{m}+5\epsilon=\alpha+5\epsilon. (135)

Then Λ(f^)(w,f,α+5ϵ)\Lambda({\hat{f}})\subseteq{\mathcal{F}}(w,{f^{\star}},\alpha+5\epsilon). It follows from Eq. (132) that

i=1mlnf[πi](oi)g[πi](oi)(α+ϵ)m,gΛ(f^).\displaystyle\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m,\quad\forall g\in\Lambda({\hat{f}}). (136)

Finally, by Condition 3 we get f[π](o)>cmin{f^{\star}}[\pi](o)>c_{\rm min} for any πΠ and osupp(f[π])\pi\in\Pi\text{ and }o\in\operatorname{supp}({f^{\star}}[\pi]). As a result

|lnf^[π](o)f[π](o)||ln(1+f^[π](o)f[π](o)f[π](o))||ln(1+f^[π](o)f[π](o)cmin)|.\displaystyle{\left|{\ln\frac{{\hat{f}}[\pi](o)}{{f^{\star}}[\pi](o)}}\right|}\leq{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o)-{f^{\star}}[\pi](o)}{{f^{\star}}[\pi](o)}\right)}\right|}\leq{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o)-{f^{\star}}[\pi](o)}{c_{\rm min}}\right)}\right|}. (137)

When f^fcmin/2\|{\hat{f}}-f^{\star}\|_{\infty}\leq c_{\rm min}/2, applying the basic inequality |ln(1+x)|2x,|x|1/2{\left|{\ln(1+x)}\right|}\leq 2x,\forall|x|\leq 1/2 we get

|ln(1+f^[π](o)f[π](o)cmin)|2cminf^f2cmin(lnlnnlnn)c4c5,\displaystyle{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o)-{f^{\star}}[\pi](o)}{c_{\rm min}}\right)}\right|}\leq\frac{2}{c_{\rm min}}\|{\hat{f}}-f^{\star}\|_{\infty}\lesssim\frac{2}{c_{\rm min}}\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}c_{5}}, (138)

where the last inequality comes from item (a) of Lemma C.1 and Condition 3. Therefore, for large enough nn we get |lnf^[π](o)f[π](o)|ϵ.{\left|{\ln\frac{{\hat{f}}[\pi](o)}{{f^{\star}}[\pi](o)}}\right|}\leq\epsilon. As a result,

i=1mlnf^[πi](oi)g[πi](oi)=i=1mlnf[πi](oi)g[πi](oi)+i=1mlnf^[πi](oi)f[πi](oi)αm=lnn.\displaystyle\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}=\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}+\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq\alpha m=\ln n. (139)

Since gΛ(f^)g\in\Lambda({\hat{f}}) is arbitrary, have

Pr(accf^)=Pr(gΛ(f^),i=1mlnf^[πi](oi)g[πi](oi)lnn)\displaystyle\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc}^{{\hat{f}}})=\mathop{\rm Pr}\nolimits\left(\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right) (140)
\displaystyle\geq Pr(g(w,f,α+5ϵ),i=1mlnf[πi](oi)g[πi](oi)(α+ϵ)m)11/lnn.\displaystyle\;\mathop{\rm Pr}\nolimits\left(\forall g\in{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon),\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m\right)\geq 1-1/\ln n. (141)

Item (c):

By the definition of 𝒞(f^,(lnlnn)1/4){\mathcal{C}}(\hat{f},(\ln\ln n)^{1/4}), we have w^π(lnlnn)1/4\hat{w}_{\pi}\leq(\ln\ln n)^{1/4} for every πΠ\pi\in\Pi. As a result, m2Alnn(lnlnn)1/4.m\leq 2A\ln n(\ln\ln n)^{1/4}. Therefore, the expect regret of Step 2 is upper bounded by

Δmax2Alnn(lnlnn)1/4=O(lnnlnlnn).\displaystyle\Delta_{\rm max}2A\ln n(\ln\ln n)^{1/4}=O(\ln n\ln\ln n). (142)

Item (d):

Recall that w^\hat{w} is the solution of 𝒞(f^,(lnlnn)1/4).{\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}). As a result, the regret of Step 2 is upper bounded by

(πΠw^πΔ(f,π)+o(1))lnn\displaystyle\left(\sum_{\pi\in\Pi}\hat{w}_{\pi}\Delta({f^{\star}},\pi)+o(1)\right)\ln n (143)

where Δ(f,π)\Delta({f^{\star}},\pi) is the sub-optimality gap of decision π\pi under instance f{f^{\star}}. In the following, we prove that

πΠw^πΔ(f,π)𝒞(f,(lnlnn)1/4/2).\displaystyle\sum_{\pi\in\Pi}\hat{w}_{\pi}\Delta({f^{\star}},\pi)\leq{\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2). (144)

Let w^\hat{w}^{\star} be the solution to 𝒞(f,(lnlnn)1/4/2){\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2). Define δ=1(lnlnn)1/4.\delta=\frac{1}{(\ln\ln n)^{1/4}}. Let w¯{(1+δ)w^π+δ}π\bar{w}^{\star}\triangleq\{\left(1+\delta\right)\hat{w}^{\star}_{\pi}+\delta\}_{\pi} and m=πΠw¯πlnnm^{\star}=\sum_{\pi\in\Pi}\lceil\bar{w}_{\pi}^{\star}\ln n\rceil. We will show that w¯\bar{w}^{\star} is also (approximately) a solution of 𝒞(f^,(lnlnn)1/4/2){\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}/2)

By the definition of 𝒞(f,(lnlnn)1/4/2){\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2), for any gΛ(f),g\in\Lambda({f^{\star}}), we have

πΠ(w^π)DKL(f[π]g[π])1.\displaystyle\sum_{\pi\in\Pi}(\hat{w}^{\star}_{\pi})D_{\mathrm{KL}}({f^{\star}}[\pi]\|g[\pi])\geq 1. (145)

Let ww^{\star} be the list of decisions that π\pi appears w¯lnn\lceil\bar{w}^{\star}\ln n\rceil times for every πΠ\pi\in\Pi. Define α=lnnm\alpha^{\star}=\frac{\ln n}{m^{\star}} and ϵ=1lnlnn.\epsilon=\frac{1}{\ln\ln n}. Next we apply Lemma G.4 with parameters (ϵ,w).(\epsilon,w^{\star}). To verify its condition, item (a) of Lemma C.1 gives

DTV(f[π]f^[π])DKL(f[π]f^[π])1/2(lnlnnlnn)c4/2,πΠ,\displaystyle D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}/2},\quad\forall\pi\in\Pi, (146)

which satisfies the condition of Lemma G.4. Consequently, we get DKLw(f^[π]g[π])lnnmD_{\mathrm{KL}}^{w^{\star}}({\hat{f}}[\pi]\|g[\pi])\geq\frac{\ln n}{m^{\star}} for every gΛ(f)g\in\Lambda({f^{\star}}). Therefore,

gΛ(f),πΠw¯πlnnlnnDKL(f^[π]g[π])1.\displaystyle\forall g\in\Lambda({f^{\star}}),\quad\sum_{\pi\in\Pi}\frac{\lceil\bar{w}^{\star}_{\pi}\ln n\rceil}{\ln n}D_{\mathrm{KL}}({\hat{f}}[\pi]\|g[\pi])\geq 1. (147)

By item (c) of Lemma C.1, Λ(f^)=Λ(f)\Lambda({\hat{f}})=\Lambda({f^{\star}}). When nn is large enough, we have w¯πlnnlnn(lnlnn)1/4.\frac{\lceil\bar{w}^{\star}_{\pi}\ln n\rceil}{\ln n}\leq(\ln\ln n)^{1/4}. Therefore, {w¯πlnnlnn}πΠ\left\{\frac{\lceil\bar{w}^{\star}_{\pi}\ln n\rceil}{\ln n}\right\}_{\pi\in\Pi} satisfies all the constraints of 𝒞(f^,(lnlnn)1/4){\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}). Recall that w^\hat{w} is the solution to 𝒞(f^,(lnlnn)1/4){\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}). By the optimality of w^\hat{w} we have

πw^πΔ(f^,π)πw¯πlnnlnnΔ(f^,π)πw¯πΔ(f^,π)+o(1).\displaystyle\sum_{\pi}\hat{w}_{\pi}\Delta({\hat{f}},\pi)\leq\sum_{\pi}\frac{\lceil\bar{w}^{\star}_{\pi}\ln n\rceil}{\ln n}\Delta({\hat{f}},\pi)\leq\sum_{\pi}\bar{w}^{\star}_{\pi}\Delta({\hat{f}},\pi)+o(1). (148)

By item (b) of Lemma C.1, |Δ(f^,π)Δ(f,π)|ι(f)(lnlnnlnn)c4.{\left|{\Delta({\hat{f}},\pi)-\Delta({f^{\star}},\pi)}\right|}\leq\iota({f^{\star}})\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}}. As a result,

πw¯πΔ(f^,π)πw¯πΔ(f,π)+o(1)\displaystyle\sum_{\pi}\bar{w}^{\star}_{\pi}\Delta({\hat{f}},\pi)\leq\sum_{\pi}\bar{w}^{\star}_{\pi}\Delta({f^{\star}},\pi)+o(1) (149)
\displaystyle\leq\; πw^πΔ(f,π)+o(1)=𝒞(f,(lnlnn)1/4/2)+o(1).\displaystyle\sum_{\pi}\hat{w}^{\star}_{\pi}\Delta({f^{\star}},\pi)+o(1)={\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2)+o(1). (150)

In addition,

πΠw^πΔ(f,π)πw^πΔ(f^,π)+o(1).\displaystyle\sum_{\pi\in\Pi}\hat{w}_{\pi}\Delta({f^{\star}},\pi)\leq\sum_{\pi}\hat{w}_{\pi}\Delta({\hat{f}},\pi)+o(1). (151)

Stitching the inequalities above we have

πw^πΔ(f,π)𝒞(f,(lnlnn)1/4/2)+o(1).\displaystyle\sum_{\pi}\hat{w}_{\pi}\Delta({f^{\star}},\pi)\leq{\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2)+o(1). (152)

As a result, the regret in Step 2 is bounded by

(𝒞(f,(lnlnn)1/4/2)+o(1))lnn.\left({\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2)+o(1)\right)\ln n.

Appendix D Proof of Conditions for Tabular Reinforcement Learning

In this section, we first prove that Conditions 13 holds for tabular RL with truncated Gaussian reward. Consequently, Theorem 5.2 gives the first asymptotic instance-optimal regret bound in this setting. Later in Appendix E, we extend our analysis to Gaussian reward.

First of all, we define some additional notations. A finite horizon Markov Decision Process (MDP) is denoted by a tuple (𝒮,𝒜,p,r,H)({\mathcal{S}},{\mathcal{A}},p,r,H). 𝒮{\mathcal{S}} and 𝒜{\mathcal{A}} are the state and action spaces respectively. H>0H>0 denotes the horizon. The tansition function pp maps a state-action pair (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times{\mathcal{A}} to a distribution over 𝒮{\mathcal{S}}, and the reward function rr maps state-action pair (s,a)(s,a) to a truncated Gaussian distribution. In particular, for a fixed (s,a)(s,a), the mean reward is μ(s,a)[1,1]\mu(s,a)\in[-1,1], and the density of the reward r(s,a)r(s,a) is

r[s,a](x)=𝕀[x[2,2]]1Zexp((xμ(s,a))22),\displaystyle r[s,a](x)=\mathbb{I}\left[x\in[-2,2]\right]\frac{1}{Z}\exp\left(-\frac{(x-\mu(s,a))^{2}}{2}\right), (153)

where Z=x[2,2]exp((xμ(s,a))2/2)Z=\int_{x\in[-2,2]}\exp\left(-(x-\mu(s,a))^{2}/2\right) is the normalization factor. We use a fixed truncation for every state-action pair regardless of μ(s,a)\mu(s,a) because otherwise the KL divergence of two instances will easily be infinity. Without loss of generality, we assume 𝒮{\mathcal{S}} is partitioned into 𝒮1,,𝒮H{\mathcal{S}}_{1},\cdots,{\mathcal{S}}_{H} where for any state s𝒮hs\in{\mathcal{S}}_{h} and action a𝒜a\in{\mathcal{A}}, supp(p(s,a))𝒮h+1\operatorname{supp}(p(s,a))\subseteq{\mathcal{S}}_{h+1}. We also assume that the initial state s1s_{1} is fixed.

A decision (a.k.a. policy) π:𝒮𝒜\pi:{\mathcal{S}}\to{\mathcal{A}} is a deterministic mapping from a state to an action, and Π=𝒜𝒮\Pi={\mathcal{A}}^{\mathcal{S}} is the set of possible decisions with |Π|=|𝒜||𝒮|<.|\Pi|=|{\mathcal{A}}|^{|{\mathcal{S}}|}<\infty. An observation oo is a trajectory (s1,a1,r1,,sH,aH,rH)(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H}). The reward of an observation is R(o)=h=1Hrh.R(o)=\sum_{h=1}^{H}r_{h}. We emphasize that sh,ahs_{h},a_{h} are discrete random variables and rhr_{h}\in\mathbb{R} is a continuous random variable. In the following, we use osa=(s1,a1,,sH,aH)o^{\rm sa}=(s_{1},a_{1},\cdots,s_{H},a_{H}) to denote the set of state-action pairs in oo and or=(r1,,rH)o^{\rm r}=(r_{1},\cdots,r_{H}) the set of rewards. As a basic property of MDP, the rewards are independent conditioned on osao^{\rm sa}, and rhr(sh,ah)r_{h}\sim r(s_{h},a_{h}).

An instance ff is a represented by the transition function pp and reward function rr (which is uniquely determined by μ:𝒮×𝒜[1,1]\mu:{\mathcal{S}}\times{\mathcal{A}}\to[-1,1]). The family of instances {\mathcal{F}} is defined as the set of all possible instances.

For an instance ff and a decision π\pi, the density of an observation o=(s1,a1,r1,,sH,aH,rH)o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H}) is

f[π](o)=h=1Hp[sh,ah](sh+1)r[sh,ah](rh).\displaystyle f[\pi](o)=\prod_{h=1}^{H}p[s_{h},a_{h}](s_{h+1})r[s_{h},a_{h}](r_{h}). (154)

For simplicity, we also use fp[s,a](s)f_{p}[s,a](s^{\prime}) to denote the distribution p[s,a](s)p[s,a](s^{\prime}), and fr[s,a](r)f_{r}[s,a](r) to denote r[s,a](r)r[s,a](r).

By definition, for all s,a,r,ss,a,r,s^{\prime} we have p[s,a](s)r[s,a](r)<1.p[s,a](s^{\prime})r[s,a](r)<1. We also define

f[π](osa)=h=1Hp[sh,ah](sh+1)\displaystyle f[\pi](o^{\rm sa})=\prod_{h=1}^{H}p[s_{h},a_{h}](s_{h+1}) (155)

to be the marginal density of osa,o^{\rm sa}, and

f[π](orosa)=h=1Hr[sh,ah](rh)f[\pi](o^{\rm r}\mid o^{\rm sa})=\prod_{h=1}^{H}r[s_{h},a_{h}](r_{h})

the conditional distribution of or.o^{\rm r}.

For an instance ff\in{\mathcal{F}}, we define μmin(f)=minosa,π:f[π](osa)>0f[π](osa).\mu_{\rm min}(f)=\min_{o^{\rm sa},\pi:f[\pi](o^{\rm sa})>0}f[\pi](o^{\rm sa}). Since osa,πo^{\rm sa},\pi are finite, we have μmin(f)>0\mu_{\rm min}(f)>0.

D.1 Proof of Theorem 5.1

In this section, we prove Theorem 5.1.

Proof.

In the following, we prove the three conditions separately.

Proof of Condition 1.

We prove this condition by invoking Lemma D.1. For instances f,gf,g\in{\mathcal{F}} and state-action pair s,a𝒮×𝒜s,a\in{\mathcal{S}}\times{\mathcal{A}}, let fr[s,a],gr[s,a]f_{r}[s,a],g_{r}[s,a] be their reward distributions respectively. Since fr[s,a],gr[s,a]f_{r}[s,a],g_{r}[s,a] are truncated Gaussian distributions, we get exp(3)gr[s,a](x)1\exp(-3)\leq g_{r}[s,a](x)\leq 1 for every xsupp(fr[s,a])x\in\operatorname{supp}(f_{r}[s,a]) and gg\in{\mathcal{F}}. Therefore supx|lnfr[s,a](x)gr[s,a](x)|3\sup_{x}{\left|{\ln\frac{f_{r}[s,a](x)}{g_{r}[s,a](x)}}\right|}\leq 3 for all s𝒮,a𝒜.s\in{\mathcal{S}},a\in{\mathcal{A}}. By Lemma D.1 with cM=3c_{M}=3, we get Condition 1.

Proof of Condition 2.

The first part of this condition is proved by Lemma D.2. On the other hand, we have

1do=(2|𝒮||𝒜|)H<.\displaystyle\int 1\mathrm{d}o=(2|{\mathcal{S}}||{\mathcal{A}}|)^{H}<\infty. (156)

Proof of Condition 3.

To prove the first part of this condition, recall that

f[π](o)=h=1Hp[sh,ah](sh+1)r[sh,ah](rh),\displaystyle{f^{\star}}[\pi](o)=\prod_{h=1}^{H}p[s_{h},a_{h}](s_{h+1})r[s_{h},a_{h}](r_{h}), (157)

where pp is the transition function and rr is the reward distribution. Let

cmin=(mins,a𝒮×𝒜,ssuppp[s,a]()p[s,a](s))Hexp(4H),c_{\rm min}=\left(\min_{s,a\in{\mathcal{S}}\times{\mathcal{A}},s^{\prime}\in\operatorname{supp}p[s,a](\cdot)}p[s,a](s^{\prime})\right)^{H}\exp(-4H),

As a result, f[π](o)>cmin{f^{\star}}[\pi](o)>c_{\rm min} for all osupp(f[π])o\in\operatorname{supp}({f^{\star}}[\pi]). We prove the second part of this condition in Lemma D.3. ∎

D.2 Proof of Condition 1

In this section, we present a lemma that establishes Condition 1 for tabular RL.

Lemma D.1.

Consider any fixed RL instance ff with discrete state, action and general reward distribution. Suppose there exists a constant cM>0c_{M}>0 such that for any g,s𝒮,a𝒜g\in{\mathcal{F}},s\in{\mathcal{S}},a\in{\mathcal{A}}, the reward distributions of instance ff and gg at state ss and action aa (denoted by fr[s,a],gr[s,a]f_{r}[s,a],g_{r}[s,a] respectively) satisfy

𝔼xfr[s,a][(lnfr[s,a](x)gr[s,a](x))4]cM4.\displaystyle\mathbb{E}_{x\sim f_{r}[s,a]}\left[\left(\ln\frac{f_{r}[s,a](x)}{g_{r}[s,a](x)}\right)^{4}\right]\leq c_{M}^{4}. (158)

Then for every α>0,ϵ(0,1)\alpha>0,\epsilon\in(0,1), Condition 1 holds with

λ0(α,ϵ,f)=ϵ32H2min{μmin(f)4α,μmin(f)10,1cM}2.\displaystyle\lambda_{0}(\alpha,\epsilon,f)=\frac{\epsilon}{32H^{2}}\min\left\{\frac{\mu_{\rm min}(f)}{4\alpha},\frac{\mu_{\rm min}(f)}{10},\frac{1}{c_{M}}\right\}^{2}. (159)
Proof of Lemma D.1.

Let

κ=μmin(f)e2exp(2αμmin(f)).\displaystyle\kappa=\frac{\mu_{\rm min}(f)}{e^{2}}\exp\left(-\frac{2\alpha}{\mu_{\rm min}(f)}\right). (160)

Recall that for reinforcement learning, an observation o=(s1,a1,r1,,sH,aH,rH)o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H}) represents a trajectory, and osa=(s1,a1,s2,a2,,sH,aH)o^{\rm sa}=(s_{1},a_{1},s_{2},a_{2},\cdots,s_{H},a_{H}) denotes the state-action pairs in the trajectory. Consider any fixed decision π\pi, for any λ<λ0\lambda<\lambda_{0}, we prove the following two cases separately.

Case 1: minosa:f[π](osa)>0g[π](osa)<κ.\min_{o^{\rm sa}:f[\pi](o^{\rm sa})>0}g[\pi](o^{\rm sa})<\kappa.

In this case, we prove that D1λ(f[π]g[π])α.D_{1-\lambda}(f[\pi]\|g[\pi])\geq\alpha.

Let osa^=argminosa:f[π](osa)>0g[π](osa).\hat{o^{\rm sa}}=\operatorname*{argmin}_{o^{\rm sa}:f[\pi](o^{\rm sa})>0}g[\pi](o^{\rm sa}). By the condition of this case we have g[π](osa^)<κ.g[\pi](\hat{o^{\rm sa}})<\kappa. Applying Lemma G.1 we get

D1λ(f[π](o)g[π](o))D1λ(f[π](osa)g[π](osa)).\displaystyle D_{1-\lambda}(f[\pi](o)\|g[\pi](o))\geq D_{1-\lambda}(f[\pi](o^{\rm sa})\|g[\pi](o^{\rm sa})). (161)

In the following we prove the RHS of Eq. (161) is larger than α\alpha. We start with Hölder’s inequality and the basic inequality that (1x)t1tx(1-x)^{t}\leq 1-tx for any t(0,1),x(0,1)t\in(0,1),x\in(0,1):

osaf[π](osa)1λg[π](osa)λ\displaystyle\sum_{o}^{\rm sa}f[\pi](o^{\rm sa})^{1-\lambda}g[\pi](o^{\rm sa})^{\lambda} (162)
=\displaystyle=\; osaosa^f[π](osa)1λg[π](osa)λ+f[π](o^sa)1λg[π](o^sa)λ\displaystyle\sum_{o^{\rm sa}\neq\hat{o^{\rm sa}}}f[\pi](o^{\rm sa})^{1-\lambda}g[\pi](o^{\rm sa})^{\lambda}+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda} (163)
\displaystyle\leq\; (osaosa^f[π](osa))1λ(osaosa^g[π](osa))λ+f[π](o^sa)1λg[π](o^sa)λ\displaystyle\left(\sum_{o^{\rm sa}\neq\hat{o^{\rm sa}}}f[\pi](o^{\rm sa})\right)^{1-\lambda}\left(\sum_{o^{\rm sa}\neq\hat{o^{\rm sa}}}g[\pi](o^{\rm sa})\right)^{\lambda}+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda} (164)
\displaystyle\leq\; (1f[π](o^sa))1λ(1g[π](o^sa))λ+f[π](o^sa)1λg[π](o^sa)λ\displaystyle\left(1-f[\pi](\hat{o}^{\rm sa})\right)^{1-\lambda}\left(1-g[\pi](\hat{o}^{\rm sa})\right)^{\lambda}+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda} (165)
\displaystyle\leq\; (1f[π](o^sa))1λ+f[π](o^sa)1λg[π](o^sa)λ\displaystyle\left(1-f[\pi](\hat{o}^{\rm sa})\right)^{1-\lambda}+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda} (166)
\displaystyle\leq\; 1f[π](o^sa)(1λ)+f[π](o^sa)1λg[π](o^sa)λ.\displaystyle 1-f[\pi](\hat{o}^{\rm sa})(1-\lambda)+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda}. (167)

Recall the basic inequality ln(1+x)x\ln(1+x)\leq x for all x>1x>-1. Therefore,

1λln(osaf[π](osa)1λg[π](osa)λ)\displaystyle\frac{1}{\lambda}\ln\left(\sum_{o}^{\rm sa}f[\pi](o^{\rm sa})^{1-\lambda}g[\pi](o^{\rm sa})^{\lambda}\right) (168)
\displaystyle\leq\; 1λ(f[π](o^sa)(1λ)+f[π](o^sa)1λg[π](o^sa)λ)\displaystyle\frac{1}{\lambda}\left(-f[\pi](\hat{o}^{\rm sa})(1-\lambda)+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda}\right) (169)
\displaystyle\leq\; 1λ(f[π](o^sa)((g(o^sa)f(o^sa))λ1)+λf[π](o^sa))\displaystyle\frac{1}{\lambda}\left(f[\pi](\hat{o}^{\rm sa})\left(\left(\frac{g(\hat{o}^{\rm sa})}{f(\hat{o}^{\rm sa})}\right)^{\lambda}-1\right)+\lambda f[\pi](\hat{o}^{\rm sa})\right) (170)
\displaystyle\leq\; 1λ(f[π](o^sa)((κf(o^sa))λ1)+λf[π](o^sa)).\displaystyle\frac{1}{\lambda}\left(f[\pi](\hat{o}^{\rm sa})\left(\left(\frac{\kappa}{f(\hat{o}^{\rm sa})}\right)^{\lambda}-1\right)+\lambda f[\pi](\hat{o}^{\rm sa})\right). (171)

Recall the basic inequality that exp(x)1+x/2\exp(x)\leq 1+x/2 for all 1x0-1\leq x\leq 0. Since we have

(κf[π](o^sa))λ=exp(λln(κf[π](o^sa))),\left(\frac{\kappa}{f[\pi](\hat{o}^{\rm sa})}\right)^{\lambda}=\exp\left(\lambda\ln\left(\frac{\kappa}{f[\pi](\hat{o}^{\rm sa})}\right)\right),

when λ(ln(f[π](o^sa)/κ))1\lambda\leq\left(\ln\left(f[\pi](\hat{o}^{\rm sa})/\kappa\right)\right)^{-1} we get

1λ(f[π](o^sa)((κf[π](o^sa))λ1)+λf[π](o^sa))\displaystyle\frac{1}{\lambda}\left(f[\pi](\hat{o}^{\rm sa})\left(\left(\frac{\kappa}{f[\pi](\hat{o}^{\rm sa})}\right)^{\lambda}-1\right)+\lambda f[\pi](\hat{o}^{\rm sa})\right) (172)
\displaystyle\leq\; 12f[π](o^sa)ln(κ/f[π](o^sa))+f[π](o^sa)=12f[π](o^sa)ln(e2κ/f[π](o^sa)).\displaystyle\frac{1}{2}f[\pi](\hat{o}^{\rm sa})\ln(\kappa/f[\pi](\hat{o}^{\rm sa}))+f[\pi](\hat{o}^{\rm sa})=\frac{1}{2}f[\pi](\hat{o}^{\rm sa})\ln(e^{2}\kappa/f[\pi](\hat{o}^{\rm sa})). (173)

By the definition of κ\kappa we get

12f[π](o^sa)ln(e2κ/f[π](o^sa))α,\displaystyle\frac{1}{2}f[\pi](\hat{o}^{\rm sa})\ln(e^{2}\kappa/f[\pi](\hat{o}^{\rm sa}))\leq-\alpha, (174)

which leads to D1λ(f[π](osa)g[π](osa))α.D_{1-\lambda}(f[\pi](o^{\rm sa})\|g[\pi](o^{\rm sa}))\geq\alpha.

Case 2: minosa:f[π](osa)>0g[π](osa)κ.\min_{o^{\rm sa}:f[\pi](o^{\rm sa})>0}g[\pi](o^{\rm sa})\geq\kappa.

By Lemma G.8, for any λ(0,1/2)\lambda\in\left(0,1/2\right) we get

DKL(f[π]g[π])D1λ(f[π]g[π])λ2𝔼of[π][(lnf[π](o)g[π](o))4]1/2.\displaystyle D_{\mathrm{KL}}(f[\pi]\|g[\pi])-D_{1-\lambda}(f[\pi]\|g[\pi])\leq\frac{\lambda}{2}\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right]^{1/2}. (175)

Let fr,gr:𝒮×𝒜Δ()f_{r},g_{r}:{\mathcal{S}}\times{\mathcal{A}}\to\Delta(\mathbb{R}) be the reward distributions of instance ff and gg respectively, and fr[s,a](),gr[s,a]()f_{r}[s,a](\cdot),g_{r}[s,a](\cdot) the densities of the reward given state-action pair (s,a)(s,a). Recall that for a trajectory o=(s1,a1,r1,,sH,aH,rH)o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H}) we have

f[π](o)=f[π](osa)h=1Hfr[sh,ah](rh).\displaystyle f[\pi](o)=f[\pi](o^{\rm sa})\prod_{h=1}^{H}f_{r}[s_{h},a_{h}](r_{h}). (176)

By Hölder’s inequality we get

𝔼of[π][(lnf[π](o)g[π](o))4]\displaystyle\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right] (177)
=\displaystyle=\; 𝔼of[π][(lnf[π](osa)g[π](osa)+h=1Hlnfr[sh,ah](rh)gr[sh,ah](rh))4]\displaystyle\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o^{\rm sa})}{g[\pi](o^{\rm sa})}+\sum_{h=1}^{H}\ln\frac{f_{r}[s_{h},a_{h}](r_{h})}{g_{r}[s_{h},a_{h}](r_{h})}\right)^{4}\right] (178)
=\displaystyle=\; 𝔼of[π][(H+1)3((lnf[π](osa)g[π](osa))4+h=1H(lnfr[sh,ah](rh)gr[sh,ah](rh))4)]\displaystyle\mathbb{E}_{o\sim f[\pi]}\left[(H+1)^{3}\left(\left(\ln\frac{f[\pi](o^{\rm sa})}{g[\pi](o^{\rm sa})}\right)^{4}+\sum_{h=1}^{H}\left(\ln\frac{f_{r}[s_{h},a_{h}](r_{h})}{g_{r}[s_{h},a_{h}](r_{h})}\right)^{4}\right)\right] (179)
\displaystyle\leq\; (H+1)3ln(1/κ)4+(H+1)4sups𝒮,a𝒜𝔼xfr[s,a][(lnfr[s,a](x)gr[s,a](x))4]\displaystyle(H+1)^{3}\ln(1/\kappa)^{4}+(H+1)^{4}\sup_{s\in{\mathcal{S}},a\in{\mathcal{A}}}\mathbb{E}_{x\sim f_{r}[s,a]}\left[\left(\ln\frac{f_{r}[s,a](x)}{g_{r}[s,a](x)}\right)^{4}\right] (180)
\displaystyle\leq\; (H+1)3ln(1/κ)4+(H+1)4cM4,\displaystyle(H+1)^{3}\ln(1/\kappa)^{4}+(H+1)^{4}c_{M}^{4}, (181)

where the last inequality comes from item (c) of Condition 4. Therefore, when λϵ(H+1)2min{ln(1/κ)2,cM2}\lambda\leq\epsilon(H+1)^{-2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\} we get

DKL(f[π]g[π])D1λ(f[π]g[π])λ2𝔼of[π][(lnf[π](o)g[π](o))4]1/2\displaystyle D_{\mathrm{KL}}(f[\pi]\|g[\pi])-D_{1-\lambda}(f[\pi]\|g[\pi])\leq\frac{\lambda}{2}\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right]^{1/2} (182)
\displaystyle\leq\; ϵ2(H+1)2min{ln(1/κ)2,cM2}((H+1)3ln(1/κ)4+(H+1)4cM4)1/2\displaystyle\frac{\epsilon}{2}(H+1)^{-2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\}\left((H+1)^{3}\ln(1/\kappa)^{4}+(H+1)^{4}c_{M}^{4}\right)^{1/2} (183)
\displaystyle\leq\; ϵ2min{ln(1/κ)2,cM2}(ln(1/κ)4+cM4)1/2\displaystyle\frac{\epsilon}{2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\}\left(\ln(1/\kappa)^{4}+c_{M}^{4}\right)^{1/2} (184)
\displaystyle\leq\; ϵ2min{ln(1/κ)2,cM2}(ln(1/κ)2+cM2)\displaystyle\frac{\epsilon}{2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\}\left(\ln(1/\kappa)^{2}+c_{M}^{2}\right) (185)
\displaystyle\leq\; ϵ.\displaystyle\epsilon. (186)

Recall that

κ=μmin(f)e2exp(2αμmin(f)).\displaystyle\kappa=\frac{\mu_{\rm min}(f)}{e^{2}}\exp\left(-\frac{2\alpha}{\mu_{\rm min}(f)}\right). (187)

By algebraic manipulation we get

ϵ32H2min{μmin(f)4α,μmin(f)10,1cM}2ϵ(H+1)2min{ln(1/κ)2,cM2},\displaystyle\frac{\epsilon}{32H^{2}}\min\left\{\frac{\mu_{\rm min}(f)}{4\alpha},\frac{\mu_{\rm min}(f)}{10},\frac{1}{c_{M}}\right\}^{2}\leq\epsilon(H+1)^{-2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\}, (188)

which proves the desired result. ∎

D.3 Proof of Condition 2

Recall that for two instances f,gf,g\in{\mathcal{F}}, their distance is d(f,g)=supπ,o|f[π](o)g[π](o)|.d(f,g)=\sup_{\pi,o}{\left|{f[\pi](o)-g[\pi](o)}\right|}.

Lemma D.2.

Suppose {\mathcal{F}} represents tabular RL with truncated Gaussian reward with state space 𝒮{\mathcal{S}} and action space 𝒜{\mathcal{A}}, then we have ln𝒩(,ϵ)𝒪(|𝒮||𝒜|ln(1/ϵ))\ln{\mathcal{N}}({\mathcal{F}},\epsilon)\leq{\mathcal{O}}(|{\mathcal{S}}||{\mathcal{A}}|\ln(1/\epsilon)) for every ϵ>0\epsilon>0.

Proof.

For instances ff\in{\mathcal{F}} and state-action pair s,a𝒮×𝒜s,a\in{\mathcal{S}}\times{\mathcal{A}}, let fr[s,a]f_{r}[s,a] be its reward distribution and fp[s,a]f_{p}[s,a] its transition. Recall that for tabular RL we have

f[π](o)=h=1Hfp[sh,ah](sh+1)fr[sh,ah](rh),\displaystyle f[\pi](o)=\prod_{h=1}^{H}f_{p}[s_{h},a_{h}](s_{h+1})f_{r}[s_{h},a_{h}](r_{h}), (189)

where the observation is o=(s1,a1,r1,,sH,aH,rH)o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H}). Consequently,

|f[π](o)g[π](o)|\displaystyle{\left|{f[\pi](o)-g[\pi](o)}\right|}
\displaystyle\leq h=1H(|fp[sh,ah](sh+1)fr[sh,ah](rh)gp[sh,ah](sh+1)gr[sh,ah](rh)|\displaystyle\sum_{h=1}^{H}\Bigg{(}{\left|{f_{p}[s_{h},a_{h}](s_{h+1})f_{r}[s_{h},a_{h}](r_{h})-g_{p}[s_{h},a_{h}](s_{h+1})g_{r}[s_{h},a_{h}](r_{h})}\right|}
h=1h1fp[sh,ah](sh+1)fr[sh,ah](rh)h=h+1Hgp[sh,ah](sh+1)gr[sh,ah](rh))\displaystyle\qquad\prod_{h^{\prime}=1}^{h-1}f_{p}[s_{h^{\prime}},a_{h^{\prime}}](s_{h^{\prime}+1})f_{r}[s_{h^{\prime}},a_{h^{\prime}}](r_{h^{\prime}})\prod_{h^{\prime}=h+1}^{H}g_{p}[s_{h^{\prime}},a_{h^{\prime}}](s_{h^{\prime}+1})g_{r}[s_{h^{\prime}},a_{h^{\prime}}](r_{h^{\prime}})\Bigg{)}
\displaystyle\leq h=1H|fp[sh,ah](sh+1)fr[sh,ah](rh)gp[sh,ah](sh+1)gr[sh,ah](rh)|\displaystyle\sum_{h=1}^{H}{\left|{f_{p}[s_{h},a_{h}](s_{h+1})f_{r}[s_{h},a_{h}](r_{h})-g_{p}[s_{h},a_{h}](s_{h+1})g_{r}[s_{h},a_{h}](r_{h})}\right|}
\displaystyle\leq h=1H(|fp[sh,ah](sh+1)gp[sh,ah](sh+1)|+|fr[sh,ah](rh)gr[sh,ah](rh)|)\displaystyle\sum_{h=1}^{H}\left({\left|{f_{p}[s_{h},a_{h}](s_{h+1})-g_{p}[s_{h},a_{h}](s_{h+1})}\right|}+{\left|{f_{r}[s_{h},a_{h}](r_{h})-g_{r}[s_{h},a_{h}](r_{h})}\right|}\right)

Therefore, we construct a covering 𝒞{\mathcal{C}} such that for every ff\in{\mathcal{F}}, there exists g𝒞g\in{\mathcal{C}} with

|fp[s,a](s)gp[s,a](s)|ϵ2H,s,a,s,\displaystyle{\left|{f_{p}[s,a](s^{\prime})-g_{p}[s,a](s^{\prime})}\right|}\leq\frac{\epsilon}{2H},\quad\forall s,a,s^{\prime}, (190)
|fr[s,a](r)gr[s,a](r)|ϵ2H,s,a,r.\displaystyle{\left|{f_{r}[s,a](r)-g_{r}[s,a](r)}\right|}\leq\frac{\epsilon}{2H},\quad\forall s,a,r. (191)

Since fp[s,a]f_{p}[s,a] is a discrete distribution, the covering number for the transition function is upper bounded by (4H/ϵ)|𝒮|2|𝒜|.(4H/\epsilon)^{|{\mathcal{S}}|^{2}|{\mathcal{A}}|}. On the other hand, fr[s,a](r)f_{r}[s,a](r) is a truncated Gaussian distribution, so covering number for the reward is also upper bounded by (4H/ϵ)|𝒮||𝒜|.(4H/\epsilon)^{|{\mathcal{S}}||{\mathcal{A}}|}. As a result,

ln𝒩(,ϵ)𝒪(|𝒮||𝒜|ln(1/ϵ)).\displaystyle\ln{\mathcal{N}}({\mathcal{F}},\epsilon)\leq{\mathcal{O}}(|{\mathcal{S}}||{\mathcal{A}}|\ln(1/\epsilon)). (192)

D.4 Proof of Condition 3

Lemma D.3.

Consider tabular reinforcement learning with truncated Gaussian reward. For a fixed instance f{f^{\star}}, for all f,πΠf\in{\mathcal{F}},\pi\in\Pi we have

f[π]f[π]37Hμmin(f)1/6DKL(f[π]f[π])1/6.\displaystyle\|{f^{\star}}[\pi]-f[\pi]\|_{\infty}\leq\frac{37H}{\mu_{\rm min}({f^{\star}})^{1/6}}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])^{1/6}. (193)
Proof.

Recall that for tabular RL, an observation is o=(s1,a1,r1,,sH,aH,rH)o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H}). We use osa=(s1,a1,s2,a2,,sH,aH)o^{\rm sa}=(s_{1},a_{1},s_{2},a_{2},\cdots,s_{H},a_{H}) to denote the collection of states and actions in the trajectory oo, and or=(r1,,rH)o^{\rm r}=(r_{1},\cdots,r_{H}) the collection of of rewards.For instances ff\in{\mathcal{F}} and state-action pair s,a𝒮×𝒜s,a\in{\mathcal{S}}\times{\mathcal{A}}, let fr[s,a]f_{r}[s,a] be its reward distribution and fp[s,a]f_{p}[s,a] its transition.

Let ϵ0=Hμmin(f)1/6.\epsilon_{0}=\frac{H}{\mu_{\rm min}({f^{\star}})^{1/6}}. Consider the random variables osao^{\rm sa} and or.o^{\rm r}. By the chain rules of KL divergence we have

DKL(f[π]f[π])=DKL(f[π](osa)f[π](osa))+𝔼osaf[DKL(f[π](orosa)f[π](orosa))].\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])=D_{\mathrm{KL}}({f^{\star}}[\pi](o^{\rm sa})\|f[\pi](o^{\rm sa}))+\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}}\left[D_{\mathrm{KL}}({f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})\|f[\pi](o^{\rm r}\mid o^{\rm sa}))\right]. (194)

Since osao^{\rm sa} is a discrete random variable, we have

|f[π](osa)f[π](osa)|DTV((f[π]f[π])DKL(f[π]f[π])1/2.\displaystyle{\left|{{f^{\star}}[\pi](o^{\rm sa})-f[\pi](o^{\rm sa})}\right|}\leq D_{\rm TV}(({f^{\star}}[\pi]\|f[\pi])\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])^{1/2}. (195)

Therefore, for any osasupp(f[π])o^{\rm sa}\not\in\operatorname{supp}({f^{\star}}[\pi]),

|f[π](osa,or)f[π](osa,or)|f[π](osa,or)f[π](osa)f[π](orosa)DKL(f[π]f[π])1/2,\displaystyle{\left|{{f^{\star}}[\pi](o^{\rm sa},o^{\rm r})-f[\pi](o^{\rm sa},o^{\rm r})}\right|}\leq f[\pi](o^{\rm sa},o^{\rm r})\leq f[\pi](o^{\rm sa})f[\pi](o^{\rm r}\mid o^{\rm sa})\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])^{1/2}, (196)

where the last inequality comes from the fact that f[π](orosa)=(sh,ah,rh)ofr[sh,ah](rh)1.f[\pi](o^{\rm r}\mid o^{\rm sa})=\prod_{(s_{h},a_{h},r_{h})\in o}f_{r}[s_{h},a_{h}](r_{h})\leq 1.

Now, for any osasupp(f[π])o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi]), by Eq. (194) we get

DKL(f[π](orosa)f[π](orosa))1μmin(f)DKL(f[π]f[π]).\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})\|f[\pi](o^{\rm r}\mid o^{\rm sa}))\leq\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi]). (197)

By the chain rule of KL divergence,

DKL(f[π](orosa)f[π](orosa))=h=1HDKL(fr[sh,ah]fr[sh,ah]).\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})\|f[\pi](o^{\rm r}\mid o^{\rm sa}))=\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\|f_{r}[s_{h},a_{h}]). (198)

As a result, for every (s,a)osa(s,a)\in o^{\rm sa} we get

DKL(fr[s,a]fr[s,a])1μmin(f)DKL(f[π]f[π]).\displaystyle D_{\mathrm{KL}}({f^{\star}}_{r}[s,a]\|f_{r}[s,a])\leq\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi]). (199)

By Lemma G.6, |fr[s,a](r)fr[s,a](r)|36(1μmin(f)DKL(f[π]f[π]))1/6.{\left|{{f^{\star}}_{r}[s,a](r)-f_{r}[s,a](r)}\right|}\leq 36\left(\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])\right)^{1/6}. Therefore,

|f[π](orosa)f[π](orosa)|\displaystyle{\left|{{f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})-f[\pi](o^{\rm r}\mid o^{\rm sa})}\right|} (200)
=\displaystyle=\; |h=1Hfr[sh,ah](rh)h=1Hfr[sh,ah](rh)|\displaystyle{\left|{\prod_{h=1}^{H}{f^{\star}}_{r}[s_{h},a_{h}](r_{h})-\prod_{h=1}^{H}f_{r}[s_{h},a_{h}](r_{h})}\right|} (201)
\displaystyle\leq\; h=1H|fr[sh,ah](rh)fr[sh,ah](rh)|\displaystyle\sum_{h=1}^{H}{\left|{{f^{\star}}_{r}[s_{h},a_{h}](r_{h})-f_{r}[s_{h},a_{h}](r_{h})}\right|} (202)
\displaystyle\leq\; 36H(1μmin(f)DKL(f[π]f[π]))1/6.\displaystyle 36H\left(\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])\right)^{1/6}. (203)

It follows that,

|f[π](osa,or)f[π](osa,or)|\displaystyle{\left|{{f^{\star}}[\pi](o^{\rm sa},o^{\rm r})-f[\pi](o^{\rm sa},o^{\rm r})}\right|} (204)
\displaystyle\leq\; |f[π](osa)f[π](osa)|+|f[π](orosa)f[π](orosa)|\displaystyle{\left|{{f^{\star}}[\pi](o^{\rm sa})-f[\pi](o^{\rm sa})}\right|}+{\left|{{f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})-f[\pi](o^{\rm r}\mid o^{\rm sa})}\right|} (205)
\displaystyle\leq\; DKL(f[π]f[π])1/2+36H(1μmin(f)DKL(f[π]f[π]))1/6\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])^{1/2}+36H\left(\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])\right)^{1/6} (206)
\displaystyle\leq\; 37H(1μmin(f)DKL(f[π]f[π]))1/6\displaystyle 37H\left(\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])\right)^{1/6} (207)

Appendix E RL with General Reward Distribution

In this section, we extend our analysis to RL with general reward distribution, where the support of the reward may have infinite volume (e.g., the real line \mathbb{R} when the reward distribution is Gaussian). We assume that for any state-action pair (s,a)(s,a), the reward distribution r[s,a]r[s,a] comes from a distribution family {\mathcal{R}} (e.g., ={𝒩(μ,1):μ[1,1]}{\mathcal{R}}=\{{\mathcal{N}}(\mu,1):\mu\in[-1,1]\} when the reward is Gaussian). For any gg\in{\mathcal{R}}, let g()g(\cdot) be its density function and μ(g)𝔼xg[x]\mu(g)\triangleq\mathbb{E}_{x\sim g}[x] its mean and we assume supgμ(g)Rmax.\sup_{g\in{\mathcal{R}}}\mu(g)\leq R_{\rm max}. We emphasize that for general reward distributions, we do not require Conditions 2 and 3 because they do not hold for Gaussian distribution. Instead, we require the following.

Condition 4.

Let {\mathcal{R}} be the reward distribution family. Then

  1. (a)

    for all ff\in{\mathcal{R}}, there exists a constant c7(0,1],c8(0,1],cM>0,c_{7}\in(0,1],c_{8}\in(0,1],c_{M}>0, such that for every δ>0\delta>0,

    Prxf(g,g,|lng(x)g(x)|>DTV(gg)c7polylog(1/δ))δ;\displaystyle\mathop{\rm Pr}\nolimits_{x\sim f}\left(\forall g,g^{\prime}\in{\mathcal{R}},\;{\left|{\ln\frac{g(x)}{g^{\prime}(x)}}\right|}>D_{\rm TV}(g\|g^{\prime})^{c_{7}}\mathrm{polylog}(1/\delta)\right)\leq\delta; (208)
  2. (b)

    for all g,gg,g^{\prime}\in{\mathcal{R}}, |μ(g)μ(g)|DTV(gg)c8{\left|{\mu(g)-\mu(g^{\prime})}\right|}\lesssim D_{\rm TV}(g\|g^{\prime})^{c_{8}};

  3. (c)

    for all g,gg,g^{\prime}\in{\mathcal{R}},

    𝔼xg[(lng(x)g(x))4]cM4;\mathbb{E}_{x\sim g}\left[\left(\ln\frac{g(x)}{g^{\prime}(x)}\right)^{4}\right]\leq c_{M}^{4};
  4. (d)

    for any ϵ>0\epsilon>0, there exists a covering 𝒞(,ϵ){\mathcal{C}}({\mathcal{R}},\epsilon)\subseteq{\mathcal{R}} such that

    g,g𝒞(,ϵ),DTV(gg)ϵ.\displaystyle\forall g\in{\mathcal{R}},\;\exists g^{\prime}\in{\mathcal{C}}({\mathcal{R}},\epsilon),\;D_{\rm TV}(g\|g^{\prime})\leq\epsilon. (209)

    And log|𝒞(,ϵ)|=𝒪(log(1/ϵ)).\log|{\mathcal{C}}({\mathcal{R}},\epsilon)|={\mathcal{O}}(\log(1/\epsilon)).

Condition 1 still holds in this case because of Lemma D.1 and item (c) of Condition 4.

For tabular reinforcement learning problems, we only require Condition 4 for the reward distribution of any fixed state-action pair, which is one-dimensional. Condition 4 holds for Gaussian distribution (see Proposition E.5), Laplace distribution, etc.

Our main result for tabular RL is stated as follows.

Theorem E.1.

Suppose {\mathcal{F}} is the hypothesis class representing tabular reinforcement learning with a reward distribution that satisfies Conditions 4, then the regret of Alg. 1 satisfies

lim supnRegf,nlnn𝒞(f).\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{{f^{\star}},n}}{\ln n}\leq{\mathcal{C}}({f^{\star}}). (210)

In section E.1 we present the main lemmas for Steps 1-2. Then, Theorem E.1 is proved using exactly the same approach as Theorem 5.2 (by replacing the main lemmas for Steps 1-2).

E.1 Lemmas for RL with General Reward Distribution

In this section, we state the main lemmas for RL with general reward distribution (namely, Lemma E.2 for Step 1 and Lemma E.3 for Step 2).

Lemma E.2.

Let init{\mathcal{E}}_{\rm init} be the event that, there exists a universal constant c4>0c_{4}>0 and a function ι(f)\iota({f^{\star}}) that only depends on ff^{\star} such that

  1. (a)

    for all πΠ\pi\in\Pi, DKL(f[π]f^[π])(lnlnnlnn)c4D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}},

  2. (b)

    |Rf^(π)Rf(π)|2HRmax(lnlnnlnn)c4{\left|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right|}\lesssim 2HR_{\rm max}\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}}, for all πΠ\pi\in\Pi,

  3. (c)

    π(f^)=π(f).\pi^{\star}({\hat{f}})=\pi^{\star}({f^{\star}}).

For tabular reinforcement learning with general reward distribution that satisfies Condition 4, there exists n0>0n_{0}>0 such that when n>n0n>n_{0}, Pr(init)11/lnn.\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init})\geq 1-1/\ln n. In addition, the regret of Step 1 is upper bounded by 𝒪(lnnlnlnn).{\mathcal{O}}(\frac{\ln n}{\ln\ln n}).

Proof.

In the following, we prove the three items separately. Recall that Condition 1 holds by Lemma D.1 and item (c) of Condition 4.

Items (a) and (c):

Proofs of (a) and (c) are the same as that of Lemma C.1 if we replace Lemma F.1 by Lemma F.2.

Item (b):

Recall that an observation is a sequence of state-action-reward tuples: o=(s1,a1,r1,,sH,aH,rH)o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H}). Therefore, for all πΠ\pi\in\Pi we have

|Rf^(π)Rf(π)|=|𝔼of^[π][h=1Hrh]𝔼of[π][h=1Hrh]|\displaystyle{\left|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right|}={\left|{\mathbb{E}_{o\sim{\hat{f}}[\pi]}\left[\sum_{h=1}^{H}r_{h}\right]-\mathbb{E}_{o\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}r_{h}\right]}\right|} (211)
\displaystyle\leq\; |𝔼osaf^[π][h=1Hμ^(sh,ah)]𝔼osaf[π][h=1Hμ(sh,ah)]|,\displaystyle{\left|{\mathbb{E}_{o^{\rm sa}\sim{\hat{f}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]-\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\mu^{\star}(s_{h},a_{h})\right]}\right|}, (212)

where μ^\hat{\mu} and μ\mu^{\star} are the mean reward of instance f^{\hat{f}},f{f^{\star}} respectively. Since μ^[0,1],\hat{\mu}\in[0,1], we have

|𝔼osaf^[π][h=1Hμ^(sh,ah)]𝔼osaf[π][h=1Hμ(sh,ah)]|\displaystyle{\left|{\mathbb{E}_{o^{\rm sa}\sim{\hat{f}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]-\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\mu^{\star}(s_{h},a_{h})\right]}\right|} (213)
\displaystyle\leq\; |𝔼osaf^[π][h=1Hμ^(sh,ah)]𝔼osaf[π][h=1Hμ^(sh,ah)]|\displaystyle{\left|{\mathbb{E}_{o^{\rm sa}\sim{\hat{f}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]-\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]}\right|} (214)
+|𝔼osaf[π][h=1Hμ^(sh,ah)]𝔼osaf[π][h=1Hμ(sh,ah)]|\displaystyle\quad+{\left|{\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]-\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\mu^{\star}(s_{h},a_{h})\right]}\right|} (215)
\displaystyle\leq\; HDTV(f[π]f^[π])+𝔼osaf[π][h=1H|μ^(sh,ah)μ(sh,ah)|].\displaystyle HD_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])+\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}|\hat{\mu}(s_{h},a_{h})-\mu^{\star}(s_{h},a_{h})|\right]. (216)

In the following we upper bound the second term of Eq. (216). Let fr,f^r:𝒮×𝒜Δ(){f^{\star}}_{r},{\hat{f}}_{r}:{\mathcal{S}}\times{\mathcal{A}}\to\Delta(\mathbb{R}) be the reward distributions of instance f{f^{\star}} and f^{\hat{f}} respectively, and fr[s,a](),f^r[s,a](){f^{\star}}_{r}[s,a](\cdot),{\hat{f}}_{r}[s,a](\cdot) the densities of the reward given state-action pair (s,a)(s,a). By item (b) of Condition 4 and Cauchy-Schwarz we get

𝔼osaf[π][h=1H|μ^(sh,ah)μ(sh,ah)|]\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}|\hat{\mu}(s_{h},a_{h})-\mu^{\star}(s_{h},a_{h})|\right] (217)
\displaystyle\lesssim\; 𝔼osaf[π][h=1HDKL(fr[sh,ah]f^r[sh,ah])c8]\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\|{\hat{f}}_{r}[s_{h},a_{h}])^{c_{8}}\right] (218)
\displaystyle\leq\; 𝔼osaf[π][H1c8(h=1HDKL(fr[sh,ah]f^r[sh,ah]))c8]\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[H^{1-c_{8}}\left(\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\|{\hat{f}}_{r}[s_{h},a_{h}])\right)^{c_{8}}\right] (219)
\displaystyle\leq\; H1c8𝔼osaf[π][h=1HDKL(fr[sh,ah]f^r[sh,ah])]c8.\displaystyle H^{1-c_{8}}\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\|{\hat{f}}_{r}[s_{h},a_{h}])\right]^{c_{8}}. (220)

Recall that fp,f^p{f^{\star}}_{p},{\hat{f}}_{p} denotes the transition function of instances f,f^{f^{\star}},{\hat{f}} respectively. By the chain rule of KL divergence,

DKL(f[π]f^[π])\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi]) (221)
=\displaystyle=\; 𝔼osaf[π][h=1H(DKL(fp[sh,ah]f^p[sh,ah])+DKL(fr[sh,ah]f^r[sh,ah]))]\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}(D_{\mathrm{KL}}\left({f^{\star}}_{p}[s_{h},a_{h}]\|{\hat{f}}_{p}[s_{h},a_{h}]\right)+D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\|{\hat{f}}_{r}[s_{h},a_{h}]))\right] (222)
\displaystyle\geq\; 𝔼osaf[π][h=1HDKL(fr[sh,ah]f^r[sh,ah])].\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\|{\hat{f}}_{r}[s_{h},a_{h}])\right]. (223)

By item (a) of this lemma, DKL(f[π]f^[π])0D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\to 0 as nn\to\infty. Therefore for large enough nn we get

|Rf^(π)Rf(π)|HDTV(f[π]f^[π])+H1c8DKL(f[π]f^[π])c82HDKL(f[π]f^[π])c8.\displaystyle{\left|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right|}\lesssim HD_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])+H^{1-c_{8}}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{c_{8}}\leq 2HD_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{c_{8}}. (224)

Combining with item (a) of this lemma, we prove the desired result. ∎

Lemma E.3.

For tabular reinforcement learning with general reward distribution that satisfies Condition 4, there exists n0>0n_{0}>0 such that when n>n0n>n_{0}, the following holds.

  1. (a)

    Conditioned on the event π(f^)π(f)\pi^{\star}({\hat{f}})\neq\pi^{\star}(f^{\star}), Pr(acc)1/n\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc})\leq 1/n.

  2. (b)

    Conditioned on the event init{\mathcal{E}}_{\rm init}, Pr(acc)11/lnn\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc})\geq 1-1/\ln n.

  3. (c)

    The expected regret of Step 2 is always upper bounded by 𝒪(lnnlnlnn).{\mathcal{O}}(\ln n\ln\ln n).

  4. (d)

    Conditioned on the event init{\mathcal{E}}_{\rm init}, the expected regret of Step 2 is upper bounded by

    (𝒞(f,(lnlnn)1/4/2)+o(1))lnn.\left({\mathcal{C}}(f^{\star},(\ln\ln n)^{1/4}/2)+o(1)\right)\ln n.
Proof.

We prove the four items above separately.

Items (a), (c), and (d):

Proofs of (a), (c), and (d) are the same as that of Lemma C.2.

Item (b):

Let ϵ=1/lnlnn,α=lnnm\epsilon=1/\ln\ln n,\alpha=\frac{\ln n}{m} and δ=1/(2lnn)\delta=1/(2\ln n). We prove this statement by invoking Lemma F.2 with parameters (α+5ϵ,ϵ,w)(\alpha+5\epsilon,\epsilon,w). Following Lemma F.2, let γ=1mminπi=1m𝕀[πi=π]\gamma=\frac{1}{m}\min_{\pi}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right] and λ=λ0((α+5ϵ)/γ,ϵ,f)\lambda=\lambda_{0}((\alpha+5\epsilon)/\gamma,\epsilon,{f^{\star}}) be the value that satisfies Condition 1. Let

ϵ0=exp((α+5ϵ)/γ)(ϵλ)1/λ.\epsilon_{0}=\exp(-(\alpha+5\epsilon)/\gamma)(\epsilon\lambda)^{1/\lambda}.

First of all, we prove that the condition for Lemma F.2 holds. That is,

mpoly(|𝒮||𝒜|H)λϵ((ln(m/ϵ0))+ln(1/δ)).\displaystyle m\gtrsim\frac{\mathrm{poly}(|{\mathcal{S}}||{\mathcal{A}}|H)}{\lambda\epsilon}\left((\ln(m/\epsilon_{0}))+\ln(1/\delta)\right). (225)

Recall that m=xw¯πlnnm=\sum_{x}\lceil\bar{w}_{\pi}\ln n\rceil. As a result, m|Π|lnn(lnlnn)1/4.m\geq\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}. Now consider the RHS of Eq. (225). By the definition of w¯π\bar{w}_{\pi} we get α2|Π|(lnlnn)1/4\alpha\leq\frac{2}{|\Pi|}(\ln\ln n)^{1/4} and γ12|Π|(lnlnn)1/2\gamma^{-1}\leq 2|\Pi|(\ln\ln n)^{1/2}. It follows from Condition 1 that λpoly(1/lnlnn).\lambda\geq\mathrm{poly}(1/\ln\ln n). By the definition of ϵ0\epsilon_{0} and Condition 2 we get

ln(m/ϵ0)poly(lnlnn).\displaystyle\ln(m/\epsilon_{0})\lesssim\mathrm{poly}(\ln\ln n). (226)

Consequently, the RHS of Eq. (225) is at most poly(lnlnn)\mathrm{poly}(\ln\ln n), and Eq. (225) holds when nn is sufficiently large. By Lemma F.2 we get, with probability at least 11/(2lnn)1-1/(2\ln n),

i=1mlnf[πi](oi)g[πi](oi)(α+ϵ)m,g(w,f,α+5ϵ),\displaystyle\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m,\quad\forall g\in{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon), (227)

where (w,f,α+5ϵ)={g:DKLw(fg)α+5ϵ}{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon)=\{g\in{\mathcal{F}}:D_{\mathrm{KL}}^{w}({f^{\star}}\|g)\geq\alpha+5\epsilon\}.

In the following, we prove that Eq. (227) implies accf^.{\mathcal{E}}_{\rm acc}^{{\hat{f}}}. Recall that accf^{\mathcal{E}}_{\rm acc}^{{\hat{f}}} is the event defined as follows:

accf^\displaystyle{\mathcal{E}}_{\rm acc}^{\hat{f}} =𝕀[gΛ(f^),i=1mlnf^[πi](oi)g[πi](oi)lnn].\displaystyle=\mathbb{I}\left[\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right]. (228)

Next, we apply Lemma G.4 to any gΛ(f^).g\in\Lambda({\hat{f}}). To verify its condition, we have DTV(f[π]f^[π])DKL(f[π]f^[π])1/2(lnlnnlnn)c4/2D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}/2} for all πΠ\pi\in\Pi by item (a) of Lemma E.2. Therefore when nn is large enough,

DKLw(fg)lnnm+5ϵ=α+5ϵ.\displaystyle D_{\mathrm{KL}}^{w}(f^{\star}\|g)\geq\frac{\ln n}{m}+5\epsilon=\alpha+5\epsilon. (229)

Then g(w,f,α+5ϵ)g\in{\mathcal{F}}(w,{f^{\star}},\alpha+5\epsilon). It follows from Eq. (227) that

i=1mlnf[πi](oi)g[πi](oi)(α+ϵ)m,gΛ(f^).\displaystyle\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m,\quad\forall g\in\Lambda({\hat{f}}). (230)

By Lemma E.4, with probability at least 11/(2lnn)1-1/(2\ln n) we have

i=1mlnf^[πi](oi)f[πi](oi)m(lnlnnlnn)c4c6poly(lnlnn)ι(f).\displaystyle\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq-m\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}c_{6}}\mathrm{poly}(\ln\ln n)\iota({f^{\star}}). (231)

For large enough nn, (lnlnnlnn)c4c6poly(lnlnn)ι(f)1lnlnn=ϵ.\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}c_{6}}\mathrm{poly}(\ln\ln n)\iota({f^{\star}})\leq\frac{1}{\ln\ln n}=\epsilon. As a result, combining Eq. (230) and Eq. (231), with probability 11/lnn1-1/\ln n we have

gΛ(f^),i=1mlnf^[πi](oi)g[πi](oi)αm=lnn.\displaystyle\forall g\in\Lambda({\hat{f}}),\quad\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\alpha m=\ln n. (232)

The following lemma is used in the proof of Lemma E.3.

Lemma E.4.

Let f,f^{f^{\star}},{\hat{f}} be any fixed tabular RL instances with reward distribution that satisfies Condition 4. For any sequence of policies {πi}i=1m\{\pi_{i}\}_{i=1}^{m}, let oif[πi](),i[m]o_{i}\sim{f^{\star}}[\pi_{i}](\cdot),\forall i\in[m] be a sequence of random observations drawn from f{f^{\star}}. Then there exists constant c6>0c_{6}>0 and ι(f)\iota({f^{\star}}) that only depends on f{f^{\star}} such that, for any δ>0\delta>0, with probability at least 1δ1-\delta,

1mi=1mlnf^[πi](oi)f[πi](oi)>(maxπDKL(f[π]f^[π]))c6ι(f)polylog(mH/δ)\displaystyle\frac{1}{m}\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}>-\left(\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\right)^{c_{6}}\iota({f^{\star}})\mathrm{polylog}(mH/\delta) (233)

assuming maxπDKL(f[π]f^[π])1/2μmin(f)/2.\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}\leq\mu_{\rm min}({f^{\star}})/2.

Proof.

Recall that for reinforcement learning, an observation o=(s1,a1,r1,,sH,aH,rH)o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H}) represents a trajectory, and osa=(s1,a1,s2,a2,,sH,aH)o^{\rm sa}=(s_{1},a_{1},s_{2},a_{2},\cdots,s_{H},a_{H}) denotes the state-action pairs in the trajectory. By algebraic manipulation, we get

i=1mlnf^[πi](oi)f[πi](oi)=i=1mlnf^[πi](oisa)f[πi](oisa)+i=1mlnf^[πi](oiroisa)f[πi](oiroisa).\displaystyle\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}=\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o^{\rm sa}_{i})}{{f^{\star}}[\pi_{i}](o^{\rm sa}_{i})}+\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o^{\rm r}_{i}\mid o^{\rm sa}_{i})}{{f^{\star}}[\pi_{i}](o^{\rm r}_{i}\mid o^{\rm sa}_{i})}. (234)

Recall that for the instance f{f^{\star}}, we have

μmin(f)=minπminosasupp(f[π]())f[π](osa).\displaystyle\mu_{\rm min}({f^{\star}})=\min_{\pi}\min_{o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi](\cdot))}{f^{\star}}[\pi](o^{\rm sa}). (235)

Since both osao^{\rm sa} and π\pi are finite for tabular RL, we get μmin(f)>0.\mu_{\rm min}({f^{\star}})>0. On the one hand, for any osasupp(f[π])o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi]), by Pinsker’s inequality we get

|f[π](osa)f^[π](osa)|DTV(f[π]f^[π])DKL(f[π]f^[π])1/2.\displaystyle{\left|{{f^{\star}}[\pi](o^{\rm sa})-{\hat{f}}[\pi](o^{\rm sa})}\right|}\leq D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}. (236)

As a result

|lnf^[π](osa)f[π](osa)||ln(1+f^[π](osa)f[π](osa)f[π](osa))||ln(1+f^[π](osa)f[π](osa)μmin(f))|.\displaystyle{\left|{\ln\frac{{\hat{f}}[\pi](o^{\rm sa})}{{f^{\star}}[\pi](o^{\rm sa})}}\right|}\leq{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o^{\rm sa})-{f^{\star}}[\pi](o^{\rm sa})}{{f^{\star}}[\pi](o^{\rm sa})}\right)}\right|}\leq{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o^{\rm sa})-{f^{\star}}[\pi](o^{\rm sa})}{\mu_{\rm min}({f^{\star}})}\right)}\right|}.

When maxπDKL(f[π]f^[π])1/2μmin(f)/2\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}\leq\mu_{\rm min}({f^{\star}})/2, applying the basic inequality |ln(1+x)|2x,|x|1/2{\left|{\ln(1+x)}\right|}\leq 2x,\forall|x|\leq 1/2 we get for all πΠ\pi\in\Pi and osasupp(f[π]())o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi](\cdot)),

|ln(1+f^[π](osa)f[π](osa)μmin(f))|2μmin(f)|f[π](osa)f^[π](osa)|\displaystyle{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o^{\rm sa})-{f^{\star}}[\pi](o^{\rm sa})}{\mu_{\rm min}({f^{\star}})}\right)}\right|}\leq\frac{2}{\mu_{\rm min}({f^{\star}})}{\left|{{f^{\star}}[\pi](o^{\rm sa})-{\hat{f}}[\pi](o^{\rm sa})}\right|} (237)
\displaystyle\leq\; 2μmin(f)maxπDKL(f[π]f^[π])1/2.\displaystyle\frac{2}{\mu_{\rm min}({f^{\star}})}\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}. (238)

On the other hand, let fr[s,a]{f^{\star}}_{r}[s,a] and f^r[s,a]{\hat{f}}_{r}[s,a] be the reward distribution of instance f{f^{\star}} and f^{\hat{f}} given state-action pair (s,a)(s,a) respectively. Then

i=1mlnf^[πi](oiroisa)f[πi](oiroisa)=i=1mh=1Hlnf^r[si,h,ai,h](ri,h)fr[si,h,ai,h](ri,h).\displaystyle\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o^{\rm r}_{i}\mid o^{\rm sa}_{i})}{{f^{\star}}[\pi_{i}](o^{\rm r}_{i}\mid o^{\rm sa}_{i})}=\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{{\hat{f}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{{f^{\star}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}. (239)

For any πΠ\pi\in\Pi, by the chain rule of KL divergence we get

DKL(f[π]f^[π])\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi]) (240)
=\displaystyle=\; 𝔼osaf[π][h=1H(DKL(fp[sh,ah]f^p[sh,ah])+DKL(fr[sh,ah]f^r[sh,ah]))]\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}(D_{\mathrm{KL}}\left({f^{\star}}_{p}[s_{h},a_{h}]\|{\hat{f}}_{p}[s_{h},a_{h}]\right)+D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\|{\hat{f}}_{r}[s_{h},a_{h}]))\right] (241)
\displaystyle\geq\; μmin(f)DKL(fr[s,a]f^r[s,a])𝕀[(s,a)osa for some osasupp(f[π])].\displaystyle\mu_{\rm min}({f^{\star}})D_{\mathrm{KL}}({f^{\star}}_{r}[s,a]\|{\hat{f}}_{r}[s,a])\mathbb{I}\left[(s,a)\in o^{\rm sa}\text{ for some }o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi])\right]. (242)

Because oif[πi]o_{i}\sim{f^{\star}}[\pi_{i}], for any i[m],h[H]i\in[m],h\in[H] we have (si,h,ai,h)oisa(s_{i,h},a_{i,h})\in o^{\rm sa}_{i} and oisasupp(f[πi])o^{\rm sa}_{i}\in\operatorname{supp}({f^{\star}}[\pi_{i}]). As a result, for all i[m],h[H]i\in[m],h\in[H], item (a) of Condition 4 implies that with probability at least 1δ/(mH)1-\delta/(mH)

|f^r[si,h,ai,h](ri,h)fr[si,h,ai,h]|DTV(fr[si,h,ai,h](ri,h)f^r[si,h,ai,h])c7polylog(mH/δ)\displaystyle{\left|{\frac{{\hat{f}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{{f^{\star}}_{r}[s_{i,h},a_{i,h}]}}\right|}\leq D_{\rm TV}({f^{\star}}_{r}[s_{i,h},a_{i,h}](r_{i,h})\|{\hat{f}}_{r}[s_{i,h},a_{i,h}])^{c_{7}}\mathrm{polylog}(mH/\delta) (243)
\displaystyle\leq\; DKL(fr[si,h,ai,h]f^r[si,h,ai,h])c7/2polylog(mH/δ)\displaystyle D_{\mathrm{KL}}({f^{\star}}_{r}[s_{i,h},a_{i,h}]\|{\hat{f}}_{r}[s_{i,h},a_{i,h}])^{c_{7}/2}\mathrm{polylog}(mH/\delta) (244)
\displaystyle\leq\; 1μmin(f)c7/2DKL(f[π]f^[π])c7/2polylog(mH/δ).\displaystyle\frac{1}{\mu_{\rm min}({f^{\star}})^{c_{7}/2}}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{c_{7}/2}\mathrm{polylog}(mH/\delta). (245)

Let ϵ=1μmin(f)c7/2(maxπDKL(f[π]f^[π]))c7/2\epsilon=\frac{1}{\mu_{\rm min}({f^{\star}})^{c_{7}/2}}\left(\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\right)^{c_{7}/2}. Apply union bound and we get

Prof[π](1mi=1mh=1H|lnf^r[si,h,ai,h](ri,h)fr[si,h,ai,h](ri,h)|ϵHpolylog(mH/δ))1δ.\displaystyle\mathop{\rm Pr}\nolimits_{o\sim{f^{\star}}[\pi]}\left(\frac{1}{m}\sum_{i=1}^{m}\sum_{h=1}^{H}{\left|{\ln\frac{{\hat{f}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{{f^{\star}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}}\right|}\geq\epsilon H\mathrm{polylog}(mH/\delta)\right)\leq 1-\delta. (246)

It follows that with probability at least 1δ1-\delta,

1mi=1mlnf^[πi](oi)f[πi](oi)\displaystyle\frac{1}{m}\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}
\displaystyle\geq\; 2μmin(f)maxπDKL(f[π]f^[π])1/2Hpolylog(mH/δ)1μmin(f)c7/2(maxπDKL(f[π]f^[π]))c7/2\displaystyle-\frac{2}{\mu_{\rm min}({f^{\star}})}\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}-H\mathrm{polylog}(mH/\delta)\frac{1}{\mu_{\rm min}({f^{\star}})^{c_{7}/2}}\left(\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\right)^{c_{7}/2}
\displaystyle\geq\; 2Hμmin(f)polylog(mH/δ)(maxπDKL(f[π]f^[π]))c7/2.\displaystyle-\frac{2H}{\mu_{\rm min}({f^{\star}})}\mathrm{polylog}(mH/\delta)\left(\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\right)^{c_{7}/2}.

Therefore, Eq. (233) is satisfied by setting

ι(f)=2Hμmin(f),c6=c7/2.\displaystyle\iota({f^{\star}})=\frac{2H}{\mu_{\rm min}({f^{\star}})},\quad c_{6}=c_{7}/2. (247)

E.2 Proof of Condition 4 for Gaussian Distribution

In this section, we prove that the reward distribution family ={𝒩(μ,1),μ[0,1]}{\mathcal{R}}=\{{\mathcal{N}}(\mu,1),\mu\in[0,1]\} satisfies Condition 4.

Proposition E.5.

Condition 4 holds for the reward distribution family ={𝒩(μ,1),μ[0,1]}{\mathcal{R}}=\{{\mathcal{N}}(\mu,1),\mu\in[0,1]\}

Proof.

In the following we prove the four items of Condition 4 respectively.

Item (a).

For any fixed ff\in{\mathcal{R}}, let μ\mu be the mean of ff. In other words, f=𝒩(μ,1).f={\mathcal{N}}(\mu,1). Then we have

Prxf(|x|>μ+2log(2/δ))δ.\displaystyle\mathop{\rm Pr}\nolimits_{x\sim f}\left(|x|>\mu+2\sqrt{\log(2/\delta)}\right)\leq\delta. (248)

By definition, for any g=𝒩(μg,1),g=𝒩(μg,1)g={\mathcal{N}}(\mu_{g},1),g^{\prime}={\mathcal{N}}(\mu_{g}^{\prime},1)\in{\mathcal{R}}, we have

lng(x)g(x)=12((xμg)2(xμg)2)=(μgμg)(x(μg+μg)/2).\ln\frac{g(x)}{g^{\prime}(x)}=\frac{1}{2}\left((x-\mu_{g})^{2}-(x-\mu_{g}^{\prime})^{2}\right)=(\mu_{g}^{\prime}-\mu_{g})(x-(\mu_{g}^{\prime}+\mu_{g})/2).

Therefore when μg,μg[0,1]\mu_{g},\mu_{g}^{\prime}\in[0,1] we get

|lng(x)g(x)||μgμg||x+1|.\displaystyle{\left|{\ln\frac{g(x)}{g^{\prime}(x)}}\right|}\leq|\mu_{g}-\mu_{g}^{\prime}||x+1|. (249)

As a result, with probability at least 1δ1-\delta we have

g,g,|lng(x)g(x)||μgμg|(1+2log(2/δ)).\displaystyle\forall g,g^{\prime}\in{\mathcal{R}},\quad{\left|{\ln\frac{g(x)}{g^{\prime}(x)}}\right|}\leq|\mu_{g}-\mu_{g}^{\prime}|\left(1+2\sqrt{\log(2/\delta)}\right). (250)

In addition, when μg,μg[0,1]\mu_{g},\mu_{g}^{\prime}\in[0,1] we have DTV(gg)110|μgμg|.D_{\rm TV}(g\|g^{\prime})\geq\frac{1}{10}|\mu_{g}-\mu_{g}^{\prime}|. As a result, item (a) holds with c7=1c_{7}=1.

Item (b).

Recall that for Gaussian distribution with unit variance, when μg,μg[0,1]\mu_{g},\mu_{g}^{\prime}\in[0,1] we have DTV(gg)110|μgμg|.D_{\rm TV}(g\|g^{\prime})\geq\frac{1}{10}|\mu_{g}-\mu_{g}^{\prime}|. Therefore item (b) holds with c8=1c_{8}=1.

Item (c).

Recall that for any g=𝒩(μg,1),g=𝒩(μg,1)g={\mathcal{N}}(\mu_{g},1),g^{\prime}={\mathcal{N}}(\mu_{g}^{\prime},1)\in{\mathcal{R}}, lng(x)g(x)=12((xμg)2(xμg)2)\ln\frac{g(x)}{g^{\prime}(x)}=\frac{1}{2}\left((x-\mu_{g})^{2}-(x-\mu_{g}^{\prime})^{2}\right). Therefore when μg,μg[0,1]\mu_{g},\mu_{g}^{\prime}\in[0,1] we get

𝔼xg[(lng(x)g(x))4]=12𝔼xg[((xμg)2(xμg)2)4]\displaystyle\mathbb{E}_{x\sim g}\left[\left(\ln\frac{g(x)}{g^{\prime}(x)}\right)^{4}\right]=\frac{1}{2}\mathbb{E}_{x\sim g}\left[\left((x-\mu_{g})^{2}-(x-\mu_{g}^{\prime})^{2}\right)^{4}\right] (251)
\displaystyle\leq\; (μgμg)42𝔼xg[(x(μg+μg)/2)4]3(μgμg)4.\displaystyle\frac{(\mu_{g}^{\prime}-\mu_{g})^{4}}{2}\mathbb{E}_{x\sim g}\left[(x-(\mu_{g}^{\prime}+\mu_{g})/2)^{4}\right]\leq 3(\mu_{g}^{\prime}-\mu_{g})^{4}. (252)

Therefore, item (c) holds with cM=2.c_{M}=2.

Item (d).

For g=𝒩(μg,1),g=𝒩(μg,1)g={\mathcal{N}}(\mu_{g},1),g^{\prime}={\mathcal{N}}(\mu_{g}^{\prime},1)\in{\mathcal{R}}, we have

DTV(gg)DKL(gg)/2=|μgμg|2.\displaystyle D_{\rm TV}(g\|g^{\prime})\leq\sqrt{D_{\mathrm{KL}}(g\|g^{\prime})/2}=\frac{|\mu_{g}-\mu_{g}^{\prime}|}{2}. (253)

Therefore, we can set 𝒞(,ϵ)={𝒩(kϵ,1):k{1/ϵ,1/ϵ+1,,1/ϵ}}{\mathcal{C}}({\mathcal{R}},\epsilon)=\{{\mathcal{N}}(k\epsilon,1):k\in\{-\lfloor 1/\epsilon\rfloor,-\lfloor 1/\epsilon\rfloor+1,\cdots,\lfloor 1/\epsilon\rfloor\}\}. Then log|𝒞(,ϵ)|log2/ϵ=𝒪(log(1/ϵ)).\log|{\mathcal{C}}({\mathcal{R}},\epsilon)|\leq\log\lceil 2/\epsilon\rceil={\mathcal{O}}(\log(1/\epsilon)).

Appendix F Uniform Concentration

In this section, we present the uniform concentration results.

F.1 Uniform Concentration with \ell_{\infty} Covering

In this section we prove uniform concentration results with \ell_{\infty} covering. For two instances g,gg,g^{\prime}, define the \ell_{\infty} distance as

ggsupπ,o|g[π](o)g[π](o)|.\displaystyle\|g-g^{\prime}\|_{\infty}\triangleq\sup_{\pi,o}{\left|{g[\pi](o)-g^{\prime}[\pi](o)}\right|}. (254)

Let 𝒩(,ϵ){\mathcal{N}}({\mathcal{F}},\epsilon) be the proper covering number of {\mathcal{F}} w.r.t. the distance .\ell_{\infty}. Then we have the following lemma.

Lemma F.1.

Consider any fixed α>0,0<ϵ<α/2\alpha>0,0<\epsilon<\alpha/2, list of decisions w=(π1,,πm)w=(\pi_{1},\cdots,\pi_{m}), ff\in{\mathcal{F}}. Let γ=1mminπΠi=1m𝕀[πi=π]\gamma=\frac{1}{m}\min_{\pi\in\Pi}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right] and let λ=λ0(α/γ,ϵ,f)\lambda=\lambda_{0}(\alpha/\gamma,\epsilon,f) be the value that satisfies Condition 1. Let (w,f,α)={g:DKLw(fg)α}{\mathcal{F}}(w,f,\alpha)=\{g\in{\mathcal{F}}:D_{\mathrm{KL}}^{w}(f\|g)\geq\alpha\} and define

ϵ0=exp(α/γ)(ϵλ)1/λ3Vol.\epsilon_{0}=\frac{\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}}{3{\rm Vol}}.

Then for any δ>0\delta>0, when m1λϵ(ln𝒩(,ϵ0)+ln(1/δ))m\geq\frac{1}{\lambda\epsilon}\left(\ln{\mathcal{N}}\left({\mathcal{F}},\epsilon_{0}\right)+\ln(1/\delta)\right) we have

Proif[πi],i(g(w,f,α),i=1mlnf[πi](oi)g[πi](oi)(α4ϵ)m)1δ.\displaystyle\mathop{\rm Pr}\nolimits_{o_{i}\sim f[\pi_{i}],\forall i}\left(\forall g\in{\mathcal{F}}(w,f,\alpha),\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha-4\epsilon)m\right)\geq 1-\delta. (255)
Proof.

For any gg\in{\mathcal{F}}, we define a induced distribution g^\hat{g} for all πΠ\pi\in\Pi:

g^[π](o)=g[π](o)+ϵ01+ϵ0Vol.\displaystyle\hat{g}[\pi](o)=\frac{g[\pi](o)+\epsilon_{0}}{1+\epsilon_{0}{\rm Vol}}. (256)

Because g^[π](o)0\hat{g}[\pi](o)\geq 0 and

g^[π](o)1=g[π](o)+ϵ01+ϵ0Voldx=11+ϵ0Volg[π](o)+ϵ0dx=1,\|\hat{g}[\pi](o)\|_{1}=\int\frac{g[\pi](o)+\epsilon_{0}}{1+\epsilon_{0}{\rm Vol}}\mathrm{d}x=\frac{1}{1+\epsilon_{0}{\rm Vol}}\int g[\pi](o)+\epsilon_{0}\mathrm{d}x=1,

the induced distribution g^[π]\hat{g}[\pi] is a valid distribution.

Properties of the induced covering.

Let Z=ϵ0VolZ=\epsilon_{0}{\rm Vol}. Consider the minimum ϵ0\epsilon_{0} covering set 𝒞(w,f,α){\mathcal{C}}(w,f,\alpha) of (w,f,α).{\mathcal{F}}(w,f,\alpha). For any g(w,f,α)g\in{\mathcal{F}}(w,f,\alpha), let g𝒞(w,f,α)g^{\prime}\in{\mathcal{C}}(w,f,\alpha) be its cover and g^\hat{g} the induced distribution of gg^{\prime}. Now, we prove that g^\hat{g} satisfies the following two properties:

  1. (a)

    For any πΠ\pi\in\Pi and osuppf[π]o\in\operatorname{supp}f[\pi], lnf[π](o)g[π](o)lnf[π](o)g^[π](o)ϵ,\ln\frac{f[\pi](o)}{g[\pi](o)}\geq\ln\frac{f[\pi](o)}{\hat{g}[\pi](o)}-\epsilon, and

  2. (b)

    D1λw(fg^)α2ϵD^{w}_{1-\lambda}(f\|\hat{g})\geq\alpha-2\epsilon for g(w,f,α).g\in{\mathcal{F}}(w,f,\alpha).

First we prove item (a). Since gg^{\prime} is the ϵ0\epsilon_{0} cover of gg, we get g[π](o)+ϵ0g[π](o)g^{\prime}[\pi](o)+\epsilon_{0}\geq g[\pi](o) for any πΠ\pi\in\Pi and xsuppf[π]x\in\operatorname{supp}f[\pi]. As a result,

lnf[π][o]g[π](o)lnf[π][o]g[π](o)+ϵ0=lnf[π][o]g^[π](o)ln(1+Z).\displaystyle\ln\frac{f[\pi][o]}{g[\pi](o)}\geq\ln\frac{f[\pi][o]}{g^{\prime}[\pi](o)+\epsilon_{0}}=\ln\frac{f[\pi][o]}{\hat{g}[\pi](o)}-\ln(1+Z). (257)

By the basic inequality ln(1+x)x,x(1,)\ln(1+x)\leq x,\forall x\in(-1,\infty) we get

ln(1+Z)Z=ϵ0Volϵ.\displaystyle\ln(1+Z)\leq Z=\epsilon_{0}{\rm Vol}\leq\epsilon. (258)

As a result, item (a) follows directly. Now we prove item (b). By algebraic manipulation,

DTV(g[π]g^[π])|g^[π](o)g[π](o)|do=|g[π](o)+ϵ01+ϵ0Volg[π](o)|do\displaystyle D_{\rm TV}(g[\pi]\|\hat{g}[\pi])\leq\int{\left|{\hat{g}[\pi](o)-g[\pi](o)}\right|}\mathrm{d}o=\int{\left|{\frac{g^{\prime}[\pi](o)+\epsilon_{0}}{1+\epsilon_{0}{\rm Vol}}-g[\pi](o)}\right|}\mathrm{d}o (259)
\displaystyle\leq |g[π](o)+ϵ01+ϵ0Volg[π](o)ϵ0|do+|g[π](o)+ϵ0g[π](o)|do\displaystyle\;\int{\left|{\frac{g^{\prime}[\pi](o)+\epsilon_{0}}{1+\epsilon_{0}{\rm Vol}}-g^{\prime}[\pi](o)-\epsilon_{0}}\right|}\mathrm{d}o+\int{\left|{g^{\prime}[\pi](o)+\epsilon_{0}-g[\pi](o)}\right|}\mathrm{d}o (260)
\displaystyle\leq 11+ϵ0Vol|ϵ0Vol(g[π](o)+ϵ0)|do+2ϵ0Vol\displaystyle\;\frac{1}{1+\epsilon_{0}{\rm Vol}}\int{\left|{\epsilon_{0}{\rm Vol}\left(g^{\prime}[\pi](o)+\epsilon_{0}\right)}\right|}\mathrm{d}o+2\epsilon_{0}{\rm Vol} (261)
\displaystyle\leq ϵ0Vol+2ϵ0Vol=3ϵ0Vol=exp(α/γ)(ϵλ)1/λ.\displaystyle\;\epsilon_{0}{\rm Vol}+2\epsilon_{0}{\rm Vol}=3\epsilon_{0}{\rm Vol}=\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}. (262)

Applying Lemma G.5, we get item (b).

Uniform concentration via covering.

We define the induced covering set 𝒞^(w,f,α)={g^:g𝒞(w,f,α)}.\hat{\mathcal{C}}(w,f,\alpha)=\{\hat{g}:g^{\prime}\in{\mathcal{C}}(w,f,\alpha)\}. Then |𝒞^(w,f,α)||𝒞(w,f,α)|𝒩(,ϵ0).|\hat{\mathcal{C}}(w,f,\alpha)|\leq|{\mathcal{C}}(w,f,\alpha)|\leq{\mathcal{N}}({\mathcal{F}},\epsilon_{0}). Applying Lemma A.2 and union bound we get, with probability at least 1δ1-\delta,

i=1mlnf[πi](oi)g^[πi](oi)m(D1λw(fg^)ϵ),g^𝒞^(w,f,α).\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{\hat{g}[\pi_{i}](o_{i})}\geq m(D^{w}_{1-\lambda}(f\|\hat{g})-\epsilon),\quad\forall\hat{g}\in\hat{\mathcal{C}}(w,f,\alpha). (263)

By item (a) of the covering property,

i=1mlnf[πi](oi)g[πi](oi)i=1mlnf[πi](oi)g^[πi](oi)mϵ.\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{\hat{g}[\pi_{i}](o_{i})}-m\epsilon. (264)

By item (b) of the covering property,

D1λw(fg^)α2ϵ\displaystyle D^{w}_{1-\lambda}(f\|\hat{g})\geq\alpha-2\epsilon (265)

Combining Eqs. (263), (264), (265), with probability at least 1δ1-\delta,

i=1mlnf[πi](oi)g[πi](oi)m(α4ϵ).\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq m(\alpha-4\epsilon). (266)

F.2 Uniform Concentration for RL with General Reward Distribution

In the following, we present an uniform concentration lemma for tabular RL with general reward.

Lemma F.2.

Let {\mathcal{F}} be the instance family representing tabular RL with general reward distribution that satisfies Condition 4. For any fixed α>0,0<ϵ<α/2\alpha>0,0<\epsilon<\alpha/2, list of decisions w=(π1,,πm)w=(\pi_{1},\cdots,\pi_{m}), ff\in{\mathcal{F}}, let γ=1mminπ:πwi=1m𝕀[πi=π]\gamma=\frac{1}{m}\min_{\pi:\pi\in w}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right] and λ=λ0(α/γ,ϵ,f)\lambda=\lambda_{0}(\alpha/\gamma,\epsilon,f) be the value that satisfies Condition 1. Let (w,f,α)={g:DKLw(fg)α}{\mathcal{F}}(w,f,\alpha)=\{g\in{\mathcal{F}}:D_{\mathrm{KL}}^{w}(f\|g)\geq\alpha\} and define

ϵ0=exp(α/γ)(ϵλ)1/λ.\epsilon_{0}=\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}.

Then for any δ>0\delta>0, when mpoly(|𝒮||𝒜|H)λϵ((ln(m/ϵ0))+ln(1/δ))m\gtrsim\frac{\mathrm{poly}(|{\mathcal{S}}||{\mathcal{A}}|H)}{\lambda\epsilon}\left((\ln(m/\epsilon_{0}))+\ln(1/\delta)\right) we have

Proif[πi],i(g(w,f,α),i=1mlnf[πi](oi)g[πi](oi)(α4ϵ)m)1δ.\displaystyle\mathop{\rm Pr}\nolimits_{o_{i}\sim f[\pi_{i}],\forall i}\left(\forall g\in{\mathcal{F}}(w,f,\alpha),\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha-4\epsilon)m\right)\geq 1-\delta. (267)
Proof.

Recall that a tabular RL instance gg\in{\mathcal{F}}, we use gr[s,a]g_{r}[s,a] to denote its reward distribution given state-action pair (s,a)(s,a), and gp[s,a]g_{p}[s,a] its transition. We prove this lemma using the covering argument.

The covering set.

Define

ϵ1=min{ϵ2H|𝒮|,ϵ2Hpolylog(2mH/δ),exp(α/γ)(ϵλ)1/λ4|𝒮|2|𝒜|}.\epsilon_{1}=\min\left\{\frac{\epsilon}{2H|{\mathcal{S}}|},\frac{\epsilon}{2H\mathrm{polylog}(2mH/\delta)},\frac{\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}}{4|{\mathcal{S}}|^{2}|{\mathcal{A}}|}\right\}.

Let 𝒞{\mathcal{C}}\subseteq{\mathcal{F}} be the minimum covering of {\mathcal{F}} such that for any gg\in{\mathcal{F}} (parameterized by p,μp,\mu), there exists gg^{\prime}\in{\mathcal{F}} (parameterized by p,μp^{\prime},\mu^{\prime}) such that

sups,a,s|gp[s,a](s)gp[s,a](s)|ϵ1,sups,aDTV(gr[s,a]gr[s,a])ϵ11/c7.\displaystyle\sup_{s,a,s^{\prime}}{\left|{g_{p}[s,a](s^{\prime})-g^{\prime}_{p}[s,a](s^{\prime})}\right|}\leq\epsilon_{1},\quad\sup_{s,a}D_{\rm TV}\left(g_{r}[s,a]\|g^{\prime}_{r}[s,a]\right)\leq\epsilon_{1}^{1/c_{7}}. (268)

By item (d) of Condition 4 and a standard covering argument for discrete distributions, we have ln|𝒞||𝒮||𝒜|ln(1/ϵ1).\ln|{\mathcal{C}}|\lesssim|{\mathcal{S}}||{\mathcal{A}}|\ln(1/\epsilon_{1}). For any g𝒞g^{\prime}\in{\mathcal{C}}, we consider the induced instance g^\hat{g} defined by

g^p[s,a](s)=gp[s,a](s)+ϵ11+ϵ1|𝒮|,g^r[s,a]()=gr[s,a]().\displaystyle\hat{g}_{p}[s,a](s^{\prime})=\frac{g^{\prime}_{p}[s,a](s^{\prime})+\epsilon_{1}}{1+\epsilon_{1}|{\mathcal{S}}|},\quad\hat{g}_{r}[s,a](\cdot)=g^{\prime}_{r}[s,a](\cdot). (269)

In the following, we prove that

  1. (a)

    with probability at least 1δ/21-\delta/2,

    Proif[πi],i(g,i=1mlng^[πi](oi)g[πi](oi)mϵ)1δ/2.\displaystyle\mathop{\rm Pr}\nolimits_{o_{i}\sim f[\pi_{i}],\forall i}\left(\forall g\in{\mathcal{F}},\sum_{i=1}^{m}\ln\frac{\hat{g}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq-m\epsilon\right)\geq 1-\delta/2. (270)
  2. (b)

    D1λw(fg^)α2ϵD^{w}_{1-\lambda}(f\|\hat{g})\geq\alpha-2\epsilon for g(w,f,α).g\in{\mathcal{F}}(w,f,\alpha).

To prove (a), recall that oi={(si,h,ai,h,ri,h)}h=1Ho_{i}=\{(s_{i,h},a_{i,h},r_{i,h})\}_{h=1}^{H} represents a trajectory. Consequently,

i=1mlng^[πi](oi)g[πi](oi)\displaystyle\sum_{i=1}^{m}\ln\frac{\hat{g}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})} (271)
=\displaystyle= i=1mh=1Hlng^p[si,h,ai,h](si,h+1)gp[si,h,ai,h](si,h+1)+i=1mh=1Hlng^r[si,h,ai,h](ri,h)gr[si,h,ai,h](ri,h)\displaystyle\;\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{\hat{g}_{p}[s_{i,h},a_{i,h}](s_{i,h+1})}{g_{p}[s_{i,h},a_{i,h}](s_{i,h+1})}+\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{\hat{g}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{g_{r}[s_{i,h},a_{i,h}](r_{i,h})} (272)
\displaystyle\geq i=1mh=1Hln(1+|𝒮|ϵ1)+i=1mh=1Hlng^r[si,h,ai,h](ri,h)gr[si,h,ai,h](ri,h).\displaystyle\;-\sum_{i=1}^{m}\sum_{h=1}^{H}\ln(1+|{\mathcal{S}}|\epsilon_{1})+\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{\hat{g}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{g_{r}[s_{i,h},a_{i,h}](r_{i,h})}. (273)

By item (a) of Condition 4 and union bound we get with probability at least 1δ/21-\delta/2

i=1mh=1Hlng^r[si,h,ai,h](ri,h)gr[si,h,ai,h](ri,h)mHϵ1polylog(mH/δ)mϵ/2.\displaystyle\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{\hat{g}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{g_{r}[s_{i,h},a_{i,h}](r_{i,h})}\geq-mH\epsilon_{1}\mathrm{polylog}(mH/\delta)\geq-m\epsilon/2. (274)

Therefore,

i=1mlng^[πi](oi)g[πi](oi)i=1mh=1H|𝒮|ϵ1mϵ/2mϵ,\displaystyle\sum_{i=1}^{m}\ln\frac{\hat{g}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq-\sum_{i=1}^{m}\sum_{h=1}^{H}|{\mathcal{S}}|\epsilon_{1}-m\epsilon/2\geq-m\epsilon, (275)

which proves (a). For (b), by algebraic manipulation we have

DTV(g[π]g^[π])s,a(DTV(p[s,a]g^p[s,a])+DTV(r[s,a]g^r[s,a]))\displaystyle D_{\rm TV}(g[\pi]\|\hat{g}[\pi])\leq\sum_{s,a}\left(D_{\rm TV}(p[s,a]\|\hat{g}_{p}[s,a])+D_{\rm TV}(r[s,a]\|\hat{g}_{r}[s,a])\right) (276)
\displaystyle\leq s,a,s|p[s,a](s)+ϵ11+ϵ1|𝒮|p[s,a](s)|+|𝒮||𝒜|ϵ1\displaystyle\;\sum_{s,a,s^{\prime}}{\left|{\frac{p^{\prime}[s,a](s^{\prime})+\epsilon_{1}}{1+\epsilon_{1}|{\mathcal{S}}|}-p[s,a](s^{\prime})}\right|}+|{\mathcal{S}}||{\mathcal{A}}|\epsilon_{1} (277)
\displaystyle\leq s,a,s|p[s,a](s)+ϵ11+ϵ1|𝒮|p[s,a](s)ϵ1|+s,a,s|p[s,a](s)+ϵ1p[s,a](s)|+|𝒮||𝒜|ϵ1\displaystyle\;\sum_{s,a,s^{\prime}}{\left|{\frac{p^{\prime}[s,a](s^{\prime})+\epsilon_{1}}{1+\epsilon_{1}|{\mathcal{S}}|}-p^{\prime}[s,a](s^{\prime})-\epsilon_{1}}\right|}+\sum_{s,a,s^{\prime}}{\left|{p^{\prime}[s,a](s^{\prime})+\epsilon_{1}-p[s,a](s^{\prime})}\right|}+|{\mathcal{S}}||{\mathcal{A}}|\epsilon_{1} (278)
\displaystyle\leq  4|𝒮|2|𝒜|ϵ1exp(α/γ)(ϵλ)1/λ.\displaystyle\;4|{\mathcal{S}}|^{2}|{\mathcal{A}}|\epsilon_{1}\leq\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}. (279)

Applying Lemma G.5, we get item (b).

Uniform concentration via covering.

Now we apply Lemma A.2 and union bound. When

m|𝒮||𝒜|ln(1/ϵ1)λϵln(|𝒞|/δ)λϵ,m\gtrsim\frac{|{\mathcal{S}}||{\mathcal{A}}|\ln(1/\epsilon_{1})}{\lambda\epsilon}\gtrsim\frac{\ln(|{\mathcal{C}}|/\delta)}{\lambda\epsilon},

with probability at least 1δ/21-\delta/2 we get

i=1mlnf[πi](oi)g^[πi](oi)m(D1λw(fg^)ϵ),g^𝒞.\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{\hat{g}[\pi_{i}](o_{i})}\geq m(D^{w}_{1-\lambda}(f\|\hat{g})-\epsilon),\quad\forall\hat{g}\in{\mathcal{C}}. (280)

By item (a) of the covering property, with probability at least 1δ/21-\delta/2,

i=1mlnf[πi](oi)g[πi](oi)i=1mlnf[πi](oi)g^[πi](oi)mϵ.\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{\hat{g}[\pi_{i}](o_{i})}-m\epsilon. (281)

By item (b) of the covering property,

D1λw(fg^)α2ϵ\displaystyle D^{w}_{1-\lambda}(f\|\hat{g})\geq\alpha-2\epsilon (282)

Combining Eqs. (280), (281), (282), with probability at least 1δ1-\delta,

i=1mlnf[πi](oi)g[πi](oi)m(α4ϵ).\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq m(\alpha-4\epsilon). (283)

Appendix G Helper Lemmas

The following lemma shows that the Rényi divergence of a marginal distribution is smaller than that of the original distribution.

Lemma G.1 (Theorem 1 of Van Erven and Harremos [2014]).

Consider any distributions f,gf,g over variables x,yx,y. Let f(x),g(x)f(x),g(x) denote their marginal distribution over xx respectively. For any α(0,1)\alpha\in(0,1), we have

Dα(f(x)g(x))Dα(f(x,y)g(x,y)).\displaystyle D_{\alpha}(f(x)\|g(x))\leq D_{\alpha}(f(x,y)\|g(x,y)). (284)
Proof.

When α(0,1)\alpha\in(0,1), we have 1α1<0.\frac{1}{\alpha-1}<0. As a result, we only need to prove that

f(x,y)αg(x,y)1αdxyf(x)αg(x)1αdx.\displaystyle\int f(x,y)^{\alpha}g(x,y)^{1-\alpha}\mathrm{d}xy\leq\int f(x)^{\alpha}g(x)^{1-\alpha}\mathrm{d}x. (285)

We prove this by Hölder’s inequality. In particular,

f(x,y)αg(x,y)1αdxy\displaystyle\int f(x,y)^{\alpha}g(x,y)^{1-\alpha}\mathrm{d}xy (286)
=\displaystyle= f(x)αg(x)1αf(yx)αg(yx)1αdydx\displaystyle\int f(x)^{\alpha}g(x)^{1-\alpha}\int f(y\mid x)^{\alpha}g(y\mid x)^{1-\alpha}\mathrm{d}y\mathrm{d}x (287)
\displaystyle\leq f(x)αg(x)1α(f(yx)dy)α(g(yx)dy)1αdx\displaystyle\int f(x)^{\alpha}g(x)^{1-\alpha}\left(\int f(y\mid x)\mathrm{d}y\right)^{\alpha}\left(\int g(y\mid x)\mathrm{d}y\right)^{1-\alpha}\mathrm{d}x (288)
\displaystyle\leq f(x)αg(x)1αdx.\displaystyle\int f(x)^{\alpha}g(x)^{1-\alpha}\mathrm{d}x. (289)

Intuitively, the following two lemmas upper bound the difference of Rényi divergence by the TV distance.

Lemma G.2.

For and fixed λ(0,1),α>0,ϵ>0\lambda\in(0,1),\alpha>0,\epsilon>0 and distribution ff, consider two distributions g,g^g,\hat{g} such that DTV(gg^)exp(α)(λϵ)1/λ.D_{\rm TV}(g\|\hat{g})\leq\exp(-\alpha)(\lambda\epsilon)^{1/\lambda}. Then we have

D1λ(fg^)min{α,D1λ(fg)}ϵ.\displaystyle D_{1-\lambda}(f\|\hat{g})\geq\min\{\alpha,D_{1-\lambda}(f\|g)\}-\epsilon. (290)
Proof.

Let κ=min{α,D1λ(fg)}.\kappa=\min\{\alpha,D_{1-\lambda}(f\|g)\}. We start by proving

|f(x)1λg(x)λdxf(x)1λg^(x)λdx|(exp(λϵ)1)exp(λκ).\displaystyle{\left|{\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x-\int f(x)^{1-\lambda}\hat{g}(x)^{\lambda}\mathrm{d}x}\right|}\leq\left(\exp(\lambda\epsilon)-1\right)\exp(-\lambda\kappa). (291)

By Hólder’s inequality we get

|f(x)1λg(x)λdxf(x)1λg^(x)λdx|\displaystyle{\left|{\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x-\int f(x)^{1-\lambda}\hat{g}(x)^{\lambda}\mathrm{d}x}\right|} (292)
\displaystyle\leq\; (|g(x)g^(x)|dx)λ(f(x)dx)1λ\displaystyle\left(\int{\left|{g(x)-\hat{g}(x)}\right|}\mathrm{d}x\right)^{\lambda}\left(\int f(x)\mathrm{d}x\right)^{1-\lambda} (293)
\displaystyle\leq\; exp(λα)λϵ\displaystyle\exp(-\lambda\alpha)\lambda\epsilon (294)
\displaystyle\leq\; exp(λα)(exp(λϵ)1)\displaystyle\exp(-\lambda\alpha)\left(\exp(\lambda\epsilon)-1\right) (295)
\displaystyle\leq\; exp(λκ)(exp(λϵ)1),\displaystyle\exp(-\lambda\kappa)\left(\exp(\lambda\epsilon)-1\right), (296)

where the Eq. (295) follows from the basic inequality 1+xexp(x)1+x\leq\exp(x) for x>0.x>0.

By the definition of Rényi divergence,

f(x)1λg(x)λdx=exp(λD1λ(fg))exp(λκ).\displaystyle\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x=\exp(-\lambda D_{1-\lambda}(f\|g))\leq\exp(-\lambda\kappa). (297)

Combining Eq. (296) and Eq. (297) we get,

f(x)1λg^(x)λdxexp(λ(κϵ)).\displaystyle\int f(x)^{1-\lambda}\hat{g}(x)^{\lambda}\mathrm{d}x\leq\exp(-\lambda(\kappa-\epsilon)). (298)

It follows that

D1λ(fg^)=1λlnf(x)1λg^(x)λdxκϵ.\displaystyle D_{1-\lambda}(f\|\hat{g})=-\frac{1}{\lambda}\ln\int f(x)^{1-\lambda}\hat{g}(x)^{\lambda}\mathrm{d}x\geq\kappa-\epsilon. (299)

Lemma G.3.

For and fixed λ(0,1/2),α>0,ϵ>0\lambda\in(0,1/2),\alpha>0,\epsilon>0 and distribution gg, consider two distributions f,f^f,\hat{f} such that DTV(ff^)exp(λ1λα)(λϵ)1/(1λ).D_{\rm TV}(f\|\hat{f})\leq\exp\left(-\frac{\lambda}{1-\lambda}\alpha\right)(\lambda\epsilon)^{1/(1-\lambda)}. Then we have

D1λ(f^g)min{α,D1λ(fg)}ϵ.\displaystyle D_{1-\lambda}(\hat{f}\|g)\geq\min\{\alpha,D_{1-\lambda}(f\|g)\}-\epsilon. (300)
Proof.

We use a similar proof as Lemma G.2. Let κ=min{α,D1λ(fg)}.\kappa=\min\{\alpha,D_{1-\lambda}(f\|g)\}. Then we have

|f(x)1λg(x)λdxf^(x)1λg(x)λdx|\displaystyle{\left|{\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x-\int\hat{f}(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x}\right|} (301)
\displaystyle\leq\; (|f(x)f^(x)|dx)1λ(g(x)dx)λ\displaystyle\left(\int{\left|{f(x)-\hat{f}(x)}\right|}\mathrm{d}x\right)^{1-\lambda}\left(\int g(x)\mathrm{d}x\right)^{\lambda} (302)
\displaystyle\leq\; exp(λα)λϵ\displaystyle\exp\left(-\lambda\alpha\right)\lambda\epsilon (303)
\displaystyle\leq\; exp(λκ)(exp(λϵ)1),\displaystyle\exp(-\lambda\kappa)\left(\exp(\lambda\epsilon)-1\right), (304)

By the definition of Rényi divergence,

f(x)1λg(x)λdxexp(λD1λ(fg))exp(λκ).\displaystyle\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x\leq\exp(-\lambda D_{1-\lambda}(f\|g))\leq\exp(-\lambda\kappa). (305)

Combining Eq. (304) and Eq. (305) we get,

f^(x)1λg(x)λdxexp(λ(κϵ)).\displaystyle\int\hat{f}(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x\leq\exp(-\lambda(\kappa-\epsilon)). (306)

It follows that

D1λ(f^g)=1λlnf^(x)1λg(x)λdxκϵ.\displaystyle D_{1-\lambda}(\hat{f}\|g)=-\frac{1}{\lambda}\ln\int\hat{f}(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x\geq\kappa-\epsilon. (307)

The following lemma is used to prove that when f^{\hat{f}} is close to f{f^{\star}} (measured by TV distance), we can use f^{\hat{f}} to approximately solve 𝒞(f){\mathcal{C}}({f^{\star}}) (see Lemma C.2).

Lemma G.4.

For an instance ff\in{\mathcal{F}} and n>0n>0. Let δ=(lnlnn)1/4\delta=(\ln\ln n)^{-1/4} and ϵ=(lnlnn)1.\epsilon=(\ln\ln n)^{-1}. For any w^+|Π|\hat{w}\in\mathbb{R}^{|\Pi|}_{+} such that w^(lnlnn)1/4\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4}, define w={πi}i=1mw=\{\pi_{i}\}_{i=1}^{m} be a list of decisions where a decision π\pi occurs ((1+δ)w^π+δ)lnn\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil times for every πΠ\pi\in\Pi, and m=π((1+δ)w^π+δ)lnnm=\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil. Consider two instances f,f^f,{\hat{f}}\in{\mathcal{F}} such that there exists constant c6>0c_{6}>0 with

DTV(f[π]f^[π])(lnlnnlnn)c6,πΠ.D_{\rm TV}(f[\pi]\|\hat{f}[\pi])\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}},\forall\pi\in\Pi.

Define the set (w^,f)={g:πΠw^πDKL(f[π]g[π])1}.{\mathcal{F}}(\hat{w},f)=\{g\in{\mathcal{F}}:\sum_{\pi\in\Pi}\hat{w}_{\pi}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 1\}. Then for any constant c>0c>0, there exits n0>0n_{0}>0 such that for all n>n0n>n_{0},

DKLw(f^g)lnnm+cϵ,g(w^,f).\displaystyle D_{\mathrm{KL}}^{w}(\hat{f}\|g)\geq\frac{\ln n}{m}+c\epsilon,\quad\forall g\in{\mathcal{F}}(\hat{w},f). (308)
Proof of Lemma G.4.

First we invoke Condition 1 with proper parameters. Define

λ=λ0(4(lnlnn)3/4,1lnlnn,f)\displaystyle\lambda=\lambda_{0}\left(4(\ln\ln n)^{3/4},\frac{1}{\ln\ln n},f\right) (309)

By the definition of mm and the fact that w^(lnlnn)1/4\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4} we get

|Π|lnn(lnlnn)1/4m2|Π|lnn(lnlnn)1/4.\displaystyle\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}\leq m\leq 2|\Pi|\ln n(\ln\ln n)^{1/4}. (310)

Let αlnnm+(c+2)ϵ2|Π|(lnlnn)1/4.\alpha\triangleq\frac{\ln n}{m}+(c+2)\epsilon\lesssim\frac{2}{|\Pi|}(\ln\ln n)^{1/4}. Consider γ=minπ1m((1+δ)w^π+δ)lnn\gamma=\min_{\pi}\frac{1}{m}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil. By the upper bound of mm we get γ12|Π|(lnlnn)1/2.\gamma\geq\frac{1}{2|\Pi|(\ln\ln n)^{1/2}}.

We invoke Condition 1 with parameters (α/γ,ϵ,f)(\alpha/\gamma,\epsilon,f). For large enough nn we have α/γ4(lnlnn)3/4\alpha/\gamma\leq 4(\ln\ln n)^{3/4}, which implies that λ<λ0(α/γ,ϵ,f)\lambda<\lambda_{0}(\alpha/\gamma,\epsilon,f). Then for any πΠ\pi\in\Pi we get

D1λ(f[π]g[π])min{α/γ,DKL(f[π]g[π])ϵ}.\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi])\geq\min\{\alpha/\gamma,D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon\}. (311)

We claim that DKLw(fg)lnnm+(c+2)ϵ,g(w^,f)D_{\mathrm{KL}}^{w}(f\|g)\geq\frac{\ln n}{m}+(c+2)\epsilon,\forall g\in{\mathcal{F}}(\hat{w},f) for large enough nn. Indeed, by the definition of ww we get

mDKLw(fg)π((1+δ)w^π+δ)(lnn)DKL(f[π]g[π])(1+δ)lnnlnn+(c+2)mϵ.\displaystyle mD_{\mathrm{KL}}^{w}(f\|g)\geq\sum_{\pi}((1+\delta)\hat{w}_{\pi}+\delta)(\ln n)D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq(1+\delta)\ln n\geq\ln n+(c+2)m\epsilon. (312)

Let

ϵ1=exp(α/γ)(λϵ)2=Ω(exp((lnlnn)3/4)poly(lnlnn)).\epsilon_{1}=\exp(-\alpha/\gamma)(\lambda\epsilon)^{2}=\Omega\left(\exp(-(\ln\ln n)^{3/4})\mathrm{poly}(\ln\ln n)\right).

By Lemma G.3 with parameters (λ,α/γ,ϵ)(\lambda,\alpha/\gamma,\epsilon) and the assumption that DTV(f[π]f^[π])(lnlnnlnn)c6=o(ϵ1)D_{\rm TV}(f[\pi]\|\hat{f}[\pi])\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}}=o(\epsilon_{1}) for all πΠ\pi\in\Pi, we get

D1λ(f^[π]g[π])min{α/γ,D1λ(f[π]g[π])}ϵ,πΠ.\displaystyle D_{1-\lambda}(\hat{f}[\pi]\|g[\pi])\geq\min\{\alpha/\gamma,D_{1-\lambda}(f[\pi]\|g[\pi])\}-\epsilon,\forall\pi\in\Pi. (313)

Now we consider the following two cases.

Case 1:

There exists πw\pi\in w such that DKL(f[π]g[π])α/γD_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq\alpha/\gamma. In this case Eq. (311) implies that D1λ(f[π]g[π])α/γϵ.D_{1-\lambda}(f[\pi]\|g[\pi])\geq\alpha/\gamma-\epsilon. Combining with Eq. (313) we have

D1λ(f^[π]g[π])α/γ2ϵ.\displaystyle D_{1-\lambda}({\hat{f}}[\pi]\|g[\pi])\geq\alpha/\gamma-2\epsilon. (314)

As a result,

D1λw(f^g)1mi=1m𝕀[πi=π](α/γ2ϵ)α2ϵ=lnnm+cϵ.\displaystyle D^{w}_{1-\lambda}(\hat{f}\|g)\geq\frac{1}{m}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right](\alpha/\gamma-2\epsilon)\geq\alpha-2\epsilon=\frac{\ln n}{m}+c\epsilon. (315)

Case 2:

For all πw\pi\in w, DKL(f[π]g[π])α/γD_{\mathrm{KL}}(f[\pi]\|g[\pi])\leq\alpha/\gamma. In this case we also have D1λ(f[π]g[π])α/γ,πwD_{1-\lambda}(f[\pi]\|g[\pi])\leq\alpha/\gamma,\;\forall\pi\in w. Therefore Eq. (313) and Eq. (311) implies that

D1λ(f^[π]g[π])\displaystyle D_{1-\lambda}({\hat{f}}[\pi]\|g[\pi]) D1λ(f[π]g[π])ϵ,\displaystyle\geq D_{1-\lambda}(f[\pi]\|g[\pi])-\epsilon, (316)
D1λ(f[π]g[π])\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi]) DKL(f[π]g[π])ϵ.\displaystyle\geq D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon. (317)

As a result,

D1λw(f^g)1mi=1m(D1λ(f[πi]g[πi])ϵ)1mi=1m(DKL(f[πi]g[πi])2ϵ)\displaystyle D^{w}_{1-\lambda}(\hat{f}\|g)\geq\frac{1}{m}\sum_{i=1}^{m}\left(D_{1-\lambda}(f[\pi_{i}]\|g[\pi_{i}])-\epsilon\right)\geq\frac{1}{m}\sum_{i=1}^{m}\left(D_{\mathrm{KL}}(f[\pi_{i}]\|g[\pi_{i}])-2\epsilon\right) (318)
=\displaystyle=\; DKLw(fg)2ϵlnnm+cϵ.\displaystyle D_{\mathrm{KL}}^{w}(f\|g)-2\epsilon\geq\frac{\ln n}{m}+c\epsilon. (319)

Combining the two cases together, we get D1λw(f^g)lnnm+cϵ.D^{w}_{1-\lambda}(\hat{f}\|g)\geq\frac{\ln n}{m}+c\epsilon.

The following lemma is used to prove a nice property of the covering (see Lemma F.1).

Lemma G.5.

Consider any ϵ>0,α>0\epsilon>0,\alpha>0, sequence of decisions w={πi}i=1mw=\{\pi_{i}\}_{i=1}^{m}. Let γ=1mminπ:πwi=1m𝕀[πi=π]\gamma=\frac{1}{m}\min_{\pi:\pi\in w}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right] and λ=λ0(α/γ,ϵ,f)\lambda=\lambda_{0}(\alpha/\gamma,\epsilon,f) be the value that satisfies Condition 1. For two distributions f,f^f,{\hat{f}}\in{\mathcal{F}} such that

DTV(g[π]g^[π])exp(α/γ)(λϵ)1/λ,πΠ.D_{\rm TV}(g[\pi]\|\hat{g}[\pi])\leq\exp(-\alpha/\gamma)(\lambda\epsilon)^{1/\lambda},\forall\pi\in\Pi.

we have

D1λw(fg^)min{DKLw(fg),α}2ϵ.\displaystyle D^{w}_{1-\lambda}(f\|\hat{g})\geq\min\{D_{\mathrm{KL}}^{w}(f\|g),\alpha\}-2\epsilon. (320)
Proof of Lemma G.5.

Let ϵ1=exp(α/γ)(λϵ)1/λ\epsilon_{1}=\exp(-\alpha/\gamma)(\lambda\epsilon)^{1/\lambda} and κ=min{DKLw(fg),α}\kappa=\min\{D_{\mathrm{KL}}^{w}(f\|g),\alpha\}.

By Lemma G.2 and the fact that DTV(g[π]g^[π])ϵ1,πΠD_{\rm TV}(g[\pi]\|\hat{g}[\pi])\leq\epsilon_{1},\forall\pi\in\Pi, for any πΠ\pi\in\Pi we have

D1λ(f[π]g^[π])min{α/γ,D1λ(f[π]g[π])}ϵ.\displaystyle D_{1-\lambda}(f[\pi]\|\hat{g}[\pi])\geq\min\{\alpha/\gamma,D_{1-\lambda}(f[\pi]\|g[\pi])\}-\epsilon. (321)

Applying Condition 1, for any πΠ\pi\in\Pi we have

D1λ(f[π]g[π])min{α/γ,DKL(f[π]g[π])ϵ}.\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi])\geq\min\{\alpha/\gamma,D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon\}. (322)

Now we consider the following two cases.

Case 1:

There exists πw\pi\in w such that DKL(f[π]g[π])κ/γD_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq\kappa/\gamma. By the definition of κ\kappa we have κα\kappa\leq\alpha. In this case Eq. (322) implies that D1λ(f[π]g[π])κ/γϵ.D_{1-\lambda}(f[\pi]\|g[\pi])\geq\kappa/\gamma-\epsilon. Combining with Eq. (321) we have

D1λ(f[π]g^[π])κ/γ2ϵ.\displaystyle D_{1-\lambda}(f[\pi]\|\hat{g}[\pi])\geq\kappa/\gamma-2\epsilon. (323)

As a result,

D1λw(fg^)1mi=1m𝕀[πi=π](κ/γ2ϵ)κ2ϵ.\displaystyle D^{w}_{1-\lambda}(f\|\hat{g})\geq\frac{1}{m}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right](\kappa/\gamma-2\epsilon)\geq\kappa-2\epsilon. (324)

Case 2:

For all πw\pi\in w, DKL(f[π]g[π])κ/γD_{\mathrm{KL}}(f[\pi]\|g[\pi])\leq\kappa/\gamma. In this case we also have D1λ(f[π]g[π])κ/γ,πwD_{1-\lambda}(f[\pi]\|g[\pi])\leq\kappa/\gamma,\;\forall\pi\in w. Therefore Eq. (321) and Eq. (322) implies that

D1λ(f[π]g^[π])\displaystyle D_{1-\lambda}(f[\pi]\|\hat{g}[\pi]) D1λ(f[π]g[π])ϵ,\displaystyle\geq D_{1-\lambda}(f[\pi]\|g[\pi])-\epsilon, (325)
D1λ(f[π]g[π])\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi]) DKL(f[π]g[π])ϵ.\displaystyle\geq D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon. (326)

As a result,

D1λw(fg^)1mi=1m(D1λ(f[πi]g[πi])ϵ)1mi=1m(DKL(f[πi]g[πi])2ϵ)\displaystyle D^{w}_{1-\lambda}(f\|\hat{g})\geq\frac{1}{m}\sum_{i=1}^{m}\left(D_{1-\lambda}(f[\pi_{i}]\|g[\pi_{i}])-\epsilon\right)\geq\frac{1}{m}\sum_{i=1}^{m}\left(D_{\mathrm{KL}}(f[\pi_{i}]\|g[\pi_{i}])-2\epsilon\right) (327)
=\displaystyle=\; DKLw(fg)2ϵκ2ϵ.\displaystyle D_{\mathrm{KL}}^{w}(f\|g)-2\epsilon\geq\kappa-2\epsilon. (328)

Combining the two cases together, we get D1λw(fg^)κ2ϵ.D^{w}_{1-\lambda}(f\|\hat{g})\geq\kappa-2\epsilon.

The following lemma shows that for truncated Gaussian distributions, the difference in their density function can be upper bounded by their KL divergence. We use this lemma to prove Condition 3 for tabular RL with truncated Gaussian reward.

Lemma G.6.

Consider two truncated Gaussian distributions p1,p2p_{1},p_{2} with density

pi(x)=𝕀[x[2,2]]1Ziexp((xμi)22),\displaystyle p_{i}(x)=\mathbb{I}\left[x\in[-2,2]\right]\frac{1}{Z_{i}}\exp\left(-\frac{(x-\mu_{i})^{2}}{2}\right), (329)

where ZiZ_{i} is the normalization factor. Assuming μ1,μ2[1,1]\mu_{1},\mu_{2}\in[-1,1], we have

supx[2,2]|p1(x)p2(x)|36DKL(p1p2)1/6.\displaystyle\sup_{x\in[-2,2]}{\left|{p_{1}(x)-p_{2}(x)}\right|}\leq 36D_{\mathrm{KL}}(p_{1}\|p_{2})^{1/6}. (330)
Proof.

We first prove that |μ1μ2|DKL(p1p2)1/6.{\left|{\mu_{1}-\mu_{2}}\right|}\lesssim D_{\mathrm{KL}}(p_{1}\|p_{2})^{1/6}. Then we show that supx[2,2]|p1(x)p2(x)||μ1μ2|\sup_{x\in[-2,2]}{\left|{p_{1}(x)-p_{2}(x)}\right|}\lesssim{\left|{\mu_{1}-\mu_{2}}\right|}.

By Pinsker’s inequality,

DTV(p1p2)DKL(p1p2)1/2.\displaystyle D_{\rm TV}(p_{1}\|p_{2})\lesssim D_{\mathrm{KL}}(p_{1}\|p_{2})^{1/2}. (331)

Now we prove that |μ1μ2|3DTV(p1p2).{\left|{\mu_{1}-\mu_{2}}\right|}^{3}\lesssim D_{\rm TV}(p_{1}\|p_{2}). W.l.o.g., we assume Z1Z2>1/2πZ_{1}\geq Z_{2}>1/\sqrt{2\pi} and μ1μ2.\mu_{1}\leq\mu_{2}. Then we have, for any x[μ1,μ1+14(μ2μ1)]x\in[\mu_{1},\mu_{1}+\frac{1}{4}(\mu_{2}-\mu_{1})],

p1(x)p2(x)\displaystyle p_{1}(x)-p_{2}(x) 12π[exp((μ2μ1)232)exp(9(μ2μ1)232)]\displaystyle\geq\frac{1}{\sqrt{2\pi}}\left[\exp(-\frac{(\mu_{2}-\mu_{1})^{2}}{32})-\exp(-\frac{9(\mu_{2}-\mu_{1})^{2}}{32})\right] (332)
12πexp(9(μ2μ1)232)(exp(14(μ2μ1)2)1)\displaystyle\geq\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{9(\mu_{2}-\mu_{1})^{2}}{32}\right)\left(\exp(\frac{1}{4}(\mu_{2}-\mu_{1})^{2})-1\right) (333)
14e22π(μ2μ1)2.\displaystyle\geq\frac{1}{4e^{2}\sqrt{2\pi}}(\mu_{2}-\mu_{1})^{2}. (334)

As a result,

DTV(p1p2)14(μ2μ1)(p1(x)p2(x))|μ2μ1|3.\displaystyle D_{\rm TV}(p_{1}\|p_{2})\geq\frac{1}{4}(\mu_{2}-\mu_{1})(p_{1}(x)-p_{2}(x))\gtrsim{\left|{\mu_{2}-\mu_{1}}\right|}^{3}. (335)

Now we prove that supx[2,2]|p1(x)p2(x)||μ1μ2|\sup_{x\in[-2,2]}{\left|{p_{1}(x)-p_{2}(x)}\right|}\lesssim{\left|{\mu_{1}-\mu_{2}}\right|}. By definition, for any x[2,2]x\in[-2,2] we have

|p1(x)p2(x)|\displaystyle{\left|{p_{1}(x)-p_{2}(x)}\right|} =|1Z1exp((xμ1)22)1Z2exp((xμ2)22)|\displaystyle={\left|{\frac{1}{Z_{1}}\exp\left(-\frac{(x-\mu_{1})^{2}}{2}\right)-\frac{1}{Z_{2}}\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right)}\right|} (336)
|1/Z11/Z2|+1Z1|exp((xμ1)22)exp((xμ2)22)|,\displaystyle\leq{\left|{1/Z_{1}-1/Z_{2}}\right|}+\frac{1}{Z_{1}}{\left|{\exp\left(-\frac{(x-\mu_{1})^{2}}{2}\right)-\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right)}\right|}, (337)
4|Z1Z2|+2|exp((xμ1)22)exp((xμ2)22)|,\displaystyle\leq 4{\left|{Z_{1}-Z_{2}}\right|}+2{\left|{\exp\left(-\frac{(x-\mu_{1})^{2}}{2}\right)-\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right)}\right|}, (338)

where the last inequality comes from the fact that Zi1/2Z_{i}\geq 1/2 when |μi|1.|\mu_{i}|\leq 1. For the second term,

|exp((xμ1)22)exp((xμ2)22)|\displaystyle{\left|{\exp\left(-\frac{(x-\mu_{1})^{2}}{2}\right)-\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right)}\right|} (339)
=\displaystyle=\; exp((xμ2)22)|exp((μ2μ1)(μ2+μ1x)/2)1|\displaystyle\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right){\left|{\exp((\mu_{2}-\mu_{1})(\mu_{2}+\mu_{1}-x)/2)-1}\right|} (340)
\displaystyle\leq\; 2|μ2μ1|.\displaystyle 2|\mu_{2}-\mu_{1}|. (341)

For the first term,

|Z1Z2|x=22|p1(x)p2(x)|dx8|μ2μ1|.\displaystyle{\left|{Z_{1}-Z_{2}}\right|}\leq\int_{x=-2}^{2}{\left|{p_{1}(x)-p_{2}(x)}\right|}\mathrm{d}x\leq 8|\mu_{2}-\mu_{1}|. (342)

As a result, we get

|p1(x)p2(x)|36|μ2μ1|.\displaystyle{\left|{p_{1}(x)-p_{2}(x)}\right|}\leq 36|\mu_{2}-\mu_{1}|. (343)

Combining Eq. (335) and Eq. (343) we prove the this lemma. ∎

The following lemma can be viewed as an perturbation analysis to the covariance matrix in linear bandits setting. We use the lemma to prove that our complexity measure recovers that in [Lattimore and Szepesvari, 2017].

Lemma G.7.

For a fixed positive definite matrix Ad×dA\in\mathbb{R}^{d\times d} and an unit vector xdx\in\mathbb{R}^{d}, let G1=(A+nxx)1G_{1}=(A+nxx^{\top})^{-1} and G2=limρ(A+ρxx)1.G_{2}=\lim_{\rho\to\infty}(A+\rho xx^{\top})^{-1}. Then

G1G22σmax(A1)1+nσmin(A1).\displaystyle\|G_{1}-G_{2}\|_{2}\leq\frac{\sigma_{\rm max}(A^{-1})}{1+n\sigma_{\rm min}(A^{-1})}. (344)
Proof.

By Sherman–Morrison formula we get

G1\displaystyle G_{1} =A1nA1xxA11+nxA1x,\displaystyle=A^{-1}-\frac{nA^{-1}xx^{\top}A^{-1}}{1+nx^{\top}A^{-1}x}, (345)
G2\displaystyle G_{2} =limρA1ρA1xxA11+ρxA1x.\displaystyle=\lim_{\rho\to\infty}A^{-1}-\frac{\rho A^{-1}xx^{\top}A^{-1}}{1+\rho x^{\top}A^{-1}x}. (346)

Then for any vdv\in\mathbb{R}^{d} such that v2=1\|v\|_{2}=1, we get

v(G1G2)v\displaystyle v^{\top}(G_{1}-G_{2})v =limρρvA1xxA1v1+ρxA1xnvA1xxA1v1+nxA1x\displaystyle=\lim_{\rho\to\infty}\frac{\rho v^{\top}A^{-1}xx^{\top}A^{-1}v}{1+\rho x^{\top}A^{-1}x}-\frac{nv^{\top}A^{-1}xx^{\top}A^{-1}v}{1+nx^{\top}A^{-1}x} (347)
=limρ(ρn)vA1xxA1v(1+nxA1x)(1+ρxA1x)\displaystyle=\lim_{\rho\to\infty}\frac{(\rho-n)v^{\top}A^{-1}xx^{\top}A^{-1}v}{(1+nx^{\top}A^{-1}x)(1+\rho x^{\top}A^{-1}x)} (348)
=vA1xxA1vxA1x(1+nxA1x)\displaystyle=\frac{v^{\top}A^{-1}xx^{\top}A^{-1}v}{x^{\top}A^{-1}x(1+nx^{\top}A^{-1}x)} (349)
vA1v1+nxA1x\displaystyle\leq\frac{v^{\top}A^{-1}v}{1+nx^{\top}A^{-1}x} (350)
σmax(A1)1+nσmin(A1).\displaystyle\leq\frac{\sigma_{\rm max}(A^{-1})}{1+n\sigma_{\rm min}(A^{-1})}. (351)

The following lemma upper bounds the difference between KL divergence and Rényi divergence, and is used to prove Condition 1.

Lemma G.8.

For any two distribution f,gf,g and constant λ(0,1/2),\lambda\in(0,1/2), we have

DKL(fg)D1λ(fg)λ2𝔼of[(lnf(o)g(o))4]1/2.\displaystyle D_{\mathrm{KL}}(f\|g)-D_{1-\lambda}(f\|g)\leq\frac{\lambda}{2}\mathbb{E}_{o\sim f}\left[\left(\ln\frac{f(o)}{g(o)}\right)^{4}\right]^{1/2}. (352)
Proof.

Recall that

D1λ(fg)=1λlnf(o)1λg(o)λdo.D_{1-\lambda}(f\|g)=-\frac{1}{\lambda}\ln\int f(o)^{1-\lambda}g(o)^{\lambda}\mathrm{d}o.

Define the function h(λ)f(o)1λg(o)λdo.h(\lambda)\triangleq\int f(o)^{1-\lambda}g(o)^{\lambda}\mathrm{d}o. By basic algebra we get

h(λ)\displaystyle h^{\prime}(\lambda) =f(o)1λg(o)λlng(o)f(o)do\displaystyle=\int f(o)^{1-\lambda}g(o)^{\lambda}\ln\frac{g(o)}{f(o)}\mathrm{d}o (353)
h′′(λ)\displaystyle h^{\prime\prime}(\lambda) =f(o)1λg(o)λln2g(o)f(o)do.\displaystyle=\int f(o)^{1-\lambda}g(o)^{\lambda}\ln^{2}\frac{g(o)}{f(o)}\mathrm{d}o. (354)

By Taylor expansion, there exists ξ(0,λ)\xi\in(0,\lambda) such that

h(λ)=h(0)+λh(0)+λ22h′′(ξ).\displaystyle h(\lambda)=h(0)+\lambda h^{\prime}(0)+\frac{\lambda^{2}}{2}h^{\prime\prime}(\xi). (355)

By definition we have h(0)=1h(0)=1 and h(0)=DKL(fg).h^{\prime}(0)=-D_{\mathrm{KL}}(f\|g). As a result, we get

D1λ(fg)=1λlnh(λ)\displaystyle D_{1-\lambda}(f\|g)=-\frac{1}{\lambda}\ln h(\lambda) (356)
=\displaystyle=\; 1λln(1λDKL(fg)+λ22f(o)1ζg(o)ζln2g(o)f(o)do)\displaystyle-\frac{1}{\lambda}\ln\left(1-\lambda D_{\mathrm{KL}}(f\|g)+\frac{\lambda^{2}}{2}\int f(o)^{1-\zeta}g(o)^{\zeta}\ln^{2}\frac{g(o)}{f(o)}\mathrm{d}o\right) (357)
\displaystyle\geq\; DKL(fg)λ2f(o)1ξg(o)ξln2g(o)f(o)do.\displaystyle D_{\mathrm{KL}}(f\|g)-\frac{\lambda}{2}\int f(o)^{1-\xi}g(o)^{\xi}\ln^{2}\frac{g(o)}{f(o)}\mathrm{d}o. (358)

By Hölder’s inequality, when ξ<λ<1/2\xi<\lambda<1/2 we get

f(o)1ζg(o)ξln2g(o)f(o)do\displaystyle\int f(o)^{1-\zeta}g(o)^{\xi}\ln^{2}\frac{g(o)}{f(o)}\mathrm{d}o (359)
=\displaystyle=\; 𝔼of[(g(o)f(o))ξln2g(o)f(o)]\displaystyle\mathbb{E}_{o\sim f}\left[\left(\frac{g(o)}{f(o)}\right)^{\xi}\ln^{2}\frac{g(o)}{f(o)}\right] (360)
\displaystyle\leq\; 𝔼of[g(o)f(o)]ξ𝔼of[(lng(o)f(o))21ξ]1ξ\displaystyle\mathbb{E}_{o\sim f}\left[\frac{g(o)}{f(o)}\right]^{\xi}\mathbb{E}_{o\sim f}\left[\left(\ln\frac{g(o)}{f(o)}\right)^{\frac{2}{1-\xi}}\right]^{1-\xi} (361)
\displaystyle\leq\; 𝔼of[(lng(o)f(o))4]1/2.\displaystyle\mathbb{E}_{o\sim f}\left[\left(\ln\frac{g(o)}{f(o)}\right)^{4}\right]^{1/2}. (362)

Combining the inequalities above we get the desired result. ∎