Asymptotic Instance-Optimal Algorithms for Interactive Decision Making

Kefan Dong
Stanford University
[email protected] Tengyu Ma
Stanford University
[email protected]

Abstract

Past research on interactive decision making problems (bandits, reinforcement learning, etc.) mostly focuses on the minimax regret that measures the algorithm’s performance on the hardest instance. However, an ideal algorithm should adapt to the complexity of a particular problem instance and incur smaller regrets on easy instances than worst-case instances. In this paper, we design the first asymptotic instance-optimal algorithm for general interactive decision making problems with finite number of decisions under mild conditions. On every instance $f$ , our algorithm outperforms all consistent algorithms (those achieving non-trivial regrets on all instances), and has asymptotic regret $\mathcal{C}(f)\ln n$ , where $\mathcal{C}(f)$ is an exact characterization of the complexity of $f$ . The key step of the algorithm involves hypothesis testing with active data collection. It computes the most economical decisions with which the algorithm collects observations to test whether an estimated instance is indeed correct; thus, the complexity $\mathcal{C}(f)$ is the minimum cost to test the instance $f$ against other instances. Our results, instantiated on concrete problems, recover the classical gap-dependent bounds for multi-armed bandits (Lai et al., 1985) and prior works on linear bandits (Lattimore and Szepesvari, 2017), and improve upon the previous best instance-dependent upper bound (Xu et al., 2021) for reinforcement learning.

1 Introduction

Bandit and reinforcement learning (RL) algorithms demonstrated a wide range of successful real-life applications (Silver et al., 2016, 2017; Mnih et al., 2013; Berner et al., 2019; Vinyals et al., 2019; Mnih et al., 2015; Degrave et al., 2022). Past works have theoretically studied the regret or sample complexity of various interactive decision making problems, such as contextual bandits, reinforcement learning (RL), partially observable Markov decision process (see Azar et al. (2017); Jin et al. (2018); Dong et al. (2021); Li et al. (2019); Agarwal et al. (2014); Foster and Rakhlin (2020); Jin et al. (2020), and references therein). Recently, Foster et al. (2021) present a unified algorithmic principle for achieving the minimax regret—the optimal regret for the worst-case problem instances.

However, minimax regret bounds do not necessarily always present a full picture of the statistical complexity of the problem. They characterize the complexity of the most difficult instances, but potentially many other instances are much easier. An ideal algorithm should adapt to the complexity of a particular instance and incur smaller regrets on easy instances than the worst-case instances. Thus, an ideal regret bound should be instance-dependent, that is, depending on some properties of each instance. Prior works designed algorithms with instance-dependent regret bounds that are stronger than minimax regret bounds, but they are often not directly comparable because they depend on different properties of the instances, such as the gap conditions and the variance of the value function (Zanette and Brunskill, 2019; Xu et al., 2021; Foster et al., 2020; Tirinzoni et al., 2021).

A more ambitious and challenging goal is to design instance-optimal algorithms that outperform, on every instance, all consistent algorithms (those achieving non-trivial regrets on all instances). Past works designed instance-optimal algorithms for multi-armed bandit (Lai et al., 1985), linear bandits (Kirschner et al., 2021; Hao et al., 2020), Lipschitz bandits (Magureanu et al., 2014), and ergodic MDPs (Ok et al., 2018). However, instance-optimal regret bounds for tabular reinforcement learning remain an open question, despite recent progress (Tirinzoni et al., 2021, 2022). The challenge partly stems from the fact that the existence of such an instance-optimal algorithm is even a priori unclear for general interactive decision making problems. Conceivably, each algorithm can have its own specialty on a subset of instances, and no algorithm can dominate all others on all instances.

Somewhat surprisingly, we prove that there exists a simple algorithm (T2C, stated in Alg. 1) that is asymptotic instance-optimal for general interactive decision making problems with finite number of decisions.We determine the exact leading term of the optimal asymptotic regret for instance $f$ to be ${\mathcal{C}}(f)\ln n$ . Here, $n$ is the number of interactions and ${\mathcal{C}}(f)$ is a complexity measure for the instance $f$ that intuitively captures the difficulty of distinguishing $f$ from other instances (that have different optimal decisions) using observations. Concretely, under mild conditions on the interactive decision problem, our algorithm achieves an asymptotic regret ${\mathcal{C}}(f)\ln n$ (Theorem 5.2) for every instance $f$ , while every consistent algorithm must have an asymptotic regret at least ${\mathcal{C}}(f)\ln n$ (Theorem 3.2).

Our algorithm consists of three simple steps. First, it explores uniformly for $o(1)$ -fraction of the steps and computes the MLE estimate of the instance with relatively low confidence. Then, it tests whether the estimate instance (or, precisely, its associated optimal decision) is indeed correct using the most economical set of queries/decisions. Concretely, it computes a set of decisions with minimal regret, such that, using a log-likelihood ratio test, it can either distinguish the estimated instance from all other instances (with different optimal decisions) with high confidence, or determine that our estimate was incorrect. Thirdly, with the high-confidence estimate, it commits to the optimal decision of the estimated instance in the rest of the steps. The algorithmic framework essentially reduces the problem to the key second step — a problem of optimal hypothesis testing with active data collection.

Our results recover the classical gap-dependent regret bounds for multi-armed bandits (Lai et al., 1985) and prior works on linear bandits (Lattimore and Szepesvari, 2017; Hao et al., 2020). As an instantiation of the general algorithm, we present the first asymptotic instance-optimal algorithm for episodic tabular RL, improving upon prior instance-dependent algorithms (Xu et al., 2021; Simchowitz and Jamieson, 2019; Tirinzoni et al., 2021, 2022).

1.1 Additional Related Works

Instance-optimal and instance-dependent analysis.

Prior works have designed instance-optimal algorithms for specific interactive decision making problems. Variants of UCB algorithms are instance-optimal for bandits with various assumptions (Lattimore and Szepesvári, 2020; Gupta et al., 2021; Tirinzoni et al., 2020; Degenne et al., 2020; Magureanu et al., 2014), but are suboptimal for linear bandits (Lattimore and Szepesvari, 2017). These algorithms rely on the optimism-in-face-of-uncertainty principle to deal with exploration-exploitation tradeoff, whereas our algorithm explicitly finds the best tradeoff. There are also non-optimistic instance-optimal algorithms for linear bandits (Kirschner et al., 2021; Lattimore and Szepesvari, 2017; Hao et al., 2020), ergodic MDPs (Ok et al., 2018; Burnetas and Katehakis, 1997), and others (Graves and Lai, 1997; Rajeev et al., 1989).

Another line of research focuses on instance-optimal pure exploration algorithms in multi-armed bandits (Mason et al., 2020; Marjani et al., 2022; Chen et al., 2017b, a) and deterministic MDPs (Tirinzoni et al., 2022), where the goal is to use minimum number of samples to identify an $\epsilon$ -optimal decision.

Many algorithms’ regret bounds depend on some properties of instances such as the gap condition. Foster et al. (2020) prove a gap-dependent regret bound for contextual bandits. For reinforcement learning, the regret bound may depend on the variance of the optimal value function (Zanette and Brunskill, 2019) or the gap of the $Q$ -function (Xu et al., 2021; Simchowitz and Jamieson, 2019; Yang et al., 2021). Xu et al. (2021); Foster et al. (2020) also prove that the gap-dependent bounds cannot be improve on some instances. To some extent, these instance-dependent lower bounds can be viewed as minimax bounds for a fine-grained instance family (e.g., all instances with the same $Q$ -function gap), and therefore are different from ours.

Instance-optimal algorithm via log-likelihood ratio test.

The idea of using log-likelihood ratio test to design instance-optimal algorithms traces back to Rajeev et al. (1989); Graves and Lai (1997). Rajeev et al. (1989) design an asymptotic instance-optimal algorithm for general decision making problems with finite hypothesis class, finite state-action space, and known rewards. The algorithm in Graves and Lai (1997) is instance-optimal for infinite horizon MDPs where the Markov chain induced by every policy is uniformly recurrent, meaning that the probability of visiting every state is uniformly lower bounded away from 0. In comparison, our analysis is applicable to infinite hypothesis settings without the uniform recurrent assumption (e.g., episodic tabular RL).

2 Setup and Notations

Additional notations.

We use $\mathrm{poly}(n)$ to denote a polynomial in $n$ . For two non-negative sequences $a_{n},b_{n}$ , we write $a_{n}={\mathcal{O}}(b_{n})$ if $\limsup_{n\to\infty}a_{n}/b_{n}<\infty$ and $a_{n}=o(b_{n})$ if $\limsup_{n\to\infty}a_{n}/b_{n}=0.$

Interactive decision making problems.

We focus on interactive decision making with structured observations (Foster et al., 2021), which includes bandits and reinforcement learning as special cases. An interactive decision making problem is defined by a family of decisions $\Pi$ , a space of observations $O$ , a reward function $R:O\to\mathbb{R}$ , and a function $f$ (also called an instance) that maps a decision $\pi\in\Pi$ to a distribution over observations $f[\pi]$ . We use $f[\pi](\cdot):O\to\mathbb{R}_{+}$ to denote the density function of the distribution $f[\pi]$ . We assume that the reward $R$ is a deterministic and known function.

An environment picks an ground-truth instance ${f^{\star}}$ from a instance family ${\mathcal{F}}$ , and then an agent (which knows the instance family ${\mathcal{F}}$ but not the ground-truth ${f^{\star}}$ ) interacts with the environment for $n$ total rounds. In round $t\leq n$ ,

1.

the learner selects a decision $\pi_{t}$ from the decision class $\Pi$ , and
2.

the environment generates an observation $o_{t}$ following the ground-truth distribution ${f^{\star}}[\pi_{t}]$ and reveals the observation. Then the agent receives a reward $R(o_{t})$ .

As an example, multi-armed bandits, linear bandits, and reinforcement learning all belong to interactive decision making problems. For bandits, a decision $\pi$ corresponds to an action and an observation $o$ corresponds to a reward. For reinforcement learning, a decision $\pi$ is a deterministic policy that maps from states to actions, and an observation $o$ is a trajectory (including the reward at each step). In other words, a round of interactions in the interactive decision making problem corresponds to an episode of the reinforcement learning problem. We will formally discuss the reinforcement learning setup in Section 5.

Let $R_{f}(\pi)=\mathbb{E}_{o\sim f[\pi]}[R(o)]$ be the expected reward for decision $\pi$ under instance $f$ , and $\pi^{\star}(f)\triangleq\operatorname*{argmax}_{\pi}R_{f}(\pi)$ the optimal decision of instance $f$ . The expected regret measures how much worse the agent’s decision is than the optimal decision:

\displaystyle\mathrm{Reg}_{f,n}=\mathbb{E}_{f}\left[\sum_{t=1}^{n}\Big{(}\max_{\pi\in\Pi}R_{f}(\pi)-R_{f}(\pi_{t})\Big{)}\right].

(1)

We consider the case where the decision family $\Pi$ is finite, and every instance $f\in{\mathcal{F}}$ has a unique optimal decision, denoted by $\pi^{\star}(f)$ . We also assume that for every $f\in{\mathcal{F}}$ and $\pi\in\Pi$ , $0\leq R(o)\leq R_{\rm max}$ almost surely when $o$ is drawn from the distribution $f[\pi]$ , and $\sup_{o}f[\pi](o)<\infty$ .

Consistent algorithms and asymptotic instance-optimality.

We first note that it’s unreasonable to ask for an algorithm that can outperform or match any arbitrary algorithm on every instance. This is because, for any instance $f\in{\mathcal{F}}$ , a bespoke algorithm that always outputs $\pi^{\star}(f)$ can achieve zero regret on instance $f$ (though it has terrible regrets for other instances). Therefore, if an algorithm can outperform or match any algorithm on any instance, it must have zero regret on every instance, which is generally impossible. Instead, we are interested in finding an algorithm that is as good as any other reasonable algorithms that are not only customized to a single instance and completely fail on other instances.

We say an algorithm is consistent if its expected regret satisfies $\mathrm{Reg}_{f,n}=o(n^{p})$ for every $p>0$ and $f\in{\mathcal{F}}$ (Lai et al., 1985). Most of the reasonable algorithms are consistent, such as the UCB algorithm for multi-armed bandits (Lattimore and Szepesvári, 2020), UCBVI, and UCB-Q algorithm for tabular reinforcement learning (Simchowitz and Jamieson, 2019; Yang et al., 2021), because all of them achieve asymptotically ${\mathcal{O}}(\ln n)$ regrets on any instance, where ${\mathcal{O}}$ hides constants that depend on the property of the particular instance.¹¹1Here the regret scales in $\log n$ because we are in the asymptotic setting where the instance is fixed and $n$ tends to infinity. If the instance can depend on $n$ (which is not the setting we are interested in), then the minimax regret typically scales in $O(\sqrt{n})$ . However, consistent algorithms exclude the algorithm mentioned in the previous paragraph which is purely customized to a particular instance.

We say an algorithm is asymptotically instance-optimal if on every instance $f\in{\mathcal{F}}$ , $\limsup_{n\to\infty}\mathrm{Reg}_{f,n}/\ln n$ is the smallest among all consistent algorithms (Lai et al., 1985). We note that even though an instance-optimal algorithm only needs to perform as well as every consistent algorithm, a priori, it’s still unclear if such an algorithm exists.

Some prior works have also used a slightly weaker definition in place of consistent algorithms, e.g., $\alpha$ -uniformly good algorithms in Tirinzoni et al. (2021), which allows a sublinear regret bound ${\mathcal{O}}(n^{\alpha})$ for some constant $\alpha<1$ . The alternative definition, though apparently includes more algorithms, does not change the essence. Our algorithm is still instance-optimal up to a constant factor—a simple modification of the lower bound part of the proof shows that its asymptotic regret is at most $(1-\alpha)^{-1}$ factor bigger than any $\alpha$ -uniformly good algorithms on any instance. This paper thus primarily compares with consistent algorithms for simplicity.

3 Main Results

In this section, we present an intrinsic complexity measure ${\mathcal{C}}(f)$ of an instance $f$ (Section 3.1), and an instance-optimal algorithm that achieves an asymptotic regret ${\mathcal{C}}(f)\ln n$ (Section 3.2).

3.1 Complexity Measure and Regret Lower Bounds

In this section, our goal is to understand the minimum regret of consistent algorithms, which is an intrinsic complexity measure of an instance. We define an instance-dependent complexity measure ${\mathcal{C}}(f)$ and prove that any consistent algorithm must have asymptotic regret at least ${\mathcal{C}}(f)\ln n.$

The key observation is that any consistent algorithm has to collect enough information from the observations to tell apart the ground-truth instance from other instances with different optimal decisions. Consider the situation when a sequence of $n$ decisions is insufficient to distinguish two instances, denoted by $f$ and $g$ , with different optimal decisions. If an algorithm achieves a sublinear regret on $f$ , the sequence must contain $\pi^{\star}(f)$ (the optimal decision for $f$ ) at least $n-o(n)$ times. As a result, if the true instance were $g$ , the algorithm suffers a linear regret (due to the linear number of $\pi^{\star}(f)$ ), and therefore cannot be consistent.

An ideal algorithm should find out decisions that can most efficiently identify the ground-truth instance (or precisely the family of instances with the same optimal decision as the ground-truth). However, decisions collect information but also incur regrets. So the algorithm should pick a list of decisions with the best tradeoff between these two effects—minimizing the decisions’ regret and collecting sufficient information to identify the true instance. Concretely, suppose a sequence of decisions includes $w_{\pi}$ occurrences of decision $\pi$ . The regret of these decisions is $\sum_{\pi\in\Pi}w_{\pi}\Delta(f,\pi)$ , where $\Delta(f,\pi)\triangleq R_{f}(\pi^{\star}(f))-R_{f}(\pi)$ is the sub-optimality gap of decision $\pi$ . We use KL divergence between the observations of $\pi$ under two instances $f$ and $g$ (denoted by $D_{\mathrm{KL}}(f[\pi]\|g[\pi])$ ) to measure $\pi$ ’s power to distinguish them. The following optimization problem defines the optimal mixture of the decisions (in terms of the regret) that has sufficient power to distinguish $f$ from all other instances with different optimal decision.

$\displaystyle{\mathcal{C}}(f,n)\triangleq\min_{w\in\mathbb{R}^{\|\Pi\|}_{+}}\;$	$\displaystyle\sum_{\pi\in\Pi}w_{\pi}\Delta(f,\pi)$	(2)
s.t.	$\displaystyle\sum_{\pi\in\Pi}w_{\pi}D_{\mathrm{KL}}(f[\pi]\\|g[\pi])\geq 1,\;\forall g\in{\mathcal{F}},\pi^{\star}(g)\neq\pi^{\star}(f),$	(3)
	$\displaystyle\\|w\\|_{\infty}\leq n.$	(4)

The last constraint makes sure that $w_{\pi}<\infty$ even if the decision has no regret, and we only care about the case when $n$ approaches infinity. The asymptotic complexity of $f$ is

\displaystyle{\mathcal{C}}(f)\triangleq\lim_{n\to\infty}{\mathcal{C}}(f,n).

(5)

In Eq. (3), we only require separation between instances $f,g$ when they have different optimal decisions. As there are only finite decisions, there are finite equivalent classes, so $f$ and $g$ are in principle separable with sufficient number of observations. Thus, the complexity measure ${\mathcal{C}}(f,n)$ is well defined as stated in the following lemma.

Lemma 3.1.

For any $f\in{\mathcal{F}}$ , ${\mathcal{C}}(f,n)$ is non-increasing in $n$ , and there exists $n_{0}>0$ such that for all $n>n_{0}$ , ${\mathcal{C}}(f,n)<\infty.$ As a corollary, ${\mathcal{C}}(f)<\infty$ and is well defined.

Proof of Lemma 3.1 is deferred to Appendix B.1. In Section 3.1, we formalize the intuition above and prove that ${\mathcal{C}}(f)$ is a lower bound of the asymptotic regret, as stated in the following theorem.

Theorem 3.2.

For every instance $f\in{\mathcal{F}}$ , the expected regret of any consistent algorithm satisfies

\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{f,n}}{\ln n}\geq{\mathcal{C}}(f).

(6)

We prove Theorem 3.2 in Appendix B.2. The proof of Theorem 3.2 is inspired by previous works (Lattimore and Szepesvari, 2017; Tirinzoni et al., 2021). When applied to RL problems, our lower bound is very similar to that in Tirinzoni et al. (2021) and the only difference is that their optimization problems omit the constraint Eq. (4). (Informally, the lower bound of Tirinzoni et al. (2021) is equal to ${\mathcal{C}}(f,\infty)$ in our language.) For some carefully-constructed hypothesis classes, there exists an instance $f$ such that $\lim_{n\to\infty}{\mathcal{C}}(f,n)\neq{\mathcal{C}}(f,\infty)$ , meaning that the lower bound in Tirinzoni et al. (2021) is not tight (see Appendix B.5 for the construction and corresponding proof). However, this constraint can be removed for linear bandits (see Lattimore and Szepesvari (2017, Lemma 17)). For the standard tabular RL problems whose hypothesis class is the entire set of possible RL problems with fixed number of states and actions, we conjecture that our lower bound would be the same as that in Tirinzoni et al. (2021). We also note that Tirinzoni et al. (2021) do not have a matching upper bound.

The complexity measure ${\mathcal{C}}(f)$ can be computed explicitly in concrete settings. ${\mathcal{C}}(f)$ reduces to the well-known inverse action gap bound ${\mathcal{O}}(\sum_{a\in{\mathcal{A}},\Delta(a)>0}1/\Delta(a))$ for multi-armed bandits (Proposition B.2), and recovers the result of Lattimore and Szepesvari (2017) (Proposition B.3). For reinforcement learning, Tirinzoni et al. (2021) prove that the instance-optimal bound can be smaller than the gap-dependent bound ${\mathcal{O}}(\sum_{s,a:\Delta(s,a)>0}1/\Delta(s,a))$ in Xu et al. (2021) and Simchowitz and Jamieson (2019).

3.2 Instance-Optimal Algorithms

We first present the regret bound for our algorithm, and then discuss the T2C algorithm. For simplicity, we consider a finite hypothesis (i.e., $|{\mathcal{F}}|<\infty$ ) here, and extend to infinite hypothesis case in Section 5.

We start by stating a condition that excludes abnormal observation distributions $f[\pi]$ . Recall that for any $\zeta\in(0,1)$ , the Rényi divergence of two distributions $p,q$ is

\displaystyle D_{\zeta}(p\|q)=\frac{1}{\zeta-1}\ln\int_{x}p(x)^{\zeta}q(x)^{1-\zeta}\mathrm{d}x.

(7)

The Rényi divergence $D_{\zeta}(p\|q)$ is non-decreasing in $\zeta$ , and $\lim_{\zeta\uparrow 1}D_{\zeta}(p\|q)=D_{\mathrm{KL}}(p\|q)$ (Van Erven and Harremos, 2014). We require the limits converge uniformly for all instances $g\in{\mathcal{F}}$ and decisions $\pi\in\Pi$ with a $\zeta$ bounded away from 1 (the choice of the constants in Condition 1 is not critical to our result), as stated below.

Condition 1.

For any fixed $\alpha>0,\epsilon>0$ , instance $f\in{\mathcal{F}}$ , there exists $\lambda_{0}(\alpha,\epsilon,f)>0$ such that for all $\lambda\leq\lambda_{0}(\alpha,\epsilon,f)$ , $g\in{\mathcal{F}}$ and $\pi\in\Pi$ , $D_{1-\lambda}(f[\pi]\|g[\pi])\geq\min\{D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon,\alpha\}.$ Moreover, we require $\lambda_{0}(\alpha,\epsilon,f)\geq\epsilon^{c_{1}}\min\{1/\alpha,c_{2}\}^{c_{3}}\iota(f)$ for some universal constants $c_{1},c_{2},c_{3}>0$ , where $\iota(f)>0$ is a function that only depends on $f$ .

Condition 1 holds for a wide range of distributions, such as Gaussian, Bernoulli, multinomial, Laplace with bounded mean, Log-normal (Gil et al., 2013), and tabular RL where $f[\pi]$ is a distribution over a trajectory consists of state, action and reward tuples (see Theorem 5.1 for the proof of tabular RL). A stronger but more interpretable variant of Condition 1 is that the log density ratio of $f[\pi]$ and $g[\pi]$ has finite fourth moments (see Proposition B.4), therefore Condition 1 can also be potentially verified for other distributions.

The main theorem analyzing Alg. 1 is shown below. The asymptotic regret of Alg. 1 matches the constants in the lower bound (Theorem 3.2), indicating the asymptotic instance-optimality of our algorithm.

Theorem 3.3.

Suppose ${\mathcal{F}}$ is a finite hypothesis class and satisfies Condition 1 The regret of Alg. 1 satisfies

\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{{f^{\star}},n}}{\ln n}\leq{\mathcal{C}}({f^{\star}}).

(8)

Algorithm 1 Test-to-Commit (T2C)

1:Parameters: the number of rounds of interactions

n

. Step 1: Initialization.

2:Play each decision

\lceil\frac{\ln n}{\ln\ln n}\rceil

times. We denote these decisions by

\{\hat{\pi}_{i}\}_{i=1}^{{m_{\rm init}}}

, where

{m_{\rm init}}=|\Pi|\lceil\frac{\ln n}{\ln\ln n}\rceil

, and the corresponding observations by

\{\hat{o}_{i}\}_{i=1}^{{m_{\rm init}}}

3:Compute the max likelihood estimation (MLE) with arbitrary tie-breaking

\displaystyle{\hat{f}}=\operatorname*{argmax}_{f\in{\mathcal{F}}}\;\sum_{i=1}^{{m_{\rm init}}}\ln f[\hat{\pi}_{i}](\hat{o}_{i}).

(9)

Step 2: Identification.

4:Let

\hat{w}

be the solution of the program defining

{\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4})

and

\bar{w}_{\pi}=\left(1+\frac{1}{(\ln\ln n)^{1/4}}\right)\hat{w}_{\pi}+\frac{1}{(\ln\ln n)^{1/4}}

for all

\pi\in\Pi

5:Play each decision

\pi

for

\lceil\bar{w}_{\pi}\ln n\rceil

times. Denote these decisions by

w=\{\pi_{i}\}_{i=1}^{m}

where

m=\sum_{\pi}\lceil\bar{w}_{\pi}\ln(n)\rceil

, and the corresponding observations by

\{o_{i}\}_{i=1}^{m}.

6:Run the log-likelihood ratio test on instance

{\hat{f}}

, the sequence of decision

\{\pi_{i}\}_{i=1}^{m}

and its corresponding observations

\{o_{i}\}_{i=1}^{m}

(that is, compute the event

{\mathcal{E}}_{\rm acc}^{\hat{f}}

defined in Eq. (10)). Step 3: Exploitation.

7:if

{\mathcal{E}}_{\rm acc}^{\hat{f}}=\rm true

then

8: Commit to

\pi^{\star}({\hat{f}})

(i.e., run

\pi^{\star}({\hat{f}})

for the remaining steps).

9:else

10: Run UCB for the remaining steps.

Our algorithm is stated in Alg. 1 and consists of three steps: initialization, identification, and exploitation. In the initialization step, we explore uniformly for a short period and compute the MLE estimate ${\hat{f}}$ of the true instance (Line 3 of Alg. 1), where we only requires ${\hat{f}}$ to be accurate with moderate probability (i.e., $1-1/\ln n$ ). The estimation is used to compute the lowest-regret list of decisions that can distinguish the optimal decision of ${\hat{f}}$ . Since we only collect $o(\ln n)$ samples, the regret of this step is negligible asymptotically.

In the identification step, we hold the belief that ${\hat{f}}$ is the true instance and solve ${\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4})$ to get the best list of decisions to fortify this belief (see Line 4 of Alg. 1). Then we collect more samples using this list (with minor decorations) to boost the confidence of our initial estimation to $1-1/n$ (or reject it when the estimation is not accurate) by running the following log-likelihood ratio test:

\displaystyle{\mathcal{E}}_{\rm acc}^{{\hat{f}}}

\displaystyle=\mathbb{I}\bigg{[}\forall g\in{\mathcal{F}}\text{ and }\pi^{\star}(g)\neq\pi^{\star}({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\bigg{]}.

(10)

Intuitively, if ${\hat{f}}$ is not the ground-truth ${f^{\star}}$ , ${\mathcal{E}}_{\rm acc}^{{\hat{f}}}$ is unlikely to hold because the expected log-likelihood ratio is non-positive:

\mathbb{E}_{o\sim{f^{\star}}[\pi]}[\ln({\hat{f}}[\pi](o)/{f^{\star}}[\pi](o))]=-D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq 0.

So with high probability a wrong guess cannot be accepted. On the other hand, the first constraint (Eq. (3)) in the definition of ${\mathcal{C}}({\hat{f}})$ guarantees that when ${\hat{f}}={f^{\star}}$ , the expected log-likelihood ratio is large for all $g\in{\mathcal{F}}$ and $\pi^{\star}(g)\neq\pi^{\star}({\hat{f}})$ , so an accurate guess will be accepted. In this step $m=\tilde{\Theta}(\ln n)$ , so Step 2 dominates the regret of Alg. 1. In other words, the efficiency of the log-likelihood test is critical to our analysis.

Finally in the exploitation step, the algorithm commits to the optimal decision of ${\hat{f}}$ if it believes this estimation with confidence $1-1/n$ , or run a standard UCB algorithm as if on a multi-armed bandits problem (with a well-known ${\mathcal{O}}(\ln n)$ regret) when the initial estimation is not accurate (which happens with probability at most $1/\ln n$ ). Therefore the expected regret here is ${\mathcal{O}}(\ln n)/\ln n+{\mathcal{O}}(n)/n={\mathcal{O}}(1)$ .

Our three-stage algorithm is inspired by Lattimore and Szepesvari (2017), where their test in Step 2 is specially designed for linear bandits with Gaussian noise. In contrast, the log-likelihood ratio test is a general hypothesis testing method. Hence, our algorithm works for a large family of interactive decision making problems including tabular RL.

4 Proof Sketches of the Main Results

In this section, we discuss the proof sketch of our main results. Section 4.1 discusses the proof sketch of the lower bound, and Section 4.2 shows the main lemmas for each step in Alg. 1. In Section 4.3, we discuss the log-likelihood ratio test in detail.

4.1 Proof Sketch of Theorem 3.2

Recall that ${\mathcal{C}}(f)$ is the minimum regret of a list of decisions that can distinguish the instance $f$ with all other instances $g$ (with a different optimal decision). So to prove the lower bound, we show that the sequence of decisions played by any consistent algorithm must also distinguish $f$ , and thus, ${\mathcal{C}}(f)$ lower bounds the regret of any consistent algorithm.

For any consistent algorithm, number of interactions $n>0$ , and two instances $f,g\in{\mathcal{F}}$ with different optimal decisions, let $P_{f,n}$ and $P_{g,n}$ denote the probability space generated by running the algorithm on instances $f$ and $g$ respectively for $n$ rounds. Since $f,g$ have different optimal decisions, $P_{f,n}$ and $P_{g,n}$ should be very different. Indeed, let $N_{\pi}$ be the random variable denoting the number of times decision $\pi$ is executed, and we consider the event $A=\mathbb{I}\left[N_{\pi^{\star}(f)}\geq\frac{n}{2}\right]$ . By the basic property of TV distance, we have

\mathop{\rm Pr}\nolimits_{f,n}(A)-\mathop{\rm Pr}\nolimits_{g,n}(A)\leq D_{\rm TV}(P_{f,n}\|P_{g,n}).

Since the algorithm has sub-polynomial regret, for every $p>0$ we must have $\mathop{\rm Pr}\nolimits_{f,n}(A)\geq 1-{\mathcal{O}}(1/n^{1-p})$ and $\mathop{\rm Pr}\nolimits_{g,n}(A)\leq{\mathcal{O}}(1/n^{1-p})$ (otherwise the algorithm incurs ${\mathcal{O}}(n^{p})$ regret on either $f$ or $g$ ). Consequently,

\displaystyle\forall p>0,D_{\rm TV}(P_{f,n}\|P_{g,n})\geq 1-{\mathcal{O}}(1/n^{1-p}).

(11)

This states that the observations from the two instances should be distinguishable, although the TV distance is complex and hard to deal with (especially when the algorithm chooses its decision sequence adaptively). So we translate the requirement from TV distance to a more localized quantity. First, we invoke a extended version of Pinsker’s inequality (see Sampson and Guttorp (1992))

\displaystyle D_{\rm TV}(P_{f,n}\|P_{g,n})\leq 1-\exp(-D_{\mathrm{KL}}(P_{f,n}\|P_{g,n}))/2.

(12)

Combining with Eq. (11) we get $D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})\geq(1+o(1))\ln n.$ Let the random variable $\pi_{i}$ be the decision of the algorithm in round $i$ . The chain rule of KL divergence shows

\displaystyle(1+o(1))\ln n\leq D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})=\mathbb{E}_{f,n}\bigg{[}\sum_{i=1}^{n}D_{\mathrm{KL}}(f[\pi_{i}]\|g[\pi_{i}])\bigg{]}=\sum_{\pi}\mathbb{E}_{f,n}[N_{\pi}]D_{\mathrm{KL}}(f[\pi]\|g[\pi]).

Now consider the vector $w\in\mathbb{R}^{|\Pi|}_{+}$ where $w_{\pi}=\mathbb{E}_{f}[N_{\pi}]/((1+o(1))\ln n)$ . Based on the derivation above, we can verify that $w$ is a valid solution to ${\mathcal{C}}(f,n),$ and therefore

\displaystyle\mathrm{Reg}_{f,n}=\sum_{\pi}\mathbb{E}_{f}[N_{\pi}]\Delta(f,\pi)=\sum_{\pi}w_{\pi}\Delta(f,\pi)(1+o(1))\ln n\geq{\mathcal{C}}(f,n)(1+o(1))\ln n.

(13)

Then the final result is proved by the fact that ${\mathcal{C}}(f)=\lim_{n\to\infty}{\mathcal{C}}(f,n)$ .

4.2 Proof Sketch of Theorem 3.3

In the following, we discuss main lemmas and their proof sketches for the three steps of Alg. 1.

Step 1: Initialization.

In this step we show that the max likelihood estimation can find the exact instance (i.e., ${\hat{f}}={f^{\star}}$ ) with probability at least $1-1/\ln n$ and negligible regret. Note that the failure probability is not small enough to directly commit to the optimal decision of ${\hat{f}}$ (i.e., play $\pi^{\star}({\hat{f}})$ forever), which would incur linear regret when ${\hat{f}}\neq{f^{\star}}$ . Formally speaking, the main lemma for Step 1 is stated as follows, whose proof is deterred to Appendix A.1.

Lemma 4.1.

Under Condition 1, with probability at least $1-1/\ln n$ we get ${\hat{f}}={f^{\star}}.$ In addition, the regret of Step 1 is upper bounded by ${\mathcal{O}}(\frac{\ln n}{\ln\ln n}).$

Lemma 4.1 is not surprising since the MLE only requires ${\mathcal{O}}(\log(1/\delta))$ samples to reach a failure probability $\delta$ in general (Van de Geer, 2000, Section 7). In Step 1, we require $1/\ln n$ failure probability but use $(|\Pi|\ln n)/(\ln\ln n)=\Omega(\ln\ln n)$ samples, so the result is expected for large enough $n$ .

Step 2: Identification.

In the identification step, we boost the failure probability to $1/n$ using the log-likelihood test. To this end, we compute the optimal list of decisions $w=\{\pi_{1},\cdots,\pi_{m}\}$ that distinguishes ${\hat{f}}$ by solving ${\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4})$ (Line 4 of Alg. 1). The choice of $(\ln\ln n)^{1/4}$ is not critical, and could be replaced by any smaller quantity that approaches infinity as $n\to\infty$ . Then we run the log-likelihood ratio test using the observations collected by executing the list of decision $w$ , and achieve the following:

(a)

when the true instance ${f^{\star}}$ and the estimation ${\hat{f}}$ have different optimal decisions, accept ${\hat{f}}$ with probability at most $1/n$ ;
(b)

when ${\hat{f}}={f^{\star}}$ , accept ${\hat{f}}$ with probability at least $1-1/\ln n$ .

In Section 4.3, we will discuss the log-likelihood ratio test in detail. Note that the regret after we falsely reject the true instance is ${\mathcal{O}}(\ln n)$ (by the regret bound of UCB algorithm), so we only require a $1/\ln n$ failure probability for (b) because then it leads to a ${\mathcal{O}}(\ln n)/\ln n={\mathcal{O}}(1)$ expected regret. The main lemma for Step 2 is stated as follows, and its proof is deferred to Appendix A.2

Lemma 4.2.

Under Condition 1, for any finite hypothesis ${\mathcal{F}}$ , for large enough $n$ the following holds:

(a)

conditioned on the event $\pi^{\star}({\hat{f}})\neq\pi^{\star}(f^{\star})$ , ${\mathcal{E}}_{\rm acc}^{\hat{f}}$ is true with probability at most $1/n$ ;
(b)

conditioned on the event ${\hat{f}}={f^{\star}}$ , ${\mathcal{E}}_{\rm acc}^{\hat{f}}$ is true with probability at least $1-1/\ln n$ ;
(c)

conditioned on the event ${\hat{f}}={f^{\star}}$ , the expected regret of Step 2 is upper bounded by

$\left({\mathcal{C}}(f^{\star},(\ln\ln n)^{1/4})+o(1)\right)\ln n.$

Otherwise, the expected regret of Step 2 is upper bounded by ${\mathcal{O}}(\ln n\ln\ln n).$

Step 3: Exploitation.

Finally, in Step 3 we either commits to the optimal decision $\pi^{\star}({\hat{f}})$ when the estimation ${\hat{f}}$ is accepted, or run a standard UCB algorithm with ${\mathcal{O}}(\ln n)$ regret (Lattimore and Szepesvári, 2020). Combining Lemma 4.1 and Lemma 4.2 we can prove that

1.

the probability of Step 3 incurring a ${\mathcal{O}}(n)$ regret is at most $1/n$ , and
2.

the probability of Step 3 incurring a ${\mathcal{O}}(\ln n)$ regret is at most $1/\ln n$ .

As a result, the expected regret in Step 3 is ${\mathcal{O}}(1)$ and negligible . Finally Theorem 3.3 is proved by stitching the three steps together, and deferred to Appendix A.4.

4.3 The Log-Likelihood Ratio Test

The goal of the log-likelihood ratio test is to boost the confidence of our initial estimation ${\hat{f}}$ to $1-1/n$ , so that the algorithm can safely commit to its optimal decision in Step 3. In other words, the test should reject a wrong estimation but also accept a correct one.

Formally speaking, in the test we observe a list of observations $\{o_{1},\cdots,o_{m}\}$ collected by the list of decision $\{\pi_{1},\cdots,\pi_{m}\}$ on the true instance ${f^{\star}}$ , and achieve the following two goals simultaneously,

(a)

when ${\hat{f}}$ and ${f^{\star}}$ are sufficiently different (i.e., their KL divergence is large in the sense that $\sum_{i=1}^{m}D_{\mathrm{KL}}({\hat{f}}[\pi_{i}]\|{f^{\star}}[\pi_{i}])\geq(1+o(1))\ln n$ ), accept ${\hat{f}}$ with probability at most $1/n$ , and
(b)

when ${\hat{f}}={f^{\star}}$ , accept ${\hat{f}}$ with probability at least $1-1/\ln n$ .

We prove that the log-likelihood ratio test with proper parameters achieves (a) and (b) under Condition 1 in the next lemma, whose proof is deferred to Appendix A.3.

Lemma 4.3.

Given two sequences of distributions $P=\{P_{i}\}_{i=1}^{m}$ and $Q=\{Q_{i}\}_{i=1}^{m}$ , and a sequence of independent random variables $o=\{o_{i}\}_{i=1}^{m}$ . For any fixed $\lambda>0,c>0$ , and $\beta=\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(P_{i}\|Q_{i}),$ the test event

{\mathcal{E}}_{\rm acc}=\mathbb{I}\bigg{[}\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\geq c\bigg{]}

satisfies

	$\displaystyle\mathop{\rm Pr}\nolimits_{o\sim Q}({\mathcal{E}}_{\rm acc})\leq\exp(-c),$		(14)
	$\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}({\mathcal{E}}_{\rm acc})\geq 1-\exp(-\lambda(m\beta-c)).$		(15)

To use Lemma 4.3 in the proof of Lemma C.2, we set $c=\ln n$ and $\lambda=\mathrm{poly}(\ln\ln n)$ . Note that the choice of $w$ in Line 4 of Alg. 1 and Condition 1 implies $m\beta=\sum_{i=1}^{m}D_{1-\lambda}(P_{i}\|Q_{i})\gtrapprox\sum_{i=1}^{m}D_{\mathrm{KL}}({\hat{f}}[\pi_{i}]\|{f^{\star}}[\pi_{i}])=(1+o(1))\ln n$ when ${\hat{f}},{f^{\star}}$ have different optimal policies. So the conclusion of this lemma matches the first two items in Lemma 4.2.

Lemma 4.3 is closely related to the Chernoff-Stein lemma (see Chernoff (1959) and Mao (2021, Theorem 4.11)). The failure probability of Eq. (15) in classical Chernoff-Stein lemma is a constant, while here it decreases with $m$ . The proof of Eq. (14) is the same as in Chernoff-Stein lemma, and proof of Eq. (15) only requires an one-sided concentration of the empirical log-likelihood ratio. Indeed, the expectation of empirical log-likelihood ratio can be lower bounded by

\displaystyle\mathbb{E}_{P}\bigg{[}\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\bigg{]}=\sum_{i=1}^{m}D_{\mathrm{KL}}\left(P_{i}\|Q_{i}\right)\geq\sum_{i=1}^{m}D_{1-\lambda}\left(P_{i}\|Q_{i}\right)=m\beta.

(16)

So $\mathop{\rm Pr}\nolimits_{P}({\mathcal{E}}_{\rm acc})$ is the probability that the empirical log-likelihood ratio is (approximately) larger than its expectation. We also note that the concentration is non-trivial because we do not make assumptions on the boundedness on the tail of $Q_{i}$ .

5 Extensions to Infinite Hypothesis Class

Now we extend our analysis to infinite hypothesis settings, and instantiate our results on tabular RL. Our results in the infinite hypothesis case need two additional conditions. The first condition requires an upper bound on covering number of the hypothesis, and is used to prove a infinite-hypothesis version of Lemma 4.2. Formally speaking, for any $f,g\in{\mathcal{F}}$ , define their distance as

\displaystyle d(f,g)=\sup_{\pi\in\Pi,o}{\left|{f[\pi](o)-g[\pi](o)}\right|}.

(17)

An $\epsilon$ (proper) covering of ${\mathcal{F}}$ is a subset ${\mathcal{C}}\subseteq{\mathcal{F}}$ , such that for any $g\in{\mathcal{F}}$ , there exists $g^{\prime}\in{\mathcal{C}}$ with $d(g,g^{\prime})\leq\epsilon.$ The covering number ${\mathcal{N}}({\mathcal{F}},\epsilon)$ is the size of the minimum $\epsilon$ covering of ${\mathcal{F}}$ .

Condition 2.

There exists constant $c$ that depends on ${\mathcal{F}}$ such that $\ln{\mathcal{N}}({\mathcal{F}},\epsilon)\leq{\mathcal{O}}(c\ln(1/\epsilon))$ for every $\epsilon>0.$ In addition, the base measure of the probability space has a finite volume ${\rm Vol}<\infty.$

The second condition upper bounds the distance of two instances by a polynomial of their KL divergence, and is used to prove a stronger version of Lemma 4.1.

Condition 3.

There exists a constant $c_{\rm min}>0$ (which may depend on ${f^{\star}}$ ) such that for all $\pi\in\Pi$ and $o\in\operatorname{supp}({f^{\star}}[\pi])$ ${f^{\star}}[\pi](o)>c_{\rm min}.$ In addition, there exists a constant $\iota({f^{\star}})>0$ that only depends on ${f^{\star}}$ and $c_{5}>0$ , such that for all $f\in{\mathcal{F}},\pi\in\Pi$ , when $D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])\leq 1$

\|{f^{\star}}[\pi]-f[\pi]\|_{\infty}\leq\iota({f^{\star}})D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])^{c_{5}}.

A lot of interactive decision making problems satisfy Conditions 2, 3, such as multi-armed bandits and linear bandits with Bernoulli reward or truncated Gaussian reward. For bandit problems, Conditions 2 and 3 only depend on the noise of the reward function and can be verified easily.

For tabular RL, an observation $o$ denotes a trajectory $(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H})$ where the state-action pairs $(s_{h},a_{h})$ are discrete random variables but the rewards $r_{h}$ are continuous. A decision $\pi:{\mathcal{S}}\to{\mathcal{A}}$ is a mapping from the state space to action space, so $|\Pi|=|{\mathcal{A}}|^{|{\mathcal{S}}|}<\infty.$ In Appendix D, we formally define the tabular RL problems where the reward $r_{h}$ given $(s_{h},a_{h})$ follows a truncated Gaussian distribution, and prove that Conditions 1-3 are satisfied as stated in the following theorem.

Theorem 5.1.

Let ${\mathcal{F}}$ be the family of tabular RL with truncated Gaussian reward and unique optimal policies. Then Conditions 1, 2, 3 holds simultaneously.

For tabular RL with truncated Gaussian reward, the observation $o$ is a mixture of discrete random variables (i.e., the states and actions $s_{h},a_{h}$ ) and continuous random variables (i.e., the rewards $r_{h}$ ). To prove Conditions 1-3, we deal with the discrete and continuous part of the observation separately. We also have flexibility in the reward distribution, e.g., our proof technique can deal with other distributions such as truncated Laplace distribution.

We present these unified conditions for bounded random variables, but Conditions 2 and 3 do not hold for unbounded random variables because the base measure has an infinite volume (which contradicts to Condition 2), and the density of the observation cannot be lower bounded (which contradicts to Condition 3). However, we can still prove the same result using a slightly different approach (see Appendix E).

With Conditions 1-3, we can prove our main results for infinite hypothesis class. The asymptotic regret of Alg. 1 in this case still matches the lower bound exactly. The proof of Theorem 5.2 is deferred to Appendix C.1.

Theorem 5.2.

Suppose ${\mathcal{F}}$ is an infinite hypothesis class that satisfies Conditions 1-3, the regret of Alg. 1 satisfies

\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{{f^{\star}},n}}{\ln n}\leq{\mathcal{C}}({f^{\star}}).

(18)

As a corollary of Theorem 5.1 and Theorem 5.2, our algorithm strictly improves upon previous best gap-dependent bounds ${\mathcal{O}}(\sum_{s,a:\Delta(s,a)>0}1/\Delta(s,a))$ in Xu et al. (2021) and Simchowitz and Jamieson (2019) because the gap-dependent bounds are not tight for many instances (Tirinzoni et al., 2021).

In the following we discuss the challenges when extending our analysis to infinite hypothesis settings.

Covering Number.

With infinite hypothesis, we need to accept an accurate estimation even when there are infinitely many other choices. Recall that the accept event is

\displaystyle{\mathcal{E}}_{\rm acc}^{{\hat{f}}}

\displaystyle=\mathbb{I}\left[\forall g\in{\mathcal{F}}\text{ and }\pi^{\star}(g)\neq\pi^{\star}({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right].

(19)

So informally speaking, we need to show that with high probability,

\displaystyle\forall g\in{\mathcal{F}}\text{ and }\pi^{\star}(g)\neq\pi^{\star}({\hat{f}}),\quad\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\gtrsim(D^{w}_{1-\lambda}({\hat{f}}\|g)-\epsilon)m\geq\ln n,

(20)

which is essentially an uniform concentration inequality as discussed in Section 4.3. So we resort to a covering number argument. The standard covering argument is not directly suitable in this case — even if $d(g,g^{\prime})$ is small, it’s still possible that $\ln(f[\pi](o)/g[\pi](o))$ is very different from $\ln(f[\pi](o)/g^{\prime}[\pi](o))$ (especially when $g[\pi](o)$ is very close to 0). Instead, we consider the distribution with density $\hat{g}[\pi](o)\triangleq(g^{\prime}[\pi](o)+\epsilon)/Z$ where $Z$ is the normalization factor, and only prove a one-sided covering (that is, $\ln(f[\pi](o)/g[\pi](o))\geq\ln(f[\pi](o)/\hat{g}[\pi](o))-{\mathcal{O}}(\epsilon)$ ). We state and prove the uniform concentration in Appendix F.

Initialization.

With infinite hypothesis, we cannot hope to recover the true instance ${f^{\star}}$ exactly — some instances can be arbitrarily close to ${f^{\star}}$ and thus indistinguishable. Instead, we prove that the estimation in Step 1 satisfies $\sup_{\pi\in\Pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\mathrm{poly}(\frac{\ln\ln n}{\ln n})$ . The main lemma of Step 1 in the infinite hypothesis class case is stated in Lemma C.1.

6 Conclusion

In this paper, we design instance-optimal algorithms for general interactive decision making problems with finite decisions. As an instantiation, our algorithm is the first instance-optimal algorithm for tabular MDPs. For future works, we raise the following open questions.

1.

To implement Alg. 1, we need to solve ${\mathcal{C}}(f,(\ln\ln n)^{1/4}/2)$ , which is a linear programming with $|\Pi|$ variables and infinitely many constraints. However, $|\Pi|$ is exponentially large for tabular MDPs. Can we compute this optimization problem efficiently for tabular MDPs?
2.

Although our algorithm is asymptotically optimal, the lower order terms may dominate the regret unless $n$ is very large. Can we design non-asymptotic instance optimal algorithms?

Acknowledgment

The authors would like to thank Yuanhao Wang, Jason D. Lee for helpful discussions, and the support from JD.com.

References

Agarwal et al. [2014] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646, 2014.
Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017.
Berner et al. [2019] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
Burnetas and Katehakis [1997] Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
Chen et al. [2017a] Lijie Chen, Jian Li, and Mingda Qiao. Nearly instance optimal sample complexity bounds for top-k arm selection. In Artificial Intelligence and Statistics, pages 101–110. PMLR, 2017a.
Chen et al. [2017b] Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory, pages 535–592. PMLR, 2017b.
Chernoff [1959] Herman Chernoff. Sequential design of experiments. The Annals of Mathematical Statistics, 30(3):755–770, 1959.
Cover [1999] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
Degenne et al. [2020] Rémy Degenne, Han Shao, and Wouter Koolen. Structure adaptive algorithms for stochastic bandits. In International Conference on Machine Learning, pages 2443–2452. PMLR, 2020.
Degrave et al. [2022] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419, 2022.
Dong et al. [2021] Kefan Dong, Jiaqi Yang, and Tengyu Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. arXiv preprint arXiv:2102.04168, 2021.
Foster and Rakhlin [2020] Dylan Foster and Alexander Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3199–3210. PMLR, 13–18 Jul 2020.
Foster et al. [2020] Dylan J Foster, Alexander Rakhlin, David Simchi-Levi, and Yunzong Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. arXiv preprint arXiv:2010.03104, 2020.
Foster et al. [2021] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
Gil et al. [2013] Manuel Gil, Fady Alajaji, and Tamas Linder. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences, 249:124–131, 2013.
Graves and Lai [1997] Todd L Graves and Tze Leung Lai. Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM journal on control and optimization, 35(3):715–743, 1997.
Gupta et al. [2021] Samarth Gupta, Shreyas Chaudhari, Gauri Joshi, and Osman Yağan. Multi-armed bandits with correlated arms. IEEE Transactions on Information Theory, 67(10):6711–6732, 2021.
Hao et al. [2020] Botao Hao, Tor Lattimore, and Csaba Szepesvari. Adaptive exploration in linear contextual bandit. In International Conference on Artificial Intelligence and Statistics, pages 3536–3545. PMLR, 2020.
Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 4868–4878, 2018.
Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
Kirschner et al. [2021] Johannes Kirschner, Tor Lattimore, Claire Vernade, and Csaba Szepesvári. Asymptotically optimal information-directed sampling. In Conference on Learning Theory, pages 2777–2821. PMLR, 2021.
Lai et al. [1985] Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
Lattimore and Szepesvari [2017] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737. PMLR, 2017.
Lattimore and Szepesvári [2020] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
Li et al. [2019] Yingkai Li, Yining Wang, and Yuan Zhou. Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pages 2173–2174. PMLR, 2019.
Magureanu et al. [2014] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975–999. PMLR, 2014.
Mao [2021] Cheng Mao. Hypothesis testing: Lecture notes for math 6263 at georgia tech. 2021.
Marjani et al. [2022] Aymen Al Marjani, Tomáš Kocák, and Aurélien Garivier. On the complexity of all $\varepsilon$ -best arms identification. arXiv preprint arXiv:2202.06280, 2022.
Mason et al. [2020] Blake Mason, Lalit Jain, Ardhendu Tripathy, and Robert Nowak. Finding all $\epsilon$ -good arms in stochastic bandits. Advances in Neural Information Processing Systems, 33:20707–20718, 2020.
Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Ok et al. [2018] Jungseul Ok, Alexandre Proutiere, and Damianos Tranos. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems, pages 8874–8882, 2018.
Rajeev et al. [1989] A Rajeev, Demosthenis Teneketzis, and Venkatachalam Anantharam. Asymptotically efficient adaptive allocation schemes for controlled iid processes: Finite parameter space. IEEE Transactions on Automatic Control, 34(3), 1989.
Sampson and Guttorp [1992] Paul D Sampson and Peter Guttorp. Nonparametric estimation of nonstationary spatial covariance structure. Journal of the American Statistical Association, 87(417):108–119, 1992.
Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
Simchowitz and Jamieson [2019] Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular MDPs. In Advances in Neural Information Processing Systems, pages 1153–1162, 2019.
Tirinzoni et al. [2020] Andrea Tirinzoni, Alessandro Lazaric, and Marcello Restelli. A novel confidence-based algorithm for structured bandits. In International Conference on Artificial Intelligence and Statistics, pages 3175–3185. PMLR, 2020.
Tirinzoni et al. [2021] Andrea Tirinzoni, Matteo Pirotta, and Alessandro Lazaric. A fully problem-dependent regret lower bound for finite-horizon mdps. arXiv preprint arXiv:2106.13013, 2021.
Tirinzoni et al. [2022] Andrea Tirinzoni, Aymen Al-Marjani, and Emilie Kaufmann. Near instance-optimal pac reinforcement learning for deterministic mdps. arXiv preprint arXiv:2203.09251, 2022.
Van de Geer [2000] Sara Van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000.
Van Erven and Harremos [2014] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
Xu et al. [2021] Haike Xu, Tengyu Ma, and Simon S Du. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. arXiv preprint arXiv:2102.04692, 2021.
Yang et al. [2021] Kunhe Yang, Lin Yang, and Simon Du. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pages 1576–1584. PMLR, 2021.
Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.

List of Appendices

\startcontents

[sections] \printcontents[sections]l1

Appendix A Proofs for Finite Hypothesis Class

In this section we show the missing proofs in Section 4.

A.1 Proof of Lemma 4.1

In this section, we prove Lemma 4.1.

Proof of Lemma 4.1.

Recall that ${m_{\rm init}}=|\Pi|\lceil\frac{\ln n}{\ln\ln n}\rceil$ . Let $w=\{\hat{\pi}_{i}\}_{i=1}^{{m_{\rm init}}}$ be the sequence of decisions in the initialization step of Alg. 1, and $\{\hat{o}_{i}\}_{i=1}^{{m_{\rm init}}}$ the corresponding observations. Define

\displaystyle\rho=\min_{g\in{\mathcal{F}},g\neq{f^{\star}}}\max_{\pi\in\Pi}D_{1/2}({f^{\star}}[\pi]\|g[\pi]).

(21)

For finite hypothesis we have $\rho>0.$ Then Lemma A.2, for any $\epsilon>0$ we get

\displaystyle\mathop{\rm Pr}\nolimits\left(\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{g[\hat{\pi}_{i}](\hat{o}_{i})}\geq\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}D_{1/2}({f^{\star}}[\hat{\pi}_{i}]\|g[\hat{\pi}_{i}])-\epsilon\right)\geq 1-\exp(-{m_{\rm init}}\epsilon/2).

(22)

Let $\epsilon=\frac{\rho}{2|\Pi|}$ . By definition of $\rho$ we get, for all $g\in{\mathcal{F}}\setminus\{{f^{\star}}\}$

	$\displaystyle\mathop{\rm Pr}\nolimits\left(\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{g[\hat{\pi}_{i}](\hat{o}_{i})}\geq\frac{\rho}{2\|\Pi\|}\right)$	(23)
$\displaystyle\geq\;$	$\displaystyle\mathop{\rm Pr}\nolimits\left(\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{g[\hat{\pi}_{i}](\hat{o}_{i})}\geq\frac{1}{{m_{\rm init}}}\sum_{i=1}^{{m_{\rm init}}}D_{1/2}({f^{\star}}[\hat{\pi}_{i}]\\|g[\hat{\pi}_{i}])-\frac{\rho}{2\|\Pi\|}\right)$	(24)
$\displaystyle\geq\;$	$\displaystyle 1-\exp\left(-\frac{\rho\ln n}{4\ln\ln n}\right).$	(25)

By union bound, with probability at least $1-|{\mathcal{F}}|\exp\left(-\frac{\rho\ln n}{4\ln\ln n}\right)$ we have

\displaystyle\forall g\in{\mathcal{F}}\setminus\{{f^{\star}}\},\quad\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{g[\hat{\pi}_{i}](\hat{o}_{i})}>0=\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})}{{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i})},

(26)

which implies that

\displaystyle\forall g\in{\mathcal{F}}\setminus\{{f^{\star}}\},\quad\sum_{i=1}^{{m_{\rm init}}}\ln g[\hat{\pi}_{i}](\hat{o}_{i})<\sum_{i=1}^{{m_{\rm init}}}\ln{f^{\star}}[\hat{\pi}_{i}](\hat{o}_{i}).

(27)

Recalling that ${\hat{f}}=\operatorname*{argmax}_{f\in{\mathcal{F}}}\sum_{i=1}^{{m_{\rm init}}}\ln f[\hat{\pi}_{i}](\hat{o}_{i})$ , Eq. (27) implies ${\hat{f}}={f^{\star}}.$

By algebraic manipulation, for large enough $n$ we have

\displaystyle\exp\left(-\frac{\rho\ln n}{4\ln\ln n}\right)\leq\exp(-\ln|{\mathcal{F}}|-\ln\ln n)=\frac{1}{|{\mathcal{F}}|\ln n}.

(28)

As a result, for large enough $n$ we get $\mathop{\rm Pr}\nolimits({\hat{f}}={f^{\star}})\geq 1-1/\ln n.$ In addition, the regret of Step 1 is upper bounded by ${\mathcal{O}}({m_{\rm init}})={\mathcal{O}}(\frac{\ln n}{\ln\ln n}).$ ∎

A.2 Proof of Lemma 4.2

In this section, we prove Lemma 4.2. To this end, we need the following lemma.

Lemma A.1.

Consider an instance $f\in{\mathcal{F}}$ and $n>0$ . Let $\delta=(\ln\ln n)^{-1/4}$ and $\epsilon=(\ln\ln n)^{-1}.$ Define

\displaystyle\lambda=\lambda_{0}\left(4(\ln\ln n)^{3/4},\frac{1}{\ln\ln n},f\right)

(29)

as the value that Condition 1 holds with corresponding parameters.

Consider any $\hat{w}\in\mathbb{R}^{|\Pi|}_{+}$ such that $\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4}$ . Let $w=\{\pi_{i}\}_{i=1}^{m}$ be a list of decisions where a decision $\pi$ occurs $\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil$ times for every $\pi\in\Pi$ , and $m=\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil$ .

Define the set ${\mathcal{F}}(\hat{w},f)=\{g\in{\mathcal{F}}:\sum_{\pi\in\Pi}\hat{w}_{\pi}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 1\}.$ For any constant $c>0$ , there exits $n_{0}>0$ such that for all $n>n_{0}$ ,

\displaystyle D^{w}_{1-\lambda}(f\|g)\geq\frac{\ln n}{m}+c\epsilon,\quad\forall g\in{\mathcal{F}}(\hat{w},f).

(30)

In addition, we have $\lambda^{-1}=O(\mathrm{poly}(\ln\ln n))$ and $m\geq\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}.$

Proof.

To prove this lemma, we invoke Condition 1 with proper parameters.

First we bound the value of $m$ . By the assumption of this lemma, we have $\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4}$ . Consequently, for large enough $n$ we get

\displaystyle m=\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\leq 2|\Pi|\ln n(\ln\ln n)^{1/4}.

(31)

On the other hand,

\displaystyle m=\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\geq|\Pi|\delta\ln n=\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}.

(32)

Define $\alpha=\frac{\ln n}{m}+c\epsilon.$ Then for large enough $n$ we have $\alpha+\epsilon\leq\frac{2}{|\Pi|}(\ln\ln n)^{1/4}.$ We can also lower bound $\gamma=\frac{1}{m}\min_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil$ . By the upper bound of $m$ and the fact that $\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\geq\delta\ln n=\frac{\ln n}{(\ln\ln n)^{1/4}}$ , we get $\gamma\geq\frac{1}{2|\Pi|(\ln\ln n)^{1/2}}.$

We invoke Condition 1 with parameters $((\alpha+\epsilon)/\gamma,\epsilon,f)$ . Note that $\lambda$ defined in Eq. (29) satisfies $\lambda\leq\lambda_{0}((\alpha+\epsilon)/\gamma,\epsilon,f)$ because $(\alpha+\epsilon)/\gamma\leq 4(\ln\ln n)^{3/4}$ . Therefore we get

\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi])\geq\min\{D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon,(\alpha+\epsilon)/\gamma\},\quad\forall g\in{\mathcal{F}},\pi\in\Pi.

(33)

In the following, we prove that $D^{w}_{1-\lambda}(f\|g)\geq\alpha$ for all $g\in{\mathcal{F}}(\hat{w},f)$ . First of all, for large enough $n$ , for all $g\in{\mathcal{F}}(\hat{w},f)$ we get

		$\displaystyle D_{\mathrm{KL}}^{w}(f\\|g)=\frac{1}{m}\sum_{i=1}^{m}D_{\mathrm{KL}}(f[\pi_{i}]\\|g[\pi_{i}])=\frac{1}{m}\sum_{\pi\in\Pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil D_{\mathrm{KL}}(f[\pi]\\|g[\pi])$		(34)
	$\displaystyle\geq$	$\displaystyle\;\frac{1}{m}\sum_{\pi\in\Pi}(1+\delta)(\ln n)\hat{w}_{\pi}D_{\mathrm{KL}}(f[\pi]\\|g[\pi])\geq\frac{1}{m}(1+\delta)\ln n\geq\frac{1}{m}(\ln n+(c+1)m\epsilon),$		(35)

where the last inequality comes from the fact that $\delta\ln n=\frac{\ln n}{(\ln\ln n)^{1/4}}\gtrsim 2(c+1)|\Pi|\frac{\ln n}{(\ln\ln n)^{3/4}}\geq(c+1)m\epsilon.$

Now for a fixed $g\in{\mathcal{F}}(\hat{w},f)$ , consider the following two cases.

Case 1: $\exists\bar{\pi}\in\Pi,D_{\mathrm{KL}}(f[\bar{\pi}]\|g[\bar{\pi}])\geq(\alpha+\epsilon)/\gamma.$

In this case we have

	$\displaystyle D^{w}_{1-\lambda}(f\\|g)=\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(f[\pi_{i}]\\|g[\pi_{i}])$	(36)
$\displaystyle\geq\;$	$\displaystyle\frac{\lceil((1+\delta)\hat{w}_{\bar{\pi}}+\delta)\ln n\rceil}{m}D_{1-\lambda}(f[\bar{\pi}]\\|g[\bar{\pi}])$	(37)
$\displaystyle\geq\;$	$\displaystyle\gamma D_{1-\lambda}(f[\bar{\pi}]\\|g[\bar{\pi}])\geq\alpha,$	(38)

where the last inequality comes from Eq. (33).

Case 2: $\forall\pi\in\Pi,D_{\mathrm{KL}}(f[\pi]\|g[\pi])\leq(\alpha+\epsilon)/\gamma.$

In this case, Eq. (33) implies that

\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi])\geq D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon,\quad\forall\pi\in\Pi.

(39)

Consequently,

\displaystyle D^{w}_{1-\lambda}(f\|g)=\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(f[\pi_{i}]\|g[\pi_{i}])\geq\frac{1}{m}\sum_{i=1}^{m}(D_{\mathrm{KL}}(f[\pi_{i}]\|g[\pi_{i}])-\epsilon)=D_{\mathrm{KL}}^{w}(f\|g)-\epsilon.

By Eq. (35) we get $D_{\mathrm{KL}}^{w}(f\|g)\geq\alpha+\epsilon.$ Therefore $D^{w}_{1-\lambda}(f\|g)\geq\alpha.$

Combining the two cases together we get the desired result. The lower bound of $\lambda^{-1}$ follows directly from Condition 1. ∎

Now we are ready to prove Lemma 4.2.

Proof of Lemma 4.2.

We prove the four items in Lemma 4.2 separately. We will invoke Lemma A.1 and Lemma 4.3 in the proof. Following Lemma A.1, let $\delta=(\ln\ln n)^{1/4},\epsilon=(\ln\ln n)^{-1}$ and

\displaystyle\lambda=\lambda_{0}\left(4(\ln\ln n)^{3/4},\frac{1}{\ln\ln n},f\right).

(40)

Recall that $\Lambda(f)=\{g\in{\mathcal{F}}:\pi^{\star}(g)\neq\pi^{\star}(f)\}.$

Proof of item (a).

In this case we have ${f^{\star}}\in\Lambda({\hat{f}}).$ By Lemma 4.3 we get

		$\displaystyle\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left({\mathcal{E}}_{\rm acc}^{{\hat{f}}}\right)=\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left(\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right)$		(41)
	$\displaystyle\leq$	$\displaystyle\;\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq\ln n\right)\leq\exp(-\ln n)=1/n.$		(42)

Proof of item (b).

Recall that in this case we have ${\hat{f}}={f^{\star}}$ . Since $\hat{w}$ is the solution to ${\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4})$ (Line 4 of Alg. 1), we have $\sum_{\pi\in\Pi}\hat{w}_{\pi}D_{\mathrm{KL}}({\hat{f}}[\pi]\|g[\pi])\geq 1$ for all $g\in\Lambda({\hat{f}})$ . Recall that $w$ is the list of decisions computed by Line 5 of Alg. 1. By Lemma A.1 we get

\displaystyle D^{w}_{1-\lambda}(f\|g)\geq\frac{\ln n}{m}+\epsilon,\quad\forall g\in\Lambda({\hat{f}}).

(43)

Let $\beta=\frac{\ln n}{m}+\epsilon$ . By Lemma 4.3, for every $g\in\Lambda({\hat{f}})$ we have

\displaystyle\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left(\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right)\geq 1-\exp(-m\lambda\epsilon).

(44)

Lemma A.1 also yields $\lambda^{-1}\leq\mathrm{poly}(\ln\ln n)$ and $m\geq\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}$ . Therefore $m\lambda\epsilon\geq\frac{\ln n}{\mathrm{poly}(\ln\ln n)}$ . Consequently, for large enough $n$ we have

\displaystyle\exp(-m\lambda\epsilon)\leq\frac{\ln n}{|{\mathcal{F}}|}.

(45)

Applying union bound, under the event $\mathbb{I}\left[{\hat{f}}={f^{\star}}\right]$ we get

\displaystyle\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left({\mathcal{E}}_{\rm acc}^{{\hat{f}}}\right)=\mathop{\rm Pr}\nolimits_{{f^{\star}}}\left(\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right)\geq 1-\ln n.

(46)

Proof of item (c).

Recall that $\Delta(f,\pi)$ is the sub-optimality gap of decision $\pi$ on instance $f$ , and $\Delta_{\rm max}(f)=\max_{\pi}\Delta(f,\pi)$ is the maximum decision gap of instance $f$ . Since $\hat{w}$ is the solution to ${\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4})$ , we have $\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4}.$ As a result, the regret of Step 2 is upper bounded by

\displaystyle\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\Delta_{\rm max}({f^{\star}})\lesssim|\Pi|\ln n(\ln\ln n)^{1/4},

(47)

which proves the second part of (c). For the first part, when ${\hat{f}}={f^{\star}}$ we have

	$\displaystyle\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil\Delta({f^{\star}},\pi)$	(48)
$\displaystyle\leq\;$	$\displaystyle\sum_{\pi}(1+\delta)\hat{w}_{\pi}(\ln n)\Delta({f^{\star}},\pi)+\|\Pi\|\Delta_{\rm max}({f^{\star}})(1+\delta\ln n)$	(49)
$\displaystyle=\;$	$\displaystyle((1+\delta){\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})+o(1))\ln n.$	(50)

By Lemma B.1, ${\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})={\mathcal{O}}(1)$ . As a result,

\displaystyle((1+\delta){\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})+o(1))\ln n\leq({\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})+o(1))\ln n.

(51)

∎

A.3 Proof of Lemma 4.3

In this section, we prove Lemma 4.3. To this end, we need the following concentration inequality.

Lemma A.2.

Consider two sequences of distributions $P=\{P_{i}\}_{i=1}^{m}$ and $Q=\{Q_{i}\}_{i=1}^{m}$ , and a sequence of random variables $o=\{o_{i}\}_{i=1}^{m}$ independently drawn from $P$ . For any $\lambda\in(0,1)$ and $\epsilon>0$ we have

\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}\left(\frac{1}{m}\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\geq\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(P_{i}\|Q_{i})-\epsilon\right)\geq 1-\exp(-m\lambda\epsilon).

(52)

Proof of Lemma A.2.

We prove Lemma A.2 by moment method. Consider any $\lambda\in(0,1),\epsilon>0$ and let $\beta=\frac{1}{m}\sum_{i=1}^{m}D_{1-\lambda}(P_{i}\|Q_{i}).$ Then we have

	$\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}\left(\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\leq m(\beta-\epsilon)\right)$	(53)
$\displaystyle=\;$	$\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}\left(\sum_{i=1}^{m}\ln\frac{Q_{i}(o_{i})}{P_{i}(o_{i})}\geq-m(\beta-\epsilon)\right)$	(54)
$\displaystyle=\;$	$\displaystyle\mathop{\rm Pr}\nolimits_{o\sim P}\left(\exp\left(\lambda\sum_{i=1}^{m}\ln\frac{Q_{i}(o_{i})}{P_{i}(o_{i})}\right)\geq\exp(-\lambda m(\beta-\epsilon))\right)$	(55)
$\displaystyle\leq\;$	$\displaystyle\exp(\lambda m(\beta-\epsilon))\mathbb{E}_{o\sim P}\left[\exp\left(\lambda\sum_{i=1}^{m}\ln\frac{Q_{i}(o_{i})}{P_{i}(o_{i})}\right)\right]$	(56)
$\displaystyle\leq\;$	$\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\mathbb{E}_{o_{i}\sim P_{i}}\left[\left(Q_{i}(o_{i})^{\lambda}P_{i}(o_{i})^{-\lambda}\right)\right]$	(57)
$\displaystyle\leq\;$	$\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\left(\int_{o}\left(Q_{i}(o)^{\lambda}P_{i}(o)^{1-\lambda}\right)\mathrm{d}o\right)$	(58)
$\displaystyle\leq\;$	$\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\exp((\lambda-1)D_{\lambda}(Q_{i}\\|P_{i})).$	(59)

where Eq. (56) follows from Markov inequality. By Van Erven and Harremos [2014, Proposition 2] we get $D_{\lambda}(Q\|P)=\frac{\lambda}{1-\lambda}D_{1-\lambda}(P\|Q)$ for any distributions $P,Q$ . As a result,

	$\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\exp((\lambda-1)D_{\lambda}(Q_{i}\\|P_{i}))$	(60)
$\displaystyle=\;$	$\displaystyle\exp(\lambda m(\beta-\epsilon))\prod_{i=1}^{m}\exp(-\lambda D_{1-\lambda}(P_{i}\\|Q_{i}))$	(61)
$\displaystyle=\;$	$\displaystyle\exp(\lambda m(\beta-\epsilon))\exp\left(-\lambda m\beta\right)$	(62)
$\displaystyle=\;$	$\displaystyle\exp\left(-m\lambda\epsilon\right).$	(63)

∎

Now we are ready to prove Lemma 4.3.

Proof of Lemma 4.3.

First we prove Eq. (14). By the moment method, for any $c>0$

	$\displaystyle\mathop{\rm Pr}\nolimits_{Q}({\mathcal{E}}_{\rm acc})=\mathop{\rm Pr}\nolimits_{Q}\left(\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\geq c\right)$	(64)
$\displaystyle=\;$	$\displaystyle\mathop{\rm Pr}\nolimits_{Q}\left(\exp\left(\sum_{i=1}^{m}\ln\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\right)\geq\exp(c)\right)$	(65)
$\displaystyle\leq\;$	$\displaystyle\exp(-c)\mathbb{E}_{Q}\left[\prod_{i=1}^{m}\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\right]$	(Markov inequality)
$\displaystyle\leq\;$	$\displaystyle\exp(-c)\prod_{i=1}^{m}\mathbb{E}_{o_{i}\sim Q_{i}}\left[\frac{P_{i}(o_{i})}{Q_{i}(o_{i})}\right]$	(independence of $o_{i}$ ’s)
$\displaystyle=\;$	$\displaystyle\exp(-c),$	(66)

where the last line comes from the fact that both $P_{i},Q_{i}$ are valid distributions. For Eq. (15), we can directly invoke Lemma A.2 with $\epsilon=\beta-c/m$ . ∎

A.4 Proof of Theorem 3.3

We prove Theorem 3.3 by stitching Lemma 4.1 and Lemma 4.2 together.

Proof of Theorem 3.3.

We upper bound the regret of Alg. 1 by discussing the regret under following events separately. Let $\mathrm{Reg}_{\rm Step1},\mathrm{Reg}_{\rm Step2},$ and $\mathrm{Reg}_{\rm Step3}$ be the regret incurred in Step 1, 2, and 3 respectively. Let ${\mathcal{E}}_{\rm init}=\mathbb{I}\left[{\hat{f}}={f^{\star}}\right]$ be the event that the initial estimation is accurate.

Regret of Step 1:

By Lemma 4.1 we have,

\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}[\mathrm{Reg}_{\rm Step1}]\leq 0.

Regret of Step 2.

By item (c) of Lemma 4.2 and Lemma 4.1,

		$\displaystyle\mathbb{E}\left[\mathrm{Reg}_{\rm Step2}\right]=\mathbb{E}\left[\mathrm{Reg}_{\rm Step2}\mid{\mathcal{E}}_{\rm init}\right]\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init})+\mathbb{E}\left[\mathrm{Reg}_{\rm Step2}\mid{\mathcal{E}}_{\rm init}^{c}\right]\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init}^{c})$		(67)
	$\displaystyle\leq\;$	$\displaystyle({\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})+o(1))\ln(n)+{\mathcal{O}}(\ln n\ln\ln n)/\ln n.$		(68)

As a result,

\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathrm{Reg}_{\rm Step2}\right]\leq\limsup_{n\to\infty}{\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4})={\mathcal{C}}({f^{\star}}).

(69)

Regret of Step 3.

First focus on the event ${\mathcal{E}}_{\rm init}$ . Under event ${\mathcal{E}}_{\rm acc}^{\hat{f}}\cap{\mathcal{E}}_{\rm init}$ , there is no regret in Step 3. On the other hand, by item (b) of Lemma 4.2 we have $\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init}\cap({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c})\leq 1/\ln n.$ Since UCB gives logarithmic regret, we have

\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init},({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c}\right]\mathrm{Reg}_{\rm Step3}\right]\leq\limsup_{n\to\infty}\frac{{\mathcal{O}}(\ln n)}{\ln n}\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init}\cap({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c})\leq 0.

(70)

As a result,

\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}\right]\mathrm{Reg}_{\rm Step3}\right]\leq 0.

(71)

Now we focus on the event ${\mathcal{E}}_{\rm init}^{c}$ . Let ${\mathcal{E}}_{1}=\{\pi^{\star}({\hat{f}})=\pi^{\star}({f^{\star}})\}.$ Under event ${\mathcal{E}}_{\rm acc}^{\hat{f}}\cap{\mathcal{E}}_{1}$ the algorithm has no regret in Step 3:

\displaystyle\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}^{c},{\mathcal{E}}_{1},{\mathcal{E}}_{\rm acc}^{\hat{f}}\right]\mathrm{Reg}_{\rm Step3}\right]=0.

(72)

On the other hand, consider the event ${\mathcal{E}}_{\rm acc}^{\hat{f}}\cap{\mathcal{E}}_{1}^{c}$ . By item (a) of Lemma 4.2 we have

\displaystyle\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}^{c},{\mathcal{E}}_{1}^{c},{\mathcal{E}}_{\rm acc}^{\hat{f}}\right]\mathrm{Reg}_{\rm Step3}\right]\leq n\Delta_{\rm max}\mathop{\rm Pr}\nolimits\left(\mathbb{I}\left[\pi^{\star}({f^{\star}})\neq\pi^{\star}({\hat{f}}),{\mathcal{E}}_{\rm acc}^{\hat{f}}\right]\right)\leq\frac{n\Delta_{\rm max}}{n}\leq{\mathcal{O}}(1).

(73)

Under the event $({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c}$ , Step 3 incurs logarithmic regret. As a result,

\displaystyle\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}^{c},({\mathcal{E}}_{\rm acc}^{\hat{f}})^{c}\right]\mathrm{Reg}_{\rm Step3}\right]\leq{\mathcal{O}}(\ln n)\mathop{\rm Pr}\nolimits\left({\mathcal{E}}_{\rm init}^{c}\right)\leq{\mathcal{O}}(1).

(74)

Combining Eqs. (72), (73) and (74) we get

\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathbb{I}\left[{\mathcal{E}}_{\rm init}^{c}\right]\mathrm{Reg}_{\rm Step3}\right]=0.

(75)

Therefore, combining Eqs. (71) and (75),

\displaystyle\limsup_{n\to\infty}\frac{1}{\ln n}\mathbb{E}\left[\mathrm{Reg}_{\rm Step3}\right]=0.

(76)

Stitching the regrets from steps 1-3 together we prove Theorem 3.3. ∎

Appendix B Missing Proofs in Section 3.1

In this section, we present the missing proofs in section 3.1.

B.1 Proof of Lemma 3.1

To prove Lemma 3.1, we need the following lemma.

Lemma B.1.

For any $f\in{\mathcal{F}}$ , let $n_{0}=5(\Delta_{\rm min}(f)/R_{\rm max})^{-2}.$ Consider $w\in\mathbb{R}^{|\Pi|}_{+}$ such that $w_{\pi}=n_{0}$ for all $\pi\in\Pi$ . Then $w$ is a valid solution to ${\mathcal{C}}(f,n)$ for all $n>n_{0}.$ As a corollary, ${\mathcal{C}}(f,n)\leq\Delta_{\rm max}(f)|\Pi|n_{0}$ for all $n>n_{0}.$

Proof of Lemma B.1.

Recall that $\Delta(f,\pi)=\max_{\pi^{\prime}\in\Pi}R_{f}(\pi^{\prime})-R_{f}(\pi)$ is the reward gap of decision $\pi$ under instance $f$ , and $\Delta_{\rm min}(f)=\min_{\pi:\Delta(f,\pi)>0}\Delta(f,\pi)$ is the minimum decision gap of the instance $f$ .

Let $\Lambda(f)=\{g\in{\mathcal{F}},\pi^{\star}(g)\neq\pi^{\star}(f)\}.$ For any instance $g\in\Lambda(f)$ , consider the following two decisions: $\pi_{1}=\pi^{\star}(f),\pi_{2}=\pi^{\star}(g).$ We claim that

\displaystyle\max_{\pi\in\{\pi_{1},\pi_{2}\}}{\left|{R_{f}(\pi)-R_{g}(\pi)}\right|}\geq\frac{\Delta_{\rm min}(f)}{3}.

(77)

This claim can be proved by contradiction. Suppose on the contrary that

\max_{\pi\in\{\pi_{1},\pi_{2}\}}{\left|{R_{f}(\pi)-R_{g}(\pi)}\right|}<\Delta_{\rm min}(f)/3.

Then we have

\displaystyle R_{g}(\pi_{1})\geq R_{f}(\pi_{1})-\frac{\Delta_{\rm min}(f)}{3}\geq R_{f}(\pi_{2})+\frac{2\Delta_{\rm min}(f)}{3}\geq R_{g}(\pi_{2})+\frac{\Delta_{\rm min}(f)}{3}>R_{g}(\pi_{2}),

which contradicts with the fact that $\pi_{2}=\pi^{\star}(g).$

Recall that $R_{f}(\pi)=\mathbb{E}_{o\sim f[\pi]}[R(o)].$ As a result, ${\left|{R_{f}(\pi)-R_{g}(\pi)}\right|}\leq R_{\rm max}D_{\rm TV}(f[\pi]\|g[\pi])$ . Now, by Pinsker’s inequality, for any $\pi\in\Pi$ we have

\displaystyle D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 2D_{\rm TV}(f[\pi]\|g[\pi])^{2}\geq 2({\left|{R_{f}(\pi)-R_{g}(\pi)}\right|}/R_{\rm max})^{2}.

(78)

Combining with Eq. (77), for any $g\in\Lambda(f)$ we get

\displaystyle\sum_{\pi\in\Pi}n_{0}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 1.

(79)

Since this inequality holds for every $g\in\Lambda(f)$ , we prove the desired result. ∎

Now we present the proof of Lemma 3.1.

Proof of Lemma 3.1.

By Lemma B.1, for $n>5(\Delta_{\rm min}(f)/R_{\rm max})^{-2}$ , we get ${\mathcal{C}}(f,n)<\infty$ . ∎

B.2 Proof of Theorem 3.2

Now we show the proof of Theorem 3.2.

Proof of Theorem 3.2.

In the following we fix a consistent algorithm ${\mathcal{A}}$ . Consider any instance $g\in{\mathcal{F}}$ such that $\pi^{\star}(g)\neq\pi^{\star}(f)$ .

Recall that $\Delta(f,\pi)=\max_{\pi^{\prime}\in\Pi}R_{f}(\pi^{\prime})-R_{f}(\pi)$ is the reward gap of decision $\pi$ under instance $f$ and $\Delta_{\rm min}(f)=\min_{\pi:\Delta(f,\pi)>0}\Delta(f,\pi)$ is the minimum decision gap of the instance $f$ . Let $\epsilon=\min\{\Delta_{\rm min}(g),\Delta_{\rm min}(f)\}/2.$ Consider two distributions (over observations and decisions) $P_{f,n},P_{g,n}$ induced by running $n$ steps of algorithm ${\mathcal{A}}$ on instance $f,g$ respectively. In addition, let $N_{\pi}$ be the random variable indicates the number of times decision $\pi$ is executed, and $\pi_{i}$ the random variable indicating the decision executed at step $i$ .

Let $\mathrm{Reg}_{f,n}$ and $\mathrm{Reg}_{g,n}$ be the regret of running algorithm ${\mathcal{A}}$ on instances $f$ and $g$ respectively. By definition we have

\displaystyle\frac{\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}}{\epsilon n}\geq\mathop{\rm Pr}\nolimits\nolimits_{f}\left(N_{\pi^{\star}(f)}\leq\frac{n}{2}\right)+\mathop{\rm Pr}\nolimits\nolimits_{g}\left(N_{\pi^{\star}(f)}>\frac{n}{2}\right)

(80)

By basic inequality of KL divergence [Lattimore and Szepesvari, 2017, Lemma 5] we get,

\displaystyle\mathop{\rm Pr}\nolimits\nolimits_{f}\left(N_{\pi^{\star}(f)}\leq\frac{n}{2}\right)+\mathop{\rm Pr}\nolimits\nolimits_{g}\left(N_{\pi^{\star}(f)}>\frac{n}{2}\right)\geq\frac{1}{2}\exp(-D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})).

(81)

Combining Eq. (80) and Eq. (81) we get,

\displaystyle D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})\geq\ln\left(\frac{\epsilon n}{2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right).

(82)

Now applying the chain rule for KL divergence (see, e.g., [Cover, 1999, Theorem 2.5.3]), we get

\displaystyle D_{\mathrm{KL}}(P_{f,n}\|P_{g,n})=\mathbb{E}_{f}\left[\sum_{i=1}^{n}D_{\mathrm{KL}}(f[\pi_{i}]\|g[\pi_{i}])\right]=\sum_{\pi\in\Pi}\mathbb{E}_{f}[N_{\pi}]D_{\mathrm{KL}}(f[\pi]\|g[\pi]).

(83)

Therefore we have

\displaystyle\sum_{\pi\in\Pi}\frac{\mathbb{E}_{f}[N_{\pi}]}{\ln\left(\epsilon n\right)-\ln\left({2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right)}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 1.

(84)

Consider the mixture of decisions

w_{\pi}=\frac{\mathbb{E}_{f}[N_{\pi}]}{\ln\left(\epsilon n\right)-\ln\left({2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right)}.

Since $\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\leq n\Delta_{\rm max}$ and $\mathbb{E}_{f}[N_{\pi}]\leq n$ , $w_{\pi}$ satisfies the constraint of ${\mathcal{C}}\left(f,n\right)$ for large enough $n$ . In addition, the expected regret of algorithm ${\mathcal{A}}$ is

	$\displaystyle\mathrm{Reg}_{f,n}=\mathbb{E}_{f}\left[\sum_{i=1}^{n}\Delta_{\pi_{i}}\right]=\sum_{\pi}\mathbb{E}_{f}[N_{\pi}]\Delta_{\pi}$	(85)
$\displaystyle\geq\;$	$\displaystyle\sum_{\pi}\Delta_{\pi}w_{\pi}\ln\left(\frac{\epsilon n}{2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right)$	(86)
$\displaystyle\geq\;$	$\displaystyle{\mathcal{C}}\left(f,n\right)\ln\left(\frac{\epsilon n}{2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)}\right).$	(87)

Since ${\mathcal{A}}$ is consistent, for any $p>0$ we have $\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}=O(n^{p}).$ Consequently, for any $p>0$ ,

	$\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{f,n}}{\ln(n)}$	(88)
$\displaystyle\geq\;$	$\displaystyle\limsup_{n\to\infty}{\mathcal{C}}\left(f,n\right)\frac{\ln(n)-\ln\left(2\left(\mathrm{Reg}_{f,n}+\mathrm{Reg}_{g,n}\right)/\epsilon\right)}{\ln(n)}$	(89)
$\displaystyle\geq\;$	$\displaystyle{\mathcal{C}}(f)(1-p).$	(90)

Since $p>0$ is an arbitrary constant, we get

\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{f,n}}{\ln(n)}\geq{\mathcal{C}}(f).

(91)

∎

B.3 Instantiation of the Complexity Measure

In this section, we instantiate our complexity measure ${\mathcal{C}}(f)$ on concrete settings.

First we consider multi-armed bandits with unit Gaussian reward. The family of decisions is $\Pi=\{1,2,\cdots,A\}$ . Let $(\mu_{1},\mu_{2},\cdots,\mu_{A})$ be the mean rewards of each decision. An instance $f$ in this case is characterized by the mean rewards, and $f[i]={\mathcal{N}}(\mu_{i},1).$

Proposition B.2.

For an multi-armed bandit instance $f$ with unique optimal decision and unit Gaussian noise, let $(\mu_{1},\mu_{2},\cdots,\mu_{A})$ be the mean reward of each action. Then

{\mathcal{C}}(f)\leq\sum_{i\in[1,A]\text{ and }\Delta_{i}>0}\frac{2}{\Delta_{i}}

where $\Delta_{i}=\max_{i^{\prime}}\mu_{i^{\prime}}-\mu_{i}.$

Proof.

We prove this proposition by constructing a solution $w\in\mathbb{R}^{A}_{+}$ to the optimization problem ${\mathcal{C}}(f,n)$ for large enough $n$ .

Without loss of generality, we assume $\mu_{1}>\mu_{i}$ for all $i\geq 2.$ Consider the action frequency

\displaystyle w_{i}=\begin{cases}\frac{2n}{n\Delta_{i}^{2}-3},&i\geq 2,\\ n,&i=1.\end{cases}

(92)

Then when $n$ is large enough we have $\|w\|_{\infty}\leq n.$ On the other hand, consider any $g\in{\mathcal{F}}$ such that $\pi^{\star}(g)\neq\pi^{\star}(f).$ Suppose $\pi^{\star}(g)=i.$ In the following we show that

\displaystyle\sum_{i=1}^{n}w_{i}D_{\mathrm{KL}}(f[i]\|g[i])\geq 1.

(93)

Let $(\mu_{1}^{\prime},\mu_{2}^{\prime},\cdots,\mu_{A}^{\prime})$ be the mean reward of instance $g$ . For multi-armed bandits with unit Gaussian noise, we have

\displaystyle\sum_{i=1}^{n}w_{i}D_{\mathrm{KL}}(f[i]\|g[i])=\frac{1}{2}\sum_{i=1}^{n}w_{i}(\mu_{i}-\mu_{i}^{\prime})^{2}\geq\frac{1}{2}\left(w_{1}(\mu_{1}-\mu_{1}^{\prime})^{2}+w_{i}(\mu_{i}-\mu_{i}^{\prime})^{2}\right).

(94)

By the condition that $\pi^{\star}(g)=i$ , we get $\mu_{i}^{\prime}\geq\mu_{1}^{\prime}.$ Combining with the fact that $\mu_{1}>\mu_{i}$ we get

	$\displaystyle\min_{\mu_{1}^{\prime},\mu_{i}^{\prime}:\mu_{1}^{\prime}\leq\mu_{i}^{\prime}}\left(w_{1}(\mu_{1}-\mu_{1}^{\prime})^{2}+w_{i}(\mu_{i}-\mu_{i}^{\prime})^{2}\right)$	(95)
$\displaystyle=$	$\displaystyle\;\min_{\mu_{1}^{\prime}\in[\mu_{i},\mu_{1}]}\left(w_{1}(\mu_{1}-\mu_{1}^{\prime})^{2}+w_{i}(\mu_{i}-\mu_{1}^{\prime})^{2}\right)$	(96)
$\displaystyle=$	$\displaystyle\;\frac{w_{1}w_{i}}{w_{1}+w_{i}}(\mu_{1}-\mu_{i})^{2}.$	(97)

By the definition of $w_{i}$ , we have

\displaystyle\frac{w_{1}w_{i}}{w_{1}+w_{i}}(\mu_{1}-\mu_{i})^{2}=\frac{\frac{2n^{2}}{n\Delta_{i}^{2}-3}}{n+\frac{2n}{n\Delta_{i}^{2}-3}}\Delta_{i}^{2}=\frac{2\Delta_{i}^{2}n}{n\Delta_{i}^{2}-1}\geq 2.

(98)

As a result,

\displaystyle{\mathcal{C}}(f,n)\leq\sum_{i=2}^{A}\frac{2n}{n\Delta_{i}^{2}-3}\Delta_{i}.

(99)

It follows that

\displaystyle{\mathcal{C}}(f)=\lim_{n\to\infty}{\mathcal{C}}(f,n)\leq\lim_{n\to\infty}\sum_{i=2}^{A}\frac{2n}{n\Delta_{i}^{2}-3}\Delta_{i}=\sum_{i=2}^{A}\frac{2}{\Delta_{i}}.

(100)

∎

In the following, we focus on a linear bandit instance $f$ where the action set is ${\mathcal{A}}\subset\mathbb{R}^{d}.$ The mean of an action $x\in{\mathcal{A}}$ is given by $\mu_{x}=\left<x,\theta\right>$ for some $\theta\in\mathbb{R}^{d}.$ Let $x^{\star}=\operatorname*{argmax}_{x}\mu_{x}$ be the optimal action, and $\Delta_{x}\triangleq\mu_{x^{\star}}-\mu_{x}$ be the sub-optimality gap of action $x$ . Define ${\mathcal{A}}^{-}={\mathcal{A}}\setminus\{x^{\star}\}.$ An linear bandit instance is characterized by the vector $\theta$ , and $f[x]={\mathcal{N}}(\left<x,\theta\right>,1).$

We assume that ${\mathcal{A}}$ is discrete, full rank, and $\|x\|_{2}\leq 1$ for all $x\in{\mathcal{A}}.$ Then we have the following proposition.

Proposition B.3.

For an linear bandit instance $f$ with unique optimal decision and unit Gaussian noise, our complexity measure ${\mathcal{C}}(f)$ recovers that in Lattimore and Szepesvari [2017]. That is,

	$\displaystyle{\mathcal{C}}(f)\leq\inf_{w\in\mathbb{R}^{A}_{+}}\;$	$\displaystyle\sum_{x\in{\mathcal{A}}^{-}}w_{x}\Delta_{x},$		(101)
	s.t.	$\displaystyle\\|x\\|_{H(w)^{-1}}^{2}\leq\frac{\Delta_{x}^{2}}{2},\quad\forall x\in{\mathcal{A}}^{-},$		(102)

where $H(w)=\sum_{x}w_{x}xx^{\top}.$

Proof.

Let $\hat{w}$ be the solution to the RHS of Eq. (101). In the following we construct a solution to our optimization problem ${\mathcal{C}}(f,n)$ from $\hat{w}$ . Recall that

$\displaystyle{\mathcal{C}}(f,n)\triangleq\min_{w\in\mathbb{R}^{\|{\mathcal{A}}\|}_{+}}\;$	$\displaystyle\sum_{x\in{\mathcal{A}}}w_{\pi}\Delta(f,x)$	(103)
s.t.	$\displaystyle\sum_{x\in{\mathcal{A}}}w_{x}D_{\mathrm{KL}}(f[x]\\|g[x])\geq 1,\;\forall g\in{\mathcal{F}},\pi^{\star}(g)\neq\pi^{\star}(f),$	(104)
	$\displaystyle\\|w\\|_{\infty}\leq n.$	(105)

For any $w\in\mathbb{R}^{A}_{+}$ and any instance $g\in{\mathcal{F}}$ associates with parameter $\theta^{\prime}$ , we have

\displaystyle\sum_{x\in{\mathcal{A}}}w_{x}D_{\mathrm{KL}}(f[x]\|g[x])=\sum_{x\in{\mathcal{A}}}w_{x}D_{\mathrm{KL}}({\mathcal{N}}(\left<x,\theta\right>,1)\|{\mathcal{N}}(\left<x,\theta^{\prime}\right>,1))=\frac{\|\theta-\theta^{\prime}\|_{H(w)}^{2}}{2}.

(106)

Consider any $g$ such that $\pi^{\star}(g)\neq\pi^{\star}(f)$ . Suppose $\pi^{\star}(g)=x\neq x^{\star}$ . It follows that $\left<x-x^{\star},\theta^{\prime}\right>>0.$ Recall that $\Delta_{x}=\left<x^{\star}-x,\theta\right>.$ By algebraic manipulation we have,

\displaystyle\min_{\theta^{\prime}:\left<x^{\star}-x,\theta-\theta^{\prime}\right>>\Delta_{x}}\frac{1}{2}\|\theta-\theta^{\prime}\|_{H(w)}^{2}=\frac{1}{2}\frac{\Delta_{x}^{2}}{\|x^{\star}-x\|_{H(w)^{-1}}^{2}}

(107)

Therefore to satisfy Eq. (104), it’s enough to construct a solution $w$ such that

\displaystyle\frac{1}{2}\frac{\Delta_{x}^{2}}{\|x^{\star}-x\|_{H(w)^{-1}}^{2}}\geq 1,\forall x\in{\mathcal{A}}^{-}.

(108)

Define $A=\sum_{x\in{\mathcal{A}}^{-}}\hat{w}_{x}xx^{\top}+(x^{\star})(x^{\star})^{\top}.$ When the action set ${\mathcal{A}}$ is full rank, $A$ is positive definite (see Lattimore and Szepesvari [2017, Appendix C]). We use $\sigma_{\rm max}(A),\sigma_{\rm min}(A)$ to denote the maximum/minimum singular value of a matrix $A$ respectively. Then for any $n>0$ , consider the following solution

\displaystyle w_{x}=\begin{cases}\hat{w}_{x}\left(1-\frac{8}{\Delta_{\rm min}^{2}}\frac{\sigma_{\rm max}(A^{-1})}{1+(n-1)\sigma_{\rm min}(A^{-1})}\right)^{-1},&\text{ when }x\neq x^{\star},\\ n,&\text{ when }x=x^{\star}.\end{cases}

(109)

For large enough $n$ we get $\|w\|_{\infty}\leq n.$ In the following, we prove that $w$ satisfies Eq. (107). Let

c_{n}=\left(1-\frac{8}{\Delta_{\rm min}^{2}}\frac{\sigma_{\rm max}(A^{-1})}{1+(n-1)\sigma_{\rm min}(A^{-1})}\right)^{-1}

for shorthand. Since $\Delta_{x^{\star}}=0$ , we have $\hat{w}_{x^{\star}}=\infty.$ Therefore $\|x^{\star}\|_{H(\hat{w})^{-1}}=0.$ Then for any $x\in{\mathcal{A}}^{-}$ , by Eq. (102) we have

		$\displaystyle(x^{\star}-x)^{\top}H(c_{n}\hat{w})^{-1}(x^{\star}-x)=c_{n}^{-1}(x^{\star}-x)^{\top}H(\hat{w})^{-1}(x^{\star}-x)$		(110)
	$\displaystyle=\;$	$\displaystyle c_{n}^{-1}x^{\top}H(\hat{w})^{-1}x\leq\frac{\Delta_{x}^{2}}{2}-\frac{4\sigma_{\rm max}(A^{-1})}{1+(n-1)\sigma_{\rm min}(A^{-1})}.$		(111)

Invoking Lemma G.7 we get

\displaystyle(x^{\star}-x)^{\top}H(w)^{-1}(x^{\star}-x)\leq(x^{\star}-x)^{\top}H(c_{n}\hat{w})^{-1}(x^{\star}-x)+\frac{4\sigma_{\rm max}(A^{-1})}{1+(n-1)\sigma_{\rm min}(A^{-1})}=\frac{\Delta_{x}^{2}}{2},

which implies Eq. (108). As a result, $w$ is a valid solution ot ${\mathcal{C}}(f,n).$ Consequently,

\displaystyle{\mathcal{C}}(f,n)\leq\sum_{x}\Delta_{x}w_{x}\leq c_{n}\sum_{x}\hat{w}_{x}\Delta_{x}.

(112)

By definition we have $\lim_{n\to\infty}c_{n}=1$ . Consequently, ${\mathcal{C}}(f)=\lim_{n\to\infty}{\mathcal{C}}(f,n)\leq\sum_{x}\hat{w}_{x}\Delta_{x}.$ ∎

B.4 A Sufficient Condition for Condition 1

In this section, we discuss a sufficient but more interpretable condition to Condition 1.

Proposition B.4.

Suppose there exists a constant $c_{M}$ such that for every $f,g\in{\mathcal{F}}$ and $\pi\in\Pi$ ,

\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right]\leq c_{M}^{4}.

Then Condition 1 holds with $\lambda_{0}(\alpha,\epsilon,f)=\min\{2\epsilon/c_{M}^{2},1/2\}.$

Proof.

For any fixed $f,g\in{\mathcal{F}}$ , $\pi\in\Pi$ and $\lambda<1/2$ , by Lemma G.8 we get

\displaystyle D_{\mathrm{KL}}(f[\pi]\|g[\pi])-D_{1-\lambda}(f[\pi]\|g[\pi])\leq\frac{\lambda}{2}\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right]^{1/2}.

(113)

Therefore we have

\displaystyle D_{\mathrm{KL}}(f[\pi]\|g[\pi])-D_{1-\lambda}(f[\pi]\|g[\pi])\leq\frac{\lambda}{2}c_{M}^{2}.

(114)

Since $D_{1-\lambda}(f[\pi]\|g[\pi])$ is monotonically decreasing with $\lambda$ , we prove the desired result. ∎

B.5 Comparison with Existing Lower Bounds

Recall that when applied to tabular RL problems, our lower bound is very similar to that in Tirinzoni et al. [2021], and the only difference is that their optimization problems omit the second constraint (Eq. (4)), On a high level, the constraint $\|w_{\pi}\|_{\infty}\leq n$ is necessary to prevent degenerate cases and guarantee mathematical rigor.

The value ${\mathcal{C}}(f,n)$ can be different from the solution without this constraint for some artificial hypothesis class $\mathcal{F}$ . For example, we can construct a multi-armed bandit hypothesis class $\mathcal{F}=\{\mu\in\mathbb{R}^{A}:\mu(1)\neq 0.5\}\cup\{[0.5,0.1,\cdots,0.1]\}$ (where $\mu\in\mathbb{R}^{A}$ represents the mean reward of each arm). Then for $f=[0.5,0.1,\cdots,0.1]$ , ${\mathcal{C}}(f,n)>0$ for every $n>0$ (because there exits other instances in $\mathcal{F}$ whose mean reward of action 1 is arbitrarily close of 0.5). As a result, ${\mathcal{C}}(f)=\lim_{n\to\infty}{\mathcal{C}}(f,n)>0$ by the definition of limits and in this case $\lim_{n\to\infty}{\mathcal{C}}(f,n)\neq{\mathcal{C}}(f,\infty)$ . However, without the constraint $\|w_{\pi}\|_{\infty}\leq n$ , the solution will be $0$ , achieved by letting $w_{1}=\infty$ and $w_{i}=0,\forall i\neq 1$ .

For other hypothesis classes (such as the standard MAB and linear bandits discussed in Proposition B.2 & B.3, and tabular RL), however, this constraint does not change the value of $C(f,n).$

Appendix C Extension to Infinite Hypothesis Class

In this section, we extend our results to the infinite hypothesis case.

C.1 Proof of Theorem 5.2

To prove Theorem 5.2, we require the following lemmas for steps 1 and 2 by analogy with the finite hypothesis case.

Lemma C.1 (Main lemma for Initialization).

Let ${\mathcal{E}}_{\rm init}$ be the event that, there exists a universal constant $c_{4}>0$ such that

(a)

$\max_{\pi}D_{\mathrm{KL}}(f^{\star}[\pi]\|{\hat{f}}[\pi])\leq\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}}$ ,
(b)

${\left|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right|}\leq R_{\rm max}\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}}$ , for all $\pi\in\Pi$ ,
(c)

$\pi^{\star}({\hat{f}})=\pi^{\star}({f^{\star}}).$

Under Conditions 1 and 2, there exists $n_{0}>0$ such that when $n>n_{0}$ , $\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init})\geq 1-1/\ln n.$ In addition, the regret of Step 1 is upper bounded by ${\mathcal{O}}(\frac{\ln n}{\ln\ln n}).$

Proof of Lemma C.1 is deferred to Appendix C.2.

Lemma C.2 (Main lemma for Identification).

Under Conditions 1, 2, and 3, there exists $n_{0}>0$ such that when $n>n_{0}$ , the following holds.

(a)

Conditioned on the event $\pi^{\star}({\hat{f}})\neq\pi^{\star}(f^{\star})$ , $\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc})\leq 1/n$ .
(b)

Conditioned on the event ${\mathcal{E}}_{\rm init}$ , $\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc})\geq 1-1/\ln n$ .
(c)

The expected regret of Step 2 is always upper bounded by ${\mathcal{O}}(\ln n\ln\ln n).$
(d)

Conditioned on the event ${\mathcal{E}}_{\rm init}$ , the expected regret of Step 2 is upper bounded by

$\left({\mathcal{C}}(f^{\star},(\ln\ln n)^{1/4}/2)+o(1)\right)\ln n.$

Proof of Lemma C.2 is deferred to Appendix C.3.

Then, proof of Theorem 5.2 is the same as that of Theorem 3.3 by plugging in Lemma C.1 and Lemma C.2,

C.2 Proof of Lemma C.1

Proof of Lemma C.1.

We prove the two items in this lemma separately.

Proof of item (a):

Let $c_{1}>0$ be the constant from Condition 1. Set $c_{6}=\frac{1}{2c_{1}+3}$ , $\alpha=\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}}$ , and $\epsilon=\frac{\alpha}{5}$ . Let $w=\left(\pi_{1},\cdots,\pi_{{m_{\rm init}}}\right)$ be the list of decisions run by Step 1, and $o_{1},\cdots,o_{m_{\rm init}}$ the corresponding observations. Recall that by definition,

\displaystyle{\hat{f}}=\operatorname*{argmax}_{g\in{\mathcal{F}}}\sum_{i=1}^{{m_{\rm init}}}\ln g[\pi_{i}](o_{i}).

(115)

Combining with the fact that ${f^{\star}}\in{\mathcal{F}}$ , we have

\displaystyle\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{{\hat{f}}[\pi_{i}](o_{i})}\leq\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\leq 0.

(116)

Let ${\mathcal{G}}(\alpha)=\{g\in{\mathcal{F}}:\exists\pi:D_{\mathrm{KL}}({f^{\star}}[\pi]\|g[\pi])\geq\alpha\}.$ We will prove that for all $g\in{\mathcal{G}}(\alpha)$ , we have $\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}>0$ . Combining with Eq. (116) we get $D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\alpha,\forall\pi.$

To this end, we apply Lemma F.1 with parameters $(\alpha/|\Pi|,\epsilon/|\Pi|,w)$ . Following Lemma F.1, define

\gamma=\frac{1}{{m_{\rm init}}}\min_{\pi}\sum_{i=1}^{{m_{\rm init}}}\mathbb{I}\left[\pi_{i}=\pi\right].

Then we have $\gamma=1/|\Pi|.$ Let $\lambda=\lambda_{0}(\alpha,\epsilon/|\Pi|,f)$ be the value that satisfies Condition 1, and

\epsilon_{0}=\frac{\exp(-\alpha)(\epsilon\lambda/|\Pi|)^{1/\lambda}}{3{\rm Vol}}.

Recall that the condition of Lemma F.1 states

\displaystyle{m_{\rm init}}\geq\frac{1}{\lambda\epsilon}\left(\ln{\mathcal{N}}({\mathcal{F}},\epsilon_{0})+\ln(1/\delta))\right).

(117)

The failure probability in this case is $\delta=1/\ln n$ . By Condition 1, for large enough $n$ we get $1/\lambda\lesssim\epsilon^{-c_{1}}.$ By Condition 2 we get

\ln{\mathcal{N}}({\mathcal{F}},\epsilon_{0})\lesssim\ln(1/\epsilon_{0})\lesssim{\mathcal{O}}(1)+\frac{1}{\lambda}\ln(1/(\epsilon\lambda))+\alpha.

As a result, when $n$ is large enough

		$\displaystyle\frac{1}{\lambda\epsilon}\left(\ln{\mathcal{N}}({\mathcal{F}},\epsilon_{0})+\ln(1/\delta))\right)\lesssim\frac{1}{\epsilon^{c_{1}+1}}\left({\mathcal{O}}(1)+\ln\ln n+\frac{1}{\epsilon^{c_{1}}}\ln\frac{1}{\epsilon^{c_{1}}}\right)$		(118)
	$\displaystyle\lesssim\;$	$\displaystyle\epsilon^{-(2c_{1}+2)}\lesssim\left(\frac{\ln n}{\ln\ln n}\right)^{1-\frac{1}{2c_{1}+3}},$		(119)

where the last inequality comes from the definition of $\epsilon$ , i.e., $\epsilon={\mathcal{O}}\left(\left(\frac{\ln\ln n}{\ln n}\right)^{\frac{1}{2c_{1}+3}}\right)$ . Recall that ${m_{\rm init}}\geq|\Pi|\frac{\ln n}{\ln\ln n}$ . When $n$ is large enough, the condition of Lemma F.1 (i.e., Eq. (117)) is satisfied.

Because every policy appears in $w$ exactly the same number of times, we have $D_{\mathrm{KL}}^{w}({f^{\star}}\|g)\geq\alpha/|\Pi|$ for all $g\in{\mathcal{G}}(\alpha)$ . Therefore, by Lemma F.1 with paremeters $(\alpha/|\Pi|,\epsilon/|\Pi|,w)$ ,

\displaystyle\forall g\in{\mathcal{G}}(\alpha),\sum_{i=1}^{{m_{\rm init}}}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\left(\frac{\alpha}{|\Pi|}-4\frac{\epsilon}{|\Pi|}\right)m>0.

(120)

Combining with Eq. (116), we get ${\hat{f}}\not\in{\mathcal{G}}(\alpha)$ . As a result

\displaystyle\forall\pi\in\Pi,\quad D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\alpha=\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}}.

(121)

Proof of item (b):

Now we focus on item (b). For any fixed $\pi\in\Pi$ , by Pinsker’s inequality and Eq. (121) we get

\displaystyle D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\sqrt{\frac{1}{2}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])}\leq\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}/2}.

(122)

By assumption we have $0\leq R(o)\leq R_{\rm max}$ almost surely for both ${f^{\star}}[\pi]$ and ${\hat{f}}[\pi]$ . It follows that

\displaystyle{\left|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right|}\leq R_{\rm max}D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi]).

(123)

Then we prove item (b) with $c_{4}=c_{6}/2$ and $\iota({f^{\star}})=R_{\rm max}.$

Proof of item (c):

Since $\Delta_{\rm min}({f^{\star}})>0$ , (c) follows from (b) directly when $n$ is large enough.

Proof of regret:

The number of samples collected in Step 1 is upper bounded by ${m_{\rm init}}=|\Pi|\lceil\frac{\ln n}{\ln\ln n}\rceil.$ As a result, the regret is upper bounded by

\displaystyle{\mathcal{O}}(\Delta_{\rm max}{m_{\rm init}})={\mathcal{O}}\left(\frac{\ln n}{\ln\ln n}\right).

(124)

∎

C.3 Proof of Lemma C.2

Proof of Lemma C.2.

We prove the four items in Lemma C.2 separately.

Item (a):

First we prove item (a) of Lemma C.2. By Markov inequality, for any $c>0$ , we have

	$\displaystyle\mathop{\rm Pr}\nolimits_{f^{\star}}\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq\ln n\right)=\mathop{\rm Pr}\nolimits_{f^{\star}}\left(\exp\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\right)\geq\exp(\ln n)\right)$	(125)
$\displaystyle\leq\;$	$\displaystyle\exp(-\ln n)\mathbb{E}_{f^{\star}}\left[\exp\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\right)\right]=\exp(-\ln n)\prod_{i=1}^{m}\mathbb{E}_{f^{\star}}\left[\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\right]$	(126)
$\displaystyle=\;$	$\displaystyle 1/n.$	(127)

The last equality follows from the fact that ${\hat{f}}[\pi]$ and $f^{\star}[\pi]$ are both probability distributions given any decision $\pi\in\Pi.$

Recall that in this case $\pi^{\star}({\hat{f}})\neq\pi^{\star}(f^{\star})$ . Therefore,

		$\displaystyle\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc}^{\hat{f}})=\mathop{\rm Pr}\nolimits\left(\forall g\in{\mathcal{F}}\text{ and }\pi^{\star}(g)\neq\pi^{\star}({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right)$		(128)
	$\displaystyle\leq\;$	$\displaystyle\mathop{\rm Pr}\nolimits\left(\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq\ln n\right)\leq 1/n.$		(129)

Item (b):

Let $\epsilon=1/\ln\ln n$ and $\alpha=\frac{\ln n}{m}$ . We prove this statement by invoking Lemma F.1 with parameters $(\alpha+5\epsilon,\epsilon,w)$ . Following Lemma F.1, let $\gamma=\frac{1}{m}\min_{\pi}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right]$ and $\lambda=\lambda_{0}((\alpha+5\epsilon)/\gamma,\epsilon,{f^{\star}})$ be the value that satisfies Condition 1. Let

\epsilon_{0}=\frac{\exp(-(\alpha+5\epsilon)/\gamma)(\epsilon\lambda)^{1/\lambda}}{3{\rm Vol}}.

First of all, we prove that the condition for Lemma F.1 holds. That is,

\displaystyle m\geq\frac{1}{\lambda\epsilon}\left(\ln{\mathcal{N}}\left({\mathcal{F}},\epsilon_{0}\right)+\ln\ln n\right).

(130)

Recall that in Alg. 1, $\hat{w}$ is the solution of ${\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4})$ , $\bar{w}_{\pi}=\left(1+\frac{1}{(\ln\ln n)^{1/4}}\right)\hat{w}_{\pi}+\frac{1}{(\ln\ln n)^{1/4}}$ , and $m=\sum_{x}\lceil\bar{w}_{\pi}\ln(n)\rceil$ . As a result, $m\geq\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}.$ Now consider the RHS of Eq. (130). By the definition of $\bar{w}_{\pi}$ we get $m\leq 2|\Pi|\ln n(\ln\ln n)^{1/4}$ , so $\alpha\leq\frac{1}{|\Pi|}(\ln\ln n)^{1/4}$ and $\gamma^{-1}\leq 2|\Pi|(\ln\ln n)^{1/2}$ . It follows from Condition 1 that $\lambda\geq\mathrm{poly}(1/\ln\ln n).$ By the definition of $\epsilon_{0}$ and Condition 2 we get

\displaystyle\ln{\mathcal{N}}\left({\mathcal{F}},\epsilon_{0}\right)\lesssim\ln(1/\epsilon_{0})\lesssim\mathrm{poly}(\ln\ln n).

(131)

Consequently, when $n$ is sufficiently large, Eq. (130) holds. By Lemma F.1 we get, with probability at least $1-1/\ln n$ ,

\displaystyle\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m,\quad\forall g\in{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon),

(132)

where ${\mathcal{F}}(w,f^{\star},\alpha+5\epsilon)=\{g\in{\mathcal{F}}:D_{\mathrm{KL}}^{w}({f^{\star}}\|g)\geq\alpha+5\epsilon\}$ . In the following, we prove that Eq. (132) implies ${\mathcal{E}}_{\rm acc}^{{\hat{f}}}.$ Recall that ${\mathcal{E}}_{\rm acc}^{{\hat{f}}}$ is the event defined as follows:

\displaystyle{\mathcal{E}}_{\rm acc}^{\hat{f}}

\displaystyle=\mathbb{I}\left[\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right].

(133)

Recall that $\Lambda({\hat{f}})=\{g\in{\mathcal{F}},\pi^{\star}(g)\neq\pi^{\star}({\hat{f}})\}.$ Next, we apply Lemma G.4 to show that $\Lambda({\hat{f}})\subseteq{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon).$ To verify the condition of Lemma G.4, we have $D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}}$ for all $\pi\in\Pi$ by item (a) of Lemma C.1. By the definition of $\hat{w}$ we get

\displaystyle\forall g\in\Lambda({\hat{f}}),\quad\sum_{\pi\in\Pi}\hat{w}_{\pi}D_{\mathrm{KL}}({\hat{f}}[\pi]\|g[\pi])\geq 1.

(134)

Therefore when $n$ is large enough,

\displaystyle\forall g\in\Lambda({\hat{f}}),D_{\mathrm{KL}}^{w}(f^{\star}\|g)\geq\frac{\ln n}{m}+5\epsilon=\alpha+5\epsilon.

(135)

Then $\Lambda({\hat{f}})\subseteq{\mathcal{F}}(w,{f^{\star}},\alpha+5\epsilon)$ . It follows from Eq. (132) that

\displaystyle\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m,\quad\forall g\in\Lambda({\hat{f}}).

(136)

Finally, by Condition 3 we get ${f^{\star}}[\pi](o)>c_{\rm min}$ for any $\pi\in\Pi\text{ and }o\in\operatorname{supp}({f^{\star}}[\pi])$ . As a result

\displaystyle{\left|{\ln\frac{{\hat{f}}[\pi](o)}{{f^{\star}}[\pi](o)}}\right|}\leq{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o)-{f^{\star}}[\pi](o)}{{f^{\star}}[\pi](o)}\right)}\right|}\leq{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o)-{f^{\star}}[\pi](o)}{c_{\rm min}}\right)}\right|}.

(137)

When $\|{\hat{f}}-f^{\star}\|_{\infty}\leq c_{\rm min}/2$ , applying the basic inequality ${\left|{\ln(1+x)}\right|}\leq 2x,\forall|x|\leq 1/2$ we get

\displaystyle{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o)-{f^{\star}}[\pi](o)}{c_{\rm min}}\right)}\right|}\leq\frac{2}{c_{\rm min}}\|{\hat{f}}-f^{\star}\|_{\infty}\lesssim\frac{2}{c_{\rm min}}\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}c_{5}},

(138)

where the last inequality comes from item (a) of Lemma C.1 and Condition 3. Therefore, for large enough $n$ we get ${\left|{\ln\frac{{\hat{f}}[\pi](o)}{{f^{\star}}[\pi](o)}}\right|}\leq\epsilon.$ As a result,

\displaystyle\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}=\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}+\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq\alpha m=\ln n.

(139)

Since $g\in\Lambda({\hat{f}})$ is arbitrary, have

		$\displaystyle\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc}^{{\hat{f}}})=\mathop{\rm Pr}\nolimits\left(\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right)$		(140)
	$\displaystyle\geq$	$\displaystyle\;\mathop{\rm Pr}\nolimits\left(\forall g\in{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon),\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m\right)\geq 1-1/\ln n.$		(141)

Item (c):

By the definition of ${\mathcal{C}}(\hat{f},(\ln\ln n)^{1/4})$ , we have $\hat{w}_{\pi}\leq(\ln\ln n)^{1/4}$ for every $\pi\in\Pi$ . As a result, $m\leq 2A\ln n(\ln\ln n)^{1/4}.$ Therefore, the expect regret of Step 2 is upper bounded by

\displaystyle\Delta_{\rm max}2A\ln n(\ln\ln n)^{1/4}=O(\ln n\ln\ln n).

(142)

Item (d):

Recall that $\hat{w}$ is the solution of ${\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}).$ As a result, the regret of Step 2 is upper bounded by

\displaystyle\left(\sum_{\pi\in\Pi}\hat{w}_{\pi}\Delta({f^{\star}},\pi)+o(1)\right)\ln n

(143)

where $\Delta({f^{\star}},\pi)$ is the sub-optimality gap of decision $\pi$ under instance ${f^{\star}}$ . In the following, we prove that

\displaystyle\sum_{\pi\in\Pi}\hat{w}_{\pi}\Delta({f^{\star}},\pi)\leq{\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2).

(144)

Let $\hat{w}^{\star}$ be the solution to ${\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2)$ . Define $\delta=\frac{1}{(\ln\ln n)^{1/4}}.$ Let $\bar{w}^{\star}\triangleq\{\left(1+\delta\right)\hat{w}^{\star}_{\pi}+\delta\}_{\pi}$ and $m^{\star}=\sum_{\pi\in\Pi}\lceil\bar{w}_{\pi}^{\star}\ln n\rceil$ . We will show that $\bar{w}^{\star}$ is also (approximately) a solution of ${\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4}/2)$

By the definition of ${\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2)$ , for any $g\in\Lambda({f^{\star}}),$ we have

\displaystyle\sum_{\pi\in\Pi}(\hat{w}^{\star}_{\pi})D_{\mathrm{KL}}({f^{\star}}[\pi]\|g[\pi])\geq 1.

(145)

Let $w^{\star}$ be the list of decisions that $\pi$ appears $\lceil\bar{w}^{\star}\ln n\rceil$ times for every $\pi\in\Pi$ . Define $\alpha^{\star}=\frac{\ln n}{m^{\star}}$ and $\epsilon=\frac{1}{\ln\ln n}.$ Next we apply Lemma G.4 with parameters $(\epsilon,w^{\star}).$ To verify its condition, item (a) of Lemma C.1 gives

\displaystyle D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}/2},\quad\forall\pi\in\Pi,

(146)

which satisfies the condition of Lemma G.4. Consequently, we get $D_{\mathrm{KL}}^{w^{\star}}({\hat{f}}[\pi]\|g[\pi])\geq\frac{\ln n}{m^{\star}}$ for every $g\in\Lambda({f^{\star}})$ . Therefore,

\displaystyle\forall g\in\Lambda({f^{\star}}),\quad\sum_{\pi\in\Pi}\frac{\lceil\bar{w}^{\star}_{\pi}\ln n\rceil}{\ln n}D_{\mathrm{KL}}({\hat{f}}[\pi]\|g[\pi])\geq 1.

(147)

By item (c) of Lemma C.1, $\Lambda({\hat{f}})=\Lambda({f^{\star}})$ . When $n$ is large enough, we have $\frac{\lceil\bar{w}^{\star}_{\pi}\ln n\rceil}{\ln n}\leq(\ln\ln n)^{1/4}.$ Therefore, $\left\{\frac{\lceil\bar{w}^{\star}_{\pi}\ln n\rceil}{\ln n}\right\}_{\pi\in\Pi}$ satisfies all the constraints of ${\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4})$ . Recall that $\hat{w}$ is the solution to ${\mathcal{C}}({\hat{f}},(\ln\ln n)^{1/4})$ . By the optimality of $\hat{w}$ we have

\displaystyle\sum_{\pi}\hat{w}_{\pi}\Delta({\hat{f}},\pi)\leq\sum_{\pi}\frac{\lceil\bar{w}^{\star}_{\pi}\ln n\rceil}{\ln n}\Delta({\hat{f}},\pi)\leq\sum_{\pi}\bar{w}^{\star}_{\pi}\Delta({\hat{f}},\pi)+o(1).

(148)

By item (b) of Lemma C.1, ${\left|{\Delta({\hat{f}},\pi)-\Delta({f^{\star}},\pi)}\right|}\leq\iota({f^{\star}})\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}}.$ As a result,

		$\displaystyle\sum_{\pi}\bar{w}^{\star}_{\pi}\Delta({\hat{f}},\pi)\leq\sum_{\pi}\bar{w}^{\star}_{\pi}\Delta({f^{\star}},\pi)+o(1)$		(149)
	$\displaystyle\leq\;$	$\displaystyle\sum_{\pi}\hat{w}^{\star}_{\pi}\Delta({f^{\star}},\pi)+o(1)={\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2)+o(1).$		(150)

In addition,

\displaystyle\sum_{\pi\in\Pi}\hat{w}_{\pi}\Delta({f^{\star}},\pi)\leq\sum_{\pi}\hat{w}_{\pi}\Delta({\hat{f}},\pi)+o(1).

(151)

Stitching the inequalities above we have

\displaystyle\sum_{\pi}\hat{w}_{\pi}\Delta({f^{\star}},\pi)\leq{\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2)+o(1).

(152)

As a result, the regret in Step 2 is bounded by

\left({\mathcal{C}}({f^{\star}},(\ln\ln n)^{1/4}/2)+o(1)\right)\ln n.

∎

Appendix D Proof of Conditions for Tabular Reinforcement Learning

In this section, we first prove that Conditions 1- 3 holds for tabular RL with truncated Gaussian reward. Consequently, Theorem 5.2 gives the first asymptotic instance-optimal regret bound in this setting. Later in Appendix E, we extend our analysis to Gaussian reward.

First of all, we define some additional notations. A finite horizon Markov Decision Process (MDP) is denoted by a tuple $({\mathcal{S}},{\mathcal{A}},p,r,H)$ . ${\mathcal{S}}$ and ${\mathcal{A}}$ are the state and action spaces respectively. $H>0$ denotes the horizon. The tansition function $p$ maps a state-action pair $(s,a)\in{\mathcal{S}}\times{\mathcal{A}}$ to a distribution over ${\mathcal{S}}$ , and the reward function $r$ maps state-action pair $(s,a)$ to a truncated Gaussian distribution. In particular, for a fixed $(s,a)$ , the mean reward is $\mu(s,a)\in[-1,1]$ , and the density of the reward $r(s,a)$ is

\displaystyle r[s,a](x)=\mathbb{I}\left[x\in[-2,2]\right]\frac{1}{Z}\exp\left(-\frac{(x-\mu(s,a))^{2}}{2}\right),

(153)

where $Z=\int_{x\in[-2,2]}\exp\left(-(x-\mu(s,a))^{2}/2\right)$ is the normalization factor. We use a fixed truncation for every state-action pair regardless of $\mu(s,a)$ because otherwise the KL divergence of two instances will easily be infinity. Without loss of generality, we assume ${\mathcal{S}}$ is partitioned into ${\mathcal{S}}_{1},\cdots,{\mathcal{S}}_{H}$ where for any state $s\in{\mathcal{S}}_{h}$ and action $a\in{\mathcal{A}}$ , $\operatorname{supp}(p(s,a))\subseteq{\mathcal{S}}_{h+1}$ . We also assume that the initial state $s_{1}$ is fixed.

A decision (a.k.a. policy) $\pi:{\mathcal{S}}\to{\mathcal{A}}$ is a deterministic mapping from a state to an action, and $\Pi={\mathcal{A}}^{\mathcal{S}}$ is the set of possible decisions with $|\Pi|=|{\mathcal{A}}|^{|{\mathcal{S}}|}<\infty.$ An observation $o$ is a trajectory $(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H})$ . The reward of an observation is $R(o)=\sum_{h=1}^{H}r_{h}.$ We emphasize that $s_{h},a_{h}$ are discrete random variables and $r_{h}\in\mathbb{R}$ is a continuous random variable. In the following, we use $o^{\rm sa}=(s_{1},a_{1},\cdots,s_{H},a_{H})$ to denote the set of state-action pairs in $o$ and $o^{\rm r}=(r_{1},\cdots,r_{H})$ the set of rewards. As a basic property of MDP, the rewards are independent conditioned on $o^{\rm sa}$ , and $r_{h}\sim r(s_{h},a_{h})$ .

An instance $f$ is a represented by the transition function $p$ and reward function $r$ (which is uniquely determined by $\mu:{\mathcal{S}}\times{\mathcal{A}}\to[-1,1]$ ). The family of instances ${\mathcal{F}}$ is defined as the set of all possible instances.

For an instance $f$ and a decision $\pi$ , the density of an observation $o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H})$ is

\displaystyle f[\pi](o)=\prod_{h=1}^{H}p[s_{h},a_{h}](s_{h+1})r[s_{h},a_{h}](r_{h}).

(154)

For simplicity, we also use $f_{p}[s,a](s^{\prime})$ to denote the distribution $p[s,a](s^{\prime})$ , and $f_{r}[s,a](r)$ to denote $r[s,a](r)$ .

By definition, for all $s,a,r,s^{\prime}$ we have $p[s,a](s^{\prime})r[s,a](r)<1.$ We also define

\displaystyle f[\pi](o^{\rm sa})=\prod_{h=1}^{H}p[s_{h},a_{h}](s_{h+1})

(155)

to be the marginal density of $o^{\rm sa},$ and

f[\pi](o^{\rm r}\mid o^{\rm sa})=\prod_{h=1}^{H}r[s_{h},a_{h}](r_{h})

the conditional distribution of $o^{\rm r}.$

For an instance $f\in{\mathcal{F}}$ , we define $\mu_{\rm min}(f)=\min_{o^{\rm sa},\pi:f[\pi](o^{\rm sa})>0}f[\pi](o^{\rm sa}).$ Since $o^{\rm sa},\pi$ are finite, we have $\mu_{\rm min}(f)>0$ .

D.1 Proof of Theorem 5.1

In this section, we prove Theorem 5.1.

Proof.

In the following, we prove the three conditions separately.

Proof of Condition 1.

We prove this condition by invoking Lemma D.1. For instances $f,g\in{\mathcal{F}}$ and state-action pair $s,a\in{\mathcal{S}}\times{\mathcal{A}}$ , let $f_{r}[s,a],g_{r}[s,a]$ be their reward distributions respectively. Since $f_{r}[s,a],g_{r}[s,a]$ are truncated Gaussian distributions, we get $\exp(-3)\leq g_{r}[s,a](x)\leq 1$ for every $x\in\operatorname{supp}(f_{r}[s,a])$ and $g\in{\mathcal{F}}$ . Therefore $\sup_{x}{\left|{\ln\frac{f_{r}[s,a](x)}{g_{r}[s,a](x)}}\right|}\leq 3$ for all $s\in{\mathcal{S}},a\in{\mathcal{A}}.$ By Lemma D.1 with $c_{M}=3$ , we get Condition 1.

Proof of Condition 2.

The first part of this condition is proved by Lemma D.2. On the other hand, we have

\displaystyle\int 1\mathrm{d}o=(2|{\mathcal{S}}||{\mathcal{A}}|)^{H}<\infty.

(156)

Proof of Condition 3.

To prove the first part of this condition, recall that

\displaystyle{f^{\star}}[\pi](o)=\prod_{h=1}^{H}p[s_{h},a_{h}](s_{h+1})r[s_{h},a_{h}](r_{h}),

(157)

where $p$ is the transition function and $r$ is the reward distribution. Let

c_{\rm min}=\left(\min_{s,a\in{\mathcal{S}}\times{\mathcal{A}},s^{\prime}\in\operatorname{supp}p[s,a](\cdot)}p[s,a](s^{\prime})\right)^{H}\exp(-4H),

As a result, ${f^{\star}}[\pi](o)>c_{\rm min}$ for all $o\in\operatorname{supp}({f^{\star}}[\pi])$ . We prove the second part of this condition in Lemma D.3. ∎

D.2 Proof of Condition 1

In this section, we present a lemma that establishes Condition 1 for tabular RL.

Lemma D.1.

Consider any fixed RL instance $f$ with discrete state, action and general reward distribution. Suppose there exists a constant $c_{M}>0$ such that for any $g\in{\mathcal{F}},s\in{\mathcal{S}},a\in{\mathcal{A}}$ , the reward distributions of instance $f$ and $g$ at state $s$ and action $a$ (denoted by $f_{r}[s,a],g_{r}[s,a]$ respectively) satisfy

\displaystyle\mathbb{E}_{x\sim f_{r}[s,a]}\left[\left(\ln\frac{f_{r}[s,a](x)}{g_{r}[s,a](x)}\right)^{4}\right]\leq c_{M}^{4}.

(158)

Then for every $\alpha>0,\epsilon\in(0,1)$ , Condition 1 holds with

\displaystyle\lambda_{0}(\alpha,\epsilon,f)=\frac{\epsilon}{32H^{2}}\min\left\{\frac{\mu_{\rm min}(f)}{4\alpha},\frac{\mu_{\rm min}(f)}{10},\frac{1}{c_{M}}\right\}^{2}.

(159)

Proof of Lemma D.1.

Let

\displaystyle\kappa=\frac{\mu_{\rm min}(f)}{e^{2}}\exp\left(-\frac{2\alpha}{\mu_{\rm min}(f)}\right).

(160)

Recall that for reinforcement learning, an observation $o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H})$ represents a trajectory, and $o^{\rm sa}=(s_{1},a_{1},s_{2},a_{2},\cdots,s_{H},a_{H})$ denotes the state-action pairs in the trajectory. Consider any fixed decision $\pi$ , for any $\lambda<\lambda_{0}$ , we prove the following two cases separately.

Case 1: $\min_{o^{\rm sa}:f[\pi](o^{\rm sa})>0}g[\pi](o^{\rm sa})<\kappa.$

In this case, we prove that $D_{1-\lambda}(f[\pi]\|g[\pi])\geq\alpha.$

Let $\hat{o^{\rm sa}}=\operatorname*{argmin}_{o^{\rm sa}:f[\pi](o^{\rm sa})>0}g[\pi](o^{\rm sa}).$ By the condition of this case we have $g[\pi](\hat{o^{\rm sa}})<\kappa.$ Applying Lemma G.1 we get

\displaystyle D_{1-\lambda}(f[\pi](o)\|g[\pi](o))\geq D_{1-\lambda}(f[\pi](o^{\rm sa})\|g[\pi](o^{\rm sa})).

(161)

In the following we prove the RHS of Eq. (161) is larger than $\alpha$ . We start with Hölder’s inequality and the basic inequality that $(1-x)^{t}\leq 1-tx$ for any $t\in(0,1),x\in(0,1)$ :

	$\displaystyle\sum_{o}^{\rm sa}f[\pi](o^{\rm sa})^{1-\lambda}g[\pi](o^{\rm sa})^{\lambda}$	(162)
$\displaystyle=\;$	$\displaystyle\sum_{o^{\rm sa}\neq\hat{o^{\rm sa}}}f[\pi](o^{\rm sa})^{1-\lambda}g[\pi](o^{\rm sa})^{\lambda}+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda}$	(163)
$\displaystyle\leq\;$	$\displaystyle\left(\sum_{o^{\rm sa}\neq\hat{o^{\rm sa}}}f[\pi](o^{\rm sa})\right)^{1-\lambda}\left(\sum_{o^{\rm sa}\neq\hat{o^{\rm sa}}}g[\pi](o^{\rm sa})\right)^{\lambda}+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda}$	(164)
$\displaystyle\leq\;$	$\displaystyle\left(1-f[\pi](\hat{o}^{\rm sa})\right)^{1-\lambda}\left(1-g[\pi](\hat{o}^{\rm sa})\right)^{\lambda}+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda}$	(165)
$\displaystyle\leq\;$	$\displaystyle\left(1-f[\pi](\hat{o}^{\rm sa})\right)^{1-\lambda}+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda}$	(166)
$\displaystyle\leq\;$	$\displaystyle 1-f[\pi](\hat{o}^{\rm sa})(1-\lambda)+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda}.$	(167)

Recall the basic inequality $\ln(1+x)\leq x$ for all $x>-1$ . Therefore,

	$\displaystyle\frac{1}{\lambda}\ln\left(\sum_{o}^{\rm sa}f[\pi](o^{\rm sa})^{1-\lambda}g[\pi](o^{\rm sa})^{\lambda}\right)$	(168)
$\displaystyle\leq\;$	$\displaystyle\frac{1}{\lambda}\left(-f[\pi](\hat{o}^{\rm sa})(1-\lambda)+f[\pi](\hat{o}^{\rm sa})^{1-\lambda}g[\pi](\hat{o}^{\rm sa})^{\lambda}\right)$	(169)
$\displaystyle\leq\;$	$\displaystyle\frac{1}{\lambda}\left(f[\pi](\hat{o}^{\rm sa})\left(\left(\frac{g(\hat{o}^{\rm sa})}{f(\hat{o}^{\rm sa})}\right)^{\lambda}-1\right)+\lambda f[\pi](\hat{o}^{\rm sa})\right)$	(170)
$\displaystyle\leq\;$	$\displaystyle\frac{1}{\lambda}\left(f[\pi](\hat{o}^{\rm sa})\left(\left(\frac{\kappa}{f(\hat{o}^{\rm sa})}\right)^{\lambda}-1\right)+\lambda f[\pi](\hat{o}^{\rm sa})\right).$	(171)

Recall the basic inequality that $\exp(x)\leq 1+x/2$ for all $-1\leq x\leq 0$ . Since we have

\left(\frac{\kappa}{f[\pi](\hat{o}^{\rm sa})}\right)^{\lambda}=\exp\left(\lambda\ln\left(\frac{\kappa}{f[\pi](\hat{o}^{\rm sa})}\right)\right),

when $\lambda\leq\left(\ln\left(f[\pi](\hat{o}^{\rm sa})/\kappa\right)\right)^{-1}$ we get

		$\displaystyle\frac{1}{\lambda}\left(f[\pi](\hat{o}^{\rm sa})\left(\left(\frac{\kappa}{f[\pi](\hat{o}^{\rm sa})}\right)^{\lambda}-1\right)+\lambda f[\pi](\hat{o}^{\rm sa})\right)$		(172)
	$\displaystyle\leq\;$	$\displaystyle\frac{1}{2}f[\pi](\hat{o}^{\rm sa})\ln(\kappa/f[\pi](\hat{o}^{\rm sa}))+f[\pi](\hat{o}^{\rm sa})=\frac{1}{2}f[\pi](\hat{o}^{\rm sa})\ln(e^{2}\kappa/f[\pi](\hat{o}^{\rm sa})).$		(173)

By the definition of $\kappa$ we get

\displaystyle\frac{1}{2}f[\pi](\hat{o}^{\rm sa})\ln(e^{2}\kappa/f[\pi](\hat{o}^{\rm sa}))\leq-\alpha,

(174)

which leads to $D_{1-\lambda}(f[\pi](o^{\rm sa})\|g[\pi](o^{\rm sa}))\geq\alpha.$

Case 2: $\min_{o^{\rm sa}:f[\pi](o^{\rm sa})>0}g[\pi](o^{\rm sa})\geq\kappa.$

By Lemma G.8, for any $\lambda\in\left(0,1/2\right)$ we get

\displaystyle D_{\mathrm{KL}}(f[\pi]\|g[\pi])-D_{1-\lambda}(f[\pi]\|g[\pi])\leq\frac{\lambda}{2}\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right]^{1/2}.

(175)

Let $f_{r},g_{r}:{\mathcal{S}}\times{\mathcal{A}}\to\Delta(\mathbb{R})$ be the reward distributions of instance $f$ and $g$ respectively, and $f_{r}[s,a](\cdot),g_{r}[s,a](\cdot)$ the densities of the reward given state-action pair $(s,a)$ . Recall that for a trajectory $o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H})$ we have

\displaystyle f[\pi](o)=f[\pi](o^{\rm sa})\prod_{h=1}^{H}f_{r}[s_{h},a_{h}](r_{h}).

(176)

By Hölder’s inequality we get

	$\displaystyle\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right]$	(177)
$\displaystyle=\;$	$\displaystyle\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o^{\rm sa})}{g[\pi](o^{\rm sa})}+\sum_{h=1}^{H}\ln\frac{f_{r}[s_{h},a_{h}](r_{h})}{g_{r}[s_{h},a_{h}](r_{h})}\right)^{4}\right]$	(178)
$\displaystyle=\;$	$\displaystyle\mathbb{E}_{o\sim f[\pi]}\left[(H+1)^{3}\left(\left(\ln\frac{f[\pi](o^{\rm sa})}{g[\pi](o^{\rm sa})}\right)^{4}+\sum_{h=1}^{H}\left(\ln\frac{f_{r}[s_{h},a_{h}](r_{h})}{g_{r}[s_{h},a_{h}](r_{h})}\right)^{4}\right)\right]$	(179)
$\displaystyle\leq\;$	$\displaystyle(H+1)^{3}\ln(1/\kappa)^{4}+(H+1)^{4}\sup_{s\in{\mathcal{S}},a\in{\mathcal{A}}}\mathbb{E}_{x\sim f_{r}[s,a]}\left[\left(\ln\frac{f_{r}[s,a](x)}{g_{r}[s,a](x)}\right)^{4}\right]$	(180)
$\displaystyle\leq\;$	$\displaystyle(H+1)^{3}\ln(1/\kappa)^{4}+(H+1)^{4}c_{M}^{4},$	(181)

where the last inequality comes from item (c) of Condition 4. Therefore, when $\lambda\leq\epsilon(H+1)^{-2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\}$ we get

	$\displaystyle D_{\mathrm{KL}}(f[\pi]\\|g[\pi])-D_{1-\lambda}(f[\pi]\\|g[\pi])\leq\frac{\lambda}{2}\mathbb{E}_{o\sim f[\pi]}\left[\left(\ln\frac{f[\pi](o)}{g[\pi](o)}\right)^{4}\right]^{1/2}$	(182)
$\displaystyle\leq\;$	$\displaystyle\frac{\epsilon}{2}(H+1)^{-2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\}\left((H+1)^{3}\ln(1/\kappa)^{4}+(H+1)^{4}c_{M}^{4}\right)^{1/2}$	(183)
$\displaystyle\leq\;$	$\displaystyle\frac{\epsilon}{2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\}\left(\ln(1/\kappa)^{4}+c_{M}^{4}\right)^{1/2}$	(184)
$\displaystyle\leq\;$	$\displaystyle\frac{\epsilon}{2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\}\left(\ln(1/\kappa)^{2}+c_{M}^{2}\right)$	(185)
$\displaystyle\leq\;$	$\displaystyle\epsilon.$	(186)

Recall that

\displaystyle\kappa=\frac{\mu_{\rm min}(f)}{e^{2}}\exp\left(-\frac{2\alpha}{\mu_{\rm min}(f)}\right).

(187)

By algebraic manipulation we get

\displaystyle\frac{\epsilon}{32H^{2}}\min\left\{\frac{\mu_{\rm min}(f)}{4\alpha},\frac{\mu_{\rm min}(f)}{10},\frac{1}{c_{M}}\right\}^{2}\leq\epsilon(H+1)^{-2}\min\{\ln(1/\kappa)^{-2},c_{M}^{-2}\},

(188)

which proves the desired result. ∎

D.3 Proof of Condition 2

Recall that for two instances $f,g\in{\mathcal{F}}$ , their distance is $d(f,g)=\sup_{\pi,o}{\left|{f[\pi](o)-g[\pi](o)}\right|}.$

Lemma D.2.

Suppose ${\mathcal{F}}$ represents tabular RL with truncated Gaussian reward with state space ${\mathcal{S}}$ and action space ${\mathcal{A}}$ , then we have $\ln{\mathcal{N}}({\mathcal{F}},\epsilon)\leq{\mathcal{O}}(|{\mathcal{S}}||{\mathcal{A}}|\ln(1/\epsilon))$ for every $\epsilon>0$ .

Proof.

For instances $f\in{\mathcal{F}}$ and state-action pair $s,a\in{\mathcal{S}}\times{\mathcal{A}}$ , let $f_{r}[s,a]$ be its reward distribution and $f_{p}[s,a]$ its transition. Recall that for tabular RL we have

\displaystyle f[\pi](o)=\prod_{h=1}^{H}f_{p}[s_{h},a_{h}](s_{h+1})f_{r}[s_{h},a_{h}](r_{h}),

(189)

where the observation is $o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H})$ . Consequently,

		$\displaystyle{\left\|{f[\pi](o)-g[\pi](o)}\right\|}$
	$\displaystyle\leq$	$\displaystyle\sum_{h=1}^{H}\Bigg{(}{\left\|{f_{p}[s_{h},a_{h}](s_{h+1})f_{r}[s_{h},a_{h}](r_{h})-g_{p}[s_{h},a_{h}](s_{h+1})g_{r}[s_{h},a_{h}](r_{h})}\right\|}$
		$\displaystyle\qquad\prod_{h^{\prime}=1}^{h-1}f_{p}[s_{h^{\prime}},a_{h^{\prime}}](s_{h^{\prime}+1})f_{r}[s_{h^{\prime}},a_{h^{\prime}}](r_{h^{\prime}})\prod_{h^{\prime}=h+1}^{H}g_{p}[s_{h^{\prime}},a_{h^{\prime}}](s_{h^{\prime}+1})g_{r}[s_{h^{\prime}},a_{h^{\prime}}](r_{h^{\prime}})\Bigg{)}$
	$\displaystyle\leq$	$\displaystyle\sum_{h=1}^{H}{\left\|{f_{p}[s_{h},a_{h}](s_{h+1})f_{r}[s_{h},a_{h}](r_{h})-g_{p}[s_{h},a_{h}](s_{h+1})g_{r}[s_{h},a_{h}](r_{h})}\right\|}$
	$\displaystyle\leq$	$\displaystyle\sum_{h=1}^{H}\left({\left\|{f_{p}[s_{h},a_{h}](s_{h+1})-g_{p}[s_{h},a_{h}](s_{h+1})}\right\|}+{\left\|{f_{r}[s_{h},a_{h}](r_{h})-g_{r}[s_{h},a_{h}](r_{h})}\right\|}\right)$

Therefore, we construct a covering ${\mathcal{C}}$ such that for every $f\in{\mathcal{F}}$ , there exists $g\in{\mathcal{C}}$ with

	$\displaystyle{\left\|{f_{p}[s,a](s^{\prime})-g_{p}[s,a](s^{\prime})}\right\|}\leq\frac{\epsilon}{2H},\quad\forall s,a,s^{\prime},$		(190)
	$\displaystyle{\left\|{f_{r}[s,a](r)-g_{r}[s,a](r)}\right\|}\leq\frac{\epsilon}{2H},\quad\forall s,a,r.$		(191)

Since $f_{p}[s,a]$ is a discrete distribution, the covering number for the transition function is upper bounded by $(4H/\epsilon)^{|{\mathcal{S}}|^{2}|{\mathcal{A}}|}.$ On the other hand, $f_{r}[s,a](r)$ is a truncated Gaussian distribution, so covering number for the reward is also upper bounded by $(4H/\epsilon)^{|{\mathcal{S}}||{\mathcal{A}}|}.$ As a result,

\displaystyle\ln{\mathcal{N}}({\mathcal{F}},\epsilon)\leq{\mathcal{O}}(|{\mathcal{S}}||{\mathcal{A}}|\ln(1/\epsilon)).

(192)

∎

D.4 Proof of Condition 3

Lemma D.3.

Consider tabular reinforcement learning with truncated Gaussian reward. For a fixed instance ${f^{\star}}$ , for all $f\in{\mathcal{F}},\pi\in\Pi$ we have

\displaystyle\|{f^{\star}}[\pi]-f[\pi]\|_{\infty}\leq\frac{37H}{\mu_{\rm min}({f^{\star}})^{1/6}}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])^{1/6}.

(193)

Proof.

Recall that for tabular RL, an observation is $o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H})$ . We use $o^{\rm sa}=(s_{1},a_{1},s_{2},a_{2},\cdots,s_{H},a_{H})$ to denote the collection of states and actions in the trajectory $o$ , and $o^{\rm r}=(r_{1},\cdots,r_{H})$ the collection of of rewards.For instances $f\in{\mathcal{F}}$ and state-action pair $s,a\in{\mathcal{S}}\times{\mathcal{A}}$ , let $f_{r}[s,a]$ be its reward distribution and $f_{p}[s,a]$ its transition.

Let $\epsilon_{0}=\frac{H}{\mu_{\rm min}({f^{\star}})^{1/6}}.$ Consider the random variables $o^{\rm sa}$ and $o^{\rm r}.$ By the chain rules of KL divergence we have

\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])=D_{\mathrm{KL}}({f^{\star}}[\pi](o^{\rm sa})\|f[\pi](o^{\rm sa}))+\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}}\left[D_{\mathrm{KL}}({f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})\|f[\pi](o^{\rm r}\mid o^{\rm sa}))\right].

(194)

Since $o^{\rm sa}$ is a discrete random variable, we have

\displaystyle{\left|{{f^{\star}}[\pi](o^{\rm sa})-f[\pi](o^{\rm sa})}\right|}\leq D_{\rm TV}(({f^{\star}}[\pi]\|f[\pi])\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])^{1/2}.

(195)

Therefore, for any $o^{\rm sa}\not\in\operatorname{supp}({f^{\star}}[\pi])$ ,

\displaystyle{\left|{{f^{\star}}[\pi](o^{\rm sa},o^{\rm r})-f[\pi](o^{\rm sa},o^{\rm r})}\right|}\leq f[\pi](o^{\rm sa},o^{\rm r})\leq f[\pi](o^{\rm sa})f[\pi](o^{\rm r}\mid o^{\rm sa})\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])^{1/2},

(196)

where the last inequality comes from the fact that $f[\pi](o^{\rm r}\mid o^{\rm sa})=\prod_{(s_{h},a_{h},r_{h})\in o}f_{r}[s_{h},a_{h}](r_{h})\leq 1.$

Now, for any $o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi])$ , by Eq. (194) we get

\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})\|f[\pi](o^{\rm r}\mid o^{\rm sa}))\leq\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi]).

(197)

By the chain rule of KL divergence,

\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})\|f[\pi](o^{\rm r}\mid o^{\rm sa}))=\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\|f_{r}[s_{h},a_{h}]).

(198)

As a result, for every $(s,a)\in o^{\rm sa}$ we get

\displaystyle D_{\mathrm{KL}}({f^{\star}}_{r}[s,a]\|f_{r}[s,a])\leq\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi]).

(199)

By Lemma G.6, ${\left|{{f^{\star}}_{r}[s,a](r)-f_{r}[s,a](r)}\right|}\leq 36\left(\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\|f[\pi])\right)^{1/6}.$ Therefore,

	$\displaystyle{\left\|{{f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})-f[\pi](o^{\rm r}\mid o^{\rm sa})}\right\|}$	(200)
$\displaystyle=\;$	$\displaystyle{\left\|{\prod_{h=1}^{H}{f^{\star}}_{r}[s_{h},a_{h}](r_{h})-\prod_{h=1}^{H}f_{r}[s_{h},a_{h}](r_{h})}\right\|}$	(201)
$\displaystyle\leq\;$	$\displaystyle\sum_{h=1}^{H}{\left\|{{f^{\star}}_{r}[s_{h},a_{h}](r_{h})-f_{r}[s_{h},a_{h}](r_{h})}\right\|}$	(202)
$\displaystyle\leq\;$	$\displaystyle 36H\left(\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\\|f[\pi])\right)^{1/6}.$	(203)

It follows that,

	$\displaystyle{\left\|{{f^{\star}}[\pi](o^{\rm sa},o^{\rm r})-f[\pi](o^{\rm sa},o^{\rm r})}\right\|}$	(204)
$\displaystyle\leq\;$	$\displaystyle{\left\|{{f^{\star}}[\pi](o^{\rm sa})-f[\pi](o^{\rm sa})}\right\|}+{\left\|{{f^{\star}}[\pi](o^{\rm r}\mid o^{\rm sa})-f[\pi](o^{\rm r}\mid o^{\rm sa})}\right\|}$	(205)
$\displaystyle\leq\;$	$\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi]\\|f[\pi])^{1/2}+36H\left(\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\\|f[\pi])\right)^{1/6}$	(206)
$\displaystyle\leq\;$	$\displaystyle 37H\left(\frac{1}{\mu_{\rm min}({f^{\star}})}D_{\mathrm{KL}}({f^{\star}}[\pi]\\|f[\pi])\right)^{1/6}$	(207)

∎

Appendix E RL with General Reward Distribution

In this section, we extend our analysis to RL with general reward distribution, where the support of the reward may have infinite volume (e.g., the real line $\mathbb{R}$ when the reward distribution is Gaussian). We assume that for any state-action pair $(s,a)$ , the reward distribution $r[s,a]$ comes from a distribution family ${\mathcal{R}}$ (e.g., ${\mathcal{R}}=\{{\mathcal{N}}(\mu,1):\mu\in[-1,1]\}$ when the reward is Gaussian). For any $g\in{\mathcal{R}}$ , let $g(\cdot)$ be its density function and $\mu(g)\triangleq\mathbb{E}_{x\sim g}[x]$ its mean and we assume $\sup_{g\in{\mathcal{R}}}\mu(g)\leq R_{\rm max}.$ We emphasize that for general reward distributions, we do not require Conditions 2 and 3 because they do not hold for Gaussian distribution. Instead, we require the following.

Condition 4.

Let ${\mathcal{R}}$ be the reward distribution family. Then

(a)

for all $f\in{\mathcal{R}}$ , there exists a constant $c_{7}\in(0,1],c_{8}\in(0,1],c_{M}>0,$ such that for every $\delta>0$ ,

\displaystyle\mathop{\rm Pr}\nolimits_{x\sim f}\left(\forall g,g^{\prime}\in{\mathcal{R}},\;{\left|{\ln\frac{g(x)}{g^{\prime}(x)}}\right|}>D_{\rm TV}(g\|g^{\prime})^{c_{7}}\mathrm{polylog}(1/\delta)\right)\leq\delta;

(208)

(b)

for all $g,g^{\prime}\in{\mathcal{R}}$ , ${\left|{\mu(g)-\mu(g^{\prime})}\right|}\lesssim D_{\rm TV}(g\|g^{\prime})^{c_{8}}$ ;
(c)

for all $g,g^{\prime}\in{\mathcal{R}}$ ,

$\mathbb{E}_{x\sim g}\left[\left(\ln\frac{g(x)}{g^{\prime}(x)}\right)^{4}\right]\leq c_{M}^{4};$

(d)

for any $\epsilon>0$ , there exists a covering ${\mathcal{C}}({\mathcal{R}},\epsilon)\subseteq{\mathcal{R}}$ such that

\displaystyle\forall g\in{\mathcal{R}},\;\exists g^{\prime}\in{\mathcal{C}}({\mathcal{R}},\epsilon),\;D_{\rm TV}(g\|g^{\prime})\leq\epsilon.

(209)

And $\log|{\mathcal{C}}({\mathcal{R}},\epsilon)|={\mathcal{O}}(\log(1/\epsilon)).$

Condition 1 still holds in this case because of Lemma D.1 and item (c) of Condition 4.

For tabular reinforcement learning problems, we only require Condition 4 for the reward distribution of any fixed state-action pair, which is one-dimensional. Condition 4 holds for Gaussian distribution (see Proposition E.5), Laplace distribution, etc.

Our main result for tabular RL is stated as follows.

Theorem E.1.

Suppose ${\mathcal{F}}$ is the hypothesis class representing tabular reinforcement learning with a reward distribution that satisfies Conditions 4, then the regret of Alg. 1 satisfies

\displaystyle\limsup_{n\to\infty}\frac{\mathrm{Reg}_{{f^{\star}},n}}{\ln n}\leq{\mathcal{C}}({f^{\star}}).

(210)

In section E.1 we present the main lemmas for Steps 1-2. Then, Theorem E.1 is proved using exactly the same approach as Theorem 5.2 (by replacing the main lemmas for Steps 1-2).

E.1 Lemmas for RL with General Reward Distribution

In this section, we state the main lemmas for RL with general reward distribution (namely, Lemma E.2 for Step 1 and Lemma E.3 for Step 2).

Lemma E.2.

Let ${\mathcal{E}}_{\rm init}$ be the event that, there exists a universal constant $c_{4}>0$ and a function $\iota({f^{\star}})$ that only depends on $f^{\star}$ such that

(a)

for all $\pi\in\Pi$ , $D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}}$ ,
(b)

${\left|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right|}\lesssim 2HR_{\rm max}\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}}$ , for all $\pi\in\Pi$ ,
(c)

$\pi^{\star}({\hat{f}})=\pi^{\star}({f^{\star}}).$

For tabular reinforcement learning with general reward distribution that satisfies Condition 4, there exists $n_{0}>0$ such that when $n>n_{0}$ , $\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm init})\geq 1-1/\ln n.$ In addition, the regret of Step 1 is upper bounded by ${\mathcal{O}}(\frac{\ln n}{\ln\ln n}).$

Proof.

In the following, we prove the three items separately. Recall that Condition 1 holds by Lemma D.1 and item (c) of Condition 4.

Items (a) and (c):

Proofs of (a) and (c) are the same as that of Lemma C.1 if we replace Lemma F.1 by Lemma F.2.

Item (b):

Recall that an observation is a sequence of state-action-reward tuples: $o=(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H})$ . Therefore, for all $\pi\in\Pi$ we have

		$\displaystyle{\left\|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right\|}={\left\|{\mathbb{E}_{o\sim{\hat{f}}[\pi]}\left[\sum_{h=1}^{H}r_{h}\right]-\mathbb{E}_{o\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}r_{h}\right]}\right\|}$		(211)
	$\displaystyle\leq\;$	$\displaystyle{\left\|{\mathbb{E}_{o^{\rm sa}\sim{\hat{f}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]-\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\mu^{\star}(s_{h},a_{h})\right]}\right\|},$		(212)

where $\hat{\mu}$ and $\mu^{\star}$ are the mean reward of instance ${\hat{f}}$ , ${f^{\star}}$ respectively. Since $\hat{\mu}\in[0,1],$ we have

	$\displaystyle{\left\|{\mathbb{E}_{o^{\rm sa}\sim{\hat{f}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]-\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\mu^{\star}(s_{h},a_{h})\right]}\right\|}$	(213)
$\displaystyle\leq\;$	$\displaystyle{\left\|{\mathbb{E}_{o^{\rm sa}\sim{\hat{f}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]-\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]}\right\|}$	(214)
	$\displaystyle\quad+{\left\|{\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\hat{\mu}(s_{h},a_{h})\right]-\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\mu^{\star}(s_{h},a_{h})\right]}\right\|}$	(215)
$\displaystyle\leq\;$	$\displaystyle HD_{\rm TV}({f^{\star}}[\pi]\\|{\hat{f}}[\pi])+\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\|\hat{\mu}(s_{h},a_{h})-\mu^{\star}(s_{h},a_{h})\|\right].$	(216)

In the following we upper bound the second term of Eq. (216). Let ${f^{\star}}_{r},{\hat{f}}_{r}:{\mathcal{S}}\times{\mathcal{A}}\to\Delta(\mathbb{R})$ be the reward distributions of instance ${f^{\star}}$ and ${\hat{f}}$ respectively, and ${f^{\star}}_{r}[s,a](\cdot),{\hat{f}}_{r}[s,a](\cdot)$ the densities of the reward given state-action pair $(s,a)$ . By item (b) of Condition 4 and Cauchy-Schwarz we get

	$\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}\|\hat{\mu}(s_{h},a_{h})-\mu^{\star}(s_{h},a_{h})\|\right]$	(217)
$\displaystyle\lesssim\;$	$\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\\|{\hat{f}}_{r}[s_{h},a_{h}])^{c_{8}}\right]$	(218)
$\displaystyle\leq\;$	$\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[H^{1-c_{8}}\left(\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\\|{\hat{f}}_{r}[s_{h},a_{h}])\right)^{c_{8}}\right]$	(219)
$\displaystyle\leq\;$	$\displaystyle H^{1-c_{8}}\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\\|{\hat{f}}_{r}[s_{h},a_{h}])\right]^{c_{8}}.$	(220)

Recall that ${f^{\star}}_{p},{\hat{f}}_{p}$ denotes the transition function of instances ${f^{\star}},{\hat{f}}$ respectively. By the chain rule of KL divergence,

	$\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi]\\|{\hat{f}}[\pi])$	(221)
$\displaystyle=\;$	$\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}(D_{\mathrm{KL}}\left({f^{\star}}_{p}[s_{h},a_{h}]\\|{\hat{f}}_{p}[s_{h},a_{h}]\right)+D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\\|{\hat{f}}_{r}[s_{h},a_{h}]))\right]$	(222)
$\displaystyle\geq\;$	$\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\\|{\hat{f}}_{r}[s_{h},a_{h}])\right].$	(223)

By item (a) of this lemma, $D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\to 0$ as $n\to\infty$ . Therefore for large enough $n$ we get

\displaystyle{\left|{R_{{\hat{f}}}(\pi)-R_{{f^{\star}}}(\pi)}\right|}\lesssim HD_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])+H^{1-c_{8}}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{c_{8}}\leq 2HD_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{c_{8}}.

(224)

Combining with item (a) of this lemma, we prove the desired result. ∎

Lemma E.3.

For tabular reinforcement learning with general reward distribution that satisfies Condition 4, there exists $n_{0}>0$ such that when $n>n_{0}$ , the following holds.

(a)

Conditioned on the event $\pi^{\star}({\hat{f}})\neq\pi^{\star}(f^{\star})$ , $\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc})\leq 1/n$ .
(b)

Conditioned on the event ${\mathcal{E}}_{\rm init}$ , $\mathop{\rm Pr}\nolimits({\mathcal{E}}_{\rm acc})\geq 1-1/\ln n$ .
(c)

The expected regret of Step 2 is always upper bounded by ${\mathcal{O}}(\ln n\ln\ln n).$
(d)

Conditioned on the event ${\mathcal{E}}_{\rm init}$ , the expected regret of Step 2 is upper bounded by

$\left({\mathcal{C}}(f^{\star},(\ln\ln n)^{1/4}/2)+o(1)\right)\ln n.$

Proof.

We prove the four items above separately.

Items (a), (c), and (d):

Proofs of (a), (c), and (d) are the same as that of Lemma C.2.

Item (b):

Let $\epsilon=1/\ln\ln n,\alpha=\frac{\ln n}{m}$ and $\delta=1/(2\ln n)$ . We prove this statement by invoking Lemma F.2 with parameters $(\alpha+5\epsilon,\epsilon,w)$ . Following Lemma F.2, let $\gamma=\frac{1}{m}\min_{\pi}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right]$ and $\lambda=\lambda_{0}((\alpha+5\epsilon)/\gamma,\epsilon,{f^{\star}})$ be the value that satisfies Condition 1. Let

\epsilon_{0}=\exp(-(\alpha+5\epsilon)/\gamma)(\epsilon\lambda)^{1/\lambda}.

First of all, we prove that the condition for Lemma F.2 holds. That is,

\displaystyle m\gtrsim\frac{\mathrm{poly}(|{\mathcal{S}}||{\mathcal{A}}|H)}{\lambda\epsilon}\left((\ln(m/\epsilon_{0}))+\ln(1/\delta)\right).

(225)

Recall that $m=\sum_{x}\lceil\bar{w}_{\pi}\ln n\rceil$ . As a result, $m\geq\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}.$ Now consider the RHS of Eq. (225). By the definition of $\bar{w}_{\pi}$ we get $\alpha\leq\frac{2}{|\Pi|}(\ln\ln n)^{1/4}$ and $\gamma^{-1}\leq 2|\Pi|(\ln\ln n)^{1/2}$ . It follows from Condition 1 that $\lambda\geq\mathrm{poly}(1/\ln\ln n).$ By the definition of $\epsilon_{0}$ and Condition 2 we get

\displaystyle\ln(m/\epsilon_{0})\lesssim\mathrm{poly}(\ln\ln n).

(226)

Consequently, the RHS of Eq. (225) is at most $\mathrm{poly}(\ln\ln n)$ , and Eq. (225) holds when $n$ is sufficiently large. By Lemma F.2 we get, with probability at least $1-1/(2\ln n)$ ,

\displaystyle\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m,\quad\forall g\in{\mathcal{F}}(w,f^{\star},\alpha+5\epsilon),

(227)

where ${\mathcal{F}}(w,f^{\star},\alpha+5\epsilon)=\{g\in{\mathcal{F}}:D_{\mathrm{KL}}^{w}({f^{\star}}\|g)\geq\alpha+5\epsilon\}$ .

In the following, we prove that Eq. (227) implies ${\mathcal{E}}_{\rm acc}^{{\hat{f}}}.$ Recall that ${\mathcal{E}}_{\rm acc}^{{\hat{f}}}$ is the event defined as follows:

\displaystyle{\mathcal{E}}_{\rm acc}^{\hat{f}}

\displaystyle=\mathbb{I}\left[\forall g\in\Lambda({\hat{f}}),\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\ln n\right].

(228)

Next, we apply Lemma G.4 to any $g\in\Lambda({\hat{f}}).$ To verify its condition, we have $D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}/2}$ for all $\pi\in\Pi$ by item (a) of Lemma E.2. Therefore when $n$ is large enough,

\displaystyle D_{\mathrm{KL}}^{w}(f^{\star}\|g)\geq\frac{\ln n}{m}+5\epsilon=\alpha+5\epsilon.

(229)

Then $g\in{\mathcal{F}}(w,{f^{\star}},\alpha+5\epsilon)$ . It follows from Eq. (227) that

\displaystyle\sum_{i=1}^{m}\ln\frac{{f^{\star}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha+\epsilon)m,\quad\forall g\in\Lambda({\hat{f}}).

(230)

By Lemma E.4, with probability at least $1-1/(2\ln n)$ we have

\displaystyle\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}\geq-m\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}c_{6}}\mathrm{poly}(\ln\ln n)\iota({f^{\star}}).

(231)

For large enough $n$ , $\left(\frac{\ln\ln n}{\ln n}\right)^{c_{4}c_{6}}\mathrm{poly}(\ln\ln n)\iota({f^{\star}})\leq\frac{1}{\ln\ln n}=\epsilon.$ As a result, combining Eq. (230) and Eq. (231), with probability $1-1/\ln n$ we have

\displaystyle\forall g\in\Lambda({\hat{f}}),\quad\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\alpha m=\ln n.

(232)

∎

The following lemma is used in the proof of Lemma E.3.

Lemma E.4.

Let ${f^{\star}},{\hat{f}}$ be any fixed tabular RL instances with reward distribution that satisfies Condition 4. For any sequence of policies $\{\pi_{i}\}_{i=1}^{m}$ , let $o_{i}\sim{f^{\star}}[\pi_{i}](\cdot),\forall i\in[m]$ be a sequence of random observations drawn from ${f^{\star}}$ . Then there exists constant $c_{6}>0$ and $\iota({f^{\star}})$ that only depends on ${f^{\star}}$ such that, for any $\delta>0$ , with probability at least $1-\delta$ ,

\displaystyle\frac{1}{m}\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}>-\left(\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\right)^{c_{6}}\iota({f^{\star}})\mathrm{polylog}(mH/\delta)

(233)

assuming $\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}\leq\mu_{\rm min}({f^{\star}})/2.$

Proof.

\displaystyle\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}=\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o^{\rm sa}_{i})}{{f^{\star}}[\pi_{i}](o^{\rm sa}_{i})}+\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o^{\rm r}_{i}\mid o^{\rm sa}_{i})}{{f^{\star}}[\pi_{i}](o^{\rm r}_{i}\mid o^{\rm sa}_{i})}.

(234)

Recall that for the instance ${f^{\star}}$ , we have

\displaystyle\mu_{\rm min}({f^{\star}})=\min_{\pi}\min_{o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi](\cdot))}{f^{\star}}[\pi](o^{\rm sa}).

(235)

Since both $o^{\rm sa}$ and $\pi$ are finite for tabular RL, we get $\mu_{\rm min}({f^{\star}})>0.$ On the one hand, for any $o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi])$ , by Pinsker’s inequality we get

\displaystyle{\left|{{f^{\star}}[\pi](o^{\rm sa})-{\hat{f}}[\pi](o^{\rm sa})}\right|}\leq D_{\rm TV}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\leq D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}.

(236)

As a result

\displaystyle{\left|{\ln\frac{{\hat{f}}[\pi](o^{\rm sa})}{{f^{\star}}[\pi](o^{\rm sa})}}\right|}\leq{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o^{\rm sa})-{f^{\star}}[\pi](o^{\rm sa})}{{f^{\star}}[\pi](o^{\rm sa})}\right)}\right|}\leq{\left|{\ln\left(1+\frac{{\hat{f}}[\pi](o^{\rm sa})-{f^{\star}}[\pi](o^{\rm sa})}{\mu_{\rm min}({f^{\star}})}\right)}\right|}.

When $\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])^{1/2}\leq\mu_{\rm min}({f^{\star}})/2$ , applying the basic inequality ${\left|{\ln(1+x)}\right|}\leq 2x,\forall|x|\leq 1/2$ we get for all $\pi\in\Pi$ and $o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi](\cdot))$ ,

		$\displaystyle{\left\|{\ln\left(1+\frac{{\hat{f}}[\pi](o^{\rm sa})-{f^{\star}}[\pi](o^{\rm sa})}{\mu_{\rm min}({f^{\star}})}\right)}\right\|}\leq\frac{2}{\mu_{\rm min}({f^{\star}})}{\left\|{{f^{\star}}[\pi](o^{\rm sa})-{\hat{f}}[\pi](o^{\rm sa})}\right\|}$		(237)
	$\displaystyle\leq\;$	$\displaystyle\frac{2}{\mu_{\rm min}({f^{\star}})}\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\\|{\hat{f}}[\pi])^{1/2}.$		(238)

On the other hand, let ${f^{\star}}_{r}[s,a]$ and ${\hat{f}}_{r}[s,a]$ be the reward distribution of instance ${f^{\star}}$ and ${\hat{f}}$ given state-action pair $(s,a)$ respectively. Then

\displaystyle\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o^{\rm r}_{i}\mid o^{\rm sa}_{i})}{{f^{\star}}[\pi_{i}](o^{\rm r}_{i}\mid o^{\rm sa}_{i})}=\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{{\hat{f}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{{f^{\star}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}.

(239)

For any $\pi\in\Pi$ , by the chain rule of KL divergence we get

	$\displaystyle D_{\mathrm{KL}}({f^{\star}}[\pi]\\|{\hat{f}}[\pi])$	(240)
$\displaystyle=\;$	$\displaystyle\mathbb{E}_{o^{\rm sa}\sim{f^{\star}}[\pi]}\left[\sum_{h=1}^{H}(D_{\mathrm{KL}}\left({f^{\star}}_{p}[s_{h},a_{h}]\\|{\hat{f}}_{p}[s_{h},a_{h}]\right)+D_{\mathrm{KL}}({f^{\star}}_{r}[s_{h},a_{h}]\\|{\hat{f}}_{r}[s_{h},a_{h}]))\right]$	(241)
$\displaystyle\geq\;$	$\displaystyle\mu_{\rm min}({f^{\star}})D_{\mathrm{KL}}({f^{\star}}_{r}[s,a]\\|{\hat{f}}_{r}[s,a])\mathbb{I}\left[(s,a)\in o^{\rm sa}\text{ for some }o^{\rm sa}\in\operatorname{supp}({f^{\star}}[\pi])\right].$	(242)

Because $o_{i}\sim{f^{\star}}[\pi_{i}]$ , for any $i\in[m],h\in[H]$ we have $(s_{i,h},a_{i,h})\in o^{\rm sa}_{i}$ and $o^{\rm sa}_{i}\in\operatorname{supp}({f^{\star}}[\pi_{i}])$ . As a result, for all $i\in[m],h\in[H]$ , item (a) of Condition 4 implies that with probability at least $1-\delta/(mH)$

	$\displaystyle{\left\|{\frac{{\hat{f}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{{f^{\star}}_{r}[s_{i,h},a_{i,h}]}}\right\|}\leq D_{\rm TV}({f^{\star}}_{r}[s_{i,h},a_{i,h}](r_{i,h})\\|{\hat{f}}_{r}[s_{i,h},a_{i,h}])^{c_{7}}\mathrm{polylog}(mH/\delta)$	(243)
$\displaystyle\leq\;$	$\displaystyle D_{\mathrm{KL}}({f^{\star}}_{r}[s_{i,h},a_{i,h}]\\|{\hat{f}}_{r}[s_{i,h},a_{i,h}])^{c_{7}/2}\mathrm{polylog}(mH/\delta)$	(244)
$\displaystyle\leq\;$	$\displaystyle\frac{1}{\mu_{\rm min}({f^{\star}})^{c_{7}/2}}D_{\mathrm{KL}}({f^{\star}}[\pi]\\|{\hat{f}}[\pi])^{c_{7}/2}\mathrm{polylog}(mH/\delta).$	(245)

Let $\epsilon=\frac{1}{\mu_{\rm min}({f^{\star}})^{c_{7}/2}}\left(\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\|{\hat{f}}[\pi])\right)^{c_{7}/2}$ . Apply union bound and we get

\displaystyle\mathop{\rm Pr}\nolimits_{o\sim{f^{\star}}[\pi]}\left(\frac{1}{m}\sum_{i=1}^{m}\sum_{h=1}^{H}{\left|{\ln\frac{{\hat{f}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{{f^{\star}}_{r}[s_{i,h},a_{i,h}](r_{i,h})}}\right|}\geq\epsilon H\mathrm{polylog}(mH/\delta)\right)\leq 1-\delta.

(246)

It follows that with probability at least $1-\delta$ ,

		$\displaystyle\frac{1}{m}\sum_{i=1}^{m}\ln\frac{{\hat{f}}[\pi_{i}](o_{i})}{{f^{\star}}[\pi_{i}](o_{i})}$
	$\displaystyle\geq\;$	$\displaystyle-\frac{2}{\mu_{\rm min}({f^{\star}})}\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\\|{\hat{f}}[\pi])^{1/2}-H\mathrm{polylog}(mH/\delta)\frac{1}{\mu_{\rm min}({f^{\star}})^{c_{7}/2}}\left(\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\\|{\hat{f}}[\pi])\right)^{c_{7}/2}$
	$\displaystyle\geq\;$	$\displaystyle-\frac{2H}{\mu_{\rm min}({f^{\star}})}\mathrm{polylog}(mH/\delta)\left(\max_{\pi}D_{\mathrm{KL}}({f^{\star}}[\pi]\\|{\hat{f}}[\pi])\right)^{c_{7}/2}.$

Therefore, Eq. (233) is satisfied by setting

\displaystyle\iota({f^{\star}})=\frac{2H}{\mu_{\rm min}({f^{\star}})},\quad c_{6}=c_{7}/2.

(247)

∎

E.2 Proof of Condition 4 for Gaussian Distribution

In this section, we prove that the reward distribution family ${\mathcal{R}}=\{{\mathcal{N}}(\mu,1),\mu\in[0,1]\}$ satisfies Condition 4.

Proposition E.5.

Condition 4 holds for the reward distribution family ${\mathcal{R}}=\{{\mathcal{N}}(\mu,1),\mu\in[0,1]\}$

Proof.

In the following we prove the four items of Condition 4 respectively.

Item (a).

For any fixed $f\in{\mathcal{R}}$ , let $\mu$ be the mean of $f$ . In other words, $f={\mathcal{N}}(\mu,1).$ Then we have

\displaystyle\mathop{\rm Pr}\nolimits_{x\sim f}\left(|x|>\mu+2\sqrt{\log(2/\delta)}\right)\leq\delta.

(248)

By definition, for any $g={\mathcal{N}}(\mu_{g},1),g^{\prime}={\mathcal{N}}(\mu_{g}^{\prime},1)\in{\mathcal{R}}$ , we have

\ln\frac{g(x)}{g^{\prime}(x)}=\frac{1}{2}\left((x-\mu_{g})^{2}-(x-\mu_{g}^{\prime})^{2}\right)=(\mu_{g}^{\prime}-\mu_{g})(x-(\mu_{g}^{\prime}+\mu_{g})/2).

Therefore when $\mu_{g},\mu_{g}^{\prime}\in[0,1]$ we get

\displaystyle{\left|{\ln\frac{g(x)}{g^{\prime}(x)}}\right|}\leq|\mu_{g}-\mu_{g}^{\prime}||x+1|.

(249)

As a result, with probability at least $1-\delta$ we have

\displaystyle\forall g,g^{\prime}\in{\mathcal{R}},\quad{\left|{\ln\frac{g(x)}{g^{\prime}(x)}}\right|}\leq|\mu_{g}-\mu_{g}^{\prime}|\left(1+2\sqrt{\log(2/\delta)}\right).

(250)

In addition, when $\mu_{g},\mu_{g}^{\prime}\in[0,1]$ we have $D_{\rm TV}(g\|g^{\prime})\geq\frac{1}{10}|\mu_{g}-\mu_{g}^{\prime}|.$ As a result, item (a) holds with $c_{7}=1$ .

Item (b).

Recall that for Gaussian distribution with unit variance, when $\mu_{g},\mu_{g}^{\prime}\in[0,1]$ we have $D_{\rm TV}(g\|g^{\prime})\geq\frac{1}{10}|\mu_{g}-\mu_{g}^{\prime}|.$ Therefore item (b) holds with $c_{8}=1$ .

Item (c).

Recall that for any $g={\mathcal{N}}(\mu_{g},1),g^{\prime}={\mathcal{N}}(\mu_{g}^{\prime},1)\in{\mathcal{R}}$ , $\ln\frac{g(x)}{g^{\prime}(x)}=\frac{1}{2}\left((x-\mu_{g})^{2}-(x-\mu_{g}^{\prime})^{2}\right)$ . Therefore when $\mu_{g},\mu_{g}^{\prime}\in[0,1]$ we get

		$\displaystyle\mathbb{E}_{x\sim g}\left[\left(\ln\frac{g(x)}{g^{\prime}(x)}\right)^{4}\right]=\frac{1}{2}\mathbb{E}_{x\sim g}\left[\left((x-\mu_{g})^{2}-(x-\mu_{g}^{\prime})^{2}\right)^{4}\right]$		(251)
	$\displaystyle\leq\;$	$\displaystyle\frac{(\mu_{g}^{\prime}-\mu_{g})^{4}}{2}\mathbb{E}_{x\sim g}\left[(x-(\mu_{g}^{\prime}+\mu_{g})/2)^{4}\right]\leq 3(\mu_{g}^{\prime}-\mu_{g})^{4}.$		(252)

Therefore, item (c) holds with $c_{M}=2.$

Item (d).

For $g={\mathcal{N}}(\mu_{g},1),g^{\prime}={\mathcal{N}}(\mu_{g}^{\prime},1)\in{\mathcal{R}}$ , we have

\displaystyle D_{\rm TV}(g\|g^{\prime})\leq\sqrt{D_{\mathrm{KL}}(g\|g^{\prime})/2}=\frac{|\mu_{g}-\mu_{g}^{\prime}|}{2}.

(253)

Therefore, we can set ${\mathcal{C}}({\mathcal{R}},\epsilon)=\{{\mathcal{N}}(k\epsilon,1):k\in\{-\lfloor 1/\epsilon\rfloor,-\lfloor 1/\epsilon\rfloor+1,\cdots,\lfloor 1/\epsilon\rfloor\}\}$ . Then $\log|{\mathcal{C}}({\mathcal{R}},\epsilon)|\leq\log\lceil 2/\epsilon\rceil={\mathcal{O}}(\log(1/\epsilon)).$ ∎

Appendix F Uniform Concentration

In this section, we present the uniform concentration results.

F.1 Uniform Concentration with $\ell_{\infty}$ Covering

In this section we prove uniform concentration results with $\ell_{\infty}$ covering. For two instances $g,g^{\prime}$ , define the $\ell_{\infty}$ distance as

\displaystyle\|g-g^{\prime}\|_{\infty}\triangleq\sup_{\pi,o}{\left|{g[\pi](o)-g^{\prime}[\pi](o)}\right|}.

(254)

Let ${\mathcal{N}}({\mathcal{F}},\epsilon)$ be the proper covering number of ${\mathcal{F}}$ w.r.t. the distance $\ell_{\infty}.$ Then we have the following lemma.

Lemma F.1.

Consider any fixed $\alpha>0,0<\epsilon<\alpha/2$ , list of decisions $w=(\pi_{1},\cdots,\pi_{m})$ , $f\in{\mathcal{F}}$ . Let $\gamma=\frac{1}{m}\min_{\pi\in\Pi}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right]$ and let $\lambda=\lambda_{0}(\alpha/\gamma,\epsilon,f)$ be the value that satisfies Condition 1. Let ${\mathcal{F}}(w,f,\alpha)=\{g\in{\mathcal{F}}:D_{\mathrm{KL}}^{w}(f\|g)\geq\alpha\}$ and define

\epsilon_{0}=\frac{\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}}{3{\rm Vol}}.

Then for any $\delta>0$ , when $m\geq\frac{1}{\lambda\epsilon}\left(\ln{\mathcal{N}}\left({\mathcal{F}},\epsilon_{0}\right)+\ln(1/\delta)\right)$ we have

\displaystyle\mathop{\rm Pr}\nolimits_{o_{i}\sim f[\pi_{i}],\forall i}\left(\forall g\in{\mathcal{F}}(w,f,\alpha),\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha-4\epsilon)m\right)\geq 1-\delta.

(255)

Proof.

For any $g\in{\mathcal{F}}$ , we define a induced distribution $\hat{g}$ for all $\pi\in\Pi$ :

\displaystyle\hat{g}[\pi](o)=\frac{g[\pi](o)+\epsilon_{0}}{1+\epsilon_{0}{\rm Vol}}.

(256)

Because $\hat{g}[\pi](o)\geq 0$ and

\|\hat{g}[\pi](o)\|_{1}=\int\frac{g[\pi](o)+\epsilon_{0}}{1+\epsilon_{0}{\rm Vol}}\mathrm{d}x=\frac{1}{1+\epsilon_{0}{\rm Vol}}\int g[\pi](o)+\epsilon_{0}\mathrm{d}x=1,

the induced distribution $\hat{g}[\pi]$ is a valid distribution.

Properties of the induced covering.

Let $Z=\epsilon_{0}{\rm Vol}$ . Consider the minimum $\epsilon_{0}$ covering set ${\mathcal{C}}(w,f,\alpha)$ of ${\mathcal{F}}(w,f,\alpha).$ For any $g\in{\mathcal{F}}(w,f,\alpha)$ , let $g^{\prime}\in{\mathcal{C}}(w,f,\alpha)$ be its cover and $\hat{g}$ the induced distribution of $g^{\prime}$ . Now, we prove that $\hat{g}$ satisfies the following two properties:

(a)

For any $\pi\in\Pi$ and $o\in\operatorname{supp}f[\pi]$ , $\ln\frac{f[\pi](o)}{g[\pi](o)}\geq\ln\frac{f[\pi](o)}{\hat{g}[\pi](o)}-\epsilon,$ and
(b)

$D^{w}_{1-\lambda}(f\|\hat{g})\geq\alpha-2\epsilon$ for $g\in{\mathcal{F}}(w,f,\alpha).$

First we prove item (a). Since $g^{\prime}$ is the $\epsilon_{0}$ cover of $g$ , we get $g^{\prime}[\pi](o)+\epsilon_{0}\geq g[\pi](o)$ for any $\pi\in\Pi$ and $x\in\operatorname{supp}f[\pi]$ . As a result,

\displaystyle\ln\frac{f[\pi][o]}{g[\pi](o)}\geq\ln\frac{f[\pi][o]}{g^{\prime}[\pi](o)+\epsilon_{0}}=\ln\frac{f[\pi][o]}{\hat{g}[\pi](o)}-\ln(1+Z).

(257)

By the basic inequality $\ln(1+x)\leq x,\forall x\in(-1,\infty)$ we get

\displaystyle\ln(1+Z)\leq Z=\epsilon_{0}{\rm Vol}\leq\epsilon.

(258)

As a result, item (a) follows directly. Now we prove item (b). By algebraic manipulation,

	$\displaystyle D_{\rm TV}(g[\pi]\\|\hat{g}[\pi])\leq\int{\left\|{\hat{g}[\pi](o)-g[\pi](o)}\right\|}\mathrm{d}o=\int{\left\|{\frac{g^{\prime}[\pi](o)+\epsilon_{0}}{1+\epsilon_{0}{\rm Vol}}-g[\pi](o)}\right\|}\mathrm{d}o$	(259)
$\displaystyle\leq$	$\displaystyle\;\int{\left\|{\frac{g^{\prime}[\pi](o)+\epsilon_{0}}{1+\epsilon_{0}{\rm Vol}}-g^{\prime}[\pi](o)-\epsilon_{0}}\right\|}\mathrm{d}o+\int{\left\|{g^{\prime}[\pi](o)+\epsilon_{0}-g[\pi](o)}\right\|}\mathrm{d}o$	(260)
$\displaystyle\leq$	$\displaystyle\;\frac{1}{1+\epsilon_{0}{\rm Vol}}\int{\left\|{\epsilon_{0}{\rm Vol}\left(g^{\prime}[\pi](o)+\epsilon_{0}\right)}\right\|}\mathrm{d}o+2\epsilon_{0}{\rm Vol}$	(261)
$\displaystyle\leq$	$\displaystyle\;\epsilon_{0}{\rm Vol}+2\epsilon_{0}{\rm Vol}=3\epsilon_{0}{\rm Vol}=\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}.$	(262)

Applying Lemma G.5, we get item (b).

Uniform concentration via covering.

We define the induced covering set $\hat{\mathcal{C}}(w,f,\alpha)=\{\hat{g}:g^{\prime}\in{\mathcal{C}}(w,f,\alpha)\}.$ Then $|\hat{\mathcal{C}}(w,f,\alpha)|\leq|{\mathcal{C}}(w,f,\alpha)|\leq{\mathcal{N}}({\mathcal{F}},\epsilon_{0}).$ Applying Lemma A.2 and union bound we get, with probability at least $1-\delta$ ,

\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{\hat{g}[\pi_{i}](o_{i})}\geq m(D^{w}_{1-\lambda}(f\|\hat{g})-\epsilon),\quad\forall\hat{g}\in\hat{\mathcal{C}}(w,f,\alpha).

(263)

By item (a) of the covering property,

\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{\hat{g}[\pi_{i}](o_{i})}-m\epsilon.

(264)

By item (b) of the covering property,

\displaystyle D^{w}_{1-\lambda}(f\|\hat{g})\geq\alpha-2\epsilon

(265)

Combining Eqs. (263), (264), (265), with probability at least $1-\delta$ ,

\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq m(\alpha-4\epsilon).

(266)

∎

F.2 Uniform Concentration for RL with General Reward Distribution

In the following, we present an uniform concentration lemma for tabular RL with general reward.

Lemma F.2.

Let ${\mathcal{F}}$ be the instance family representing tabular RL with general reward distribution that satisfies Condition 4. For any fixed $\alpha>0,0<\epsilon<\alpha/2$ , list of decisions $w=(\pi_{1},\cdots,\pi_{m})$ , $f\in{\mathcal{F}}$ , let $\gamma=\frac{1}{m}\min_{\pi:\pi\in w}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right]$ and $\lambda=\lambda_{0}(\alpha/\gamma,\epsilon,f)$ be the value that satisfies Condition 1. Let ${\mathcal{F}}(w,f,\alpha)=\{g\in{\mathcal{F}}:D_{\mathrm{KL}}^{w}(f\|g)\geq\alpha\}$ and define

\epsilon_{0}=\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}.

Then for any $\delta>0$ , when $m\gtrsim\frac{\mathrm{poly}(|{\mathcal{S}}||{\mathcal{A}}|H)}{\lambda\epsilon}\left((\ln(m/\epsilon_{0}))+\ln(1/\delta)\right)$ we have

\displaystyle\mathop{\rm Pr}\nolimits_{o_{i}\sim f[\pi_{i}],\forall i}\left(\forall g\in{\mathcal{F}}(w,f,\alpha),\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq(\alpha-4\epsilon)m\right)\geq 1-\delta.

(267)

Proof.

Recall that a tabular RL instance $g\in{\mathcal{F}}$ , we use $g_{r}[s,a]$ to denote its reward distribution given state-action pair $(s,a)$ , and $g_{p}[s,a]$ its transition. We prove this lemma using the covering argument.

The covering set.

Define

\epsilon_{1}=\min\left\{\frac{\epsilon}{2H|{\mathcal{S}}|},\frac{\epsilon}{2H\mathrm{polylog}(2mH/\delta)},\frac{\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}}{4|{\mathcal{S}}|^{2}|{\mathcal{A}}|}\right\}.

Let ${\mathcal{C}}\subseteq{\mathcal{F}}$ be the minimum covering of ${\mathcal{F}}$ such that for any $g\in{\mathcal{F}}$ (parameterized by $p,\mu$ ), there exists $g^{\prime}\in{\mathcal{F}}$ (parameterized by $p^{\prime},\mu^{\prime}$ ) such that

\displaystyle\sup_{s,a,s^{\prime}}{\left|{g_{p}[s,a](s^{\prime})-g^{\prime}_{p}[s,a](s^{\prime})}\right|}\leq\epsilon_{1},\quad\sup_{s,a}D_{\rm TV}\left(g_{r}[s,a]\|g^{\prime}_{r}[s,a]\right)\leq\epsilon_{1}^{1/c_{7}}.

(268)

By item (d) of Condition 4 and a standard covering argument for discrete distributions, we have $\ln|{\mathcal{C}}|\lesssim|{\mathcal{S}}||{\mathcal{A}}|\ln(1/\epsilon_{1}).$ For any $g^{\prime}\in{\mathcal{C}}$ , we consider the induced instance $\hat{g}$ defined by

\displaystyle\hat{g}_{p}[s,a](s^{\prime})=\frac{g^{\prime}_{p}[s,a](s^{\prime})+\epsilon_{1}}{1+\epsilon_{1}|{\mathcal{S}}|},\quad\hat{g}_{r}[s,a](\cdot)=g^{\prime}_{r}[s,a](\cdot).

(269)

In the following, we prove that

(a)

with probability at least $1-\delta/2$ ,

\displaystyle\mathop{\rm Pr}\nolimits_{o_{i}\sim f[\pi_{i}],\forall i}\left(\forall g\in{\mathcal{F}},\sum_{i=1}^{m}\ln\frac{\hat{g}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq-m\epsilon\right)\geq 1-\delta/2.

(270)

(b)

$D^{w}_{1-\lambda}(f\|\hat{g})\geq\alpha-2\epsilon$ for $g\in{\mathcal{F}}(w,f,\alpha).$

To prove (a), recall that $o_{i}=\{(s_{i,h},a_{i,h},r_{i,h})\}_{h=1}^{H}$ represents a trajectory. Consequently,

	$\displaystyle\sum_{i=1}^{m}\ln\frac{\hat{g}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}$	(271)
$\displaystyle=$	$\displaystyle\;\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{\hat{g}_{p}[s_{i,h},a_{i,h}](s_{i,h+1})}{g_{p}[s_{i,h},a_{i,h}](s_{i,h+1})}+\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{\hat{g}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{g_{r}[s_{i,h},a_{i,h}](r_{i,h})}$	(272)
$\displaystyle\geq$	$\displaystyle\;-\sum_{i=1}^{m}\sum_{h=1}^{H}\ln(1+\|{\mathcal{S}}\|\epsilon_{1})+\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{\hat{g}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{g_{r}[s_{i,h},a_{i,h}](r_{i,h})}.$	(273)

By item (a) of Condition 4 and union bound we get with probability at least $1-\delta/2$

\displaystyle\sum_{i=1}^{m}\sum_{h=1}^{H}\ln\frac{\hat{g}_{r}[s_{i,h},a_{i,h}](r_{i,h})}{g_{r}[s_{i,h},a_{i,h}](r_{i,h})}\geq-mH\epsilon_{1}\mathrm{polylog}(mH/\delta)\geq-m\epsilon/2.

(274)

Therefore,

\displaystyle\sum_{i=1}^{m}\ln\frac{\hat{g}[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq-\sum_{i=1}^{m}\sum_{h=1}^{H}|{\mathcal{S}}|\epsilon_{1}-m\epsilon/2\geq-m\epsilon,

(275)

which proves (a). For (b), by algebraic manipulation we have

	$\displaystyle D_{\rm TV}(g[\pi]\\|\hat{g}[\pi])\leq\sum_{s,a}\left(D_{\rm TV}(p[s,a]\\|\hat{g}_{p}[s,a])+D_{\rm TV}(r[s,a]\\|\hat{g}_{r}[s,a])\right)$	(276)
$\displaystyle\leq$	$\displaystyle\;\sum_{s,a,s^{\prime}}{\left\|{\frac{p^{\prime}[s,a](s^{\prime})+\epsilon_{1}}{1+\epsilon_{1}\|{\mathcal{S}}\|}-p[s,a](s^{\prime})}\right\|}+\|{\mathcal{S}}\|\|{\mathcal{A}}\|\epsilon_{1}$	(277)
$\displaystyle\leq$	$\displaystyle\;\sum_{s,a,s^{\prime}}{\left\|{\frac{p^{\prime}[s,a](s^{\prime})+\epsilon_{1}}{1+\epsilon_{1}\|{\mathcal{S}}\|}-p^{\prime}[s,a](s^{\prime})-\epsilon_{1}}\right\|}+\sum_{s,a,s^{\prime}}{\left\|{p^{\prime}[s,a](s^{\prime})+\epsilon_{1}-p[s,a](s^{\prime})}\right\|}+\|{\mathcal{S}}\|\|{\mathcal{A}}\|\epsilon_{1}$	(278)
$\displaystyle\leq$	$\displaystyle\;4\|{\mathcal{S}}\|^{2}\|{\mathcal{A}}\|\epsilon_{1}\leq\exp(-\alpha/\gamma)(\epsilon\lambda)^{1/\lambda}.$	(279)

Applying Lemma G.5, we get item (b).

Uniform concentration via covering.

Now we apply Lemma A.2 and union bound. When

m\gtrsim\frac{|{\mathcal{S}}||{\mathcal{A}}|\ln(1/\epsilon_{1})}{\lambda\epsilon}\gtrsim\frac{\ln(|{\mathcal{C}}|/\delta)}{\lambda\epsilon},

with probability at least $1-\delta/2$ we get

\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{\hat{g}[\pi_{i}](o_{i})}\geq m(D^{w}_{1-\lambda}(f\|\hat{g})-\epsilon),\quad\forall\hat{g}\in{\mathcal{C}}.

(280)

By item (a) of the covering property, with probability at least $1-\delta/2$ ,

\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{\hat{g}[\pi_{i}](o_{i})}-m\epsilon.

(281)

By item (b) of the covering property,

\displaystyle D^{w}_{1-\lambda}(f\|\hat{g})\geq\alpha-2\epsilon

(282)

Combining Eqs. (280), (281), (282), with probability at least $1-\delta$ ,

\displaystyle\sum_{i=1}^{m}\ln\frac{f[\pi_{i}](o_{i})}{g[\pi_{i}](o_{i})}\geq m(\alpha-4\epsilon).

(283)

∎

Appendix G Helper Lemmas

The following lemma shows that the Rényi divergence of a marginal distribution is smaller than that of the original distribution.

Lemma G.1 (Theorem 1 of Van Erven and Harremos [2014]).

Consider any distributions $f,g$ over variables $x,y$ . Let $f(x),g(x)$ denote their marginal distribution over $x$ respectively. For any $\alpha\in(0,1)$ , we have

\displaystyle D_{\alpha}(f(x)\|g(x))\leq D_{\alpha}(f(x,y)\|g(x,y)).

(284)

Proof.

When $\alpha\in(0,1)$ , we have $\frac{1}{\alpha-1}<0.$ As a result, we only need to prove that

\displaystyle\int f(x,y)^{\alpha}g(x,y)^{1-\alpha}\mathrm{d}xy\leq\int f(x)^{\alpha}g(x)^{1-\alpha}\mathrm{d}x.

(285)

We prove this by Hölder’s inequality. In particular,

	$\displaystyle\int f(x,y)^{\alpha}g(x,y)^{1-\alpha}\mathrm{d}xy$	(286)
$\displaystyle=$	$\displaystyle\int f(x)^{\alpha}g(x)^{1-\alpha}\int f(y\mid x)^{\alpha}g(y\mid x)^{1-\alpha}\mathrm{d}y\mathrm{d}x$	(287)
$\displaystyle\leq$	$\displaystyle\int f(x)^{\alpha}g(x)^{1-\alpha}\left(\int f(y\mid x)\mathrm{d}y\right)^{\alpha}\left(\int g(y\mid x)\mathrm{d}y\right)^{1-\alpha}\mathrm{d}x$	(288)
$\displaystyle\leq$	$\displaystyle\int f(x)^{\alpha}g(x)^{1-\alpha}\mathrm{d}x.$	(289)

∎

Intuitively, the following two lemmas upper bound the difference of Rényi divergence by the TV distance.

Lemma G.2.

For and fixed $\lambda\in(0,1),\alpha>0,\epsilon>0$ and distribution $f$ , consider two distributions $g,\hat{g}$ such that $D_{\rm TV}(g\|\hat{g})\leq\exp(-\alpha)(\lambda\epsilon)^{1/\lambda}.$ Then we have

\displaystyle D_{1-\lambda}(f\|\hat{g})\geq\min\{\alpha,D_{1-\lambda}(f\|g)\}-\epsilon.

(290)

Proof.

Let $\kappa=\min\{\alpha,D_{1-\lambda}(f\|g)\}.$ We start by proving

\displaystyle{\left|{\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x-\int f(x)^{1-\lambda}\hat{g}(x)^{\lambda}\mathrm{d}x}\right|}\leq\left(\exp(\lambda\epsilon)-1\right)\exp(-\lambda\kappa).

(291)

By Hólder’s inequality we get

	$\displaystyle{\left\|{\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x-\int f(x)^{1-\lambda}\hat{g}(x)^{\lambda}\mathrm{d}x}\right\|}$	(292)
$\displaystyle\leq\;$	$\displaystyle\left(\int{\left\|{g(x)-\hat{g}(x)}\right\|}\mathrm{d}x\right)^{\lambda}\left(\int f(x)\mathrm{d}x\right)^{1-\lambda}$	(293)
$\displaystyle\leq\;$	$\displaystyle\exp(-\lambda\alpha)\lambda\epsilon$	(294)
$\displaystyle\leq\;$	$\displaystyle\exp(-\lambda\alpha)\left(\exp(\lambda\epsilon)-1\right)$	(295)
$\displaystyle\leq\;$	$\displaystyle\exp(-\lambda\kappa)\left(\exp(\lambda\epsilon)-1\right),$	(296)

where the Eq. (295) follows from the basic inequality $1+x\leq\exp(x)$ for $x>0.$

By the definition of Rényi divergence,

\displaystyle\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x=\exp(-\lambda D_{1-\lambda}(f\|g))\leq\exp(-\lambda\kappa).

(297)

Combining Eq. (296) and Eq. (297) we get,

\displaystyle\int f(x)^{1-\lambda}\hat{g}(x)^{\lambda}\mathrm{d}x\leq\exp(-\lambda(\kappa-\epsilon)).

(298)

It follows that

\displaystyle D_{1-\lambda}(f\|\hat{g})=-\frac{1}{\lambda}\ln\int f(x)^{1-\lambda}\hat{g}(x)^{\lambda}\mathrm{d}x\geq\kappa-\epsilon.

(299)

∎

Lemma G.3.

For and fixed $\lambda\in(0,1/2),\alpha>0,\epsilon>0$ and distribution $g$ , consider two distributions $f,\hat{f}$ such that $D_{\rm TV}(f\|\hat{f})\leq\exp\left(-\frac{\lambda}{1-\lambda}\alpha\right)(\lambda\epsilon)^{1/(1-\lambda)}.$ Then we have

\displaystyle D_{1-\lambda}(\hat{f}\|g)\geq\min\{\alpha,D_{1-\lambda}(f\|g)\}-\epsilon.

(300)

Proof.

We use a similar proof as Lemma G.2. Let $\kappa=\min\{\alpha,D_{1-\lambda}(f\|g)\}.$ Then we have

	$\displaystyle{\left\|{\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x-\int\hat{f}(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x}\right\|}$	(301)
$\displaystyle\leq\;$	$\displaystyle\left(\int{\left\|{f(x)-\hat{f}(x)}\right\|}\mathrm{d}x\right)^{1-\lambda}\left(\int g(x)\mathrm{d}x\right)^{\lambda}$	(302)
$\displaystyle\leq\;$	$\displaystyle\exp\left(-\lambda\alpha\right)\lambda\epsilon$	(303)
$\displaystyle\leq\;$	$\displaystyle\exp(-\lambda\kappa)\left(\exp(\lambda\epsilon)-1\right),$	(304)

By the definition of Rényi divergence,

\displaystyle\int f(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x\leq\exp(-\lambda D_{1-\lambda}(f\|g))\leq\exp(-\lambda\kappa).

(305)

Combining Eq. (304) and Eq. (305) we get,

\displaystyle\int\hat{f}(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x\leq\exp(-\lambda(\kappa-\epsilon)).

(306)

It follows that

\displaystyle D_{1-\lambda}(\hat{f}\|g)=-\frac{1}{\lambda}\ln\int\hat{f}(x)^{1-\lambda}g(x)^{\lambda}\mathrm{d}x\geq\kappa-\epsilon.

(307)

∎

The following lemma is used to prove that when ${\hat{f}}$ is close to ${f^{\star}}$ (measured by TV distance), we can use ${\hat{f}}$ to approximately solve ${\mathcal{C}}({f^{\star}})$ (see Lemma C.2).

Lemma G.4.

For an instance $f\in{\mathcal{F}}$ and $n>0$ . Let $\delta=(\ln\ln n)^{-1/4}$ and $\epsilon=(\ln\ln n)^{-1}.$ For any $\hat{w}\in\mathbb{R}^{|\Pi|}_{+}$ such that $\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4}$ , define $w=\{\pi_{i}\}_{i=1}^{m}$ be a list of decisions where a decision $\pi$ occurs $\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil$ times for every $\pi\in\Pi$ , and $m=\sum_{\pi}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil$ . Consider two instances $f,{\hat{f}}\in{\mathcal{F}}$ such that there exists constant $c_{6}>0$ with

D_{\rm TV}(f[\pi]\|\hat{f}[\pi])\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}},\forall\pi\in\Pi.

Define the set ${\mathcal{F}}(\hat{w},f)=\{g\in{\mathcal{F}}:\sum_{\pi\in\Pi}\hat{w}_{\pi}D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq 1\}.$ Then for any constant $c>0$ , there exits $n_{0}>0$ such that for all $n>n_{0}$ ,

\displaystyle D_{\mathrm{KL}}^{w}(\hat{f}\|g)\geq\frac{\ln n}{m}+c\epsilon,\quad\forall g\in{\mathcal{F}}(\hat{w},f).

(308)

Proof of Lemma G.4.

First we invoke Condition 1 with proper parameters. Define

\displaystyle\lambda=\lambda_{0}\left(4(\ln\ln n)^{3/4},\frac{1}{\ln\ln n},f\right)

(309)

By the definition of $m$ and the fact that $\|\hat{w}\|_{\infty}\leq(\ln\ln n)^{1/4}$ we get

\displaystyle\frac{|\Pi|\ln n}{(\ln\ln n)^{1/4}}\leq m\leq 2|\Pi|\ln n(\ln\ln n)^{1/4}.

(310)

Let $\alpha\triangleq\frac{\ln n}{m}+(c+2)\epsilon\lesssim\frac{2}{|\Pi|}(\ln\ln n)^{1/4}.$ Consider $\gamma=\min_{\pi}\frac{1}{m}\lceil((1+\delta)\hat{w}_{\pi}+\delta)\ln n\rceil$ . By the upper bound of $m$ we get $\gamma\geq\frac{1}{2|\Pi|(\ln\ln n)^{1/2}}.$

We invoke Condition 1 with parameters $(\alpha/\gamma,\epsilon,f)$ . For large enough $n$ we have $\alpha/\gamma\leq 4(\ln\ln n)^{3/4}$ , which implies that $\lambda<\lambda_{0}(\alpha/\gamma,\epsilon,f)$ . Then for any $\pi\in\Pi$ we get

\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi])\geq\min\{\alpha/\gamma,D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon\}.

(311)

We claim that $D_{\mathrm{KL}}^{w}(f\|g)\geq\frac{\ln n}{m}+(c+2)\epsilon,\forall g\in{\mathcal{F}}(\hat{w},f)$ for large enough $n$ . Indeed, by the definition of $w$ we get

\displaystyle mD_{\mathrm{KL}}^{w}(f\|g)\geq\sum_{\pi}((1+\delta)\hat{w}_{\pi}+\delta)(\ln n)D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq(1+\delta)\ln n\geq\ln n+(c+2)m\epsilon.

(312)

Let

\epsilon_{1}=\exp(-\alpha/\gamma)(\lambda\epsilon)^{2}=\Omega\left(\exp(-(\ln\ln n)^{3/4})\mathrm{poly}(\ln\ln n)\right).

By Lemma G.3 with parameters $(\lambda,\alpha/\gamma,\epsilon)$ and the assumption that $D_{\rm TV}(f[\pi]\|\hat{f}[\pi])\lesssim\left(\frac{\ln\ln n}{\ln n}\right)^{c_{6}}=o(\epsilon_{1})$ for all $\pi\in\Pi$ , we get

\displaystyle D_{1-\lambda}(\hat{f}[\pi]\|g[\pi])\geq\min\{\alpha/\gamma,D_{1-\lambda}(f[\pi]\|g[\pi])\}-\epsilon,\forall\pi\in\Pi.

(313)

Now we consider the following two cases.

Case 1:

There exists $\pi\in w$ such that $D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq\alpha/\gamma$ . In this case Eq. (311) implies that $D_{1-\lambda}(f[\pi]\|g[\pi])\geq\alpha/\gamma-\epsilon.$ Combining with Eq. (313) we have

\displaystyle D_{1-\lambda}({\hat{f}}[\pi]\|g[\pi])\geq\alpha/\gamma-2\epsilon.

(314)

As a result,

\displaystyle D^{w}_{1-\lambda}(\hat{f}\|g)\geq\frac{1}{m}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right](\alpha/\gamma-2\epsilon)\geq\alpha-2\epsilon=\frac{\ln n}{m}+c\epsilon.

(315)

Case 2:

For all $\pi\in w$ , $D_{\mathrm{KL}}(f[\pi]\|g[\pi])\leq\alpha/\gamma$ . In this case we also have $D_{1-\lambda}(f[\pi]\|g[\pi])\leq\alpha/\gamma,\;\forall\pi\in w$ . Therefore Eq. (313) and Eq. (311) implies that

	$\displaystyle D_{1-\lambda}({\hat{f}}[\pi]\\|g[\pi])$	$\displaystyle\geq D_{1-\lambda}(f[\pi]\\|g[\pi])-\epsilon,$		(316)
	$\displaystyle D_{1-\lambda}(f[\pi]\\|g[\pi])$	$\displaystyle\geq D_{\mathrm{KL}}(f[\pi]\\|g[\pi])-\epsilon.$		(317)

As a result,

		$\displaystyle D^{w}_{1-\lambda}(\hat{f}\\|g)\geq\frac{1}{m}\sum_{i=1}^{m}\left(D_{1-\lambda}(f[\pi_{i}]\\|g[\pi_{i}])-\epsilon\right)\geq\frac{1}{m}\sum_{i=1}^{m}\left(D_{\mathrm{KL}}(f[\pi_{i}]\\|g[\pi_{i}])-2\epsilon\right)$		(318)
	$\displaystyle=\;$	$\displaystyle D_{\mathrm{KL}}^{w}(f\\|g)-2\epsilon\geq\frac{\ln n}{m}+c\epsilon.$		(319)

Combining the two cases together, we get $D^{w}_{1-\lambda}(\hat{f}\|g)\geq\frac{\ln n}{m}+c\epsilon.$ ∎

The following lemma is used to prove a nice property of the covering (see Lemma F.1).

Lemma G.5.

Consider any $\epsilon>0,\alpha>0$ , sequence of decisions $w=\{\pi_{i}\}_{i=1}^{m}$ . Let $\gamma=\frac{1}{m}\min_{\pi:\pi\in w}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right]$ and $\lambda=\lambda_{0}(\alpha/\gamma,\epsilon,f)$ be the value that satisfies Condition 1. For two distributions $f,{\hat{f}}\in{\mathcal{F}}$ such that

D_{\rm TV}(g[\pi]\|\hat{g}[\pi])\leq\exp(-\alpha/\gamma)(\lambda\epsilon)^{1/\lambda},\forall\pi\in\Pi.

we have

\displaystyle D^{w}_{1-\lambda}(f\|\hat{g})\geq\min\{D_{\mathrm{KL}}^{w}(f\|g),\alpha\}-2\epsilon.

(320)

Proof of Lemma G.5.

Let $\epsilon_{1}=\exp(-\alpha/\gamma)(\lambda\epsilon)^{1/\lambda}$ and $\kappa=\min\{D_{\mathrm{KL}}^{w}(f\|g),\alpha\}$ .

By Lemma G.2 and the fact that $D_{\rm TV}(g[\pi]\|\hat{g}[\pi])\leq\epsilon_{1},\forall\pi\in\Pi$ , for any $\pi\in\Pi$ we have

\displaystyle D_{1-\lambda}(f[\pi]\|\hat{g}[\pi])\geq\min\{\alpha/\gamma,D_{1-\lambda}(f[\pi]\|g[\pi])\}-\epsilon.

(321)

Applying Condition 1, for any $\pi\in\Pi$ we have

\displaystyle D_{1-\lambda}(f[\pi]\|g[\pi])\geq\min\{\alpha/\gamma,D_{\mathrm{KL}}(f[\pi]\|g[\pi])-\epsilon\}.

(322)

Now we consider the following two cases.

Case 1:

There exists $\pi\in w$ such that $D_{\mathrm{KL}}(f[\pi]\|g[\pi])\geq\kappa/\gamma$ . By the definition of $\kappa$ we have $\kappa\leq\alpha$ . In this case Eq. (322) implies that $D_{1-\lambda}(f[\pi]\|g[\pi])\geq\kappa/\gamma-\epsilon.$ Combining with Eq. (321) we have

\displaystyle D_{1-\lambda}(f[\pi]\|\hat{g}[\pi])\geq\kappa/\gamma-2\epsilon.

(323)

As a result,

\displaystyle D^{w}_{1-\lambda}(f\|\hat{g})\geq\frac{1}{m}\sum_{i=1}^{m}\mathbb{I}\left[\pi_{i}=\pi\right](\kappa/\gamma-2\epsilon)\geq\kappa-2\epsilon.

(324)

Case 2:

For all $\pi\in w$ , $D_{\mathrm{KL}}(f[\pi]\|g[\pi])\leq\kappa/\gamma$ . In this case we also have $D_{1-\lambda}(f[\pi]\|g[\pi])\leq\kappa/\gamma,\;\forall\pi\in w$ . Therefore Eq. (321) and Eq. (322) implies that

	$\displaystyle D_{1-\lambda}(f[\pi]\\|\hat{g}[\pi])$	$\displaystyle\geq D_{1-\lambda}(f[\pi]\\|g[\pi])-\epsilon,$		(325)
	$\displaystyle D_{1-\lambda}(f[\pi]\\|g[\pi])$	$\displaystyle\geq D_{\mathrm{KL}}(f[\pi]\\|g[\pi])-\epsilon.$		(326)

As a result,

		$\displaystyle D^{w}_{1-\lambda}(f\\|\hat{g})\geq\frac{1}{m}\sum_{i=1}^{m}\left(D_{1-\lambda}(f[\pi_{i}]\\|g[\pi_{i}])-\epsilon\right)\geq\frac{1}{m}\sum_{i=1}^{m}\left(D_{\mathrm{KL}}(f[\pi_{i}]\\|g[\pi_{i}])-2\epsilon\right)$		(327)
	$\displaystyle=\;$	$\displaystyle D_{\mathrm{KL}}^{w}(f\\|g)-2\epsilon\geq\kappa-2\epsilon.$		(328)

Combining the two cases together, we get $D^{w}_{1-\lambda}(f\|\hat{g})\geq\kappa-2\epsilon.$ ∎

The following lemma shows that for truncated Gaussian distributions, the difference in their density function can be upper bounded by their KL divergence. We use this lemma to prove Condition 3 for tabular RL with truncated Gaussian reward.

Lemma G.6.

Consider two truncated Gaussian distributions $p_{1},p_{2}$ with density

\displaystyle p_{i}(x)=\mathbb{I}\left[x\in[-2,2]\right]\frac{1}{Z_{i}}\exp\left(-\frac{(x-\mu_{i})^{2}}{2}\right),

(329)

where $Z_{i}$ is the normalization factor. Assuming $\mu_{1},\mu_{2}\in[-1,1]$ , we have

\displaystyle\sup_{x\in[-2,2]}{\left|{p_{1}(x)-p_{2}(x)}\right|}\leq 36D_{\mathrm{KL}}(p_{1}\|p_{2})^{1/6}.

(330)

Proof.

We first prove that ${\left|{\mu_{1}-\mu_{2}}\right|}\lesssim D_{\mathrm{KL}}(p_{1}\|p_{2})^{1/6}.$ Then we show that $\sup_{x\in[-2,2]}{\left|{p_{1}(x)-p_{2}(x)}\right|}\lesssim{\left|{\mu_{1}-\mu_{2}}\right|}$ .

By Pinsker’s inequality,

\displaystyle D_{\rm TV}(p_{1}\|p_{2})\lesssim D_{\mathrm{KL}}(p_{1}\|p_{2})^{1/2}.

(331)

Now we prove that ${\left|{\mu_{1}-\mu_{2}}\right|}^{3}\lesssim D_{\rm TV}(p_{1}\|p_{2}).$ W.l.o.g., we assume $Z_{1}\geq Z_{2}>1/\sqrt{2\pi}$ and $\mu_{1}\leq\mu_{2}.$ Then we have, for any $x\in[\mu_{1},\mu_{1}+\frac{1}{4}(\mu_{2}-\mu_{1})]$ ,

$\displaystyle p_{1}(x)-p_{2}(x)$	$\displaystyle\geq\frac{1}{\sqrt{2\pi}}\left[\exp(-\frac{(\mu_{2}-\mu_{1})^{2}}{32})-\exp(-\frac{9(\mu_{2}-\mu_{1})^{2}}{32})\right]$	(332)
	$\displaystyle\geq\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{9(\mu_{2}-\mu_{1})^{2}}{32}\right)\left(\exp(\frac{1}{4}(\mu_{2}-\mu_{1})^{2})-1\right)$	(333)
	$\displaystyle\geq\frac{1}{4e^{2}\sqrt{2\pi}}(\mu_{2}-\mu_{1})^{2}.$	(334)

As a result,

\displaystyle D_{\rm TV}(p_{1}\|p_{2})\geq\frac{1}{4}(\mu_{2}-\mu_{1})(p_{1}(x)-p_{2}(x))\gtrsim{\left|{\mu_{2}-\mu_{1}}\right|}^{3}.

(335)

Now we prove that $\sup_{x\in[-2,2]}{\left|{p_{1}(x)-p_{2}(x)}\right|}\lesssim{\left|{\mu_{1}-\mu_{2}}\right|}$ . By definition, for any $x\in[-2,2]$ we have

$\displaystyle{\left\|{p_{1}(x)-p_{2}(x)}\right\|}$	$\displaystyle={\left\|{\frac{1}{Z_{1}}\exp\left(-\frac{(x-\mu_{1})^{2}}{2}\right)-\frac{1}{Z_{2}}\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right)}\right\|}$	(336)
	$\displaystyle\leq{\left\|{1/Z_{1}-1/Z_{2}}\right\|}+\frac{1}{Z_{1}}{\left\|{\exp\left(-\frac{(x-\mu_{1})^{2}}{2}\right)-\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right)}\right\|},$	(337)
	$\displaystyle\leq 4{\left\|{Z_{1}-Z_{2}}\right\|}+2{\left\|{\exp\left(-\frac{(x-\mu_{1})^{2}}{2}\right)-\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right)}\right\|},$	(338)

where the last inequality comes from the fact that $Z_{i}\geq 1/2$ when $|\mu_{i}|\leq 1.$ For the second term,

	$\displaystyle{\left\|{\exp\left(-\frac{(x-\mu_{1})^{2}}{2}\right)-\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right)}\right\|}$	(339)
$\displaystyle=\;$	$\displaystyle\exp\left(-\frac{(x-\mu_{2})^{2}}{2}\right){\left\|{\exp((\mu_{2}-\mu_{1})(\mu_{2}+\mu_{1}-x)/2)-1}\right\|}$	(340)
$\displaystyle\leq\;$	$\displaystyle 2\|\mu_{2}-\mu_{1}\|.$	(341)

For the first term,

\displaystyle{\left|{Z_{1}-Z_{2}}\right|}\leq\int_{x=-2}^{2}{\left|{p_{1}(x)-p_{2}(x)}\right|}\mathrm{d}x\leq 8|\mu_{2}-\mu_{1}|.

(342)

As a result, we get

\displaystyle{\left|{p_{1}(x)-p_{2}(x)}\right|}\leq 36|\mu_{2}-\mu_{1}|.

(343)

Combining Eq. (335) and Eq. (343) we prove the this lemma. ∎

The following lemma can be viewed as an perturbation analysis to the covariance matrix in linear bandits setting. We use the lemma to prove that our complexity measure recovers that in [Lattimore and Szepesvari, 2017].

Lemma G.7.

For a fixed positive definite matrix $A\in\mathbb{R}^{d\times d}$ and an unit vector $x\in\mathbb{R}^{d}$ , let $G_{1}=(A+nxx^{\top})^{-1}$ and $G_{2}=\lim_{\rho\to\infty}(A+\rho xx^{\top})^{-1}.$ Then

\displaystyle\|G_{1}-G_{2}\|_{2}\leq\frac{\sigma_{\rm max}(A^{-1})}{1+n\sigma_{\rm min}(A^{-1})}.

(344)

Proof.

By Sherman–Morrison formula we get

	$\displaystyle G_{1}$	$\displaystyle=A^{-1}-\frac{nA^{-1}xx^{\top}A^{-1}}{1+nx^{\top}A^{-1}x},$		(345)
	$\displaystyle G_{2}$	$\displaystyle=\lim_{\rho\to\infty}A^{-1}-\frac{\rho A^{-1}xx^{\top}A^{-1}}{1+\rho x^{\top}A^{-1}x}.$		(346)

Then for any $v\in\mathbb{R}^{d}$ such that $\|v\|_{2}=1$ , we get

$\displaystyle v^{\top}(G_{1}-G_{2})v$	$\displaystyle=\lim_{\rho\to\infty}\frac{\rho v^{\top}A^{-1}xx^{\top}A^{-1}v}{1+\rho x^{\top}A^{-1}x}-\frac{nv^{\top}A^{-1}xx^{\top}A^{-1}v}{1+nx^{\top}A^{-1}x}$	(347)
	$\displaystyle=\lim_{\rho\to\infty}\frac{(\rho-n)v^{\top}A^{-1}xx^{\top}A^{-1}v}{(1+nx^{\top}A^{-1}x)(1+\rho x^{\top}A^{-1}x)}$	(348)
	$\displaystyle=\frac{v^{\top}A^{-1}xx^{\top}A^{-1}v}{x^{\top}A^{-1}x(1+nx^{\top}A^{-1}x)}$	(349)
	$\displaystyle\leq\frac{v^{\top}A^{-1}v}{1+nx^{\top}A^{-1}x}$	(350)
	$\displaystyle\leq\frac{\sigma_{\rm max}(A^{-1})}{1+n\sigma_{\rm min}(A^{-1})}.$	(351)

∎

The following lemma upper bounds the difference between KL divergence and Rényi divergence, and is used to prove Condition 1.

Lemma G.8.

For any two distribution $f,g$ and constant $\lambda\in(0,1/2),$ we have

\displaystyle D_{\mathrm{KL}}(f\|g)-D_{1-\lambda}(f\|g)\leq\frac{\lambda}{2}\mathbb{E}_{o\sim f}\left[\left(\ln\frac{f(o)}{g(o)}\right)^{4}\right]^{1/2}.

(352)

Proof.

Recall that

D_{1-\lambda}(f\|g)=-\frac{1}{\lambda}\ln\int f(o)^{1-\lambda}g(o)^{\lambda}\mathrm{d}o.

Define the function $h(\lambda)\triangleq\int f(o)^{1-\lambda}g(o)^{\lambda}\mathrm{d}o.$ By basic algebra we get

	$\displaystyle h^{\prime}(\lambda)$	$\displaystyle=\int f(o)^{1-\lambda}g(o)^{\lambda}\ln\frac{g(o)}{f(o)}\mathrm{d}o$		(353)
	$\displaystyle h^{\prime\prime}(\lambda)$	$\displaystyle=\int f(o)^{1-\lambda}g(o)^{\lambda}\ln^{2}\frac{g(o)}{f(o)}\mathrm{d}o.$		(354)

By Taylor expansion, there exists $\xi\in(0,\lambda)$ such that

\displaystyle h(\lambda)=h(0)+\lambda h^{\prime}(0)+\frac{\lambda^{2}}{2}h^{\prime\prime}(\xi).

(355)

By definition we have $h(0)=1$ and $h^{\prime}(0)=-D_{\mathrm{KL}}(f\|g).$ As a result, we get

	$\displaystyle D_{1-\lambda}(f\\|g)=-\frac{1}{\lambda}\ln h(\lambda)$	(356)
$\displaystyle=\;$	$\displaystyle-\frac{1}{\lambda}\ln\left(1-\lambda D_{\mathrm{KL}}(f\\|g)+\frac{\lambda^{2}}{2}\int f(o)^{1-\zeta}g(o)^{\zeta}\ln^{2}\frac{g(o)}{f(o)}\mathrm{d}o\right)$	(357)
$\displaystyle\geq\;$	$\displaystyle D_{\mathrm{KL}}(f\\|g)-\frac{\lambda}{2}\int f(o)^{1-\xi}g(o)^{\xi}\ln^{2}\frac{g(o)}{f(o)}\mathrm{d}o.$	(358)

By Hölder’s inequality, when $\xi<\lambda<1/2$ we get

	$\displaystyle\int f(o)^{1-\zeta}g(o)^{\xi}\ln^{2}\frac{g(o)}{f(o)}\mathrm{d}o$	(359)
$\displaystyle=\;$	$\displaystyle\mathbb{E}_{o\sim f}\left[\left(\frac{g(o)}{f(o)}\right)^{\xi}\ln^{2}\frac{g(o)}{f(o)}\right]$	(360)
$\displaystyle\leq\;$	$\displaystyle\mathbb{E}_{o\sim f}\left[\frac{g(o)}{f(o)}\right]^{\xi}\mathbb{E}_{o\sim f}\left[\left(\ln\frac{g(o)}{f(o)}\right)^{\frac{2}{1-\xi}}\right]^{1-\xi}$	(361)
$\displaystyle\leq\;$	$\displaystyle\mathbb{E}_{o\sim f}\left[\left(\ln\frac{g(o)}{f(o)}\right)^{4}\right]^{1/2}.$	(362)

Combining the inequalities above we get the desired result. ∎

	$\displaystyle D_{1-\lambda}({\hat{f}}[\pi]\\|g[\pi])$	$\displaystyle\geq D_{1-\lambda}(f[\pi]\\|g[\pi])-\epsilon,$		(316)
	$\displaystyle D_{1-\lambda}(f[\pi]\\|g[\pi])$	$\displaystyle\geq D_{\mathrm{KL}}(f[\pi]\\|g[\pi])-\epsilon.$		(317)

	$\displaystyle D_{1-\lambda}(f[\pi]\\|\hat{g}[\pi])$	$\displaystyle\geq D_{1-\lambda}(f[\pi]\\|g[\pi])-\epsilon,$		(325)
	$\displaystyle D_{1-\lambda}(f[\pi]\\|g[\pi])$	$\displaystyle\geq D_{\mathrm{KL}}(f[\pi]\\|g[\pi])-\epsilon.$		(326)

Asymptotic Instance-Optimal Algorithms for Interactive Decision Making

Abstract

1 Introduction

1.1 Additional Related Works

Instance-optimal and instance-dependent analysis.

Instance-optimal algorithm via log-likelihood ratio test.

2 Setup and Notations

Additional notations.

Interactive decision making problems.

Consistent algorithms and asymptotic instance-optimality.

3 Main Results

3.1 Complexity Measure and Regret Lower Bounds

Lemma 3.1.

Theorem 3.2.

3.2 Instance-Optimal Algorithms

Condition 1.

Theorem 3.3.

4 Proof Sketches of the Main Results

4.1 Proof Sketch of Theorem 3.2

4.2 Proof Sketch of Theorem 3.3

Step 1: Initialization.

Lemma 4.1.

Step 2: Identification.

Lemma 4.2.

Step 3: Exploitation.

4.3 The Log-Likelihood Ratio Test

Lemma 4.3.

5 Extensions to Infinite Hypothesis Class

Condition 2.

Condition 3.

Theorem 5.1.

Theorem 5.2.

Covering Number.

Initialization.

6 Conclusion

Acknowledgment

References

List of Appendices

Appendix A Proofs for Finite Hypothesis Class

A.1 Proof of Lemma 4.1

Proof of Lemma 4.1.

A.2 Proof of Lemma 4.2

Lemma A.1.

Proof.

Case 1: ∃π¯∈Π,DKL​(f​[π¯]∥g​[π¯])≥(α+ϵ)/γ.\exists\bar{\pi}\in\Pi,D_{\mathrm{KL}}(f[\bar{\pi}]\|g[\bar{\pi}])\geq(\alpha+\epsilon)/\gamma.

Case 2: ∀π∈Π,DKL​(f​[π]∥g​[π])≤(α+ϵ)/γ.\forall\pi\in\Pi,D_{\mathrm{KL}}(f[\pi]\|g[\pi])\leq(\alpha+\epsilon)/\gamma.

Proof of Lemma 4.2.

Proof of item (a).

Proof of item (b).

Proof of item (c).

A.3 Proof of Lemma 4.3

Lemma A.2.

Proof of Lemma A.2.

Proof of Lemma 4.3.

A.4 Proof of Theorem 3.3

Proof of Theorem 3.3.

Regret of Step 1:

Regret of Step 2.

Regret of Step 3.

Appendix B Missing Proofs in Section 3.1

B.1 Proof of Lemma 3.1

Lemma B.1.

Proof of Lemma B.1.

Proof of Lemma 3.1.

B.2 Proof of Theorem 3.2

Proof of Theorem 3.2.

B.3 Instantiation of the Complexity Measure

Proposition B.2.

Proof.

Proposition B.3.

Proof.

B.4 A Sufficient Condition for Condition 1

Proposition B.4.

Proof.

B.5 Comparison with Existing Lower Bounds

Appendix C Extension to Infinite Hypothesis Class

C.1 Proof of Theorem 5.2

Lemma C.1 (Main lemma for Initialization).

Lemma C.2 (Main lemma for Identification).

C.2 Proof of Lemma C.1

Case 1: $\exists\bar{\pi}\in\Pi,D_{\mathrm{KL}}(f[\bar{\pi}]\|g[\bar{\pi}])\geq(\alpha+\epsilon)/\gamma.$

Case 2: $\forall\pi\in\Pi,D_{\mathrm{KL}}(f[\pi]\|g[\pi])\leq(\alpha+\epsilon)/\gamma.$

Case 1: $\min_{o^{\rm sa}:f[\pi](o^{\rm sa})>0}g[\pi](o^{\rm sa})<\kappa.$

Case 2: $\min_{o^{\rm sa}:f[\pi](o^{\rm sa})>0}g[\pi](o^{\rm sa})\geq\kappa.$

F.1 Uniform Concentration with $\ell_{\infty}$ Covering