This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarial Online Multi-Task Reinforcement Learning

Quan Nguyen Department of Computer Science, University of Victoria thanks: Emails: [email protected], [email protected] Nishant A. Mehta Department of Computer Science, University of Victoria thanks: Emails: [email protected], [email protected]
Abstract

We consider the adversarial online multi-task reinforcement learning setting, where in each of KK episodes the learner is given an unknown task taken from a finite set of MM unknown finite-horizon MDP models. The learner’s objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in \mathcal{M} are well-separated under a notion of λ\lambda-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of Ω(KDSAH)\Omega(K\sqrt{DSAH}) on the regret of any learning algorithm and an instance-specific lower bound of Ω(Kλ2)\Omega(\frac{K}{\lambda^{2}}) in sample complexity for a class of uniformly good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains O~(Kλ2)\tilde{O}(\frac{K}{\lambda^{2}}) sample complexity guarantee for the clustering phase and O~(MK)\tilde{O}(\sqrt{MK}) regret guarantee for the learning phase, indicating that the dependency on KK and 1λ2\frac{1}{\lambda^{2}} is tight.

1 Introduction

The majority of theoretical works in online reinforcement learning (RL) have focused on single-task settings in which the learner is given the same task in every episode. In practice, an autonomous agent might face a sequence of different tasks. For example, an automatic medical diagnosis system could be given an arbitrarily ordered sequence of patients who are suffering from an unknown set of variants of a virus. In this example, the system needs to classify and learn the appropriate treatment for each variant of the virus. This example is an instance of the adversarial online multi-task episodic RL setting, an important learning setting for which the theoretical understanding is rather limited. The framework commonly used in existing theoretical works is an episodic setting of KK episodes; in each episode an unknown Markov decision process (MDP) from a finite set \mathcal{M} of size MM is given to the learner. When M=1M=1, the setting reduces to single-task episodic RL. Most existing algorithms for single-task episodic RL are based on aggregating samples in all episodes to obtain sub-linear bounds on various notions of regret (Azar et al., 2017; Jin et al., 2018; Simchowitz and Jamieson, 2019) or finite (ϵ,δ)(\epsilon,\delta)-PAC bounds on the sample complexity of exploration (Dann and Brunskill, 2015). When M>1M>1, without any assumptions on the common structure of the tasks, aggregating samples from different tasks could produce negative transfer (Brunskill and Li, 2013). To avoid negative transfer, existing works (Brunskill and Li, 2013; Hallak et al., 2015; Kwon et al., 2021) assumed that there exists some notion of task-separability that defines how different the tasks in \mathcal{M} are. Based on this notion of separability, most existing algorithms followed a two-phase cluster-then-learn paradigm that first attempts to figure out which MDP is being given and then uses the samples from the previous episodes of the same MDP for learning. However, most existing works employ strong assumptions such that the tasks are given stochastically following a fixed distribution (Azar et al., 2013; Brunskill and Li, 2013; Steimle et al., 2021; Kwon et al., 2021) or the task-separability notion allows the MDPs to be distinguished in a small number of exploration steps (Hallak et al., 2015; Kwon et al., 2021). These strong assumptions become the main theoretical challenges towards understanding this setting.

Our goal in this work is to study the adversarial setting with a more general task-separability notion, in which the aformentioned strong assumptions do not hold. Specifically, the learner makes no statistical assumptions on the sequence of tasks; the task in each episode can be either the same or different from the tasks in any other episodes. Moreover, the difference between the tasks in two consecutive episodes can be large (linear in the length of the episodes) so that algorithms based on a fixed budget for total variation such as RestartQ-UCB (Mao et al., 2021) cannot be applied. The performance of the learner is measured by its regret with respect to an omniscient agent that knows which tasks are coming in every episode and the optimal policies for these tasks. We consider the same cluster-then-learn paradigm of the previous works and focus on the following two questions:

  • Is there a task-separability notion that generalizes the notions from previous works while still enabling tasks to be distinguished by a cluster-then-learn algorithm with polynomial time and sample complexity? If so, what is the optimal sample complexity of clustering under this notion?

  • Is there a polynomial time cluster-then-learn algorithm that simultaneously obtains near-optimal sample complexity in the clustering phase and near-optimal regret guarantee for the learning phase in the adversarial setting?

We answer both questions positively. For the first question, we introduce the notion of λ\lambda-separability, a task-separability notion that generalizes the task-separability definitions in previous works in the same setting (Brunskill and Li, 2013; Hallak et al., 2015; Kwon et al., 2021). Definition 1 formally defines λ\lambda-separability. A more informal version of λ\lambda-separability has appeared in the discounted setting of Concurrent PAC RL (Guo and Brunskill, 2015) where multiple MDPs are learned concurrently; however the implications on the episodic sequential setting and the tightness of their results were lacking. In essence, λ\lambda-separability assumes that between every pair of MDPs in \mathcal{M}, there exists some state-action pair whose transition functions are well-separated in 1\ell_{1}-norm. This setting is more challenging than the one considered by Hallak et al. (2015) where all state-action pairs are well-separated. In Appendix B, we show that λ\lambda-separability is more general than the entropy-based separability defined in Kwon et al. (2021) and thus requires novel approaches to exploring and clustering samples from different episodes. Under this notion of λ\lambda-separability, we show an instance-specific lower bound111Here and throughout the introduction, we suppress factors related to the MDPs such that the number of states and actions and the horizon length in all the bounds. Ω(Kλ2)\Omega(\frac{K}{\lambda^{2}}) on both the sample complexity and regret of the clustering phase for a class of cluster-then-learn algorithms that includes most of the existing works.

To answer the second question, we propose a new cluster-then-learn algorithm, AOMultiRL, which obtains a regret upper bound of O~(Kλ2+MK)\tilde{O}\left(\frac{K}{\lambda^{2}}+\sqrt{MK}\right) (the O~\tilde{O} hides logarithmic terms). This upper bound indicates that the linear dependency on KK and λ2\lambda^{2} in the lower bounds are tight. The O~(MK)\tilde{O}(\sqrt{MK}) upper bound in the learning phase is near-optimal because if the identity of the model is revealed to a learner at the beginning of every episode (so that no clustering is necessary), there exists a straightforward Ω(MK)\Omega(\sqrt{MK}) lower bound obtained by combining the lower bound for the single-task episodic setting of Domingues et al. (2021b) and Cauchy-Schwarz inequality. In the stochastic setting, the L-UCRL algorithm (Kwon et al., 2021) obtains O(MK)O(\sqrt{MK}) regret with respect to the optimal policy of a partially observable MDP (POMDP) setting that does not know the identity of the MDPs in each episode; thus their notion of regret is weaker than the one in our work.

Overview of Techniques

  • In Section 3, we present two lower bounds. The first is a minimax lower bound Ω(KSAH)\Omega(K\sqrt{SAH}) on the total regret of any algorithm. This result uses the construction of JAO MDPs in Jaksch et al. (2010). The second is a Ω(Kλ2)\Omega\left(\frac{K}{\lambda^{2}}\right) instance-specific lower bound on the sample complexity and regret of the clustering phase for a class of uniformly good cluster-then-learn algorithms when both λ\lambda and MM are sufficiently large. The instance-specific lower bound relies on the novel construction of 2-JAO MDP, a hard instance combining two JAO MDPs in which one is the minimax lower bound instance and the other satisfies λ\lambda-separability. We show that learning 2-JAO MDPs is fundamentally a two-dimensional extension of the problem of finding a biased coin among a collection of fair coins (e.g. Tulsiani, 2014), for which information theoretic techniques of the one-dimensional problem can be adapted.

  • In Section 4, we show that AOMultiRL obtains a regret upper bound of O~(Kλ2+MK)\tilde{O}\left(\frac{K}{\lambda^{2}}+\sqrt{MK}\right). The main idea of AOMultiRL is based on the observation that a fixed horizon of order Θ(1λ2)\Theta(\frac{1}{\lambda^{2}}) with a small constant factor is sufficient to obtain a λ\lambda-dependent coarse estimate of the transition functions of all state-action pairs. In turn, this coarse estimate is sufficient to have high-probability guarantees for the correctness of the clustering phase. This allows AOMultiRL to have a fixed horizon for the learning phase and be able to apply single-task RL algorithms with theoretical guarantees such as UCBVI-CH (Azar et al., 2017) in the learning phase.

Our paper is structured as follows: Section 2 formally sets up the problem. Section 3 presents the lower bounds. AOMultiRL and its regret upper bound are shown in Section 4. Several numerical simulations are in Section 5. The appendix contains formal proofs of all results. We defer detailed discussion on related works to Appendix A.

2 Problem Setup

Our learning setting consists of KK episodes. In episode k=1,2,,Kk=1,2,\dots,K, an adversary chooses an unknown Markov decision process (MDP) mkm^{k} from a set of finite-horizon tabular stationary MDP models ={(𝒮,𝒜,H,Pi,r):i=1,2,,M}\mathcal{M}=\{(\mathcal{S},\mathcal{A},H,P_{i},r):i=1,2,\dots,M\} where r:𝒮×𝒜[0,1]r:\mathcal{S\times A}\mapsto[0,1] is the shared reward function, 𝒮\mathcal{S} is the set of states with size SS, 𝒜\mathcal{A} is the set of actions with size AA, HH is the length of each episode, and Pi:𝒮×𝒜×𝒮[0,1]P_{i}:\mathcal{S\times A\times S}\mapsto[0,1] is the transition function where Pi(s|s,a)P_{i}(s^{\prime}|s,a) specifies the probability of being in state ss^{\prime} after taking action aa at state ss. The state space 𝒮\mathcal{S} and action space 𝒜\mathcal{A} are known and shared between all models; however, the transition functions are distinct and unknown. Following a common practice in single-task RL literature (Azar et al., 2017; Jin et al., 2018), we assume that the reward function is known and deterministic, however our techniques and results extend to the setting of unknown stochastic rr. Furthermore, the MDPs are assumed to be communicating with a finite diameter DD (Jaksch et al., 2010). A justification for this assumption on the diameter is provided in Section 2.1.

The adversary also chooses the initial state s1ks^{k}_{1}. The policy πk\pi_{k} of the learner in episode kk is a collection of HH functions πk={πk,h:𝒮𝒜}\pi^{k}=\{\pi_{k,h}:\mathcal{S}\mapsto\mathcal{A}\}, which can be non-stationary and history-dependent. The value function of πk\pi_{k} starting in state ss at step hh is the expected rewards obtained by following πk\pi_{k} for Hh+1H-h+1 steps Vhπk(s)=𝔼[h=hHr(shk,πhk(shk))shk=s]V_{h}^{\pi_{k}}(s)=\mathbb{E}[\sum_{h^{\prime}=h}^{H}r(s^{k}_{h^{\prime}},\pi^{k}_{h}(s^{k}_{h^{\prime}}))\mid s^{k}_{h}=s], where the expectation is taken with respect to the stochasticity in mkm^{k} and πk\pi^{k}. Let V1k,V_{1}^{k,*} denote the value function of the optimal policy in episode kk.

The performance of the learner is measured by its regret with respect to the optimal policies in every episode:

Regret(K)=k=1K[V1k,V1πk](s1k).\text{Regret}(K)=\sum_{k=1}^{K}[V^{k,*}_{1}-V^{\pi_{k}}_{1}](s^{k}_{1}). (1)

Let [M]={1,2,,M}[M]=\{1,2,\dots,M\}. We assume that the MDPs in \mathcal{M} are λ\lambda-separable:

Definition 1 (λ\lambda-separability)

Let λ>0\lambda>0 and consider set of MDP models ={m1,,mM}\mathcal{M}=\{m_{1},\dots,m_{M}\} with MM models. For all (i,j)[M]×[M](i,j)\in[M]\times[M] and iji\neq j, the λ\lambda-distinguishing set for two models mim_{i} and mjm_{j} is defined as the set of state-action pairs such that the 1\ell_{1} distance between Pi(s,a)P_{i}(s,a) and Pj(s,a)P_{j}(s,a) is larger than λ\lambda: Γi,jλ={(s,a)𝒮×𝒜:Pi(s,a)Pj(s,a)λ}\Gamma^{\lambda}_{i,j}=\{(s,a)\in\mathcal{S\times A}:\left\lVert P_{i}(s,a)-P_{j}(s,a)\right\rVert\geq\lambda\}, where \left\lVert\cdot\right\rVert denotes the 1\ell_{1}-norm and Pi(s,a)=Pi(s,a)P_{i}(s,a)=P_{i}(\cdot\mid s,a).

The set \mathcal{M} is λ\lambda-separable if for every two models mi,mjm_{i},m_{j} in \mathcal{M}, the set Γi,jλ\Gamma^{\lambda}_{i,j} is non-empty:

i,j[M],ij:Γi,jλ.\forall i,j\in[M],i\neq j:\Gamma^{\lambda}_{i,j}\neq\emptyset.

In addition, λ\lambda is called a separation level of \mathcal{M}, and we say a state-action pair (s,a)(s,a) is λ\lambda-distinguishing for two models mim_{i} and mjm_{j} if Pi(s,a)Pj(s,a)>λ\left\lVert P_{i}(s,a)-P_{j}(s,a)\right\rVert>\lambda.

We use the following notion of a λ\lambda-distinguishing set for a collection of MDP models \mathcal{M}:

Definition 2 (λ\lambda-distinguishing set)

Given a λ\lambda-separable set of MDPs \mathcal{M}, a λ\lambda-distinguishing set of \mathcal{M} is a set of state-action pairs Γλ𝒮×𝒜\Gamma^{\lambda}\subseteq\mathcal{S}\times\mathcal{A} such that for all i,j[M],Γi,jλΓλi,j\in[M],\Gamma^{\lambda}_{i,j}\cap\Gamma^{\lambda}\neq\emptyset. In particular, the set Γ=i,jΓi,jλ\Gamma=\cup_{i,j}\Gamma^{\lambda}_{i,j} is a λ\lambda-distinguishing set of \mathcal{M}.

By definition, a state-action pair can be λ\lambda-distinguishing for some pairs of models and not λ\lambda-distinguishing for other pairs of models.

2.1 Assumption on the finite diameter of the MDPs

In this work, all MDPs are assumed to be communicating. We employ the following formal definition and assumption commonly used in literature (Jaksch et al., 2010; Brunskill and Li, 2013; Sun and Huang, 2020; Tarbouriech et al., 2021):

Definition 3

((Jaksch et al., 2010)) Given an ergodic Markov chain \mathcal{F}, let Ts,s=inf{t>0st=s,s0=s}T^{\mathcal{F}}_{s,s^{\prime}}=\inf\{t>0\mid s_{t}=s^{\prime},s_{0}=s\} be the first passage time for two states s,ss,s^{\prime} on \mathcal{F}. Then the hitting time of a unichain MDP GG is TG=maxs,sSmaxπ𝔼[Ts,sπ]T_{G}=\max_{s,s^{\prime}\in S}\max_{\pi}\mathbb{E}[T^{\mathcal{F}_{\pi}}_{s,s^{\prime}}], where π\mathcal{F}_{\pi} is the Markov chain induced by π\pi on GG. In addition, TG=maxs,sSminπ𝔼[Ts,sπ]T^{\prime}_{G}=\max_{s,s^{\prime}\in S}\min_{\pi}\mathbb{E}[T^{\mathcal{F}_{\pi}}_{s,s^{\prime}}] is the diameter of GG.

Assumption 4

The diameter of all MDPs in \mathcal{M} are bounded by a constant DD.

While this finite diameter assumption is common in undiscounted and discounted single-task setting (Jaksch et al., 2010; Guo and Brunskill, 2015), it is not necessary in the episodic single-task setting (Jin et al., 2018; Mao et al., 2021). Therefore, it is important to justify this assumption in the episodic multi-task setting. In the episodic single-task setting, for any initial state s1s_{1}, the average time between any pair of states reachable from s1s_{1} is bounded 2H2H; hence, HH plays the same role as DD (Domingues et al., 2021b). This allows the learner to visit and gather state-transition samples in each state multiple times and construct accurate estimates of the model.

However, in the multi-task setting, the same initial state s1s_{1} in one episode might belong to a different MDP than the state s1s_{1} in the previous episodes. Therefore, the set of reachable states and their state-transition distributions could change drastically. Hence, it is important that the λ\lambda-distinguishing state-action pairs be reachable from any initial state s1s_{1} for the learner to recognize which MDP it is in and use the samples appropriately. Otherwise, combining samples from different MDPs could lead to negative transfer. Conversely, if the MDPs are allowed to be non-communicating, the component that makes them λ\lambda-separable might be unreachable from other components. In this case, the adversary can pick the initial states in these components and block the learner from accessing the λ\lambda-distinguishing state-actions. A construction that formalizes this argument is shown at the end of Section 3.

3 Minimax and Instance-Dependent Lower Bounds

We first show that if λ\lambda is sufficiently small and M=Θ(SA)M=\Theta(SA), then the setting is uninteresting in the sense that one cannot do much better than learning every episode individually without any transfer, leading to an expected regret that grows linearly in the number of episodes KK.

Lemma 5 (Minimax Lower Bound)

Suppose S,A10,D20logA(S)S,A\geq 10,D\geq 20\log_{A}(S) and HDSAH\geq DSA are given. Let λ=Θ(SAHD)\lambda=\Theta(\sqrt{\frac{SA}{HD}}). There exists a set of λ\lambda-separable MDPs \mathcal{M} of size M=SA4M=\frac{SA}{4}, each with SS states, AA actions, diameter at most DD and horizon HH such that if the tasks are chosen uniformly at random from \mathcal{M}, the expected regret of any sequence of policies (πk)k=1,,K(\pi_{k})_{k=1,\dots,K} over KK episodes is

𝔼[Regret(K)]Ω(KDSAH).\displaystyle\mathbb{E}[\mathrm{Regret}(K)]\geq\Omega\left(K\sqrt{DSAH}\right).

Proof (Sketch) We construct \mathcal{M} so that each MDP in \mathcal{M} is a JAO MDP (Jaksch et al., 2010) of two states {0,1}\{0,1\}, SA4\frac{SA}{4} actions and diameter D4\frac{D}{4}. Figure 1 (left) illustrates the structure of a JAO MDP. State 0 has no reward, while state 11 has reward +1+1. Each model has a unique best action aa^{*} that starts from 0 and goes to 11. The pair (0,a)(0,a^{*}) is a λ\lambda-distinguishing state-action pair.

A JAO MDP can be converted to an MDP with SS states, AA actions and diameter DD, and this type of MDP gives the minimax lower bound proof in the undiscounted setting (Jaksch et al., 2010). The adversary selects a model from \mathcal{M} uniformly at random, and so previous episodes provide no useful information for the current episode; hence, the regret of any learner is equal to the sum of its KK one-episode learning regrets. The one-episode learning regret for JAO MDPs is known to be Ω(DSAH)\Omega(\sqrt{DSAH}) when comparing against the optimal infinite-horizon average reward. For JAO MDPs, the optimal infinite horizon policy is also optimal for finite horizon; so, we can use a geometric convergence result from Markov chain theory (Levin et al., 2008) to convert this lower bound to a lower bound of the standard finite-horizon regret of the same order, giving the result.

01+1δ+λ/2\delta+\lambda/2δ\deltaδ\deltaδ=4D\delta=\frac{4}{D}
01+12δ+Δ\delta+\Deltaδ\deltaδ=4D\delta=\frac{4}{D}1/2+λ21/2+\frac{\lambda}{2}1/21/2δ\delta
Figure 1: A JAO MDP (left) and a 2-JAO MDP (right). Only state 11 has reward +1+1. The dashed arrows indicate the best actions.

Using the same technique in the proof of Lemma 5, we can show that applying UCRL2 (Jaksch et al., 2010) in every episode individually leads to a regret upper bound of O(KDSAHlnH)O\left(KDS\sqrt{AH\ln{H}}\right). This implies that learning every episode individually already gives a near-optimal regret guarantee.

Remark 6

Our proof for Lemma 5 contains a simple yet rigorous proof for the mixing-time argument used in Mao et al. (2021); Jin et al. (2018). This argument claims that for JAO MDPs, when the diameter is sufficiently small compared to the horizon, the optimal HH-step value function V1V^{*}_{1} in the regret of the episodic setting can be replaced by the optimal average reward ρH\rho^{*}H in the undiscounted setting without changing the order of the lower bound. To the best of our knowledge, our proof is the first rigorous proof for this argument that applies for any number of episodes including K=1K=1Domingues et al. (2021b) provide an alternative proof; however the results therein hold in a different setting where KK is sufficiently large and the horizon HH can be much smaller than DD.

We emphasize that the lower bound in Lemma 5 holds for any learning algorithms. This result motivates the more interesting setting in which λ\lambda is a fixed and large constant independent of HH. In this case, we are interested in an instance-specific lower bound. For multi-armed bandits, instance-specific lower bounds are constructed with respect to a class of uniformly good learning algorithms (Lai and Robbins, 1985). In our setting, we focus on defining a class of uniformly good algorithms that include the cluster-then-learn algorithms in the previous works for multi-task PAC RL settings such as Finite-Model-RL (Brunskill and Li, 2013) and PAC-EXPLORE (Guo and Brunskill, 2015). We consider a class of MDPs and a cluster-then-learn algorithm uniformly good if they satisfy an intuitive property: for any MDP in that class, the algorithm should be able to correctly classify whether a cluster of samples is from that MDP or not with an arbitrarily low (but not zero) failure probability, provided that the horizon HH is sufficiently long for the algorithm to collect enough samples. The following definition formalizes this idea.

Definition 7 (PAC identifiability of MDPs)

A set of models \mathcal{M} of size MM is PAC identifiable if there exists a function f:(0,1)f:(0,1)\mapsto\mathbb{N}, a sample collection policy π\pi and a classification algorithm 𝒞\mathcal{C} with the following property: for every p(0,1)p\in(0,1), for each model 1mM1\leq m\leq M in \mathcal{M}, if π\pi is run for f(p)f(p) steps and the state-transition samples are given to 𝒞\mathcal{C}, then the algorithm 𝒞\mathcal{C} returns the correct identity of mm with probability at least 1p1-p, where the probability is taken over all possible sequence of f(p)f(p) samples collected by running π\pi on mm for f(p)f(p) steps. The smallest choice of function f(p)f(p) among all possible choices is called the sample complexity of model identification of \mathcal{M}.

The clustering algorithm in a cluster-then-learn framework solves a problem different from classification: they only need to tell whether a cluster of samples belong to the same or different distribution than another cluster of samples, not the identity of the distribution. We can reduce one problem to the other by the following construction: consider the adversary that gives all MM models in the first MM episodes. After the first MM episodes, there are MM clusters of samples, each corresponding to one model in \mathcal{M}. Once the learner has constructed MM different clusters, from the episode M+1M+1, the clustering problem is as hard as classification since identifying the right cluster immediately implies the identity of the MDP where the samples come from, and vice versa. Hence, we can apply the sample complexity of classification to that of clustering.

Next, we show the lower bound on the sample complexity of model identification for the class of λ\lambda-separable communicating MDPs.

Lemma 8

For any S,A20,D16S,A\geq 20,D\geq 16 and λ(0,12]\lambda\in(0,\frac{1}{2}], there exists a PAC identifiable λ\lambda-separable set of MDPs \mathcal{M} of size SA12\frac{SA}{12}, each with at most SS states, AA actions and diameter DD such that for any classification algorithm 𝒞\mathcal{C}, if the number of state-transition samples given to 𝒞\mathcal{C} is less than SA180λ2\frac{SA}{180\lambda^{2}} then for at least one MDP in \mathcal{M}, algorithm 𝒞\mathcal{C} fails to identify that MDP with probability at least 12\frac{1}{2}.

Proof (Sketch) The set \mathcal{M} is a set of 2-JAO MDPs, shown in Figure 1 (right). Each 2-JAO MDP combines two JAO MDPs with the same number of actions and with diameter in the range [D2,D][\frac{D}{2},D]; one is λ\lambda-separable and one is the hard instance for the minimax lower bound of Jaksch et al. (2010). Rewards exist only in the part containing the hard instance. If a learner completely ignores the λ\lambda-separable part, by Lemma 5 the learner cannot do much better than just learning every episode individually. On the other hand, with enough samples from the λ\lambda-separable part, the learner can identify the MDP and use the samples collected in the previous episodes of the same MDP to accelerate learning the hard instance part. However, the λ\lambda-separable part is also a JAO MDP, for which no useful information from previous episodes can help identify the MDP in the current episode.

Only the actions at state 0 are λ\lambda-distinguishing and can be used to identify the MDPs. Taking an action in state 0 can be seen as flipping a coin: heads for transitioning to another state and tails for staying in state 0. Identifying a 2-JAO MDP reduces to the problem of using at most HH coin flips to identify, in a Q×2Q\times 2 matrix of coins, a row jj that has coins that are slightly different from the others. The first column has fair coins except in row jj, where the success probability is 12+λ\frac{1}{2}+\lambda. The second column coins with success probability of δ14\delta\leq\frac{1}{4} except in row jj, where the coin is upwardly biased by Δλ\Delta\leq\lambda. Lemma 23 and Corollary 24 in the appendix show a Ω(Qλ2)\Omega\left(\frac{Q}{\lambda^{2}}\right) lower bound on the number of coin flips on the first column (the left part of the 2-JAO MDP), implying the desired result.

Lemma 8 imply that for 2-JAO MDPs, any uniformly good model identification algorithm needs to collect at least Ω(SAλ2)\Omega\left(\frac{SA}{\lambda^{2}}\right) samples from state 0 on the left part. Whenever an action towards state 22 is taken from state 0, the learner may end up in state 22. Once in state 22, the learner needs to get back to state 0 to obtain the next useful sample. The expected number of actions needed to get back to state 0 from state 22 is 1δ=D4\frac{1}{\delta}=\frac{D}{4}. This implies the following two lower bounds on the horizon of the clustering phase and the total regret of any cluster-then-learn algorithms.

Corollary 9

For any S,A20,D16S,A\geq 20,D\geq 16 and λ(0,1]\lambda\in(0,1], there exists a PAC identifiable λ\lambda-separable set of MDPs \mathcal{M} of size M=SA12M=\frac{SA}{12}, each with SS states, AA actions and diameter DD such that for any uniformly good cluster-then-learn algorithm, to find the correct cluster with probability of at least 12\frac{1}{2}, the expected number of exploration steps needed in the clustering phase is Ω(DSAλ2)\Omega(\frac{DSA}{\lambda^{2}}). Furthermore, the expected regret over KK episodes of the same algorithm is

𝔼[Regret(K)]Ω(KDSAλ2).\displaystyle\mathbb{E}[\mathrm{Regret}(K)]\geq\Omega\left(\frac{KDSA}{\lambda^{2}}\right).

Proof (Sketch) In the lower bound construction, the learner is assumed to know everything about the set of models, including their optimal policies. Hence, after having identified the model in the clustering phase, the learner can follow the optimal policy in the learning phase and incur a small regret of at most D/2D/2 in this phase. Therefore, the regret is dominated by the regret in the clustering phase, which is of order DSAλ2\frac{DSA}{\lambda^{2}}.

Remark 10

The lower bound in Corollary 9 holds for a particular class of uniformly good cluster-then-learn algorithms under an adaptive adversary. It remains an open question whether this lower bound holds for any algorithms, not just cluster-then-learn.

Remark 11

Corollary 9 implies that, without further assumptions, it is not possible to improve the 1λ2\frac{1}{\lambda^{2}} dependency on λ\lambda. At the first glance this seems to contradict the existing results in bandits and online learning literature, where the regret bound depends on 1gap\frac{1}{\mathrm{gap}} where gap\mathrm{gap} is the the difference in expected reward between the best arm and the sub-optimal arms. However, λ\lambda does not play the same role as the gaps in bandits. Observe that on the 2-JAO MDPs, the set of arms with positive reward is only in the right JAO MDP. The lower-bound learner knows this, but chooses to pull the arms on the left JAO MDP (with zero-reward) to collect side information that helps learn the right part faster. In this analogy, λ\lambda does not play the same role as the gaps in bandits, since the learner already knows the arms on the left JAO MDP are suboptimal. The role of λ\lambda is in model identification, for which similar 1λ2\frac{1}{\lambda^{2}} lower bounds are known (e.g. Tulsiani, 2014).

Finally, we construct a non-communicating variant of the 2-JAO MDP to show that the finite diameter assumption is necessary. Figure 4 in Appendix C illustrates this construction. On this variant, all the transitions from state 0 to state 22 are reversed. In addition, no actions take state 0 to state 22, making this MDP non-communicating. A set of these non-communicating MDPs is still λ\lambda-separable due to the state-action pairs that start at state 22. However, by setting the initial state to 0, the adversary can force the learner to operate only on the right part, regardless of how large λ\lambda is.

4 Non-Asymptotic Upper Bounds

We propose and analyze AOMultiRL, a polynomial time cluster-then-learn algorithm that obtains a high-probability regret bound of O~(KDSAλ2+H3/2MSAK)\tilde{O}(\frac{KDSA}{\lambda^{2}}+H^{3/2}\sqrt{MSAK}). In each episode, the learner starts with the clustering phase to identify the cluster of samples generated in previous episodes that has the same task. Once the right cluster is identified, the learner can use the samples from previous episodes in the learning phase.

A fundamental difference between the undiscounted infinite horizon setting considered in previous works (Guo and Brunskill, 2015; Brunskill and Li, 2013) and the episodic finite horizon in our work is the horizon of the two phases. In previous works, different episodes might have different horizons for the clustering phase depending on whether the learner decides to start exploration at all (Brunskill and Li, 2015) or which state-action pairs are to be explored (Brunskill and Li, 2013). This poses a challenge for the episodic finite-horizon setting, because a varying horizon for the clustering phase leads to a varying horizon for the learning phase. Thus, standard single-task algorithms that rely on a fixed horizon such as UCBVI (Azar et al., 2017) and StrongEuler (Simchowitz and Jamieson, 2019) cannot be applied directly. From an algorithmic standpoint, for a fixed horizon HH, a non-asymptotic bound on the horizon of the clustering phase is necessary so that the learner knows exactly whether HH is large enough and when to stop collecting samples.

AOMultiRL alleviates this issue by setting a fixed horizon for the clustering phase, which reduces the learning phase to standard single-task episodic RL. First, we state an assumption on the ergodicity of the MDPs.

Assumption 12

The hitting times of all MDPs in \mathcal{M} are bounded by a known constant D~\tilde{D}.

The main purpose of Assumption 12 is simplifying the computation of a non-asymptotic upper bound for the clustering phase in order to focus the exposition on the main ideas. We discuss a method for removing this assumption in Appendix G.

Algorithm 1 outlines the main steps of our approach. Given a set Γα\Gamma^{\alpha} of α\alpha-distinguishing state-action pairs, in the clustering phase the learner employs a history-dependent policy specified by Algorithm 2, ExploreID, to collect at least NN samples for each state-action pair in Γα\Gamma^{\alpha}, where NN will be determined later. Once all (s,a)(s,a) in Γα\Gamma^{\alpha} have been visited at least NN times, Algorithm 3, IdentifyCluster, computes the empirical means of the transition function of these (s,a)(s,a) and then compares them with those in each cluster to determine which cluster contains the samples from the same task (or none do, in which case a new cluster is created). For the rest of the episode, the learner uses the UCBVI-CH algorithm (Azar et al., 2017) to learn the optimal policy.

The algorithms and results up to Theorem 16 are presented for a general set Γα\Gamma^{\alpha}. Since Γα\Gamma^{\alpha} is generally unknown, Corollary 17 shows the result for α=λ\alpha=\lambda and Γα=𝒮×𝒜\Gamma^{\alpha}=\mathcal{S}\times\mathcal{A}.

Input: Number of models MM, number of episodes KK, MDPs parameters 𝒮,𝒜,H,D~,λ\mathcal{S,A},H,\tilde{D},\lambda, probability pp, separation level α\alpha and an α\alpha-distinguishing set Γα\Gamma^{\alpha}.
Compute p1=p/3,N=256λ2max{S,ln(K|Γα|p1)},δ=αλ/4,H0=12D|Γα|Np_{1}=p/3,N=\frac{256}{\lambda^{2}}\max\{S,\ln(\frac{K|\Gamma^{\alpha}|}{p_{1}})\},\delta=\alpha-\lambda/4,H_{0}=12D|\Gamma^{\alpha}|N
Initialize 𝒞\mathcal{C}\leftarrow\emptyset
for k=1,,Kk=1,\dots,K do
       Initialize k\mathcal{B}_{k}\leftarrow\emptyset
       The environment chooses a task mkm^{k}
       Observe the initial state s1s_{1}
       for h=1,,H0h=1,\dots,H_{0}  do
             ah=a_{h}= ExploreID(sh,Γαs_{h},\Gamma^{\alpha})
             Observe sh+1s_{h+1} and rh+1r_{h+1}
             Add (sh,ah,sh+1)(s_{h},a_{h},s_{h+1}) to k\mathcal{B}_{k}
            
       idid\leftarrow IdentifyCluster(k,Γα,𝒞,δ\mathcal{B}_{k},\Gamma^{\alpha},\mathcal{C},\delta)
       if id1id\geq 1 then  𝒞idmodel=𝒞idmodelk\mathcal{C}_{id}^{model}=\mathcal{C}_{id}^{model}\cup\mathcal{B}_{k}
       else
             id|𝒞|+1id\leftarrow|\mathcal{C}|+1
             𝒞idmodel=k\mathcal{C}_{id}^{model}=\mathcal{B}_{k}, 𝒞idregret=\mathcal{C}_{id}^{regret}=\emptyset
             𝒞𝒞𝒞id\mathcal{C}\leftarrow\mathcal{C}\cup\mathcal{C}_{id}
            
       πk=UCBVI-CH(𝒞idregret)\pi_{k}=\texttt{UCBVI-CH}(\mathcal{C}_{id}^{regret})
       for h=H0+1,,Hh=H_{0}+1,\dots,H  do
             ah=πk(h,sh)a_{h}=\pi_{k}(h,s_{h})
             Observe sh+1s_{h+1} and rh+1r_{h+1}
             𝒞idregret=𝒞idregret(sh,ah,sh+1)\mathcal{C}_{id}^{regret}=\mathcal{C}_{id}^{regret}\cup(s_{h},a_{h},s_{h+1})
            
Algorithm 1 Adversarial online multi-task RL
Input: Episode kk, state ss, set Γα\Gamma^{\alpha} and number NN
Set 𝒢(s)={a𝒜:(s,a)Γα,Nk(s,a)<N}\mathcal{G}(s)=\left\{\begin{array}[]{l}a\in\mathcal{A}:\\ (s,a)\in\Gamma^{\alpha},N_{\mathcal{B}_{k}}(s,a)<N\end{array}\right\}
if 𝒢(s)\mathcal{G}(s)\neq\emptyset then
       return argmaxa𝒢(s)Nk(s,a)\operatorname*{arg\,max}_{a\in\mathcal{G}(s)}N_{\mathcal{B}_{k}}(s,a)
else
       return argmaxa𝒜s=1SP^k(ss,a)𝕀{𝒢(s)}\operatorname*{arg\,max}_{a\in\mathcal{A}}\sum_{s^{\prime}=1}^{S}\hat{P}^{k}(s^{\prime}\mid s,a)\mathbb{I}\{\mathcal{G}(s^{\prime})\neq\emptyset\}
Algorithm 2 ExploreID
Input: Episode kk, set Γα\Gamma^{\alpha}, clusters 𝒞\mathcal{C}, and threshold δ\delta
for c=1,,𝒞c=1,\dots,\left\lVert\mathcal{C}\right\rVert do
       Initialize id c\leftarrow c
       for (s,a)Γ(s,a)\in\Gamma do
             if [P^cP^k](s,a)>δ\left\lVert[\hat{P}_{c}-\hat{P}^{k}](s,a)\right\rVert>\delta then
                   id 0\leftarrow 0
                   break;
            
      if id ==c==c then
             return id;
      
return 0;
Algorithm 3 Identify Cluster

4.1 The Exploration Algorithm

Given a collection \mathcal{B} of tuples (s,a,s)(s,a,s^{\prime}), the empirical transition functions estimated by \mathcal{B} are

P^(ss,a)={N(s,a,s)N(s,a)if N(s,a)>00otherwise,\displaystyle\hat{P}_{\mathcal{B}}(s^{\prime}\mid s,a)=\begin{cases}\frac{N_{\mathcal{B}}(s,a,s^{\prime})}{N_{\mathcal{B}}(s,a)}&\text{if }{N_{\mathcal{B}}(s,a)>0}\\ 0&\text{otherwise},\end{cases}
whereN(s,a,s)=(x,y,z)𝕀{x=s,y=a,z=s},N(s,a)=s𝒮N(s,a,s)\displaystyle\text{where}\qquad\qquad N_{\mathcal{B}}(s,a,s^{\prime})=\sum_{(x,y,z)\in\mathcal{B}}\mathbb{I}\{x=s,y=a,z=s^{\prime}\},\qquad N_{\mathcal{B}}(s,a)=\sum_{s^{\prime}\in\mathcal{S}}N_{\mathcal{B}}(s,a,s^{\prime})

are the number of instances of (s,a,s)(s,a,s^{\prime}) and (s,a)(s,a) in \mathcal{B}, respectively.

For each episode kk, let PkP^{k} denote the transition function of the task mkm^{k} and k\mathcal{B}_{k} denote the collection of samples (sh,ah,sh+1)(s_{h},a_{h},s_{h+1}) collected during the learning phase. The empirical means P^k\hat{P}^{k} estimated using samples in k\mathcal{B}_{k} are P^k=P^k\hat{P}^{k}=\hat{P}_{\mathcal{B}_{k}}. The value of NN can be chosen so that for all (s,a)Γα(s,a)\in\Gamma^{\alpha}, with high probability P^k(s,a)\hat{P}^{k}(s,a) is close to Pk(s,a)P^{k}(s,a). Specifically, we find that if NN is large enough so that P^k(s,a)\hat{P}^{k}(s,a) is within λ/8\lambda/8 in 1\ell_{1} norm of the true function Pk(s,a)P^{k}(s,a), then the right cluster can be identified in every episode. The exact value of NN is given in the following lemma.

Lemma 13

Suppose the learner is given a constant p1(0,1)p_{1}\in(0,1) and a α\alpha-distinguishing set Γα𝒮×𝒜\Gamma^{\alpha}\subseteq\mathcal{S}\times\mathcal{A}. If each state-action pair in Γα\Gamma^{\alpha} is visited at least N=256λ2max{S,ln(K|Γα|p1)}N=\frac{256}{\lambda^{2}}\max\{S,\ln(\frac{K|\Gamma^{\alpha}|}{p_{1}})\} times during the clustering phase of each episode k=1,2,,Kk=1,2,\dots,K, then with probability at least 1p11-p_{1}, the event

kΓα={(s,a)Γα,Pk(s,a)P^k(s,a)λ8} holds for all k[K].\displaystyle\mathcal{E}^{\Gamma^{\alpha}}_{k}=\left\{\forall(s,a)\in\Gamma^{\alpha},\left\lVert P^{k}(s,a)-\hat{P}^{k}(s,a)\right\rVert\leq\frac{\lambda}{8}\right\}\text{ holds for all }k\in[K].

The exploration in AOMultiRL is modelled as an instance of the active model estimation problem (Tarbouriech et al., 2020). Given the current state ss, if there exists an action aa such that (s,a)Γα(s,a)\in\Gamma^{\alpha} and (s,a)(s,a) has not been visited at least NN times, this action will be chosen (with ties broken by selecting the most chosen action). Otherwise, the algorithm chooses an action that has the highest estimated probability of leading to an under-sampled state-action pair in Γα\Gamma^{\alpha}. The following lemma computes the number of steps H0H_{0} in the clustering phase.

Lemma 14

Consider p1p_{1} and NN defined in Lemma 13. By setting

H0=12D~|Γα|N=3072D~|Γα|λ2max{S,ln(K|Γα|p1)},\displaystyle H_{0}=12\tilde{D}|\Gamma^{\alpha}|N=\frac{3072\tilde{D}|\Gamma^{\alpha}|}{\lambda^{2}}\max\{S,\ln(\frac{K|\Gamma^{\alpha}|}{p_{1}})\},

with probability at least 1p11-p_{1}, Algorithm 2 visits each state-action pair in Γα\Gamma^{\alpha} at least NN times during the clustering phase in each of the KK episodes.

4.2 The Clustering Algorithm

Denote by 𝒞\mathcal{C} the set of clusters, C=|𝒞|C=|\mathcal{C}| the number of clusters and 𝒞i\mathcal{C}_{i} the ithi\operatorname*{{}^{\text{th}}} cluster. Each 𝒞i\mathcal{C}_{i} is a collection of two multisets 𝒞imodel,𝒞iregret𝒮×𝒜×𝒮\mathcal{C}_{i}^{model},\mathcal{C}_{i}^{regret}\subset\mathcal{S\times A\times S} which contain the (s,a,s)(s,a,s^{\prime}) samples collected during the clustering and learning phases, respectively. Formally, up to episode kk we have

𝒞imodel\displaystyle\mathcal{C}_{i}^{model} =k=1k1{(shk,ahk,sh+1k):hH0,idk=i},\displaystyle=\cup_{k^{\prime}=1}^{k-1}\{(s^{k^{\prime}}_{h},a^{k^{\prime}}_{h},s^{k^{\prime}}_{h+1}):h\leq H_{0},\mathrm{id}^{k^{\prime}}=i\},
𝒞iregret\displaystyle\mathcal{C}_{i}^{regret} =k=1k1{(shk,ahk,sh+1k):h>H0,idk=i},\displaystyle=\cup_{k^{\prime}=1}^{k-1}\{(s^{k^{\prime}}_{h},a^{k^{\prime}}_{h},s^{k^{\prime}}_{h+1}):h>H_{0},\mathrm{id}^{k^{\prime}}=i\},

where shks^{k}_{h} and ahka^{k}_{h} are the state and action at time step hh of episode kk, respectively and idkid^{k^{\prime}} is the cluster index returned by Algorithm 3 in episode kk^{\prime}.

Let P^i=P^𝒞imodel\hat{P}_{i}=\hat{P}_{\mathcal{C}_{i}^{model}} denote the empirical means estimated using samples in 𝒞imodel\mathcal{C}_{i}^{model}. For each episode kk, from Lemma 14 with high probability after the first H0H_{0} steps each state-action pair (s,a)Γα(s,a)\in\Gamma^{\alpha} has been visited at least NN times. Algorithm 3 determines the right cluster for a task by computing the 1\ell_{1} distance between P^k\hat{P}^{k} and the empirical transition function P^i\hat{P}_{i} for each cluster i=1,2,,Ci=1,2,\dots,C. If there exists an (s,a)Γα(s,a)\in\Gamma^{\alpha} such that the distance is larger than a certain threshold δ\delta, i.e., [P^iP^k](s,a)>δ\left\lVert[\hat{P}_{i}-\hat{P}^{k}](s,a)\right\rVert>\delta, then the algorithm concludes that the task belongs to another cluster. Otherwise, the task is considered to belong to cluster ii. We set δ=αλ/4\delta=\alpha-\lambda/4. The following lemma shows that with this choice of δ\delta, the right cluster is identified by Algorithm 3 in all episodes.

Lemma 15

Consider a λ\lambda-separable set of MDPs \mathcal{M} and an α\alpha-distinguishing set Γα\Gamma^{\alpha} where αλ/2\alpha\geq\lambda/2. If the events kΓα\mathcal{E}^{\Gamma^{\alpha}}_{k} defined in Lemma 13 hold for all k[K]k\in[K], then with the distance threshold δ=αλ/4\delta=\alpha-\lambda/4 Algorithm 3 always produces a correct output in each episode: the trajectories of the same model in two different episodes are clustered together and no two trajectories of two different models are in the same cluster.

Once the clustering phase finishes, the learner enters the learning phase and uses the UCBVI-CH algorithm (Azar et al., 2017) to learn the optimal policy for this phase. In principle, almost all standard single-task RL algorithms with a near-optimal regret guarantee can be used for this phase. We chose UCBVI-CH to simplify the analysis and make the exposition clear.

To simulate the standard single-task episodic learning setting, the learner only uses the samples in 𝒞iregret\mathcal{C}_{i}^{regret} for regret minimization. The impact of combining samples in two phases for regret minimization is addressed in Appendix H. Theorem 16 states a regret bound for Algorithm 1.

Theorem 16

For any failure probability p(0,1)p\in(0,1), with probability at least 1p1-p the regret of Algorithm 1 is bounded as

Regret(K)2KH0+67H13/2LMSAK+15MS2AH12L2,\begin{split}\text{Regret}(K)\leq 2KH_{0}&+67H_{1}^{3/2}L\sqrt{MSAK}+15MS^{2}AH_{1}^{2}L^{2},\end{split}

where H0=12D~|Γα|NH_{0}=12\tilde{D}|\Gamma^{\alpha}|N, N=256λ2max{S,ln(3K|Γα|p)}N=\frac{256}{\lambda^{2}}\max\{S,\ln(\frac{3K|\Gamma^{\alpha}|}{p})\}, H1=HH0H_{1}=H-H_{0}, and L=ln(15SAKHM/p)L=\ln(15SAKHM/p).

For K>MS3AHK>MS^{3}AH, the first two terms are the most significant. The 2KH02KH_{0} term accounts for the clustering phase and the fact that the exploration policy might lead the learner to an undesirable state after H0H_{0} steps. The O~(K)\tilde{O}(\sqrt{K}) term comes from the fact that the learning phase is equivalent to episodic single-task learning with horizon H1H_{1}. When HH0H\gg H_{0}, the sub-linear bound on the learning phase is a major improvement compared to the O(KHSA)O(K\sqrt{HSA}) bound of the strategy that learns each episode individually.

By setting Γα=𝒮×𝒜\Gamma^{\alpha}=\mathcal{S}\times\mathcal{A} and α=λ\alpha=\lambda, we obtain

Corollary 17

For any failure probability p(0,1)p\in(0,1), with probability at least 1p1-p, by setting Γα=𝒮×𝒜\Gamma^{\alpha}=\mathcal{S}\times\mathcal{A} with α=λ\alpha=\lambda, the regret of Algorithm 1 is

Regret(K)O(KD~SAλ2ln(KSAp)+H3/2LMSAK).\displaystyle\text{Regret}(K)\leq O\left(\frac{K\tilde{D}SA}{\lambda^{2}}\ln\left(\frac{KSA}{p}\right)+H^{3/2}L\sqrt{MSAK}\right). (2)

where L=ln(15SAKH1M/p)L=\ln(15SAKH_{1}M/p).

Time Complexity The clustering algorithm runs once in each episode, which leads to time complexity of O(MSA+H)O(MSA+H). When HH0H\gg H_{0}, the overall time complexity is dominated by the learning phase, which is O(HSA)O(HSA) for UCBVI-CH.

Remark 18

Instead of clustering, a different paradigm involves actively merging samples from different MDPs to learn a model that is an averaged estimate of the MDPs in \mathcal{M}. The best regret guarantee in this paradigm, to the best of our knowledge, is O~(S1/3A1/3B1/3H5/3K2/3)\tilde{O}(S^{1/3}A^{1/3}B^{1/3}H^{5/3}K^{2/3}), where BB is a variation budget, achieved by RestartQ-UCB (Mao et al., 2021, Theorem 3). In our setting, if the adversary frequently alternates between tasks then B=Ω(KHλ)B=\Omega(KH\lambda) and therefore this bound becomes O~(λ1/3S1/3A1/3H2K)\tilde{O}(\lambda^{1/3}S^{1/3}A^{1/3}H^{2}K), which is larger than the trivial bound KHKH and worse than the bound in Corollary 17. If the adversary selects tasks so that BB is small i.e. B=o(K)B=o(K) then the bound offered by RestartQ-UCB is better since it is sub-linear in KK. Note that this does not contradict the lower bound result in Section 3, since the lower bound is constructed with an adversary that selects tasks uniformly at random, and hence BB is linear in KK.

4.3 Learning a distinguishing set when MM is small

As pointed out by Brunskill and Li (2013), for all α>0\alpha>0, the size of the smallest α\alpha-distinguishing set of \mathcal{M} is at most (M2)M\choose 2. If M2SAM^{2}\ll SA and such a set is known to the learner, then the clustering phase only need collect samples from this set instead of the full 𝒮×𝒜\mathcal{S}\times\mathcal{A} set of state-action pairs. However, in general this set is not known. We show that if the adversary is weaker so that all models are guaranteed to appear at least once early on, the learner will be able to discover a λ2\frac{\lambda}{2}-distinguishing set Γ^\hat{\Gamma} of size at most (M2){M\choose 2}. Specifically, we employ the following assumption:

Assumption 19

There exists an unknown constant K1MK_{1}\geq M satisfying K1SA<KK_{1}SA<K such that after at most K1K_{1} episodes, each model in \mathcal{M} has been given to the learner at least once.

In order to discover Γ^\hat{\Gamma}, the learner uses Algorithm 4, which consists of two stages:

  • Stage 1: the learner starts by running Algorithm 1 with the λ\lambda-distinguishing set candidate 𝒮×𝒜\mathcal{S\times A} until the number of clusters is MM. With high probability, each cluster corresponds to a model. At the end of stage 1, the learner uses the empirical estimates in all clusters P^i\hat{P}_{i} for i[M]i\in[M] to construct a λ/2\lambda/2-distinguishing set Γ^\hat{\Gamma} for \mathcal{M}.

  • Stage 2: the learner runs Algorithm 1 with the distinguishing set Γ^\hat{\Gamma} as an input.

Extracting λ/2\lambda/2-distinguishing pairs: After K1K_{1} episodes, with high probability there are MM clusters corresponding to MM models. For two clusters ii and jj, the set Γ^i,j\hat{\Gamma}_{i,j} contains the first state-action pair (s,a)(s,a) that satisfies P^i(s,a)P^j(s,a)>3λ/4\left\lVert\hat{P}_{i}(s,a)-\hat{P}_{j}(s,a)\right\rVert>3\lambda/4. With high probability, every (s,a)Γi,j(s,a)\in\Gamma_{i,j} satisfies this condition, hence Γ^i,j\hat{\Gamma}_{i,j}\neq\emptyset.

Let i[M]i^{\star}\in[M] denote the index of the MDP model corresponding to cluster ii. For all (s,a)Γ^i,j(s,a)\in\hat{\Gamma}_{i,j}, by the triangle inequality, we have

PiPjP^iP^jP^iPi+PjP^j>3λ/4(λ/8+λ/8)=λ/2,\displaystyle\left\lVert P_{i^{\star}}-P_{j^{\star}}\right\rVert\geq\left\lVert\hat{P}_{i}-\hat{P}_{j}\right\rVert-\left\lVert\hat{P}_{i}-P_{i^{\star}}+P_{j^{\star}}-\hat{P}_{j}\right\rVert>3\lambda/4-(\lambda/8+\lambda/8)=\lambda/2,

where (s,a)(s,a) is omitted for brevity. It follows that the set Γ^=i,jΓ^i,j\hat{\Gamma}=\cup_{i,j}\hat{\Gamma}_{i,j} is λ/2\lambda/2-distinguishing and |Γ^|(M2)|\hat{\Gamma}|\leq{M\choose 2}. Although λ/2\lambda/2 is smaller than the λ\lambda-separation level of Γ\Gamma, it is sufficient for the conditions in Lemma 15 to hold. Thus, with high probability the clustering algorithm in stage 2 works correctly. The next theorem shows the regret guarantee of Algorithm 4.

Theorem 20

Under Assumption 19, With probability at least 1p1-p, the regret of Algorithm 4 is

Regret(K)=O(KD~M2λ2lnKM2p+H3/2LMKSA),\displaystyle\textstyle\text{Regret}(K)=O\left(\frac{K\tilde{D}M^{2}}{\lambda^{2}}\ln{\frac{KM^{2}}{p}}+H^{3/2}L\sqrt{MKSA}\right),

where H0,M=3072D~M2λ2max{S,ln(3KM2p)}H_{0,M}=\frac{3072\tilde{D}M^{2}}{\lambda^{2}}\max\{S,\ln(\frac{3KM^{2}}{p})\} and L=ln(15SAKH1M/p)L=\ln(15SAKH_{1}M/p).

Compared to Corollary 17, Theorem 20 improves the clustering phase’s dependency from SASA to M2M^{2}. This implies that if the number of models is small and all models appear relatively early, we can discover a λ/2\lambda/2-distinguishing set quickly without increasing the order of the total regret bound.

Input: Number of models MM, number of episodes KK, MDPs parameters 𝒮,𝒜,H,D~,λ\mathcal{S,A},H,\tilde{D},\lambda, probability pp
Stage 1: Run Algorithm 1 with the distinguishing set Γα=𝒮×𝒜\Gamma^{\alpha}=\mathcal{S\times A} and α=λ\alpha=\lambda until the number of clusters is MM
for i,j[M]×[M],iji,j\in[M]\times[M],i\neq j do
       Γ^i,j=\hat{\Gamma}_{i,j}=\emptyset
       for (s,a)𝒮×𝒜(s,a)\in\mathcal{S\times A} do
             if P^i(s,a)P^j(s,a)>3λ/4\left\lVert\hat{P}_{i}(s,a)-\hat{P}_{j}(s,a)\right\rVert>3\lambda/4 then
                   Γ^i,j=Γ^i,j(s,a)\hat{\Gamma}_{i,j}=\hat{\Gamma}_{i,j}\cup(s,a)
                   break
            
Γ^=i,jΓ^i,j\hat{\Gamma}=\cup_{i,j}\hat{\Gamma}_{i,j}
Stage 2: Run Algorithm 1 with distinguishing set Γ^\hat{\Gamma} and α=λ/2\alpha=\lambda/2 for K2=KK1K_{2}=K-K_{1} episodes.
Algorithm 4 AOMultiRL with all models being given at least once

5 Experiments

Refer to caption
Figure 2: Average per-episode reward.

We evaluate AOMultiRL on a sequence of K=200K=200 episodes, where the task in each episode is taken from a set of M=4M=4 MDPs. Each MDP in \mathcal{M} is a 4×44\times 4 grid of S=16S=16 cells with A=4A=4 valid actions: up, down, left, right. The state for row rr and column cc (0-indexed) is represented by the tuple (r,c)(r,c). The reward is 0 in every state, except for the four corners (0,0),(0,3),(3,0)(0,0),(0,3),(3,0), and (3,3)(3,3), where the reward is 1. We fix the initial state at (1,1)(1,1).

To simulate an adversarial sequence of tasks, episodes 100 to 150 and episodes 180 to 200 contain only the MDP m4m_{4}. Other episodes chooses m1,m2m_{1},m_{2} and m3m_{3} uniformly at random. The hitting time is D=7D=7 and the failure probability is p=0.03p=0.03. We use the rlberry framework (Domingues et al., 2021a) for our implementation222The code is available at https://github.com/ngmq/adversarial-online-multi-task-reinforcement-learning.

We construct the transition functions so that each MDP has only one easy-to-reach corner, which corresponds to a unique optimal policy. The separation level λ\lambda is 1.29991.2999. Furthermore, there exists state-action pairs that are λ/2\lambda/2-distinguishing but not λ\lambda-distinguishing. More details can be found in Appendix J.

The baseline algorithms include a random agent that chooses actions uniformly at random, a one-episode UCBVI agent which does not group episodes and learns using only the samples of each episode, and the optimal non-stationary agent that acts optimally in every episode. The first and the last baselines serve as the lower bound and upper bound performance for AOMultiRL, while the second baseline helps illustrate the effectiveness of clustering episodes correctly. We evaluate two instances of AOMultiRL: AOMultiRL1 with a set Γ\Gamma of |Γ|=3|\Gamma|=3 given and AOMultiRL2 without any distinguishing set given. We follow the approach of Kwon et al. (2021) and evaluate all five algorithms based on their expected cumulative reward when starting at state (1,1)(1,1) and following their learned policy for H1=200H_{1}=200 steps (averaged over 10 runs). While the horizon for the learning phase is much smaller than the horizon for clustering phase of H080000H_{0}\approx 80000, we ensure the fairness of the comparisons by not using the samples collected in the clustering phase in the learning phase, thus simulating the setting where H0H1H_{0}\ll H_{1} without the need to use significantly larger MDPs. We use the average per-episode reward as the performance metric. Figure 2 shows the results.

The effectiveness of the clustering on the learning phase. To measure the effectiveness of aggregating samples from episodes of the same task for the learning phase, we compare AOMultiRL1 and the one-episode UCBVI agent. Since for every pair of MDP models, the transition functions are distinct for state-action pairs adjacent to two of the corners, AOMultiRL1 can only learn the estimated model accurately for each MDP model if the clustering phase produces correct clusters in most of the episodes. We can observe in Figure 2 that after about thirty episodes, AOMultiRL1 starts outperforming the one-episode UCBVI agent and approaching the performance of the optimal non-stationary agent. The model m4m_{4} appears for the first time in episode 100, which accounts for the sudden drop in performance in that episode. Afterwards, the performance of AOMultiRL1 steadily increases again. This demonstrates that the AOMultiRL1 is able to identify the correct cluster in most of the episodes, which enables the multi-episode UCBVI algorithm in AOMultiRL1 to estimate the MDP models much more accurately than the non-transfer one-episode UCBVI agent. This suggests that for larger MDPs where H1H0H_{1}\gg H_{0}, spending a number of initial steps on finding the episodes of the same task would yield higher long-term rewards.

Performance of AOMultiRL with the discovered 𝚪^\bm{\hat{\Gamma}}. Next, we examine the performance of AOMultiRL2 when no distinguishing set is given. We run AOMultiRL2 for 204 episodes, in which stage 1 consists of the first four episodes, each containing one of the four MDP models in \mathcal{M}. As the identities of the models are not given, the algorithm has to correctly construct four clusters and then compute a λ/2\lambda/2-distinguishing set after the 4th4\operatorname*{{}^{\text{th}}} episode even though each model is seen just once. As mentioned above, the MDPs are set up so that if the AOMultiRL2 correctly identifies four clusters, then the discovered Γ^\hat{\Gamma} will contain at least one state-action pair that is λ/2\lambda/2-distinguishing but not λ\lambda-distinguishing. In stage 2, the horizon of the learning phase is set to the same H1=200H_{1}=200 used for AOMultiRL1. The performance in stage 2 of AOMultiRL2 approaches that of AOMultiRL1, indicating that the discovered Γ^\hat{\Gamma} is as effective as the set Γ\Gamma given to AOMultiRL1.

6 Conclusion

In this paper, we studied the adversarial online multi-task RL setting with the tasks belonging to a finite set of well-separated models. We used a general notion of task-separability, which we call λ\lambda-separability. Under this notion, we proved a minimax regret lower bound that applies to all algorithms and an instance-specific regret lower bound that applies to a class of uniformly good cluster-then-learn algorithms. We further proposed AOMultiRL, a polynomial time cluster-then-learn algorithm that obtains a nearly-optimal instance-specific regret upper bound. These results addressed two fundamental aspects of online multi-task RL, namely learning an adversarial task sequence and learning under a general task-separability notion. Adversarial online multi-task learning remains challenging when the diameter and the number of models are unknown; this is left for future work.

Acknowledgement

This work was supported by the NSERC Discovery Grant RGPIN-2018-03942.

References

  • Abel et al. (2018) David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, and Michael Littman. Policy and value transfer in lifelong reinforcement learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 20–29. PMLR, 10–15 Jul 2018.
  • Auer and Ortner (2007) Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007.
  • Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002. doi: 10.1137/S0097539701398375. URL https://doi.org/10.1137/S0097539701398375.
  • Azar et al. (2013) Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Sequential transfer in multi-armed bandit with finite set of models. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
  • Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 263–272. PMLR, 06–11 Aug 2017.
  • Brunskill and Li (2013) Emma Brunskill and Lihong Li. Sample complexity of multi-task reinforcement learning. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13, page 122–131, Arlington, Virginia, USA, 2013. AUAI Press.
  • Brunskill and Li (2015) Emma Brunskill and Lihong Li. The online discovery problem and its application to lifelong reinforcement learning. CoRR, abs/1506.03379, 2015.
  • Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954.
  • Dann and Brunskill (2015) Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  • Dann et al. (2017) Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Domingues et al. (2021a) Omar Darwiche Domingues, Yannis Flet-Berliac, Edouard Leurent, Pierre Ménard, Xuedong Shang, and Michal Valko. rlberry - A Reinforcement Learning Library for Research and Education, 10 2021a. URL https://github.com/rlberry-py/rlberry.
  • Domingues et al. (2021b) Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 578–598. PMLR, 16–19 Mar 2021b. URL https://proceedings.mlr.press/v132/domingues21a.html.
  • Guo and Brunskill (2015) Zhaohan Guo and Emma Brunskill. Concurrent PAC RL. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, page 2624–2630. AAAI Press, 2015. ISBN 0262511290.
  • Hallak et al. (2015) Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual Markov Decision Processes, 2015.
  • Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563–1600, 2010.
  • Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • Kwon et al. (2021) Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. RL for latent MDPs: Regret guarantees and a lower bound. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  • Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
  • Lecarpentier et al. (2021) Erwan Lecarpentier, David Abel, Kavosh Asadi, Yuu Jinnai, Emmanuel Rachelson, and Michael L. Littman. Lipschitz lifelong reinforcement learning. In AAAI, 2021.
  • Levin et al. (2008) David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov Chains and Mixing Times. American Mathematical Soc., 2008. ISBN 9780821886274. URL http://pages.uoregon.edu/dlevin/MARKOV/.
  • Mao et al. (2021) Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Basar. Near-optimal model-free reinforcement learning in non-stationary episodic MDPs. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7447–7458. PMLR, 18–24 Jul 2021. URL http://proceedings.mlr.press/v139/mao21b.html.
  • Sason (2015) Igal Sason. On reverse Pinsker inequalities. arXiv preprint arXiv:1503.07118, 2015.
  • Simchowitz and Jamieson (2019) Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular MDPs. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Steimle et al. (2021) Lauren N. Steimle, David L. Kaufman, and Brian T. Denton. Multi-model Markov decision processes. IISE Transactions, 53(10):1124–1139, 2021. doi: 10.1080/24725854.2021.1895454.
  • Sun and Huang (2020) Yanchao Sun and Furong Huang. Can agents learn by analogy? An inferable model for PAC reinforcement learning. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’20, page 1332–1340, Richland, SC, 2020. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450375184.
  • Tarbouriech et al. (2020) Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, and Alessandro Lazaric. Active model estimation in Markov Decision Processes. In Uncertainty in Artificial Intelligence, 2020. URL http://proceedings.mlr.press/v124/tarbouriech20a/tarbouriech20a-supp.pdf.
  • Tarbouriech et al. (2021) Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. A provably efficient sample collection strategy for reinforcement learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=AvVDR8R-kQX.
  • Tulsiani (2014) Madhur Tulsiani. Lecture 6, lecture notes in information and coding theory, October 2014. URL https://home.ttic.edu/~madhurt/courses/infotheory2014/l6.pdf.
  • Weissman et al. (2003) Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.

Appendix A Related Work

Stochastic Online Multi-task RL. The Finite-Model-RL algorithm (Brunskill and Li, 2013) considered the stochastic setting with infinite-horizon MDPs and focused on deriving a sample complexity of exploration in a (ϵ,δ)(\epsilon,\delta)-PAC setting. As shown by Dann et al. (2017), even an optimal (ϵ,δ)(\epsilon,\delta)-PAC bound can only guarantee a necessarily sub-optimal O(Km2/3)O(K_{m}^{2/3}) regret bound for each task m[M]m\in[M] that appears in KmK_{m} episodes, leading to an overall O(M1/3K2/3)O(M^{1/3}K^{2/3}) regret bound for the learning phase in the multi-task setting.

The Contextual MDPs algorithm by Hallak et al. (2015) is capable of obtaining a O(K)O(\sqrt{K}) regret bound in the learning phase after the right cluster has been identified; however their clustering phase has exponential time complexity in KK. The recent L-UCRL algorithm (Kwon et al., 2021) considered the stochastic finite-horizon setting and reduced the problem to learning the optimal policy of a POMDP. Under a set of assumptions that allow the clusters to be discovered in O(polylog(MSA))O(\operatorname{polylog}(MSA)), L-UCRL is able to obtain an overall O(MK)O(\sqrt{MK}) regret with respect to a POMDP planning oracle which aims to learn a policy that maximizes the expected single-task return when a task is randomly drawn from a known distribution of tasks. In contrast, our work adopts a stronger notion of regret that encourages the learner to maximize its expected return for a sequence of tasks chosen by an adversary. When the models are bandits instead of MDPs, Azar et al. (2013) use spectral learning to estimate the mean reward of the arms in all models and obtains an upper bound linear in KK.

Lifelong RL. Learning a sequence of related tasks is more well-studied in the lifelong learning literature. Recent works in lifelong RL (Abel et al., 2018; Lecarpentier et al., 2021) often focus on the setting where tasks are drawn from an unknown distribution of MDPs and there exists some similarity measure between MDPs that support transfer learning. Our work instead focuses on learning the dissimilarity between tasks for the clustering phase and avoiding negative transfer.

Active model estimation The exploration in AOMultiRL is modelled after the active model estimation problem (Tarbouriech et al., 2020), which is often presented in PAC-RL setting. Several recent works on active model estimation are PAC-Explore (Guo and Brunskill, 2015), FW-MODEST (Tarbouriech et al., 2020), β\beta-curious walking (Sun and Huang, 2020), and GOSPRL (Tarbouriech et al., 2021).

The Θ(D~|Γα|N)\Theta(\tilde{D}|\Gamma^{\alpha}|N) bound on the horizon of clustering in Lemma 14 has the same O(S2A)O(S^{2}A) dependency on the number of states and actions as the state-of-the-art bound by GOSPRL (Tarbouriech et al., 2021) for the active model estimation problem. The main drawback is that H0H_{0} depends linearly on the hitting time D~\tilde{D} and not the diameter DD of the MDPs. As the hitting time is often strictly larger than the diameter (Jaksch et al., 2010; Tarbouriech et al., 2021), this dependency on D~\tilde{D} is sub-optimal. On the other hand, AOMultiRL is substantially less computationally expensive than GOSPRL since there is no shortest-path policy computation involved.

Appendix B The generality of λ\lambda-separability notion

In this section, we show that the general separation notion in Definition 1 defines a broader class of online multi-task RL problems that extends the entropy-based separation assumption in the latent MDPs setting (Kwon et al., 2021). We start by restating the entropy-based separation condition of Kwon et al. (2021):

Definition 21

Let Π\Pi denote the class of all history-dependent and possibly non-Markovian policies, and let τ(m,π)\tau\sim(m,\pi) be a trajectory of length HH sampled from MDP mm by a policy πΠ\pi\in\Pi. The set \mathcal{M} is well-separated if the following condition holds:

m,m,mm,πΠ,Prτ(m,π)(Prm,π(τ)Prm,π(τ)>(ϵp/M)c1)<(ϵp/M)c2,\displaystyle\forall m,m^{\prime}\in\mathcal{M},m^{\prime}\neq m,\pi\in\Pi,\Pr_{\tau\sim(m,\pi)}\left(\frac{\Pr_{m^{\prime},\pi}(\tau)}{\Pr_{m,\pi}(\tau)}>(\epsilon_{p}/M)^{c_{1}}\right)<(\epsilon_{p}/M)^{c_{2}}, (3)

where ϵp(0,1)\epsilon_{p}\in(0,1) is a target failure probability, c14,c24c_{1}\geq 4,c_{2}\geq 4 are universal constants and Prm,π(τ)\Pr_{m,\pi}(\tau) is the probability that τ\tau is realized when running policy π\pi on model mm.

12λ\lambda31λ1-\lambda12λ/2\lambda/231λ/21-\lambda/2
Figure 3: An instance of λ\lambda-separable LMDPs where Definition 21 does not apply

The following lemma constructs a set \mathcal{M} of just two models that satisfy the λ\lambda-separability condition but not the entropy-based separation condition.

Lemma 22

Given any λ(0,1),ϵp(0,1),H>0\lambda\in(0,1),\epsilon_{p}\in(0,1),H>0 and any constants c1,c24c_{1},c_{2}\geq 4, there exists a set of MDPs ={m1,m2}\mathcal{M}=\{m_{1},m_{2}\} with horizon HH that is λ\lambda-separable but is not well-separated in the sense of Definition 21.

Proof  Consider the set \mathcal{M} with M=2,𝒮={s1,s2,s3},𝒜={a1,a2}M=2,\mathcal{S}=\{s^{1},s^{2},s^{3}\},\mathcal{A}=\{a^{1},a^{2}\} in Figure 3. Both m1m_{1} and m2m_{2} have the same transition functions in all state-action pairs except for (s1,a1)(s^{1},a^{1}):

1(s2s1,a1)=λ\displaystyle\mathbb{P}_{1}(s^{2}\mid s^{1},a^{1})=\lambda
1(s3s1,a1)=1λ\displaystyle\mathbb{P}_{1}(s^{3}\mid s^{1},a^{1})=1-\lambda
2(s2s1,a1)=λ/2\displaystyle\mathbb{P}_{2}(s^{2}\mid s^{1},a^{1})=\lambda/2
2(s3s1,a1)=1λ/2.\displaystyle\mathbb{P}_{2}(s^{3}\mid s^{1},a^{1})=1-\lambda/2.

It follows that the 1\ell_{1} distance between P1(s1,a1)P_{1}(s^{1},a^{1}) and P2(s1,a1)P_{2}(s^{1},a^{1}) is

P1(s1,a1)P2(s1,a1)\displaystyle\left\lVert P_{1}(s^{1},a^{1})-P_{2}(s^{1},a^{1})\right\rVert =P1(s2s1,a1)P2(s2s1,a1)+P1(s3s1,a1)P2(s3s1,a1)\displaystyle=\left\lVert P_{1}(s^{2}\mid s^{1},a^{1})-P_{2}(s^{2}\mid s^{1},a^{1})\right\rVert+\left\lVert P_{1}(s^{3}\mid s^{1},a^{1})-P_{2}(s^{3}\mid s^{1},a^{1})\right\rVert
=2(λλ/2)\displaystyle=2(\lambda-\lambda/2)
=λ.\displaystyle=\lambda.

As a result, this set \mathcal{M} is λ\lambda-separable. However, any deterministic policy that takes action a2a_{2} in s1s_{1} and an arbitrary action in s2s_{2} and s3s_{3} will induce the same Markov chain on two MDP models. Thus, the entropy-based separation definition does not apply. An example of such a policy is shown below.

Consider running the following deterministic policy on model m1m_{1}:

π(s1)\displaystyle\pi(s^{1}) =a2\displaystyle=a^{2}
π(s2)\displaystyle\pi(s^{2}) =a1\displaystyle=a^{1}
π(s3)\displaystyle\pi(s^{3}) =a1.\displaystyle=a^{1}.

Consider an arbitrary trajectory τ\tau. The probability that this trajectory is realized with respect to both models is

Prm1,π(τ)\displaystyle\Pr_{m_{1},\pi}(\tau) =t=1HP1(st+1st,at)\displaystyle=\prod_{t=1}^{H}P_{1}(s_{t+1}\mid s_{t},a_{t}) (4)
=t=1HP1(st+1st,π(st))\displaystyle=\prod_{t=1}^{H}P_{1}(s_{t+1}\mid s_{t},\pi(s_{t})) (5)
=t=1HP2(st+1st,at)since (st,π(st))(s1,a1)\displaystyle=\prod_{t=1}^{H}P_{2}(s_{t+1}\mid s_{t},a_{t})\qquad\text{since }(s_{t},\pi(s_{t}))\neq(s^{1},a^{1}) (6)
=Prm2,π(τ).\displaystyle=\Pr_{m_{2},\pi}(\tau). (7)

As a result, for all τ\tau,

Prm2,π(τ)Prm1,π(τ)=1,\displaystyle\frac{\Pr_{m_{2},\pi}(\tau)}{\Pr_{m_{1},\pi}(\tau)}=1, (8)

which implies that

Prτm1,π(Prm2,π(τ)Prm1,π(τ)>(ϵp/M)c1)=Prτm1,π(1>(ϵp/M)c1)=1,\displaystyle\Pr_{\tau\sim m_{1},\pi}\left(\frac{\Pr_{m_{2},\pi}(\tau)}{\Pr_{m_{1},\pi}(\tau)}>(\epsilon_{p}/M)^{c_{1}}\right)=\Pr_{\tau\sim m_{1},\pi}(1>(\epsilon_{p}/M)^{c_{1}})=1, (9)

which is larger than (ϵp/M)c2\left(\epsilon_{p}/M\right)^{c_{2}}.

Appendix C Proofs of the lower bounds

01+12δ+Δ\delta+\Deltaδ\deltaδ\delta1δ1-\delta1/2+λ21/2+\frac{\lambda}{2}1/21/21/21/2
Figure 4: A non-communicating 2-JAO MDP. There are no rewards at states 0 and 22, while state 11 has reward +1+1. We set Δ=Θ(SAHD)\Delta=\Theta(\sqrt{\frac{SA}{HD}}). The dashed arrows indicate the unique actions with highest transition probabilities on the left and right parts of the MDP. No actions take state 0 to state 22, making this MDP non-communicating.

See 5

Proof  We construct \mathcal{M} in the following way: each MDP in \mathcal{M} is a JAO MDP (Jaksch et al., 2010) of two states and SASA actions and diameter D=D/4D^{\prime}=D/4. The translation from this JAO MDP to an MDP with SS states, AA actions and diameter DD is straightforward (Jaksch et al., 2010). State 11 has reward +1+1 while state 0 has no reward. In state 0, for all actions the probability of transitioning to state 11 is δ\delta except for one best action where this probability is δ+λ/2\delta+\lambda/2. Every MDP in \mathcal{M} has a unique best action: for i=1,,SAi=1,\dots,SA, the ithi\operatorname*{{}^{\text{th}}} action is the best action in the MDP mim_{i}. The starting state is always s1=0s_{1}=0.

We consider a learner who knows all the parameters of models in \mathcal{M}, except the identity of the task mkm^{k} given in episode kk. We employ the following information-theoretic argument from Mao et al. (2021): when the task mkm^{k} in episode kk is chosen uniformly at random from \mathcal{M}, no useful information from the previous episodes can help the learner identify the best action in mkm^{k}. This is true since all the information in the previous episodes is samples from the MDPs in \mathcal{M}, which provide no further information than the parameters of the models in \mathcal{M}. Since =SA\mathcal{M}=SA, all actions (from state 0) are equally probable to be the best action in mkm^{k}. Therefore, the learner is forced to learn mkm^{k} from scratch. It follows that the total regret of the learner is the sum of the one-episode-learning regrets in every episode:

Regret(K)=k=1KRk,\displaystyle\mathrm{Regret}(K)=\sum_{k=1}^{K}R^{k},

where Rk=V1(s1)V1πk(s1)R^{k}=V_{1}^{*}(s_{1})-V_{1}^{\pi_{k}}(s_{1}) is the one-episode-learning regret in episode kk. The one-episode-learning is equivalent to the learning in the undiscounted setting with horizon HH. Applying the lower bound result for the undiscounted setting in Jaksch et al. (2010, Theorem 5) obtains that for all πk\pi_{k},

ρH𝔼mkV1πk(s1)Ω(DSAH),\displaystyle\rho^{*}H-\mathbb{E}_{m^{k}\sim\mathcal{M}}V_{1}^{\pi_{k}}(s_{1})\geq\Omega(\sqrt{DSAH}),

where ρ=δ+λ/22δ+λ/2\rho^{*}=\frac{\delta+\lambda/2}{2\delta+\lambda/2} is the average reward of the optimal policy (Jaksch et al., 2010). Note since only state 11 has reward +1+1, ρ\rho^{*} is also the stationary probability that the optimal learner is at state 11.

Next, we show that for all H2H\geq 2 and mkm^{k}\in\mathcal{M}, it holds that |V1ρH|D2|V_{1}^{*}-\rho^{*}H|\leq\frac{D}{2}. The optimal policy on all mkm^{k} induces a Markov chain between two states with transition matrix

[1δλ/2δ+λ/2δ1δ].\displaystyle\begin{bmatrix}1-\delta-\lambda/2&\delta+\lambda/2\\ \delta&1-\delta\end{bmatrix}.

Let mk(st=1s1=0)\mathbb{P}_{m^{k}}(s_{t}=1\mid s_{1}=0) be the probability that the Markov chain is in state 11 after tt time steps with the initial state s1=0s_{1}=0. Let Δt=mk(st=1s1=0)ρ\Delta_{t}=\mathbb{P}_{m^{k}}(s_{t}=1\mid s_{1}=0)-\rho^{*}. Obviously, Δ1=ρ\Delta_{1}=-\rho^{*}. By Levin et al. (2008, Equation 1.8), we have Δt=(12δλ/2)t1Δ1\Delta_{t}=(1-2\delta-\lambda/2)^{t-1}\Delta_{1}. It follows that, for the optimal policy,

V1(s1)\displaystyle V_{1}^{*}(s_{1}) =t=1Hmk(st=1s1=0)\displaystyle=\sum_{t=1}^{H}\mathbb{P}_{m^{k}}(s_{t}=1\mid s_{1}=0) (10)
=t=1H(Δt+ρ)\displaystyle=\sum_{t=1}^{H}(\Delta_{t}+\rho^{*}) (11)
=ρH+t=1HΔt\displaystyle=\rho^{*}H+\sum_{t=1}^{H}\Delta_{t} (12)
=ρH+t=1H(12δλ/2)t1Δ1\displaystyle=\rho^{*}H+\sum_{t=1}^{H}(1-2\delta-\lambda/2)^{t-1}\Delta_{1} (13)
=ρH+Δ11(12δλ/2)H2δ+λ/2.\displaystyle=\rho^{*}H+\Delta_{1}\frac{1-(1-2\delta-\lambda/2)^{H}}{2\delta+\lambda/2}. (14)

Hence,

|V1(s1)ρH|\displaystyle|V_{1}^{*}(s_{1})-\rho^{*}H| =|Δ11(12δλ/2)H2δ+λ/2|\displaystyle=\left|\Delta_{1}\frac{1-(1-2\delta-\lambda/2)^{H}}{2\delta+\lambda/2}\right| (15)
|Δ12δ+λ/2|\displaystyle\leq\left|\frac{\Delta_{1}}{2\delta+\lambda/2}\right| (16)
=ρ2δ+λ/2\displaystyle=\frac{\rho^{*}}{2\delta+\lambda/2} (17)
12δ+λ/2\displaystyle\leq\frac{1}{2\delta+\lambda/2} (18)
12δ\displaystyle\leq\frac{1}{2\delta} (19)
=D2,\displaystyle=\frac{D}{2}, (20)

where the last equality follows from δ=D4\delta=\frac{D}{4}.

For any HDSAH\geq DSA and S,A2S,A\geq 2, we have HDSADSA4D\sqrt{HDSA}\geq DSA\geq 4D, and hence HDSAD2HDSA2\sqrt{HDSA}-\frac{D}{2}\geq\frac{\sqrt{HDSA}}{2}. We conclude that

𝔼[Regret(K)]\displaystyle\mathbb{E}[\mathrm{Regret}(K)] =k=1K𝔼[Rk]\displaystyle=\sum_{k=1}^{K}\mathbb{E}[R^{k}]
=k=1K𝔼[V1V1πk](s1)\displaystyle=\sum_{k=1}^{K}\mathbb{E}[V^{*}_{1}-V_{1}^{\pi_{k}}](s_{1})
k=1K(ρHD2V1πk(s1))\displaystyle\geq\sum_{k=1}^{K}(\rho^{*}H-\frac{D}{2}-V_{1}^{\pi_{k}}(s_{1}))
=Ω(KDHSA).\displaystyle=\Omega(K\sqrt{DHSA}).

The upper bound of UCRL2 can be proved similarly: Theorem 2 in Jaksch et al. (2010) states that for any p(0,1)p\in(0,1), by running UCRL2 with failure parameter pp, we obtain that for any initial state s1s_{1} and any H>1H>1, with probability at least 1p1-p,

ρHh=1HrhO(DSAHlnHp).\displaystyle\rho^{*}H-\sum_{h=1}^{H}r_{h}\leq O\left(DS\sqrt{AH\ln{\frac{H}{p}}}\right). (21)

Setting p=1Hp=\frac{1}{H} and trivially bound the regret in the failure cases by HH to obtain

ρHE[h=1Hrh]\displaystyle\rho^{*}H-E[\sum_{h=1}^{H}r_{h}] O(DSAHlnH2)+1H×H\displaystyle\leq O\left(DS\sqrt{AH\ln{H^{2}}}\right)+\frac{1}{H}\times H (22)
=O(DSAHlnH).\displaystyle=O\left(DS\sqrt{AH\ln{H}}\right). (23)

This bound holds across all episodes, hence the total regret bound with respect to ρH\rho^{*}H is O(KDSAHlnH)O\left(KDS\sqrt{AH\ln{H}}\right). Combining this with the fact that V1(s1)ρH+D2V_{1}^{*}(s_{1})\leq\rho^{*}H+\frac{D}{2}, we obtain the upper bound.

See 8 Before showing the proof of Lemma 8, we consider the following auxiliary problem: Suppose we are given three constants δ,λ,ϵ(0,14]\delta,\lambda,\epsilon\in(0,\frac{1}{4}] and a set of 2Q2Q coins. The coins are arranged into a Q×2Q\times 2 table of QQ rows and 22 columns so that each cell contains exactly one coin. The rows are indexed from 11 to QQ and the columns are indexed from 11 to 22. In the first column, all coins are fair except for one coin at row θ\theta which is biased with probability of heads equal to 12+λ\frac{1}{2}+\lambda. In the second column, all coins have probability of heads equal to δ\delta except for the coin at row θ\theta which has probability of heads δ+ϵ\delta+\epsilon. In this setting, row θ\theta is a special row that contains the most biased coins in the two columns. The objective is to find this special row θ\theta after at most HH coin flips, where H>0H>0 is a constant representing a fixed budget. Note that if we ignore the second column, then this problem is reduced to the well-known problem of identifying one biased coin in a collection of QQ-coins (Tulsiani, 2014).

Let N1,N2N_{1},N_{2} be the number of flips an algorithm performs on the first and second column, respectively. For a fixed global budget HH, after τ=N1+N2H\tau=N_{1}+N_{2}\leq H coin flips, the algorithm recommends θ^\hat{\theta} as its prediction for θ\theta. Note that τ\tau is a random stopping time which can depend on any information the algorithm observes up to time τ\tau. Let XtX_{t} be the random variable for the outcome of ttht\operatorname*{{}^{\text{th}}} flip, and X1τ=(X1,X2,,Xτ)X_{1}^{\tau}=(X_{1},X_{2},\dots,X_{\tau}) be the sequence of outcomes after τ\tau flips. For j[Q]j\in[Q], let j\mathbb{P}_{j} denote the probability measure induced by Alg\mathrm{Alg} corresponding to the case when θ=j\theta=j. We first show that if the algorithm fails to flip the coins sufficiently many times in both columns, then for some θ\theta the probability of failure is at least 12\frac{1}{2}.

Lemma 23

Let Q12,C1=40Q\geq 12,C_{1}=40 and C2=64C_{2}=64. For any algorithm Alg\mathrm{Alg}, if

N1T1:=Q4C1λ2andN2T2:=Q(δ+ϵ)4C2ϵ2,\displaystyle N_{1}\leq T_{1}:=\frac{Q}{4C_{1}\lambda^{2}}\quad\text{and}\quad N_{2}\leq T_{2}:=\frac{Q(\delta+\epsilon)}{4C_{2}\epsilon^{2}},

then there exists a set J[Q]J\subseteq[Q] with |J|Q6|J|\geq\frac{Q}{6} such that

jJ,j[θ^=j]12.\displaystyle\forall j\in J,~{}\mathbb{P}_{j}[\hat{\theta}=j]\leq\frac{1}{2}.

The proof uses a reasonably well-known reverse Pinsker inequality (Sason, 2015, Equation 10):

Let PP and QQ be probability measures over a common discrete set. Then

KL(PQ)4log2eminxQ(x)DTV(PQ)2.\displaystyle KL(P\;\|\;Q)\leq\frac{4\log_{2}e}{\min_{x}Q(x)}\cdot D_{TV}(P\operatorname*{\|}Q)^{2}. (24)

where DTVD_{TV} is the total variation distance. In the particular case where PP and QQ are Bernoulli distributions with success probabilities pp and q12q\leq\frac{1}{2} respectively, we get

KL(PQ)4log2eq(pq)2.\displaystyle KL(P\;\|\;Q)\leq\frac{4\log_{2}e}{q}\cdot(p-q)^{2}. (25)

Proof (of Lemma 23) As reasoned in the proof for the lower bound of multi-armed bandits (Auer et al., 2002), we can assume that Alg\mathrm{Alg} is deterministic333Deterministic conditional on the random history. Our proof closely follows the main steps in the proof of Tulsiani (2014) for the setting where there is only one column. We will lower bound the probability of mistake of Alg\mathrm{Alg} based on its behavior on a hypothetical instance where λ=ϵ=0\lambda=\epsilon=0.

To account for algorithms which do not exhaust both budgets T1T_{1} and T2T_{2}, we introduce two “dummy coins” by adding a zero’th row with two identical coins, solely for the analysis. These two coins have the same mean of 1 under all QQ models and hence flipping either of them provides no information. An algorithm which wishes to stop in a round τ<H\tau<H will simply flip any dummy coin in the remaining rounds τ+1,τ+2,,H\tau+1,\tau+2,\ldots,H. This way, we have the convenient option of always working with a sequence of outcomes X1HX_{1}^{H} in the analysis.

Let 0\mathbb{P}_{0} and 𝔼0\mathbb{E}_{0} denote the probability and expectation over X1HX_{1}^{H} taken on the hypothetical instance with λ=ϵ=0\lambda=\epsilon=0, respectively. Let at=(at,0,at,1){0,1,,Q}×{1,2}a_{t}=(a_{t,0},a_{t,1})\in\{0,1,\ldots,Q\}\times\{1,2\} be the coin that the algorithm flips in step tt. Let xt{0,1}x_{t}\in\{0,1\} denote the outcome of ata_{t} where 0 is tails and 11 is heads.

The number of flips the coin in row ii, column kk is

Ni,k=t=1T𝕀{at=(i,k)}.\displaystyle N_{i,k}=\sum_{t=1}^{T}\mathbb{I}\{a_{t}=(i,k)\}.

By the earlier definition of NkN_{k} for k{1,2}k\in\{1,2\}, we have

N1=i=1QNi,1,\displaystyle N_{1}=\sum_{i=1}^{Q}N_{i,1},
N2=i=1QNi,2.\displaystyle N_{2}=\sum_{i=1}^{Q}N_{i,2}.

We define

J1:={i[Q]:(𝔼0[Ni,1]4T1Q)(𝔼0[Ni,2]4T2Q)}.\displaystyle J_{1}:=\left\{i\in[Q]\colon\left(\mathbb{E}_{0}[N_{i,1}]\leq\frac{4T_{1}}{Q}\right)\wedge\left(\mathbb{E}_{0}[N_{i,2}]\leq\frac{4T_{2}}{Q}\right)\right\}.

Clearly, at most Q4\frac{Q}{4} rows ii satisfy 𝔼0[Ni,1]>4T1Q\mathbb{E}_{0}[N_{i,1}]>\frac{4T_{1}}{Q} and, similarly, at most Q4\frac{Q}{4} rows ii satisfy 𝔼0[Ni,2]>4T2Q\mathbb{E}_{0}[N_{i,2}]>\frac{4T_{2}}{Q}. Therefore, |J1|Q2Q4=Q2|J_{1}|\geq Q-2\cdot\frac{Q}{4}=\frac{Q}{2}.

We also define

J2:={i[Q]:0(θ^=i)3Q}.\displaystyle J_{2}:=\left\{i\in[Q]\colon\mathbb{P}_{0}(\hat{\theta}=i)\leq\frac{3}{Q}\right\}.

As at most Q3\frac{Q}{3} arms ii can satisfy 0(θ^=i)>3Q\mathbb{P}_{0}(\hat{\theta}=i)>\frac{3}{Q}, it holds that |J2|2Q3|J_{2}|\geq\frac{2Q}{3}.

Consequently, defining J:=J1J2J:=J_{1}\cap J_{2}, we have |J|Q6|J|\geq\frac{Q}{6}.

For any jJj\in J, we have

|j[c=j]0[c=j]|\displaystyle\left|\mathbb{P}_{j}[c^{*}=j]-\mathbb{P}_{0}[c^{*}=j]\right| =|𝔼j[𝕀{c=j}]𝔼0[𝕀{c=j}]|\displaystyle=\left|\mathbb{E}_{j}[\mathbb{I}\{c^{*}=j\}]-\mathbb{E}_{0}[\mathbb{I}\{c^{*}=j\}]\right| (26)
120(X1H)j(X1H)1\displaystyle\leq\frac{1}{2}\left\lVert\mathbb{P}_{0}({X_{1}^{H}})-\mathbb{P}_{j}({X_{1}^{H}})\right\rVert_{1} (27)
122ln2KL(P0(X1H)j(X1H)),\displaystyle\leq\frac{1}{2}\sqrt{2\ln{2}KL(P_{0}({X_{1}^{H}})\;\|\;\mathbb{P}_{j}({X_{1}^{H}}))}, (28)

where the first inequality follows from Auer et al. (2002, Equation 28) since the final output cc^{*} is a function of the outcomes X1HX_{1}^{H}, and the last inequality is Pinsker inequality.

Since Alg\mathrm{Alg} is deterministic, the flip ata_{t} at step tt is fully determined given the previous outcomes x1t1x_{1}^{t-1}. Applying the chain rule for KL-divergences (Cover and Thomas, 2006, Theorem 2.5.3) we obtain

KL(P0(X1H)j(X1H))\displaystyle KL(P_{0}({X_{1}^{H}})\;\|\;\mathbb{P}_{j}({X_{1}^{H}})) =t=1Hx1t10[x1:t1]KL(0[xt]j[xt]x1t1).\displaystyle=\sum_{t=1}^{H}\sum_{x_{1}^{t-1}}\mathbb{P}_{0}[x_{1:t-1}]KL(\mathbb{P}_{0}[x_{t}]\;\|\;\mathbb{P}_{j}[x_{t}]\mid x_{1}^{t-1}).

Note that xtx_{t} is the result of a single coin flip. When at,0ja_{t,0}\neq j, the KL-divergence is zero since the two instances have the identical coins on both columns. When at,0=ja_{t,0}=j, the KL-divergence is either B1=KL(1212+λ)B_{1}=KL(\frac{1}{2}\;\|\;\frac{1}{2}+\lambda) or B2=KL(δδ+ϵ)B_{2}=KL(\delta\;\|\;\delta+\epsilon), depending on whether at,1=1a_{t,1}=1 or at,1=2a_{t,1}=2, respectively. It follows that

KL(P0(X1H)j(X1H))\displaystyle KL(P_{0}({X_{1}^{H}})\;\|\;\mathbb{P}_{j}({X_{1}^{H}})) =t=1Hx1:t10[x1:t1](𝕀{at=(j,1)}B1+𝕀{at=(j,2)}B2)\displaystyle=\sum_{t=1}^{{H}}\sum_{x_{1:t-1}}\mathbb{P}_{0}[x_{1:t-1}]\left(\mathbb{I}\{a_{t}=(j,1)\}B_{1}+\mathbb{I}\{a_{t}=(j,2)\}B_{2}\right)
=𝔼0[Nj,1]B1+𝔼0[Nj,2]B2\displaystyle=\mathbb{E}_{0}[N_{j,1}]B_{1}+\mathbb{E}_{0}[N_{j,2}]B_{2}
4T1QB1+4T2QB2\displaystyle\leq\frac{4T_{1}}{Q}B_{1}+\frac{4T_{2}}{Q}B_{2}
B1C1λ2+(δ+ϵ)B2C2ϵ2\displaystyle\leq\frac{B_{1}}{C_{1}\lambda^{2}}+\frac{(\delta+\epsilon)B_{2}}{C_{2}\epsilon^{2}}

Since λ14\lambda\leq\frac{1}{4} and δ+ϵ12\delta+\epsilon\leq\frac{1}{2}, we can bound B15λ22ln2B_{1}\leq\frac{5\lambda^{2}}{2\ln{2}} (Tulsiani, 2014) and B24log2(e)ϵ2δ+ϵB_{2}\leq\frac{4\log_{2}(e)\epsilon^{2}}{\delta+\epsilon}. Consequently,

KL(P0(X1H)j(X1H))\displaystyle KL(P_{0}({X_{1}^{H}})\;\|\;\mathbb{P}_{j}({X_{1}^{H}})) 5(2ln2)C1+4log2(e)C2\displaystyle\leq\frac{5}{(2\ln{2})C_{1}}+\frac{4\log_{2}(e)}{C_{2}}

Plugging this into Equation 28 and applying Q12Q\geq 12, we obtain

j[θ^=j]\displaystyle\mathbb{P}_{j}[\hat{\theta}=j] 0[θ^=j]+122ln2(5(2ln2)C1+4log2(e)C2)\displaystyle\leq\mathbb{P}_{0}[\hat{\theta}=j]+\frac{1}{2}\sqrt{2\ln{2}\left(\frac{5}{(2\ln 2)C_{1}}+\frac{4\log_{2}(e)}{C_{2}}\right)}
=3Q+125C1+8C2\displaystyle=\frac{3}{Q}+\frac{1}{2}\sqrt{\frac{5}{C_{1}}+\frac{8}{C_{2}}}
312+12540+864\displaystyle\leq\frac{3}{12}+\frac{1}{2}\sqrt{\frac{5}{40}+\frac{8}{64}}
=12.\displaystyle=\frac{1}{2}.

The next result shows that if ϵ\epsilon is sufficiently small, then any algorithm has to flip the coins in the first column sufficiently many times; otherwise the probability of failure is at least 12\frac{1}{2}.

Corollary 24

Let Q,C1Q,C_{1} and C2C_{2} be the constants defined in Lemma 23. Let H>0H>0 be the budget for the number of flips on both columns. If ϵ=120QδH\epsilon=\frac{1}{20}\sqrt{\frac{Q\delta}{H}}, then for any algorithm Alg\mathrm{Alg}, if

N1Q4C1λ2,\displaystyle N_{1}\leq\frac{Q}{4C_{1}\lambda^{2}},

then there exists a set J[Q]J\subseteq[Q] with |J|Q6|J|\geq\frac{Q}{6} such that

jJ,j[θ^=j]12.\displaystyle\forall j\in J,\mathbb{P}_{j}[\hat{\theta}=j]\leq\frac{1}{2}.

Proof  We will show that when ϵ=120QδH\epsilon=\frac{1}{20}\sqrt{\frac{Q\delta}{H}}, the inequality N2T2=Q(δ+ϵ)4C2ϵ2N_{2}\leq T_{2}=\frac{Q(\delta+\epsilon)}{4C_{2}\epsilon^{2}} holds trivially for any N2HN_{2}\leq H (recall that HH is the fixed budget for the total number of coin flips). The result then follows directly from Lemma 23. We have

T2=Q(δ+ϵ)4C2ϵ2\displaystyle{T_{2}=}\frac{Q(\delta+\epsilon)}{4C_{2}\epsilon^{2}} Qδ4C2ϵ2\displaystyle\geq\frac{Q\delta}{4C_{2}\epsilon^{2}}
=Qδ256ϵ2 since C2=64\displaystyle=\frac{Q\delta}{256\epsilon^{2}}\qquad\text{ since }C_{2}=64
=400256H\displaystyle=\frac{400}{256}H
>H\displaystyle>H
N2,\displaystyle\geq{N_{2}},

which implies that N2T2N_{2}\leq T_{2} always holds for any N2HN_{2}\leq H.

We are now ready to prove Lemma 8.

Proof (of Lemma 8) We construct \mathcal{M} as the set of SA12\frac{SA}{12} 2-JAO MDPs in Figure 1 (right). Each MDP has a left part and a right part, where each part is a JAO MDP. The left part of the MDP mim_{i} consists of two states {0,2}\{0,2\} and SA12\frac{SA}{12} actions numbered from 11 to SA12\frac{SA}{12}, where all actions from state 0 transition to state 22 with probability of 12\frac{1}{2} or stay at state 0 with probability 12\frac{1}{2}, except for the ithi\operatorname*{{}^{\text{th}}} action that transitions to state 22 with probability 12+λ2\frac{1}{2}+\frac{\lambda}{2} and stays at state 0 with probability 12λ2\frac{1}{2}-\frac{\lambda}{2}. The right part of the ithi\operatorname*{{}^{\text{th}}} MDP consists of two states {0,1}\{0,1\} and also SA12\frac{SA}{12} actions numbered from 11 to SA12\frac{SA}{12}, where all actions from state 0 transition to state 11 with probability of δ=4D14\delta=\frac{4}{D}\leq\frac{1}{4} or stays at state 0 with probability 1δ1-\delta, except for the ithi\operatorname*{{}^{\text{th}}} action that transitions to state 22 with probability δ+Δ\delta+\Delta and stays at state 0 with probability 1δΔ1-\delta-\Delta. We set Δ=120(SA3HD)\Delta=\frac{1}{20}\left(\sqrt{\frac{SA}{3HD}}\right). We will show the conversion from these 2-JAO MDPs to MDPs with SS states and AA actions later.

Since each model in \mathcal{M} has a distinct index for the actions on both parts that transitions from 0 to 11 and 22 with probability higher than any other actions, identifying a model in \mathcal{M} is equivalent to identifying this distinct action. Each action on both parts can be seen as a (possibly biased) coin, where the probability of getting tails is equal to the probability of ending up in state 0 when the action is taken. Thus, the problem of identifying this distinct action index reduces to the above auxiliary problem of identifying the row of the most biased coins, where taking an action from state 0 is equivalent to flipping a coin, Q=SA1212,ϵ=ΔQ=\frac{SA}{12}\geq 12,\epsilon=\Delta and λ\lambda is replaced by λ/2\lambda/2. Corollary 24 states that for every algorithm, if the number of coin flips on the first column is less than SA480λ2\frac{SA}{480\lambda^{2}}, then there exists a set of size at least SA72\frac{SA}{72} positions of the row with the most biased coins such that the algorithm fails to find the biased coin with probability at least 12\frac{1}{2}. Correspondingly, for any model classification algorithm, if the number of state-transition samples from state 0 towards state 2 (i.e. the first column) is less than SA480λ2\frac{SA}{480\lambda^{2}} then the algorithm fails to identify the model for at least SA72\frac{SA}{72} MDPs in \mathcal{M}.

Finally, we show the conversion from the 2-JAO MDP to an MDP with SS states and AA actions. The conversion is almost identical to that of Jaksch et al. (2010), which starts with an atomic 2-JAO MDP of three states and A=A2A^{\prime}=\frac{A}{2} actions and builds an AA^{\prime}-ary tree from there. Assuming AA^{\prime} is an even positive number, each part of the atomic 2-JAO MDP has A2\frac{A^{\prime}}{2} actions. We make S3\frac{S}{3} copies of these atomic 2-JAO MDPs, where only one of them has the best action on the right part. Arranging S3\frac{S}{3} copies of these atomic 2-JAO MDPs and connecting their states 0 by AAA-A^{\prime} connections, we obtain an AA^{\prime}-ary tree which represents a composite MDP with at most SS states, AA actions and diameter DD. The transitions of the AAA-A^{\prime} actions on the tree are defined identically to that of Jaksch et al. (2010): self-loops for states 11 and 22, deterministic connections to the state 0 of other nodes on the tree for state 0. By having δ=4D\delta=\frac{4}{D} in each atomic 2-JAO MDP, the diameter of this composite MDP is at most 2δ+logAS3D\frac{2}{\delta}+\log_{A^{\prime}}{\frac{S}{3}}\leq D. This composite MDP is harder to explore and learn than the 2-JAO MDP with three states and SA6\frac{SA}{6} actions, and hence all the lower bound results apply.

See 9

Proof  As argued in Section 3, we can apply the sample complexity of the classification algorithm onto that of the clustering algorithm. Using the same set \mathcal{M} of 2-JAO MDPs constructed in the proof of Lemma 8, for any given MDP \mathcal{M}, any PAC classification learner has to be in state 0 and takes at least Z=Ω(SAλ2)Z=\Omega(\frac{SA}{\lambda^{2}}) actions from state 0 to state 22. If the learner stays at state 0, then it can take the next action from 0 to 22 in the next time step. However, if the learner transitions to state 22, then it has to wait until it gets back to state 0 to take the next action. Let Z2Z_{2} denote the number of times the learner ends up in state 22 after taking ZZ actions on the left part from state 0. Since every action from 0 to 22 has probability at least 12\frac{1}{2} of ending up in state 22, we have

𝔼[Z2Z]Z2.\displaystyle\mathbb{E}[Z_{2}\mid Z]\geq\frac{Z}{2}. (29)

Since every action from state 22 transitions to state 0 with the same probability of δ=Θ(1D)\delta=\Theta(\frac{1}{D}), every time the learner is in state 22, the expected number of time steps it needs to get back to state 0 is Θ(1δ)=Θ(D)\Theta(\frac{1}{\delta})=\Theta(D). Hence, the expected number of time steps the learner needs to get back to state 0 after Z2Z_{2} times being in state 22 is Θ(Z2D)\Theta(Z_{2}D). We conclude that for any PAC learner, the expected number of exploration steps needed to identify the model with probability of correct at least 12\frac{1}{2} is at least

𝔼[Z+Z2D]Ω(ZD)=Ω(DSAλ2).\displaystyle\mathbb{E}[Z+Z_{2}D]\geq\Omega(ZD)=\Omega\left(\frac{DSA}{\lambda^{2}}\right). (30)

Next, we lower bound the expected regret of the same algorithm. Let H0H_{0} be the number of time steps the algorithm spends on the left part and H1H_{1} on the right part of each model in \mathcal{M}. Note that H0H_{0} and H1H_{1} are random variables. Recall that the right part of each MDP in \mathcal{M} resembles the JAO MDP in the minimax lower bound proof in Lemma 5, hence we can apply the regret formula of the JAO MDP for 2-JAO MDP and obtain that the regret in each episode is of the same order as

Regret\displaystyle\mathrm{Regret} =ρH𝔼[h=1Hr(sh,ah)]\displaystyle=\rho^{*}H-\mathbb{E}[\sum_{h=1}^{H}r(s_{h},a_{h})] (31)
=ρE[H0+H1]𝔼[𝔼[h=1H0r(sh,ah)]+𝔼[h=H0+1Hr(sh,ah)]H0,H1]\displaystyle=\rho^{*}E[H_{0}+H_{1}]-\mathbb{E}[\mathbb{E}[\sum_{h=1}^{H_{0}}r(s_{h},a_{h})]+\mathbb{E}[\sum_{h=H_{0}+1}^{H}r(s_{h},a_{h})]\mid H_{0},H_{1}] (32)
=ρE[H0+H1]𝔼[𝔼[h=H0+1Hr(sh,ah)H0,H1]]\displaystyle=\rho^{*}E[H_{0}+H_{1}]-\mathbb{E}[\mathbb{E}[\sum_{h=H_{0}+1}^{H}r(s_{h},a_{h})\mid H_{0},H_{1}]] (33)
=ρE[H0]+E[(ρH1𝔼[h=H0+1Hr(sh,ah)])H1]\displaystyle=\rho^{*}E[H_{0}]+E\left[\left(\rho^{*}H_{1}-\mathbb{E}[\sum_{h=H_{0}+1}^{H}r(s_{h},a_{h})]\right)\mid H_{1}\right] (34)
Ω(ρE[H0])D2\displaystyle\geq\Omega\left(\rho^{*}E[H_{0}]\right)-\frac{D}{2} (35)
=Ω(DSAλ2),\displaystyle=\Omega\left(\frac{DSA}{\lambda^{2}}\right), (36)

where

  • the second equality follows from H=H0+H1H=H_{0}+H_{1},

  • the third equality follows from the fact that the H0H_{0} time steps spent on the left part of the MDP returns no rewards,

  • the fourth equality follows from the linearity of expectation,

  • the inequality follows from H1=HH0H_{1}=H-H_{0} and equation 20,

  • the last equality follows from ρ=δ+Δ2δ+Δ12\rho^{*}=\frac{\delta+\Delta}{2\delta+\Delta}\geq\frac{1}{2} for all δ,Δ>0\delta,\Delta>0 and E[H0]Ω(DSAλ2)E[H_{0}]\geq\Omega\left(\frac{DSA}{\lambda^{2}}\right).

We conclude that the expected regret over KK episodes is at least

Ω(𝔼[KH0])=Ω(KDSAλ2).\displaystyle\Omega(\mathbb{E}[KH_{0}])=\Omega\left(\frac{KDSA}{\lambda^{2}}\right).

Appendix D Proofs of the upper bounds

First, we state the following concentration inequality for vector-valued random variables by Weissman et al. (2003).

Lemma 25 (Weissman et al. (2003))

Let PP be a probability distribution on the set 𝒮={1,,S}\mathcal{S}=\{1,\dots,S\}. Let 𝒳N\mathcal{X}^{N} be a set of NN i.i.d samples drawn from PP. Then, for all ϵ>0\epsilon>0:

Pr(PP^𝒳Nϵ)(2S2)eNϵ2/2.\Pr(\left\lVert P-\hat{P}_{\mathcal{X}^{N}}\right\rVert\geq\epsilon)\leq(2^{S}-2)e^{-N\epsilon^{2}/2}.

Using Lemma 25, we can show that N=O(Sλ2)N=O(\frac{S}{\lambda^{2}}) samples are sufficient for each (s,a)Γ(s,a)\in\Gamma so that with high probability, the empirical means of the transition function P^(s,a)\hat{P}_{\mathcal{B}}(\cdot\mid s,a) are within λ/8\lambda/8 of their true values, measured in 1\ell_{1} distance.

Corollary 26

Denote p1(0,1)p_{1}\in(0,1). If a state-action pair (s,a)(s,a) is visited at least

N=256λ2max{S,ln(1/p1)}\displaystyle N=\frac{256}{\lambda^{2}}\max\{S,\ln(1/p_{1})\} (37)

times, then with probability at least 1p11-p_{1},

P(s,a)P^𝒳N(s,a)λ/8.\displaystyle\left\lVert P(s,a)-\hat{P}_{\mathcal{X}^{N}}(s,a)\right\rVert\leq\lambda/8.

Proof  We simplify the bound in Lemma 25 as follows:

Pr(PP^𝒳Nϵ)(2S2)eNϵ2/2eSNϵ2/2\Pr(\left\lVert P-\hat{P}_{\mathcal{X}^{N}}\right\rVert\geq\epsilon)\leq(2^{S}-2)e^{-N\epsilon^{2}/2}\leq e^{S-N\epsilon^{2}/2}

Next, we substitute ϵ=λ/8\epsilon=\lambda/8 into the right hand side and solve the following inequality for NN:

eSNλ2/128p1\begin{split}e^{S-N\lambda^{2}/128}\leq p_{1}\end{split}

to obtain N128λ2(S+ln(1/p1))N\geq\frac{128}{\lambda^{2}}(S+\ln(1/p_{1})). Thus N=256λ2max{S,ln(1/p1)}N=\frac{256}{\lambda^{2}}\max\{S,\ln(1/p_{1})\} satisfies this condition.

Taking a union bound of the result in Corollary 26 over all state-action pairs in the set Γ\Gamma of all episodes from 11 to KK, we obtain Lemma 13.

Next, we show the proof of Lemma 14. The proof strategy is similar to that of Auer and Ortner (2007); Sun and Huang (2020).

See 14

Proof  The history-dependent exploration policy in Algorithm 2 visits an under-sampled state-action pair in Γα\Gamma^{\alpha} whenever possible; otherwise it starts a sequence of steps that would lead to such a state-action pair. In the latter case, denote the current state of the learner by ss and the number of steps needed to travel from ss^{\prime} to an under-sampled state ss by T(s,s)T(s^{\prime},s). By Assumption 4 and using Markov inequality, we have

Pr(T(s,s)>2D~)E[T(s,s)]2D~D~2D~=12.\Pr(T(s^{\prime},s)>2\tilde{D})\leq\frac{E[T(s^{\prime},s)]}{2\tilde{D}}\leq\frac{\tilde{D}}{2\tilde{D}}=\frac{1}{2}.

It follows that Pr(T(s,s)>2D~)1/2\Pr(T(s^{\prime},s)>2\tilde{D})\leq 1/2. In other words, in every interval of 2D~2\tilde{D} time steps, the probability of visiting an under-sampled state-action pair in Γα\Gamma^{\alpha} is at least 1/21/2. Over such nn intervals, the expected number of such visits is lower bounded by n/2n/2. Fix a (s,a)Γα(s,a)\in\Gamma^{\alpha}. Let VnV_{n} denote number of visits to (s,a)Γα(s,a)\in\Gamma^{\alpha} after nn intervals. Using a Chernoff bound for Poisson trials, we have

Pr(Vn(1ϵ)n/2)1eϵ2n/4\Pr(V_{n}\geq(1-\epsilon)n/2)\geq 1-e^{-\epsilon^{2}n/4}

for any ϵ(0,1)\epsilon\in(0,1). Setting ϵ=12N/m\epsilon=1-2N/m and solving

e(12N/n)2n/4p1e^{-(1-2N/n)^{2}n/4}\leq p_{1}

for nn, we obtain

n2(N+ln(1/p1))+22Nln(1/p1)+(ln(1/p1))2.n\geq 2(N+\ln(1/p_{1}))+2\sqrt{2N\ln(1/p_{1})+(\ln(1/p_{1}))^{2}}. (38)

By definition of NN,

2Nln(1/p1)+(ln(1/p1))2(1+512λ2)max{S,ln(1/p1)}2(256λmax{S,ln(1/p1)})2N2.\begin{split}2N\ln(1/p_{1})+(\ln(1/p_{1}))^{2}&\leq(1+\frac{512}{\lambda^{2}})\max\{S,\ln(1/p_{1})\}^{2}\\ &\leq\left(\frac{256}{\lambda}\max\{S,\ln(1/p_{1})\}\right)^{2}\\ &\leq N^{2}.\end{split}

We also have Nln(1/p)N\geq\ln(1/p). Overall, n=6Nn=6N satisfies the condition in Equation 38. Taking a union bound over all (s,a)Γα(s,a)\in\Gamma^{\alpha} and noting that each interval has length 2D~2\tilde{D} steps, the total number of identifying steps needed is H0=2D~n|Γα|=12D~|Γα|NH_{0}=2\tilde{D}n|\Gamma^{\alpha}|=12\tilde{D}|\Gamma^{\alpha}|N.

To prove Lemma 15, we state the following auxiliary proposition and its corollary.

Proposition 27

Suppose we are given a probability distribution PP over 𝒮=1,,S\mathcal{S}=1,\dots,S, a constant ϵ>0\epsilon>0 and two set of samples 𝒳=(X1,,XN𝒳)\mathcal{X}=(X_{1},\dots,X_{N_{\mathcal{X}}}) and 𝒴=(Y1,,YN𝒴)\mathcal{Y}=(Y_{1},\dots,Y_{N_{\mathcal{Y}}}) drawn from PP such that PP^𝒳ϵ\left\lVert P-\hat{P}_{\mathcal{X}}\right\rVert\leq\epsilon and PP^𝒴ϵ\left\lVert P-\hat{P}_{\mathcal{Y}}\right\rVert\leq\epsilon. Then,

PP^𝒳𝒴ϵ.\left\lVert P-\hat{P}_{\mathcal{X}\cup\mathcal{Y}}\right\rVert\leq\epsilon.

Proof  Let N𝒳(s)N_{\mathcal{X}}(s) and N𝒴(s)N_{\mathcal{Y}}(s) denote the number of samples of s[S]s\in[S] in 𝒳\mathcal{X} and 𝒴\mathcal{Y}, respectively. We have:

PP^𝒳𝒴\displaystyle\left\lVert P-\hat{P}_{\mathcal{X}\cup\mathcal{Y}}\right\rVert =s=1S|P(s)N𝒳(s)+N𝒴(s)N𝒳+N𝒴|\displaystyle=\sum_{s=1}^{S}|P(s)-\frac{N_{\mathcal{X}}(s)+N_{\mathcal{Y}}(s)}{N_{\mathcal{X}}+N_{\mathcal{Y}}}| (39)
=1N𝒳+N𝒴s=1S|N𝒳P(s)N𝒳(s)+N𝒴P(s)N𝒴(s)|\displaystyle=\frac{1}{N_{\mathcal{X}}+N_{\mathcal{Y}}}\sum_{s=1}^{S}|N_{\mathcal{X}}P(s)-N_{\mathcal{X}}(s)+N_{\mathcal{Y}}P(s)-N_{\mathcal{Y}}(s)| (40)
1N𝒳+N𝒴s=1S(|N𝒳P(s)N𝒳(s)|+|N𝒴P(s)N𝒴(s)|)(triangle inequality)\displaystyle\leq\frac{1}{N_{\mathcal{X}}+N_{\mathcal{Y}}}\sum_{s=1}^{S}(|N_{\mathcal{X}}P(s)-N_{\mathcal{X}}(s)|+|N_{\mathcal{Y}}P(s)-N_{\mathcal{Y}}(s)|)\quad\text{(triangle inequality)} (41)
=1N𝒳+N𝒴(N𝒳s=1S|P(s)N𝒳(s)N𝒳|)+1N𝒳+N𝒴(N𝒴s=1S|P(s)N𝒴(s)N𝒴|)\displaystyle=\frac{1}{N_{\mathcal{X}}+N_{\mathcal{Y}}}\left(N_{\mathcal{X}}\sum_{s=1}^{S}|P(s)-\frac{N_{\mathcal{X}}(s)}{N_{\mathcal{X}}}|\right)+\frac{1}{N_{\mathcal{X}}+N_{\mathcal{Y}}}\left(N_{\mathcal{Y}}\sum_{s=1}^{S}|P(s)-\frac{N_{\mathcal{Y}}(s)}{N_{\mathcal{Y}}}|\right) (42)
=1N𝒳+N𝒴(N𝒳PP^𝒳1+N𝒴PP^𝒴)\displaystyle=\frac{1}{N_{\mathcal{X}}+N_{\mathcal{Y}}}(N_{\mathcal{X}}\left\lVert P-\hat{P}_{\mathcal{X}}\right\rVert_{1}+N_{\mathcal{Y}}\left\lVert P-\hat{P}_{\mathcal{Y}}\right\rVert) (43)
ϵ\displaystyle\leq\epsilon (44)
Corollary 28

Suppose we are given a probability distribution PP over 𝒮=1,,S\mathcal{S}=1,\dots,S, a constant ϵ>0\epsilon>0 and a finite number of set of samples 𝒳1,𝒳2,,𝒳t\mathcal{X}_{1},\mathcal{X}_{2},\dots,\mathcal{X}_{t} such that PP^𝒳iϵ\left\lVert P-\hat{P}_{\mathcal{X}_{i}}\right\rVert\leq\epsilon for all i=1,2,,ti=1,2,\dots,t. Then,

PP^i=1,,t𝒳iϵ.\displaystyle\left\lVert P-\hat{P}_{\cup_{i=1,\dots,t}\mathcal{X}_{i}}\right\rVert\leq\epsilon. (45)

Proof (Of Lemma 15) The proof is by induction. The claim is trivially true for the first episode (k=1k=1). For an episode k>1k>1, assume that the outputs of the Algorithm 3 are correct until the beginning of this episode. We consider two cases:

  • When the task mkm_{k} has never been given to the learner before episode kk.

    Consider an arbitrary existing cluster cc. Denote by i[M]i\in[M] the identity of the model to which the samples in cc belong, j[M]j\in[M] the identity of the task mkm_{k}, and (s,a)(s,a) in Γi,jα\Gamma^{\alpha}_{i,j} a state-action pair that distinguishes these two models. Under the definition of Γi,jα\Gamma^{\alpha}_{i,j}, the result in Lemma 13 and the result in Corollary 28, the following three inequalities hold true:

    [PiPj](s,a)>α[PjP^k](s,a)λ/8[PiP^c](s,a)λ/8.\begin{split}\left\lVert[P_{i}-P_{j}](s,a)\right\rVert&>\alpha\\ \left\lVert[P_{j}-\hat{P}_{\mathcal{B}_{k}}](s,a)\right\rVert&\leq\lambda/8\\ \left\lVert[P_{i}-\hat{P}_{c}](s,a)\right\rVert&\leq\lambda/8.\\ \end{split}

    From here, we omit the (s,a)(s,a) and write PP for P(s,a)P(s,a) when no confusion is possible. Applying the triangle inequality twice, we obtain:

    P^cP^kPiPj(PiP^c+PjP^k)>α(λ/8+λ/8)=δ.\begin{split}\left\lVert\hat{P}_{c}-\hat{P}_{\mathcal{B}_{k}}\right\rVert&\geq\left\lVert P_{i}-P_{j}\right\rVert-(\left\lVert P_{i}-\hat{P}_{c}\right\rVert+\left\lVert P_{j}-\hat{P}_{\mathcal{B}_{k}}\right\rVert)\\ &>\alpha-(\lambda/8+\lambda/8)\\ &=\delta.\end{split}

    It follows that the break condition in Algorithm 3 is satisfied, and the correct value of 0 is returned. A new cluster is created containing only the samples generated by the new task mkm_{k}.

  • When the task mkm_{k} has been given to the learner before episode kk.

    In this case, there exists a cluster cc^{\prime} containing the samples generated from model jj. Using a similar argument in the previous part, we have that whenever the iteration in Algorithm 3 reaches a cluster cc whose identity iji\neq j, the break condition is true for at least one (s,a)Γα(s,a)\in\Gamma^{\alpha}, and the algorithm moves to the next cluster. When the iteration reaches cluster cc^{\prime}, for all (s,a)Γ~α(s,a)\in\tilde{\Gamma}^{\alpha}, we have:

    P^kP^cP^kPj+PjP^cλ/8+λ/8=λ/4δ.\begin{split}\left\lVert\hat{P}_{\mathcal{B}_{k}}-\hat{P}_{c^{\prime}}\right\rVert&\leq\left\lVert\hat{P}_{\mathcal{B}_{k}}-P_{j}\right\rVert+\left\lVert P_{j}-\hat{P}_{c^{\prime}}\right\rVert\\ &\leq\lambda/8+\lambda/8=\lambda/4\\ &\leq\delta.\end{split}

    Hence, the break condition is false for all (s,a)Γ(s,a)\in\Gamma, and thus the algorithm returns id=c\texttt{id}=c^{\prime} as expected.

By induction, under event Γ\mathcal{E}_{\Gamma}, Algorithm 3 always produces correct outputs throughout the KK episodes.

We can now state the regret bound of Algorithm 1 where the regret minimization algorithm in every episode is UCBVI-CH (Azar et al., 2017). For each state-action pair (s,a)(s,a) in episode kk, UCBVI-CH needs a bonus term defined as

bk(s,a)=7H1Lk1Nkregret(s,a),b_{k}(s,a)=7H_{1}L_{k}\sqrt{\frac{1}{N^{regret}_{k}(s,a)}},

where Lk=ln(5SAKmkH1/p1)L_{k}=\ln(5SAK_{m_{k}}H_{1}/p_{1}), Nkregret(s,a)N^{regret}_{k}(s,a) is the total number of visits to (s,a)(s,a) in the learning phase before episode kk, and KmkK_{m^{k}} is the total number of episodes in which the model mkm^{k} is given to the learner. However, KmkK_{m^{k}} is unknown to the learner. We instead upper bound KmkK_{m^{k}} by KK and modify the bonus term as

bk(s,a)=7H1L1Nkregret(s,a)b^{\prime}_{k}(s,a)=7H_{1}L\sqrt{\frac{1}{N^{regret}_{k}(s,a)}} (46)

where L=ln(5SAKHM/p1)L=\ln(5SAKHM/p_{1}). Since bkbkb^{\prime}_{k}\geq b_{k}, this algorithm still retain the optimism principle needed for UCBVI-CH. The total regret of each model in \mathcal{M} is bounded by the following result, whose proof is in Appendix E.

Lemma 29

With probability at least 1p11-p_{1}, applying UCBVI-CH with the bonus term bkb^{\prime}_{k} defined in Equation 46, each task mm in \mathcal{M} has a total regret of

Regret(m,Km)Km(H0+D)+67H13/2LSAKm+15S2A2H12L2\text{Regret}(m,K_{m})\leq K_{m}(H_{0}+D)+67H_{1}^{3/2}L\sqrt{SAK_{m}}+15S^{2}A^{2}H_{1}^{2}L^{2}

See 16

Proof  Summing up the regret for all mm\in\mathcal{M} and applying the Cauchy-Schwarz inequality, Lemma 29 together with Lemma 15 and Lemma 14 imply that with probability 1p1-p, the total regret is bounded by

Regret(K)K(H0+D)+67H1LMSAKH1+15MS2AH12L2.\text{Regret}(K)\leq K(H_{0}+D)+67H_{1}L\sqrt{MSAKH_{1}}+15MS^{2}AH_{1}^{2}L^{2}. (47)

Note that the bound in Equation 47 is tighter than the bound in Theorem 16. To obtain the bound in Theorem 16, notice that DD~H0D\leq\tilde{D}\leq H_{0} and thus K(H0+D)K(H0+H0)=2KH0K(H_{0}+D)\leq K(H_{0}+H_{0})=2KH_{0}.

See 20

Proof  In stage 1, as the distinguishing set has size |Γ~|=SA|\tilde{\Gamma}|=SA, the number of time steps needed in the clustering phase is

H0,1=12D~|Γ~|N1=12DSAN1,H_{0,1}=12\tilde{D}|\tilde{\Gamma}|N_{1}=12DSAN_{1},

where N1=256λ2max{S,ln(3KSAp)}N_{1}=\frac{256}{\lambda^{2}}\max\{S,\ln(\frac{3KSA}{p})\}.

In stage 2, the length of the clustering phase is

H0,2=12D~|Γ^|N2,H_{0,2}=12\tilde{D}|\hat{\Gamma}|N_{2},

where N2=256λ2max{S,ln(3K|Γ^|p)}N_{2}=\frac{256}{\lambda^{2}}\max\{S,\ln(\frac{3K|\hat{\Gamma}|}{p})\}.

Substituting H0,1H_{0,1} and H0,2H_{0,2} into Theorem 16, we obtain the regret bound of stage 1 and stage 2:

RegretStage12K1H0,1+67(H1,1)3/2L1MSAK1+15MS2A(H1,1)2L12,\begin{split}\text{Regret}_{Stage1}\leq 2K_{1}H_{0,1}&+67(H_{1,1})^{3/2}L_{1}\sqrt{MSAK_{1}}+15MS^{2}A(H_{1,1})^{2}L_{1}^{2},\end{split}

where L1=ln(15MSAKH1,1p)L_{1}=\ln(\frac{15MSAKH_{1,1}}{p}) and H1,1=HH0,1H_{1,1}=H-H_{0,1}.

RegretStage22K2H0,2+67H1,23/2L2MSAK2+15MS2AH1,22L22,\begin{split}\text{Regret}_{Stage2}\leq 2K_{2}H_{0,2}&+67H_{1,2}^{3/2}L_{2}\sqrt{MSAK_{2}}+15MS^{2}AH_{1,2}^{2}L_{2}^{2},\end{split}

where L2=ln(15MSAKH1,2p)L_{2}=\ln(\frac{15MSAKH_{1,2}}{p}) and H1,2=HH0,2H_{1,2}=H-H_{0,2}.

Since H0,1H0,2H_{0,1}\geq H_{0,2}, we have H1,1H1,2H_{1,1}\leq H_{1,2}. Using the assumption that K1SA<K2K_{1}SA<K_{2} and the Cauchy-Schwarz inequality for the sum K1+K2\sqrt{K_{1}}+\sqrt{K_{2}}, we obtain

Regret(K)\displaystyle\text{Regret}(K) =RegretStage1+RegretStage2\displaystyle=\text{Regret}_{Stage1}+\text{Regret}_{Stage2} (48)
4KH0,2+67H1,23/2L22MSAK+30MS2AH1,22L22.\displaystyle\leq 4KH_{0,2}+67H_{1,2}^{3/2}L_{2}\sqrt{2MSAK}+30MS^{2}AH_{1,2}^{2}L_{2}^{2}. (49)

By having |Γ^|(M2)M2,H1,2H|\hat{\Gamma}|\leq{M\choose 2}\leq M^{2},H_{1,2}\leq H and max{L1,L2}L\max\{L_{1},L_{2}\}\leq L, we obtain

Regret(K)4KH0,M\displaystyle\text{Regret}(K)\leq 4KH_{0,M} +67H3/2L2MSAK+30MS2AH2L2.\displaystyle+67H^{3/2}L\sqrt{2MSAK}+30MS^{2}AH^{2}L^{2}. (50)

where H0,M=3072D~M2λ2max{S,ln(3KM2p)}H_{0,M}=\frac{3072\tilde{D}M^{2}}{\lambda^{2}}\max\{S,\ln(\frac{3KM^{2}}{p})\}.

Appendix E Per-model Regret analysis

First, we prove the following lemma which upper bound the per-episode regret as a function of H0H_{0} and the regret of the clustering phase.

Lemma 30

The regret of Algorithm 1 in episode kk is

Δk=[V1k,V1πk](s1k)H0+D+maxs𝒮[VH0+1k,VH0+1πk](s).\Delta_{k}=[V_{1}^{k,*}-V_{1}^{\pi_{k}}](s^{k}_{1})\leq H_{0}+D+\max_{s\in\mathcal{S}}[V_{H_{0}+1}^{k,*}-V_{H_{0}+1}^{\pi_{k}}](s).

Proof  Denote by Pr(shk=ss1,π)\Pr(s^{k}_{h}=s\mid s_{1},\pi) the probability of visiting state ss at time hh when the learner follows a (possibly non-stationary) policy π\pi in model mkm^{k} starting from state s1s_{1}. The regret of task mm in a single episode k𝒦mk\in\mathcal{K}_{m} can be written as

Δk=[V1k,V1πk](s1k)=E[h=1Hr(sh,ah)s1=s1k,ah=πk(sh)]E[h=1Hr(sh,ah)s1=s1k,ah=πk(sh)]=(E[h=1H0r(sh,ah)s1=s1k,ah=πk(sh)]+s𝒮Prm(sH0+1k=ss1k,πk)VH0+1k,(s))(E[h=1H0r(sh,ah)s1=s1k,ah=πk(sh)]+s𝒮Prm(sH0+1k=ss1k,πk)VH0+1πk(s))H0+s𝒮Prm(sH0+1k=ss1k,πk)VH0+1k,(s)s𝒮Prm(sH0+1k=ss1k,πk)VH0+1πk(s)=H0+(s𝒮Prm(sH0+1k=ss1k,πk)VH0+1k,(s)s𝒮Prm(sH0+1k=ss1k,πk)VH0+1k,(s))+s𝒮Prm(sH0+1k=ss1k,πk)[VH0+1k,VH0+1πk](s)H0+(maxs𝒮VH0+1k,(s)mins𝒮VH0+1k,(s))()+maxs𝒮[VH0+1k,VH0+1πk](s).\begin{split}\Delta_{k}&=[V^{k,*}_{1}-V^{\pi_{k}}_{1}](s^{k}_{1})\\ &=E[\sum_{h=1}^{H}r(s_{h},a_{h})\mid s_{1}=s^{k}_{1},a_{h}=\pi_{k}^{*}(s_{h})]-E[\sum_{h=1}^{H}r(s_{h},a_{h})\mid s_{1}=s^{k}_{1},a_{h}=\pi_{k}(s_{h})]\\ &=\left(E[\sum_{h=1}^{H_{0}}r(s_{h},a_{h})\mid s_{1}=s^{k}_{1},a_{h}=\pi_{k}^{*}(s_{h})]+\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi^{*}_{k})V_{H_{0}+1}^{k,*}(s)\right)\\ &\quad-\left(E[\sum_{h=1}^{H_{0}}r(s_{h},a_{h})\mid s_{1}=s^{k}_{1},a_{h}=\pi_{k}(s_{h})]+\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})V_{H_{0}+1}^{\pi_{k}}(s)\right)\\ &\leq H_{0}+\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi^{*}_{k})V_{H_{0}+1}^{k,*}(s)-\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})V_{H_{0}+1}^{\pi_{k}}(s)\\ &=H_{0}+\left(\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi^{*}_{k})V_{H_{0}+1}^{k,*}(s)-\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})V_{H_{0}+1}^{k,*}(s)\right)\\ &\quad+\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})[V_{H_{0}+1}^{k,*}-V^{\pi_{k}}_{H_{0}+1}](s)\\ &\leq H_{0}+\underbrace{\left(\max_{s\in\mathcal{S}}V_{H_{0}+1}^{k,*}(s)-\min_{s\in\mathcal{S}}V_{H_{0}+1}^{k,*}(s)\right)}_{(\clubsuit)}+\max_{s\in\mathcal{S}}[V_{H_{0}+1}^{k,*}-V^{\pi_{k}}_{H_{0}+1}](s).\end{split}

The first inequality follows from the assumption that r(s,a)[0,1]r(s,a)\in[0,1] for all (s,a)(s,a). The second inequality follows the fact that

s𝒮Prm(sH0+1k=ss1k,πk)VH0+1k,(s)s𝒮Prm(sH0+1k=ss1k,πk)maxx𝒮VH0+1k,(x)=(maxx𝒮VH0+1k,(x))s𝒮Prm(sH0+1k=ss1k,πk)=maxx𝒮VH0+1k,(x),\begin{split}\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi^{*}_{k})V_{H_{0}+1}^{k,*}(s)&\leq\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi^{*}_{k})\max_{x\in\mathcal{S}}V_{H_{0}+1}^{k,*}(x)\\ &=\left(\max_{x\in\mathcal{S}}V_{H_{0}+1}^{k,*}(x)\right)\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi^{*}_{k})\\ &=\max_{x\in\mathcal{S}}V_{H_{0}+1}^{k,*}(x),\end{split}

and

s𝒮Prm(sH0+1k=ss1k,πk)VH0+1k,(s)s𝒮Prm(sH0+1k=ss1k,πk)minx𝒮VH0+1k,(x)=(minx𝒮VH0+1k,(x))s𝒮Prm(sH0+1k=ss1k,πk)=minx𝒮VH0+1k,(x).\begin{split}\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})V_{H_{0}+1}^{k,*}(s)&\geq\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})\min_{x\in\mathcal{S}}V_{H_{0}+1}^{k,*}(x)\\ &=\left(\min_{x\in\mathcal{S}}V_{H_{0}+1}^{k,*}(x)\right)\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})\\ &=\min_{x\in\mathcal{S}}V_{H_{0}+1}^{k,*}(x).\end{split}

Furthermore, since VH0+1k,(s)VH0+1πk(s)V^{k,*}_{H_{0}+1}(s)\geq V_{H_{0}+1}^{\pi_{k}}(s) for all s𝒮s\in\mathcal{S}, we have

s𝒮Prm(sH0+1k=ss1k,πk)[VH0+1k,VH0+1πk](s)s𝒮Prm(sH0+1k=ss1k,πk)maxx𝒮[VH0+1k,VH0+1πk](x)=maxx𝒮[VH0+1k,VH0+1πk](x)s𝒮Prm(sH0+1k=ss1k,πk)=maxx𝒮[VH0+1k,VH0+1πk](x).\begin{split}\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})[V_{H_{0}+1}^{k,*}-V^{\pi_{k}}_{H_{0}+1}](s)&\leq\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})\max_{x\in\mathcal{S}}[V_{H_{0}+1}^{k,*}-V^{\pi_{k}}_{H_{0}+1}](x)\\ &=\max_{x\in\mathcal{S}}[V_{H_{0}+1}^{k,*}-V^{\pi_{k}}_{H_{0}+1}](x)\sum_{s\in\mathcal{S}}{\Pr}_{m}(s^{k}_{H_{0}+1}=s\mid s^{k}_{1},\pi_{k})\\ &=\max_{x\in\mathcal{S}}[V_{H_{0}+1}^{k,*}-V^{\pi_{k}}_{H_{0}+1}](x).\end{split}

For each state ss, the value of Vhk,(s)V_{h}^{k,*}(s) is the expected total (Hh)(H-h)-step reward of an optimal non-stationary (Hh)(H-h) step policy starting in state ss on the MDP mm. Thus, the term ()(\clubsuit) represents the bounded span of the finite-step value function in MDP mm. Applying equation 11 of Jaksch et al. (2010), the span of the value function is bounded by the diameter of the MDP. We obtain for all hh

maxs𝒮Vhk,(s)mins𝒮Vhk,(s)D.\max_{s\in\mathcal{S}}V_{h}^{k,*}(s)-\min_{s\in\mathcal{S}}V^{k,*}_{h}(s)\leq D.

It follows that

ΔkH0+D+maxs𝒮[VH0+1k,VH0+1πk](s).\Delta_{k}\leq H_{0}+D+\max_{s\in\mathcal{S}}[V^{k,*}_{H_{0}+1}-V^{\pi_{k}}_{H_{0}+1}](s).

Denote 𝒦m\mathcal{K}_{m} the set of episodes where the model mm is given to the learner. The total regret of the learner in episodes 𝒦m\mathcal{K}_{m} is

Regret(m,Km)=k𝒦mΔkKm(H0+D)+k𝒦mmaxsS[VH0+1k,VH0+1πk](s)().\begin{split}\mathrm{Regret}(m,K_{m})&=\sum_{k\in\mathcal{K}_{m}}\Delta_{k}\\ &\leq K_{m}(H_{0}+D)+\underbrace{\sum_{k\in\mathcal{K}_{m}}\max_{s\in S}[V^{k,*}_{H_{0}+1}-V^{\pi_{k}}_{H_{0}+1}](s)}_{(\heartsuit)}.\end{split}

The policy πk\pi_{k} from time step H0+1H_{0}+1 to HH is the UCBVI-CH algorithm (Azar et al., 2017). Therefore, the term ()(\heartsuit) corresponds to the total regret of UCBVI-CH in an adversarial setting in which the starting state s1ks^{k}_{1} in each episode is chosen by an adversary that maximizes the regret in each episode. In Appendix F, we given a simplified analysis for UCBVI-CH and show that with probability at least 1p1/M1-p_{1}/M,

()=k𝒦mmaxsS[VH0+1k,VH0+1πk](s)67H13/2LSAKm+15S2A2H12L2.(\heartsuit)=\sum_{k\in\mathcal{K}_{m}}\max_{s\in S}[V^{k,*}_{H_{0}+1}-V^{\pi_{k}}_{H_{0}+1}](s)\leq 67H_{1}^{3/2}L\sqrt{SAK_{m}}+15S^{2}A^{2}H_{1}^{2}L^{2}. (51)

The proof of Lemma 29 is completed by plugging the bound of (2)(2) in Equation 51 to obtain

Regret(m,Km)=k𝒦mΔkKm(H0+D)+67H13/2LSAKm+15S2A2H12L2.\begin{split}\mathrm{Regret}(m,K_{m})&=\sum_{k\in\mathcal{K}_{m}}\Delta_{k}\\ &\leq K_{m}(H_{0}+D)+67H_{1}^{3/2}L\sqrt{SAK_{m}}+15S^{2}A^{2}H_{1}^{2}L^{2}.\end{split}

Appendix F A simplified analysis for UCBVI-CH

Input: Failure probability pp
Initialize an empty collection \mathcal{B}
for episode k=1,,Kk=1,\dots,K: do
       Qk,h=Q_{k,h}= UCB-Q-Values (,p\mathcal{B},p)
       for h=1,,Hh=1,\dots,H:  do
             Take action ak,h=argmaxaQk,h(shk,a)a_{k,h}=\operatorname*{arg\,max}_{a}Q_{k,h}(s^{k}_{h},a)
             Add (shk,ahk,sh+1k)(s^{k}_{h},a^{k}_{h},s^{k}_{h+1}) to \mathcal{B}
            
Algorithm 5 UCBVI
Input: Collection \mathcal{B}, probability pp
Compute, for all (s,a,s)𝒮×𝒜×𝒮(s,a,s^{\prime})\in\mathcal{S\times A\times S}
Nk(s,a,s)=(x,a,y)𝕀(x=s,a=a,y=s)N_{k}(s,a,s^{\prime})=\sum_{(x,a^{\prime},y)\in\mathcal{B}}\mathbb{I}(x=s,a^{\prime}=a,y=s^{\prime})
Nk(s,a)=sSNk(s,a,s)N_{k}(s,a)=\sum_{s^{\prime}\in S}N_{k}(s,a,s^{\prime})
For all (s,a){(s,a):Nk(s,a)>0}(s,a)\in\{(s,a):N_{k}(s,a)>0\}, compute
P^k(ss,a)=Nk(s,a,s)Nk(s,a)\hat{P}_{k}(s^{\prime}\mid s,a)=\frac{N_{k}(s,a,s^{\prime})}{N_{k}(s,a)}
bk,h(s,a)=7HL1Nk(s,a)b_{k,h}(s,a)=7HL\sqrt{\frac{1}{N_{k}(s,a)}} where L=ln(5SAKH/p)L=\ln(5SAKH/p)
Initialize Vk,H+1(s)=0V_{k,H+1}(s)=0 for all x𝒮x\in\mathcal{S}
for h=H,H1,,1h=H,H-1,\dots,1: do
       for (s,a)𝒮×𝒜(s,a)\in\mathcal{S\times A} do
             if Nk(s,a)>0N_{k}(s,a)>0 then
                   Qk,h(s,a)=min{H,r(s,a)+(s𝒮P^k(ss,a)Vk,h+1(s))+bk,h(s,a)}Q_{k,h}(s,a)=\min\{H,r(s,a)+\left(\sum_{s^{\prime}\in\mathcal{S}}\hat{P}_{k}(s^{\prime}\mid s,a)V_{k,h+1}(s^{\prime})\right)+b_{k,h}(s,a)\}
             else
                   Qk,h=HQ_{k,h}=H
             Vk,h(s)=maxaQk,h(s,a)V_{k,h}(s)=\max_{a}Q_{k,h}(s,a)
Algorithm 6 UCB-Q-Values with Hoeffding bonus

In section, we construct a simplified analysis for the UCBVI-CH algorithm in Azar et al. (2017). The proof largely follows the existing constructions in Azar et al. (2017), with two differences: the definition of “typical” episodes and the analysis are tailored specifically for the Chernoff-type bonus of UCBVI-CH, without being complicated by handling of the variances for the Bernstein-type bonus of UCBVI-BF in Azar et al. (2017). For completeness, the full UCBVI-CH algorithm from Azar et al. (2017) is shown in Algorithms 5 and 6.

Notation. In this section, we consider the standard single-task episodic RL setting in  Azar et al. (2017) where the learner is given the same MDP (𝒮,𝒜,H,P,r)(\mathcal{S,A},H,P,r) in KK episodes. We assume the reward function r:𝒮×𝒜[0,1]r:\mathcal{S\times A}\mapsto[0,1] is deterministic and known. The state and action spaces 𝒮\mathcal{S} and 𝒜\mathcal{A} are discrete spaces with size SS and AA, respectively. Denote by pp the failure probability and let L=ln(5SAKH/p)L=\ln(5SAKH/p). We assume the product SAKHSAKH is sufficiently large that L>1L>1.

Let V1V_{1}^{*} denote the optimal value function and V1πkV^{\pi_{k}}_{1} the value function of the policy πk\pi_{k} of the UCBVI-CH agent in episode kk. The regret is defined as follows.

Regret(K)=k=1Kδk,1,\mathrm{Regret}(K)=\sum_{k=1}^{K}\delta_{k,1}, (52)

where δk,h=[VhVhπk](shk)\delta_{k,h}=[V_{h}^{*}-V^{\pi_{k}}_{h}](s^{k}_{h}).

Denote by Nk(s,a)N_{k}(s,a) the number of visits to the state-action pair (s,a)(s,a) up to the beginning of episode kk.

We call an episode kk “typical” if all state-action pairs visited in episode kk have been visited at least HH times at the beginning of episode kk. The set of typical episodes is defined as follows.

[K]typ={i[K]:h[H],Ni(shi,ahi)H}.[K]_{typ}=\{i\in[K]:\forall h\in[H],N_{i}(s^{i}_{h},a^{i}_{h})\geq H\}. (53)

Equation 52 can be written as

Regret(K)=k[K]typδk,1+k[K]typδk,1k[K]typH+k[K]typδk,1SAH2+k[K]typδk,1.\begin{split}\mathrm{Regret}(K)&=\sum_{k\notin[K]_{typ}}\delta_{k,1}+\sum_{k\in[K]_{typ}}\delta_{k,1}\\ &\leq\sum_{k\notin[K]_{typ}}H+\sum_{k\in[K]_{typ}}\delta_{k,1}\\ &\leq SAH^{2}+\sum_{k\in[K]_{typ}}\delta_{k,1}.\end{split} (54)

The first inequality follows from the trivial upper bound of the regret in an episode δk,1H\delta_{k,1}\leq H. The second inequality comes from the fact that each state-action pair can cause at most HH episodes to be non-typical; therefore there are at most SAHSAH non-typical episodes.

Next, we have:

k[K]typδk,1=kKδk,1𝕀{k[K]typ}.\begin{split}\sum_{k\in[K]_{typ}}\delta_{k,1}&=\sum_{k}^{K}\delta_{k,1}\mathbb{I}\{k\in[K]_{typ}\}.\end{split} (55)

From here we write 𝕀k=𝕀{k[K]typ}\mathbb{I}_{k}=\mathbb{I}\{k\in[K]_{typ}\} for brevity.

Lemma 3 in Azar et al. (2017) implies that, for all k[K]k\in[K],

δk,1eh=1H[εk,h+2Lε¯k,h+c1,k,h+bk,h+c4,k,h].\delta_{k,1}\leq e\sum_{h=1}^{H}\left[\varepsilon_{k,h}+2\sqrt{L}\bar{\varepsilon}_{k,h}+c_{1,k,h}+b_{k,h}+c_{4,k,h}\right]. (56)

where c4,k,h=4SH2LNk(shk,ahk)c_{4,k,h}=\frac{4SH^{2}L}{N_{k}(s^{k}_{h},a^{k}_{h})}, εk,h\varepsilon_{k,h} and ε¯k,h\bar{\varepsilon}_{k,h} are martingale difference sequences which, by Lemma 5 in Azar et al. (2017), satisfy

k=1Kh=1Hεk,hHKHLk=1Kh=1Hε¯k,hKH,\begin{split}\sum_{k=1}^{K}\sum_{h=1}^{H}\varepsilon_{k,h}&\leq H\sqrt{KHL}\\ \sum_{k=1}^{K}\sum_{h=1}^{H}\bar{\varepsilon}_{k,h}&\leq\sqrt{KH},\end{split} (57)

and c1,k,hc_{1,k,h} is a confidence interval to be defined later.

Plugging Equation 56 into Equation 55 and combining with Equation 57, we obtain:

k[K]typδk,1ek=1K(h=1H[εk,h+2Lε¯k,h+c1,k,h+bk,h+c4,k,h])𝕀k=e[(k=1K𝕀kh=1H(εk,h+2Lε¯k,h))+(k=1K𝕀kh=1H(bk,h+c1,k,h+c4,k,h))]e[(k=1Kh=1H(εk,h+2Lε¯k,h))+(k=1Kh=1H(bk,h𝕀k+c1,k,h𝕀k+c4,k,h𝕀k))]e[(HKHL+2LKH)+(k=1Kh=1H(bk,h𝕀k+c1,k,h𝕀k+c4,k,h𝕀k))]=e[((H+2)KHL)+(k=1Kh=1H(bk,h𝕀k+c1,k,h𝕀k+c4,k,h𝕀k))]\begin{split}\sum_{k\in[K]_{typ}}\delta_{k,1}&\leq e\sum_{k=1}^{K}\left(\sum_{h=1}^{H}\left[\varepsilon_{k,h}+2\sqrt{L}\bar{\varepsilon}_{k,h}+c_{1,k,h}+b_{k,h}+c_{4,k,h}\right]\right)\mathbb{I}_{k}\\ &=e\left[\left(\sum_{k=1}^{K}\mathbb{I}_{k}\sum_{h=1}^{H}(\varepsilon_{k,h}+2\sqrt{L}\bar{\varepsilon}_{k,h})\right)+\left(\sum_{k=1}^{K}\mathbb{I}_{k}\sum_{h=1}^{H}(b_{k,h}+c_{1,k,h}+c_{4,k,h})\right)\right]\\ &\leq e\left[\left(\sum_{k=1}^{K}\sum_{h=1}^{H}(\varepsilon_{k,h}+2\sqrt{L}\bar{\varepsilon}_{k,h})\right)+\left(\sum_{k=1}^{K}\sum_{h=1}^{H}(b_{k,h}\mathbb{I}_{k}+c_{1,k,h}\mathbb{I}_{k}+c_{4,k,h}\mathbb{I}_{k})\right)\right]\\ &\leq e\left[\left(H\sqrt{KHL}+2\sqrt{L}\sqrt{KH}\right)+\left(\sum_{k=1}^{K}\sum_{h=1}^{H}(b_{k,h}\mathbb{I}_{k}+c_{1,k,h}\mathbb{I}_{k}+c_{4,k,h}\mathbb{I}_{k})\right)\right]\\ &=e\left[\left((H+2)\sqrt{KHL}\right)+\left(\sum_{k=1}^{K}\sum_{h=1}^{H}(b_{k,h}\mathbb{I}_{k}+c_{1,k,h}\mathbb{I}_{k}+c_{4,k,h}\mathbb{I}_{k})\right)\right]\end{split}

Note that the second inequality follows from the fact that 𝕀k1\mathbb{I}_{k}\leq 1, and the last inequality follows directly from Equation 57.

Let 𝕀k,h=𝕀{Nk(shk,ahk)H}\mathbb{I}_{k,h}=\mathbb{I}\{N_{k}(s^{k}_{h},a^{k}_{h})\geq H\}. By the definition of a “typical” episode, 𝕀k=1\mathbb{I}_{k}=1 implies that 𝕀k,h=1\mathbb{I}_{k,h}=1 for all hh. It follows that 𝕀k𝕀k,h\mathbb{I}_{k}\leq\mathbb{I}_{k,h}. Thus,

k[K]typδk,1e((H+2)KHL+i=1Kj=1H(bk,h+c1,k,h+c4,k,h)),\sum_{k\in[K]_{typ}}\delta_{k,1}\leq e\left((H+2)\sqrt{KHL}+\sum_{i=1}^{K}\sum_{j=1}^{H}(b^{\prime}_{k,h}+c^{\prime}_{1,k,h}+c^{\prime}_{4,k,h})\right), (58)

where bk,h=bk,h𝕀k,hb^{\prime}_{k,h}=b_{k,h}\mathbb{I}_{k,h}, c1,k,h=c1,k,h𝕀k,hc^{\prime}_{1,k,h}=c_{1,k,h}\mathbb{I}_{k,h} and c4,k,h=c4,k,h𝕀k,hc^{\prime}_{4,k,h}=c_{4,k,h}\mathbb{I}_{k,h}.

Next, we compute c1,k,hc_{1,k,h}. In Equation (32) in Azar et al. (2017), c1,k,hc_{1,k,h} corresponds to the confidence interval of

(P^hπPhπ)Vh+1(shk)=s𝒮[P^(sshk,ahk)Ph(sshk,ahk)]Vh+1(s).(\hat{P}_{h}^{\pi}-P_{h}^{\pi})V^{*}_{h+1}(s^{k}_{h})=\sum_{s^{\prime}\in\mathcal{S}}\left[\hat{P}(s^{\prime}\mid s^{k}_{h},a^{k}_{h})-P_{h}(s^{\prime}\mid s^{k}_{h},a^{k}_{h})\right]V^{*}_{h+1}(s^{\prime}).

Equation (9) in Azar et al. (2017) computes a confidence interval for this term using the Bernstein inequality. Instead, we use the Hoeffding inequality and obtain

[(P^hπPhπ)Vh+1]HL2Nk(shk,ahk)=c1,k,h.[(\hat{P}_{h}^{\pi}-P_{h}^{\pi})V^{*}_{h+1}]\leq H\sqrt{\frac{L}{2N_{k}(s^{k}_{h},a^{k}_{h})}}=c_{1,k,h}. (59)

Combining Equations 59,  58 and 54, the total regret is bounded as

RegretSAH2+e((H+2)KHL+k=1Kh=1H(bk,h+c1,k,h+c4,k,h)(a))\begin{split}\text{Regret}\leq SAH^{2}+e\left((H+2)\sqrt{KHL}+\underbrace{\sum_{k=1}^{K}\sum_{h=1}^{H}(b^{\prime}_{k,h}+c^{\prime}_{1,k,h}+c^{\prime}_{4,k,h})}_{(a)}\right)\end{split} (60)

where bk,h=7HL𝕀k,hNk(shk,ahk),c1,k,h=HL𝕀k,h2Nk(shk,ahk)b^{\prime}_{k,h}=\frac{7HL\mathbb{I}_{k,h}}{\sqrt{N_{k}(s^{k}_{h},a^{k}_{h})}},c^{\prime}_{1,k,h}=\frac{H\sqrt{L}\mathbb{I}_{k,h}}{\sqrt{2N_{k}(s^{k}_{h},a^{k}_{h})}} and c4,k,h=4SH2L𝕀k,hNk(shk,ahk)c^{\prime}_{4,k,h}=\frac{4SH^{2}L\mathbb{I}_{k,h}}{N_{k}(s^{k}_{h},a^{k}_{h})}.

We focus on the third and dominant term (a)(a). As bk,hc1,k,hb_{k,h}\geq c_{1,k,h}, this term can be upper bounded by

(a)k=1Kh=1H[8HL𝕀k,hNk(shk,ahk)+4SH2L𝕀k,hNk(shk,ahk)](since L>1)=8HLi=1Kj=1H𝕀k,hNk(shk,ahk)(b)+4SH2Li=1Kj=1H𝕀k,hNk(shk,ahk)(c).\begin{split}(a)&\leq\sum_{k=1}^{K}\sum_{h=1}^{H}\left[\frac{8HL\mathbb{I}_{k,h}}{\sqrt{N_{k}(s^{k}_{h},a^{k}_{h})}}+\frac{4SH^{2}L\mathbb{I}_{k,h}}{N_{k}(s^{k}_{h},a^{k}_{h})}\right]\quad\text{(since }L>1\text{)}\\ &=8HL\underbrace{\sum_{i=1}^{K}\sum_{j=1}^{H}\frac{\mathbb{I}_{k,h}}{\sqrt{N_{k}(s^{k}_{h},a^{k}_{h})}}}_{(b)}+4SH^{2}L\underbrace{\sum_{i=1}^{K}\sum_{j=1}^{H}\frac{\mathbb{I}_{k,h}}{N_{k}(s^{k}_{h},a^{k}_{h})}}_{(c)}.\end{split} (61)

We bound (b)(b) and (c)(c) separately.

First, we bound (b)(b). We introduce the following lemma, which is an analogy to Lemma 19 in Jaksch et al. (2010) in the finite-horizon setting.

Lemma 31

Let H1H\geq 1. For any sequence of numbers z1,,znz_{1},\dots,z_{n} with 0zkH0\leq z_{k}\leq H, consider the sequence Z0,Z1,ZnZ_{0},Z_{1},\dots Z_{n} defined as

Z0HZk=Zk1+zkfor k1.\begin{split}Z_{0}&\geq H\\ Z_{k}&=Z_{k-1}+z_{k}\qquad\text{for }k\geq 1.\end{split}

Then, for all n1n\geq 1,

k=1nzkZk1(2+1)Zn.\sum_{k=1}^{n}\frac{z_{k}}{\sqrt{Z_{k-1}}}\leq(\sqrt{2}+1)\sqrt{Z_{n}}.

Using Lemma 31, we can bound (b)(b) by Lemma 62.

Lemma 32

Denote vi(s,a)=j=1H𝕀(ai,j=a,si,j=s)v_{i}(s,a)=\sum_{j=1}^{H}\mathbb{I}(a_{i,j}=a,s_{i,j}=s) the number of times the state-action pair (s,a)(s,a) is visited during episode ii, and let τ(s,a)=argmink[K]{Nk(s,a)H}\tau(s,a)=\operatorname*{arg\,min}_{k\in[K]}\{N_{k}(s,a)\geq H\} be the first episode where the state-action pair (s,a)(s,a) is visited at least HH times. Then,

(b)(2+1)SAKH.\begin{split}(b)\leq(\sqrt{2}+1)\sqrt{SAKH}.\end{split} (62)

Proof  By definition, Ni(s,a)=k=1i1vk(s,a)N_{i}(s,a)=\sum_{k=1}^{i-1}v_{k}(s,a). Regrouping the sum in (b)(b) by (s,a)(s,a), we have

(b)=s,ai=1Kvi(s,a)Ni(s,a)𝕀{Ni(s,a)H}=s,a(i=1τ(s,a)1vi(s,a)Ni(s,a)𝕀{Ni(s,a)H}+i=τ(s,a)Kvi(s,a)Ni(s,a))=s,ai=τ(s,a)Kvi(s,a)Ni(s,a)s,a(2+1)NK(s,a)+vK(s,a)(2+1)SAKH.\begin{split}(b)&=\sum_{s,a}\sum_{i=1}^{K}\frac{v_{i}(s,a)}{\sqrt{N_{i}(s,a)}}\mathbb{I}\{N_{i}(s,a)\geq H\}\\ &=\sum_{s,a}\left(\sum_{i=1}^{\tau(s,a)-1}\frac{v_{i}(s,a)}{\sqrt{N_{i}(s,a)}}\mathbb{I}\{N_{i}(s,a)\geq H\}+\sum_{i=\tau(s,a)}^{K}\frac{v_{i}(s,a)}{\sqrt{N_{i}(s,a)}}\right)\\ &=\sum_{s,a}\sum_{i=\tau(s,a)}^{K}\frac{v_{i}(s,a)}{\sqrt{N_{i}(s,a)}}\\ &\leq\sum_{s,a}(\sqrt{2}+1)\sqrt{N_{K}(s,a)+v_{K}(s,a)}\\ &\leq(\sqrt{2}+1)\sqrt{SAKH}.\end{split}

where the last two inequalities follow from Lemma 31, the Cauchy-Schwarz inequality and the fact that s,aNK(s,a)KH\sum_{s,a}{N_{K}}(s,a)\leq KH.

In order to bound the term (c)(c) in Equation 61, we use the following lemma, which is a variant of Lemma 31 and was stated in Azar et al. (2017) without proof.

Lemma 33

Let H1H\geq 1. For any sequence of numbers z1,,znz_{1},\dots,z_{n} with 0zkH0\leq z_{k}\leq H, consider the sequence Z0,Z1,ZnZ_{0},Z_{1},\dots Z_{n} defined as

Z0HZk=Zk1+zkfor k1.\begin{split}Z_{0}&\geq H\\ Z_{k}&=Z_{k-1}+z_{k}\qquad\text{for }k\geq 1.\end{split}

Then, for all n1n\geq 1,

k=1nzkZk1j=1ZnZ01jln(ZnZ0)+1.\sum_{k=1}^{n}\frac{z_{k}}{Z_{k-1}}\leq\sum_{j=1}^{Z_{n}-Z_{0}}\frac{1}{j}\leq\ln(Z_{n}-Z_{0})+1.

Proof  The second half follows immediately from existing results for the partial sum of the harmonic series. We prove the first half of the inequality by induction. By definition of the two sequences, ZkH1Z_{k}\geq H\geq 1 and zkHZk1z_{k}\leq H\leq Z_{k-1} for all kk. At n=1n=1, if z1=0z_{1}=0 then the inequality trivially holds. If z1>0z_{1}>0, then Z1Z0=z1Z_{1}-Z_{0}=z_{1} and

z1Z0z1H=(1H++1Hz1 terms)1+12++1z1\frac{z_{1}}{Z_{0}}\leq\frac{z_{1}}{H}=\left(\underbrace{\frac{1}{H}+\dots+\frac{1}{H}}_{z_{1}\text{ terms}}\right)\leq 1+\frac{1}{2}+\dots+\frac{1}{z_{1}}

since z1Hz_{1}\leq H.

For n>1n>1, by the induction hypothesis, we have

k=1nzkZk1=k=1n1zkZk1+znZn1(j=1Zn1Z01j)+znZn1=(j=1Zn1Z01j)+(1Zn1++1Zn1znterms)(j=1Zn1Z01j)+(1Zn1Z0+1++1Zn1Z0+zn)=j=1ZnZ01j,\begin{split}\sum_{k=1}^{n}\frac{z_{k}}{Z_{k-1}}&=\sum_{k=1}^{n-1}\frac{z_{k}}{Z_{k-1}}+\frac{z_{n}}{Z_{n-1}}\\ &\leq\left(\sum_{j=1}^{Z_{n-1}-Z_{0}}\frac{1}{j}\right)+\frac{z_{n}}{Z_{n-1}}\\ &=\left(\sum_{j=1}^{Z_{n-1}-Z_{0}}\frac{1}{j}\right)+\left(\underbrace{\frac{1}{Z_{n-1}}+\dots+\frac{1}{Z_{n-1}}}_{z_{n}\text{terms}}\right)\\ &\leq\left(\sum_{j=1}^{Z_{n-1}-Z_{0}}\frac{1}{j}\right)+\left(\frac{1}{Z_{n-1}-Z_{0}+1}+\dots+\frac{1}{Z_{n-1}-Z_{0}+z_{n}}\right)\\ &=\sum_{j=1}^{Z_{n}-Z_{0}}\frac{1}{j},\end{split}

where the last inequality follows from znZ0z_{n}\leq Z_{0}. Therefore, the induction hypothesis holds for all n1n\geq 1.

Using Lemma 33, the term (c)(c) can be bounded similarly to term (b)(b) as follows:

Lemma 34

With vi(s,a)v_{i}(s,a) and τ(s,a)\tau(s,a) defined in Lemma 62, we have

(c)SAL+SA.(c)\leq SAL+SA.

Proof  We write (c)(c) as

(c)=i=1Kj=1H𝕀{Ni(s,a)H}Ni(si,j,ai,j)=s,ai=1Kvi(s,a)Ni(s,a)𝕀{Ni(s,a)H}s,a(i=1τ(s,a)1vi(s,a)Ni(s,a)𝕀{Ni(s,a)H}+i=τ(s,a)Kvi(s,a)Ni(s,a))=s,ai=τ(s,a)Kvi(s,a)Ni(s,a)s,a(ln(NK(s,a)+vK(s,a)Nτ(s,a)(s,a))+1)\begin{split}(c)&=\sum_{i=1}^{K}\sum_{j=1}^{H}\frac{\mathbb{I}\{N_{i}(s,a)\geq H\}}{N_{i}(s_{i,j},a_{i,j})}\\ &=\sum_{s,a}\sum_{i=1}^{K}\frac{v_{i}(s,a)}{N_{i}(s,a)}\mathbb{I}\{N_{i}(s,a)\geq H\}\\ &\leq\sum_{s,a}\left(\sum_{i=1}^{\tau(s,a)-1}\frac{v_{i}(s,a)}{N_{i}(s,a)}\mathbb{I}\{N_{i}(s,a)\geq H\}+\sum_{i=\tau(s,a)}^{K}\frac{v_{i}(s,a)}{N_{i}(s,a)}\right)\\ &=\sum_{s,a}\sum_{i=\tau(s,a)}^{K}\frac{v_{i}(s,a)}{N_{i}(s,a)}\\ &\leq\sum_{s,a}\left(\ln\left(N_{K}(s,a)+v_{K}(s,a)-N_{\tau(s,a)}(s,a)\right)+1\right)\end{split}

where the last inequality follows from Lemma 33. Trivially bounding the logarithm term by ln(KH)\ln(KH), we obtain

(c)SAln(KH)+SASAL+SA.(c)\leq SA\ln(KH)+SA\leq SAL+SA.

Combining Lemma 62 and Lemma 34, we obtain

(a)8HL((2+1)SAKH)+4SH2L(SAL+SA)20HLSAKH+5S2AH2L2.\begin{split}(a)&\leq 8HL((\sqrt{2}+1)\sqrt{SAKH})+4SH^{2}L(SAL+SA)\\ &\leq 20HL\sqrt{SAKH}+5S^{2}AH^{2}L^{2}.\end{split}

Substituting this into Equation 60, we obtain

RegretSAH2+e(H+2)KHL+e20HLSAKH+e5S2AH2L267HLSAKH+15S2AH2L2.\begin{split}\text{Regret}&\leq SAH^{2}+e(H+2)\sqrt{KHL}+e20HL\sqrt{SAKH}+e5S^{2}AH^{2}L^{2}\\ &\leq 67HL\sqrt{SAKH}+15S^{2}AH^{2}L^{2}.\end{split}

Appendix G Removing the assumption on the hitting time

GOSPRL (Tarbouriech et al., 2021, Lemma 3) guaranteed that in the undiscounted infinite horizon setting, with H0=O(DS2Aλ2)H_{0}=O(\frac{DS^{2}A}{\lambda^{2}}), Lemma 14 holds with high probability. Thus, in the episodic finite horizon setting, by setting H0=cDS2Aλ2H_{0}=c\frac{DS^{2}A}{\lambda^{2}} for some appropriately large constant c>0c>0 and applying GOSPRL in each episode we obtain a tight bound in the dependency of KK and λ\lambda for communicating MDPs. One difficulty in this approach is both cc and DD are unknown. One possible way to overcome this is to apply the doubling-trick as following: at the beginning of episode kk, we set H0=ckS2Aλ2H_{0}=c_{k}\frac{S^{2}A}{\lambda^{2}}, where c1=1c_{1}=1. If the learner successfully visits every state-action pair at least NN times after H0H_{0} steps, we set ck+1=ckc_{k+1}=c_{k}. Otherwise, ck+1=2ckc_{k+1}=2c_{k}. There are at most log2(cD)\log_{2}{(cD)} episodes with failed exploration until ckc_{k} is large enough so that with high probability, all the subsequent episodes will have successful explorations. Moreover, the horizons of the clustering and learning phases change at most log2(cD)\log_{2}(cD) times. The full analysis of this approach is not in the scope of this paper and is left to future work.

Appendix H Using samples in both phases for regret minimization

One of the results from previous works on the stochastic infinite-horizon multi-task setting (Brunskill and Li, 2013) is that in the cluster-then-learn paradigm, the samples collected in the their first stage (before all models have been seen at least once) can be used to accelerate the learning in their second stage (after all models have been seen at least once). In this work, we study the similar effects at the phase level. Specifically, in the finite horizon setting, the clustering phase is always followed by the learning phase; therefore it is desirable to use the samples collected in the clustering phase to improve the regret bound of the learning phase.

Our goal is to improve the regret of stage 1 in Algorithm 4. The reason that we focus on Stage 1 is two-fold:

  • In case Assumption 19 does not hold, i.e. K1K_{1} is close to KK, the total regret is dominated by the regret of stage 1. Given that the length of the clustering phase H0H_{0} is already of the same order O(S2A)O(S^{2}A) with respect to the state-of-the-art bound of the recently proposed GOSPRL algorithm (Tarbouriech et al., 2021), without further assumptions we conjecture that it is difficult to improve H0H_{0} substantially, and thus we focus on improving the learning phase.

  • In stage 1, every state-action pair is uniformly visited at least NN times before the learning phase. This uniformity allows us to study their impact in a systematic way without any further assumptions.

Using samples collected in both phases for the learning phase in Algorithm 1 is equivalent to using the policy

πk=UCBVI-CH(𝒞id)\pi_{k}=\texttt{UCBVI-CH}(\mathcal{C}_{id})

for the learning phase, since 𝒞id\mathcal{C}_{id} contains both 𝒞idmodel\mathcal{C}_{id}^{model} and 𝒞idregret\mathcal{C}_{id}^{regret}.

Input: Number of episode KK, horizon HH, failure probability pp, number of external samples for each state-action pair NN
Initialize two empty collections \mathcal{H} and \mathcal{B}
for episode k=1,2,,Kk=1,2,\dots,K do
       for (s,a)𝒮×𝒜(s,a)\in\mathcal{S\times A}) do
             for counter=1,2,,Ncounter=1,2,\dots,N do
                   The oracle draws ss^{\prime} from P(s,a)P(\cdot\mid s,a)
                   Add (s,a,s)(s,a,s^{\prime}) to \mathcal{B}
                  
             end for
            
       end for
      πk=UCBVI-CH()\pi_{k}=\texttt{UCBVI-CH}(\mathcal{H\cup B})
       Observe the starting state s1s_{1}
       for h=1,2,,Hh=1,2,\dots,H do
             Learner takes action ah=πk(sh)a_{h}=\pi_{k}(s_{h})
             Observe state sh+1s_{h+1}
             Add (sh,ah,sh+1)(s_{h},a_{h},s_{h+1}) to \mathcal{H}
            
       end for
      
end for
Algorithm 7 UCBVI-CH with external samples

The regret minimization process in the learning phase is now equivalent to learning single-task episodic RL where at the beginning of each episode, the learner is given SANSAN more (s,a,s)(s,a,s^{\prime}) samples, in which the transition function P(s,a)P(\cdot\mid s,a) of each (s,a)(s,a) is sampled i.i.d. NN times. We extend the UCBVI-CH algorithm in Azar et al. (2017) to this new setting and obtain Algorithm 7. The bonus function of episode kk in UCBVI-CH is set to

bk(s,a)=7HLN1Nk(s,a)+kN,b_{k}(s,a)=7HL_{N}\sqrt{\frac{1}{N_{k}(s,a)+kN}}, (63)

where LN=ln(5SAK(H+N)/p)L_{N}=\ln(5SAK(H+N)/p).

The regret of this algorithm is bounded in the following theorem (proved in Appendix I).

Theorem 35

Given a constant p(0,1)p\in(0,1). With probability at least 1p1-p, the regret of Algorithm 7 is bounded by

Regret(K)SAH2N+1+e(H+1)KLN+602H1N+2H1H3/2LNSAK+152H1N+2H1S2AH2LN2.\text{Regret}(K)\leq\frac{SAH^{2}}{N+1}+e(H+1)\sqrt{KL_{N}}+60\sqrt{\frac{2H-1}{N+2H-1}}H^{3/2}L_{N}\sqrt{SAK}+15\frac{2H-1}{N+2H-1}S^{2}AH^{2}L_{N}^{2}.

It can be observed that when N=0N=0, this bound recovers the bound of UCBVI-CH (up to a constant factor). Intuitively, when NN is small compared to HH, then the regret should still be of order O(HSAKH)O(H\sqrt{SAKH}) since most of the useful information for learning still comes from exploring the environment. As NN increases, since the logarithmic term LNL_{N} increases much slower compared to O(1/N)O(1/\sqrt{N}), the dominant term O(2H1N+2H1H3/2LNSAK)O(\sqrt{\frac{2H-1}{N+2H-1}}H^{3/2}L_{N}\sqrt{SAK}) converges to 0.

Using Theorem 35 and H1HH_{1}\leq H, we can directly bound the regret of each model mm that is given in 𝒦m\mathcal{K}_{m}:

Lemma 36

The stage-1 regret of each model mm is

RegretStage1(m,Km)SAH12N+1+e(H1+1)KmLN+602H11N+2H11H13/2LNSAKm+152H11N+H11S2AH12LN2.\begin{split}\text{Regret}_{Stage1}(m,K_{m})\leq&\frac{SAH_{1}^{2}}{N+1}+e(H_{1}+1)\sqrt{K_{m}L_{N}}\\ &+60\sqrt{\frac{2H_{1}-1}{N+2H_{1}-1}}H_{1}^{3/2}L_{N}\sqrt{SAK_{m}}+15\frac{2H_{1}-1}{N+H_{1}-1}S^{2}AH_{1}^{2}L_{N}^{2}.\end{split}

where LN=ln(5SAK(H+N)/p)L_{N}=\ln(5SAK(H+N)/p).

Adding up the bound in Lemma 36 for all models mm\in\mathcal{M} and applying the Cauchy-Schwarz inequality, we obtain the total regret bound of Stage 1:

Theorem 37
RegretStage1K1H0+MSAH12N+1+e(H1+1)MKLN+602H11N+2H11H13/2LNMSAK+15M2H11N+H11S2AH12LN2.\begin{split}\text{Regret}_{Stage1}\leq&K_{1}H_{0}+\frac{MSAH_{1}^{2}}{N+1}+e(H_{1}+1)\sqrt{MKL_{N}}\\ &+60\sqrt{\frac{2H_{1}-1}{N+2H_{1}-1}}H_{1}^{3/2}L_{N}\sqrt{MSAK}+15M\frac{2H_{1}-1}{N+H_{1}-1}S^{2}AH_{1}^{2}L_{N}^{2}.\end{split}

In our setting, recall that N=O(Sλ2)N=O\left(\frac{S}{\lambda^{2}}\right) and H0=O(DSAN)=O(DS2A/λ2)H_{0}=O(DSAN)=O(DS^{2}A/\lambda^{2}). Since we assumed that SA<<HSA<<H, we also have NH1=HH0N\ll H_{1}=H-H_{0}, and thus the bound in Theorem 37 is an improvement from the bound for stage 1 in the proof of Theorem 20, albeit the order stays the same. Intuitively, this means that the length of the learning phase is much larger than the length of the clustering phase, and therefore the learner spends more time on learning the optimal policy. When the length of the learning phase is small compared to NN, then the samples collected in the clustering phase significantly reduce the regret bound of the learning phase. Therefore, Algorithm 7 also accelerates the learning phase after the exploration phase, which is consistent with findings on the stochastic infinite-horizon multi-task setting in Brunskill and Li (2013).

Appendix I Proofs for Appendix H

We analyze the regret of the UCBVI-CH algorithm with external samples, where at the beginning of each episode, each state-action pair receives N1N\geq 1 additional samples drawn i.i.d from the transition function P(s,a)P(\cdot\mid s,a).

Adapting from Equation 60, the regret of E-UCBVI-CH can be bounded by

Regret(K)SAH2N+1+e(H+1)KHLN+ei=1Kj=1H[8HLN𝕀i,jkN+Ni(si,j,ai,j)+4SH2LN𝕀i,jkN+Ni(si,j,ai,j)](a),\displaystyle\begin{split}\text{Regret}(K)&\leq\frac{SAH^{2}}{N+1}+e(H+1)\sqrt{KHL_{N}}+e\underbrace{\sum_{i=1}^{K}\sum_{j=1}^{H}\left[\frac{8HL_{N}\mathbb{I}_{i,j}}{\sqrt{kN+N_{i}(s_{i,j},a_{i,j})}}+\frac{4SH^{2}L_{N}\mathbb{I}_{i,j}}{kN+N_{i}(s_{i,j},a_{i,j})}\right]}_{(a)},\end{split} (64)

where 𝕀i,j=𝕀{Ni(si,j,ai,j)H}\mathbb{I}_{i,j}=\mathbb{I}\{N_{i}(s_{i,j},a_{i,j})\geq H\} as defined in Appendix F.

The first term SAH2N+1\frac{SAH^{2}}{N+1} bounds the total regret of episodes where a state-action pair is visited less than HH times: in each episode where a pair (s,a)(s,a) is visited at least once there are at least N+1N+1 more samples of this pair, and therefore there can be at most SAHN+1\frac{SAH}{N+1} such episodes.

Similar to Appendix F, we bound (a)(a) by bounding its two components (b)(b) and (c)(c) where

(a)=8HLN(i=1Kj=1H𝕀i,jkN+Ni(s,a)(b))+4SH2LN(i=1Kj=1H𝕀i,jkN+Ni(s,a)(c)).(a)=8HL_{N}\left(\underbrace{\sum_{i=1}^{K}\sum_{j=1}^{H}\frac{\mathbb{I}_{i,j}}{\sqrt{kN+N_{i}(s,a)}}}_{(b)}\right)+4SH^{2}L_{N}\left(\underbrace{\sum_{i=1}^{K}\sum_{j=1}^{H}\frac{\mathbb{I}_{i,j}}{kN+N_{i}(s,a)}}_{(c)}\right).

In order to bound (b)(b), we first prove the following technical lemma, which quantifies the fraction of the regret that is reduced when using external samples.

Lemma 38

Suppose two constants N1,H1N\geq 1,H\geq 1 are given. For any sequence of numbers z1,,znz_{1},\dots,z_{n} with 0zkH0\leq z_{k}\leq H, consider the sequence Z0,Z1,ZnZ_{0},Z_{1},\dots Z_{n} defined as

Z02H1Zk=Zk1+zkfor k1\begin{split}Z_{0}&\leq 2H-1\\ Z_{k}&=Z_{k-1}+z_{k}\qquad\text{for }k\geq 1\end{split}

Then, for all kk,

zkkN+Zk1(k+1)H1kN+(k+1)H1zkZk1.\frac{z_{k}}{\sqrt{kN+Z_{k-1}}}\leq\sqrt{\frac{(k+1)H-1}{kN+(k+1)H-1}}\frac{z_{k}}{\sqrt{Z_{k-1}}}.

Proof  If zk=0z_{k}=0, then the claim is trivially true. For zk>0z_{k}>0, the claim is equivalent to

1kN+Zk1\displaystyle\frac{1}{\sqrt{kN+Z_{k-1}}} (k+1)H1kN+(k+1)H11Zk1\displaystyle\leq\sqrt{\frac{(k+1)H-1}{kN+(k+1)H-1}}\frac{1}{\sqrt{Z_{k-1}}}
\displaystyle\Leftrightarrow (kN+(k+1)H1)Zk1\displaystyle\sqrt{(kN+(k+1)H-1)}\sqrt{Z_{k-1}} (k+1)H1kN+Zk1\displaystyle\leq\sqrt{(k+1)H-1}\sqrt{kN+Z_{k-1}}
\displaystyle\Leftrightarrow Zk1\displaystyle Z_{k-1} (k+1)H1,\displaystyle\leq(k+1)H-1,

which is true, since Zk1=Z0+i=1k1zkZ0+i=1k1H2H1+(k1)H=(k+1)H1Z_{k-1}=Z_{0}+\sum_{i=1}^{k-1}z_{k}\leq Z_{0}+\sum_{i=1}^{k-1}H\leq 2H-1+(k-1)H=(k+1)H-1.

Corollary 39

Suppose two constants N1,H1N\geq 1,H\geq 1 are given. For any sequence of numbers z1,,znz_{1},\dots,z_{n} with 0zkH0\leq z_{k}\leq H, consider the sequence Z0,Z1,ZnZ_{0},Z_{1},\dots Z_{n} defined as

1Z02H1Zk=Zk1+zkfor k1\begin{split}1\leq Z_{0}&\leq 2H-1\\ Z_{k}&=Z_{k-1}+z_{k}\qquad\text{for }k\geq 1\end{split}

Then, for all n1n\geq 1,

k=1nzkkN+Zk1k=1n(k+1)H1kN+(k+1)H1zkZk12H1N+2H1k=1nzkkN+Zk1.\sum_{k=1}^{n}\frac{z_{k}}{\sqrt{kN+Z_{k-1}}}\leq\sum_{k=1}^{n}\sqrt{\frac{(k+1)H-1}{kN+(k+1)H-1}}\frac{z_{k}}{\sqrt{Z_{k-1}}}\leq\sqrt{\frac{2H-1}{N+2H-1}}\sum_{k=1}^{n}\frac{z_{k}}{\sqrt{kN+Z_{k-1}}}.

Proof  The first half of the claim is true, following Lemma 38. We now show that the second half is true. Consider the following function

f(x)=(x+1)H1xN+(x+1)H1f(x)=\frac{(x+1)H-1}{xN+(x+1)H-1}

The derivative is f(x)=N(1H)(xN+(x+1)H1)2f^{\prime}(x)=\frac{N(1-H)}{(xN+(x+1)H-1)^{2}}. Since H1H\geq 1, we have f(x)0xf^{\prime}(x)\leq 0~{}\forall x, and therefore f(x)f(x) is decreasing. It follows that for k1k\geq 1,

f(k)=(k+1)H1kN+(k+1)H1f(1)=2H1N+2H1.f(k)=\frac{(k+1)H-1}{kN+(k+1)H-1}\leq f(1)=\sqrt{\frac{2H-1}{N+2H-1}}.

Using Corollary 39, we can bound (b)(b) as following.

Lemma 40

With vi(s,a)v_{i}(s,a) and τ(s,a)\tau(s,a) defined in Lemma 62, we have

(b)2H1N+2H1(2+1)SAKH.(b)\leq\sqrt{\frac{2H-1}{N+2H-1}}(\sqrt{2}+1)\sqrt{SAKH}.

Proof  We can write (b)(b) as follows

(b)=s,ai=τ(s,a)Kvi(s,a)iN+Ni(s,a).(b)=\sum_{s,a}\sum_{i=\tau(s,a)}^{K}\frac{v_{i}(s,a)}{\sqrt{iN+N_{i}(s,a)}}.

By definition of τ(s,a)\tau(s,a):

Nτ(s,a)=Nτ(s,a)1+vτ(s,a)1H1+H=2H1.N_{\tau(s,a)}=N_{\tau(s,a)-1}+v_{\tau(s,a)-1}\leq H-1+H=2H-1.

Applying Corollary 39 and Lemma 62 we obtain

(b)2H1N+2H1s,ai=τ(s,a)Kvi(s,a)Ni(s,a)2H1N+2H1(2+1)SAKH.\begin{split}(b)&\leq\sqrt{\frac{2H-1}{N+2H-1}}\sum_{s,a}\sum_{i=\tau(s,a)}^{K}\frac{v_{i}(s,a)}{\sqrt{N_{i}(s,a)}}\\ &\leq\sqrt{\frac{2H-1}{N+2H-1}}(\sqrt{2}+1)\sqrt{SAKH}.\end{split}

Next, we bound (c)(c). Using similar techniques in Lemma 38 and Corollary 39, we can show that the following claims are true.

Lemma 41

Given two constants N0,H1N\geq 0,H\geq 1. For any sequence of numbers z1,,znz_{1},\dots,z_{n} with 0zkH0\leq z_{k}\leq H, consider the sequence Z0,Z1,ZnZ_{0},Z_{1},\dots Z_{n} defined as

1Z02H1Zk=Zk1+zkfor k1\begin{split}1\leq Z_{0}&\leq 2H-1\\ Z_{k}&=Z_{k-1}+z_{k}\qquad\text{for }k\geq 1\end{split}

Then, for all kk,

zkkN+Zk1(k+1)H1kN+(k+1)H1zkZk1.\frac{z_{k}}{kN+Z_{k-1}}\leq\frac{(k+1)H-1}{kN+(k+1)H-1}\frac{z_{k}}{Z_{k-1}}.

And for all n1n\geq 1,

k=1nzkkN+Zk12H1N+2H1k=1nzkZk1.\sum_{k=1}^{n}\frac{z_{k}}{kN+Z_{k-1}}\leq\frac{2H-1}{N+2H-1}\sum_{k=1}^{n}\frac{z_{k}}{Z_{k-1}}.

Consequently, (c)(c) is bounded in the following corollary.

Corollary 42

With vi(s,a)v_{i}(s,a) and τ(s,a)\tau(s,a) defined in Lemma 62, we have

(c)2H1N+2H1(SALN+SA).(c)\leq\frac{2H-1}{N+2H-1}(SAL_{N}+SA).

Combining Corollaries 39 and  42 we obtain

(a)8HL2H1N+2H1((2+1)SAKH)+4SH2LN2H1N+2H1(SALN+SA)2H1N+2H120HLNSAKH+2H1N+2H15S2AH2LN2.\begin{split}(a)&\leq 8HL\sqrt{\frac{2H-1}{N+2H-1}}((\sqrt{2}+1)\sqrt{SAKH})+4SH^{2}L_{N}\frac{2H-1}{N+2H-1}(SAL_{N}+SA)\\ &\leq\sqrt{\frac{2H-1}{N+2H-1}}20HL_{N}\sqrt{SAKH}+\frac{2H-1}{N+2H-1}5S^{2}AH^{2}L_{N}^{2}.\end{split}

and the total regret is

Regret(K)SAH2N+1+e(H+1)KLN+e(2H1N+2H120HLNSAKH+2H1N+2H15S2AH2LN2).\begin{split}\text{Regret}(K)&\leq\frac{SAH^{2}}{N+1}+e(H+1)\sqrt{KL_{N}}+e\left(\sqrt{\frac{2H-1}{N+2H-1}}20HL_{N}\sqrt{SAKH}+\frac{2H-1}{N+2H-1}5S^{2}AH^{2}L_{N}^{2}\right).\end{split}

Appendix J Experimental Details

Transition functions. Figure 5 illustrate the 4×44\times 4 gridworld environment of the four MDPs in \mathcal{M}. The rows are numbered top to bottom from 0 to 33. The columns are numbered left to right from 0 to 33. The starting state s1s_{1} is at position (1,1)(1,1). In every state, the probability of success of all actions is 0.850.85. When an action is unsuccessful, the probability of being in one of the other adjacent cells is equally divided from the remaining probability of 0.150.15. There are several exceptions:

  • In the four corners, if the agent takes an action in the direction of the border then with probability of 0.70.7 it will stay in the same corner, and with probability 0.30.3 it will end up in the cell in the opposite direction. For example, if the agent is at (0,0)(0,0) and takes action up, then with probability 0.30.3 it will actually goes down to the cell (1,0)(1,0).

  • Each of four MDPs have an easy-to-reach corner and three hard-to-reach corners. The easy-to-reach corners in models m1,m2,m3m_{1},m_{2},m_{3} and m4m_{4} are (0,0),(0,3),(3,0)(0,0),(0,3),(3,0) and (3,3)(3,3), respectively. In each of these model, the probability of success of an action that leads to one of the hard-to-reach corners is 0.20.2, except for the (3,3)(3,3) corner where this probability is 0.30.3. For example, in model m1m_{1}, taking action right in cell (0,2)(0,2) has probability of success equal to 0.20.2 while taking the action down in cell (2,3)(2,3) has probability of success equal to 0.30.3.

  • On the four edges, any action that takes the agent out of the grid has probability of success equal to 0, and the agent ends up in one of the three adjacent cells with equal probability of 13\frac{1}{3}. For each example, taking action up in position (0,1)(0,1) will take the agent to one of the three positions (0,0),(0,1)(0,0),(0,1) and (1,1)(1,1) with probability 13\frac{1}{3}.

Under this construction, the seperation level is λ=1.2999\lambda=1.2999. One example of a λ\lambda-distinguishing set of optimal size is Γ={(1,0),(8,3),(2,1)}\Gamma=\{(1,0),(8,3),(2,1)\}. One example of a λ/2\lambda/2-distinguishing but not λ\lambda-distinguishing is Γλ/2={(11,3),(4,2),(13,0)}\Gamma^{\lambda/2}=\{(11,3),(4,2),(13,0)\}.

1111s1s_{1}
Figure 5: A 4×44\times 4 gridworld MDP with start state at (1,1)(1,1) and reward of 1 in four corners

Performance metric. At the end of each episode, the two AOMultiRL agents and the one-episode UCBVI agent obtain their estimated model P^\hat{P}. The estimated optimal policy computed based on P^\hat{P} is run for H1=200H_{1}=200 steps starting from (1,1)(1,1). The average per-episode reward (APER) in episode k=1,2,,Kk=1,2,\dots,K of an agent is defined as

APER(k)=i=1kj=1H1ri,jk\displaystyle\mathrm{APER}(k)=\frac{\sum_{i=1}^{k}\sum_{j=1}^{H_{1}}r_{i,j}}{k} (65)

where ri,j=r(sji,aji)r_{i,j}=r(s^{i}_{j},a^{i}_{j}) the reward this agent received in step jj of episode ii.

Horizon settings For AOMultiRL2, the horizons of the clustering phase in two stages are different since the distinguishing sets in the two stages are different. In order to make a fair comparison with other algorithms, the horizon of the learning phase is set to H1=0H_{1}=0 in stage 1 and H1=200H_{1}=200 in stage 2. Since we assumed that stage 2 is dominant, the goal of the experiment is to examine whether a λ/2\lambda/2-distinguishing set can be discovered and how effective that set can be. We observe that AOMultiRL2 is able to discover the same λ/2\lambda/2-distinguishing set {(14,1),(7,2),(13,0)}\{(14,1),(7,2),(13,0)\} in all 10 runs. Since this set also has an optimal size of 33, in stage 2 the clustering phase’s horizon H0H_{0} of AOMultiRL2 is identical to that of AOMultiRL1.